Sie sind auf Seite 1von 30

Introduction

Problem Formulation
Software Implementation
Hardware Implementation

Parallel Solver for Bordered Block Diagonal Matrix

Shashank Gangrade

Under the guidance of Prof. Sachin. B. Patkar,

Department of Electrical Engineering


Indian Institute of Technology, Bombay

Oct 26, 2016

1 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

Outline
1 Introduction
2 Problem Formulation
3 Software Implementation
External Packages
Software API
Results
4 Hardware Implementation
Memory Bottlneck
Hardware Blocks
Matrix Inversion Unit
Matrix Multiplication
Functional Block
Results
5 Conclusion & Future Work
2 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

Motivation

Circuit simulator for linear networks with large number of


nodes
Use the Bordered Block Diagonal Matrix generated using
node tearing nodal analysis
Design a system to solve Bordered Block Diagonal Matrix as
large as 1M×1M nodes
Use parallel methods in Hardware and Software to design
performce critical and resource efficient system

3 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

A general form of Bordered Block Diagonal Matrix can be written


as follows: AX = G ,
    
A1 0 0 ... 0 B1 X1 G1
 0 A2 0 . . . 0 B2  X2
   G2
    
0 0 A3 . . . 0 B3  X3
   G3
=
    
 .. .. .. . . .. ..  ..  ..
 .
 . . . . . 
 .
 
 

 .
0 0 0 . . . AN-1 BN-1  XN-1  GN-1 
C1 C2 C3 . . . CN-1 AN XN GN

Ai is a m × m matrix Gi is a m × 1 vector
Bi is a m × n matrix Xi is a m × 1 vector
Ci is a n × m matrix GN is a n × 1 vector
AN is a n × n matrix XN is a n × 1 vector

4 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

For each of the i th row in range [1, N-1], row equations can be
written as
Xi = A−1
i (Gi − Bi XN ) (1)
Similarly soving for XN in the N th row

C1 X1 + C2 X2 + · · · + CN−1 XN−1 + AN XN = GN

AN XN = GN − (C1 X1 + C2 X2 + · · · + CN−1 XN−1 )


N−1
X
AN XN = GN − Ci Xi (2)
i=1

In the above substitute value of Xi in terms of XN to get above in


terms on XN only

5 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

N−1
X
AN XN = GN − (Ci A−1
i (Gi − Bi XN ))
i=1

Taking XN terms on one side


N−1
X N−1
X
AN XN − (Ci A−1
i Bi XN ) = GN − (Ci A−1
i Gi )
i=1 i=1
PN−1 −1
GN − i=1 Ci Ai Gi
XN = PN−1 −1
(3)
AN − i=1 Ci Ai Bi
Let Gi∗ and Bi∗ denote the individual sigma term in XN

Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi

6 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

Solutions of AX = B
Xi = A−1
i (Gi − Bi XN ), ∀i ∈ (1, N-1)

GN − N−1 ∗
P
i=1 G
XN = PN−1 i∗
AN − i=1 Bi
Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi

We can find two parallel blocks in this,


Each Gi∗ and Bi∗ can be calculated in parallel, speedup in
calculation of XN
Each Xi can be calculated in parallel from XN , speedup in
calculation of Xi

7 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Software Implementation

Motivation for Software:


Basic idea is to have a complete C based implementation of
Bordered Block Diagonal Matrix Solver
Software system can serve as an benchmark for further
hardware design
Profile the run times of various parts of program
Exploit the inherent parallelism using multi core chips present
on the CPU

8 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Software API

Diagonal Matrix Ai is m × m,
Border matrix Bi is m × n and Ci is n × m
Number of diagonal blocks of matrix is N
Tile size is the size of eack tile matrix, TileSize = m = n
Linear Algebra Packages:
LAPACK
LAPACKE
BLAS
Parallel Progamming:
OpenMP

9 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Algorithm Flow Diagram

Description of parts
P ∗ P ∗
Part1: Calculate Gi & Gi
∗ −1 ∗ −1
Gi = Ci Ai Gi , Bi = Ci Ai Bi

Gi∗
P
Part2:
P ∗Calculate XN from
& Gi

Part3: Calculate individual Xi


Xi = A−1
i (Gi − Bi XN )

10 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Run time comparison for BBD solver and standard solver

BBD Matrix System Lapack solver


N Tile Size Part1 Part2 Part3 Total Total
100 4 0.255 0.017 0.049 0.329 6.904
500 4 1.036 0.03 0.238 1.313 266.813
1000 4 1.882 0.024 0.578 2.505 1939.958
1500 4 3.041 0.023 0.798 3.877 6874.44
100 8 0.956 0.033 0.131 1.134 23.711
500 8 1.899 0.033 0.377 2.32 212.629
1000 8 5.526 0.036 1.026 6.601 15743.108
1500 8 5.632 0.027 1.055 6.723 50147.257

Table: Runtimes in ms for different sized matrices

Note: Standard solver is dgesv () function from LAPACK library 11 / 29


Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Results: Comparison of BBD solver with Lapack

12 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Run times for large BBD solver

N Run Time (ms)


Tile Size=4 Tile Size=8
100 0.329 1.134
500 1.313 2.32
1000 2.505 6.601
5k 12.742 26.13
10k 24.483 46.477
50k 119.576 277.521
100k 246.337 480.946
500k 1896.716 2843.414
1M 2334.047 5532.105

Table: Run times for BBD matrix solver with large number of blocks

13 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation

Run times for large BBD solver

14 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Hardware Implementation
Idea

The basic idea is to come up with a hardware design, which


can perform better than a software API running on multi-core
CPU

15 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Design in hardware

Basis of Design Methodology


Follow a performance driven design methodology
Understand the various bottlenecks in system
Extract parallelism given the constraints of bottle neck
Perform better than software to justify more design effort
Specifications
Design targeted towards a high end FPGA like Virtex-6
Use high level functional hardware description language,
Bluespec

16 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Memory Bottleneck

Latency and bandwith are botlenecks in performance of hardware


For a Typical DDR3-SODIMM memory onboard a Virtex-6
ML605
Bandwidth = 8.5GB/s or 1066 MT/s
Latency = 10-20 clock cycles
For 4 × 4 matrix storing float values, this can get on an
average single tile of matrix in a clock cycle
Design a system working under these constraints can deliver
maximum performance

17 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Hardware Design

The system has following specifications


Each matrix is a 4 × 4 matrix, elements stored as 32bit single
precision floating point number
Memory bandwith gives 128bits per cycle
Memory Access pattern is predefined, hence latency is hidden
by prefetch
It takes 4 clock cyles to bring a complete matrix from memory
to FPGA

18 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Tile Inversion Block

Matrix Inverse
Inversion is based on the idea of Gauss Jordan Elimination
Perform a set of operations of input matrix and a predefined
identity matrix
Succesive operations convert the matrix into identity matrix
and identity matrix is transformed to inverse of initial matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix inverse can be calculated in n2 cycles
In every cycle either calculate a FP division or FP multiply add
Each Inversion unit will have 8 FP MAC unit and 8 FP
Division unit

19 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Figure: Inversion Unit for 3x3


20 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Matrix-matrix Multiplication Block

Matrix-matrix Multiplication
This is based on rank-one update matrix multplication
algorithm[1]
Adding a rank one matrix to existing matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix multiplication can be calculated in n2 cycles
In every cycle we do a FP multiply add
Each multiply unit will have 4 FP MAC unit

21 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Figure: Matrix Multiplication Unit for 3x3

22 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Functional Unit

Functional unit can calculate B ∗ by adding Bi∗ from i th block


Given memory bandwith and latency of blocks, we have 4
functional units in our design

B ∗ = P Bi∗
P
G ∗ = Gi∗

Figure: Functional Unit

23 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Example of Hardware Data Flow

Matrix A with tile size of 4 × 4, and number of diagonal tiles


is N=13
Every element of A is stored as a 32 bit floating point variable
Memory can fetch 128 bit in one clock cycle
    
A1 0 . . . 0 B1 X1 G1
 0 A2 . . . 0 B2   X2   G2 
   
 
 .. .. . . .. ..   ..  =  .. 
 .
 . . . .  .   . 
   
0 0 . . . A12 B12  X12  G12 
C1 C2 . . . C12 A13 X13 G13

24 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Example of Hardware Data Flow

Memory Bandwidth allows 128 bit


in a clock Matrix Clock# Matrix Clock#

Ai , Bi , Ci are 4 × 4 matrices, takes A1 4 B1 36


A2 8 B2 40
4 cycles to fetch A3 12 B3 44
Gi is a 4 × 1 vector, takes 1 cycle A4 16 B4 48
to fetch C1 20 G1 49
C2 24 G2 50
As per the access patters shown C3 28 G3 51
here, it takes 52 clock cycles to C4 32 G4 52
get data for 4 functional units Table: Memeory Access pattern
This access pattern will repeat
every 52 clocy cycles, for (N-1)/4
times
25 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Example of Hardware Data Flow


Animantion

Hardware Data Flow: link to animation


Number of Cycles taken: 52 × (12/4) + 28

26 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Results

We judge the performance of a design on the basis of the following


criteria:
P ∗ P ∗
Cycles to calculate BBD solutions: B and G
52 × (N/4) + 28
Number of FP operations per cycle
(16 + 16 + 16 + 4) × 4 + 16
4× ' 17
52
Number of FP Units needed:
80 FP MAC + 32 FP Division + 2 FP ADD
Memory bandwidth 128bits × 50MHz = 800MB/s

27 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation

Results

For a system operating at 50MHz clock, and N=10000,


Time = 2.6ms,
For a software under similar data types,
Time = 19.168ms

Calculated Speedup Achieved: 8x

28 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation

Conclusion & Future Work

Conclusion
We observe sufficient speedup in performance of hardware as
compared to software implementatons
The design methadology can be scaled for larger designs and
higher bandwidths
Future Work
Focus on resusing FP units, so that resource usage is
minimized
Do a cycle accurate testing of integrated hardware design in
simulation
Run the hardware design on FPGA using Bluespec emulation
platform
Design the system for scalibility, large block matrices

29 / 29
For Further Reading

References I

Mahendra Burdhak
Efficient Simulation of Large Non-Linear Circuits using
Partioning and Parallelism
DDP phase-1 Report, 2016
Kumar, V.B.Y., Joshi, S., Patkar, S.B. et al.
FPGA Based High Performance Double-Precision Matrix
Multiplication
Int J Parallel Prog (2010) 38: 322.
ML605 Hardware User Guide
http://www.xilinx.com/support/documentation/
boards_and_kits/ug534.pdf

30 / 29