Sie sind auf Seite 1von 67

Chapter 7: Systolic Architecture

Design

Keshab K. Parhi
• Systolic architectures are designed by using linear
mapping techniques on regular dependence graphs (DG).
• Regular Dependence Graph : The presence of an edge in
a certain direction at any node in the DG represents
presence of an edge in the same direction at all nodes
in the DG.
• DG corresponds to space representation à no time
instance is assigned to any computation ⇒ t=0.
• Systolic architectures have a space-time
representation where each node is mapped to a certain
processing element(PE) and is scheduled at a particular
time instance.
• Systolic design methodology maps an N-dimensional DG
to a lower dimensional systolic architecture.
• Mapping of N-dimensional DG to (N-1) dimensional
systolic array is considered.
Chap. 7 2
• Definitions :
 d1
Ø Projection vector (also called iteration vector), d =  
 d 2
Two nodes that are displaced by d or multiples of d are
executed by the same processor.
ØProcessor space vector, p T
= ( p1 p2 )
Any node with index IT=(i,j) would be executed by proc-
essor; i
pT I = ( p1 p 2 ) 
 j

ØScheduling vector, sT = (s1 s2). Any node with index I would


would be executed at time, sTI.
ØHardware Utilization Efficiency, HUE = 1/|STd|. This is
because two tasks executed by the same processor are
spaced |STd| time units apart.
ØProcessor space vector and projection vector must be
orthogonal to each other ⇒ pTd = 0.
Chap. 7 3
Ø If A and B are mapped to the same processor, then they
cannot be executed at the same time, i.e., STIA ≠ STIB, i.e.,
STd ≠ 0.
Ø Edge mapping : If an edge e exists in the space
representation or DG, then an edge pTe is introduced in the
systolic array with sTe delays.
Ø A DG can be transformed to a space-time representation by
interpreting one of the spatial dimensions as temporal
dimension. For a 2-D DG, the general transformation is
described by i’ = t = 0, j’ = pTI, and t’ = sTI, i.e.,
 i'   i  0 0 1  i 
      
 j ' = T  j  =  p' 0  j 
 t'  t   s' 0  t 
    
j’ ⇒ processor axis
t’ ⇒ scheduling time instance
Chap. 7 4
FIR Filter Design B1(Broadcast Inputs, Move Results,
Weights Stay)
dT = (1 0), pT = (0 1), sT = (1 0)
Ø Any node with index IT = (i , j)
Ø is mapped to processor pTI=j.
Ø is executed at time sTI=i.
Ø Since sTd=1 we have HUE = 1/|sTd| = 1.
Ø Edge mapping : The 3 fundamental edges corresponding to
weight, input, and result can be mapped to corresponding
edges in the systolic array as per the following table:

e pTe sTe
wt(1 0) 0 1
i/p(0 1) 1 0
result(1 –1) -1 1

Chap. 7 5
Block diagram of B1 design

Low-level implementation of B1 design


Chap. 7 6
Space-time representation of B1 design

Chap. 7 7
Design B2(Broadcast Inputs, Move Weights, Results Stay)
dT = (1 -1), pT = (1 1), sT = (1 0)
ØAny node with index IT = (i , j)
Øis mapped to processor pTI=i+j.
Øis executed at time sTI=i.
ØSince sTd=1 we have HUE = 1/|sTd| = 1.
ØEdge mapping :
e pTe sTe
wt(1 0) 1 1
i/p(0 1) 1 0
result(1 –1) 0 1

Chap. 7 8
Block diagram of B2 design

Low-level implementation of B2 design


Chap. 7 9
• Applying space time transformation we get :
j’ = pT(i j)T = i + j
t’ = sT(i j)T = i

Space-time representation of B2 design


Chap. 7 10
Design F(Fan-In Results, Move Inputs, Weights Stay)
dT = (1 0), pT = (0 1), sT = (1 1)
ØSince sTd=1 we have HUE = 1/|sTd| = 1.
ØEdge mapping :
e pTe sTe
wt(1 0) 0 1
i/p(0 1) 1 1
result(1 –1) -1 0

Block diagram of F design

Chap. 7 11
Low-level implementation of F design

Space-time representation of F design


Chap. 7 12
Design R1(Results Stay, Inputs and Weights Move in
Opposite Direction)
dT = (1 -1), pT = (1 1), sT = (1 -1)
ØSince sTd=2 we have HUE = 1/|sTd| = ½.
ØEdge mapping :
e pTe sTe
wt(1 0) 1 1
i/p(0 -1) -1 1
result(1 –1) 0 2

Chap. 7 Block diagram of R1 design 13


Low-level implementation of R1 design

Note : R1 can be obtained from B2 by 2-slow transformation


and then retiming after changing the direction of signal x.

Chap. 7 14
Design R2 and Dual R2(Results Stay, Inputs and
Weights Move in Same Direction but at Different Speeds)
dT = (1 -1), pT = (1 1),
R2 : sT = (2 1); Dual R2 : sT = (1 2);
ØSince sTd=1 for both of them we have HUE = 1/|sTd| = 1 for
both.
ØEdge mapping :
R2 Dual R2
e pTe sTe e pTe sTe
wt(1, 0) 1 2 wt(1, 0) 1 1
i/p(0,1) 1 1 i/p(0,1) 1 2
result(1, -1) 0 1 result(-1, 1) 0 1

Note : The result edge in design dual R2has been reversed to


Guarantee sTe ≥ 0. 15
Design W1 (Weights Stay, Inputs and Results Move in
Opposite Directions)
dT = (1 0), pT = (0 1), sT = (2 1)
ØSince sTd=2 for both of them we have HUE = 1/|sTd| = ½.
ØEdge mapping :

e pTe sTe
wt(1 0) 0 2
i/p(0 -1) 1 1
result(1 –1) -1 1

Chap. 7 16
Design W2 and Dual W2(Weights Stay, Inputs and
Results Move in Same Direction but at Different Speeds)
dT = (1 0), pT = (0 1),
W2 : sT = (1 2); Dual W2 : sT = (1 -1);
ØSince sTd=1 for both of them we have HUE = 1/|sTd| = 1 for
both.
ØEdge mapping :
W2 Dual W2
e pTe sTe e pTe sTe
wt(1, 0) 0 1 wt(1, 0) 0 1
i/p(0,1) 1 2 i/p(0,-1) -1 1
result(1, -1) 1 1 result(1, -1) -1 2

Chap. 7 17
• Relating Systolic Designs Using Transformations :
Ø FIR systolic architectures obtained using the
same projection vector and processor vector,
but different scheduling vectors, can be
derived from each other by using
transformations like edge reversal,
associativity, slow-down, retiming and pipelining.
• Example 1 : R1 can be obtained from B2 by slow-
down, edge reversal and retiming.

Chap. 7 18
• Example 2:

Derivation of design F from B1 using cutset retiming

Chap. 7 19
Ø Selection of sT based on scheduling inequalities:
For a dependence relation X àY, where IxT= (ix, jx)T and IyT=
(iy, jy)T are respectively the indices of the nodes X and Y.
The scheduling inequality for this dependence is given by,
Sy ≥ Sx + Tx
where Tx is the computation time of node X. The scheduling
equations can be classified into the following two types :
ØLinear scheduling , where
Sx = sT Ix = (s1 s2)(ix jx )T
Sy = sT Iy = (s1 s2)(iy jy)T
ØAffine Scheduling, where
Sx = sT Ix + γx= (s1 s2)(ix jx )T + γx
Sx = sT Ix + γy = (s1 s2)(ix jx)T + γy
So scheduling equation for affine scheduling is as follows:
sT Ix + γy ≥ sT Ix + γx + Tx
Chap. 7 20
Each edge of a DG leads to an inequality for selection of the
scheduling vectors which consists of 2 steps.
– Capture all fundamental edges. The reduced
dependence graph (RDG) is used to capture the
fundamental edges and the regular iterative algorithm
(RIA) description of the corresponding problem is used
to construct RDGs.
– Construct the scheduling inequalities according to
sT Ix + γy ≥ sT Ix + γx + Tx
and solve them for feasible sT.

Chap. 7 21
• RIA Description : The RIA has two forms
⇒ The RIA is in standard input RIA form if the index of the
inputs are the same for all equations.
⇒ The RIA is in standard output RIA form if all the output
indices are the same.
• For the FIR filtering example we have,
W(i+1, j) = W(i, j)
X(i, j+1) = X(i, j)
Y(i+1, j-1) = Y(i, j) + W(i+1, j-1)X(i+1, j-1)
The FIR filtering problem cannot be expressed in standard
input RIA form. Expressing it in standard output RIA form
we get,
W(i, j) = W(i-1, j)
X(i, j) = X(i, j-1)
Y(i, j) = Y(i-1, j+1) + W(i, j)X(i, j)

Chap. 7 22
• The reduced DG for FIR filtering is shown below.

Example :
Tmult = 5, Tadd = 2, Tcom = 1
Applying the scheduling equations to the five edges of the
above figure we get ;
W-->Y : e = (0 0)T , γx - γw ≥ 0
X -->X : e = (0 1)T , s2 + γx - γx ≥ 1
W-->W: e = (1 0)T , s1 + γw - γw ≥ 1
X -->Y : e = (0 0)T , γy - γx ≥ 0
Y --> Y: e = (1 -1)T , s1 - s2 + γy - γy ≥ 5 + 2 + 1
For linear scheduling γx =γy = γw = 0. Solving we get, s1 ≥ 1,
s2 ≥ 1 and s1 - s2 ≥ 8.
Chap. 7 23
• Taking sT = (9 1), d = (1 -1) such that sTd ≠ 0 and pT = (1,1)
such that pTd = 0 we get HUE = 1/8. The edge mapping is as
follows :
e pTe sTe
wt(1 0) 1 9
i/p(0 1) 1 1
result(1 –1) 0 8

Systolic architecture for the example

Chap. 7 24
Matrix-Matrix multiplication and 2-D Systolic Array Design

C11 = a11b11 + a12 b21


C12 = a11b12 + a12 b22
C21 = a21b11 + a22 b21
C22 = a21b12 + a22 b22

The iteration in standard output RIA form is as follows :


a(i,j,k) = a(i,j-1,k)
b(i,j,k) = b(i-1,j,k)
c(i,j,k) = c(i,j,k-1) + a(i,j,k) b(i,j,k)
Chap. 7 25
• Applying scheduling inequality with
Tmult -add = 1, and Tcom = 0 we get
s2 ≥ 0, s1 ≥ 0, s3 ≥ 1, γc - γa ≥ 0
and γc - γb ≥ 0. Take γa =γb = γc = 0
for linear scheduling.
• Solution 1 :
sT = (1,1,1), dT = (0,0,1), p1 = (1,0,0),
p2 = (0,1,0), PT = (p1 p2)T

Chap. 7 26
• Solution 2 :
sT = (1,1,1), dT = (1,1,-1), p1 = (1,0,1),
p2 = (0,1,1), PT = (p1 p2)T

Sol. 1 Sol. 2
e pTe sTe e pTe sTe
a(0, 1, 0) (0, 1) 1 a(0, 1, 0) (0, 1) 1
b(1, 0, 0) (1, 0) 1 b(1, 0, 0) (1, 0) 1
C(0, 0, 1) (0, 0) 1 C(0, 0, 1) (1, 1) 1

Chap. 7 27
VLSI Digital Signal Processing Systems

Systolic Architecture Design

Lan-Da Van (范倫達), Ph. D.


Department of Computer Science
National Chiao Tung University
Taiwan, R.O.C.
Fall, 2010

ldvan@cs.nctu.edu.tw

http://www.cs.nctu.tw/~ldvan/
VLSI Digital Signal Processing Systems

Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion

Lan-Da Van VLSI-DSP-7-2


VLSI Digital Signal Processing Systems

Systolic Architecture
What is systolic architecture (also called Systolic Arrays)?
A network of PEs that rhythmically compute and pass
data through the system.
Used as a coprocessor in combination with a host
computer and the behavior is analogous to the flow of
blood through the heart; thus named as systolic.

Lan-Da Van VLSI-DSP-7-3


VLSI Digital Signal Processing Systems

Characteristics of Systolic Arrays


Synchronization
Modularity
Regularity
Locality
Finite Connection
Parallel/Pipeline
Extendibility
Some relaxations are introduced to increase the
utility of systolic arrays
 Neighbor interconnection ( near, but not nearest )
 Data broadcast operations
 Different PEs, especially at the boundaries

Lan-Da Van VLSI-DSP-7-4


VLSI Digital Signal Processing Systems

Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Conclusion

Lan-Da Van VLSI-DSP-7-5


VLSI Digital Signal Processing Systems

Systolic Array Design Methodology

Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors


(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture

Lan-Da Van VLSI-DSP-7-6


VLSI Digital Signal Processing Systems

Design Methodology: Basic Vectors


Projection vector dT = [d1 d2]
 Determine how DG is compressed.
 Two nodes that are displaced by d or multiples of d are executed by the
same processor
Processor space vector pT = [p1 p2]
 Any node with index IT = [i, j] would be executed by processor pTI.
Schedule vector sT = [s1 s2]
 Any node with index IT = [i, j] would be executed at time sTI.
Hardware utilization efficiency: HUE = 1/|sTd|
 This is because two tasks executed by the same processor are spaced
1/|sTd| time units apart.
Feasibility constrains
 Processor space vector and the projection vector must be orthogonal to
each other. p is orthogonal to d, that is, pTd = 0
 If A and B differ by projection vector, i.e, IA-IB = d,
then they must be executed by the same processor => pTIA = pTIB =>pT(IA-IB) = 0
=> pTd = 0
 If A and B are mapped to the same processor, then they cannot be
executed at the same time, i.e, sTIA ≠ sTIB => sTd ≠ 0
 Edge mapping: If an edge e exists in DG, then an edge pTe exists in the
systolic array with sTe delays.

Lan-Da Van VLSI-DSP-7-7


VLSI Digital Signal Processing Systems

Space to Space-Time Representation


Space-time representation
 Interpreting one of the spatial dimensions as temporal
dimension
 j’: processor axis, t’: scheduling time instance

 i'   0 0 1  i 
    
 j '    p 0  j 
T

 t '   s T 0  t 
    

Lan-Da Van VLSI-DSP-7-8


VLSI Digital Signal Processing Systems

Outline
Introduction
Systolic Array Design Methodology
FIR Systolic Arrays
Selection of Scheduling Vector
Matrix-Matrix Multiplication and 2D Systolic
Array Design
Systolic Design for Space Representations
Containing Delays
Conclusion

Lan-Da Van VLSI-DSP-7-9


VLSI Digital Signal Processing Systems

Systolic Array Design Methodology

Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors


(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture

Lan-Da Van VLSI-DSP-7-10


VLSI Digital Signal Processing Systems

DG of FIR Filter
Dependence Graph (DG)
 Ex: FIR filter: y(n) = w0(n)x(n)+w1x(n-1)+w2x(n-2)

i
Lan-Da Van VLSI-DSP-7-11
VLSI Digital Signal Processing Systems

Systolic Array Design Methodology

Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors


(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture

Lan-Da Van VLSI-DSP-7-12


VLSI Digital Signal Processing Systems

Applying Projection and Scheduling


(1/2)

Part of DG:
0  1 
 2  2 PE2
   
Processor vector 0 
pT I  0 1   2
1
pT I  0 1   2
… processor 2
pT = [0 1]  2  2
0  1
sT I  1 0   0 sT I  1 0   1
 2  2

0  1
Projection vector apply 1 1
dT = [1 0]    PE1 SFG

0
pT I  0 1   1
1
pT I  0 1   1
… processor 1
1 1

Scheduling vector 0 1


sT I  1 0   0 s T I  1 0   1
sT = [1 0] 1 1

0  1
0  0 
    PE0
0
pT I  0 1   0
1
pT I  0 1   0
… processor 0
0 0
0 1
sT I  1 0   0 sT I  1 0   1
0 0

Lan-Da Van VLSI-DSP-7-13


VLSI Digital Signal Processing Systems

Applying Projection and Scheduling


(2/2)

Dependence Graph Applying projection and Scheduling Space-time representation

Lan-Da Van VLSI-DSP-7-14


VLSI Digital Signal Processing Systems

Systolic Array Design Methodology

Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors


(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture

Lan-Da Van VLSI-DSP-7-15


VLSI Digital Signal Processing Systems

Edge Mapping

i 
e  e p e
' T
 j
Edge mapping
Delay  s e T

Lan-Da Van VLSI-DSP-7-16


VLSI Digital Signal Processing Systems

Edge Mapping
0
e1  pT e1  0 1   1
'
Example:
1
0
delaye1  sT e1  1 0   0
0  1
input  e1   
1
1
e2  p e2  0 1   0
'
1
T
weight  e2    0 
0  Edge mapping
1
delaye 2  s e2  1 0   1
T

1 0
output  e3   
 1
1
e3  pT e3  0 1   1
'
pT=[0 1]
 1
1
sT=[1 0] delaye3  s e3  1 0   1
T

dT=[1 0]  1
Lan-Da Van VLSI-DSP-7-17
VLSI Digital Signal Processing Systems

Edge mapping

e pTe sTe

Input [0 1]T 1 0

Weight [1 0]T 0 1

Output [1 -1]T -1 1

Edge mapping table


pT=[0 1]

sT=[1 0]
dT=[1 0]

Lan-Da Van VLSI-DSP-7-18


VLSI Digital Signal Processing Systems

Systolic Array Design Methodology

Represent the Algorithm as a Dependence Graph

Applying Projection, Processor, and Scheduling Vectors


(Space-Time Representation)

Edge Mapping

Construct the Final Systolic Architecture

Lan-Da Van VLSI-DSP-7-19


VLSI Digital Signal Processing Systems

Construct the Final Systolic


Architecture

Weight 2
PE2 D

Weight 1
PE1 D

D
Weight 0
PE0 D

D This is called B1 design


Input Output

Lan-Da Van VLSI-DSP-7-20


VLSI Digital Signal Processing Systems

Alternative Designs
B1 (Broadcast inputs, Move results, Weight Stay)
B2 (Broadcast inputs, Move Weight, Results stay)
F (Fan-in results, Move inputs, Weight stay)
R1 (Results stay, Inputs and Weight move in opposite
directions)
R2 and Dual R2 (Results stay, Inputs and Weights
move in the same direction but at different speeds)
W1 (Weights stay, Inputs and Results move in
opposite directions)
W2 and Dual W2 (Weights stay, Inputs and Results
move in same direction but at different speeds)
Relating systolic designs using transformations

Lan-Da Van VLSI-DSP-7-21


VLSI Digital Signal Processing Systems

B2 – Broadcast Inputs, Move Weight,


Results Stay
e pTe sTe
1
dT=[1 -1]
pT=[1 1]
wt [1 0]T 1 1 HUE  T
1
sT=[1 0] input [0 1]T 1 0 s d
result [1 -1]T 0 1

Lan-Da Van VLSI-DSP-7-22


VLSI Digital Signal Processing Systems

F - Fan-in Results, Move Inputs,


Weight Stay
e pTe sTe
dT=[1 0] 1
pT=[0 1]
wt [1 0]T 0 1
HUE  T
1
sT=[1 1] input [0 1]T 1 1 S d
result [1 -1]T -1 0

Lan-Da Van VLSI-DSP-7-23


VLSI Digital Signal Processing Systems

R1 - Results Stay, Inputs and Weight


Move in Opposite Directions
e pTe sTe
dT=[1 -1] wt [1 0]T 1 1 1 1
pT=[1 1] HUE  
sT=[1 -1] input [0 1]T -1 1 sT d 2
result [1 -1]T 0 2

Lan-Da Van VLSI-DSP-7-24


VLSI Digital Signal Processing Systems

R2 and Dual R2-Results Stay, Inputs and Weights


Move in the Same Direction but at Different Speeds
e pTe sTe
R2
wt [1 0]T 1
dT=[1 -1] 1 2
HUE  T  1
pT=[1 1] input [0 1]T 1 1 s d
sT=[2 1]
result [1 -1]T 0 1

D D result
D
Input D D D
PE0 PE1 PE2
Weight 2D 2D 2D
e pTe sTe
1
Dual R2 wt [1 0]T 1 1 HUE  T
1
dT=[1 -1] s d
pT=[1 1] input [0 1]T 1 2
sT=[1 2] result [1 -1]T 0 1
result
D D D
Input 2D 2D 2D
PE0 PE1 PE2
Weight D D Van
Lan-Da D VLSI-DSP-7-25
VLSI Digital Signal Processing Systems

W1 – Weights Stay, Inputs and Results Move in


Opposite Directions

e pTe sTe
dT=[1 0]
wt [1 0]T 0 2 1 1
pT=[0 1] HUE  
sT=[2 1] sT d 2
input [0 1]T 1 1

result [1 -1]T -1 1

weight weight weight


2D 2D 2D
Input D D D
PE0 PE1 PE2
result
D D D

Lan-Da Van VLSI-DSP-7-26


VLSI Digital Signal Processing Systems

W2 and Dual W2-Weights Stay, Inputs and Results


Move in Same Direction but at Different Speeds

W2 e pTe sTe
1
dT=[1 0] wt [1 0]T 0 1 HUE  T
1
pT=[0 1] s d
input [0 1]T 1 2
sT=[1 2]
result [1 -1]T 1 1
weight weight weight
D D D
Input 2D 2D 2D
PE0 PE1 PE2
result
D D D

Dual W2 e pTe sTe 1


dT=[1 0] wt [1 0]T 0 1
HUE  T
1
s d
pT=[0 1] input [0 1]T -1 1
sT=[1 -1]
result [1 -1]T -1 2
weight weight weight
D D D
Input D D D
PE0 PE1 PE2
result 2D 2D 2D
Lan-Da Van VLSI-DSP-7-27
VLSI Digital Signal Processing Systems

Relating Systolic Designs Using


Transformations

The same projection vector and processor space vector


Different scheduling vectors
Can derive each other using transformations
 Edge reversal : reverse edge direction in DG when no precedence
constraints
 Associativity : when accumulating (a+b)+c = a+(b+c)
 Slow-down
 Retiming
 Pipelining

Lan-Da Van VLSI-DSP-7-28


VLSI Digital Signal Processing Systems

Cutset Retiming Transformation

B1

cutset retiming

Lan-Da Van VLSI-DSP-7-29


VLSI Digital Signal Processing Systems

Outline
Introduction
Systolic array design methodology
FIR systolic arrays
Selection of scheduling vector
Conclusion

Lan-Da Van VLSI-DSP-7-30


VLSI Digital Signal Processing Systems

Scheduling Inequalities (1/3)


Based on selected scheduling vector sT, the
projection vector d and the processor space vector pT
can be selected.
pT (I A  I B )  0  pT d  0
sT I A  sT I B  sT d  0
Consider the dependence relation X -> Y,

 ix   iy 
X : I x    Y : I y   
 jx   jy 
where Ix and Iy are the indices of node X and node Y,
respectively. The scheduling inequality for this
dependence is defined as
Lan-Da Van VLSI-DSP-7-31
VLSI Digital Signal Processing Systems

Scheduling Inequalities (2/3)


S y  S x  Tx Eq. (1)
Where Tx is the time to compute node X and Sx, Sy are the
scheduling times for nodes X, Y, respectively.
Linear scheduling
 ix 
Sx  s I x  T
 s1 s2   
 x
j
Eq. (2)
 iy 
S y  sT I y   s1 s2   
Affine scheduling  jy 
 ix 
S x  sT I x   x   s1 s2      x
 jx 
Eq. (3)
 iy 
S y  s I y   y   s1
T
s2   y
 jy 
Lan-Da Van VLSI-DSP-7-32
VLSI Digital Signal Processing Systems

Scheduling Inequalities (3/3)


Define the edge from node X to node Y as

e x y  I y  I x
Eqs. (1) & (2)  s e x  y  ry  rx  Tx
T

Hence the selection of scheduling vector consists of


two steps:
 Capture all the fundamental edges. The reduced dependence
graph (RDG) is used to capture the fundamental edges and the
regular iterative algorithm (RIA) description of the
corresponding problem is used to construct RDGs.
 Construct the scheduling inequalities and solve them for
feasible sT.

Lan-Da Van VLSI-DSP-7-33


VLSI Digital Signal Processing Systems

Regular Iterative Algorithm (RIA)


The regular iterative algorithm is the method for
constructing the reduce dependence graph (RDG).
The regular iterative algorithm (RIA) has two
standard forms:
 The RIA is in standard input RIA form if the index of the
inputs are the same for all equations.
 The RIA is in standard output RIA form if output indices are
the same for all equations.
Output RIA Form
FIR example:
W (i  1, j )  W (i, j ) W (i, j )  W (i  1, j )
X (i, j  1)  X (i, j ) X (i, j )  X (i, j  1)
Y (i  1, j  1)  Y (i, j ) Y (i, j )  Y (i  1, j  1)
 W (i  1, j  1) X (i  1, j  1)  W (i, j ) X (i, j )
Lan-Da Van VLSI-DSP-7-34
VLSI Digital Signal Processing Systems

Scheduling Vector and Systolic Array


Design Using RDG

Constructing scheduling inequalities using RDG


Determine the scheduling vector using
scheduling inequalities
Systolic mapping using the scheduling vector
This formulation can accommodate different
computation times for various operations due to
its generality.

Lan-Da Van VLSI-DSP-7-35


VLSI Digital Signal Processing Systems

Example 7.4.1 (1/4)

There are 5 edges in the above RDG.

Lan-Da Van VLSI-DSP-7-36


VLSI Digital Signal Processing Systems

Example 7.4.1 (2/4)


Reduced Dependence Graph (RDG)
0
W  Y : e   , y   x  0
0
0
X  X : e    , s2   x   x  1
1
1
W  W : e    , s1   w   w  1
0
0
X  Y : e   , y   x  0
0
1
Y  Y : e    , s1  s2   y   y  5  2  1
 1
Lan-Da Van VLSI-DSP-7-37
VLSI Digital Signal Processing Systems

Example 7.4.1 (3/4)

Lan-Da Van VLSI-DSP-7-38


VLSI Digital Signal Processing Systems

Example 7.4.1 (4/4)


Linear scheduling
e pTe sTe
s1  1, s2  1, s1  s2  8 wt(1,0) 1 9
s2  1, s1  9  sT   9 1 i/p(0,1) 1 1
d  (1, 1), pT  (1,1) Result(1,-1) 0 8

Systolic array architecture


X
D D

8D 8D
8D

9D 9D
W
Lan-Da Van VLSI-DSP-7-39
VLSI Digital Signal Processing Systems

Conclusion

Systolic architecture
 A massively parallel processing with limited I/O

communication with host computer


 Suitable for many regular interactive operations

Design methodology
 Map an N-dimensional DG to (N-1) dimensional

space-time representation
 Needs to determine three critical vectors

 Projection vector
 Processor space vector
 Scheduling vector

Lan-Da Van VLSI-DSP-7-40

Das könnte Ihnen auch gefallen