Synthesis of Orthogonal Systolic Arrays For Fault-Tolerant Matrix Multiplication

Synthesis of orthogonal systolic arrays for
fault-tolerant matrix multiplication

M. K. Stojčev, E. I. Milovanović, S. R. Marković, I. Ž. Milovanović,
Abstract— This paper presents a procedure for design- approach, which is known as N -tuple modular redun-
ing fault-tolerant systolic array with orthogonal inter- dancy, N copies (N odd) of a module and majority
connects and bidirectional data flow (2DBOSA) for ma-
trix multiplication. The method employs space-time re- voter are used to mask the error from failed module.
dundancy to achieve fault-tolerance. The obtained array At least three modules are necessary in a voting sys-
has Ω = n(n + 2) processing elements, and total execution tem that is typically called triple modular redundancy
time of Ttot = 6n − 5. The array can tolerate single tran-
sient errors and the majority of multiple error patterns
(TMR). It seems that at least 200 percent hardware
with high probability. Compared to hexagonal array of overhead for fault tolerance is needed. In practice, trip-
same dimensions, the number of I/O pins is reduced for licate computations need to be put to the voter and
approximately 30%. then a correct result is obtained. The triplicated com-
putations may be computed in different hardware mod-
I. Introduction ules and/or different time using space-shift, time-shift
Matrix multiplication is one of the essential opera- or space-time-shift [14]. If the replicated computations
tions in various fields of science, engineering and tech- are performed simultaneously by different modules it
nology, such as signal and image processing, system is a space-shift scheme. When the replicated compu-
theory, statistical and numerical analysis, biomedical tations are computed by the same module at different
researches, etc. This operation is characterized by its times it is a time-shift scheme. If the replicated compu-
intensive computational complexity and regularity, and tations are computed by different modules at different
it is often required under real time constraints. times it is a space-time-shift scheme.
Today’s high performance computing systems exploit In our approach instead of using full hardware re-
one or more forms of parallelism to achieve high speed dundancy (i.e. triplication) to achieve fault tolerance,
computations. To fulfill the desired throughput rates we generate redundant information partly by means of
for time-critical and computationally intensive prob- hardware (by involving two extra columns of processing
lems, special-purpose, high-speed computing systems elements in the 2D array), and partly by means of time
optimized for processing specific tasks have been de- redundancy. This paper presents a systematic approach
signed. Systolic array is a type of special-purpose sys- for designing two-dimensional fault-tolerant systolic ar-
tem that can be used for implementing such tasks. ray with orthogonal connections and bidirectional data
Fault-tolerance has become an essential design require- flow (2DBOSA) with the capability of concurrent er-
ment in such systems. ror detection and correction using space-time redun-
Permanent, transient and intermittent faults are dancy. In this approach redundant computations are
main sources of errors in the integrated circuits. Perma- introduced at the algorithmic level by deriving three
nent faults occur due to irreversible physical changes. equivalent algorithms but with disjoint index spaces.
Shorts and opens are typical examples of such faults. Fault-tolerant systolic array is constructed by mapping
Transients are most frequently generated by environ- dependence graphs of these algorithms along projection
mental conditions, like cosmic rays. Intermittent faults direction vector into physical processor-time domain.
occur due to unstable or marginal hardware. There are 13 different projection direction vectors that
The transient faults are most common, and their can be used to obtain 2D SAs with planar links [10].
number is continuously increasing due to high com- Three of them give 2DBOSAs. The resulting 2DBOSAs
plexity, smaller transistor sizes, higher operational fre- are optimal in terms of area and speed and all can tol-
quency, and lower voltage levels. The rate of transient erate single transient errors and majority of multiple
faults is often much higher compared to the rate of per- errors with high probability. Similar method was used
manent faults. Transient-to-permanent fault ration is in [2], [12], [16] to obtain hexagonal with fault-tolerant
100:1 or higher [5]. capability. The main difference in the approach used in
A variety of techniques have been devised for masking [12], [16] concerns the fact that hexagonal array has a
the errors induced by silicon faults. In error masking pipeline period λ = 3 which means that processing ele-
ments perform active computations in every third clock
M. K. Stojčev, E. I. Milovanović, I. Ž. Milovanović and are with
the Faculty of Electronic Engineering A. Medvedeva 14, P.O. Box cycle. This fact was used to introduce two additional
73, 18000 Niš, Serbia computations of the same problem instance in the idle
clock cycles and then vote for the correct result using matrix
majority. On the contrary, the 2DBOSAs have pipeline  
h i 1 0 0
period λ = 1 or λ = 2. Therefore they were not consid-  
ered as suitable candidate architecture for fault-tolerant
D= ~eb3 ~ea3 ~ec3 = 0 1 0 . (3)
computations. In this paper we first perform the expan- 0 0 1
sion of the inner computation space of the basic systolic An acyclic directed graph G = (Pint , D), where Pint
algorithm for matrix multiplication in order to obtain corresponds to vertices while edges are determined by
the array with λ = 3. Second, we introduce redun- columns of matrix D, can be joined to Algorithm 1 .
dant computations at the algorithmic level and define This graph is placed in a three-dimensional space gen-
optimal mapping which maps the proposed algorithm erated by unity vectors
into 2DBOSA which minimizes hardware cost (i.e. the
number of processing elements) and computation time. ~e1 = [1 0 0]T , ~e2 = [0 1 0]T and ~e3 = [0 0 1]T .
Finally, for the obtained 2DBOSA we derive formulas This graph can be used to determine all allowable pro-
for initial data schedule which provides correct execu- jection vectors of the form µ ~ = [µ1 µ2 µ3 ]T , which are
tion of the fault-tolerant algorithm on the synthesized used to synthesize 2D SAs. There are 13 allowable pro-
array. jection vectors in total. According to this vectors 19
The rest of the paper is organized as follows. Sec- different 2D SAs can be synthesized. These arrays can
tion 2 is devoted to problem definition. A systematic be classified into four groups according to the inter-
approach for designing four fault-tolerant 2DBOSAs is connection patterns between the processing elements,
described in Section 3. Global hardware structure of the as shown in Fig.1. The arrays from the same group
fault-tolerant system is presented in Section 4. Details mutually differ with respect to input data patterns and
related to performance evaluation of the synthesized ar- directions of data flow. All these arrays are well studied
ray are given in Section 5. Finally, conclusion is given in the literature (see for example [4], [9], [10], [13]).
in Section 6.
II. Background
Let A = (aik ) and B = (bkj ) be two matrices of order
mesh bidirectional hexagonal unidirectional
n × n. To find their product, C = A · B, the following orthogonal orthogonal
recurrence relation is usually used

Fig. 1. Interconnection patterns in four groups of systolic arrays.
(0) (k) (k−1) (n)
cij := 0; cij := cij + aik bkj ; cij := cij , (1) A synthesis procedure which enables obtaining 2D
SAs with optimal number of PEs, i.e. Ω = n2 , with
for i = 1, 2, . . . , n, j = 1, 2, . . . , n, k = 1, 2, . . . , n. The respect to matrix dimensions was proposed in [4], [9].
systolic algorithm that computes C = A · B according The execution time was minimized for that number of
to (1) has the following form PEs. Thanks to time optimization PEs are active in
every clock (machine) cycle, i.e. pipeline period of all
Algorithm 1 synthesized arrays is λ = 1. However, if time optimiza-
for k := 1 to n do tion was omitted then the pipeline period of synthesized
for j := 1 to n do arrays would be in the range of 1 ≤ λ ≤ 3. To be more
for i := 1 to n do precise, only hexagonal array obtained for projection
a(i, j, k) := a(i, j − 1, k); vector µ~ = [1 1 1]T has λ = 3, while others have λ = 1
b(i, j, k) := b(i − 1, j, k); or λ = 2. Because of this property hexagonal array
c(i, j, k) := c(i, j, k − 1) + a(i, j, k) ∗ b(i, j, k); obtained for µ ~ = [1 1 1]T was considered as a good
candidate for realization of fault-tolerant matrix multi-
where a(i, 0, k) ≡ aik , b(0, j, k) ≡ bkj , c(i, j, 0) ≡ cij , for plication. Namely, the fact that PEs are active in every
i, j, k = 0, 1, . . . , n. This algorithm represents a starting third clock cycle can be used to realize triplicated com-
point for the synthesis of two-dimensional (2D) systolic putations of the same problem instance and perform
arrays (SAs) which implement matrix multiplication in majority voting to obtain correct result [3], [12], [15],
hardware. [16]. This can be done with minimal hardware over-
The computational structure of the Algorithm 1 is head. The obtained array can tolerate single transient
completely determined by the inner computation space errors during the computation of each element of prod-
uct matrix C. In [12] hexagonal array with optimal
Pint = {~
p = [i, j, k]|1 ≤ i, j, k ≤ n, }, (2) number of PEs, i.e. Ω = n2 + 2n, which performs fault-
tolerant matrix multiplication for T = 6n−5 time units
where data are used or computed, and a dependency was designed.
Bidirectional orthogonal arrays naturally have λ = 2 A. Step 1
[8]. This fact was used in [14], [15] to organize matrix
To obtain 2DBOSA with λ = 3, the set of inner com-
multiplication such that single errors can be detected,
putation space Pint = {~p = [i j k]T } of Algorithm 1 is
but not corrected. This has inspired us to construct h iT
∗
an algorithm which will be executed with λ = 3 on mapped into a space Pint p ∗ = 3i−1
= {~ 2 j k }.
2DBOSA. Then we use this algorithm to derive three ∗
Here we assume that in each index point of Pint de-
equivalent algorithms with disjoint index spaces and h iT
organize fault tolerant matrix multiplication on 2D or- fined by position vector p~ ∗ = 3i−1
2 j k } the
thogonal bidirectional SA so that single transient errors same elements of matrices A, B and C (k−1) are taking
can be detected and corrected. In the next section we part in the computation as in the point p~ = [i j k]T of
will describe a synthesis of such array. space Pint . In other words we assume that the following
identities are valid:
III. The synthesis procedure
µ ¶
2DBOSAs can be obtained by mapping graph G = 3i − 1
a , 0, k ≡ a(i, 0, k) ≡ aik , and
(Pint , D), which corresponds to Algorithm 1, along 2
µ ¶
projections vectors µ ~ = [0 1 1]T , µ~ = [1 1 0]T and 3i − 1 (k)
µ T
~ = [1 0 1] onto the projection plane. Here, we will c , j, k ≡ c(i, j, k) ≡ cij . (5)
2
consider the array obtained by the projection vector
~ = [1 0 1]T . The procedure is the same for other two
µ In order to minimize the number of PEs in the
directions. 2DBOSA, we perform an accommodation of index set
Several valid transformation matrices, T , can be as- ∗
Pint ~ = [1 0 1]T
to the direction of projection vector µ
sociated to each projection vector(see for example [1], (see for example [10]). The accommodation is per-
[6], [7], [11]). Without affecting generality, for the di- formed by the following mapping
rection µ ~ = [1 0 1]T we will take the following valid
transformation matrix in our synthesis procedure:  
u(0) h i
   (0) 
  1 1 1  v = µ~ 1 ~e2 ~e3 · p~ ∗ + α
~= (6)
~
Π   (0)
   −− −− −−  w
T =  −−  =  , (4)       
 −1 0 1  1 0 0 3i−1
0 3i−1
S   2
    2

0 1 0  0 1 0   j  +  0 = j ,
3i−3
where Π ~ determines time schedule, and S represents 1 0 1 k −1 2 +k
space transformation which maps G = (Pint , D) into
2DBOSA. The obtained 2DBOSA has Ω = n(2n − 1) for i = 1, 2, . . . , n, j = 1, 2, . . . , n and k =
∗
PEs and pipeline period of λ = 2. Neither of these two 1, 2, . . . , n. In this way we have mapped Pint =
h iT
parameters satisfy our requirements. Our goal is to ∗ 3i−1 (0) (0)
{~
p = 2 j k } into Pint = {~ p =
obtain 2DBOSA with Ω = n2 PEs and λ = 3. By the £ (0) (0) (0) ¤T
procedure described in [10] 2DBOSA with Ω = n2 PEs u v w }. Now, we can define a new graph
(0)
can be obtained. However it’s pipeline period is λ = 1 G(0) = (Pint , D) which corresponds to the following
if time optimization is performed, or λ = 2 without it. systolic algorithm
Therefore we have to modify the synthesis procedure
proposed in [10]. Algorithm 2
Our procedure for designing 2DBOSA on which fault- for k := 1 to n do
tolerant matrix multiplication can be performed con- for j := 1 to n do
sists of the following three steps: for i := 1 to n do
Step 1. Synthesizing of the 2DBOSA with λ = 3 and u(0) := 3i−1 2 ;
(0)
Ω = n2 PEs. This step requires defining a new systolic v := j;
algorithm equivalent to Algorithm 1; w(0) := 3i−3 2 + k;
Step 2. Deriving three equivalent algorithms from the a(u(0) , v (0) , w(0) ) := a(u(0) , v (0) − 1, w(0) );
one obtained in step 1, with disjoint index spaces. b(u(0) , v (0) , w(0) ) := b(u(0) − 1, v (0) , w(0) );
These algorithms are then used to obtain 2DBOSA with c(u(0) , v (0) , w(0) ) := c(u(0) , v (0) , w(0) − 1)+
Ω = n2 + 2n PEs on which fault-tolerant matrix multi- a(u(0) , v (0) , w(0) )∗b(u(0) , v (0) , w(0) );
plication is performed;
Step 3. Determining the initial data schedule which By mapping graph G(0) using transformation matrix
provides correct execution of fault-tolerant matrix mul- S, defined with (4), 2DBOSA is obtained. The (x, y)-
tiplication algorithm on the 2DBOSA. coordinates of the processing elements in the projection
plane are obtained according to the following formula B. Step 2
" #
x Now we have to construct three algorithms equiva-
P E 7→ = S · p~ (0) = lent to Algorithm 2, but with disjoint index spaces. In
y
  the step 1 we have obtained a systolic algorithm which
" # 3i−1 " # (0)
is characterized by the index space Pint . Other two al-
2
−1 0 1   k−1 (0)
· j = , gorithms are obtained by translating the points of Pint
0 1 0 3i−3 j for − 12 and -1 along i-axis and for 21 and 1 along k-axis.
2 + k
The corresponding index spaces are
for i = 1, 2, . . . , n, j = 1, 2, . . . , n and k = 1, 2, . . . , n. It
can be concluded from the above equation that obtained ( · ¸T )
2DBOSA consists of Ω = n2 PEs. (1) 3i − 2 3i − 2
To determine pipeline period λ of the obtained array, Pint = p~ (1) = j +k
2 2
a timing function which defines time instances when
the computations at point p~ (0) are performed, has to
and
be determined first. Timing function is computed in
the following way ( · ¸T )
(2) (2) 3i − 3 3i − 1
" # Pint = p~ = j +k .
3i−1 2 2
t(~ (0)
p ) = Π ~ · p~ + α = [1 1 1] ·
0) 2 +α
j 3i−3
2 +k
(0) (1) (2)
= 3i + j + k + α − 2, It can be shown easily that sets Pint , Pint and Pint
are mutually disjoint. Three directed graphs G(0) =
where Π ~ is defined in (4), while α is a constant de- (0) (1) (2)
(Pint , D), G(1) = (Pint , D) and G(2) = (Pint , D) are
termined so that the first computation in G(0) is per- related to three systolic algorithms equivalent to Algo-
(0) (0)
formed at point p~min for which t(~ pmin ) = 0. For all rithm 2. Denote with Ḡ a union of graphs G(0) , G(1)
(0)
other points p~ (0) ∈ Pint , hold t(~ p (0) ) > 0. To deter- and G(2) . A systolic algorithm that corresponds to Ḡ
mine the pipeline period λ, we have to find time dif- has the following form
ference between the computations at two neighboring
(0) (0) £ ¤T Algorithm 3
points from Pint . Points p~1 = 3i−1 j 3i−3
2 +k
(0) £ ¤T 2 for r := 0 to 2 do
and p2 = 3i+2 2 j 3i
2 +k are neighbors in the space for k := 1 to n do
(0)
Pint . Since for j := 1 to n do
t(~
(0)
p1 ) − t(~
(0)
p2 ) = 3 for i := 1 to n do
u := 3i−r−1
2 ;
we conclude that λ = 3, what was our goal, and com- v := j;
pletes step 1 of the design procedure. An example of w := 3i+r−3 + k;
2
the obtained array for the case n = 3 is sketched in a(u, v, w) := a(u, v − 1, w);
Fig.2. b(u, v, w) := b(u − 1, v, w);
c(u, v, w) := c(u, v, w − 1) + a(u, v, w) ∗ b(u, v, w);
c33 0 0 c23 0 0 c13 0 0
b13 0 b23 b23 b33 b33 b13 b13 b23 0 b33
We assume that the following identities are valid in Al-
c32 0 0 c22 0 0 c12 0 gorithm 3
b12 0 b22 b22 b32 b32 b12 b12 b22 0 b32
c31 0 0 c21 0 0 c11 3i + r − 3 3i + r − 3

a(i, 0, + k) ≡ a(i, 0, b c + k)
b11 0
b21 b21 b31 b31 b11 b11 b21 0 b31 2 2
a11 0 0 a(i, 0, t + n) ≡ a(i, 0, t)
0 a12 0 3i + r − 3 3i + r − 3
0 0 a13 b(0, j, + k) ≡ b(o, j, b c + k)
0
2 2
a 22 0
0 a23 0 b(0, j, t + n) ≡ b(0, j, t)
0 0 a21
a 31 0 0 for r = 0, 1, 2, i = 1, 2, . . . , n, j = 1, 2, . . . , n, k =
0
0 a32
a33
1, 2, . . . , n and t = 1, 2, . . . , n.
0 0
A systolic array that implements Algorithm 3 is ob-
Fig. 2. The 2DBOSA obtained after the first step, for the case tained by mapping graph Ḡ using transformation ma-
n = 3. trix S, defined in (4). The (x, y)-coordinates of the pro-
cessing elements in the projection plane are obtained
using following formula for each p~γ from the space Pin , γ ∈ {a, b, c}, where ~eγ
  is defined in (3). In our case, timing function is
" # " # 3i−r−1
2
x −1 0 1   p (r) ) = u + v + w − 3,
t(~
P E 7→ = · j 
y 0 1 0 3i+r−3
2 +k where u, v and w are defined in Algorithm 3. The re-
" # ∗ ∗ (r) ∗ (r) ∗ (r)
k+r−1 arranged set, Pin = Pin (a) ∪ Pin (b) ∪ Pin (c), is
= , (7) obtained according to the following formulas
j
∗ (r)
Pin (a) =
for r = 0, 1, 2, i = 1, 2, . . . , n, j = 1, 2, . . . , n, and ( · ¸T )
k = 1, 2, . . . , n. From (7) it can be concluded that the ∗ 3i − r − 1 3i + r − 3
p~a = 4 − 3i − k +k
obtained 2DBOSA has Ω = n2 + 2n PEs, what was our 2 2
goal. ∗ (r)
Pin (b) =
Communication channels in the 2DBOSA are imple- ( · ¸T )
mented along the direction of column vectors of matrix ∗ 7 − 3i − 2j − 2k + r 3i + r − 3
p~b = j +k
∆ which is obtained by mapping data dependency ma- 2 2
trix using transformation matrix S, i.e. ∗ (r)
h i Pin (c) =
( · ¸T )
∆ = ~eb2 ~ea2 ~ec2 = S · D (8) ∗ 3i − r − 1 5 − 3i + 2j − r
  p~c = j +k
" # " # 2 2
1 0 0
−1 0 1   −1 0 1
= ·  0 1 0 = . The initial arrangement of input data items in the
0 1 0 0 1 0
0 0 1 projection plane during realization of Algorithm 3 on
∗
the 2DBOSA is obtained by mapping space Pin using
This completes step 2 of the design procedure. transformation S defined in (4), i.e. according to the
following formulas
C. Step 3 " # " #
In this step we will determine the initial arrangement 3i + r − 3 x k+r−1
a(i, 0, b c + k) 7→ = ,
of input data items in the projection plane during re- 2 y 4 − 3i − k
alization of Algorithm 3 on the 2DBOSA. The space " #
3i + r − 3 x
of initial computations, Pin , which corresponds to Al- b(0, j, b c + k) 7→ = (10)
gorithm 3 represents the union of the following three 2 y
" #
spaces 3i + j + 2k + r − 5
,
(r) j
Pin (a) pa = [u 0 w]T } =
= {~ " # " #
( · ¸T )
3i − r − 1 3i + r − 3 x r − 3i − j + 3
p~a = 0 +k , c(i, j, 0) 7→ = ,
2 2 y j
(r)
Pin (b) pb = [0 v w]T } =
= {~ for r = 0, 1, 2, i = 1, 2, . . . , n, j = 1, 2, . . . , n, and k =
( · ¸T ) 1, 2, . . . , n. An example of the 2DBOSA which performs
3i + r − 3 fault-tolerant matrix multiplication for the case n = 3
p~b = 0 j +k ,
2 is depicted in Fig. 3.
(r) Let us note that other three arrays of 2DBOSA type
Pin (c) pc = [u v 0]T } =
= {~
( · ¸T ) are obtained by the same procedure. The difference re-
3i − r − 1 lates to the choice of transformation matrix T , since it
p~c = j 0 ,
2 is chosen from the set of possible transformation matri-
ces for each projection vector separately. All the arrays
for r = 0, 1, 2, i = 1, 2, . . . , n, j = 1, 2, . . . , n, and have identical number of PEs and execution time. They
k = 1, 2, . . . , n. These spaces have to be reordered differ with respect to initial date schedule and directions
with aim to obtain data schedule which will provide of data flow.
correct execution of Algorithm 3 on the 2DBOSA. The
IV. Global structure of the fault-tolerant
rearrangement of these spaces is performed by timing
system
function (see for example [10], [11]) in the following way
Global structure of the system which performs fault-
p~γ∗ = p~γ − (t(~
pγ ) − 1)~eγ3 , (9) tolerant matrix multiplication is presented in Fig. 4. It
c33 c33 c33 c23 c23 c23 c13 c13 c13
b1 3 b1 3 b 2 3 b 2 3 b 3 3 b 3 3 b1 3 b1 3
c12
c32 c32 c32 c22 c22 c22 c12 c12
b12 b12 b22 b32 b32 b12 b12 b12 b22 b22
b22
c11 c11
c31 c31 c31 c21 c21 c21 c11
b31 b31 b11 b11 b 21 b 21 b31 b31 b11
b11 b11 b21 b21
a 11 a11 a12 0 0
0 a12 a12 a13 0
0 0 a13 a13 a 11
a 22 a 23 a23 0 0
0 a 23 a21 a 21 0
0 0 a21 a 22 a 22
a 31 a 31 a32 0 0
0 a 32 a32 a 33 0
0 0 a33 a 33 a 31
Fig. 3. The 2DBOSA which performs fault-tolerant matrix multiplication, for the case n = 3.
system bus
consists of a host computer and an accelerator (ACL). Address generator
The host and the accelerator are connected via a com- ACL
AGB
address
address
address
mon system bus. The host consists of CPU, memory
and I/O subsystem. Logically, the ACL is seen by the MBn-1 MB1 MB0
host as a memory-mapped space where input data are PEn-1,n+1

Memory_C
data MCn-1 data
c=0 PEn-1, 0 PEn-1, 1 data
stored and results of computation obtained. It is com- data
addr&CS addr&CS
posed of the following five building blocks:
Voter logic
1) Address generator logic, AGA , AGB and AGC , in-
VL
data
data MC1
c=0 data data
tended for accessing memory banks Memory A, Mem- PE10 PE11 PE1, n+1 data addr&CS
Host
addr&CS
data
data MC0 data
system bus
ory B and Memory C, respectively. c=0 PE00 PE01 PE0, n+1 data addr&CS addr&CS
address & control

control
2) Input memory banks, Memory A and Memory B,
address
A_side B_side
used for storing input data.

control logic AGC
MA0 MA1 MAn+1
address and
address
address
address
3) 2DBOSA with n ∗ (n + 2) PEs which implements the

data
data
data
data
fault-tolerant matrix multiplication algorithm.
4) Voter logic which is used to vote the results. It con- Address generator
AGA
sists of n TMR voters. address
5) Output memory bank, Memory C, used to store el- Fig. 4. Global system structure.
ements of product matrix C. It is composed of n dual-
port memory modules, M C0 , M C1 , . . . , M Cn−1 . After
the voting has been performed, the result is written into the output of P Ei,n+1 drives demultiplexor DEMUX1.
A side of the corresponding memory module. At the Output of DEMUX1, outs , s = 1, 2, 3 is buffered in a
end of computation, the host accesses B side of mem- pair of latches LLs and LL0s , s = 1, 2, 3, alternatively.
ory modules M Ci , i = 0, 1, . . . , n − 1, via a system bus, Multiplexors MUX1, MUX2 and MUX3 select latched
and transfers the result to the main memory. values. Outputs C, C 0 and C” drive the voter. The
Elements of the resulting matrix C are obtained at voted value, VR, is accepted by latch LV. During next
the output of P Ei,n+1 , i = 0, 1, . . . , n − 1. Each element clock cycle output of LV is written into memory block
ci,j , i, j = 0, 1, . . . , n − 1 is calculated three times and M Ci , i = 0, 1, . . . , n − 1. The structure of voter is given
obtained in consecutive time instances. Three values of in Fig. 6. It consists of three subtractors, S1, S2 and
cij are used to vote the result. Voter logic is appended S3 and combinatorial logic. Status of zero markers Z0 ,
to each row of processing elements. All voter logics are Z00 and Z0” points to equality of input values C, C 0 and
identical, but with different timing. The structure of C”. If any of zero markers is set, than the result can be
voter logic is sketched in Fig. 5. The result obtained at voted. Otherwise, multiple error has occurred. Multi-
LL1 i.e. for 2n PEs. In order to estimate time overhead, we
MUX1
out1
L1
C define a performance metric Q as
LL’1
TF T − Topt
L’1 selOD Q= ,
Topt
data from PE0, n+1
LL2
DEMUX1
MUX2
out2 C’ VR
L2 Voter LV Data
MCi
where TF T corresponds to the total execution time of
LL’2
fault-tolerant matrix multiplication algorithm on the
L’2 selOD
LL3
LVclk
Addr
2DBOSA, and Topt to the total execution time of ma-
trix multiplication algorithm on 2DBOSA with optimal
MUX3
out3 C”
L3
LL’3
number of PEs, but without fault-tolerant capability.
Consequently, we have [10]
L’3 selOD
6n − (4n − 3)
Fig. 5. The structure of voter logic. Q= ≈ 0.5 = 50%
4n − 3
Having in mind that three copies of matrix multipli-
plexor M U Xdata is used to select data that is written
cation algorithm are computed simultaneously on the
to memory module M Ci , i = 0, 1, . . . , n − 1. The n-
2DBOSA at the cost of minimal hardware overhead,
input OR gate, O2, reports to the host occurrence of
we can say that the time overhead is relatively low.
the error that cannot be masked off.
The synthesized array can detect and tolerate single
c transient faults and majority of multiple fault patterns
z0 c 0
with high probability. The only case where multiple
MUXdata
S1
c’ voted data
(to host) errors cannot be tolerated is when they affect different
c’ 1
copies of the same element of the product matrix.
sel
c z0'
S2 A1 The number of I/O pins has a direct impact on host-
c”
accelerator interface complexity. Assume that the size
of each data is m-bit and that data are feed-in and -
c’
S3
z0" out in parallel into the array. Under this condition, the
c” number of I/O pins in the 2DBOSA proposed in this
paper is N OI/O = m ∗ (3n + 2). The fault-tolerant
O1 hexagonal array of the same size (i.e. n ∗ (n + 2) PEs)
proposed in [2], [12] has N HI/O = m∗(4n+3) I/O pins.
err0 Thus, for example for m = 32 and n = 16, a reduction
erri
O2 error status of N HI/O − N OI/O = 544 I/O pins is obtained, i.e.
(to host)
errn - 1 approximately 30%.
Fig. 6. The structure of voter. VI. Conclusion

We have described a method to synthesize fault-
tolerant systolic array for matrix multiplication with or-
V. Performance analysis thogonal interconnect and bidirectional data flow. For
In this section we will evaluate performance of the the given matrix dimensions, synthesized array is both
synthesized array from the aspect of: space and time optimal. Fault-tolerance is achieved by
- Number of processing elements; involving triplicated computation of the same problem
- Total computation time; instance and majority voting. In this way all single
- Time overhead due to involvement of fault-tolerance; transient errors and majority of multiple fault patterns
- Possibility to tolerate single and multiple transient can be tolerated with high probability. The execu-
faults, and tion time of the fault-tolerant algorithm is only 50%
- Number of I/O pins. longer compared to that of the non fault-tolerant, im-
According to (7) it can be concluded the 2DBOSA plemented on SA of almost identical hardware complex-
consists of Ω = n2 + 2n PEs. The total execution ity. With respect to hexagonal arrays of the same size,
time of the fault-tolerant algorithm on the 2DBOSA is the number of I/O pins is reduced for approximately
Ttot = 6n−5 time units, where duration of add-multiply 30%.
operation is taken as a time unit.
References
In general, by involving fault-tolerance we have in-
[1] M. Bekakos, E. Milovanović, N. Stojanović, T. Tokić, I. Milo-
creased both array size and total execution time. The vanović, I. Milentojević, ’Transformation matrices for sys-
array is expanded with two additional columns of PEs, tolic array synthesis’, J. Electrotech. Math. 1, (2002), 9-15.
[2] M. P. Bekakos, I. Ž. Milovanović, E. I. Milovanović, T. I. [10] I. Ž. Milovanović, E. I. Milovanović, M. P. Bekakos, I. N. Tse-
Tokić, M. K. Stojčev, ’Hexagonal systolic arrays for matrix lepis, ’Forty-three two-dimensional systolic arrays for matrix
multiplication’, Highly Parallel Computing : Algorithms and multiplication’,in: Y. Huang, ed., Supercomputing Research
Applications, (M. P. Bekakos, ed.) Vol. 5, Chapter 6, WIT- Advances, Nova Science Publishers, New York, 2008, 237-
press, Southampton-Boston, UK, 2001, 175-209. 262.
[3] M. O. Esonu, A. J. Al-Khalili, S. Hariri, D. Al-Khalili, ’Fault [11] S. G. Sedukhin, ’The design and analysis of systolic algo-
tolerant design methodology for systolic array architectures’, rithms and structures’, Programming, 2 (1990), 20-40.
IEE Comput. Digit. Techn. 141, (1994), 17-28. [12] N. M. Stojanović, E. I. Milovanović, I. Stojmenović, I. Ž.
[4] D. J. Evans, C. R. Wan, ’Massive parallel processing for ma- Milovanović, T. I. Tokić, ’Mapping matrix multiplication al-
trix multiplication: a systolic approach’, in: M. P. Bekakos, gorithm onto fault-tolerant systolic array’, Comput. Math.
ed., Highly Parallel Computations: Algorithms and Appli- with Appl., 48, 1-2 (2004) 275-289.
cations, WITpress, Southampton-Boston, 2001, 140-173. [13] C. R. Wan, D. J. Evans, ’Nineteen ways of systolic matrix
[5] H. Kopetz, R. Obermaisser, P. Peti, N. Suri, ’From a feder- multiplication’, Int. J. Comput. Math., 68 (1998), 39-65.
ated to an integrated architecture for dependable embedded [14] J. -J. Wang, C. -W. Jen, ’Redundancy design for fault-
real-time systems’, Technische Univ. Wien, Viena, Austria, tolerant systolic array’, IEE Proceedings, Vol. 137, 3 (1990)
Tech. rep. 22, 2004. 218-226.
[6] S. Y. Kung, ’VLSI array processors’ Prentice Hall, New Jer- [15] C. N. Zhang, ’Systematic design of systolic arrays for com-
sey, 1988. puting multiple problem instances’, Microelectronics J. 23
(1992), 543-553.
[7] C. Langauer, ’View of systolic design’, Proc. PaCT’91,
[16] C. N. Zhang, T. M. Bachtiar, W. K. Chou, ’Optimal fault–
World Scientific, Singapoore, 1991, 32-46.
tolerant design approach for VLSI array processors’, IEE
[8] G. M. Melkemi, M. Tchuente, ’Computing of matrix product Proc. Comput. Digit. Tech., Vol. 144, 1 (1997), 15–21.
on a class of orthogonally connected systolic arrays’, IEEE
Trans. Comput. 36 (1987), 615-619.
[9] I. Z. Milentijević, I. Ž. Milovanović, E. I. Milovanović, M.
K. Stojčev, ’The design of optimal planar systolic arrays for
matrix multiplication’, Comput. Math. Appl. 3 (1997), 17-
35.

Synthesis of Orthogonal Systolic Arrays For Fault-Tolerant Matrix Multiplication

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Synthesis of Orthogonal Systolic Arrays For Fault-Tolerant Matrix Multiplication

Hochgeladen von

Copyright:

Verfügbare Formate

Synthesis of orthogonal systolic arrays for

fault-tolerant matrix multiplication

recurrence relation is usually used

c31 0 0 c21 0 0 c11 3i + r − 3 3i + r − 3

host as a memory-mapped space where input data are PEn-1,n+1

posed of the following five building blocks:

address & control

used for storing input data.

3) 2DBOSA with n ∗ (n + 2) PEs which implements the

sists of n TMR voters. address

Fig. 6. The structure of voter. VI. Conclusion

Das könnte Ihnen auch gefallen