Beruflich Dokumente
Kultur Dokumente
Abstract— This paper presents a procedure for design- approach, which is known as N -tuple modular redun-
ing fault-tolerant systolic array with orthogonal inter- dancy, N copies (N odd) of a module and majority
connects and bidirectional data flow (2DBOSA) for ma-
trix multiplication. The method employs space-time re- voter are used to mask the error from failed module.
dundancy to achieve fault-tolerance. The obtained array At least three modules are necessary in a voting sys-
has Ω = n(n + 2) processing elements, and total execution tem that is typically called triple modular redundancy
time of Ttot = 6n − 5. The array can tolerate single tran-
sient errors and the majority of multiple error patterns
(TMR). It seems that at least 200 percent hardware
with high probability. Compared to hexagonal array of overhead for fault tolerance is needed. In practice, trip-
same dimensions, the number of I/O pins is reduced for licate computations need to be put to the voter and
approximately 30%. then a correct result is obtained. The triplicated com-
putations may be computed in different hardware mod-
I. Introduction ules and/or different time using space-shift, time-shift
Matrix multiplication is one of the essential opera- or space-time-shift [14]. If the replicated computations
tions in various fields of science, engineering and tech- are performed simultaneously by different modules it
nology, such as signal and image processing, system is a space-shift scheme. When the replicated compu-
theory, statistical and numerical analysis, biomedical tations are computed by the same module at different
researches, etc. This operation is characterized by its times it is a time-shift scheme. If the replicated compu-
intensive computational complexity and regularity, and tations are computed by different modules at different
it is often required under real time constraints. times it is a space-time-shift scheme.
Today’s high performance computing systems exploit In our approach instead of using full hardware re-
one or more forms of parallelism to achieve high speed dundancy (i.e. triplication) to achieve fault tolerance,
computations. To fulfill the desired throughput rates we generate redundant information partly by means of
for time-critical and computationally intensive prob- hardware (by involving two extra columns of processing
lems, special-purpose, high-speed computing systems elements in the 2D array), and partly by means of time
optimized for processing specific tasks have been de- redundancy. This paper presents a systematic approach
signed. Systolic array is a type of special-purpose sys- for designing two-dimensional fault-tolerant systolic ar-
tem that can be used for implementing such tasks. ray with orthogonal connections and bidirectional data
Fault-tolerance has become an essential design require- flow (2DBOSA) with the capability of concurrent er-
ment in such systems. ror detection and correction using space-time redun-
Permanent, transient and intermittent faults are dancy. In this approach redundant computations are
main sources of errors in the integrated circuits. Perma- introduced at the algorithmic level by deriving three
nent faults occur due to irreversible physical changes. equivalent algorithms but with disjoint index spaces.
Shorts and opens are typical examples of such faults. Fault-tolerant systolic array is constructed by mapping
Transients are most frequently generated by environ- dependence graphs of these algorithms along projection
mental conditions, like cosmic rays. Intermittent faults direction vector into physical processor-time domain.
occur due to unstable or marginal hardware. There are 13 different projection direction vectors that
The transient faults are most common, and their can be used to obtain 2D SAs with planar links [10].
number is continuously increasing due to high com- Three of them give 2DBOSAs. The resulting 2DBOSAs
plexity, smaller transistor sizes, higher operational fre- are optimal in terms of area and speed and all can tol-
quency, and lower voltage levels. The rate of transient erate single transient errors and majority of multiple
faults is often much higher compared to the rate of per- errors with high probability. Similar method was used
manent faults. Transient-to-permanent fault ration is in [2], [12], [16] to obtain hexagonal with fault-tolerant
100:1 or higher [5]. capability. The main difference in the approach used in
A variety of techniques have been devised for masking [12], [16] concerns the fact that hexagonal array has a
the errors induced by silicon faults. In error masking pipeline period λ = 3 which means that processing ele-
ments perform active computations in every third clock
M. K. Stojčev, E. I. Milovanović, I. Ž. Milovanović and are with
the Faculty of Electronic Engineering A. Medvedeva 14, P.O. Box cycle. This fact was used to introduce two additional
73, 18000 Niš, Serbia computations of the same problem instance in the idle
clock cycles and then vote for the correct result using matrix
majority. On the contrary, the 2DBOSAs have pipeline
h i 1 0 0
period λ = 1 or λ = 2. Therefore they were not consid-
ered as suitable candidate architecture for fault-tolerant
D= ~eb3 ~ea3 ~ec3 = 0 1 0 . (3)
computations. In this paper we first perform the expan- 0 0 1
sion of the inner computation space of the basic systolic An acyclic directed graph G = (Pint , D), where Pint
algorithm for matrix multiplication in order to obtain corresponds to vertices while edges are determined by
the array with λ = 3. Second, we introduce redun- columns of matrix D, can be joined to Algorithm 1 .
dant computations at the algorithmic level and define This graph is placed in a three-dimensional space gen-
optimal mapping which maps the proposed algorithm erated by unity vectors
into 2DBOSA which minimizes hardware cost (i.e. the
number of processing elements) and computation time. ~e1 = [1 0 0]T , ~e2 = [0 1 0]T and ~e3 = [0 0 1]T .
Finally, for the obtained 2DBOSA we derive formulas This graph can be used to determine all allowable pro-
for initial data schedule which provides correct execu- jection vectors of the form µ ~ = [µ1 µ2 µ3 ]T , which are
tion of the fault-tolerant algorithm on the synthesized used to synthesize 2D SAs. There are 13 allowable pro-
array. jection vectors in total. According to this vectors 19
The rest of the paper is organized as follows. Sec- different 2D SAs can be synthesized. These arrays can
tion 2 is devoted to problem definition. A systematic be classified into four groups according to the inter-
approach for designing four fault-tolerant 2DBOSAs is connection patterns between the processing elements,
described in Section 3. Global hardware structure of the as shown in Fig.1. The arrays from the same group
fault-tolerant system is presented in Section 4. Details mutually differ with respect to input data patterns and
related to performance evaluation of the synthesized ar- directions of data flow. All these arrays are well studied
ray are given in Section 5. Finally, conclusion is given in the literature (see for example [4], [9], [10], [13]).
in Section 6.
II. Background
Let A = (aik ) and B = (bkj ) be two matrices of order
mesh bidirectional hexagonal unidirectional
n × n. To find their product, C = A · B, the following orthogonal orthogonal
b1 3 b1 3 b 2 3 b 2 3 b 3 3 b 3 3 b1 3 b1 3
c12
c32 c32 c32 c22 c22 c22 c12 c12
b12 b12 b22 b32 b32 b12 b12 b12 b22 b22
b22
c11 c11
c31 c31 c31 c21 c21 c21 c11
b31 b31 b11 b11 b 21 b 21 b31 b31 b11
b11 b11 b21 b21
a 11 a11 a12 0 0
0 a12 a12 a13 0
0 0 a13 a13 a 11
a 22 a 23 a23 0 0
0 a 23 a21 a 21 0
0 0 a21 a 22 a 22
a 31 a 31 a32 0 0
0 a 32 a32 a 33 0
0 0 a33 a 33 a 31
Fig. 3. The 2DBOSA which performs fault-tolerant matrix multiplication, for the case n = 3.
system bus
consists of a host computer and an accelerator (ACL). Address generator
The host and the accelerator are connected via a com- ACL
AGB
address
address
address
mon system bus. The host consists of CPU, memory
and I/O subsystem. Logically, the ACL is seen by the MBn-1 MB1 MB0
Voter logic
1) Address generator logic, AGA , AGB and AGC , in-
VL
data
data MC1
c=0 data data
tended for accessing memory banks Memory A, Mem- PE10 PE11 PE1, n+1 data addr&CS
Host
addr&CS
data
data MC0 data
system bus
ory B and Memory C, respectively. c=0 PE00 PE01 PE0, n+1 data addr&CS addr&CS
address
address
data
data
data
fault-tolerant matrix multiplication algorithm.
4) Voter logic which is used to vote the results. It con- Address generator
AGA
5) Output memory bank, Memory C, used to store el- Fig. 4. Global system structure.
ements of product matrix C. It is composed of n dual-
port memory modules, M C0 , M C1 , . . . , M Cn−1 . After
the voting has been performed, the result is written into the output of P Ei,n+1 drives demultiplexor DEMUX1.
A side of the corresponding memory module. At the Output of DEMUX1, outs , s = 1, 2, 3 is buffered in a
end of computation, the host accesses B side of mem- pair of latches LLs and LL0s , s = 1, 2, 3, alternatively.
ory modules M Ci , i = 0, 1, . . . , n − 1, via a system bus, Multiplexors MUX1, MUX2 and MUX3 select latched
and transfers the result to the main memory. values. Outputs C, C 0 and C” drive the voter. The
Elements of the resulting matrix C are obtained at voted value, VR, is accepted by latch LV. During next
the output of P Ei,n+1 , i = 0, 1, . . . , n − 1. Each element clock cycle output of LV is written into memory block
ci,j , i, j = 0, 1, . . . , n − 1 is calculated three times and M Ci , i = 0, 1, . . . , n − 1. The structure of voter is given
obtained in consecutive time instances. Three values of in Fig. 6. It consists of three subtractors, S1, S2 and
cij are used to vote the result. Voter logic is appended S3 and combinatorial logic. Status of zero markers Z0 ,
to each row of processing elements. All voter logics are Z00 and Z0” points to equality of input values C, C 0 and
identical, but with different timing. The structure of C”. If any of zero markers is set, than the result can be
voter logic is sketched in Fig. 5. The result obtained at voted. Otherwise, multiple error has occurred. Multi-
LL1 i.e. for 2n PEs. In order to estimate time overhead, we
MUX1
out1
L1
C define a performance metric Q as
LL’1
TF T − Topt
L’1 selOD Q= ,
Topt
data from PE0, n+1
LL2
DEMUX1
MUX2
out2 C’ VR
L2 Voter LV Data
MCi
where TF T corresponds to the total execution time of
LL’2
fault-tolerant matrix multiplication algorithm on the
L’2 selOD
LL3
LVclk
Addr
2DBOSA, and Topt to the total execution time of ma-
trix multiplication algorithm on 2DBOSA with optimal
MUX3
out3 C”
L3
LL’3
number of PEs, but without fault-tolerant capability.
Consequently, we have [10]
L’3 selOD
6n − (4n − 3)
Fig. 5. The structure of voter logic. Q= ≈ 0.5 = 50%
4n − 3
Having in mind that three copies of matrix multipli-
plexor M U Xdata is used to select data that is written
cation algorithm are computed simultaneously on the
to memory module M Ci , i = 0, 1, . . . , n − 1. The n-
2DBOSA at the cost of minimal hardware overhead,
input OR gate, O2, reports to the host occurrence of
we can say that the time overhead is relatively low.
the error that cannot be masked off.
The synthesized array can detect and tolerate single
c transient faults and majority of multiple fault patterns
z0 c 0
with high probability. The only case where multiple
MUXdata
S1
c’ voted data
(to host) errors cannot be tolerated is when they affect different
c’ 1
copies of the same element of the product matrix.
sel
c z0'
S2 A1 The number of I/O pins has a direct impact on host-
c”
accelerator interface complexity. Assume that the size
of each data is m-bit and that data are feed-in and -
c’
S3
z0" out in parallel into the array. Under this condition, the
c” number of I/O pins in the 2DBOSA proposed in this
paper is N OI/O = m ∗ (3n + 2). The fault-tolerant
O1 hexagonal array of the same size (i.e. n ∗ (n + 2) PEs)
proposed in [2], [12] has N HI/O = m∗(4n+3) I/O pins.
err0 Thus, for example for m = 32 and n = 16, a reduction
erri
O2 error status of N HI/O − N OI/O = 544 I/O pins is obtained, i.e.
(to host)
errn - 1 approximately 30%.