Beruflich Dokumente
Kultur Dokumente
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
REFERENCES
[1] K. Takeuchi, G. Tanaka, H. Matsushita, K. Yoshizumi, Y. Katsuki, and
T. Sato, Observations of supply-voltage-noise dispersion in sub-nsec,
presented at the ITC, Santa Clara, CA, 2008, 26.3.
[2] J. Wang, D. Walker, A. Majhi, B. Kruseman, G. Gronthoud, L. E.
Villagra, P. Wiel, and S. Eichenberger, Power supply noise in delay
testing, presented at the ITC, Santa Clara, CA, 2006, 17.3.
[3] E. Alon, V. Stojanovic, and M. A. Horowitz, Circuits and techniques
for high-resolution measurement of on-chip power supply noise, IEEE
J. Solid-State Circuits, vol. 40, no. 4, pp. 820828, Apr. 2005.
[4] E. Alon, V. Abramzon, B. Nezamfar, and M. Horowitz, On-die power
supply noise measurement techniques, IEEE Trans. Adv. Packag., vol.
32, no. 2, pp. 248259, Feb. 2009.
[5] M. Takamiya and M. Mizuno, A sampling oscilloscope macro toward
feedback physical design methodology, in Proc. Symp. VLSI Circuits,
2004, pp. 240243.
[6] M. Nagata, On-chip measurements complementary to design flow for
integrity in SoCs, in Proc. DAC, 2007, pp. 400403.
[7] K. Shimazaki, M. Fukazawa, M. Nagata, S. Miyahara, M. Hirata, K.
Sato, and H. Tsujikawa, An integrated timing and dynamic supply
noise verification for nano-meter CMOS SoC designs, in Proc. CICC,
2005, pp. 3134.
[8] Y. Kanno, Y. Kondoh, T. Irita, K. Hirose, R. Mori, Y. Yasu, S. Komatsu, and H. Mizuno, In-situ measurement of supply-noise maps
with millivolt accuracy and nanosecond-order time resolution, in Proc.
VLSI Circuits, 2006, pp. 6364.
[9] J. R. Vazquez and J. P. de Gyvez, Power supply noise monitor for
signal integrity faults, in Proc. DATA, 2004, pp. 14061407.
[10] S. Naffziger, B. Stackhouse, T. Grutkowski, D. Josephson, J. Desai, E.
Alon, and M. H. Horowitz, The implementation of a 2-core, multithreaded itanium family processor, IEEE J. Solid-State Circuits, vol.
41, no. 1, pp. 197209, Jan. 2005.
[11] M. Fukazawa, T. Matsuno, T. Uemura, R. Akiyama, T. Kagemoto, H.
Makino, H. Takata, and M. Nagata, Fine-grained in-circuit continuous-time probing technique of dynamic supply variations in SoCs,
in Proc. ISSCC, 2007, pp. 288289.
[12] S. Pant and E. Chiprout, Power grid physics and implications for
CAD, in Proc. DAC, 2006, pp. 199204.
[13] J. Rearick and R. Rodgers, Calibrating clock stretch during AC scan
testing, presented at the ITC, Austin, TX, 2005, 11.3.
[14] K. Takeuchi, A. Yoshikawa, M. Komoda, K. Kotani, H. Matsushita,
Y. Katsuki, Y. Yamamoto, and T. Sato, Clock-skew test module for
exploring reliable clock-distribution under process and global voltagetemperature variations, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 11, pp. 15591566, Nov. 2008.
Based on
I. INTRODUCTION
Finite field multipliers over GF (2m ) have wide applications in elliptic curve cryptography (ECC) and error control coding systems [1],
[2]. Polynomial basis multipliers are popularly used because they are
relatively simple to design, and offer scalability for the fields of higher
orders. Efficient hardware design for polynomial-based multiplication
is therefore important for real-time applications [3][5]. All-one polynomial (AOP) is one of the classes of polynomials considered suitable
to be used as irreducible polynomial for efficient implementation of
finite field multiplication. Multipliers for the AOP-based binary fields
are simple and regular, and therefore, a number of works have been
explored on its efficient realization [6][17]. Irreducible AOPs are not
abundant. They are very often not preferred in cryptosystems for security reasons, and one has to make careful choice of the field order
to use irreducible AOPs for cryptographic applications [1], [9]. The
AOP-based multipliers can be used for the nearly AOP (NAOP) which
could be used for efficient realization of ECC systems [18]. AOP-based
fields could also be used for efficient implementation of Reed-Solomon
encoders [19]. Besides, the AOP-based architectures can be used as a
kernel circuit for field exponentiation, inversion, and division architectures [20][23].
Systolic design is a preferred type of specialized hardware solution due to its high-level of pipeline ability, local connectivity and
many other advantageous features [24], [25]. In [13], a bit-parallel
AOP-based systolic multiplier has been suggested by Lee et al.. Another efficient systolic design is presented in [14]. In a recent paper
[15], a low-complexity bit-parallel systolic Montgomery multiplier has
been suggested. Very recently [16], an efficient digit-serial systolic
Montgomery multiplier for AOP-based binary extension field is presented.
The systolic structures for field multiplication have two major issues. First, the registers in the systolic structures usually consume large
Manuscript received May 06, 2011; revised October 24, 2011 and December
08, 2011; accepted December 15, 2011. Date of publication January 12, 2012;
date of current version December 19, 2012. This work was supported by National Science Foundation (NSF) of China under Grant 60843002, 61174132
and by Natural Science Program of Hunan Province under Grant 09JJ6098.
J. Xie and J. He are with the School of Information Science and Engineering,
Central South University, Changsha 410083, China (e-mail: jftsa@yahoo.cn;
jjhe@mail.csu.edu.cn).
P. Kumar Meher is with the Department of Embedded Systems, Institute for
Infocomm Research, Singapore 138632 (e-mail: pkmeher@i2r.a-star.edu.sg).
Digital Object Identifier 10.1109/TVLSI.2011.2181434
1063-8210/$31.00 2012 IEEE
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
area and power. Second, the systolic structures usually have a latency
cycles, which is very often undesired for real-time apof nearly
plications. Therefore, in this paper, we have presented a novel register-sharing technique to reduce the register requirement in the systolic
structure. The proposed algorithm not only facilitates sharing of registers by the neighboring PEs to reduce the register complexity but also
helps reducing the latency. Cut-set retiming allows to introduce certain
number of delays on all the edges in one direction of any cut-set of a
signal flow-graph (SFG) by removing equal number of delays on all
the edges in the reverse direction of the same cut-set [24]. When all the
edges are in a single direction, one can introduce any desired number
of delays on all the edges of any cut-set of an SFG. Therefore, this
technique is highly useful for pipelining digital circuits to reduce the
critical path. In this paper, we have proposed a novel cut-set retiming
approach to reduce the clock-period. The proposed structure is found to
involve significantly less area-time-power complexity compared with
the existing designs.
The rest of this paper is organized as follows. The proposed algom
(2 ) based on AOP is derithm for finite field multiplication over
rived in Section II. In Section III, the proposed structure is presented.
In Section IV, we have listed the complexities and compared them
with those of the existing structures. Finally the conclusion is given
in Section V.
GF
II. ALGORITHM
fx
GF (2). As a requirement of irreducible AOP for
GF (2m ); (mm0+1 1)mis02prime and 2 is the primitive modulo (m + 1).
; ; . . . ;; 1g forms the polynomial basis (where
The set f
is a root of f (x)), such that an element X of the binary field can be
Let ( ) =
of degree m over
169
C=
(1)
i=0
C=
where
m+1
=1
(2)
j =0
a ; B=
j
b; C=
j
j =0
c
j
j =0
(4)
A +1 = a0 1 + a1 1 2 + 1 1 1 + a 1
i
m+1
i
m
mod
f ()
(10a)
where
=
j =0
a:
i
j
(10b)
A +1 can be obtained as
A +1 = a0+1 + a1+1 1 + 1 1 1 + a +1 1
i
i
m
a0+1 = a
a +1 = a 01;
(11a)
i
m
(11b)
j m 0 1:
(11c)
It is also possible to extend (11) further to obtain A + directly from
A for 1 l m, such that
0j l01
a + = aa 00;+ +1 ; for
(12)
otherwise.
i
j
i
j
for
(3)
X GF
X x x
; ; ;;
A; B;C GF
j
A=
(7)
where
Therefore, we have
x GF
i=0
is given by
X =b 1A
(8a)
for A0 = A, and A = [ :A mod f ()], and using (3) A can be
obtained from A as
A = a 0 + a 0 01 01 + 1 1 1 + a 0 +2 + a 0 +1 : (8b)
Such that A +1 can be obtained from A recursively as
A +1 = 1 A mod f ():
(9)
(6)
given by
b 1 A mod f () :
= 0,
(5)
i
m l
i
j l
i l
j
We have used the above equations to derive the proposed linear systolic structure based on a novel cut-set retiming strategy and registersharing technique.
III. PROPOSED STRUCTURE
In this section, we derive a basic systolic design followed by the
proposed register sharing structure.
A. Basic Systolic Design
GF
m
(2 ), the opFor systolic implementation of multiplication over
erations of (7), (8) and (11) can be performed recursively. Each recursion is composed of three steps, i.e., modular reduction of (11), bit-multiplication of (8), and bit-addition of (7). Equations of (7), (8) and (11)
can be represented by the SFG (shown in Fig. 1) consisting of modular reduction nodes ( ) and addition nodes ( ) for 1 ,
Ri
Ai
m
i m
170
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
Fig. 1. SFG of the algorithm. (a) The SFG. (b) Function of node
Function of node M (i). (d) Function of node A(i).
R(i). (c)
Fig. 3. Proposed systolic structure. (a) Systolic design. (b) Function of PE[0].
(c) Function of PE[1]. (d) Function of regular PE (from PE[2] to PE[m 1]).
(e) Function of PE[m]. (f) Function of PE[m + 1].
Fig. 4. Structure of PEs. (a) Internal structure of a regular PE. (b) Internal structure of PE[0] of Fig. 4. (c) An example of AND cell for m = 4. (d) Structure of
the AC. (e) Structure of BSC where m = 4. (f) Alternate structure of a regular
PE. (g) Alternate structure of PE[0].
Fig. 2. Cut-set retiming of the SFG. (a) Cut-set retiming in a general way. (b)
Proposed cut-set retiming. (c) Formation of PE. D denotes unit delay.
Mi
i m
and ( + 1) multiplication nodes ( ) for 1 + 1. The functions of these nodes are shown in Fig. 1(b)(d). Node ( ) performs
the modular reduction of degree by one according to (11). Node ( )
performs an AND operation of a bit of operand with a reduced form
of operand , according to (8). Node ( ) performs the bit-addition
operation according to (7), as shown in Fig. 1(d), where i is the partial result available to the node.
Generally, we can introduce a delay between the reduction node and
its corresponding bit-multiplication and bit-addition nodes, as shown
in Fig. 2(a), such that the critical-path is not larger than ( A + X ),
where the A and X refer the propagation delay of AND gate and
XOR gate, respectively. In this section, however, we introduce a novel
cut-set retiming to reduce the critical-path of a PE to X . It is observed
that the node ( ) performs only the bit-shift operation according to
(11), and therefore it does not involve any time consumption. Therefore, we introduce a critical-path which is not larger than X , as shown
in Fig. 2(b). To derive the basic design of a systolic multiplier, we have
Ai
Ri
Ri
Mi
T ; T ;T
T T
Mi
Ai
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
171
Fig. 5. Proposed low latency systolic structure. (a) The systolic structure. (b)
Function of the AC.
A
A
Ri
T ;T
i m
m
m
r l
C=
lP r
m=2
i=0
r
P m=
X:
i=m=2+1
(13)
m=
As shown in (13), one of the sum contains [( 2)+ 1] partial prod2 partial products. Based on (13), the sysucts while the other has
tolic structure of Fig. 4 could be modified to a form shown in Fig. 5,
which consists of two systolic branches. The upper branch consists of
[( 2) + 2] PEs and the lower branch consists of ( 2 + 1) PEs
and a delay cell. Besides, an addition-cell (AC) is required to perform
the final addition of the outputs of the two systolic arrays, as shown in
Fig. 5(b). The structure has the PEs of the same complexity as those in
Fig. 3, but the latency of structure is only [( 2) + 3] cycles.
It is observed that the two systolic branches in Fig. 5 share the same
input operand , and the PEs in both the branches perform the same operation except the last PE in each of the branches. Therefore, we present
an efficient structure using the register-sharing technique as shown in
Fig. 6, where the structure consists of [( 2) + 2] PEs and an AC.
The circuit of its regular PE (from PE[2] to PE[ 2 0 1]) is shown in
Fig. 6(c). It combines two regular PEs of Fig. 5(a) together by sharing
one input-operand-transfer. The other PEs need some minor modifications, as shown in Fig. 6(b), (d) and (e), respectively. The function
of AC is the same as that in Fig. 5. Thus, the whole structure requires
only (2 5 2 + 6 5 + 4) bit-registers, while the structure of Fig. 4
requires (3 2 +5 +2) bit-registers. Besides, the latency of structure
is [( 2) + 3] cycles, while the duration of cycle period of a regular
PE is still X .
m=
m=
m= 0
m=
Fig. 7. Improved low-latency systolic structure. (a) The proposed systolic array
merging. (b) Improved systolic structure.
P m=
C=
01
m=4
i=0
m=
:m :m
m m
m=
T
m=
01
m=2
i=m=4
3m=401
i=m=2
X:
i
i=3m=4
(14)
Following the same approach as the one used to derive the structure
of Fig. 5, we can have the design in Fig. 7(a), where it consists of four
systolic branches. Similarly, following the approach presented to derive
the structure of Fig. 6 from Fig. 5, we may have the design shown in
Fig. 7(b). The design of Fig. 7(b) requires only [( 4) + 4] cycles of
latency. When is a large number, and can be chosen as
l P
l = P = bm + 1c
m=
m=
m=
m=
(15)
m=
The proposed structure (see Fig. 6) requires [( 2)+2] PEs and one
AC. Each of the regular PEs consists of 2( + 1) XOR gates in a pair
of XOR cells and 2( + 1) AND gates in a pair of AND cells. Besides,
the AC requires ( + 1) XOR gates. Moreover, (2 5 2 + 6 5 + 4)
bit-registers are required for transferring data to the nearby PE. The
latency of the design is [( 2) + 3] cycles, where the duration of the
clock-period is X . The structure of Fig. 7 requires nearly the same
gate-counts as that of Fig. 6. But its latency is [( 4) + 4] cycles. The
number of gates, latency and critical-path of the proposed designs (see
Figs. 6 and 7) and the existing designs of [11][16] are listed in Table I.
It can be seen that the proposed design outperforms the existing designs. Although slightly more registers than that in [11] are used, proposed design requires shorter latency and lower critical-path than the
other as well as the MUX gates. The digit-serial structures of [12] and
m
m
:m
m=
m=
:m
172
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
TABLE I
AREA AND TIME COMPLEXITY
T
]T
= 20
ACT = (critical0path) 2
(cycles number required to obtain a result).
Digit-size L = 4.
a low-latency bit-parallel systolic multiplier. Compared with the existing systolic structures for bit-parallel realization of multiplication
m
(2 ), the proposed one is found to involve less area, shorter
over
critical-path and lower latency. From ASIC and FPGA synthesis results
we find that the proposed design involves significantly less ADP and
PDP than the existing designs. Moreover, our proposed design can be
extended to further reduce the latency.
GF
REFERENCES
TABLE III
FPGA SYNTHESIS RESULT OF PROPOSED AND EXISTING DESIGNS
(2 )
(2 )
(2 )
=4
m=l
m=l
V. CONCLUSION
GF
m
Efficient systolic design for the multiplication over
(2 ) based
on irreducible AOP is proposed. By novel cut-set retiming we have
been able to reduce the critical path to one XOR gate delay and by
sharing of registers for the input-operands in the PEs, we have derived
((2 ) )
(2 )
(2 )
(2 )
(2 )
(2 )
(2 )
(2 )
(2 )
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013
GF
GF
GF
GF
GF
GF
I. INTRODUCTION
A trend of multiprocessor system-on-chip (MPSoC) design being
interconnected with on-chip networks is currently emerging for applications of parallel processing, scientific computing, and so on [1][6].
Manuscript received January 12, 2011; revised June 23, 2011 and October 20,
2011; accepted December 07, 2011. Date of publication January 17, 2012; date
of current version December 19, 2012. This work was supported by the National
Research Foundation of Korea (NRF) grant funded by the Korea Government
(MEST) (2011-0020128).
The authors are with the School of Electrical Engineering, Korea University,
Seoul 136-713, South Korea (e-mail: hungpp@korea.ac.kr; ckim@korea.ac.kr).
Digital Object Identifier 10.1109/TVLSI.2011.2181545
173