Sie sind auf Seite 1von 6

168

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

REFERENCES
[1] K. Takeuchi, G. Tanaka, H. Matsushita, K. Yoshizumi, Y. Katsuki, and
T. Sato, Observations of supply-voltage-noise dispersion in sub-nsec,
presented at the ITC, Santa Clara, CA, 2008, 26.3.
[2] J. Wang, D. Walker, A. Majhi, B. Kruseman, G. Gronthoud, L. E.
Villagra, P. Wiel, and S. Eichenberger, Power supply noise in delay
testing, presented at the ITC, Santa Clara, CA, 2006, 17.3.
[3] E. Alon, V. Stojanovic, and M. A. Horowitz, Circuits and techniques
for high-resolution measurement of on-chip power supply noise, IEEE
J. Solid-State Circuits, vol. 40, no. 4, pp. 820828, Apr. 2005.
[4] E. Alon, V. Abramzon, B. Nezamfar, and M. Horowitz, On-die power
supply noise measurement techniques, IEEE Trans. Adv. Packag., vol.
32, no. 2, pp. 248259, Feb. 2009.
[5] M. Takamiya and M. Mizuno, A sampling oscilloscope macro toward
feedback physical design methodology, in Proc. Symp. VLSI Circuits,
2004, pp. 240243.
[6] M. Nagata, On-chip measurements complementary to design flow for
integrity in SoCs, in Proc. DAC, 2007, pp. 400403.
[7] K. Shimazaki, M. Fukazawa, M. Nagata, S. Miyahara, M. Hirata, K.
Sato, and H. Tsujikawa, An integrated timing and dynamic supply
noise verification for nano-meter CMOS SoC designs, in Proc. CICC,
2005, pp. 3134.
[8] Y. Kanno, Y. Kondoh, T. Irita, K. Hirose, R. Mori, Y. Yasu, S. Komatsu, and H. Mizuno, In-situ measurement of supply-noise maps
with millivolt accuracy and nanosecond-order time resolution, in Proc.
VLSI Circuits, 2006, pp. 6364.
[9] J. R. Vazquez and J. P. de Gyvez, Power supply noise monitor for
signal integrity faults, in Proc. DATA, 2004, pp. 14061407.
[10] S. Naffziger, B. Stackhouse, T. Grutkowski, D. Josephson, J. Desai, E.
Alon, and M. H. Horowitz, The implementation of a 2-core, multithreaded itanium family processor, IEEE J. Solid-State Circuits, vol.
41, no. 1, pp. 197209, Jan. 2005.
[11] M. Fukazawa, T. Matsuno, T. Uemura, R. Akiyama, T. Kagemoto, H.
Makino, H. Takata, and M. Nagata, Fine-grained in-circuit continuous-time probing technique of dynamic supply variations in SoCs,
in Proc. ISSCC, 2007, pp. 288289.
[12] S. Pant and E. Chiprout, Power grid physics and implications for
CAD, in Proc. DAC, 2006, pp. 199204.
[13] J. Rearick and R. Rodgers, Calibrating clock stretch during AC scan
testing, presented at the ITC, Austin, TX, 2005, 11.3.
[14] K. Takeuchi, A. Yoshikawa, M. Komoda, K. Kotani, H. Matsushita,
Y. Katsuki, Y. Yamamoto, and T. Sato, Clock-skew test module for
exploring reliable clock-distribution under process and global voltagetemperature variations, IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 16, no. 11, pp. 15591566, Nov. 2008.

Low-Complexity Multiplier for


All-One Polynomials

Based on

Jiafeng Xie, Pramod Kumar Meher, and Jianjun He

AbstractThis paper presents an area-time-efficient systolic structure


(2 ) based on irreducible all-one polynomial
for multiplication over
(AOP). We have used a novel cut-set retiming to reduce the duration of
the critical-path to one XOR gate delay. It is further shown that the systolic
structure can be decomposed into two or more parallel systolic branches,
where the pair of parallel systolic branches has the same input operand, and
they can share the same input operand registers. From the application-specific integrated circuit and field-programmable gate array synthesis results
we find that the proposed design provides significantly less area-delay and
power-delay complexities over the best of the existing designs.
Index TermsAll-one polynomial, finite field, systolic design.

I. INTRODUCTION

Finite field multipliers over GF (2m ) have wide applications in elliptic curve cryptography (ECC) and error control coding systems [1],
[2]. Polynomial basis multipliers are popularly used because they are
relatively simple to design, and offer scalability for the fields of higher
orders. Efficient hardware design for polynomial-based multiplication
is therefore important for real-time applications [3][5]. All-one polynomial (AOP) is one of the classes of polynomials considered suitable
to be used as irreducible polynomial for efficient implementation of
finite field multiplication. Multipliers for the AOP-based binary fields
are simple and regular, and therefore, a number of works have been
explored on its efficient realization [6][17]. Irreducible AOPs are not
abundant. They are very often not preferred in cryptosystems for security reasons, and one has to make careful choice of the field order
to use irreducible AOPs for cryptographic applications [1], [9]. The
AOP-based multipliers can be used for the nearly AOP (NAOP) which
could be used for efficient realization of ECC systems [18]. AOP-based
fields could also be used for efficient implementation of Reed-Solomon
encoders [19]. Besides, the AOP-based architectures can be used as a
kernel circuit for field exponentiation, inversion, and division architectures [20][23].
Systolic design is a preferred type of specialized hardware solution due to its high-level of pipeline ability, local connectivity and
many other advantageous features [24], [25]. In [13], a bit-parallel
AOP-based systolic multiplier has been suggested by Lee et al.. Another efficient systolic design is presented in [14]. In a recent paper
[15], a low-complexity bit-parallel systolic Montgomery multiplier has
been suggested. Very recently [16], an efficient digit-serial systolic
Montgomery multiplier for AOP-based binary extension field is presented.
The systolic structures for field multiplication have two major issues. First, the registers in the systolic structures usually consume large

Manuscript received May 06, 2011; revised October 24, 2011 and December
08, 2011; accepted December 15, 2011. Date of publication January 12, 2012;
date of current version December 19, 2012. This work was supported by National Science Foundation (NSF) of China under Grant 60843002, 61174132
and by Natural Science Program of Hunan Province under Grant 09JJ6098.
J. Xie and J. He are with the School of Information Science and Engineering,
Central South University, Changsha 410083, China (e-mail: jftsa@yahoo.cn;
jjhe@mail.csu.edu.cn).
P. Kumar Meher is with the Department of Embedded Systems, Institute for
Infocomm Research, Singapore 138632 (e-mail: pkmeher@i2r.a-star.edu.sg).
Digital Object Identifier 10.1109/TVLSI.2011.2181434
1063-8210/$31.00 2012 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

area and power. Second, the systolic structures usually have a latency
cycles, which is very often undesired for real-time apof nearly
plications. Therefore, in this paper, we have presented a novel register-sharing technique to reduce the register requirement in the systolic
structure. The proposed algorithm not only facilitates sharing of registers by the neighboring PEs to reduce the register complexity but also
helps reducing the latency. Cut-set retiming allows to introduce certain
number of delays on all the edges in one direction of any cut-set of a
signal flow-graph (SFG) by removing equal number of delays on all
the edges in the reverse direction of the same cut-set [24]. When all the
edges are in a single direction, one can introduce any desired number
of delays on all the edges of any cut-set of an SFG. Therefore, this
technique is highly useful for pipelining digital circuits to reduce the
critical path. In this paper, we have proposed a novel cut-set retiming
approach to reduce the clock-period. The proposed structure is found to
involve significantly less area-time-power complexity compared with
the existing designs.
The rest of this paper is organized as follows. The proposed algom
(2 ) based on AOP is derithm for finite field multiplication over
rived in Section II. In Section III, the proposed structure is presented.
In Section IV, we have listed the complexities and compared them
with those of the existing structures. Finally the conclusion is given
in Section V.

GF

II. ALGORITHM

xm + xm01 + . . . + x + 1 be an irreducible AOP

fx
GF (2). As a requirement of irreducible AOP for
GF (2m ); (mm0+1 1)mis02prime and 2 is the primitive modulo (m + 1).
; ; . . . ; ; 1g forms the polynomial basis (where
The set f
is a root of f (x)), such that an element X of the binary field can be

Let ( ) =
of degree m over

169

which can be decomposed to a form


m

C=

(1)

i=0

C=
where

m+1

=1

(2)

j =0

a ; B=
j

b ; C=
j

j =0

c
j

j =0

(4)

a , b , and c 2 GF (2), for 0  j  m 0 1, and a


b = 0, and c = 0.
If C is the product of elements A and B , then we can have
C = A 1 B mod f ( )
where
m

The partial product generation and modular reduction are performed


according to (8) and (9), respectively. The additions of the reduced
polynomials are performed according to (7).
Equation (9) can be expressed as

A +1 = a0 1 + a1 1 2 + 1 1 1 + a 1
i

m+1

i
m

mod

f ( )

(10a)

where

=
j =0

a :
i
j

(10b)

A +1 can be obtained as
A +1 = a0+1 + a1+1 1 + 1 1 1 + a +1 1
i

Substituting (3) into (10a),


i

i
m

a0+1 = a
a +1 = a 01;

(11a)

i
m

(11b)

j  m 0 1:
(11c)
It is also possible to extend (11) further to obtain A + directly from
A for 1  l  m, such that
0j l01
a + = aa 00;+ +1 ; for
(12)
otherwise.
i
j

i
j

for

(3)

X GF
X x x
; ; ; ;
A; B;C GF
j

This property of AOP [17] is used to reduce the complexity of field


multiplications as discussed in the following.
m
(2 ) given by (1) in polynomial basis repreAny element in
sentation can be represented as = 0 + 1 + . . . + m m , where
m
m01
(2), and f
...
1g is the extended polynomial
i 2
m
2
(2 ), they can be represented
basis [17]. Similarly, if
by the extended polynomial basis as
m

A=

(7)

where

Therefore, we have

x GF

i=0

is given by

X =b 1A
(8a)
for A0 = A, and A = [ :A mod f ( )], and using (3) A can be
obtained from A as
A = a 0 + a 0 01 01 + 1 1 1 + a 0 +2 + a 0 +1 : (8b)
Such that A +1 can be obtained from A recursively as
A +1 = 1 A mod f ( ):
(9)

(6)

Equation (6) can be expressed as a finite field accumulation

given by

X = Xm01 m01 + Xm02 m02 + 1 1 1 + X1 + X0


where X 2 GF (2) for i = m 0 1; . . . ; 2; 1; 0.
Since is a root of f (x), we can have f ( ) = 0, and
f ( ) + f ( ) = ( + 01 + 1 1 1 + + 1)
01 + 1 1 1 + + 1)
+ ( +
+1
+ 1 = 0:
=

b 1 A mod f ( ) :

= 0,

(5)

i
m l
i
j l

i l
j

We have used the above equations to derive the proposed linear systolic structure based on a novel cut-set retiming strategy and registersharing technique.
III. PROPOSED STRUCTURE
In this section, we derive a basic systolic design followed by the
proposed register sharing structure.
A. Basic Systolic Design

GF

m
(2 ), the opFor systolic implementation of multiplication over
erations of (7), (8) and (11) can be performed recursively. Each recursion is composed of three steps, i.e., modular reduction of (11), bit-multiplication of (8), and bit-addition of (7). Equations of (7), (8) and (11)
can be represented by the SFG (shown in Fig. 1) consisting of modular reduction nodes ( ) and addition nodes ( ) for 1   ,

Ri

Ai

m
i m

170

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

Fig. 1. SFG of the algorithm. (a) The SFG. (b) Function of node
Function of node M (i). (d) Function of node A(i).

R(i). (c)
Fig. 3. Proposed systolic structure. (a) Systolic design. (b) Function of PE[0].
(c) Function of PE[1]. (d) Function of regular PE (from PE[2] to PE[m 1]).
(e) Function of PE[m]. (f) Function of PE[m + 1].

Fig. 4. Structure of PEs. (a) Internal structure of a regular PE. (b) Internal structure of PE[0] of Fig. 4. (c) An example of AND cell for m = 4. (d) Structure of
the AC. (e) Structure of BSC where m = 4. (f) Alternate structure of a regular
PE. (g) Alternate structure of PE[0].
Fig. 2. Cut-set retiming of the SFG. (a) Cut-set retiming in a general way. (b)
Proposed cut-set retiming. (c) Formation of PE. D denotes unit delay.

Mi

i m

and ( + 1) multiplication nodes ( ) for 1   + 1. The functions of these nodes are shown in Fig. 1(b)(d). Node ( ) performs
the modular reduction of degree by one according to (11). Node ( )
performs an AND operation of a bit of operand with a reduced form
of operand , according to (8). Node ( ) performs the bit-addition
operation according to (7), as shown in Fig. 1(d), where i is the partial result available to the node.
Generally, we can introduce a delay between the reduction node and
its corresponding bit-multiplication and bit-addition nodes, as shown
in Fig. 2(a), such that the critical-path is not larger than ( A + X ),
where the A and X refer the propagation delay of AND gate and
XOR gate, respectively. In this section, however, we introduce a novel
cut-set retiming to reduce the critical-path of a PE to X . It is observed
that the node ( ) performs only the bit-shift operation according to
(11), and therefore it does not involve any time consumption. Therefore, we introduce a critical-path which is not larger than X , as shown
in Fig. 2(b). To derive the basic design of a systolic multiplier, we have

Ai

Ri

Ri

Mi

shown the formation of PE of the retimed SFG in Fig. 2(c). It can be


observed that the cut-set retiming allows to perform a reduction operations, bit-addition, and bit-multiplication concurrently, so that the critical-path is reduced to maxf A M R g, where A , M and R are,
respectively, the computation times of the bit-addition nodes, bit-multiplication nodes, and reduction nodes.
The basic design of systolic multiplier thus derived is shown in
Fig. 3. It consists of ( + 2) PEs, and the functions of the PEs are
shown in Fig. 3. During each cycle period, the regular PE (from PE[2]
to PE[ 0 1]) not only performs the modular reduction operation
according to (11), but also performs the bit-multiplication and bit-addition operations concurrently. The detail circuit of a regular PE is
shown in Fig. 4.
The regular PE, as shown in Fig. 4(a), consists of three basic cells,
e.g., the bit-shift cell (BSC), the AND cell, and the XOR cell. The AND
cell, and the XOR cell correspond to the node ( ), and node ( )
of the SFG of Fig. 1, respectively. The structure of PE[1] of Fig. 3 is
shown in Fig. 4(b). It consists of an AND cell and a BSC. Each XOR cells
and AND cells in the PE consists of ( + 1) number of gates working
= 4. The
in parallel. Fig. 4(c) shows an example of AND cell for

T ; T ;T

T T

Mi

Ai

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

171

Fig. 5. Proposed low latency systolic structure. (a) The systolic structure. (b)
Function of the AC.

PE[m + 1] of the systolic structure in Fig. 3 consists of only an XOR

cell, as shown in Fig. 4(d), which performs bit-by-bit XOR operations


of its pair of -bit inputs. The BSC in the PE performs the bit-shift operation according to (11). We have shown an example of the structure
of BSC (of PE[1] of Fig. 4) in Fig. 4(e) for = 4. Note that according
to (12), one can obtain i directly from 0 for 1   , i.e., every
PE of the structure of Fig. 3 can have the same input operand 0 , and
i
can be obtained from the BSC after 0 is fed as input. Therefore,
we can change the circuit-designs of Fig. 4(a) and (b) into the form of
Fig. 4(f) and (g), respectively. Besides, according to (11), the operation
of node ( ) does not involve any area and time-consumption. Therefore, the minimum duration of clock-period of a regular PE amounts
to maxf A X g = X . The proposed systolic design yields the first
output of desired product ( + 2) cycles after the first input is fed to
the structure, while the successive outputs are available in each cycle.

A
A

Ri
T ;T

i m

m
m

For irreducible AOP, is an even number. Therefore, let and be


+ , where is an integer in the
two integers such that ( + 1) =
2, then = 2,
range 0   . For example, if we choose =
= 1, (7) can be rewritten as

r l

C=

lP r

m=2

i=0

r
P m=

X:

i=m=2+1

(13)

m=

As shown in (13), one of the sum contains [( 2)+ 1] partial prod2 partial products. Based on (13), the sysucts while the other has
tolic structure of Fig. 4 could be modified to a form shown in Fig. 5,
which consists of two systolic branches. The upper branch consists of
[( 2) + 2] PEs and the lower branch consists of ( 2 + 1) PEs
and a delay cell. Besides, an addition-cell (AC) is required to perform
the final addition of the outputs of the two systolic arrays, as shown in
Fig. 5(b). The structure has the PEs of the same complexity as those in
Fig. 3, but the latency of structure is only [( 2) + 3] cycles.
It is observed that the two systolic branches in Fig. 5 share the same
input operand , and the PEs in both the branches perform the same operation except the last PE in each of the branches. Therefore, we present
an efficient structure using the register-sharing technique as shown in
Fig. 6, where the structure consists of [( 2) + 2] PEs and an AC.
The circuit of its regular PE (from PE[2] to PE[ 2 0 1]) is shown in
Fig. 6(c). It combines two regular PEs of Fig. 5(a) together by sharing
one input-operand-transfer. The other PEs need some minor modifications, as shown in Fig. 6(b), (d) and (e), respectively. The function
of AC is the same as that in Fig. 5. Thus, the whole structure requires
only (2 5 2 + 6 5 + 4) bit-registers, while the structure of Fig. 4
requires (3 2 +5 +2) bit-registers. Besides, the latency of structure
is [( 2) + 3] cycles, while the duration of cycle period of a regular
PE is still X .

m=

m=

m= 0

m=

Fig. 7. Improved low-latency systolic structure. (a) The proposed systolic array
merging. (b) Improved systolic structure.

We may further decompose the design in Fig. 6. For example, if we


4, then = 2, = 1, (7) can be rewritten as
choose =

P m=

C=

01

m=4

i=0

m=

:m :m
m m
m=
T

m=

01

m=2

i=m=4

3m=401

i=m=2

X:
i

i=3m=4

(14)

Following the same approach as the one used to derive the structure
of Fig. 5, we can have the design in Fig. 7(a), where it consists of four
systolic branches. Similarly, following the approach presented to derive
the structure of Fig. 6 from Fig. 5, we may have the design shown in
Fig. 7(b). The design of Fig. 7(b) requires only [( 4) + 4] cycles of
latency. When is a large number, and can be chosen as

l P
l = P = bm + 1c

m=

m=

m=

B. Shared-Register Low-Latency Systolic Structure

Fig. 6. Low-latency register-sharing systolic structure. (a) The systolic


structure. (b) Structure of PE[1]. (c) Structure of a regular PE (from PE[2] to
PE[ 2 1]). (d) Structure of PE[ 2]. (e) Structure of PE[ 2 + 1].

m=

(15)

to obtain an optimal realization.


IV. HARDWARE AND TIME COMPLEXITY

m=

The proposed structure (see Fig. 6) requires [( 2)+2] PEs and one
AC. Each of the regular PEs consists of 2( + 1) XOR gates in a pair
of XOR cells and 2( + 1) AND gates in a pair of AND cells. Besides,
the AC requires ( + 1) XOR gates. Moreover, (2 5 2 + 6 5 + 4)
bit-registers are required for transferring data to the nearby PE. The
latency of the design is [( 2) + 3] cycles, where the duration of the
clock-period is X . The structure of Fig. 7 requires nearly the same
gate-counts as that of Fig. 6. But its latency is [( 4) + 4] cycles. The
number of gates, latency and critical-path of the proposed designs (see
Figs. 6 and 7) and the existing designs of [11][16] are listed in Table I.
It can be seen that the proposed design outperforms the existing designs. Although slightly more registers than that in [11] are used, proposed design requires shorter latency and lower critical-path than the
other as well as the MUX gates. The digit-serial structures of [12] and

m
m

:m

m=

m=

:m

172

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

TABLE I
AREA AND TIME COMPLEXITY

T
]T

: delay of a 3-input XOR gate.


: the delay of a T flip flop.
: For the digit-serial structure, L is the digit size.
T : the delay of a 2:1 MUX.
TABLE II
COMPARISON OF AREA AND TIME COMPLEXITY FOR m

= 20

ACT = (critical0path) 2
(cycles number required to obtain a result).
Digit-size L = 4.

a low-latency bit-parallel systolic multiplier. Compared with the existing systolic structures for bit-parallel realization of multiplication
m
(2 ), the proposed one is found to involve less area, shorter
over
critical-path and lower latency. From ASIC and FPGA synthesis results
we find that the proposed design involves significantly less ADP and
PDP than the existing designs. Moreover, our proposed design can be
extended to further reduce the latency.

GF

REFERENCES

TABLE III
FPGA SYNTHESIS RESULT OF PROPOSED AND EXISTING DESIGNS

[1] M. Ciet, J. J. Quisquater, and F. Sica, A secure family of composite


finite fields suitable for fast implementation of elliptic curve cryptography, in Proc. Int. Conf. Cryptol. India, 2001, pp. 108116.
mont[2] H. Fan and M. A. Hasan, Relationship between GF
gomery and shifted polynomial basis multiplication algorithms, IEEE
Trans. Computers, vol. 55, no. 9, pp. 12021206, Sep. 2006.
[3] C.-L. Wang and J-L. Lin, Systolic array implementation of multipliers
for finite fields GF
, IEEE Trans. Circuits Syst., vol. 38, no. 7,
pp. 796800, Jul. 1991.
[4] B. Sunar and C. K. Koc, Mastrovito multiplier for all trinomials,
IEEE Trans. Comput., vol. 48, no. 5, pp. 522527, May 1999.
[5] C. H. Kim, C.-P. Hong, and S. Kwon, A digit-serial multiplier for
, IEEE Trans. Very Large Scale Integr. (VLSI)
finite field GF
Syst., vol. 13, no. 4, pp. 476483, 2005.
[6] C. Paar, Low complexity parallel multipliers for Galois fields
based on special types of primitive polynomials, in Proc.
GF
IEEE Int. Symp. Inform. Theory, 1994, p. 98.
[7] H. Wu, Bit-parallel polynomial basis multiplier for new classes of finite fields, IEEE Trans. Computers, vol. 57, no. 8, pp. 10231031,
Aug. 2008.
[8] S. Fenn, M.G. Parker, M. Benaissa, and D. Taylor, Bit-serial multiplication in GF
using all-one polynomials, IEE Proc. Com. Digit.
Tech., vol. 144, no. 6, pp. 391393, 1997.
[9] K.-Y. Chang, D. Hong, and H.-S. Cho, Low complexity bit-parallel
multiplier for GF
defined by all-one polynomials using redundant representation, IEEE Trans. Computers, vol. 54, no. 12, pp.
16281629, Dec. 2005.
[10] H.-S. Kim and S.-W. Lee, LFSR multipliers over GF
defined
by all-one polynomial, Integr., VLSI J., vol. 40, no. 4, pp. 571578,
2007.
[11] P. K. Meher, Y. Ha, and C.-Y. Lee, An optimized design of serial-parbased on all-one polynomials,
allel finite field multiplier for GF
in Proc. ASP-DAC, 2009, pp. 210215.
[12] M. Sandoval, M. F. Uribe, and C. Kitsos, Bit-serial and digit-serial
montgomery multipliers using linear feedback shift regisGF
ters, IET Comput. Digit. Tech., vol. 5, no. 2, pp. 8694, 2011.
[13] C.-Y. Lee, E.-H. Lu, and J.-Y. Lee, Bit-parallel systolic multipliers for
GF
fields defined by all-one and equally spaced polynomials,
IEEE Trans. Computers, vol. 50, no. 6, pp. 385393, May 2001.
[14] Y.-R. Ting, E.-H. Lu, and Y.-C. Lu, Ringed bit-parallel systolic multipliers over a class of fields GF
, Integr., VLSI J., vol. 38, no. 4,
pp. 571578, 2005.
[15] C.-Y. Lee, J.-S. Horng, I.-C. Jou, and E.-H. Lu, Low-complexity
bit-parallel systolic montgomery multipliers for special classes of
GF
, IEEE Trans. Computers, vol. 54, no. 9, pp. 10611070,
Sep. 2005.

(2 )

(2 )

(2 )

Power consumption (PC).


Logic element (LE).
Digit-size L
.

=4

m=l

m=l

[16] yield one product word in


and (2
0 1) clock-periods, respectively, while the proposed structure produces one product word in
every clock-period. Besides, as shown in Fig. 7, the proposed design
can be extended further to obtain a more efficient design for high-speed
implementation, especially when is a large number.
The proposed design (see Fig. 7) has been coded in VHDL and synthesized by Synopsys Design Compiler using TSMC 90-nm library
= 20 along with the bit-parallel systolic design of [15] and
for
digit-serial systolic structure of [16]. The average computation time
(ACT), area and power consumption (at 100 MHz frequency) thus obtained are listed in Table II. The proposed design has at least 28.5%
less area-delay product (ADP) and 28.2% lower power-delay product
(PDP) compared to the existing ones.
Besides, we have synthesized the proposed design (see Fig. 7) and
the designs of [15] and [16] for = 20 and implemented on an Altera
FPGA: Cyclone-II EP2C15AF256A7 using Quartus II 9.0. From the
synthesis result, as shown in Table III, we find that the proposed design
has lower ADP and less PDP than the existing ones.

V. CONCLUSION

GF

m
Efficient systolic design for the multiplication over
(2 ) based
on irreducible AOP is proposed. By novel cut-set retiming we have
been able to reduce the critical path to one XOR gate delay and by
sharing of registers for the input-operands in the PEs, we have derived

((2 ) )

(2 )

(2 )

(2 )

(2 )

(2 )
(2 )

(2 )

(2 )

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 1, JANUARY 2013

[16] S. Talapatra, H. Rahaman, and J. Mathew, Low complexity digit serial


(2 ), IEEE
systolic montgomery multipliers for special class of
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5, pp. 847852,
May 2010.
[17] T. Itoh and S. Tsujii, Structure of parallel multipliers for a class of
fields
(2 ), Inform. Computation, vol. 83, no. 1, pp. 2140, 1989.
[18] C. Negre, Quadrinomial modular arithmetic using modified polynomial basis, in Proc. ITCC, 2005, pp. 550555.
[19] Z. Chen, M. Jing, J. Chen, and Y. Chang, New viewpoint of bit-serial/
parallel normal basis multipliers using irreducible all-one polynomial,
in Proc. ISCAS, 2006, pp. 14991502.
[20] C.-Y. Lee, C. W. Chiou, J. M. Lin, and C. C. Chang, Scalable and systolic montgomery multiplier over
(2 ) generated by trinomials,
IET Circuits Dev. Syst., vol. 1, no. 6, pp. 477484, 2007.
[21] C.-Y. Lee, Error-correcting codes for concurrent error correction in
bit-parallel systolic and scalable multipliers for shifted dual basis of
(2 ), in Proc. Intern. Sym. Para. Dis. Process. Appl., 2010, pp.
405412.
[22] C.-Y. Lee and C. W. Chiou, New bit-parallel systolic architectures
for computing multiplication, multiplicative inversion and division in
(2 ) under polynomial basis and normal basis representations, J.
Signal Process. Syst., vol. 52, no. 3, 2008.
[23] C.-W. Chiou, C. C. Chang, C. Y. Lee, T. W. Hou, and J. M. Lin,
Concurrent error detection and correction in Guassian normal basis
multiplier over
(2 ), IEEE Trans. Computers, vol. 58, no. 6, pp.
851857, 2009.
[24] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.
[25] P. K. Meher, Systolic and non-systolic scalable modular designs of
finite field multipliers for Reed-Solomon Codec, IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 17, no. 6, pp. 747757, Jun. 2009.

GF

GF

GF

GF
GF

GF

Design and Implementation of an On-Chip Permutation


Network for Multiprocessor System-On-Chip
Phi-Hung Pham, Junyoung Song, Jongsun Park, and Chulwoo Kim

AbstractThis paper presents the silicon-proven design of a novel


on-chip network to support guaranteed traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a
pipelined circuit-switching approach combined with a dynamic path-setup
scheme under a multistage network topology. The dynamic path-setup
scheme enables runtime path arrangement for arbitrary traffic permutations. The circuit-switching approach offers a guarantee of permuted data
and its compact overhead enables the benefit of stacking multiple networks. A 0.13- m CMOS test-chip validates the feasibility and efficiency
of the proposed design. Experimental results show that the proposed
on-chip network achieves 1.9 to 8.2 reduction of silicon overhead
compared to other design approaches.
Index TermsGuaranteed throughput, multistage interconnection network, network-on-chip, permutation network, pipelined circuit-switching,
traffic permutation.

I. INTRODUCTION
A trend of multiprocessor system-on-chip (MPSoC) design being
interconnected with on-chip networks is currently emerging for applications of parallel processing, scientific computing, and so on [1][6].
Manuscript received January 12, 2011; revised June 23, 2011 and October 20,
2011; accepted December 07, 2011. Date of publication January 17, 2012; date
of current version December 19, 2012. This work was supported by the National
Research Foundation of Korea (NRF) grant funded by the Korea Government
(MEST) (2011-0020128).
The authors are with the School of Electrical Engineering, Korea University,
Seoul 136-713, South Korea (e-mail: hungpp@korea.ac.kr; ckim@korea.ac.kr).
Digital Object Identifier 10.1109/TVLSI.2011.2181545

173

Permutation traffic, a traffic pattern in which each input sends traffic to


exactly one output and each output receives traffic from exactly one
input, is one of the important traffic classes exhibited from on-chip
multiprocessing applications [7], [8]. Standard permutations of traffic
occur in general-purpose MPSoCs, for example, polynomial, sorting,
and fast Fourier transform (FFT) computations cause shuffled permutation, whereas matrix transposes or corner-turn operations exhibit transpose permutation [6]. Recently, application-specific MPSoCs targeting
flexible Turbo/LDPC decoding have been developed, and they exhibit
arbitrary and concurrent traffic permutations due to multi-mode and
multi-standard feature [3][5]. In addition, many of the MPSoC applications (e.g., Turbo/LDPC decoding [3][5]) compute in real-time,
therefore, guaranteeing throughput (i.e., data lossless, predictable latency, guaranteed bandwidth, and in-order delivery) is critical for such
permutation traffics.
Most on-chip networks in practice are general-purpose and use
routing algorithms such as dimension-ordered routing and minimal
adaptive routing. To support permutation traffic patterns, on-chip
permutation networks using application-aware routings are needed
to achieve better performance compared to the general-purpose networks [8]. These application-aware routings are configured before
running the applications and can be implemented as source routing or
distributed routing. However, such application-aware routings cannot
efficiently handle the dynamic changes of a permutation pattern, which
is exhibited in many of the application phases [8]. The difficulty lies
in the design effort to compute the routing to support the permutation
changes in runtime, as well as to guarantee [9] the permutated traffics.
This becomes a great challenge when these permutation networks
need to be implemented under very limited on-chip power and area
overhead.
Reviewing on-chip permutation networks (supporting either full or
partial permutation) with regard to their implementation shows that
most the networks employ a packet-switching mechanism to deal with
the conflict of permuted data [3][6]. Their implementations either use
first-input first-output (FIFO) queues for the conflicting data [3], [5],
[6], or time-slot allocation in the overall system with the cost of more
routing stages [5], or a complex routing with a deflection technique that
avoids buffering of the conflicting data [4]. The choices of network design factors, i.e., topology, switching technique and the routing algorithm, have different impacts on the on-chip implementation.
Regarding the topology, regular direct topologies, such as mesh and
torus [2], [3], [6], are intuitively feasible for physical layout in a 2-D
chip. On the contrary, the high wiring irregularity and the large router
radix of indirect topologies such as Benes or Butterfly [4], [5] pose
a challenge for physical implementation [10]. However, an arbitrary
permutation pattern with its intensive load on individual source-destination pairs stresses the regular topologies and that may lead to
throughput degradation [7]. In fact, indirect multistage topologies are
preferred for on-chip traffic-permutation intensive applications [4],
[5].
Regarding the switching technique, packet switching requires an
excessive amount of on-chip power and area for the queuing buffers
(FIFOs) with pre-computed queuing depth at the switching nodes
and/or network interfaces [3][6].
Regarding the routing algorithm, the deflection routing [4] is not
energy-efficient due to the extra hops needed for deflected data transfer,
compared to a minimal routing [2], [3], [5]. Moreover, the deflection
makes packet latency less predictable; hence, it is hard to guarantee the
latency and the in-order delivery of data.
This paper presents a novel silicon-proven design of an on-chip permutation network to support guaranteed throughput of permutated traf-

1063-8210/$31.00 2012 IEEE

Das könnte Ihnen auch gefallen