Sie sind auf Seite 1von 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


A High Bit Rate Serial-Serial Multiplier With On-the-Fly Accumulation by Asynchronous Counters
Manas Ranjan Meher, Student Member, IEEE, Ching Chuen Jong, and Chip-Hong Chang, Senior Member, IEEE
AbstractA novel approach of designing serial-serial hybrid multiplier is proposed for applications with high data sampling rate ( 4 GHz). The conventional way of partial product formation is revamped. Our proposed technique effectively forms the entire partial product matrix in just sampling cycles for an multiplication instead of at least 2 cycles in the conventional serial-serial multipliers. It achieves a high bit sampling rate by replacing conventional full adders and 5:3 counters with asynchronous 1s counters so that the critical path is limited to only an AND gate and a D ip-op (DFF). The use of 1s counter to column compress the partial products preliminarily reduces the height + 1, resulting of the partial product matrix from to log2 in a signicant complexity reduction of the resultant adder tree. The proposed hybrid column compressed multiplier consists of a serial-serial data accumulation unit and a parallel carry save adder (CSA) array that occupies approximately 35% and 58% less silicon area than the full CSA array multiplier with operands of wordlength 32 32 and 64 64, respectively. The post-layout simulation results based on 90-nm seven metal single poly CMOS process technology shows that our 64 64 multiplier dissipates 39% less average power at a sampling rate of 4 GHz, and has only 11% additional delay penalty to complete a multiplication compared to the conventional fully parallel CSA array multiplier. Index TermsBinary multiplication, on-chip serial-link bus architecture, on-the-y accumulation, parallel multipliers, serial-serial and serial-parallel multiplier.

I. INTRODUCTION ULTIPLIERS are the fundamental and essential building blocks of VLSI systems. The design and implementation approaches of multipliers contribute substantially to the area, speed and power consumption of computational intensive VLSI systems. Often, the delay of multipliers dominates the critical path of these systems and due to issues concerning reliability and portability, power consumption is a critical criterion for applications that demand low-power as its primary metric. While low power and high speed multiplier circuits are highly demanded, it is not always possible to achieve both criteria simultaneously. Therefore, a good multiplier design

Manuscript received August 15, 2009; revised November 27, 2009, April 11, 2010, and July 05, 2010; accepted July 06, 2010. The work of M. R. Meher was supported by Nanyang Technological University under a Research Scholarship. M. R. Meher is with the Integrated Systems Research Laboratory, School of Electrical and Electronic Engineering, Nanyang Technological University, 639798 Singapore (e-mail: C. C. Jong and C. H. Chang are with the School of Electrical and Electronic Engineering, Nanyang Technological University, 639798 Singapore (e-mail:; Color versions of one or more of the gures in this paper are available online at Digital Object Identier 10.1109/TVLSI.2010.2060374

requires some tradeoff between speed and power consumption. As the device size shrinks, there are other side effects and it becomes increasingly difcult to achieve a good tradeoff by device scaling or sizing the transistors [1]. A more effective driver to optimize the area, delay and power consumption of arithmetic computations in battery powered VLSI circuits is to explore alternative architectural concepts for the design of digital multipliers. Typically, hardware implementation of a multiplication operation consists of three stages, specically the generation of partial products (PPs), the reduction of PPs and the nal carry-propagation addition [2]. The partial products can be generated either in parallel or serially, depending on the target application and the availability of input data. The partial products are generally reduced by carry-save adders (CSAs) using an array or a tree structure. Carry propagation addition is inevitable when the number of partial products is reduced to two rows. This nal adder can be a simple ripple carry adder (RCA) for low power or a carry look-ahead adder (CLA) for high speed [2]. As the height of PP tree increases linearly with the wordlength of the multiplier, it aggravates the area, delay and power dissipation of the two subsequent stages. Therefore, it is highly desirable to reduce the number of partial products before the CSA stage. This can be achieved by Modied Booth algorithm to reduce the height of the PP matrix [3]. Another approach is to use high order column compressors instead of full adders (FAs) to increase the PP reduction ratio of the CSA stage [4], [5]. The drawback is that Modied Booth encoder adds both area and delay overheads to the simple partial product generation process, and higher order compressors are slower and consume more power than the full adders. Hence a hybrid combination of both techniques is often considered. To reduce the wiring cost, it has been a common practice to transmit data over a communication channel through a high speed serial link [6]. For some integrated circuits constrained by I/O pins, the designers often try to reduce the number of I/O pads by serializing I/O data because I/O pads occupy large silicon area and consume high power [7]. Therefore, effort has been made to design high speed serial interface [6], [8][12] in order to facilitate on-chip buffering and parallel processing. Parallel multipliers are popular for their high speed operation but long wordlength multiplications are often constrained by the hardware cost and power consumption of the applications. In public key cryptography like RSA encryption and decryption, integer multiplications of 1024 bits are typical [13]. In Elliptic Curve Cryptography (ECC), key lengths of 112 bits and 109 bits are commonly used for prime eld and binary eld multi-

1063-8210/$26.00 2010 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 1. On-chip communication among parallel functional modules in SoC. (a) Conventional bus structure. (b) Serial-link bus structure [16].

Fig. 2. Suitability of serial-serial multiplier for upcoming on-chip serial-link bus architectures in complex SoC.

plication, respectively. Therefore, low-cost serial multipliers are widely adopted in hardware cryptography [14], [15]. Serial multipliers also nd applications in system-on-chip (SoC) design. As technology scales, more intellectual property cores and logic blocks will be integrated in an SoC, resulting in larger interconnect area and higher power dissipation [17]. The increase in integration density of the on-chip modules causes the buses connecting these modules to become highly congested. To overcome this problem, new techniques have been evolved recently to have on-chip data transfer in a high speed serial link instead of conventional bus [16][18]. Fig. 1(a) and (b) depict the conventional on-chip bus and alternative on-chip serial-link bus structures, respectively. In Fig. 1(b), the serializer at the source module converts the parallel outputs to a bit stream that can be transferred in a simple routing network and at the destination module they are converted back to parallel data by the deserializer. The on-chip serial-link is capable of transmitting data at Gb/s so that a chunk of parallel data is available when the destination module nishes the previous computation. Under the new on-chip communication paradigm for digital signal processing, it is desirable to have a low complexity data processing unit as the destination module that is able to perform partial computation on the incoming data stream at high speed while the data is being buffered. Fig. 2 illustrates a potential use of a serial-serial multiplier as a destination module in a SoC with serial-link bus architecture. The low complexity precomputation unit forms part of the serial-serial multiplier and could perform partial computation on the high speed serial bit stream.

The unit doubles as a buffer and eliminates the deserializer. As the data has been partially processed and buffered, the completion of the multiplication can be done at a lower speed with a less complex parallel multiplier. The challenge in such a scheme lies in reducing the critical path delay of the precomputation unit to that of the deserializer, which usually has bit rate in the order of several Gb/s. We introduce this new scheme for the design of serial-serial multiplier suitable for SoCs with on-chip serial-link bus architecture [16]. The proposed scheme could also be used as an alternative to embedded multipliers in the future eld-programmable gate array (FPGA), where congurable logic blocks (CLBs), embedded multipliers and memory blocks are integrated with serializers/deserializers to facilitate on-chip serial data transfer in order to reduce interconnect complexity [19]. The rest of this paper is organized as follows. Section II revisits some of the existing serial multiplier architectures in the literature. In Section III, a serial accumulator developed based on the new design paradigm is proposed to deal with very high-speed data sampling rate of above 4 GHz. The accumulator employs asynchronous counters1 to perform bit accumulation at each bit position of the PP matrix, resulting in low critical path delay and small area, especially for operands with long wordlength. Asynchronous counter has a low hardware complexity but the outputs are not synchronized with the clock which leads to a timing delay before all output bits of the counter have settled to their nal states. The correct output of the counter is read after a timing delay to be analyzed from the timing diagram in Section VI-B. The data dependent counters change states only when the input bit is 1, which leads to low switching power dissipation. The height of the PP matrix after buffering by the asynchronous counters is reduced logabefore it is further reduced by the rithmically to CSA tree. The details of the newly proposed serial-serial multiplier are described in Section IV. The method is extended to signed multiplication of 2s complement numbers in Section V. Section VI presents the application-specic integrated circuit (ASIC) implementation results and their comparisons with other designs on various performance measures. The energy efciency of the proposed serial-serial multiplier design is demonstrated also by the post-layout simulation results of
1An asynchronous counter is also called a ripple counter, where the clock input of each ip-op of the counter is driven by the output of its preceding ip-op, except for the rst ip-op, which is driven by the clock signal.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 3. Bit serial multiplier and its basic bit-cell (BC ) [23].

Fig. 4. CSAS serial-parallel multipliers [34]. (a) Unsigned CSAS. (b) 2s complement CSAS.

the designs implemented with STM 90-nm CMOS process. Section VII concludes this paper. II. REVIEW OF SERIAL MULTIPLIERS Serial multipliers are popular for their low area and power [20][35], and are more suitable for bit-serial signal processing applications with I/O constraints and on-chip serial-link bus architectures. They are broadly classied into two categories, namely serial-serial and serial-parallel multiplier. In a serialserial multiplier both the operands are loaded in a bit-serial fashion, reducing the data input pads to two [20][32]. On the other hand, a serial-parallel multiplier loads one operand in a bit-serial fashion and the other is always available for parallel operation [27], [33][35]. Lyon [21] proposed a bit-serial input output multiplier in 1976 which features high throughput at the expense of truncated output. A full precision bit-serial multiplier was introduced by Strader et al. for unsigned numbers [22]. The rudimentary cell consists of a 5:3 counter and some DFFs. Later, Gnanasekaran [28] extended the work in [22] and developed the rst bit-serial multiplier that directly handles the negative weight of the most signicant bit (MSB) in 2s complement -bit representation. This method needs only cells for an multiplication but it introduces an XOR gate in the critical path, which ends up with a more complicated overall design. Lenne et al. [23] designed a bit-serial-serial multiplier that is modular in structure and can operate on both signed and unsigned numbers. The 1-bit slice of a typical serial-serial multiplier, called a bit-cell (BC), is excerpted from [23] and shown in Fig. 3. such cells are interconnected to produce the output in a bit-serial serial-serial multiplier. The operand bits manner for an and are loaded serially in each cycle and added with the far carry , local carry , and the partial sum in the 5:3 counter. The cascaded cells form a shift register chain with of one cell connected to the of the next cell. The addition of and ) is serially the symmetric partial product bits (i.e., enabled by pulling rst bit (FB) signal low after the rst cycle. The partial sum is shifted out serially through another shift regto connection. The ister chain formed by the cascaded registers are cleared by activating last bit (LB) before the start

of a new multiplication. Aggoun et al. [30] proposed a new architecture for serial-serial multiplication with 50% reduction in hardware without degrading the speed. It is observed that all the reported multipliers have a common computational unit known as the 5:3 counter in the critical path. In addition, there are also DFFs and AND gates in the critical path, which lower the operation speed and limit the input bit rate. Many attempts have been made to reduce either the hardware cost or latency [22], [23] but there is no improvement on the critical path. To overcome this problem, Pekmestzi et al. [26] designed a systolic serial multiplier with a critical path comprising an FA, a 2:1 MUX, a DFF and an AND gate. Nibouche et al. [31] proposed two serial-serial architectures, Structure I and Structure II, which can handle a sampling frequency close to that of the serial-parallel multiplier. The critical path of Structure I consists of an FA, a DFF, and an AND gate but it has a latency of cycles for an multiplication and requires cycles to complete one multiplication. to To reduce the number of computational cycles from in an serial multiplier, several serial-parallel multipliers have been developed over the years [27], [33][35]. Most of them are based on a carry save add shift (CSAS) structure. Fig. 4 shows the unsigned and 2s complement serial-parallel multiplier based on the CSAS structure. It can be observed that the critical path consists of an FA, a DFF, and an AND gate for the unsigned multiplier in Fig. 4(a) and an extra XOR gate for the 2s complement multiplier in Fig. 4(b). Gnanasekaran [34] proposed a fast CSAS multiplier capable of producing -bit output in clock cycles at the expense of an extra RCA. A complexity 2s complement serial-parlow allel multiplier was proposed by Sunder et al. [35]. It used the BaughWooley algorithm to avoid the sign extension problem [2]. Saleh et al. [27] designed many serial-parallel systolic and complexity based on the non-systolic multipliers with low Booths multi-bit recoding. Although the sampling frequency has been improved in the serial-parallel multipliers and the total number of computational cycles is halved, one of the operands needs to be loaded in parallel. Recently, there has not been any

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

new development in serial-serial multiplier design due to the maturity of conventional architectures. Parallel multipliers are more popular as the size is less critical due to technology scaling. However, with the emerging development of the on-chip serial-link bus architectures [16], [17], [19], serial-serial multipliers could nd their potential roles in the new generation of SoCs and FPGAs. In the following sections, an approach to the design of serial multiplier that is capable of processing input data at Gb/s without input buffering and with reduced total number of computational cycles is proposed. III. PROPOSED SERIAL ACCUMULATOR Accumulation is an integral part of serial multiplier design. A typical accumulator is simply an adder that successively adds the current input with the value stored in its internal register. Generally, the adder can be a simple RCA but the speed of accumulation is limited by the carry propagation chain. The accumulation can be speed up by using a CSA with two registers to store the intermediate sum and carry vectors, but a more complex fast vector merged adder is needed to add the nal outputs of these registers. In either case, the basic functional unit is an FA cell. A new approach to serial accumulation of data by using asynchronous counters is suggested here which essentially count the number of 1s in respective input sequences (columns). A rudimentary version of our proposed serial accumulator was rst introduced in [36]. for An accumulation of integers can be mathematically expressed as (1) For ease of exposition, let be an unsigned integer represented by binary weighted bits. Hence, (1) can be rewritten as (2) where is the th bit of the th operand and is associated with a positional weight . By changing the order of accumulation of and , (2) becomes (3) The inner-summation of (3), , represents the sum of all the 1s presented in the th column and it can easily be accomplished by a simple serial 1s counter of width . Thus, an -bit accumulator can be realized with such dedicated counters to concurrently accumulate the 1s in each bit position. The dependency graph (DG) of such a scheme and is shown in Fig. 5. The nodes in the DG with perform , which is equivalent to an increment of if . An example illustrating the serial accumulation of the inputs, 10110001, 00000001, 11001101, 11100011, 01110001, 11001101, and 10001101 in the DG is shown

Fig. 5. Dependency graph of the proposed accumulator for 7 8-bit operands.

in Fig. 6. The accumulated output after each cycle is indicated by the integer in each node and the nal accumulated output in each column can be represented by a 3-bit binary number. At the end of the th iteration, each counter produces an -bit binary weighted output. By arranging all counter output bits of the same positional weight in the same column with the least significant bit (LSB) of each counter output aligned at the th column, the result of the accumulation can be obtained by column compressing the counter outputs with a CSA tree structure of height . Note that the height of the CSA tree would be a binary exponent of should the summation be carried out without the 1s counters. The architecture of the accumulator corresponding to Fig. 5 is shown in Fig. 7. The bits of the input operands are serially fed into their corresponding counters from column 0 (right-most in Fig. 5) to column 7. These counters execute independently and concurrently. In each cycle of accumulation, a new operand is loaded and the counters corresponding to the columns that have a 1 input are incremented. The counters can be clocked at high frequency and all the operands will be accumulated at the end of the th clock. The nal outputs of the counters need to be further reduced to only two rows of partial products by a CSA tree, such as a Wallaces or Daddas tree [2]. A carry propagate adder is then used to obtain the nal sum. In Fig. 7, the counters C are used to count the number of 1s in a column. Each of them is a simple DFF-based ripple counter. The clock is provided to the rst DFF and all the other DFFs are triggered by the preceding DFF outputs. A typical 3-bit 1s counter is shown in Fig. 8. The clock input is synchronized with the input data rate and thus the operands can be accumulated with a high frequency dened by the setup time and propagation delay of a DFF. Moreover, the counters change states only when the input is 1, which leads to low switching power. This simple

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

where and are the th and th bits of and , respectively, with bit 0 being the LSB. By reversing the sequence of index of (4) (5) By decomposing and rearranging (5) (6) where

Fig. 6. Example for Fig. 5.

. In (6) the partial product row, , can be generated in . If is fed MSB (bit ) rst and is fed LSB , is a partial product bit (bit 0) rst, then in and , are generated by the current input bits and each of the partial product bits of the current input bit, preceding input bits of , i.e., , for , are the partial product bits of the input bit, and and each of the preceding input bits of , i.e., for . By appropriately sequencing the input bits of and into a shift register, one PP in each cycle can be generated. Consequently, can be obtained in cycles. Fig. 9 illustrates the PP generation of an 8 8 multiplier for two unsigned numbers and . Fig. 9(a) shows the conventional partial product formation and Fig. 9(b) shows the generagenerated tion sequence of the PPs according to (6), with , for . is represented by the in center column, by the columns to the right of the center by the columns to the left. column and The PPs in Fig. 9(b) are generated in such an unconventional way in order to facilitate their accumulation on-the-y by the proposed counter-based accumulation technique. A bank of counters is deployed, one in each column, to accumulate the bits arriving in the respective column. The DG shown in Fig. 10 illustrates the complete operation of the PP generation and accumultiplication. Each node (circle) mulation for a general in the DG represents a binary counter and an ancillary AND gate to generate a partial product bit. All nodes are identical in functionality and have three inputs and three outputs, except that the data are propagated in different directions as depicted in Fig. 10. It can be seen from Fig. 10 that the nodes (L) on the left of the middle column have the identical properties of shifting to the immediate left node in , computing the PP bit and updating the counter output. Similarly, the nodes

Fig. 7. Architecture of an accumulator with

m = 8 and n = 7.

Fig. 8. Hardware architecture of a 3-bit 1s counter.

and efcient bit accumulation technique is used to design the proposed serial-serial multiplier. IV. PROPOSED SERIAL-SERIAL MULTIPLIER This section proposes a new technique of generating the individual row of partial products by considering two serial inputs, one starting from the LSB and the other from MSB. Using this feeding sequence and the proposed counter-based accumulation technique presented in Section III, it takes only cycles to complete the entire partial product generation and accumulation multiplication. The theoretical underpinprocess for an ning of this design is elaborated as follows. The product of two -bit unsigned binary numbers and can be expressed as (4)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 9. Partial product generation schemes for an 8 cation: (a) conventional and (b) proposed.

2 8 serial-serial multipli-

Fig. 11. Proposed architecture for 8

2 8 serial-serial unsigned multiplication.

Fig. 10. Dependency graph of and accumulation.

n 2 n serial-serial partial product generation

(R) on the right of the middle column exhibit identical property but shift instead. The nodes (C) on the middle column and , in each clock receive a new pair of input bits, cycle, process the pair to form the PP bit and propagate and to the left and right nodes respectively. At the registered output of the nodes contains the sum of all 1s in the respective columns. The complete architecture of the pro8) is shown posed counter-based serial-serial multiplier (8 in Fig. 11. The functionality of the nodes in Fig. 10 is mapped to the structure of Fig. 11 with two shift registers to perform counters of different bit widths the left and right shift, sufcient to accumulate the number of 1s in their respective AND gates to generate the PP columns and an array of

bits serially. It can be seen from Fig. 9 that for an 8 8 multiplication, the middle column has a maximum height of 8 and a 4-bit counter is sufcient to account for the maximum column sum. The column height decreases gradually to either side and the counter width also decreases correspondingly. The operation of the multiplier can be explained with the aid of the structure in Fig. 11. A PP bit corresponding to the middle column of the PP is produced by the center AND gate when a new and ) is latched by the two DFFs (top pair of input bits ( and are middle) in each clock cycle. In the next cycle, shifted to the left and right, respectively, to produce the partial and by product bits with another pair of input bits the array of AND gates. The AND gates are gated by to ensure that the outputs driving the counters are free of glitches. Each counter changes state at the rising edge of the clock line only if a 1 is produced by its driving AND gate. After cycles, the counters hold the sums of all the 1s in the respective columns and their outputs are latched to the second stage for summation. The latched outputs are wired to the correct FAs and HAs (half adders) according to the positional weights of the output bits produced by the counters. From Fig. 11, it is observed that the column height has been reduced from 8 to 4 and the nal product, , can be obtained with two stages of CSA tree and a nal RCA. Similarly, for 16 16, 32 32 and 64 64 multipliers the column heights are reduced logarithmically from 16, 32, and 64 to 5, 6, and 7, respectively. This drastic reduction in column height leads to a much simpler CSA tree, and hence reducing the overall hardware complexity and power consumption. The latching register between the counter and the adder stages not only makes it possible to pipeline the serial data accumulation and the CSA tree reduction, but also prevents the spurious transitions from propagating into the adder tree. Two and , are employed to synchroclocks, namely nize the data ow between the two stages. The counter stage to process the inputs at high speed as the is driven by critical path is dened by the delay of only an AND gate and a

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 12. Snapshot of the output sequence of the proposed multiplier (without actual timing).

DFF. The counters and shift registers are reset to 0 in every cycles to allow a new set of operands to be loaded. is deto drive the latching register. The output serived from quence of a 16 16 multiplication is depicted in Fig. 12 without the complication of the actual timing. The detailed performance and timing analysis will be discussed in Section VI. The -bit multiplication is produced in parallel by the product of an nal RCA and can be interfaced with other logic circuitry either in parallel or in serial depending on the requirement. V. EXTENSION FOR 2S COMPLEMENT NUMBERS Most digital systems operate on signed numbers commonly represented in 2s complement. In this section, the unsigned serial-serial multiplication architecture of Fig. 11 is extended to deal with signed multiplication. Using the BaughWooley aland gorithm [2], the multiplication of two signed number represented in 2s complement form is given by

(9) As be simplied to and , (9) can

(10) To extend the unsigned multiplier architecture of Fig. 11 for signed multiplication without introducing a high overhead, the difference expressed in (10) must be simplied. , the following summation terms Since embedded in (10) can be simplied by the closed form expression for the sum of a geometric progression, i.e.,

(11) (7) The multiplication of two unsigned numbers by the multiplier of Fig. 11 can be expressed as By substituting (11) into (10), the constant terms cancel out and the difference, is reduced to (12) The difference of (12) is added to so that the structure of Fig. 11 can be modied to obtain , the product of the 2s complement multiplication (8) The difference between (7) and (8) is given by (13) arrives only in In the proposed PP generation method, . The generation of in (13) has to be . The remaining terms delayed until can be computed during the initial cycles. Hence, the difference can be corrected in the CSA tree. It is trivial from (13) -bit shift register, a NAND gate and several FAs are that a

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Consisting of 2 FA and 1 HA

Fig. 13. Proposed architecture for 8 cation.

2 8 serial-serial 2s complement multipli-

which is in contrary to most CSAS 2s complement multipliers where an extra XOR gate delay is required in the critical path. , is deterFor the proposed structures, the clock period, and a mined by the delay of a three-input AND gate DFF (14) The latency product bits of an required to accumulate all the partial multiplication is cycles of (15) The counter outputs are latched to the adder stage. For proper must be times slower than . The pipelining, critical path delay of the second stage is constrained by the delay of the CSA tree and the nal CPA (16) and are the delays of an FA where and the nal CPA, respectively. Since (see Table I), . The area complexity of the proposed unsigned and signed multiplier is derived from its basic logic modules. The 1st stage consists of DFFs and AND gates only. The height of the partial product column is symmetrical around the center column and decrements from to 1 towards both sides. Thus, the width of the counter at columns away from the center column is , and the total number of DFFs required for multiplier is given by the counters in an (17) is the number of DFFs required by the where . By substituting counter in the center column into the power-series of (18) [37] (18) we have

required for adding . The architecture of the proposed 2s complement serial-serial multiplier is depicted in Fig. 13. It is noted that the input can be fed at the same speed as the unsigned mulis required to latch in the tiplier. A control input rst clock cycle to generate serially in cycles. The bits of to be added in the CSA tree are marked in Fig. 13. The addition of the term raises the height of CSA tree by only two bits regardless of the wordlength of the operands. VI. PERFORMANCE COMPARISON AND IMPLEMENTATION RESULTS In this section, the performances of the proposed unsigned and 2s complement multiplier architectures are evaluated in terms of area, speed, power, and other implementation factors. As the circuits being compared contain different types of logic gates and logic modules, the simplistic unit gate model that assumes all cells used are primitive two-input logic gate is inadequate. It is better to preserve the basic modules as described in the architectures as long as they correspond to the commonly available cells in a typical standard cell library. The area and delay of the proposed multiplier are derived and expressed in terms of the area and delay of the basic logic modules that can be found in a typical standard cell library for different operator sizes. These theoretical estimates are then calibrated by the basic logic modules from STM 90-nm CMOS standard cell library and used to benchmark the proposed multiplier design against other serial-serial multipliers in Section VI-A. The proposed and another parallel CSA array multipliers are also physically synthesized and laid out using the same cell library. The results are presented and discussed in Section VI-B. A. Comparison With Serial-Serial Multipliers PP bits are formed and accuIn our proposed technique, mulated by the 1s counters in just cycles to reduce the height of the CSA tree logarithmically from to . The serial operands are input at a high frequency as the critical path of the input stage consists of only an AND gate and a DFF. The critical path for the 2s complement multiplier remains the same,


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


In addition, DFFs are also needed for the shift registers at DFFs are required to latch the the top of the structure and counter outputs to the CSA tree. Thus, the total number DFFs required is (20) The area complexity of the second stage can be determined from the number of FAs and HAs required for the CSA tree and the CPA. The CPA is assumed to be implemented by the simplest and lower power dissipation RCA. It should be noted the pattern of the dot matrix in this stage is different from that multiplier (see Fig. 11). Owing to the of the conventional preliminary counter-based reduction, the height of the matrix . In the is reduced logarithmically from to following derivations, the number of HAs and FAs is derived , that is required to reduce from the number of stages, , required for the the height to 2. The number of HAs, proposed multiplier depends on only . is constant , . The number of for for the stage below the latching HAs in the rst stage ( register) is , and for each of the subsequent stages. can One HA is also required in the nal RCA. Therefore, be expressed as

where the rst two terms account for the FAs in the CSA tree when is an exact power of two, and the third term accounts for the FAs if is deviated from the power of two for the same , and the last term accounts for the FAs in the nal RCA. Note that the same power series in (18) is also applied here to simplify the expression. In general the area of the proposed serial-serial unsigned mulin terms of basic logic modules can be expressed as tiplier

(23) where , , , and are the area of a DFF, FA, HA, and three-input AND gate, respectively. For the proposed 2s complement serial-serial multiplier of DFFs, one two-input AND gate, one Fig. 13, additional two-input NAND gate, inverters, and FAs are required. Thus, its total area is

(24) , , and are the area of an inverter, where two-input NAND, and two-input AND gate, respectively. To comprehensively evaluate the area-time complexity of our proposed design against some other serial-serial multipliers, the area and delay of the logic cells are normalized with a basic inverter from the same standard cell library. This benchmarking method has been widely adopted in the literature as a fair alternative practice when it is impractical to implement all competitor circuits for different dimensions [27], [35]. The normalized areas and delays of various logic cells are tabulated in Table I using the STM 90-nm standard cell library datasheet [38], where the delay and area of an inverter are 7.2 ps and 1.56 m , respectively. As an inverter has a normalized delay of unity, a two-input NAND gate with a normalized delay of 1.57 11.3 ps. in Table I implies that its actual delay is The data for the logic modules which are not available in the library (e.g., 5:3 Counter) is interpolated from their constituent logic gates. Table II compares the critical path delay, cycles required and computation time of the proposed design with the existing serial-serial multipliers. In the Area Expression column of

(21) If is a power of two, the number of FAs in each stage of the CSA tree with reduction stages follows the pattern of . If is not a power of two, the number of FAs in each stage is increased by 2 for every increment of with the same . Hence, extra FAs are needed. Therefore, , required by an multiplier the total number of FAs, can be expressed as


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Fig. 14. Normalized computation time for unsigned serial-serial multipliers.

Table II, the critical path delay is expressed in terms of the and the normalized delays from Table I. Among wordlength all, the proposed counter-based multiplier exhibits the lowest critical path delay of 21.41. This is achieved by eliminating the complex 5:3 counter and replacing it by a simple 1s counter. In addition, all the PP bits can be accumulated in just cycles in the proposed method while the others need cycles. The logarithmically reduced partial product bits are summed in which can be pipelined with the a single cycle of is approximated accumulation stage. The period of times as that of . Thus, an initial latency of cycles is used to produce the rst output and thereafter one of cycle. All other serial-serial output is generated in every multipliers are not easily pipelinable due to their tight integration of the PP formation, summation and carry propagation. The critical path delay remains the same for the proposed 2s complement multiplier while the others have an additional XOR gate or MUX in the critical path [23], [28], [35].

Fig. 14 shows the total computation time (normalized) of the existing and proposed unsigned serial-serial multipliers for 8, 16, 32, 64, 128 in a bar chart. It is evident from Fig. 14 that the proposed multiplier is much faster than all the other existing serial-serial multipliers, making it suitable for on-the-y multiplication for high data rate applications, especially those with on-chip serial-link bus architectures. To the best of our knowledge this is the rst serial-serial multiplier that can operate at an input data sampling rate that is even higher than the serial-parallel multipliers [27], [33][35]. The proposed multiplier has an area penalty over other methods due to its semi-serial and semi-parallel structure. It in the second needs a small adder tree of height stage which is not required in the other methods. It also needs more ip ops to accumulate and latch the PP bits. The area complexities of different serial-serial multipliers are compared in Table III, where the Normalized Area column lists the area and the normalized values from in terms of the wordlength Table I. The approximation of is used to simplify the normalized area expression of our design in the last column of Table III. The cost of performing the accumulation as fast as the input data buffering required by an I/O limited design by using mixed execution modes in the two stages is the hardware overhead. The tradeoff between area and delay is better 8, 16, 32, illustrated by the area-delay product (ADP) for 64, 128 in Fig. 15. The ADP is shown in logarithmic scale due to the large variation in ADP for different . The area-time tradeoff of our proposed multiplier is acceptable based on the moderate ADP from Fig. 15. In comparison with the parallel multiplier, the proposed multiplier has better ADP, which is discussed in the next subsection. B. ASIC Implementation and Post-Layout Simulation Result The designs of the proposed 16 16, 32 32, and 64 64 multipliers were coded in structural VHDL, synthesized and technology mapped to the STM 90-nm standard cell library by Synopsys Design Compiler to obtain the gate-level netlists and then placed and routed using Cadence SoC Encounter. The layouts were generated using Cadence Virtuoso and they were DRC

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Fig. 16. Timing relation between data, reset, and Clock1 of asynchronous counter and Clock2 for a 16 16 multiplication.



Fig. 15. Normalized area-delay-product comparison.

and LVS veried. The post-layout simulation results were obtained by extracting the RC from the routed designs using Synopsys StarRC-XT. A transistor level simulation was performed on the extracted netlist. The post-layout simulation results of the proposed multiplier are compared with those of the conventional parallel CSA array multiplier to ascertain its merits. For a fair comparison, I/O pads were not included in the layout and the conventional 16 16, 32 32, and 64 64 parallel CSA array multipliers were implemented with on-chip serial data input buffer. An important implementation issue associated with our counter-based serial-serial multiplier is that due to the asynchronous nature of the serial 1s counter, the counter output is not available immediately after clock cycles as the last input has to propagate from the LSB to the MSB of the counter. A setup and hold violation can occur if the counter output bits are before they are stabilized. In this read at the edge of case the latching of counter outputs to the second adder stage in the worst case for must be delayed by the longest counter of width at the center column to trigger of Fig. 11. This is addressed by refraining , which provides an additional the counter for the last for clock uncertainty. guard band of The serial data inputs and are kept at 0 during these cycles to maintain the counter outputs. A min-max timing analysis is performed on the design after placement and routing to ensure that the worst-case delay has been considered to avoid any timing violations. The timing 16 multiplication captured from the waveforms of a 16 post-layout simulation with Synopsys NanoSim are shown at and are the inputs (all the bottom of Fig. 16, where

1s for the worse case) to the 5-bit counter at the center column (for , it is a 4-bit counter as in Fig. 11), Counter is the output of the 5-bit counter, Latch is the output of the is set at 4 GHz. The timing latching register, and waveforms in the shaded window is zoomed in and generalmultiplication as shown at the top of Fig. 16. ized for From the standard cell library, it is known that the setup time, clock-Q delay and hold time of a DFF are 70, 40, and 20 ps, respectively. It can be seen that there are almost two cycles 40 ps and one cycle of of 20 ps before and after the active edge of . Thus, the setup and hold time violations are avoided. It can be seen from Fig. 16 that the counter is able to count from 0 to 16 (10 in HEX) with the outputs correctly latched to the second . pipeline adder tree on The post-layout area, delay and ADP of the proposed and conventional array multipliers for wordlengths of 16 16, 32 32, and 64 64 are tabulated in Table IV. The delay of the CSA array multipliers consists of the time required to buffer a complete set of operands and the time required for the PP reduction and nal CPA. For the proposed counter-based multipliers, the , the delay of delay consists of the accumulation delay , and the additional delay reCSA tree and nal CPA . quired for bit-propagation of the counters From the post-layout simulation results of Table IV, our proposed multiplier achieves a delay comparable to parallel multiplier but the area is signicantly smaller, especially for larger operands due to its simpler adder network. From Table IV, the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

of one multiplication operation. In the case of our proposed counter-based multiplier, the additional cycles required for compensating the bit propagation of the counters from the LSB to the MSB were included. VII. CONCLUSION In this paper, a new method for computing serial-serial multiplication is introduced by using low complexity asynchronous counters. By exploiting the relationship among the bits of a partial product matrix, it is possible to generate all the rows semultiplication. Employing rially in just cycles for an counters to count the number of 1s in each column allows the partial product bits to be generated on-the-y and partially accumulated in place with a critical path delay of only an AND gate and a DFF. The counter-based accumulation reduces the PP height logarithmically and makes it possible to achieve an using an FA-based CSA tree. effective reduction rate of The post-layout simulation results show that the serial input can be sampled at a rate as high as 4.54 GHz when the multiplier is mapped to an ASIC with the STM 90-nm CMOS process. The sampling rate can be increased to 8 GHz if high speed cells from the same library are used. The proposed counter-based multiplier outperforms many serial-serial and serial-parallel multipliers in speed but its hybrid architecture does carry an area overhead. Overall, the ADP is comparable to other serial-serial multipliers. Comparing with the 32 32 and 64 64 parallel CSA array multipliers, the proposed multiplier has comparable speed but is 35% and 58% more area efcient, respectively. In addition, our 64 64 multiplier consumes about 31% less energy per operation than its parallel counterpart. Last but not least, the counter-based approach has clear advantage of low I/O requirement and hence is most suitable for complex SoCs, advanced FPGAs and high-speed bit-serial applications. REFERENCES
[1] D. Bailey, E. Soenen, P. Gupta, P. Villarrubia, and D. Sang, Challenges at 45 nm and beyond, in Proc. IEEE/ACM Int. Conf. Comput.Aided Des. (ICCAD), San Jose, CA, 2008, pp. 1118. [2] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. New York: Oxford Univ. Press, 2009. [3] A. D. Booth, A signed binary multiplication technique, Quarterly J. Mechan. Appl. Math., vol. 4, no. 2, pp. 236240, Aug. 1951. [4] R. Menon and D. Radhakrishnan, High performance 5:2 compressor architectures, IEE Proc-Circuits Devices Syst., vol. 153, no. 5, pp. 447452, Oct. 2006. [5] C. H. Chang, J. Gu, and M. Zhang, Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 10, pp. 19851997, Oct. 2004. [6] M. Burzio and P. Pellegrino, Serilizing-parallelizing circuit for high speed digital signals, U.S. Patent 5 790 058, Aug. 4, 1998. [7] STMicroelectronics, Geneva, Switzerland, IO-PAD library datasheet, 2006. [8] H. K. Mecklai and A. L. Webb, Compact buffer design for serial I/O, U.S. Patent 6 167 109, Dec. 26, 2000. [9] T. Gloekler, I. Holm, R. C. Koester, and M. W. Riley, High speed on-chip serial link apparatus, U.S. Patent 2008/0 133 800, Jun. 5, 2008. [10] C. Dudha, T. Schoenauer, and P. Wallner, Parallelization of serial digital input signals, U.S. Patent 2008/0 055 126 A1, Mar. 6, 2008. [11] W. M. Pitio, Serilizer, U.S. Patent 2003/0 102 992 A1, Jun. 5, 2003. [12] M. Koga, Serial/parallel converter, U.S. Patent 6 339 387 B1, Jan. 15, 2002. [13] A. K. Lenstra and E. Verheul, Selecting cryptographic key sizes, J. Cryptology, vol. 14, no. 4, pp. 255293, 2001.

Fig. 17. Power convergence result of the proposed 64

2 64 multiplier.


ADPs of the proposed 32 32 and 64 64 multipliers are approximately 28% and 52% lower than the corresponding CSA array counterparts. If a series of multiplications is to be performed consecutively, the proposed multipliers can be easily pipelined into and are dened by two stages. Although and of (15) must (14) and (16), respectively, be equalized in pipeline mode for proper synchronization and are of the two stages and accordingly , with then determined. In the current implementation, , is derived from using a clock divider circuit. In the two-stage pipelined conguration, the proposed 16 16, 32 32, and 64 64 multipliers can achieve throughput rates as high as 216, 120, and 64 MHz, respectively. Note that both the proposed and the CSA array multiplier reported in Table IV are not pipelined at the adder tree. For a fully pipelined CSA multiplier, the clocking frequency could be higher, and so is its throughput, but the area and routing complexity would be signicantly increased as each stage of FAs requires a set of DFFs. The average power consumption was estimated by Synopsys Nanosim with random serial inputs at 4 GHz and a supply voltage of 1 V. Monte Carlo statistical model [39] is adopted to obtain the mean power dissipation of each design with 99.9% condence level that the error is bound below 0.25% at convergence. The Monte Carlo simulation of the mean power 64 multiplier is shown in Fig. 17. It for the proposed 64 shows that the cumulative average power recorded after every 0.25 s (Sampling Time) converges rapidly within the 0.25% error bound of the nal estimate. The average power consumptions of the proposed and the CSA multipliers were listed in Table V. The energy per operation in Table V is obtained by the product of the estimated mean power and the computation time

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

[14] J.-Y. Lai and C.-T. Huang, Elixir: High-throughput cost-effective dual-eld processors and the design framework for elliptic curve cryptography, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 11, pp. 15671580, Nov. 2008. [15] A. Hariri and A. Reyhani-Masoleh, Bit-serial and bit-parallel mont, IEEE Trans. gomery multiplication and squaring over Comput., vol. 58, no. 10, pp. 13321345, Oct. 2009. [16] M. Ghoneima, Y. Ismail, M. Khellah, J. Tschanz, and V. De, Seriallink bus: A low-power on-chip bus architecture, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 20202032, Sep. 2009. [17] R. Dobkin, A. Morgenshtein, A. Kolodny, and R. Ginosar, Parallel vs. serial on-chip communication, in Proc. ACM Int. Workshop Syst. Level Interconnect Prediction (SLIP), Newcastle, U.K., 2008, pp. 4350. [18] R. Dobkin, M. Moyal, A. Kolodny, and R. Ginosar, Asynchronous current mode serial communication, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 7, pp. 11071117, Jul. 2010. [19] P. Teehan, G. Lemieux, and R. Greenstreet, Towards reliable 5 Gbps wave-pipelined and 3 Gbps surng interconnect in 65 nm FPGAs, in Proc. 17th ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), Newcastle, U.K., 2009, pp. 4352. [20] I. N. Chen and R. Willoner, An O(n) parallel multiplier with bit sequential input and output, IEEE Trans. Comput., vol. C-28, no. 10, pp. 721727, Oct. 1979. [21] R. F. Lyon, Twos complement pipeline multipliers, IEEE Commun. Lett., vol. 24, no. 4, pp. 418425, Apr. 1976. [22] N. R. Strader and V. T. Rhyne, A canonical bit-sequential multiplier, IEEE Trans. Comput., vol. C-31, no. 8, pp. 791795, Aug. 1982. [23] P. Ienne and M. A Viredaz, Bit-serial multipliers and squarers, IEEE Trans. Comput., vol. 43, no. 12, pp. 14451450, Dec. 1994. [24] G. Bi and E. V. Jones, High-performance bit-serial adders and multipliers, IEE Proc G-Circuits Devices Systs, vol. 139, no. 1, pp. 109113, Feb. 1992. [25] A. Almiladi, M. K. Ibrahim, M. Al-Akidi, and A. Aggoun, High performance scalable bidirectional mixed radix- serial serial multipliers, IET Proc-Comput. Digit. Tech., vol. 1, no. 5, pp. 632639, 2007. [26] K. Z. Pekmestzi, P. Kalivas, and N. Moshopoulos, Long unsigned number systolic serial multipliers and squarers, IEEE Trans. Circuits Syst. II, Brief Papers, vol. 48, no. 3, pp. 316321, Mar. 2001. [27] H. I. Saleh, A. H. Khalil, M. A. Ashour, and A. E. Salama, Novel serial-parallel multipliers, IEE Proc-Circuits Devices Syst., vol. 148, no. 4, pp. 183189, Aug. 2001. [28] R. Gnanasekaran, On a bit-serial input and bit-serial output multiplier, IEEE Trans. Comput., vol. C-32, no. 9, pp. 878880, 1983. [29] P. T. Balsara and D. T. Harpe, Understanding VLSI bit serial multipliers, IEEE Trans. Edu., vol. 39, no. 1, pp. 1928, Feb. 1996. [30] A. Aggoun, A. Ashur, and M. K. Ibrahim, Area-time efcient serialserial multipliers, in Proc. IEEE Conf. Circuits Syst. (ISCAS), Geneva, Switzerland, 2000, pp. 585588. [31] O. Nibouche, A. Bouridarie, and M. Nibouche, New architectures for serial-serial multiplication, in Proc. IEEE Conf. Circuits Syst. (ISCAS), Sydney, Australia, 2001, vol. 2, pp. 705708. [32] S. Haynal and B. Parhami, Arithmetic structures for inner-product and other computations based on a latency-free bit-serial multiplier design, presented at the 13th Asilomar Conf. Signals, Syst. Comput., Geneva, Switzerland, 1996. [33] I.-C. Wu, A fast 1-D serial-parallel systolic multiplier, IEEE Trans. Comput., vol. C-36, no. 10, pp. 12431247, Oct. 1987. [34] R. Gnanasekaran, A fast serial-parallel binary multiplier, IEEE Trans. Comput., vol. C-34, no. 8, pp. 741744, Aug. 1985. [35] S. Sunder, F. El-Guibaly, and A. Antoniou, Twos-complement fast serial-parallel multiplier, IEE Proc.-Circuits Devices Syst, vol. 142, no. 1, pp. 4144, Feb. 1995. [36] M. R. Meher, C. C. Jong, and C. H. Chang, High-speed and lowpower serial accumulator for serial/parallel multiplier, in Proc. IEEE Asia-Pacic Conf. Circuits Syst. (APCCAS), Macau, China, 2008, pp. 176179. [37] E. Kreyszig, Advanced Engineering Mathematics. New York, NJ: Wiley, 2006. [38] STMicroelectronics, Geneva, Switzerland, Standard cell library datasheet, 2006.

[39] R. Burch, F. N. Najm, P. Yang, and T. N. Trick, A Monte Carlo approach for power estimation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 1, no. 1, pp. 6371, Mar. 1993.

GF(2 )

Manas Ranjan Meher (S08) received the B.S. degree from G.M. College, Sambalpur University, Orissa, India, in 2001, the M.S. degree from Utkal University, Orissa, India, in 2003, both in computer science, and the Master of Technology degree in electronics and communication engineering from National Institute of Technology (NIT), Rourkela, India, in 2006. He is currently, pursuing the Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. He worked as a Research Engineer with the VLSI Design Laboratory, Department of Electronics and Communication Engineering, NIT. His research interest includes designing high-speed bit-serial arithmetic circuits, algorithm to hardware mapping of computational intensive DSP algorithms and digital IC design. Mr. Meher has been listed in Marquis Whos Who in the World since 2009.

Ching Chuen Jong received the B.Sc. (Eng) and Ph.D. degrees in electronic engineering from Queen Mary, University of London, London, U.K. He had worked in the area of high-level synthesis of digital systems for over three years both in the academics and the industries in U.K. before he joined the Nanyang Technological University, Singapore, as a faculty member, where he is currently an Associate Professor with the Division of Circuits and Systems, School of Electrical and Electronic Engineering. His current technical interest includes high-level synthesis, digital integrated circuit design, fast-prototyping of digital systems with FPGA, recongurable systems and CAD for IC designs. Dr. Jong is a Chartered Engineer (CEng), a Chartered IT Professional (CITP), a member of the Institution of Engineering and Technology (IET), and a member of the British Computer Society (BCS).

Chip-Hong Chang (S92M98SM03) received the B.Eng. (Hons) degree from National University of Singapore, Singapore, in 1989, and the M.Eng. and Ph.D. degrees from Nanyang Technological University (NTU), Singapore, in 1993 and 1998, respectively. He served as a Technical Consultant in industry prior to joining the School of Electrical and Electronic Engineering, NTU, in 1999, where he is currently an Associate Professor. He holds joint appointments at the university as Assistant Chair of Alumni, School of EEE since June 2008, Deputy Director of the Centre for High Performance Embedded Systems (CHiPES) since 2000, Program Director of the Centre for Integrated Circuits and Systems (CICS) from 20032009, and Program Director of VLSI Laboratory since 2010. His current research interests include low power arithmetic circuits, digital lter design, application specic digital signal processing, and digital watermarking for IP protection. He has published three book chapters and over 150 research papers in refereed international journals and conferences. Dr. Chang serves as the Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART I: REGULAR PAPERS since 2010, Editorial Advisory Board Member of the Open Electrical and Electronic Engineering Journal since 2007, and the Journal of Electrical and Computer Engineering since 2008, the special issue guest editor of the Journal of Circuits, Systems and Computers in 2010 and in several international conference advisory and technical program committee. He is a Fellow of the IET.