4.design and Implementation of High-Performance Ling Adder

Design and Implementation of High-Performance High-Valency Ling Adders
Taskin Kocak
Dept. of Computer Engineering Bahcesehir University Istanbul, Turkey E-mail: taskin.kocak@bahcesehir.edu.tr
Preeti Patil
Dept. of Electrical and Electronics Engr. University of Bristol Bristol, UK E-mail: pp7020@bristol.ac.uk
AbstractParallel prex adders are used for efcient VLSI implementation of binary number additions. Ling architecture offers a faster carry computation stage compared to the conventional parallel prex adders. Recently, Jackson and Talwar proposed a new method to factorize Ling adders, which helps to reduce the complexity as well as the delay of the adder further. This paper discusses the design and implementation details for such lower complexity, fast parallel prex adders based on Ling theory of factorization. In particular, valency or radix, the number of inputs to a single node, is explored as a design parameter. Several low and high valency adders are implemented in 65 nm CMOS technology. Experimental results show that the high-valency Ling adders have superior areadelay characteristics over previously reported Ling-based or non-Ling adders for the same input size. Moreover, our 20-bit high valency adder has a better areadelay measurement than the previouslypublished 16-bit adders. Index TermsAdders, arithmetic, integrated circuits, logic design
I. I NTRODUCTION Binary addition is one of the fundamental operations in electronic circuits. Many modern circuits contain several adder units for applications such as arithmetic logic unit, memory addresssing and program counter update. Thus, there is a considerable interest to design higher speed and less complex adder architectures. For the last few decades, several adder architectures have been proposed to optimize the adder delay, examples include, carry-look ahead [1], [2], [3], ripple carry [4], [5], and parallel prex adder [6], [7], [8]. The parallel prex adder is one of the most popular architectures and offers good compromise among area, speed and power. This type of adder implements a logic function to determine whether each bit position generates the carry, propagates it or kills it. Then these generate and propagate/not kill functions are hierarchically combined to compute the carry into each bit position forming a carry tree. The nal stage computes sum at every bit position using exclusive or (XOR) gates. In 1981, Ling proposed a method to form the simplied group generate function called Pseudo carry [9]. The factorization of one not kill bit from the usual group generate function makes the rst level of carry computation stage simple. And the rest of the logic of the carry tree is computed in a similar way as the conventional theory of parallel prex adder. The not kill term has to be combined with the simplied group generate
function or pseudo carry before computation of sum at the end. As this involves no delay by using multiplexer instead of XOR gate to compute sum, the resulting adder becomes faster. In [10], Jackson and Talwar proposed a new approach that the Ling adder can be simplied not only in the initial stage, but also in the consequent stages. They showed that it is possible to apply the Ling factorization to all the levels of the carry tree to form the reduced group generate function similar to Lings pseudo carry, which reduces the fan-in requirement of each AND gate by one reducing the complexity at every level. These pseudo carries can be hierarchically combined to form the carry tree and then the nal sum is computed using multiplexers. This factorization enables to implement valency-3 equations for reduced group generate using a 4input CMOS gate, which was not possible in conventional adders as the number of terms in the valency-3 equation are 5 and most of the standard cell libraries have gates with the maximum number of inputs equal to 4. The critical path of the prex adder goes through the carry path, if we can simplify the logic for carry, which is the group generate equations of all the stages in a carry tree, it can be computed faster and the resulting adder becomes faster. By doing this, the complexity of some other terms increases, but as long as the carry is on the critical path, this simplication of carry path makes the adder fast. The valency or radix indicates the number of inputs to a single node in the carry tree. In implementation terms, fan-in is also interchangeably used with valency. In this paper, we use valency as a design parameter, and explore its effect on the adder performance in terms of delay and area. We use the terms low valency for two and three inputs, and high valency for four and ve inputs. Using low valency in the logic circuit reduces the complexity at each level by decreasing the number of CMOS cells. However, the numbers of levels are increased. The work in [10] paves the way for high valency circuits in adder architectures, however, it was short of giving any experimental results and discussing implementation issues. That is the subject of this paper. Hence, in this work, rst, we conrm the complexity reduction in Ling adders by providing experimental implementation results. Then, we demonsrate that high valency Ling adders achieve better performance than low valency Ling adders and other conventional parallel prex
978-1-4673-1188-5/12/$31.00 2012 IEEE
adders. Three types of 16-bit adders are designed based on different types of parallel prex adders using valency-4. One 20-bit adder is designed using valency-5 and a 32-bit adder is designed which is a hybrid structure, divided into 20- and 12-bit adders using valency-5 and valency-3. All the designed adders are faster than the conventional adders and many of the recently implemented ones. The new theory using high valency mostly valency-5, proved efcient than the conventional adders using valency-2, in terms of area and speed. Experimental results show the comparison of new adder designs and various types of conventional parallel prex adders designed in the same design ow and also comparison of newly designed adders with the recently published adders based on Ling theory. II. BACKGROUND A. Adders based on formulation of carry Addition of two n-bit binary numbers requires to determine what happens to the carry at each bit position, which can be generated, propogated or killed. Let A = an1 an2 . . . a0 and B = bn1 bn2 . . . b0 represent the two numbers to be added, and Cin and Cout represent the carry-in and carry-out, respectively. The logic equations for the carry operations can be expressed as follows: Generate, g , operation implies setting Cout = 1 independent of Cin gi = ai bi (1) Propagate, p, operation sets Cout = Cin pi = a i bi (2)
m = Kn
ki
i=m
mn
(7)
Fig. 1.
Prex operator with valency 2, 3 and 4
The number of stages to produce the carries is equal to log2 n (n is the width of the adder). Base is 2 as this is for valency-2 adder. This stage is formed by implementing prex cells which are called grey or black prex cells where the prex operator is implemented. The number of lines going into the dot in the prex cell determines the valency of the adder as illustrated in Fig. 1. Note that the prex cell has actually four inputs (two Gs and two Ks). The sum, S = sn1 sn2 . . . s0 , is calculated by si = pi Ci = pi G0 i1 (8)
In this equation Ci is the carry in the bit position i, which is the carry out of (i 1)th stage. Thus, in the sum logic stage, the sum at each bit position can be computed by implementing an XOR gate with one input as bit level propagate at that bit position and the second input as carry into that bit position. C. Ling Adders In [9], Ling introduced the pseudo carry notion and showed that the delay in the carry path could be reduced by forming the pseudo-carry term (Hn:0 = gn + Gn1:0 ). Lings pseudo carry is less complex than the Gn:0 as the fan-in of each AND gate to compute Ling pseudo carry gets reduced by one. Thus, Hn:0 is easier to implement than Gn:0 . Ling also showed that recursion can be applied to produce group pseudo carry by hierarchically combining the pseudo carries over sub-groups in a similar way as in a prex adder. This means that the pseudo carry Hj :i of a group from position j to position i can be constructed from the pseudo carries Hj :k and Hk1:i of two subgroups formed one from position j to k and other from position k 1 to i, respectively. Hj :i = Hj :k + Kj 1:k1 Hk1:i (9)
And nally, kill, k , operation resets the carry-out (Cout = 0) ki = ai bi B. Parallel Prex Addition Prex adders compute the carry from a group of bits by hierarchically combining the carries from subgroups forming a tree structure called carry tree. The equation for computing the carry out from one group of bits (0 to i 1), by combining the carry out (group generate and not kill) from two subgroups formed over bits i 1 to n and n 1 to 0 is given as follows:
n n 0 C i = G0 i1 = Gi1 + Ki1 Gn1
(3)
(4)
where 1 n i 1, and G and K are group generate and group kill signals, respectively. Over a group from m to n, the group operations are calculated by
n1
Ling adder using valency-2 at its rst level of carry computation will have the equation for group generate as Hn:n1 = gn + gn1 = (an bn ) + (an1 bn1 ) (10)
Gm n = gn +
i+1 (Kn gi )
m<n
(5)
i=m
Gm n = gn
m=n
(6)
This equation can be implemented using a single gate in fast computation of sum with delay of the adder reduced
by approximately 1 inverter (FO4) gate delay. But in terms of conventional prex adder computation, this would be two levels of logic gates. The rst level is two input NAND/NOR gate to compute g and k terms and the second level is a three input gate. Many adders are implemented as shown in [11] using the Ling theory of factorization to the rst level of the carry computation stage. These work faster than the prex adders but it would it be better to apply the Ling factorization in each level of carry tree to reduce the complexity of the group generate function and make the adder work even faster. This is achieved by the recent contribution of Jackson and Talwar [10]. They noted that Eq. 9 can be further factorized as Hj :i = (Lj :k + Hj :k )(Hj :k + Hk1:i ) (11)
In valency-2, each group generate function, formed over two subgroups drives another prex cell in the next level of the carry tree. The number of prex cells which need to be driven by a single cell or the amount of wiring from the driving node to the cell to be driven is large in case of valency-2. This needs some buffers to be included to support high load increasing the delay of the circuit especially in later stages. By combining more carries into one node (group) without having any buffering inside the node, fewer inverters will be needed to drive the next stage cells as there will be fewer nodes to be driven overall. In addition, the higher valency gives fewer levels of logic, whic makes the path faster. Therefore, faster adders can be implemented using the theory given in [10]. B. Gate Sizing Before going into the details of individual adder designs, we would like to cover how the gates are sized. In all designs, the inputs to adders are given from the ip-ops. The ip ops generating a and b signals are of size 200. The size 200 is dened relative to the size 100% inverter. Different sizes of the inverters are available in the cell library by changing the width to length ratio of the transistors. In the same way different gates are designed with different drive sizes by adjusting the width and length ratio of transistors. For an inverter, the PMOS and N-MOS width and length ratio is maintained as 1.6:1 and this inverter is called 100% inverter or natural size inverter. The delay to drive the load of 4 such inverters by this inverter is called 1 FO4 delay and the input capacitance of this inverter is taken as reference for dening the capacitances of other gates. To build an inverter of different size the width to length ratio of each transistor is increased or decreased depending on whether the drive strength is large or small. All the other gates also need to be tted in the same height as that of the 100% inverter. For example, 2-input NAND gate has 2 N-MOS in series which would increase the height of the cell. C. 16-bit adder using valency-4 Two 16-bit adders are designed based on Ladner-Fischer [6] and Kogge-Stone [8]. As shown in the dot diagram depicted in Fig. 2, there are three levels of carry computation formed with valency2*4*2 which are shown by three rows of solid dots. The rst level indicates input bits a and b which are the outputs from ip-ops. Comparing this structure with the conventional 16bit Ladner and Fischer adder, the fan-out of nodes goes on increasing in later stages in both the designs. The high fan-out load requires buffering between the two levels of the carry tree. The inverter also needs to be large enough to support the high fan-out load. Even though the second stage with valency-4 is shown with just one level, it needs two logic levels, resulting in the same number of levels equal to 4 as the 16-bit conventional adder formed using valency-2. But the conventional adder may need more than one buffer in the critical path resulting in higher delay. Also the rst stage without Ling factorization requires two levels, increasing the delay by 1 FO4 delay.
where Lj :k = Kj 1:k1 . In [10], the rst term is denoted j k+1 . In as Dj :k and the second term is denoted as H or Rj :i the latter, the subscript indicates the range of group over which this carry is computed and the superscript indicates the number of missing not kill bits in the reduced generate or group pseudo carry equation. These missing not kill bits are included in the D term which needs to be combined with reduced group generate equation to get the actual group generate function.
Gj :i = (Dj :k )(Hj :i )
(12)
Thus, Ling factorization is applied to compute group pseudo carry in terms of H or R for the carry tree stage. For a higher valency like 4 and 5 the same factorization (of any number of not kill terms at any level of carry tree) can be done but still the carry equation requires more than one logic level to implement using gates with maximum 4 inputs. The sum computation can be achieved by implementing a multiplexer with Hn1:0 as a control signal for sum out at bit position n without any extra delay. In this case the j k+1 . The equation to compute sum for control signal is Rj :i bit position n is given as, sn = Rn1:0 (pn ) + Rn1:0 (pn Dn1:0 ) III. A DDER D ESIGN A. Motivation to use high valency The logical depth of a prex adder depends on its valency. A high valency implementation requires more than one CMOS logic level for computing the resulting logic, this adds the delay in critical path of the circuit even though logical levels are reduced. But the Ling theory, which is modied by Jackson and Talwar [10] shows that the complexity of carry equations can be reduced by factoring any number of not kill terms from the group generate equation forming reduced group generate equation and making the corresponding D term (which includes the missing not kill bits) more complex. So as long as both these paths are balanced such that the carry still remains on the critical path, the resulting adder can work faster. The purpose of this paper is to conrm this. (13)
Fig. 2.
16-bit Ladner-Fisher adder carry tree with valency-2*4*2
Fig. 4.
20-bit adder carry tree with valency 2*5*2
Fig. 3.
16-bit Kogge-Stone adder carry tree with valency 2*4*2
by single cell in valency-3. The valency-5 equation is very complex in a conventional adder. By factoring out two not kill groups from this equation, much simpler reduced generate equation is obtained as shown by the following equations. The third stage is formed using valency-2 equations, where the (5) second stage term R9:0 drives high fan-out load in third stage. This arrangement is similar to Ladner-Fischer structure and require very large buffer to support this high fan-out load. The inverter of size 480 is used for this purpose. E. 32-bit adder using valency 2-5
The second 16-bit adder which is based on Kogge-Stone structure is designed using valency-2*4*2 forming 3 stages of carry tree. In this adder, the fan-out of each node of any stage of carry tree is not limited to just 2 as the conventional Kogge-Stone adder, this is due to use of high valency. The dot diagram to generate the carry is given in Fig. 3. As in Kogge-Stone structure, the rst stage in this adder also computes the carries for groups formed over two bits at every position. This overlapping of terms increases the number of gates in the adder. Also at every bit position, bit generate and not kill terms are computed using NAND and NOR gates combining outputs of the ip-ops. So the total fan-out of the ip-op becomes 6. This is very high fan-out load and the maximum size available for the ip-ops is 200 which cannot support such a high load. Adding the buffer after the ip-ops to drive the high fan-out load will result in extra delay. The size of this buffer will be too large to support such a high load. This is the main drawback of this adder. D. 20-bit adder using valency-5 A 20-bit adder is designed with a carry tree formed in three stages using valency 2*5*2 as shown in the carry computation dot diagram in Fig. 4. The rst stage is formed as the previous adders with valency-2 equations. The reduced group generate term misses one not kill bit which is denoted by superscript of the R term. In the second stage some of the positions combine 5 subgroups to form wider groups to compute the reduced group generate in two logic levels as no single cell with 4 inputs can be used. In the second stage for valency-3 and -4 one not kill term is factored out, making the group generate equation simple and possible to compute in two logic levels in valency-4 and
The carry tree of this adder is formed by combining carry trees of two adders, 20-bit and 12-bit. Both are formed in three stages. The 20-bit adder is formed as the one just described above with valency 2*5*2. This adder is driven by the 12bit adder with valency 2*3*2 through buffering inverters. The carry tree for this 32-bit adder is as shown in Fig. 5. Carry out from the 12-bit adder is combined with the carries of the 20-bit adder to form the nal stage of a 32-bit adder. The carry out from the 12-bit adder has a very high fan-out load so buffering inverters are added to support this load. The carry out of the 12-bit adder through two buffering inverters is available almost at the same time as the carries from the 20-bit adder.
Fig. 5. 32-bit adder carry tree with valency 2*5*2 (20-bit) and valency 2*3*2 (12-bit)
IV. E XPERIMENTAL R ESULTS In this section, we provide experimental results for the adders described in the previous section. After forming the
TABLE I TABLE OF RESULTS NORMALISED TO 32- B KOGGE -S TONE (K-S) ADDER RESULTS REPORTED IN EACH PAPER Adder 32-bit 16-bit 16-bit 32-bit 32-bit 16-bit 20-bit K-S K-S L-F Ling D-K [12] 1 0.82 0.86 1.01 0.88 Delay D-N [13] 1 0.87 0.70 This work 1 0.82 0.86 0.92 0.88 0.75 0.76 D-K [12] 1 0.43 0.32 0.72 0.35 Area D-N [13] 1 1.01 0.43 This work 1 0.41 0.34 1 0.75 0.28 0.37
equations for the adders, RTL code is written for them. Then the RTL code is synthesized using Synopsis tools on a 65nm CMOS standard cell technology. The genarated net-list is then fed to Magma tool to carry out oor-planning, place, route, clock distribution stages to produce the layout of the adder and static timing analysis report of the adder. If the load of any gate is larger than required to get the minimum possible delay, the reduction in load is obtained either by reducing the wire load or by fan-out load. Wire load reduction is possible by placing the driving gate and driven gates close to each other. Fan-out reduction can be obtained by inserting the inverter after that gate and driving the gates on non-critical path through that inverter and gate on critical path directly. Also techniques like shielding reduce the wire capacitance of long wires by placing the power and ground rails parallel to that wire. The gates have different inputs with different delays from input to output depending upon the position of the transistor to which it is connected to. So by connecting input which is available late to the faster input of the gate can also reduce the total delay. Some of the gates have very high input capacitance so the delay through that gate is longer compared to others, avoiding such gates and reformulation of equations to use better gates reduces the overall delay. All these techniques and careful construction of equations resulted in adders which are smaller in size and faster than the conventional prex adders and some of the recent implementations in either factor or both. In Table I, all results have been normalized to the reported area and delay of a 32-bit Kogge-Stone adder in each paper. This has been done to take account of the different CMOS technologies used in the previous two papers (D-K [12]: 90nm; D-N [13]: 180nm) and our work (65nm). After normalization, the 16-bit Kogge-Stone and Ladner-Fisher adders reported in D-K also closely match our results, giving further condence that the comparisons of other adders are valid. The Table I shows that compared with D-K [12], the new adders reported in this work are always faster for comparable area, and that compared with D-N [13], the new adders are always smaller for comparable speed. The new adders are also superior to the non-Ling adders. This is highlighted by the AreaDelay comparison given in Table II, which shows the new adders consistently have a superior area-delay characteristic. Moreover, the 20-bit adder has a better AreaDelay measurement than either of the previously-published 16-bit
adders.
TABLE II A REA D ELAY COMPARISON Adder 32-bit 16-bit 16-bit 32-bit 32-bit 16-bit 20-bit K-S K-S L-F Ling D-K [12] 1 0.35 0.28 0.73 0.31 Area Delay D-N [13] This work 1 1 0.34 0.29 0.92 0.88 0.66 0.30 0.21 0.28
V. C ONCLUSIONS Ling factorization can be recursively applied to all stages in a carry computation tree of an adder. This factorization reduces the complexity of the carry path which is generally the critical path of the adder, making it faster. This makes some other paths more complex, but if the complexity is properly balanced the resulting adder can work faster. Idea of combining more simplied logic in a group without buffers in between and then drives another stage of logic, which is high valency implementation of carry equations proved not slower but of similar or better delay. Using valency-4, 16-bit adders resulted in similar speed as that of the conventional adders, implemented using Ling factorization at the rst level, but area improvement is considerable. Using valency-5, 20-bit and 32-bit adders are better in terms of both speed and area than the conventional adders. Hence, it is shown that highvalency Ling adders have superior areadelay characteristics over existing Ling or non-Ling adders for the same input size. As a future direction, one can design adders by making the carry tree sparse, i.e. instead of generating every carry some of the carries are generated which form the sum of more than one bit position. This can get the design further speed up, by less fan-out requirement and less number of stages. The area requirement may be smaller or may remain the same as the generation of sum that will result in complex logic requiring more number of gates than required in adders without sparse carry tree. R EFERENCES
[1] A. Weinberger and J. L. Smith, A logic for high speed addition, National Bureau of Standards, Circulation 591, pp. 3-12, 1958. [2] T. Lynch and E. Swartzlander, A spanning tree carry lookahead adder, IEEE Trans. on Computers, vol. 41, no. 8, pp. 931-939, 1992. [3] G. Yang, S. Jung, K. Baek, S. Kim, S. Kim, S. Kang, A 32-bit carry lookahead adder using dual-path all-N logic, IEEE Trans. on VLSI Systems, vol. 13, no. 8, pp. 992-996, 2005.
[4] S. Hauck, M. Hosler and T. Fry, High performance carry chains for FPGAs, IEEE Trans. on VLSI Systems, vol. 8, no. 2, pp. 138-147, 2000. [5] C. Huang, J. Wang, C. Yeh and C. Fang, The CMOS carry-forward adders, IEEE Journal of Solid-State Circuits, vol. 39, no. 2, pp. 327336, 2004. [6] R.E. Ladner and M.J. Fischer, Parallel prex computation, Journal of ACM, vol. 27, no.4, pp.831-838, Oct. 1980. [7] S. Knowles, A family of adders, Proc. 14th IEEE Symp. on Computer Arithmetic, pp.30-34, 1999. [8] P.M. Kogge and H.S. Stone, A parallel algorithm for efcient solution of a general class of recurrence equations, IEEE Trans. on Computers, vol. C-22, no. 8, pp.786-793, Aug. 1973. [9] H. Ling, High speed binary adder, IBM Journal of Research and Development, vol. 25, no. 3, pp. 156-166, 1981. [10] R. Jackson and S. Talwar, High speed binary addition, Proc. of the 38th Asilomar Conf. on Circuits, Systems, and Computers, pp. 1350-1353, Pacic Grove, CA, November 2004. [11] C. Efstathiou, H.T. Vergos, and D. Nikolos, Ling adders in standard CMOS technologies, Proc. IEEE Intl Conf. Electronics, Circuits, and Systems (ICECS), vol. 2, pp. 485-488, Sept. 2002. [12] S. Das and S. P. Khatri, A novel hybrid parallel-prex adder architecture with efcient timing-area characteristic, IEEE Trans. on VLSI Systems, vol. 16, no. 3, pp. 326-331, 2008. [13] G. Dimirakopolous and D. Nikolos, High-speed parallel-prex VLSI Ling adders, IEEE Trans. on Computers, vol. 54, no. 2, pp. 225-231, 2005.

4.design and Implementation of High-Performance Ling Adder

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

4.design and Implementation of High-Performance Ling Adder

Hochgeladen von

Copyright:

Verfügbare Formate

Design and Implementation of High-Performance High-Valency Ling Adders

978-1-4673-1188-5/12/$31.00 2012 IEEE

Prex operator with valency 2, 3 and 4

16-bit Ladner-Fisher adder carry tree with valency-242

20-bit adder carry tree with valency 252

16-bit Kogge-Stone adder carry tree with valency 242

Das könnte Ihnen auch gefallen