Sie sind auf Seite 1von 76

A novel High-Speed Carry Skip Adder with AOI and OAI Logic

using Verilog HDL


ABSTRACT: In this paper, we present a carry skip adder (CSKA) structure that has a higher
speed yet lower energy consumption compared with the conventional one. The speed
enhancement is achieved by applying concatenation and incrementation schemes to improve the
efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing
multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-
Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed
stage size and variable stage size styles, wherein the latter further improves the speed and energy
parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure,
which lowers the power consumption without considerably impacting the speed, is presented.
Index Terms: Carry skip adder (CSKA), high performance, hybrid variable latency adders,.

Chapter-1
INTRODUCTION TO VLSI

1.1 Very-large-scale integration


Very-large-scale integration (VLSI) is the process of creating integrated circuits by combining
thousands of transistors into a single chip. VLSI began in the 1970s when complex
semiconductor and communication technologies were being developed. The microprocessor is a
VLSI device.

Fig1.1 A VLSI integrated-circuit die

1.2 History

During the 1920s, several inventors attempted devices that were intended to control the current
in solid state diodes and convert them into triodes. Success, however, had to wait until after
World War II, during which the attempt to improve silicon and germanium crystals for use as
radar detectors led to improvements both in fabrication and in the theoretical understanding of
the quantum mechanical states of carriers in semiconductors and after which the scientists who
had been diverted to radar development returned to solid state device development. With the
invention of transistors at Bell labs, in 1947, the field of electronics got a new direction which
shifted from power consuming vacuum tubes to solid state devices.

With the small and effective transistor at their hands, electrical engineers of the 50s saw the
possibilities of constructing far more advanced circuits than before. However, as the complexity
of the circuits grew, problems started arising.

Another problem was the size of the circuits. A complex circuit, like a computer, was dependent
on speed. If the components of the computer were too large or the wires interconnecting them
too long, the electric signals couldn't travel fast enough through the circuit, thus making the
computer too slow to be effective.

Jack Kilby at Texas Instruments found a solution to this problem in 1958. Kilby's idea was to
make all the components and the chip out of the same block (monolith) of semiconductor
material. When the rest of the workers returned from vacation, Kilby presented his new idea to
his superiors. He was allowed to build a test version of his circuit. In September 1958, he had his
first integrated circuit ready. Although the first integrated circuit was pretty crude and had some
problems, the idea was groundbreaking. By making all the parts out of the same block of
material and adding the metal needed to connect them as a layer on top of it, there was no more
need for individual discrete components. No more wires and components had to be assembled
manually. The circuits could be made smaller and the manufacturing process could be
automated. From here the idea of integrating all components on a single silicon wafer came into
existence and which led to development in Small Scale Integration(SSI) in early 1960s, Medium
Scale Integration(MSI) in late 1960s, Large Scale Integration(LSI) and in early 1980s VLSI
10,000s of transistors on a chip (later 100,000s & now 1,000,000s).

1.3 Developments

The first semiconductor chips held two transistors each. Subsequent advances added more and
more transistors, and, as a consequence, more individual functions or systems were integrated
over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes,
transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
single device.Now known retrospectively as small-scale integration (SSI), improvements in
technique led to devices with hundreds of logic gates, known as medium-scale integration (MSI).
Further improvements led to large-scale integration (LSI), i.e. systems with at least a thousand
logic gates. Current technology has moved far past this mark and today's microprocessors have
many millions of gates and billions of individual transistors.

At one time, there was an effort to name and calibrate various levels of large-scale integration
above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of
gates and transistors available on common devices has rendered such fine distinctions moot.
Terms suggesting greater than VLSI levels of integration are no longer in widespread use.

As of early 2008, billion-transistor processors are commercially available. This is expected to


become more commonplace as semiconductor fabrication moves from the current generation of
65 nm processes to the next 45 nm generations (while experiencing new challenges such as
increased variation across process corners). A notable example is Nvidia's280 series GPU. This
GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic, in contrast
to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache. Current
designs, unlike the earliest devices, use extensive design automation and automated logic
synthesis to lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM (Static Random Access
Memory) cell, however, are still designed by hand to ensure the highest efficiency (sometimes by
bending or breaking established design rules to obtain the last bit of performance by trading
stability). VLSI technology is moving towards radical level miniaturization with introduction of
NEMS technology. Alot of problems need to be sorted out before the transition is actually made.
1.4 Structured design

Structured VLSI design is a modular methodology originated by Carver Mead and Lynn Conway
for saving microchip area by minimizing the interconnect fabrics area. This is obtained by
repetitive arrangement of rectangular macro blocks which can be interconnected using wiring by
abutment. An example is partitioning the layout of an adder into a row of equal bit slices cells. In
complex designs this structuring may be achieved by hierarchical nesting.

Structured VLSI design had been popular in the early 1980s, but lost its popularity later because
of the advent of placement and routing tools wasting a lot of area by routing, which is tolerated
because of the progress of Moore's Law. When introducing the hardware description language
KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design"
(originally as "structured LSI design"), echoing EdsgerDijkstra's structured programming
approach by procedure nesting to avoid chaotic spaghetti-structured programs.

1.4.1 Challenges

As microprocessors become more complex due to technology scaling, microprocessor designers


have encountered several challenges which force them to think beyond the design plane, and
look ahead to post-silicon:

Power usage/Heat dissipation As threshold voltages have ceased to scale with


advancing process technology, dynamic power dissipation has not scaled proportionally.
Maintaining logic complexity when scaling the design down only means that the power
dissipation per area will go up. This has given rise to techniques such as dynamic voltage
and frequency scaling (DVFS) to minimize overall power.
Process variation As photolithography techniques tend closer to the fundamental laws
of optics, achieving high accuracy in doping concentrations and etched wires is becoming
more difficult and prone to errors due to variation. Designers now must simulate across
multiple fabrication process corners before a chip is certified ready for production.
Stricter design rules Due to lithography and etch issues with scaling, design rules for
layout have become increasingly stringent. Designers must keep ever more of these rules
in mind while laying out custom circuits. The overhead for custom design is now
reaching a tipping point, with many design houses opting to switch to electronic design
automation (EDA) tools to automate their design process.
Timing/design closure As clock frequencies tend to scale up, designers are finding it
more difficult to distribute and maintain low clock skew between these high frequency
clocks across the entire chip. This has led to a rising interest in multicore and
multiprocessor architectures, since an overall speedup can be obtained by lowering the
clock frequency and distributing processing.
First-pass success As die sizes shrink (due to scaling), and wafer sizes go up (to lower
manufacturing costs), the number of dies per wafer increases, and the complexity of
making suitable photomasks goes up rapidly. A mask set for a modern technology can
cost several million dollars. This non-recurring expense deters the old iterative
philosophy involving several "spin-cycles" to find errors in silicon, and encourages first-
pass silicon success. Several design philosophies have been developed to aid this new
design flow, including design for manufacturing (DFM), design for test (DFT), and
Design for X.

1.5 VLSI Technology

Gone are the days when huge computers made of vacuum tubes sat humming in entire dedicated
rooms and could do about 360 multiplications of 10 digit numbers in a second. Though they
were heralded as the fastest computing machines of that time, they surely dont stand a chance
when compared to the modern day machines. Modern day computers are getting smaller, faster,
and cheaper and more power efficient every progressing second. But what drove this change?
The whole domain of computing ushered into a new dawn of electronic miniaturization with the
advent of semiconductor transistor by Bardeen (1947-48) and then the Bipolar Transistor by
Shockley (1949) in the Bell Laboratory.

Since the invention of the first IC (Integrated Circuit) in the form of a Flip Flop by Jack Kilby in
1958, our ability to pack more and more transistors onto a single chip has doubled roughly every
18 months, in accordance with the Moores Law. Such exponential development had never been
seen in any other field and it still continues to be a major area of research work.

Fig 1.2 A comparison: First Planar IC (1961) and Intel Nehalem Quad Core Die

1.6 History & Evolution of VLSI Technology


The development of microelectronics spans a time which is even lesser than the average life
expectancy of a human, and yet it has seen as many as four generations. Early 60s saw the low
density fabrication processes classified under Small Scale Integration (SSI) in which transistor
count was limited to about 10. This rapidly gave way to Medium Scale Integration in the late
60s when around 100 transistors could be placed on a single chip.

It was the time when the cost of research began to decline and private firms started entering the
competition in contrast to the earlier years where the main burden was borne by the military.
Transistor-Transistor logic (TTL) offering higher integration densities outlasted other IC families
like ECL and became the basis of the first integrated circuit revolution. It was the production of
this family that gave impetus to semiconductor giants like Texas Instruments, Fairchild and
National Semiconductors. Early seventies marked the growth of transistor count to about 1000
per chip called the Large Scale Integration.

By mid eighties, the transistor count on a single chip had already exceeded 1000 and hence came
the age of Very Large Scale Integration or VLSI. Though many improvements have been made
and the transistor count is still rising, further names of generations like ULSI are generally
avoided. It was during this time when TTL lost the battle to MOS family owing to the same
problems that had pushed vacuum tubes into negligence, power dissipation and the limit it
imposed on the number of gates that could be placed on a single die.

The second age of Integrated circuit revolution started with the introduction of the first
microprocessor, the 4004 by Intel in 1972 and the 8080 in 1974. Today many companies like
Texas Instruments, Infineon, Alliance Semiconductors, Cadence, Synopsys, Celox Networks,
Cisco, Micron Tech, National Semiconductors, ST Microelectronics, Qualcomm, Lucent, Mentor
Graphics, Analog Devices, Intel, Philips, Motorola and many other firms have been established
and are dedicated to the various fields in "VLSI" like Programmable Logic Devices, Hardware
Descriptive Languages, Design tools, Embedded Systems etc.
VLSI Design
VLSI chiefly comprises of Front End Design and Back End design these days. While front end
design includes digital design using HDL, design verification through simulation and other
verification techniques, the design from gates and design for testability, backend design
comprises of CMOS library design and its characterization. It also covers the physical design and
fault simulation.

While Simple logic gates might be considered as SSI devices and multiplexers and parity
encoders as MSI, the world of VLSI is much more diverse. Generally, the entire design
procedure follows a step by step approach in which each design step is followed by simulation
before actually being put onto the hardware or moving on to the next step. The major design
steps are different levels of abstractions of the device as a whole:
1. Problem Specification: It is more of a high level representation of the system. The major
parameters considered at this level are performance, functionality, physical dimensions,
fabrication technology and design techniques. It has to be a tradeoff between market
requirements, the available technology and the economical viability of the design. The end
specifications include the size, speed, power and functionality of the VLSI system.

2. Architecture Definition: Basic specifications like Floating point units, which system to
use, like RISC (Reduced Instruction Set Computer) or CISC (Complex Instruction Set
Computer), number of ALUs cache size etc.

3. Functional Design: Defines the major functional units of the system and hence facilitates
the identification of interconnect requirements between units, the physical and electrical
specifications of each unit. A sort of block diagram is decided upon with the number of inputs,
outputs and timing decided upon without any details of the internal structure.

4. Logic Design: The actual logic is developed at this level. Boolean expressions, control
flow, word width, register allocation etc. are developed and the outcome is called a Register
Transfer Level (RTL) description. This part is implemented either with Hardware Descriptive
Languages like VHDL and/or Verilog. Gate minimization techniques are employed to find the
simplest, or rather the smallest most effective implementation of the logic.

5. Circuit Design: While the logic design gives the simplified implementation of the logic,the
realization of the circuit in the form of a netlist is done in this step. Gates, transistors and
interconnects are put in place to make a netlist. This again is a software step and the outcome is
checked via simulation.

6. Physical Design: The conversion of the netlist into its geometrical representation is done in
this step and the result is called a layout. This step follows some predefined fixed rules like the
lambda rules which provide the exact details of the size, ratio and spacing between components.
This step is further divided into sub-steps which are:

6.1 Circuit Partitioning: Because of the huge number of transistors involved, it is not possible
to handle the entire circuit all at once due to limitations on computational capabilities and
memory requirements. Hence the whole circuit is broken down into blocks which are
interconnected.
6.2 Floor Planning and Placement: Choosing the best layout for each block from partitioning
step and the overall chip, considering the interconnect area between the blocks, the exact
positioning on the chip in order to minimize the area arrangement while meeting the performance
constraints through iterative approach are the major design steps taken care of in this step.
6.3 Routing: The quality of placement becomes evident only after this step is completed.
Routing involves the completion of the interconnections between modules. This is completed in
two steps. First connections are completed between blocks without taking into consideration the
exact geometric details of each wire and pin. Then, a detailed routing step completes point to
point connections between pins on the blocks.
6.4 Layout Compaction: The smaller the chip size can get, the better it is. The compression of
the layout from all directions to minimize the chip area thereby reducing wire lengths, signal
delays and overall cost takes place in this design step.
6.5 Extraction and Verification: The circuit is extracted from the layout for comparison with
the original netlist, performance verification, and reliability verification and to check the
correctness of the layout is done before the final step of packaging.

7. Packaging: The chips are put together on a Printed Circuit Board or a Multi Chip Module
to obtain the final finished product.

Initially, design can be done with three different methodologies which provide different levels of
freedom of customization to the programmers. The design methods, in increasing order of
customization support, which also means increased amount of overhead on the part of the
programmer, are FPGAs and PLDs, Standard Cell (Semi Custom) and Full Custom Design.

While FPGAs have inbuilt libraries and a board already built with interconnections and blocks
already in place; Semi Custom design can allow the placement of blocks in user defined custom
fashion with some independence, while most libraries are still available for program
development. Full Custom Design adopts a start from scratch approach where the programmer is
required to write the whole set of libraries and also has full control over the block development,
placement and routing. This also is the same sequence from entry level designing to professional
designing.
Fig1.3: Future of VLSI

Where do we actually see VLSI Technology in action? Everywhere, in personal computers, cell
phones, digital cameras and any electronic gadget. There are certain key issues that serve as
active areas of research and are constantly improving as the field continues to mature. The
figures would easily show how Gordon Moore proved to be a visionary while the trend predicted
by his law still continues to hold with little deviations and dont show any signs of stopping in
the near future. VLSI has come a far distance from the time when the chips were truly hand
crafted. But as we near the limit of miniaturization of Silicon wafers, design issues have cropped
up.

VLSI is dominated by the CMOS technology and much like other logic families, this too has its
limitations which have been battled and improved upon since years. Taking the example of a
processor, the process technology has rapidly shrunk from 180 nm in 1999 to 60nm in 2008 and
now it stands at 45nm and attempts being made to reduce it further (32nm) while the Die area
which had shrunk initially now is increasing owing to the added benefits of greater packing
density and a larger feature size which would mean more number of transistors on a chip.

As the number of transistors increase, the power dissipation is increasing and also the noise. If
heat generated per unit area is to be considered, the chips have already neared that of the nozzle
of a jet engine. At the same time, the Voltage scaling of threshold voltages beyond a certain
point poses serious limitations in providing low dynamic power dissipation with increased
complexity. The number of metal layers and the interconnects be it global and local also tend to
get messy at such nano levels.

Even on the fabrication front, we are soon approaching towards the optical limit of
photolithographic processes beyond which the feature size cannot be reduced due to decreased
accuracy. This opened up Extreme Ultraviolet Lithography techniques. High speed clocks used
now make it hard to reduce clock skew and hence putting timing constraints. This has opened up
a new frontier on parallel processing. And above all, we seem to be fast approaching the Atom-
Thin Gate Oxide layer thickness where there might be only a single layer of atoms serving as the
oxide layer in the CMOS transistors. New alternatives like Gallium Arsenide technology are
becoming an active area of research owing to this.
Chapter-2
INTRODUCTION TO ADDERS
2.1 Motivation
To humans, decimal numbers are easy to comprehend and implement for performing
arithmetic.However, in digital systems, such as a microprocessor, DSP (Digital Signal
Processor)or ASIC (Application-Specific Integrated Circuit), binary numbers are more pragmatic
for a given computation. This occurs because binary values are optimally efficient at
representing many values.

Binary adders are one of the most essential logic elements within a digital system. In
addition, binary adders are also helpful in units other than Arithmetic Logic Units (ALU),such as
multipliers, dividers and memory addressing. Therefore, binary addition is essential that any
improvement in binary addition can result in a performance boost for any computing system and,
hence, help improve the performance of the entire system.

The major problem for binary addition is the carry chain. As the width of the input
operand increases, the length of the carry chain increases. Figure 2.1 demonstrates an example of
an 8- bit binary add operation and how the carry chain is affected. This example shows that the
worst case occurs when the carry travels the longest possible path, from the least significant bit
(LSB) to the most significant bit (MSB). In order to improve the performance of carry-propagate
adders, it is possible to accelerate the carry chain, but not eliminate it. Consequently, most digital
designers often resort to building faster adders when optimizing a computer architecture, because
they tend to set the critical path for most computations.

Figure 2.1: Binary Adder Example.


The binary adder is the critical element in most digital circuit designs including digital
signal processors (DSP) and microprocessor data path units. As such, extensive research
continues to be focused on improving the power delay performance of the adder. In VLSI
implementations, parallel-prefix adders are known to have the best performance. Reconfigurable
logic such as Field Programmable Gate Arrays (FPGAs) has been gaining in popularity in recent
years because it offers improved performance in terms of speed and power over DSP-based and
microprocessor-based solutions for many practical designs involving mobile DSP and
telecommunications applications and a significant reduction in development time and cost over
Application Specific Integrated Circuit (ASIC) designs.

The power advantage is especially important withthe growing popularity of mobile and
portable electronics, which make extensive use of DSP functions. However, because of the
structure of the configurable logic androuting resources in FPGAs, parallel-prefix adders will
have a different performance than VLSI implementations. In particular, most modern FPGAs
employ a fast-carry chain which optimizes the carry path for the simple Ripple Carry Adder
(RCA).In this paper, the practical issues involved in designing and implementing tree-based
adders on FPGAs are described. Several tree-based adder structures are implemented and
characterized on a FPGA and compared with the Ripple Carry Adder (RCA) and the Carry Skip
Adder (CSA). Finally, some conclusions and suggestions for improving FPGA designs to enable
better tree-based adder performance are given.

2.2 Carry-Propagate Adders

Binary carry-propagate adders have been extensively published, heavily attacking


problems related to carry chain problem. Binary adders evolve from linear adders, which have a
delay approximately proportional to the width of the adder, e.g. ripple-carry adder (RCA),to
logarithmic-delay adder, such as the carry-lookahead adder (CLA). There are some additional
performance enhancing schemes, including the carry-increment adder and the Ling adder that
can further enhance the carry chain, however, in Very Large Scale Integration (VLSI) digital
systems, the most efficient way of offering binary addition involves utilizing parallel-prefix
trees, this occurs because they have the regular structures that exhibit logarithmic delay.

2.3 Research Contributions


The implementations that have been developed in this dissertation help to improve the
design of Carry select adders and their associated computing architectures. This has the potential
of impacting many application specific and general purpose computer architectures.
Consequently, this work can impact the designs of many computing systems, as well as
impacting many areas of engineers and science. In this paper, the practical issues involved in
designing and implementing Carry select adders on FPGAs are described. Several carry select
adder structures are implemented and characterized on a FPGA and compared with the CSLA
with Ripple Carry Adder (RCA) and the CSLA with Binary Excess Converter. Finally, some
conclusions and suggestions for improving FPGA designs to enable better carry select adder
performance are given.
Chapter-3
BINARY ADDER SCHEMES

Adders are one of the most essential components in digital building blocks, however, the
performance of adders become more critical as the technology advances. The problem of
addition involves algorithms in Boolean algebra and their respective circuit implementation.
Algorithmically, there are linear-delay adders like ripple-carry adders (RCA),which are the most
straightforward but slowest. Adders like carry-skip adders (CSKA),carry-select adders (CSLA)
and carry-increment adders (CINA) are linear-based adders with optimized carry-chain and
improve upon the linear chain within a ripple-carry adder. Carry-lookahead adders (CLA) have
logarithmic delay and currently have evolved to parallel-prefix structures. Other schemes, like
Ling adders, NAND/NOR adders and carry-save adders can help improve performance as well.
This chapter gives background information on architectures of adder algorithms. In the
following sections, the adders are characterized with linear gate model, which is a rough
estimation of the complexity of real implementation. Although this evaluation method can be
misleading for VLSI implementers, such type of estimation can provide sufficient insight to
understand the design trade-offs for adder algorithms.

3.1 Binary Adder Notations and Operations

As mentioned previously, adders in VLSI digital systems use binary notation. In that
case, add is done bit by bit using Boolean equations. Consider a simple binary add with two n-bit
inputs A;B and a one-bit carry-in cin along with n-bit output S.

Figure 3.1: 1-bit Half Adder.


S = A + B + cin:
where A = an-1, an-2a0;B = bn-1, bn-2b0.
The + in the above equation is the regular add operation. However, in the binary world,
only Boolean algebra works. For add related operations, AND, OR and Exclusive-OR (XOR) are
required. In the following documentation, a dot between two variables (each with single bit), e.g.
a _ b denotes 'a AND b'. Similarly, a + b denotes 'a OR b' and a _ b denotes 'a XOR b'.

Considering the situation of adding two bits, the sum s and carry c can be expressed
using Boolean operations mentioned above.
si = ai^bi
ci+1 = ai.bi
The Equation of ci+1 can be implemented as shown in Figure 3.1. In the figure, there is a half
adder, which takes only 2 input bits. The solid line highlights the critical path, which indicates
the longest path from the input to the output.

Equation of ci+1 can be extended to perform full add operation, where there is a carry
input.
si = ai^ bi ^ ci
ci+1 = ai .bi + ai. ci + bi . ci

Figure 3.2: 1-bit Full Adder.

A full adder can be built based on Equation above. The block diagram of a 1-bit full
adder is shown in Figure 3.2. The full adder is composed of 2 half adders and an OR gate for
computing carry-out.

Using Boolean algebra, the equivalence can be easily proven.


To help the computation of the carry for each bit, two binary literals are introduced.
They are called carry generate and carry propagate, denoted by gi and pi. Another literalcalled
temporary sum ti is employed as well. There is relation between the inputs and theseliterals.
gi = ai. bi
pi = ai + bi
ti = ai^ bi
where i is an integer and 0 _ i < n.
With the help of the literals above, output carry and sum at each bit can be written as
ci+1 = gi + pi .ci
si = ti^ ci
In some literatures, carry-propagate pi can be replaced with temporary sum ti in order
tosave the number of logic gates. Here these two terms are separated in order to clarify
theconcepts. For example, for Ling adders, only pi is used as carry-propagate.
The single bit carry generate/propagate can be extended to group version G and P. The
following equations show the inherent relations.
Gi:k = Gi:j + Pi:j. Gj-1:k
Pi:k = Pi:j. Pj-1:k
where i : k denotes the group term from i through k.
Using group carry generate/propagate,carry can be expressed as expressed in the following
equation.
ci+1 = Gi:j + Pi:j.cj

3.2 Ripple-Carry Adders (RCA)

The simplest way of doing binary addition is to connect the carry-out from the
previousbit to the next bit's carry-in. Each bit takes carry-in as one of the inputs and outputs
sumand carry-out bit and hence the name ripple-carry adder. This type of adders is built
bycascading 1-bit full adders. A 4-bit ripple-carry adder is shown in Figure 3.3. Each
trapezoidalsymbol represents a single-bit full adder. At the top of the figure, the carry is
rippledthrough the adder from cin to cout.
Figure 3.3: Ripple-Carry Adder.

It can be observed in Figure 3.3 that the critical path, highlighted with a solid line, isfrom
the least significant bit (LSB) of the input (a0 or b0) to the most significant bit (MSB)of sum (sn-
1). Assuming each simple gate, including AND, OR and XOR gate has a delayof 2/\ and NOT
gate has a delay of 1/\. All the gates have an area of 1 unit. Using thisanalysis and assuming that
each add block is built with a 9-gate full adder, the critical pathis calculated as follows.
ai ,bi si = 10/\
ai , bi ci+1 = 9/\
cisi = 5/\
ci ci+1 = 4/\
The critical path, or the worst delay is
trca ={9 + (n- 2) x 4 + 5}/\ = {f4n + 6}/\
As each bit takes 9 gates, the area is simply 9n for a n-bit RCA.
3.3 Carry-Select Adders (CSLA)

Simple adders, like ripple-carry adders, are slow since the carry has to to travel
throughevery full adder block. There is a way to improve the speed by duplicating the hardware
dueto the fact that the carry can only be either 0 or 1. The method is based on the conditionalsum
adder and extended to a carry-select adder. With two RCA, each computingthe case of the one
polarity of the carry-in, the sum can be obtained with a 2x1 multiplexerwith the carry-in as the
select signal. An example of 16-bit carry-select adder is shown inFigure 3.4. In the figure, the
adder is grouped into four 4-bit blocks. The 1-bit multiplexorsfor sum selection can be
implemented as Figure 3.5 shows. Assuming the two carry terms are utilized such that the carry
input is given as a constant 1 or 0:

Figure 3.4: Carry-Select Adder.

In Figure 3.4, each two adjacent 4-bit blocks utilizes a carry relationship
ci+4 = c0 i+4 + c1 i+4 . ci
The relationship can be verified with properties of the group carry generate/propagate and c0i+4
can be written as
c0i+4 = Gi+4:i + Pi+4:i . 0
= Gi+4:i
Similarly, c1i+4 can be written as
c1i+4 = Gi+4:i + Pi+4:i . 1
= Gi+4:i + Pi+4:i
Then
c0i+4 + c1i+4 .ci = Gi+4:i + (Gi+4:i + Pi+4:i) .ci
= Gi+4:i + Gi+4:i .ci + Pi+4:i .ci
= Gi+4:i + Pi+4:i .ci
= ci+4
Figure 3.5: 2-1 Multiplexor.

Varying the number of bits in each group can work as well for carry-select adders.
temporary sums can be defined as follows.
s0 i+1 = ti+1 .c0i
s1i+1 = ti+1 .c1i
The final sum is selected by carry-in between the temporary sums already calculated.
si+1 = cj.s0i+1 + cj.s1i+1
Assuming the block size is fixed at r-bit, the n-bit adder is composed of k groups ofr-bit
blocks, i.e. n = r x k. The critical path with the first RCA has a delay of (4r + 5)/\ from the input
to the carry-out, and there are k - 2 blocks that follow, each with a delay of4/\ for carry to go
through. The final delay comes from the multiplexor, which has a delay of 5/\, as indicated in
Figure 3.5. The total delay for this CSEA is calculated as
tcsea = 4r + 5 + 4(k - 2) + 5/\
= {4r + 4k + 2}/\
The area can be estimated with (2n - r) FAs, (n - r) multiplexors and (k - 1) AND/ORlogic. As
mentioned above, each FA has an area of 9 and a multiplexor takes 5 units ofarea. The total area
can be estimated
9(2n - r) + 2(k - 1) + 4(n - r) = 22n - 13r + 2k - 2

The delay of the critical path in CSEA is reduced at the cost of increased area. For
example, in Figure 2.4, k = 4, r = 4 and n = 16. The delay for the CSEA is 34/\
compared to 70/\ for 16-bit RCA. The area for the CSEA is 310 units while the RCA hasan area
of 144 units. The delay of the CSEA is about the half of the RCA. But the CSEAhas an area
more than twice that of the RCA. Each adder can also be modified to have avariable block sizes,
which gives better delay and slightly less area.

3.4 Carry-Skip Adders (CSKA)

There is an alternative way of reducing the delay in the carry-chain of a RCA by


checking if a carry will propagate through to the next block. This is called carry-skip adders.
ci+1 = Pi:j _ Gi:j + Pi:j.cj
Figure 3.6 shows an example of 16-bit carry-skip adder.

Figure 3.6: Carry-Skip Adder.

The carry-out of each block is determined by selecting the carry-in and Gi:j using Pi:j.
When Pi:j = 1, the carry-in cj is allowed to get through the block immediately. Otherwise, the
carry-out is determined by Gi:j. The CSKA has less delay in the carry-chain with only a little
additional extra logic. Further improvement can be achieved generally by making the central
block sizes larger and the two-end block sizes smaller.
Assuming the n-bit adder is divided evenly to k r-bit blocks, part of the critical path is
from the LSB input through the MSB output of the final RCA. The first delay is from the LSB
input to carry-out, which is 4r + 5. Then, there are k - 2 skip logic blocks with a delay of 3/\.
Each skip logic block includes one 4-input AND gate for getting Pi+3:i and one AND/OR logic.
The final RCA has a delay from input to sum at MSB, which is 4r+6. The total delay is
calculated as follows.
tcska = {4r + 5 + 3(k - 2) + 4r + 6}/\
= {8r + 3k + 5}/\

The CSKA has n-bit FA and k - 2 skip logic blocks. Each skip logic block has an area of 3 units.
Therefore, the total area is estimated as9n + 3(k - 2) = 9n + 3k 6.

3.5 Carry-Look-ahead Adders (CLA)

The carry-chain can also be accelerated with carry generate/propagate logic. Carry-
lookahead adders employ the carry generate/propagate in groups to generate
carry for the next block. In other words, digital logic is used to calculate all the carries
at once. When building a CLA, a reduced version of full adder, which is called a reducedfull
adder (RFA) is utilized. Figure 3.7 shows the block diagram for an RFA. The
carrygenerate/propagate signals gi/pi feed to carry-lookahead generator (CLG) for carry inputsto
RFA.

Figure 3.7: Reduced Full Adder.

The theory of the CLA is based on next Equations. Figure 3.8 shows an example of 16-bit
carry-lookaheadadder. In the figure, each block is fixed at 4-bit. BCLG stands for Block Carry
Lookahead Carry Generator, which generates generate/propagate signals in group form. For the
4-bit BCLG, the following equations are created.
Gi+3:i = gi+3 + pi+3 .gi+2 + pi+3 .pi+2 .gi+1 + pi+3 .pi+2 .pi+1 .gi
Pi+3:i = pi+3 .pi+2 .pi+1 .pi

The group generate takes a delay of 4/\, which is an OR after an AND, therefore, the
carry-out can be computed, as follows.
ci+3 = Gi+3:i + Pi+3:i .ci

Figure 3.8: Carry-Lookahead Adder.

The carry computation also has a delay of 4/\, which is an OR after an AND. The 4-
bitBCLG has an area of 14 units.
The critical path of the 16-bit CLA can be observed from the input operand through
1RFA, then 3 BCLG and through the final RFA. That is, the critical path shown in Figure 3.8 is
from a0/b0 to s7. The delay will be the same for a0/b0 to s11 or s15, however, the criticalpath
traverses logarithmically, based on the group size.
The delays are listed below.
a0 , b0 p0 , g0 = 2/\
p0 , g0 G3,0 = 4/\
G3,0 c4 = 4/\
c4 c7 = 4/\
c7 s7 = 5/\
a0 , b0 s7 = 19/\

The 16-bit CLA is composed of 16 RFAs and 5 BCLGs, which amounts to an area of 16 x 8 + 5
x 14 = 198 units.
Extending the calculation above, the general estimation for delay and area can be
derived.Assume the CLA has n-bits, which is divided into k groups of r-bit blocks. Itrequires
dlogrne logic levels. The critical path starts from the input to p0/g0 generation,BLCG logic and
the carry-in to sum at MSB. The generation of (p; g) takes a delay of 2/\.The group version of (p;
g) generated by the BCLG has a delay of 4/\. From next BCLG,there is a 4/\ delay from the CLG
generation and 4/\ from the BCLG generation to thenext level, which totals to 8/\. Finally, from
ck+r to sk+r, there is a delay of 5/\. Thus, thetotal delay is calculated as follows.
tcla = {2 + 8(dlogrn- 1) + 4 + 5}/\
= {3 + 8dlogrn}/\
Chapter-4
Carry Skip Adder
A carry-skip adder (also known as a carry-bypass adder) is an adder implementation that
improves on the delay of a ripple-carry adder with little effort compared to other adders.
The improvement of the worst-case delay is achieved by using several carry-skip adders
to form a block-carry-skip adder.

The worst case for a simple one level carry-ripple-adder occurs, when the propagate-condition[1] is true for each
digit pair . Then the carry-in ripples through the -bit adder and appears as the carry-out
after .

Full adder with additional generate and propagate signals.

For each operand input bit pair the propagate-conditions are determined using an
XOR-Gate (see ). When all propagate-conditions are true, then the carry-in bit determines the carry-out bit.

The n-bit-carry-skip adder consists of a n-bit-carry-ripple-chain, a n-input AND-gate and one multiplexer.
Each propagate bit , that is provided by the carry-ripple-chain is connected to the n-input AND-gate. The
resulting bit is used as the select bit of a multiplexer that switches either the last carry-bit or the carry-
in to the carry-out signal .

This greatly reduces the latency of the adder through its critical path, since the carry bit for each block can now
"skip" over blocks with agroup propagate signal set to logic 1 (as opposed to a long ripple-carry chain, which
would require the carry to ripple through each bit in the adder). The number of inputs of the AND-gate is equal
to the width of the adder. For a large width, this becomes impractical and leads to additional delays, because
the AND-gate has to be built as a tree. A good width is achieved, when the sum-logic has the same depth like
the n-input AND-gate and the multiplexer.
The critical path of a carry-skip-adder begins at the first full-adder, passes through all adders and ends at the
sum-bit . Carry-skip-adders are chained (see block-carry-skip-adders) to reduce the overall critical path,
since a single -bit carry-skip-adder has no real speed benefit compared to a -bit carry-ripple-adder.

The skip-logic consists of a -input AND-gate and one multiplexer.

As the propagate signals are computed in parallel and are early available, the critical path for the skip
logic in a carry-skip adder consists only of the delay imposed by the multiplexer (conditional skip).

4 bit carry-skip adder.


Block-carry-skip adders:

16-bit fixed-block-carry-skip adder with a block size of 4 bit.

Block-carry-skip adders are composed of a number of carry-skip adders. There are two
types of block-carry-skip adders The two operands
and are split in blocks
of bits.

Why are block-carry-skip-adders used?


Should the block-size be constant or variable?
Fixed block width vs. variable block width

Fixed size block-carry-skip adders:

Fixed size block-carry-skip adders split the bit of the input bits into blocks of bit

each, resulting in blocks. The critical path consists of the ripple path and the skip
element of the first block, the skip paths that are enclosed between the first and the last block,
and finally the ripple-path of the last block.

The optimal block size for a given adder width n is derived by equating to 0

Only positive block sizes are realizable

Variable size block-carry-skip adders:


The performance can be improved, i.e. all carries propagated more quickly by varying the block
sizes. Accordingly the initial blocks of the adder are made smaller so as to quickly detect carry
generates that must be propagated the furthers, the middle blocks are made larger because they
are not the problem case, and then the most significant blocks are again made smaller so that the
late arriving carry inputs can be processed quickly.
Multilevel carry-skip adders:
By using additional skip-blocks in an additional layer, the block-propagate signals are
further summarized and used to perform larger skips:

Thus making the adder even faster.


Carry-Skip Optimization:
The problem of determining the block sizes and number of levels required to make the physically
fastest carry skip adder is known as the 'carry-skip adder optimization problem'. This problem is
made complex by the fact that a carry-skip adders are implemented with physical devices whose
size and other parameters also affects addition time.
The carry-skip optimization problem for variable block sizes and multiple levels for an arbitrary
device process node was solved by Thomas W. Lynch in.[2] This reference also shows that carry-
skip addition is the same as parallel prefix addition and is thus related to, and for some
configurations identical to, the Hans Carlson, Brent and Kung, Kogge-Stone adder and a number
of other adder types.
Implementation overview:
Breaking this down into more specific terms, in order to build a 4-bit carry-bypass adder, 6 full
adders would be needed. The input buses would be a 4-bit A and a 4-bit B, with a carry-in (CIN)
signal. The output would be a 4-bit bus X and a carry-out signal (COUT).
The first two full adders would add the first two bits together. The carry-out signal from the
second full adder ( )would drive the select signal for three 2 to 1 multiplexers. The second set
of 2 full adders would add the last two bits assuming is a logical 0. And the final set of full
adders would assume that is a logical 1.
The multiplexers then control which output signal is used for COUT, and .
Chapter-5
Proposed Carry Skip Adder

I. INTRODUCTION

ADDERS are a key building block in arithmetic and logic units (ALUs) [1] and hence
increasing their speed and reducing their power/energy consumption strongly affect the speed
and power consumption of processors. There are many works on the subject of optimizing the
speed and power of these units, which have been reported in [2][9]. Obviously, it is highly
desirable to achieve higher speeds at low-power/energy consumptions, which is a challenge for
the designers of general purpose processors.

One of the effective techniques to lower the power consumption of digital circuits is to
reduce the supply voltage due to quadratic dependence of the switching energy on the voltage.
Moreover, the subthreshold current, which is the main leakage component in OFF devices, has
an exponential dependence on the supply voltage level through the drain-induced barrier
lowering effect [10]. Depending on the amount of the supply voltage reduction, the operation of
ON devices may reside in the superthreshold, near-threshold, or subthreshold regions. Working
in the superthreshold region provides us with lower delay and higher switching and leakage
powers compared with the near/subthreshold regions. In the subthreshold region, the logic gate
delay and leakage power exhibit exponential dependences on the supply and threshold voltages.
Moreover, these voltages are (potentially) subject to process and environmental variations in the
nanoscale technologies. The variations increase uncertainties in the aforesaid performance
parameters. In addition, the small subthreshold current causes a large delay for the circuits
operating in the subthreshold region [10].

Recently, the near-threshold region has been considered as a region that provides a more
desirable tradeoff point between delay and power dissipation compared with that of the
subthreshold one, because it results in lower delay compared with the subthreshold region and
significantly lowers switching and leakage powers compared with the superthreshold region. In
addition, near-threshold operation, which uses supply voltage levels near the threshold voltage
of transistors [11], suffers considerably less from the process and environmental variations
compared with the subthreshold region.

The dependence of the power (and performance) on the supply voltage has been the
motivation for design of circuitswith the feature of dynamic voltage and frequency scaling. In
these circuits, to reduce the energy consumption, the system may change the voltage (and
frequency) of the circuit based on the workload requirement [12]. For these systems, the circuit
should be able to operate under a wide range of supply voltage levels. Of course, achieving
higher speeds at lower supply voltages for the computational blocks, with the adder as one the
main components, could be crucial in the design of high-speed, yet energy efficient, processors.

In addition to the knob of the supply voltage, one may choose between different adder
structures/families for optimizing power and speed. There are many adder families with
different delays, power consumptions, and area usages. Examples include ripple carry adder
(RCA), carry increment adder (CIA), carry skip adder (CSKA), carry select adder (CSLA), and
parallel prefix adders (PPAs). The descriptions of each of these adder architectures along with
their characteristics may be found in [1] and [13]. The RCA has the simplest structure with the
smallest area and power consumption but with the worst critical path delay. In the CSLA, the
speed, power consumption, and area usages are considerably larger than those of the RCA. The
PPAs, which are also called carry look-ahead adders, exploit direct parallel prefix structures to
generate the carry as fast as possible [14]. There are different types of the parallel prefix
algorithms that lead to different PPA structures with different performances. As an example, the
KoggeStone adder (KSA) [15] is one of the fastest structures but results in large power
consumption and area usage. It should be noted that the structure complexities of PPAs are more
than those of other adder schemes [13], [16].

The CSKA, which is an efficient adder in terms of power consumption and area usage,
was introduced in [17]. The critical path delay of the CSKA is much smaller than the one in the
RCA, whereas its area and power consumption are similar to those of the RCA. In addition, the
power-delay product (PDP) of the CSKA is smaller than those of the CSLA and PPA structures
[19]. In addition, due to the small number of transistors, the CSKA benefits from relatively short
wiring lengths as well as a regular and simple layout [18]. The comparatively lower speed of
this adder structure, however, limits its use for high-speed applications.

In this paper, given the attractive features of the CSKA structure, we have focused on
reducing its delay by modifying its implementation based on the static CMOS logic. The
concentration on the static CMOS originates from the desire to have a reliably operating circuit
under a wide range of supply voltages in highly scaled technologies [10]. The proposed
modification increases the speed considerably while maintaining the low area and power
consumption features of the CSKA. In addition, an adjustment of the structure, based on the
variable latency technique, which in turn lowers the power consumption without considerably
impacting the CSKA speed, is also presented. To the best of our knowledge, no work
concentrating on design of CSKAs operating from the superthreshold region down to near-
threshold region and also, the design of (hybrid) variable latency CSKA structures have been
reported in the literature. Hence, the contributions of this paper can be summarized as follows.

1) Proposing a modified CSKA structure by combining the concatenation and the


incrementation schemes to the conventional CSKA (Conv-CSKA) structure for
enhancing the speed and energy efficiency of the adder. The modification provides us
with the ability to use simpler carry skip logics based on the AOI/OAI compound gates
instead of the multiplexer.

2) Providing a design strategy for constructing an efficient CSKA structure based on


analytically expressions presented for the critical path delay.

3) Investigating the impact of voltage scaling on the efficiency of the proposed CSKA
structure (from the nominal supply voltage to the near-threshold voltage).

4) Proposing a hybrid variable latency CSKA structure based on the extension of the
suggested CSKA, by replacing some of the middle stages in its structure with a PPA,
which is modified in this paper.

II. PRIOR WORK

Since the focus of this paper is on the CSKA structure, first the related work to this adder
are reviewed and then the variable latency adder structures are discussed.
A. Modifying CSKAs for Improving Speed

The conventional structure of the CSKA consists of stages containing chain of full adders
(FAs) (RCA block) and 2:1 multiplexer (carry skip logic). The RCA blocks are connected to
each other through 2:1 multiplexers, which can be placed into one or more level structures [19].
The CSKA configuration (i.e., the number of the FAs per stage) has a great impact on the speed
of this type of adder [23]. Many methods have been suggested for finding the optimum number
of the FAs [18][26]. The techniques presented in [19][24] make use of VSSs to minimize the
delay of adders based on a singlelevel carry skip logic. In [25], some methods to increase the
speed of the multilevel CSKAs are proposed. The techniques, however, cause area and power
increase considerably and less regular layout. The design of a static CMOS CSKA where the
stages of the CSKA have a variable sizes was suggested in [18]. In addition, to lower the
propagation delay of the adder, in each stage, the carry look-ahead logics were utilized. Again, it
had a complex layout as well as large power consumption and area usage. In addition, the design
approach, which was presented only for the 32-bit adder, was not general to be applied for
structures with different bits lengths.

Alioto and Palumbo [19] propose a simple strategy for the design of a single-level
CSKA. The method is based on the VSS technique where the near-optimal numbers of the FAs
are determined based on the skip time (delay of the multiplexer), and the ripple time (the time
required by a carry to ripple through a FA). The goal of this method is to decrease the critical
path delay by considering a noninteger ratio of the skip time to the ripple time on contrary to
most of the previous works, which considered an integer ratio [17], [20]. In all of the works
reviewed so far, the focus was on the speed, while the power consumption and area usage of the
CSKAs were not considered. Even for the speed, the delay of skip logics, which are based on
multiplexers and form a large part of the adder critical path delay [19], has not been reduced.
Fig. 1. Conventional structure of the CSKA [19].

B. Improving Efficiency of Adders at Low Supply Voltages

To improve the performance of the adder structures at low supply voltage levels, some
methods have been proposed in [27][36]. In [27][29], an adaptive clock stretching operation
has been suggested. The method is based on the observation that the critical paths in adder units
are rarely activated. Therefore, the slack time between the critical paths and the off-critical paths
may be used to reduce the supply voltage. Notice that the voltage reduction must not increase the
delays of the noncritical timing paths to become larger than the period of the clock allowing us to
keep the original clock frequency at a reduced supply voltage level. When the critical timing
paths in the adder are activated, the structure uses two clock cycles to complete the operation.
This way the power consumption reduces considerably at the cost of rather small throughput
degradation. In [27], the efficiency of this method for reducing the power consumption of the
RCA structure has been demonstrated. The CSLA structure in [28] was enhanced to use adaptive
clock stretching operation where the enhanced structure was called cascade CSLA (C2SLA).
Compared with the common CSLA structure, C2SLA uses more and different sizes of RCA
blocks. Since the slack time between the critical timing paths and the longest off-critical path
was small, the supply voltage scaling, and hence, the power reduction were limited. Finally,
using the hybrid structure to improve the effectiveness of the adaptive clock stretching operation
has been investigated in [31] and [33]. In the proposed hybrid structure, the KSA has been used
in the middle part of the C2SLA where this combination leads to the positive slack time increase.
However, the C2SLA and its hybrid version are not good candidates for low-power ALUs. This
statement originates from the fact that due to the logic duplication in this type of adders, the
power consumption and also the PDP are still high even at low supply voltages [33].

III. CONVENTIONAL CARRY SKIP ADDER

The structure of an N-bit Conv-CSKA, which is based on blocks of the RCA (RCA
blocks), is shown in Fig. 1. In addition to the chain of FAs in each stage, there is a carry skip
logic. For an RCA that contains N cascaded FAs, the worst propagation delay of the summation
of two N-bit numbers, A and B, belongs to the case where all the FAs are in the propagation
mode. It means that the worst case delay belongs to the case where

Pi = Ai Bi = 1 for i = 1,..., N

where Pi is the propagation signal related to Ai and Bi . This shows that the delay of the
RCA is linearly related to N [1]. In the case, where a group of cascaded FAs are in the propagate
mode, the carry output of the chain is equal to the carry input.In the CSKA, the carry skip logic
detects this situation, and makes the carry ready for the next stage without waiting for the
operation of the FA chain to be completed. The skip operation is performed using the gates and
the multiplexer shown in the figure. Based on this explanation, the N FAs of the CSKA are
grouped in Q stages. Each stage contains an RCA block with Mj FAs ( j = 1,..., Q) and a skip
logic. In each stage, the inputs of the multiplexer (skip logic) are the carry input of the stage and
the carry output of its RCA block (FA chain). In addition, the product of the propagation signals
(P) of the stage is used as the selector signal of the multiplexer.

The CSKA may be implemented using FSS and VSS where the highest speed may be
obtained for the VSS structure [19], [22]. Here, the stage size is the same as the RCA block size.
In Sections III-A and III-B, these two different implementations of the CSKA adder are
described in more detail.

A. Fixed Stage Size CSKA

By assuming that each stage of the CSKA contains M FAs,there are Q = N/M stages
where for the sake of simplicity, we assume Q is an integer. The input signals of the jth
multiplexer are the carry output of the FAs chain in the jth stage denoted by C0 j , the carry
output of the previous stage (carry input of the jth stage) denoted by C1 j (Fig. 1). The critical
path of the CSKA contains three parts:

1) The path of the FA chain of the first stage whose delay is equal to M TCARRY;
2) 2) the path of the intermediate carry skip multiplexer whose delay is equal to the
(Q 1) TMUX; and
3) 3) the path of the FA chain in the last stage whose its delay is equal to the (M 1)
TCARRY + TSUM. Note that TCARRY, TSUM, and TMUX are the propagation
delays of the carry output of an FA, the sum output of an FA, and the output delay of
a 2:1 multiplexer, respectively.

Hence, the critical path delay of a FSS CSKA is formulated by

Based on (1), the optimum value of M (Mopt) that leads to optimum propagation delay
may be calculated as (0.5N)1/2 where is equal to TMUX/TCARRY. Therefore, the optimum
propagation delay (TD,opt) is obtained from

Thus, the optimum delay of the FSS CSKA is almost proportional to the square root of
the product of N and [19]. B. Variable Stage Size CSKA As mentioned before, by assigning
variable sizes to the stages, the speed of the CSKA may be improved. The speed improvement in
this type is achieved by lowering the delays of the first and third terms in (1). These delays are
minimized by lowering sizes of first and last RCA blocks. For instance, the first RCA block size
may be set to one, whereas sizes of the following blocks may increase. To determine the rate of
increase, let us express the propagation delay of the C1j (t1j) by
where t0j1 (t1j1) shows the calculating delay of C0j1(C1j1) signal in the ( j 1)th stage.
In a FSS CSKA, except in the first stage, t0j is smaller than t1j . Hence, based on (3), the delay of
t0j1 may be increased from t01 to t1j1 without increasing the delay of C1j signal. This means that
one could increase the size of the ( j 1)th stage (i.e., Mj1) without increasing the propagation
delay of the CSKA. Therefore, increasing the size of Mj for the jth stage should be bounded by

Since the last RCA block size also should be minimized, the increase in the stage size
may not be continued to the lastRCA block. Thus, we justify the decrease in the RCA block
sizes toward the last stage. First, note that based on Fig. 1, the output of the jth stage is, in the
worst case, accessible after t1j + TSUM,j . Assuming that the pth stage has the maximum RCA
block size, we wish to keep the delay of the outputs of the following stages to be equal to the
delay of the output of the pth stage. To keep the same worst case delay for the critical path, we
should reduce the size of the following RCA blocks. For example, when i p, for the (i +1)th
stage, the output delay is t1i + TMUX + TSUM,i+1, where TSUM,i+1 is the delay of the (i + 1)th
RCA block for calculating all of its sum outputs when its carry input is ready. Therefore, the size
of the (i + 1)th stage should be reduced to decrease TSUM,i+1 preventing the increase in the
worst case delay (TD) of the adder. In other words, we eliminate the increase in the delay of the
next stage due to the additional multiplexer by reducing the sum delay of the RCA block. This
may be analytically expressed as

The trend of decreasing the stage size should be continued until we produce the required
number of adder bits.

Note that, in this case, the size of the last RCA block may only be one (i.e., one FA).
Hence, to reach the highest number of input bits under a constant propagation delay, both (4) and
(5) should be satisfied. Having these constraints, we can minimize the delay of the CSKA for a
given number of input bits to find the stages sizes for an optimal structure. In this optimal
CSKA, the size of first p stages is increased, while the size of the last (Q p) stages is decreased.
For this structure, the pth stage, which is called nucleus of the adder, has the maximum size [24].
Now, let us find the constraints used for determining the optimum structure in this case.
As mentioned before, when the jth stage is not in the propagate mode, the carry output of the
stage is C0j. In this case, the maximum of t0j is equal to Mj TCARRY. To satisfy (4), we
increase the size of the first p stages up to the nucleus using [19]

In addition, the maximum of TSUM,i is equal to (Mi 1) TCARRY + TSUM. To


satisfy (5), the size of the last (Q p) stages from the nucleus to the last stage should decrease
based on [19]

In the case, where is an integer value, the exact sizes of stages for the optimal structure
can be determined. Subsequently, the optimal values of M1, MQ , and Q as well as the delay of
the optimal CSKA may be calculated [19]. In the case, where is a non-integer value, one may
realize only a near optimal structure, as detailed in [19] and [21]. In this case, most of the time,
by setting M1 to 1 and using (6) and (7), the near-optimal structure is determined. It should be
noted that, in practice, is non-integer whose value is smaller than one. This is the case that has
been studied in [19], where the estimation of the near-optimal propagation delay of the CSKA is
given by [19]

This equation may be written in a more general form by replacing TMUX by TSKIP to
allow for other logic types instead of the multiplexer. For this form, becomes equal to
TSKIP/TCARRY. Finally, note that in real implementations, TSKIP < TCARRY, and hence, [/2]
becomes equal to one. Thus, (8) may be written as
Note that, as (9) reveals that a large portion of the critical path delay is due to the carry
skip logics.

Fig. 2. Proposed CI-CSKA structure

IV. PROPOSED CSKA STRUCTURE

Based on the discussion presented in Section III, it is concluded that by reducing the
delay of the skip logic, one may lower the propagation delay of the CSKA significantly. Hence,
in this paper, we present a modified CSKA structure that reduces this delay.

A. General Description of the Proposed Structure

The structure is based on combining the concatenation and the incrementation schemes
[13] with the Conv-CSKA structure, and hence, is denoted by CI-CSKA. It provides us with the
ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI
compound gates (Fig. 2). The gates, which consist of fewer transistors, have lower delay, area,
and smaller power consumption compared with those of the 2:1 multiplexer [37]. Note that, in
this structure, as the carry propagates through the skip logics, it becomes complemented.
Therefore, at the output of the skip logic of even stages, the complement of the carry is
generated. The structure has a considerable lower propagation delay with a slightly smaller area
compared with those of the conventional one. Note that while the power consumptions of the
AOI (or OAI) gate are smaller than that of the multiplexer, the power consumption of the
proposed CI-CSKA is a little more than that of the conventional one. This is due to the increase
in the number of the gates, which imposes a higher wiring capacitance (in the noncritical paths).
Now, we describe the internal structure of the proposed CI-CSKA shown in Fig. 2 in
more detail. The adder contains two N bits inputs, A and B, and Q stages. Each stage consists of
an RCA block with the size of Mj (j = 1,..., Q). In this structure, the carry input of all the RCA
blocks, except for the first block which is Ci , is zero (concatenation of the RCA blocks).
Therefore, all the blocks execute their jobs simultaneously. In this structure, when the first block
computes the summation of its corresponding input bits (i.e., SM1 ,..., S1), and C1, the other
blocks simultaneously compute the intermediate results [i.e., {ZK j+Mj,..., ZK j+2, ZK j+1} for
K j = j1 r=1 Mr(j = 2,..., Q)], and also Cj signals. In the proposed structure, the first stage has
only one block, which is RCA. The stages 2 to Q consist of two blocks of RCA and
incrementation. The incrementation block uses the

Fig. 3. Internal structure of the jth incrementation block, K j = j1 r=1 Mr (j = 2,..., Q).

intermediate results generated by the RCA block and the carry output of the previous stage to
calculate the final summation of the stage. The internal structure of the incrementation block,
which contains a chain of half-adders (HAs), is shown in Fig. 3. In addition, note that, to reduce
the delay considerably, for computing the carry output of the stage, the carry output of the
incrementation block is not used. As shown in Fig. 2, the skip logic determines the carry output
of the jth stage (CO,j) based on the intermediate results of the jth stage and the carry output of
the previous stage (CO,j1) as well as the carry output of the corresponding RCA block (Cj).
When determining CO,j , these cases may be encountered. When Cj is equal to one, CO,j will be
one. On the other hand, when Cj is equal to zero, if the product of the intermediate results is one
(zero), the value of CO,j will be the same as CO,j1 (zero).

The reason for using both AOI and OAI compound gates as the skip logics is the
inverting functions of these gates in standard cell libraries. This way the need for an inverter
gate, which increases the power consumption and delay, is eliminated. As shown in Fig. 2, if an
AOI is used as the skip logic, the next skip logic should use OAI gate. In addition, another point
to mention is that the use of the proposed skipping structure in the Conv-CSKA structure
increases the delay of the critical path considerably. This originates from the fact that, in the
Conv-CSKA, the skip logic (AOI or OAI compound gates) is not able to bypass the zero carry
input until the zero carry input propagates from the corresponding RCA block. To solve this
problem, in the proposed structure, we have used an RCA block with a carry input of zero (using
the concatenation approach). This way, since the RCA block of the stage does not need to wait
for the carry output of the previous stage, the output carries of the blocks are calculated in
parallel.

B. Area and Delay of the Proposed Structure

As mentioned before, the use of the static AOI and OAI gates (six transistors) compared
with the static 2:1 multiplexer (12 transistors), leads to decreases in the area usage and delay of
the skip logic [37], [38]. In addition, except for the first RCA block, the carry input for all other
blocks is zero, and hence, for these blocks, the first adder cell in the RCA chain is a HA. This
means that (Q 1) FAs in the conventional structure are replaced with the same number of HAs
in the suggested structure decreasing the area usage (Fig. 2). In addition, note that the proposed
structure utilizes incrementation blocks that do not exist in the conventional one. These blocks,
however, may be implemented with about the same logic gates (XOR and AND gates) as those
used for generating the select signal of the multiplexer in the conventional structure. Therefore,
the area usage of the proposed CI-CSKA structure is decreased compared with that of the
conventional one.

The critical path of the proposed CI-CSKA structure, which contains three parts, is
shown in Fig. 2. These parts include the chain of the FAs of the first stage, the path of the skip
logics, and the incrementation block in the last stage. The delay of this path (TD) may be
expressed as

where the three brackets correspond to the three parts mentioned above, respectively.
Here, TAND and TXOR are the delays of the two inputs static AND and XOR gates,
respectively. Note that, [(Mj 1)TAND + TXOR] shows the critical path delay of the jth
incrementation block (TINC,j), which is shown in Fig. 3. To calculate the delay of the skip logic,
the average of the delays of the AOI and OAI gates, which are typically close to one another
[35], is used. Thus, (10) may be modified to

where TAOI and TOAI are the delays of the static AOI and OAI gates, respectively. The
comparison of (1) and (11) indicates that the delay of the proposed structure is smaller than that
of the conventional one. The First reason is that the delay of the skip logic is considerably
smaller than that of the conventional structure while the number of the stages is about the same
in both structures. Second, since TAND and TXOR are smaller than TCARRY and TSUM, the
third additive term in (11) becomes smaller than the third term in (1) [37]. It should be noted that
the delay reduction of the skip logic has the largest impact on the delay decrease of the whole
structure.

B. Stage Sizes Consideration

Similar to the Conv-CSKA structure, the proposed CI-CSKA structure may be implemented
with either FSS or VSS. Here, the stage size is the same as the RCA and incrementation blocks
size. In the case of the FSS (FSS-CI-CSKA), there are Q = N/M stages with the size of M. The
optimum value of M, which may be obtained using (11), is given by

In the case of the VSS (VSS-CI-CSKA), the sizes of the stages, which are M1 to MQ ,
are obtained using a method similar to the one discussed in Section III-B. For this structure, the
new value for TSKIP should be used, and hence, becomes (TAOI+TOAI) / (2TCARRY). In
particular, the following steps should be taken.

1) The size of the RCA block of the first stage is one.


2) From the second stage to the nucleus stage, the size of jth stage is determined based on
the delay of the product of the sum of its RCA block and the delay of the carry output of
the ( j 1)th stage. Hence, based on the description given in Section III-B, the size of the
RCA block of the jth stage should be as large as possible, while the delay of the product
of the its output sum should be smaller than the delay of the carry output of the ( j 1)th
stage. Therefore, in this case, the sizes of the stages are either not changed or increased.

3) The increase in the size is continued until the summation of all the sizes up to this
stage becomes larger than N/2. The last stage, which has the largest size, is considered

as the nucleus ( pth) stage. There are cases that we should consider the stage right before
this stage as thenucleu s stage (Step 5).

4) Starting from the stage (p + 1) to the last stage, the sizes of the stage i is determined
based on the delay of the incrementation block of the ith and (i 1)th stages (TINC,i
and TINC,i1, respectively), and the delay of the skip logic. In particular

In this case, the size of the last stage is one, and its RCA block contains a HA.

5) Finally, note that, it is possible that the sum of all the stage sizes does not become
equal to N. In the case, where the sum is smaller than N by d bits, we should add another
stage with the size of d. The stage is placed close to the stage with the same size. In the
case, where the sum is larger than N by d bits, the size of the stages should be revised
(Step 3). For more details on how to revise the stage sizes, one may refer to [19].

Now, the procedure for determining the stage sizes is demonstrated for the 32-bit adder.
It includes both the conventional and the proposed CI-CSKA structures. The number of stages
and the corresponding size for each stage, which are given in Fig. 4, have been determined based
on a 45-nm static CMOS technology [38]. The dashed and dotted lines in the plot indicate the
rates of size increase and decrease. While the increase and decrease rates in the conventional
structure are balanced, the decrease rate is more than the increase one in the case of the proposed
structure. It originates from the fact that, in the Conv-CSKA structure, both of the stages size
increase and decrease are determined based on the RCA block delay [according to (4) and (5)],
while in the proposed CI-CSKA structure, the increase is determined based on the RCA block
delay and the decrease is determined based on the incrementation block delay [according to
(13)]. The imbalanced rates may yield a larger nucleus stage and smaller number of stages
leading to a smaller propagation delay.

Fig. 4. Sizes of the stages in the case of VSS for the proposed and conventional 32-bit CSKA
structures in 45-nm static CMOS technology

V. PROPOSED HYBRID VARIABLE LATENCY CSKA

In this section, first, the structure of a generic variable latency adder, which may be used
with the voltage scaling relying on adaptive clock stretching, is described. Then, a hybrid
variable latency CSKA structure based on the CI-CSKA structure described in Section IV is
proposed. A. Variable Latency Adders Relying on Adaptive Clock Stretching The basic idea
behind variable latency adders is that the critical paths of the adders are activated rarely [33].
Hence, the supply voltage may be scaled down without decreasing the clock frequency. If the
critical paths are not activated, one clock period is enough for completing the operation.In the
cases, where the critical paths are activated, the structure allows two clock periods for finishing
the operation. Hence, in this structure, the slack between the longest off-critical paths and the
longest critical paths determines the maximum amount of the supply voltage scaling. Therefore,
in the variable latency adders, for determining the critical paths activation, a predictor block,
which works based on the inputs pattern, is required [28].
The concepts of the variable latency adders, adaptive clock stretching, and also supply
voltage scaling in an N-bit RCA adder may be explained using Fig. 5. The predictor block
consists of some XOR and AND gates that determines the product of the propagate signals of
considered bit positions.Since the block has some area and power overheads, only few middle
bits are used to predict the activation of the critical paths at price of prediction accuracy decrease
[31], [33]. In Fig. 5, the input bits ( j + 1)th( j + m)th have been exploited to predict the
propagation of the carry output of the jth stage (FA) to the carry output of ( j + m)th stage. For
this configuration, the carry propagation path from the first stage to the Nth stage is the longest
critical path (which is denoted by Long Latency Path (LLP), while the carry propagation path
from first stage to the ( j+m)th stage and the carry propagation path from ( j +1)th stage to the
Nth stage (which are denoted by Short Latency Path (SLP1) and SLP2, respectively) are the
longest off-critical paths. It should be noted the paths that the predictor shows are (are not) active
for a given set of inputs are considered as critical (off-critical) paths. Having the bits in the
middle decreases the maximum of the off-critical paths [33]. The range of voltage scaling is
determined by the slack time, which is defined by the delay difference between LLP and
max(SLP1, SLP2). Since the activation probability of the critical paths is low (<1/2m), the clock
stretching has a negligible impact on the throughput (e.g., for a 32-bit adder, m = 610 may be
considered [33]). There are cases that the predictor mispredicts the critical path activation. By
increasing m, the number of misprediction decreases at the price of increasing the longest off-
critical path, and hence, limiting the range of the voltage scaling. Therefore, the predictor block
size should be selected based on these tradeoffs.

Fig. 5. Generic structure of variable latency adders based on RCA.


B. Proposed Hybrid Variable Latency CSKA Structure

The basic idea behind using VSS CSKA structures was based on almost balancing the
delays of paths such that the delay of the critical path is minimized compared with that of the
FSS structure [21]. This deprives us from having the opportunity of using the slack time for the
supply voltage scaling. To provide the variable latency feature for the VSS CSKA structure, we
replace some of the middle stages in our proposed structure with a PPA modified in this paper. It
should be noted that since the Conv-CSKA structure has a lower speed than that of the proposed
one, in this section, we do not consider the conventional structure. The proposed hybrid variable
latency CSKA structure is shown in Fig. 6 where an Mp-bit modified PPA is used for the pth
stage (nucleus stage). Since the nucleus stage, which has the largest size (and delay) among the
stages, is present in both SLP1 and SLP2, replacing it by the PPA reduces the delay of the
longest off-critical paths. Thus, the use of the fast PPA helps increasing the available slack time
in the variable latency structure.It should be mentioned that since the input bits of the PPA block
are used in the predictor block, this block becomes parts of both SLP1 and SLP2.

Fig. 6. Structure of the proposed hybrid variable latency CSKA.


Fig. 7. Internal structure of the pth stage of the proposed hybrid variable latency CSKA. Mp is
equal to 8 and Kp= j1 r=1 Mr.

In the proposed hybrid structure, the prefix network of the BrentKung adder [39] is used
for constructing the nucleus stage (Fig. 7). One the advantages of the this adder compared with
other prefix adders is that in this structure, using forward paths, the longest carry is calculated
sooner compared with the intermediate carries, which are computed by backward paths. In
addition, the fan-out of adder is less than other parallel adders, while the length of its wiring is
smaller [14]. Finally, it has a simple and regular layout. The internal structure of thestage p,
including the modified PPA and skip logic, is shown in Fig. 7. Note that, for this figure, the size
of the PPA is assumed to be 8 (i.e., Mp = 8).

As shown in the figure, in the preprocessing level, the propagate signals (Pi) and generate signals
(Gi) for the inputs are calculated. In the next level, using BrentKung parallel prefix network, the
longest carry (i.e., G8:1) of the prefix network along with P8:1, which is the product of the all
propagate signals of the inputs, are calculated sooner than other intermediate signals in this
network. The signal P8:1 is used in the skip logic to determine if the carry output of the previous
stage (i.e., CO,p1) should be skipped or not. In addition, this signal is exploited as the predictor
signal in the variable latency adder. It should be mentioned that all of these operations are
performed in parallel with other stages. In the case, where P8:1 is one, CO,p1 should skip this
stage predicting that some critical paths are activated. On the other hand, when P8:1 is zero,
CO,p is equal to the G8:1. In addition, no critical path will be activated in this case. After the
parallel prefix network, the intermediate carries, which are functions of CO,p1 and intermediate
signals, are computed (Fig. 7). Finally, in the postprocessing level, the output sums of this stage
are calculated. It should be noted that this implementation is based on the similar ideas of the
concatenation and incrementation concepts used in the CI-CSKA discussed in Section IV. It
should be noted that the end part of the SPL1 path from CO,p1 to final summation results of the
PPA block and the beginning part of the SPL2 paths from inputs of this block to CO,p belong to
the PPA block (Fig. 7). In addition, similar to the proposed CI-CSKA structure, the first point of
SPL1 is the first input bit of the first stage, and the last point of SPL2 is the last bit of the sum
output of the incrementation block of the stage Q. The steps for determining the sizes of the
stages in the hybrid variable latency CSKA structure are similar to the ones discussed in Section
IV. Since the PPA structure is more efficient when its size is equal to an integer power of two,
we can select a larger size for the nucleus stage accordingly [14]. This implies that the third step
discussed in that section is modified. The larger size (number of bits), compared with that of the
nucleus stage in the original CI-CSKA structure, leads to the decrease in the number of stages as
well smaller delays for SLP1 and SLP2. Thus, the slack time increases further.
Chapter-6
Verilog HDL
In the semiconductor and electronic design industry, Verilog is a hardware description language
(HDL) used to model electronic systems. Verilog HDL, not to be confused with VHDL (a
competing language), is most commonly used in the design, verification, and implementation of
digital logic chips at the register-transfer level of abstraction. It is also used in the verification of
analog and mixed-signal circuits.

6.1 Overview

Hardware description languages such as Verilog differ from software programming languages
because they include ways of describing the propagation of time and signal dependencies
(sensitivity). There are two assignment operators, a blocking assignment (=), and a non-blocking
(<=) assignment. The non-blocking assignment allows designers to describe a state-machine
update without needing to declare and use temporary storage variables. Since these concepts are
part of Verilog's language semantics, designers could quickly write descriptions of large circuits
in a relatively compact and concise form. At the time of Verilog's introduction (1984), Verilog
represented a tremendous productivity improvement for circuit designers who were already using
graphical schematic capture software and specially written software programs to document and
simulate electronic circuits.

The designers of Verilog wanted a language with syntax similar to the C programming language,
which was already widely used in engineering software development. Like C, Verilog is case-
sensitive and has a basic pre processor (though less sophisticated than that of ANSI C/C++). Its
control flow keywords (if/else, for, while, case, etc.) are equivalent, and its operator precedence
is compatible. Syntactic differences include variable declaration (Verilog requires bit-widths on
net/reg types, demarcation of procedural blocks (begin/end instead of curly braces {}), and many
other minor differences.

A Verilog design consists of a hierarchy of modules. Modules encapsulate design hierarchy, and
communicate with other modules through a set of declared input, output, and bidirectional ports.
Internally, a module can contain any combination of the following: net/variable declarations
(wire, reg, integer, etc.), concurrent and sequential statement blocks, and instances of other
modules (sub-hierarchies). Sequential statements are placed inside a begin/end block and
executed in sequential order within the block. However, the blocks themselves are executed
concurrently, making Verilog a dataflow language.

Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating, undefined") and
strengths (strong, weak, etc.). This system allows abstract modeling of shared signal lines, where
multiple sources drive a common net. When a wire has multiple drivers, the wire's (readable)
value is resolved by a function of the source drivers and their strengths.

A subset of statements in the Verilog language are synthesizable. Verilog modules that conform
to a synthesizable coding style, known as RTL (register-transfer level), can be physically
realized by synthesis software. Synthesis software algorithmically transforms the (abstract)
Verilog source into a net list, a logically equivalent description consisting only of elementary
logic primitives (AND, OR, NOT, flip-flops, etc.) that are available in a specific FPGA or VLSI
technology. Further manipulations to the netlist ultimately lead to a circuit fabrication blueprint
(such as a photo mask set for an ASIC or a bit stream file for an FPGA).

6.2 History
6.2.1 Beginning
Verilog was the first modern hardware description language to be invented. It was created by
Phil Moorby and Prabhu Goel during the winter of 1983/1984. The wording for this process was
"Automated Integrated Design Systems" (later renamed to Gateway Design Automation in 1985)
as a hardware modeling language. Gateway Design Automation was purchased by Cadence
Design Systems in 1990. Cadence now has full proprietary rights to Gateway's Verilog and the
Verilog-XL, the HDL-simulator that would become the de-facto standard (of Verilog logic
simulators) for the next decade. Originally, Verilog was intended to describe and allow
simulation; only afterwards was support for synthesis added.

6.2.2 Verilog-95

With the increasing success of VHDL at the time, Cadence decided to make the language
available for open standardization. Cadence transferred Verilog into the public domain under the
Open Verilog International (OVI) (now known as Accellera) organization. Verilog was later
submitted to IEEE and became IEEE Standard 1364-1995, commonly referred to as Verilog-95.

In the same time frame Cadence initiated the creation of Verilog-A to put standards support
behind its analog simulator Spectre. Verilog-A was never intended to be a standalone language
and is a subset of Verilog-AMS which encompassed Verilog-95.

6.2.3 Verilog 2001

Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users had
found in the original Verilog standard. These extensions became IEEE Standard 1364-2001
known as Verilog-2001.

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for (2's
complement) signed nets and variables. Previously, code authors had to perform signed
operations using awkward bit-level manipulations (for example, the carry-out bit of a simple 8-
bit addition required an explicit description of the Boolean algebra to determine its correct
value). The same function under Verilog-2001 can be more succinctly described by one of the
built-in operators: +, -, /, *, >>>. A generate/endgenerate construct (similar to VHDL's
generate/endgenerate) allows Verilog-2001 to control instance and statement instantiation
through normal decision operators (case/if/else). Using generate/endgenerate, Verilog-2001 can
instantiate an array of instances, with control over the connectivity of the individual instances.
File I/O has been improved by several new system tasks. And finally, a few syntax additions
were introduced to improve code readability (e.g. always @*, named parameter override, C-style
function/task/module header declaration).

Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA
software packages.

6.2.4 Verilog 2005

Not to be confused with System Verilog, Verilog 2005 (IEEE Standard 1364-2005) consists of
minor corrections, spec clarifications, and a few new language features (such as the uwire
keyword).

A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and mixed
signal modeling with traditional Verilog.

Example

A hello world program looks like this:

module main;
initial
begin
$display("Hello world!");
$finish;
end
endmodule

A simple example of two flip-flops follows:

moduletoplevel(clock,reset);
input clock;
input reset;

reg flop1;
reg flop2;

always @ (posedge reset or posedge clock)


if (reset)
begin
flop1 <= 0;
flop2 <= 1;
end
else
begin
flop1 <= flop2;
flop2 <= flop1;
end
endmodule

The "<=" operator in Verilog is another aspect of its being a hardware description language as
opposed to a normal procedural language. This is known as a "non-blocking" assignment. Its
action doesn't register until the next clock cycle. This means that the order of the assignments is
irrelevant and will produce the same result: flop1 and flop2 will swap values every clock.

The other assignment operator, "=", is referred to as a blocking assignment. When "="
assignment is used, for the purposes of logic, the target variable is updated immediately. In the
above example, had the statements used the "=" blocking operator instead of "<=", flop1 and
flop2 would not have been swapped. Instead, as in traditional programming, the compiler would
understand to simply set flop1 equal to flop2 (and subsequently ignore the redundant logic to set
flop2 equal to flop1.)

An example counter circuit follows:

module Div20x (rst, clk, cet, cep, count, tc);


// TITLE 'Divide-by-20 Counter with enables'
// enable CEP is a clock enable only
// enable CET is a clock enable and
// enables the TC output
// a counter using the Verilog language

parameter size = 5;
parameter length = 20;

inputrst; // These inputs/outputs represent


inputclk; // connections to the module.
inputcet;
inputcep;

output [size-1:0] count;


outputtc;

reg [size-1:0] count; // Signals assigned


// within an always
// (or initial)block
// must be of type reg

wiretc; // Other signals are of type wire

// The always statement below is a parallel


// execution statement that
// executes any time the signals
// rst or clk transition from low to high

always @ (posedgeclk or posedgerst)


if (rst) // This causes reset of the cntr
count<= {size{1'b0}};
else
if (cet&&cep) // Enables both true
begin
if (count == length-1)
count<= {size{1'b0}};
else
count<= count + 1'b1;
end

// the value of tc is continuously assigned


// the value of the expression
assigntc = (cet&& (count == length-1));

endmodule

An example of delays:

...
reg a, b, c, d;
wire e;
...
always @(b or e)
begin
a = b & e;
b = a | b;
#5 c = b;
d = #6 c ^ e;
end

The always clause above illustrates the other type of method of use, i.e. it executes whenever any
of the entities in the list (the b or e) changes. When one of these changes, a is immediately
assigned a new value, and due to the blocking assignment, b is assigned a new value afterward
(taking into account the new value of a). After a delay of 5 time units, c is assigned the value of b
and the value of c ^ e is tucked away in an invisible store. Then after 6 more time units, d is
assigned the value that was tucked away.

Signals that are driven from within a process (an initial or always block) must be of type reg.
Signals that are driven from outside a process must be of type wire. The keyword reg does not
necessarily imply a hardware register.

Definition of constants

The definition of constants in Verilog supports the addition of a width parameter. The basic
syntax is:

<Width in bits>'<base letter><number>

Examples:

12'h123 - Hexadecimal 123 (using 12 bits)


20'd44 - Decimal 44 (using 20 bits - 0 extension is automatic)
4'b1010 - Binary 1010 (using 4 bits)
6'o77 - Octal 77 (using 6 bits)

Synthesizeable constructs

There are several statements in Verilog that have no analog in real hardware, e.g. $display.
Consequently, much of the language can not be used to describe hardware. The examples
presented here are the classic subset of the language that has a direct mapping to real gates.

// Mux examples - Three ways to do the same thing.


// The first example uses continuous assignment
wire out;
assign out =sel?a : b;

// the second example uses a procedure


// to accomplish the same thing.

reg out;
always@(a or b orsel)
begin
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end

// Finally - you can use if/else in a


// procedural structure.
reg out;
always@(a or b orsel)
if(sel)
out= a;
else
out= b;

The next interesting structure is a transparent latch; it will pass the input to the output when the
gate signal is set for "pass-through", and captures the input and stores it upon transition of the
gate signal to "hold". The output will remain stable regardless of the input signal while the gate
is set to "hold". In the example below the "pass-through" level of the gate would be when the
value of the if clause is true, i.e. gate = 1. This is read "if gate is true, the din is fed to latch_out
continuously." Once the if clause is false, the last value at latch_out will remain and is
independent of the value of din.

// Transparent latch example

reg out;
always@(gate or din)
if(gate)
out= din;// Pass through state
// Note that the else isn't required here. The variable
// out will follow the value of din while gate is high.
// When gate goes low, out will remain constant.

The flip-flop is the next significant template; in Verilog, the D-flop is the simplest, and it can be
modeled as:

reg q;
always@(posedgeclk)
q <= d;

The significant thing to notice in the example is the use of the non-blocking assignment. A basic
rule of thumb is to use <= when there is a posedge or negedge statement within the always
clause.

A variant of the D-flop is one with an asynchronous reset; there is a convention that the reset
state will be the first if clause within the statement.

reg q;
always@(posedgeclkorposedge reset)
if(reset)
q <=0;
else
q <= d;

The next variant is including both an asynchronous reset and asynchronous set condition; again
the convention comes into play, i.e. the reset term is followed by the set term.

reg q;
always@(posedgeclkorposedge reset orposedge set)
if(reset)
q <=0;
else
if(set)
q <=1;
else
q <= d;

Note: If this model is used to model a Set/Reset flip flop then simulation errors can result.
Consider the following test sequence of events. 1) reset goes high 2) clk goes high 3) set goes
high 4) clk goes high again 5) reset goes low followed by 6) set going low. Assume no setup and
hold violations.
In this example the always @ statement would first execute when the rising edge of reset occurs
which would place q to a value of 0. The next time the always block executes would be the rising
edge of clk which again would keep q at a value of 0. The always block then executes when set
goes high which because reset is high forces q to remain at 0. This condition may or may not be
correct depending on the actual flip flop. However, this is not the main problem with this model.
Notice that when reset goes low, that set is still high. In a real flip flop this will cause the output
to go to a 1. However, in this model it will not occur because the always block is triggered by
rising edges of set and reset - not levels. A different approach may be necessary for set/reset flip
flops.

The final basic variant is one that implements a D-flop with a mux feeding its input. The mux
has a d-input and feedback from the flop itself. This allows a gated load function.

// Basic structure with an EXPLICIT feedback path


always@(posedgeclk)
if(gate)
q <= d;
else
q <= q;// explicit feedback path

// The more common structure ASSUMES the feedback is present


// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedgeclk)''' and the non-blocking '''<='''
//
always@(posedgeclk)
if(gate)
q <= d;// the "else" mux is "implied"

Note that there are no "initial" blocks mentioned in this description. There is a split between
FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial blocks where reg
values are established instead of using a "reset" signal. ASIC synthesis tools don't support such a
statement. The reason is that an FPGA's initial state is something that is downloaded into the
memory tables of the FPGA. An ASIC is an actual hardware implementation.

Initial and always

There are two separate ways of declaring a Verilog process. These are the always and the initial
keywords. The always keyword indicates a free-running process. The initial keyword indicates a
process executes exactly once. Both constructs begin execution at simulator time 0, and both
execute until the end of the block. Once an always block has reached its end, it is rescheduled
(again). It is a common misconception to believe that an initial block will execute before an
always block. In fact, it is better to think of the initial-block as a special-case of the always-
block, one which terminates after it completes for the first time.

//Examples:
initial
begin
a =1;// Assign a value to reg a at time 0
#1;// Wait 1 time unit
b = a;// Assign the value of reg a to reg b
end

always@(a or b)// Any time a or b CHANGE, run the process


begin
if(a)
c = b;
else
d =~b;
end// Done with this block, now return to the top (i.e. the @ event-control)

always@(posedge a)// Run whenever reg a has a low to high change


a <= b;

These are the classic uses for these two keywords, but there are two significant additional uses.
The most common of these is an always keyword without the @(...) sensitivity list. It is possible
to use always as shown below:

always
begin// Always begins executing at time 0 and NEVER stops
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
clk=1;// Set clk to 1
#1;// Wait 1 time unit
end// Keeps executing - so continue back at the top of the begin

The always keyword acts similar to the "C" construct while(1) {..} in the sense that it will
execute forever.

The other interesting exception is the use of the initial keyword with the addition of the forever
keyword.
The example below is functionally identical to the always example above.

initialforever// Start at time 0 and repeat the begin/end forever


begin
clk=0;// Set clk to 0
#1;// Wait for 1 time unit
clk=1;// Set clk to 1
#1;// Wait 1 time unit
end

Fork/join

The fork/join pair are used by Verilog to create parallel processes. All statements (or blocks)
between a fork/join pair begin execution simultaneously upon execution flow hitting the fork.
Execution continues after the join upon completion of the longest running statement or block
between the fork and join.

initial
fork
$write("A");// Print Char A
$write("B");// Print Char B
begin
#1;// Wait 1 time unit
$write("C");// Print Char C
end
join

The way the above is written, it is possible to have either the sequences "ABC" or "BAC" print
out. The order of simulation between the first $write and the second $write depends on the
simulator implementation, and may purposefully be randomized by the simulator. This allows
the simulation to contain both accidental race conditions as well as intentional non-deterministic
behavior.

Notice that VHDL cannot dynamically spawn multiple processes like Verilog

Race conditions

The order of execution isn't always guaranteed within Verilog. This can best be illustrated by a
classic example. Consider the code snippet below:

initial
a =0;
initial
b = a;

initial
begin
#1;
$display("Value a=%b Value of b=%b",a,b);
end

What will be printed out for the values of a and b? Depending on the order of execution of the
initial blocks, it could be zero and zero, or alternately zero and some other arbitrary uninitialized
value. The $display statement will always execute after both assignment blocks have completed,
due to the #1 delay.

Operators

Note: These operators are not shown in order of precedence.

Operator
Operator type Operation performed
symbols
~ Bitwise NOT (1's complement)
& Bitwise AND
Bitwise | Bitwise OR
^ Bitwise XOR
~^ or ^~ Bitwise XNOR
! NOT
Logical && AND
|| OR
& Reduction AND
~& Reduction NAND
| Reduction OR
Reduction
~| Reduction NOR
^ Reduction XOR
~^ or ^~ Reduction XNOR
+ Addition
Arithmetic
- Subtraction
- 2's complement
* Multiplication
/ Division
** Exponentiation (*Verilog-2001)
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
Relational == Logical equality (bit-value 1'bX is removed from comparison)
Logical inequality (bit-value 1'bX is removed from
!=
comparison)
=== 4-state logical equality (bit-value 1'bX is taken as literal)
!== 4-state logical inequality (bit-value 1'bX is taken as literal)
>> Logical right shift
<< Logical left shift
Shift
>>> Arithmetic right shift (*Verilog-2001)
<<< Arithmetic left shift (*Verilog-2001)
Concatenation { , } Concatenation
Replication {n{m}} Replicate value m for n times
Conditional ?: Conditional

Four-valued logic
The IEEE 1364 standard defines a four-valued logic with four states: 0, 1, Z (high impedance),
and X (unknown logic value). For the competing VHDL, a dedicated standard for multi-valued
logic exists as IEEE 1164 with nine levels.
Chapter-7
FPGA Implementation

7.1 Introduction to FPGA

FPGA contains a two dimensional arrays of logic blocks and interconnections between
logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are
programmed to implement a desired function and the interconnections are programmed using the
switch boxes to connect the logic blocks.

To be more clear, if we want to implement a complex design (CPU for instance), then the
design is divided into small sub functions and each sub function is implemented using one logic
block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks
must be connected and this is done by programming the internal structure of an FPGA which is
depicted in the following figure 7.1.
Figure 7.1: FPGA interconnections

FPGAs, alternative to the custom ICs, can be used to implement an entire System
On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can
reprogram an FPGA to implement a design and this is done after the FPGA is
manufactured. This brings the name Field Programmable.

Custom ICs are expensive and takes long time to design so they are useful when
produced in bulk amounts. But FPGAs are easy to implement within a short time with the
help of Computer Aided Designing (CAD) tools (because there is no physical layout
process, no mask making, and no IC manufacturing).

Some disadvantages of FPGAs are, they are slow compared to custom ICs as they
cant handle vary complex designs and also they draw more power.

Xilinx logic block consists of one Look Up Table (LUT) and one Flip-Flop. An LUT
is used to implement number of different functionality. The input lines to the logic block go
into the LUT and enable it. The output of the LUT gives the result of the logic function that
it implements and the output of logic block is registered or unregistered output from the
LUT.

SRAM is used to implement a LUT.A k-input logic function is implemented using


2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k.
Advantage of such an architecture is that it supports implementation of so many logic
functions, however the disadvantage is unusually large number of memory cells required to
implement such a logic block in case number of inputs is large.

Figure 7.2 shows a 4-input LUT based implementation of logic block

LUT based design provides for better logic block utilization. A k-input LUT based
logic block can be implemented in number of different ways with tradeoff between
performance and logic density.An n-LUT can be shown as a direct implementation of a
function truth-table. Each of the latch holds the value of the function corresponding to one
input combination. For Example: 2-LUT can be used to implement 16 types of functions
like AND, OR, A +not B.... Etc.

Interconnects
A wire segment can be described as two end points of an interconnection with no
programmable switch between them. A sequence of one or more wire segments in an FPGA can
be termed as a track.

Typically an FPGA has logic blocks, interconnects and switch blocks (Input /Output
blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are
connected to logic blocks through switch blocks. Depending on the required design, one logic
block is connected to another and so on.

7.2 FPGA DESIGN FLOW

In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing diagram.

Figure 7.3 FPGA Design Flow

7.2.1 Design Entry

There are different techniques for design entry. Schematic based, Hardware Description
Language and combination of both etc. . Selection of a method depends on the design and
designer. If the designer wants to deal more with Hardware, then Schematic entry is the better
choice. When the design is complex or the designer thinks the design in an algorithmic way then
HDL is the better choice. Language based entry is faster but lag in performance and density.

HDLs represent a level of abstraction that can isolate the designers from the details of the
hardware implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method but rarely
used is state-machines. It is the better choice for the designers who think the design as a series of
states. But the tools for state machine entry are limited. In this documentation we are going to
deal with the HDL based design entry.

7.2.2 Synthesis

Figure 7.4 FPGA Synthesis

The process that translates VHDL/ Verilog code into a device netlist format i.e. a
complete circuit with logical elements (gates flip flop, etc) for the design. If the design
contains more than one sub designs, ex. to implement a processor, we need a CPU as one design
element and RAM as another and so on, then the synthesis process generates netlist for each
design element Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has selected.
The resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx Synthesis
Technology (XST)).

7.2.3 Implementation

This process consists of a sequence of three steps

Translate
Map
Place and Route

Translate:
Process combines all the input netlists and constraints to a logic design file. This
information is saved as a NGD (Native Generic Database) file. This can be done using NGD
Build program. Here, defining constraints is nothing but, assigning the ports in the design to the
physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time
requirements of the design. This information is stored in a file named UCF (User Constraints
File). Tools used to create or modify the UCF are PACE, Constraint Editor Etc.

Figure 7.5 FPGA Translate

Map:

Process divides the whole circuit with logical elements into sub blocks such that they can
be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file
into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks
(IOB)) and generates an NCD (Native Circuit Description) file which physically represents the
design mapped to the components of FPGA. MAP program is used for this purpose.

Figure 7.6 FPGA map

Place and Route:


PAR program is used for this process. The place and route process places the sub blocks
from the map process into logic blocks according to the constraints and connects the logic
blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save
the time but it may affect some other constraint. So tradeoff between all the constraints is taken
account by the place and route process.

The PAR tool takes the mapped NCD file as input and produces a completely routed
NCD file as output. The output NCD file consists of the routing information.

Figure 7.7 FPGA Place and route

7.3 Synthesis Result

To investigate the advantages of using our technique in terms of area overhead against
Fully ECCand against the partially protection, we implemented andsynthesized for a Xilinx
XC3S500E different versions of a32-bit, 32-entry, dual read ports, single write port registerfile.
Once the functional verification is done, the RTL model is taken to the synthesis process using
the Xilinx ISE tool. In synthesis process, the RTL model will be converted to the gate level
netlist mapped to a specific technology library. Here in this Spartan 3E family, many different
devices were available in the Xilinx ISE tool. In order to synthesis this design the device named
as XC3S500E has been chosen and the package as FG320 with the device speed such as -
4.
RTL Schematic

The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design
is made. It shows the inputs and outputs of the system. By double-clicking on the diagram we
can see gates, flip-flops and MUX.
Source code for Variable Stage Size Carry Skip Adder:
`timescale 1ns / 1ps
//////////////////////////////////////////////////////////////////////////////////
// Company:
// Engineer:
//
// Create Date: 12:40:01 08/13/2016
// Design Name:
// Module Name: PCSKA32_VSS
// Project Name:
// Target Devices:
// Tool versions:
// Description:
//
// Dependencies:
//
// Revision:
// Revision 0.01 - File Created
// Additional Comments:
//
//////////////////////////////////////////////////////////////////////////////////
module PCSKA32_VSS(a,b, cin, cout, s);
input [31:0] a,b;
input cin;
output cout;
output [31:0] s;
wire [8:0]c;
fa u0(.a(a[0]),.b(b[0]),.cin(cin),.s(s[0]),.cout(c[0]));
PADDER2BIT u1(.a(a[2:1]),.b(b[2:1]),.cin(c[0]),.s(s[2:1]),.cout(c[1]));
PADDER3BIT1 u2(.a(a[5:3]),.b(b[5:3]),.cin(c[1]),.s(s[5:3]),.cout(c[2]));
PADDER4BIT u3(.a(a[9:6]),.b(b[9:6]),.cin(c[2]),.s(s[9:6]),.cout(c[3]));
PADDER5BIT1 u4(.a(a[14:10]),.b(b[14:10]),.cin(c[3]),.s(s[14:10]),.cout(c[4]));
BrentKung8 u5(.A(a[22:15]),.B(b[22:15]),.Cin(c[4]),.S(s[22:15]),.Cout(c[5]));
PADDER5BIT u6(.a(a[27:23]),.b(b[27:23]),.cin(c[5]),.s(s[27:23]),.cout(c[6]));
PADDER4BIT1 u7(.a(a[31:28]),.b(b[31:28]),.cin(c[6]),.s(s[31:28]),.cout(cout));
endmodule
The corresponding schematics of the adders after synthesis is shown below.

Figure 7.13: RTL schematic of Top-level Variable Stage Size Carry Skip Adder

Figure 7.14: RTL schematic of Internal block Variable Stage Size Carry Skip Adder
Figure 7.15: Technology schematic of Top-level Variable Stage Size Carry Skip Adder

Figure 7.16: Technology schematic of Internal block Variable Stage Size Carry Skip Adder

Figure 7.17: Internal block Variable Stage Size Carry Skip Adder
7.4 Synthesis Report

This device utilization includes the following.

Logic Utilization
Logic Distribution
Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of number
of devices used from the available devices and also represented in %. Hence as the result of the
synthesis process, the device utilization in the used device and package is shown below.
Table 7-1: Synthesis report of Variable Stage Size Carry Skip Adder
Chapter-8
SIMULATION RESULTS
All the synthesis and simulation results are performed using Verilog HDL. The synthesis and
simulation are performed on Xilinx ISE 14.4. The simulation results are shown below figures.
The corresponding simulation results of the variable stage size carry skip adders are shown
below.

Figure 8-1: Test Bench for 16 bit Variable Stage Size Carry Skip Adder

Figure 8-2: Simulated output for Variable Stage Size Carry Skip Adder
CONCLUSION

In this paper, CSKA structure called CI-CSKA was proposed, which exhibits a higher
speed and lower energy consumption compared with those of the conventional one. The speed
enhancement was achieved by modifying the structure through the concatenation and
incrementation techniques. In addition, AOI and OAI compound gates were exploited for the
carry skip logics. The efficiency of the proposed structure for both FSS and VSS was studied by
comparing its power and delay with those of the Conv-CSKA, RCA, CIA, SQRT-CSLA, and
KSA structures. The results revealed considerably lower PDP for the VSS implementation of the
CI-CSKA structure over a wide range of voltage from super-threshold to near threshold. The
results also suggested the CI-CSKA structure as a very good adder for the applications where
both the speed and energy consumption are critical. In addition, a hybrid variable latency
extension of the structure was proposed. The efficacy of this structure was compared versus
those of the variable latency RCA, C2SLA, and hybrid C2SLA structures. Again, the suggested
structure showed the lowest delay as a better candidate for high-speed applications.
REFERENCES

[1] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA: A K Peters, Ltd.,
2002.

[2] R. Zlatanovici, S. Kao, and B. Nikolic, Energydelay optimization of 64-bit carry-lookahead


adders with a 240 ps 90 nm CMOS design example, IEEE J. Solid-State Circuits, vol. 44, no. 2,
pp. 569583, Feb. 2009.

[3] S. K. Mathew, M. A. Anders, B. Bloechel, T. Nguyen, R. K. Krishnamurthy, and S. Borkar,


A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS,

IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 4451, Jan. 2005.

[4] V. G. Oklobdzija, B. R. Zeydel, H. Q. Dao, S. Mathew, and R. Krishnamurthy, Comparison


of high-performance VLSI adders in the energy-delay space, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 13, no. 6, pp. 754758, Jun. 2005.

[5] B. Ramkumar and H. M. Kittur, Low-power and area-efficient carry select adder, IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 2, pp. 371375, Feb. 2012.

[6] M. Vratonjic, B. R. Zeydel, and V. G. Oklobdzija, Low- and ultra low-power arithmetic
units: Design and comparison, in Proc. IEEE Int. Conf. Comput. Design, VLSI Comput.
Process. (ICCD), Oct. 2005, pp. 249252.

[7] C. Nagendra, M. J. Irwin, and R. M. Owens, Area-time-power tradeoffs in parallel adders,


IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 10, pp. 689702, Oct.
1996.

[8] Y. He and C.-H. Chang, A power-delay efficient hybrid carrylookahead/carry-select based


redundant binary to twos complement converter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
55, no. 1, pp. 336346, Feb. 2008.

[9] C.-H. Chang, J. Gu, and M. Zhang, A review of 0.18 m full adder performances for tree
structured arithmetic circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 6,
pp. 686695, Jun. 2005.
[10] D. Markovic, C. C. Wang, L. P. Alarcon, T.-T. Liu, and J. M. Rabaey, Ultralow-power
design in near-threshold region, Proc. IEEE, vol. 98, no. 2, pp. 237252, Feb. 2010.

[11] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, Near-threshold


computing: Reclaiming Moores law through energy efficient integrated circuits, Proc. IEEE,
vol. 98, no. 2, pp. 253266, Feb. 2010.

[12] S. Jain et al., A 280 mV-to-1.2 V wide-operating-range IA-32 processor in 32 nm CMOS,


in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2012, pp. 6668.

[13] R. Zimmermann, Binary adder architectures for cell-based VLSI and their synthesis,
Ph.D. dissertation, Dept. Inf. Technol. Elect. Eng., Swiss Federal Inst. Technol. (ETH), Zrich,
Switzerland, 1998.

[14] D. Harris, A taxonomy of parallel prefix networks, in Proc. IEEE Conf. Rec. 37th
Asilomar Conf. Signals, Syst., Comput., vol. 2. Nov. 2003, pp. 22132217.

[15] P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general
class of recurrence equations, IEEE Trans. Comput., vol. C-22, no. 8, pp. 786793, Aug. 1973.

[16] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew, and R. Krishnamurthy, Energy-delay


estimation technique for highperformance microprocessor VLSI adders, in Proc. 16th IEEE
Symp. Comput. Arithmetic, Jun. 2003, pp. 272279.

[17] M. Lehman and N. Burla, Skip techniques for high-speed carrypropagation in binary
arithmetic units, IRE Trans. Electron. Comput., vol. EC-10, no. 4, pp. 691698, Dec. 1961.

[18] K. Chirca et al., A static low-power, high-performance 32-bit carry skip adder, in Proc.
Euromicro Symp. Digit. Syst. Design (DSD), Aug./Sep. 2004, pp. 615619.

[19] M. Alioto and G. Palumbo, A simple strategy for optimized design of one-level carry-skip
adders, IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 50, no. 1, pp. 141148, Jan.
2003.

[20] S. Majerski, On determination of optimal distributions of carry skips in adders, IEEE


Trans. Electron. Comput., vol. EC-16, no. 1, pp. 4558, Feb. 1967.
[21] A. Guyot, B. Hochet, and J.-M. Muller, A way to build efficient carryskip adders, IEEE
Trans. Comput., vol. C-36, no. 10, pp. 11441152, Oct. 1987.

[22] S. Turrini, Optimal group distribution in carry-skip adders, in Proc. 9th IEEE Symp.
Comput. Arithmetic, Sep. 1989, pp. 96103.

[23] P. K. Chan, M. D. F. Schlag, C. D. Thomborson, and V. G. Oklobdzija, Delay optimization


of carry-skip adders and block carry-lookahead adders using multidimensional dynamic
programming, IEEE Trans. Comput., vol. 41, no. 8, pp. 920930, Aug. 1992.

[24] V. Kantabutra, Designing optimum one-level carry-skip adders, IEEE Trans. Comput.,
vol. 42, no. 6, pp. 759764, Jun. 1993.

[25] V. Kantabutra, Accelerated two-level carry-skip addersA type of very fast adders, IEEE
Trans. Comput., vol. 42, no. 11, pp. 13891393, Nov. 1993.

[26] S. Jia et al., Static CMOS implementation of logarithmic skip adder, in Proc. IEEE Conf.
Electron Devices Solid-State Circuits, Dec. 2003, pp. 509512.

[27] H. Suzuki, W. Jeong, and K. Roy, Low power adder with adaptive supply voltage, in
Proc. 21st Int. Conf. Comput. Design, Oct. 2003, pp. 103106.

[28] H. Suzuki, W. Jeong, and K. Roy, Low-power carry-select adder using adaptive supply
voltage based on input vector patterns, in Proc. Int. Symp. Low Power Electron. Design
(ISLPED), Aug. 2004, pp. 313318.

[29] Y. Chen, H. Li, K. Roy, and C.-K. Koh, Cascaded carry-select adder (C2SA): A new
structure for low-power CSA design, in Proc. Int. Symp. Low Power Electron. Design
(ISLPED), Aug. 2005, pp. 115118.

[30] Y. Chen, H. Li, J. Li, and C.-K. Koh, Variable-latency adder (VL-adder): New arithmetic
circuit design practice to overcome NBTI, in Proc. ACM/IEEE Int. Symp. Low Power Electron.

Design (ISLPED), Aug. 2007, pp. 195200.

[31] S. Ghosh and K. Roy, Exploring high-speed low-power hybrid arithmetic units at scaled
supply and adaptive clock-stretching, in Proc. Asia
South Pacific Design Autom. Conf. (ASPDAC), Mar. 2008, pp. 635640.

[32] Y. Chen et al., Variable-latency adder (VL-adder) designs for low power and NBTI
tolerance, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 11, pp. 16211624,
Nov. 2010.

[33] S. Ghosh, D. Mohapatra, G. Karakonstantis, and K. Roy, Voltage scalable high-speed


robust hybrid arithmetic units using adaptive clocking, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 18, no. 9, pp. 13011309, Sep. 2010.

[34] Y. Liu, Y. Sun, Y. Zhu, and H. Yang, Design methodology of variable latency adders with
multistage function speculation, in Proc. IEEE 11th Int. Symp. Quality Electron. Design
(ISQED), Mar. 2010, pp. 824830.

[35] Y.-S. Su, D.-C. Wang, S.-C. Chang, and M. Marek-Sadowska, Performance optimization
using variable-latency design style, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,
no. 10, pp. 18741883, Oct. 2011.

[36] K. Du, P. Varman, and K. Mohanram, High performance reliable variable latency carry
select addition, in Proc. Design, Autom., Test Eur. Conf. Exhibit. (DATE), Mar. 2012, pp.
12571262.

[37] J. M. Rabaey, A. Chandrakasa, and B. Nikolic, Digital Integrated Circuits: A Design


Perspective, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2003.

[38] NanGate 45 nm Open Cell Library. [Online]. Available: http://www.nangate.com/, accessed


Dec. 2010.

[39] R. P. Brent and H. T. Kung, A regular layout for parallel adders, IEEE Trans. Comput.,
vol. C-31, no. 3, pp. 260264, Mar. 1982.

[40] Synopsys HSPICE. [Online]. Available: http://www.synopsys.com, accessed Sep. 2011.

Das könnte Ihnen auch gefallen