Beruflich Dokumente
Kultur Dokumente
INTRODUCTION
1.1
INTRODUCTION
Multiplication is an important fundamental function in arithmetic operations.
1.2
OBJECTIVES
Digital multipliers are the most commonly used components in any digital
circuit design. They are fast, reliable and efficient components that are utilized to
implement any operation. Depending upon the arrangement of the components, there
are different types of multipliers available. Particular multiplier architecture is chosen
based on the application [5]. In many DSP algorithms, the multiplier lies in the critical
delay path and ultimately determines the performance of algorithm. The speed of
1
1.3
PROBLEM SPECIFICATION
In the digital hardware, two multiplication algorithms commonly followed are
Sutra also
shows the effectiveness of to reduce the NxN multiplier structure into an efficient 4x4
multiplier structures.
1.4
METHODOLOGIES
To design a fast and low power multiplier, the following steps are to be
2.
3.
4.
1.5
Chapter 2
LITERATURE SURVEY
2.1 VEDIC MATHEMATICS
Many Indian Secondary School students consider Mathematics a very difficult
subject. Some students encounter difficulty with basic arithmetical operations. Some
students feel it difficult to manipulate symbols and balance equations. In other words,
abstract and logical reasoning is their hurdle. Many such difficulties in learning
Mathematics enter into a long list if prepared by an experienced teacher of
Mathematics. Volumes have been written on the diagnosis of 'learning difficulties'
related to Mathematics and remedial techniques. Learning Mathematics is an
unpleasant experience to some students mainly because it involves mental exercise.
Of late, a few teachers and scholars have revived interest in Vedic Mathematics which
was developed, as a system derived from Vedic principles. Vedic mathematics is the
name given to the ancient system of mathematics which was rediscovered from the
Vedas. To be more specific, it has originated from Atharva Vedas the fourth Veda.
Atharva Veda deals with the branches like Engineering, Mathematics, Sculpture,
Medicine and all other sciences with which we are today aware of. Its a unique
technique of calculations based on simple principles and rules, with which any
mathematical problem - be it arithmetic, algebra, geometry or trigonometry can be
solved mentally [6]. Swami Bharati Krishna Tirthaji Maharaj, Shankaracharya of
Goverdhan Peath collected the lost formulae from Atharva Veda and wrote them in
the form of sixteen mathematical formulae or sutras and their corollaries derived from
the Vedas and thirteen sub-sutras. Vedic mathematics introduces wonderful
applications
to
arithmetic
computations,
theory
of
numbers,
compound
2.2 MULTIPLIERS
Multipliers play an important role in todays digital signal processing and
various other applications. With advances in technology, many researchers have tried
and are trying to design multipliers which offer either of the following design targets
high speed, low power consumption, regularity of layout and hence less area or even
combination of them in one multiplier thus making them suitable for various high
speed, low power and compact VLSI implementation.
The common multiplication method is add and shift algorithm. In parallel
multipliers number of partial products to be added is the main parameter that
determines the performance of the multiplier. To reduce the number of partial
products to be added, Modified Booth algorithm is one of the most popular
algorithms. To achieve speed improvements Wallace Tree algorithm can be used to
reduce the number of sequential adding stages. Further by combining both Modified
Booth algorithm and Wallace Tree technique we can see advantage of both algorithms
in one multiplier. However with increasing parallelism, the amount of shifts between
the partial products and intermediate sums to be added will increase which may result
in reduced speed, increase in silicon area due to irregularity of structure and also
increased power consumption due to increase in interconnect resulting from complex
routing. On the other hand serial-parallel multipliers compromise speed to achieve
better performance for area and power consumption. The selection of a parallel or
serial multiplier actually depends on the nature of application.
Multiplication algorithm [7]:
1. If the LSB of Multiplier is 1, then add the multiplicand into an accumulator.
2. Shift the multiplier one bit to the right and multiplicand one bit to the left.
3. Stop when all bits of the multiplier are zero.
Types of multipliers:
1. Serial multiplier:
Where area and power is of utmost importance and delay can be tolerated the
serial multiplier is used. This circuit uses one adder to add the M * N partial products.
The delay is N cycles maximum. This circuit has several advantages in asynchronous
circuits.
4. Array Multiplier:
Array multiplier is well known due to its regular structure. Multiplier circuit is
based on add and shift algorithm. Each partial product is generated by the
multiplication of the multiplicand with one multiplier bit. The partial product are
shifted according to their bit orders and then added. The addition can be performed
with normal carry propagate adder. N-1 adders are required where N is the multiplier
length. Although the method is simple, the addition is done serially as well as in
parallel.
multiplier bit generates one multiple of the multiplicand to be added to the partial
product. If the multiplier is very large, then a large number of multiplicands have to
be added. In this case the delay of multiplier is determined mainly by the number of
additions to be performed. If there is a way to reduce the number of the additions, the
performance will get better. Booth algorithm [8] is a method that will reduce the
number of multiplicand multiples. For a given range of numbers to be represented, a
higher representation radix leads to fewer digits. Since a k-bit binary number can be
interpreted as K/2-digit radix-4 number, a K/3-digit radix-8 number, and so on, it can
deal with more than one bit of the multiplier in each cycle by using high radix
multiplication.
reduces this new matrix and so on, until a two-row matrix is generated. The most
common counter used is the 3:2 counter which is a Full Adder. The final results are
added using usually carry propagate adder. The advantage of Wallace tree is speed
because the addition of partial products is now O (logN). The result of these additions
is the final product bits and sum and carry bits which are added in the final fast adder
(CRA). Block diagram of Wallace tree multiplier is shown in figure 2.6
products are then added to form the final result. Main advantage of binary
multiplication is that the generation of intermediate products is easy. If the multiplier
bit is a 1, the product is a correctly shifted copy of the multiplicand; if the multiplier
bit is a 0, the product is simply 0. The architecture of combinational multiplier is
shown in figure 2.7. In most of the systems combinational multipliers are slow and
take a lot of area.
Generally, it is not possible to say that an exact multiplier yields to greater
cost-effectiveness, since trade-off is design and technology dependent. These basic
array multipliers consume low power and exhibit good performance, however there
use is limited to sixteen bits. But due to the regular structure of Vedic multiplier,
power consumed and delay will be reduced with the increase in order of the
multiplier.
2.3 ADDERS
An adder or summer is a digital circuit that performs addition of numbers. In
many computers and other kinds of processors, adders are used not only in the
arithmetic logic units, but also in other parts of the processor, where they are used to
calculate addresses, table indices, and similar operations.
Although adders can be constructed for many numerical representations, such
as binary-coded decimal or excess-3, the most common adders operate on binary
numbers. In cases where two's complement or ones' complement is being used to
represent negative numbers, it is trivial to modify an adder into an addersubtractor.
Other signed number representations require a more complex adder. Adder circuits are
of two types: Half adder ad Full adder.
Half adder
Half adder is a combinational arithmetic circuit that adds two numbers and
produces a sum bit (S) and carry bit (C) as the output. If A and B are the input bits,
then sum bit (S) is the X-OR of A and B and the carry bit (C) will be the AND of A
and B. From this it is clear that a half adder circuit can be easily constructed using one
X-OR gate and one AND gate. Half adder is the simplest of all adder circuit, but it has
a major disadvantage. The half adder can add only two input bits (A and B) and has
nothing to do with the carry if there is any in the input. So if the input to a half adder
have a carry, then it will be neglected it and adds only the A and B bits. That means
the binary addition process is not complete and thats why it is called a half adder.
result of 1+1+0 is 10 just like we get 1+1+0 =2 in decimal system. 2 in the decimal
system correspond to 10 in the binary system. Swapping the result 10 will give S=0
and Cout = 1 and the second last row is justified. This can be applied to any row in the
table.
Other adders
1. Ripple Carry Adder (RCA)
The ripple carry adder is constructed by cascading full adders (FA) blocks in
series. One full adder is responsible for the addition of two binary digits at any stage
of the ripple carry. The carryout of one stage is fed directly to the carry-in of the next
stage. Even though this is a simple adder and can be used to add unrestricted bit
length numbers, it is however not very efficient when large bit numbers are used. One
of the most serious drawbacks of this adder is that the delay increases linearly with
the bit length. The worst-case delay of the RCA is when a carry signal transition
ripples through all stages of adder chain from the least significant bit to the most
significant bit, which is approximated by:
t= (n+1)tc + ts
Eq.1.1
Where tc is the delay through the carry stage of a full adder, and ts is the delay to
compute the sum of the last stage. The delay of ripple carry adder is linearly
proportional to n, the number of bits; therefore the performance of the RCA is limited
when n grows bigger. The advantages of the RCA are lower power consumption as
well as compact layout giving smaller chip area.
able to deduce quickly whether, for each group of digits, that group is going to
propagate a carry that comes in from the right. The net effect is that the carries start
by propagating slowly through each 4-bit group, just as in a ripple-carry system, but
then moves 4 times faster, leaping from one look ahead carry unit to the next. Finally,
within each group that receives a carry, the carry propagates slowly within the digits
in that group.
Eq.2.2
Eq.2.3
Eq.2.4
Eq.2.5
multimedia processors, makes the carry skip structure more interesting. The crossover
point between the ripple-carry adder and the carry skip adder is dependent on
technology considerations and is normally situated 4 to 8 bits. The carry-skip circuitry
consists of two logic gates. The AND gate accepts the carry-in bit and compares it to
the group propagate signal using the individual propagate values.
p[i ,i + 3] = pi + 3 pi + 2 pi + 1 pi
Eq.2.6
The output from the AND gate is ORed with cout of RCA to produce a stage output
carry = ci + 4 + p[i, i + 3] ci
Eq.2.7
If p [i, i + 3] =0, then the carry-out of the group is determined by the value of
ci + 4. However, if p [i , i + 3] =1 when the carry-in bit is ci =1, then the group carryin is automatically sent to the next group of adders. The design schematic of Carry
Skip Adder is shown in figure 2.14.
immediately to the next block through the bypass and if it is not the case, the carry is
obtained via the normal route. If (P0P1P3P4P5P6P7 = 1) then C0,7 = Ci,0 else either
Delete or Generate occurred. Hence, in a CBA the full adders are divided into groups,
each of them is bypassed by a multiplexer if its full adders are all in propagating.
A four bit carry select adder generally consists of two ripple carry adders and a
multiplexer. The carry-select adder is simple but rather fast, having a gate level depth
of. Adding two n-bit numbers with a carry select adder is done with two adders (two
ripple carry adders) in order to perform the calculation twice, one time with the
assumption of the carry being zero and the other assuming one. After the two results
are calculated, the correct sum, as well as the correct carry, is then selected with the
multiplexer once the correct carry is known. A carry-select adder speeds 40% to
90%faster than RCA by performing additions in parallel and reducing the maximum
carry path.
Of all the adders shown above, there are both advantages as well as
disadvantages in each of those multipliers. As delay is the major factor in any
application, to reduce the delay, carry lookahead adder is used in the proposed design.
2
0
Chapter 3
PROBLEM SPECIFICATION
3.1
Various tricks and short cuts are suggested by Vedic multiplier to optimize the
process. These methods are based on concept of
1. Multiplication using deficits and excess
2. Changing the base to simplify the operation.
Various methods of multiplication are proposed in Vedic multiplier. Among
sixteen sutras mainly three sutras are used for multiplication in Vedic mathematics.
1. Urdhva Tiryagbhyam: vertically and crosswise
2. Nikhilam Navatashcharamam Dashatah: All from nine and last from ten
3. Anurupyena: Proportionately Vinculum
URDHVA TIRYAGBHYAM
Urdhva Tiryagbhyam is the general formula applicable to all cases of
multiplication and also in the division of a large number by another large number.
This method can be applied to decimal numbers as well as binary numbers [10-11].
Here, the partial products and their sums are calculated in parallel. So, the multiplier
is independent of the clock frequency of the processor. Due to its regular structure, it
can be easily layout in microprocessors and designers can easily circumvent these
problems to avoid catastrophic device failures.
The processing power of multiplier can easily be increased by increasing the
input and output data bus widths since it has a quite a regular structure. Due to its
regular structure, it can be easily layout in a silicon chip. The Multiplier based on this
sutra has the advantage that as the number of bits increases, gate delay and area
increases very slowly as compared to other conventional multipliers.
Urdhva Tiryagbhyam sutra which is used in the proposed multiplier is
illustrated with the help of a simple example. Although this sutra can be used for both
21
binary and decimal numbers, the example shown is used for binary numbers as binary
multiplication is required in any processor.
Multiplication of two 2 digit binary numbers:
Example 1: Find the product 3(11) X 3(11)
Step 1: The right hand most digit of the multiplicand, the first number (3) i.e., 1 is
multiplied by the right hand most digit of the multiplier, the second number (3) i.e., 1.
The product 1 X 1 = 1 forms the right hand most part of the answer.
Step 2: Now, diagonally multiply the first digit of the multiplicand (3) i.e., 1 and
second digit of the multiplier (3) i.e., 1 (answer 1 X 1=1); then multiply the second
digit of the multiplicand i.e., 1 and first digit of the multiplier i.e., 1 (answer 1 X 1 =
1); add these two i.e.,1 + 1 = 10. It gives the next, i.e., second digit of the answer.
Hence second digit of the answer is 0 and carry for the next digit is1.
Step 3: Now, multiply the second digit of the multiplicand i.e., 1 and second digit of
the multiplier i.e., 1 vertically, i.e., 1 X 1 = 1. Then add this 1 to the previous carry
which is 1 (1+1=10). It gives the left hand most part of the answer.
Thus the product obtained is 1001.
Symbolically we can represent the Vedic multiplication process for two bit numbers
as follows:
3.2
are displayed in the below sections. Here, Urdhva Tiryagbhyam (Vertically and
Crosswise) sutra is used to propose such architecture for the multiplication of two
binary numbers. The beauty of Vedic multiplier is that here partial product generation
and additions are done concurrently. Hence, it is well adapted to parallel processing.
The feature makes it more attractive for binary multiplications. This in turn reduces
delay, which is the primary motivation behind this work.
Vedic Multiplier for 2x2 bit Module
The method is explained below for two, 2 bit binary numbers A and B where
A = a1a0 and B = b1b0 as shown in previous example. Firstly, the least significant
bits are multiplied which gives the least significant bit of the final product (vertical).
Then, the LSB of the multiplicand is multiplied with the next higher bit of the
multiplier and added with, the product of LSB of multiplier and next higher bit of the
multiplicand (crosswise). The sum gives second bit of the final product and the carry
is added with the partial product obtained by multiplying the most significant bits to
give the sum and carry. The sum is the third corresponding bit and carry becomes the
fourth bit of the final product.
The 2x2 Vedic multiplier module is implemented using four input AND gates
& two half-adders which is displayed in its block diagram in figure 3.2. It is found
that the hardware architecture of 2x2 bit Vedic multiplier is same as the hardware
architecture of 2x2 bit conventional Array Multiplier. Hence it is concluded that
multiplication of 2 bit binary numbers by Vedic method does not made significant
effect in improvement of the multipliers efficiency. Very precisely we can state that
the total delay is only 2-half adder delays, after final bit products are generated, which
is very similar to Array multiplier. So we switch over to the implementation of 4x4 bit
Vedic multiplier which uses the 2x2 bit multiplier as a basic building block. The same
method can be extended for input bits 4 & 8. But for higher number of bits in input,
little modification is required such as when in a 2x2 multiplier only half adders are
enough to implement the design where as for higher order multipliers higher order
adders are required such as carry look ahead adder, ripple carry adder etc.
Each block as shown above is 2x2 bit Vedic multiplier. First 2x2 bit multiplier
inputs are A1A0 and B1B0. The last block is 2x2 bit multiplier with inputs A3A2 and
B3B2. The middle one shows two 2x2 bit multiplier with inputs A3A2 & B1B0 and
A1A0 & B3B2. So the final result of multiplication, which is of 8 bit,
S7S6S5S4S3S2S1S0. To understand the concept, block diagram of 4x4 bit Vedic
multiplier is shown in figure 3.4. To get final product (S7S6S5S4S3S2S1S0), four
2x2 bit Vedic multiplier and two 4-bit Carry Lookahead (CLA) Adders, 2 bit or gate
and two half adders are required. The proposed Vedic multiplier can be used to reduce
delay. Early literature speaks about Vedic multipliers based on array multiplier
structures. On the other hand, we proposed a new architecture, which is efficient in
terms of speed.
Block diagram of 4x4 multiplier:
as
discussed in the previous section. Lets analyze 8x8 multiplications, say A= A7A6A5
A4A3A2A1A0 and B= B7B6B5B4B3B2B1B0. The output line for the multiplication
result will be of 16 bits as S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0. Let s
divide A and B into two parts, say the 8 bit multiplicand A can be decomposed into
pair of 4 bits AH-AL. Similarly multiplicand B can be decomposed into BH-BL. The
16 bit product can be written as: S15S14S13S12S11S10S9S8S7S6S5S4S3S2S1S0.
Using the fundamental of Vedic multiplication, taking four bits at a time and
using 4 bit multiplier block as discussed we can perform the multiplication. The
outputs of 4x4 bit multipliers are added accordingly to obtain the final product. Here
total two 8 bit Carry lookahead Adders are required as shown in figure 3.5.
Chapter 4
SOFTWARE IMPLEMENTATION
4.1 EDA TOOLS
Digital circuit design has evolved rapidly over the last 25 years. The earliest
digital circuits were designed with vacuum tubes and transistors. Integrated circuits
were then invented where logic gates were placed on a single chip. The first
integrated circuit chips were SSI (small scale integration) chips where the gate count
was very small. As technologies became sophisticated, designers were able to place
circuits with hundreds of gates on a chip. These chips are called MSI (medium scale
integration) chips. With the advent of LSI (large scale integration), designers could
put thousands of gates on a single chip. At this point, design processes started getting
very complicated, and designers felt the need to automate these processes. Electronic
design automation (EDA) techniques began to evolve. Chip designers began to use
circuit and logic simulation techniques to verify the functionality of building blocks of
the order of about 100 transistors. The circuits were still tested on the breadboard, and
the layout was done on paper or by hand on a graphic computer terminal.
With the advent of VLSI (very large scale integration) technology, designers
could design single chips with more than 100,000 transistors. Because of the
complexity of these circuits, it was not possible to verify these circuits on a
breadboard. Computer aided techniques became critical for verification and design of
VLSI digital circuits. Computer programs to do automatic placement and routing of
circuit layouts also became popular. The designers were now building gate level
digital circuits manually on graphic terminals. They would build small block and then
derive higher level blocks from them. Design flow consists of several steps and there
is a need for a toolset in each step of the process. Modern FPGA/ASIC projects
require a complete set of CAD (Computer Aided Design) design tools. Following are
the most common tools available.
Design Capture Tools
Design entry tool encapsulates a circuit description. These tools capture a
design and prepare it for simulation. Design requirements dictate type of the design
capture tool as well as the options needed.
route tool. Then, the designer maps the gate level description or netlist to the target
design library and optimizes for speed, area or power consumption. The objective is
to provide a tool set for FPGA / ASIC design such that the number of vendors to
design and build the ASIC /FPGA is minimized. Each vendor brings a new set of
learning curves, maintenance and interfaces that eventually end up spending more
time and money. Tool maturity is another important factor. New tools almost always
come with certain number of bugs and issues.
Design Hierarchy
Hierarchical systems in general refer to the systems that are organized in the
shape of pyramid with multiple rows of system objects. Each object in a row may be
linked to the objects beneath it. Hierarchical systems in computer designs are as
popular as they are in everyday life. A good example of a hierarchical system in
everyday life could be a monarchy system that has the King on top and then Prime
Minister in the next level down. People are the base of this pyramid. An obvious
example in computer world would be a file system that has the root directory on top in
which directories have files and subdirectories underneath. Generally speaking
hierarchical systems have strong connectivity inside rather than outside the modules.
Design hierarchy trees do not have a regular pattern in size and number of
nodes and it is really up to designer to decide how the tree should look like. Figure 11 outlines partitioning process in a top-down design methodology. This figure shows
that the partitioning procedure is recursively called until the design of all subcomponents is feasible by the hardware mapping procedure.
Design Methodology
Digital circuits are becoming more complex while time-to-market windows
are shrinking. Designers cannot keep up with the advancements in engineering unless
they adopt a more methodological process for design and verification of digital
circuits i.e. Design methodology. This involves more than simply coming up with
block diagram of a design. Rather, it requires developing and following a formal
verification plan and an incremental approach for transforming the design from an
abstract block diagram to a detailed transistor level implementation. High-level ASIC
or FPGA designs start off with capturing the design idea with a hardware description
language (HDL) at the behavioral or register-transfer (RTL) level. The design will be
then verified with an HDL simulator and synthesized to the gates. Gate level design
method such as schematic capture was a typical design approach up to few years ago
but when the average gate count passed the 10, 000 gate threshold, they started to
break down. On the other hand the pressure to reduce the design cycle increased and
as a result, high-level design became an imperative part of the digital design
engineers. Industry experts agree nowadays that most of the FPGA/ASIC designers
will turn to high-level design methodologies in the near future. This happens
primarily because of the technology improvements that have taken place in the EDA
(Electronic Design Automation) tools, hardware and software. HDL allows the
designer to organize and integrate complex functions and verify the individual blocks
and eventually the entire design with tools like HDL simulators. Designers making
the switch to a high level design methodology leverage some obvious benefits. First,
individual designers are able to handle increased complexity by working at higher
levels of abstraction and delaying the design details. Second, designers can shorten
cycles and improve quality by verifying functionality earlier in the design cycle, when
design changes are easier and less expensive to make. Some designers may think that
FPGAs require significantly less functionality and features compared to ASICs. The
truth is that the improvements in FPGA technology including hardware and software
tools have enabled the FPGA manufacturers to come up with a high level of
integration in FPGAs. Today, several features are being implemented in FPGAs such
as processors, transceivers and even debugging tools are available inside the FPGA.
This makes the design cycle faster and simpler than what it was before. FPGA
designers are using as many features and as much functionality as is made available to
them. Economic requirements on the other hand play a role in the design methodology
arena.
Functionality and flexibility are two key feature of the design tool sets. FPGA
design tools are more cost effective compared to the ASIC design tools, however they
lack the functionalities and features available in the ASIC design tools. An ASIC
design tool seat software may cost over $100,000 while FPGA design tool seat barely
runs over $10,000. There is always a compromise between different factors in order to
decide which way to go. These could be time to market, NRE (Non Recurring
Engineering) cost, ease of use, programmability, flexibility, etc.
3
0
A number of FPGA designs are more complex and have higher density
compared to some ASIC designs that are being implemented. Furthermore, high end
FPGA design methodology nowadays mirrors an ASIC design methodology. To meet
this emerging requirement, tool vendors must provide solutions that will enable FPGA
designers to deal with complex designs and at the same time be able to integrate
variety of functions and features at high level of abstraction. HDL based designs
represent a new paradigm for some designers. Design flow is straightforward, but like
anything else, it requires going through a learning curve.
Regardless of which tools you use, the bottom line is that advanced HDL
design tools are a must for FPGA and ASIC designers. Price is not the only factor in
making the decision for the appropriate tools. Rather, designers must consider the
features required to meet their design goals and find out which tools are leaders in
their class.
Typical HDL Design Flow
Figure 4.1 shows a typical HDL design flow. This figure does not take into
account whether an ASIC or FPGA is being designed. In fact it is a high level view to
HDL designs. In later chapters the detailed view for each technology will be covered.
As it can be seen, once the design is created, it must be verified prior to the RTL to
make sure that its functionality is as intended. A test bench should be created to verify
the functionality of the design. This test bench will be used throughout the design
flow to verify the functionality at the functional, RTL and timing levels. A test bench
is a separate VHDL or Verilog piece of code that is connected to the design's inputs
and outputs. In fact, design itself is considered as a black box with a set of inputs and
outputs. The input test stimulus is applied to the design inputs and the outputs are
observed to ensure the correct functionality. Since the test vectors stay constant
throughout synthesis and place and route, they can be used at each stage to verify
functionality.
The main purpose of the test bench is to provide the stimulus and response
information, such as clocks, reset, and input data, that the design will encounter when
implemented in an ASIC or FPGA and installed into the final system. Design
Specification Coding Synthesis Optimization Implementation
31
Design Flow
implementations is usually based on the designers preferences. One may prefer one
method for developing a certain type of application while another person prefers the
other method. The key idea of both methodologies is the hierarchical propagation of
the design units based on behavioral modeling and optimization at each level.
A Bottom-up design methodology starts with individual blocks, which are
then combined to form the system. Design of each block starts with a set of
specifications and ends with a transistor level implementation, but each block is
verified individually and eventually all the blocks will be combined and verified
together.
On the other hand a top-down design approach defines the architecture of the
whole design as a single unit. The whole design is simulated and optimized
afterwards. The requirements for lower level blocks are derived based on the results
obtained in the previous steps. Each level is completely designed before proceeding to
the next step.
Circuits can be designed individually to meet the specifications and finally, the
entire design is laid out and verified against the original requirements. Top-down
design is referred to as partitioning of a system into its sub-components until all subcomponents become manageable design parts. Design of a component is available as
part of a library, it can be implemented by modifying an already available part or it
can be described for a synthesis program or an automatic hardware generator.
Partitioning the design into smaller modules provides the facility for the designers to
work as a team and be more productive. It also reduces the total time required to
complete the design because it reduces the effect of the late changes in the design
process. Any change in a module needs updating the rest of the system. Generally
speaking, bottom-up design methodology is effective for small designs. As size and
complexity of the digital designs continues to increase, they cause some implications
in this approach. It is very likely to have a number of blocks in a design and once they
are combined, simulation takes a long time and verification becomes difficult. Also
performance, cost and functionality are typically found at the architectural level and
this makes the design modification difficult since any change at the higher levels of
abstraction should propagate to the lower level modules and it needs redesigning of
the lower level blocks. The other challenge is that in bottom-up design methodology
several steps have to be done sequentially which elongates the design process
especially when it comes to modification of the design. To address all these
challenges, many designers prefer a substitute method namely top-down design
methodology.
The idea is to break down the design into smaller pieces so that each piece can
be designed at a time. Of course, all the pieces have to be put together again and this
should provide the solution to original design. Assembly of the smaller blocks is
referred to as bottom-up design. This approach has been applied to complex
engineering projects and is now finding its way into digital designs. Top-level design
hierarchy specifies the partitioning of the system into manageable blocks as well as
each block interface. One of the strengths of this method is that once the top-level
schematic is specified, the design process for all the blocks can be started
concurrently. It is imperative to avoid complex models if not necessary in top-down
design methodology since this makes the design verification process complicated.
Each block can be specified by its behavior and design can be expanded gradually.
Mapping to hardware depends on target technology, available libraries, and available
tools.
Generally, unavailability of good tools and/or libraries can be compensated by
further partitioning of a system into simpler components. After the completion of this
top-down design process, the bottom-up implementation phase begins. In this phase,
hardware components corresponding to the terminals of the tree are recursively wired
to form the hierarchical wiring of the complete system. This figure also shows that the
original design is initially described at the behavioral level. In the first level of
partitioning, one of its sub-components is mapped to hardware. Further partitioning is
required for hardware implementation of the other two components. This procedure
goes on until the hardware implementations of all components are available.
At each and every step of a top-down design process a multilevel simulation
tool plays an important role in the correct implementation of the design. Initially a
behavioral description of the system under design (SUD) must be simulated to verify
the designers understanding of the problem. After the first level of partitioning, a
behavioral description of each of the sub-components must be developed, and these
descriptions must be wired to form a structural hardware model of SUD. Simulation
of this new model and comparing the results with those of the original SUD
description will verify the correctness of the first level of partitioning. After verifying
the first level of partitioning, hardware implementation of each sub-component must
be verified. For this purpose, another simulation run in which behavioral models of
sub-components are replaced by more detailed hardware level models will be
performed.
The process of partitioning and verification stated above continues throughout
the design process. At the end, a simulation model, consisting of the interconnection
specification of hardware-level models of the terminals of the partition tree, will be
performed. The simulation of this model and comparing the results with those of the
original behavioral description of SUD verify the correctness of the complete design.
In a large design where simulation of a complete hardware-level model, is too
time consuming, subsections of the partition tree will be independently verified.
Verified behavioral models of such subsections will be used in forming the simulation
model for final design verification.
developing
4.4 XILINX
Xilinx ISE (Integrated Synythesis Environment) [12] is a software tool
produced by Xilinx for synthesis and analysis of HDL designs, enabling the developer
to synthesize their designs, perform timing analysis, examine RTL diagrams, simulate
a designs reaction to different stimuli, and configure the target device with the
programmer.
The Xilinx ISE is a design environment for FPGA products from Xilinx, and
is tightly-coupled to the architecture of such chips, and cannot be used with FPGA
products from other vendors. The Xilinx ISE is primarily used for circuit synthesis
and design, while the ModelSim logic simulator is used for system-level testing.
Other components shipped with the Xilinx ISE include the Embedded Development
Kit (EDK), a Software Development Kit (SDK) and ChipScope Pro.
User Interface
The primary user interface of the ISE is the Project Navigator, which includes
the design hierarchy (Sources), a source code editor (Workplace), an output console
(Transcript), and a processes tree (Processes). The Design hierarchy consists of design
files (modules), whose dependencies are interpreted by the ISE and displayed as a tree
structure. For single-chip designs there may be one main module, with other modules
included by the main module, similar to the main() subroutine in C++ programs.
Design constraints are specified in modules, which include pin configuration and
mapping.
The Processes hierarchy describes the operations that the ISE will perform on
the currently active module. The hierarchy includes compilation functions, their
dependency functions, and other utilities. The window also denotes issues or errors
that arise with each function. The Transcript window provides status of currently
running operations, and informs engineers on design issues. Such issues may be
filtered to show Warnings, Errors, or both.
Simulation
System-level testing may be performed with the ModelSim logic simulator,
and such test programs must also be written in HDL languages. Test bench programs
may include simulated input signal waveforms, or monitors which observe and verify
the outputs of the device under test. ModelSim may be used to perform the following
types of simulations:
1. Logical verification, to ensure the module produces expected results
4.5 SYNTHESIS
Xilinx's patented algorithms for synthesis allow designs to run up to 30%
faster than competing programs, and allow greater logic density which reduces project
costs. Also, due to the increasing complexity of FPGA fabric, including memory
blocks and I/O blocks, more complex synthesis algorithms were developed that
separate unrelated modules into slices, reducing post-placement errors. IP Cores are
offered by Xilinx and other third-party vendors, to implement system-level functions
such as digital signal processing (DSP), bus interfaces, networking protocols, image
processing, embedded processors, and peripherals. Xilinx has been instrumental in
shifting designs from ASIC-based implementation to FPGA-based implementation.
Chapter 5
HARDWARE IMPLEMENTATION
After simulating the verilog code using Xilinx, it is dumped in to an field
programmable gate array (FPGA) [12]. An FPGA is a device that contains a matrix of
reconfigurable gate array logic circuitry. When a FPGA is configured, the internal
circuitry is connected in a way that creates a hardware implementation of the software
application. Unlike processors, FPGAs use dedicated hardware for processing logic
and do not have an operating system. FPGAs are truly parallel in nature so different
processing operations do not have to compete for the same resources. As a result, the
performance of one part of the application is not affected when additional processing
is added. Also, multiple control loops can run on a single FPGA device at different
rates. FPGA-based control systems can enforce critical interlock logic and can be
designed to prevent I/O forcing by an operator. However, unlike hard-wired printed
circuit board (PCB) designs which have fixed hardware resources, FPGA-based
systems can literally rewire their internal circuitry to allow reconfiguration after the
control system is deployed to the field. FPGA devices deliver the performance and
reliability of dedicated hardware circuitry. A single FPGA can replace thousands of
discrete components by incorporating millions of logic gates in a single integrated
circuit (IC) chip. FPGA are constructed of three basic elements: logic blocks, I/O
cells, and interconnection resources.
1. Transistor pairs
2. Combinational gates like basic NAND gates or XOR gates
3. n-input Lookup tables
4. Multiplexers
5. Wide fan-in and-OR structure.
logic utilization decreases pin placement flexibility, as I/O blocks utilized in logic
cannot be reassigned mid-design.
interconnect arrays. Input output blocks surround this scheme of logic blocks and
interconnects.
5.3 FPGA
SWITCH TECHNOLOGIES
FPGAs are based on an array of logic modules and a supply of uncommitted
wires to route signals. In gate arrays these wires are connected by a mask design
during manufacture. In FPGAs, however, these wires are connected by the user and
therefore must use an electronic device to connect them. Three types of devices have
been commonly used to do this, pass transistors controlled by an SRAM cell, a flash
or EEPROM cell to pass the signal, or a direct connect using antifuses. Each of these
interconnect devices have their own advantages and disadvantages. This has a major
effect on the design, architecture, and performance of the FPGA.
SRAM Based
The major advantage of SRAM based device is that they are infinitely reprogrammable and can be soldered into the system and have their function changed
quickly by merely changing the contents of a PROM. They therefore have simple
development mechanics. They can also be changed in the field by uploading new
application code, a feature attractive to designers. It does however come with a price
as the interconnect element has high impedance and capacitance as well as consuming
much more area than other technologies. Hence wires are very expensive and slow.
The FPGA architect is therefore forced to make large inefficient logic modules
(typically a look up table or LUT).The other disadvantages are: They needs to be
reprogrammed each time when power is applied, needs an external memory to store
program and require large area. There are two applications of SRAM cells: for
controlling the gate nodes of pass-transistor switches and to control the select lines of
multiplexers that drive logic block inputs.
Antifuse Based
The antifuse based cell is the highest density interconnects by being a true
cross point. Thus the designer has a much larger number of interconnects so logic
modules can be smaller and more efficient. Place and route software also has a much
easier time. These devices however are only one-time programmable and therefore
have to be thrown out every time a change is made in the design.
The Antifuse has an inherently low capacitance and resistance such that the
fastest parts are all Antifuse based. The disadvantage of it is the requirement to
integrate the fabrication of it into the IC process, which means the process will always
lag the SRAM process in scaling. These are suitable for FPGAs because they can be
built using modified CMOS technology. The antifuse is positioned between two
interconnect wires and physically consists of three sandwiched layers: the top and
bottom layers are conductors, and the middle layer is an insulator. When
unprogrammed, the insulator isolates the top and bottom layers, but when
programmed the insulator changes to become a low-resistance link. It uses Poly-Si
and n+ diffusion as conductors and ONO as an insulator, but other antifuses rely on
metal for conductors, with amorphous silicon as the middle layer.
EEPROM Based
The EEPROM/FLASH cell in FPGAs can be used in two ways, as a control
device as in an SRAM cell or as a directly programmable switch. When used as a
switch they can be very efficient as interconnect and can be reprogrammable at the
same time. They are also non-volatile so they do not require an extra PROM for
loading. They, however, do have their detractions. The EEPROM process is
complicated and therefore also lags SRAM technology.
Second type of logic blocks is RAM logic which can be used to implement
random access memory.
Plessey FPGA
Basic building block here is 2-input NAND gate which is connected to each
other to implement desired function.
CHAPTER 6
RESULTS
6.1 BLOCKWISE SIMULATION RESULTS
Each block of the block diagram is individually analyzed, executed in XILINX
and following are their respective block wise simulation results obtained from
XILINX. After simulating the Verilog code using XILINX ISE 10.1 and
implementing the design on FPGA, obtained simulation results of 2x2 multiplier, 4x4
multiplier and 8x8 multiplier are shown below.
1. 2x2 VEDIC MULTIPLIER
2x2 Vedic multiplier is designed, analyzed and simulated using Xilinx ISE
10.1. The simulation results are shown in figure 6.1.
b:
s:
Various factors on which the speed of multiplier depends upon are obtained
from the synthesis report during simulation and are shown in table 6.1
Table 6.1 Synthesis Results of 2 bit multiplier
FPGA Device Package
3s50pq208-5
Number of slices
2 out of
4 out of 1536
Number of IOs
8 out of
7.858ns
Memory usage
141072 kilobytes
768
124
a:
b:
s:
Example 1:
Example2:
Various factors on which the speed of multiplier depends upon are obtained
from the synthesis report during simulation and are shown in table 6.2
Table 6.2 Synthesis Results of 4 bit multiplier
FPGA Device Package
3s50pq208-5
Number of slices
19 out of
33 out of 1536
Number of IOs
16
16 out of
18.089ns
Memory usage
141072 kilobytes
768
124
a:
b:
s:
Example 1:
Example2:
Example3:
Various factors on which the speed of multiplier depends upon are obtained
from the synthesis report during simulation and are shown in table 6.3
Table 6.3 Synthesis Results of 8 bit multiplier
FPGA Device Package
3s50pq208-5
Number of slices
94 out of
Number of IOs
32
32 out of
28.451ns
Memory usage
145168 kilobytes
768
124
5
0
5
0
CHAPTER 7
CONCLUSIO
N
7.1 Conclusio
n
The design of 8bit Vedic multiplier has been implemented on Xilinx FPGA
Spartan 3 board. It is a method for hierarchical multiplier design which clearly
indicates the computational advantages offered by Vedic methods. The computation
delay for 8 bit Vedic multiplier is 28.451 ns. It is therefore seen that the Vedic
multipliers are much faster than the conventional multipliers. An awareness of Vedic
mathematics can be effectively increased if it is included in engineering education.
The Comparison between proposed multiplier and 8 bit Booth radix-4 multiplier is
shown in table 7.1. As from the table this multiplier helps in future to make fast
processors.
51
Booth multiplier
28.421ns
29.549ns
145168 kilobytes
151860 kilobytes
Delay
Memory usage
APPENDIX-I
VERILOG SOURCE CODE
HALF ADDER
Data flow description for half adder is given below. The 1 bit inputs to the half adder
are a, b and the outputs sum and carry are given s and c.
// define a 1-bit half adder by using data flow statements
module halfadder(s,c,a,b);
// I/O port declarations where s is sum and c is carry
output s,c;
input a,b;
// specify the function of a half adder
assign s = a^b;
assign c = a&b;
endmodule
// define a 4-bit carry lookahead adder using data flow statements module
cla(s,cout,a,b,cin);
//I/O port declarations where s is 4 bit sum cout is carry out
output [3:0]s;// an array of 4 bit sum values
output cout;
input [3:0]a,b//an array of 4 bit input values of a,b
input cin;
wire [3:0]g,p,c;
//specify function of carry look ahead adder
53
assign g=a&b;
assign p=a^b;
assign c[0]=cin;
assign c[1]=g[0]|(p[0]&c[0]);
assign c[2]=g[1]|(p[1]&g[0])|(p[1]&p[0]&c[0]);
assign c[3]=g[2]|(p[2]&g[1])|(p[2]&p[1]&g[0])|(p[2]&p[1]&p[0]&c[0]);
assign
cout=g[3]|(p[3]&g[2])|(p[3]&p[2]&g[1])|(p[3]&p[2]&p[1]&g[0])|(p[3]&p[2]&p[1]&p
[0]&c[0]);
assign s=p^c;
endmodule
cla m1(t1[3:0],c1,a[3:0],b[3:0],cin);
cla m2(t2[3:0],cout,a[7:4],b[7:4],c1);
assign s[3:0]=t1[3:0];
assign s[7:4]=t2[3:0];
endmodule
halfadder z2(c[3],c[2],temp[2],temp[3]);
endmodule
ha z8(s[7],cout,c2,q3[3]);
endmodule
assign x=c|c1;
APPENDIX-II
VERILOG
FUNCTIONS
OPERATOR
S
Operator type
performed
Arithematic
Symbol
operation
addition
equal
subtract
divide
modulus
Logical
Negation
&&
logical and
||
logical or
>
greater than
<
less than
>=
<=
==
equality
!=
inequality
===
case equality
!= =
case inequality
bitwise negation
&
bitwise and
bitwise or
bitwise xor
~^
bitwise xnor
>>
right shift
<<
left shift
>>>
<<<
Concatenation
{}
concatenation
Replication
{{}}
replication
Conditional
?:
conditional
Relational
Equality
Bitwise
Shift
NUMBER SPECIFICATION
1.
2.
Keywords:
always
and
assign
automatic
begin
buf
bufif0
bufif1
case
casex
casez
cell
cmos
config
starts a configuration
default
defparam
design
disable
a task or block
edge
else
end
endcase
endconfig
end of a configuration
endfunction
endgenerate
end of a generate
endmodule
endprimitive
endspecify
end of a specify
endtable
endtask
for
force
forever
fork
function
generate
genvar
if
ifnone
incdir
include
initial
62
inout
input
instance
integer
join
large
liblist
library
localparam
macromodule
medium
module
nand
negedge
nmos
nor
not
notif0
notif1
or
gate primitive, or
output
parameter
posedge
primitive
pulldown
gate primitive
pullup
gate primitive
remos
real
realtime
reg
release
repeat
rnmos
rpmos
rtran
rtranif0
rtranif1
scalared
signed
specify
specparam
table
task
time
tran
tranif0
tranif1
tri
tri0
tri1
triand
trior
trireg
unsigned
use
vectored
wait
wand
while
wire
wor
xnor
xor
weak0
drive strength 3
weak1
drive strength 3
REFERENCES
[1]
[2]
[3]
[4]
Neil H.E Weste, David Harris, Ayan anerjee, CMOS VLSI Design, A
Circuits and Systems Perspective, Third Edition, Published by Person
[5]
[6]
[7]
[8]
[9]
[10]
[11]
www.xilinx.com