0%(1)0% fanden dieses Dokument nützlich (1 Abstimmung)

36 Ansichten55 Seitenmain report of my project

© © All Rights Reserved

DOCX, PDF, TXT oder online auf Scribd lesen

main report of my project

© All Rights Reserved

Als DOCX, PDF, TXT **herunterladen** oder online auf Scribd lesen

0%(1)0% fanden dieses Dokument nützlich (1 Abstimmung)

36 Ansichten55 Seitenmain report of my project

© All Rights Reserved

Als DOCX, PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 55

INTRODUCTION

Multiplication is widely used in many real time applications. They form an

integral part in implementation of many Digital Signal Processing, Digital Image

Processing and Multimedia algorithms. The size, speed and power dissipation of any

DSP chip can be significantly influenced by the design and implementation of its

multiplication and squaring functions. This mandates the need to have a multiplication

and squaring function that is not just fast but at the same time occupies less area on

the chip. Over the years a lot of research has been done in designing and

implementing multiplication functions that yield less area on the chip, consume less

power and have minimal propagation time. The move towards achieving less area on

chip started with the implementation of fixed width multipliers.

A fixed width multiplier has smaller silicon area compared to a full width

multiplier which takes in n-bit Multiplicand and n-bit Multiplier to yield an output

that is 2n-bits wide. A fixed width multiplication can be derived by truncating a fullwidth multiplier.

There are many different approaches to deriving a fixed width multiplier and

the most commonly used ones are:1. Truncating the 2n-bit result of a full-width multiplier to generate a result which is

n-bits wide. The least significant n-bits out of 2n-bits are truncated. This design yields

the best results of all the fixed-width multiplier designs. However, the area savings are

minimal. This is mainly because the design retains all the columns of the partial

product array even when generating a truncated output. As an example, when

processing inputs of word size n-bits, there are 2n-1columns in the partial product

array whereas the output is n-bits wide.

2. The second approach involves truncating the least significant n-columns of the

partial product matrix of a full-width multiplier to directly yield a result that is n- bits

wide. This design offers massive savings in area but the errors introduced due to

truncation are high. Design of any fixed-width multiplier is always a trade-off

between area savings and error. The truncation of bits/partial product terms results in

error. So error compensation algorithm needed to compensate truncation errors.

1.1 Motivation

The most executed operation in the data path is addition, which requires a

binary adder that adds two given numbers. Adders also play a vital role in more

complex computations like multiplication, division and decimal operations. Hence, an

efficient implementation of binary adder is crucial to an efficient data path. Relatively

significant work has been done in proposing and realizing efficient adder circuits for

binary addition as described in the next chapter. However, as the technology is scaling

down new design issues like fan-out and wiring complexity are appearing in the frontline. These issues are addressed to some extent by new adder architectures known as

sparse adders. As operand size increases, adders also suffer from design issues that are

becoming vital as they have direct impact on the performance of an adder Thus, there

is an urgent need to develop alternative adder architectures which can address these

design issues.

The next most important block in data-path after adder is the multiplier, which

is also very crucial in ASICs and DSPs. High speed multipliers reported in literature

use parallel multiplier architectures that employ compressors along with adders as

basic building blocks. Compressors are multi-input, multi-output combinational logic

circuits that determine the number of logic 1s in their input vectors and generate a

binary coded output vector that corresponds to this number. Compressors have carry

inputs and carry outputs in addition to the normal inputs and outputs. As these

blocks lie directly within the critical path of a given design, thus dictating the overall

circuit performance, there is an urgent need to design and validate new high

speed/low power compressors.

The following objectives are proposed to be addressed in this thesis:

Implementation of RPR block with increased accuracy and less error rate

Implementation of Algorithmic Noise Tolerant Architecture based fixed width

multiplier with improved performance

This thesis is organized into six chapters. The chapter wise out line is as follows:

Chapter 2 deals with different methods of fixed width multipliers and different

methods implemented previously.

Chapter 3gives information about compressors and their types and implementation

of different types of compressors.

Chapter 4 gives introduction to multipliers and different types and their working and

reduction techniques of multiplier architecture.

Chapter 5 deals about Algorithmic noise tolerant architecture and implementation of

proposed fixed width multipliers using compressors.

Chapter 6gives experiment results of proposed ANT architecture based fixed width

multipliers and comparative analysis of results obtained

Chapter 7 gives Conclusion and future scope

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction:

Multipliers can be broadly classified into two types.

Sequential multiplier: In a sequential multiplier each bit of either the multiplier or the

multiplicand is processed one at a time. The main advantage of the design is its small

area. A small piece of hardware involving a shifter and an accumulator is all that is

needed to generate the output.

Parallel multiplier: In a parallel multiplier all the partial product terms are generated

in parallel and the final result is obtained by adding the partial product terms over the

columns. The main advantage of this design is its higher speed.

The parallel multiplier is widely used in various Digital Signal Processing,

Video Processing and Multimedia applications because of its high speed of operation.

The parallel array multiplier in turn is classified into two types namely,

Fixed Width Multiplier

The main drawback of a full width multiplier over a fixed width multiplier is

the area on the chip and power consumption. The presence of AND gates to generate

the partial product terms and the HALF and FULL ADDERS to perform the column

wise sum of the partial product terms results in large number of transistors. Hence, the

large power consumption. Over the years attempts have been made to overcome the

area constraint and limit the power consumption.

Sunder S Kidambi, Andreas Antoniou and Fayez El-Guibaly proposed a

design which brought 50% reduction in the area of a parallel multiplier [3]. In many

DSP applications there is a need to maintain the word size. This motivated the trio to

come up with a design that addressed this requirement. In the proposed

design the lower N columns of a parallel multiplier are truncated and

a correction is then added to the remaining most significant

columns. The authors carried out a statistical error analysis to

predict the error due to truncation and provide an appropriate

correction term. Based on the correction term the design is altered

to replace the half adders with full adders where needed.

In Figure 2.1 the partial product terms along the diagonal are

added to generate the output bits p8-p15. The carry terms from

each addition propagate along the vertical arrows. FA represents the

Full adder block and AFA represents the block containing an AND

gate and a Full adder. HA stands for Half Adder block.

Highlights: This was one of the earliest designs of fixed width parallel

multipliers. In addition to reducing the area of the parallel multiplier for almost 50% it

proposed an analytical technique to generate a correction term to reduce the error due

to truncation. The main drawback of this design is the correction bias added to offset

the error due to truncation. The correction added is a constant term and does not

depend on the inputs being fed to the multiplier.

Michael J Schulte and Earl E Swartzlander furthered the work

carried out by Kidambi and team [4]. The design proposed by them

are based on the following,

Rounding the result to n columns.

Multiplier, the output is a 2n-bit word. There are 2n-1 columns in

partial product array matrix and the terms are added column-wise to

yield a 2n bit result. In the design proposed by Schulte and

Swartzlander, n+k columns are truncated to yield n+k columns of

partial product terms. This design is slightly different from that

proposed by Kidambi [3] in that the latter design truncated n

columns where as in this design n-k columns are truncated.

Retaining more columns gives better results at the expense of area.

In order to offset the error introduced due to truncation, a constant

term is added to most significant n+k columns. This correction is the

average of the truncated portion of the partial product matrix. The

result obtained after adding the partial product terms is then

rounded to yield an n-bit result.

Highlights: The design proposed by Schulte and Swartzlander [4] brings

about a huge saving in area when compared to a full-width multiplier. It also

introduces a degree of flexibility in the number of columns that are truncated. This

gives designers a chance to choose between 12 area savings and better error

6

correction. The design proposed by Kidambi [3] and team is a special case of this

design [4].

Earl E Swartzlander, Jr proposed a fixed width multiplier that uses variable

correction instead of constant correction [5]. The basic design of the multiplier is the

same as that of a constant correction fixed width multiplier. The least significant N- 2

partial product columns of a full width multiplier are truncated. The partial product

terms in the N-1column are then added to the partial product terms in the Nth column

using full-adders. This is done in order to offset the error introduced due to truncation

of least significant N-2columns. The correction term that is generated is based on the

following arguments,

The biggest column in the entire partial product array of a full-width multiplier is

the Nth column(assuming that the columns are numbered from1 to 2N, 1 being the

TheN-1thcolumn contributes more information to the most significant N-1

columns than the rest of the least significant N-1 columns. The information

presented could be made more accurate if the carry from the N-1th column is

Adding the elements in N-1th column to the Nth column provides a variable

correction as the information presented is dependent on input bits. When all the

partial product terms in the N-1th column are zero, the correction added is zero.

When all the terms are one, a different correction value is added.

The partial product terms along the outer most diagonal form the elements of

the N-1thcolumn of the partial product array of a standard multiplier. The partial

product terms along the outer diagonal are fed to the ADDER blocks in the Nth

column (represented by the adjacent diagonal elements). The output obtained is N

bits wide.

Highlights: The advantage of this design is that the correction term changes with

the inputs fed to the multiplier. The complexity is slightly more in this case when

compared to the constant correction fixed width multiplier. This design can at most

bring about N- 2 column truncation.

Proposed by Jou

Jer Min Jou and Shiann Rong Kuang proposed a design [6] which is a slight

variation ver the other designs documented above. The Nth column of the parallel

multiplier is the largest contributor of information to the most significant half of the

partial product array matrix. Retaining the entire column can significantly reduce the

error due to truncation. However, retaining the entire column would increase the area

of the chip as there would be an increase in number of AND gates and adders in the

design to compute the partial product terms and the sum for that column. The design

proposed by the duo aims at reducing the area by using an AO cell which is a

combination of an AND gate and an OR gate as shown in figure 2.2.

This AO cell is used to generate the carry information which is then

fed to the next column. This design brings about a small

improvement in area as the adders in the nth column are replaced

by the AO cell. Figure 2.2 shows the implementation of the multiplier

proposed by Jer Min Jouand Shiann Rong Kuang.

HighlightsThis design provides a new approach to offset the

error due to truncation. It is based on the reasoning that then Nth

column, being the largest column in the entire partial product array,

is the largest contributor of information to the remaining most

significant N columns. It includes the entire column using the AO cell

which is used instead of an AND gate and a full adder. This design is

better compared to previous designs as it reduces the area by using

the AO cell. However, the maximum possible truncation is N-1

columns.

The truncated multiplier with Symmetric correction [7] was

implemented by Hyuk Park and Earl E Swartzlander, Jr with the

intention of improving the error correction over other existing

techniques. This multiplier design is very similar to variable

correction technique with an additional piece of logic. The basic

design incorporates the same truncation principle as compared to

9

different. In addition to adding the partial product terms from N K-1

column to N-K column an additional piece of hardware logic is added

which is responsible for bringing in symmetry between the positive

and negative maximum error. The introduction of the proposed

additional logic only slightly increases the complexity of the

design..The block is responsible for introducing symmetry between

positive and negative maximum errors. The proposed logic block is

responsible for introducing symmetry between positive and negative

maximum errors.

Highlights - The main advantage of the design is that it evenly

distributes the error between its positive maximum and negative

maximum errors. Retaining the extra columns in the partial product

array and introducing the proposed logic increases the area of the

multiplier. Thus, the better error correction obtained by this design

is offset by the increase in area

This design was proposed by Antonio G M Strollo, Nicolo Petra

and David De Caro[8]. It is a variant of fixed width multiplier that

uses tree based adding scheme. This design uses an error

compensation function that can be tweaked, based on the

requirements, to either bring down the maximum error or the mean

error. The design structure is shown in figure 2.3.

10

each of the partial product terms in the input correction vector carry

different weights. It is shown in the paper that the outer partial

product terms in the input correction vector(which is the N-1

column) have a lower weight when compared to the inner partial

product terms. Therefore, choosing the right combination of partial

product terms depending on their weights yields good results.

The paper proposes that, in order to get low mean square

error the partial product terms in the N-1 column are separated into

two different groups and passed on to adder trees. The outer partial

product terms are processed separately bypassing them to a

standard tree adder structure(which is composed of half and full

adders) whereas the remaining partial product terms are added

using a modified tree structure (which is composed of AND and OR

gates). In order to get a multiplier that yields low maximum error

the outer partial product are added using a standard tree structure

whereas the inner terms are added using modified tree structure.

Each of these tree structures generate carry terms which are fed to

the adjacent column which has the weight 2-n. The paper also

discusses the use of mixing block which is used to pass on the carry

to the adjacent column with a higher weight.

11

Highlights The main advantage of this design is that it provides the user with

the flexibility to choose either a design that lowers maximum error or a design that

lowers mean square error. The error compensation provided by this design is by far

the best when compared to the other designs. It provides a variable error

compensation bias as the correction value is dependent on the input bits. The error

correction function is simple to implement as the partial product terms in the input

correction vectors are used directly to provide the correction.

12

CHAPTER 3

DESIGN AND IMPLEMENTATION OF COMPRESSORS

3.1 Introduction

Multiplication is a basic arithmetic operation that is crucial in applications like

digital signal processing which in turn rely on efficient implementation of generic

arithmetic and logic units (ALU) and floating point units to execute dedicated

operations like convolution and filtering. In the implementation of multipliers, the

main phases include generation of partial products, reduction of partial products using

CSA (Carry-Save Adder) and Carry propagation for the computation of the final result

as shown in Fig 3.1. The second phase i.e. reduction of the partial products

contributes most to the overall delay, area and power. In order to reduce partial

products, multi-operand adders, which are different from conventional adders, are

required and hence a different design methodology is needed for multi-operand

adders. A special structure known as compressor is one strategy that can be adopted

for multi-operand addition. Wallace and Dadda were the first ones who explained the

usage of compressors for partial product reduction tree in multipliers. Later different

optimized structures for compressors have been reported in literature.

13

3.2 Compressors

A (N, 2) compressor is a logic circuit that takes N bits of same significance

and generates a Sum bit and several Carry bits as the output. Though a compressor

gives Sum and Carry, it is different from a conventional adder. For example,

compressor adds N-bits of same precision whereas an adder adds 2 operands of N-bit

numbers of different precision. Compressor operation can be shown logically as

I 1+ I 2++ ( Cin 1+Cin 2+ Cink )= +2(Cout 1+Cout 2+..Coutk )

Where I1,I2 are inputs and Cin1,Cin2 are also carry inputs

14

3.3.1 3-2 Compressor

A 3-2 compressor takes 3 inputs X1, X2, X3 and generates 2 outputs, the Sum bit S,

and the Carry bit C as shown in figure.3.2(a).

The compressor is governed by the basic equation

X 1+ X 2+ X 3= + 2Carry

(3.1)

The 3-2 compressor can also be employed as a full adder cell when the third

input is considered as the Carry-in from the previous compressor block. Existing

design shown in figure 3.2(b) employs two XOR gates in the critical path

3.3.2 4-2 Compressor

A 4-2 compressor has 4 inputs X1, X2, X3 and X4 and 2 outputs, Sum and

Carry, along with a Carry-in (C in) and a Carry-out (Cout) as shown in figure 3.3. The

input Cin is the output from the previous lower significant compressor. The C out is the

output to the compressor in the next significant stage

15

Similar to the 3-2 compressor, a 4-2 compressor is governed by the basic equation

X 1+ X 2+ X 3+ X 4+Cin= +2(Carry+ Cout)

(3.2)

The standard implementation of the 4-2 compressor can be done using 2 full Adder

cells as shown in Fig 3.4

When the individual full adders are broken into their constituent XOR blocks, it can

be observed that the overall delay is equal to 4*-XOR gates (where refers to delay)

The block diagram in Figure 3.5 shows the existing architecture for the

implementation of the 4-2 compressor with a delay of 3*-XOR gates. but in this

architecture, the fact that both the output and its complement are available at every

stage was not taken into account

3.3.3 5-2 Compressor

The 5-2 Compressor block has 5 inputs X1, X2, X3, X4 and X5 and 2 outputs, Sum

and Carry, along with 2 input Carry bits (Cin1, Cin2) and 2 output Carry bits

16

(Cout1,Cout2) as shown in figure.3.5(a). Input Carry bits are the outputs from the

previous lesser significant compressor block and the output Carry bits are passed on

to the next higher significant compressor block.

The basic equation that governs the function of a 5-2 compressor block is

given below

X 1+ X 2+ X 3+ X 4+ X +Cin 1+Cin 2= + 2( carry+ Cout 1+Cout 2 ) (3.3)

Conventional implementation of the compressor block is shown in figure 3.5(b) where

3 cascaded full adder cells are used. When these full adders are replaced with their

constituent blocks of XOR gates, then it can be observed that the overall delay is

equal to (6*-XOR) for the Sum or Carry output.

17

3.4.1 3-2 Compressor

In CMOS implementation, the gates like OR and AND require implementation

of NOR and NAND gates followed by an inverter. Thus, from OR and AND gates, we

can obtain NOR and NAND outputs without any extra hardware. This technique is

used to design a XOR-XNOR pair gate which is shown in figure3.7

18

X 1 X 2 X3

(3.4)

(3.5)

figure3.3. In the existing design, the output of the first XOR gate and X3 are given as

inputs to second stage XOR gate. This XOR gate can be replaced by a multiplexer

which reduces the delay, as multiplexer has less delay compared to XOR gate

In the design shown in figure.3.9, the fact that both the XOR and XNOR

outputs are computed, is efficiently used to reduce the delay by replacing the second

XOR gate with a MUX. This is due to the availability of the select bit i.e. X3 at the

MUX block before the inputs arrive. Thus, the time taken for the switching ON of the

transistors is reduced in the critical path

The equations governing the proposed (3, 2) compressor outputs are shown below

19

In this design also, the fact that both the output and its complement are

available at every stage is neglected .Thus replacing some XOR gates with

multiplexers result in a significant improvement in delay.

Like in previous case, the MUX block at the SUM output gets the select bit before the

inputs arrive and thus the transistors are already switched ON by the time the inputs

arrive. This minimizes the delay to a considerable extent as shown in Fig 3.10.

The equations governing the outputs are shown below

X 1 X 2 X 3 X 4 C

Cout=(X1

X2

).X3+

X 1 X 2

(3.8)

.X1

(3.9)

carry=( x 1 x 2 x 3 ) . x 4 + x 1 x 2 x 3 x 4 . x 4

(3.10)

transmission gate logic style shown in Figure. 3.11. This design of the multiplexer is

faster and also consumes lesser power than the CMOS design but requires buffers to

enhance the driving capability. Therefore, these types of multiplexers can be used

where there are a CMOS transistors at its input and output, because CMOS has good

driving capability. Thus, transmission gate multiplexers are used in the intermediate

stage, thereby increasing the performance.

20

In the proposed design of the 5-2 compressor the most important change is to

efficiently use the outputs generated at every stage. This is done by replacing some XOR

blocks with MUX blocks. Also the select bits to the multiplexers in the critical path are

made available much ahead of the inputs so that the critical path delay is minimized.

For example, the Cout2 output from the previous lesser significant compressor block

is utilized as the select bit after a stage it is produced so that the MUX block is already

switched ON and the output is produced as soon as the inputs arrive. Also if the

output of the multiplexer is used as select bit for another multiplexer, then it can be

used efficiently in a similar manner because the negation of select bit is also required,

as shown in Figure 3.7. Thus an extra stage to compute the negation can be saved.

Similarly replacing the XOR block in the second stage with a MUX block reduces the

delay because the select bit x3 is already available and the time taken for the transistor

switching to take place happens in parallel with the computation of the inputs of the

block

As mentioned before, in all the general implementations of the XOR or MUX block,

in particular CMOS implementation, the output and its complement are generated.

But in the existing design this advantage is not being utilized fully. In the proposed

design these outputs are utilized efficiently by using multiplexers at particular stages

21

in the circuit. Also additional inverter stages are eliminated. This in turn contributes to

the reduction of delay, power consumption and transistor count (area).

The equations governing the outputs are shown below:

X 1 X 2 X 3 X 4 X 5 C 1 C 2

(3.11)

Cout1= (X1+X2).X3+X1.X2

Cout2=(X4

X5

).Cin1+

(3.12)

X 4 X5

.X1

(3.13)

temp=X 1 X 2 X 3 X 4 X 5 C 1

Carry=temp.

C 2

+ temp (X1 X 2 X 3 )

(3.14)

In the carry generation module (CGEN1) shown in Figure.3.10, the above equation

(3.12) is used to design the CMOS implementation of Cout1 as shown in Figure 3.11.

In the carry generation module (CGEN) shown in figure.3.10, the above equation is

used to design the CMOS implementation of Cout as shown in figure 3.11.

22

CHAPTER-4

INTRODUCTION TO MULTIPLIERS

Multipliers are among the fundamental components of many digital systems

and hence their power dissipation and speed are of primary concern [19]. For portable

applications where the power consumption is the most important parameter, one

should reduce the power dissipation as much as possible. One of the best ways to

reduce the dynamic power dissipation is to minimize the total switching activity, i.e.,

the total number of signal transitions of the system.

Multiplication plays an essential role in computer arithmetic operations for

both general purpose and digital signal processors. For computational extensive

algorithms required by multimedia functions such as Finite Impulse Response (FIR)

filters, Infinite Impulse Response (IIR) filters and Fast Fourier Transform (FFT).

In a popular array multiplication scheme the summation of partial products

proceeds in a more regular but in slower manner. Using this scheme only one row of

bits in the matrix is eliminated at each stage of the summation. In a parallel multiplier

the partial products are generated by using an array of AND gates. The main problem

is the summation of the partial products, and it is the time taken to perform this

summation which determines the maximum speed at which a multiplier may operate.

The Wallace scheme essentially minimizes the number of adder stages required to

perform the summation of partial products. This is achieved by using full and half

adders. Wallace multiplier consists of three stages. The partial product matrix is

formed in the first stage by N2 AND stages. In the second stage, the partial product

matrix is reduced to a height of two. Dadda replaced Wallace pseudo adders with

parallel (n, m) counters. A parallel (n, m) counter is a circuit which has n inputs and

produce m outputs which provide a binary count of the ONEs present at the inputs

23

and produces 2 outputs. Similarly a half adder is an implementation of a (2, 2) counter

which takes 2 inputs and produces 2 outputs. Dadda multipliers have less expensive

reduction phase, but the numbers may be a few bits longer, thus requiring slightly

bigger adders.

In general, the product p of two n-bit unsigned binary numbers x and y may be

expressed as follows:

n1

n1

x0 )

i=0

y(i x

n1

}.2i (4.1)

x0 )

products and are generated using an array of AND gates. For a parallel multiplier, the

shifting term 2i is inherent in the wiring and does not require any explicit hardware.

Thus the main problem is the summation of the partial products, and it is the time

taken to perform this summation which determines the maximum speed at which a

multiplier may operate.

The realization of a parallel multiplier for digital computers has been

considered by C.S. Wallace, who proposed a tree of pseudo-adders (that means adders

without carry propagation) producing two numbers, whose sum equals the product.

This sum can be obtained by applying the two numbers to a carry-propagating adder.

.

There are two basic schemes in the multiplication process. They are serial

Serial Multiplication (Shift-Add)

It computing a set of partial products, and then summing the partial products

together. The implementations are primitive with simple architectures (used when

there is a lack of a dedicated hardware multiplier)

Parallel Multiplication

24

used for high performance machines, where computation latency needs to be

minimized.

Comparing these two types parallel multiplication has more advantage than the serial

multiplication. Because the parallel type has lesser steps comparing to the serial

multiplication. So it performs faster than the serial multiplication

This is the most basic form of binary multiplier construction. Its basic principle is

exactly like that done by pen and paper. It consists of a highly regular array of full

adders, the exact number depending on the length of the binary number to be

multiplied. Each row of this array generates a partial product. This partial product

generated value is then added with the sum and carry generated on the next row. The

final result of the multiplication is obtained directly after the last row. ANDed terms

generated using logic AND gate. Full Adder (FA) implementation showing the two

bits(A,B) and Carry In (Ci) as inputs and Sum (S) and Carry Out (Co) as outputs

25

4.2.1Principle

Due to the highly regular structure, array multiplier is very easily constructed and also

can be densely implemented in VLSI, which takes less space. But compared to other

26

multiplier structures proposed later, it shows a high computational time. In fact, the

computational time is of order of log O(N), one of the highest in any multiplier

structure.

Baugh-Wooley Multiplier are used for both unsigned and signed number

multiplication. Signed Number operands which are represented in 2s complemented

form. Partial Products are adjusted such that negative sign move to last step, which in

turn maximize the regularity of the multiplication array. Baugh-Wooley Multiplier

operates on signed operands with 2s complement representation to make sure that the

signs of all partial products are positive. To reiterate, the numerical value of 2s

complement numbers

4.3 Advantages

Minimum complexity.

27

Easily scalable.

Easily pipelined.

4.4 Disadvantages

A Wallace tree is an efficient hardware implementation of a digital circuit that

multiplies two integers. For a N*N bit multiplication, partial products are formed

from (N^2)AND gates. Next N rows of the partial products are grouped together in set

of three rows each. Any additional rows that are not a member of these groups are

transferred to the next level without modification. For a column consisting of three

partial products and a full adder is used with the sum dropped down to the same

column whereas the carry out is brought to the next higher column. For column with

two partial products, a half adder is used in place of full adder as shown in figure 4.4.

At the final stage, a carry propagation adder is used to add over all the propagating

carries to get the final result. It can also be implemented using Carry Save Adders.

Sometimes it will be combined with Booth Encoding. Various other researches have

been done to reduce the number of adders, for higher order bits such as 16 &

32.Applications, as the use in DSP for performing FFT,FIR, etc.,

28

4.5.1 Function

The Wallace tree has three steps:

Multiply (that is - AND) each bit of one of the arguments, by each bit of the other,

yielding n2 results. Depending on position of the multiplied bits, the wires carry

different weights, for example wire of bit carrying result of a2b3 is 32.

Reduce the number of partial products to two by layers of full and half adders.

Group the wires in two numbers, and add them with a conventional adder

29

4.5.3 Example:

30

4.6 Advantages:

Each layer of the tree reduces the number of vectors by a factor of 3:2

Minimum propagation delay.

The benefit of the Wallace tree is that there are only O(log n) reduction layers,

but adding partial products with regular adders would require O(log n)2 time.

4.7 Disadvantages:

Wallace trees do not provide any advantage over ripple adder trees in many

FPGAs.

Due to the irregular routing, they may actually be slower and are certainly

more difficult to route.

31

CHAPTER 5

IMPLEMENTATION OF FIXED WIDTH MULTIPLIER

USING ALGORITHMIC NOISE

TOLERANTARCHITECTURE (ANT)

Algorithmic noise tolerant architecture ANT architecture is effective method

used for error compensation in high processing DSP applications. RPR based ANT

system employs replica block for higher magnitude of errors in main block.

The ANT technique includes both main digital signal processor (MDSP) and error

correction (EC) block, as shown in Figure. 6.1.

Let and be two -bit unsigned numbers

n1

n1

x= xi .2 , y= yj . 2 (5.1)

0

0

32

2 n1

P Pk . 2

k=0

n1 n1

. 2i+ j

j=0

i=0

(5.2)

Generally for n bit inputs multiplier produces 2n bits output. But fixed width

multiplier produces output with n bits or less of 2n bits output with reduced precision.

This is possible by rounding off or truncation of lower order bits. Output product of

fixed width multiplier may not be actual result obtained using full width multiplier but

provides advantage of less area and less delay compared to full width multiplier.

Precision or accuracy of result of fixed width multiplier can be improved by

compensating truncation error. This can be done by using error compensation

functions In our proposed work it is done by using lower order truncation bits feeding

to major part after truncation to provide compensation.

In the ANT technique, a replica of the MDSP but with reduced precision

operands and shorter computation delay is used as EC block. Under VOS, there are a

number of input-dependent soft errors in its output ya[n]; however, RPR output yr[n]

is still correct since the critical path delay of the replica is smaller than Tsamp

Therefore, yr[n] is applied to detect errors in the MDSP output ya[n]. Error detection

is accomplished by comparing the difference

y o [ n ] y r [ n ]

against a threshold

Th. Once the difference between ya[n] and yr[n] is larger than Th, the output y[n] is

yr[n] instead of ya[n]. As a result, y[n] can be expressed as

y [ n ] = y a [ n ] , if | y a [ n ] y r [ n ]|Th

y r [ n ] , if y a [ n ] y r [n]>Th ( 5.3)

max

y o [ n ] y r [ n ] input

Th=

(5.4)

yo[n] is error free output and yr[n] is output from RPR block

33

In this paper, we further proposed the fixed-width RPR to replace the fullwidth RPR block in the ANT design, as shown in Figure. 2, which can not only

provide higher computation precision, lower power consumption, and lower area

overhead in RPR, but also perform with higher SNR, more area efficient, lower

operating supply voltage, and lower power consumption in realizing the ANT

architecture. We demonstrate our fixed-width RPR-based ANT design in an ANT

multiplier.

The fixed-width designs are usually applied in DSP applications to avoid

infinite growth of bit width. Cutting off n-bit least significant bit (LSB) output is a

popular solution to construct a fixed-width DSP with n-bit input and n-bit output. The

hardware complexity and power consumption of a fixed-width DSP is usually about

half of the full-length one. However, truncation of LSB part results in rounding error,

which needs to be compensated precisely. Many literatures have been presented to

reduce the truncation error with constant correction value or with variable correction

value the circuit complexity to compensate with constant corrected value can be

simpler than that of variable correction value; however, the variable correction

approaches are usually more precise.

However, in the fixed-width RPR of an ANT multiplier, the compensation

error we need to correct is the overall truncation error of MDSP block. Our

compensation method is to compensate the truncation error between the full-length

MDSP multiplier and the fixed-width RPR multiplier. In nowadays, there are many

fixed-width multiplier designs applied to the full-width multipliers. However, there is

still no fixed-width RPR design applied to the ANT multiplier designs.

To achieve more precise error compensation, we compensate the truncation

error with variable correction value. We construct the error compensation circuit

mainly using the partial product terms with the largest weight in the least significant

segment. The error compensation algorithm makes use of probability, statistics, and

linear regression analysis to find the approximate compensation value. To save

hardware complexity, the compensation vector in the partial product terms with the

34

largest weight in the least significant segment is directly inject into the fixed-width

RPR, which does not need extra compensation logic gates

5.3.1 Error compensation using AO Cells

To design a low-error fixed-width multiplier, we first analyze the source of

errors generated by MP; and then derive a small carry-generating circuit Cg to feed

each carry input of MP0 to reduce errors effectively. Let denote the difference

between the two products produced by the standard multiplier and circuit MP0; and it

is caused by the carries generated from column circuit Pn-1 in least significant part

Let be sum of weights of pn-1 column it has more weight than sum of

remaining columns let say than cg can be easily found out based on and

35

The (n/2)-bit unsigned full-width or n-bit multiplier partial product array can

be divided into four subsets, which are most significant part (MSP),input correction

vector ICV, minor ICV and LSP. In fixed width multipliers only MSP part is kept

remaining part is removed.

For compensation of error caused due to truncation can be done by using ICV

and MICV as feeding to MSP for compensation because ICV and MICV has higher

weights compared to other terms in LSP Part.ICV depends on and it is sum of all

partial product terms in ICV as shown in Figure 5.3

Therefore, the other three parts of ICV(), MICV(), and LSP are called as

truncated part. The truncated ICV() and MICV() are the most important parts

because of their highest weighting. Therefore, they can be applied to construct the

truncation error compensation algorithm

To evaluate the accuracy of a fixed-width RPR, we can exploit the difference

between the (n/2)-bit fixed-width RPR output and the 2n-bit full-length MDSP output,

which is expressed as

=PPt (5.5)

Where P is the output of the complete multiplier in MDSP and Ptis the output of the

fixed-width multiplier in RPR

36

The source of errors generated in the fixed-width RPR is dominated by the bit

products of ICV since they have the largest weight. It is reported that a low-cost EC

circuit can be designed easily if a simple relationship between f (EC) and is found. It

is noted that is the summation of all partial products of ICV. By statistically

analyzing the truncated difference between MDSP and fixed width RPR with uniform

input distribution, we can find the relationship between f (EC) and

The statistical results show that the average truncation error in the fixed-width

RPR multiplier is approximately distributed between and +1. More precisely, as

= 0, the average truncation error is close to + 1. As >0, the average truncation

error is very close to . If we can select as the compensation vector, the

compensation vector can directly inject into the fixed-width RPR as compensation,

which does not need extra compensation logic gates. Therefore, we can apply

multiple input error compensation vectors to further enhance the

error compensation precision. For the >0 case, we can still select

as the compensation vector. For the = 0 case, we select + 1

combining with MICV as the compensation vector.

Below figure shows architecture for error compensation

The compensation vector ICV() is realized by directly injecting the partial

terms of Xn1Yn/2, Xn2Y(n/2)+1, Xn3Y(n/2)+2, . . . , X(n/2)+2Yn2. These directly

37

The other compensation vector used to mend the insufficient error compensation case

is constructed by one conditional controlled OR gate. One input of OR gate is injected

by X(n/2)Yn1, which is designed to realize the function of compensation vector .

The other input is conditional controlled by the judgment formula used to judge

whether = 0 and l _= 0 as well. Then, Cm is injected together with X(n/2)Yn1 into

a two-input OR gate to correct the insufficient error compensation. Accordingly, in

the case of = 0 and l _= 0 as well, one additional carry-in signal C(n/2) is injected

into the compensation vector to modify the compensation value as + 1 instead of .

Moreover, the carry-in signal C(n/2) is injected in the bottom of error compensation

vector, which is the farthest location away from the critical path.

In our proposed work compressors used in main block still reduces critical path

delay and power consumption. So reducing critical path delay reduces error rate and

performance of error compensation is done using error analysis and tabulated. Main

block is implemented using Baugh-Wooley multipliers using compressors and

Wallace tree structure using compressors and results are analyzed in next chapter

38

39

CHAPTER-6

EXPERIMENT RESULTS

6.1 Compressors

As compressor is one strategy that can be adopted for multi-operand addition

in effective manner. Hence 3:2, 4:2 and 5:2 compressors are used to reduce the critical

path delay in the design of fixed width multipliers. The RTL schematic of 3:2

compressor is shown in figure 6.1.

+

Figure 6.1 3:2 compressor module

The internal RTL schematic of 3:2 compressor is shown in figure 6.2.

40

The figure 6.4 shows the internal RTL schematic of 4:2 compressor.

41

Main block in our work is done using Baugh-Wooley and Wallace and adders

are replaced with compressors such that critical path delay can be reduce and multi

operand addition is possible

42

The figure 6.7 shows the RTL schematic of 16 bit Baugh-wooley using compressors

The figure 6.8 is internal RTL schematic of 16 bit Baugh-wooley using compressors

43

figure 6.9.

The Wallace tree multiplier of 16 bit size is implemented using 3:2,4:2,5:2

compressors which reduces partial products reduction and Wallace tree implemented

and RTL schematic, output waveform of 16 bit multiplier for different combination of

inputs is shown below

Figure 6.10 is RTL schematic of 16 bit Wallace tree using compressors (main block)

44

Figure 6.11 is internal RTL schematic of 16 bit Wallace tree multiplier using

compressors.

The Figure 6.12 shows the simulation result of 16 bit Wallace tree using compressors.

RPR block of 16 bit size is implemented with compensation is done

considering

Input

Correction

Vector(ICV)

and

Minor

Input

Correction

shown in below fig 6.13,6 14 and 6.15.

45

The Simulation result of 16 bit RPR block withcompensation(ICV&MICV) is shown

in figure 6.15

46

The 16 bit Baugh-Wooley multiplier and Wallace tree multiplier using ANT

Architecture with compressors are designed and their respective results are discussed

below.

The RTL schematic of ANT Architecture implemented using Baugh-Wooley

multiplier with compressors is Figure 6.16.

The RTL schematic of ANT Architecture implemented using Baugh-Wooley

multiplier with compressors is shown in Figure 6.17.

47

Figure 6.17 ANT with RPR 16 bit & Baugh-Wooley with compressors

The figure 6.18 shows the simulation result of ANT Architecture.

Figure 6.18 Simulation result of ANT with RPR(16 bit) & Baugh-Wooley with

compressors

The RTL schematic of ANT Architecture implemented using Wallace tree with

compressors is shown in figure 6.19.

Figure 6.19 ANT with RPR16 bit using Wallace tree with compressors

48

tree with compressors is shown in figure 6.20.

Figure 6.20 ANT with RPR of 16 bit & Wallace tree with compressors

The figure 6.21 shows the simulation result of ANT Architecture implemented

using Wallace tree with compressors.

Figure 6.21 Simulation result of ANT with RPR of 16 bit& Wallace tree using

compressors

49

6.6.1 Error Analysis:

RPR block which is fixed width multiplier of results in truncation error due to

truncation of lower order bits hence to compensate truncation errors, error

compensation algorithm is implemented using truncated partial products error

analysis of different error compensation algorithm for RPR of 8 bit and 16 bit is

tabulated below in table 6.1.

Table 6.1 Error Analysis

Architecture

RPR without

Compensation

RPR with AO cell

compensation

RPR considering

ICV&MICV

Mean error of

RPR of 8 bit

67.9717%

Mean error of

RPR of 16 bit

20.0187%

37.107%

10.9157%

28.902%

2.558%

The Device utilization of ANT architecture of different multipliers and RPR of

16 bit and 8 bit are compared and tabulated in Table 6.3. The Vertex 6 FPGA is used

for synthesis.

Table 6.2 Gate count of ANT architecture

Baugh-Wooley

without compressors

Baugh-Wooley

using compressors

Wallace using

compressors

of 16 bit

923

864

837

of 8 bit

764

740

699

Device count

50

Compressors shows reduction in device utilization compared to existing BaughWooley without compressors.

The ANT architecture of Baugh-Wooley multiplier using Compressors shows

reduction in device utilization by 6.39% and 3.14% for RPR of 16 and 8 bit

respectively compared to existing Baugh-Wooley without compressors.

The ANT architecture of Wallace tree multiplier using Compressors shows

slight reduction in device utilization compared to Baugh-Wooley with compressors.

Delay of ANT architecture with RPR with Baugh-Wooley and Wallace tree

using Compressors is analyzed and tabulated in below Table 6.4. Results show that

decrease in delay of 8.61% and 23.07% of Baugh-Wooley using compressors

compared to previous work [1] and decrease in delay of 10.61% and 32.56% of

Baugh-Wooley using compressors compared to previous work.

Table 6.3 Delay Analysis of ANT architecture

Architecture

ANT with RPR

of 16 bit

ANT with RPR

of 8 bit

Existing

work[1]

Delay(ns)

21.26

21.09

Baugh-Wooley

using compressors

Delay(ns)

20.09

Delay(ns)

16.22

14.22

51

19.00

The Power analysis of an ANT Architecture based 16 bit fixed width

multiplier implementations is shown in Table 6.4 below. The Power consumed by

ANT Architecture using Baugh-Wooley multiplier with compressors is decreased by

15.39% compared to ANT Architecture implemented using Baugh-Wooley without

compressors.

Table 6.4 Power Analysis of ANT architecture based designs

ANT Architecture

based Designs

Total

power(mw)

Dynamic

power(mw)

Quiscent

power(mw)

Baugh-Wooley

without compressors

16.69

2.97

13.12

Baugh-Wooley with

compressors

14.12

2.88

11.24

compressors

14.16

2.92

11.24

In the case of Wallace tree design with compressors the power reduction is

almost close to that of Baugh- Wooley design. The power analysis is performed using

Xilinx Xpower analyser Tool.

52

CHAPTER-7

CONCLUSION & FUTURE SCOPE

7.1 Conclusion

Multiplier is most important block in DSP Processors and FFT Processor and Design

using 3:2, 4:2, 5:2 compressors reduces critical path delay and power. The proposed

work is simulated using ISIM simulator and synthesis is carried out using XILINX

ISE synthesis tool on the XILINX platform. Reports shows better result compared to

previous work in terms of delay, power and error analysis. The proposed system

reduces power up to 15.15% by Implementing Baugh-Wooley using compressors

whencompared to previous work.

Fixed width multipliers reduce area, power and delay. The current work

proved the efficiency of Algorithmic Noise Tolerant Architecture based fixed width

multiplier in these aspects. The same can be used in implementation of FIR filters,

MAC units and FFT processors.

The error rate and power reduction can be further improved by the use of

further advanced compensation algorithms and architectures respectively, to design

fixed width multipliers with better performance.

53

REFERENCES

[1] I-Chyn Wey, Chien-Chang Peng, and Feng-Yu Liao Reliable Low-Power

Multiplier Design Using Fixed-Width Replica Redundancy Block IEEE

Transaction Very Large Scale Integration. (VLSI) Systems, vol. 23, no.01, pp.

78-87, January 2015.

[2] SreehariVeeramachaneni, Kirthi M Krishna, LingamneniAvinash, Sreekanth

Reddy Puppala and M.B. Srinivas; Novel Architectures for High-Speed and

Low-Power 3-2, 4-2 and 5-2 Compressors, 20th International Conference on

VLSI Design, pp.324-329, Jan.2007.

[3] S. K. Sunder, E. Fayez, and A. Antoniou, Area efficient multipliers for digital

signal processing applications, IEEE Trans. on Circuits and Syst., vol. 43, pp.

2, Feb. 1996.

[4] M. J. Schulte and E .E. Swartzlander Jr., Truncated multiplication with

correction constant, IEEE Trans. VLSI Systsems. vol. I, pp. 388-396, May

1993.

[5] E. E. Swartzlander Jr., Truncated multiplication with approximate rounding,

in 1999 Proc. 33rd Asilomar Conference: Signals, Systems, and Computers,

pp. 1480-1483, July 1999.

[6] J. M. Jou, S. R. Kuang, and R. D. Chen, Design of low-error fixed- width

multiplier for DSP applications, IEEE Trans. Circuits and Systs., vol. 46, pp.

836-842, June 1999.

[7] H. Park and E. E. Swartzlander, Truncated multiplication with symmetric

correction, 40th Asilomar Conf.: Signals, Systems and Computers, 2006, vol.

6, pp. 931 934, Sept. 2006.

[8] G. M. Strollo, N. Petra, and D. De Caro, Dual-tree error compensation for

high performance fixed-width multipliers, IEEE Trans. on Circuits and Systs.

II: Express Briefs, vol. 52, pp. 501-507, Aug. 2005.

[9] L. Da Van, S. S. Wang, and W. S. Feng, Design of the lower error fixed-width

multiplier and its application, IEEE Trans. on Circuits and Systs.II: Analog

and Digital Signal Processing, vol. 47, pp. 1112-1118, Oct. 2000.

PUBLICATION

N. Sai Mani Bharath, H. Phanendra Babu, Reliable Low Power Design of

Fixed width Multiplier using Algorithmic Noise Tolerant Architecture

54

Publication (communicated).

55