Sie sind auf Seite 1von 4

Automated Design Techniques for Low-Power High-Speed Circuits

On a Self-Configuring 64-bit Wallace Tree Multiplier


EE241: Advanced Digital Integrated Circuits Midterm Report Zhujie Lin (jzlin@cory.eecs.berkeley.edu) Michael Liao (mliao@berkeley.edu) ABSTRACT This paper presents techniques to reduce power consumption in arithmetic logic units (ALUs) while improving performance. This ultimate paradigm in design takes advantage of varying input widths to enable evaluation with partial ALU activation. We will demonstrate partial ALU evaluations have shorter critical paths; this thus enables us to increase clock speed. Shorter clock time means the circuit can spend less time in operation mode and more time in power savings mode such as sleep mode or clock gating mode. These proposed techniques will be implemented and benchmarked on various large-input dynamic Wallace Tree multipliers using 90nm technology. Additional power saving circuitry, influenced by static CMOS sleep mode circuits and clock gating in dynamic logic, are added onto an existing dynamic Wallace Tree multiplier. These proposed additions look at the incoming input bits and determines which part of the tree to precharge using a self-generated variable clock. This design encompasses the philosophy of off until used by using the partial tree for partial computations. The goal is to see if these proposed circuit additions significantly reduce leakage power while increasing overall performance. ____________________________________________________________________________________________________ I. INTRODUCTION The current trend in CPU and multimedia evolution emphasizes on the increasing width and accuracy of ALUs. Flagship CPUs from both Intel and AMD have recently upgraded to 64-bits, and GPUs from NVidia and ATI now make use of 32-bit FPUs. While these hardware changes enable personal computers to perform on par with high end computers of yesteryear, software does not always take advantage of these hardware evolutions. Powering unused hardware presents a direct obstacle in today's paradigm for low-powered high-speed circuits. We wish to approach the necessity of low-powered high-speed circuits with a different philosophy: off by default. The circuit will intelligently turn on only the necessary paths for execution, this process will be shown to have faster execution time. Conventional methods for power reduction rely on the reduction of voltage and the reduction of frequency to allow for lower voltage operations. Using the equation P=CV 2 F , power consumption seems to be reduced. A mobile processor of the current generation (Pentium M) represents this approach: ALUs are turned off when no computation is necessary, clock speed is reduced under light load, and voltage is reduced. This approach presents several fallacies: shutdown of ALU is impossible while processing a multimedia stream; a fixed computation requires the same amount of clock cycles regardless of clock speed; and there is no way to increase performance without increasing power consumption. We can reduce power consumption by relying on . Instead of dependent only on the general usage of a block, should also be dependent on the width of operation and the length of the clock. These are the revised equations:

P=CV 2 F ' ' ' =0 width clock F ' =F 0 clock


It is safe to assume a 64-bit multiplier will not always be performing 64-bit multiplications, width will likely to be much less than 1. Since the reduced width of operation reduces the critical path of computation, clock represents the reduction of time a circuit spends in operation mode. The flip side of reduced clock period is the option to increase frequency, and because of width, there is still overall power savings. This is an attractive choice previously unavailable to circuit designers who reduce power by scaling down voltage. Section II investigates the problems with our benchmark normal Wallace Tree multiplier implemented in dynamic logic. Section III discusses the necessary circuits and concepts for solving the power and performance issues of the benchmark multiplier. We will present the methods for automating such changes in section IV. Finally, section V outlines our testing methodology. II. SHORTCOMINGS OF BENCHMARK MULTIPLIER A. DYNAMIC LOGIC In most dynamic logic designs, some kind of level restoring device is used to alleviate the problem of charge leakage on the output nodes (See Figure 1). These level restorers add extra intrinsic capacitances as well as a static leakage current that increases power consumption of the

circuit. In high-speed applications (~5 GHz), each evaluate only has 0.1ns to complete (with the other 0.1ns for precharge), thus a level restoring device may incur too much performance overhead per operation that is it not viable. To prevent leakage power consumption of critical nodes by the precharge device, clock gating is introduced, which only enables precharge of the device when it is in use, not while the device is inactive. Later in section III, the use of clock gating to improve overall performance and prevent leakage power consumption will further explained in detail.
VDD
Level Restorer

operation. These modes define the power consumption of each data path depending on the word length of the input. Clock gating When the circuit senses that a datapath will not be used to evaluate the current input, this particular datapath will remain in precharge mode. The ability to do so reduces dynamic power dissipation since the clock charges and discharges the input capacitances of precharge PMOS transistors. Power consumption is P=CV 2 F and for a continuously active clock is 1. But for a gated clock is dependent on the input it receives and is likely to be much less than 1. Sleep mode In the event that a datapath remains unused for a prolonged period of time, that particular datapath will be shut down and all precharged nodes are allowed to leak away. Sleep mode is achieved by turning off both the precharge PMOS and the evaluate NMOS, this introduces two large resistances and thus minimizes power consumption of unused circuitry. Operation mode If a datapath is determined to be necessary for processing data, that particular datapath will receive a clock signal to precharge the path if it was in sleep node or directly evaluate if the path was in clock gating mode. The length of the clock may vary depending on the bit length of the input, this results in reduced circuit activity time in comparison to a fixed clock operation.
VDD 0 1 VDD VDD

F PDN
Leakage Node

GND

Dynamic Logic w/ Level Restorer Figure 1 B. WALLACE TREE MULTIPLIER In a variety of applications, a basic high-speed Wallace Tree multiplier implemented in dynamic logic does not have optimally performance or power consumption. In multimedia applications, where the multiplier will always be on and the input bit lengths are highly correlated, the dynamic Wallace Tree will have unnecessary power consumption due to the multiplier being precharged every cycle. Since the input bits are correlated, if the inputs do not require the entire Wallace Tree to compute, then parts of the Wallace Tree will not be active for long periods of time. But these parts still leak charge and still get charged by the precharge devices. If those parts of the multiplier can be turned off, then power consumption can be reduced significantly. In microprocessors, where the multiplier can be idle for long periods of time, having a constant clock to precharge the critical nodes also result in unnecessary power consumption. In this case, having a sleep mode to disconnect the multiplier from the supplies makes sense. When computing Legacy code (such as 16-bit and 32-bit operations) in a 64-bit multiplier, the whole word length is never used and therefore precharging only for the active parts will yield optimal power consumption. III. PROPOSED ADDITIONS TO THE MULTIPLIER In the discussion of power saving, the following modes of operation need to be defined: clock gating, sleep, and

PDN

PDN

PDN

GND

GND

GND

Clock Gating

Sleep Mode Figure 2

Operation Mode

In order to obtain the aforementioned modes of operation and to maximize power savings, several circuits need to be implemented: most significant bit (MSB) detection, variable duty cycle clock, datapath state selector, and data multiplexer. Most significant bit detection This circuit determines the bit length of the incoming data. MSB detection must be fast and efficient since it controls the length of the clock to reflect operation time, the arrangement of data for top calculation efficiency, and the state of every data path.

...
MSB Detection Circuit Figure 3 Variable duty cycle clock This clock generator is show in the figure below. The reason for having a variable clock is so we can reduce the operation time of the circuitry. This benefits us in two ways: first, we have the ability to run the circuit at higher clock speed depending on the complexity of operation; second, the less time a circuit spends in operation mode the less current is leaked away.
CLK EN

any standard ALU design. The philosophy of design automation requires scripting to reflect the structural regularity of circuits. The goal is to generate any length ALU containing the circuitry mentioned in the previous section. Some parts of ALU structures are highly repetitive, while others are placed in random, thus there are techniques to deal with each of the situations: scripting for regular circuits, scripting for irregular circuits. Scripting for regular circuits In our Wallace Tree example, we can see in the following diagram that a 5-bit Wallace Tree is just a 4-bit tree with an extra row of adders and a lengthened vector add unit. We can exploit this structural regularity to generate Wallace Trees of any bit length. There are two less rows of parallel adders than the number of bits in the adder, each row has two more full adders than the previous row, and its then followed by a vector adder at twice the length of the number of bits (See Figure 6). The most structured part of a Wallace Tree is the block of AND gates; it's simply a square with side width equal to the number of bits. The setup circuitry such as bit detection and data multiplexer all scale linearly with the number of bits. The result of such scripting will feature similarly named circuit elements with slight variation in numbering to differentiate one adder from another.

Variable Clock Generator Figure 4 Datapath state selector Each datapath has a different utilization rate. In our benchmark multiplier, in highly uncorrelated operation, every bit can be considered a noise signal and have 50% utilization rate when active; in correlated operation, the most significant bits see very few transitions and the least significant bits still have noise distribution; in sparse computation mode, idle prevails, thus the utilization rate of every bit is minimal; in legacy mode, we are guaranteed that a select set of bits will never be used. Our datapath state selector must have the following attributes: minimal operational time so when data is highly uncorrelated the datapath doesn't take long to switch modes; carefully choose between clock gating and sleep modes when the data is correlated or mostly idle, this is because it takes a while to bring elements from sleep to active as every node needs to be recharged. Data multiplexer When dealing with two inputs, their relative bit lengths may vary, a fixed circuit is more easily optimized for the condition that A is equally long or longer than B. A data multiplexer is thus needed to route the data into operational circuitry so this condition is always satisfied. This enables a regularly structured operational circuitry to compute data more efficiently. IV. DESIGN METHODOLOGY In showcasing our power reduction and performance boosting circuits for general application, we have developed a full suite of implementation techniques to quickly convert

Regularity of a 4-bit by 4-bit Wallace Tree [1] Figure 5 Scripting for irregular circuits The input and output networks between every row of parallel adders in a Wallace Tree is highly irregular; some might take an original input, some might take a carry and a save, some might take other combinations of original, carry, and save. On top of the irregular wiring, there's a need to simulate the wiring resistance and capacitance leading from one node to another, and the resulting model must also reflect the varying length of the paths. A data structure is necessary to automate the generation of such wiring networks. Whereas in a regularly structured script, the circuit elements can be generated on the fly with small variation in numbering, a irregularly structured network requires the names to be

entered into a database. The entries can be referred to by its relative position in the circuit, and can also be updated with new names as more circuit elements are connected hierarchically. The wiring network in a Wallace Tree multiplier can be represented in 3D, the top level represents a row of adders, each outputting a sum and a carry wire, which can be seen in Figure 6. The relative positions of these two wires are known, so they can be entered into the database in the correct locations. The level below these adders are an interconnect network, they might extend an original wire or represent the sum or carry wires leading to the next adder. These vary in lengths depending on their originating points. With a database, these attributes are remembered and thus the correct values for resistance and capacitances can be extracted.

Input Sequences We would choose input sequences that incur the maximal switching activity in the test multipliers to test the input extremes for dynamic switching power and propagation delay. Long periods of inactivity injected to test the advantages of the sleep mode and inactivity detection in our proposed design. Figure 7 below shows various input bit width sequences we will test.

Multimedia (Bits Correlated)

Noise (Bits Uncorrelated)

No. of Bits

Time Idle Periods

No. of Bits

Time

No. of Bits

No. of Bits

Legacy (16-, 32-, or 64-bit) Time Time

Hierarchical Structure of the Wallace Tree [1] Figure 6 V. TESTING METHODOLOGY We will benchmark our proposed additions using 90nm ST Microelectronics standard cell technology. The proposed 64-bit by 64-bit multiplier will be compared against a static CMOS design and a basics dynamic design of the same input size as well as other smaller word length multipliers (i.e. 16b by 16b and 32b by 32b Wallace Trees) for power consumption and propagation delay. We will test full ranges of operation, by using a specific set of testing values that vary input word lengths interspersed with periods of inactivity. Input Length Choices We can see that for the basic Wallace Tree multiplier, a 1-bit by 64-bit multiplication activates a different part of the tree than a 64-bit by 1-bit multiplication. The two operations have different power consumption as well as different propagation delays. However, our proposed design should be unaffected by the input order. Also, a 32-bit by 32-bit multiply on our proposed 64-bit multiplier will be tested against a pure 32bit Wallace Tree as well as the two 64-bit basic Wallace Trees. We want to see if our design still has power and performance advantages over a dedicated 32-bit multiplier.

Various Testing Input Sequences Figure 7 VI. REFERENCES 1. J. Rabaey, A. Chandrakasan, B. Nicolic, Digital Integrated Circuits, 2nd ed.

Das könnte Ihnen auch gefallen