Sie sind auf Seite 1von 4

Fast Techniques for Standby Leakage Reduction in MTCMOS Circuits

Wenxin Wang
School of Engineering University of Guelph, Ontario wenxin@uoguelph.ca

Mohab Anis
ECE Department University of Waterloo, Ontario manis@vlsi.uwaterloo.ca

Shawki Areibi
School of Engineering University of Guelph, Ontario sareibi@uoguelph.ca

Abstract Technology scaling causes subthreshold leakage currents to increase exponentially. Therefore, effective leakage minimization techniques must be designed. In addition, for a true low-power solution in System-on-Chip (SoC) design, it has to be tightly integrated into the main design environment. This paper presents two design techniques to effectively solve the sleep transistor sizing and distribution problem in MTCMOS circuits. The introduced First-Fit and Set-Covering approaches achieve lower leakage at an order of magnitude reduction in CPU time compared with other techniques in the literature. In addition, an automatic MTCMOS design environment is developed and integrated into the Canadian Microelectronics Corporation (CMC) digital ASIC design ow.

I. I NTRODUCTION As technology scales into the deep-submicron (DSM) regime, standby subthreshold leakage power increases exponentially with the reduction of the supply voltage ( ) and the threshold voltage ( ). For many event driven applications, such as mobile devices where circuits spend most of their time in an idle state with no computation, standby leakage power is especially detrimental on overall power dissipation. Multi-Threshold CMOS (MTCMOS) is an effective circuit-level methodology that provides high performance in the active mode and saves leakage power during the standby mode. The basic principle of the MTCMOS technique is to use low transistors to design the logic gates where the switching speed is essential, while the high transistors (also called sleep transistors) are used to effectively isolate the logic gates in standby state and limit the leakage dissipation [5].
VDD VDD

CMOS logic with low V th Vx sleep sleep transistor Virtual Ground


(VGND)

CMOS logic with low V th VGND R I GND Vx

high Vth nMOS GND

(a) MTCMOS circuit structure.


Fig. 1.

(b) Sleep transistor is modeled as a resistor in active mode.

MTCMOS Circuit.

Fig.1(a) shows the basic circuit scheme of MTCMOS. The sleep transistor (ST) in MTCMOS circuit is controlled by a sleep control signal. During the standby mode (sleep=1), the ST is off. This causes the leakage current of the logic block to be limited to that of the of ST, the total leakage of the circuit is ST. Due to the high minimized. On the other hand, in the active mode (sleep=0), the ST is turned on and the real ground line (GND) is directly connected

to the virtual ground line (VGND). Consequently, the low logic gates operate normally at a high speed. In the active mode, the sleep transistor works as a resistor as shown in Fig.1(b). The discharge  current owing through the ST causes a voltage drop ( ), which  degrades the circuit performance. Both and the leakage of the ST in standby mode are directly proportional to the size of the sleep transistor [5]. Therefore, proper ST sizing is a key issue that affects the performance, the leakage power saving and the noise immunity of the MTCMOS circuits. Over the past few years, a number of ST sizing methodologies have been reported in the literature. A single ST to support the whole circuit was proposed in [5] and [6], whereas the ST was sized based on mutual exclusive discharge patterns in [3], and eventually merging the sets of STs into a single large sleep transistor to accommodate the whole circuit as in [5] and [6]. A distributed ST network methodology was proposed in [4] to minimize the total ST area, whereas a clusterbased technique was proposed in [1] and [2] which accounted for the circuit oorplan and the noise bouncing on the virtual ground rails. The drawbacks of the above techniques can be summarized as follows: (I) In [3], [5], and [6], a circuit that is supported by a single ST would augment the interconnect resistance for distant blocks. As a result, the sleep transistor would be sized even larger than expected to compensate for the added interconnect resistance, leading to an increase in leakage power. This drawback would be even more severe in DSM regimes, where interconnects would have larger impact on the circuits performance [7]. (II) In [3], [5], and [6], the virtual ground interconnects are not modelled nor taken into account in the design phase, which leads to impractical designs. (III) The work in [4], [5], and [6] have not considered timing discharge patterns for the gates. This leads to pessimistic design and oversizing of the STs, which consequently leads to a large increase in leakage power. Timing analysis must contain information about the state of the input vectors, variations in critical path delays, and how they impact the gate discharge patterns. Therefore, timing analysis must be taken into account for the proper ST sizing. (IV) Finally, the work in [1] [2] may consume relatively large CPU times for computing the ST size, location, and clustering solution. In addition, there has been no mention in [4] to the CPU time. Motivated by the drawbacks of the above techniques proposed in the literature, this paper proposes two techniques for ST sizing for small and large circuits. The contributions in this paper can be outlined as follows: (I) An automatic vector generation engine is developed to build a vector for each gate in the gate-level netlist. Based on the vector representation, a MTCMOS design environment is developed and integrated into conventional design ow.

(II) This paper presents a modication to the algorithms presented in [1] [2], which achieve an order of magnitude reduction in CPU time to compute the ST size. The proposed techniques in this paper also take virtual ground interconnects into account, as well as timing analysis for the gate discharge currents which ensure proper ST sizing. This paper accounts for process variations which will have higher importance in the DSM regime, where the down scaling of devices increases the impact of intra-die variations on performance of VLSI circuits. The paper starts with modelling the discharge currents as vectors in Section II, while the vector generation ow is described in Section III. Section IV presents the two techniques to size the sleep transistor, including experimental results. The issue about MTCMOS design ow is discussed in Section V. Finally, conclusions with comments are given in Section VI. II. P ROCESSING OF D ISCHARGE C URRENTS The accuracy of sizing the sleep transistor, while maintaining adequate values for the circuit speed, is heavily dependent on how well the discharge currents at the output of each gate in the circuit are modelled. In this paper, the discharge current of each standard cell in the circuit is modelled as a trapezoid waveform as in [1] (Fig.2).
G1 G2

III. AUTOMATIC V ECTOR G ENERATION FROM RTL The technology library used in this paper for logic synthesis is from Virtual Silicon Technology Inc using the 0.18 m TSMC process. For each standard cell in this library, all the parameters introduced previously to characterize the discharge current at different fanout are recorded by Hspice simulator. Thus, a database (step 1 in Fig.3) is constructed containing all the information about the discharge current for each standard cell at different fanout. This database works as a look-up table to build the vector for each gate in the circuit after logic synthesis. The vector generating process is summarized in Fig.3.

RTLlevel RTL coding netlist 2 Gatelevel netlist 3 Synthesis

5 Update delay 6 Build vectors 1 Database

4 Circuit topo

logy extraction

Fig. 3.

Vector generation ow.

Circuit design starts with the RTL code (step 2). Based on the gate-level netlist after the logic synthesis (step 3) by using Synopsyss Design Compiler, the circuit topology is extracted (step 4). The delay  parameters ( and ) of each gate read from the database are then updated accounting for the accumulative delay (step 5), which is based on the gate-level netlist. This will be shown in Fig. 6 in Section V. In general, the accumulative delay of gate is expressed as:

)0%2143

)0%7'

I 1 (G1 )

      !   !!        !!  !!  !!  !!      ! 
I max = 5 t 1min t 1max

time

V1:

0000135555531000000000000000000000000000000

CD E 8FD CGD6HPI4QSRUTSVXWEYa`bCc1 HPI4Qed )fD6HPI4Qhg6ip`bCrq HPIsQ@d )fD6HPI4Qtg6i&uvuvuvu4uvuvuvu w (1) CGD Hyx RUTShY`bCc1 H7x d )fD Hyx6 g6ip`bCrq Hyx d )fD Hyx g6i&uvu4uvuvuvuvu4u w (2) given that outputs of gate 81 , 8q , .... are inputs to gate 8FD . The
vector representation for each gate in the circuit is then generated automatically based on the updated delay information. For a circuit with 800 gates, the vector generation process is accomplished within one second. IV. P ROPOSED T ECHNIQUES

I 2 (G2 )

 ""  #  ""  #  "  # ""  #  "  # "                  # "## " # "## " # "## " # "## " # "## " #"  # # # #  # ""  # ""  #                        "  "  #   "  #  #" "## " "# " "## " "# " "## " "# " "## " "# " "## " "# " "## ""
I max = 14 t 2min t 2max

time

V2:

0 0 0 0 0 0 0 0 0 0 0 2 4 6 8 10 12 14 14 14 14 14 14 12 10 8 6 4 2 0 0 0 0 0 0 0

Fig. 2.

Discharge current timing diagram and vector representation.

The peak value, the delay and the duration of the discharge current for each gate are monitored according to: (i) different input transitions causing the discharge currents and (ii) the number of fanout associated with the gate. The peak discharge current value   ( ), the earliest delay time ( ), the latest delay time ( ) and the longest duration are documented for each gate. For example,  and are the earliest and latest delay of gate in Fig. 2, respectively. The switching activity of the gate is then calculated and multiplied by the corresponding peak value, which gives an expected discharge current estimation. The discharge current of the gate nally takes the waveform of a trapezoid as the bold line shown in Fig.2. This technique guarantees that for all input combination and process variations accounted for, the discharge current and the glitching current are taken into account, and the speed of the circuit is attained [2]. To facilitate vector comparisons and to offer an automated design environment, the trapezoid discharge current of each gate is represented by a vector as in [2]. The time axis is divided into 10psec time slots (adequate accuracy for 0.18 m CMOS technology) and each slot holds a value representing the magnitude of the and shown in discharge current at that specic time (as Fig.2). This idea is adopted in this paper, such that the proposed MTCMOS design environment can be easily integrated into the Canadian Microelectronics Corporation (CMC) design ow.

$ %(' & )65 %2143

)0%2143

)0%('

)5 %7'

8@9

(350mV) represent high and low threshold voltage, respectively. The objective of MTCMOS low power design is to propose efcient techniques to evaluate the number of STs at a value based on the discharge current limit ( ) of the ST, such that the leakage power is minimized. In the following subsections, two techniques are proposed, namely First-Fit and Set-Covering.

$ k X nm ji6iok np ` (gX@R d u dfe AG3tg(h  ` jiil X g` g (3) where `X(g q is the width to length ratio of the ST, AE3 is the  np nm N-mobility, g(h is the oxide capacitance, (500mV) and $ X `Xr7g X

The techniques proposed in this paper are based on the usage of the vectorially-modelled discharge currents to estimate the sleep transistor size. Due to the presence of the ST, a speed penalty must be incorporated. A maximum speed penalty (MSP) is set to 5%. Thus, the size of the sleep transistor, which would tolerate a MSP=5% due to the presence of the sleep transistor, can be expressed as [2]

A. First-Fit Technique A preprocessing heuristic was utilized in [2] to form a set of subclusters of gates that when combined would not exceed the maximum current of any gate within the cluster. The ST sizing problem was then modelled as a bin-packing problem (BPP). The objective of BPP was to assign each cluster to one bin (i.e., one ST) such that the number of bins used is minimized, and the total current in each bin does not to satisfy the MSP constraint. exceed

@B

$ X

One problem of using the maximum current of each cluster to represent the object weight in [2] is over-estimating the discharge current at different time slots. Accordingly, a more efcient FirstFit (FF) heuristic technique is proposed in this paper to solve the ST sizing problem directly. As seen from the pseudo-code in Fig.4, is directly taken as the criterion to assign a gate to the sleep transistor. The algorithm terminates when all the gates are assigned. Without transforming the dynamic discharge current of a cluster into a static maximum current, the current over-estimation is avoided, and therefore the number of sleep transistors could be reduced. Furthermore, the pseudo-code in Fig.4 indicates that the FF heuristic has a complexity similar to the preprocessing heuristic proposed in [1]. Hence, the CPU time involved in solving the sleep transistor sizing problem is improved dramatically since the Integer Learning Programming (ILP) bin-packing problem is avoided.

B. Set-Covering Technique In order to take the physical locations of the gates in the chip into consideration, and therefore reduce the routing complexity of larger circuits, a set-partitioning technique was further proposed in [1]. The objective of the lower-power set-partitioning formulation (SPP) is to nd an optimal collection of clusters such that each gate is covered by exactly one cluster (i.e., one ST in the MTCMOS technique), while the lowest cost value is achieved. The mathematical formulation of the SPP problem is as follows [2]:

$ X

FIRST-FIT HEURISTIC 1.Initialize current vectors 2.Set all Gates free to be assigned to sleep transistor; 3. For all gates in circuit is not assigned yet If gate to new sleep transistor assign gate update sleep transistor info calculate max current, start, and end time End If For all other gates in circuit is not assigned yet) If (gate to sleep transistor add current of gate If (combination current limit of ST) append gate to sleep transistor update sleep transistor info locked in sleep transistor set gate End If End For End For 4. Return all sleep transistors used.

qR#"


qrq q  5   3 1 qqR 9 Subject to q 5 V d R 9fipuvuvuvifT  q ip! 9 R 9fi&uvuvuvifW 9d if the r)%$ cluster is selected
Minimize Z otherwise

(4) (5)

The cost function is evaluated from the physical locations of the  each other. is related to the routing complexity gates with respect to  of the circuit as well as the capacity of each cluster [2]. In order to evaluate the physical locations of the gates, Qplace placement tool integrated into Cadence physical design environment is applied in this work and the coordinates are extracted from the Design Exchange Format (DEF) le. The cost function associated with group  is  formulated as follows [2]:

rq

Fig. 4.

First-Fit Heuristic for MTCMOS Technique.

where is a distance function (i.e., rectilinear distance between  gates within a cluster) and 0 represents the difference between the maximum ST capacity (  ) and the sum of all currents of gates within a cluster. Therefore,

q5

qFR `'& 5)( q 5 g d `'&10


 (

q0pg

(6)

q $ X

Experimental Results of the First-Fit Approach Keeping the MSP=5% as a comparison basis, Table I shows a comparison between the BPP technique used in [1] and the proposed FF heuristic in terms of total ST area, leakage and computation time. The test-benches are a 4b carry look ahead adder (CLA), a 32b parity checker (PC), a 6b multiplier (MP), a 4b ALU unit, a 32b single error correcting (ISCAS89 C499 benchmark) and a 27b channel interrupt controller (ISCAS89 C432 benchmark). Column BPP and FF are the results generated by the BPP and FF techniques, respectively. The last two columns show the leakage and computation time saving achieved by FF compared with BPP. Table I dictates that the FF heuristic technique achieves less or equal leakage power compared to the techniques used in [1]. However, the main advantage arises from the large reduction in CPU time by the FF heuristic. The FF heuristic can, therefore, scale well for larger circuits.
TABLE I C OMPARISON BETWEEN BPP AND FF T ECHNIQUES
Benchmark 4b CLA 32b PC 6b MP 4b ALU C499 C432 Total width Total width Leakage CPU time of STs (BPP) [1] of STs (FF) savings savings 3.96  m 3.96 m 0% 96% 2.64  m 2.64 m 0% 39% 3.96 m 2.64 m 33% 82% 6.6 m 4.4 m 33% 64% 7.7  m 7.7 m 0% 80% 16.94 m 14.52 m 14.3% 98%

where

245

is the distance between the centers of gates

q 5(R 5

3245

in the group 

q 8 5

(7) and

and &10 are the weights associated with the cost The parameters & of the two constraints (i.e., distance and capacity of the formed clusters). Equal values are assigned to the weights & and &10 , in order to balance the distance and capacity constraints. Gates are always grouped, while meeting the constraint that the sum of currents does not exceed which is depicted by the MSP=5% condition. The equality constraints of SPP in Eq. (5) guarantee that all gates in the circuit are covered once by a single ST. These constraint functions make the SPP a highly constrained problem. When the problem size (i.e., the number of gates in the circuit) increases, it becomes quite inefcient to solve the model by an ILP solver (CPLEX) in reasonable amount of time. Experimental results in [1] show that the CPU time used by CPLEX solver to solve the SPP increases dramatically as the circuit size increases. Accordingly, a Set-Covering technique (SCP) is considered in this paper to reduce computation time. The same clustering heuristic introduced in [2] is used within the SCP formulation. The main difference between SPP and SCP is the sensitivity of the constraints. By relaxing the sensitivity of constraints in Eq. (5) to large or equal ( B ), the SPP problem is transformed to SCP problem. The constraints used in the SCP model guarantee all the gates in the circuit to be covered by the ST at least once, which means some of the gates may be covered by two or even more STs. In [1],

q0 RU$ X k

68779

Wh) 1A@ V

(8)

$ X

the current capacity of each sleep transistor is assumed to be a xed value. As a result, the relaxed constraints of the SCP technique could increase the number of STs. However, when one gate is connected to more than one sleep transistor, the virtual ground wires of different STs become common which balances the discharging currents. As the authors analyzed in [4], the total area of all STs can be reduced with the presence of such current discharging balance. This balance would actually increase the capacitance of the virtual ground rail (VGND), which would reduce noise bouncing on VGND. Therefore, it is useful to have a gate assigned to more than one ST. Furthermore, the relaxed constraints of the SCP can result in a better optimization solution because of the larger solution space. For the computation time, CPLEX solver spends less time in searching the feasible solutions of the SCP compared to the SPP due to the relaxed constraints. Experimental Results of the SCP Technique Table II compares the results produced by the SCP and SPP techniques in terms of total sleep transistor area, leakage power savings, the reduction in the CPU time, and the reduction in the cost function for each benchmark. Compared to the SPP technique, the SCP technique achieves reductions in cost function (due to the larger solution space) and leakage power because the number of STs has been reduced (due to the discharge current sharing). However, the main advantage arises from the large reduction in CPU time for the SCP heuristic as compared to the SPP proposed in [1].
TABLE II C OMPARISON BETWEEN SPP AND SCP T ECHNIQUES
Benchmark 4b CLA 32b PC 6b MP 4b ALU C499 C432 Total width Total width Leakage CPU time Cost of STs (SPP) [1] of STs (SCP) savings savings reduction 6.6 m 6.6 m 0% 88% 5% 7.7 m 6.6 m 14.3% 73% 17% 11.88 m 11.88 m 0% 75% 5% 14.5 m 14.5 m 0% 99.8% 18% 26.4 m 23.8 m 9.9% 92% 17% 39.6 m 36.9 m 6.8% 94% 12%

MTCMOS DESIGN ENVIRONMENT 1.Read gate-level netlist. 2. While updating the accumulative delay not done For all gates in circuit If of is not updated If all the fan-in (    ) are updated update  End If End If End For End While 3. Build the vector for each gate. 4. Solve the ST sizing problem using FF or SCP technique. 5. Insert STs and sleep control signal into netlist. 6. Export the new gate-level netlist.

Fig. 6.

MTCMOS automatic design environment.

Fig.7 shows an example of the ST layout implementation. A cavity is inserted where the STs are located. Although this layout style incurs routing complexity because of the additional virtual ground lines (VGND), the conventional placement and routing methodology can be used with minimal modication using commercially available tools (i.e., Qplace and Wroute).
Cavity VGND                                         VDD # $ # $ # $ # $ # $ # $ # $ #  $ # $ # $# $ GND   sleep  transistor ! " ! " ! " ! " ! " ! " ! " !  " ! " ! "! " GND 

Fig. 7.

A layout example with placed sleep transistors.

VI. C ONCLUSIONS Two techniques are proposed to size the sleep transistor in MTCMOS circuits. The First-Fit and the Set-Covering techniques achieve same or better leakage power savings compared to the literature. However, both techniques save computation time by 76% and 86% on average compared to literature. In addition, an automatic vector generation engine and a MTCMOS design environment are developed and integrated into the CMC digital ASIC design ow. R EFERENCES
[1] M. Anis, S. Areibi, and M. Elmasry. Design and Optimization of MultiThreshold CMOS (MTCMOS) Circuits. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 22(10):13241342, October 2003. [2] M. Anis, S. Areibi, M. Mahmoud, and M. Elmasry. Dynamic and Leakage Power Reduction in MTCMOS Circuits Using an Automated Efcient Gate Clustering. In Proceedings of the 39th Design Automation Conference, pages 480485, 2002. [3] J. Kao, S. Narendra, and A. Chandrakasan. MTCMOS Hierarchical Sizing Based on Mutual Exclusive Discharge Patterns. In Proceedings of the 35th Design Automation Conference, pages 495500, 1998. [4] C. Long and L. He. Distributed Sleep Transistor Network for Power Reduction. In Proceedings of the 40th Design Automation Conference, pages 181186, 2003. [5] S. Mutah, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada. 1-V Power Supply High-Speed Digital Circuit Technology with Multi-Threshold Voltage CMOS. IEEE Journal of Solid-State Circuits, 30(8):847853, August 1995. [6] S. Mutah, S. Shigematsu, W. Gotoh, and S. Konaka. Design Method of MTCMOS Power Switch for Low-Voltage High-Speed LSIs. in Proceedings of Asia and South Pacic Design Automation Conference, pages 113116, January 1999. [7] D. Sylvester and C. Hu. Analytical Modeling and Characterization of Deep-Submicron Interconnect. Proceedings of the IEEE, 89(5):634664, May 2001.

V. MTCMOS D ESIGN F LOW The Canadian Microelectronics Corporation (CMC) digital ASIC design ow integrated with the developed MTCMOS design environment is shown in Fig.5.
CMC design flow Design Synthesis Compiler First Encounter Qplace Wroute Df II Floorplan P&R LVS&DRC MTCMOS environment Vector modelling ST sizing ST insertion

Fig. 5.

MTCMOS design ow.

As mentioned in Section III, after synthesizing the RTL code, the gate-level netlist is imported to MTCMOS design environment. Fig. 6 shows the pseudo-code of the MTCMOS design environment, which is implemented in C. The circuit structure is extracted and vectors are then generated for each gate in the circuit based on the discharge current database built in advance. In addition, sleep transistors with different ratios (i.e., different ) are developed using Eq. (3). The ST sizing problem is solved by using FF or SCP technique. The optimal sized STs combined with the sleep control signal are then inserted into the gate-level netlist. Finally, the new netlist is exported to Cadence physical design environment.

$ q