Sie sind auf Seite 1von 5

Low Power Clock Gates Optimization For Clock Tree Distribution

Teng, Siong Kiong1, Dr Norhayati Soin2


1
Penang Design Center, Intel Microelectronics, Penang, Malaysia
2
Dept. of Electrical Engineering, University of Malaya, Kuala Lumpur, Malaysia
1
siong.kiong.teng@intel.com 2norhayatisoin@um.edu.my

problem. [5] presented the gated clock planning algorithm


Abstract
during clock placement steps. The solution was based on
Clock gating technique had become one of the major
Capo placer [6] and will have limitation when we try to
dynamic power saving approaches in today low power digital
optimize the clock tree in the existing design using standard
circuit design. In this paper, we present a new physical clock
gates optimization technique using splitting and merging EDA tool.
algorithm that works on both single level and multiple levels 1.1 Clock Gate Placement Problem
clock gating design. The algorithm is built on top of the Clock gating components are being inserted on clock tree
standard EDA flow by running two passes clock tree to shut off part of the clock tree when it is idle. There will be
synthesis. The first pass is to obtain the clock buffer location two possible clock gate operations that will affect the clock
for clock gate swapping and the second pass will build the tree power consumption. The two possible clock gating
clock tree based on the optimum clock gate location. The operations, the dynamic and idle operations, which would be
merging algorithm will then be used to improve the overall referred to as clock gate ON (CGON) and clock gate OFF
clock tree power. The results on the industrial design show (CGOFF) respectively. The dynamic power operation is
the improvement on overall clock tree power using when all clock gates enabled to allow all registers to be
aforementioned algorithm. latching actively, while the idle operation is when all clock
gates disabled to achieve gating.
Keywords
Clock gating, low power, clock tree synthesis In the ASIC or SoC design, there will be gated and
ungated clocks. Some of the logics that can be shut off will
1. Introduction use gated clocks but some that cannot will use the ungated
In today advanced VLSI design environment, the VLSI clocks. The CGON power will be higher compare to CGOFF.
designers are not only facing the challenges of converging However, in recent industrial and research on the clock
the timing to meet the performance on the digital circuit but gating design, the CGON power is always not optimized.
also on power. On recent years, the low power initiative had Figure 1 and Figure 2 show the clock tree topology with
gained the focus with a lot of researchers start paying clock gates before and after CTS. In Figure 1, the clock gate
attention on low power design optimizations. This has CG_A placement is placed very close to the clock root and
prompted a shift on the digital design on joined optimization clock gate CG_C is place relatively close to the loads. The
on both power and performance. clock buffers are used to illustrate the last fanout buffers that
Recent researches on low power VLSI design techniques are going to drive the sequential elements. This is
had established various innovations such as clock gating, corresponding to the post-CTS last level buffer placements.
multi-threshold voltage transistors, multi-supply voltage,
dynamic voltage and frequency scaling, power shut-off and
etc [1,2]. The clock tree synthesis (CTS) is the process of
distribution clock signal from PLL to all the synchronous
components within a design. Clock gating is being used
widely in clock distribution as a method to reduce clock
network power dissipations [7-12]. The clock gating
components are part of the clock tree distribution
components during CTS process.
The clock root gating algorithm that is used to merge the
clock gates with different enables function had been
implemented by [3]. However, in [3] proposal, the initial
clock gates placement is not being optimized where merging
of the clock gates might not able to obtain the optimum clock Figure 1: Clock Tree Relative Placement before CTS
gate structure.
In [4], the clock gates are split based on the total number of By using standard EDA tool CTS flow, the clock tree will
fanout of the individual clock gate. This is still not exploring be built and yield the clock tree topology shown in Figure 2.
the clock gating power efficiency in physical domain. [4] also The gated and ungated clock trees are essentially overlapping
did not mention the solution of multiple levels clock gating each other physically resulting in higher CGON power

978-1-4244-6455-5/10/$26.00 ©2010 IEEE 488 11th Int’l Symposium on Quality Electronic Design
during full operations. We will address this problem using Start from the RTL coding, the normal SoC design flow
the clock gates synthesis algorithm during CTS. will run through logic synthesis to translate the system
verilog or VHDL behavior model into gate level netlist. The
gate level netlist will be optimized for timing and area
constraint to achieve the required design target. After
generating the optimized netlist, the netlist will then be
placed on the floorplan for physical optimization.
Once the design physical placement is completed, the
clock tree synthesis will be run to synthesize the clock tree
network to distribute the clock signal across the floorplan to
achieve low skew, low power clocking design. To obtain the
optimum clock gate design, we implement the two passes
clock tree synthesis flow. During the first pass, the traditional
clock tree synthesis methodology will be used. The clock tree
will be built using the default design constraint and yield the
standard clock tree structure with the clock gates being
Figure 2: Clock Tree Placement after CTS placed at their initial location based on pre-CTS database.
2. Two Passes Low Power Clock Gate Synthesis We will then apply our split clock gate algorithm to
Flow effectively splitting the clock gates to the nearest location to
RTL codes their loads. The output of the algorithm will be a TCL format
engineering change order (ECO) script that can be used to
apply on the pre-CTS database. As the low power design
Logic Synthesis Flow normally will have a lot of aggressive clock gating, there
might be multiple levels of clock gating a long the clock path.
Placement and Logic Optimization Flow The flow will be able to split each level of the clock gate
accordingly.
First Pass Clock Tree Synthesis Flow The TCL scripts will then be applied into the pre-CTS
database and we will have a new pre-CTS database for the
second iteration of clock tree synthesis. With the newly
Low Power Multiple Level Clock
created clock gate location, the design is once again running
Gates Splitting Algorithm
through the clock tree synthesis flow and optimization to get
the final optimized clock tree design.
Second Pass Clock Tree Synthesis Flow 3. Multiple Levels Clock Gate Splitting
In this section, the multiple levels clock gate splitting
Low Power Multiple Level Clock algorithm will be discussed. We presented the clock gate
Gates Merging Algorithm splitting flow on post-CTS database to find out the optimum
locations of the clock gate. The newly created clock gates
and their location will be written into an ECO format in TCL.
Clock Tree Power This ECO TCL file will be applied to the pre-CTS database
for the second pass clock tree synthesis.
N The multiple level clock gates splitting algorithm can be
o Achieved described as below.
Lower Power? Input: a digital circuit design.
Yes Output: an ECO changes in TCL format.
I. Clock Tree Tracing Algorithm – For each clock
End source, all clock gates are traced for connectivity and
Figure 3: Two Passes Low Power Clock Tree Synthesis each clock gate level are stored.
Flow II. Clock Gate Reverse Splitting Algorithm – For the
clock source, starting from the last level (n) of the
In this section, we introduce the two passes low power clock gate
clock tree synthesis flow as shown Figure 3. This flow is a. For each clock gate
proven working on the standard industrial design using the i. Split the clock gate to its last level buffers
electronic design automation (EDA) tool. The low power locations.
clock gate synthesis algorithm is integrated into the clock tree ii. Find the fanout loads of all the last level
synthesis flow. The algorithm is implemented using the TCL buffers.
language that can be easily integrated into the EDA tool. iii. Assign the fanout loads to the new clock
gate’s hash.

Teng, Low Power Clock Gates Optimization for Clock…


iv. Find the fanin source of the original clock
gate.
v. Assign the fanin root driver to the new split
clock gate’s hash.
vi. Repeat for next clock gate
b. Repeat for next n-1 level clock gate until level 1
clock gate.
III. Generate the netlist connectivity changes and create
the split clock gate ECO changes.
IV. Done.

The output of the flow is the changes that will be applied


to the pre-CTS database for the second pass CTS flow.
Figure 4 shows the outcome of the design after applying the Figure 5: Clock Tree Topology after CTS
ECO changes on the pre-CTS design. Compared to Figure 1, 4. Multiple Levels Clock Gate Merging
the clock gate CG_A had been split to 3 copies namely From section 3, we proved that the multiple levels clock
CG_A_S1, CG_A_S2 and CG_A_S3. For the placement of gate splitting algorithm is able to reduce the CGON dynamic
the clock gate, it will be using the clock buffer or clock gates power. However, during CGOFF operation, the clock tree is
placement shown in Figure 2. The CG_A_S1 will occupy the actually burning more power due to the clock gates now is
placement of CG_C. The clock gates CG_A_S2 and sitting at the lower level of the clock tree. The clock tree
CG_A_S3 will occupy the Buf_D3 and Buf_D4 placement. buffers from clock sources to the input of the clock gates will
The second level clock gate CG_C will also be split into two be still toggling and consume power during CGOFF
copies. The CG_C_S1 and CG_C_S2 will occupy Buf_D1 operation. To solve the aforementioned problem, we used the
and Buf_D2 locations respectively. clock gates merging and relocating algorithm to merge the
same functional clock gates. After merging the clock gates,
we moved and relocated the clock gate closer to the clock
sources to improve the CGOFF power consumptions.
As oppose to [3] method, our method is built on the
existing clock tree topology. We do not optimize the clock
gates by ORing the enable of different clock gates because
this will induce formal verification failure where we will
need to reflect the changes into RTL to ensure the functional
logic correctness.
The clock gate merging and relocating algorithm can be
illustrated as below.
Input: a post-CTS digital circuit design.
Output: an optimized clock gates digital circuit design.
Figure 4: Initial Clock Gate Placement after Applying I. Clock Tree Merging and Relocating Algorithm – For
Multiple Level Split Clock Gate ECO changes. each clock source, all clock gates are traced for
connectivity and each clock gate level are stored.
The Figure 4 design will be used for second pass of CTS II. For the clock source, starting from the last level (n), of
optimizations. The final clock tree topology after CTS is the clock gate,
shown in Figure 5. Compare to Figure 2, the total number of a. For each clock gate
clock buffers needed to construct the clock tree topology is i. Find the driver of the clock gate.
less after using the split clock gate algorithm. The total ii. If clock gates are having same driver, the
number of clock buffers and clock gates in Figure 2 are 17 clock gates will be merged.
while the total number of clock buffers and clock gates in iii. If the clock gate had one driver, relocated the
Figure 5 is only 14. Beside number of clocks components clock gate to the driver location.
saving, the overall wire routing will be reduced also as there iv. Apply ECO changes to the database.
is less overlapping on gated and ungated clocks. The v. Repeat for next clock gate
theoretical analysis shows that our algorithm is able to reduce b. Repeat for next n-1 level clock gate until level 1.
the total clock tree power during CGON operation. III. Done

Figure 6 shows the clock tree structural after clock gate


splitting and CTS flow. The clock gates CG_A_S1 and
CG_A_S2 are the clock gates that can be merged to reduce
the total dynamic power during CGOFF operations. The
algorithm will merge both CG_A_S1 and CG_A_S2 to
Teng, Low Power Clock Gates Optimization for Clock…
become CG_A_M12. The CG_A_M12 clock gate will then clearly showed that the proposed flow is able to reduce the
go through the relocating algorithm to move upward closer to overall clock buffer count in the design.
the clock source. When the clock gate hit the point of
divergence, it will stop and occupy the final location of the
clock buffer. The outcome of both the clock gate merging
and clock gate relocating algorithm is shown in Figure 7.

Figure 8: Number of Clock Tree Buffer Using Multiple


Levels Clock Gate Splitting and Merging Flow
Figure 6: Clock Gates Placement after CTS that can be
merged Using Clock Gate Merging Algorithm Figure 9 shows the clock tree performance comparison on
conventional flow, clock gate splitting flow and clock gate
merging flow. The clock tree performance matrix is
measured based on clock skew, total clock buffer area,
CGON dynamic power, CGOFF dynamic power and overall
leakage power.

Figure 7: Clock Tree Topology after Clock Gate Merging


5. Results
The proposed algorithm had been tested on an industrial
standard VLSI digital design. The experimental is being run
on a design that consists of 1 million instances and 23K
flops. The result is being compared between the conventional
CTS flow and the new low power clock gate synthesis flow. Figure 9: Clock Tree Performance on Skew, Area and
For each flow, the total clock buffers used and the clock tree Power Using Multiple Levels Clock Gate Splitting and
performance like clock skew, clock buffer area and total Merging Flow
clock power are analyzed.
Figure 8 shows the total number of clock buffers and The graph showed that both clock skew and total buffers
clock gates used in the design using conventional CTS flow, area reduced with the clock gate splitting and merging flow.
clock gate splitting flow and clock gate merging flow. The A significant dynamic power reduction is observed during
result showed that after clock gate splitting flow, total clock gate ON operation. Meanwhile an increase in total
number of clock gates increased while total number of clock clock dynamic power is noticed for clock gate OFF
buffer decreased. The overall clock buffer count is also operations. The reduction in the clock gate ON operation is
decreased. Then we used the same split database and run clearly attributed to the avoidance of long overlapping
through clock gate merging flow. It showed that the total interconnects routing by the flow. However, the clock gate
number of clock gates decreased while the total number of OFF operation condition increased the dynamic power is
clock gate increased. The overall clock buffers count still ascribed to the increase in pre-clock gate buffer stages,
remained the same compared to split database. The results brought around by the split clock gate effect to the lower

Teng, Low Power Clock Gates Optimization for Clock…


level of buffer trees. This problem had been addressed by Computer.-Aided Des Integr. Circ.Syst. June 2001
clock gate merging flow where it effectively reduced the Pages 715-722
clock gate power during CGOFF stages by moving the clock [9] Jaewon Oh Massoud Pedram “Gated Clock Routing
gate to higher level of clock tree buffers resulting in better Minimizing the Switched Capacitance”. Proc of
CGOFF dynamic power savings. The leakage power on IEEE/ACM Design Automation and Test in
overall clock tree is also noticed to be remained constant with Europe.1998 Pages 692-697.
minimum changes. [10] Jaewon Oh and Massoud Pedram, “Power reduction in
microprocessor chips by gated clock routing”, in Proc.
6. Discussion ASP-DAC, pp.313-318,1998.
The clock gate merging flow had some limitation on the [11] Monica Donno, Enrico Macii, Luca Mazzoni “Power-
clock gate control timing optimization. The fanout loads of aware clock tree planning”. Proc of the 2004
the clock gate’ control signal increased when a clock gate is international symposium on Physical Design. 2004.
split. This will create a potential setup timing violation to the [12] Monica Donno, Enrico Macii et at “Clock-tree power
clock gate enable signal. However, the clock gate skew optimization based on RTL clock-gating” Proc of the
compared to the control signal generated flop clock skew 40th conference on Design Automation. 2003.
decreased when we split the clock gate to lower level of
clock tree hierarchy. This will improve the setup requirement.
Therefore, careful monitoring is needed on the clock gate
control signals timing.
7. Summary
In this paper, we present a new design flow for gated clock
tree synthesis that is able to effectively reduce total dynamic
power. We first spilt and duplicate the clock gates to
minimize clock tree overlapping on gated and ungated clock.
Then we create the new clock gates with optimize location on
pre-CTS design and reran the second pass of clock tree
synthesis. After that, we trigger the clock gate re-merging
algorithm to optimize for gated clock idle stage power. The
low power clock gate splitting algorithm is able to effectively
reduce the clock tree dynamic power during full operation
mode and idle mode. The paper also shows that with this
approach, the overall clock tree buffers are able to reduce
with less area. Therefore, the proposed algorithm is proven to
be able to reduce the gated clock tree power.
8. References
[1] Low-Power Methodology Manual For System-on-Chip
Design. Books by Springer – ISBN 978-0-387-71818-7
[2] A Practical Guide to Low-Power Design. Book by
powerforward.org.
http://www.powerforward.org/DesignGuide.aspx/
[3] Qi Wang & Sumit Roy “Power Minimization by Clock
Root Gating” Design Automation Conference, 2003.
Proceedings of the ASP-DAC 2003. Asia and South
Pacific, 21-24 Jan. 2003 Page(s):249 – 254
[4] https://solvnet.synopsys.com/retrieve/017127.html
“Clock Gating Methodology for Power and CTS QoR”
Synopsys Inc. SNUG San Jose 2006.
[5] Weixiang Shen, Yici Cai, Xianlong Hong, Jiang Hu
(2008)“Gate Planning During Placement for Gated
Clock Network” Computer Design 2008. ICCD 2008.
IEEE IC. Pages 128-133
[6] http://vlsicad.eecs.umich.edu/BK/PDtools/.
[7] R.-S Tsay, “An Exact Zero Skew Clock Routing
Algorithm” IEEE Transaction on CAD/ICAS, Vol. 12,
No. 2 pp 242-249, February 1993.
[8] Jaewon Oh Massoud Pedram “Gated Clock Routing for
Low-Power Microprocessor Design” IEEE Trans

Teng, Low Power Clock Gates Optimization for Clock…

Das könnte Ihnen auch gefallen