Beruflich Dokumente
Kultur Dokumente
Monica Donno
BullDAST s.r.l.
R&D Division
10121 Torino, ITALY
monica.donno@bulldast.com
Enrico Macii
Politecnico di Torino
Dip. di Automatica e Informatica
10129 Torino, ITALY
enrico.macii@polito.it
Luca Mazzoni
Accent s.r.l.
R&D
20059 Vimercate, ITALY
luca.mazzoni@accent.it
ABSTRACT
Modern processors and SoCs require the adoption of power-
oriented design styles, due to the implications that power
consumption may have on reliability, cost and manufactura-
bility of integrated circuits featuring nanometric technolo-
gies. And the power problem is further exacerbated by the
increasing demand of devices for mobile, battery-operated
systems, for which reduced power dissipation is mandatory.
A large fraction of the power consumed by a synchronous cir-
cuit is due to the clock distribution network. This is for two
reasons: First, the clock nets are long and heavily loaded.
Second, they are subject to a high switching activity.
The problem of automatically synthesizing a power ecient
clock tree has been addressed recently in a few research con-
tributions. In this paper, we introduce a methodology in
which low-power clock trees are obtained through aggres-
sive exploitation of the clock-gating technology. Distinguish-
ing features of the methodology are: (i) The capability of
calculating powerful clock-gating conditions that go beyond
the simple topological search of the RTL source code. (ii)
The capability of determining the clock tree logical struc-
ture starting from an RTL description. (iii) The capability
of including in the cost function that drives the generation
of the clock tree structure both functional (i.e., clock activa-
tion conditions) and physical (i.e., oorplanning) informa-
tion. (iv) The capability of generating a clock tree struc-
ture that can be synthesized and routed using standard,
commercially-available back-end tools.
We illustrate the methodology for power-aware RTL clock
tree planning, we provide details on the fundamental al-
gorithms that support it and information on how such a
methodology can be integrated into an industrial design
ow. The results achieved on several benchmarks, as well
as on a real design case demonstrate the feasibility and the
potential of the proposed approach.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
ISPD04, April 1821, 2004, Phoenix, Arizona, USA.
Copyright 2004 ACM 1-58113-817-2/04/0004 ...$5.00.
Categories and Subject Descriptors
B.5 [Hardware]: Register-Transfer-Level Implementation;
B.6 [Hardware]: Logic Design; B.7 [Hardware]: Inte-
grated Circuits
General Terms
Digital design
Keywords
Low-power design, physical design and optimization, clock
tree synthesis and routing
1. INTRODUCTION
The clock distribution network is responsible for an increas-
ing fraction of the dynamic power consumed by modern pro-
cessors and SoCs [1]. For example, Figure 1 shows the break-
down of power consumption for a recent high-performance
microprocessor [2].
I/O pads
10%
Memory
20%
CIock
40%
ControI
15%
Datapath
15%
Figure 1: Processor Power Breakdown.
This result is common to many real designs: For the DEC
Alpha 21164, 40% of the chip power (i.e., around 20W out of
50W) is consumed by the clock distribution network when
the processor runs at its maximum speed [3]. Similarly, in
the Motorola MCORE micro-RISC processor, the clock trees
account for 36% of the total processor power [4]. And the
picture is predicted to get worse as the complexity and the
operating frequency of the circuits keep growing as a result
of technology scaling [5].
138
Designing the clock tree has thus become critical not only
for performance, but also for power, and the development of
new modeling capabilities [6] and synthesis techniques that
help in controlling the clock tree power eectively is one of
the challenges that EDA engineers currently have to face.
Dierent solutions for minimizing the power consumed by
the clock tree have been investigated in the recent past. In
this paper, we focus our attention to an approach (named
LPClock in the sequel) that relies on clock-gating, a well-
established concept for power optimization at the gate and
RT levels. The basis for the LPClock methodology can be
found in [7]. That work introduced new algorithms for
gated-clock tree construction that are specically geared to-
wards integration with existing design ows, both in the
front-end (i.e., extraction and manipulation of RTL and
logic-level clock activation functions) and in the back-end
(i.e., interfacing to industry-strength clock tree synthesis
tools). We show how such algorithms can be combined
with innovative techniques for detecting clock-gating con-
ditions [8] that go beyond the pure topological analysis of
the RTL source code, to generate a power-ecient clock tree
structure. We provide and discuss validation data obtained
on a set of benchmark circuits, as well as on an industrial
design.
We emphasize that the objective of the LPClock methodol-
ogy is not that of replacing the back-end step of clock tree
synthesis and routing; instead, the goal is that of generating
a set of design constraints early enough in the design process
(i.e., planning the clock tree structure at the RTL) that can
then be exploited by traditional physical design tools during
clock tree routing.
The rest of this paper is organized as follows. In Section 2
we briey review previous work on clock tree power min-
imization, discussing techniques ranging from buer inser-
tion to adoption of multiple supply voltages, from reduced-
swing clock signalling to dierent solutions for clock-gating.
Section 3 provides an overview of the LPClock methodol-
ogy, including the details on how clock-gating conditions are
extracted based on the concept of observability dont care
(ODC) and on the algorithms for planning the clock tree
structure. Extensions to the methodology for handling non-
hierarchical (i.e., at) designs are also sketched. Section 4
discusses tool ow issues, thus addressing the problem of
embedding and integrating LPClock into an industrial de-
sign framework. In Section 5 we report on the experimental
results we have obtained on some meaningful design exam-
ples. Finally, Section 6 concludes the manuscript with some
nal remarks.
2. PREVIOUS WORK
The problem of synthesizing low-power clock distribution
networks has been addressed recently and from dierent
angles. Initially, the attention has focussed on techniques
based on power-driven buer insertion. In [9], buers are
added to the clock tree and sized as a post-processing op-
eration, when the tree structure is already determined. Im-
proved methods for buer sizing [10] and simultaneous buer
and clock wire sizing [11] that target a minimum-power clock
tree implementation have been proposed at a later time,
while Vittal and Marek-Sadowska made a step forward in
this domain by introducing a heuristic algorithm that per-
forms concurrent design of the clock tree topology and buer
insertion [12].
A dierent approach to the problem of designing a minimum-
power clock tree network was taken by Igarashi et al.; in [13],
they proposed the use of multiple supply voltages to reduce
clock tree power. The incoming, high-voltage clock signal
is down-scaled by means of a low-voltage buer stage. The
low-V
dd
signal is then propagated throughout the circuit,
and regenerating elements (e.g., buers) are inserted into
the tree structure to ensure the appropriate speed and slew
rate of the transitions. Finally, the original high-voltage is
restored through level-shifters before the clock signals feed
the ip-ops.
Although the method of [13] did target minimization of the
power consumed by the clock network, it did not factor into
the power balance the cost of buering and voltage convert-
ers. The approach presented by Pangjung and Sapatnekar
in [14] addresses this limitation by providing a more sophis-
ticated algorithm for introducing buers into the clock tree
and for placing the low-to-high voltage shifters, which are
now not necessarily located right in front of the ip-ops.
The algorithm is a modication of the Deferred-Merged Em-
bedding (DME) method [15, 16, 17] that considers the pos-
sibility of buer insertion after every step of bottom-up sub-
tree merging. In the interest of keeping the skew very close
to zero, the algorithm guarantees that the number of regen-
erating elements is equalized along any root-to-sink paths
of the tree. However, in spite of the solid theoretical ba-
sis of this solution, experimental results showed very small
dierences with the clock trees generated by the original
approach by Igarashi et al., witnessing the goodness of the
greedy approach of [13].
An alternative to multi-voltage clock distribution networks,
proposed rst by Zhang and Rabaey in [18], is based on the
idea of adopting a reduced-swing clock signalling scheme.
The paper provides general guidelines for the design of driver
and receiver circuits with reduced voltage swing, while [14]
focuses on intermediate driver circuits, whose usage is sug-
gested instead of traditional buers and repeaters for guar-
anteeing the required level of performance. An ecient ar-
chitecture of a low-swing receiver circuit that improves over
the one in [18] is also proposed. Compared to the multi-
voltage solution, reduced-swing clock trees are less power
ecient, as the number of intermediate receivers that are
needed to achieve the same speed of the multi-voltage im-
plementation is substantially larger.
Although most of the techniques mentioned above are eec-
tive, none of them considers the fact that clock signals may
not be always needed, and thus power can be saved by mask-
ing o (i.e., gating) the clock when a circuit (or part of it)
is idle, that is, it is not performing any useful computation
for one or more clock cycles.
Clock gating can signicantly reduce the switching activity
in a circuit and on the clock nets; thus, it has been viewed
as one of the most eective logic, RTL and architectural
approaches to dynamic power minimization [19]. Complex
algorithms have been devised for calculating the idle condi-
tions of a circuit and for automatically inserting the clock-
gating logic into the netlist [20, 21, 22, 23]. Side eects
of the clock-gating paradigm, such as its impact on circuit
testability, have been explored in details [24], making this
technology very mature also from the industrial stand-point.
As of today, most commercial EDA tools for power-driven
synthesis feature automatic clock-gating capabilities at dif-
ferent levels of design abstraction.
139
Unfortunately, if applied in a uncontrolled fashion, clock-
gating can adversely impact the clock power. In fact, to
amortize its power and area overhead, the gating logic should
be shared among several ip-ops. If the ip-ops that share
a common gated-clock (i.e., a gated-clock domain) are widely
dispersed across the chip, a signicant wiring overhead is
induced in the clock distribution network, as each domain
must be independently routed on dedicated wires. As a re-
sult, clock drivers in each domain are loaded with a much
larger capacitance and power may increase even if switching
activity is decreased [25, 26]. We then conclude that clock-
gating and clock tree construction should not be seen as two
independent steps and a combined strategy is needed.
Several authors have focused on the problem of minimizing
clock tree power through exploitation of gated-clocks. In
the sequel, we summarize two contributions that have some
common roots with the approach we discuss in this paper.
In [27], Farrahi et al. dened a methodology based on behav-
ioral synthesis to build an activity-driven clock tree. Given a
pre-placement description of the design, the set of active and
idle times, representing the activity pattern for each mod-
ule, is extracted from the modules scheduling table. An
activity pattern is a string of 0s and 1s, indicating idle and
active control steps, respectively for the module the pattern
refers to. The clock tree construction algorithm is heuristic,
it works bottom-up and it is based on recursive weighted
matching, where the cost function is the activity of the re-
sulting sub-tree. The objective is to cluster into the same
sub-tree modules with similar activity patterns, so that the
clock tree can be gated with high probability as close as
possible to the root. The clusters of modules created by the
recursive matching algorithm are translated into proximity
constraints for module placement. Then, the clock tree is
routed as an H-tree. Dynamic programming is nally used
to determine where the gating logic must be inserted.
In [26], Oh et al. present a zero-skew gated-clock routing
technique for VLSI circuits that improves upon the work
of [27] in two ways. First, it starts from a placed netlist of
modules. Second, it accurately accounts for the power con-
sumption of control signals, jointly addressing the routing
problem for both the clock tree and the gated-clock control
signals. The algorithm is applicable to a class of processors
where activation signals are obtained from instructions and
where the generation of all activation signals is centralized in
a single module placed close to the center of the die. Clock
tree building is done in two steps. First, possible locations
of the internal nodes are calculated according to [28]. Then,
the exact position is found by a greedy method that merges
minimum switched capacitance nodes; delaying the merging
of high activity nodes reduces the global activity in the tree.
Further work on gated-clock tree construction can be found
in [25, 29]. The rst paper reports on an exploration of the
impact of clock-gating on traditional clock tree construction
in the case of realistic benchmarks. The second contribution
extends the work of [27] in the directions indicated by [26].
Experimental data of previous work have shown that the
gated-clock technique can signicantly reduce the power dis-
sipation in the clock distribution network. Also, it has been
demonstrated the eectiveness of exploiting information on
the clock activation functions during clock tree generation.
However, the described approaches give little attention to
integration issues with existing design ows and they have
not been validated on real-life benchmarks.
3. LPCLOCK OVERVIEW
The objective of LPClock is to build a power-optimal gated-
clock tree structure, and use state-of-the-art physical design
tools to perform detailed clock routing and buering. As
a consequence, the output of LPClock is not a completely
routed clock tree; instead, it is a clock netlist (including
clock-gating cells and related control logic) and constraints
that, provided as input to commercial clock tree synthesis
tools, lead to a low-power gated-clock tree, while still ac-
counting for all non-power-related requirements (e.g., con-
trolled skew, low crosstalk-induced noise).
LPClock requires two inputs: (i) A RTL structural descrip-
tion of a synchronous circuit, that can be obtained by any
RTL synthesis tool; (ii) A placement of the RTL modules,
that can be obtained by any RTL oorplanner.
The methodology consists of three steps, as shown in the
ow diagram of Figure 2.
6a|cu|at|on of Act|vat|on Funct|ons
Cenerat|on of 6|ock Tree Log|ca| Topo|ogy
|nsert|on and Propagat|on of 6|ock-Cat|ng 6e||s
RTL
0escr|pt|on
P|acement
|nformat|on
6|ock Tree
8tructure
q=1
ODC(zq)
3.2.2 Activation Functions and ODCs
According to the denition given in Section 3.1, the acti-
vation function ACTFi of module Mi represents the set of
conditions for which the module is idle, that is, it is not
supposed to perform any useful computation; thus, its clock
input can be masked o when ACTFi = 1.
Given a module Mi with K inputs, {x1, x2, ..., xK}, and
given the observability dont care conditions for all such in-
puts (i.e., ODC(xj), j = 1...K), the activation function for
module Mi is given by the intersection of all the ODC(xj)s,
that is:
ACTFi =
K
j=1
ODC(xj)
In fact, module Mi is idle for all the conditions such that
none of its inputs is observable to the environment. In other
words, when ACTFi = 1, the clock signal feeding module
Mi can be disabled, as no useful computation is going to be
performed by the logic in Mi.
3.3 Generating the Clock-Tree Topology
The second stage of the LPClock methodology takes care of
generating the logical topology of the clock tree. To this
purpose, both activation functions and placement informa-
tion are used.
For each module Mi in the design, the placed netlist contains
information about its position, as well as the physical coor-
dinates of its clock input (which is assumed to be unique and
which we call clock sink in the sequel), denoted by (xs
i
, ys
i
).
Also available is the capacitance Ci of module Mi, which is
proportional to the number of ip-ops that are contained
in Mi.
For each pair of RTL modules (Mi,Mj) in the design, we
dene their physical distance as:
D(Mi, Mj) = |xs
i
xs
j
| + |ys
i
ys
j
|
The physical distance is calculated with the Manhattan met-
ric, which is a good estimator of the wiring length between
clock sinks, as horizontal and vertical directions are the only
ones allowed to the routing tools. Physical closeness means
shorter interconnections, hence reduced congestion, shorter
interconnection delay and smaller parasitic capacitance.
Besides the physical distance, we also dene the logical dis-
tance between two modules Mi and Mj as:
L(Mi, Mj) = (Ci +Cj) p(i, j)
where:
p(i, j) = P(ACTFi = 1, ACTFj = 1)
is the probability for modules Mi and Mj to be idle.
If ACTFi and ACTFj are completely independent, then
p(i, j) = P(ACTFi = 1) P(ACTFj = 1). Since the in-
dependence condition is not always satised, the probabil-
ity p(i, j) can be computed in a conservative way by means
of RTL simulation: The values of ACTFi and ACTFj are
collected over N consecutive simulation cycles and the num-
ber of times in which the logic AND of the two activation
functions takes on the value 1 is calculated. In formula:
p(i, j) =
N
AND
N
The logical distance measures the similarity of the activity of
the two modules. If two modules with close activities are fed
by the same net of the clock tree, the parent node of the net
requires the clock signal for a percentage of time comparable
to that of the children nodes, leading to a reduction of the
overall activity in the tree.
The construction of the clock tree made by LPClock is a
search into the space of all topological binary trees associ-
ated to the set of clock sinks. The search process is driven
by a cost function, shown below, that includes both physical
and logical distance information:
DIST(i, j) = f(D(i, j)) +g(L(i, j))
Parameters and allow the tuning of the weight of the
wire length between modules (i.e., the physical proximity)
versus the common activation of the modules (i.e., the log-
ical proximity), while f and g are normalization functions
for D and L.
The clock tree construction algorithm works hierarchically,
building a binary topology on a level-by-level basis, proceed-
ing in a bottom up fashion. A current set is associated to
each level of the tree that contains all the available sinks for
that level. The algorithm aims at building the current set
that will contain all the sinks that belong to the next level
of the tree.
The algorithm works as follows. Given the current set,
the DIST(i, j) cost function is evaluated for every possi-
ble pair of sinks (i,j). Then, the pair (i,j) that gives the
minimum value for the cost function is moved from the
current set to the next set This operation is repeated un-
til the current set becomes empty, that is, all the sinks
in that level have been paired and moved one level higher.
Then, the newly created next set becomes the current set
for the next level, and the process restarts. The construction
tree procedure terminates when the current set contains
only two sinks and hence the next set will contain the root
of the tree.
When completed, the algorithm leads to a fully binary tree
structure, whose leaves are all the RTL modules of the de-
sign. No clock-gating cells are included in the clock tree at
this point. This is the subject of the nal stage of the clock
tree planning process, which is described next.
142
3.4 Inserting the Clock-Gating Cells
The last stage of the LPClock methodology targets the in-
sertion and the propagation of the clock-gating cells on the
branches of the clock tree in order to guarantee that, at
any point in time of circuit operation, the largest possible
fraction of the clock nets will be disabled.
Initially, the clock-gating cells are placed right in front of the
sinks, i.e., they only condition the clock signals that enter
the RTL modules. The gating cells are then repositioned in
the tree through a procedure that tries to move them from
the leaves of the tree topology towards the upper levels.
The algorithm that we have implemented is heuristic and it
is driven by a cost function that, for each possible move, esti-
mates the total clock tree power, using the model described
in Section 3.1.
The clock tree is visited in a post-order fashion to search
for congurations of the clock-gating cells in the tree cor-
responding to local minima of the cost function (i.e. the
estimated power consumption).
For every node in the tree for which the branches to the
two children nodes host a clock-gating cell (see Figure 5-a),
three possible transformations can be applied. In case the
activation functions of the two children nodes are the same,
the best possible solution is certainly the one shown in Fig-
ure 5-b, since it guarantees maximum disabling of the clock
signals for both children nodes and it requires the insertion
of only one clock-gating cell that controls the entire sub-tree.
However, there may be cases where the activation functions
of the two children nodes dier substantially; in particular,
the activation function of the right child may include most of
the idle conditions of the left child, and many more. In this
case, it may be worth resorting to the conguration shown
in Figure 5-c, which allows the disabling of the clock signal
to the right branch of the sub-tree even when the left sub-
tree actually needs the clock. Clearly, also the symmetric
case, shown Figure 5-d, may occur and it is thus handled by
the procedure.
Figure 5: Gating Logic Propagation.
When the nal position of the clock-gating cells inside the
clock tree is determined, the control logic that combines the
activation functions for each clock-gating cell is synthesized
and it is passed to the RTL-to-layout synthesis ow, which
will then consider the clock tree structure planned by LP-
Clock during both nal placement and clock tree synthesis
and routing.
3.5 Handling Non-Hierarchical Designs
One fundamental assumption which stands at the basis of
the LPClock methodology is that ip-ops belonging to the
same RTL module are kept physically contiguous during the
RTL-to-layout synthesis step. Unfortunately, there are prac-
tical cases in which this does not happen, due to the fact
that the hierarchical nature of the design is not enforced dur-
ing RTL-to-layout synthesis, leading to a layout structure in
which physical contiguity of the RTL modules (and of the
ip-ops located inside each of them) is lost. The ip-ops
belonging to the same RTL module may end-up being spread
far apart across the chip, thus making the planned clock tree
logical topology highly suboptimal and of no practical use,
as the routing of the clock sub-tree to the individual ip-
ops contained in the RTL modules can be prohibitively
expensive.
This section introduces the enhancements to the LPClock
methodology which are needed to prevent the aforemen-
tioned undesirable phenomenon, and thus enable the ap-
plicability of LPClock also to designs with non-hierarchical
(i.e., at) structure.
The key idea to be pursued is that of forcing physical con-
tiguity for the ip-ops inside an RTL module through the
assertion of placement constraints. To this purpose, we in-
troduce the concept of pseudo-module, which is dened as a
set of ip-ops that are identied (and marked) as belong-
ing to the same RTL module and for which the placement is
constrained so that the ip-ops will be placed close to each
other. This concept is exploited when the LPClock method-
ology has to be applied to at designs, for example those
which are produced by RTL synthesis.
Figure 6-a shows a layout where boxes represent ip-ops
and dierent grey levels of color denote ip-ops that belong
to dierent RTL modules. From the picture it is evident
that the ip-ops of a given RTL module can be scattered
in the nal placement, if appropriate countermeasures are
not adopted.
Introducing the denition of pseudo-module leads to a more
localized layout structure for the ip-ops belonging to each
RTL module, as shown in Figure 6(b), thus preserving (or
reconstructing), at the physical level, the hierarchical struc-
ture that is initially available at the RTL, and that is es-
sential for making the clock tree architecture planned by
LPClock eective.
Figure 6: Handling Non-Hierarchical Designs.
143
4. INTEGRATION OF LPCLOCK INTO AN
INDUSTRIAL FLOW
This section describes how the LPClock methodology is in-
tegrated into an industrial design ow that relies on com-
mercial tools for RTL synthesis, optimization and physical
design.
Figure 7 shows the ow in details.
FowerChecker FowerChecker
LFC|ock LFC|ock
C|ock
$tructure
RTL
Netlist
FowerChecker FowerChecker
CGCop CGCop
Act|vot|on
Funct|ons
Des|gnComp||er Des|gnComp||er
5|||con Fnsemb|e 5|||con Fnsemb|e
Qp|oce Qp|oce
5|||con Fnsemb|e 5|||con Fnsemb|e
CIGen CIGen
Routed
CIock
Tree
Synopsys
Cadence BullDAST
Figure 7: LPClock Integrated Flow.
Starting from a high-level design specication (i.e., VHDL or
Verilog), the circuit is rst elaborated by Synopsys Design-
Compiler to obtain a RTL structural representation from
which clocked modules and all nets, including the clock, are
extracted. A oorplan and a placement are then initialized
by Cadence SE-Qplace.
The LPClock algorithms have been implemented inside Bull-
DAST PowerChecker, an integrated environment for RTL
power estimation and optimization. PowerChecker features
the CGCap optimization engine, which is capable of gener-
ating ODC-based gated-clock activation functions for all the
modules in the RTL design starting from the initial speci-
cation.
The pre-placed netlist and the module activation functions
are fed to LPClock, which generates the clock tree structure
according to the methodology described in Section 3. The
information about the clock network topology and the po-
sition of the clock-gating cells is introduced into the design
database. This step requires to rst change the +PLACED at-
tribute of all the modules in the database to +FIXED, in order
to avoid that the position of the modules changes during
some subsequent optimizations. Next, incremental place-
ment is invoked to include the clock tree structure and the
clock-gating logic into the current view of the design.
The updated database is nally fed to Cadence SE-CTGen,
which performs buer insertion and checks for timing closure
and nal clock skew. It should be pointed out that the
insertion of the AND gate for each internal node in the clock
tree prevents any change on the clock net by CTGen, forcing
the tool to preserve the clock branching structure planned
by LPClock.
By closing this section, we would like to emphasize that the
LPClock methodology has general validity, and its usability
is not limited to the environment (i.e., tools and ow) we
have described above. As LPClock provides, as output, a
plan of the clock tree consisting of a set of constraints,
it can be easily mapped onto any RTL-to-layout ow with
very little eort, as no conceptual changes are needed.
5. EXPERIMENTAL RESULTS
We have validated the LPClock ow on some benchmark
circuits coming from dierent sources and domains, as well
as on an industrial design case provided by Accent, i.e., an
IEEE MAC 802.3 sublayer controller for a VCI bus with
10,100 and 1000 Mbit/s data rates.
Each design was rst synthesized and mapped using Synop-
sys DesignCompiler and PowerCompiler. Then, we gener-
ated the placed and routed netlists (including the clock dis-
tribution network) using Cadence Silicon Ensemble Qplace
for the original descriptions, as well as the netlists for the
designs with gating logic inserted at the clock inputs of the
RTL modules and with the clock tree structure created by
LPClock. Layout extraction was performed next for all the
circuits, and the gate-level netlists back-annotated using the
extracted parameters. Finally, gate-level power estimation
was performed using Synopsys PowerCompiler. The whole
synthesis process was timing driven, and mapping was done
onto the 0.13m HCMOS9 technology library by STMicro-
electronics. Clock tree synthesis with Cadence Silicon En-
semble CTGen was performed using a very tight maximum
skew constraint (less than 0.2% of the clock cycle).
LPClock was run with a value of the / ratio equal to one.
This choice was made based on previous experience (see the
analysis reported in [7]). In practical terms, this means
that physical distance (parameter ) and logical distance
(parameter ) have equal weight in the cost function that
drives LPClock.
In the following sections we present and discuss the results
we have achieved for the two classes of circuits.
5.1 Benchmark Circuits
We have considered a total of eight benchmark circuits, char-
acterized by dierent functionalities and sizes. Some of them
are publicly available and are quite simple (no more than
2000 library cells), some others come from industry and are
more complex (up to 33000 cells). Details about the circuits
are summarized in Table 1.
Benchmark # of Gates # of Clock Sinks
Simple1 140 72
Simple2 185 68
Simple3 1870 624
Simple4 1943 680
Indust1 13954 1726
Indust2 17125 2054
Indust3 24587 2963
Indust4 33180 5450
Table 1: Characteristics of Benchmark Circuits.
Table 2 collects the results of the experiments. In partic-
ular, column Clock-Gating shows the savings in the power
consumed by the clock tree w.r.t. the original circuit imple-
mentation achieved by inserting the clock-gating logic only
at the inputs of the RTL modules. On the other hand, col-
umn LPClock shows the clock tree power savings against the
original circuits obtained by inserting the clock-gating logic
as suggested by LPClock. A comparison of the clock power
data for the two optimized circuits shows that LPClock of-
fers an additional savings over traditional clock-gating that
ranges from 3.58% to 42.03%, depending on the benchmark
(column ).
144
v6| RX
|NTERFA6E
F|F0 RX
UFFER
F8H RX
UFF 6NTR
A8YN6h
F|F0 RX
CH|| RX
|NTERFA6E
RX 6NTR
FRAHE
RX FRAHE
ANALYZER
6R6 RX
v6| TX
|NTERFA6E
F|F0 TX
UFFER
F8H TX
UFF 6NTR
A8YN6h
F|F0 TX
TX 6NTR
FRAHE
6R6 TX |FC
CH|| TX
|NTERFA6E
RE-8YN6h
H|| HCH
h08T |F
v6| RX
v6| TX
v6| h08T
H||H
L00P
A6K
6T2: H||_TX_6LK
6T3: H||_RX_6LK
6T1: 8Y8_6LK
TX 8TATU8 U8
TX 8TATU8 vAL|0
Figure 8: Block Diagram with Clocking Scheme of the MAC 802.3 Controller.
Benchmark Clock-Gating LPClock
Simple1 12.24% 18.32% 6.92%
Simple2 43.86% 45.87% 3.58%
Simple3 36.70% 41.78% 8.02%
Simple4 9.94% 22.94% 14.43%
Indust1 25.15% 56.61% 42.03%
Indust2 22.31% 39.88% 22.61%
Indust3 14.38% 37.23% 26.68%
Indust4 17.72% 43.26% 31.04%
Table 2: Results on Benchmark Circuits.
The experimental data show very clearly that the clock trees
generated using LPClock as a preprocessor to CTGen are
much superior (in terms of power) to those generated by
CTGen at the end of the traditional ow for circuits of sig-
nicant size, while they are limited (i.e., below 15%) on
smaller benchmarks. This was somehow expected, as for
small circuits the clock distribution networks tend to have
very simple structures, and thus the degrees of freedom that
are available for the optimization are reduced.
Timing analysis was performed on the synthesized netlists
containing capacitance information back-annotated after ex-
traction using Synopsys PrimeTime. The results have shown
that no skew violation occurred for all the benchmarks. This
is a very important result, as it indicates the quality of the
constraints for the clock tree structure that LPClock was
able to generate.
5.2 Industrial Design: MAC 802.3 Controller
The IEEE 802.3 International Standard for Local Area Net-
work (LAN) employs the CSMA/CD (Carrier Sense Multi-
ple Access with Collision Detection) as the access method.
The MAC 802.3 (media access control) controller imple-
ments the LAN CSMA/CD sublayer for the following fam-
ilies of systems: 10 Mb/s, 100 Mb/s and 1000Mb/s of data
rates for baseband and broadband systems. Half and full-
duplex operation modes are supported. The collision detec-
tion access method is applied only to the half-duplex oper-
ation mode. The frame bursting is supported for half du-
plex and speed above 100Mb/s. The MAC control frame
sublayer (optional) is supported by the current implemen-
tation. VCI (Virtual Component Interface) buses (a super
set of the standard bus) are used as application and host in-
terfaces. The MII (Media Independent Interface) standard
bus is used for the PHY interface.
Figure 8 shows the top-level block diagram of the MAC 802.3
controller, highlighting the implemented clocking scheme.
There are three clock domains in the design; the system
clock (CT1, indicated by the black, solid lines), the MII TX
clock (CT2, indicated by the black dashed lines), and the MII
RX clock (CT3, indicated by the grey solid lines). The sug-
gested operating frequency for the system clock is 166MHz;
instead, both the MII TX and the MII RX clocks have a
suggested operating frequency of 125MHz.
Signals that cross dierent clock domains are resynchronized
in the RESYNCH module shown at the bottom of the
block diagram (i.e., the conguration bits and the hand-
shaking signals).
145
The two asynchronous FIFOs are used to detach the data
between the system clock and the MII clock domains.
In loopback mode, the MII TX clock is used also on the
RX path, therefore the clock trees CT2 and CT3 must be
balanced starting from the common root mii tx clk (see
the schematic of Figure 9).
0
1
sys_cIk
mii_tx_cIk
mii_rx_cIk
seI_mux = (not Ioopback)
or test_mode