Sie sind auf Seite 1von 6

Clock Data Compensation Aware Clock Tree

Synthesis in Digital Circuits with Adaptive Clock


Generation
Taesik Na, Jong Hwan Ko and Saibal Mukhopadhyay
School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA
{taesik.na, jonghwan.ko}@gatech.edu, saibal@ece.gatech.edu

Abstract—Adaptive clock generation to track critical path path delay under PSN if and only if the voltage-delay
delay enables lowering supply voltage with improved timing sensitivities of RO, critical path and clock tree are the same
slack under supply noise. This paper presents how to synthesize regardless of the clock distribution delay or critical path delay.
clock tree in adaptive clocking to fully exploit the clock data
compensation (CDC) effect in digital circuits. The paper first Next, we analyze timing slack under PSN to show that
provides analytical proof of ideal CDC effect for ring oscillator ideal CDC cannot be achieved for designs where the voltage
based clock generation. Second, the paper analyzes non-ideal delay sensitivities of the clock tree are not the same with that of
CDC effect in a gate dominated critical path and wire dominated critical path. Finally, we propose simple but efficient clock tree
clock tree design. The paper shows the delay sensitivity mismatch synthesis (CTS) technique to maximize CDC effect. As a case
between clock tree and critical path can degrade CDC effect by study, we implement JPEG encoder engine [7] with Synopsys
analyzing timing slack under power supply noise (PSN). Finally, 28nm PDK and apply CTS changing available buffer list while
the paper proposes simple but efficient clock tree synthesis (CTS) maintaining the same transition delay. Simulation results show
technique to maximize timing slack under PSN in digital circuits that we can achieve up to 85ps timing slack (equivalent 40mV
with adaptive clock generation. voltage margin) improvement under 300mV voltage droop
with derating factor of 0.9.
Keywords—clock data compensation; clock tree synthesis;
adaptive clock The rest of the paper is organized as follows: Section II
briefly presents behavioral model of timing slack variation due
I. INTRODUCTION to PSN; Section III analytically proves ideal CDC effect for
Reducing voltage margin is one of key enablers for energy RO as a clock generation; Section IV shows non-ideal CDC
efficient integrated circuit (IC) design. The voltage margin is effect for the practical design; Section V presents timing slack
added in digital circuits to avoid potential timing error caused analysis for several design choices; Section VI shows our
by voltage droop due to the limited power delivery network proposed CTS technique; and Section VII summarizes the
(PDN) resources. One approach to reduce voltage margin is to paper.
reduce the voltage drop itself from optimal design of the PDN,
adding decoupling capacitors, and using on-chip voltage II. BACKGROUND
regulators [1]–[2]. The complementary approach is to Previous work [8] proposed behavioral modeling of timing
implement circuit level techniques that can guarantee correct slack variation due to PSN. We briefly review the previous
circuit operation under supply noise [3]–[6]. Among them, work which will be used for the proof of ideal CDC for the RO
adaptive clock generation to track critical path delay has in section III. Given a circuit having relationship between
emerged as an efficient approach [5]–[6]. It generates delay variation (Δd) and supply voltage difference (Δv) from
elongated (contracted) clock period matched with critical path nominal voltage, delay sensitivity α(Δv) [1/V] can be obtained
delay under a negative (positive) power supply noise (PSN). by [8]:
Many works proposed various clock generation schemes to
accurately mimic the critical path delay under PSN. However, ଵ οௗሺο௩ሻ
there is a lack of understanding of how the clock tree design Ž‹ ǡο‫ ݒ‬ൌ Ͳ
஽బ ο௩՜଴ ο௩
needs to change to maximally exploit the advantages of Ƚሺο‫ݒ‬ሻ  ൌ ቐ ଵ οௗሺο௩ሻ
(1)
 ǡ‫݁ݏ݅ݓݎ݄݁ݐ݋‬Ǥ
adaptive clocking. ஽బ ο௩

This paper presents how to synthesize the clock tree in where, D0 is nominal circuit delay from the nominal
adaptive clocking. In particular, we show that delay sensitivity voltage for the target circuit. The power supply noise induced
matching between the clock tree and the adaptive clock jitter (PSNIJ) J(t), then can be calculated by [8]:
generation is necessary to fully benefit from clock data
compensation (CDC) effect where clock distribution modulates ௧
clock period favorably for timing slack under PSN [1]. We ‫ܬ‬൫‫ݐ‬Ǣ ߙሺο‫ݒ‬ሻǡ ‫ܦ‬଴ ǡ ο‫ݒ‬ሺ‫ݐ‬ሻ൯ ൌ ‫׬‬௧ି஽ ߙሺο‫ݒ‬ሺ߬ሻሻο‫ݒ‬ሺ߬ሻ݀߬Ǥ (2)

consider ring oscillator (RO) based adaptive clock generation
as an example to analytically show the ideal CDC effect. We where, Δv(t) is a transient supply voltage difference from
show that the clock period of RO exactly follows the critical nominal voltage. For brevity, we use J(t; D0) instead of using

978-3-9815370-8-6/17/$31.00 2017
c IEEE 1504
J(τ; D0)

J(t-DCK0; D0) J(t; D0)


t-DCK0 t
J(τ; DCK0)
J(t-D0-J(t-DCK0; D0);DCK0)
J(t-D0;DCK0) J(t; DCK0)
t-D0-J(t-DCK0; D0) t-D0 t
Fig. 2: Waveforms showing |J(t-D0; DCK0)-J(t-D0-J(t-DCK0;D0);DCK0)| << D0
at a low frequency noise.

and critical path can be modeled as alpha delay, alpha clock


period and alpha delay [8] respectively as shown in this figure.
We assume every circuit element has the same delay sensitivity
α(Δv) in this example (αD(Δv), αRO(Δv), and αCK(Δv) = α(Δv)).
In the following, we use α, αD, αRO and αCK instead of using
α(Δv), αD(Δv), αRO(Δv) and αCK(Δv) for simpler notation.

Fig. 1: (a) Block diagram and (b) behavioral modeling [8] for ring
The clock period and the critical path are no longer
oscillator as a local clock generator. constants under PSN. From fig. 1(b), TCKIN(t) and D(t) can be
written as:
the whole term in (2) and we will use the whole term only
when necessary. Power supply noise induced delay (PSNID) ܶ‫ܭܥ‬ூே ሺ‫ݐ‬ሻ ൌ ‫ܦ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻǤ (6)
D(t) can be calculated by adding PSNIJ (J(t; D0)), propagation
of the input delay (DIN(t)) and nominal delay (D0) as follows Making clock period at the leaf node by (4):
[8]:
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ ሻ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬஼௄଴ ሻ
‫ܦ‬ሺ‫ݐ‬ሻ ൌ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻ ൅ ‫ܦ‬ூே ሺ‫ ݐ‬െ ‫ܦ‬଴ ሻ ൅ ‫ܦ‬଴ Ǥ (3)
െ‫ܬ‬ሺ‫ ݐ‬െ ܶ‫ܭܥ‬ூே ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ ሻǢ ‫ܦ‬஼௄଴ ሻǤ (7)
Now, consider the clock period for a leaf node of the clock
tree with the clock distribution delay DCK0 from the nominal We assume the following condition is true for all t > 0:
voltage and the clock period at the source node TCKIN(t).
Power supply noise induced clock period (PSNICP) TCK(t) is ȁ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ ሻ െ ‫ܬ‬ሺ‫ ݐ‬െ ܶ‫ܭܥ‬ூே ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ ሻǢ ‫ܦ‬஼௄଴ ሻȁ ‫ܦ ا‬଴
obtained by adding PSNIJ of the current time and subtracting
PSNIJ of the previous rising edge to the propagated input clock At a low frequency noise, even though PSNIJ itself can be
period. The PSNICP can be written as [8]: large, |J(t-D0; DCK0)-J(t-D0-J(t-DCK0;D0);DCK0)| is small due to
the slow supply voltage change as shown in fig. 2. This
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ܶ‫ܭܥ‬ூே ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ ሻ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬஼௄଴ ሻ assumption is also true at a higher noise frequency since the
peak values of PSNID and PSNICP become negligible at a
െ‫ܬ‬ሺ‫ ݐ‬െ ܶ‫ܭܥ‬ூே ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ ሻǢ ‫ܦ‬஼௄଴ ሻǤ (4) higher frequency. Note that integration over a fixed interval on
a sin function becomes smaller at a higher noise frequency.
Resulting PSN induced slack (PSNIS) S(t) can be written as Now, TCK(t) can be re-written as:
follows [8]:
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ ሻ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬஼௄଴ ሻ
ܵሺ‫ݐ‬ሻ ൌ ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ െ ‫ܦ‬ሺ‫ݐ‬ሻǤ (5)
െ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ ሻǤ (8)
III. IDEAL CLOCK DATA COMPENSATION
Ideal CDC can be proved by showing (6) and (8) are the
When the clock period exactly follows critical path delay same.
under PSN, S(t) becomes zero and we call this ideal CDC. In
this section, we will show inherent ideal CDC effect for a RO A. Critical Path Delay < Clock Distribution Delay
as a clock generator. Suppose we have a circuit as shown in fig. First, we consider the case where critical path delay (D0) is
1(a). In this figure, clock is generated from the RO which has shorter than clock distribution delay (DCK0). By additivity
half of the critical path delay D0, making clock period of D0. principle of integration on intervals, J(t; DCK0) in (8) can be
The clock propagates through the clock tree with the clock written as:
distribution delay of DCK0. Then, the clock meets with critical
௧ ௧ି஽
path delay of D0 making zero timing slack. ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬஼௄଴ ሻ ൌ ‫׬‬௧ି஽ ߙο‫ݒ‬ሺ߬ሻ݀߬  ൅ ‫׬‬௧ି஽ బ ߙο‫ݒ‬ሺ߬ሻ݀߬
బ ಴಼బ
Now we consider PSN, Δv(t), and assume on-chip PDN is
well designed so that every circuit element is experiencing the ൌ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻ ൅ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ െ ‫ܦ‬଴ ሻǤ
same PSN. Fig. 1(b) represents behavioral modeling of this
circuit for timing slack variation due to PSN. RO, clock tree Similarly, J(t-D0; DCK0) in (8) can be decomposed as:

2017 Design, Automation and Test in Europe (DATE) 1505


50mV
Δv(t)
50mV
Slack
ΔD(t)=D(t)-D0 improvement
ΔTCK(t)=TCK(t)-D0
- S(t)
Fig. 4: Waveforms showing slack improvement is measured by
max{ΔD(t)}– max{-S(t)}.

the critical path delay [1]. To examine the effect of different


delay sensitivities, we consider the same circuit in fig. 1 and set
the delay sensitivities αD, αRO and αCK for critical path, RO and
clock tree respectively as shown in fig. 1(b). TCKIN(t), D(t) and
TCK(t) now can be written as:

ܶ‫ܭܥ‬ூே ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ߙோை ǡ ‫ܦ‬଴ ሻ (11)

Fig. 3: Design flow. ‫ܦ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ߙ஽ ǡ ‫ܦ‬଴ ሻ (12)

‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ ሻ ൌ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ െ ‫ܦ‬଴ ሻ ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ߙோை ǡ ‫ܦ‬଴ ሻ

൅‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ ሻǤ ൅‫ܬ‬ሺ‫ݐ‬Ǣ ߙ஼௄ ǡ ‫ܦ‬஼௄଴ ሻ െ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ߙ஼௄ ǡ ‫ܦ‬஼௄଴ ሻǤ (13)

Substituting the above two terms for (8) gives: If we apply similar decomposition and substitution
techniques like previous section, we obtain:
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻǤ (9) ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ߙ஼௄ ǡ ‫ܦ‬଴ ሻ
which is same as (6) resulting in ideal CDC effect. ൅‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ߙோை ǡ ‫ܦ‬଴ ሻ െ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ߙ஼௄ ǡ ‫ܦ‬଴ ሻǤ (14)
B. Critial Path Delay > Clock Distribution Delay
Note the result in section III is the special form of (14)
Secondly, we consider the case where critical path delay
(D0) is longer than clock distribution delay (DCK0). We apply when αD = αRO = αCK. If we set αRO = αCK, then the equation
additivity principle of integration on intervals again for J(t; (14) can be modified to:
DCK0) in (8):
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ܶ‫ܭܥ‬ூே ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ߙ஼௄ ǡ ‫ܦ‬଴ ሻǤ (15)
‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬஼௄଴ ሻ  ൌ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻ െ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ െ ‫ܦ‬஼௄଴ ሻǤ
This means that TCK(t) is not dependent on DCK0 resulting
Note that the integration for an interval [t-DCK0, t] can be in no clock period modulation through the clock tree. We refer
done by integration for [t-D0, t] minus [t-D0, t-DCK0] since D0 > this effect as non-clock tree modulation and (αRO = αCK) as the
DCK0. Similarly, J(t-D0; DCK0) in (8) can be re-written as: condition for non-clock tree modulation .

‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬଴ Ǣ ‫ܦ‬஼௄଴ ሻ ൌ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ ሻ V. FREQUENCY DOMAIN TIMING SLACK ANALYSIS
In this section, we perform frequency domain timing slack
െ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ‫ܦ‬଴ െ ‫ܦ‬஼௄଴ ሻ analysis for 3 test cases to understand non ideal CDC effect.
We use the behavioral modeling in fig. 1(b).
Substituting the above two terms for (8) gives:
• αRO = αCK = αD: The RO and the clock tree are both gate
ܶ‫ܭܥ‬ሺ‫ݐ‬ሻ ൌ ‫ܦ‬଴ ൅ ‫ܬ‬ሺ‫ݐ‬Ǣ ‫ܦ‬଴ ሻǤ (10) dominated. This is the ideal CDC case and resulting timing
slack can be can be obtained by substituting parameters
(12) and (14) for (5):
resulting in the same function with (6).
This is notable that ideal CDC effect does not depend on ܵሺ‫ݐ‬ሻ ൌ ͲǤ (16)
the relationship between critical path delay and clock
distribution delay. • αRO = αCK = 0.8*αD: The RO and the clock tree are both
wire dominated. Timing slack for this case is:
IV. NON-IDEAL CLOCK DATA COMPENSATION
In real implementation, delay sensitivities for the critical ܵሺ‫ݐ‬ሻ ൌ  െ‫ܬ‬ሺ‫ݐ‬Ǣ ͲǤʹߙ஽ ǡ ‫ܦ‬଴ ሻǤ (17)
path and clock tree are not the same. Normally, clock tree is
wire dominated delay and critical path is gate dominated delay, • αRO = αD, αCK = 0.8*αD: This is the most conventional case
thus, delay sensitivity of the clock tree is smaller than that of where RO generated clock period tracks critical path delay,

1506 2017 Design, Automation and Test in Europe (DATE)


120 120

100 100

80 80

(%)
(%)
60 60
αRO=αCK0=αD
aRO=aCK0=aD
40 40
αRO=αCK0=0.8αD
aRO=aCK0=0.8aD 500MHz
20 20
αRO=αDaCK0=0.8aD
aRO=aD, , αCK0=0.8αD
0 0
1.00E+07 (a) DCK0=0.5ns 1.00E+09
1.00E+08 1.00E+07 (b) DCK0=1ns 1.00E+09
1.00E+08
120 Noise frequency (Hz) 120 Noise frequency (Hz)
100 100 250MHz
80 80
500MHz 500MHz

(%)
(%)

60 60
250MHz 125MHz
40 40 375MHz 750MHz
20 20 625MHz
750MHz
0 0
1.00E+07 1.00E+08 1.00E+09 1.00E+07 1.00E+08 1.00E+09
Noise frequency (Hz) Noise frequency (Hz)
(c) DCK0=2ns (d) DCK0=4ns
Fig. 5: Normalized slack improvement for D0=1ns, (a) DCK0=0.5ns (b) DCK0=1ns (c) DCK0=2ns and (d) DCK0=4ns.

but the clock tree is wire dominated and resulting timing B. Simulation Results
slack is: Fig. 5 shows simulation results.
ܵሺ‫ݐ‬ሻ ൌ  െ‫ܬ‬ሺ‫ݐ‬Ǣ ͲǤʹߙ஽ ǡ ‫ܦ‬଴ ሻ ൅ ‫ܬ‬ሺ‫ ݐ‬െ ‫ܦ‬஼௄଴ Ǣ ͲǤʹߙ஽ ǡ ‫ܦ‬଴ ሻǤ (18) • αRO = αCK = αD: As expected, this case shows maximum
achievable slack improvement for all the noise frequencies
A. Simulation Setup since S(t)=0 as in (16). Normalized slack improvement in
this case is:
We design JPEG encoder engine [7] with Synopsis 28nm
PDK (supply voltage of 1.05V). We use Design Compiler and ௠௔௫ሼ௃ሺ௧Ǣఈವ ǡ஽బ ሻሽ
IC Compiler for synthesis and place & route respectively. After ‘”̴•Žƒ…̴‹’”ሺ݂ሻ ൌ ͳͲͲ ൈ Ǥ (22)
ି଴Ǥ଴ହሼఈವ ሺି଴Ǥ଴ହሻሽ஽బ
obtaining critical path netlist with our custom scripts, we
convert Verilog netlist to SPICE compatible netlist and We omit Δv(t) = 0.05sin(2πft) for brevity. Note that J(t; αD ,
simulate this with parasitic RC to obtain delay sensitivity D0) becomes zero at every f = n/D0 (n is positive integer) since
αD(Δv) for the critical path as shown in fig. 3. Then, we build the integration of the sinusoidal function becomes zero
behavioral model for each test case and apply 100mV peak-to- between t-D0 and t for f = n/D0 (n is positive integer). Also,
peak sinusoidal supply noise 0.05sin(2πft) by sweeping noise max{J(t; αD , D0)} decreases as the noise frequency becomes
frequency (f). We perform simulations varying DCK0 = 0.5ns, higher since the integration over D0 on a sin function becomes
1ns, 2ns and 4ns while fixing D0 = 1ns. We measure the timing smaller at a higher noise frequency. In this case, slack
slack improvement as shown in fig. 4: improvement is just a function of f, αD and D0 not DCK0 since
there is no clock period modulation through the clock tree
•Žƒ…̴‹’” ൌ ݉ܽ‫ݔ‬ሼο‫ܦ‬ሺ‫ݐ‬ሻሽ െ ݉ܽ‫ݔ‬ሼെܵሺ‫ݐ‬ሻሽ (19) (non-clock tree modulation) as αRO = αCK. Note that the graphs
for this case in fig. 5(a)–(d) are identical regardless of DCK0.
where ΔD(t) = D(t) – D0 = J(t;αD, D0) and –S(t) is negative
timing slack. Then the slack improvement can be maximized • αRO = αCK = 0.8*αD: Normalized slack improvement in this
when S(t)=0 resulting maximum achievable slack improvement: case can be obtained by substituting (12) and (17) for (21):

௠௔௫ሼ௃ሺ௧Ǣ଴Ǥ଼ఈವ ǡ஽బ ሻሽ
ƒš̴ƒ…Š‹‡˜ƒ„Ž‡̴•Žƒ…̴‹’” ൌ ƒšሼο‫ܦ‬ሺ‫ݐ‬ሻሽ (20) ‘”̴•Žƒ…̴‹’”ሺ݂ሻ ൌ ͳͲͲ ൈ Ǥ (23)
ି଴Ǥ଴ହሼఈವ ሺି଴Ǥ଴ହሻሽ஽బ

We normalize the measured timing slack improvement by


This shows that we can only achieve 80% of maximum
dividing delay increase under 50mV (half of the peak-to-peak
achievable slack improvement as shown in fig. 5. The clock
noise) DC supply voltage drop. The delay increase can be
period modulation ( ΔTCKIN(t) = TCKIN(t) –D0 = J(t; αRO, D0) )
obtained by substituting -50mV for Δv(t) in (2) resulting in -
from the RO is 80% of J(t; αD, D0) due to αRO = 0.8*αD. And
0.05{αD(-0.05)}D0. Note that the above value is positive value
the generated clock is, then, delivered to the leaf node without
since the typical delay sensitivity is negative value (delay
increases as supply voltage decreases). Then the normalized clock tree modulation since αRO = αCK. The graphs for this case
slack improvement can be the function of f: are also identical regardless of DCK0 due to non-clock tree
modulation.
‘”̴•Žƒ…̴‹’”ሺ݂ሻ ൌ
• αRO = αD, αCK = 0.8*αD: Normalized slack improvement in
௠௔௫ሼο஽ሺ௧ሻሽି௠௔௫ሼିௌሺ௧ሻሽ
this case can be obtained by substituting (12) and (18) for
ͳͲͲ ൈ ቚ (21) (21):
ି଴Ǥ଴ହሼఈವ ሺି଴Ǥ଴ହሻሽ஽బ ο௩ሺ௧ሻୀ଴Ǥ଴ହ௦௜௡ሺଶగ௙௧ሻ

2017 Design, Automation and Test in Europe (DATE) 1507


3
Critical Path
2.5 Clock tree, Max buf: x4
Clock tree, Max buf: x8
2 Clock tree, Max buf: x16
1.5

0.5
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
ΔV (V)
Fig. 8: Normalized delay.

Fig. 6: (a) S(t) = 0 at every f = n/(DCK0) (n : positive integer) (b) S(t) = -J(t;
0.4αD, D0) at every f = n/(2DCK0) (n : positive odd integer) for αRO = αD,
αCK = 0.8*αD.

Fig. 9: IC compiler snapshot showing clock tree for JPEG encoder for max
buffer size is (a) x16 and (b) x4

Also note that these frequencies become lower when clock


distribution delay DCK0 becomes larger. This is noteworthy as
clock distribution delay tends to increase in today’s high
Fig. 7: Decision tree to choose the optimal design. frequency microprocessors and significant first droop noise
occurs at around 200Mhz [1]. Therefore, it suggests that clock
‘”̴•Žƒ…̴‹’”ሺ݂ሻ ൌ generation only mimicking the critical path delay does not
guarantee optimum solution. In order to get fully benefit from
௠௔ ௫ሼ௃ሺ௧Ǣఈವ ǡ஽బ ሻሽି௠௔௫ሼ௃ሺ௧Ǣ଴Ǥଶఈವ ǡ஽బ ሻି௃ሺ௧ି஽಴಼బ Ǣ଴Ǥଶఈವ ǡ஽బ ሻሽ CDC effect, delay sensitivity of clock tree should be well
ͳͲͲ ൈ . (24)
ି଴Ǥ଴ହሼఈವ ሺି଴Ǥ଴ହሻሽ஽బ matched with that of critical path delay.

Interesting observations can be found in this case. First, this VI. CDC AWARE CLOCK TREE SYNTHESIS
case shows 100% slack improvement at DC noise. Since αRO is In this section, we propose simple CDC aware CTS
the same with αD, PSNICP generated from the RO exactly algorithm to make gate dominant clock tree design in standard
follows PSNID of the critical path. At DC supply voltage design flow. We take the same design with the previous section,
droop, clock period modulation through the clock tree is but use measured delay sensitivity for clock tree (αCK) instead
negligible, thus, it achieves 100% slack improvement. of assuming αCK = αD or αCK = 0.8*αD like previous session.
Second, the slack improvement can be maximized at every We also use measured D0 and DCK0. For adaptive clock
f = n/DCK0 (n is positive integer) since S(t) becomes 0. The two generation, we assume clock is being generated with the same
terms, J(t; 0.2αD , D0) and J(t- DCK0; 0.2αD , D0) in (18), are delay sensitivity of the critical path delay. Note, design of
cancelled out at these frequencies as shown in fig. 6(a). Empty adaptive clock generation is beyond the scope of this paper.
circles marked in fig. 5 show this phenomenon. A. CDC Aware CTS
Third, the slack improvement is reduced by 40% from the To change delay sensitivity of the clock tree, we propose to
maximum achievable slack improvement at every f = n/(2DCK0) control available clock buffer list for CTS while maintaining
(n is positive odd integer). As shown in fig. 6(b), at every f = maximum transition parameter (max_transition) unchanged
n/(2DCK0) (n is positive odd integer), the amplitude of timing during place and route. We initially perform the design flow as
slack is maximized, thus, resulting in S(t) = -J(t; 0.4αD, D0). in fig. 3 with normal clock buffer list and check the timing
Slack improvement, then, becomes max{J(t; 0.6αD, D0)} which slack and power budget given PSN and derating factor. And
is 60% of maximum achievable slack improvement. The we remove maximum strength buffer one by one and perform
frequencies in this condition are marked with filled circles in the design flow again with updated clock buffer list until the
fig. 5. This is interesting in that slack improvement can be even timing slack tends to degrade and power consumption for the
lower than αRO = αCK = 0.8*αD case around these frequencies. clock tree violates power budget. Entire decision tree is shown
in fig. 7.

1508 2017 Design, Automation and Test in Europe (DATE)


0 0

-50 -50

(ps)
(ps)

-100 -100 αCK0=αD 105ps


αCK0=αD 105ps Max buf: x4 , DCK0=528ps
Max buf: x4 , DCK0=528ps -150
-150 Max buf: x8 , DCK0=394ps
Max buf: x8 , DCK0=394ps
Max buf: x16 , DCK0=330ps
Max buf: x16 , DCK0=330ps
-200 -200
1.00E+07 1.00E+08 1.00E+09 0 0.05 (a)
0.1 Derate
0.15 = 1 0.2 0.25 0.3
Noise frequency (Hz) 0 P2P Voltage (V)
Max buf: x16
Fig. 10: Timing slack with 300mV peak-to-peak voltage with D0= Max buf: x8
-50
TCK0=1.35ns Max buf: x4

(ps)
-100
B. Implementation Results
JPEG encoder is implemented with buffer list {x1, x2, x4, -150 85ps
x8, x16} first and then with {x1, x2, x4, x8}, {x1, x2, x4} and Optimal design
{x1, x2} in 28nm Synopsys PDK (supply voltage = 1.05V). -200
The implementation results show D0 = TCK0 = 1.35ns. Fig. 8 0 0.05 0.1 0.15 0.2 0.25 0.3
P2P Voltage (V)
shows normalized delay for critical path delay and clock tree
distribution delay when the maximum buffer size is limited to (b) Derate = 0.9
x16, x8 and x4. We observe that clock tree becomes wire Fig. 11: Timing slack varying peak-to-peak voltage with derating factor is
(a) 1 and (b) 0.9
dominated as we use large buffers. CTS with small sized
buffers gives longer clock distribution delay since the depth of technique.
the clock tree becomes larger as shown in fig. 9.
VII. CONCLUSION
Fig. 10 shows timing slack when we apply 300mV peak-to-
peak sinusoidal PSN. We observe that over 100ps of timing This paper addresses careful clock tree design is needed for
slack is improved by simply synthesizing clock tree with adaptive clock generation to maximize CDC effect. To this end,
limited available clock buffer list. first, we analytically show how the timing slack is impacted by
the clock tree design in adaptive clocking for both ideal case
Fig. 11 shows minimum timing slack over the noise and generalized case. Second, frequency domain timing slack
frequencies varying peak-to-peak voltage of sinusoidal PSN analysis for several cases is provided. And finally, we propose
with various derating factors. We see that timing slack for simple and efficient way to control delay sensitivity for clock
limited buffer list degrades more than normal case with tree to maximize the CDC effect. Synopsys 28nm
increased derating factor since the clock distribution delay for implementation results for JPEG encoder show 85ps timing
limited buffer list is longer than that for the normal buffer list. slack (equivalent 40mV voltage margin) improvement given
Thus, optimal design should be chosen based on PSN and 300mV voltage droop and derating factor of 0.9. Our
derating factor. CTS with normal buffer list gives optimum observation is important in that the clock tree distribution delay
design for lower PSN while CTS with limited buffer list give is becoming longer and it is wire dominated while most of the
better design for larger PSN given derating factor. critical path delay is gate dominated.
Table I summarizes clock tree implementation results for
various clock buffer list cases. As we limit maximum buffer ACKNOWLEDGMENT
size, clock tree distribution delay, skew, clock tree level and This material is based on work supported in part by
number of clock tree buffers tend to increase as expected. We National Science Foundation under grant CCF#1417323.
also observe changing clock buffer size doesn’t impact much
on the power consumption except x2 case. Even if we use large REFERENCES
number of clock buffers with limited clock buffer list, power [1] K. L. Wong et al., “Enhancing microprocessor immunity to power
consumption doesn’t change much since unit power supply noise with clock-data compensation,” JSSC, 2006.
consumption of small size buffer is lower than that of large size [2] N. Kurd et al., “Haswell: A family of IA 22 nm processors,” JSSC,
buffer offsetting increased number of clock buffers. This shows 2015.
we can make gate dominant clock tree design without [3] S. Das et al., “RazorII: In Situ Error Detection and Correction for PVT
and SER Tolerance,” JSSC, 2009.
additional power overhead by carefully using our proposed
[4] K. Chae et al., “A Dynamic Timing Error Prevention Technique in
Pipelines With Time Borrowing and Clock Stretching,” TCAS, 2014
TABLE I
[5] A. Grenat, et al., “Adaptive Clocking System for Improved Power
CLK TREE IMPLEMENTATION SUMMARY
Efficiency in a 28nm x86-64 Microprocessor,” ISSCC, 2014.
Max buffer size x16 x8 x4 x2
[6] J. Kwak and B. Nikolic, “A Self-Adjustable Clock Generator With Wide
Longest Path (ps) 330 394 528 703 Dynamic Range in 28 nm FDSOI,” JSSC, 2016
Skew (ps) 68 114 107 153
[7] OpenCores [Online]. Available: http://opencores.org/
# total level 8 8 10 19
# CT Buffers 472 475 826 2039 [8] T. Na and S. Mukhopadhyay, “Behavioral Modeling of Timing Slack
Variation in Digital Circuits Due to Power Supply Noise”, DATE, 2016.
Power (mW) 19.8 19.4 19.3 21.5

2017 Design, Automation and Test in Europe (DATE) 1509

Das könnte Ihnen auch gefallen