Clock Distribution: Rajeev Murgai

Clock Distribution
Rajeev Murgai
Advanced CAD Technologies
Fujitsu Labs of America
UC Berkeley
Feb 15, 2005
1
Defining Clock Skew and Jitter
Clock skew
The deterministic (knowable) difference in clock arrival times at each flip-flop

Caused mainly by imperfect balancing of clock tree/mesh
Can be deliberately introduced using delay blocks in order to time-borrow
Accounted for in STA by calculating the clock arrival times at each flip-flop
Clock jitter
The random (unknowable, except distribution ) difference in clock arrival

times at each flip-flop
Caused by on-die process, Vdd, temperature variation, PLL jitter, crosstalk,
Static timing analysis (STA) accuracy, layout parameter extraction (LPE)
accuracy
Accounted for in STA by subtracting (~3 ) from the cycle time in long path
analysis, and adding to receiving clock arrival time in race analysis
Jitter is always bad, skew can be helpful or harmful.
Clock uncertainty skew jitter

Long path analysis
F
F
F
F
Logic
skew
clk
Race analysis
-jitter
F
F
clk
F
F
skew
+jitter
2
Background
Technology scaling results in:
Existing ASIC clock synthesis flows
higher clock frequencies possible and requested by users

prominence of wiring parasitics (R,L,C) in electrical behavior
increasing noise impact on delays
increasing on-chip process variation impact on delays
Use tree architectures: not best for low skew, jitter, variations
Don't properly address noise issues
Rely on STA to calculate the delays through clock networks
Use inaccurate wiring models
Use noise-sensitive clock circuit topologies
Ignore or crudely estimate process/voltage/temperature variations
Dont have tight integration of physical synthesis & clock synthesis
Result
Predictability of clock delay is poor: Clock uncertainty (i.e., skew +

jitter) of 400ps is not uncommon
Maximum attainable clock frequency is impaired
3
Problems with Existing Clock Methodologies
Tree-based Clock Distribution
Low power but...
Sensitive to mismatching branches, difficult to layout
Sensitive to noise, especially if wires are not shielded
Using STA to calculate tree timing results in large errors
=> high skew and jitter

F
F
F
F
F
F
small skew and jitter
medium skew and jitter
F
F
F
F
PLL
F
F
large skew and jitter
Problems with Static Timing Analysis (STA)

What we have...
R
Cs
signal wire
Cg
What STA uses...

Rup
Rwire
Rdn
Cw/2
Cw/2
Cload
Note: driver model is a

little better than this with
table look-up
Other problems
Cw can match either delay or slew, but not both
interpolation using look-up tables
Clock Distribution Architectures
Two basic architectures
Tree
Grid (mesh)
Hybrids of tree and mesh
Tree + crosslinks
Mesh + local trees
Tree
Widely used in ASICs
Advantages
Disadvantages
Low cost
Wiring
Capacitance
Power
Clock gating easy
Difficult to balance path
delays due to asymmetric
FF distribution
Sensitive to variations
Flip-flops
Topologies
Symmetric H-tree
Asymmetric trees
7
CAD for Tree Architecture
Topology generation
H-tree: widely used

Method of means and medians (MMM) [Jackson et al. DAC 90]
Goal: reduce wirelength while minimizing skew.
Divide set S of points into Sleft and Sright, based on median.
| Sleft | = | Sright |
Connect/route center of mass (CM) of S to CM of Sleft and Sright.
Recurse on Sleft and Sright.
Method of Means & Medians
Problem
Solution
May not result in zero skew

One step look-ahead and decide direction of splitting.
Estimate skews using Penfield Rubenstein model.
Other problems
Buffer insertion not handled.

Obstructions not handled.
Topology: Recursive Geometric Matching
[Kahng et al. DAC 91]
Bottom-up pair-wise merge algorithm
Optimum geometric matching on n points (minimum wirelength)
Determine center point of each match edge
Recurse on n/2 points
Uses path length skews
Tries to balance root to leaf path lengths.
10
Topology: Simulated Annealing
Topology generation
Cheng et al: improve initial topology by simulated annealing
effective in reducing delay
11
CAD for Tree Architecture
Routing & wire sizing
Tsay, TCAD 93: zero-skew routing
first paper to use Elmore delay as delay model
earlier work used pathlength
DME, planar DME
make faster paths slower by detours/snaking to match delays
may use wire-sizing: make slower paths faster

Wire spacing
Buffering
Tellez & Sarrafzadeh, TCAD 97

insert minimum buffers on a given topology to meet skew and slew
constraints.
12
Grid/Mesh
Clock source
n x n uniform mesh
Distributed array of k x k
buffers drives the mesh.
Buffers driven by global Htree.
Flip-flops directly connected

to the nearest mesh segment
Used in modern processors
Advantages
flip flops
Excellent for low skew

Robust to variations
Disadvantages
Higher wiring area,

capacitance, power
Difficult to analyze
13 and
Loops
redundancy
Mesh
Sizing of clock distribution networks for high performance CPU chips
Desai et al., DEC [DAC 1996]
goal: size grid interconnect segments with constraints on clock latency

and average current
assume: initial grid and interconnect sizes
width explicit => non-linear program; practical for small networks/trees.
consider width as implicit & solve using sequence of network problems.
Results: applied on clock networks of two actual processors: DC21046A

and DC21164. Results for DC21046A:
275MHz clock
grid has 1 million edges, 15.5K drivers, 81K receivers
16% reduction in capacitance - without increasing clock latency.
Runtime: 3 days.
Optimal Wire and Transistor Sizing for Circuits with Non-tree Topology
Vandeberghe et al., Stanford University [ICCAD 97]

RC circuit with tree topology => sizing problem is convex optimization
meshes have R loops; use dominant time constant as measure of delay
14
solve using semi-definite programming (quasi-convex function)
Hybrid Architecture: Tree + Cross-links
Reducing Clock Skew Variability via Cross Links
[Rajaram et. al., DAC 2004]
tree + short-circuit some sink pairs => non-tree topology
clock signal propagates through multiple paths; reduces skew and

skew variability between shorted sinks
reduces skew variability by 30-70%
very small wire-length penalty (2%) over tree topology
Drawback:
does not consider buffering
source
15
Hybrid Architecture: Mesh + Trees
Hybrid Structured Clock Network

Construction [Hu & Sapatnekar,
ICCAD 01]
Hybrid clock topology
simple top-level global mesh
zero-skew local trees at

bottom
Presents wire sizing scheme to
achieve latency and skew
reduction.
iterative LP to minimize wire

width (area) of top-level
mesh, given delay bound
uses Elmore delay t =

G-1C
sensitivity-based post-layout
clock tree tuning to reduce
skew.
(a, CDa)
a
b
source
c
d
16
Clock Architectures
Clock source
Flip-flops
flip flops
Tree
-- low cost (wiring, power, cap)
-- higher skew, jitter than mesh
-- widely used in ASIC designs
-- clock gating easy to incorporate
Mesh
-- excellent for low skew, jitter
-- high power, area, capacitance
-- difficult to analyze
-- clock gating not easy
-- used in modern processors
Clock source
Best architecture depends on the application

Flip flops
crosslink
crosslink
tree
Local trees
Hybrid: tree + cross-links

-- low cost (wiring, power, cap)
-- smaller skew, jitter than tree
-- difficult to analyze
Flip flops
Hybrid: mesh + local trees

-- suitable for coarse mesh
17
Processors
Traditionally two hierarchies
Global clock network

Local clock network
Skew control
Global network: balanced trees or grids

Local network: de-skewing buffers
18
Pentium4 [IJSSC Nov 2001]
0.18u, 6 metal layers, 42 million transistors
Core medium clock frequency: 2 GHz
Used by most core blocks
High speed scheduling and execution: 4GHz
Non critical blocks (e.g., bus interface logic): 1GHz
Global clock distribution
3 spines; each spine has binary clock distribution

jitter reduction schemes
low-pass RC-filtered power supply for clock drivers
shield clock wires
source
spines
19
IBM [IJSSC 2001]
Same clock architecture for 6 chips (including PowerPC):
Design priorities: min. clock skew, sharp rise and fall times (below
100 ps for 1ns clock), 50% duty cycle, low power consumption
Global buffered H-trees (on top 2 layers) drive sector buffers.
Each sector buffer drives tuneable tree, which drives global mesh
length-matched
Tree wire-widths tuned to minimize skew over long distances

Mesh minimizes local skew by connecting nearby points directly.
For each chip, 10-20 complete tuning cycles
Buffer placement, wiring
Flip-flops connected to closest point on mesh
Global clock skew of 22ps
Inductance included in analysis
Mesh difficult to analyze due to loops
cut the mesh
Clock source
flip flops
20
Alpha, DEC [JSSC, Nov 98]
0.35u, 4 metal layers, 15.2 million transistors, 600 MHz at 2.2V
3 hierarchies in clock distribution
Global, major (regional) and local
Multi-level mesh
global: trees to global GCLK grid
Uses 3% of M3/M4 interconnect
M3/M4 shielding; M2, M4: Vdd/Vss
power = 16W; skew = 72ps
Major (regional)
six grids over execution units
use 6% of M3, M4
power = 14W
Local clock
tree structure, not shielded

conditional/unconditional clocks
less than 10ps skew; power = 15.6W
Clock simulation
AWE-reduction + SPICE
PLL
GCLK grid
21
Summary of Processor Clock Design
Three basic routing structures for global clock
H-tree
low skew, smallest routing capacitance, low power
Floorplan flexibility is poor:

Grid or mesh
low skew, increases routing capacitance, worse power
Alpha uses global clock grid and regional clock grids

Spine
Small RC delay because of large spine width
Spine has to balance delays; difficult problem
Routing cap lower than grid but may be higher than H-tree.
Clock
structure
Clock skew
Capacitance/Layout
area/power
Floorplan flexibility
H-tree
Low/medium
Low
Low
Grid
Low
Spine
High
High
Medium
Medium/high
Medium
22
Estimation of Process-dependent Clock

Skew in CMOS VLSI, Shoji [JSSC, Oct. 86]
Given two paths from clock source to FFs
Conventional design method
However,
skew may not be zero at another process corner
S2
Novel idea in the paper
design the two paths such that skew between S1

and S2 is zero for different process corners
TA + TB + TC = TD + TE (typical corner)
For high-current process corner H,
S1
design paths such that skew between S1 and S2

is zero at a (fixed) process corner
TA(H) = TA * 1/fN; TB(H) = TB * 1/fP (fN, fP > 1)
Zero-skew condition at H
TA(H) + TB(H) + TC(H) = TD(H) + TE(H)

(TA+TC) * 1/fN + TB/FP = TD/fN + TE/fP
(TE TB)/fN = (TE - TB)/fP
CLK
23
Estimation of Process-dependent Clock

Skew in CMOS VLSI, Shoji [JSSC, Oct. 86]
Either TE = TB or fN = fP.
But fN may not be same as fP (for PH-NL process)
In general, TE = TB => TD = TA + TC.
Pull-up and pull-down delays of two paths should

be identical.
Determine NMOS & PMOS transistor widths of

inverters to achieve this.
Results
1.75 u process
Widths selected manually
Lead to very small skews at all process corners
S1
S2
C
B
Drawbacks
only analyzes two paths

assumes identical percentage delay variation for
all NMOS (PMOS) devices
uses simplistic delay model; ignores wire cap
CLK
24
Optimal Clock Skew Scheduling
Long & short path constraints impose lower/upper bounds on skew.
long path analysis: aj ai + logic_max + tset_up - Tcycle

short path analysis: aj ai + logic_min - thold
Leads to a set of linear inequalities: ai aj cij
Given a clock cycle, feasibility can be solved using linear program,

more efficiently with Bellman-Ford shortest path [Fishburn TCAD90].
If wish to compute optimum clock cycle,
Perform binary search using above feasibility check.

Perform parametrized shortest path [Tarjan et al.]
One challenge: realize each ai
Other objectives: minimize power or switching noise.

j
i
F
F
ai skew
clk
Logic
F
F
aj
25
Optimal Clock Skew Scheduling Tolerant to

Process Variations [Neves & Friedman, 96]
Long path and short path constraints impose lower and upper
bounds on skew.
long path analysis: aj ai + logic_max + tset_up - Tcycle

short path analysis: aj ai + logic_min - thold
Try to choose skews in the middle of the bounds for maximum

protection against process variations.
i
F
F
ai skew
Logic
F
F
aj
clk
26

Clock Distribution: Rajeev Murgai

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clock Distribution: Rajeev Murgai

Hochgeladen von

Copyright:

Verfügbare Formate

Clock Distribution

Defining Clock Skew and Jitter

The deterministic (knowable) difference in clock arrival times at each flip-flop

The random (unknowable, except distribution ) difference in clock arrival

Jitter is always bad, skew can be helpful or harmful.

Clock uncertainty skew jitter

Technology scaling results in:

Existing ASIC clock synthesis flows

higher clock frequencies possible and requested by users

Predictability of clock delay is poor: Clock uncertainty (i.e., skew +

Problems with Existing Clock Methodologies

Tree-based Clock Distribution

Low power but...

Sensitive to mismatching branches, difficult to layout

Sensitive to noise, especially if wires are not shielded

Using STA to calculate tree timing results in large errors

=> high skew and jitter

small skew and jitter

medium skew and jitter

large skew and jitter

Problems with Static Timing Analysis (STA)

What STA uses...

Note: driver model is a

Clock Distribution Architectures

Two basic architectures

Hybrids of tree and mesh

Widely used in ASICs

CAD for Tree Architecture

H-tree: widely used

Goal: reduce wirelength while minimizing skew.

Divide set S of points into Sleft and Sright, based on median.

Connect/route center of mass (CM) of S to CM of Sleft and Sright.

Recurse on Sleft and Sright.

Method of Means & Medians

May not result in zero skew

Estimate skews using Penfield Rubenstein model.

Buffer insertion not handled.

Topology: Recursive Geometric Matching

[Kahng et al. DAC 91]

Bottom-up pair-wise merge algorithm

Optimum geometric matching on n points (minimum wirelength)

Determine center point of each match edge

Recurse on n/2 points

Uses path length skews

Tries to balance root to leaf path lengths.

Topology: Simulated Annealing

Cheng et al: improve initial topology by simulated annealing

effective in reducing delay

CAD for Tree Architecture

Routing & wire sizing

Tsay, TCAD 93: zero-skew routing

first paper to use Elmore delay as delay model

earlier work used pathlength

DME, planar DME

make faster paths slower by detours/snaking to match delays

may use wire-sizing: make slower paths faster

Tellez & Sarrafzadeh, TCAD 97

Buffers driven by global Htree.

Flip-flops directly connected

Used in modern processors

Excellent for low skew

Higher wiring area,

Sizing of clock distribution networks for high performance CPU chips