Beruflich Dokumente
Kultur Dokumente
A synchronous system needs a clock which signals are synchronized with Clock distribution network: goal is to generate a clock signal in which you want the clock to arrive at the same time at equivalent points on the chip Problem Clock skew
Clock skew mismatches in wire delays can cause differences in arrival times at equivalent points in the clocks. Can only predict the arrival time of the clock at clock arrival = Y +/- Skew Clock skew must accounted for in timing budget when determing delay paths to meet setup/hold constraints Clock skew is THE problem everything else, such as power costs with regards to the clock distribution system, comes from trying to solve the clock skew problem
BR 6/07
Clock Distribution
clk 100 ps 100 ps
0 ps reference point
100 ps
100 ps
BR 6/07 2
100 ps
0 ps reference point
100 ps
100 ps
BR 6/07 3
Clocking Regions
local skew: 6 ps 102 ps 99 ps
97 ps
100 ps
100 ps
4
Chip is divided into regions, the further a signal has to travel, the larger the skew budget BR 6/07
Skew: Flip-Flops
clk clk Combinational Logic Tc clk tpcq Q1 D2 clk tpdq tsetup tskew
F1
F1
Q1
CL
clk
F2
D2
F2
Q1
D2
BR 6/07
BR 6/07
BR 6/07
Alpha 21064 Die Photo (1993) Single Clock driver, 2 transistors for buffer visible to naked eye Clocking scheme was 2 phase, single wire.
Max clock skew approx. 180 ps (3.6% of 5 ns clock period) 1 gate delay about 300 ps, so clock skew about 50% of a gate delay.
Note the skew is smallest closest to center of chip where driver is located.
BR 6/07
10
BR 6/07
12
Max clock skew approx. 80 ps (2.4% of 3.3 ns clock period) 1 gate delay about 240 ps, so clock skew about 1/3 of a gate delay.
BR 6/07
13
Universal availability of clock signals Design teams can proceed in parallel since clock constraints well known Good process-variation tolerance The disadvantage is the extra capacitance of the grid
Power-performance tradeoff is determined by choice of skew target, which establishes the needed grid density, which determines the clock driver size.
BR 6/07 14
State elements and clocking points were 0 to 8 gates past Gclk Six major regional clocks two gain stages past GCLK with grids juxaposed with GCLK, but shielded from it.
Major clocks drive local clocks and conditional clocks
BR 6/07
16
Window pane arrangement - same skew to all panes . Note redudant drive to clock nets
BR 6/07 17
PLLs are analog circuits that use a charge pump and a voltage controlled oscillator (VCO) to perform phase alignment
Alpha 21264 PLL used a separate, regulated 3.3 V supply and was located in the corner of the chip to minimize noise impact Section 9.5.2 of Rabaey text has a block diagram of a PLL
All high performance CPUs and most ASICs now include a PLL for internal clock generation
BR 6/07
18
Global clock grid. Uses 3% of M3/M4 routing layers (lines in picture are misleadingly thick).
BR 6/07
19
Vss
Vdd
Lateral shielding via Vss/Vdd prevents clock noise from coupling into signal lines. Clock wires and lateral shields were manually placed
BR 6/07 20
BR 6/07
21
Major clocks saved power over a single global clocks because they service a lighter load and distribution area is smaller both of these means smaller drivers are needed.
Gclk+Major clocks used 24 W @ 2.2 V, 600 Mhz. It is estimated that at least 40W would have been required if only global clocks were used. 10%-90% rise/fall times were targeted at < 320 ps. BR 6/07
22
Major Clock grids. Densest major clock grids used up to 6% of M3/M4 routing. White areas are serviced by local clocks, local clocks also present in major clock grids.
60,000 local clock nodes, all were analyzed with SPICE using minimum and maximum gate capacitance estimates
Some local clocks had very high min/max delay variation tolerances (up to 280 ps)
BR 6/07
24
BR 6/07
25
BR 6/07
26
Technology
0.18 CMOS 25.4 million transistors 6 metal layers Flip-chip with 1014 pads
Recall that the Alpha had 21264 had 15.2 million transistors
BR 6/07
27
Both 2X reference clock and core clock is distributed across die via an H tree Routed in M5/M6
Fully laterally shielded with Vss/VDD Inductive reflections minimized at branch points by sizing wires to match impedances
BR 6/07 29
Inductance adds extra delay in current return path Inductive effects decreased clock buffer delays dues to faster BR 6/07 transition rates.
30
BR 6/07
31
This does NOT account for delays due to on die process variations
At Ghz clock speeds, skew due to on die process variations can cause timing failures
IA64 used an active distributed deskewing approach for GCLK and Regional Clocks
Wanted to avoid the detailed delay matching, timing analysis required in the Alpha design after complete implementation because of impact on design schedule Account for delay due to on die process variations
BR 6/07 32
Feedback clock!!! Delay circuit used to control edge alignment of Global clock with Regional Clock. In general, this is a form of a Digital Delay Locked Loop (DLL). Any form of PLL/DLL must have feedback for BR 6/07 correction!
33
Decoupling caps
Deskew Register adjusted every 16 clock cycles of Reference Clock. The Deskew buffer is just a simple form of a Delay Locked Loop (DLL).
BR 6/07
35
Not possible to maintain a balance routing network and load matching for core clock over such a large design with multiple design teams since the core clock was driving logic
However, it was possible to design balanced routing network and have load matching for the reference clock since all it drove were the DSKs and global clock design team solely responsible for reference clock design Feedback clocks from the regional clock distribution were then used to deskew regional clocks with respect to reference clock.
BR 6/07 36
Skew Elements
Total skew of design based on residual skew in reference clock, uncertainty of phase detector in DSK, and mismatches of feedback clocks
Reference clock did not have as large a distribution region as the core clock, and loads were better matched, so had tighter skew than would have been possible with global clock Feedback clock routes were kept short with respect to DSKs Phase detector uncertainty kept small via symmetric layout techniques and by allowing a long time for phase comparison
BR 6/07
37
Local Clocks
Local clocks generated from Regional Clocks and provided clocks needed by domino logic Full timing analysis performed on local clocks Local clocks responsibility of functional block design teams Global and regional clock responsibility of global clock design team Delay added for time borrowing or to account for skew in local clock
BR 6/07 39
If shortest path from G1 to G2 is less than max Skew, than incorrect value may get clocked into G2 when clock edge arrives at G2.
BR 6/07 40
LCB = local clock buffer. Common reference means in same DSK cluster
BR 6/07 41
BR 6/07
42
BR 6/07
43
Comments
Active de-skewing used in 1st generation jettisoned in 2nd generation
2nd generation just used a balanced H-tree Difficult to route this type of structure - all clock routing was reserved prior to block layout Differential clocks used for 2nd level clock distribution reduced jitter Non-active de-skew easier to test, and more deterministic behavior Intentional clock skewing for time borrowing easier
BR 6/07
44
differential clocks
Gated clocks
BR 6/07
45
This level reduces inductive effects. Locates gnd current return close to clock lines.
BR 6/07 46
Fuse-Based De-skewing
69 fuses controlling 23 clock zones. Delay increments in 30.5 ps over 220 ps range. Exhaustive search for best fuse settings not possible, use a generic search algorithm with statistical history to help; done during production sort. 48
BR 6/07
49
90 nm IA Microprocessor (2003)
Global clock distribution scaled up to 6 GHz
Used a clock distributed by H-tree, but shorted clock nodes at about every third level in order to reduce the skew. No active de-skew or fused-based de-skew.
BR 6/07 50
BR 6/07
51
BR 6/07
active de-skew
52
Xeon Dual-core (2006/65 nm) 3.4 GHz (Tulsa) two different clock systems
Core clocks (clocks for processor cores) uses same core clock scheme as used in Xeon Single Core (2003,/90 nm). This clock scheme was designed to scale up to 6 GHz, and used a H-tree distributed clock with shorted nodes that had produced less than 10 ps skew. No active de-skew or fuse-based de-skew. Un-core clock (everything outside the core) Cache, bus logic, etc. Large area prevented use of gridded clock (power restriction), used a clock tree (9 vertical, 2 horizontal) with fuse-based deskew at root of each vertical spine. Achieved less than 11 ps skew.
BR 6/07
54
BR 6/07
55
Clock Domains
BR 6/07
56
BR 6/07
57
Clock Distribution
Fused-based deskew buffers located at the root of the vertical MCLK spines
Clock Hierarchy
BR 6/07
59
Core and un-core clocks are aligned, this just de-skews the data
BR 6/07 60
BR 6/07
61
Global Skew
Skew < 10 ps
BR 6/07
62
Power
BR 6/07
63
Papers
Gronowski, Paul E., et.al., High Performance Microprocessor Design, IEEE Journal of Solid-State Circuits, Vol. 33, No. 5, May 1998, pp. 676-686 Bailey, Daniel W. and Bradley J. Benschneider, Clocking Design and Analysis for a 600-Mhz Alpha Microprocessor, IEEE Journal of Solid-State Circuits, Vol. 33, No. 11, November 1998, pp. 1627-1633 Tam, S. et.al, "Clock Generation and distribution for the First IA-64 microprocessor", IEEE Journal of Solid State Circuits, Vol 35, Issue 11, Nov 2000. Rusu, S. and Singer G, "The first IA-64 microprocessor ", IEEE Journal of Solid State Circuits, Vol 35, Issue 11, Nov 2000. Anderson, F. E., Wells, J. S., Berta, E. Z, The Core Clock System on the Next Generation Itanium Processor", ISSCC 2002, pp 453-456. Tam, S., Desai, U. Limaye, R., Clock Generation and Distribution for the Third Generation Itanium Processor ", 2003 Symposium n VLSI Circuits, pp 9-12. Stinson, J., Rusu, S., A 1.5GHz Third Generation Itanium Processor, ISSCC 2003, paper 14.4. The implementation of the Itanium 2 microprocessor Naffziger, S.D.; Colon-Bonet, G.; Fischer, T.; Riedlinger, R.; Sullivan, T.J.; Grutkowski, T.; Solid-State Circuits, IEEE Journal of , Volume: 37 Issue: 11 , Nov. 2002 Page(s): 1448 -1460 A 90-nm variable frequency clock system for a power-managed itanium architecture processor, Fischer, T.; Desai, J.; Doyle, B.; Naffziger, S.; Patella, B.; Solid-State Circuits, IEEE Journal of Volume 41, Issue 1, Jan. 2006 Page(s):218 228 Digital Object Identifier 10.1109/JSSC.2005.859879 Clock distribution on a dual-core, multi-threaded Itanium/sup /spl reg//-family processor, Mahoney, P.; Fetzer, E.; Doyle, B.; Naffziger, S.; Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International 6-10 Feb. 2005 Page(s):292 - 599 Vol. 1 Digital Object Identifier 10.1109/ISSCC.2005.1493984 BR 6/07 64
Papers (cont)
Scalable sub-10ps skew global clock distribution for a 90nm multi-GHz IA microprocessor Bindal, N.; Kelly, T.; Velastegui, N.; Wong, K.L.; Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International 2003 Page(s):346 - 498 vol.1 Digital Object Identifier 10.1109/ISSCC.2003.1234329 A 65-nm Dual-Core Multithreaded Xeon Processor With 16-MB L3 Cache Rusu, S.; Tam, S.; Muljono, H.; Ayers, D.; Chang, J.; Cherkauer, B.; Stinson, J.; Benoit, J.; Varada, R.; Leung, J.; Limaye, R. D.; Vora, S.; Solid-State Circuits, IEEE Journal of Volume 42, Issue 1, Jan. 2007 Page(s):17 25 Digital Object Identifier 10.1109/JSSC.2006.885041 Clock Generation and Distribution of a Dual-Core Xeon Processor with 16MB L3 Cache Tam, S.; Leung, J.; Limaye, R.; Choy, S.; Vora, S.; Adachi, M.; Solid-State Circuits, 2006 IEEE International Conference Digest of Technical Papers Feb. 6-9, 2006 Page(s):1512 - 1521
BR 6/07
65