Sie sind auf Seite 1von 51

1

The Future of Electrical I/O for


Microprocessors
Frank OMahony
frank.omahony@intel.com
Intel Labs, Hillsboro, OR USA
2
Outline
1TByte/s I/O: motivation and challenges
Circuit Directions
Channel Directions
Tool Directions
470Gb/s Prototype
2
3
Microprocessor Bandwidth Needs
As CPU core count increases, I/O bandwidth (BW)
requirements will increase for all segments
Current system bandwidth requirements (Y2010)
Client BW = ~50GB/s
Server BW = ~100GB/s
High-end Server BW = ~200GB/s
Server
Example
High-End
Server Example
4
Microprocessor Bandwidth Trends
Bandwidth Drivers:
CPU

Memory
CPU

CPU
CPU

Peripheral
CPU

I/O bridge
0.1
1
10
100
1000
B
a
n
d
w
i
d
t
h

(
G
B
/
s
)
2000 2005 2010 2015
2X/2yrs.
3X/2yrs.
4
High-end microprocessors are expected to
need ~1TB/s during coming decade
5
Microprocessor I/O Power
Current system I/O power efficiency is 20-40pJ/bit
If I/O power efficiency doesnt improve during the next
decade, then:
1TB/s x 20pJ/bit = 160W
System BW I/O Pwr. Eff. I/O Pwr
Client ~50GB/s 20pJ/bit 8W
Server ~100GB/s 20pJ/bit 16W
High-End
Server
~200GB/s 20pJ/bit 32W
6
I/O Energy Efficiency Trends
1
10
100
Energy Eff.
(pJ/bit)
2002 2004 2006 2008
-20%/year
Issue: ~20% per year power reduction while bandwidth
increasing 40-70% per year
Year
2010
Ref: R. Palmer,
ISSCC 07
7
Energy Efficiency and Channel Loss Tradeoff
Power efficiency is strongly correlated to channel loss
Simply scaling per-pin BW will not meet power budget
Low power interfaces should be wider not faster
Based on transceivers
reported 2006-2009 in
65-130nm CMOS
10
100
Energy Eff.
(pJ/bit)
0 10 20 30 40
1
Channel Loss @ Symbol rate (dB)
7
8
Channel/Interconnect Density
Conventional package/socket density does not
scale with process
Width of interfaces is limited by routing
congestion
C4 pitch << Pkg. pin pitch
Flip-chip Package
P
Circuit
area
C4 bump
area
Time
D
e
n
s
i
t
y
9
Problem Statement Summary
Bandwidth needs are quickly approaching 1TB/s
Energy efficiency is not scaling as aggressively
as bandwidth
The channel limits our ability to increase per-pin
data rate and/or increase the width of an
interface
10
How Will Electrical I/O scale to 1TB/s?
1. Co-design the interconnects and I/O circuitry to
meet bandwidth, scalability and power
efficiency demands
2. Scale the channel by transitioning to new
channel configurations and materials
3. Use accurate, statistical link design tools to
identify balanced architectures.
11
Outline
1TByte/s I/O: motivation and challenges
Circuit Directions
Channel Directions
Tool Directions
470Gb/s Prototype
11
12
Low Active Power Techniques
12
10
100
0 10 20 30 40
1
Channel Loss @ Symbol rate (dB)
P
o
w
e
r

E
f
f
.
(
m
W
/
G
b
/
s
)
Power Optimized Links
Simple equalization
Low TX swing
Sensitive RX sampler
Low-power clocking
13
Minimize analog circuit complexity
RX
L
R
TERM
C
PAD
Ref: G. Balamurugan, JSSC 4/08
Lowest power links find ways to simplify equalization and
clocking circuitry to reduce power
Equalization examples:
Constrain equalization range by known channel characteristics
Continuous-time linear Rx equalizer
13
14
Power Management: Scalable supplies
Adapt supply to frequency, process, temperature (f,P,T)
Digital: Power V
SUPPLY
2
f
Analog: Power V
SUPPLY
I
bias
Removes excess circuit BW and headroom
VR
V
SUPPLY
TX
RX
V
REG
(f,P,T)
+
-
V
REG
(f,P,T)
C
REG
V
REF
V
SUPPLY
Regulated supply ring VCO Data link with adaptive supply
14
15
Power Management: Scalable supplies
Power efficiency improves with adaptive supply/biasing
2.7
3.6
5.0
11.2
5Gb/s
0.68V
10Gb/s
0.85V
15Gb/s
1.05V
20Gb/s
1.2V
Energy Eff.
(pJ/bit)
0
2
4
6
8
10
12
TX Driver
TX Ser/Pre
TX Clk
RXFE
RX Clk
Refs: G. Balamurugan, JSSC 4/08 and B. Casper, ISSCC 06
15
16
Aggressive Power Management
Dont spend power doing nothing!
Rapidly adapt to bandwidth demand
Requires fast, granular bandwidth adaptation
1.0
0.8
0.6
0.4
0.2
0.0
Normalized
Bandwidth
Demand
Time
17
Aggressive Power Management
1.0
0.8
0.6
0.4
0.2
0.0
Normalized
Bandwidth
Demand
Time
Conventional (fixed Bandwidth)
Wasted
Energy
Dont spend power doing nothing!
Rapidly adapt to bandwidth demand
Requires fast, granular bandwidth adaptation
18
Energy
Savings
Aggressive Power Management
1.0
0.8
0.6
0.4
0.2
0.0
Normalized
Bandwidth
Demand
Time
Conventional (fixed Bandwidth)
Adaptive
Bandwidth
Dont spend power doing nothing!
Rapidly adapt to bandwidth demand
Requires fast, granular bandwidth adaptation
19
Device Variation in Scaled CMOS
Device manufacturing
tolerances are improving
but area scaling still
causes higher variation
Fundamental power/area to
variation tradeoff is not
acceptable
Ref: K. Kuhn, IEDM 2007
|
.
|

\
|

=
Leff Weff
c
2
1
V
2
T
19
Need circuit architectures
that fundamentally change
this tradeoff.
20
Mitigating Device Variation
+
-
V
offset
Circuit derivatives (gm, ro)
are not calibrated by offset
calibration PSRR is not
calibrated
20
Calibration greatly improves the
power/variation tradeoff
Receiver offset calibration
Duty cycle correction
Adaptive equalizers
Clock recovery (or deskew)
Simple calibration doesnt
alleviate all variation issues (e.g.
PSRR)
21
Mitigating Device Variation
Calibration greatly improves the
power/variation tradeoff
Receiver offset calibration
Duty cycle correction
Adaptive equalizers
Clock recovery (or deskew)
Simple calibration doesnt
alleviate all variation issues (e.g.
PSRR)
Possible solutions:
Dynamic calibration (e.g. auto-zero)
Redundancy/reconfigurability
Better correct by design circuits

1
sel
DSM

OUT
Ref:
P. Hanumolu,
JSSC 2/08.
Coarse
Phase
gen.
LPF
D
N
L

(
p
s
)
I
N
L

(
p
s
)
21
22
Outline
1TByte/s I/O: motivation and challenges
Circuit Directions
Channel Directions
Tool Directions
470Gb/s Prototype
22
23
Channel scaling
Circuit innovation alone will probably not be enough to
reach the 1TB/s target the channel needs to scale too!
Better signal integrity: Improved electrical characteristics
mean less power in clocking and equalization
Higher density: More lanes allow each lane to operate at
lower data rate better power efficiency
23
24
Channel vs. Equalization tradeoffs:
Backplane example
-80dB
-60dB
-40dB
0dB
0GHz 5GHz 15GHz
-20dB
10GHz
5mm
Stubbed-
via BP
Drilled-
via BP
Ref: B. Casper, CICC 07.
24
|S
21
|
25
Improve channel signal integrity
19 Flex
-80dB
-60dB
-40dB
0dB
0GHz 5GHz 15GHz
-20dB
10GHz
5mm
Stubbed-
via BP
Drilled-
via BP
19 Flex Cable
CPU Socket
Flex Connector
25
|S
21
|
26
High density channels
10
1000
A
p
p
r
o
x
.
C
o
n
t
a
c
t
/
R
o
u
t
i
n
g

P
i
t
c
h

(

m
)
100
1
Contact pitch
Routing pitch
26
27
High density channels
10
1000
A
p
p
r
o
x
.
C
o
n
t
a
c
t
/
R
o
u
t
i
n
g

P
i
t
c
h

(

m
)
100
1
Contact pitch
Routing pitch
27
28
High density channels
10
1000
A
p
p
r
o
x
.
C
o
n
t
a
c
t
/
R
o
u
t
i
n
g

P
i
t
c
h

(

m
)
100
1
Contact pitch
Routing pitch
28
29
High density channels
10
1000
A
p
p
r
o
x
.
C
o
n
t
a
c
t
/
R
o
u
t
i
n
g

P
i
t
c
h

(

m
)
100
1
Contact pitch
Routing pitch
29
30
High density channels
10
1000
A
p
p
r
o
x
.
C
o
n
t
a
c
t
/
R
o
u
t
i
n
g

P
i
t
c
h

(

m
)
100
1
Contact pitch
Routing pitch
30
31
Outline
1TByte/s I/O: motivation and challenges
Circuit Directions
Channel Directions
Tool Directions
470Gb/s Prototype
31
32
What is the Right Link Architecture?
Designers need the ability to quickly and
accurately compare architecture options
32
TX RX
Clock Jitter?
Signal Swing?
Equalization?
ISI? Xtalk?
Modulation (PAM)?
Data Rate?
Interface width?
Clock Jitter?
Sensitivity?
Equalization?
33
Empirical Approach
33
Simulate system with random data
This doesnt provide adequate accuracy
(BER<10
-12
)
34
Full System Statistical Analysis
Specify high-level architecture and block characteristics
Enables fast evaluation of link sensitivities
Statistical
Signaling
Analysis
TX jitter
Channel & co-channel
responses
Equalization
Modulation
RX input referred noise
RX sampling jitter
34
35
Maximum Data Rate Comparison:
Backplane vs. Flex
TX FIR taps
DFE taps
1 2 3 4 5 6 4
128
4
1
4
2
4
8
4
16
4
32
4
64
4
4
30Gb/s
15Gb/s
45Gb/s
Flex
M
a
x
.

D
a
t
a

R
a
t
e

(
B
E
R
=
1
0
-
1
2
)
Drilled-via BP
Stubbed-via BP
35
Statistical system analysis provides designers with real
performance tradeoffs and brick walls
36
Maximum Data Rate Comparison:
Backplane vs. Flex
Statistical system analysis provides designers with real
performance tradeoffs and brick walls
TX FIR taps
DFE taps
1 2 3 4 5 6 4
128
4
1
4
2
4
8
4
16
4
32
4
64
4
4
30Gb/s
15Gb/s
45Gb/s
Flex
M
a
x
.

D
a
t
a

R
a
t
e

(
B
E
R
=
1
0
-
1
2
)
Drilled-via BP
Stubbed-via BP
36
37
Outline
1TByte/s I/O: motivation and challenges
Circuit Directions
Channel Directions
Tool Directions
470Gb/s Prototype
37
38
txbundle_A[1] rxbundle_B[1]
rxbundle_B[0] txbundle_A[0]
IL-VCO
IL-VCO
rxbundle_A[1] txbundle_B[1]
txbundle_B[0] rxbundle_A[0]
rxbundle_A[2] txbundle_B[2]
dieA dieB
fclk_A
9
10
9
9
10
fclk_B
47x10Gb/s, 1.4pJ/bit Interface (45nm CMOS)
38
39
Bundled Architecture
Deskew
Deskew
Deskew
Deskew
Deskew
clk
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
data
Conventional:
Independent clocking
40
Bundled Architecture
Clocking innovation Bundle clocking
Deskew
Deskew
Deskew
Deskew
Deskew
clk clk
Bundle
Deskew
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
data
Conventional:
Independent clocking
41
Bundled Architecture
Clocking innovation Bundle clocking
Deskew
Deskew
Deskew
Deskew
Deskew
clk clk
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
RX sampler
data data
Bundled clocking reduces I/O power
Conventional:
Independent clocking
Optimized:
Bundle clocking
Bundle
Deskew
42
Fast RX Power States
RX bundle power reduced by 93% in standby
All RX lanes return to reliable operation in <5ns
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
Wake-up time [ns]
P
r
o
b
a
b
i
l
i
t
y

o
f

c
o
r
r
e
c
t

s
e
e
d
i
n
g
1
rxbundle_B[0]
rxlane[9:0]
43
Silicon Area Compression
Conventional: I/O floor plan
Power
I/O layout
Ground
I/O signals
44
Silicon Area Compression
Floor plan optimizationminimize I/O area
Conventional: I/O floor plan
Power
I/O layout
Ground
I/O circuitry
I/O signals
45
Silicon Area Compression
Floor plan optimizationminimize I/O area
Conventional: I/O floor plan
Power
I/O layout
Ground
I/O signals
I/O circuitry
Optimized: Bundle layout
46
txbundle_A[0]
rxbundle_A[2]
rxbundle_A[1]
rxbundle_A[0]
txbundle_A[1]
IL-VCO + Drv
Lane[9:5] Lane[4:0]
10 TL pairs
Active I/O circuitry
Interface Floorplan
1
3
0
2

m
2864m
Die edge
Active circuit area is reduced with TL routing.
47
Interface Configuration
Within-bundle lanes matched to <100m
Dense LGA connector minimizes breakout area
Bundles share the same routing layer
2X density on stripline layers due to reduced Xtalk
Package
HDI/Flex
bridge
500m LGA
connector
5 signals/mm
Microstrip
Stripline2
Stripline1
dieB
dieA
Socket
PCB
48
Silicon and Interconnect Prototypes
0.5m flex interconnect 3m twinax cable
49
Electrical Interconnect Scaling Challenges
10
100
Power Eff.
(pJ/bit)
0 10 20 30 40
1
Channel Loss @ Symbol rate (dB)
(Based on transceivers reported 2006-2009 in 65-130nm CMOS)
This work: 45nm CMOS
Prototype @10Gb/s
data rate
50
I/O Power Efficiency Measurements
0.00
0.50
1.00
1.50
2.00
2.50
3.00
0 100 200 300
P
o
w
e
r

E
f
f
i
c
i
e
n
c
y

(
p
J
/
b
i
t
)
Channel Length (cm)
HDI
LCP flex
32AWG -
twinax
Link data rate = 10Gb/s
*
*high density
interconnect
(HDI)
51
Summary
Bandwidth needs are quickly approaching 1TB/s
Extending electrical I/O to 1TB/s requires balance
between power, data rate, density and cost
Evaluate alternate channel configurations and
materials
Recent results indicate that electrical will be up to
the task for in-box I/O
51

Das könnte Ihnen auch gefallen