Sie sind auf Seite 1von 53

Design of High-Speed Links:

A look at Modern VLSI Design

Vladimir Stojanović

Computer Systems Laboratory


Stanford University
Chip design is changing
 Becoming constrained by power
 Not so much by area/density
Pentium Pentium 4
3M transistors 125M transistors
30mW/mm2 850mW/mm2
0.6um tech 90nm tech
4W 103W
0.1GHz 3.4GHz

 Best systems trade-off circuits, architecture


and system issues
2
Power-performance system optimization
 Complex, many levels of hierarchy and variables

3
Power-performance system optimization
 Complex, many levels of hierarchy and variables

Individual components
Flops & latches
(power and timing critical)

D Q D Q
Logic
Clk Clk

4
Power-performance system optimization
 Complex, many levels of hierarchy and variables

Individual components Vdd1, Vth1


Flops & latches
Vdd2, Vdd3,
(power and timing critical) Vth2 Vth3

D Q D Q Vdd4, Vdd5,

Logic Vth4 Vth5


Clk Clk

D Q D Q D Q

System level,
Logic A Logic B
Clk Clk Clk

VLSI blocks and circuits D Q

-Physical (Vdd, Vth, Sizing) Clk


Logic A Logic B

-Logic D Q
Logic A Logic B

-uArchitecture (parallelism, pipelining) Clk

5
Power-performance system optimization
 Complex, many levels of hierarchy and variables
Interfaces
Individual components (Digital, Analog and
Vdd1, Vth1 Mixed-Signal)
Flops & latches
Vdd2, Vdd3,
(power and timing critical) Vth2 Vth3
Channel
Transmitter Receiver
D Q D Q Vdd4, Vdd5,

Logic Vth4 Vth5


Clk Clk

D Q D Q D Q

System level,
Logic A Logic B
Clk Clk Clk

VLSI blocks and circuits D Q

-Physical (Vdd, Vth, Sizing) Clk


Logic A Logic B

-Logic D Q
Logic A Logic B

-uArchitecture (paralellism, pipelining) Clk

6
Look at sub-problem: links

 Seems pretty simple:

Channel
Transmitter Receiver

 Challenging multi-disciplinary area


 Circuits
 Communications
 Optimization

7
What makes it challenging

High speed
link chip

> 2 GHz signals

 Now, the bandwidth limit is in wires


8
New link design

Dealing with bandwidth limited channels


 This is an old research area
 Textbooks on digital communications
 Think modems, DSL
 But can’t directly apply their solutions
 Standard approach requires high-speed A/Ds and digital
signal processing
 20Gs/s A/Ds are expensive
 (Un)fortunately need to rethink issues

9
Outline
 Show system level optimization for links
 Create a framework to evaluate trade-offs

 Background on high-speed links


 High-speed link modeling
 System level optimization
 Practical implementation issues
 Current / future work

10
Backplane environment

Package

On-chip parasitic
Package
(termination resistance and via
Line card trace device loading capacitance)

Back plane trace Back plane connector Line card


via

Backplane via

 Line attenuation
 Reflections from stubs (vias)

11
Backplane channel
 Loss is variable 0

Attenuation [dB]
 Same backplane
-10 9" FR4
 Different lengths
 Different stubs -20
 Top vs. Bot
-30 26" FR4
 Attenuation is large -40
9" FR4,
 >30dB @ 3GHz -50 via stub
 But is that bad?
-60 26" FR4,
via stub
 Required signal amplitude 0 2 4 6 8 10
set by noise frequency [GHz]

12
Inter-symbol interference (ISI)
 Channel is low pass
 Our nice short pulse gets spread out

 Dispersion –
short latency
1
(skin-effect,
pulse response

0.8
dielectric loss)
 Reflections –
0.6 Tsymbol=160ps long latency
(impedance mismatches
0.4 – connectors, via stubs,
device parasitics,
0.2 package)
0

0 1 2 3
ns 13
ISI
1

Error!
0.8
Amplitude

0.6

0.4

0.2

0
0 2 4 6 8 10 12 14 16 18
Symbol time

 Middle sample is corrupted by 0.2 trailing ISI (from the previous


symbol), and 0.1 leading ISI (from the next symbol) resulting in
0.3 total ISI
 As a result middle symbol is detected in error

14
The right sub-system model

 Need accurate models


 To relate the power/complexity to performance

 Main system impairments


 Interference
 Various noise sources
 Voltage (thermal, supply, offsets, quantization noise)
 Timing (jitter, offset)

15
Problem with current models

 Worst case analysis


 Can be too pessimistic
 If probability of worst case very small

 Gaussian distributions
 Works well near mean
 Often way off at tails
 e.g. ISI distribution is bounded

 Use direct noise and interference statistics

16
Effect of timing noise
 Need to map from time to voltage

Jittered Ideal
sampling sampling
Voltage noise
when receiver Voltage noise
clock is off

 The effect is going to depend on the size of the jitter, the


input sequence, and the channel
17
Example: Effect of transmitter jitter
Jittered pulse decomposition
bk ideal
bk 1
kT ( k  1)T
 kTX  kTX1
kT (k  1)T 2
 bk
 kTX1
 kTX

 bk noise
 Decompose output into ideal and noise
 Noise are pulses at front and end of symbol
 Width of pulse is equal to jitter
 Approximate with deltas on bandlimited channels
18
Jitter effect on voltage noise
 Transmitter jitter
 High frequency (cycle-cycle) jitter is bad
 Changes the energy (area) of the symbol
 No correlation of noise sources that sum
 Low frequency jitter is less bad
 Effectively shifts waveform
 Correlated noise give partial cancellation


kRx kRx
 Receive jitter
 Modeled by shift of transmit sequence
 Same as low frequency transmitter jitter
 Bandwidth of the jitter is critical
 It sets the magnitude of the noise created 19
Jitter source from PLL clocks
10

Noise transfer functions [dB]


from from
input clock clock buffer supply
0

Phase Icp
RefClk detector VCO Clock -10
R Kvco/s buffer
 Kpd Icp
-20
C from
VCO supply

-30

N
5 6 7 8 9 10
10 10 10 10 10 10
frequency [Hz]

 Noise sources
 Reference clock phase noise
 VCO supply noise
 Clock buffer supply noise
M. Mansuri and C-K.K. Yang, "Jitter optimization based on phase-locked loop design parameters,"
IEEE Journal Solid-State Circuits, Nov. 2002
20
2x Oversampled bang-bang CDR
dn

dn
en

en (late)

dn-1

 Generate early/late from dn,dn-1,en


 Simple 1st order loop, cancels receiver setup time
 Now need jitter on data Clk, not PLL output
 Base linear PLL jitter
 Add non-linear phase selector noise from CDR
21
Bang-bang CDR model
 Model CDR loop as a state machine – Markov chain
0
Steady-State Probability

-5

-10

-15
10
log

0 50 100 150 200 250


Phase Count
 Gives the probability distribution of phase
 Which is the CDR jitter distribution
A.E. Payzin, "Analysis of a Digital Bit Synchronizer," IEEE Transactions on Communications, April 1983.
22
Outline
 Show system level optimization for links
 Create a framework to evaluate trade-offs

 Background on high-speed links


 High-speed link modeling
 System level optimization
 Limits – What is the capacity of these links?
 Improving today’s baseband signaling

 Practical implementation issues


 Current / future work
23
Baseline channels
0

Attenuation [dB]
26" NELCO, (b)
-20 no stub

-40

-60
26" FR4,
-80 via stub

-100

0 5 10 15 20
frequency [GHz]

 Legacy (FR4) - lots of reflections


 Microwave engineered (NELCO)

24
Capacity with link-specific noise
NELCO FR4
140 140
Capacity [Gb/s]

Capacity [Gb/s]
therm al noise
120 120

100 100 therm al noise


therm al noise and LC PLL
phase noise
80 80
therm al noise and ring PLL phase noise
60 60
therm al noise and therm al noise and
ring PLL phase noise LC PLL phase noise
40 40

20 20
log10(Clipping probability) log10(Clipping probability)
0 0
-25 -20 -15 -10 -5 0 -25 -20 -15 -10 -5 0

 Effective noise from phase noise


 Proportional to signal energy
 Decreases expected gains
 Still, capacity much higher than data rates in today’s links
25
Today’s links

 Exclusively baseband
 Biggest problem is ISI
 Starting to use equalization
 Thinking about multi-level modulation
 Constrained by speed and power
 Large number of links on a chip

 Model links to find efficient implementations

26
Baseband links - removing ISI
Linear transmit equalizer
Anticausal taps Sampled Deadband Feedback taps
Tx Data
Data

Channel
TapSel
Causal Logic
taps

Decision-feedback equalizer

 Transmit and Receive Equalization


 Changes signal to correct for ISI
 Often easier to work at transmitter
 DACs easier than ADCs

J. Zerbe et al, "Design, Equalization and Clock Recovery for a 2.5-10Gb/s 2-PAM/4-PAM Backplane
Transceiver Cell," IEEE Journal Solid-State Circuits, Dec. 2003. 27
Transmit equalization – headroom constraint

Attenuation [dB]
unequalized
Peak power constraint -5

-10

equalized
-15

-20

frequency [GHz]
-25
0 0.5 1 1.5 2 2.5

Amplitude of equalized signal


depends on the channel

 Transmit DAC has limited voltage headroom


 Unknown target signal levels
 Hard to formulate error or objective function
 Need to tune the equalizer and receive comparator levels

28
Optimization example:
Power constrained linear precoding

 T T
MSE ( w, g )  Ea 1  2 g w P1  g 2 w PP T w  g 2 2
T
 Ea ( w P1 ) 2
SINRunbiased ( w) 
Ea w P(I  1 1 )(I  1 1 )T P T w   2
T T T

 Add variable gain to amplify to known target level


 Formulate the objective function from error
 SINR is not concave in w in general
 Change objective to quasiconcave SINRunbiased
29
Optimal linear precoding
 Still, does this objective really relate to link performance?
 Need to look at noise and interference distributions

T
0.5d min w P1  V peak wPI PD 1  offset
maximize  
w 
Ea w P(I  1 1  I PD )(I  1 1  I PD )T P T w   2
T T T

1/ 2

s.t. w 1 1
2=wTS0TXw+wTS0RXw+2thermal

 Minimize BER
 Residual dispersion into peak distortion
 Reflections into mean distortion
 Includes all link-specific noise sources

30
Including feedback equalization
1
 Feedback equalization (DFE) Feedback
0.8
 Subtracts error from input equalization

Amplitude
0.6
 No attenuation
0.4

0.2
 Problem with DFE
0
 Need to know interfering bits 0 2 4 6 8 10 12 14 16 18
Symbol time
 ISI must be causal
 Problem - latency in the decision circuit
 Receive latency + DAC settling < bit time
 Can increase allowable time by loop unrolling
 Receive next bit before the previous is resolved

31
1 bit loop unrolling
2PAM signal
constellation

1 1  D

1 1
1
1 1
 d n | d n 1  1
 
0 xn d n 1
dClk D Q
 
1 1  d n | d n 1  0
1
1 1 dClk

 Instead of subtracting the error


 Move the slicer level to include the noise
 Slice for each possible level, since previous value unknown
K.K. Parhi, "High-Speed architectures for algorithms with quantizer loops,"
IEEE International Symposium on Circuits and Systems, May 1990
32
Residual error

 Cannot correct all the ISI


 Equalizers are finite length
 EQ coefficients quantized
 ISI-noise enhancement tradeoff
 The error affects both voltage and timing

 Need accurate distribution of this error


 Random data
 Standard textbook methods for distribution of the sum of
weighted random variables

33
Comparison with Gaussian model
Cumulative ISI distribution Impact on CDR phase

log Steady-State Phase Probability


log probability [cdf]

0 0
9% Tsymbol
-2 -2

-4 -4
10

-6 -6

-8 -8

-10 -10 4% Tsymbol


40mV error @ 10
-10
10 -10
error @ 10
25% of eye height
0 25 50 75 100 80 100 120 140 160 180
residual ISI [mV] phase count

 Gaussian model only good down to 10-3 probability


 Way pessimistic for much lower probabilities
34
BER contours
5 tap Tx Eq 5 tap Tx Eq + 1 tap DFE
150 150
-5 -5
100 100

-10 -10
50 50

margin [mV]
margin [mV]

-15 -15
0 0

-20 -20
-50 -50

-100 -25 -100 -25

-150 -30 -150 -30


0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
time [ps] time [ps]

 Voltage margin
 Min. distance between the receiver threshold and contours with same BER

35
Pulse amplitude modulation

 Binary (NRZ)  PAM4


 1 bit / symbol  2 bits / symbol
 Symbol rate = bit rate  Symbol rate = bit rate/2

00

1 01

0 11

10

36
Multi-level: Offset and jitter are crucial
thermal noise +
thermal noise + offset+
thermal noise offset jitter
45 30 30
Data rate [Gb/s]

Data rate [Gb/s]

Data rate [Gb/s]


40
PAM8
25 25
35
PAM16
30 20 20
PAM8 PAM16
25 PAM4
15 15 PAM4
20 PAM2
PAM4
15 10 10 PAM2
PAM2
10 PAM8
5 5
5

0 0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Symbol rate [Gs/s] Symbol rate [Gs/s] Symbol rate [Gs/s]

 To make better use of available bandwidth, need better


circuits

37
Full ISI compensation too costly
thermal noise thermal noise
w. thermal noise + offset + offset+ jitter
20 20 20
Data rate [Gb/s]

Data rate [Gb/s]

Data rate [Gb/s]


18 18 18
16 16 16
PAM4
14 14 14
PAM8
12 12 PAM16 PAM4 12
PAM4
10 10 PAM8 10
8 8 PAM2
8
PAM8 PAM2
PAM2
6 6 6
4 4 4
2 2 2
0 0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Symbol rate [Gs/s] Symbol rate [Gs/s] Symbol rate [Gs/s]

 Today’s links cannot afford to compensate all ISI


 Limits today’s maximum achievable data rates

38
Outline
 Show system level optimization for links
 Create a framework to evaluate trade-offs

 Background on high-speed links


 High-speed link modeling
 System level optimization
 Practical implementation issues
 Low-cost adaptation
 Dual-mode link (hardware re-use)
 Current / future work

39
Fully adaptive dual-mode link

TX PLL
 PAM2/PAM4
 2-10Gb/s
 0.13µm
RX
 40mW/Gb

 Reconfigurable dual-mode PAM2/PAM4 link


 Adaptive equalization
 Transmit and receive equalization
 DFE with loop unrolling
40
Adaptation with minimum overhead
dLev
Tx Data error
adaptive
aClk sampler Adaptive
macro
Rx data
Channel dClk
thresholds

tap edge
CDR
updates
eClk
aClk dClk eClk
tap updates
 Adaptive sampler
 Generates the error signal at reference level
 Monitors the link
 Adjustable voltage and time reference
 On-chip sampling scope
 Can replace any other sampler - calibration
41
Dual-loop adaptive algorithm
 Data level reference loop
dLevn 1  dLevn  stepdataLev sign(en ), xˆn  0

dLevinit x̂n
dLevmid
errorinit p-p dLevend
Sign(en )

… … Sign( xˆn )

Initial eye Mid-way equalized Equalized

 Equalizer loop
wn 1  wn  stepw sign(en ) sign( xˆ n )
 Scale the equalizer - output Tx constraint
42
Dual loop convergence – 4 tap example
PAM2, 5Gb/s, 4taps Tx Equalization
100 1000

800
80
main tap

tap weight [mV]


600
dLev [mV]

60
400

40 200
post2
0
20 pre1
-200
post1
0 -400
0 50 100 150 200 0 50 100 150 200
number of updates number of updates

 Hard to estimate analytically


 Experimental results show
 Both loops are stable within wide range 0.1 – 10x of relative speeds

43
Hardware re-use: Dual-mode receiver

prDFE enable
thresh (+)
D Q D Q 0 lsb(+)
thresh(+) D Q
1
0 dClk
prDFE enable
0 x D Q D Q
0 msb
dClk D Q
thresh(-) 1
1
D Q prDFE enable
D Q 0
thresh (-)
dClk D Q 0 lsb(-)
D Q
1

 PAM4

44
Hardware re-use: Dual-mode receiver

prDFE enable
thresh (+)
D Q D Q 0 lsb(+)
D Q
1
0 dClk
prDFE enable
0 x D Q D Q
0 msb
dClk D Q
1
1
D Q prDFE enable
D Q 0
thresh (-)
dClk D Q 0 lsb(-)
D Q
1

 PAM4
 PAM2

45
Hardware re-use: Dual-mode receiver

prDFE enable
thresh (+)
D Q D Q 0 lsb(+)
D Q
1
thresh(+) 0 dClk
prDFE enable
x D Q D Q

thresh(-) 0 msb
dClk D Q
1
1
D Q prDFE enable
D Q 0
thresh (-)
dClk D Q 0 lsb(-)
D Q
1

 PAM4
 PAM2 with loop-unrolled DFE tap
 Leverage multi-level properties of signals in loop-unrolling

46
Improvements with loop-unrolling
0.4
[V]
unequalized
0.3

log10(voltage probability distribution)


-3
200
0.2
150 -3.5
0.1
100
0 -4
[ps] 50

[mV]
(a) 0 1000 2000 3000 4000 0
-4.5
0.25
[V] transmit equalized -50
0.2 with one tap DFE
-100 -5
fully transmit equalized
0.15

0.1 0 50 100 150 200 [ps]

0.05

0
 Signal as seen by the
(b) 0 1000 2000 3000
[ps]
4000
receiver (on-chip scope)

47
Model and measurements
0

-2
log10(BER)
-4

-6

-8

-10

-12

-14
80 60 40 20 0 -20 -40 -60 -80
Voltage Margin [mV]

 PAM4, 3taps of transmit equalization, 5Gb/s


Voltage Margin (mV)
48
Outline
 Show system level optimization for links
 Create a framework to evaluate trade-offs

 Background on high-speed links


 High-speed link modeling
 System level optimization
 Practical implementation issues
 Current / future work
 Bridging the gap to link capacity

49
Bridging the gap: Multi-tone link
10
Multi-tone data rates with thermal noise
8 Nelco 64Gb/s
FR4 38Gb/s
6
#bits/Hz

0
0 2 4 6 8 10 12 14
frequency [GHz]

50
Bridging the gap: Multi-tone link
10
Multi-tone data rates with thermal noise
8 Nelco 64Gb/s
FR4 38Gb/s
6

#bits/Hz
data0
4
data0
LPF 2 LPF
data1 0
0 2 4 6 8 10 12 14 data1
frequency [GHz]
LPF BPF BPF LPF

ejw1t # levels ejw1t


dataN dataN
LPF BPF BPF LPF

ejwNt f ejwNt
 Challenge – balancing the inter-symbol and
inter-channel interference
 Microwave filter techniques
 Custom signal processing 51
Conclusions
 Links nice example of system-level optimization
 Need accurate models
 Global tradeoff
 off-chip communication with on-chip computation
 ISI is large in baseband links
 Can’t completely compensate
 (At least not with reasonable area/power)
 Power constrained transmitter
 PAM4 and simple DFE are attractive solutions
 Implemented practical, low-cost algorithms
 Still, far from the capacity of these links
 Looking into multi-tone to bridge the gap
52
Acknowledgments
 Prof. Mark Horowitz and Prof. Vojin Oklobdzija
 Prof. Stephen Boyd, Prof. Joseph Kahn, Prof. Thomas Lee
 My mother Nada, my wife Ivana, kids Marija and Marko, my sister
Tamara, Maurizio and my whole family
 Rambus and MARCO IFC for support
 Jared Zerbe, Andrew Ho, Fred Chen and everybody in Rambus XG team
 MH group - especially Elad Alon and Amir Amirkhany
 Dr. George Ginis and Prof. John Cioffi
 Dejan Markovic and Prof. Borivoje Nikolic
 Prof. Michael Flynn, Prof. Ken Yang
 Marianne Marx, Teresa, Penny, Taru, Deborah, Pamela
 My friends Svjetlana, Danijela and Dejan

53

Das könnte Ihnen auch gefallen