Sie sind auf Seite 1von 4

An Output Structure for a Bi-Modal 6.4-Gbps GDDR5 and 2.

4-Gbps DDR3 Compatible Memory Interface


Navin K. Mishra1, Manish Jain1, Phuong Le2, Sanku Mukherjee2, Arul Sendhil1, Amir Amirkhany2 1 Rambus Chip Technologies, Bangalore, India 2 Rambus Inc., 1050 Enterprise Way, Suite 700, Sunnyvale, CA, U.S.A

Abstract - A bi-modal x32 memory interface supports 6.4-Gbps GDDR5 signaling as well as 2.4-Gbps DDR3 signaling with a 1.5V IO supply. The interface incorporates a novel driver and pre-driver structure that supports one-tap equalization and presents very small capacitive loading to the pins. The entire interface, including both data and request channels achieves 11.6mW/Gbps and 27.7mW/Gbps energy efficiencies in GDDR5 and DDR3 modes respectively, and communicates successfully with 1.6-Gbps DDR3 and 6.0-Gbps GDDR5 DRAMs.

level-shifter. Section II describes the architecture of the driver and the pre-driver.
Pull-up Omux [0.5V-1.5V] Data @ [0.5V-1.5V] 0.5V [0V-0.9V] [0.5V-1.5V] 0.9V 0.9V clk_in clk_1v5 Clock Level Shifter 0V 0V Delay Circuit 0.9V 0.9V clk_0v9 On-chip Regulator 0V 0V 0.5V Regulated Output Data @ [0V-0.9V] 0.5V REF 0V Pull-down Omux [0V-0.9V] 0V Pull-down Pre-driver 0V 0.9V datan 0.5V ERE Capacitors 0.9V Output Pad Pull-up Pre-driver 1.5V Driver Stage 1.5V datap

I. INTRODUCTION Multi-modal interfaces, which can support several signaling standards with the same silicon and package, help with amortizing the costs across multiple market segments. Multi-modal interfaces additionally provide a smooth transition path for system designers to deploy higher speed products as their cost drops throughout the life cycle of the system. This paper presents the design of a bi-modal memory interface that supports 6.4-Gbps GDDR5 signaling, as well as 2.4-Gbps DDR3 signaling. In DDR memory interfaces, IO supply voltage is substantially higher than the voltage supported by the thinoxide transistors available in modern technologies. As a result, IO devices with thick oxides are often used at the driver and pre-driver stage [1]. However, the performance of the IO transistors becomes a bottleneck as target speeds are increased. In addition, the use of thick-oxide devices at the output stage substantially adds to the capacitive loading at the pin, which subsequently deteriorates signal quality by introducing inter-symbol interference (ISI). In this design, thin-oxide transistors are used throughout the high-speed signal path in the transmitter, while the architecture is designed in such a way that all transistors remain protected against voltage stress. Fig. 1 shows a high-level block diagram of the bi-modal driver. The use of high IO supply at the output stage and low core supply for the pipe also introduces a high-speed level-shifting problem [2]. High-speed signal would have to be transferred to the high-supply domain with minimal duty-cycle distortion and jitter. Unfortunately, power-efficient high-quality wideband level-shifters are difficult to implement. Instead, in this design, the level-shifting is performed before the final output multiplexer (OMUX) in the signal path so that the quality of the output signal is only dictated by the final multiplexer and the high-speed clock. The high-speed clock is level-shifted through an efficient narrow-band capacitive

Fig. 1. Block diagram of GDDR5/DDR3 compatible driver pre-driver subsystem.

To enable the use of thin-oxide transistors in the OMUX and in the pre-driver, the two paths controlling output rising (pull-up path) and falling (pull-down path) transitions are separated. The pull-up (PU) path is designed between 0.5V voltage and the IO supply, while the pull-down (PD) path is designed under the core supply of 0.9V. The 0.5V voltage is generated by on-chip regulators, where each regulator serves four bit-slices. Section II also demonstrates high-voltage stress protection mechanisms included in this design in more detail. Separating the PU and PD pre-drivers can affect the matching between the output falling and rising transitions, leading to distortion at the output stage. A novel capacitive edge-rate equalization method is introduced in Section III to mitigate this effect with minimal power overhead. The bi-modal memory interface, implemented in a 40nm CMOS process, can successfully communicate with 6.0Gbps GDDR5 and 1.6Gbps DDR3 DRAMs at very good energy efficiencies. The data rate in the GDDR5 mode is limited by the DRAM speed. Summary of measurements are provided in Section IV and the paper is summarized in Section V. II. DRIVER ACHITECTURE The driver is designed to present a 40-Ohm PU and PD impedance to the channel when in GDDR5 or DDR3 transmit mode. When in GDDR5 receive mode, the PU branch of the

978-1-4577-0223-5/11/$26.00 2011 IEEE

driver is reconfigured to support 60-Ohm termination to the IO supply. In DDR3 receiver mode, the PU and PD branches of the driver are configured to create 120-Ohm termination to the IO supply and to ground respectively. The structure of the driver is very similar to the one presented in [3]. The driver consists of 24 segments, as shown in Fig. 2(a). Each slice has a cascode push-pull structure and the poly resistor is shared between PU and PD. The 24 slices are divided to two identical sets of 12 slices. Different slices in each set are also identical, except that the length of the poly in each slice is slightly perturbed (by k) from a nominal value of 720-Ohm. Total impedances of the individual slices are also shown in Fig. 2(a). For impedance calibration of the driver, some of the slices are turned off based on the on-chip voltage and temperature, as well as the process corner. If, for example, 3 slices have to be turned off, since all the slices are slightly different, many choices exist, and therefore, very fine impedance control can be implemented [3]. Fig. 3 shows the simulated impedance calibration curve of the GDDR5 and the DDR3 driver across process, voltage, and temperature (PVT) variations. Only a 40-point subset of all the possible combinations of the slices is required to achieve less than 5% step size in the driver impedance. For GDDR5 receive termination, 6 of the slices are turned off and the rest are calibrated to 60-ohm. For DDR3 receive termination, one of the 12-slice sets pulls high and the other set pulls low, and they are both calibrated to 120-Ohm. For equalization, a subset of the slices which are ON has to be assigned to the post data. In this design, six of the slices can be shared between the main tap and the post tap. These slices would not be turned off as part of the impedance calibration. Again since the slices are slightly different, if for example two slices have to be assigned to the post data, many choices exist, leading to a very fine equalization control [3]. Fig. 4 shows the driver output swing (for a DC non-toggling output pattern) for different equalizer coefficients and across PVT variations. 30% equalization range with 6% resolution is enabled with a 24-point subset of all the possible combinations. Fig. 2(a) shows the block diagram of one slice of the driver. The cascode structure at the output stage is required to protect the thin-oxide transistor from the 1.5V IO supply. The lower PMOS and upper NMOS in the cascode are tied to 0.5V (regulated supply) and 0.9V (core supply) respectively. All the 24 slices in the driver have the same transistor sizes at the output stage. As a result, they all have identical pre-driver and omux circuits. A DC control mechanism behind the OMUX controls the ON and OFF state of the slices for output impedance calibration. An on-chip impedance calibration circuit automatically sets the state of the control signals using an off-chip 60-ohm reference resistor and a reference voltage. In order to protect the PU PMOS transistors at the output stage, the pre-driver output has to swing between 0.5V and 1.5V. To accommodate this constraint, the PU pre-driver and OMUX are separated from the PD pre-driver and OMUX, and the PU path is implemented between the 1.5V supply and a regulated 0.5V node. The required data level-shifting from the core supply to the IO supply is hidden behind the OMUX, so

that output rising edge quality is not impacted by the levelshifter performance.

(a) (b) Fig. 2. (a) DDR Driver. (b) One driver slice.
Rdc 80 PD
Ohm Ohm

Rdc 80 PD 60 40 20 0 80 PU PU
Ohm

60 40 20 0 80 60 40 20 0

10

15

20

25

30

35

40

10

15

20

25

30

35

40

Ohm

60 40 20 0

10

15

20 25 Code Index

30

35

40

10

15

20 25 Code Index

30

35

40

(a) (b) Fig. 3. Simulated driver impedance versus driver code across PVT. (a) DDR3 mode. For GDDR5 mode. Maximum step size is 1.91. (b) For Maximum step size is 1.92.
1

0.9 Normalized Equalized Driver Output

0.8

0.7

0.6

0.5

0.4

10 15 20 Equalizer Code

25

Fig. 4. Equalized output swing versus equalizer code index across PVT, at 40-ohm calibrated driver impedance. Maximum equalizer step across corners is 6% of the output swing.

The OMUX clock, nonetheless, has to be level-shifted to 0.5V and 1.5V. Fig. 5 shows the clock level-shifter structure. An AC coupling and trip-biasing-based level shifter [4] is employed. This configuration also acts as a duty cycle corrector. The trip inverter operates under 0.5V-1.5V voltage range. The AC coupling capacitor blocks the DC level of

incoming clock (i.e. (0V+0.9V)/2=0.45V) and trip inverter biases the other node of the capacitor to a new DC level of 1V ((0.5V+1.5V)/2 =1V). The AC decoupling cap and the trip resistor are picked in such a way that minimum circuit bandwidth exceeds 1.0GHz. The pull-down pre-driver and omux are implemented between the core supply of 0.9V and ground. A replica of the PU clock level-shifter is placed in the PD clock path for matching purposes (Fig. 1).

clk_1v5 (with cap) clk_0v9 (with cap) delay:1.9ps delay:1.2ps datap (with cap) datan (with cap) delay:1.3ps delay:1.6ps FIG. 3b clk_0v9 (without cap) clk_1v5 (without cap) delay:12.6ps delay:11.4ps FIG. 3c datap (without cap) datan (without cap) delay:22.3ps delay:25.2ps FIG. 3d FIG. 3a

Fig. 6. Timing diagram showing effectiveness of ERE capacitor.

IV. MEASUREMENT RESULTS


Fig. 5. AC coupling and trip-biasing-based clock level shifter.

III. BALANCING OF THE OUTPUT RISE AND FALL TIME A consequence of separating the PU and PD pre-driver and OMUX and having them on different supplies is that the output rising and falling transitions may not be symmetric. As a result, the driver output will be seriously distorted. It is important to note that the PU and PD data paths include identical sets of circuits, and also carry identical signals, except for the high and low voltage levels. For example, when the PU pre-driver output is rising from 0.5V to 1.5V to shut down the output stage PMOS, the output of the PD pre-driver is also transitioning from low to high (from 0V to 0.9V) to enable the output stage NMOS transistor. The same condition is true for the PU OMUX clock and the PD OMUX clock. To mitigate the potential mismatch between these logically identical signals, edge-rate equalizing (ERE) capacitors are used in this design as shown in Fig. 1. An ERE capacitor can be placed between a node-pair that have identical logical levels and have to transition together at all times. The capacitor causes an averaging effect (delays the faster signal and speeds up the slower signal) resulting in the synchronization of the transitions. In this design two critical node-pairs are equalized: the PU and PD OMUX clock inputs, and the PU and PD pre-driver outputs. The two capacitors are shown in Fig. 1. Fig. 6 shows the simulated waveforms at the OMUX clock input and the pre-drive data output, with and without the edge-rate equalizing capacitors. As shown in Fig. 6, simulated worst-case skew between OMUX clocks rising edge without compensation is 12.6ps, while the falling edge skew is 11.4ps. With 15fF of capacitive compensation, rising and falling skew are reduced to 1.9ps and 1.2ps respectively. Similarly, a 15fF compensation capacitor at the output of the pre-drivers can lead to about an order of magnitude reduction in the rising and falling skew (Fig. 6). This edge equalization technique does not incur a power or delay overhead to first order, and only adds small area overhead to the pre-driver.

The prototype chip is implemented in a TSMC 40-nm CMOS process with conventional flip-chip package assembly. Bump pitch is 150um x 180um. The x32 bi-modal interface discussed in this paper is part of two x16 tri-modal interfaces [3] [5] that also support a higher-speed interface. For tests in GDDR5 and DDR3 modes, both interfaces are configured to operate in DDR mode. One-bit DBI per-byte encoding was applied for GDDR5 mode measurements in this paper. Fig. 7 shows the measured eye diagrams in different settings for both the modes.

(a)

(b)

(c) Fig. 7. Eye diagrams on a scope using PRBS-7 pattern (a) un-equalized 6.4Gbps GDDR5, 100mV/Div and 30ps/Div, (b) equalized 6.4-Gbps GDDR5, 100mV/Div and 30ps/Div, (c) un-equalized 2.4-Gbps DDR3, 100mV/Div and 100ps/Div.

The system, when configured in the GDDR5 mode, is tested with GDDR5 DRAM in x32 configuration over 3 FR4 PCB traces with routing constraints of a typical 6-layer graphics card. The system, when configured in the DDR3 mode, is tested in x32 configuration (with two DDR3 X16 DRAMs) over 3 FR4 PCB traces with routing constraints of a typical 6-layer graphics card. The test-chip includes an on-

chip bathtub measurement capability. Measured bathtub curve for the GDDR5 mode is shown in Fig. 8 for bursty WRITEs (with reads in between to validate the data) to the DRAM at 6Gbps data rate. Operation at 6.4Gbps operation is not supported by the DRAM.

TABLE 1: Overall Performance Summary Technology Supply Voltage (VddC/VddA/VddIO) Cell Size (x32 interface) Clock Frequency Data Rate (GDDR5/DDR3) Extrapolated continuous WRITE timing margin w/ DBI @ 6Gbps GDDR5 and BER = 10-16 Power @ 6.4-Gbps for 32 data links, 4 DBI and command/address Power @ 2.4-Gbps for 32 data links and command/address Energy Efficiency at 6.4-Gbps in nominal conditions Energy Efficiency at 2.4-Gbps in nominal conditions TSMC 40nm G+ 0.9V/1.0V/1.5V 6.1mm x 1.4mm 1.6GHz-3.2GHz 6.4Gbps/2.4Gbps 24ps 2.4W (nominal) 2.2W (nominal) 11.6mW/Gbps 27.7mW/Gbps

V. CONCLUSION A bi-modal GDDR5/DDR3 memory interface is enabled using novel driver architecture. The driver exclusively utilizes thin-oxide transistors in the critical path to enable high-speed operation and minimize the capacitive loading at the pad. The appropriate voltage-stress protection mechanisms are implemented such that all thin-ox transistors remain protected under the 1.5V IO supply. Edge-rate compensation mechanisms are also introduced to minimize duty-cycle distortion at the output. The system achieves a BER of 10-16 with healthy margins while communicating a bursty traffic to GDDR5 and DDR3 DRAMs. REFERENCES
[1] [2] [3] [4] [5] Kho, R., et al., A 75 nm 7 Gb/s/pin 1 Gb GDDR5 Graphics Memory Device With Bandwidth Improvement Techniques, IEEE JOURNAL OF Solid-State Circuits, Jan. 2010. Partovi, H., et al., Single-ended transceiver design techniques for 5.33Gbps graphics applications, Solid-State Circuits Conference Digest of Technical Papers, 2009. ISSCC 2009. IEEE International. A. Amirkhany, et al., A 12.8-Gbps/link Tri-Modal Single-Ended Memory Interface for Graphics Applications, VLSI Circuits Symposium, Jun. 2011, in press. T. Wu, et al., Clocking circuits for a 16Gbps memory interface, IEEE CICC, Sep. 2008. K. Kaviani, et al., A Tri-Modal 20Gbps/link Differential/DDR3/GDDR5 Memory Interface, VLSI Circuits Symposium, Jun. 2011, in press.

Fig.8. WRITE timing bathtub curve in GDDR5 mode with optimized Tx EQ (24ps Timing Margin for 1e-16 BER and 6Gbps data rate).

Table 1 summarizes the overall performance in GDDR5 and DDR3 modes respectively. The x32 tri-modal interface occupies an area of 8.5mm2. The entire system including the date and request channels dissipates 2.4W in GDDR5 mode, and 2.1W in DDR3 mode. Measured random jitter at the pad is 2.76ps (rms). The clocking architecture in the system is designed to support a frequency range of 1.6-GHz to 4-GHz. To enable 2.4-Gbps Dual Data Rate (DDR) operation in DDR3 mode, bit-stuffing is employed in the pipe to extend every data bit by one bit period. Fig. 9 shows the die photo. The x32 interface was implemented at the lower edge of a die which hosted another x32 interface at the upper edge. The inductors shown in the photo belong to LC VCOs used in the high-speed mode supported by the tri-modal PHYs. In the DDR modes, a ring VCO is enabled within the tri-VCO PLL [5].

2.16mm X16 DQ

1.80mm RQ

2.16mm X16 DQ

Fig.9. Die Photo. Pad pitch is 180um (horizontal) and 150um (vertical).

Das könnte Ihnen auch gefallen