Sie sind auf Seite 1von 6

An RLDRAM II Implementation of a 10Gbps Shared Packet Buffer

for Network Processing

Ciaran Toal, Dwayne Burns, Kieran McLaughlin, Sakir Sezer, Stephen O’Kane
Institute of Electronics, Communications and Information Technology (ECIT)
Queen’s University Belfast
Ciaran.Toal@ee.qub.ac.uk

Abstract that contains a memory capacity similar to that of


This paper presents the design and implementation of a DDRAM II but with a lower latency [2].
fast shared packet buffer for throughput rates of at least This paper describes a complex memory controller
10Gbps using RLDRAM II memory. A complex packet architecture designed for an Altera FPGA that uses a
buffer controller is implemented on an Altera FPGA and number of RLDRAM II memory modules to create a
interfaced to the memory. Four RLDRAM II devices are 10Gbps shared packet buffer. Synthesis results are
combined to store the packet data and one RLDRAM II provided as well as analysis of possible future throughput
device is used to store a linked-list of the packet memory capabilities using different TDM (Time Division
addresses which is maintained by the packet buffer Multiplexing) patterns.
controller. The architecture is pipelined and optimised to
combat the latencies involved with RLDRAM II 2. Background Information
technologies to enable a high performance low cost
packet buffer implementation. The shared buffer architecture has been identified as
being the most suitable buffering technique available [3].
This is mainly because it utilises the buffer memory more
1. Introduction efficiently and offers a lower packet loss rate for a given
amount of buffer space, when compared to other buffering
Internet traffic continues to grow at rates close to 100% strategies such as the input buffer [4], [5] and output
per year [1]. Existing data traffic processing architectures buffer [6].
are not able to keep up with ever increasing bandwidth The majority of modern shared buffer architectures
demands, as a technology gap exists between the growth have been developed as single chip solutions [7], [8].
of Internet traffic and the growth in performance However, advances in FPGA and SoC technologies have
achievable purely through Moore’s Law. With the meant that these are now a viable platform for
emergence of new Internet traffic types and services, the implementing high-speed, high-performance packet
amount of different flows travelling through the network buffers.
is increasing at an enormous rate. This presents new
challenges in network nodes where high-speed data 2.1 Need for shared packed buffering in network
processing is becoming ever more vital. High throughput processing
connections have extremely high data handing
requirements and part of this includes data access to Packet scheduling has become a key technique for
memory. The focus of this paper is the need for high delivering Quality of Service (QoS) in packet switched
speed, high capacity packet buffering architectures, networks, such as the ATM network and the Internet.
specifically for an IP packet scheduler in this case. Weighted Fair Queuing (WFQ) and packet-by-packet
SRAM memory structures are ideal in terms of latency generalised processor sharing (PGPS) are the preferred
but their high cost and low memory capacity makes them scheduling schemes, for the scheduling of packets with
unfeasible for high capacity packet buffering network variable length such as IP packets [9], [10]. These
processing architectures. DDRRAM II memory scheduling techniques require efficient storage and
architectures contain a high memory capacity but are retrieval of packets at network nodes, which is critical for
hindered by the latency involved. RLDRAM (Reduced their effective operation and ultimately for the
Latency DRAM) is an alternative memory architecture provisioning of QoS.
The increase of throughput rate and the demand for
service variety on the Internet in particular requires that

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007
more efficient packet buffers and faster packet retrieval CIO devices have a shared READ/WRITE port that
techniques are employed at nodes. This demand cannot be requires one additional cycle to turn the bus around.
satisfied by increasing the physical size of their memory. RLDRAM II CIO architecture is optimized for data
Provisioning of application-specific QoS also streaming, where the near-term bus operation is either
encompasses the provisioning of packet buffers for 100% READ or 100% WRITE, independent of the long-
individual flows. Sharing resources, in particular the term balance.
packet buffer resource at network nodes, is essential for
the efficient operation of sophisticated QoS enabled 3. RLDRAM II Shared Packet Buffer
routers and switches. Architecture
An FPGA based shared buffer architecture has
previously been investigated [11]. This however, was The proposed architecture targets a minimum line rate
only for design validation purposes and is not intended for of 10Gbps. It consists of five RLDRAM II memory chips,
commercial exploitation. The reference presents a 4×4 four of these are RLDRAM II CIO, 8Mbyte devices with
ATM switch which has a shared buffer for 32 16-byte 36-bit data width and the last chip is an RLDRAM II CIO,
cells. It does not address the issue of buffering variable 16Mbyte device with an 18-bit data width. The four
sized packets as required for IP based networks such as 8Mbyte devices are used to store the packets whilst the
the Internet. It is therefore vital that an efficient design of single 16Mbyte device is used to store the link list of
a high-speed shared buffer can be produced to facilitate packet addresses.
packet scheduling with a high degree of QoS for IP based Fig. 1 shows a high-level diagram of the architecture of
networks. this design:
Control Memory Packet Memory
2.2 Packet Buffering for WFQ Scheduling 1 x MT49H8M36FM2.5 4 x MT49H16M18FM2. 5

A hardware based IP packet scheduler has been


derived that utilises a WFQ algorithm deployed in
hardware to provide QoS. As part of this architecture,
high-speed packet buffering is required to handle the
storage and retrieval of variable sized IP packets. The
data for individual packets is divided between cells of size
RLDRAM II Interface
128 bytes. A linked-list is used to track the position of
subsequent cells belonging to individual packets, since Control Mem Packet Mem
these will be distributed throughout the shared buffer State Machine State Machine

memory. The aim is to provide a shared buffer memory


that can operate fast enough to support line-speeds of
10Gbps. Ingress Control Egress Control

200Mhz Domain
2.3 RLDRAM II
100Mhz Domain

The high speed, high capacity shared buffer design that Ingress Fifo
Dual clock
Egress Fifo
Dual clock
has been developed utilises a special type of DRAM
optimised for low latency access, called RLDRAM II.
This memory, developed by Micron, is targeted for
Packet Scheduler
networking applications, L3 cache, high-end commercial
graphics, etc. RLDRAM II utilizes an eight-bank
architecture that is optimized for high-speed operation and Figure 1: High-level diagram of packet buffer
a double data rate I/O for increased bandwidth. The architecture
eight-bank architecture enables RLDRAM II devices to
achieve peak bandwidth by decreasing the probability of The five RLDRAM II chips are shown at the top of the
random access conflicts. diagram. It can be seen that the system clock operates at
RLDRAM II architecture offers separate I/O (SIO) and half that of the memory interface logic, which itself is
common I/O (CIO) options. The SIO devices have composed of an interface to the RLDRAM II chips,
separate READ and WRITE ports to eliminate bus ingress and egress controllers and separate state machines
turnaround cycles and contention. Optimized for near- to operate the control memory for the linked list and for
term READ and WRITE balance, RLDRAM II SIO the packet memory itself.
devices are able to achieve full bus utilization.

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007
The packet memories are divided into cells or blocks is accessed for one clock cycle there must be 3 clock
of 128 bytes, which corresponds to one complete read and cycles when it is not accessed.
write of the TDM cycle. Packets are stored into the
linked-list using 128-byte cells. However, since
RLDRAM II memory devices have read and write 3.1 Maintaining the address linked-list
latencies of 4 and 5 clock cycles respectively when
operating at 200MHz, incorporating the address linked- On start-up the packet buffer first initialises the linked-
list into the same memory device as the packets is not list. The memory is assumed empty and there will be only
feasible for high throughputs. Therefore the linked list one linked-list chain of empty cell addresses. When a
address is stored on a separate memory module, the packet comes in on the ingress FIFO, it will be collected
control memory. There is an exact memory address into a 128-byte buffer before being written into the next
mapping between the control memory and the packet available packet memory cell. The address of this cell will
memory. be provided by the control memory. The first address of
The combined IO nature of the RLDRAM devices the packet is returned to the packet scheduler to be stored
means that a TDM (Time Division Multiplexing) pattern until the packet is to be serviced.
must be used. A trade-off in terms of memory usage and When a cell of data is being written to the packet
packet size must be made due to the burst access of the memory, the same address will be accessed in the control
RLDRAM II memories. For this implementation a 4-2-4-2 memory in order to fetch the address of the next available
TDM pattern is utilised i.e. 4 clock cycle write burst, a 2 empty linked-list cell. The next cell address will therefore
clock cycle write-to-read turnaround, a 4 clock cycle read be available by the time the next cell of data has been
burst and finally a 2 clock cycle read-to-write burst. collected.
RLDRAM II uses a dual data rate by accessing the When a packet is retrieved from memory, the first cell
memory on both the rising and falling edge of the clock. address is provided by the packet scheduler and both the
Dual clock FIFOs have been implemented to enable the packet memory and the control memory are accessed at
packet data to cross between the clock domains. The this address at the same time. Whilst the packet cell is
memory controller is partitioned into ingress control and being read from the packet memory, the address of the
egress control. Since the RLDRAM II devices are next cell is read back from the control memory to be
combined IO, the TDM cadences are controlled using accessed on the next TDM cycle. The packet scheduler
state machines that write and read the data from ingress must also provide the packet length information on both
and egress control to the RLDRAM II interface. packet ingress and egress to the memory controller so that
Fig. 2 shows a simplified timing diagram of the packet it can calculate how many cells must be written/read.
and control memory state machines, based on the 4-2-4-2 Fig. 3 shows an example diagram of how the linked-list
TDM cadence which allows 128bytes of data (one in the control memory corresponds to data in the packet
complete cell) to be written to and read from the memory. memory.

Read -to - Write Write - to -Read Control Memory Packet Memory


Bank 0 Address Bank 0 - 3
Read State Nop Write State Nop
1 0

Packet Memory 4 1
State Machine
6 2

10 3
Ingress Egress Egress
Nop Nop Nop 5 4
Read State Read State Write State

E 5
Control Memory
State Machine
7 6

8 7

9 8
12 clock cycle TDM
E 9
Figure 2: Packet and Control Memory State Machine
11 10
Timing Diagram
12 11

The control memory requires only one read for the 13 12

ingress and one read for the egress and also one write per
serviced packet to maintain the address linked-list. The no Figure 3: Address linked-list
operation (NOPs) are in place to satisfy the bank tRC
(Row Cycle times) timing requirements. The RLDRAM II Once a packet has been read out of memory, the
control memory only ever accesses bank 0, so each time it corresponding linked-list of cell addresses must be joined
on to the end of the empty address linked-list. The last

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007
address in the empty address linked-list is stored locally in
18.0000
the memory controller. Once a packet is read out, the first
address of that packet must be written into the location of 16.0000

the last link in the empty address linked list. The last cell 14.0000
address of this packet is then stored locally in the memory 12.0000

B andw idth (Gbps)


controller.
10.0000
8.0000
6.0000
4.0000
2.0000
0.0000
0 100 200 300 400 500
Payload size (bytes)
Actual rate Rate required for 10Gbps

Figure 5: Bandwidth of proposed packet buffer


architecture for various payload sizes

Fig. 5 shows that this architecture enables 10Gbps


traffic throughput for every packet size, in fact it can be
seen that for packets larger than 160bytes, 12Gbps is
supported.
The packet buffer controller has been targeted to an
Altera Stratix 2 speed grade -4 device. The post-layout
synthesis results are detailed in Table 1.

Figure 4: Linked-list write back procedure Table 1: Post Layout Synthesis Results

Device Clock Speed ALUTs Registers


4. Synthesis Results & Performance Analysis
EP2S130F1508C4 200MHz 6668 6233
The system runs with two clock domains as shown in
Fig. 1. A system clock of 100MHz with the memory logic For the configuration used, the RLDRAM II requires
running at twice this speed, 200Mhz. With the 4-2-4-2 the packet buffer controller to operate at 200MHz which
configuration, this gives a maximum throughput of is easily attainable using Stratix 2 FPGA technology. The
12.8Gbps, thus the utilised TDM structure is more than design uses a total of 6,668 ALUTs (only 6% of the
capable of providing the 10Gbps that was originally device). These results demonstrate that operating a shared
specified. buffer architecture using FPGA technology is possible for
The graph in Fig. 5 shows how the bandwidth of this the target line speed of 10Gbps.
implementation varies for different packet payload sizes
as well as the required line rate in order to achieve 10 5. Exploration of Achievable Throughput
Gbps throughput. The graph assumes a minimum payload Rates
size of 60bytes with a 20byte packet header. The dashed
line shows the performance required to satisfy a line speed Having shown line speeds of 10Gbps to be achievable
of 10Gbps. The actual speed required is slower because using standard FPGA technology, the next aim is to
the packet header data does not have to be processed. improve the performance beyond this to 20Gbps for
shared packet buffers. This will be necessary for future
architectures as it has already been established that speed
demands for networks are fast increasing. In order to
achieve higher throughputs, the architecture must operate
at a higher frequency. The highest frequency at which the
RLDRAM II can be configured to operate is 400MHz. As
a consequence of running at such high speed, the tRC,
tWR and tRL cycles increase. It is no longer possible to
operate the RLDRAM II using a 4-2-4-2 TDM. In order

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007
to satisfy tRC, the TDM pattern must be at least 4-5-4-5 Fig. 8 shows the throughput performance achieved if
which would produce the throughput waveform shown in the TDM pattern was changed to an 8-1-8-1 cadence. In
Fig. 6. this case, all 8 banks of the packet memory RLDRAM II
will be utilized.
25.0000
25.0000

20.0000
20.0000
B andw idth (Gbps)

15.0000

B andw idth (Gbps)


15.0000

10.0000
10.0000

5.0000
5.0000

0.0000
0 100 200 300 400 500 0.0000
Payload size (bytes) 0 100 200 300 400 500
Actual rate Rate required for 20Gbps Payload size (bytes)

Figure 6: Throughput for 400MHz with 4-5-4-5 TDM Actual rate Rate required for 20Gbps

Figure 8: Throughput for 400 MHz with 8-1-8-1 TDM


Fig. 6 shows that the 4-5-4-5 TDM pattern does not
satisfy the throughput required for almost 50 % of the This shows that utilising an 8-1-8-1 cadence, 20Gbps
packet sizes. Using a cell size of 128 bytes now becomes throughput rate is satisfied for all packets with payload
very inefficient. size larger than 104 bytes. The maximum throughput
Fig. 7 shows the throughput performance achieved if shortfall for minimum packet payload size of 60 bytes is
the TDM pattern is changed to a 6-2-6-2 cadence. In this now 2.9Gbps.
case 6 banks of the packet memory RLDRAM II will be From this analysis, the most effective cadence to use is
utilized. a 6-2-6-2 pattern. However, there is no pattern that can
effectively handle 20Gbps throughput rates for packets
25.0000
that have a payload size of 88 bytes and less.

6. Conclusions
20.0000

This paper clearly shows that a shared packet buffer


B andw idth (Gbps)

15.0000 operating beyond 10Gbps is possible using RLDRAM II


and standard FPGA technology. The shared buffer
10.0000 architecture is seen to be the most suitable buffering
solution for future packet buffers. The shared buffer
offers a lower packet loss rate for a given amount of
5.0000
buffer space and utilises the buffer memory more
efficiently when compared to other buffering strategies.
0.0000 This implementation uses a network of RLDRAM II
0 100 200 300 400 500
memories and a memory controller implemented on
Payload size (bytes)
FPGA technology. The architecture uses one RLDRAM
Actual rate Rate required for 20Gbps
II device to store an address linked-list of cells that
Figure 7: Throughput for 400MHz with 6-2-6-2 TDM contain the packet data and four RLDRAM II devices that
contain the packet data. The use of RLDRAM II
This shows that by utilising a 6-2-6-2 cadence, 20Gbps overcomes the inadequacies of other RAM technologies,
throughput rate is satisfied for all packets with a payload such as SRAM which is very expensive and DDRRAM
size larger than 88 bytes. The maximum throughput which has high latency requirements.
shortfall for minimum packet payload size of 60bytes is This design is suitable for deployment in routers
1.4Gbps. operating scheduling policies that require shared buffer
access as well as for deploying other complex networking

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007
technologies. The architecture is flexible and scalable and
it has been shown that by reconfiguring the architecture,
throughputs approaching 20Gbps are possible for a large
range of packet sizes. It is anticipated that by refining the
architecture, further speed gains can be made, making this
design suitable for future networking applications with
high-speed and high capacity packet buffering
requirements.

7. References
[1] International Data Group (IDG), “Worldwide
Bandwidth End-User Forecast and Analysis, 2003-2007:
More is Still Not Enough,” 2003.
[2] http://www.micron.com/products/dram/rldram/
[3] H. Kuwahara, “A shared buffer memory switch for
an ATM exchange,” IEEE GLOBECOM, 1989.
[4] J.Y. Hui, E. Arthurs “A Broadband Packet Switch
for Integrated Transport" IEEE Journal Selected Areas in
Communications, Vol.5, No. 8, pp 1264-1273, Oct. 1988.
[5] C. Minkenberg, T. Engbersen, “A combined input
and output queued packet switched system based on
PRIZMA switch on a chip technology,” IEEE
Communications Magazine, Dec. 2000.
[6] M. Hluchyj and M. Karol, “Queuing in high-
performance packet switching,” IEEE Journal, SAC-5,
1987.
[7] M. Lau, S. Shieh, “Gigabit Ethernet switches using a
shared buffer architecture,” IEEE Communications
Magazine, 2003.
[8] K. J. Schultz and P. G. Gulak, “CAM-Based single-
chip shared buffer ATM switch”, IEEE ICC’94, 1994.
[9] A. Demers, S. Keshav, S. Shenker, “Analysis and
Simulation of a Fair Queuing Algorithm,” in Proc. ACM
SIGCOMM’89, pp. 3–12.
[10] A. Parekh and R. Gallager, “A Generalized
Processor Sharing Approach to Flow Control in Integrated
Services Networks: The Single Node Case,” ACM/IEEE
Trans. Networking, vol. 1, June 1993, pp. 344–357.
[11] J.W. Shim, G.J. Jeong, M.K. Lee, “FPGA
implementation of a scalable shared buffer ATM switch,”
ATM, June 1998.

Second NASA/ESA Conference on Adaptive Hardware and Systems(AHS 2007)


0-7695-2866-X/07 $25.00 © 2007

Das könnte Ihnen auch gefallen