Sie sind auf Seite 1von 4

DESIGN A ND VERIFICATION OF A LAYER-2 ETHERNET MAC CLA SSIFICATION

ENGINE FOR A GIGABIT ETHERNET SWITCH

Jorge Tonfat, Ricardo Reis

Universidade Federal do Rio Grande do SuI (UFRGS)


Instituto de Informatica - PPGC/PGMicro Av. Bento Gon<;alves 9500 Porto Alegre, RS - Brazil
jorgetonfat@ieee.org, reis@inf.ufrgs.br

ABSTRACT and applications, which exceed the capacity of shared LANs,


limiting the overall performance.
This work presents the design and verification of the main
block of a Gigabit Ethernet switch for an ASIC based on the These factors together with the advances in microelec­
NetFPGA platform [1]. The main function of the Layer-2 tronics technology, allow the development of LAN switches
classification engine is to forward Ethernet frames to their that use the wiring structure already installed to create a mi­
corresponding output ports. To accomplish this task the block cro segmented network. These changes have the following
stores the source MAC address from frames in a SRAM advantages: first, the possibility to eliminate collisions, if the
memory and associates it to one of the input ports. This full-duplex operation mode is used and the fact that each de­
classification engine uses a hashing scheme that has been vice has dedicated bandwidth and independent data rate.
proven to be effective in terms of performance and imple­ Gigabit Ethernet switches are deployed in different kind
mentation costs. It can lookup constantly 62.5 million frames of scenarios. Depending of it, some characteristics are dif­
per second, which is enough to work at wire-speed rate in a ferent such as the lookup table size. This work focuses on
42-port Gigabit switch. The main challenge was to achieve scenarios where the L2 table size is large (more than lOOk
wire-speed rate during the learning process using external entries) like in an enterprise core. In the present work, we
SRAM memory. This means that the bandwidth will not be propose a layer-2 classification engine for a Gigabit Ether­
reduced when new flows appear. This block was synthesized net Switch. This paper is organized as follows: Section 2
with an 180nm process and verified using System Verilog. A presents a description of the NetFPGA Platform. Section 3
constrained random stimulus approach is used in a layered­ presents related previous works. Section 4 describes the de­
testbench environment with self-checking capability. sign and implementation of our L2 classification engine. Sec­
tion 5 presents the verification methodology. Section 6 shows
Index Terms- Ethernet, NetFPGA, Classification En­
the results for an ASIC implementation, and in Section 7 the
gine
conclusions and future work are presented.

1. INTRODUCTION

Ethernet is the most popular layer-2(L2) protocol (data link 2. THE NETFPGA PLATFORM
layer according to the OSI model). It is widely used in Lo­
cal Area Networks or LANs and also in Metropolitan Area The NetFPGA platform [I] was developed by a research
Networks or MANs. Its popularity is mainly due to the low group at Stanford to enable fast prototyping of networking
cost and high performance characteristics and also its fast hardware. It basically contains an FPGA, four IGigE port and
standardizations: 10 Mbits/s in 1983, 100 Mbits/s in 1995, buffer memory. The core clock of the board runs at 125 MHz.
1 Gbitls in 1998, and 10 Gbits/s in 2002. NetFPGA offers a basic hardware modular structure imple­
At the beginning, LANs were designed using one shared mented in the FPGA as shown in Figure I. Frames inside
communication channel. During late 80s and early 90s, two the data pipeline have their own header format as shown in
main factors changed the way LANs were designed [2]: the Figure 2. The NetFPGA header contains information about
LAN topology that change to a structured wiring system us­ the frame being processed such as frame size, source port
ing central hubs and the improvement of computing systems and the destination port that is calculated by the classifica­
tion engine. New modules can also add more headers. This
This work is partially supported by Brazilian National Council for Sci­
entific and Technological Development (CNPq - Brazil) and Coordination for pipelined structure allows us the implementation of specific
the Improvement of Higher Education Personnel (CAPES). functions on modules and integrate them quickly.

978-1-4244-8157-6/101$26.00 ©2010 IEEE 146 ICECS 2010



Register 1/0 over PCI

TID
TID
TID
OUTPUT
TID
INPUT FRAME OUTPUT
PORT
ARBITER MARKER QUEUES
LOOKUP
DMA
from
TID
host TID
TID
From TID
Ethernet L_�===:"':===_�===:""'_�

Fig. I. The NetFPGA Framework. [1]

Fig. 3. Gigabit Ethernet switch diagram block based on the


NetFPGA platform.

First Packet Word

Second Packet Word 4. L2 CLASSIFICATION ENGINE ARCHITECTURE

Example last word with two valid bytes


The L2 classification engine is the block that implements the
main function in a Gigabit Ethernet switch, the frame for­
warding and learning, and is part of the data path as shown in
Fig. 2. The NetFPGA packet format. [1] Figure 3. It uses the MAC destination address (MACDA) and
the MAC source address (MACSA) of frames to forward them
to their proper destination port. In order to achieve this task,
3. L2 LOOKUP ARCHITECTURES it will be needed to create a table entry with the MACSA and
the input port in a process called learning. When the MACDA
The L2 switching task needs three fields from the MAC is not found on the table (miss), the frame will be sent to
layer header: the MAC address, the port related to that ad­ all ports except the source port (flooding). According to the
dress and the VLAN (Virtual LAN) id (if present) for flow IEEE 802.1D standard [6], it is necessary to age out all the
identification. This information needs to be stored in some entries that are not accessed for a programmable amount of
kind of data structure. Previous works has shown differ­ time; this is not a priority task and should not interrupt the
ent approaches. Solutions using binary and ternary CAMs main learning/forwarding process. VLAN tags should also be
(Content-addressable memory) were used but the cost per considered during the frame learning/forwarding to be com­
bit and power consumption as shown in [3] make them un­ pliant with the IEEE 802.1Q [7] standard.
viable for switches deployed at the enterprise core or metro Every frame needs at least two read accesses from the
edge where the lookup table's size is around the hundreds lookup table, one for the MACDA (forwarding) and another
of thousands of entries. Another proposal could be the use for the MACSA (learning). If the MACSA is not found or the
of a software-only switch. Tests done in [4] confirm that the source port associated is different, then a third access (write)
switching task needs to be implemented in hardware. One should be necessary to update the input port number or create
popular solution is the use of hashing functions to store MAC a new entry in the table. In our design, four read and two write
addresses in a SRAM. It uses simple hardware compared accesses are needed because a two-entry bucket mechanism
with others, but has some disadvantages like hash collisions is used, so each bucket is stored in two consecutives memory
and a decreased table capacity. To deal with hash collisions, addresses. One extra read access is needed when a VLAN
the table is organized in buckets that contain multiple en­ frame is processed. In the same SRAM where the lookup ta­
tries. Since buckets are smaller than the table, the lookup is ble is stored, VLAN information is stored as shown in Figure
accelerated. Carefully chosen hash functions with relatively 4. The 4096 VLANs information are mapped directly into
uniform distribution of output values (memory indexes) for the SRAM, four ids for each memory entry. In that Figure
an arbitrary set of input values (MAC addresses) will reduce is shown the organization for an 8-Gigabit port switch as an
the hash collisions and improve the table capacity as well [5]. example. The 16 bits assigned for each VLAN id are divided
This work uses a hashing scheme using the MAC address and in two groups: the tagged members and only members. It is
the VLAN id to generate indexes to the SRAM. important to know if a port is a tagged member or not. Tagged

147
- . - . :

FO RD FO RD FO RD FO RD FO RD FOWR FOWR
FO FO FO FO FO
REa REa REa REa REa REa REa
F1 WR F1 WR F1 RD F1 RD F1 RD F1 RD F1 RD
F1 F1 F1 F1 F1
REa REa REa REa REa REa REa

Fig. 5. 16 cycles SRAM arbiter FSM, showing only the memory requests.

VLAN
tagged VLAN
FOR AN 8-PORT SWITCH: members members

I
I
I
I
L2 CLASSIFICATION I
I
Fig. 4. VLAN information memory format. I- _ E�IN� _____________ ...J

Fig. 6. L2 classification engine block diagram.


members should pass the VLAN tag in the frame header. This
information is dispatched to the output ports module inserted
into the NetFPGA header showed in Figure 2. how two frames are processed interleaved, using 14 of the 16
The lookup table is stored in a 72-bit wide SRAM. Each cycles to memory accesses. The two left cycles are used to
table entry is composed of a MAC address (48-bit), the input external access and for the aging module. Since the SRAM is
port(8-bit), the VLAN id(12-bit), and three status bits. These accessed by three different sources (the forwarding/learning
status bits are: the valid bit, the static bit and the age bit. module, the aging module and the external access through
When a new MAC address is learned, the valid and the age the register bus) an arbiter is needed. It can accept one re­
bit are set. If this MAC address is found later, the age bit is quest (read or write) per clock cycle and have a latency of
refreshed. In the aging process, the age bit of each valid entry five cycles due to the pipeline structure and the ZBT (zero­
will be cleared. If the age bit is not set, then the valid bit is bus turnaround) SRAM.
cleared and the entry is aged out. The aging process will not
modify the entries with the static bit set. This bit denotes an
entry programmed by external access.
The block diagram is presented in Figure 6. The header 5. VERIFICATION METHODOLOGY
parser module will extract the MACSA, MACDA and the
VLAN id from the frame and send this information to the
For the verification stage, we use System Verilog and Model­
Exact Match and Learning block. The frame will wait for
sim to create the testbench environment. The testbench archi­
the lookup using the input FIFO as a buffer. Considering the
tecture is better explained in Figure 7.
worst case (MACSA not found), the module have a constant
latency of 12 clock cycles. This classification engine should A testcase have been developed with particular constraints
be able to support 42 GigE ports at wire-speed or to process that will limit the random stimulus generation. With these
62.5 million frames per second in the worst case. The worst constraints the generator will create a programmable amount
case scenario appears when the switch deals with minimum of random frames that will be inserted in the DOT (Design
frame length (64 bytes), minimum interframe gap and negligi­ under Test). The agent or transactor will take these frames
ble propagation delays as explained in [2]. Considering this, a (described in a high level variable) and will transform them
frame should be processed at least in 16 ns (672 ns for a single into signals (bytes) and will send them through the interface
GigE port) or 8 clock cycles for a 500 MHz clock. To achieve (driver). The scoreboard will predict the expected result from
this goal, this block needs to be tightly-coupled with the ex­ each block and this result will be used by the checker to com­
ternal SRAM memory controller. Considering that for each pare them with the received data from the DOT. During this
frame is needed 12 cycles (including read and write requests, process tens of bugs were found in the design and corrected.
header data fetch, hash mac addresses), the FSM from both It is always preferable that the testbench and the design be
modules (SRAM arbiter and Exact Match & Learning) have designed by different persons; this will add some redundancy
16 cycles and are able to process 2 frames. Figure 5 shows to the interpretation process of the specification.

148
8. REFERENCES
ENVIRONMENT
I SYSTEM lOG I
[1] Jad Naous, Glen Gibb, Sara Bolouki, and Nick McKe­
Report
own, "NetFPGA: reusable router architecture for exper­
imental research, " in PREST O '08: Proceedings of the
ACM workshop on Programmable routers for extensible
services of tomorrow, New York, NY, USA, 2008, pp.
1-7, ACM.

[2] Rich Seifert and James Edwards, T he All-New Switch


Book: T he Complete Guide to LAN Switching Technol­
ogy, Wiley, Hoboken, NJ, 2008.

Fig. 7. Classification Engine Testbench architecture.


[3] AJ. McAuley and P. Francis, " Fast routing table lookup
using cams, " in INFOCOM '93. P roceedings. Twelfth
Annual Joint Conference of the IEEE Computer and
Communications Societies. Networking: Foundation for
6. IMPLEMENTATION RESULTS
the Future. IEEE, 1993, pp. 1382 -1391 vol.3.

Table I shows a comparison between different classification [4] J Luo, J. Pettit, M. Casado, J. Lockwood, and N. McK­
engines. The work from [8] doesn't mention the operation eown, "Prototyping fast, simple, secure switches for
frequency. [9] has better bandwidth results but is important to ethane, " in High-Peiformance Interconnects, 2007.
note that the bandwidth they show is an average that depends HOT! 2007. 15th Annual IEEE Symposium on, 22-24
on the number of collisions and the size of the table. There is 2007, pp. 73 -82.
shown the results for a 64K table. If compared to the results
[5] C. Huntley, G. Antonova, and P. Guinand, "Effect of
obtained with a 32K table, the bandwidth is reduced by 25%
hash collisions on the performance of Ian switching de­
while the results presented in this work are constant relative
vices and networks, " in Local Computer Networks, Pro­
to the table size. This module was sinthezized with different
ceedings 2006 31st IEEE Conference on, 14-16 2006,
table sizes: 4K, 32K, 64K and 128K all of them obtain an
pp. 280 -284.
operation frequency of 500 Mhz and a bandwidth of 42 Gbps.
[6] IEEE Std 802.1D-2004, "IEEE standard for local
S olution Op. Freq. Bandwidth Technolo gy and metropolitan area networks media access control
(Mhz) (Gbps) Process (MAC) bridges, " p. 269, 2004.
Our 500 42 TSMC 180nm
[7] IEEE Std 802.lQ-2005, "IEEE standard for local and
[10] 125 22 180nm
metropolitan area networks virtual bridged local area
[8] 10 180nm
networks, " p. 285, 2006.
[9] 400 103.5 UMC 130nm
[8] S.M. Mishra, A Guruprasad, Chun Feng Hu, P.K.
Table 1. Comparison with related works. Pandey, and Ming Hung, "Wire-speed traffic manage­
ment in ethernet switches, " in Circuits and Systems,
2003. ISCAS '03. Proceedings of the 2003 International
Symposium on, 25-28 2003, vol. 2, pp. 11-105 - 11-108
vol.2.
7. CONCLUSIONS AND FUTURE WORK
[9] V. Papaefstathiou and I. Papaefstathiou, "A hardware­
A classification engine for a Gigabit Ethernet switch was pre­ engine for layer-2 classification in low-storage, ultra
sented. The verification stage is very important to find bugs high bandwidth environments, " in Design, Automation
and Test in Europe, 2006. DAT E '06. Proceedings, 6-10
that will only appear with random stimulus. The random­
constrained approach is more time-efficient to reach the cov­ 2006, vol. 2, pp. 1 -6.
erage goal than other simpler methods such as direct-test.
[10] MV Lau, S. Shieh, Pei-Feng Wang, B. Smith, D. Lee,
The architecture presented in this work achieves the nec­ J. Chao, B. Shung, and Cheng-Chung Shih, "Gigabit
essary throughput for a 42-port GigE. The next step is to work ethernet switches using a shared buffer architecture, "
in order to be able to process layer-3 protocols such as IP. Communications Magazine, IEEE, vol. 41, no. 12, pp.
To accomplish this, some sort of search algorithm should be 76 - 84, dec. 2003.
needed such as LPM (Longest Prefix Match).

149

Das könnte Ihnen auch gefallen