Sigcomm Tutorial

High Performance Switches and Routers:
Sigcomm Practice Theory and99 August 30, 1999

H S
Te lec
Harvard University
i w
om.
g i
Ce
h tc
nte
P h
rW
e i
ork
r n
sho
fo g
p:
r a
Se
m n
pt4
a d
,1
n R
97
c o
e u
in
Nick McKeown
Balaji Prabhakar
Departments of Electrical Engineering and Computer Science

nickm@stanford.edu balaji@isl.stanford.edu
Tutorial Outline
Introduction:
What is a Packet Switch?
Packet Lookup and Classification:

Where does a packet go next?
Switching Fabrics:
How does the packet get there?
Output Scheduling:
When should the packet leave?
Copyright 1999. All Rights Reserved
Introduction
Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers
Basic Architectural Components
Admission Control
Congestion Control
Routing Switching
Reservation
Control
Policing
Output Scheduling
Datapath:
per-packet processing

1.
Datapath: per-packet processing

2. Interconnect
Forwarding Table
3. Output Scheduling
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Copyright 1999. All Rights Reserved 5
Where high performance packet switches are used

- Carrier Class Core Router - ATM Switch - Frame Relay Switch
The Internet Core
Edge Router
Enterprise WAN access & Enterprise Campus Switch
Introduction
ATM Switch
Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.
Ethernet Switch
Lookup frame DA in forwarding table.
If known, forward to correct port. If unknown, broadcast to all ports.
Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.
IP Router
Lookup packet DA in forwarding table.
If known, forward to correct port. If unknown, drop packet.
Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.
10
Introduction
11
First-Generation IP Routers
Shared Backplane
CPU
Buffer Memory
CP L U I ine nt er fa M ce em or y
DMA
DMA
DMA
Line Interface
MAC
Line Interface
MAC
Line Interface
MAC
12
Second-Generation IP Routers
CPU
Buffer Memory
DMA
DMA
DMA
Line Card Local Buffer Memory

MAC

MAC

MAC
13
Third-Generation Switches/Routers
Switched Backplane
Li Line LiIn n LiILiInnetneteef n e r LiILiInnetneterf rfa ace nt er a c LI CP Initnnetne erf fac ce e In e erf ac e Ue rf a e tr a c fa ce e M ce em or y

MAC
CPU Card

MAC
14
Fourth-Generation Switches/Routers
Clustering and Multistage
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 111213 1415 16
13 14 15 16 17 18
25 26 27 28 29 30
17 1819 20 21 22 23 2425 26 2728 29 30 31 32
7 8 9 10 11 12
19 20 21 22 23 24
31 32 21
15
Packet Switches
References
J. Giacopelli, M. Littlewood, W.D. Sincoskie Sunshine: A high performance self-routing broadband packet switch architecture, ISS 90. J. S. Turner Design of a Broadcast packet switching network, IEEE Trans Comm, June 1988, pp. 734-743. C. Partridge et al. A Fifty Gigabit per second IP Router, IEEE Trans Networking, 1998. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, The Tiny Tera: A Packet Switch Core, IEEE Micro Magazine, Jan-Feb 1997.
Tutorial Outline
Introduction:

Switching Fabrics:
Output Scheduling:
17

1.

2. Interconnect
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Decisions
ATM and MPLS switches Bridges and Ethernet switches
Associative Lookup Hashing Trees and tries Caching CIDR Patricia trees/tries Other methods Direct Lookup
IP Routers
Packet Classification
19
ATM and MPLS Switches

Direct Lookup
VCI
Memory
(Port, VCI)
Address
Data
20
IP Routers
21
Bridges and Ethernet Switches

Associative Lookups
Associative Memory or CAM
Associated Data
Advantages:
Simple
Search Data
48
Network Associated Address Data
Disadvantages
Slow High Power Small Expensive
Hit?
log2N
Address
22
Bridges and Ethernet Switches

Hashing
Associated Data
Address
48
Hashing Function
16
Data
Search Data
Memory
Hit?
log2N
Address
23
Lookups Using Hashing

An example
Memory #1
Search Data
48
#2 #2
#3
#4
Hashing Function
16
CRC-16
#1
{
#3
Associated Data
Hit?
log2N
Address
Linked lists
#1
#2
24

Performance of simple example
E R = 1 1 + ------------------------------- -2 1 M 1 1 --- N
Where: ER = Expected number of memory references M = Number of memory addresses in table N = Number of linked lists = M N
25

Advantages:
Simple Expected lookup time can be small
Disadvantages
Non-deterministic lookup time Inefficient use of memory
26
Trees and Tries

Binary Search Tree < < > < > > Binary Search Trie 0 0 1 0 1 1
log2N
N entries
010
111
27
Trees and Tries

Multiway tries
16-ary Search Trie 0000, ptr
0000, 0 1111, ptr
1111, ptr
0000, 0 1111, ptr
000011110000
111111111111
28
Trees and Tries

Multiway tries
N D E w = D L 11 1 ------- + D L N D E n = 1 + D L 1 ------- + DL
L1 i =1 L 1 i=1
D i ( (1 D i 1 )N ( 1 D 1 i ) N)
N
Di D i 1(1 D i 1 )
Where: D = Degree of tree L = Number of layers/references N = Number of entries in table E n = Expected number of nodes E w = Expected amount of wasted memory
Degree of # Mem # Nodes Total Memory Fraction 6 Tree References (x10 ) (Mbytes) Wasted (%)
2 4 8 16 64 256 48 24 16 12 8 6 1.09 0.53 0.35 0.25 0.17 0.12 4.3 4.3 5.6 8.3 21 64 49 73 86 93 98 99.5
29
Table produced from 215 randomly generated 48-bit addresses

IP Routers
30
Caching Addresses
Slow Path
CPU
Buffer Memory
Fast Path
DMA
DMA
DMA

MAC

MAC

MAC
31
Caching Addresses
LAN: Average flow < 40 packets Huge Number of flows
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
WAN:
Cache Hit Rate
Cache = 10% of Full Table

IP Routers
Class-based addresses
IP Address Space
Class A Class B Class C D
212.17.9.4
Class A Class B Class C
Routing Table: Exact Match 212.17.9.0 Port 4
33
IP Routers
CIDR
Class-based: A
0
D
232 -1
Classless:
65/8
128.9.0.0 128.9/16 2
1 6
142.12/19
232 -1
128.9.16.14
IP Routers
CIDR
128.9.19/24 128.9.25/24 128.9.16/20 128.9.176/20 128.9/16
232 -1
128.9.16.14 Most specific route = longest matching prefix

IP Routers
Metrics for Lookups
Prefix 128.9.16.14
65/8 128.9/16 128.9.16/20 128.9.19/24 128.9.25/24 128.9.176/20 142.12/19
Port
3 5 2 7 10 1 3
Lookup time Storage space Update time Preprocessing time
36
IP Router
Lookup
H E A D E R
Dstn Addr
Forwarding Engine Next Hop Computation Forwarding Table Destination Next Hop -------------------
Next Hop
Incoming Packet
IPv4 unicast destination address based lookup

Need more than IPv4 unicast lookups

Multicast
PIM-SM
Longest Prefix Matching on the source and group address Try (S,G) followed by (*,G) followed by (*,*,RP) Check Incoming Interface
DVMRP:
Incoming Interface Check followed by (S,G) lookup
IPv6
128-bit destination address field Exact address architecture not yet known
38
Lookup Performance Required

Line T1 OC3 OC12 OC48 Line Rate Pktsize=40B 1.5Mbps 155Mbps 622Mbps 2.5Gbps 4.68Kpps 480Kpps 1.94Mpps 7.81Mpps 31.25Mpps Pktsize=240B 0.78Kpps 80Kpps 323Kpps 1.3Mpps 5.21Mpps
OC192 10 Gbps
Gigabit Ethernet (84B packets): 1.49 Mpps

Size of the Routing Table
Source: http://www.telstra.net/ops/bgptable.html
Ternary CAMs
Associative Memory Value 10.0.0.0 10.1.0.0 10.1.1.0 10.1.3.0 10.1.3.1 Mask 255.0.0.0 255.255.0.0 255.255.255.0 255.255.255.0 255.255.255.255 R1 R2 R3 R4 R4
Next Hop
Priority Encoder
41
Binary Tries
0 1 Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000
42
d e a b c
g h i
Patricia Tree
0 1 Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 Skip=5 e) 0101 f) 011 j g) 100 h) 1010 i) 1100 j) 11110000
d e a b c
g h i
43
Patricia Tree
Disadvantages Many memory accesses May need backtracking Pointers take up a lot of space Advantages General Solution Extensible to wider fields
Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries) 40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]
Binary search on trie levels

Level 0 Level 8
Level 29
45

Store a hash table for each prefix length to aid search at a particular trie level. Length 8 12 16 24 Hash 10 10.1, 10.2 Example Prefixes 10.0.0.0/8 10.1.0.0/16 10.1.1.0/24 10.1.2.0/24 10.2.3.0/24 Example Addrs 10.1.1.4 10.4.4.3 10.2.3.9 10.2.4.8
46
10.1.1, 10.1.2, 10.2.3


Disadvantages Multiple hashed memory accesses. Updates are complex. Advantages Scaleable to IPv6.
33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]
Compacting Forwarding Tables
1 0 0 0
1 1 0 0 0 1
48

10001010 11100010 10000010 10110100 R1, 0 0 1 R2, 3 2 R3, 7 3 R4, 9 4 Codeword array 11000000 R5, 0
Base index array 0 0 1

13

Disadvantages Scalability to larger tables? Updates are complex. Advantages Extremely small data structure - can fit in cache.
33K entries: 160KB data structure with average 2Mpps [O(W/k)]

Multi-bit Tries
16-ary Search Trie 0000, ptr
0000, 0 1111, ptr
1111, ptr
0000, 0 1111, ptr
000011110000
111111111111
51
Compressed Tries
Only 3 memory accesses
L8
L16 L24
52
Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter


Prefixes up to 24-bits 142.19.6
224 = 16M entries

Next Hop
Next Hop
142.19.6 14
142.19.6.14
24
54

Prefixes up to 24-bits 128.3.72
Next Hop
128.3.72
128.3.72.44
24
base
Pointer
Next Hop Prefixes above 24-bits Next Hop Next Hop
offset
44
55

Prefixes up to n-bits 2n entries:
(i
)entries
0 N
Prefixes longer than N+M bits Next Hop
N+M
56

Disadvantages Large memory required (9-33MB) Depends on prefix-length distribution.
Advantages 20Mpps with 50ns DRAM Easy to implement in hardware
Various compression schemes can be employed to decrease the storage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.
IP Router Lookups
References
A. Brodnik, S. Carlsson, M. Degermark, S. Pink. Small Forwarding Tables for Fast Routing Lookups, Sigcomm 1997, pp 3-14. B. Lampson, V. Srinivasan, G. Varghese. IP lookups using multiway and multicolumn search, Infocom 1998, pp 1248-56, vol. 3. M. Waldvogel, G. Varghese, J. Turner, B. Plattner. Scalable high speed IP routing lookups, Sigcomm 1997, pp 25-36. P. Gupta, S. Lin, N.McKeown. Routing lookups in hardware at memory access speeds, Infocom 1998, pp 1241-1248, vol. 3. S. Nilsson, G. Karlsson. Fast address lookup for Internet routers, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998. V. Srinivasan, G.Varghese. Fast IP lookups using controlled prefix expansion, Sigmetrics, June 1998.
58
IP Routers
59
Providing Value-Added Services

Some examples
Differentiated services
Regard traffic from Autonomous System #33 as `platinum-grade
Access Control Lists

Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp
Committed Access Rate

Rate limit WWW traffic from sub-interface#739 to 10Mbps
Policy-based Routing
Route all voice traffic through the ATM network
60
H E A D E R
Forwarding Engine Packet Classification Classifier (Policy Database) Predicate Action -------------------
Action
Incoming Packet
61
Multi-field Packet Classification

Fie 1 ld Rule1 Rule2 RuleN
15 63 9 2.1 .1 0.69 21 / 15 68 .0/ 24 2.1 .3
Fie 2 ld
15 2.163.8 1/ 3 0.1 2 15 2.163.0 / 16 .0
Fie k Ac ld tion
UDP TCP A1 A2
15 68 .0/ 16 2.1 .0
15 2.0.0.0/ 8
ANY
An
Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.
Geometric Interpretation in 2D
Field #1 Field #2
R7 R6
Data
P2
Field #2
R3 R5 R4
P1
e.g. (144.24/16, 64/24) e.g. (128.16.46.23, *) R1

R2
Field #1
63
Proposed Schemes
Pros
Se ntial que Evaluation Small storage, scales well with number of fields Te rnary CAMs Single cycle classification Grid of Trie Small storage requirements and s (Srinivasan e fast lookup rates for two fields. t al[Sig com m Suitable for big classifiers 98])
Cons
Slow classification rates Cost, density, power consumption Not easily extendible to more than two fields.
64
Proposed Schemes (Contd.)

Pros
Crossproducting (Srinivasan et al[Sigcomm 98]) Fast accesses. Suitable for multiple fields.
Cons
Large memory requirements. Suitable without caching for classifiers with fewer than 50 rules. Large memory bandwidth required. Comparatively slow lookup rate. Hardware only.
Bil-level Parallelism Suitable for (Lakshman and multiple fields. Stiliadis[Sigcomm 98])
65
Proposed Schemes (Contd.)

Pros
Suitable for multiple fields. Small memory requirements. Good update time. Suitable for multiple fields. The basic scheme has good update times and memory requirements. Recursive Flow Fast accesses. Suitable for Classification (Gupta multiple fields. and Reasonable memory McKeown[Sigcomm requirements for real-life 99]) classifiers.
Cons
Large preprocessing time.
Hierarchical Intelligent Cuttings (Gupta and McKeown[HotI 99]) Tuple Space Search (Srinivasan et al[Sigcomm 99])
Classification rate can be low. Requires perfect hashing for determinism.
Large preprocessing time and memory requirements for large classifiers.

66
Grid of Tries
0 0 1 0
Dimension 1
0 0 1 1 0 1 0
R4
0 0 0 1
R1 R2
0 1
Dimension 2 R7
67
R3
R5
R6
Grid of Tries
Disadvantages Static solution Not easy to extend to higher dimensions Advantages Good solution for two dimensions
20K entries: 2MB data structure with 9 memory accesses [at most 2W]
Classification using Bit Parallelism

0 1 1 1
1 1 0 0
R4 R3 R1 R2
69
Classification using Bit Parallelism

Disadvantages Large memory bandwidth Hardware optimized Advantages Good solution for multiple dimensions for small classifiers
512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.
Classification Using Multiple Fields

Recursive Flow Classification
2 =2
S 18 2
2 =2
T
1 2
Packet Header Memory

F1 F2 F3
Memory Memory
Action
2S = 2128
264
224
2 =2
T
1 2
F4
Fn
71
References
T.V. Lakshman. D. Stiliadis. High speed policy based packet forwarding using efficient multi-dimensional range matching, Sigcomm 1998, pp 191-202. V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. Fast and scalable layer 4 switching, Sigcomm 1998, pp 203-214. V. Srinivasan, G. Varghese, S. Suri. Fast packet classification using tuple space search, to be presented at Sigcomm 1999. P. Gupta, N. McKeown, Packet classification using hierarchical intelligent cuttings, Hot Interconnects VII, 1999. P. Gupta, N. McKeown, Packet classification on multiple fields, Sigcomm 1999.
72
Tutorial Outline
Introduction:

Switching Fabrics:
Output Scheduling:
73
Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Combining input and output queues Other non-blocking fabrics Multicast traffic
74

1.

2. Interconnect
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Interconnects
Two basic techniques
Input Queueing Output Queueing
Usually a non-blocking switch fabric (e.g. crossbar)

Usually a fast bus

76
Interconnects
Output Queueing
Individual Output Queues 1 2 N 1 2
Memory b/w = (N+1).R
Centralized Shared Memory

Memory b/w = 2N.R
N
77
Output Queueing
The ideal
2 1 1 2 2 1 2 1 1 2 1
2 1
78
Output Queueing
How fast can we make centralized shared memory?
5ns SRAM Shared Memory
1 2 N
200 byte bus
5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s In practice, closer to 80Gb/s
79
Switching Fabrics
Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic
80
Interconnects
Input Queueing with Crossbar
Memory b/w = 2R
Scheduler
Data In
configuration
Data Out
81
Input Queueing
Head of Line Blocking
Delay
Load
58.6%
100%
82
Head of Line Blocking
83
84
85
Input Queueing
Virtual output queues
86
Input Queues
Virtual Output Queues
Delay
Load
100%
87
Input Queueing
Memory b/w = 2R
Scheduler
Can be quite complex!
88
Input Queueing
Scheduling
Input 1 A1,1(t) Q(1,1) Matching, M Output 1 D1 (t) A1 (t)
Q(1,n)
?
Input m Q(m,1) Am (t) Q(m,n) Output n Dn(t)
89
Input Queueing
1 2 3 4
7 2 4 2 5 2
Scheduling 1 1 2 2 3 3 4 4 Bipartite Matching

(Weight = 18)
1 2 3 4
Request Graph
Question: Maximum weight or maximum size?

Input Queueing
Scheduling
Maximum Size
Maximizes instantaneous throughput Does it maximize long-term throughput?
Maximum Weight
Can clear most backlogged queues But does it sacrifice long-term throughput?
91
Input Queueing
Scheduling 1 2 1 2
1 2 1 2
92
Input Queueing
Longest Queue First or Oldest Cell First
Weight
={
1 10
Queue Length Waiting Time
}
1 2 3 4
100%
1 2 3 4
1 1 1
10
1 2 3 4
Maximum weight
1 2 3 4
93
Input Queueing
Why is serving long/old queues better than serving maximum number of queues?
When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. When traffic is non-uniform, some queues become longer than others. A good algorithm keeps the queue lengths matched, and services a large number of queues.
Avg Occupancy
Uniform traffic
Avg Occupancy
Non-uniform traffic
VOQ #
VOQ #
94
Input Queueing
Practical Algorithms Maximal Size Algorithms
Wave Front Arbiter (WFA) Parallel Iterative Matching (PIM) iSLIP
Maximal Weight Algorithms

Fair Access Round Robin (FARR) Longest Port First (LPF)
95
Wave Front Arbiter
Requests 1 2 3 4
Match 1 2 3 4 1 2 3 4 1 2 3 4
96
Wave Front Arbiter
Requests
Match
97
Wave Front Arbiter

Implementation
1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4
Combinational Logic Blocks
98
Wave Front Arbiter

Wrapped WFA (WWFA)
N steps instead of 2N-1
Requests
Match
99
Input Queueing

100
Parallel Random Selection Random Selection Iterative Matching

#1
1 2 3 4
Requests
1 2 3 4 1 2 3 4
1 2 3 4
Grant
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
101
Accept/Match
1 2 #2 3 4
1 2 3 4
Parallel Iterative Matching

Maximal is not Maximum
1 2 3 4
Requests
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
1 2 3 4
Accept/Match
102

Analytical Results
Number of iterations to converge:
N2 E [ U i ] -----4i E [ C] log N C = # of iterations required to resolve connections N = # of ports U i = # of unresolved connections after iteration i
103
104
105
106
Input Queueing

107
Round-Robin Selection Round-Robin Selection
iSLIP
#1
1 2 3 4
Requests
1 2 3 4 1 2 3 4
1 2 3 4
Grant
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4
108
Accept/Match
1 2 #2 3 4
1 2 3 4
iSLIP
Properties
Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log2N Implementation: N priority encoders Up to 100% throughput for uniform traffic
109
iSLIP
110
iSLIP
111
Programmable Priority Encoder
iSLIP
Implementation
Grant Grant
1 1 log2N
Accept Accept
2
log2N
State
Decision
Grant
Accept
log2N
112
Input Queueing References

References
M. Karol et al. Input vs Output Queueing on a Space-Division Packet Switch, IEEE Trans Comm., Dec 1987, pp. 1347-1356. Y. Tamir, Symmetric Crossbar arbiters for VLSI communication switches, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27. T. Anderson et al. High-Speed Switch Scheduling for Local Area Networks, ACM Trans Comp Sys., Nov 1993, pp. 319-352. N. McKeown, The iSLIP scheduling algorithm for Input-Queued Switches, IEEE Trans Networking, April 1999, pp. 188-201. C. Lund et al. Fair prioritized scheduling in an input-buffered switch, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69. A. Mekkitikul et al. A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches, IEEE Infocom 98, April 1998.
Switching Fabrics
114
Other Non-Blocking Fabrics

Clos Network
115

Clos Network
Expansion factor required = 2-1/N (but still blocking for multicast)
116

Self-Routing Networks
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
117

Self-Routing Networks
The Non-blocking Batcher Banyan Network
Batcher Sorter
3 7 5 2 6 0 1 4 7 2 3 5 6 1 0 4 7 5 2 3 1 0 6 4 7 0 5 1 3 4 2 6 7 4 5 6 0 3 1 2 7 6 4 5 3 2 0 2 7 6 5 4 3 2 1 0
Self-Routing Network
000 001 010 011 100 101 110 111
Fabric can be used as scheduler. Batcher-Banyan network is blocking for multicast.

Switching Fabrics
119
Speedup
Context
input-queued switches output-queued switches the speedup problem
Early approaches Algorithms Implementation considerations
120
Speedup: Context
M e m o r y M e m o r y
A generic switch
The placement of memory gives

- Output-queued switches - Input-queued switches - Combined input- and output-queued switches
Output-queued switches
Best delay and throughput performance

- Possible to erect bandwidth firewalls between sessions
Main problem
- Requires high fabric speedup (S = N)
Unsuitable for high-speed switching

Input-queued switches
Big advantage
- Speedup of one is sufficient
Main problem
- Cant guarantee delay due to input contention
Overcoming input contention: use higher speedup

A Comparison
Memory speeds for 32x32 switch
Output-queued
Line Rate 100 Mb/s 1 Gb/s 2.5 Gb/s 10 Gb/s Memory BW 3.3 Gb/s 33 Gb/s 82.5 Gb/s 330 Gb/s Access Time Per cell 128 ns 12.8 ns 5.12 ns
1.28ns
Input-queued
Memory BW 200 Mb/s 2 Gb/s 5 Gb/s 20 Gb/s Access Time 2.12 s 212 ns 84.8 ns 21.2 ns
124
The Speedup Problem

Find a compromise: 1 < Speedup << N
- to get the performance of an OQ switch - close to the cost of an IQ switch
Essential for high speed QoS switching
125
Some Early Approaches

Probabilistic Analyses
- assume traffic models (Bernoulli, Markov-modulated, non-uniform loading, friendly correlated) - obtain mean throughput and delays, bounds on tails - analyze different fabrics (crossbar, multistage, etc)
Numerical Methods
- use actual and simulated traffic traces - run different algorithms - set the speedup dial at various values
126
The findings
Very tantalizing ...
- under different settings (traffic, loading, algorithm, etc) - and even for varying switch sizes
A speedup of between 2 and 5 was sufficient!
127
Using Speedup
1 2
1 2
128
Intuition
Bernoulli IID inputs Speedup = 1 Fabric throughput = .58
Bernoulli IID inputs Speedup = 2 Fabric throughput = 1.16 I/p efficiency, = 1/1.16 Ave I/p queue = 6.25
Intuition (continued)
Bernoulli IID inputs Speedup = 3 Fabric throughput = 1.74 Input efficiency = 1/1.74 Ave I/p queue = 1.35 Bernoulli IID inputs Speedup = 4 Fabric throughput = 2.32 Input efficiency = 1/2.32 Ave I/p queue = 0.75
130
Issues
Need hard guarantees
- exact, not average
Robustness
- realistic, even adversarial, traffic not friendly Bernoulli IID
131
The Ideal Solution

Inputs Speedup = N Outputs
?
Speedup << N
Question: Can we find

- a simple and good algorithms - that exactly mimics output-queueing - regardless of switch sizes and traffic patterns?
132
What is exact mimicking?
Apply same inputs to an OQ and a CIOQ switch

- packet by packet
Obtain same outputs

- packet by packet
133
Algorithm - MUCF
Key concept: urgency value

- urgency = departure time - present time
134
MUCF
The algorithm
- Outputs try to get their most urgent packets - Inputs grant to output whose packet is most urgent, ties broken by port number - Loser outputs for next most urgent packet - Algorithm terminates when no more matchings are possible
135
Stable Marriage Problem

Men = Outputs
Bill
John
Pedro
Women = Inputs Hillary

Monica
Maria
136
An example
Observation: Only two reasons a packet doesnt get to its output - Input contention, Output contention - This is why speedup of 2 works!!
What does this get us?

Speedup of 4 is sufficient for exact emulation of FIFO OQ switches, with MUCF
What about non-FIFO OQ switches?

E.g. WFQ, Strict priority
138
Other results
To exactly emulate an NxN OQ switch
- Speedup of 2 - 1/N is necessary and sufficient (Hence a speedup of 2 is sufficient for all N) - Input traffic patterns can be absolutely arbitrary - Emulated OQ switch may use a monotone scheduling policies - E.g.: FIFO, LIFO, strict priority, WFQ, etc
139
What gives?
Complexity of the algorithms
- Extra hardware for processing - Extra run time (time complexity)
What is the benefit?

- Reduced memory bandwidth requirements
Tradeoff: Memory for processing

- Moores Law supports this tradeoff
Implementation - a closer look

Main sources of difficulty
- Estimating urgency, etc - info is distributed (and communicating this info among I/ps and O/ps) - Matching process - too many iterations?
Estimating urgency depends on what is being emulated

- Like taking a ticket to hold a place in a queue - FIFO, Strict priorities - no problem - WFQ, etc - problems
Implementation (contd)
Matching process
- A variant of the stable marriage problem - Worst-case number of iterations for SMP = N2 - Worst-case number of iterations in switching = N - High probability and average approxly log(N)
142
Other Work
Relax stringent requirement of exact emulation
- Least Occupied O/p First Algorithm (LOOFA) Keeps outputs always busy if there are packets By time-stamping packets, it also exactly mimics - Disallow arbitrary inputs E.g. leaky bucket constrained Obtain worst-case delay bounds
143
References for speedup

- Y. Oie et al, Effect of speedup in nonblocking packet switch, ICC 89. - A.L Gupta, N.D. Georgana, Analysis of a packet switch with input and
and output buffers and speed constraints, Infocom 91.

- S-T. Chuang et al, Matching output queueing with a combined input and
and output queued switch, IEEE JSAC, vol 17, no 6, 1999.

- B. Prabhakar, N. McKeown, On the speedup required for combined input
and output queued switching, Automatica, vol 35, 1999.

- P. Krishna et al, On the speedup required for work-conserving crossbar
switches, IEEE JSAC, vol 17, no 6, 1999.

- A. Charny, Providing QoS guarantees in input buffered crossbar switches
with speedup, PhD Thesis, MIT, 1998.

Switching Fabrics
145
Multicast Switching
The problem Switching with crossbar fabrics Switching with other fabrics
146
Multicasting
2
147
Crossbar fabrics: Method 1

Copy network + unicast switching
Copy networks
Increased hardware, increased input contention

Method 2
Use copying properties of crossbar fabric
No fanout-splitting: Easy, but low throughput
Fanout-splitting: higher throughput, but not as simple. Leaves residue.
149
The effect of fanout-splitting
Performance of an 8x8 switch with and without fanout-splitting under uniform IID traffic
Placement of residue
Key question: How should outputs grant requests? (and hence decide placement of residue)
151
Residue and throughput

Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris.
152
Multicasting and Tetris

Input ports 1 2 3 4 5
Residue
1 2 3 4 5 Output ports
153
Multicasting and Tetris

Input ports 1 2 3 4 5
Residue Concentrated
1 2 3 4 5 Output ports
154
Replication by recycling
Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves.
b a x e y d c x a y b x c y e d
155
Replication by recycling (contd)

Receive Reseq Output Table Network Transmit
Recycle
Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays.
156
References for Multicasting

J. Hayes et al. Performance analysis of a multicast switch, IEEE/ACM Trans. on Networking, vol 39, April 1991. B. Prabhakar et al. Tetris models for multicast switches, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996 B. Prabhakar et al. Multicast scheduling for input-queued switches, IEEE JSAC, 1997 J. Turner, An optimal nonblocking multicast virtual circuit switch, INFOCOM, 1994
157
Tutorial Outline
Introduction:

Switching Fabrics:
Output Scheduling:
158
Output Scheduling
What is output scheduling? How is it done? Practical Considerations
159
Output Scheduling
Allocating output bandwidth Controlling packet delay
scheduler
160
Output Scheduling
FIFO
Fair Queueing
161
Motivation
FIFO is natural but gives poor QoS
bursty flows increase delays for others hence cannot guarantee delays
Need round robin scheduling of packets

Fair Queueing Weighted Fair Queueing, Generalized Processor Sharing
162
Fair queueing: Main issues

Level of granularity
packet-by-packet? (favors long packets) bit-by-bit? (ideal, but very complicated)
Packet Generalized Processor Sharing (PGPS)

serves packet-by-packet and imitates bit-by-bit schedule within a tolerance
163
How does WFQ work?

WR = 1 WG = 5 WP = 2
164
Delay guarantees
Theorem
If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.
165
Practical considerations
For every packet, the scheduler needs to
classify it into the right flow queue and maintain a linked-list for each flow schedule it for departure
Complexities of both are o(log [# of flows])

first is hard to overcome second can be overcome by DRR
Deficit Round Robin

50 400 200 600 700 600 100 250 750 250 500 1000 500 400 500
Good approximation of FQ Much simpler to implement

500 Quantum size
167
But...
WFQ is still very hard to implement
classification is a problem needs to maintain too much state information doesnt scale well
168
Strict Priorities and Diff Serv

Classify flows into priority classes
maintain only per-class queues perform FIFO within each class avoid curse of dimensionality
169
Diff Serv
A framework for providing differentiated QoS
set Type of Service (ToS) bits in packet headers this classifies packets into classes routers maintain per-class queues condition traffic at network edges to conform to
class requirements May still need queue management inside the network
References for O/p Scheduling

- A. Demers et al, Analysis and simulation of a fair queueing algorithm,
ACM SIGCOMM 1989.

- A. Parekh, R. Gallager, A generalized processor sharing approach to
flow control in integrated services networks: the single node case, IEEE Trans. on Networking, June 1993.
- A. Parekh, R. Gallager, A generalized processor sharing approach to
flow control in integrated services networks: the multiple node case, IEEE Trans. on Networking, August 1993.
- M. Shreedhar, G. Varghese, Efficient Fair Queueing using Deficit Round
Robin, ACM SIGCOMM, 1995.

- K. Nichols, S. Blake (eds), Differentiated Services: Operational Model
and Definitions, Internet Draft, 1998.

Active Queue Management

Problems with traditional queue management
tail drop

goals an example effectiveness
172
Tail Drop Queue Management

Lock-Out
Max Queue Length
173
Tail Drop Queue Management

Drop packets only when queue is full
long steady-state delay global synchronization bias against bursty traffic
174
Global Synchronization
Max Queue Length
175
Bias Against Bursty Traffic
Max Queue Length
176
Alternative Queue Management Schemes
177

Goals
Solve lock-out and full-queue problems
no lock-out behavior no global synchronization no bias against bursty flow low steady-state delay lower packet dropping
Provide better QoS at a router
178

Problems with traditional queue management
tail drop

goals an example effectiveness
179
Random Early Detection (RED)

Pk P2 P1
maxth
q
qavg
minth
if qavg < minth : admit every packet else if qavg <= maxth : drop an incoming packet with p = (qavg - minth )/(maxth - minth ) else if qavg > maxth : drop every incoming packet
180
Effectiveness of RED: Lock-Out

Packets are randomly dropped Each flow has the same probability of being discarded
181
Effectiveness of RED: Full-Queue

Drop packets probabilistically in anticipation of congestion (not when queue is full) Use qavg to decide packet dropping probability: allow instantaneous bursts
Randomness avoids global synchronization
182
What QoS does RED Provide?

Lower buffer delay: good interactive service Given responsive flows: packet dropping is reduced Given responsive flows: fair bandwidth allocation
early congestion indication allows traffic to throttle back before congestion qavg is controlled to be small
183
Unresponsive or aggressive flows

Dont properly back off during congestion Take away bandwidth from TCP compatible flows Monopolize buffer space
184
Control Unresponsive Flows

Some active queue management schemes
RED with penalty box Flow RED (FRED) Stabilized RED (SRED)
identify and penalize unresponsive flows with a bit of extra work
185

References
B. Braden et al. Recommendations on queue management and congestion avoidance in the internet, RFC2309, 1998. S. Floyd, V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. D. Lin, R. Morris, Dynamics on random early detection, ACM SIGCOMM, 1997 T. Ott et al. SRED: Stabilized RED, INFOCOM 1999 S. Floyd, K. Fall, Router mechanisms to support end-to-end congestion control, LBL technical report, 1997
186
Tutorial Outline
Introduction:

Switching Fabrics:
Output Scheduling:
187
Admission Control
Congestion Control
Routing Switching
Reservation
Control
Policing
Output Scheduling
Datapath:
per-packet processing
188

1.

2. Interconnect
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision

Sigcomm Tutorial

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sigcomm Tutorial

Hochgeladen von

Copyright:

Verfügbare Formate

High Performance Switches and Routers:

Sigcomm Practice Theory and99 August 30, 1999

Departments of Electrical Engineering and Computer Science

Packet Lookup and Classification:

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Basic Architectural Components

Copyright 1999. All Rights Reserved

Basic Architectural Components

Datapath: per-packet processing

Where high performance packet switches are used

The Internet Core

Enterprise WAN access & Enterprise Campus Switch

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Line Card Local Buffer Memory

Line Card Local Buffer Memory

Line Card Local Buffer Memory

Copyright 1999. All Rights Reserved

Line Card Local Buffer Memory

Line Card Local Buffer Memory

Copyright 1999. All Rights Reserved

17 1819 20 21 22 23 2425 26 2728 29 30 31 32

Copyright 1999. All Rights Reserved

Packet Lookup and Classification:

Copyright 1999. All Rights Reserved

Basic Architectural Components

Datapath: per-packet processing

Copyright 1999. All Rights Reserved

ATM and MPLS Switches

Copyright 1999. All Rights Reserved

Copyright 1999. All Rights Reserved

Bridges and Ethernet Switches

Network Associated Address Data

Copyright 1999. All Rights Reserved

Bridges and Ethernet Switches

Copyright 1999. All Rights Reserved

Lookups Using Hashing

Lookups Using Hashing

Copyright 1999. All Rights Reserved

Lookups Using Hashing

Copyright 1999. All Rights Reserved

Trees and Tries

Copyright 1999. All Rights Reserved

Trees and Tries

Copyright 1999. All Rights Reserved

Trees and Tries

Table produced from 215 randomly generated 48-bit addresses

Copyright 1999. All Rights Reserved

Line Card Local Buffer Memory

Line Card Local Buffer Memory

Line Card Local Buffer Memory

Copyright 1999. All Rights Reserved

Cache Hit Rate

Cache = 10% of Full Table

Class A Class B Class C

Routing Table: Exact Match 212.17.9.0 Port 4

Copyright 1999. All Rights Reserved