Sie sind auf Seite 1von 189

High Performance Switches and Routers:

Sigcomm Practice Theory and99 August 30, 1999


H S
Te lec

Harvard University
i w
om.

g i
Ce

h tc
nte

P h
rW

e i
ork

r n
sho

fo g
p:

r a
Se

m n
pt4

a d
,1

n R
97

c o

e u

in

Nick McKeown

Balaji Prabhakar

Departments of Electrical Engineering and Computer Science


nickm@stanford.edu balaji@isl.stanford.edu

Tutorial Outline
Introduction:
What is a Packet Switch?

Packet Lookup and Classification:


Where does a packet go next?

Switching Fabrics:
How does the packet get there?

Output Scheduling:
When should the packet leave?

Copyright 1999. All Rights Reserved

Introduction
What is a Packet Switch?

Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

Copyright 1999. All Rights Reserved

Basic Architectural Components

Admission Control

Congestion Control

Routing Switching

Reservation

Control

Policing

Output Scheduling

Datapath:
per-packet processing

Copyright 1999. All Rights Reserved

Basic Architectural Components


1.

Datapath: per-packet processing


2. Interconnect

Forwarding Table

3. Output Scheduling

Forwarding Decision
Forwarding Table

Forwarding Decision
Forwarding Table

Forwarding Decision
Copyright 1999. All Rights Reserved 5

Where high performance packet switches are used


- Carrier Class Core Router - ATM Switch - Frame Relay Switch

The Internet Core

Edge Router

Enterprise WAN access & Enterprise Campus Switch

Copyright 1999. All Rights Reserved

Introduction
What is a Packet Switch?

Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

Copyright 1999. All Rights Reserved

ATM Switch
Lookup cell VCI/VPI in VC table. Replace old VCI/VPI with new. Forward cell to outgoing interface. Transmit cell onto link.

Copyright 1999. All Rights Reserved

Ethernet Switch
Lookup frame DA in forwarding table.
If known, forward to correct port. If unknown, broadcast to all ports.

Learn SA of incoming frame. Forward frame to outgoing interface. Transmit frame onto link.

Copyright 1999. All Rights Reserved

IP Router
Lookup packet DA in forwarding table.
If known, forward to correct port. If unknown, drop packet.

Decrement TTL, update header Cksum. Forward packet to outgoing interface. Transmit packet onto link.

Copyright 1999. All Rights Reserved

10

Introduction
What is a Packet Switch?

Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers

Copyright 1999. All Rights Reserved

11

First-Generation IP Routers
Shared Backplane
CPU
Buffer Memory

CP L U I ine nt er fa M ce em or y

DMA

DMA

DMA

Line Interface
MAC

Line Interface
MAC

Line Interface
MAC

Copyright 1999. All Rights Reserved

12

Second-Generation IP Routers
CPU
Buffer Memory

DMA

DMA

DMA

Line Card Local Buffer Memory


MAC

Line Card Local Buffer Memory


MAC

Line Card Local Buffer Memory


MAC

Copyright 1999. All Rights Reserved

13

Third-Generation Switches/Routers

Switched Backplane
Li Line LiIn n LiILiInnetneteef n e r LiILiInnetneterf rfa ace nt er a c LI CP Initnnetne erf fac ce e In e erf ac e Ue rf a e tr a c fa ce e M ce em or y

Line Card Local Buffer Memory


MAC

CPU Card

Line Card Local Buffer Memory


MAC

Copyright 1999. All Rights Reserved

14

Fourth-Generation Switches/Routers
Clustering and Multistage
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 111213 1415 16

13 14 15 16 17 18

25 26 27 28 29 30

17 1819 20 21 22 23 2425 26 2728 29 30 31 32

7 8 9 10 11 12

19 20 21 22 23 24

31 32 21

Copyright 1999. All Rights Reserved

15

Packet Switches
References
J. Giacopelli, M. Littlewood, W.D. Sincoskie Sunshine: A high performance self-routing broadband packet switch architecture, ISS 90. J. S. Turner Design of a Broadcast packet switching network, IEEE Trans Comm, June 1988, pp. 734-743. C. Partridge et al. A Fifty Gigabit per second IP Router, IEEE Trans Networking, 1998. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, The Tiny Tera: A Packet Switch Core, IEEE Micro Magazine, Jan-Feb 1997.
Copyright 1999. All Rights Reserved 16

Tutorial Outline
Introduction:
What is a Packet Switch?

Packet Lookup and Classification:


Where does a packet go next?

Switching Fabrics:
How does the packet get there?

Output Scheduling:
When should the packet leave?

Copyright 1999. All Rights Reserved

17

Basic Architectural Components


1.

Datapath: per-packet processing


2. Interconnect

Forwarding Table

3. Output Scheduling

Forwarding Decision
Forwarding Table

Forwarding Decision
Forwarding Table

Forwarding Decision
Copyright 1999. All Rights Reserved 18

Forwarding Decisions
ATM and MPLS switches Bridges and Ethernet switches
Associative Lookup Hashing Trees and tries Caching CIDR Patricia trees/tries Other methods Direct Lookup

IP Routers

Packet Classification

Copyright 1999. All Rights Reserved

19

ATM and MPLS Switches


Direct Lookup

VCI

Memory

(Port, VCI)

Address

Data

Copyright 1999. All Rights Reserved

20

Forwarding Decisions
ATM and MPLS switches Bridges and Ethernet switches
Associative Lookup Hashing Trees and tries Caching CIDR Patricia trees/tries Other methods Direct Lookup

IP Routers

Packet Classification

Copyright 1999. All Rights Reserved

21

Bridges and Ethernet Switches


Associative Lookups
Associative Memory or CAM
Associated Data

Advantages:
Simple

Search Data
48

Network Associated Address Data

Disadvantages
Slow High Power Small Expensive

Hit?
log2N

Address

Copyright 1999. All Rights Reserved

22

Bridges and Ethernet Switches


Hashing
Associated Data

Address

48

Hashing Function

16

Data

Search Data

Memory

Hit?
log2N

Address

Copyright 1999. All Rights Reserved

23

Lookups Using Hashing


An example
Memory #1
Search Data
48

#2 #2

#3

#4

Hashing Function

16

CRC-16

#1

{
#3

Associated Data

Hit?
log2N

Address

Linked lists
Copyright 1999. All Rights Reserved

#1

#2

24

Lookups Using Hashing


Performance of simple example

E R = 1 1 + ------------------------------- -2 1 M 1 1 --- N

Where: ER = Expected number of memory references M = Number of memory addresses in table N = Number of linked lists = M N

Copyright 1999. All Rights Reserved

25

Lookups Using Hashing


Advantages:
Simple Expected lookup time can be small

Disadvantages
Non-deterministic lookup time Inefficient use of memory

Copyright 1999. All Rights Reserved

26

Trees and Tries


Binary Search Tree < < > < > > Binary Search Trie 0 0 1 0 1 1

log2N

N entries

010

111

Copyright 1999. All Rights Reserved

27

Trees and Tries


Multiway tries
16-ary Search Trie 0000, ptr
0000, 0 1111, ptr

1111, ptr
0000, 0 1111, ptr

000011110000

111111111111

Copyright 1999. All Rights Reserved

28

Trees and Tries


Multiway tries
N D E w = D L 11 1 ------- + D L N D E n = 1 + D L 1 ------- + DL
L1 i =1 L 1 i=1

D i ( (1 D i 1 )N ( 1 D 1 i ) N)
N

Di D i 1(1 D i 1 )

Where: D = Degree of tree L = Number of layers/references N = Number of entries in table E n = Expected number of nodes E w = Expected amount of wasted memory

Degree of # Mem # Nodes Total Memory Fraction 6 Tree References (x10 ) (Mbytes) Wasted (%)
2 4 8 16 64 256 48 24 16 12 8 6 1.09 0.53 0.35 0.25 0.17 0.12 4.3 4.3 5.6 8.3 21 64 49 73 86 93 98 99.5
29

Table produced from 215 randomly generated 48-bit addresses


Copyright 1999. All Rights Reserved

Forwarding Decisions
ATM and MPLS switches Bridges and Ethernet switches
Associative Lookup Hashing Trees and tries Caching CIDR Patricia trees/tries Other methods Direct Lookup

IP Routers

Packet Classification

Copyright 1999. All Rights Reserved

30

Caching Addresses
Slow Path
CPU
Buffer Memory

Fast Path
DMA

DMA

DMA

Line Card Local Buffer Memory


MAC

Line Card Local Buffer Memory


MAC

Line Card Local Buffer Memory


MAC

Copyright 1999. All Rights Reserved

31

Caching Addresses
LAN: Average flow < 40 packets Huge Number of flows
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

WAN:

Cache Hit Rate

Cache = 10% of Full Table


Copyright 1999. All Rights Reserved 32

IP Routers
Class-based addresses
IP Address Space
Class A Class B Class C D

212.17.9.4

Class A Class B Class C

Routing Table: Exact Match 212.17.9.0 Port 4

Copyright 1999. All Rights Reserved

33

IP Routers
CIDR
Class-based: A
0

D
232 -1

Classless:
65/8

128.9.0.0 128.9/16 2
1 6

142.12/19

232 -1

128.9.16.14
Copyright 1999. All Rights Reserved 34

IP Routers
CIDR
128.9.19/24 128.9.25/24 128.9.16/20 128.9.176/20 128.9/16

232 -1

128.9.16.14 Most specific route = longest matching prefix


Copyright 1999. All Rights Reserved 35

IP Routers
Metrics for Lookups
Prefix 128.9.16.14
65/8 128.9/16 128.9.16/20 128.9.19/24 128.9.25/24 128.9.176/20 142.12/19

Port
3 5 2 7 10 1 3

Lookup time Storage space Update time Preprocessing time

Copyright 1999. All Rights Reserved

36

IP Router
Lookup
H E A D E R

Dstn Addr

Forwarding Engine Next Hop Computation Forwarding Table Destination Next Hop -------------------

Next Hop

Incoming Packet

IPv4 unicast destination address based lookup


Copyright 1999. All Rights Reserved 37

Need more than IPv4 unicast lookups


Multicast
PIM-SM
Longest Prefix Matching on the source and group address Try (S,G) followed by (*,G) followed by (*,*,RP) Check Incoming Interface

DVMRP:
Incoming Interface Check followed by (S,G) lookup

IPv6
128-bit destination address field Exact address architecture not yet known

Copyright 1999. All Rights Reserved

38

Lookup Performance Required


Line T1 OC3 OC12 OC48 Line Rate Pktsize=40B 1.5Mbps 155Mbps 622Mbps 2.5Gbps 4.68Kpps 480Kpps 1.94Mpps 7.81Mpps 31.25Mpps Pktsize=240B 0.78Kpps 80Kpps 323Kpps 1.3Mpps 5.21Mpps

OC192 10 Gbps

Gigabit Ethernet (84B packets): 1.49 Mpps


Copyright 1999. All Rights Reserved 39

Size of the Routing Table

Source: http://www.telstra.net/ops/bgptable.html
Copyright 1999. All Rights Reserved 40

Ternary CAMs
Associative Memory Value 10.0.0.0 10.1.0.0 10.1.1.0 10.1.3.0 10.1.3.1 Mask 255.0.0.0 255.255.0.0 255.255.255.0 255.255.255.0 255.255.255.255 R1 R2 R3 R4 R4

Next Hop

Priority Encoder

Copyright 1999. All Rights Reserved

41

Binary Tries
0 1 Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 e) 0101 f) 011 g) 100 h) 1010 i) 1100 j) 11110000
42

d e a b c

g h i

Copyright 1999. All Rights Reserved

Patricia Tree
0 1 Example Prefixes a) 00001 b) 00010 c) 00011 d) 001 Skip=5 e) 0101 f) 011 j g) 100 h) 1010 i) 1100 j) 11110000

d e a b c

g h i

Copyright 1999. All Rights Reserved

43

Patricia Tree
Disadvantages Many memory accesses May need backtracking Pointers take up a lot of space Advantages General Solution Extensible to wider fields

Avoid backtracking by storing the intermediate-best matched prefix. (Dynamic Prefix Tries) 40K entries: 2MB data structure with 0.3-0.5 Mpps [O(W)]
Copyright 1999. All Rights Reserved 44

Binary search on trie levels


Level 0 Level 8

Level 29

Copyright 1999. All Rights Reserved

45

Binary search on trie levels


Store a hash table for each prefix length to aid search at a particular trie level. Length 8 12 16 24 Hash 10 10.1, 10.2 Example Prefixes 10.0.0.0/8 10.1.0.0/16 10.1.1.0/24 10.1.2.0/24 10.2.3.0/24 Example Addrs 10.1.1.4 10.4.4.3 10.2.3.9 10.2.4.8
46

10.1.1, 10.1.2, 10.2.3


Copyright 1999. All Rights Reserved

Binary search on trie levels


Disadvantages Multiple hashed memory accesses. Updates are complex. Advantages Scaleable to IPv6.

33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]
Copyright 1999. All Rights Reserved 47

Compacting Forwarding Tables

1 0 0 0

1 1 0 0 0 1

Copyright 1999. All Rights Reserved

48

Compacting Forwarding Tables


10001010 11100010 10000010 10110100 R1, 0 0 1 R2, 3 2 R3, 7 3 R4, 9 4 Codeword array 11000000 R5, 0

Base index array 0 0 1


Copyright 1999. All Rights Reserved 49

13

Compacting Forwarding Tables


Disadvantages Scalability to larger tables? Updates are complex. Advantages Extremely small data structure - can fit in cache.

33K entries: 160KB data structure with average 2Mpps [O(W/k)]


Copyright 1999. All Rights Reserved 50

Multi-bit Tries
16-ary Search Trie 0000, ptr
0000, 0 1111, ptr

1111, ptr
0000, 0 1111, ptr

000011110000

111111111111

Copyright 1999. All Rights Reserved

51

Compressed Tries
Only 3 memory accesses

L8

L16 L24

Copyright 1999. All Rights Reserved

52

Routing Lookups in Hardware

Number

Prefix length

Most prefixes are 24-bits or shorter


Copyright 1999. All Rights Reserved 53

Routing Lookups in Hardware


Prefixes up to 24-bits 142.19.6

224 = 16M entries


Next Hop

Next Hop

142.19.6 14

142.19.6.14

24

Copyright 1999. All Rights Reserved

54

Routing Lookups in Hardware


Prefixes up to 24-bits 128.3.72

Next Hop

128.3.72

128.3.72.44

24

base

Pointer

Next Hop Prefixes above 24-bits Next Hop Next Hop

Copyright 1999. All Rights Reserved

offset

44

55

Routing Lookups in Hardware


Prefixes up to n-bits 2n entries:

(i

)entries

0 N

Prefixes longer than N+M bits Next Hop

N+M

Copyright 1999. All Rights Reserved

56

Routing Lookups in Hardware


Disadvantages Large memory required (9-33MB) Depends on prefix-length distribution.
Advantages 20Mpps with 50ns DRAM Easy to implement in hardware

Various compression schemes can be employed to decrease the storage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.
Copyright 1999. All Rights Reserved 57

IP Router Lookups
References
A. Brodnik, S. Carlsson, M. Degermark, S. Pink. Small Forwarding Tables for Fast Routing Lookups, Sigcomm 1997, pp 3-14. B. Lampson, V. Srinivasan, G. Varghese. IP lookups using multiway and multicolumn search, Infocom 1998, pp 1248-56, vol. 3. M. Waldvogel, G. Varghese, J. Turner, B. Plattner. Scalable high speed IP routing lookups, Sigcomm 1997, pp 25-36. P. Gupta, S. Lin, N.McKeown. Routing lookups in hardware at memory access speeds, Infocom 1998, pp 1241-1248, vol. 3. S. Nilsson, G. Karlsson. Fast address lookup for Internet routers, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998. V. Srinivasan, G.Varghese. Fast IP lookups using controlled prefix expansion, Sigmetrics, June 1998.

Copyright 1999. All Rights Reserved

58

Forwarding Decisions
ATM and MPLS switches Bridges and Ethernet switches
Associative Lookup Hashing Trees and tries Caching CIDR Patricia trees/tries Other methods Direct Lookup

IP Routers

Packet Classification

Copyright 1999. All Rights Reserved

59

Providing Value-Added Services


Some examples
Differentiated services
Regard traffic from Autonomous System #33 as `platinum-grade

Access Control Lists


Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15 eq snmp

Committed Access Rate


Rate limit WWW traffic from sub-interface#739 to 10Mbps

Policy-based Routing
Route all voice traffic through the ATM network

Copyright 1999. All Rights Reserved

60

Packet Classification
H E A D E R

Forwarding Engine Packet Classification Classifier (Policy Database) Predicate Action -------------------

Action

Incoming Packet
Copyright 1999. All Rights Reserved

61

Multi-field Packet Classification


Fie 1 ld Rule1 Rule2 RuleN
15 63 9 2.1 .1 0.69 21 / 15 68 .0/ 24 2.1 .3

Fie 2 ld
15 2.163.8 1/ 3 0.1 2 15 2.163.0 / 16 .0

Fie k Ac ld tion
UDP TCP A1 A2

15 68 .0/ 16 2.1 .0

15 2.0.0.0/ 8

ANY

An

Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.
Copyright 1999. All Rights Reserved 62

Geometric Interpretation in 2D
Field #1 Field #2
R7 R6

Data

P2
Field #2
R3 R5 R4

P1

e.g. (144.24/16, 64/24) e.g. (128.16.46.23, *) R1


R2

Copyright 1999. All Rights Reserved

Field #1

63

Proposed Schemes
Pros
Se ntial que Evaluation Small storage, scales well with number of fields Te rnary CAMs Single cycle classification Grid of Trie Small storage requirements and s (Srinivasan e fast lookup rates for two fields. t al[Sig com m Suitable for big classifiers 98])

Cons
Slow classification rates Cost, density, power consumption Not easily extendible to more than two fields.

Copyright 1999. All Rights Reserved

64

Proposed Schemes (Contd.)


Pros
Crossproducting (Srinivasan et al[Sigcomm 98]) Fast accesses. Suitable for multiple fields.

Cons
Large memory requirements. Suitable without caching for classifiers with fewer than 50 rules. Large memory bandwidth required. Comparatively slow lookup rate. Hardware only.

Bil-level Parallelism Suitable for (Lakshman and multiple fields. Stiliadis[Sigcomm 98])

Copyright 1999. All Rights Reserved

65

Proposed Schemes (Contd.)


Pros
Suitable for multiple fields. Small memory requirements. Good update time. Suitable for multiple fields. The basic scheme has good update times and memory requirements. Recursive Flow Fast accesses. Suitable for Classification (Gupta multiple fields. and Reasonable memory McKeown[Sigcomm requirements for real-life 99]) classifiers.
Copyright 1999. All Rights Reserved

Cons
Large preprocessing time.

Hierarchical Intelligent Cuttings (Gupta and McKeown[HotI 99]) Tuple Space Search (Srinivasan et al[Sigcomm 99])

Classification rate can be low. Requires perfect hashing for determinism.

Large preprocessing time and memory requirements for large classifiers.


66

Grid of Tries
0 0 1 0

Dimension 1

0 0 1 1 0 1 0

R4
0 0 0 1

R1 R2

0 1

Dimension 2 R7
67

R3

R5

R6

Copyright 1999. All Rights Reserved

Grid of Tries
Disadvantages Static solution Not easy to extend to higher dimensions Advantages Good solution for two dimensions

20K entries: 2MB data structure with 9 memory accesses [at most 2W]
Copyright 1999. All Rights Reserved 68

Classification using Bit Parallelism


0 1 1 1

1 1 0 0
R4 R3 R1 R2

Copyright 1999. All Rights Reserved

69

Classification using Bit Parallelism


Disadvantages Large memory bandwidth Hardware optimized Advantages Good solution for multiple dimensions for small classifiers

512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.
Copyright 1999. All Rights Reserved 70

Classification Using Multiple Fields


Recursive Flow Classification
2 =2
S 18 2

2 =2
T

1 2

Packet Header Memory


F1 F2 F3

Memory Memory
Action

2S = 2128

264

224

2 =2
T

1 2

F4

Fn

Copyright 1999. All Rights Reserved

71

Packet Classification
References
T.V. Lakshman. D. Stiliadis. High speed policy based packet forwarding using efficient multi-dimensional range matching, Sigcomm 1998, pp 191-202. V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. Fast and scalable layer 4 switching, Sigcomm 1998, pp 203-214. V. Srinivasan, G. Varghese, S. Suri. Fast packet classification using tuple space search, to be presented at Sigcomm 1999. P. Gupta, N. McKeown, Packet classification using hierarchical intelligent cuttings, Hot Interconnects VII, 1999. P. Gupta, N. McKeown, Packet classification on multiple fields, Sigcomm 1999.

Copyright 1999. All Rights Reserved

72

Tutorial Outline
Introduction:
What is a Packet Switch?

Packet Lookup and Classification:


Where does a packet go next?

Switching Fabrics:
How does the packet get there?

Output Scheduling:
When should the packet leave?

Copyright 1999. All Rights Reserved

73

Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Combining input and output queues Other non-blocking fabrics Multicast traffic
74

Copyright 1999. All Rights Reserved

Basic Architectural Components


1.

Datapath: per-packet processing


2. Interconnect

Forwarding Table

3. Output Scheduling

Forwarding Decision
Forwarding Table

Forwarding Decision
Forwarding Table

Forwarding Decision
Copyright 1999. All Rights Reserved 75

Interconnects
Two basic techniques
Input Queueing Output Queueing

Usually a non-blocking switch fabric (e.g. crossbar)


Copyright 1999. All Rights Reserved

Usually a fast bus


76

Interconnects
Output Queueing
Individual Output Queues 1 2 N 1 2
Memory b/w = (N+1).R
Copyright 1999. All Rights Reserved

Centralized Shared Memory


Memory b/w = 2N.R

N
77

Output Queueing
The ideal
2 1 1 2 2 1 2 1 1 2 1

2 1

Copyright 1999. All Rights Reserved

78

Output Queueing
How fast can we make centralized shared memory?
5ns SRAM Shared Memory

1 2 N
200 byte bus
Copyright 1999. All Rights Reserved

5ns per memory operation Two memory operations per packet Therefore, up to 160Gb/s In practice, closer to 80Gb/s

79

Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic
80

Copyright 1999. All Rights Reserved

Interconnects
Input Queueing with Crossbar
Memory b/w = 2R
Scheduler

Data In

configuration

Data Out

Copyright 1999. All Rights Reserved

81

Input Queueing
Head of Line Blocking
Delay

Load

58.6%

100%

Copyright 1999. All Rights Reserved

82

Head of Line Blocking

Copyright 1999. All Rights Reserved

83

Copyright 1999. All Rights Reserved

84

Copyright 1999. All Rights Reserved

85

Input Queueing
Virtual output queues

Copyright 1999. All Rights Reserved

86

Input Queues
Virtual Output Queues

Delay

Load

100%

Copyright 1999. All Rights Reserved

87

Input Queueing
Memory b/w = 2R
Scheduler

Can be quite complex!

Copyright 1999. All Rights Reserved

88

Input Queueing
Scheduling
Input 1 A1,1(t) Q(1,1) Matching, M Output 1 D1 (t) A1 (t)

Q(1,n)

?
Input m Q(m,1) Am (t) Q(m,n) Output n Dn(t)

Copyright 1999. All Rights Reserved

89

Input Queueing
1 2 3 4
7 2 4 2 5 2

Scheduling 1 1 2 2 3 3 4 4 Bipartite Matching


(Weight = 18)

1 2 3 4

Request Graph

Question: Maximum weight or maximum size?


Copyright 1999. All Rights Reserved 90

Input Queueing
Scheduling
Maximum Size
Maximizes instantaneous throughput Does it maximize long-term throughput?

Maximum Weight
Can clear most backlogged queues But does it sacrifice long-term throughput?

Copyright 1999. All Rights Reserved

91

Input Queueing
Scheduling 1 2 1 2
Copyright 1999. All Rights Reserved

1 2 1 2
92

Input Queueing
Longest Queue First or Oldest Cell First
Weight

={
1 10

Queue Length Waiting Time

}
1 2 3 4

100%

1 2 3 4

1 1 1

10

1 2 3 4

Maximum weight

1 2 3 4
93

Copyright 1999. All Rights Reserved

Input Queueing
Why is serving long/old queues better than serving maximum number of queues?
When traffic is uniformly distributed, servicing the maximum number of queues leads to 100% throughput. When traffic is non-uniform, some queues become longer than others. A good algorithm keeps the queue lengths matched, and services a large number of queues.
Avg Occupancy

Uniform traffic
Avg Occupancy

Non-uniform traffic

VOQ #
Copyright 1999. All Rights Reserved

VOQ #

94

Input Queueing
Practical Algorithms Maximal Size Algorithms
Wave Front Arbiter (WFA) Parallel Iterative Matching (PIM) iSLIP

Maximal Weight Algorithms


Fair Access Round Robin (FARR) Longest Port First (LPF)

Copyright 1999. All Rights Reserved

95

Wave Front Arbiter

Requests 1 2 3 4
Copyright 1999. All Rights Reserved

Match 1 2 3 4 1 2 3 4 1 2 3 4
96

Wave Front Arbiter

Requests

Match

Copyright 1999. All Rights Reserved

97

Wave Front Arbiter


Implementation
1,1 2,1 3,1 4,1 1,2 2,2 3,2 4,2 1,3 2,3 3,3 4,3 1,4 2,4 3,4 4,4

Combinational Logic Blocks

Copyright 1999. All Rights Reserved

98

Wave Front Arbiter


Wrapped WFA (WWFA)
N steps instead of 2N-1

Requests

Match

Copyright 1999. All Rights Reserved

99

Input Queueing
Practical Algorithms Maximal Size Algorithms
Wave Front Arbiter (WFA) Parallel Iterative Matching (PIM) iSLIP

Maximal Weight Algorithms


Fair Access Round Robin (FARR) Longest Port First (LPF)

Copyright 1999. All Rights Reserved

100

Parallel Random Selection Random Selection Iterative Matching


#1

1 2 3 4
Requests

1 2 3 4 1 2 3 4

1 2 3 4
Grant

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4
101

Accept/Match

1 2 #2 3 4

1 2 3 4

Copyright 1999. All Rights Reserved

Parallel Iterative Matching


Maximal is not Maximum
1 2 3 4
Requests

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

1 2 3 4

Accept/Match

Copyright 1999. All Rights Reserved

102

Parallel Iterative Matching


Analytical Results
Number of iterations to converge:
N2 E [ U i ] -----4i E [ C] log N C = # of iterations required to resolve connections N = # of ports U i = # of unresolved connections after iteration i

Copyright 1999. All Rights Reserved

103

Parallel Iterative Matching

Copyright 1999. All Rights Reserved

104

Parallel Iterative Matching

Copyright 1999. All Rights Reserved

105

Parallel Iterative Matching

Copyright 1999. All Rights Reserved

106

Input Queueing
Practical Algorithms Maximal Size Algorithms
Wave Front Arbiter (WFA) Parallel Iterative Matching (PIM) iSLIP

Maximal Weight Algorithms


Fair Access Round Robin (FARR) Longest Port First (LPF)

Copyright 1999. All Rights Reserved

107

Round-Robin Selection Round-Robin Selection

iSLIP

#1

1 2 3 4
Requests

1 2 3 4 1 2 3 4

1 2 3 4
Grant

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4

1 2 3 4 1 2 3 4
108

Accept/Match

1 2 #2 3 4

1 2 3 4

Copyright 1999. All Rights Reserved

iSLIP
Properties
Random under low load TDM under high load Lowest priority to MRU 1 iteration: fair to outputs Converges in at most N iterations. On average <= log2N Implementation: N priority encoders Up to 100% throughput for uniform traffic

Copyright 1999. All Rights Reserved

109

iSLIP

Copyright 1999. All Rights Reserved

110

iSLIP

Copyright 1999. All Rights Reserved

111

Programmable Priority Encoder

iSLIP
Implementation
Grant Grant
1 1 log2N

Accept Accept
2

log2N

State

Decision

Grant

Accept

log2N

Copyright 1999. All Rights Reserved

112

Input Queueing References


References
M. Karol et al. Input vs Output Queueing on a Space-Division Packet Switch, IEEE Trans Comm., Dec 1987, pp. 1347-1356. Y. Tamir, Symmetric Crossbar arbiters for VLSI communication switches, IEEE Trans Parallel and Dist Sys., Jan 1993, pp.13-27. T. Anderson et al. High-Speed Switch Scheduling for Local Area Networks, ACM Trans Comp Sys., Nov 1993, pp. 319-352. N. McKeown, The iSLIP scheduling algorithm for Input-Queued Switches, IEEE Trans Networking, April 1999, pp. 188-201. C. Lund et al. Fair prioritized scheduling in an input-buffered switch, Proc. of IFIP-IEEE Conf., April 1996, pp. 358-69. A. Mekkitikul et al. A Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches, IEEE Infocom 98, April 1998.
Copyright 1999. All Rights Reserved 113

Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic
114

Copyright 1999. All Rights Reserved

Other Non-Blocking Fabrics


Clos Network

Copyright 1999. All Rights Reserved

115

Other Non-Blocking Fabrics


Clos Network
Expansion factor required = 2-1/N (but still blocking for multicast)

Copyright 1999. All Rights Reserved

116

Other Non-Blocking Fabrics


Self-Routing Networks
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

Copyright 1999. All Rights Reserved

117

Other Non-Blocking Fabrics


Self-Routing Networks
The Non-blocking Batcher Banyan Network
Batcher Sorter
3 7 5 2 6 0 1 4 7 2 3 5 6 1 0 4 7 5 2 3 1 0 6 4 7 0 5 1 3 4 2 6 7 4 5 6 0 3 1 2 7 6 4 5 3 2 0 2 7 6 5 4 3 2 1 0

Self-Routing Network
000 001 010 011 100 101 110 111

Fabric can be used as scheduler. Batcher-Banyan network is blocking for multicast.


Copyright 1999. All Rights Reserved 118

Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic
119

Copyright 1999. All Rights Reserved

Speedup
Context
input-queued switches output-queued switches the speedup problem

Early approaches Algorithms Implementation considerations

Copyright 1999. All Rights Reserved

120

Speedup: Context
M e m o r y M e m o r y

A generic switch

The placement of memory gives


- Output-queued switches - Input-queued switches - Combined input- and output-queued switches
Copyright 1999. All Rights Reserved 121

Output-queued switches

Best delay and throughput performance


- Possible to erect bandwidth firewalls between sessions

Main problem
- Requires high fabric speedup (S = N)

Unsuitable for high-speed switching


Copyright 1999. All Rights Reserved 122

Input-queued switches

Big advantage
- Speedup of one is sufficient

Main problem
- Cant guarantee delay due to input contention

Overcoming input contention: use higher speedup


Copyright 1999. All Rights Reserved 123

A Comparison
Memory speeds for 32x32 switch
Output-queued
Line Rate 100 Mb/s 1 Gb/s 2.5 Gb/s 10 Gb/s Memory BW 3.3 Gb/s 33 Gb/s 82.5 Gb/s 330 Gb/s Access Time Per cell 128 ns 12.8 ns 5.12 ns
1.28ns

Input-queued
Memory BW 200 Mb/s 2 Gb/s 5 Gb/s 20 Gb/s Access Time 2.12 s 212 ns 84.8 ns 21.2 ns

Copyright 1999. All Rights Reserved

124

The Speedup Problem


Find a compromise: 1 < Speedup << N
- to get the performance of an OQ switch - close to the cost of an IQ switch

Essential for high speed QoS switching

Copyright 1999. All Rights Reserved

125

Some Early Approaches


Probabilistic Analyses
- assume traffic models (Bernoulli, Markov-modulated, non-uniform loading, friendly correlated) - obtain mean throughput and delays, bounds on tails - analyze different fabrics (crossbar, multistage, etc)

Numerical Methods
- use actual and simulated traffic traces - run different algorithms - set the speedup dial at various values

Copyright 1999. All Rights Reserved

126

The findings
Very tantalizing ...
- under different settings (traffic, loading, algorithm, etc) - and even for varying switch sizes

A speedup of between 2 and 5 was sufficient!

Copyright 1999. All Rights Reserved

127

Using Speedup
1 2

1 2

Copyright 1999. All Rights Reserved

128

Intuition
Bernoulli IID inputs Speedup = 1 Fabric throughput = .58

Bernoulli IID inputs Speedup = 2 Fabric throughput = 1.16 I/p efficiency, = 1/1.16 Ave I/p queue = 6.25
Copyright 1999. All Rights Reserved 129

Intuition (continued)
Bernoulli IID inputs Speedup = 3 Fabric throughput = 1.74 Input efficiency = 1/1.74 Ave I/p queue = 1.35 Bernoulli IID inputs Speedup = 4 Fabric throughput = 2.32 Input efficiency = 1/2.32 Ave I/p queue = 0.75

Copyright 1999. All Rights Reserved

130

Issues
Need hard guarantees
- exact, not average

Robustness
- realistic, even adversarial, traffic not friendly Bernoulli IID

Copyright 1999. All Rights Reserved

131

The Ideal Solution


Inputs Speedup = N Outputs

?
Speedup << N

Question: Can we find


- a simple and good algorithms - that exactly mimics output-queueing - regardless of switch sizes and traffic patterns?

Copyright 1999. All Rights Reserved

132

What is exact mimicking?

Apply same inputs to an OQ and a CIOQ switch


- packet by packet

Obtain same outputs


- packet by packet

Copyright 1999. All Rights Reserved

133

Algorithm - MUCF

Key concept: urgency value


- urgency = departure time - present time

Copyright 1999. All Rights Reserved

134

MUCF
The algorithm
- Outputs try to get their most urgent packets - Inputs grant to output whose packet is most urgent, ties broken by port number - Loser outputs for next most urgent packet - Algorithm terminates when no more matchings are possible

Copyright 1999. All Rights Reserved

135

Stable Marriage Problem


Men = Outputs

Bill

John

Pedro

Women = Inputs Hillary


Copyright 1999. All Rights Reserved

Monica

Maria
136

An example

Observation: Only two reasons a packet doesnt get to its output - Input contention, Output contention - This is why speedup of 2 works!!
Copyright 1999. All Rights Reserved 137

What does this get us?


Speedup of 4 is sufficient for exact emulation of FIFO OQ switches, with MUCF

What about non-FIFO OQ switches?


E.g. WFQ, Strict priority

Copyright 1999. All Rights Reserved

138

Other results
To exactly emulate an NxN OQ switch
- Speedup of 2 - 1/N is necessary and sufficient (Hence a speedup of 2 is sufficient for all N) - Input traffic patterns can be absolutely arbitrary - Emulated OQ switch may use a monotone scheduling policies - E.g.: FIFO, LIFO, strict priority, WFQ, etc

Copyright 1999. All Rights Reserved

139

What gives?
Complexity of the algorithms
- Extra hardware for processing - Extra run time (time complexity)

What is the benefit?


- Reduced memory bandwidth requirements

Tradeoff: Memory for processing


- Moores Law supports this tradeoff
Copyright 1999. All Rights Reserved 140

Implementation - a closer look


Main sources of difficulty
- Estimating urgency, etc - info is distributed (and communicating this info among I/ps and O/ps) - Matching process - too many iterations?

Estimating urgency depends on what is being emulated


- Like taking a ticket to hold a place in a queue - FIFO, Strict priorities - no problem - WFQ, etc - problems
Copyright 1999. All Rights Reserved 141

Implementation (contd)
Matching process
- A variant of the stable marriage problem - Worst-case number of iterations for SMP = N2 - Worst-case number of iterations in switching = N - High probability and average approxly log(N)

Copyright 1999. All Rights Reserved

142

Other Work
Relax stringent requirement of exact emulation
- Least Occupied O/p First Algorithm (LOOFA) Keeps outputs always busy if there are packets By time-stamping packets, it also exactly mimics - Disallow arbitrary inputs E.g. leaky bucket constrained Obtain worst-case delay bounds

Copyright 1999. All Rights Reserved

143

References for speedup


- Y. Oie et al, Effect of speedup in nonblocking packet switch, ICC 89. - A.L Gupta, N.D. Georgana, Analysis of a packet switch with input and

and output buffers and speed constraints, Infocom 91.


- S-T. Chuang et al, Matching output queueing with a combined input and

and output queued switch, IEEE JSAC, vol 17, no 6, 1999.


- B. Prabhakar, N. McKeown, On the speedup required for combined input

and output queued switching, Automatica, vol 35, 1999.


- P. Krishna et al, On the speedup required for work-conserving crossbar

switches, IEEE JSAC, vol 17, no 6, 1999.


- A. Charny, Providing QoS guarantees in input buffered crossbar switches

with speedup, PhD Thesis, MIT, 1998.


Copyright 1999. All Rights Reserved 144

Switching Fabrics
Output and Input Queueing Output Queueing Input Queueing
Scheduling algorithms Other non-blocking fabrics Combining input and output queues Multicast traffic
145

Copyright 1999. All Rights Reserved

Multicast Switching
The problem Switching with crossbar fabrics Switching with other fabrics

Copyright 1999. All Rights Reserved

146

Multicasting
2

Copyright 1999. All Rights Reserved

147

Crossbar fabrics: Method 1


Copy network + unicast switching

Copy networks

Increased hardware, increased input contention


Copyright 1999. All Rights Reserved 148

Method 2
Use copying properties of crossbar fabric
No fanout-splitting: Easy, but low throughput

Fanout-splitting: higher throughput, but not as simple. Leaves residue.

Copyright 1999. All Rights Reserved

149

The effect of fanout-splitting

Performance of an 8x8 switch with and without fanout-splitting under uniform IID traffic
Copyright 1999. All Rights Reserved 150

Placement of residue
Key question: How should outputs grant requests? (and hence decide placement of residue)

Copyright 1999. All Rights Reserved

151

Residue and throughput


Result: Concentrating residue brings more new work forward. Hence leads to higher throughput. But, there are fairness problems to deal with. This and other problems can be looked at in a unified way by mapping the multicasting problem onto a variation of Tetris.

Copyright 1999. All Rights Reserved

152

Multicasting and Tetris


Input ports 1 2 3 4 5

Residue

1 2 3 4 5 Output ports

Copyright 1999. All Rights Reserved

153

Multicasting and Tetris


Input ports 1 2 3 4 5

Residue Concentrated

1 2 3 4 5 Output ports

Copyright 1999. All Rights Reserved

154

Replication by recycling
Main idea: Make two copies at a time using a binary tree with input at root and all possible destination outputs at the leaves.
b a x e y d c x a y b x c y e d

Copyright 1999. All Rights Reserved

155

Replication by recycling (contd)


Receive Reseq Output Table Network Transmit

Recycle

Scaleable to large fanouts. Needs resequencing at outputs and introduces variable delays.

Copyright 1999. All Rights Reserved

156

References for Multicasting


J. Hayes et al. Performance analysis of a multicast switch, IEEE/ACM Trans. on Networking, vol 39, April 1991. B. Prabhakar et al. Tetris models for multicast switches, Proc. of the 30th Annual Conference on Information Sciences and Systems, 1996 B. Prabhakar et al. Multicast scheduling for input-queued switches, IEEE JSAC, 1997 J. Turner, An optimal nonblocking multicast virtual circuit switch, INFOCOM, 1994

Copyright 1999. All Rights Reserved

157

Tutorial Outline
Introduction:
What is a Packet Switch?

Packet Lookup and Classification:


Where does a packet go next?

Switching Fabrics:
How does the packet get there?

Output Scheduling:
When should the packet leave?

Copyright 1999. All Rights Reserved

158

Output Scheduling
What is output scheduling? How is it done? Practical Considerations

Copyright 1999. All Rights Reserved

159

Output Scheduling
Allocating output bandwidth Controlling packet delay

scheduler

Copyright 1999. All Rights Reserved

160

Output Scheduling

FIFO

Fair Queueing

Copyright 1999. All Rights Reserved

161

Motivation
FIFO is natural but gives poor QoS
bursty flows increase delays for others hence cannot guarantee delays

Need round robin scheduling of packets


Fair Queueing Weighted Fair Queueing, Generalized Processor Sharing

Copyright 1999. All Rights Reserved

162

Fair queueing: Main issues


Level of granularity
packet-by-packet? (favors long packets) bit-by-bit? (ideal, but very complicated)

Packet Generalized Processor Sharing (PGPS)


serves packet-by-packet and imitates bit-by-bit schedule within a tolerance

Copyright 1999. All Rights Reserved

163

How does WFQ work?


WR = 1 WG = 5 WP = 2

Copyright 1999. All Rights Reserved

164

Delay guarantees
Theorem
If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.

Copyright 1999. All Rights Reserved

165

Practical considerations
For every packet, the scheduler needs to
classify it into the right flow queue and maintain a linked-list for each flow schedule it for departure

Complexities of both are o(log [# of flows])


first is hard to overcome second can be overcome by DRR
Copyright 1999. All Rights Reserved 166

Deficit Round Robin


50 400 200 600 700 600 100 250 750 250 500 1000 500 400 500

Good approximation of FQ Much simpler to implement


500 Quantum size

Copyright 1999. All Rights Reserved

167

But...
WFQ is still very hard to implement
classification is a problem needs to maintain too much state information doesnt scale well

Copyright 1999. All Rights Reserved

168

Strict Priorities and Diff Serv


Classify flows into priority classes
maintain only per-class queues perform FIFO within each class avoid curse of dimensionality

Copyright 1999. All Rights Reserved

169

Diff Serv
A framework for providing differentiated QoS
set Type of Service (ToS) bits in packet headers this classifies packets into classes routers maintain per-class queues condition traffic at network edges to conform to

class requirements May still need queue management inside the network
Copyright 1999. All Rights Reserved 170

References for O/p Scheduling


- A. Demers et al, Analysis and simulation of a fair queueing algorithm,

ACM SIGCOMM 1989.


- A. Parekh, R. Gallager, A generalized processor sharing approach to

flow control in integrated services networks: the single node case, IEEE Trans. on Networking, June 1993.
- A. Parekh, R. Gallager, A generalized processor sharing approach to

flow control in integrated services networks: the multiple node case, IEEE Trans. on Networking, August 1993.
- M. Shreedhar, G. Varghese, Efficient Fair Queueing using Deficit Round

Robin, ACM SIGCOMM, 1995.


- K. Nichols, S. Blake (eds), Differentiated Services: Operational Model

and Definitions, Internet Draft, 1998.


Copyright 1999. All Rights Reserved 171

Active Queue Management


Problems with traditional queue management
tail drop

Active Queue Management


goals an example effectiveness

Copyright 1999. All Rights Reserved

172

Tail Drop Queue Management


Lock-Out

Max Queue Length

Copyright 1999. All Rights Reserved

173

Tail Drop Queue Management


Drop packets only when queue is full
long steady-state delay global synchronization bias against bursty traffic

Copyright 1999. All Rights Reserved

174

Global Synchronization

Max Queue Length

Copyright 1999. All Rights Reserved

175

Bias Against Bursty Traffic

Max Queue Length

Copyright 1999. All Rights Reserved

176

Alternative Queue Management Schemes

Copyright 1999. All Rights Reserved

177

Active Queue Management


Goals
Solve lock-out and full-queue problems
no lock-out behavior no global synchronization no bias against bursty flow low steady-state delay lower packet dropping

Provide better QoS at a router

Copyright 1999. All Rights Reserved

178

Active Queue Management


Problems with traditional queue management
tail drop

Active Queue Management


goals an example effectiveness

Copyright 1999. All Rights Reserved

179

Random Early Detection (RED)


Pk P2 P1

maxth
q

qavg

minth

if qavg < minth : admit every packet else if qavg <= maxth : drop an incoming packet with p = (qavg - minth )/(maxth - minth ) else if qavg > maxth : drop every incoming packet

Copyright 1999. All Rights Reserved

180

Effectiveness of RED: Lock-Out


Packets are randomly dropped Each flow has the same probability of being discarded

Copyright 1999. All Rights Reserved

181

Effectiveness of RED: Full-Queue


Drop packets probabilistically in anticipation of congestion (not when queue is full) Use qavg to decide packet dropping probability: allow instantaneous bursts

Randomness avoids global synchronization

Copyright 1999. All Rights Reserved

182

What QoS does RED Provide?


Lower buffer delay: good interactive service Given responsive flows: packet dropping is reduced Given responsive flows: fair bandwidth allocation
early congestion indication allows traffic to throttle back before congestion qavg is controlled to be small

Copyright 1999. All Rights Reserved

183

Unresponsive or aggressive flows


Dont properly back off during congestion Take away bandwidth from TCP compatible flows Monopolize buffer space

Copyright 1999. All Rights Reserved

184

Control Unresponsive Flows


Some active queue management schemes
RED with penalty box Flow RED (FRED) Stabilized RED (SRED)

identify and penalize unresponsive flows with a bit of extra work

Copyright 1999. All Rights Reserved

185

Active Queue Management


References
B. Braden et al. Recommendations on queue management and congestion avoidance in the internet, RFC2309, 1998. S. Floyd, V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. on Networking, 1(4), Aug. 1993. D. Lin, R. Morris, Dynamics on random early detection, ACM SIGCOMM, 1997 T. Ott et al. SRED: Stabilized RED, INFOCOM 1999 S. Floyd, K. Fall, Router mechanisms to support end-to-end congestion control, LBL technical report, 1997

Copyright 1999. All Rights Reserved

186

Tutorial Outline
Introduction:
What is a Packet Switch?

Packet Lookup and Classification:


Where does a packet go next?

Switching Fabrics:
How does the packet get there?

Output Scheduling:
When should the packet leave?

Copyright 1999. All Rights Reserved

187

Basic Architectural Components

Admission Control

Congestion Control

Routing Switching

Reservation

Control

Policing

Output Scheduling

Datapath:
per-packet processing

Copyright 1999. All Rights Reserved

188

Basic Architectural Components


1.

Datapath: per-packet processing


2. Interconnect

Forwarding Table

3. Output Scheduling

Forwarding Decision
Forwarding Table

Forwarding Decision
Forwarding Table

Forwarding Decision
Copyright 1999. All Rights Reserved 189

Das könnte Ihnen auch gefallen