Beruflich Dokumente
Kultur Dokumente
100GE
&
traffic
loss
!!!
10GE
port
port
1
Gbps
<1Gbps
“wirespeed”
router
Other
Other
traffic
traffic
Other
Other
traffic
traffic
Ingress
Egress
PFE
PFE
Ingress
Egress
PFE
PFE
1 x Tx
• Single
high-‐speed
Out
(Tx)
interfaces
• Many
to
One
• No
conges@on
risk
–
N x Rx rate <= 1 x Tx rate no
need
for
buffer
The
de-‐mux
• Simple
model
• High-‐speed
In
(Rx)
port
and
• Mul@ple
lower-‐speed
N x Tx
Out
(Tx)
interfaces
1 x Rx
N x Tx
– Traffic/datagram
may
appear
at
any
@me
– Not
aware
about
egress
port
state
• Flow-‐Control
and
Pn VOQ
P1 VOQ
Port P2 Tx
scheduling
Port P2 P2 VOQ
Rx
Pn VOQ
P1 VOQ
Port Pn Tx
egresses
simultaneously
VOQ
• Variant
Input
Queuing
–
GR
“dedicated
Input
queue
GR
for
each
output
port”
Port P1
Rx
P1 VOQ
P2 VOQ GR Port P1 Tx • No
need
for
over-‐speed
• Flow-‐Control
and
Pn VOQ
P1 VOQ
Port P2 Tx
scheduling
Port P2 P2 VOQ
Rx
Pn VOQ
P1 VOQ
P2 VOQ – Extension
to
IQ
– Ingress
PFE
can
send
Pn VOQ
P1 VOQ
egresses
simultaneously
Pn VOQ
VOQ
• Variant
Input
Queuing
–
“dedicated
Input
queue
for
each
output
port”
Port P1
Rx
P1 VOQ
P2 VOQ Port P1 Tx • No
need
for
over-‐speed
• Flow-‐Control
and
Pn VOQ
P1 VOQ
Port P2 Tx
scheduling
Port P2 P2 VOQ
Rx
Pn VOQ
P1 VOQ
P2 VOQ – Extension
to
IQ
– Ingress
PFE
can
send
Pn VOQ
P1 VOQ
egresses
simultaneously
Pn VOQ
• No
HoL
blocking
The
router
Line Card Line Card
ingress part egress part
Switch Fabric
Overload possible.
Multiple Rx ports
Buffer Buffer
memory Fabric ports memory sends to single Tx
port
Ingress PFE egress PFE
ASIC/NPU Switch Fabric P1 Tx ASIC/NPU
P1 Rx
switch
P2 Rx P2 Tx
Overload possible.
Router ingress Hi-speed fabric port
Interfaces à low-speed
Pn Tx
Interface
Pn Rx
IFn
ASIC/NPU ASIC/NPU
via
congested
port
output
queue.
ASIC/NPU Switch Fabric P1 Tx ASIC/NPU
P1 Rx
switch
P2 Rx P2 Tx
Router ingress
Interfaces
– Or
dropped
if
queue
is
Fabric scheduler
Pn Rx
Pn Tx
full.
• Fabric queuing and flow-control independent form egress port queuing and scheduling.
• Router behavior – residency time (latency), jitter, drop rate depends on both.
End-‐to-‐End
VOQ
system
Buffer Buffer
memory Fabric ports memory
P2 Rx P2 Tx
Router ingress
Interfaces
Pn Tx
Pn Rx
• For
queuing
purpose,
switch
fabric
and
all
de-‐mux
–
seen
as
single
switch
– N
inputs
(Rx)
à
M
output
(Tx)
– N
x
Rx
speed
==
M
x
Tx
speed
– M
>>
N;
Buffer
once
-‐
latency
RQ/GR
RQ
RQ RQ/GR
RQ
RQ
RQ/GR RQ/GR
IF0 Tx
Port P1
P1 IF0 VOQ
P1 IFn VOQ Port P1 Tx IF1 Tx IF0 Tx
Rx
Pn IFn VOQ
Port P1 Tx
IFn Tx IF1 Tx
P1 IF0 VOQ RQ/GR
Port P2 P1 IFn VOQ Port P2 Tx
Rx
Pn IFn VOQ RQ
IFn Tx
P1 IF0 VOQ
RQ/GR
P1 IFn VOQ
Pn IFn VOQ
Port P2 Tx
Port Pn
P1 IF0 VOQ
P1 IFn VOQ
Port Pn Tx • Queues
==
Buffer
==
accommodate
burst
and
delay
it
Rx
Pn IFn VOQ
RQ
• Burst
absorp@on
capability
DO
NOT
depends
on
type
of
burst
(regardless
at
which
point
conges@on
appears)
– Max:
VOQ
size
– Min:
VOQ
size
Port Pn Tx
• Max
latency:
Σ
(VOQ
size)
Poten@al
conges@on
point
• Wait,
there
will
be
example
Buffer
once
–
bandwidth
Fabric
Scheduler
and
flow-‐control
P1 IF0 VOQ
P2 IF0 VOQ • Fabric
Queuing
-‐
VOQ
(per
P1 IF1 VOQ
Pn IF0 VOQ
P2 IF1 VOQ
egress
interface)
P1 IFn VOQ
Pn IF1 VOQ
P2 IFn VOQ • Fabric
scheduler
and
Flow-‐
Pn IFn VOQ
IF0 Control
IF1
P1 IF0 VOQ – Which
ingress
[PFE,
VOQ]
P2 IF0 VOQ
get
GRANT
next.
(e.g.
fair-‐
Ingress PFEP1 IF1 VOQ
Pn IF0 VOQ
P2 IF1 VOQ egress PFE share)
from
given
egress
ASIC/NPU IFn
P1 IFn VOQ
Pn IF1 VOQ
P2 IFn VOQ
ASIC/NPU PFE
Pn IFn VOQ – Monitor
egress
interface
and
stop
giving
GR
from
queue
on
Fabric
ingress
if
Ingress PFE egress
interface
is
not
ASIC/NPU Buffer
memory Fabric ports
Buffer
memory
free
Ingress PFE egress PFE
Router ingress
Pn Tx
Fabric scheduler
Pn Rx
F2
50G
F6
100GE
700G
F3 200G
F3
29
86%
• All
traffic
is
BE
F4
14
86%
F5
29
86%
F6
1
0%
Un-‐intui@ve
drop
behavior
of
router
“X”
100GE
F1
150G
F1
–
F5
F2
50G
F6
100GE
700G
F3 200G
F3
29
86%
22
88%
• All
traffic
is
BE
F4
14
86%
11
88%
• Why
losses
in
F6?
F5
• Why
unequal
losses
in
F1-‐ 29
86%
33
83%
F6
F6
1
0%
0
52%
Un-‐intui@ve
drop
behavior
of
router
“Y”
100GE
F1
150G
F1
–
F5
F2
50G
F6
100GE
700G
F3 200G
PFE1
PFE4
F2
50G
F6
• Fabric
port
-‐>
Switch
fabric
IF2
(100GE)
300Gbps
F3
200G
PFE2
F4
100G
• Fair-‐share
fabric
F5
200G
scheduler
F6
10G
PFE3
Router
X
–
CIOQ
-‐
behavior
• Flow
F1&F2
shares
same
buffer
F1
IF
1
(100GE)
F1
–
F5
(queue)
on
PFE1.
Same
for
F3&F4
PFE1
PFE4
F6
@
PFE2
and
for
F5&F6
@
PFE3
F2
IF2
(100GE)
Switch
fabric
• There
is
3
ingress
PFE
that
want
F3
PFE2
to
talk
to
PFE4
F4
PFE3
each
ingress
PFE
F6
F1
F2
150
50
75
25
}
100G
50%
50%
F3
F4
200
100
67
33
}
100G
67%
67%
F5
F6
200
10
95.2
4.8
}
100G
52%
52%
Router
X
–
CIOQ
-‐
behavior
• Flows
F1
–
F5
(295Gbps)
are
queued
F1
IF
1
(100GE)
F1
–
F5
PFE1
PFE4
in
OQ
of
IF1,
and
Tx
@
100Gbps.
F2
F6
Switch
fabric
F3
loss)
PFE2
F4
PFE3
of
IF2,
and
tx
@
100Gbps.
(0%
loss)
F6
F1
F2
150
50
75
25
}
100G
50%
50%
25
8
82%
82%
F3
F4
200
100
67
33
}
100G
67%
67%
22
11
88%
88%
F5
F6
200
10
95.2
4.8
}
100G
52%
52%
33
4.8
83%
52%
CLI
example
PFE
3
PFE1
PFE4
PFE1.
Same
for
F3&F4
@
PFE2
and
for
F5
@
F2
F6
IF2
(100GE)
PFE3.
Switch
fabric
F3
PFE2
• Flow
F6
has
separate
buffer
on
PFE
3.
F4
PFE3
F6
– Fabric
Scheduling
gives
33Gbps
to
each
ingress
PFE
for
Egress
Interface
IF1
VOQ
offered
From
fabric
@
PFE4
On
egress
interface
F1
F2
150
50
25
8
}
33G
83%
83%
F3
F4
200
100
22
11
}
33G
89%
89%
F6
10
10
}
100G
0%
Router
Y
–
VOQ
-‐
behavior
IF
1
(100GE)
F1
F1
–
F5
• There
is
1
ingress
PFE
that
want
to
talk
to
PFE1
PFE4
F2
F6
egress
interface
IF2
(100GE)
IF2
(100GE)
Switch
fabric
F3
– Fabric
Scheduling
gives
100Gbps
to
only
one
PFE2
ingress
PFE
(PFE3)
for
Egress
Interface
IF2
VOQ
F4
PFE3
10Gbps.
F6
F1
F2
150
50
25
8
}
33G
83%
83%
25
8
83%
83%
F3
F4
200
100
22
11
}
33G
89%
89%
22
11
89%
89%
F6
10
10
}
100G
0%
10
0%
CLI
example
Ingress
PFE
3
VOQ
of
egress
IF
(200Gbps
toward
egress
interface)
(100GE)
PFE1
F4’
33G
– Have
to
live
with
it
as
it
is.
F2
30G
F3 200G
PFE2
Mi@ga@on:
Load
PFE
fairly
F4’’
33G
PFE3
F4’’’
33G
• Other
goodies
behind
–
blast
F6
1G
radius
ROUTER
BEHAVIOR
CASE
STUDY
-‐
LATENCY
Intui@ve
latency
100GE
F1
200G
F1
–
F3
400Gbps
-‐-‐>
100Gbps
Buffer
100%
-‐>
100ms
latency
400G
F3 100G
Observed
1
Expected
(loss)?
Observed
2
(router
Y)
(router
X)
ms ms ms
PFE1
PFE4
fabric
F2
100G
– 100ms
OQ
Switch
fabric
F3
100G
PFE2
• Router
Y
-‐>
VOQ
• 100ms
VoQ
before
fabric
PFE3
• Fabric
port
-‐>
300Gbps
Router
X
-‐
CIOQ
• Flow
F1&F2
shares
same
VoQ
buffer
(queue)
on
PFE1.
Flow
F3
is
alone
on
F1-‐F3
PFE1
PFE4
F1
PFE2
F2
Switch
fabric
IF
1
• There
is
2
ingress
PFE
that
want
to
F3
(100GE)
PFE2
talk
to
PFE4
– Fabric
Scheduling
guarantee
150Gbps
PFE3
to
each
ingress
PFE
• Flows
F1-‐F2
(300Gbps)
are
queued
in
VOQ
of
PFE1,
and
Tx
@
200Gbps
(150Gbps
+
le•over).
Latency
is
100ms
offered
Fabric
component
Egress
Interface
Total
Gbps
ms
F1
200
100
(33%
loss)
F2
100
100
(33%
loss)
F3
100
Router
X
-‐
CIOQ
• Flows
F3
(100Gbps)
are
queued
F1-‐F3
PFE1
PFE4
F1
in
VOQ
of
PFE1,
and
Tx
@
F2
Switch
fabric
IF
1
100Gbps
à
no
buffering
(100GE)
F3
PFE2
• Flows
F1-‐F3
(300Gbps)
are
queued
in
OQ
of
IF1,
and
Tx
@
PFE3
100Gbps.
Egress
interface
latency
is
100ms
PFE1
PFE4
F1
VoQ
buffer
(queue)
on
PFE1.
F2
Switch
fabric
IF
1
Flow
F3
is
alone
on
PFE2
F3
(100GE)
PFE2
• There
is
2
ingress
PFE
that
PFE3
want
to
talk
to
PFE4
– Fabric
Scheduling
guarantee
50Gbps
to
each
ingress
PFE
PFE1
PFE4
F1
queued
in
VOQ
of
IF1,
and
Tx
F2
@
50Gbps.
Switch
fabric
IF
1
F3
(100GE)
Latency
is
100ms
PFE2
• Flows
F3
(100Gbps)
are
PFE3
queued
in
VOQ
of
IF1,
and
Tx
@
50Gbps
Latency
is
100ms
PFE4
F1
100G
F2
100G
F2,
F5,
F7
F3
100G
F3
Switch
fabric
100GE
F4
100G
PFE2
Gbps Loss %
F5
100G
F1
PFE3
33 67%
F6
100G
F2
50
50%
F7
100G
F3
100
0%
F4 33 67%
F6 33 67%
F7
25
75%
Homework
• What
Queuing
architecture
router
W
is?
• Explain
behavior.
• Answers:
rafal@juniper.net
– Deadline
–
11pm
today.
Summary
• Know
your
router
anatomy
• System
queuing
architecture
impacts
power
consump@on
and
system
scaling
capabili@es.
• System
queuing
architecture
impact
residency-‐
@me
• System
queuing
architecture
may
be
a
reasoned
for
non-‐intui@ve
traffic
loss
paPern.
– (Re)Assign
ports
to
roles
smart
way.
– Trying
to
solve
of
non-‐exis@ng
problem
cost
@me
and
headache
of
wri@ng
a
incident
report
–
avoid
it.
BACKUP
SLIDES.
From
building
blocks
to
centralized
router
• Single
switching
element
• No
Switch
Fabric
• N
x
N
Interfaces
N x Rx
N x Tx
1 x Rx
1 x Tx
N x Tx
IF1 Tx
Port P1 Rx Port P1 Tx Pn RX
P1 OQ
IFn Tx
Port P2 Rx Port P2 Tx
P2 OQ
Port Pn Rx Port Pn Tx
Pn OQ
Port P2 Rx
P2 OQ
Port P2 Tx x
)
fan-‐in
to
buffer.
– If
switch
port
is
600Gbps,
and
we
have
100
ports
à
– 60T
of
raw
bandwidth
into
Port Pn Rx
Pn OQ
Port Pn Tx
buffers
!
• 100%
efficiency
N x Port Speed
N x N x Port Speed
• Good
on
paper
only
– extremely
expensive
– Bound
by
technology
Buffer
once
–
per
egress
interface
VOQ
Theory
vs.
prac@ce
Data in
Egress Interface
fabric
(egress
PFE)
to
DATA
busy
compensate
delay.
– Need
just
~10
usec.
• Similar
to
CIOQ
but:
Egress Interface
fabric
(egress
PFE)
to
DATA
busy
compensate
delay.
– Need
just
~10
usec.
• Similar
to
CIOQ
but:
latency
loss
RQ
– Sta@s@cally
minor
problem
–
asynchronous.
RQ
could
be
GR
send
while
egress
interface
Interface
Egress
handles
other
data
busy
• RQ/GR
scheduler
-‐
Can’t
be
compensated
DATA
• If
shallow
buffer
<
2x
latency
–
inefficient
egress
IF
u@liza@on
Egress Interface
busy
Impact
on
PFE
design
PFE
complexity
• PFE
for
per-‐egress
IF
VOQ
system
!=
PFE
for
CIOQ
system
• For
system
of
4k
IF
and
8
QoS
Classes
– VOQ
PFE
need
support
32.000
queues
– CIOQ
PFE
need
1.040
queues
• If
PFE
support
400k
queues
– VOQ
system
can
support
50k
IF
– CIOQ
system
can
support
1.000.000’s
IF
(from
queue
scaling
perspec@ve
only.
Other
limits
apply)
• CIOQ
PFE
has
typically
less
queues
but
much
more
of
other
func@onali@es.
PFE
performance
• To
handle
4
x
100GE
interfaces,
PFE
need:
Memory – CIOQ:
DRAM • more
then
3.6Tbps
of
PFE
I/O.
• All
packet
goes
to
and
from
memory
twice
–
4
memory
accesses.
• 1600 (CIOQ) – VOQ
or
• 800 (VOQ)
• more
then
2.4Tbps
(25%
less)of
PFE.
• All
packet
goes
to
and
from
memory
once
–
2
memory
accesses.
Mem I/O – Memory
I/O
BW
need
to
be
oversized
-‐
even
bePer
reduc@on
(~30%)
External IF I/O
PFE
• Less
I/O
==
saved
gates
– Have
more
VOQ
(bigger
system),
OR
400 400
– lower
cost
and
power,
OR
– higher
performance,
OR
CoS
vs.
Queuing
Queueing vs. QoS
• Queuing
as
discussed
so
fare
(ingress PFE)
(4 classes)
– Assumes
all
traffic
is
same
class
Class 1 Class 0
CoS
Class 1 egress PFE 0
• CoS
egress PFE 1
Class n
Class n egress PFE n
no CoS
egress 0
egress n
– Scheduler(s)
take
into
account
2
things
• Class
of
traffic
IQ VoQ
• Source
instance
(e.g.
ingress
PFE)
– What
was
1
Queue
becomes
set
of
parallel
queues.
– Classifier
needed
-‐>
put
traffic
to
this
or
other
queue
(out
of
scope)
PFE
complexity
• PFE
for
per-‐egress
IF
VOQ
system
!=
PFE
for
CIOQ
system
• For
system
of
4k
IF
and
8
QoS
Classes
– VOQ
PFE
need
support
32.000
queues
– CIOQ
PFE
need
1.040
queues
• If
PFE
support
400k
queues
– VOQ
system
can
support
50k
IF
– CIOQ
system
can
support
1.000.000’s
IF
(from
queue
scaling
perspec@ve
only.
Other
limits
apply)
• CIOQ
PFE
has
typically
less
queues
but
much
more
of
other
func@onali@es.
CIOQ-‐
Fabric
SWITCH
FABRIC
• CIOQ
• Fabric
VOQ
– Flow-‐Controll
• per
Fabric
Egress
• By
Request
–
Grant
protocols.
• Do
not
depends
on
Egress
interfaces
and
OQ
state.
Hi-‐Priority
Queue
– 2
Classes
Hi-‐/Low-‐
priority
Ingress
PFE1
Low-‐priority
• classifica@on
Queue
– Scheduling
Hi-‐Priority
• ingress
PFE
fairness
1st
Queue
Ingress
PFE2
root
• CIR-‐bound
Priority
scheduling.
Low-‐priority
Queue
– Delay
Bandwidth
Buffer:
Hi-‐Priority
• Scheduler
logic
seats
on
egress
Queue
fabric.
Queue
CIOQ
–
Egress
Interface
• CIOQ
Hi-‐Priority
Queue
Ingress
PFE1
Low-‐priority
Queue
Low-‐priority
Ingress
PFE2
root
Hi-‐Priority
Queue
Ingress
PFEn
– Scheduling
• 5
priori@es
level,
4
RED
profiles
per
queue.
Configurable.
• 2
or
4
scheduling
hierarchy
level
w/
Queue
0
priority
propaga@on
– Delay
Bandwidth
Buffer:
______
Queue
1
Logical
interface
logical
classification
interface
logical
Queue
4
interface
set
Logical
interface
Queue
5
Queue 6
Queue
7
SWITCH
FABRIC
VOQ_Class_1
• Flow-‐Control VOQ_Class_2
state.
VOQ_Class_5
• Queues
• 8
per
egress
(logical)
interface
VOQ_Class_6
Logical
• 390.000
per
system
(HW
limit)
VOQ_Class_7
interface
• Scheduling
Logical
interface
Egress
VOQ_Class_0
– 4
priori@es
level,
4
RED
profiles
per
queue.
Logical
Interface
Configurable.
VOQ_Class_1
interface