Sie sind auf Seite 1von 68

The trouble with I

v3

Dan Rautio

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

The trouble with I


Initial version 2/12/08
V2 add case study 8 and 9 3/5/08
V3 add case study 10 12/5/08

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Overview
Type of problems
Symptoms Very Important
Which block of the ASIC?
Case study of past problems
Deep Dive for each symptom

Ichip Performance
Ichip enhancements

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Type of problems
Transit traffic loss
PFE related problem

check out the ASIC

RE generated traffic is not affected

Normal to have transit traffic affected, but no problem


with RE generated packets
RE generated packets have the DT bit set to 1.
The RE doesnt need a next-hop lookup (Ir) since it
knows which interface to send the packet.
Also, RE will form the L2/L3 headers so dont need Iwo
to touch the packet either.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

I chip packet flow

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Symptoms Very Important!


Typical Symptoms
1. Interface flaps, far-end router receives all garbage.
Or, far-end dpc is reset, causes long FC to be sent to
other side.

10GE Interface getting flow controlled for more than 200


msec PR/231419, PR/250350, PR/103298, PR/104884,
PR/103597, PR/103712

2. After a restart routing, all multicast traffic stops.


Many other examples in the past with gimlet, but
would only cause illegal nh size and SRAM parity
errors. Ichip will wedge.

Incorrect Iwo_key (Lout_key) pointing to a RLDRAM (SRAM)


PR/240012, PR/258760

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Symptoms Very Important! (continued)


3. Specific type (mpls->ipv4) of traffic stops working.

Incorrect packet length calculation for MPLS->IPv4 nexthop


traffic PR/251042

4. Specific type (ipv6) of traffic stops working.

Incorrect packet length calculation for IPv6 nexthop traffic


PR/105266

5. Large tunnel traffic stops working.

Tunnel ingress with fragmentation PR/237450

6. Complete packet loss between 2 PFEs

I3.0 Ichip fabric output queue PR/268274

7. DPC repeatedly crashes with JUNOS 8.5

IA FPGA DMA corruption PR/269699

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Symptoms Very Important! (continued)


8. Interface stopped forwarding packets or IP
CRC packet error in syslog are reported

PR/277853, PR/27741

9. BIST Memory Error on ICHIP rldram

PR/255204

10. Trouble shooting LINK (SF fabric ports) error


messages to the proper FRU

PR/407207

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Which ASIC block?


1. 10GE Interface getting flow controlled for more than
200 msec PR/231419, PR/250350, PR/103298,
PR/104884 , PR/103597, PR/103712

I pktrd packet reader age cells not detected

2. Incorrect Iwo_key (Lout_key) pointing to a RLDRAM


(SRAM) PR/240012, PR/258760

Ir sends a bogus Lout_key. Iwo receives the bad Lout_key


sends feedback to Imq/Ipktrd which will be bogus. Ipktrd
gets into a bad state.

3. Incorrect packet length calculation for MPLS->IPv4


nexthop traffic PR/251042

Iwo data buffer and Iwo SPI microcode packet length


calculation

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

Which ASIC block? (continued)


4. Incorrect packet length calculation for IPv6 nexthop
traffic PR/105266

Iwo microcode packet length calculation incorrect

5. Tunnel ingress with fragmentation PR/237450

Iwo microcode packet length calculation incorrect

6. I3.0 Ifo queue buffer PR/268274

Ipktrd fab icell buffer allocation incorrect

7. IA FPGA DMA corruption PR/269699

Ichip DMA code not protected from interrupts. This causes


DMA corruption.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

10

Which ASIC block? (continued)


8. Traffic sent to Qs that are not configured PR/277853, PR/27741

Iwo wedge or Iwo CRC errors

9. Fail RLDRAM BIST, the memory is not


initialized correctly and can end up with parity
errors - PR/255204

BIST Memory Error on ICHIP rldram

10. Trouble shooting LINK (SF fabric ports) error


messages to the proper FRU

PR/407207

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

11

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec

PR/231419, PR/250350, PR/103298, PR/104884 , PR/103597,


PR/103712
Reboot of far-end router, will assert FC to the DUT PR/103298
cFPC 10GE pic interface flap, send garbage to other end
PR/250350
Flow control for a few seconds PR/104884
Some of the time, get messages like these:
Apr 26 13:46:26 tomahawk fpc0 ICHIP(3): New crc errors in WO IP stream_id 0,
iwo_ip_poll_stream_stats

Since those events are mostly silent. On the remote


interface the adj. would just timeout.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

12

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec

If you have [OSPF] enabled then the interface which is


getting wedged would move into [OSPF 1-Way state] "neighbor is in one-way mode" and
if you have [isis] enabled the wedge interface would
move into [ISIS "init state"] due to "Not Seenself".
On BGP, you will likely see "NOTIFICATION 6" from the
remote BGP peer reset due to keepalive timer expired
The following ichip outputs show the pktrd is wedged
because of the following:
This state says it's PRQ and ICB buffer are not empty and the
packet read is not done, but WO is not receiving any cells.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

13

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec
RFEB0(reno-re0 vty)# sh ichip 0 ipktrd qstatus
WAN Queue Status
WAN_PRQ_MPTY - 0xfffffffe
WAN_ICB_MPTY - 0xfffffffe
WAN_PRD_DONE - 0xfffffffe
FABRIC Queue Status
FAB_PRQ_MPTY[0] - 0xffffffff
FAB_ICB_MPTY[0] - 0xffffffff
FAB_PRD_DONE[0] - 0xffffffff
RFEB0(reno-re0 vty) show ichip 0 registers pktrd wan
...
(0xf0830400)
pktrd.wan_prq_mpty:0xfffffffe
(0xf0832400)
pktrd.wan_icb_mpty:0xfffffffe
(0xf0833500)
pktrd.wan_prd_done:0xfffffffe
(0xf0834300)
pktrd.wan_dbf_org[0]:0x80000000

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

14

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec

RFEB0(reno-re0 vty)# show ichip 0 wo statistics ip wan_stream 0


Iwo Input Processor Statistics:
Counter Name

Total

Rate

Peak Rate

---------------------- ---------------- -------------- -------------Stream(0):


Input packets

322996428

75722

output packets

322996212

75722

ssmcst packets

fragmented packets

input drops

215

output drops

62736

146

> 2 cell crc drops

5357

27

<= 2 cell crc drops

4315

10

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

15

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec

The above symptoms uncovered a hardware bug in the


Ichip
The Ichip issue is caused by aged icells that are not detected as
aged. If the packet memory write pointer wraps 4 times on a
Ichip with 512MB*3 of packet memory then a icell read request
in the pipeline before the flow control started may not be
detected as aged when the flow control is released.
The work-around for PR 231419 is to flush aged icells before
they can become undetected aged icells. There are 3 "virtual
address" bits in the packet memory address (2 bits if 512MB
DIMMs). An aged icell can become an undetected aged icell if
these 3 address bits are the same during the read as the 3 bits
currently being written by packet writer. That is, the packet
memory has be overwritten a multiple of 8 times.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

16

Case Study 1 10GE Interface getting flow


controlled for more than 200 msec

The workaround is to sample traffic flow and if traffic is blocked


on a stream and if the time is getting near the undetected icell
aged time then flush the stream. Detecting traffic flow is done
using the flow state register in the WO_SPI block. If these bits
say no traffic has flowed during the sample period and there was
traffic pending then traffic is blocked.
The delay bandwidth buffer is at least 200msec, so overwriting 8
times requires at least 1.6 seconds of real time. The sample rate
must be at least twice this rate to avoid sample aliasing. Pick a flow
sample rate of 0.5 seconds to detect block traffic.
Will see the following syslogs when auto recovery is successful:
[Nov 1 11:46:50.478 LOG: Info] ICHIP (0) one or more stream blocked detected: 0x00000001
[Nov 1 11:46:50.678 LOG: Err] imq_q_waiting_queue_empty: phy_q 0, telapsed:200ms,
mu0:1491049, mu1:1491049, mu2:1491049, loop:42879
[Nov 1 11:46:50.678 LOG: Err] imq_q_disable_queue:failed, phy_q:0
[Nov 1 11:46:50.678 LOG: Err] ICHIP(0) imq_stream_disable_stream() failed to disable
physical queue 0 in stream 32

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

17

Case Study 2 Incorrect Iwo_key (Lout_key)


pointing to a RLDRAM (SRAM)

PR/240012, PR/258760, PR/228361


After a restart routing, multicast traffic stops working PR/240012
After a restart routing, VPLS traffic stops working PR/258760
Receive syslogs like the below:
Aug 24 10:08:07 Poseidon-RE0 fpc1 ICHIP(3): New illegal link
errors in WO DESRD lout_key 0x000008bc stream_id 11,
iwo_desrd_poll_stream_stats

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

18

Case Study 2 Incorrect Iwo_key (Lout_key)


pointing to a RLDRAM (SRAM)

PR/240012, PR/258760
Here are the errors on the ichip with this problem:
ADPC0(r16 vty)# show ichip 3 wo statistics
Iwo Statistics:
Summary
Total
Rate
Peak Rate
---------------------- ---------------- -------------- -------------input packets
196629610
148810
149095
output packets
196629140
148810
149095
output bytes
20646461553
15625047
15654950
total drops

Copyright 2006 Juniper Networks, Inc.

975

Proprietary and Confidential

203

www.juniper.net

19

Case Study 2 Incorrect Iwo_key (Lout_key)


pointing to a RLDRAM (SRAM)

ADPC0(r16 vty)# show ichip 3 wo statistics desrd wan_stream 1


Iwo Descriptor Reader Statistics:
Counter Name
Total
Rate
Peak Rate
---------------------- ---------------- -------------- -------------input packets
191808272
148810
149141
output packets
191808272
148810
149141
Stream(1):
id error drops
356
0
77
data error drops
0
0
0
oflow error drops

37

ADPC1(Poseidon-RE0 vty)# ... stream wan_stream 11 mu_mad


Stream 43 mas/mu/mad/hnq_ptr info:
queue

mas

mu

mad

hnq_ptr(0x)

------- -------- -------- -------- --------------0

24132

Copyright 2006 Juniper Networks, Inc.

24026

000b75:000b71

Proprietary and Confidential

www.juniper.net

20

Case Study 2 Incorrect Iwo_key (Lout_key)


pointing to a RLDRAM (SRAM)

It has been observed that if descriptor reader


encounters an incorrect next-hop database then
WO can wedge. PR228361
The cause for this is not understood, and no
test-bench has been created that can reliable
reproduce this.
Normally when the next-hop data base is
corrupted there are "illegal link" errors
reported.
While these errors are being reported WO has
occasionally been observed to wedge.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

21

Case Study 2 Incorrect Iwo_key (Lout_key)


pointing to a RLDRAM (SRAM)

For PR228361, what we think is occurring is that the memory being


pointed to by a Lout key is modified by new next-hop entries. The
memory is released, but there are still packets in flight that point
at this memory. The memory is then reallocated to a new next-hop
and overwritten with new data. When the in flight packet reaches
WO the alignment is different so the old Lout key points at random
data.
You are correct in that it may be helpful for debugging if memory is
initialized with illegal descriptor tags (top 4 bits of zero). Also,
memory allocation could be changed to allocate the oldest freed
block first, hiding the bug probably forever.
Exactly what sequence of bad bits in the descriptors causes the
wedge is unknown. But since there is no software control over the
hardware state machine that sequences the descriptor reading it is
unlikely that any software workaround can be applied to prevent
wedging.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

22

Case Study 3 Incorrect packet length

calculation for MPLS->IPv4 nexthop traffic

PR/251042, PR/240148
On M120 and MX-series routers and M320 Enhanced III FPCs, when an
MPLS-encapsulated IPv4 packet that is padded to meet the minimum Layer2 frame size (for example, 64 bytes for frames on Ethernet media) exits
an LSP, the egress interface might stop forwarding packets. This can
happen when the router is configured as a PE router in a VPN or is the
penultimate node of an LSP. To recover, reboot the FPC (on MX-series or
M320 routers) or the FEB (on M120 platform) that houses the affected
interface.
Root cause: don't trust the plen from the IP header since it's not sanity
checked for MPLS->IPV4 case. Use the notification plen to calculate the
dbuf size for requesting data from Iwo l23 to Iwo IP.
Incorrect DBUFSZ computed by the WO microcode can cause a wedge
(PR251042). Microcode programming requirement.
Output packets are assembled in the WO output block (wo_spi) by pulling
the headers and the first part of the payload from the L23 engine and the
remainder of the payload from DBUF.
Microcode computes the remainder size and sets DBUFSZ as number of
bytes to pull from DBUF.
If there is data in DBUF but microcode incorrectly sets DBUFSZ to zero
then WO does not pull data from DBUF and WO can wedge.
Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

23

Case Study 3 Incorrect packet length

calculation for MPLS->IPv4 nexthop traffic


The wedge can happen because DBUF can become full so it stops
sending grants to packet reader.
WO drains its packets but does not drain DBUF, and waits for new
packets from packet reader which never arrive.
Another possible way it can wedge, but not confirmed by simulation,
is that the incorrect DBUFSZ results in a temporally incorrect
"byte limit" back to packet reader. If packet reader waits for the
bytes to drain, but they never do as they are stuck in DBUF.
The opposite DBUFSZ zero value error can corrupt packets for an
indefinite time, but this has not been observed to cause a wedge.
That is, if the packet's plen is two cells or less then the
microcode*must* set DBUFSZ to zero.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

24

Case Study 3 Incorrect packet length

calculation for MPLS->IPv4 nexthop traffic


If it does not then the next packet can also become
corrupted. It is speculated, but not confirmed by
simulating, that this type of error can also result in WO
CRC errors.
If DBUFSZ should be non-zero but is computed to be
larger or smaller than it is suppose to be then the error
is detected by hardware and the current packet is sent
out with an EOPE, and WO recovers. It's only when the
zero/non-zero state is incorrect that WO has been
observed to wedge.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

25

Case Study 4 Incorrect packet length


calculation for IPv6 nexthop traffic
PR/105266

invalid length value in the IPV6 header may


cause Iwo to wedge. so when calculating DBUF
size, use the adj_plen instead of plen in the v6
header.
don't trust the packet length field in the IP
header, instead use the H/W calculated
ADJ_PLEN to calculate DBUF size.
ADPC2(mercator-re1 vty)# sh nh in ge-2/2/1
ID

Type

Interface

Next Hop Addr

Protocol

-----

--------

-------------

---------------

----------

534

Unicast

ge-2/2/1.0

2001:668:0:2::1:662

548

Unicast

ge-2/2/1.0

Copyright 2006 Juniper Networks, Inc.

Encap

MTU

------------

----

IPv6

Ethernet

9194

IPv6

Ethernet

9194

fe80::217:cb00:aa1:7ff0

Proprietary and Confidential

www.juniper.net

26

Case Study 4 Incorrect packet length


calculation for IPv6 nexthop traffic

Here the killer packet has the wrong payload


length (second row, right before 3B, change
from the correct payload length of 06 to 00):
00 19 E2 B1 61 73 00 00 04 00 01 00 86 DD 60 30
00 00 00 00 3B FF FE 80 00 00 00 00 00 00 02 00
04 FF FE 00 01 00 20 01 06 68 00 00 00 02 00 00
00 00 00 01 06 62 FF FF 00 00 00 00 CB 8D C8 9B

ADPC2(mercator-re1 vty)# sh ich 1 re wo ip


(0xc4900b0c)

wo.ip.inter_status: 0x00000002

(0xc4900b10)

wo.ip.int_status_diag: 0x00000002

(0xc4900b14)

wo.ip.inter_log: 0x00000880

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

27

Case Study 4 Incorrect packet length


calculation for IPv6 nexthop traffic
ADPC2(mercator-re1 vty)# sh ich 1 ipktrd qs
WAN Queue Status
WAN_PRQ_MPTY -

0xfffffffb

WAN_ICB_MPTY -

0xffffffff

WAN_PRD_DONE -

0xffffffff

ADPC2(mercator-re1 vty)# sh ich 1 imq conf str wan 2 mu


Stream 34 mas/mu/mad/hnq_ptr info:
queue
------0

mas
-------90456

mu

mad

--------

hnq_ptr(0x)

--------

90035

--------------000692:00068e

ADPC2(mercator-re1 vty)# sh ich 1 wo stat ip wan 2


Iwo Input Processor Statistics:
Counter Name

Total

Rate

Peak Rate

---------------------- ---------------- -------------- -------------Stream(2):


input packets

622446783

933309

output packets

622446759

933309

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

28

Case Study 5 Tunnel ingress with


fragmentation
PR/237450

The issue is the tag len is not getting correctly


added to the L2 header len in the Iwo ucode.
This causes incorrect len and offset in the
subsequent IP fragments. This a day one bug on
Ichip platforms
The tag len was not correctly extracted from
$R_TC due to incorrect mask usage. Thus tag
len is not getting correctly added to the L2
header len which causes incorrect update of the
ip total len and fragment offset field locations
in subsequent IP fragments.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

29

Case Study 6 I3.0 Ifo queue buffer


PR/268274

found an improper programming in Ichip pktrd


configuration
the fab_cfg_icb was assigning 4 lookahead quota but
only 5 entries to the buffer;
there needs to be space for 2 additional entries for the
current packet. So I dialed down the lookahead quota to
3.
The script does this for all 96 fabric destinations for all
4 Ichips on the DPC.
A quota of 3 is adequate for performance; this is what
we used in our chip simulations...
Somehow, this value did not get propagated to the Junos
software, which is using the value of 4 that was ok for
I2.0 (but not I3.0). So this seems like a Day-1 bug.
Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

30

Case Study 6 I3.0 Ifo queue buffer


The need for the large MTU to trigger the wedge now
makes sense. A packet needs 2 non-lookahead icells
when it is 3392 bytes or more.
What happens is that when the icell lookahead limit is
wrong by one the extra look-ahead icell will overwrite
only the second non-lookahead icell.
Thus if the MTU is less than 3392 bytes the second icell
will never be overwritten.
There also needs to be four 1-icell packets arriving
during the large (2 icell) packet so that all 4 of the icell
lookahead buffers are filled.
So one stream of low rate 9K packets and a second
stream with burst of at least four single icell (384 to
3391 byte) packets should show the problem.
Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

31

Case Study 6 I3.0 Ifo queue buffer


the stream of smaller packets needs to be between 384
bytes (6 cells) and 3391 bytes (53 cells). I suspect a
stream of 384 byte smaller packets will cause the wedge
the quickest.
These small packets need to be in a burst, that is, back
to back, for at least 4 packets, and need to arrive just
after the 9K packet.
I2 has 32 fabric streams, I3 has 96. I2 has 256 icell
buffer pointers, I3 has 512. The streams were
increased by a factor 3 but the buffer pointers were
increased only by a factor of 2. Thus the programming
must be different.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

32

Case Study 6 I3.0 Ifo queue buffer


ADPC1(A2-MX960-BOT vty)# sh ich 0 ipktrd qst
FABRIC Queue Status
FAB_PRQ_MPTY[0] -

0xfffcffff

FAB_PRQ_MPTY[1] -

0xffffffff

FAB_PRQ_MPTY[2] -

0xffffffff

FAB_ICB_MPTY[0] -

0xfffeffff

FAB_ICB_MPTY[1] -

0xffffffff

FAB_ICB_MPTY[2] -

0xffffffff

FAB_PRD_DONE[0] -

0xfffcffff

<<<

<<<

ADPC1(A2-MX960-BOT vty)# sh ich 0 imq conf str fab 8 mu


Stream 72 mas/mu/mad/hnq_ptr info:
queue

mas

mu

--------

hnq_ptr(0x)

-------

--------

114684

114399

00010a:000106

114684

113178

00011c:000118

Copyright 2006 Juniper Networks, Inc.

--------

mad

---------------

Proprietary and Confidential

www.juniper.net

33

Case Study 6 I3.0 Ifo queue buffer


ADPC1(A2-MX960-BOT vty)# sh ich 0 ipktrd err fab 16
Cell-error Counters:
Counter Name

Total

Rate

Peak Rate

---------------------- ---------------- -------------- -------------fab_strm[16] ecc

fab_strm[16] err

165

fab_strm[16] nt_age

fab_strm[16] ic_age

fab_strm[16] dc_age

55

13

Rate

Peak Rate

ADPC1(A2-MX960-BOT vty)# sh ich 0 imq stat fab 8 qu 0


physical queue 272:
Counter Name

Total

---------------------- ---------------- -------------- -------------OUT_PKT_Q


(BYTE)

Copyright 2006 Juniper Networks, Inc.

26530802
2813438901

776265

82355652

Proprietary and Confidential

www.juniper.net

34

Case Study 6 I3.0 Ifo queue buffer


ADPC1(A2-MX960-BOT vty)# sh ich 0 imq stat fab 8 qu 0
Counter Name

Total

Rate

Peak Rate

---------------------- ---------------- -------------- -------------[..]


RDROP_PKT_Q DP00

3157265045

129153

258849

(BYTE)

330653936819

13561141

27296088

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

35

Case Study 7 IA FPGA DMA corruption


PR/269699

Some critical sections of DMA code were not


protected from interrupts which made DMA
data corruption possible if one I-chip write was
interrupted by the code that performed another
I-chip write.
On MX-Platform if we have many traffic going to
the routing engine and route changes the DPC
might trigger an assertion without producing a
core-dump

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

36

Case Study 7 IA FPGA DMA corruption


Look at the How to Repeat section of
PR/269699 for good ways to reproduce
problems.
Use the shell command route.new to install and delete
many routes
Use JIT to generate traffic through many interfaces
On the interfaces with JIT generated traffic, add a
firewall filter that will create a lot of exception
traffic sample, log, count
Use an ixia tester to send exception traffic send
pim traffic that will cause the iif mismatch, send
traffic to internal loopback address, send ARP traffic

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

37

Case Study 8 Traffic sent to Qs that are not


configured - PR/277853, PR/27741

How to repeat: Make sure you get traffic flowing with


bursts and some mpls packets in Q1 and Q2. After
flapping of the interface on upstream router you should
see aging counter increasing and after some time you
should also see the packet crc errors reported
If some traffic flows are using queue 1 or queue 2
without class-of-service configured it might trigger
packets aging which can corrupt packets.
This will be reported as IP CRC packet error in syslog or
can even stop forwarding traffic.
As a workaround make sure you have some transmit-rate
configured for queues which is used for traffic.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

38

Case Study 8 Traffic sent to Qs that are not


configured

The default configuration leaves several (6 of 8) queues


at zero transmit-rate.
If traffic happens to go to these queues of an interface
which is oversubscribed by traffic flowing thru the
other non-zero-rate queues, the zero-rate queues will
accumulate pending traffic.
That traffic will age and then enter the realm of
undetected aging.
When the oversubscription ends, the pending traffic will
drain.
A lot of it will be detected as aged (with ~7/8
probability, if we use 3 bits for "aging" detection), but
some will drain out as undetected.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

39

Case Study 8 Traffic sent to Qs that are not


configured

Most of it will encounter packet-CRC errors and get dropped before


it leaves the chipset
This problem exists in T/M/MX series PFE (Gimlet, Stoli, Ichip).
However, the Ichip is particularly susceptible to it due to PR231419,
where it can cause a wedge! PR277853 documents such a situation.
There are two ways to address this issue:
1. Change the default transmit-rate on all queues to a very tiny value
(i.e. not zero); this value needs to be carved out of the overall
credits that are assigned to the other queues on the interface.
It needs to be large enough to drain the minimum Mas of the queue in
less than undetected age time (e.g. ~4 seconds for Ichip).
A strict-priority queue configuration with improperly provisioned traffic
can defeat this mechanism, so there are no 100% guarantees.
Configure a policer for the strict-high queue

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

40

Case Study 8 Traffic sent to Qs that are not


configured

2. Enhance the workaround in PR231419 to detect


a blocked queue in Imq as well:
check that Mu is non-zero, and the queue output
counter has not incremented in the one second
duration, and if so,
disable enqueue into this queue, write the
RAMQ_START_PTR and HNQ_STE_REG for that queue
(which effectively throws away all pending traffic in this
queue), and then
re-enable enqueue into the queue.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

41

Case Study 9: Fail RLDRAM BIST PR/255204


Once we fail rldram BIST testing we will not initialize memory
correctly and eventually end up with parity errors.
Need to reboot the DPC and make sure the BIST does not fail
[Mar 4 20:04:01.139 LOG: Err] ICHIP(2): iifwo2 sram
parity error
[Mar 4 20:04:01.139 LOG: Err] ICHIP(2): iifwo2 sram
parity err count 1
[Mar 4 20:04:01.139 LOG: Err] ICHIP(2): iifwo2 parity
err at word offset 0x4017e2(hashed 0x17e4)
[Mar 4 20:04:01.139 LOG: Err] ICHIP(2): iifwo2 parity
err data0 0x8c400c24 data1 0x203000d

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

42

Case Study 9: Fail RLDRAM BIST


perform a stress test of the RLDRAM.
configure the following
set chassis no-reset-on-timeout
set chassis fpc slot <x> command noboot
set chassis images rfeb slot <x> command
noboot

then you cty of the DPC and reboot and once its
rebooting it should not load the image and you
get the BOOT prompt
you enter the following
boot -m manufacturing 1 adpc.jbf

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

43

Case Study 9: Fail RLDRAM BIST


once this is loaded you do
set parser security 10
diagnostic ram_bist all stress
diagnostic ichip 0 register

and see if there are errors.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

44

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
syslog error messages appear like below. The trouble is
to map these LINKs to the proper FRU. PR/407207 has
been opened to enhance the error messages
Note the link numbers in the below output
Link 62 and 56 which FRU do these map to?
Dec 2 18:29:05 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip
(CB 0, ID 0): link 62 failed because of crc errors
Dec 2 18:29:05 CHASSISD_FASIC_HSL_LINK_ERROR: Fchip
(CB 1, ID 0): link 56 failed because of crc errors

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

45

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
Since there are many different commands that
give channel and link output, it is difficult to
know which command would help cross-reference
the FRU with the proper LINK as found in the
messages log
After much digging around, we could find
circumstantial evidence with a bunch of
different commands but had a very hard time
identifying the proper command that would
match the LINK error messages to a FRU

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

46

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
Here are some of the commands we found that
provide circumstantial evidence of which FRU
might be causing problems
See which dpc has a link error, only dpc0 in this case
cli> show chassis fabric plane
cli> show chassis fabric plane
Fabric management PLANE state
Plane 0
Plane state: SPARE
FPC 0
PFE 0 :Link error

<<<

PFE 1 :Link error

<<<

PFE 2 :Link error

<<<

PFE 3 :Link error

<<<

[..]

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

47

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
See which (dpc) slot reports crc errors (0-5 for
calypso). Only slot 0 (dpc0) shows crc errors in this
case
cli> show chassis hsl channel asic-name CB[0,1]F[0,1] info slot [0-5]

cli> show chassis hsl channel asic-name CB0F0 info slot 0


CB0F0-chan-rx-105 : Down
Full with 8 links
HSL2_TYPE_T_RX
link-120
Flag: 0

reg: 0x2f000

first_link: CB0F0-hss_cu08-

64b66b No-scramble No-plesio input-width:0

Cell received: 0

CRC errors: 393210 exceeded 0

Cell last

CRC last

Copyright 2006 Juniper Networks, Inc.

: 0

: 65535

<<<

Proprietary and Confidential

www.juniper.net

48

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
Finally, we can find a hidden command that maps the error
messages LINK to the FRU causing the problems
Refer to the first slide for the log messages
We are looking for the FRU that matches the LINK numbers 62
and 56
cli> show chassis fabric statistics detail totals 0
SF-chip statistics for plane#0
------------------------------Total pio errors=0

<<< does not match

Statistics for input link#0 (DPC1PFE0->CB0F0_00_0):

<<< dpc1 connects to link 0

Valid cells

: 0

Request cells

: 0

Grant cells

: 0

Dropped cells

: 0

[..]

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

49

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
cli> show chassis fabric statistics detail totals 0
[..]
Statistics for output link#56 (CB0F0_14_0->DPC0PFE0) 0: << match link 56 to dpc0
Valid cells

: 0

Request cells

: 0

Grant cells

: 0

Dropped cells

: 0

[..]
Statistics for output link#62 (CB0F0_15_2->DPC0PFE2) 0: << match link 62 to dpc0
Valid cells

: 0

Request cells

: 0

Grant cells

: 0

Dropped cells

: 0

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

50

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
Here we can artificially create crc errors and check the
output:
ADPC3(router vty)# test hsl2 corrupt ichip_0 0 100

We can see the following error messages in the log


In this case, we want to find which FRU matches up with link
number 12 and 20
Dec 5 10:57:19 honolulu-re0 chassisd[5941]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 12
failed because of crc errors
Dec 5 10:57:19 honolulu-re0 chassisd[5941]:
CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 20
failed because of crc errors

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

51

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
This hidden command does not provide the link mapping
but does show the crc errors on dpc3 (slot 3) with the
CB0F0-chan-rx-24
cli> show chassis hsl channel asic-name CB0F0 info slot 3
CB0F0-chan-rx-24 : up
Sub channel 1 of 4 with 2 links
HSL2_TYPE_T_RX
Flag: 0x23

reg: 0x23000

first_link: CB0F0-hss_cu08-link-24

64b66b No-scramble Plesio input-width:3

Cell received: 10272828104118

CRC errors: 100 exceeded 0

Cell last

CRC last

Copyright 2006 Juniper Networks, Inc.

: 10105354

<<<

: 0

Proprietary and Confidential

www.juniper.net

52

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
This command makes the proper mapping between the
LINK reported in the log and which FRU is causing the
problem:
Again, we are looking for link 12 and 20
cli> show chassis fabric statistics detail totals 0
SF-chip statistics for plane#0
------------------------------Total pio errors=0

<<< no match

Statistics for input link#0 (DPC1PFE0->CB0F0_00_0): <<< link 0


Valid cells

: 0

Request cells

: 0

Grant cells

: 0

Dropped cells

: 0

[..]

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

53

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
This command makes the proper mapping between the LINK reported in the log and
which FRU is causing the problem. Below we find link 12 and 20 map to DPC3:
cli> show chassis fabric statistics detail totals 0

<..>
Statistics for input link#12 (DPC3PFE0->CB0F0_03_0):
Valid cells

: 1

Request cells

: 1

Grant cells

: 0

Dropped cells

: 0

<..>
Statistics for input link#20 (DPC3PFE0->CB0F0_05_0):
Valid cells

: 1

Request cells

: 1

Grant cells

: 0

Dropped cells

: 0

<..>

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

54

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
Here is the fabric mapping for MX240
scb 0

scb 0

scb 1

scb 1

sf

sf

sf

sf

dpc slot

sf port

Ichip port

Ichip port

Ichip port

Ichip port

========

=======

==========

==========

==========

==========

15

n/a

n/a

14

n/a

n/a

13

11

9
8

0
4

1
5

2
6

3
7

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

55

Case Study 10: Trouble shooting LINK (SF fabric


ports) error messages to the proper FRU
PR/407207 was opened to enhance the error
message logs to report the FRU instead of the
SF fabric port LINK

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

56

I chip performance
For a vanilla IPv4 packet, we have only one IP DA lookup. For uRPF,
we will have SA IP lookup, + IIF tree lookup, + DA IP lookup.
So for the uRPF path we have doubled (or even more) the number of
SRAM accesses.
Basically, there is a limited budget of RLDRAM (aka SRAM, since we
used that for previous chips) accesses per packet, due to the
memory bandwidth limit.
The budget is the smallest for smallest packet sizes (highest pps
rates).
the requirement for uRPF is significantly higher than vanilla IPv4, so
at minimum packet-sizes, there won't be adequate RLDRAM
bandwidth, route-lookups will back-up and we'll drop in Ichip iif
block.
This is similar behavior to Rchip input tail-drops we've seen when
large filters were configured (at other customers).

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

57

I chip performance
Each RLDRAM part can service at most 200*106 reads/sec. We'll
lose some performance due to RLDRAM refresh overhead and other
small inefficiencies, so peak utilization is typically ~95% (i.e.,
~190*106 rds/sec per part). For the two parts combined, the max
rate should be ~380*106 rds/sec.
If the rlkp subsystem must perform 16 fetches from RLDRAM for
every notification that it services, then the max throughput rate is
~380/16 * 106 notifs/sec.
If you observe RLDRAM utilization rates and system throughput
rates significantly less than the above, then that is consistent with
our suspicion of an RLDRAM hot bank bottleneck (but isn't really
strong enough to be evidence). It will provide us with some
quantitative measurements of the level of activity seen at each of
the two RLDRAM parts for rlkp.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

58

I chip performance
Ichip does not provide you statistics on accesses to
*each* RLDRAM bank. So if you have a hot-bank
problem, you can only deduce it indirectly, using jsim and
counting the number of accesses to each bank (the 3
LSB of the SRAM address), as you have started doing
below. Of course, jsim can be done on only a handful of
routes, and so one might not get a full picture
ADPC7(lab-brdr-01_re0 vty)# sh ich 0 iif stat
Discard counters:
Counter Name

Total

Rate

Peak Rate

---------------------- ---------------- -------------- -------------WAN_DROP_CNTR

17021548790

4009502

7973163

FAB_DROP_CNTR

283905625

323074

1168296

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

59

I chip performance
ADPC7(lab-brdr-01_re0 vty)# sh ich 0 isr stat
RLDRAM accessing statistics
Counter Name

Total

Rate

Peak

Rate
---------------------- ---------------- -------------- ------------R1_SRM_RD

11890300277284

195280981

R2_SRM_RD

9481857885985

186351963

219119725
213945425

sram bank is the 3 lsb of the sram address


for example, for the first line below,
- [ 2]

9 sram 000153

sram bank 3 (decimal) is being used

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

60

I chip performance
21 sram reads on ingress pfe
- the breakdown
- bank - used

[ 2]

9 sram 000153 4005eb61 nh: multiple SER(no SE) hops=1 addr=0x10017a


segment extended reference

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

61

I chip enhancements

PR/231419 - PR/103712, PR/228781, PR/103298

The work-around for PR 231419 is to flush aged icells before


they can become undetected aged icells. There are 3 "virtual
address" bits in the packet memory address (2 bits if 512MB
DIMMs). An aged icell can become an undetected aged icell if
these 3 address bits are the same during the read as the 3 bits
currently being written by packet writer. That is, the packet
memory has be overwritten a multiple of 8 times.

The workaround is to sample traffic flow and if traffic is blocked


on a stream and if the time is getting near the undetected icell
aged time then flush the stream. Detecting traffic flow is done
using the flow state register in the WO_SPI block. If these bits
say no traffic has flowed during the sample period and there was
traffic pending then traffic is blocked.

The delay bandwidth buffer is at least 200msec, so overwriting 8


times requires at least 1.6 seconds of real time. The sample rate
must be at least twice this rate to avoid sample aliasing. Pick a
flow sample rate of 0.5 seconds to detect block traffic.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

62

I chip enhancements

PR/260708 - Add more debug info in Iwo as


requested by customer
PR/257149

although the sw bug trigger the wedge had been fixed this
PR will be moved into suspend mode and should reflect the
hw part. In future design hw should prevent to wedge
based on a fifo overflow and have a better protection and
debugging tools to address such conditions quickly.

PR/264319 IQ2 - add syslog when prolonged


backpressure happens on the ET/XET FPGA to the
upstream block
PR/264322 Iwo/Lout - add syslog when prolonged
backpressure happens on the ET/XET FPGA to the
upstream block

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

63

I chip enhancements

PR/263909

there are many different ways in which an ichip can get


wedged. When there are many ichips on one fru that can be
power reset (for example the 4x10G dpc with 4 ichips),
then if one ichip gets wedged then all of the ichips must be
power cycled to unwedge just one ichip. The above makes
customer's very unhappy as 3 other working 10G streams
must be interrupted to fix an issue with the other ichip.
This modification request is to see if it is possible to reset
just one ichip on a multi-ichip based fru.

PR/407207

Change the syslog messages to map the FRU to the LINK


error messages

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

64

I chip enhancements

PR/263084

In the I2 and I3 chip, if an Lout key points to corrupted Next


Hop DB entry it is possible for the WO block to wedge. The exact
hardware reason for the wedge is currently unknown, but it has
been observed experimentally (PR240012). The hardware does
check certain invalid combinations and drops the packet. But
there some combinations of corrupted nhdb are not detected and
can lead to wedging.
To improve robustness in the case of bad Lout keys or corrupted
Next Hop DB, it is desirable to initialize to zero all of memory
that can be addressed by an Lout key. This includes the multicast
area, tag area, and L2 Descriptor area. A zero value allows the
hardware to be much more robust in detecting a corrupted nhdb,
as zero is always an illegal value. Whenever the hardware reads an
entry with zeros in the top nibble the results will be a syslog of
"New illegal link errors in WO DESRD lout_key" and the packet is
dropped, instead of a possibly wedging.
Robustness can also be improved if the memory allocator zeros
any freed nhdb and multi-cast table memory blocks. This results
in stale Lout keys being reported as illegal link errors instead of
possibly wedging.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

65

I chip enhancements

PR/262802 - IPv6 Jumbograms has not been handled correctly RFC


2675
PR/105469

Ichip wan stream summary stats.

When we capture the wo stats for wan_stream 0,


only the following is listed. The Iwo desrd, hdrf
and ip are all missing.
Also, the imq stream red / tail drop is missing
as well.
Can we include them in the "summary" capture for
trouble shooting purpose ? That would mininize
a "missing info" case when we perform trouble shooting
within a limited period of time.

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

66

Question or Suggestion
If you have any question regarding I chip
trouble-shooting, please contact
mx-escalation@juniper.net

If you have any question about this presentation


drautio@juniper.net

Copyright 2006 Juniper Networks, Inc.

Proprietary and Confidential

www.juniper.net

67

Das könnte Ihnen auch gefallen