Beruflich Dokumente
Kultur Dokumente
v3
Dan Rautio
www.juniper.net
www.juniper.net
Overview
Type of problems
Symptoms Very Important
Which block of the ASIC?
Case study of past problems
Deep Dive for each symptom
Ichip Performance
Ichip enhancements
www.juniper.net
Type of problems
Transit traffic loss
PFE related problem
www.juniper.net
www.juniper.net
www.juniper.net
www.juniper.net
PR/277853, PR/27741
PR/255204
PR/407207
www.juniper.net
www.juniper.net
www.juniper.net
10
PR/407207
www.juniper.net
11
www.juniper.net
12
www.juniper.net
13
www.juniper.net
14
Total
Rate
Peak Rate
322996428
75722
output packets
322996212
75722
ssmcst packets
fragmented packets
input drops
215
output drops
62736
146
5357
27
4315
10
www.juniper.net
15
www.juniper.net
16
www.juniper.net
17
www.juniper.net
18
PR/240012, PR/258760
Here are the errors on the ichip with this problem:
ADPC0(r16 vty)# show ichip 3 wo statistics
Iwo Statistics:
Summary
Total
Rate
Peak Rate
---------------------- ---------------- -------------- -------------input packets
196629610
148810
149095
output packets
196629140
148810
149095
output bytes
20646461553
15625047
15654950
total drops
975
203
www.juniper.net
19
37
mas
mu
mad
hnq_ptr(0x)
24132
24026
000b75:000b71
www.juniper.net
20
www.juniper.net
21
www.juniper.net
22
PR/251042, PR/240148
On M120 and MX-series routers and M320 Enhanced III FPCs, when an
MPLS-encapsulated IPv4 packet that is padded to meet the minimum Layer2 frame size (for example, 64 bytes for frames on Ethernet media) exits
an LSP, the egress interface might stop forwarding packets. This can
happen when the router is configured as a PE router in a VPN or is the
penultimate node of an LSP. To recover, reboot the FPC (on MX-series or
M320 routers) or the FEB (on M120 platform) that houses the affected
interface.
Root cause: don't trust the plen from the IP header since it's not sanity
checked for MPLS->IPV4 case. Use the notification plen to calculate the
dbuf size for requesting data from Iwo l23 to Iwo IP.
Incorrect DBUFSZ computed by the WO microcode can cause a wedge
(PR251042). Microcode programming requirement.
Output packets are assembled in the WO output block (wo_spi) by pulling
the headers and the first part of the payload from the L23 engine and the
remainder of the payload from DBUF.
Microcode computes the remainder size and sets DBUFSZ as number of
bytes to pull from DBUF.
If there is data in DBUF but microcode incorrectly sets DBUFSZ to zero
then WO does not pull data from DBUF and WO can wedge.
Copyright 2006 Juniper Networks, Inc.
www.juniper.net
23
www.juniper.net
24
www.juniper.net
25
Type
Interface
Protocol
-----
--------
-------------
---------------
----------
534
Unicast
ge-2/2/1.0
2001:668:0:2::1:662
548
Unicast
ge-2/2/1.0
Encap
MTU
------------
----
IPv6
Ethernet
9194
IPv6
Ethernet
9194
fe80::217:cb00:aa1:7ff0
www.juniper.net
26
wo.ip.inter_status: 0x00000002
(0xc4900b10)
wo.ip.int_status_diag: 0x00000002
(0xc4900b14)
wo.ip.inter_log: 0x00000880
www.juniper.net
27
0xfffffffb
WAN_ICB_MPTY -
0xffffffff
WAN_PRD_DONE -
0xffffffff
mas
-------90456
mu
mad
--------
hnq_ptr(0x)
--------
90035
--------------000692:00068e
Total
Rate
Peak Rate
622446783
933309
output packets
622446759
933309
www.juniper.net
28
www.juniper.net
29
www.juniper.net
30
www.juniper.net
31
www.juniper.net
32
0xfffcffff
FAB_PRQ_MPTY[1] -
0xffffffff
FAB_PRQ_MPTY[2] -
0xffffffff
FAB_ICB_MPTY[0] -
0xfffeffff
FAB_ICB_MPTY[1] -
0xffffffff
FAB_ICB_MPTY[2] -
0xffffffff
FAB_PRD_DONE[0] -
0xfffcffff
<<<
<<<
mas
mu
--------
hnq_ptr(0x)
-------
--------
114684
114399
00010a:000106
114684
113178
00011c:000118
--------
mad
---------------
www.juniper.net
33
Total
Rate
Peak Rate
fab_strm[16] err
165
fab_strm[16] nt_age
fab_strm[16] ic_age
fab_strm[16] dc_age
55
13
Rate
Peak Rate
Total
26530802
2813438901
776265
82355652
www.juniper.net
34
Total
Rate
Peak Rate
3157265045
129153
258849
(BYTE)
330653936819
13561141
27296088
www.juniper.net
35
www.juniper.net
36
www.juniper.net
37
www.juniper.net
38
www.juniper.net
39
www.juniper.net
40
www.juniper.net
41
www.juniper.net
42
then you cty of the DPC and reboot and once its
rebooting it should not load the image and you
get the BOOT prompt
you enter the following
boot -m manufacturing 1 adpc.jbf
www.juniper.net
43
www.juniper.net
44
www.juniper.net
45
www.juniper.net
46
<<<
<<<
<<<
<<<
[..]
www.juniper.net
47
reg: 0x2f000
first_link: CB0F0-hss_cu08-
Cell received: 0
Cell last
CRC last
: 0
: 65535
<<<
www.juniper.net
48
Valid cells
: 0
Request cells
: 0
Grant cells
: 0
Dropped cells
: 0
[..]
www.juniper.net
49
: 0
Request cells
: 0
Grant cells
: 0
Dropped cells
: 0
[..]
Statistics for output link#62 (CB0F0_15_2->DPC0PFE2) 0: << match link 62 to dpc0
Valid cells
: 0
Request cells
: 0
Grant cells
: 0
Dropped cells
: 0
www.juniper.net
50
www.juniper.net
51
reg: 0x23000
first_link: CB0F0-hss_cu08-link-24
Cell last
CRC last
: 10105354
<<<
: 0
www.juniper.net
52
<<< no match
: 0
Request cells
: 0
Grant cells
: 0
Dropped cells
: 0
[..]
www.juniper.net
53
<..>
Statistics for input link#12 (DPC3PFE0->CB0F0_03_0):
Valid cells
: 1
Request cells
: 1
Grant cells
: 0
Dropped cells
: 0
<..>
Statistics for input link#20 (DPC3PFE0->CB0F0_05_0):
Valid cells
: 1
Request cells
: 1
Grant cells
: 0
Dropped cells
: 0
<..>
www.juniper.net
54
scb 0
scb 1
scb 1
sf
sf
sf
sf
dpc slot
sf port
Ichip port
Ichip port
Ichip port
Ichip port
========
=======
==========
==========
==========
==========
15
n/a
n/a
14
n/a
n/a
13
11
9
8
0
4
1
5
2
6
3
7
www.juniper.net
55
www.juniper.net
56
I chip performance
For a vanilla IPv4 packet, we have only one IP DA lookup. For uRPF,
we will have SA IP lookup, + IIF tree lookup, + DA IP lookup.
So for the uRPF path we have doubled (or even more) the number of
SRAM accesses.
Basically, there is a limited budget of RLDRAM (aka SRAM, since we
used that for previous chips) accesses per packet, due to the
memory bandwidth limit.
The budget is the smallest for smallest packet sizes (highest pps
rates).
the requirement for uRPF is significantly higher than vanilla IPv4, so
at minimum packet-sizes, there won't be adequate RLDRAM
bandwidth, route-lookups will back-up and we'll drop in Ichip iif
block.
This is similar behavior to Rchip input tail-drops we've seen when
large filters were configured (at other customers).
www.juniper.net
57
I chip performance
Each RLDRAM part can service at most 200*106 reads/sec. We'll
lose some performance due to RLDRAM refresh overhead and other
small inefficiencies, so peak utilization is typically ~95% (i.e.,
~190*106 rds/sec per part). For the two parts combined, the max
rate should be ~380*106 rds/sec.
If the rlkp subsystem must perform 16 fetches from RLDRAM for
every notification that it services, then the max throughput rate is
~380/16 * 106 notifs/sec.
If you observe RLDRAM utilization rates and system throughput
rates significantly less than the above, then that is consistent with
our suspicion of an RLDRAM hot bank bottleneck (but isn't really
strong enough to be evidence). It will provide us with some
quantitative measurements of the level of activity seen at each of
the two RLDRAM parts for rlkp.
www.juniper.net
58
I chip performance
Ichip does not provide you statistics on accesses to
*each* RLDRAM bank. So if you have a hot-bank
problem, you can only deduce it indirectly, using jsim and
counting the number of accesses to each bank (the 3
LSB of the SRAM address), as you have started doing
below. Of course, jsim can be done on only a handful of
routes, and so one might not get a full picture
ADPC7(lab-brdr-01_re0 vty)# sh ich 0 iif stat
Discard counters:
Counter Name
Total
Rate
Peak Rate
17021548790
4009502
7973163
FAB_DROP_CNTR
283905625
323074
1168296
www.juniper.net
59
I chip performance
ADPC7(lab-brdr-01_re0 vty)# sh ich 0 isr stat
RLDRAM accessing statistics
Counter Name
Total
Rate
Peak
Rate
---------------------- ---------------- -------------- ------------R1_SRM_RD
11890300277284
195280981
R2_SRM_RD
9481857885985
186351963
219119725
213945425
9 sram 000153
www.juniper.net
60
I chip performance
21 sram reads on ingress pfe
- the breakdown
- bank - used
[ 2]
www.juniper.net
61
I chip enhancements
www.juniper.net
62
I chip enhancements
although the sw bug trigger the wedge had been fixed this
PR will be moved into suspend mode and should reflect the
hw part. In future design hw should prevent to wedge
based on a fifo overflow and have a better protection and
debugging tools to address such conditions quickly.
www.juniper.net
63
I chip enhancements
PR/263909
PR/407207
www.juniper.net
64
I chip enhancements
PR/263084
www.juniper.net
65
I chip enhancements
www.juniper.net
66
Question or Suggestion
If you have any question regarding I chip
trouble-shooting, please contact
mx-escalation@juniper.net
www.juniper.net
67