Beruflich Dokumente
Kultur Dokumente
Version 0.1
Author:
Department:
Date:
Version: 0.1
Troubleshooting guide for T/M/J JUNOS routers
1 Table of contents
1 Table of contents.......................................................................................................................1
Introduction..................................................................................................................................2
Document Objective...................................................................................................................2
Scope.........................................................................................................................................2
1.1 Document History..................................................................................................................3
Related documents....................................................................................................................3
2 Troubleshooting guidelines.....................................................................................................4
2.1 Basis troubleshooting for all events ......................................................................................4
2.2 Common events....................................................................................................................5
2.2.1 Power supply failure........................................................................................................5
2.2.2 Fan failure/Temperature alert.........................................................................................5
2.2.3 Device reboot with unknown cause.................................................................................6
2.2.4 Chassis event (component failure)..................................................................................7
2.2.5 Routing-engine ..............................................................................................................8
2.2.6 Link failure......................................................................................................................9
2.2.7 Management IP unreachable (ICMP)............................................................................10
2.2.8 In-band Loopback IP unreachable (ICMP)....................................................................11
2.2.9 BGP neighbor ..............................................................................................................12
2.2.10 ISIS adjacency ...........................................................................................................15
2.2.11 VRRP .........................................................................................................................16
2.2.12 LDP neighbor/MPLS ..................................................................................................17
2.2.13 PIM neighbor/multicast ..............................................................................................18
2.3 Non-fault management alarms or undocumented events....................................................19
2.3.1 Undocumented event....................................................................................................19
2.3.2 Network slow................................................................................................................19
2.3.3 Reachability problem....................................................................................................19
2.3.4 Complete service/product not working..........................................................................19
2.4 Disaster recovery................................................................................................................20
2.5 Hardware maintenance verification.....................................................................................21
1
Troubleshooting guide for T/M/J JUNOS routers
Introduction
Document Objective
This document will show basic instructions for certain types of alarms. The basic
troubleshooting steps defined will be categorized per event and are valid for JUNOS
software running on M/T/J series models.
For most of the events reference to the vendor documentation is given where
additional information can be looked up. This vendor documentation is also available
in PDF format and should be present at a common location for operational personal
(accompanying this document).
The output interpretation of the command can also be looked up in the vendor
documentation:
• Go to www.juniper.net type command in the search area, all command output
reference information can be found there.
Scope
This document will describe the initial troubleshooting for the most common events.
It will also describe a generic approach per fault.
2
Troubleshooting guide for T/M/J JUNOS routers
Related documents
3
Troubleshooting guide for T/M/J JUNOS routers
2 Troubleshooting guidelines
show version
show system uptime
show log messages | last 100
show chassis alarms
show chassis hardware
Details:
• show version -> this will show the model your are working on
• show system uptime –> this will show the current system uptime and when it
has been configured for the last time. It will indicate via the load figures how
busy the system is.
• show log messages | last 100 -> this will show the last 100 events which
happened on the router
• show chassis alarms -> this will show if there are any alarms active on the
router for the chassis.
• show chassis hardware -> this will show which hardware is present
4
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
Below commands should be run in case of power supply failure. Please note that not
all systems have a PEM module.
Impact:
Common causes:
Solution:
Further reference:
Diagnostics:
Impact:
Most chassis will have redundant failures. Overheating can be caused if the fan is not
fixed.
Common causes:
Solution:
5
Troubleshooting guide for T/M/J JUNOS routers
Further reference:
Diagnostics:
Impact:
It depends where in the topology this system is. In general for systems with an
access related function this means a short outage has occurred. In the core impact
should be minimal
Common causes:
• Power failure
• Bug/crash
• Routing engine failure (can be hard-disk failure on RE)
Solution:
Further reference:
6
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
Look in the further reference section for your specific model (and then under
“monitoring model XXX” components section).
Impact:
Common causes:
• Hardware failure
Solution:
• Replace the hardware via the vendor contract. In most cases there will be a
service contract with a 3 hour time-to-fix. Open a ticket with this supplier as
soon as possible and let them replace the hardware.
Further reference:
7
Troubleshooting guide for T/M/J JUNOS routers
2.2.5 Routing-engine
Diagnostics:
Most of the time a RE failure will also cause other alarms (for example BGP/LDP/ISIS
restarting).
Impact:
• On dual routing engine systems the backup will take over. A short interruption
has occurred.
• On single routing-engine systems either another system in the topology will
take over or this machine is down and all services it is providing are also
down. In most situations it will not occur that there is impact (except for the
normal fail-over times which apply)
• If the backup RE has failed there is no service interruption
Common causes:
Solution:
• Primary RE failure
o Check if backup RE has taken over if not manually switch over via:
request chassis routing engine master switch
o Replace the faulty RE (in a service window in case of redundancy)
• In case of backup RE failure replace it in a service window
Further reference:
8
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
The monitor interface command can be used if a link currently is transmitting traffic.
Show hardware detail will show which PIC and SFP are present.
Impact:
Common causes:
• Fiber failures
• GBIC/XENPAK failure
• PIC failure
• Other side failure
Solution:
Our hardware supplier should be able to help out to diagnose if faulty equipment
causes link failures if no obvious alarms or related network error conditions are
present for the reporting devices.
Further reference:
9
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
Below commands should be run in case management IP failures (the DCN gw can be
found as backup-router in the configuration:
#primary RE
ping <DCN gateway>
#backup RE
request routing-engine login other
ping <DCN gateway>
Also try to ping the IP from a DCN management station to verify if the fault
management system.
Impact:
• This could indicate that the routing-engine failed (see routing engine failure)
Common causes:
Solution:
Further reference:
N/A
10
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
ping <loopback>
show route <ip>
If a loopback is not reachable this means that must be a major problem with this
node and other alarms should be present for this node or neighboring nodes should
report problems.
Impact:
• Access: This could indicate that the node is down and not providing any
service
• Core: This could mean that the complete node is down in the core another
system will have taken over.
Common causes:
Solution:
An in-band loopback down will probably be caused by one of the other events (most
likely a component failure or power failure).
Further reference:
N/A
11
Troubleshooting guide for T/M/J JUNOS routers
• EBGP neighbors. These can be recognized because the remote peer does not
have our own AS number. There are a couple of EBGP neighbor types.
o Peer -> This neighbor is connected via a public exchange. Typical there
will be lots of neighbors in this category which are down (because we
have so many). One it is only a small number which are down this is
not causing any network problem. This type of neighbor will only be
present at BR (border) routers. No actions should be taken for peers
who are down then less then 24 hours.
o Transit -> These sessions are also only present at BR routers. We
always have multiple which are each other backups. They provide
reachability to the complete internet for us. If a transit is down it must
be fixed as soon as possible.
o Content -> From here we retrieve special content. Currently there is
only one: the NOB connected to the MBR (multicast BR). Most of the
time this will be redundant setups.
o Customers -> These are present at IAR1X routers. These will be
individual business customers. The customer should be contacted in
case of problems.
Diagnostics:
Below commands should be run in case of BGP neighbor events.
For most of the IBGP failures also other failures which correlate to the BGP event
should be present (e.g. link-down, ISIS).
Impact:
• IBGP sessions down -> normally systems are connected to two BGP neighbors
for redundancy. If both are down this could be service affecting.
• EBGP session down -> For important traffic redundant BGP sessions should be
available. If both are down then this could have impact for customer
reachability.
Common causes:
IBGP:
• ISIS routing problems
• Configuration error for new commissioned system
• Remote neighbor failure
EBGP:
• Neighbor router failure
12
Troubleshooting guide for T/M/J JUNOS routers
13
Troubleshooting guide for T/M/J JUNOS routers
Solution:
Please not that most of the times to solve BGP neighbor issues nothing has to be
configured on the reporting node itself
• Verify remote router
• Verify ISIS is running normally
Further reference:
14
Troubleshooting guide for T/M/J JUNOS routers
Diagnostics:
Impact:
• If from a topology perspective a system is isolated from all its other ISIS
neighbors then the system will not be able to provide any services.
Common causes:
Solution:
Please note that most time to solve an ISIS neighbor failure nothing has to be done at
the reporting router itself.
Further reference:
15
Troubleshooting guide for T/M/J JUNOS routers
2.2.11 VRRP
VRRP is used on SAR routers to provide redundancy to connected hosts. Normally two
VRRP routers are present where one will be the master and the other one the backup
router. When the master fails the backup will take over. Also when the uplink
interface fails the router will also swap to the other backup node.
Diagnostics:
Normally with a VRRP event there should also be another event which is the cause of
the VRRP alarm (link down, node down, switch down).
Impact:
• Normally the backup router will take over and no service interruption
should occur.
Common causes:
Solution:
Further reference:
N/A
16
Troubleshooting guide for T/M/J JUNOS routers
LDP is used in combination with VPN’s in the network. VPN are for example used for
VOD, VOIP and wholesale traffic.
Diagnostics:
Below commands should be run in case of LDP neighbor failures or MPLS problems.
Impact:
• Systems should have more redundant LDP sessions. If multiple sessions are
down it can cause reachability problems within the VPN’s and affecting VOD
and VOIP services.
Common causes:
Solution:
Please note that most time to solve an MPLS/LDP events nothing has to be done at
the reporting router itself.
Further reference:
N/A
17
Troubleshooting guide for T/M/J JUNOS routers
Multicast traffic is only forwarded in the network if the PIM, BGP and ISIS protocols
work correctly. A special router is present which is performing the so called RP
function. This router should be reachable at all times.
Diagnostics:
Impact:
• Multiple PIM neighbors should be present. If more the one PIM neighbor is
down it can mean that no multicast is flowing through the router
• If the RP is not reachable no multicast traffic can be send to this router
Common causes:
• ISIS problems
• RP failure
• Remote node failure
• Configuration failure
• BGP problems
Solution:
Please note that most time to solve an multicast/PIM event nothing has to be done at
the reporting router itself.
Further reference:
N/A
18
Troubleshooting guide for T/M/J JUNOS routers
• Do the basis check for all failures on the node where you think the problem is
• Check if there is customer impact
• Check if you can find a common node in the topology (consult that NGN
network drawing) which causes the problem
• Try to find if you can relate a problem to a certain protocol or service
• Use the service documentation to trouble shoot the service itself
• Always contact the next escalation level that something undocumented at
platform level has happened so this document can be improved.
If there are complaints that the network is “slow” please check the main network
capacity indicators:
• Peering points
• NGN – Cisco backbone interconnection
• Transit interconnection
Check for huge traffic spikes (might be DDOS attack) or huge traffic declines (might
be routing problem).
If it has been verified in the fault management system that no problems are currently
present which could cause this it can be that a routing problem has occurred. This
can happen after changes (check which RFC have been performed) or certain event
(a router which has never been used before has become active after a switchover).
Escalate to the next-level if there is customer impact.
In this case make sure that all outstanding in the fault management system are
confirmed and that they cannot impact the service which is currently not working. It
is highly unlikely that a complete service cannot be working without any alarms for it.
19
Troubleshooting guide for T/M/J JUNOS routers
Almost all JUNOS based systems in the Tele2-Versatel network have a dual-routing
engine setup. It is highly unlikely that we will every come into a situation where we
loose the complete router and all configuration (there are a minor number of single
RE chassis systems in the network). Below you can find a procedure to recover the
configuration in case of disaster.
Configuration -> Repository -> Display -> <select device> - > <latest date>
edit
load override terminal
<cut en paste configuration>
CTRL-D
commit synchronize
20
Troubleshooting guide for T/M/J JUNOS routers
Juniper has an excellent Network Operations guide available which documents what
to do with hardware failure, maintenance and replacement. See further reference
where to find the hardware documentation (this includes per component how to
verify correct behavior).
This will show the present hardware; the specific component command will show the
status of the component(s).
Further reference:
http://www.juniper.net/techpubs/software/nog/nog-hardware/html/nog-
hardwareTOC.html
21