5-minute initial troubleshooting on Brocade equipment

5-minute initial troubleshooting on
Brocade equipment
created by elonden on Jun 18, 2013 11:22 PM, last modified by elonden on Sep 4, 2013 4:15 PM
Version 2
Very often the HDS support organisation (GSC) is getting involved in cases whereby a massive amount of
host logs, array dumps, FC and IP traces are taken which could easily add up to many gigabytes of data.
This is then accompanied by a very synoptic problem description such as "I have a problem with my host,
can you check?".
I'm sure the intention is good to provide us all the data but the problem is the lack of the details around
the problem. We do require a detailed explanation of what the problem is, when did it occur or is it still
ongoing?
There are also things you can do yourself before opening a case with HDS. In many occasions you'll find
that the feedback you get from us in 10 minutes results in either the problem being fixed or a simple
workaround has made your problem creating less of an impact. Further troubleshooting can then be done
in a somewhat less stressful time frame.
This example provides some bullet point what you can do on a Brocade platform. (Mainly since many of
the problems I see are related to fabric issues and my job is primarily focused on storage networking.)
First of all take a look at the over health of the switch:
Command
Explanation
Sydney_ILAB_DCX-4
Switch Health Report
Switch Name: Sydn
IP address: 10.129.2
SwitchState: MARG
Duration: 214:29
Power supplies monito

Temperatures monitor
Fans monitor
WWN servers monitor
CP monitor
H
Blades monitor
Core Blades monitor
Flash monitor
Marginal ports monitor
Faulty ports monitor
Missing SFPs monitor
Error ports monitor
switchstatusshow
Provides an overview of the general components of the switch. These all need
to show up HEALTHY and not (as shown here) as "Marginal"
All ports are healthy
Command
switchshow
Explanation
Sydney_ILAB_DCX-4
switchName: Sydne
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principa
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:0
zoning:
ON (Broc
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 1
Allow XISL Use: OF
Provides a general overview of logical switch status (no physical components) LS Attributes: [FID:
plus a list of ports and their status.
Index Slot Port Addre
The switchState should alway be online.
=================
The switchDomain should have a unique ID in the fabric.
0 1 0 8f0000
If zoning is configured it should be in the "ON" state.
1 1 1 8f0100
2 1 2 8f0200
As for the ports connected these should all be "Online" for connected and
3 1 3 8f0300
operational ports. If you see ports showing "No_Sync" whereby the port
4 1 4 8f0400
is notdisabled there is likely a cable or SFP/HBA problem.
5 1 5 8f0500
6 1 6 8f0600
If you have configured FabricWatch to enable portfencing you'll see indications 7 1 7 8f0700
like here with port 75
8 1 8 8f0800
75 2 11 8f4b00
Obviously for any port to work it should be enabled.
76 2 12 8f4c00
One of the most important pieces of a link irrespective of mode and distance is
the SFP. On newer hardware and software it provides a lot of info on the
overall health of the link.
With older FOS codes there could have been a discrepancy of what was
displayed in this output as to what actually was plugged in the port. The reason
was that the SFP's get polled so every now and then for status and update
information. If a port was persistent disabled it didn't update at all so in theory
you plug in another SFP but sfpshow would still display the old info. With FOS
7.0.1 and up this has been corrected and you can also see the latest polling
time per SFP now.
The question we often get is: "What should these values be?". The answer is
"It depends". As you can imagine a shortwave 4G SFP required less amps
then a longwave 100KM SFP so in essence the SFP specs should be
consulted. As a ROT you can say that signal quality depends ont he TX power
value minus the link-loss budget. The result should be within the RX Power
specifications of the receiving SFP.
sfpshow
<slot>/<port>
Also check the Current and Voltage of the SFP. If an SFP is broken the
indication is often it draws no power at all and you'll see these two dropping to
zero.
Sydney_ILAB_DCX-4
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404
Encoding: 1 8B10
Baud Rate: 85 (uni
Length 9u: 0 (units
Length 9u: 0 (units
Length 50u (OM2): 5
Length 50u (OM3): 0
Length 62.5u:2 (uni
Length Cu: 0 (units
Vendor Name: BROC
Vendor OUI: 00:05:1
Vendor PN: 57-1000
Vendor Rev: A
Wavelength: 850 (un
Options: 003a Loss
BR Max:
0
BR Min:
0
Serial No: UAF1104
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Command
Explanation
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5
Warn Flags[0,1] = 0x5
Temperature: 25
C
Current: 6.322 mA
Voltage: 3290.2 m
RX Power: -3.2 dB
TX Power: -3.3 dB
State transitions: 1
Last poll time: 06-20-2
For link state counters this is the most useful command in the switch however
there is a perception that this command provides a "silver" bullet to solve port
and link issues but that is not the case. Basically it provides a snapshot of the
content of the LESB (Link Error Status Block) of a port at that particular point in
time. It does not tell us when these counters have accumulated and over which
time frame. So in order to create a sensible picture of the statuses of the ports
we need a baseline. This baseline can be created to reset all counters and
start from zero. To do this issue the "statsclear" command on the cli.
There are 7 columns you should pay attention to from a physical perspective.
enc_in - Encoding errors inside frames. These are errors that happen on the
FC1 with encoding 8 to 10 bits and back or, with 10G and 16G FC from 64 bits
to 66 and back. Since these happen on the bits that are part of a data frame
these are counted in this column.
crc_err - An enc_in error might lead to a CRC error however this column
shows frames that have been market as invalid frames because of this crcerror earlier in the datapath. According to FC specifications it is up to the
implementation of the programmer if he wants to discard the frame right away
or mark it as invalid and send it to the destination anyway. There are pro's and
con's on both scenarios. So basically if you see crc_err in this column it means
the port has received a frame with an incorrect crc but this occurred further
upstream.
crc_g_eof - This column is the same as crc_err however the incoming frames
areNOT marked as invalid. If you see these most often the enc_in counter
increases as well but not necessarily. If the enc_in and/or enc_out column
increases as well there is a physical link issue which could be resolved by
cleaning connectors, replacing a cable or (in rare cases) replacing the SFP
and/or HBA. If the enc_in and enc_out columns do NOT increase there is an
issue between the SERDES chip and the SFP which causes the CRC to
mismatch the frame. This is a firmware issue which could be resolved by
upgrading to the latest FOS code. There are a couple of defects listed to track
these.
porterrshow
enc_out - Similar to enc_in this is the same encoding error however this error
was outside normal frame boundaries i.e. no host IO frame was impacted. This
may seem harmless however be aware that a lot of primitive signals and
sequences travel in between normal data frame which are paramount for fibre-
Sydney_ILAB_DCX-4S_L
frames
tx
rx
0: 100.1m 53.4m
1: 466.6k 154.5k
2: 476.9k 973.7k
3: 474.2k 155.0k
Command
Explanation
channel operations. Especially primitives which regulate credit flow. (R_RDY
and VC_RDY) and signal clock synchronization are important. If this column
increases on any port you'll likely run into performance problems sooner or
later or you will see a problem with link stability and sync-errors (see below).
Link_Fail - This means a port has received a NOS (Not Operational) primitive
from the remote side and it needs to change the port operational state to LF1
(Link Fail 1) after which the recovery sequence needs to commence. (See the
FC-FS standards specification for that)
Loss_Sync - Loss of synchronization. The transmitter and receiver side of the
link maintain a clock synchronization based on primitive signals which start
with a certain bit pattern (K28.5). If the receiver is not able to sync its baud-rate
to the rate where it can distinguish between these primitives it will lose sync
and hence it cannot determine when a data frame starts.
Loss_Sig - Loss of Signal. This column shows a drop of light i.e. no light (or
insufficient RX power) is observed for over 100ms after which the port will go
into a non-active state. This counter increases often when the link-loss budget
is overdrawn. If, for instance, a TX side sends out light with -4db and the
receiver lower sensitivity threshold is -12 db. If the quality of the cable
deteriorates the signal to a value lower than that threshold, you will see the
port bounce very often and this counter increases. Another culprit is often
unclean connectors, patch-panels and badly made fibre splices. These ports
should be shut down immediately and the cabling plant be checked. Replacing
cables and/or bypassing patch-panels is often a quick way to find out where
the problem is.
The other columns are more related to protocol issues and/or performance
problems which could be the result of a physical problem but not be a cause.
In short look at these 7 columns mentioned above and check if no port
increases a value.
============================================
too_short/too_long - indicates a protocol error where SOF or EOF are
observed too soon or too late. These two columns rarely increase.
bad_eof - Bad End-of-Frame. This column indicates an issue where the sender
has observed and abnormality in a frame or it's transceiver whilst the
frameheader and portions of the payload where already send to its destination.
The only way for a transceiver to notify the destination is to invalidate the
frame. It truncates the frame and add an EOFni or EOFa to the end. This
signals the destination that the frame is corrupt and should be discarded.
F_Rjt and F_Bsy are often seen in Ficon environments where control frames
could not be processes in time or are rejected based on fabric configuration or
fabric status.
c3timout (tx/rx) - These are counters which indicate that a port is not able to
forward a frame in time to it's destination. These either show a problem
downstream of this port (tx) or a problem on this port where it has received a
Command
Explanation
frame meant to be forwarded to another port inside the sames switch. (rx).
Frames are ALWAYS discarded at the RX side (since that's where the buffers
hold the frame). The tx column is an aggregate of all rx ports that needs to
send frames via this port according to the routing tables created by FSPF.
pcs_err - Physical Coding Sublayer - These values represent encoding errors
on 16G platforms and above. Since 16G speeds have changed to 64/66 bits
encoding/decoding there is a separate control structure that takes car of this.
As a best practise is it wise to keep a trace of these port errors and create a
new baseline every week. This allows you to quickly identify errors and solve
these before they can become an problem with an elongated resolution time.
Make sure you do this fabric-wide to maintain consistency across all switches
in that fabric.
Make sure that all of these physical issues are solved first. No software can compensate for hardware
problems and the HDS support organization will give you this task anyway before commencing on the
issue.
As for which information to collect please refer to https://tuf.hds.com where you will find pages for all GSC
supported products and a method on how to collect these.

5-minute initial troubleshooting on Brocade equipment

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

5-minute initial troubleshooting on Brocade equipment

Hochgeladen von

Copyright:

Verfügbare Formate

5-minute initial troubleshooting on

Power supplies monito

Das könnte Ihnen auch gefallen