Storage Diagnostics and Troubleshooting Guide

Storage Diagnostics and Troubleshooting
Participant Guide
Global Education Services

LSI Corporation
Storage System Diagnostics & Troubleshooting - Copyright © LSI 2008, All Rights Reserved Page 1
3rd edition (July 2008)
Table of Contents
Terms and Conditions .............................................................................................. 5
Storage Systems Diagnostics and Troubleshooting Course Outline ............................... 9
Module 1: Storage System Support Data Overview ................................................... 13

All support Data Capture ..................................................................................... 14
Major Event Log (MEL) Overview ......................................................................... 17
State Capture Data File ....................................................................................... 32
Accessing the Controller Shell.............................................................................. 34
Logging In To the Controller Shell (06.xx) ............................................................ 34
Logging In To the Controller Shell (07.xx) ............................................................ 34
Controller Analysis.............................................................................................. 35
Additional Output ............................................................................................... 48
Knowledge Check ............................................................................................... 50
Additional Commands ......................................................................................... 51
Debug Queue..................................................................................................... 56
Knowledge Check ............................................................................................... 59
Modifying Controller States.................................................................................. 60
Diagnostic Data Capture (DDC) ........................................................................... 62
Knowledge Check ............................................................................................... 65
Module 3: Configuration Overview and Analysis....................................................... 67

Configuration Overview and Analysis.................................................................... 68
Knowledge Check ............................................................................................... 74
Drive and Volume State Management................................................................... 75
Volume Mappings Information ............................................................................. 92
Knowledge Check ............................................................................................... 94
Portable Volume Groups in 07.xx ......................................................................... 95
RAID 6 Volumes in 07.xx..................................................................................... 96
Troubleshooting Multiple Drive Failures ................................................................ 97
Offline Volume Groups ...................................................................................... 106
Clearing the Configuration................................................................................. 108
Recovering Lost Volumes .................................................................................. 109
Knowledge Check ............................................................................................. 114
Module 4: Fibre Channel Overview and Analysis .................................................... 115

Fibre Channel................................................................................................... 116
Fibre Channel Arbitrated Loop (FC-AL) ............................................................... 116
Fibre Channel Arbitrated Loop (FC-AL) – The LIP ................................................ 117
Knowledge Check ............................................................................................. 122
Drive Side Architecture Overview ....................................................................... 123
Knowledge Check ............................................................................................. 139
Destination Driver Events .................................................................................. 140
Read Link Status (RLS) and Switch-on-a-Chip (SOC)............................................ 143
What is SOC or SBOD?...................................................................................... 148
Field Case........................................................................................................ 160
Drive Channel State Management ...................................................................... 161
SAS Backend.................................................................................................... 163
Appendix A: SANtricity Managed Storage Systems .................................................. 173

6998 /6994 /6091 (Front) ................................................................................. 174
6998 /6994 /6091 (Back) .................................................................................. 174
3992 (Back) ..................................................................................................... 175
3994 (Back) ..................................................................................................... 176
4600 16-Drive Enclosure (Back)......................................................................... 176
4600 16-Drive Enclosure (Front) ........................................................................ 176
Appendix B: Simplicity Managed Storage Systems .................................................. 178

1333 ............................................................................................................... 178
1532 ............................................................................................................... 179
1932 ............................................................................................................... 180
SAS Drive Tray (Front)...................................................................................... 181
SAS Expansion Tray (Back) ............................................................................... 181
Appendix C – State, Status, Flags (06.xx) .............................................................. 183
Appendix D – Chapter 2 - MEL Data Format ........................................................... 189
Appendix E – Chapter 30 – Data Field Types .......................................................... 203
Appendix F – Chapter 31 – RPC Function Numbers ................................................. 215
Appendix G – Chapter 32 – SYMbol Return Codes................................................... 229
Appendix H – Chapter 5 - Host Sense Data ............................................................ 261
Appendix I – Chapter 11 – Sense Codes ................................................................ 279
Terms and Conditions
Agreement
This Educational Services and Products Terms and Conditions (“Agreement”) is between
LSI Corporation (“LSI”), a Delaware corporation, doing business in AL, AZ, CA, CO, CT,
DE, FL, GA, KS, IL, MA, MD, MN, NC, NH, NJ, NY, OH, OR, PA, SC, UT, TX, VA and WA
as LSI Corporation, with a place of business at 1621 Barber Lane, Milpitas, California
95035 and you, the Student. By signing this Agreement, or clicking on the “Accept”
button as appropriate, Student accepts all of the terms and conditions set forth below.
LSI reserves the right to change or modify the terms and conditions of this Agreement
at any time.
Course materials
The course materials are derived from end-user publications and engineering data
related to LSI’s Engenio Storage Group (“ESG”) and reflect the latest information
available at the time of printing but will not include modifications if they occurred after
the date of publication. In all cases, if there is discrepancy between this information and
official publications issued by LSI, LSI’s official publications shall take precedence.
LSI assumes no obligation for the accuracy or correctness of the course materials and
assumes no obligation to correct any errors contained herein or to advise Student of
liability for the accuracy or correctness of the course materials provided to Student. LSI
makes no commitment to update the course materials and LSI reserves the right to
change the course materials, including any terms and conditions, from time to time
at its sole discretion. LSI reserves the right to seek all available remedies for any illegal
misuse of the course materials by Student. LSI reserves the right to seek all available
remedies for any illegal misuse of the course materials.
Certification
Student acknowledges that purchasing or participating in an LSI course does not imply
certification with respect to any LSI certification program. To obtain certification,
Student must successfully complete all required elements in an applicable LSI
certification program. LSI may update or change certification requirements at any time
without notice.
Ownership
LSI and its affiliates retain all right, title and interest in and to the course materials,
including all copyrights therein. LSI grants Student permission to use the course
materials for personal, educational purposes only. The resale, reproduction, or
distribution of the course materials, and the creation of derivative works based on the
course materials, is prohibited without the prior express written permission of LSI.
Nothing in this Agreement shall be construed as an assignment of any patents,
copyrights, trademarks, or trade secret information or other intellectual property rights.
Testing
While participating in course, LSI may test Student's understanding of the subject
matter. Furthermore, LSI may record the Student's participation in a course with
videotape or other recording means. Student agrees that LSI is the owner of all such
test results and recordings, and may use such test results and recordings subject to
LSI's privacy policy.
Software license
All software utilized or distributed as course materials, or an element thereof, is licensed
pursuant to the license agreement accompanying the software.
Indemnification
Student agrees to indemnify, defend and hold LSI, and all its officers, directors, agents,
employees and affiliates, harmless from and against any and all third party claims for
loss, damage, liability, and expense (including reasonable attorney's fees and costs)
arising out of content submitted by Student, Student's use of course materials (except
as expressly outlined herein), or Student's violations of any rights of another.
Disclaimer of warranties
THE COURSE MATERIALS (INCLUDING ANY SOFTWARE) ARE PROVIDED ON AN “AS
IS” AND “AS AVAILABLE” BASIS, WITHOUT WARRANTY OF ANY KIND. LSI DOES
NOT WARRANT THAT THE COURSE MATERIALS: WILL MEET STUDENT'S
REQUIREMENTS; WILL BE UNINTERRUPTED, TIMELY, SECURE, OR ERROR-FREE; OR
WILL PRODUCE RESULTS THAT ARE RELIABLE. LSI EXPRESSLY DISCLAIMS ALL
WARRANTIES, WHETHER EXPRESS, IMPLIED OR STATUTORY, ORAL OR WRITTEN,
WITH RESPECT TO THE COURSE MATERIALS, INCLUDING WITHOUT LIMITATION THE
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE SAME. LSI EXPRESSLY DISCLAIMS ANY WARRANTY
WITH RESPECT TO ANY TITLE OR NONINFRINGEMENT OF ANY THIRD-PARTY
NTELLECTUAL PROPERTY RIGHTS, OR AS TO THE ABSENCE OF COMPETING CLAIMS,
OR AS TO INTERFERENCE WITH STUDENT’S QUIET ENJOYMENT.
Limitation of liability
STUDENT AGREES THAT LSI SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES, INCLUDING BUT
NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, DATA OR
OTHER SUCH LOSSES, ARISING OUT OF THE USE OR INABILITY TO USE THE COURSE
MATERIALS, EVEN IF LSI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES, LSI'S LIABILITY FOR DAMAGES TO STUDENT FOR ANY CAUSE
WHATSOEVER, REGARDLESS OF THE FORM OF ANY CLAIM OR ACTION, SHALL NOT
EXCEED THE AGGREGATE FEES PAID BY STUDENT FOR THE USE OF THE COURSE
MATERIALS INVOLVED IN THE CLAIM.
Miscellaneous
Student agrees to not export or re-export the course materials without the appropriate
United States and foreign government licenses, and shall otherwise comply with all
applicable export laws. In the event that course materials in the form of software is
acquired by or on behalf of a unit or agency of the United States government (the
“Agency”), the Agency agrees that such software is comprised of “commercial computer
software” and “commercial computer software documentation” as such terms are used
in 48 C.F.R. 12.212 (Sept. 1995) and is provided to the Agency for evaluation or
licensing (A) by or on behalf of civilian agencies, consistent with the policy set forth in
48 C.F.R. 12.212; or (B) by or on behalf of units of the Department of Defense,
consistent with the policies set forth in 48 C.F.R. 227-7202-1 (June 1995) and
227.7203-3 (June 1995).
This Agreement shall be governed by and construed in accordance with the laws of the
State of California, with regard to its choice of law or conflict of law provisions. In the
event of any conflict between foreign laws, rules and regulations and those of the
United States, the laws, rules and regulations of the United States shall govern.
In any action or proceeding to enforce the rights under this Assignment, the prevailing
party shall be entitled to recover reasonable costs and attorneys' fees.
In the event that any provision of this Agreement shall, in whole or in part, be
determined to be invalid, unenforceable or void for any reason, such determination shall
affect only the portion of such provision determined to be invalid, unenforceable or void,
and shall not affect the remainder of such provision or any other provision of this
Agreement. This Agreement constitutes the entire agreement between LSI and
Student relating to the course materials and supersedes any prior agreements, whether
written or oral, between the parties.
Trademark acknowledgments
Engenio, the Engenio design, HotScaletm, SANtricity, and SANsharetm are trademarks or
registered trademarks of LSI Corporation. All other brand and product names may be
trademarks of their respective companies.
Copyright notice
© 2006, 2007, 2008 LSI Corporation. All rights reserved
Agreement accepted by Student (Date):
Agreement not accepted by Student (Date):
Left Blank Intentionally
Storage Systems Diagnostics and Troubleshooting Course
Outline
Course Description:
Storage Systems Diagnostics and Troubleshooting is an advanced course that presents
the technical aspects of diagnosing and troubleshooting LSI-based storage systems
through advanced data analysis and in-depth troubleshooting.
The basic objective of this course is to equip the participants with the essential concepts
associated with troubleshooting and repairing LSI-based storage systems using either
SANtricitytm Storage Management software, analysis of support data or controller shell
commands.
The information contained in the course is derived from internal engineering publications
and is confidential to LSI Corporation. It reflects the latest information available at the
time of printing but may not include modifications if they occurred after the date of
publication.
Prerequisites:
Ideally the successful student will have completed both the Installation and
Configuration and the Support and Maintenance courses offered by Global Education
services at LSI Corporation.
However, an equivalent knowledge of storage management, installation, basic

maintenance and problem determination with LSI-based storage systems can be
substituted.
Students should have at least 6 months field exposure with LSI storage products and
technologies in a support function.
Audience:
This course is designed for customer support personnel responsible for diagnosing and
troubleshooting LSI storage systems through the use of support data analysis and
controller shell access. The course is designed for individuals employed as Tier 3 support
of LSI-based storage systems.
It is assumed that the student has in-depth experience and knowledge with Fiber
Channel Storage Area Network (SAN) technologies including RAID, Fiber Channel
topology, hardware components, installation, and configuration.
Course Length:
Approximately 4 days in length with 60% lecture and 40% hands-on lab.
Course Objectives
Upon completion of this course, the participant will be able to:
• Recognize the underlying behavior of LSI-based storage systems
• Analyze a storage system for failures through the analysis of support data
• Successfully analyze backend fiber channel errors
• Successfully interpret configuration errors
Course Modules
1. Storage System Support Data Analysis
2. Storage System Level Overview
3. Configuration Overview and Analysis
4. IO Driver and Drive Side Error Reporting and Analysis
Module 1: Storage System Support Data Overview
Upon completion should be able to complete the following:

• Describe the purpose of the files that are included within an the All Support Data
Capture
• Analyze the Major Event Log at a high level in order to diagnose an event
Lab
• Gather the support data file
• Analyze a MEL event
• Diagram the events in a MEL that lead to an error
Module 2: Storage System Level Overview

• Log into the controller shell
• Identify and modify the controller states
• Recognize the battery function within the controllers
• Describe the network functionality
• List developer functions available within the controller shell commands
Lab
• Modify controller states
Module 3: Configuration Overview and Analysis

• Describe the difference between the legacy configuration structures and the new
07.xx firmware configuration database
• Analyze an array’s configuration from shell output and recognize any errors in
the configuration
LAB
• Fix configuration errors on live system
Module 4: IO Driver and Drive Side Error Reporting and Analysis

• Describe how fibre channel topology works
• Determine how fibre channel topology relates to the different protocols that LSI
uses in its storage array products
• Analyze backend errors for problem determination and isolation
LAB
• Analyze backend data case studies
Module 1: Storage System Support Data Overview
• Describe the purpose of the files that are included within an the All Support Data
Capture
• Analyze the Major Event Log at a high level in order to diagnose an event
All support Data Capture
• ZIP archive of useful debugging files

• Some files are for development use only, and are not support readable
• Typically the first item requested for new problem analysis
• Benefits
– Provides a point-in-time snapshot of system status.
– Contains all logs needed for a ‘first look’ at system failures.
– Easy customer interface through the GUI.
– Non-disruptive
• Drawbacks
– Requires GUI accessibility.
– Can take some time to gather on a large system.
All Support Data Capture
All Support Data Capture Files - 06.xx.xx.xx
• driveDiagnosticData.bin
– Drive log information contained in a binary format.
• majorEventLog.txt
– Major Event Log
• NVSRAMdata.txt
– NVSRAM settings from both controllers
• objectBundle
– Binary format file containing java object properties
• performanceStatistics.csv
– Current performance statistics by volume
• persistentReservations.txt
– Volumes with persistent reservations will be noted here
• readLinkStatus.csv
– RLS diagnostic information in comma separated value format
• recoveryGuruProcedures.html
– Recovery Guru procedures for all failures on the system
• recoveryProfile.csv
– Log of all changes made to the configuration
• socStatistics.csv
– SOC diagnostic information in comma separated value format
• stateCaptureData.dmp/txt
– Informational shell commands ran on both controllers
• storageArrayConfiguration.cfg
– Saved configuration for use in the GUI script engine
• storageArrayProfile.txt
– Storage array profile
• unreadableSectors.txt
– Unreadable sectors will be noted here, noting the volume and drive LBA
All Support Data Capture Files - 07.xx.xx.xx
• Contains all the same files as the 06.xx.xx.xx releases but adds 3 new files.
– Connections.txt
• Lists the physical connections between expansion trays
– ExpansionTrayLog.txt
• ESM event log for each ESM in the expansion trays
– featureBundle.txt
• Lists all premium features and their status on the system
• Most useful files for first-look system analysis and troubleshooting

– stateCaptureData.dmp/txt
– majorEventLog.txt
– storageArrayProfile.txt
– socStatistics.csv
– readLinkStatus.csv
– recoveryGuruProcedures.html
Major Event Log (MEL) Overview
Major Event Log Facts
• Array controllers log events and state transitions to an 8192 event circular buffer.
• Log is written to DACSTOR region of drives.
– Log is permanent
– Survives:
• Power cycles
• Controller swaps
• SANtricity can display log, sort by parameters and save to file.
• Only critical errors send SNMP traps and Email alerts
A Details Window from a MEL log (06.xx)
General Raw Data Categories (06.xx)
General Raw Data Categories (07.xx)
Byte Swapping
• Remember when byte swapping select all of the bytes in the field
• NOTE: Do not swap the nibbles
– e.g. Value is not “00 00 00 00 00 00 01 fa”
Comparison of the Locations of the Summary Information and Raw

Data (06.xx)
Quick View of the Locations Raw Data Fields (06.xx)
MELH - Signature
MEL version - 2 means 5.x code or 06.x code
Event Description - Includes: Event Group, Component, Internal Flags, Log Group &
Priority
I/O Origin – refer to the MEL spec for the event type
Reporting Controller - 0=A 1=B
Valid? - 0=Not valid 1=Valid data
O1 - Number of Optional Data Fields
O2 - Total length of all of the Optional Data Fields in Hex
F1 - Length of this optional data field
F2 - Data field type (If there is a value of 0x8000 this is a continuation of
the previous optional data field. This would be read as a continuation
of the previous data field type 0x010d.)
F3 - The “cc” means drive side channel and the following value refers to
the channel number and is 1 relative.
Sense Data - Vendor specific depending on the component type.
N/U - Not Used
Comparison of the Locations of the Summary Information and Raw
Data (07.xx)
Quick View of the Locations Raw Data Fields (07.xx)
Event Description - includes: Event Group, Component, Internal Flags, Log Group & Priority
Location – Decode based on the component type
Valid? - 0=Not valid 1=Valid data
1. – I/O Origin
2. - Reserved
3. - Controller reported by (0=A 1=B)
4. - Number of optional data fields present
5. - Total length of optional Data
6. - Single optional field length
7. - Data field type, data field types that begin with 0x8000 are a continuation of the
previous data field of the same type
Sense Data - vendor specific depending on the component type.
MEL Summary Information
Time of the event adjusted to the management

• Date/Time:
station local clock
• Sequence number: Order that the event was written to the MEL
Event code, check MEL Specification for a list of all
• Event type:
event types
• Event category: Category of the event (Internal, Error, Command)
• Priority: Either informational or critical
• Description: Description of the event type
• Event specific codes: Information related to the event (if available)
• Component type: Component the event is associated with
Physical location of the component the event is
• Component location:
associated with
• Logged by: Controller which logged the event
Event Specific Codes
• Skey/ASC/ASCQ
– Defined in Chapter 11 (06.xx), 12 (07.xx) of the Software Interface Spec
• AEN Posted events
– Event 3101
• Drive returned check condition events
– Event 100a
• Return status/RPC function/null

– Defined in Chapter 31 & 32 of the MEL Spec (06.16)
• Controller return status/function call for requested operation
events
– Event 5023
Controller Return States
• Return status and RPC function call as defined in the MEL Specification
• Return Status
0x01 = RETCODE_OK
• RPC Function Call
0x07 = createVolume_1()
• SenseKey /ASC /ASCQ
6/3f/80 = Drive no longer usable (The controller set the drive state to
“Failed – Write Failure”)
AEN Posted for recently logged event (06.xx)
• Byte 14 = 0x7d (FRU)
• Bytes 26 & 27 = 0x02 & 0x05 (FRU Qualifiers)
• Values decoded using the Software Interface Specification

Chapter 5 (6.xx)
• FRU Qualifiers are decoded depending on what the FRU value is
Sense Data (SIS Chapter 5)
• Byte 14 FRU = 0x7d
– FRU is Drive Group (Devnum = 0x60000d)
• Byte 26 = 0x02
– Tray ID = 2
• Byte 27 = 0x05
– Slot = 5
AEN posted for recently logged event (06.xx)
• Byte 14 = 0x06 (FRU)
• Bytes 26 & 27 = 0xd5 & 0x69 (FRU Qualifiers)
• Values decoded using the Software Interface Specification

Chapter 5 (6.xx)
• FRU Qualifiers are decoded depending on the FRU code
Sense Data (SIS Chapter 5)
• SenseKey / ASC / ASCQ

6/3f/c7 = Non Media Component Failure
• Byte 14 FRU = 0x06

– FRU is Subsystem Group
• Byte 26 = 0xd5
1 1 0 1 0 1 0 1 = 0x55 = tray 85
• Byte 27 = 0x69
0110 1001
– Device State = 0x3 = Missing

– Device Type Identifier = 0x09 = Nonvolatile Cache
Automatic Volume Transfer
• IO Origin field
o 0x00 = Normal AVT
o 0x01 = Forced AVT
• LUN field
o Number of volumes being transferred
o Will be 0x00 if it is a forced volume transfer
Automatic Volume Transfer
• IO Origin field
– 0x00 = Normal AVT
– 0x01 = Forced AVT
• LUN field
– Number of volumes being transferred
– Will be 0x00 if it is a forced volume transfer
Mode Select Page 2C
• IOP ID Field
o Contains the Host Number that issued the Mode Select (referenced in the
tditnall command output)
• Optional data is defined in the Software interface Specification, section 6.15 (or 5.15)
Module 2: Storage System Analysis
• Identify and modify the controller states
• Recognize the battery function within the controllers
• Describe the network functionality
• List developer functions available within the controller shell commands
State Capture Data File
• Series of controller shell commands ran against both controllers
• Different firmware levels run different sets of commands
• Some information still needs to be gathered manually
Amethyst/Chromium (06.16.xx,06.19.xx/06.23.xx)
The following commands are collected in the state capture for the Amethyst and
Chromium releases:
moduleList spmShowMaps fcAll

arrayPrintSummary spmShow socShow
cfgUnitList getObjecGraph_MT showEnclosures
vdShow ccmStateAnalyze netCfgShow
cfgUnitList i showSdStatus
cfgUnit ionShow 99 dqprint
ghsList showEnclosuresPage81 printBatteryAge
cfgPhyList fcDump dqlist
Chromium 2 State Capture Additions (06.60.xx.xx)

The release of Chromium 2 (06.60.xx.xx) introduced the following additional commands
to the state capture dump.
tditnall luall fcHosts 3

iditnall ionShow 12 svlShow
fcnShow excLogShow getObjectGraph_MT 99*
chall ccmStateAnalyze 99**
* getObjectGraph_MT 99 replaced the individual getObjectGraph_MT calls used in previous

releases
** ccmStateAnalyze 99 replaces the ccmStateAnalyze used in previous releases
Crystal (07.10.xx.xx)
The following commands are collected in the state capture for the Crystal release:
evfShowOwnership luall hwLogShow

rdacMgrShow ionShow spmShowMaps
vdmShowDriveTrays fcDump spmShow
vdmDrmShowHSDrives fcAll 10 fcHosts
evfShowVol showSdStatus getObjectGraph_MT
vdmShowVGInfo ionShow 99 ccmShowState
bmgrShow discreteLineTableShow netDfgShow
bidShow ssmShowTree inetstatShow
tditnall ssmDumpEncl dqprint
iditnall socShow dqlist
fcnShow showEnclosuresPage81 taskInfoAll
Accessing the Controller Shell
• Accessed via RS-232 port on communication module
• Default settings are 38,400 baud, 8-N-1 no flow control
• 06.xx firmware controllers allow access to the controller shell over the network via
rlogin
• 07.xx firmware controllers allow access to the controller shell over the network via
telnet
• Always capture your shell session using your terminal’s capturing

functionality
Logging In To the Controller Shell (06.xx)

• If logging serially, get command prompt by sending Break signal, followed by Esc
key when prompted.
– Using rlogin you may be prompted for a login name, use “root”
• Enter password when prompted:

– Infiniti
• Command prompt is a ‘right arrow’ ( -> )
• The shell allows user to access controller firmware commands & routines directly
Logging In To the Controller Shell (07.xx)

• If logging in serially, get command prompt by sending Break signal, followed by Esc
key when prompted.
– Otherwise shell access can be gained via the telnet protocol.
• You will be prompted for a login name, use “shellUsr”
• Enter password when prompted:

– wy3oo&w4
–
• Command prompt is a ‘right arrow’ ( -> )
• The shell allows user to access controller firmware commands & routines directly.
Controller Analysis
Controller Analysis
• bidShow 255 (07.xx)
• Driver level information, similar to bmgrShow but for development use
getObjectGraph_MT / getObjectGraph_MT 99
• Prior to Chromium 2 (06.60.xx.xx), and in Crystal (07.xx) the getObjectGraph_MT
command was used several times to collect the following:
• getObjectGraph_MT 1 – Controller Information
• getObjectGraph_MT 4 – Drive Information
• getObjectGraph_MT 8 – Component Status
• As of Chromium 2 (06.60.xx.xx) the state capture utilizes getObjectGraph_MT 99

which collects the entire object graph including controller, drive, component, and
volume/configuration data.
• The object graph is actually used by the Storage Manager software to provide the
visual representation of the current array status.
• The output of getObjectGraph_MT can be used to determine individual component
status.
The downside of using the getObjectGraph_MT output is that it is somewhat
complicated and cryptic however it can be very valuable in determining problems with
the information being reported to the customer via Storage Manager.
Additional Output
Knowledge Check
Analyze the storageArrayProfile.txt file to find the following information:
Controller Firmware version:

Board ID:
Network IP Address
Controller A:
Controller B:
Volume Ownership (by SSID)
Controller A:
Controller B:
ESM Firmware Version:
Find the same information in the StateCaptureData.txt file. List what command was
referenced to find the information.
Command Referenced
06.xx 07.xx
Controller Firmware version:
Board ID:
Network IP Address:
Volume Ownership (by SSID):
ESM Firmware Version:
Additional Commands
Debug Queue
• Used to log pertinent information about various firmware functions.
• Each core asset team can write to the debug queue.
• There is no standard for data written to the debug queue, each core asset team
writes the information it feels is needed for debug.
• The debug queue output is becoming increasingly important for problem

determination and root cause analysis.
• Because so much data is being written to the debug queue, it is important to gather
it as soon as possible after the initial failure.
• Because there is no standard for the data written to the debug queue, it is necessary
for multiple development teams to work in conjunction to analyze the debug queue.
• This makes it difficult to interpret from a support standpoint without development

involvement.
Debug Queue Rules
• First check ‘dqlist’ to verify which trace contains events during the time of failure
• It is possible that there may not be a debug queue trace file that contains the
timeline of the failure, in this case, no information can be gained
• First data capture is a must with the debug queue as information is logged very
quickly
• Even though a trace may be available for a certain timeframe, it is not a

guarantee that further information can be gained about a failure event
Summary
• Look at the first / last timestamps and remember that they’re in GMT.
• Don’t just type ‘dqprint’ unless you actually want to flush and print the ‘trace’
trace file (the one we’re currently writing new debug queue data to). Only typing
‘dqprint’ can actually make you lose the useful data if you’re not paying
attention.
• Keep in mind that the debug queue wasn’t designed for you to read, only for you
to collect and someone in development to read.
• Remember, even LSI developers, when looking at debug queue traces, need to
go back to the core asset team that actually wrote the code that printed specific
debug queue data, in order to decode it.
Knowledge Check
What command would you run to gather the following information:
Detailed process listing:

Available controller memory:
Lock status:
There is no need to capture controller shell login sessions.
True False
The Debug Queue should only be printed at development request.
True False
The Debug Module is needed for access to all controller shell commands.
True False
Modifying Controller States
• Controller states can by modified via the GUI to place a controller offline, in-
service mode, online, or to reset a controller
• These same functions can be achieved from the controller shell if GUI access is
not available
• Commands that end in _MT use the SYMbol layer and require that the network
be enabled but does not require that the controller actually be on the network.
The controller must also be through Start Of Day
• The _MT commands are valid for both 06.xx and 07.xx firmware
• The legacy (06.xx and lower) commands are referenced in the ‘Troubleshooting
and Technical Reference Guide Volume 1’ on page 27
• To transfer all volumes from the alternate controller and place the
alternate controller in service mode
-> setControllerServiceMode_MT 1
-> cmgrSetAltToServiceMode (07.xx only)
• While the controller is in service mode it is still powered on and is available for
shell access. However it is not available for host I/O, similar to a ‘passive’ mode.
• To transfer all volumes from the alternate controller and place the
alternate controller offline
-> setControllerToFailed_MT 1
-> cmgrSetAltToFailed (07.xx only)
• While the controller is offline it is powered off and is unavailable for shell access.
It is not available for host I/O
• To place the alternate controller back online from either an offline
state, or from in service mode
-> setControllerToOptimal_MT 1
-> cmgrSetAltToOptimal (07.xx only)
• This will place the alternate controller back online and active, however will not
automatically redistribute the volumes to the preferred controller
• In order to reset a controller
• Soft reset controller

– Reboot
• Reset controller with full POST
– sysReboot
– resetController_MT 0
• Reset the alternate controller (06.xx)
– isp rdacMgrAltCtlReset
–
• Reset the alternate controller
– altCtlReset 2
– resetController_MT 1
Diagnostic Data Capture (DDC)
Brief History
• Multiple ancient IO events in the field
• Need of having better diagnostic capability
• Common infrastructure which can be used for many such events
What is DDC (Diagnostic Data Capture)?
• A mechanism to capture sufficient diagnostic information about the

controller/array state at the time of an unusual event, and store the diagnostic
data for later retrieval/transfer to LSI Development for further analysis
• Introduced in Yuma 1.2 (06.12.16.00)
• Part of Agate (06.15.23.00)
• All future releases
Unusual events triggering DDC (as of 07.xx)
• Ancient IO
• Master abort due to bad address accessed by the fibre channel chip results in
PCI error
• Destination device number registry corruption
• EDC Error returned by the disk drives
• Quiescence failure of volumes owned by the alternate controller
DDC Trigger
• MEL event gets logged whenever DDC logs are available in the system
• A system-wide Needs Attention condition is created for successful DDC capture
• Batteries
– Get enabled if system has batteries which are sufficiently charged
– DDC logs triggered by ancient IO MAY sustain without batteries, as
ancient IO does not cause hard reboot.
• No new DDC trigger if all of the following are true

– New event is of same type as previous
– New trigger happens within 10 minutes of the previous trigger
– Previous DDC logs have not been retrieved (DDC - NA is set)
Persistency of DDC Information
• DDC info is persistent across power cycle, and controller reboot provided the
following is true:
– System contains batteries which are sufficiently charged
DDC Logs format
• Binary
• Must be sent to LSI development to be analyzed
DDC CLI commands
• Commands to retrieve the DDC information

– save storageArray diagnosticData file=“<filename>.zip";
• Command to clear the DDC NA

– reset storageArray diagnosticData;
– CLI calls this command internally in case retrieval is successful
– This can be called without any retrieval (Just to clear NA)
DDC MEL Events
• MEL_EV_DDC_AVAILABLE
– Event # 6900
– Diagnostic data is available
– Critical
• MEL_EV_DDC_RETRIEVE_STARTED
– Event # 6901
– Diagnostic data retrieval operation started
– Informational
• MEL_EV_DDC_RETRIEVE_COMPLETED
– Event # 6902
– Diagnostic data retrieval operation completed
– Informational
• MEL_EV_DDC_NEEDS_ATTENTION_CLEARED
– Event # 6903
– Diagnostic data Needs Attention status cleared
– Informational
Knowledge Check
1) A controller can only be placed offline via the controller shell interface.
True False
2) A controller in service mode is available for 1/O.
True False
3) An offline controller is not available for shell access.
True False
4) DDC is to be collected and interpreted by support personnel.
True False
Module 3: Configuration Overview and Analysis
• Describe the difference between the legacy configuration structures and the new
07.xx firmware configuration database
• Analyze an array’s configuration from shell output and recognize any errors in
the configuration
Configuration Overview and Analysis
• In 06.xx firmware, the storage array configuration was maintained as data structures
resident in controller memory with pointers to related data structures
• The data structures were written to DACstore with physical references (devnums)
instead of memory pointer references
• Drawbacks of this design are that the physical references used in DACstore
(devnums) could change, which could cause a configuration error when the
controllers are reading the configuration information from DACstore
• As of 07.xx the storage array configuration has been changed to a database design
• The benefits are as follows:

– A single configuration database that is stored on every drive in a storage array
– Configuration changes are made in a transactional manner – i.e. updates are
either made in their entirety or not at all
– Provides support for > 2TB Volumes, increased partitions, increased host ports
– Unlimited Global Hot Spares
– More drives per volume group
– Pieces can be failed on a drive as opposed to the entire drive
What does this mean to support?
• Drive States and Volume States have changed slightly
• Shell commands have changed
– cfgPhyList, cfgUnitList, cfgSetDevOper, cfgFailDrive, etc
Configuration Overview and Analysis (06.xx)
• How is the configuration of an 06.xx storage array maintained?
• Each component of the configuration is maintained via data structures

– Piece Structure
– Drive Structure
– Volume Structure
• Each structure contains a reference pointer to associated structures as well as

information directly related to it’s component
• Pieces
– Pieces are simply the slice of a disk that one volume is utilizing, there
could be multiple pieces on a drive, but a piece can only reference one
drive
• Piece Structures
– Piece structures maintain the following configuration data
• A pointer to the volume structure
• A pointer to the drive structure
• Devnum of drive that the piece resides on
• Spared devnum if a global hot spare has taken over
• The piece’s state
• Drive Structures
– Drive structures maintain the following configuration data
• The drives devnum and tray/slot information
• Blocksize, Capacity, Data area start and end
• The drive’s state and status
• The drive’s flags
• The number of volumes resident on the drive (assuming it is
assigned)
• Pointers to all pieces that are resident on the drive (assuming it is
assigned)
• Volume Structures
– Volume structures maintain the following information
• SSID number
• RAID level
• Capacity
• Segment size
• Volume state
• Volume label
• Current owner
• Pointer to the first piece
06.xx configuration layout
Configuration Overview and Analysis (07.xx)
• How is the configuration of an 07.xx storage array maintained?
• Each component of the configuration is maintained via ‘records’ in the

configuration database
– Piece Records
– Drive Records
– RAID Volume Records
– Volume Group Records
• Each record maintains a reference to it’s parent record and it’s own specific state
info
• The “Virtual Disk Manager” (VDM) uses this information and facilitates the
configuration and I/O behaviors of each volume group
– VDM is the core module that consists of the drive manager, the piece
manager, the volume manager, the volume group manager, and
exclusive operations manager
• Pieces
– Pieces may also be referenced as ‘Ordinals’. Just remember that piece ==
ordinal and ordinal == piece
• Piece Records
– Piece records maintain the following configuration data
• A reference to the RAID Volume Record
• Update Timestamp of the piece record
• The persisted ordinal (what piece number, in stripe order, is this
record in the RAID Volume)
• The piece’s state
– Note that there is no reference to a drive record

– The update timestamp is set when the piece is failed
– The parent record for a piece is the RAID Volume record it belongs to
• Drive Records
– Drive records maintain the following configuration data

• The physical drive’s WWN
• Blocksize, Capacity, Data area start and end
• The drive’s accessibility, role, and availability states (more on this
later)
• The drive’s physical slot and enclosure WWN reference
• The WWN of the volume group the drive belongs to (assuming it
is assigned)
• The drive’s ordinal in the volume group (its piece number)
• Reasons for why a drive is marked incompatible, non-redundant,
or marked as non-critical fault
• Failure Reason
• Offline Reason
– Note that there is no reference to the piece record itself, only the ordinal
value
– The parent record for an assigned drive is the Volume Group record
• RAID Volume Records
– RAID Volume records maintain the following configuration data

• SSID
• RAID level
• Current path
• Preferred path
• Piece length
• Offset
• Volume state
• Volume label
• Segment size
– Volume Records only refer back to their parent volume group record via
the WWN of the volume group
• Volume Group Records
– Volume Group records simply maintain the following

• The WWN of the Volume Group
• The Volume Group Label
• The RAID Level
• The current state of the Volume Group
• The Volume Group sequence number
– Note that the Volume Group record does not reference anything but itself
07.xx configuration layout
• There are several advantages that may not be immediately obvious
o The 06.xx configuration manager used devnums (which could change)

and arbitrary memory locations (which change on every reboot)
o 07.xx configuration uses hard set values such as physical device WWNs,
and internally set WWN values for RAID Volumes and Volume Groups
which will not change once created.
• The configuration database is maintained on all drives in the storage array
• Provides for a more robust and reliable means of handling failure scenarios
Knowledge Check
1) 06.xx config uses data structures or database records to maintain the

configuration?
2) 07.xx config database is stored on every drive.
True False
3) Shell commands to analyze the config did not change between 06.xx and 07.xx.
True False
4) What are the 3 data structures used for 06.xx config?
5) What are the 4 database records used for 07.xx config?
Drive and Volume State Management
Volume State Management
Beginning with Crystal there are different classifications for volume group states
• Complete – All drives in a group are present
• Partially Complete – Drives are missing however redundancy is available to allow

I/O operations to continue
• Incomplete – Drives are missing and there is not enough redundancy available to
allow I/O operations to continue
• Missing – All drives in a volume group are inaccessible
• Exported – Volume group and associated volumes are offline as a result of a user
initiated export (used in preparation for a drive migration)
Hot Spare Behavior
• Only valid for non-RAID 0 volumes and volume groups
• Not valid if any volumes in the volume group are dead
• A hot spare can spare for a failed drive or NotPresent drive that has failed pieces
• If an InUse hot spare drive fails and that failure causes any volumes in the
volume group to transition to failed state, then the failed InUse hot spare will
remain integrated in the VG to provide the best chance or recovery
• If none of the volumes in the volume group are in the failed state, then the failed
InUse hot spare is de-integrated from the volume group making it a “failed
standby” hot spare and another optimal standby hot spare will be integrated
• If failure occurred due to reconstruction (read error), then the InUse hot spare
drive won’t be failed but it will be de-integrated from the volume group. We
won’t retry integration with another standby hot spare drive. This “read error”
information is not persisted or held in memory so we will retry integration if the
controller was ever rebooted or if there was an event that would start
integration.
• When copyback completes, the InUse hot spare drive is de-integrated from its
group and is transitioned to a Standby Optimal hot spare drive.
• New hot spare features (07.xx)

– An integrated hot spare can be made the permanent member of the
volume group it is sparing in via a user action in SANtricity Storage
Manager
Volume Mappings Information
Knowledge Check
1) For 07.xx list all of the possible:

Drive accessibility states:
Drive role states:
Drive availability states:
2) What command(s) would you 06.xx 07.xx

reference in order to get a quick look at
all volume states?
3) What command(s) would you 06.xx 07.xx

reference in order to get a quick look at
all drive states?
Portable Volume Groups in 07.xx
• Previously drive migrations were performed via a system of checking NVSRAM

bits, marking volume groups offline, removing drives, and finally carefully re-
inserting drives in to the receiving system one at a time and waiting for the
group to be merged and brought online.
• This procedure is now gone and has been replaced by portable volume group
functionality.
• Portable volume group functionality provides a means of safely removing and

moving an entire drive group from one storage system to another
• Uses the model of “Exporting” and “Importing” the configuration on the

associated disks
• “Exporting” a volume group performs the following

o Volumes are removed from the current configuration and configuration
database synchronization ceases
o The Volume Group is placed in the “Export” state and the drives marked
offline and spun down
o Drive references are removed once all drives in the “Exported” volume
group are physically removed from the donor system
• Drives can now be moved to the receiving system

o Once all drives are inserted to the receiving system the volume group
does not immediately come online
o The user must specify that the configuration of the new disks be
“Imported” to the current system configuration
o Once “Imported” the configuration data on the migrated group and the
existing configuration on the receiving system are synchronized and the
volume group is brought online
RAID 6 Volumes in 07.xx
• First we should get the “Marketing” stuff out the of the way
o RAID 6 is provided as a premium feature
o RAID 6 will only be supported on the Winterpark (399x) platform due to

controller hardware requirements
• XBB-2 (Which will release with Emerald 7.3x) will support RAID 6
o RAID 6 Volume Groups can be migrated to systems that do not have

RAID 6 enabled via a feature key but only if the controller hardware
supports RAID 6
• The volume group that is migrated will continue to function

however a needs attention condition will be generated because
the premium features will not be within limits
The Technical Bits
• LSI’s RAID 6 implementation is of a P+Q design
o P is for parity, just like we’ve always had for RAID 5 and can be used to
reconstruct data
o Q is for the differential polynomial calculation which when used with

Gaussian elimination techniques can also be used to reconstruct data
o It’s probably easier to think of the “Q” as CRC data
• A RAID 6 Volume Group can survive up to two drive failures and maintain access
to user data
• Minimum number of drives for a RAID 6 Volume Group is five drives with a
maximum of 30
• There is some additional capacity overhead due to the need to store both P and
Q data (i.e. the capacity of two disks instead of one like in RAID 5)
• Recovery from RAID 6 failures only requires slight modification of RAID 5

recovery procedures
o Revive up to the third drive to fail
o Reconstruct the first AND second drive to fail
• Reconstructions on RAID 6 volume groups will take about twice as long as a

normal RAID 5 reconstruction
Troubleshooting Multiple Drive Failures
• When addressing a multiple drive failure, there are several key pieces of information
that need to be determined prior to performing any state modifications.
• RAID Level
o Is it a RAID 6?
– RAID 6 volume group failures occur after 3 drives have failed in
the volume group
o Is it a RAID 3/5 or RAID 1?
– RAID 5 volume group failures occur after two drives have failed in
an volume group.
o RAID 1 volume group failures occur when enough drives fail to cause an
incomplete mirror.
– This could be as few as two drives or half the drives + 1.
o RAID 0 volume groups are dead upon the first drive failure
• Despite the drive failures is each individual volume group configuration complete?
– i.e. Are all drives accounted for, regardless of failed or optimal?
• How many drives have failed and what volume group does each drive belong?
• In what order did the drives fail in each individual volume group?
• Are there any global hot spares?

o Are any of the hot spares in use
o Are there any hot spares not in use and if so are they in an optimal
condition?
• Are there any backend errors that lead to the initial drive failures?
o This is the most common cause of multiple drive failures, all backend
issues must be fixed or isolated before continuing any further
Multiple Drive Failures – Why RAID Level is Important
• RAID 6 Volume Groups

o RAID 6 volume groups can survive 2 drive failures due to the p+q
redundancy model, after the third drive failure the volume group is
marked as failed
o Up until the third drive failure, data in the stripe is consistent across the
drives
• RAID 5 and RAID 3 Volume Groups
o After the second drive failure the volume group and associated volumes
are marked as failed, no I/Os have been accepted since the second drive
failed
o Up until the second drive failure, data in the stripe is consistent across
the drives

o RAID 1 volume groups can survive multiple drive failures as long as one
side of the mirror is still optimal
o RAID 1 volume groups can be failed after only two drives fail if both the
data drive and the mirror drive fail
o Until the mirror becomes incomplete the RAID 1 pairs will function
normally
• RAID 0
o As there is no redundancy these arrays cannot generally be recovered.
However, the drives can be revived and checked – no guarantees can be
made that the data will be recovered.
Multiple Drive Failures – Configuration Considerations

• Although there are several mechanisms to ensure configuration integrity there are
failure scenarios that may result in configuration corruption
• If the failed volume group’s configuration is incomplete, reviving and reconstructing

drives could permanently corrupt user data
• If any of the drives have an ‘offline’ status (06.xx), reviving drives could revert them
to an unassigned state
• How can this be avoided?

o Check to see if the customer has an old profile that shows the appropriate
configuration for the failed volume group(s)
o If the volume group configuration appears to be incomplete, corrupted, or if

there is any doubt – escalate immediately
Multiple Drive Failures – How Many Drives?
• Assuming the volume group configuration is complete and all drives are accounted
for you need to determine how many drives are failed
• Make a list of the failed drives in each failed volume group
• Using the output of ionShow 12 determine whether or not these drives are in an
open state
o If the drives are in a closed state they will be inaccessible and attempts
to spin up, revive, or reconstruct will likely fail
Multiple Drive Failures – What’s the failure order?

• Failure order is important for RAID 6, RAID 3/5, and RAID 1 volume group failures.
• Determining the failure order is just as important as determining the status of the
failed volume group’s configuration
• Failure order should be determined from multiple data points
o The Major Event Log (MEL)
o Timestamp information from the drive’s DACstore (06.xx)
o Timestamp information from the failed piece (07.xx)
• Often times, failures occur close together and will show up either at the same
timestamp or within seconds of each other in the MEL
Multiple Drive Failures – What’s the failure order? (06.xx)
• In order to obtain information from DACstore the drive must be spun up
isp cfgPrepareDrive,0x<phydev>
Note: this is the only command that uses the “phydev” address not the
devnum address
• This command will spin the drive up, but not place it back in service.
It will still be listed as failed by the controller.
However since it is spun up, it will service direct disk reads of the DACstore region
necessary for the following commands.
I’ve got my failure order, what’s next?
• Using the information on the previous slides you should have now determined what
the failure order is of the drives.
• Special considerations need to be made depending on the RAID level of the failed
volume group
o For RAID 6 volume groups, the most important piece of information is the
first two drives that failed
first drive that failed
first drive that failed causing the mirror to break.
• Before making any modifications to the failed drives, any unused global hot spares
should be failed to prevent them sparing for drives unnecessarily.
o To fail the hot spares
– Determine which unused hot spares are to be failed
– From the GUI

• Select the drive,
• from the Advanced menu select Recovery >> Fail Drive
– From the controller shell

• Determine the devnums of the hot spares that are to be failed
• Using the devnum enter
– isp cfgFailDrive,0x<devnum> (06.xx)
– setDriveToFailed_MT 0x<devnum> (06.xx & 07.xx)
Reviving Drives
• Begin with the last drive that failed and revive drives until the volume group
becomes degraded
• From the GUI

o Select the last drive to fail and from the Advanced menu select
Recovery >> Revive >> Drive
o Check to see if the volume group is degraded, if not move on to the next
drive (Last -> First) and revive it.
Repeat this step until the volume group is degraded
o Volume group and associated volumes should now be in a degraded
state.
• From the controller shell

o Using the devnum of the drive perform the following
• isp cfgSetDevOper,0x<devnum> (06.xx)
• setDriveToOptimal_MT 0x<devnum> (06.xx & 07.xx)
o Check to see if the volume group is degraded, if not move on to the next
drive (Last -> First) and revive it. Repeat this step until the volume group
is degraded
o The volume group and associated volumes should now be in a degraded

state
• Mount volumes in read-only (if possible) and verify data
Cleanup
• If data checks out, reconstruct the remaining failed drives, replace drives as
warranted
– From the GUI

• Select the drive
• From the Advanced menu select
Recovery >> Reconstruct Drive
• Using the devnum of the drive perform the following
• isp cfgReplaceDrive,0x<devnum> (06.xx)
• startDriveReconstruction_MT 0x<devnum> (06.xx &

07.xx)
• Once reconstructions have begun, the previously failed hot spares can be revived
– From the GUI
• Select the last drive to fail

• From the Advanced menu select
Recovery >> Revive >> Drive
• Using the devnum of the drive perform the following
– isp cfgSetDevOper,0x<devnum> (06.xx)
– setDriveToOptimal_MT 0x<devnum> (06.xx &

07.xx)
Multiple Drive Failures – A Few Final Notes
• If there is any doubt about the failure order, the array configuration, or you are
simply not confident – find a senior team member to consult with prior to taking any
action.
– Beyond this you can ALWAYS escalate
• You are dealing with a customer’s data, be mindful of this at all times.
– Think about what you are doing, establish a plan based on high level
facts
– Take your time
– Write down the information as you review the data
– If something doesn’t look right, ask a co-worker or escalate
– Revive the drives, check the data.
– There is no guarantee that data will be recovered, and depending on the

nature of the drive failure the array may not stay optimal long enough to
use the data.
• If there are multiple drive failures, there is chance that a backend problem is at fault
– DO NOT PULL AND RESEAT DRIVES
– Every attempt should be made to resolve any backend issues prior to

changing drive states.
– Get the failure order information, address the backend issue, spin up
drives and restore access.
Offline Volume Groups
Offline Volume Groups (06.xx)

• As a protection mechanism in 06.xx configuration manager, if all members (drives)
of a volume group are not present during start of day, the controller will mark the
associated volume group offline until all members are available
• This behavior can cause situations where a volume group is left in an offline status
with all drives present, or with one drive listed as out of service
• IMPORTANT: If a group is offline, it is unavailable for configuration changes.

That means that if any drives in the associated volume group are failed and
revived, they will not be configured into the volume group, but will transition to
an unassigned state instead
• In order to bring a volume group online through the controller shell with no
pieces out of service, or only one piece out of service
– isp cfgMarkNonOptimalDriveGroupOnline,<SSID>
• Where ‘SSID’ is any volume in the group, this only needs to be run once
against any volume in the group

• Because 07.xx firmware does not implement this functionality, it is not expected
that this will be a concern for 07.xx systems
• Volume Groups that do not have all members (drives) present during start of day
will transition to their appropriate state
– Partially Complete – Degraded

– Incomplete – Dead
– Missing
• Even though the group is listed as degraded or dead, it is possible that all
volumes will still be in an optimal state since no pieces are marked as out
of service
Clearing the Configuration
• In extreme situations it may be necessary to clear the configuration from the system
• This can be accomplished by either clearing the configuration information from the
appropriate region in DACstore or by completely wiping DACstore from the drives
and rebuilding it during start of day
• The configuration can be reset via the GUI
– Advanced >> Recovery >> Reset >> Configuration (06.xx)
– Advanced >> Recovery >> Clear Configuration >> Storage Array (07.xx)
• To wipe the configuration information
– sysWipe
• This command must be run on both controllers.

• For 06.xx systems, the controllers must be rebooted once the
command has completed.
• As of 07.xx the controllers will reboot automatically once the
command has completed
• To wipe DACstore from all drives
– sysWipeZero 1 (06.xx)
– dsmWipeAll (07.xx)
• After either of these commands, the controllers must be rebooted in

order to write new DACstore to all the drives
• To wipe DACstore from a single drive
– isp cfgWipe1,0x<devnum> (06.xx)
• Either the controllers must be rebooted in order to write new DACstore to

the drive, or it must be (re)inserted into a system
– dsmWipe 0x<devnum>,<writeNewDacstore> (07.xx)
• Where <writeNewDacstore> is either a 0 to not write new DACstore until
start of day or the drive is (re)inserted into a system, or a 1 to write new
clean DACstore once it has been cleared
• There are times where the Feature Enable Identifier key becomes
corrupt, in order to clear it and generate a new Feature Enable Identifier
use the following command.
• safeSysWipe (06.xx and 07.xx)
• For 07.xx systems, you must also remove the safe header from the
database
• dbmRemoveSubRecordType 18 (07.xx)
Note: This is a very dangerous command as it wipes out a record in

the database – make sure you type “18” and not another number
• Once this has been completed on both controllers, they will need to both
be rebooted in order to generate a new ID.
• All premium feature keys will need to be regenerated with the new ID
and reapplied.
Recovering Lost Volumes
• There are times that volumes are lost and need to be recovered, either due to a
configuration problem with the storage array, or the customer simply deleted the
wrong volume
• Multiple pieces must be known about the missing volume in order to ensure data
recovery
– Drives and Piece Order of the drives in the missing volume group
– Capacity of each volume in the volume group
– Disk offset where each volume starts
– Segment Size of the volumes
– RAID level of the group
– Last known state of the drives
• This information can be obtained from historical capture all support data files
relatively easy
• Finding Drive and Piece order
– Old Profile in the ‘Volume Group’ section
– vdShow or cfgUnit output in the stateCaptureData.dmp file (06.xx)
– evfShowVol output in the stateCaptureData.txt file (07.xx)
• Finding Capacity, Offset, RAID level, and Segment size
– vdShow or cfgUnit output in the stateCaptureData.dmp file (06.xx)
– evfShowVol output in the stateCaptureData.txt file (07.xx)
• The last known state of the drives is a special case where a drive was previously
failed in a volume prior to the deletion of the volume, it must be failed again after
the recreation of the volume in order to maintain consistent data/parity
• SMcli command to recreate a volume without initializing data on the volume
– recover volume (drive=(trayID,slotID) | drives=(trayID1,slotID1 ...

trayIDn,slotIDn) | volumeGroup=volumeGroupNumber)
userLabel="volumeName" capacity=volumeCapacity offset=offsetValue
raidLevel=(0 | 1 | 3 | 5 | 6) segmentSize=segmentSizeValue [owner=(a | b)
cacheReadPrefetch=(TRUE | FALSE)]
• This command is discussed in the EMW help in further detail
– Help >> Contents >> Command Reference Table of Contents >>

Commands Listed by Function >> Volume Commands >> Recover RAID
Volume
• When specifying the capacity, specify it in bytes for a better chance of data
recovery, if entered in Gigabytes there could be some rounding discrepancies in the
outcome
• A lost volume can be created using this method as many times as necessary until the
data is recovered as long as there are no writes that take place to the volume when
it is recreated improperly
• NEVER use this method to create a brand new volume that contains no data. Doing
so will cause data corruption upon degradation, since the volume was never
initialized during creation.
• If creating volumes using the GUI, instead of the ‘recover volume’ CLI command,
steps must first be made in the controller shell in order to prevent initialization
• There is a flag in the controller shell that defines whether or not to initialize the data
region of the drives upon new volume creations
– writeZerosFlag
Recovering Lost Volumes – Setup
(Note in the following examples:

red denotes what to type, black is the output, blue press <enter> key)
-> writeZerosFlag
value = 0 = 0x0
-> writeZerosFlag=1
-> writeZerosFlag
value = 1 = 0x1
-> VKI_EDIT_OPTIONS
EDIT APPLICATION SCRIPTS (disabled)
Enter ‘I’ to insert statement; ‘D’ to delete statement;

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit i <enter>
Enter statement to insert (exit insert mode with newline only):
writeZerosFlag=1 <enter>
1) writeZerosFlag=1

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit + <enter>
EDIT APPLICATION SCRIPTS (enabled)
1) writeZerosFlag=1

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit q <enter>
Commit changes to NVSRAM (y/n) y <enter>
value = 12589824 = 0xc01b00
->
Recovering Lost Volumes
• A lost volume can be created using this method as many times as necessary until the
data is recovered as long as there are no writes that take place to the volume when
it is recreated improperly
• NEVER use this method to create a brand new volume that contains no data. Doing
so will cause data corruption upon degradation, since the volume was never
initialized during creation
• Always verify that once the volume has been recreated that the system has been
cleaned up from all changes made during the volume recreation process
Recovering Lost Volumes – Cleanup
-> writeZerosFlag
value = 1= 0x1
-> writeZerosFlag=0
-> writeZerosFlag
value = 0 = 0x0
-> VKI_EDIT_OPTIONS
1) writeZerosFlag=1

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit c <enter>
Clear all options? (y/n) y <enter>

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit - <enter>

‘C’ to clear all options; + to enable debug options; ‘Q’ to quit q <enter>
Commit changes to NVSRAM (y/n) y <enter>

value = 12589824 = 0xc01b00
->
Recovering Lost Volumes – IMPORTANT
• IMPORTANT: do not attempt to recover lost volumes without development help.

Since this deals with customer data, it is a very sensitive matter
Knowledge Check
1) 06.xx – List the process required to determine the drive failure order for a
volume group.
2) 07.xx – List the process required to determine the drive failure order for a
volume group.
3) Clearing the configuration is a normal troubleshooting technique that will be used

frequently.
True False
4) Recovering a lost volume is a simple process that should be done without

needing to take much into consideration.
True False
Module 4: Fibre Channel Overview and Analysis
• Describe how fibre channel topology works
• Determine how fibre channel topology relates to the different protocols that LSI
uses in its storage array products
• Analyze backend errors for problem determination and isolation
Fibre Channel
• Fibre Channel is a transport protocol

– Used with upper layer protocols such as SCSI, IP, and ATM
• Provides a maximum of 127 ports in an FC-AL environment

– Is the limiting factor in the number of expansion drive trays that can be
used on a loop pair
Fibre Channel Arbitrated Loop (FC-AL)
• Devices are connected in a ‘one way’ loop or ring topology

– Can either be physically connected in a ring fashion or using a hub
• Bandwidth is shared among all devices on the loop
• Arbitration is required for one port (the ‘initiator’) to communicate with another (the
‘target’)
Fibre Channel Arbitrated Loop (FC-AL) – The LIP
• Prior to beginning I/O operations on any drive channel a Loop Initialization (LIP)
must occur.
– This must be done to address devices (ports) on the channel with an

ALPA (Arbitrated Loop Physical Address) and build the loop positional
map
• A 128-bit (four word) map is passed around the loop by the loop master (the
controller)
– Each offset in the map corresponds to an ALPA and has a state of either
0 for unclaimed or 1 for claimed
• There are two steps in the LIP that we will skip
– LISM – Loop Initialization Select Master

• The “Loop Master” is determined
• The “Loop Master” assumes the lowest ALPA (0x01)
• The “A” controller is always the loop master (under optimal
conditions)
– LIFA – Loop Initialization Fabric Address

• Fabric Assigned addresses are determined
• Occurs on HOST side connections
• The three steps we will be looking at are the
– LIPA – Loop Initialization Previous Address

– LIHA – Loop Initialization Hard Address
– LISA – Loop Initialization Soft Address
• The LIP process is the same regardless of drive trays attached (JBOD & SBOD)
• The LIPA Phase

– The Loop Master sends the loop map out and designates it as the LIPA
phase in the header of the frame
– The loop map is passed from device to device in order
– If a device’s port was previously logged in to the loop it will attempt to

assume it’s previous address by setting the appropriate offset in the map
to ‘1’
– If a device was not previously addressed it will pass the frame on to the
next device in the loop
• The LIHA Phase

– Once the LIPA phase is complete the loop master will send the loop map
out again however specifying this as the LIHA phase in the header of the
frame
– The loop map is once again passed from device to device in the loop
– Each device will check it’s hard address against the loop map
– If the offset of the loop map that corresponds to the device’s hard
address is available (set to 0) it will set that bit to 1, assuming the
corresponding ALPA, and pass the loop map on to the next device
– If the hard address is not available it will pass the loop map on and await
the LISA stage of initialization
– Devices that assumed an ALPA in the LIPA phase will simply pass the
map on to the next device
• How are hard addresses determined?
– Hard Addresses are determined by the ‘ones’ digit of the drive tray ID
and the slot position of the device in the drive tray
– Controllers are set via hardware to always assume the same hard IDs to
ensure that they assume the lower two ALPA addresses in the loop map
(0x01 for “A” and 0x02 for “B”)
• What is the benefit?
– By using hard addressing on devices a LIP can be completed quickly and

non-disruptively
– LIPs can occur for a variety of reasons – loss of

communication/synchronization, new devices joining the loop (hot adding
drives and ESMs)
– I/Os that were in progress when the LIP occurred can be recovered
quickly without the need for lengthy timeouts and retries
• The LISA Phase

– Once the LIHA phase has completed the loop master will send the loop
map out again and now designating it as the LISA phase in the frame
header
– Devices that had not assumed an ALPA on the loop map in the LIPA and
LIHA phase of initialization will now take the first available ALPA in the
loop map
• If no ALPA is available the device will be ‘non-participating’ and

will not be addressable on the loop
– When the LISA phase is received again by the loop master it will check
the frame header for a specific value that indicates that LISA had
completed
• Once LISA has completed, the loop master will distribute the loop map again and
each device will enter it’s hex ALPA in the order that it is received
– This is referred to as the LIRP (Loop Initialization Report Position) phase
• The loop master will distribute the completed loop map to all devices to inform them
of their relative position in loop to the loop master
– This is referred to as the LILP (Loop Initialization Loop Position) phase
• The loop master ends the LIP by transmitting a CLS (Close) frame to all devices on
the loop placing them in monitor mode
• Hard Address Contention
– Hard address contention occurs when a device is unable to assume the

ALPA that corresponds to its hard address and can be caused by
• The ‘ones’ digit of the tray ID not being unique among the drive
trays on a given loop
• A hardware problem that results in the device reading the

incorrect hard address or the device is simply reporting the wrong
address during the LIP
– Hard address contention will result in devices taking soft addresses

during the LIP
• ALPA Map Corruption
– A bad device on the loop will corrupt the ALPA map resulting in devices
not assuming the correct address or not participating in the loop
• The net of these conditions is that LIPs become a disruptive process that can have
adverse affects on the operation of the loop
Fibre Channel Arbitrated Loop (FC-AL) – Communication
• Each port has what is referred to as a Loop Port State Machine (LPSM) that is used
to define the behavior when it requires access or use of the loop
• While the loop is idle, the LPSM will be in MONITOR mode and transmitting IDLE
frames
• In order for one device to communicate with another arbitration must be performed
– An ARB frame will be passed along the loop from the initiating device to
the target device
– If the ARB frame is received and contains the ALPA of the initiating device
it will transition from MONITOR to ARB_WON
– An OPN (Open) frame will be sent to the device that it wishes to open
communication with
– Data is transferred between the two devices
– CLS (Close) is sent and the device ports return to the MONITOR state
Knowledge Check
1) The Fibre Channel protocol does not have very much overhead for login and
communication.
True False
2) Soft addressing should not cause a problem in an optimal system.
True False
3) List all the LIP phases:
Drive Side Architecture Overview
SCSI Architecture Model Terminology
• nexus: A relationship between two SCSI devices, and the SCSI initiator port and
SCSI target port objects within those SCSI devices.
• I_T nexus: A nexus between a SCSI initiator port and a SCSI target port.
• logical unit: A SCSI target device object, containing a device server and task
manager, that implements a device model and manages tasks to process commands
sent by an application client.
Role column
FCdr – Fibre Channel drive
SATAdr – SATA drive
SASdr – SAS drive
ORP columns indicate the overall state of the lu for disk device types (normally should
be “+++”).
O= Operation – the state of the ITN currently chosen
+) chosen itn is not degraded

d) chosen itn is degraded
R= Redundancy – the stat of the redundant ITN
+) alternate itn is up
d) alternate itn is degraded
-) alternate itn is down
x) there is no alternate itn
P= performance – Are we using the preferred path?
+) chosen itn is preferred
-) chosen itn is not preferred
) no itn preferences
The Channels column indicates the state of the itn on that channel which is for its lu.
*) up & chosen
+) up & not chosen
D) degraded & chosen
D) degraded & not chosen
-) down
x) not present
Fibre Channel Overview and Analysis
• In order to reset the backend statistics that are displayed by the previous
commands
o iopPerfMonRestart
• This must be done on both controllers

• Also flushes debug queue
Knowledge Check
1) What command will show drive path information?
2) What command will show what hosts are logged in?
Destination Driver Events
Destination Driver Events (Error Codes)
• Target detected errors:
status-sk/asc/ascq = use SCSI definitions

(status=ff means unused, sk=00 means unused)
• Hid detected errors:
02-0b/00/00 IO timeout
ff-00/01/00 ITN fail timeout (ITN has been disconnected for too long)
ff-00/02/00 device fail timeout (all ITNs to device have been discon. for too long)
ff-00/03/00 cmd breakup error
Destination Driver Events (Error Codes)
Lite detected errors: 02-0b/xx/xx xx = XCB_STAT code from table below
#define XCB_STAT_GEN_ERROR 0x01

#define XCB_STAT_BAD_ALPA 0x02
#define XCB_STAT_OVERFLOW 0x03
#define XCB_STAT_COUNT 0x04
#define XCB_STAT_LINK_FAILURE 0x05
#define XCB_STAT_LOGOUT 0x06
#define XCB_STAT_OXR_ERROR 0x07
#define XCB_STAT_ABTS_SENDER 0x08
#define XCB_STAT_ABTS_RECEIVER 0x09
#define XCB_STAT_OP_HALTED 0x0a
#define XCB_STAT_DATA_MISMATCH 0x0b
#define XCB_STAT_KILL_IO 0x0c
#define XCB_STAT_BAD_SCSI 0x0d
#define XCB_STAT_MISROUTED 0x0e
#define XCB_STAT_ABTS_REPLY_TIMEOUT 0x0f
#define XCB_STAT_REPLY_TIMEOUT 0x10
#define XCB_STAT_FCP_RSP_ERROR 0x11
#define XCB_STAT_LS_RJT 0x12
#define XCB_STAT_FCP_CHECK_COND 0x13
#define XCB_STAT_FCP_SCSI_STAT 0x14
#define XCB_STAT_FCP_RSP_CODE 0x15
#define XCB_STAT_FCP_SCSICON 0x16
#define XCB_STAT_FCP_RESV_CONFLICT 0x17
#define XCB_STAT_FCP_DEVICE_BUSY 0x18
#define XCB_STAT_FCP_QUEUE_FULL 0x19
#define XCB_STAT_FCP_ACA_ACTIVE 0x1a
#define XCB_STAT_MEMORY_ERR 0x1b
#define XCB_STAT_ILLEGAL_REQUEST 0x1c
#define XCB_STAT_MIRROR_CHANNEL_BUSY 0x1d
#define XCB_STAT_FCP_INV_LUN 0x1e
#define XCB_STAT_FCP_DL_MISMATCH 0x1f
#define XCB_STAT_EDC_ERROR 0x20
#define XCB_STAT_EDC_BLOCK_SIZE_ERROR 0x21
#define XCB_STAT_EDC_ORDER_ERROR 0x22
#define XCB_STAT_EDC_REL_OFFSET_ERROR 0x23
#define XCB_STAT_EDC_UDT_FLUSH_ERROR 0x24
#define XCB_STAT_FCP_IOS 0x25
#define XCB_STAT_FCP_IOS_DUP 0x26
Read Link Status (RLS) and Switch-on-a-Chip (SOC)
• Each port on each device maintains a Link Error Status Block (LESB) which tracks
the following errors
– Invalid Transmission Words

– Loss of Signal
– Loss of Synchronization
– Invalid CRCs
– Link Failures
– Primitive Sequence Errors
• Read Link Status (RLS) is a link service that collects the LESB from each device
• Transmission Words
– Formed by 4 Transmission Characters
– Two types:
• Data Word
– Dxx.y, Dxx.y, Dxx.y, Dxx.y
• Special Function Word such as Ordered Set

– Kxx.y, Dxx.y,Dxx.y, Dxx.y
– Ordered Set consists of Frame Delimiter, Primitive Signal, and Primitive

Sequence
• A Transmission Word is Invalid when one of the following conditions is

detected:
– At least one Invalid Transmission Character is within Transmission Word
– Any valid Special Character is at second, third, or fourth character

position of a Transmission Word
– A defined Ordered Set is received with Improper Beginning Running

Disparity
RLS Diagnostics
• Analyze RLS Counts:
– Look for “step” or “spike” in error counts
– Identify the first device (in Loop Map Order) that detects high number of
Link Errors
• Link Error Severity Order: LF > LOS > ITW
– Get the location of the first device
– Get the location of its upstream device
RLS Diagnostics Example
Example:
• Drive [0,9] has high error counts in ITW, LF, and LOS
• Upstream device is Drive [0,8]
• Drive [0,8] and Drive [0,9] are in same tray
• Most likely bad component: Drive [0,8]
Important Note:
• Logs need to be interpreted, not merely read
• The data is representative of errors seen by the devices on the loop
• No Standard error counting
• Different devices may count the error in different rate
• RLS counts are still valid in SOC environments
• Not valid however for SATA trays
What is SOC or SBOD?
• Switch-On-a-Chip ( SOC )
• Switch Bunch Of Disks (SBOD)
Features:
• Crossbar switch (Loop-Switch)
• Supported in FC-AL topologies
• Per device monitoring
SOC Components
• Controllers
– 6091 Controller
– 399x Controller
• Drive Trays
– 2Gb SBOD ESM (2610)
– 4Gb ESM (4600 – Wrigley)
SBOD vs JBOD
What is the SES?
SCSI Enclosure Services
• The SOC provides monitor and controller for SES
• The SES is the device that consumes the ALPA
• The brains of the ESM
SOC Statistics
• In order to clear the drive side SOC statistics
clearSocErrorStatistics_MT
• In order to clear the controller side SOC statistics
socShow 1
Determining SFP Ports
• 2GB SBOD drive enclosures ports go from left to right
• 4GB SBOD drive enclosures ports start from the center and go to the outside
(Wrigley-Husker)
• On all production models ports are labels on drive trays
Port State (PS)

• Inserted – The standard state when a devices is present
• Loopback – a connection when Tx is connected to Rx
• Unknown – non-deterministic state
• Various forms of bypassed state exist.
– Most commonly seen:

• Byp_TXFlt is expected when a drive is not inserted
• Byp_NoFru is expected when an SFP is not present
– Other misc.
• Bypassed, Byp_LIPF8, Byp_TmOut, Byp_RxLOS, Byp_Sync,
Byp_LIPIso, Byp_LTBI, Byp_Manu, Byp_Redn, Byp_Snoop,
Byp_CRC, Byp_OS
Port State (PS) meanings

• Bypassed – Generic bypass condition (indication that port was never in
use)
• Byp_TXFlt – Bypassed due to transmission fault
• Byp_NoFru – No FRU installed
• Byp_LIPF8 – Bypass on LIP (F8,F8) or No Comma
• Byp_TmOut – Bypassed due to timeout
• Byp_RxLOS – Bypassed due to receiver Loss Of Signal (LOSG)
• Byp_Sync – Bypasses due to Loss Of Synchronization (LOS)
• Byp_LIPIso – Bypass – LIP isolation port
• Byp_LTBI – Loop Test Before Insert testing process
• Byp_Manu – General catch all for a forced bypass state
• Byp_Redn – Redundant port connection
• Byp_CRC – Bypassed due to CRC errors
• Byp_OS – Bypassed due to Ordered Set errors
• Byp_Snoop
Port Insertion Count (PIC)
• Port insertion count – The number of times the device has been inserted into this
port.
• The value is incremented each time a port successfully transitions from the
bypassed state to inserted state.
• Range: 0-255 = 28
Loop state (LS)

• The condition of the loop between the SOC and component
• Possible States:
– Up = Expected state when a device is present
– Down = Expected state when no device is present
– Transition states as loop is coming up ( listed in order )

• Down -> Init -> Open -> Actv -> Up
Loop Up Count (LUC)

• The total instances that the loop has been identified as having changed from
Down to Up during the SOC polling intervals.
– Note: This implies that a loop can go down and up multiple instances in
one SOC polling cycle and only be detected once.
– Polling cycle is presently 30 ms
– Range: 0-255 = 28
CRC Error Count (CRCEC)

• Number of CRC ( Cyclic Redundancy Check) errors that are detected in frames.
• A single invalid word in a frame will increment the CRC counter
• Range: 0 - 4,294,967,294 = 232
Relative Frequency Drive Error Avg.
(relFrq count / RFDEA)
• SBODs are connected to multiple devices.
• This leads to the SBOD being in multiple clock domains
• Overtime clocks tend to drift. SBODs employ a clock check feature comparing
the relative frequency of all attached devices to the clock connected to the
SBOD.
• If one transmitter is transmitting at the slow end and its partner at the fast end
of tolerance range then the two clocks are in specification but will have extreme
difficulty in communicating
• Range: 0 - 4,294,967,294 = 232
Loop Cycle Count (loopCy / LCC)

• The loop cycle is the detection of a Loop transition.
– Unlike Loop Up Count the Loop Cycle count does not require the loop to
transition to the up state.
• The Loop Cycle Count is more useful in understanding overhead of the FC

protocol.
• Until Loop Up goes to 1 no data has been transmitted.
• Loop Cycle allows for an understanding that an attempt is being made to bring
up the loop.
– Does not mean the loop has come up
• Range: 0 - 4,294,967,294 = 232
• Possible States:
– Same as Loop States (LS)
• Up, Down, Transition states as loop is coming up
Ordered Set Error Count (OSErr / OSEC)
• Number of Ordered Sets that are received with an encoding error.
• Ordered Sets include Idle, ARB, LIP, SOF, EOF, etc
• Range: 0 - 4,294,967,294 = 232
Port Connection Held Off Count (hldOff / PCHOC)

• Port connections held off count
• The number of instances a device has attempted to connect to a specific port

and received busy.
• Range: 0 - 4,294,967,294 = 232
Port Utilization- Traffic Utilization (PUP)

• The percentage of bandwidth detected over a 240ms period of time.
Other values
• Sample Time:
– Time in seconds in which that sample was taken
General Rules of Thumb for Analysis

• It requires more energy to transmit (Tx) than receive (Rx)
• In some instances it is not possible to isolate the specific problematic

component.
– The recommend replacement order is the following

1. SFP
2. Cable
3. ESM
4. Controller
Analysis of RLS/ SOC
• RLS is an error reporting mechanism that reports errors as seen by the devices
on the array.
• SOC counters are controlled by the SOC chip
• SOC is an error reporting mechanism that monitors communication between two

devices.
• SOC data does not render RLS information obsolete
• RLS & SOC need to be interpreted not merely read
• Different devices may not count errors at the same rate
• Different devices may have different expected thresholds
• Know the topology/ cabling of the storage array
• When starting analysis always capture both RLS and SOC
• Do not always expect the first capture of the RLS/ SOC to pin point the
problematic device.
Analysis of SOC
• Errors are generally not propagated through the loop in a SOC Environment.
– What is recorded is the communication statistics between two devices.
• The exception to the rule

– loopUp Count
– CRC Error Count
– OS Error Count
• Focus emphasis on the following parameters

– Insertion count
– Loop up count
– Loop cycle count
– CRC error count
– OS error count
• The component connected to the port with the highest errors in the
aforementioned stats is the most likely candidate for a bad component
Known Limitations
• Non-optimal configurations
– i.e. improper cabling
• SOC in hub mode
Field Case
• Multiple backend issues reported in MEL
• readLinkStatus.csv
• RLS stats show drive tray 1 & 2 are on channel 1 & 3 (All counts zero)
Field Case (cont)

• socStatistics.csv (Amethyst 2 release)
– SOC stats shows problem on: (M = Million)

• Focusing on Drive Tray 1 ESM-A the user can see that the SES (or
brains of ESM) is By-passed and the loop state is down.
• Recommendation was to replace ESM-A.
• The drive tray can continue to operate after it is up without the SES.
Drive Channel State Management
This feature provides a mechanism for identifying drive-side channels where device
paths (IT nexus) are experiencing channel related I/O problems.
This mechanism’s goal is twofold:
1) It aims to provide ample notice to an administrator that some form of problem

exists among the components that are present on the channel
2) It attempts to eliminate, or at least reduce, I/O on drive channels that are

experiencing those problems.
• There are two states for a drive channel – OPTIMAL and DEGRADED
• A drive channel will be marked degraded by the controller when a predetermined

threshold has been met for channel errors
– Timeout errors
– Controller detected errors: Misrouted FC Frames and Bad ALPA errors, for
example
– Drive detected errors: SCSI Parity Errors, for example
– Link Down errors
• When a drive channel is marked degraded a critical event will be logged to the
MEL and a needs attention condition set in Storage Manager
What a degraded drive channel means

• When a controller marks a drive-side channel DEGRADED, that channel will be
avoided to the greatest extent possible when scheduling drive I/O operations.
– To be more precise, the controller will always select an OPTIMAL channel

over a DEGRADED channel when scheduling a drive I/O operation.
– However, if both paths to a given drive are associated with DEGRADED

channels, the controller will arbitrarily choose one of the two.
• This point further reinforces the importance of directing administrative attention

to a DEGRADED channel so that it can be repaired and returned to the OPTIMAL
state before other potential path problems arise.
• A drive channel that is marked degraded will be persisted through a reboot as
the surviving controller will direct the rebooting controller to mark the path
degraded
– If there is no alternate controller the drive path will be marked OPTIMAL
again
• The drive channel will not automatically transition back to an OPTIMAL state
(with the exception of the above situation) unless directed by the user via the
Storage Manager software
SAS Backend
SAS Backend Overview and Analysis
• Statistics collected from PHYs
– A SAS Wide port consists of multiple PHYs, each with independent error
counters
• Statistics collected from PHYs on:

– SAS Expanders
– SAS Disks
– SAS I/O Protocol ASICs
• PHYs that do not maintain counters

– Reported as “N/A” or similar in User Interface
– Including SATA Disks
• PHY counters do not wrap (per standard)

– Maximum value of 4,294,967,295 (232)
– Must be manually reset
• Counters defined in SAS 1.1 Standard

– Invalid DWORDs
– Running Disparity Errors
– Loss of DWORD Synchronization
• After dword synchronization has been achieved, this state
machine monitors invalid dwords that are received. When an
invalid dword is detected, it requires two valid dwords to nullify its
effect. When four invalid dwords are detected without
nullification, dword synchronization is considered lost.
– PHY Reset Problems
• Additional information returned

– Elapsed time since PHY logs were last cleared
– Negotiated physical link rate for the PHY
– Hardware maximum physical link rate for the PHY
SAS Error counts
• IDWC – Invalid Dword Count
– A dword that is not a data word or a primitive (i.e., in the character
context, a dword that contains an invalid character, a control character in
other that the first character position, a control character other than
K28.3 or K28.5 in the first character position, or one or more characters
with a running disparity error). This could mark the beginning of a loss of
Dword synchronization. After the fourth non-nullified (if followed by a
valid Dword) Invalid Dword, Dword synchronization is lost.
• RDEC – Running Disparity Error Count

– Cumulative encoded signal imbalance between one an zero signal state.
Any Dwords with one or more Running Disparity Errors will be considered
an invalid Dword.
• LDWSC – Loss of Dword synch Count

– After the fourth non-nullified (if followed by a valid Dword) Invalid
Dword, Dword synchronization is lost.
• RPC – Phy Reset Problem Count

– Number of times a phy reset problem occurred. When a phy or link is
reset, it will run through it’s reset sequence (OOB, Speed Negotiation,
Multiplexing, Identification).

• SAS error logs are gathered as part of the Capture all Support Data bundle
– sasPhyErrorLogs.csv
• Not available through the GUI interface, only CLI or the support bundle.
• CLI command to collect SAS PHY Error Statistics

– save storageArray SASPHYCounts file=“<file>”;
• CLI command to reset SAS PHY Error Stastics

– reset storageArray SASPHYCounts;
• Shell commands to collect SAS PHY Error Stastics

– sasShowPhyErrStats 0
• List phys with errors
– sasShowPhyErrStats 1
• List all phys
– getSasErrorStatistics_MT
• Shell commands to reset SAS PHY Error Statistics
– sasClearPhyErrStats
– clearSasErrorStatistics_MT
• Remember that SAS error statistics are gathered per PHY
• If a PHY has a high error count, look at the device that the PHY is directly
attached to
Appendix A: SANtricity Managed Storage Systems
• Fully-featured midrange storage designed for wide-ranging open systems

environments
• Compute-intensive applications, consolidation, tiered storage
• Fully-featured management software designed to provide administrators
with extensive configuration flexibility
• FC and IB connectivity with support for FC/SATA drives
Attribute 6998 | 6994 6498
Flagship system targeted Targeted at HPC environments utilizing

Overview at enterprises with compute-intensive InfiniBand for Linux server clustering
applications and large consolidations interconnect
• Disk performance
• SANtricity robustness • Native IB interfaces
• Dedicated data cache • Switched-loop backend
Key features
• 4 Gb/s interfaces • FC | SATA intermixing
• Switched-loop backend • SANtricity robustness
• FC | SATA intermixing
Host interfaces Eight 4 Gb/s FC Four 10 Gb/s IB
Drive interfaces Eight 4 Gb/s FC Eight 4 Gb/s FC
Drives 224 FC or SATA 224 FC or SATA
Data cache 4, 8, 16 GB (dedicated) 2 GB (dedicated)
Cache IOPS 575,000 | 375,000 IOPS ---
Disk IOPS 86,000 | 62,000 IOPS ---
Disk MB/s 1,600 | 1,280 MB/s 1,280 MB/s
6998 /6994 /6091 (Front)
6998 /6994 /6091 (Back)
Attribute 3994 | 3992
Fully-featured systems targeted at midrange

Overview environments requiring high-end functionality
and performance value
• Performance value
• SANtricity robustness
Key features • FC | SATA intermixing
• 4 Gb/s interfaces
• Switched-loop backend
Host interfaces Eight | Four 4 Gb/s FC
Drive interfaces Four 4 Gb/s FC
Drives 112 FC or SATA
Data cache 4 GB | 2 GB (shared)
Cache IOPS 120,000 IOPS
Disk IOPS 44,000 | 28,000 IOPS
Disk MB/s 990 | 740 MB/s
3992 (Back)
3994 (Back)
4600 16-Drive Enclosure (Back)
4600 16-Drive Enclosure (Front)
Appendix B: Simplicity Managed Storage Systems
Affordable and reliable storage designed for SMB, departmental and remote-site
customers
Intuitive, task-oriented management software designed for sites with limited IT
resources that need to be self-sufficient
FC and SAS connectivity with support SAS/SATA drives (SATA drive support mid-2007)
Attribute 1333 | 1331
Shared DAS targeted at SMB and entry level environments

Overview requiring ease of use and reliability.
Entry-point storage for Microsoft Cluster Server
• Shared DAS
• High availability/reliability
Key features • SAS host interfaces
• Robust, intuitive Simplicity software
• Snapshot / Volume Copy
Six | Two 3 Gb/s

Host interfaces
“wide” SAS
Drive interfaces Two 3 Gb/s “wide” SAS
Drives 42 SAS
Disk IOPS 22,000 IOPS
Disk MB/s 900 MB/s
1333
Attribute 1532
iSCSI connectivity-integration into

low-cost IP Networks – Pervasive and
well-understood interface technology
Overview
– Simple to implement and manage
with intuitive easy-to-use storage
software
• Cost effective and reliable

Key features • iSCSI host connectivity
• Attach to redundant IP switches
Host interfaces Four 1Gb/s iSCSI
Drives 42 SAS
Disk MB/s 320 MB/s
1532
Attribute 1932
Ideal for departments or remote offices

that need to integrate inexpensive
storage into existing FC networks. Also
Overview
appealing
to smaller organizations planning initial
SANs.
• High availability/reliability
• Robust, intuitive Simplicity software
Key features
• 4 Gb/s host interfaces
• Snapshot / Volume Copy
Host interfaces Four 4 Gb/s FC
Drives 42 SAS
Disk MB/s 900 MB/s
1932
SAS Drive Tray (Front)
SAS Expansion Tray (Back)
Appendix C – State, Status, Flags (06.xx)
Drive State, Status, Flags

From pp 15 – 16, Troubleshooting and Technical Reference Guide – Volume 1
Drive State Values
0 Optimal
1 Non-existent drive
2 Unassigned, w/DACstore
3 Failed
4 Replaced
5 Removed – optimal pg2A = 0
6 Removed – replaced pg2A = 4
7 Removed – Failed pg2A = 3
8 Unassigned, no DACstore
Drive State Values

0x0000 Optimal
0x0001 Unknown Channel
0x0002 Unknown Drive SCSI ID
0x0003 Unknown Channel and Drive SCSI ID
0x0080 Format in progress
0x0081 Reconstruction in progress
0x0082 Copy-back in progress
0x0083 Reconstruction initiated but no GHS is integrated
0x0090 Mismatched controller serial number
0x0091 Wrong vendor – lock out
0x0092 Unassigned drive locked out
0x00A0 Format failed
0x00A1 Write failed
0x00A2 Start of Day failed
0x00A3 User failed via Mode Select
0x00A4 Reconstruction failed
0x00A5 Drive failed at Read Capacity
0x00A6 Drive failed for internal reason
0x00B0 No information available
0x00B1 Wrong sector size
0x00B2 Wrong capacity
0x00B3 Incorrect Mode parameters
0x00B4 Wrong controller serial number
0x00B5 Channel Mismatch
0x00B6 Drive Id mismatch
0x00B7 DACstore inconsistent
0x00B8 Drive needs to have a 2MB DACstore
0x00C0 Wrong drive replaced
0x00C1 Drive not found
0x00C2 Drive offline, internal reasons
Drive State (d_flags)
0x00000100 Drive is locked for diagnostics
0x00000200 Drive contains config. sundry_
0x00000400 Drive is marked deleted by Raid Mgr._0
0x00000800 Defined drive without drive
0x00001000 Drive is spinning or accessible
0x00002000 Drive contains a format or accessible
0x00004000 Drive is designated as HOT SPARE
0x00008000 Drive has been removed
0x00010000 Drive has an ADP93 DACstore
0x00020000 DACstore update failed
0x00040000 Sub-volume consistency checked during SOD
0x00080000 Drive is part of a foreign rank (cold added).
0x00100000 Change vdunit number
0x00200000 Expanded DACstore parameters
0x00400000 Reconfiguration performed in reverse VOLUME order
0x00800000 Copy operation is active (not queued).
Volume State, Status, Flags
From pp 17 – 18, Troubleshooting and Technical Reference Guide – Volume 1
VOLUME State (vd_state)

These flags are bit values, and the following flags are valid:
0x0000 optimal
0x0001 degraded
0x0002 reconstructing
0x0003 formatting
0x0004 dead
0x0005 quiescent
0x0006 non\existent
0x0007 dead, awaiting format
0x0008 not spun up yet
0x0009 unconfigured
0x000a LUN is in process of ADP93 upgrade
0x000b Optiaml state and reconfig
0x000c Degraded state and reconfig
0x000d Dead state and reconfig
VOLUME Status (vd_status)

0x0000 No sub-state/status available
0x0020 Parity scan in progress
0x0022 Copy operation in progress
0x0023 Restore operation in progress
0x0025 Host parity scan in progress
0x0044 Format in progress on virtual disk
0x0045 Replaced wrong drive
0x0046 Deferred error
VOLUME Flags (vd_flags)
0x00000001 Configured
0x00000002 Open
0x00000004 On-Line
0x00000008 Not Suspended
0x00000010 Resources available
0x00000020 Degraded
0x00000040 Spare piece - VOLUME has Global Hot Spare drive in use
0x00000080 RAID 1 ping-pong state
0x00000100 RAID 5 left asymmetric mapping
0x00000200 Write-back caching enabled
0x00000400 Read caching enabled
0x00000800 Suspension in progress while switching Global Hot Spare drive
0x00001000 Quiescence has been aborted or stopped
0x00010000 Prefetch enabled
0x00020000 Prefetch multiplier enabled
0x00040000 IAF not yet started, don't restart yet
0x00100000 Data scrubbing is enabled on this unit
0x00200000 Parity check is enabled on this unit
0x00400000 Reconstruction read failed
0x01000000 Reconstruction in progress
0x02000000 Data initialization in progress
0x04000000 Reconfiguration in progress
0x08000000 Global Hot Spare copy-back in progress
0x90000000 VOLUME halted; awaiting graceful termination of any reconstruction,
verify, or copy-back
From p 27, Troubleshooting and Technical Reference Guide – Volume 1
3.2.5 Controller/RDAC Modify Commands
3.2.5.01 isp rdacMgrSetModeActivePassive

This command sets the controller (that you are talking to) to active mode and the alternate
controller mode to passive.
WARNING*** This command does not modify the controller cache setup, only the
controller states. This may be accomplished by issuing the following command:
isp ccmEventNotify,0x0f
3.2.5.02 isp rdacMgrSetModeDualActive

This command sets both array controller modes to dual active.
WARNING*** This command does not modify the controller cache setup, only the controller
states. This may be accomplished by issuing the following command:
isp ccmEventNotify,0x0f
3.2.5.03 isp rdacMgrAltCtlFail

Will fail the alternate controller and takes ownership of it’s volumes.
NOTE: In order to fail a controller, it may be necessary to set the controller to a passive state
first.
3.2.5.04 isp rdacMgrAltCtlResetRelease

Will release the alternate controller if it is being held in reset or failed.
Appendix D – Chapter 2 - MEL Data Format
Major Event Log Specification 349-1053040 (Software Release 6.16)
LSI Logic Confidential
Chapter 2: MEL Data Format
The event viewer formats and displays the most meaningful fields of major event log entries from
the controller. The data displayed for individual events varies with the event type and is described
in the Events Description section. The raw data contains the entire major event data structure
retrieved from the controller subsystem. The event viewer displays the raw data as a character
string. Fields that occupy multiple bytes may appear to be byte swapped depending on the host
system. Fields that may appear as byte swapped are noted in the table below.
2.1. Overview of the Major Event Log Fields
Table 2-1: MEL Data Fields
2.1.1. Constant Data Field format, No Version Number
Note: If the log entry field does not have a version number, the format will be as shown below.
Table 2-2: Constant Data Field format, No Version Number
2.1.2. Constant Data Field Format, Version 1
If the log entry field contains version 1, the format will be as shown below.
Table 2-3: Constant Data Field Format, Version 1
Table 2-3: Constant Data Field Format, Version 1
2.2. Detail of Constant Data Fields
2.2.1. Signature (Bytes 0-3) Field Details

The Signature field is used internally by the controller. The current value is ‘MELH.’
2.2.2. Version (Bytes 4 -7) Field Details

When the Version field is present, the value should be 1 or 2, depending on the format of the
MEL entry.
2.2.3. Sequence Number (Bytes 8 - 15) Field Details

The Sequence Number field is a 64-bit incrementing value starting from the time the system log
was created or last initialized. Resetting the log does not affect this value.
2.2.4. Event Number (Bytes 16 - 19) Field Details
The Event Number is a 4 byte encoded value that includes bits for drive and controller inclusion,
event priority, and the event value. The Event Number field is encoded as follows
Table 2-4: Event Number (Bytes 16 - 19) Encoding
2.2.4.1. Event Number - Internal Flags Field Details
The Internal Flags are used internally within the controller firmware for events that require unique
handling. The host application ignores these values.
Table 2-5: Internal Flags Field Values
2.2.4.2. Event Number - Log Group Field Details
The Log Group field indicates what kind of event is being logged. All events are logged in the
system log. The values for the Log Group Field are described as follows:
Table 2-6: Log Group Field Values
2.2.4.3. Event Number - Priority Field Details
The Priority field is defined as follows:
Table 2-7: Priority Field Values
2.2.4.4. Event Number - Event Group Field Details
The Event Group field is defined as follows:
Table 2-8: Event Group Field Values
2.2.4.5. Event Number - Component Type Field Details
The Component Type Field Values are defined as follows:
2.2.5. Timestamp (Bytes 20 - 23) Field Details
The Timestamp field is a 4 byte value that corresponds to the real time clock on the controller.
The real time clock is set (via the boot menu) at the time of manufacture. It is incremented every
second and started relative to January 1, 1970.
2.2.6. Location Information (Bytes 24 - 27 ) Field Details

The Location Information field indicates the Channel/Drive or Tray/Slot information for the event.
Logging of data for this field is optional and is zero when not specified.
2.2.7. IOP ID (Bytes 28-31) Field Details
The IOP ID is used by MEL to associate multiple log entries with a single event or I/O. The IOP ID
is guaranteed to be unique for each I/O. A valid IOP ID may not be available for certain MEL
entries and some events use this field to log other information. The event descriptions will
indicate if the IOP ID is being used for unique log information.
2.2.8. I/O Origin (Bytes 32-33) Field Details

The I/O Origin field specifies where the I/O or action originated that caused the event. It uses one
of the Error Event Logger defined origin codes:A valid I/O Origin may not be available for certain
MEL entries and some events use this field to log other information. The event descriptions will
indicate if the I/O Origin is being used for unique log information. Logging of data for this field is
optional and is zero when not specified.
Table 2-9: I/O Origin Field Values
A valid I/O Origin may not be available for certain MEL entries and some events use this field to
log other information. The event descriptions will indicate if the I/O Origin is being used for unique
log information. Logging of data for this field is optional and is zero when not specified. When
decoding MEL events, additional FRU information can be found in the Software Interface
Specification.
2.2.9. LUN/Volume Number (Bytes 36 - 39) Field Details

The LUN/Volume Number field specifies the LUN or volume associated with the event being
logged. Logging of data for this field is optional and is zero when not specified.
2.2.10. Controller Number (Bytes 40-43) Field Details
The Controller Number field specifies the controller associated with the event being logged.
Table 2-10: Controller Number (Bytes 40-43) Field Values
2.2.11. Category Number (Bytes 44 - 47) Field Details
This field identifies the category of the log entry. This field is identical to the event group field
encoded in the event number.
Table 2-11: Event Group Field Values
2.2.12. Component Type (Bytes 48 - 51) Field Details
Identifies the component type associated with the log entry. This is identical to the Component
Group list encoded in the event number
Table 2-12: Component Type Field Details
2.2.13. Component Location Field Details

The first entry in this field identifies the component based on the Component Type field listed
above. The definition of the remaining bytes is dependent on the Component Type
Table 2-13: Component Type Location Values
2.2.14. Location Valid (Bytes 120-123) Field Details
This field contains a value of 1 if the component location field contains valid data. If the
component location data is not valid or cannot be determined the value is 0.
2.2.15. Number of Optional Fields Present (Byte 124) Field Details

The Number of Optional Fields Present specifies the number (if any) of additional data fields that
follow. If this field is zero then there is no additional data for this log entry.
2.2.16. Optional Field Data Field Details
The format for the individual optional data fields follows:
Table 2-14: Optional Field Data Format
2.2.17. Data Length (Byte 128) Field Details
The length in bytes of the optional data field data (including the Data Field Type)
2.2.18. Data Field Type (Bytes 130-131) Field Details

See Data Field Types on page 14for the definitions for the various optional data fields.
2.2.19. Data (Byte 132) Field Details

Optional field data associated with the Data Field Type. This data may appear as byte swapped
when using the event viewer.
Appendix E – Chapter 30 – Data Field Types
Chapter 30: Data Field Types
This table describes data field types.
Table 30-1: Data Field Types
Appendix F – Chapter 31 – RPC Function Numbers

Chapter 31: RPC Function Numbers
The following table lists SYMbol remote procedure call function numbers:
Table 31-1: SYMbol RPC Functions
Appendix G – Chapter 32 – SYMbol Return Codes
Chapter 32: SYMbol Return Codes

This section provides a description of each of the SYMbol return codes..
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Return Codes
Appendix H – Chapter 5 - Host Sense Data
Software Interface Specification 349-1062130 - Rev. A1 (Chromium 1 & 2)

Chapter 5: Host Sense Data
5.1. Request Sense Data Format
Sense data returned by the Request Sense command is in one of two formats: Fixed format or
Descriptor format. The format is based on the value of the D_SENSE bit (byte 2, bit 2) in the
Control Mode Page. When this bit is set to 0, sense data is returned using Fixed format. When
the bit is set to 1, then sense data is returned using Descriptor format. This parameter will default
to 1b for volumes >= 2 TB in size. The parameter defaults to 0b for volumes < TB in size. This
change is persisted on a logical unit basis See “6.11.Control Mode Page (Page A)” on page 6-
232.
The first byte of all sense data contains the response code field that indicates the error type and
format of the sense data.:
If the response code is 0x70 or 0x71, the sense data format is Fixed. See “5.1.1.Request Sense
Data - Fixed Format” on page 5-189. f the response code is 0x72 or 0x73, the sense data format
is Descriptor. See “5.1.2.Request Sense Data - Descriptor Format” on page 5-205.
For more information on sense data response codes, see SPC-3, SCSI Primary Commands.
5.1.1. Request Sense Data - Fixed Format

The table below outlines the Fixed format for Request Sense data. Information about individual
bytes is defined in the paragraphs following the table
Table 5.1: Request Sense Data Format
5. 1. 1. 1. Incorrect Length Indicator (ILI) - Byte 2
This bit is used to inform the host system that the requested non-zero byte transfer length for a
Read or Write Long command does not exactly match the available data length. The information
field in the sense data will be set to the difference (residue) of the requested length minus the
actual length in bytes. Negative values will be indicated by two's complement notation. Since the
controller does not support Read or Write Long, this bit is always zero.
5. 1. 1. 2. Sense Key - Byte 2

Possible sense keys returned are shown in the following table:
Table 5.2: Sense Key - Byte 2
5. 1. 1. 3. Information Bytes - Bytes 3-6
This field is implemented as defined in the SCSI standard for direct access devices. The
information could be any one of the following types of information:
² The unsigned logical block address indicating the location of the error being reported.
² The first invalid logical block address if the sense key indicates an illegal request.
5. 1. 1. 4. Additional Sense Length - Byte 7

This value will indicate the number of additional sense bytes to follow. Some errors cannot return
valid data in all of the defined fields. For these errors, invalid fields will be zero-filled unless
specified in the SCSI-2 standard as containing 0xFF if invalid.
The value in this field will be 152 (0x98) in most cases. However, there are situations when only
the standard sense data will be returned. For these sense blocks, the additional sense length is
10 (0x0A).
5. 1. 1. 5. Command Specific Information – Bytes 8-11

This field is only valid for sense data returned after an unsuccessful Reassign Blocks command.
The logical block address of the first defect descriptor not reassigned will be returned in this field.
These bytes will be 0xFFFFFFFF if information about the first defect descriptor not reassigned is
not available or if all the defects have been reassigned.
The command-specific field will always be zero-filled for sense data returned for commands other
than Reassign Blocks.
5. 1. 1. 6. Additional Sense Codes - Bytes 12-13

See the information on supported sense codes and qualifiers in See “11.2.Additional Sense
Codes and Qualifiers” on page 11-329. for details on the information returned in these fields.
5. 1. 1. 7. Field Replaceable Unit Code - Byte 14

A non-zero value in this byte identifies a field replaceable unit that has failed or a group of field
replaceable modules that includes one or more failed devices. For some Additional Sense Codes,
the FRU code must
be used to determine where the error occurred. As an example, the Additional Sense Code for
SCSI bus parity error is returned for a parity error detected on either the host bus or one of the
drive buses. In this case, the FRU field must be evaluated to determine if the error occurred on
the host channel or a drive channel.
Because of the large number of replaceable units possible in an array, a single byte is not
sufficient to report a unique identifier for each individual field replaceable unit. To provide
meaningful information that will decrease field troubleshooting and problem resolution time, FRUs
have been grouped. The defined FRU groups are listed below.
5.1.1.7.1. Host Channel Group (0x01)
A FRU group consisting of the host SCSI bus, its SCSI interface chip, and all initiators and other
targets connected to the bus.
5.1.1.7.2. Controller Drive Interface Group (0x02)
A FRU group consisting of the SCSI interface chips on the controller which connect to the drive
buses.
5.1.1.7.3. Controller Buffer Group (0x03)
A FRU group consisting of the controller logic used to implement the on-board data buffer.
5.1.1.7.4. Controller Array ASIC Group (0x04)
A FRU group consisting of the ASICs on the controller associated with the array functions.
5.1.1.7.5. Controller Other Group (0x05)
A FRU group consisting of all controller related hardware not associated with another group.
5.1.1.7.6. Subsystem Group (0x06)
A FRU group consisting of subsystem components that are monitored by the array controller,
such as power supplies, fans, thermal sensors, and AC power monitors. Additional information
about the specific failure within this FRU group can be obtained from the additional FRU bytes
field of the array sense.
5.1.1.7.7. Subsystem Configuration Group (0x07)
A FRU group consisting of subsystem components that are configurable by the user, on which
the array controller will display information (such as faults).
5.1.1.7.8. Sub-enclosure Group (0x08)
A FRU group consisting of the attached enclosure devices. This group includes the power
supplies, environmental monitor, and other subsystem components in the sub-enclosure.
5.1.1.7.9. Redundant Controller Group (0x09)
A FRU group consisting of the attached redundant controllers.
5.1.1.7.10. Drive Group (0x10 - 0xFF)
A FRU group consisting of a drive (embedded controller, drive electronics, and Head Disk
Assembly), its power supply, and the SCSI cable that connects it to the controller; or supporting
sub-enclosure environmental electronics.
For SCSI drive-side arrays, the FRU code designates the channel ID in the most significant nibble
and the SCSI ID of the drive in the least significant nibble. For Fibre Channel drive-side arrays,
the FRU code contains an internal representation of the drive’s channel and id. This
representation may change and does not reflect the physical location of the drive. The sense data
additional FRU fields will contain the physical drive tray and slot numbers.
NOTE: Channel ID 0 is not used because a failure of drive ID 0 on this channel would cause an
FRU code of 0x00, which the SCSI-2 standard defines as no specific unit has been identified to
have failed or that the data is not available.
5. 1. 1. 8. Sense Key Specific Bytes - Bytes 15-17
This field is valid for a sense key of Illegal Request when the sense-key specific valid (SKSV) bit
is on. The sense-key specific field will contain the data defined below. In this release of the
software, the field pointer is only supported if the error is in the CDB
² C/D = 1 indicates the illegal parameter is in the CDB.
² C/D = 0 indicates that the illegal parameter is in the parameters sent during a Data Out phase.
² BPV = 0 indicates that the value in the Bit Pointer field is not valid.
² BPV = 1 indicates that the Bit Pointer field specifies which bit of the byte designated by the Field
Pointer field is in error. When a multiple-bit error exists, the Bit Pointer field will point to the most
significant (left-most) bit of the field.
The Field Pointer field indicates which byte of the CDB or the parameter was in error. Bytes are
numbered from zero. When a multiple-byte field is in error, the pointer will point to the most-
significant byte.
5. 1. 1. 9. Recovery Actions - Bytes 18-19
This is a bit-significant field that indicates the recovery actions performed by the array controller.
5. 1. 1. 10. Total Number Of Errors - Byte 20

This field contains a count of the total number of errors encountered during execution of the
command.
The ASC and ASCQ for the last two errors encountered are in the ASC/ASCQ stack field.
6 Downed LUN
5 Failed drive
5. 1. 1. 11. Total Retry Count - Byte 21
The total retry count is for all errors seen during execution of a single CDB set.
5. 1. 1. 12. ASC/ASCQ Stack - Bytes 22-25
These fields store information when multiple errors are encountered during execution of a
command. The ASC/ASCQ pairs are presented in order of most recent to least recent error
detected.
5. 1. 1. 13. Additional FRU Information - Bytes 26-33

These bytes provide additional information about the field replaceable unit identified in byte 14.
The first two bytes are qualifier bytes that provide details about the FRU in byte 14. Byte 28 is an
additional FRU code which identifies a second field replaceable unit. The value in byte 28 can be
interpreted using the description for byte 14. Bytes 29 and 30 provide qualifiers for byte 28, just
as bytes 26 and 27 provide qualifiers for byte 14. The table below shows the layout of this field.
Following the table is a description of the FRU group code qualifiers. If an FRU group code
qualifier is not listed below, this indicates that bytes 26 and 27 are not used in this release
5.1.1.13.1. FRU Group Qualifiers for the Host Channel Group (Code 0x01)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which
host channel is reporting the failed component. The least significant byte provides the device type
and state of the device being reported
5.1.1.13.2. Mini-hub Port
Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the
Mini-Hub port is irrelevant port 0 is specified
5.1.1.13.3. Controller Number
Controller Number indicates which controller the host interface is connected to.
5.1.1.13.4. Host Channel LSB Format
The least significant byte provides the device type and state of the device being reported.
Host Channel Number indicates which channel of the specified controller. Values 1 through 4 are
valid.
5.1.1.13.4.1. Host Channel Device State

Host Channel Device State is defined as:
5.1.1.13.4.2. Host Channel Device Type Identifier
The Host Channel Device Type Identifier is defined as:
5.1.1.13.5. FRU Group Qualifiers For Controller Drive Interface Group (Code 0x02)
drive channel is reporting the failed component. The least significant byte provides the device
type and state of the device being reported.
5.1.1.13.5.1. Drive Channel MSB Format:
* = Reserved for parallel SCSI
5.1.1.13.5.2. Mini-Hub Port
The Mini-Hub Port indicates which of the Mini-Hub ports is being referenced. For errors where the
Mini- Hub port is irrelevant port 0 is specified.
5.1.1.13.5.3. Drive Channel Number
Drive Channel Number indicates which channel. Values 1 through 6 are valid.
5.1.1.13.5.4. Drive Channel LSB Format

Drive Channel LSB Format (Not used on parallel SCSI)
5.1.1.13.5.41. Drive Interface Channel Device State

Device Interface Channel Device State is defined as:
5.1.1.13.5.42. Host Channel Device Type Identifier

Host Channel Device Type Identifier is defined as
5.1.1.13.6. FRU Group Qualifiers For The Subsystem Group (Code 0x06)
primary component fault line is reporting the failed component. The information returned depends
on the configuration set up by the user. For more information, see OLBS 349-1059780, External
NVSRAM Specification for Software Release 7.10. The least significant byte provides the device
type and state of the device being reported. The format for the least significant byte is the same
as Byte 27 of the FRU Group Qualifier for the Sub-Enclosure Group (0x08).
5.1.1.13.7. FRU Group Qualifiers For The Sub-Enclosure Group (Code 0x08)
enclosure identifier is reporting the failed component. The least significant byte provides the
device type and state of the device being reported.
Statuses are reported such that the first enclosure for each channel is reported, followed by the
second enclosure for each channel.
5.1.1.13.7.1. Sub-Enclosure MSB Format:
5.1.1.13.7.11. Tray Identifier Enable (TIE) Bit

When the Tray Identifier Enable (TIE) bit is set to 01b, the Sub-Enclosure Identifier field provides
the tray identifier for the sub-enclosure being described.
5.1.1.13.7.12. Sub-Enclosure Identifier

When set to 00b, the Sub-Enclosure Identifier is defined as
5.1.1.13.7.2. Sub-Enclosure LSB Format
5.1.1.13.7.21. Sub-Enclosure Device State
The Sub-Enclosure Device Type Identifier is defined as
5.1.1.13.7.22. Sub-Enclosure Device Type Identifier
The Sub-Enclosure Device Type Identifier is defined as
5.1.1.13.8. FRU Group Qualifiers For The Redundant Controller Group (Code 0x09)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates which tray
contains the failed controller. The least significant byte indicates the failed controller within the
tray.
5.1.1.13.8.1. Redundant Controller MSB Format:
5.1.1.13.8.2. Redundant Controller LSB Format:
5.1.1.13.8.21. Controller Number Field

The Controller Number field is defined as:
5.1.1.13.9. FRU Group Qualifiers For The Drive Group (Code 0x10 – 0xFF)
FRU Group Qualifier - Bytes 26 (MSB) & 27 (LSB) - The most significant byte indicates the tray
number of the affected drive. The least significant byte indicates the drive’s physical slot within
the drive tray indicated in byte 26.
5.1.1.13.9.1. Drive Group MSB Format:
5.1.1.13.9.2. Drive Group LSB Format:
5. 1. 1. 14. Error Specific Information - Bytes 34-36
This field provides information read from the array controller VLSI chips and other sources. It is
intended primarily for development testing, and the contents are not specified.
5. 1. 1. 15. Error Detection Point - Bytes 37-40
The error detection point field will indicate where in the software the error was detected. It is
intended primarily for development testing, and the contents are not specified.
5. 1. 1. 16. Original CDB - Bytes 41-50
This field contains the original Command Descriptor Block received from the host.
5. 1. 1. 17. Reserved - Byte 51
5. 1. 1. 18. Host Descriptor - Bytes 52-53
This bit position field provides information about the host. Definitions are given below.
5. 1. 1. 19. Controller Serial Number - Bytes 54-69
This sixteen-byte field contains the manufacturing identification of the array hardware. Bytes of
this field are identical to the information returned by the Unit Serial Number page in the Inquiry
Vital Product Data.
5. 1. 1. 20. Array Software Revision - Bytes 70-73
The Array Application Software Revision Level matches that returned by an Inquiry command.
5. 1. 1. 21. LUN Number - Byte 75
The LUN number field is the logical unit number in the Identify message received from the host
after selection.
5. 1. 1. 22. LUN Status - Byte 76
This field indicates the status of the LUN. It's contents are defined in the logical array page
description in the Mode Parameters section of this specification except for the value of 0xFF,
which is unique to this field.
A value of 0xFF returned in this byte indicates the LUN is undefined or is currently unavailable
(reported at Start of Day before the LUN state is known).
5. 1. 1. 23. Host ID - Bytes 77-78
The host ID is the SCSI ID of the host that selected the array controller for execution of this
command.
5. 1. 1. 24. Drive Software Revision - Bytes 79-82
This field contains the software revision level of the drive involved in the error if the error was a
drive error and the controller was able to retrieve the information.
5. 1. 1. 25. Drive Product ID - Bytes 83-98
This field identifies the Product ID of the drive involved in the error if the error was a drive error
and the controller was able to determine this information. This information is obtained from the
drive Inquiry command.
5. 1. 1. 26. Array Power-up Status - Bytes 99-100

In this release of the software, these bytes are always set to zero.
5. 1. 1. 27. RAID Level - Byte 101
This byte indicates the configured RAID level for the logical unit returning the sense data. The
values that can be returned are 0, 1, 3, 5, or 255. A value of 255 indicates that the LUN RAID
level is undefined.
5. 1. 1. 28. Drive Sense Identifier - Bytes 102-103
These bytes identify the source of the sense block returned in the next field. Byte 102 identifies
the channel and ID of the drive. Refer to the FRU group codes for physical drive ID assignments.
Byte 103 is reserved for identification of a drive logical unit in future implementations and it is
always set to zero in this release.
5. 1. 1. 29. Drive Sense Data - Bytes 104-135
For drive detected errors, these fields contain the data returned by the drive in response to the
Request Sense command from the array controller. If multiple drive errors occur during the
transfer, the sense data from the last error will be returned.
5. 1. 1. 30. Sequence Number - Bytes 136-139
This field contains the controller’s internal sequence number for the IO request.
5. 1. 1. 31. Date and Time Stamp - Bytes 140-155

The 16 ASCII characters in this field will be three spaces followed by the month, day, year, hour,
minute, second when the error occurred in the following format:
MMDDYY/HHMMSS
5. 1. 1. 32. Reserved - Bytes 156 – 159
Appendix I – Chapter 11 – Sense Codes
Chapter 11: Sense Codes
11.1. Sense Keys
11.2. Additional Sense Codes and Qualifiers
This section lists the Additional Sense Codes (ASC), and Additional Sense Code Qualifier
(ASCQ) values returned by the array controller in the sense data. SCSI-2 defined codes are used
when possible. Array specific error codes are used when necessary, and are assigned SCSI-2
vendor unique codes 0x80-0xFF. More detailed sense key information may be obtained from the
array controller command descriptions or the SCSI-2 standard.
Codes defined by SCSI-2 and the array vendor specific codes are shown below. The most
probable sense keys (listed below for reference) returned for each error are also listed in the
table. A sense key encapsulated by parentheses in the table is an indication that the sense key is
determined by the value in byte 0x0A. See Section .

Storage Diagnostics and Troubleshooting Guide

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Storage Diagnostics and Troubleshooting Guide

Hochgeladen von

Copyright:

Verfügbare Formate

Storage Diagnostics and Troubleshooting

Global Education Services

Terms and Conditions .............................................................................................. 5

Storage Systems Diagnostics and Troubleshooting Course Outline ............................... 9

Module 1: Storage System Support Data Overview ................................................... 13

Module 3: Configuration Overview and Analysis....................................................... 67

Module 4: Fibre Channel Overview and Analysis .................................................... 115

Appendix A: SANtricity Managed Storage Systems .................................................. 173

Appendix B: Simplicity Managed Storage Systems .................................................. 178

Appendix C – State, Status, Flags (06.xx) .............................................................. 183

Appendix D – Chapter 2 - MEL Data Format ........................................................... 189

Appendix E – Chapter 30 – Data Field Types .......................................................... 203

Appendix F – Chapter 31 – RPC Function Numbers ................................................. 215

Appendix G – Chapter 32 – SYMbol Return Codes................................................... 229

Appendix H – Chapter 5 - Host Sense Data ............................................................ 261

Appendix I – Chapter 11 – Sense Codes ................................................................ 279

Agreement accepted by Student (Date):

Agreement not accepted by Student (Date):

However, an equivalent knowledge of storage management, installation, basic

Module 1: Storage System Support Data Overview

Upon completion should be able to complete the following:

Module 2: Storage System Level Overview

Upon completion should be able to complete the following:

Upon completion should be able to complete the following:

Module 4: IO Driver and Drive Side Error Reporting and Analysis

Upon completion should be able to complete the following:

• ZIP archive of useful debugging files

All Support Data Capture

• Most useful files for first-look system analysis and troubleshooting

• Log is written to DACSTOR region of drives.

• SANtricity can display log, sort by parameters and save to file.

• Only critical errors send SNMP traps and Email alerts

A Details Window from a MEL log (06.xx)

General Raw Data Categories (07.xx)

• NOTE: Do not swap the nibbles

– e.g. Value is not “00 00 00 00 00 00 01 fa”

Comparison of the Locations of the Summary Information and Raw

Location – Decode based on the component type

Valid? - 0=Not valid 1=Valid data

Sense Data - vendor specific depending on the component type.

Time of the event adjusted to the management

Event Specific Codes

• Return status/RPC function/null

Event Specific Codes

• RPC Function Call

• SenseKey /ASC /ASCQ

AEN Posted for recently logged event (06.xx)

• Byte 14 = 0x7d (FRU)

• Bytes 26 & 27 = 0x02 & 0x05 (FRU Qualifiers)

• Values decoded using the Software Interface Specification

• FRU Qualifiers are decoded depending on what the FRU value is

AEN posted for recently logged event (06.xx)

• Byte 14 = 0x06 (FRU)

• Bytes 26 & 27 = 0xd5 & 0x69 (FRU Qualifiers)

• Values decoded using the Software Interface Specification

• FRU Qualifiers are decoded depending on the FRU code

• SenseKey / ASC / ASCQ

• Byte 14 FRU = 0x06

– Device State = 0x3 = Missing

Automatic Volume Transfer

• Series of controller shell commands ran against both controllers

• Different firmware levels run different sets of commands

• Some information still needs to be gathered manually