Sie sind auf Seite 1von 6

Understanding SCSI Sense - Disk Survey

1 of 6

http://blog.disksurvey.org/knowledge-base/scsi-sense/

Disk Survey
Surveying the disks of the world
RSS

Blog
Archives
Survey site
Knowledge Base
Contact

Understanding SCSI Sense


This page is about decoding and interpreting the SCSI sense buffer in order to troubleshoot a disk or storage
device problem.
A SCSI sense buffer is the error reporting facility in SCSI. It reports the error code and possibly also additional
information that helps to locate the source of the problem so the administrator or developer can help resolve the
issue.
A SCSI sense has several top-level attributes that one would care about the most:
Sense type, either fixed or descriptor,
What command it relates to, current or previous,
Sense Key,
ASC/ASCQ Additional Sense Code and Additional Sense Code Qualifier.
The easiest way to decode a sense buffer is to use a tool, I know of two:
sg3_utils provides sg_decode_sense since version 1.31
libscsicmd implements it
a web tool is available to Decode the sense data that is based on libscsicmd
The explanation below would focus a bit on how to decode and also what can be understood from it.
The sense type is important to decode the sense buffer, you need to know if it is a fixed format or a descriptor
format. The most common format is the fixed format and most of the direct decoding instruction below will be
about the fixed format the descriptor format is more complex and less frequent but its worth being aware of its
existence. The details both formats provide are the same just the decoding mechanics are different.
One important distinction about a sense buffer is wether the sense is about the command that failed with the sense
or a previous command. It is entirely possible that the command that returned with an error is not at all at fault and
that everything is just fine with it but that a previous command that was already acknowledged went bad at the end
12/11/2014 2:05 PM

Understanding SCSI Sense - Disk Survey

2 of 6

http://blog.disksurvey.org/knowledge-base/scsi-sense/

and the SCSI target has no other way to tell the user about the problem. In such a case some random other
command will be failed with a sense buffer that indicates the problem was in a previous command.
The first byte to look at is byte 0, and what matters there are the 7 lower bits, so if the number is at or above 80h
(128 decimal) you need to substruct 80h to get the actual value. There are only 4 permitted values for these 7
lower bits:
70h fixed format, current sense
71h fixed format, previous sense
72h descriptor format, current sense
73h descriptor format, previous sense
The next important information is in byte 2 (number 3 if counting from 1), the four lowest bits are the sense key,
since the sense buffer is given in hexadecimal numbers this is the second character of the number. The sense key is
the key to understand the error code. It tells you the high level issue and it is detailed below with their meanings.
The next part is the ASC and ASCQ these are found in bytes 12 and 13 (13 and 14 if counting from 1). These
explain in somewhat more detail the specifics of the problem.
Take a look at the following example:
f1 00 03 02 DD 7E BF 18 00 00 00 00 0C 03 00 00 00 00 00 00 03 0C 03 00 00 0F 83 01 00 08 00 00

The first byte is F1h, we remember to remove the top bit and we get 71h which means fixed format, previous
sense so we can further decode it according our instructions and remember that the IO that failed with this sense
is not to blame, it was a previous IO that failed. Next up we find 3h as the second nibble in the third byte which
tells us that this is a medium error. A disk tried to read or write and failed. The last part is the ASC and ASCQ
which are 0Ch/03h and this translates to WRITE ERROR RECOMMEND REASSIGNMENT. This tells us it
was a write that failed and that the disk is suggesting to reassign the sector. One part that is a bit harder to decode
is what is the LBA that actually failed. The first bit of the first byte that is lighted says that the information field
has meaning and in the case of a medium error (sense key 3h) the meaning is the first LBA that failed.
You can see full parsing of this sense in the webapp: f1 00 03 02 DD 7E BF 18 00 00 00 00 0C 03 00 00 00 00 00
00 03 0C 03 00 00 0F 83 01 00 08 00 00
The sense keys are listed briefly at the T10 Sense Key page. The ASC/ASCQ are listed at the T10 ASC/ASCQ
page.
Common sense keys are:
1h Recovered Error informational only
2h Not ready temporary error, need to wait it out
3h Medium Error may work if retried, disappears after write or reassign
4h Hardware Error usually permanent failure
5h Illegal Request mostly a programming error, maybe the device handles an older standard with some
bits unsupported
6h Unit Attention a storage fabric problem, usually a notification and not a problem with the IO itself
7h Data Protect The device cannot be read/written, needs to be unlocked (physically or logically)
Bh Aborted Task Fabric problem, command may be retried but possibly a bad cable

Recovered Error

12/11/2014 2:05 PM

Understanding SCSI Sense - Disk Survey

3 of 6

http://blog.disksurvey.org/knowledge-base/scsi-sense/

A recovered error is the least problematic in one way since it only says that there was a problem that the storage
device managed to take care of and is just letting the user know about in case he is curious or would like to delve
deeper and find what is going on.
There are two reasons why a recovered error would be returned:
SMART Trip
Medium Errors recovered
If a disk finds that it is about to fail according to its SMART logic (also known as Informational Exceptions in
SCSI), it will report it in log page 2Fh but the only way for it to tell the storage system that now is the time to start
looking at this log page is by taking one random IO (the first one that comes up after the SMART issue is detected)
and return a correctable error with ASC/ASCQ of 5Dh/00h which stands for FAILURE PREDICTION
THRESHOLD EXCEEDED.
If a sector is having problems and it took a non-trivial amount of work to recover from it a recovered error may be
reported with an ASC of 11h, 17h or 18h depending on the severity and the type of recovery needed.
If a specific device will or will not return a recovered error sense is determined by some parameters in the mode
pages. You may want to peruse them to find how to turn on or off this behavior.
The normal Linux kernel SCSI stack will ignore this sense and continue along with only reporting it in dmesg.

Not Ready
Device not ready, wait and retry, device is either going to get good or fail and it should timeout itself to Hardware
Error if so.
A Not Ready sense is returned when the device is powering up and not yet ready to really respond to anything
serious, such as when an HDD is still spinning up or when an SSD has still not read its metadata tables from the
flash.
Under some error conditions this may persist for some while and if it persists for more than 30 seconds or so it is
likely to be a failure already. In most cases the device will have a timeout of its own after which it will transition to
replying 4h Hardware Error instead of the Not Ready reply.
A user can only wait a bit more for the device to get ready and fail it out if it takes too long to exit this state.

Medium Error
A medium error means that you tried to read or write data and the disk failed. It also is taken to mean that the
problem is not permanent and doesnt afflict the entire disk only some area of it. A disk can reassign the affected
to solve the problem. If the disk is configured to auto-reassign than a write to that area will cause the disk to
reassign and the problem will be gone, if the disk is not configured to auto-reassign then you need to use the
REASSIGN BLOCKS command to get that same effect.
At some cases a retry to read the data may get the data eventually but it doesnt have a high likelyhood and it
incurs a great penalty in time since normally a medium error is declared after a timeout is reached during the read
operation.
In a SCSI disk there are two bits AWRE and ARRE that control if a auto-reallocation is done by the disk.

12/11/2014 2:05 PM

Understanding SCSI Sense - Disk Survey

4 of 6

http://blog.disksurvey.org/knowledge-base/scsi-sense/

Hardware Error
An hardware error is reported when the disk reaches a fatal state and will not recover from this. The disk can no
longer be read or written. Not even a power cycle will help in this case.

Illegal Request
When a disk returns Illegal Request it means it failed to parse the command or the data you gave it. Either the
command is invalid or it is unsupported by the disk. This can happen when the disk supports an older standard or
doesnt adhere to the standard completely.
When doing MODE SELECT and LOG SELECT commands if the parameter you are trying to change is
unsupported for change you will also get an Illegal Request. You can get the Changeable Mask for MODE
SELECT with MODE SENSE to see if this is the case.
You will need to reformulate the request or plain avoid it altogether.

Unit Attention
A Unit Attention is the way for the device to tell you that its operational state or the fabric state has changed.
Since SCSI is a client-server protocol there is no other way for the device to tell you that something changed
without piggy-backing on another request which is exactly what happens here. The command that you performed
is likely to be just fine but there was some other condition in the device that requires the users attention. The
attention needs to be taken care of and the command that was unfortunate enough to be failed for this can be
retried.
Examples for this can be when MODE SELECT is used to set Mode Parameters, when an initiator is lost and then
you get an I_T_L NEXUS LOSS or several other such cases.

Data Protect
Data Protect is received when the device is working but locked, either a physical write lock or for Data-at-Rest
encryption when the device was not yet unlocked or the band was not yet unlocked.
This only means that unlock needs to happen for the action to be allowed.

Aborted Task
When a communication link fails or a command is aborted you can get this sense key, it cannot be directly
attributed as a failure in the device, it is most likely a connectivity issue which will need to be resolved.
If there is a flaky link these errors can come and go from time to time and it will be hard to communicate with the
device. In most cases the command should just be retried several times and a problem flag raised if this continues.
Some communication failures when they are rare are no importance and can be assumed to happen but if the
failure is common enough it should be reported to be fixed by the user. The normal BER for SCSI links is around
10^-15 and so about 1 error per day at full data-rate of about 6Gbps is perfectly acceptable, above that it really
depends on the application and system.
Posted by Baruch Even

12/11/2014 2:05 PM

Understanding SCSI Sense - Disk Survey

5 of 6

Like

http://blog.disksurvey.org/knowledge-base/scsi-sense/

Share Sign Up to see what your friends like.

Comments
0 Comments

Disk Error Recovery

Disk Surface Scan on Linux and Unix

Que se transcribe detras de una


errata.

Looks like there is an issue with


./do script where it deletes the previously built
files, a workaround for now is to use:./do all

Background Media Scan

Great write up, thanks!

Recent Posts
Decoding LSI LogInfo Codes
Disk Error Recovery: Attempting Task Abort
Making Sense of SCSI Sense
SATA Handling of Medium Errors: Log_info(0x0x31080000)
Limit Maximum Latency of Multiple Command Queueing

GitHub Repos
nrf_uart_demo
Demonstrate use of nRF24LE1 UART with the SDCC SDK
nrf24LE1_proramming_board

12/11/2014 2:05 PM

Understanding SCSI Sense - Disk Survey

6 of 6

http://blog.disksurvey.org/knowledge-base/scsi-sense/

A KiCad design for a board to help connect an nRF24LE1 board to a Raspberry Pi as a programmer
diskscan
Scan disk for bad or near failure sectors
docket
Log collector for a clustered system
libwire
User space threading (aka coroutines) library for C resembling GoLang and goroutines
@baruch on GitHub

Google+
Sign up for my Disk Survey list
To learn more about disk failures and coping with them

Categories
Disk
Uncategorized
Survey
Academia
RAID
SATA
NCQ
SCSI
LSI
Copyright 2014 - Baruch Even - Powered by Octopress

12/11/2014 2:05 PM

Das könnte Ihnen auch gefallen