C5 2007-06-01 Silent Corruptions P Kelemen

Silent Corruptions
KELEMEN Pter IT-FIO After C5 CERN
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 1
Corruptions
Corruption: data is changed unintentionally mechanisms to detect/correct

parity, checksum (CRC32, MD5, SHA1, ...) ECC multiple copies with quorum
detection: SW/HW-level with error messages correction: SW/HW-level with warnings Silent corruption: data is changed unintentionally without any errors/warnings!
Corruption Sources
hardware errors (memory, CPU, disk, NIC) data transfer noise (UTP, SATA, FC, wireless) firmware bugs (RAID controller, disk, NIC) software bugs
kernel

VM filesystems block layer
application: crash, etc. (not discussed)

Expected Bit Error Rate (BER)
NIC/link: 10-10 (1 bit in ~1.1 GiB)
checksummed, retransmit if necessary ECC various error correction codes various error correction codes
memory: 10-12 (1 bit in ~116 GiB)
desktop disk: 10-14 (1 bit in ~11.3 TiB)
enterprise disk: 10-15 (1 bit in ~113 TiB)
quotes from standards/specifications
It Already Happened to Us
DON'T PANIC! acknowledge user observation (if any) assess the problem
develop/deploy tools for data collection
estimate the scale of the problem research the cause (correlation) and impact evaluate possible solutions deploy possible solutions
fsprobe(8)
probabilistic storage integrity check

write known bit pattern read it back compare and alert when mismatch found
low I/O footprint for background operation keep complexity to the minimum use static buffers attempt to preserve details about detected corruptions for further analysis
fsprobe cycles
pattern buffer (1 MiB)
test file (2 GiB)
read buffer (1 MiB)
ONE cycle completes in 2*2048 sec = 1 hour 8 minutes SIX cycles in (2*2048+300)*6 sec = 7 hours 20 minutes
0x55 0xAA 0x33 0xCC 0x0F 0xF0
[01010101] [10101010] [00110011] [11001100] [00001111] [11110000]
Investigation
fsprobe deployed on 4000 nodes 1400 incidents reported (total ~50PB traffic) 230 nodes affected (26 HW types) incidents tracked in SQLite steady flow of 1-3 incidents per day multiple types of corruptions observed affected systems are very diverse
SLC3/SLC4/RHEL4, XFS/ext3, 3ware/ARECA, ...
some corruptions are transient

Corruption Types
Type I

single/double bit errors usually bad memory (RAM, cache, etc.) small, 2n-sized chunks (128-512 bytes) of unknown origin multiple large chunks of 64K, old file data various sized chunks of zeros
Type II

Type III
Type IV
Type I
usually persistent bit(s) have flipped in a byte Single Bit Error (SBE) Double Bit Error (DBE)

DBEs are 3x more common than SBEs a single case of a triple bit error was observed
10 transition more frequent than 01 strong correlation with bad memory (verified) happens with expensive ECC-memory too
Type I Example
00000000 * 35285650 35285660 * 80000000
33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33
33 33 33 33 33 33 33 33 33 33 33 33 33 33 22 33 33 33 33 33 33 33 33 33
|3333333333333333| |33333333333333"3| |3333333333333333|
0x33 = 00110011b 0x22 = 00100010b
Type II
usually transient small chunks of random looking data
...but can go up to 128K
sometimes identifiable user data observed in vicinity of OOM situations possible SLAB corruption?
Type II Example
00000000 * 000def00 000def10 000def20 000def30 000def40 000def50 000def60 000def70 000def80 000def90 000defa0 000defb0 000defc0 000defd0 000defe0 000deff0 000df000 * 00100000 cc cc cc cc cc cc cc cc f1 af e4 b5 9c e4 79 f8 9a 44 fe db d8 ac 54 68 cc 2b 6e 42 43 8a ff b3 2d 0c df c2 dd 88 17 d6 80 cc f8 c3 42 77 94 c6 63 64 a4 2a 34 58 38 0a 6a 64 cc 2b b6 0c c9 a0 14 d8 e7 ae 50 ed 42 86 f1 72 b0 cc cd 57 e9 5d 73 ee 4c 10 18 a1 cc a1 92 c3 b9 53 cc 43 3e 22 4e 2d 00 8f e1 c5 55 73 62 2f 31 6a 97 cc 38 5f 2e 16 f7 d6 5d 6b 40 31 2e 2f af 89 d6 74 cc 38 fa f1 a2 ad c0 d8 89 a5 02 64 6d 29 82 1f 72 cc cc cc cc cc cc cc cc cc e3 e3 d0 39 d9 7a c5 da 70 57 38 92 a1 59 ff ee cc 43 d6 c6 79 12 c8 1e b3 6b d6 89 6a 0f e5 cb f6 cc bd e8 a5 5f 31 8e e4 fb 19 19 6e ed c0 89 6a 87 cc 8d 7b 55 31 2b c0 5f 0b d3 43 5a 9b 21 b9 10 9f cc b0 ef f2 10 f5 3f 0e 48 b9 80 be 23 46 9e 1e 23 cc 01 5b f3 65 db 73 2a 59 f7 b0 d7 6e fc fa bd 47 cc 12 6f a7 b8 b4 32 1d d4 a2 0a 3b 1e 3a 45 66 cb cc 0a 3c 38 e4 18 73 94 df b3 89 c4 79 e3 b9 87 48 cc |................| |.+.+.C88.C......| |.n..W>_....{.[o<| |.BB..".....U...8| |.Cw.]N..9y_1.e..| |....s-....1+....| |........z...?s2s| |y.c.L.]...._.*..| |.-d...k.....HY..| |......@.pk......| |D.*P.U1.W..C....| |..4..s.d8.nZ..;.| |..XB.b/m.j..#n.y| |..8../.)...!F.:.| |.....1..Y.....E.| |T.jr.j....j...f.| |h.d.S.tr....#G.H| |................|
Type III
usually persistent, comes in bursts strong correlation: I/O command timeouts

3ware hides timeouts, look at extended diag also observed on plain SATA systems ...sometimes with failed READ commands!
previous data from earlier cycles (sometimes multiple cycles old!) or from another location on disk seems to match RAID stripe size (64K)
observed on 16K chunk RAID arrays as well

Type III Example

00000000 * 34205200 * 34215200 * 34265200 * 34275200 * 342c5200 * 342d5200 * 34325200 * 34335200 * 34385200 * 34395200 * 80000000 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................|
Type IV
usually persistent relatively recent observations (since April) ...not sure yet this warrants another category
Type IV Example
00000000 * 00052980 * 00053000 * 80000000
aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa
aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa
|................| |................| |................|
Corruption Time Distribution

110 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3
#incidents
hours
Operating Systems
SLC3 SLC4 RHEL4 other
Corruption Types
Type I Type II Type III Type IV unknown
Corruption Persistence
persistent transient unknown
Daily Distribution
wide-scale deployment ramp-up finished 2007-01-31 WD firmware upgrade campaign started 2007-02-20
Where From?
user space
VM filesystems block layer SCSI layer low-level drivers controller firmware storage firmware
physical magnetic media

What Can Be Done?
self-examining/healing hardware (?) WRITE-READ cycles before ACK checksumming? not necessarily enough end-to-end checksumming (ZFS has a point) store multiple copies regular scrubbing of RAID arrays data refresh re-read cycles on tapes ...generally accept and prepare for corruptions
Conclusions
silent corruptions are a fact of life first step towards a solution is detection elimination seems impossible existing datasets are at the mercy of Murphy correction will cost time AND money effort has to start now (if not started already) multiple cost-schemes exist

trade time and storage space ( la Google) trade time and CPU power (correction codes)
Departing Words
Trust, but verify

Ronald Reagan
Questions?
Thank you and have a nice filesystem (without corruptions)!
Further Reading
Bernd Panzer-Steindel: Data Integrity v3
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

C5 2007-06-01 Silent Corruptions P Kelemen

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

C5 2007-06-01 Silent Corruptions P Kelemen

Hochgeladen von

Copyright:

Verfügbare Formate

Silent Corruptions

KELEMEN Pter IT-FIO After C5 CERN

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 1

Corruption: data is changed unintentionally mechanisms to detect/correct

VM filesystems block layer

application: crash, etc. (not discussed)

Expected Bit Error Rate (BER)

NIC/link: 10-10 (1 bit in ~1.1 GiB)

memory: 10-12 (1 bit in ~116 GiB)

desktop disk: 10-14 (1 bit in ~11.3 TiB)

enterprise disk: 10-15 (1 bit in ~113 TiB)

quotes from standards/specifications

develop/deploy tools for data collection

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 5

probabilistic storage integrity check

test file (2 GiB)

read buffer (1 MiB)

0x55 0xAA 0x33 0xCC 0x0F 0xF0

[01010101] [10101010] [00110011] [11001100] [00001111] [11110000]

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 7

SLC3/SLC4/RHEL4, XFS/ext3, 3ware/ARECA, ...

some corruptions are transient

00000000 * 35285650 35285660 * 80000000

|3333333333333333| |33333333333333"3| |3333333333333333|

0x33 = 00110011b 0x22 = 00100010b

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 11

usually transient small chunks of random looking data

...but can go up to 128K

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 12

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 13

usually persistent, comes in bursts strong correlation: I/O command timeouts

observed on 16K chunk RAID arrays as well

Type III Example

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 15

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 16

00000000 * 00052980 * 00053000 * 80000000

|................| |................| |................|

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 17

Corruption Time Distribution

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 18

SLC3 SLC4 RHEL4 other

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 19

Type I Type II Type III Type IV unknown

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 20

persistent transient unknown

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 21

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 22

physical magnetic media

What Can Be Done?

Trust, but verify

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 26

Thank you and have a nice filesystem (without corruptions)!

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 27

Bernd Panzer-Steindel: Data Integrity v3

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 28

Das könnte Ihnen auch gefallen