Beruflich Dokumente
Kultur Dokumente
Corruptions
parity, checksum (CRC32, MD5, SHA1, ...) ECC multiple copies with quorum
detection: SW/HW-level with error messages correction: SW/HW-level with warnings Silent corruption: data is changed unintentionally without any errors/warnings!
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 2
Corruption Sources
hardware errors (memory, CPU, disk, NIC) data transfer noise (UTP, SATA, FC, wireless) firmware bugs (RAID controller, disk, NIC) software bugs
kernel
checksummed, retransmit if necessary ECC various error correction codes various error correction codes
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 4
It Already Happened to Us
DON'T PANIC! acknowledge user observation (if any) assess the problem
estimate the scale of the problem research the cause (correlation) and impact evaluate possible solutions deploy possible solutions
fsprobe(8)
write known bit pattern read it back compare and alert when mismatch found
low I/O footprint for background operation keep complexity to the minimum use static buffers attempt to preserve details about detected corruptions for further analysis
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 6
fsprobe cycles
pattern buffer (1 MiB)
ONE cycle completes in 2*2048 sec = 1 hour 8 minutes SIX cycles in (2*2048+300)*6 sec = 7 hours 20 minutes
Investigation
fsprobe deployed on 4000 nodes 1400 incidents reported (total ~50PB traffic) 230 nodes affected (26 HW types) incidents tracked in SQLite steady flow of 1-3 incidents per day multiple types of corruptions observed affected systems are very diverse
Corruption Types
Type I
single/double bit errors usually bad memory (RAM, cache, etc.) small, 2n-sized chunks (128-512 bytes) of unknown origin multiple large chunks of 64K, old file data various sized chunks of zeros
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 9
Type II
Type III
Type IV
Type I
usually persistent bit(s) have flipped in a byte Single Bit Error (SBE) Double Bit Error (DBE)
DBEs are 3x more common than SBEs a single case of a triple bit error was observed
10 transition more frequent than 01 strong correlation with bad memory (verified) happens with expensive ECC-memory too
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 10
Type I Example
33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33
33 33 33 33 33 33 33 33 33 33 33 33 33 33 22 33 33 33 33 33 33 33 33 33
Type II
sometimes identifiable user data observed in vicinity of OOM situations possible SLAB corruption?
Type II Example
00000000 * 000def00 000def10 000def20 000def30 000def40 000def50 000def60 000def70 000def80 000def90 000defa0 000defb0 000defc0 000defd0 000defe0 000deff0 000df000 * 00100000 cc cc cc cc cc cc cc cc f1 af e4 b5 9c e4 79 f8 9a 44 fe db d8 ac 54 68 cc 2b 6e 42 43 8a ff b3 2d 0c df c2 dd 88 17 d6 80 cc f8 c3 42 77 94 c6 63 64 a4 2a 34 58 38 0a 6a 64 cc 2b b6 0c c9 a0 14 d8 e7 ae 50 ed 42 86 f1 72 b0 cc cd 57 e9 5d 73 ee 4c 10 18 a1 cc a1 92 c3 b9 53 cc 43 3e 22 4e 2d 00 8f e1 c5 55 73 62 2f 31 6a 97 cc 38 5f 2e 16 f7 d6 5d 6b 40 31 2e 2f af 89 d6 74 cc 38 fa f1 a2 ad c0 d8 89 a5 02 64 6d 29 82 1f 72 cc cc cc cc cc cc cc cc cc e3 e3 d0 39 d9 7a c5 da 70 57 38 92 a1 59 ff ee cc 43 d6 c6 79 12 c8 1e b3 6b d6 89 6a 0f e5 cb f6 cc bd e8 a5 5f 31 8e e4 fb 19 19 6e ed c0 89 6a 87 cc 8d 7b 55 31 2b c0 5f 0b d3 43 5a 9b 21 b9 10 9f cc b0 ef f2 10 f5 3f 0e 48 b9 80 be 23 46 9e 1e 23 cc 01 5b f3 65 db 73 2a 59 f7 b0 d7 6e fc fa bd 47 cc 12 6f a7 b8 b4 32 1d d4 a2 0a 3b 1e 3a 45 66 cb cc 0a 3c 38 e4 18 73 94 df b3 89 c4 79 e3 b9 87 48 cc |................| |.+.+.C88.C......| |.n..W>_....{.[o<| |.BB..".....U...8| |.Cw.]N..9y_1.e..| |....s-....1+....| |........z...?s2s| |y.c.L.]...._.*..| |.-d...k.....HY..| |......@.pk......| |D.*P.U1.W..C....| |..4..s.d8.nZ..;.| |..XB.b/m.j..#n.y| |..8../.)...!F.:.| |.....1..Y.....E.| |T.jr.j....j...f.| |h.d.S.tr....#G.H| |................|
Type III
3ware hides timeouts, look at extended diag also observed on plain SATA systems ...sometimes with failed READ commands!
previous data from earlier cycles (sometimes multiple cycles old!) or from another location on disk seems to match RAID stripe size (64K)
Type IV
usually persistent relatively recent observations (since April) ...not sure yet this warrants another category
Type IV Example
aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa
aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa
hours
Operating Systems
Corruption Types
Corruption Persistence
Daily Distribution
wide-scale deployment ramp-up finished 2007-01-31 WD firmware upgrade campaign started 2007-02-20
Where From?
user space
VM filesystems block layer SCSI layer low-level drivers controller firmware storage firmware
self-examining/healing hardware (?) WRITE-READ cycles before ACK checksumming? not necessarily enough end-to-end checksumming (ZFS has a point) store multiple copies regular scrubbing of RAID arrays data refresh re-read cycles on tapes ...generally accept and prepare for corruptions
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 24
Conclusions
silent corruptions are a fact of life first step towards a solution is detection elimination seems impossible existing datasets are at the mercy of Murphy correction will cost time AND money effort has to start now (if not started already) multiple cost-schemes exist
trade time and storage space ( la Google) trade time and CPU power (correction codes)
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 25
Departing Words
Questions?
Further Reading
http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797