Sie sind auf Seite 1von 28

Silent Corruptions

KELEMEN Pter IT-FIO After C5 CERN

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 1

Corruptions

Corruption: data is changed unintentionally mechanisms to detect/correct


parity, checksum (CRC32, MD5, SHA1, ...) ECC multiple copies with quorum

detection: SW/HW-level with error messages correction: SW/HW-level with warnings Silent corruption: data is changed unintentionally without any errors/warnings!
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 2

Corruption Sources

hardware errors (memory, CPU, disk, NIC) data transfer noise (UTP, SATA, FC, wireless) firmware bugs (RAID controller, disk, NIC) software bugs

kernel

VM filesystems block layer

application: crash, etc. (not discussed)


Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 3

Expected Bit Error Rate (BER)

NIC/link: 10-10 (1 bit in ~1.1 GiB)

checksummed, retransmit if necessary ECC various error correction codes various error correction codes
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 4

memory: 10-12 (1 bit in ~116 GiB)

desktop disk: 10-14 (1 bit in ~11.3 TiB)

enterprise disk: 10-15 (1 bit in ~113 TiB)

quotes from standards/specifications

It Already Happened to Us

DON'T PANIC! acknowledge user observation (if any) assess the problem

develop/deploy tools for data collection

estimate the scale of the problem research the cause (correlation) and impact evaluate possible solutions deploy possible solutions

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 5

fsprobe(8)

probabilistic storage integrity check


write known bit pattern read it back compare and alert when mismatch found

low I/O footprint for background operation keep complexity to the minimum use static buffers attempt to preserve details about detected corruptions for further analysis
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 6

fsprobe cycles
pattern buffer (1 MiB)

test file (2 GiB)

read buffer (1 MiB)

ONE cycle completes in 2*2048 sec = 1 hour 8 minutes SIX cycles in (2*2048+300)*6 sec = 7 hours 20 minutes

0x55 0xAA 0x33 0xCC 0x0F 0xF0

[01010101] [10101010] [00110011] [11001100] [00001111] [11110000]

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 7

Investigation

fsprobe deployed on 4000 nodes 1400 incidents reported (total ~50PB traffic) 230 nodes affected (26 HW types) incidents tracked in SQLite steady flow of 1-3 incidents per day multiple types of corruptions observed affected systems are very diverse

SLC3/SLC4/RHEL4, XFS/ext3, 3ware/ARECA, ...

some corruptions are transient


Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 8

Corruption Types

Type I

single/double bit errors usually bad memory (RAM, cache, etc.) small, 2n-sized chunks (128-512 bytes) of unknown origin multiple large chunks of 64K, old file data various sized chunks of zeros
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 9

Type II

Type III

Type IV

Type I

usually persistent bit(s) have flipped in a byte Single Bit Error (SBE) Double Bit Error (DBE)

DBEs are 3x more common than SBEs a single case of a triple bit error was observed

10 transition more frequent than 01 strong correlation with bad memory (verified) happens with expensive ECC-memory too
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 10

Type I Example

00000000 * 35285650 35285660 * 80000000

33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33

33 33 33 33 33 33 33 33 33 33 33 33 33 33 22 33 33 33 33 33 33 33 33 33

|3333333333333333| |33333333333333"3| |3333333333333333|

0x33 = 00110011b 0x22 = 00100010b

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 11

Type II

usually transient small chunks of random looking data

...but can go up to 128K

sometimes identifiable user data observed in vicinity of OOM situations possible SLAB corruption?

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 12

Type II Example
00000000 * 000def00 000def10 000def20 000def30 000def40 000def50 000def60 000def70 000def80 000def90 000defa0 000defb0 000defc0 000defd0 000defe0 000deff0 000df000 * 00100000 cc cc cc cc cc cc cc cc f1 af e4 b5 9c e4 79 f8 9a 44 fe db d8 ac 54 68 cc 2b 6e 42 43 8a ff b3 2d 0c df c2 dd 88 17 d6 80 cc f8 c3 42 77 94 c6 63 64 a4 2a 34 58 38 0a 6a 64 cc 2b b6 0c c9 a0 14 d8 e7 ae 50 ed 42 86 f1 72 b0 cc cd 57 e9 5d 73 ee 4c 10 18 a1 cc a1 92 c3 b9 53 cc 43 3e 22 4e 2d 00 8f e1 c5 55 73 62 2f 31 6a 97 cc 38 5f 2e 16 f7 d6 5d 6b 40 31 2e 2f af 89 d6 74 cc 38 fa f1 a2 ad c0 d8 89 a5 02 64 6d 29 82 1f 72 cc cc cc cc cc cc cc cc cc e3 e3 d0 39 d9 7a c5 da 70 57 38 92 a1 59 ff ee cc 43 d6 c6 79 12 c8 1e b3 6b d6 89 6a 0f e5 cb f6 cc bd e8 a5 5f 31 8e e4 fb 19 19 6e ed c0 89 6a 87 cc 8d 7b 55 31 2b c0 5f 0b d3 43 5a 9b 21 b9 10 9f cc b0 ef f2 10 f5 3f 0e 48 b9 80 be 23 46 9e 1e 23 cc 01 5b f3 65 db 73 2a 59 f7 b0 d7 6e fc fa bd 47 cc 12 6f a7 b8 b4 32 1d d4 a2 0a 3b 1e 3a 45 66 cb cc 0a 3c 38 e4 18 73 94 df b3 89 c4 79 e3 b9 87 48 cc |................| |.+.+.C88.C......| |.n..W>_....{.[o<| |.BB..".....U...8| |.Cw.]N..9y_1.e..| |....s-....1+....| |........z...?s2s| |y.c.L.]...._.*..| |.-d...k.....HY..| |......@.pk......| |D.*P.U1.W..C....| |..4..s.d8.nZ..;.| |..XB.b/m.j..#n.y| |..8../.)...!F.:.| |.....1..Y.....E.| |T.jr.j....j...f.| |h.d.S.tr....#G.H| |................|

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 13

Type III

usually persistent, comes in bursts strong correlation: I/O command timeouts


3ware hides timeouts, look at extended diag also observed on plain SATA systems ...sometimes with failed READ commands!

previous data from earlier cycles (sometimes multiple cycles old!) or from another location on disk seems to match RAID stripe size (64K)

observed on 16K chunk RAID arrays as well


Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 14

Type III Example


00000000 * 34205200 * 34215200 * 34265200 * 34275200 * 342c5200 * 342d5200 * 34325200 * 34335200 * 34385200 * 34395200 * 80000000 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc 33 33 33 33 33 33 33 33 cc cc cc cc cc cc cc cc |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................| |3333333333333333| |................|

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 15

Type IV

usually persistent relatively recent observations (since April) ...not sure yet this warrants another category

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 16

Type IV Example

00000000 * 00052980 * 00053000 * 80000000

aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa

aa aa aa aa aa aa aa aa 00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa

|................| |................| |................|

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 17

Corruption Time Distribution


110 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3
#incidents

hours

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 18

Operating Systems

SLC3 SLC4 RHEL4 other

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 19

Corruption Types

Type I Type II Type III Type IV unknown

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 20

Corruption Persistence

persistent transient unknown

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 21

Daily Distribution
wide-scale deployment ramp-up finished 2007-01-31 WD firmware upgrade campaign started 2007-02-20

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 22

Where From?
user space

VM filesystems block layer SCSI layer low-level drivers controller firmware storage firmware

physical magnetic media


Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 23

What Can Be Done?

self-examining/healing hardware (?) WRITE-READ cycles before ACK checksumming? not necessarily enough end-to-end checksumming (ZFS has a point) store multiple copies regular scrubbing of RAID arrays data refresh re-read cycles on tapes ...generally accept and prepare for corruptions
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 24

Conclusions

silent corruptions are a fact of life first step towards a solution is detection elimination seems impossible existing datasets are at the mercy of Murphy correction will cost time AND money effort has to start now (if not started already) multiple cost-schemes exist

trade time and storage space ( la Google) trade time and CPU power (correction codes)
Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 25

Departing Words

Trust, but verify


Ronald Reagan

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 26

Questions?

Thank you and have a nice filesystem (without corruptions)!

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 27

Further Reading

Bernd Panzer-Steindel: Data Integrity v3

http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797

Silent Corruptions Peter.Kelemen@cern.ch CERN After C5, June 1st, 2007 28

Das könnte Ihnen auch gefallen