blog/data/2017-11-01-storage-fault-tolerance.txt

   1 Title: Some notes on fault tolerant storage systems
   2 Tags: english, sysadmin, raid
   3 Date: 2017-11-01 15:35
   4
   5 <p>If you care about how fault tolerant your storage is, you might
   6 find these articles and papers interesting.  They have formed how I
   7 think of when designing a storage system.</p>
   8
   9 <ul>
  10
  11 <li>USENIX :login; <a
  12 href="https://www.usenix.org/publications/login/summer2017/ganesan">Redundancy
  13 Does Not Imply Fault Tolerance.  Analysis of Distributed Storage
  14 Reactions to Single Errors and Corruptions</a> by Aishwarya Ganesan,
  15 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
  16 H. Arpaci-Dusseau</li>
  17
  18 <li>ZDNet
  19 <a href="http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/">Why
  20 RAID 5 stops working in 2009</a> by Robin Harris</li>
  21
  22 <li>ZDNet
  23 <a href="http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/">Why
  24 RAID 6 stops working in 2019</a> by Robin Harris</li>
  25
  26 <li>USENIX FAST'07
  27 <a href="http://research.google.com/archive/disk_failures.pdf">Failure
  28 Trends in a Large Disk Drive Population</a> by Eduardo Pinheiro,
  29 Wolf-Dietrich Weber and Luiz André Barroso</li>
  30
  31 <li>USENIX ;login: <a
  32 href="https://www.usenix.org/system/files/login/articles/hughes12-04.pdf">Data
  33 Integrity.  Finding Truth in a World of Guesses and Lies</a> by Doug
  34 Hughes</li>
  35
  36 <li>USENIX FAST'08
  37 <a href="https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/">An
  38 Analysis of Data Corruption in the Storage Stack</a> by
  39 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
  40 Arpaci-Dusseau, and R. H. Arpaci-Dusseau</li>
  41
  42 <li>USENIX FAST'07 <a
  43 href="https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/">Disk
  44 failures in the real world: what does an MTTF of 1,000,000 hours mean
  45 to you?</a> by B. Schroeder and G. A. Gibson.</li>
  46
  47 <li>USENIX ;login: <a
  48 href="https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/">Are
  49 Disks the Dominant Contributor for Storage Failures?  A Comprehensive
  50 Study of Storage Subsystem Failure Characteristics</a> by Weihang
  51 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky</li>
  52
  53 <li>SIGMETRICS 2007
  54 <a href="http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf">An
  55 analysis of latent sector errors in disk drives</a> by
  56 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler</li>
  57
  58 </ul>
  59
  60 <p>Several of these research papers are based on data collected from
  61 hundred thousands or millions of disk, and their findings are eye
  62 opening.  The short story is simply do not implicitly trust RAID or
  63 redundant storage systems.  Details matter.  And unfortunately there
  64 are few options on Linux addressing all the identified issues.  Both
  65 ZFS and Btrfs are doing a fairly good job, but have legal and
  66 practical issues on their own.  I wonder how cluster file systems like
  67 Ceph do in this regard.  After all, there is an old saying, you know
  68 you have a distributed system when the crash of a computer you have
  69 never heard of stops you from getting any work done.  The same holds
  70 true if fault tolerance do not work.</p>
  71
  72 <p>Just remember, in the end, it do not matter how redundant, or how
  73 fault tolerant your storage is, if you do not continuously monitor its
  74 status to detect and replace failed disks.</p>
  75
  76 <p>As usual, if you use Bitcoin and want to show your support of my
  77 activities, please send Bitcoin donations to my address
  78 <b><a href="bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b">15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b</a></b>.</p>