If you care about how fault tolerant your storage is, you might
-find these articles and papers interesting. They have formed how I
-think of when designing a storage system.
-
-
-
-- USENIX :login; Redundancy
-Does Not Imply Fault Tolerance. Analysis of Distributed Storage
-Reactions to Single Errors and Corruptions by Aishwarya Ganesan,
-Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
-H. Arpaci-Dusseau
-
-- ZDNet
-Why
-RAID 5 stops working in 2009 by Robin Harris
-
-- ZDNet
-Why
-RAID 6 stops working in 2019 by Robin Harris
-
-- USENIX FAST'07
-Failure
-Trends in a Large Disk Drive Population by Eduardo Pinheiro,
-Wolf-Dietrich Weber and Luiz AndreÌ Barroso
-
-- USENIX ;login: Data
-Integrity. Finding Truth in a World of Guesses and Lies by Doug
-Hughes
-
-- USENIX FAST'08
-An
-Analysis of Data Corruption in the Storage Stack by
-L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
-Arpaci-Dusseau, and R. H. Arpaci-Dusseau
-
-- USENIX FAST'07 Disk
-failures in the real world: what does an MTTF of 1,000,000 hours mean
-to you? by B. Schroeder and G. A. Gibson.
-
-- USENIX ;login: Are
-Disks the Dominant Contributor for Storage Failures? A Comprehensive
-Study of Storage Subsystem Failure Characteristics by Weihang
-Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
-
-- SIGMETRICS 2007
-An
-analysis of latent sector errors in disk drives by
-L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler
-
-
-
-
Several of these research papers are based on data collected from
-hundred thousands or millions of disk, and their findings are eye
-opening. The short story is simply do not implicitly trust RAID or
-redundant storage systems. Details matter. And unfortunately there
-are few options on Linux addressing all the identified issues. Both
-ZFS and Btrfs are doing a fairly good job, but have legal and
-practical issues on their own. I wonder how cluster file systems like
-Ceph do in this regard. After all, there is an old saying, you know
-you have a distributed system when the crash of a computer you have
-never heard of stops you from getting any work done. The same holds
-true if fault tolerance do not work.
-
-
Just remember, in the end, it do not matter how redundant, or how
-fault tolerant your storage is, if you do not continuously monitor its
-status to detect and replace failed disks.
-
-
As usual, if you use Bitcoin and want to show your support of my
-activities, please send Bitcoin donations to my address
-15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.
-