If you care about how fault tolerant your storage is, you might +find these articles and papers interesting. They have formed how I +think of when designing a storage system.
+ +-
+
+
- USENIX :login; Redundancy +Does Not Imply Fault Tolerance. Analysis of Distributed Storage +Reactions to Single Errors and Corruptions by Aishwarya Ganesan, +Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi +H. Arpaci-Dusseau + +
- ZDNet +Why +RAID 5 stops working in 2009 by Robin Harris + +
- ZDNet +Why +RAID 6 stops working in 2019 by Robin Harris + +
- USENIX FAST'07 +Failure +Trends in a Large Disk Drive Population by Eduardo Pinheiro, +Wolf-Dietrich Weber and Luiz AndreÌ Barroso + +
- USENIX ;login: Data +Integrity. Finding Truth in a World of Guesses and Lies by Doug +Hughes + +
- USENIX FAST'08 +An +Analysis of Data Corruption in the Storage Stack by +L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. +Arpaci-Dusseau, and R. H. Arpaci-Dusseau + +
- USENIX FAST'07 Disk +failures in the real world: what does an MTTF of 1,000,000 hours mean +to you? by B. Schroeder and G. A. Gibson. + +
- USENIX ;login: Are +Disks the Dominant Contributor for Storage Failures? A Comprehensive +Study of Storage Subsystem Failure Characteristics by Weihang +Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky + +
- SIGMETRICS 2007 +An +analysis of latent sector errors in disk drives by +L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler + +
Several of these research papers are based on data collected from +hundred thousands or millions of disk, and their findings are eye +opening. The short story is simply do not implicitly trust RAID or +redundant storage systems. Details matter. And unfortunately there +are few options on Linux addressing all the identified issues. Both +ZFS and Btrfs are doing a fairly good job, but have legal and +practical issues on their own. I wonder how cluster file systems like +Ceph do in this regard. After all, there is an old saying, you know +you have a distributed system when the crash of a compyter you have +never heard of stops you from getting any work done. The same holds +true if fault tolerance do not work.
+ +Just remember, in the end, it do not matter how redundant, or how +fault tolerant your storage is, if you do not continuously monitor its +status to detect and replace failed disks.
+ +