Title: Some notes on fault tolerant storage systems Tags: english, sysadmin, raid Date: 2017-11-01 15:35

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

USENIX :login; Redundancy Does Not Imply Fault Tolerance. Analysis of Distributed Storage Reactions to Single Errors and Corruptions by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
ZDNet Why RAID 5 stops working in 2009 by Robin Harris
ZDNet Why RAID 6 stops working in 2019 by Robin Harris
USENIX FAST'07 Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso
USENIX ;login: Data Integrity. Finding Truth in a World of Guesses and Lies by Doug Hughes
USENIX FAST'08 An Analysis of Data Corruption in the Storage Stack by L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau
USENIX FAST'07 Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? by B. Schroeder and G. A. Gibson.
USENIX ;login: Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
SIGMETRICS 2007 An analysis of latent sector errors in disk drives by L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a compyter you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.