Title: Some notes on fault tolerant storage systems
Tags: english, sysadmin, raid
Date: 2017-11-01 15:35
If you care about how fault tolerant your storage is, you might
find these articles and papers interesting. They have formed how I
think of when designing a storage system.
- USENIX :login; Redundancy
Does Not Imply Fault Tolerance. Analysis of Distributed Storage
Reactions to Single Errors and Corruptions by Aishwarya Ganesan,
Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
H. Arpaci-Dusseau
- ZDNet
Why
RAID 5 stops working in 2009 by Robin Harris
- ZDNet
Why
RAID 6 stops working in 2019 by Robin Harris
- USENIX FAST'07
Failure
Trends in a Large Disk Drive Population by Eduardo Pinheiro,
Wolf-Dietrich Weber and Luiz AndreĢ Barroso
- USENIX ;login: Data
Integrity. Finding Truth in a World of Guesses and Lies by Doug
Hughes
- USENIX FAST'08
An
Analysis of Data Corruption in the Storage Stack by
L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
Arpaci-Dusseau, and R. H. Arpaci-Dusseau
- USENIX FAST'07 Disk
failures in the real world: what does an MTTF of 1,000,000 hours mean
to you? by B. Schroeder and G. A. Gibson.
- USENIX ;login: Are
Disks the Dominant Contributor for Storage Failures? A Comprehensive
Study of Storage Subsystem Failure Characteristics by Weihang
Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
- SIGMETRICS 2007
An
analysis of latent sector errors in disk drives by
L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler
Several of these research papers are based on data collected from
hundred thousands or millions of disk, and their findings are eye
opening. The short story is simply do not implicitly trust RAID or
redundant storage systems. Details matter. And unfortunately there
are few options on Linux addressing all the identified issues. Both
ZFS and Btrfs are doing a fairly good job, but have legal and
practical issues on their own. I wonder how cluster file systems like
Ceph do in this regard. After all, there is an old saying, you know
you have a distributed system when the crash of a compyter you have
never heard of stops you from getting any work done. The same holds
true if fault tolerance do not work.
Just remember, in the end, it do not matter how redundant, or how
fault tolerant your storage is, if you do not continuously monitor its
status to detect and replace failed disks.