]> pere.pagekite.me Git - homepage.git/blob - blog/archive/2017/11/11.rss
Generated.
[homepage.git] / blog / archive / 2017 / 11 / 11.rss
1 <?xml version="1.0" encoding="ISO-8859-1"?>
2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
3 <channel>
4 <title>Petter Reinholdtsen - Entries from November 2017</title>
5 <description>Entries from November 2017</description>
6 <link>http://people.skolelinux.org/pere/blog/</link>
7
8
9 <item>
10 <title>Some notes on fault tolerant storage systems</title>
11 <link>http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
12 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
13 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
14 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
15 find these articles and papers interesting. They have formed how I
16 think of when designing a storage system.&lt;/p&gt;
17
18 &lt;ul&gt;
19
20 &lt;li&gt;USENIX :login; &lt;a
21 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
22 Does Not Imply Fault Tolerance. Analysis of Distributed Storage
23 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
24 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
25 H. Arpaci-Dusseau&lt;/li&gt;
26
27 &lt;li&gt;ZDNet
28 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
29 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
30
31 &lt;li&gt;ZDNet
32 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
33 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
34
35 &lt;li&gt;USENIX FAST&#39;07
36 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
37 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
38 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
39
40 &lt;li&gt;USENIX ;login: &lt;a
41 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
42 Integrity. Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
43 Hughes&lt;/li&gt;
44
45 &lt;li&gt;USENIX FAST&#39;08
46 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
47 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
48 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
49 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
50
51 &lt;li&gt;USENIX FAST&#39;07 &lt;a
52 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
53 failures in the real world: what does an MTTF of 1,000,000 hours mean
54 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
55
56 &lt;li&gt;USENIX ;login: &lt;a
57 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
58 Disks the Dominant Contributor for Storage Failures? A Comprehensive
59 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
60 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
61
62 &lt;li&gt;SIGMETRICS 2007
63 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
64 analysis of latent sector errors in disk drives&lt;/a&gt; by
65 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
66
67 &lt;/ul&gt;
68
69 &lt;p&gt;Several of these research papers are based on data collected from
70 hundred thousands or millions of disk, and their findings are eye
71 opening. The short story is simply do not implicitly trust RAID or
72 redundant storage systems. Details matter. And unfortunately there
73 are few options on Linux addressing all the identified issues. Both
74 ZFS and Btrfs are doing a fairly good job, but have legal and
75 practical issues on their own. I wonder how cluster file systems like
76 Ceph do in this regard. After all, there is an old saying, you know
77 you have a distributed system when the crash of a compyter you have
78 never heard of stops you from getting any work done. The same holds
79 true if fault tolerance do not work.&lt;/p&gt;
80
81 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
82 fault tolerant your storage is, if you do not continuously monitor its
83 status to detect and replace failed disks.&lt;/p&gt;
84 </description>
85 </item>
86
87 </channel>
88 </rss>