blog/tags/raid/raid.rss

   1 <?xml version="1.0" encoding="utf-8"?>
   2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
   3         <channel>
   4                 <title>Petter Reinholdtsen - Entries tagged raid</title>
   5                 <description>Entries tagged raid</description>
   6                 <link>http://people.skolelinux.org/pere/blog/</link>
   7
   8
   9         <item>
  10                 <title>Some notes on fault tolerant storage systems</title>
  11                 <link>http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
  12                 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
  13                 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
  14                 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
  15 find these articles and papers interesting.  They have formed how I
  16 think of when designing a storage system.&lt;/p&gt;
  17
  18 &lt;ul&gt;
  19
  20 &lt;li&gt;USENIX :login; &lt;a
  21 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
  22 Does Not Imply Fault Tolerance.  Analysis of Distributed Storage
  23 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
  24 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
  25 H. Arpaci-Dusseau&lt;/li&gt;
  26
  27 &lt;li&gt;ZDNet
  28 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
  29 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
  30
  31 &lt;li&gt;ZDNet
  32 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
  33 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
  34
  35 &lt;li&gt;USENIX FAST&#39;07
  36 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
  37 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
  38 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
  39
  40 &lt;li&gt;USENIX ;login: &lt;a
  41 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
  42 Integrity.  Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
  43 Hughes&lt;/li&gt;
  44
  45 &lt;li&gt;USENIX FAST&#39;08
  46 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
  47 cAnalysis of Data Corruption in the Storage Stack&lt;/a&gt; by
  48 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
  49 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
  50
  51 &lt;li&gt;USENIX FAST&#39;07 &lt;a
  52 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
  53 failures in the real world: what does an MTTF of 1,000,000 hours mean
  54 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
  55
  56 &lt;li&gt;USENIX ;login: &lt;a
  57 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
  58 Disks the Dominant Contributor for Storage Failures?  A Comprehensive
  59 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
  60 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
  61
  62 &lt;li&gt;SIGMETRICS 2007
  63 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
  64 analysis of latent sector errors in disk drives&lt;/a&gt; by
  65 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
  66
  67 &lt;/ul&gt;
  68
  69 &lt;p&gt;Several of these research papers are based on data collected from
  70 hundred thousands or millions of disk, and their findings are eye
  71 opening.  The short story is simply do not implicitly trust RAID or
  72 redundant storage systems.  Details matter.  And unfortunately there
  73 are few options on Linux addressing all the identified issues.  Both
  74 ZFS and Btrfs are doing a fairly good job, but have legal and
  75 practical issues on their own.  I wonder how cluster file systems like
  76 Ceph do in this regard.  After, all the old saying, you know you have
  77 a distributed system when the crash of a compyter you have never heard
  78 of stops you from getting any work done.  The same holds true if fault
  79 tolerance do not work.&lt;/p&gt;
  80
  81 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
  82 fault tolerant your storage is, if you do not continuously monitor its
  83 status to detect and replace failed disks.&lt;/p&gt;
  84 </description>
  85         </item>
  86
  87         <item>
  88                 <title>How to figure out which RAID disk to replace when it fail</title>
  89                 <link>http://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</link>
  90                 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</guid>
  91                 <pubDate>Tue, 14 Feb 2012 21:25:00 +0100</pubDate>
  92                 <description>&lt;p&gt;Once in a while my home server have disk problems.  Thanks to Linux
  93 Software RAID, I have not lost data yet (but
  94 &lt;a href=&quot;http://comments.gmane.org/gmane.linux.raid/34532&quot;&gt;I was
  95 close&lt;/a&gt; this summer :).  But once a disk is starting to behave
  96 funny, a practical problem present itself.  How to get from the Linux
  97 device name (like /dev/sdd) to something that can be used to identify
  98 the disk when the computer is turned off?  In my case I have SATA
  99 disks with a unique ID printed on the label.  All I need is a way to
 100 figure out how to query the disk to get the ID out.&lt;/p&gt;
 101
 102 &lt;p&gt;After fumbling a bit, I
 103 &lt;a href=&quot;http://www.cyberciti.biz/faq/linux-getting-scsi-ide-harddisk-information/&quot;&gt;found
 104 that hdparm -I&lt;/a&gt; will report the disk serial number, which is
 105 printed on the disk label.  The following (almost) one-liner can be
 106 used to look up the ID of all the failed disks:&lt;/p&gt;
 107
 108 &lt;blockquote&gt;&lt;pre&gt;
 109 for d in $(cat /proc/mdstat |grep &#39;(F)&#39;|tr &#39; &#39; &quot;\n&quot;|grep &#39;(F)&#39;|cut -d\[ -f1|sort -u);
 110 do
 111     printf &quot;Failed disk $d: &quot;
 112     hdparm -I /dev/$d |grep &#39;Serial Num&#39;
 113 done
 114 &lt;/blockquote&gt;&lt;/pre&gt;
 115
 116 &lt;p&gt;Putting it here to make sure I do not have to search for it the
 117 next time, and in case other find it useful.&lt;/p&gt;
 118
 119 &lt;p&gt;At the moment I have two failing disk. :(&lt;/p&gt;
 120
 121 &lt;blockquote&gt;&lt;pre&gt;
 122 Failed disk sdd1:       Serial Number:      WD-WCASJ1860823
 123 Failed disk sdd2:       Serial Number:      WD-WCASJ1860823
 124 Failed disk sde2:       Serial Number:      WD-WCASJ1840589
 125 &lt;/blockquote&gt;&lt;/pre&gt;
 126
 127 &lt;p&gt;The last time I had failing disks, I added the serial number on
 128 labels I printed and stuck on the short sides of each disk, to be able
 129 to figure out which disk to take out of the box without having to
 130 remove each disk to look at the physical vendor label.  The vendor
 131 label is at the top of the disk, which is hidden when the disks are
 132 mounted inside my box.&lt;/p&gt;
 133
 134 &lt;p&gt;I really wish the check_linux_raid Nagios plugin for checking Linux
 135 Software RAID in the
 136 &lt;a href=&quot;http://packages.qa.debian.org/n/nagios-plugins.html&quot;&gt;nagios-plugins-standard&lt;/a&gt;
 137 debian package would look up this value automatically, as it would
 138 make the plugin a lot more useful when my disks fail.  At the moment
 139 it only report a failure when there are no more spares left (it really
 140 should warn as soon as a disk is failing), and it do not tell me which
 141 disk(s) is failing when the RAID is running short on disks.&lt;/p&gt;
 142 </description>
 143         </item>
 144
 145         </channel>
 146 </rss>