blog/tags/raid/raid.rss

   1 <?xml version="1.0" encoding="utf-8"?>
   2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
   3         <channel>
   4                 <title>Petter Reinholdtsen - Entries tagged raid</title>
   5                 <description>Entries tagged raid</description>
   6                 <link>https://people.skolelinux.org/pere/blog/</link>
   7
   8
   9         <item>
  10                 <title>RAID status from LSI Megaraid controllers in Debian</title>
  11                 <link>https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_in_Debian.html</link>
  12                 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_in_Debian.html</guid>
  13                 <pubDate>Wed, 17 Apr 2024 17:00:00 +0200</pubDate>
  14                 <description>&lt;p&gt;I am happy to report that
  15 &lt;ahref=&quot;https://github.com/namiltd/megactl&quot;&gt;the megactl package&lt;/a&gt;,
  16 useful to fetch RAID status when using the LSI Megaraid controller,
  17 now is available in Debian.  It passed NEW a few days ago, and is now
  18 &lt;ahref=&quot;https://tracker.debian.org/pkg/megactl&quot;&gt;available in
  19 unstable&lt;/a&gt;, and probably showing up in testing in a weeks time.  The
  20 new version should provide Appstream hardware mapping and should
  21 integrate nicely with isenkram.&lt;/p&gt;
  22
  23 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
  24 activities, please send Bitcoin donations to my address
  25 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
  26
  27 </description>
  28         </item>
  29
  30         <item>
  31                 <title>RAID status from LSI Megaraid controllers using free software</title>
  32                 <link>https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_using_free_software.html</link>
  33                 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_using_free_software.html</guid>
  34                 <pubDate>Sun, 3 Mar 2024 22:40:00 +0100</pubDate>
  35                 <description>&lt;p&gt;The last few days I have revisited RAID setup using the LSI
  36 Megaraid controller.  These are a family of controllers called PERC by
  37 Dell, and is present in several old PowerEdge servers, and I recently
  38 got my hands on one of these.  I had forgotten how to handle this RAID
  39 controller in Debian, so I had to take a peek in the
  40 &lt;a href=&quot;https://wiki.debian.org/LinuxRaidForAdmins&quot;&gt;Debian wiki page
  41 &quot;Linux and Hardware RAID: an administrator&#39;s summary&quot;&lt;/a&gt; to remember
  42 what kind of software is available to configure and monitor the disks
  43 and controller.  I prefer Free Software alternatives to proprietary
  44 tools, as the later tend to fall into disarray once the manufacturer
  45 loose interest, and often do not work with newer Linux Distributions.
  46 Sadly there is no free software tool to configure the RAID setup, only
  47 to monitor it. RAID can provide improved reliability and resilience in
  48 a storage solution, but only if it is being regularly checked and any
  49 broken disks are being replaced in time.  I thus want to ensure some
  50 automatic monitoring is available.&lt;/p&gt;
  51
  52 &lt;p&gt;In the discovery process, I came across a old free software tool to
  53 monitor PERC2, PERC3, PERC4 and PERC5 controllers, which to my
  54 surprise is not present in debian.  To help change that I created a
  55 &lt;a href=&quot;https://bugs.debian.org/1065322&quot;&gt;request for packaging of the
  56 megactl package&lt;/a&gt;, and tried to track down a usable version.
  57 &lt;a href=&quot;https://sourceforge.net/p/megactl/&quot;&gt;The original project
  58 site&lt;/a&gt; is on Sourceforge, but as far as I can tell that project has
  59 been dead for more than 15 years.  I managed to find a
  60 &lt;a href=&quot;https://github.com/hmage/megactl&quot;&gt;more recent fork on
  61 github&lt;/a&gt; from user hmage, but it is unclear to me if this is still
  62 being maintained.  It has not seen much improvements since 2016.  A
  63 &lt;a href=&quot;https://github.com/namiltd/megactl&quot;&gt;more up to date
  64 edition&lt;/a&gt; is a git fork from the original github fork by user
  65 namiltd, and this newer fork seem a lot more promising.  The owner of
  66 this github repository has replied to change proposals within hours,
  67 and had already added some improvements and support for more hardware.
  68 Sadly he is reluctant to commit to maintaining the tool and stated in
  69 &lt;a href=&quot;https://github.com/namiltd/megactl/pull/1&quot;&gt;my first pull
  70 request&lt;/A&gt; that he think a new release should be made based on the
  71 git repository owned by hmage.  I perfectly understand this
  72 reluctance, as I feel the same about maintaining yet another package
  73 in Debian when I barely have time to take care of the ones I already
  74 maintain, but do not really have high hopes that hmage will have time
  75 to spend on it and hope namiltd will change his mind.&lt;/p&gt;
  76
  77 &lt;p&gt;In any case, I created
  78 &lt;a href=&quot;https://salsa.debian.org/debian/megactl&quot;&gt;a draft package&lt;/a&gt;
  79 based on the namiltd edition and put it under the debian group on
  80 salsa.debian.org.  If you own a Dell PowerEdge server with one of the
  81 PERC controllers, or any other RAID controller using the megaraid or
  82 megaraid_sas Linux kernel modules, you might want to check it out.  If
  83 enough people are interested, perhaps the package will make it into
  84 the Debian archive.&lt;/p&gt;
  85
  86 &lt;p&gt;There are two tools provided, megactl for the megaraid Linux kernel
  87 module, and megasasctl for the megaraid_sas Linux kernel module. The
  88 simple output from the command on one of my machines look like this
  89 (yes, I know some of the disks have problems. :).&lt;/p&gt;
  90
  91 &lt;pre&gt;
  92 # megasasctl
  93 a0       PERC H730 Mini           encl:1 ldrv:2  batt:good
  94 a0d0       558GiB RAID 1   1x2  optimal
  95 a0d1      3067GiB RAID 0   1x11 optimal
  96 a0e32s0     558GiB  a0d0  online   errs: media:0  other:19
  97 a0e32s1     279GiB  a0d1  online
  98 a0e32s2     279GiB  a0d1  online
  99 a0e32s3     279GiB  a0d1  online
 100 a0e32s4     279GiB  a0d1  online
 101 a0e32s5     279GiB  a0d1  online
 102 a0e32s6     279GiB  a0d1  online
 103 a0e32s8     558GiB  a0d0  online   errs: media:0  other:17
 104 a0e32s9     279GiB  a0d1  online
 105 a0e32s10    279GiB  a0d1  online
 106 a0e32s11    279GiB  a0d1  online
 107 a0e32s12    279GiB  a0d1  online
 108 a0e32s13    279GiB  a0d1  online
 109
 110 #
 111 &lt;/pre&gt;
 112
 113 &lt;p&gt;In addition to displaying a simple status report, it can also test
 114 individual drives and print the various event logs.  Perhaps you too
 115 find it useful?&lt;/p&gt;
 116
 117 &lt;p&gt;In the packaging process I provided some patches upstream to
 118 improve installation and ensure
 119 &lt;ahref=&quot;https://github.com/namiltd/megactl/pull/2&quot;&gt;a Appstream
 120 metainfo file is provided&lt;/a&gt; to list all supported HW, to allow
 121 &lt;a href=&quot;https://tracker.debian.org/isenkram&quot;&gt;isenkram&lt;/a&gt; to propose
 122 the package on all servers with a relevant PCI card.&lt;/p&gt;
 123
 124 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 125 activities, please send Bitcoin donations to my address
 126 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 127
 128 </description>
 129         </item>
 130
 131         <item>
 132                 <title>Some notes on fault tolerant storage systems</title>
 133                 <link>https://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
 134                 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
 135                 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
 136                 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
 137 find these articles and papers interesting.  They have formed how I
 138 think of when designing a storage system.&lt;/p&gt;
 139
 140 &lt;ul&gt;
 141
 142 &lt;li&gt;USENIX :login; &lt;a
 143 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
 144 Does Not Imply Fault Tolerance.  Analysis of Distributed Storage
 145 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
 146 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
 147 H. Arpaci-Dusseau&lt;/li&gt;
 148
 149 &lt;li&gt;ZDNet
 150 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
 151 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
 152
 153 &lt;li&gt;ZDNet
 154 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
 155 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
 156
 157 &lt;li&gt;USENIX FAST&#39;07
 158 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
 159 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
 160 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
 161
 162 &lt;li&gt;USENIX ;login: &lt;a
 163 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
 164 Integrity.  Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
 165 Hughes&lt;/li&gt;
 166
 167 &lt;li&gt;USENIX FAST&#39;08
 168 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
 169 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
 170 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
 171 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
 172
 173 &lt;li&gt;USENIX FAST&#39;07 &lt;a
 174 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
 175 failures in the real world: what does an MTTF of 1,000,000 hours mean
 176 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
 177
 178 &lt;li&gt;USENIX ;login: &lt;a
 179 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
 180 Disks the Dominant Contributor for Storage Failures?  A Comprehensive
 181 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
 182 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
 183
 184 &lt;li&gt;SIGMETRICS 2007
 185 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
 186 analysis of latent sector errors in disk drives&lt;/a&gt; by
 187 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
 188
 189 &lt;/ul&gt;
 190
 191 &lt;p&gt;Several of these research papers are based on data collected from
 192 hundred thousands or millions of disk, and their findings are eye
 193 opening.  The short story is simply do not implicitly trust RAID or
 194 redundant storage systems.  Details matter.  And unfortunately there
 195 are few options on Linux addressing all the identified issues.  Both
 196 ZFS and Btrfs are doing a fairly good job, but have legal and
 197 practical issues on their own.  I wonder how cluster file systems like
 198 Ceph do in this regard.  After all, there is an old saying, you know
 199 you have a distributed system when the crash of a computer you have
 200 never heard of stops you from getting any work done.  The same holds
 201 true if fault tolerance do not work.&lt;/p&gt;
 202
 203 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
 204 fault tolerant your storage is, if you do not continuously monitor its
 205 status to detect and replace failed disks.&lt;/p&gt;
 206
 207 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 208 activities, please send Bitcoin donations to my address
 209 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 210 </description>
 211         </item>
 212
 213         <item>
 214                 <title>How to figure out which RAID disk to replace when it fail</title>
 215                 <link>https://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</link>
 216                 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</guid>
 217                 <pubDate>Tue, 14 Feb 2012 21:25:00 +0100</pubDate>
 218                 <description>&lt;p&gt;Once in a while my home server have disk problems.  Thanks to Linux
 219 Software RAID, I have not lost data yet (but
 220 &lt;a href=&quot;http://comments.gmane.org/gmane.linux.raid/34532&quot;&gt;I was
 221 close&lt;/a&gt; this summer :).  But once a disk is starting to behave
 222 funny, a practical problem present itself.  How to get from the Linux
 223 device name (like /dev/sdd) to something that can be used to identify
 224 the disk when the computer is turned off?  In my case I have SATA
 225 disks with a unique ID printed on the label.  All I need is a way to
 226 figure out how to query the disk to get the ID out.&lt;/p&gt;
 227
 228 &lt;p&gt;After fumbling a bit, I
 229 &lt;a href=&quot;http://www.cyberciti.biz/faq/linux-getting-scsi-ide-harddisk-information/&quot;&gt;found
 230 that hdparm -I&lt;/a&gt; will report the disk serial number, which is
 231 printed on the disk label.  The following (almost) one-liner can be
 232 used to look up the ID of all the failed disks:&lt;/p&gt;
 233
 234 &lt;blockquote&gt;&lt;pre&gt;
 235 for d in $(cat /proc/mdstat |grep &#39;(F)&#39;|tr &#39; &#39; &quot;\n&quot;|grep &#39;(F)&#39;|cut -d\[ -f1|sort -u);
 236 do
 237     printf &quot;Failed disk $d: &quot;
 238     hdparm -I /dev/$d |grep &#39;Serial Num&#39;
 239 done
 240 &lt;/blockquote&gt;&lt;/pre&gt;
 241
 242 &lt;p&gt;Putting it here to make sure I do not have to search for it the
 243 next time, and in case other find it useful.&lt;/p&gt;
 244
 245 &lt;p&gt;At the moment I have two failing disk. :(&lt;/p&gt;
 246
 247 &lt;blockquote&gt;&lt;pre&gt;
 248 Failed disk sdd1:       Serial Number:      WD-WCASJ1860823
 249 Failed disk sdd2:       Serial Number:      WD-WCASJ1860823
 250 Failed disk sde2:       Serial Number:      WD-WCASJ1840589
 251 &lt;/blockquote&gt;&lt;/pre&gt;
 252
 253 &lt;p&gt;The last time I had failing disks, I added the serial number on
 254 labels I printed and stuck on the short sides of each disk, to be able
 255 to figure out which disk to take out of the box without having to
 256 remove each disk to look at the physical vendor label.  The vendor
 257 label is at the top of the disk, which is hidden when the disks are
 258 mounted inside my box.&lt;/p&gt;
 259
 260 &lt;p&gt;I really wish the check_linux_raid Nagios plugin for checking Linux
 261 Software RAID in the
 262 &lt;a href=&quot;http://packages.qa.debian.org/n/nagios-plugins.html&quot;&gt;nagios-plugins-standard&lt;/a&gt;
 263 debian package would look up this value automatically, as it would
 264 make the plugin a lot more useful when my disks fail.  At the moment
 265 it only report a failure when there are no more spares left (it really
 266 should warn as soon as a disk is failing), and it do not tell me which
 267 disk(s) is failing when the RAID is running short on disks.&lt;/p&gt;
 268 </description>
 269         </item>
 270
 271         </channel>
 272 </rss>