]> pere.pagekite.me Git - homepage.git/blob - blog/tags/raid/raid.rss
64a2f4537d0ba22759ab27bd5f6ba1e54d971e36
[homepage.git] / blog / tags / raid / raid.rss
1 <?xml version="1.0" encoding="utf-8"?>
2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
3 <channel>
4 <title>Petter Reinholdtsen - Entries tagged raid</title>
5 <description>Entries tagged raid</description>
6 <link>https://people.skolelinux.org/pere/blog/</link>
7
8
9 <item>
10 <title>RAID status from LSI Megaraid controllers in Debian</title>
11 <link>https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_in_Debian.html</link>
12 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_in_Debian.html</guid>
13 <pubDate>Wed, 17 Apr 2024 17:00:00 +0200</pubDate>
14 <description>&lt;p&gt;I am happy to report that
15 &lt;ahref=&quot;https://github.com/namiltd/megactl&quot;&gt;the megactl package&lt;/a&gt;,
16 useful to fetch RAID status when using the LSI Megaraid controller,
17 now is available in Debian. It passed NEW a few days ago, and is now
18 &lt;ahref=&quot;https://tracker.debian.org/pkg/megactl&quot;&gt;available in
19 unstable&lt;/a&gt;, and probably showing up in testing in a weeks time. The
20 new version should provide Appstream hardware mapping and should
21 integrate nicely with isenkram.&lt;/p&gt;
22
23 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
24 activities, please send Bitcoin donations to my address
25 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
26
27 </description>
28 </item>
29
30 <item>
31 <title>RAID status from LSI Megaraid controllers using free software</title>
32 <link>https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_using_free_software.html</link>
33 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/RAID_status_from_LSI_Megaraid_controllers_using_free_software.html</guid>
34 <pubDate>Sun, 3 Mar 2024 22:40:00 +0100</pubDate>
35 <description>&lt;p&gt;The last few days I have revisited RAID setup using the LSI
36 Megaraid controller. These are a family of controllers called PERC by
37 Dell, and is present in several old PowerEdge servers, and I recently
38 got my hands on one of these. I had forgotten how to handle this RAID
39 controller in Debian, so I had to take a peek in the
40 &lt;a href=&quot;https://wiki.debian.org/LinuxRaidForAdmins&quot;&gt;Debian wiki page
41 &quot;Linux and Hardware RAID: an administrator&#39;s summary&quot;&lt;/a&gt; to remember
42 what kind of software is available to configure and monitor the disks
43 and controller. I prefer Free Software alternatives to proprietary
44 tools, as the later tend to fall into disarray once the manufacturer
45 loose interest, and often do not work with newer Linux Distributions.
46 Sadly there is no free software tool to configure the RAID setup, only
47 to monitor it. RAID can provide improved reliability and resilience in
48 a storage solution, but only if it is being regularly checked and any
49 broken disks are being replaced in time. I thus want to ensure some
50 automatic monitoring is available.&lt;/p&gt;
51
52 &lt;p&gt;In the discovery process, I came across a old free software tool to
53 monitor PERC2, PERC3, PERC4 and PERC5 controllers, which to my
54 surprise is not present in debian. To help change that I created a
55 &lt;a href=&quot;https://bugs.debian.org/1065322&quot;&gt;request for packaging of the
56 megactl package&lt;/a&gt;, and tried to track down a usable version.
57 &lt;a href=&quot;https://sourceforge.net/p/megactl/&quot;&gt;The original project
58 site&lt;/a&gt; is on Sourceforge, but as far as I can tell that project has
59 been dead for more than 15 years. I managed to find a
60 &lt;a href=&quot;https://github.com/hmage/megactl&quot;&gt;more recent fork on
61 github&lt;/a&gt; from user hmage, but it is unclear to me if this is still
62 being maintained. It has not seen much improvements since 2016. A
63 &lt;a href=&quot;https://github.com/namiltd/megactl&quot;&gt;more up to date
64 edition&lt;/a&gt; is a git fork from the original github fork by user
65 namiltd, and this newer fork seem a lot more promising. The owner of
66 this github repository has replied to change proposals within hours,
67 and had already added some improvements and support for more hardware.
68 Sadly he is reluctant to commit to maintaining the tool and stated in
69 &lt;a href=&quot;https://github.com/namiltd/megactl/pull/1&quot;&gt;my first pull
70 request&lt;/A&gt; that he think a new release should be made based on the
71 git repository owned by hmage. I perfectly understand this
72 reluctance, as I feel the same about maintaining yet another package
73 in Debian when I barely have time to take care of the ones I already
74 maintain, but do not really have high hopes that hmage will have time
75 to spend on it and hope namiltd will change his mind.&lt;/p&gt;
76
77 &lt;p&gt;In any case, I created
78 &lt;a href=&quot;https://salsa.debian.org/debian/megactl&quot;&gt;a draft package&lt;/a&gt;
79 based on the namiltd edition and put it under the debian group on
80 salsa.debian.org. If you own a Dell PowerEdge server with one of the
81 PERC controllers, or any other RAID controller using the megaraid or
82 megaraid_sas Linux kernel modules, you might want to check it out. If
83 enough people are interested, perhaps the package will make it into
84 the Debian archive.&lt;/p&gt;
85
86 &lt;p&gt;There are two tools provided, megactl for the megaraid Linux kernel
87 module, and megasasctl for the megaraid_sas Linux kernel module. The
88 simple output from the command on one of my machines look like this
89 (yes, I know some of the disks have problems. :).&lt;/p&gt;
90
91 &lt;pre&gt;
92 # megasasctl
93 a0 PERC H730 Mini encl:1 ldrv:2 batt:good
94 a0d0 558GiB RAID 1 1x2 optimal
95 a0d1 3067GiB RAID 0 1x11 optimal
96 a0e32s0 558GiB a0d0 online errs: media:0 other:19
97 a0e32s1 279GiB a0d1 online
98 a0e32s2 279GiB a0d1 online
99 a0e32s3 279GiB a0d1 online
100 a0e32s4 279GiB a0d1 online
101 a0e32s5 279GiB a0d1 online
102 a0e32s6 279GiB a0d1 online
103 a0e32s8 558GiB a0d0 online errs: media:0 other:17
104 a0e32s9 279GiB a0d1 online
105 a0e32s10 279GiB a0d1 online
106 a0e32s11 279GiB a0d1 online
107 a0e32s12 279GiB a0d1 online
108 a0e32s13 279GiB a0d1 online
109
110 #
111 &lt;/pre&gt;
112
113 &lt;p&gt;In addition to displaying a simple status report, it can also test
114 individual drives and print the various event logs. Perhaps you too
115 find it useful?&lt;/p&gt;
116
117 &lt;p&gt;In the packaging process I provided some patches upstream to
118 improve installation and ensure
119 &lt;ahref=&quot;https://github.com/namiltd/megactl/pull/2&quot;&gt;a Appstream
120 metainfo file is provided&lt;/a&gt; to list all supported HW, to allow
121 &lt;a href=&quot;https://tracker.debian.org/isenkram&quot;&gt;isenkram&lt;/a&gt; to propose
122 the package on all servers with a relevant PCI card.&lt;/p&gt;
123
124 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
125 activities, please send Bitcoin donations to my address
126 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
127
128 </description>
129 </item>
130
131 <item>
132 <title>Some notes on fault tolerant storage systems</title>
133 <link>https://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
134 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
135 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
136 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
137 find these articles and papers interesting. They have formed how I
138 think of when designing a storage system.&lt;/p&gt;
139
140 &lt;ul&gt;
141
142 &lt;li&gt;USENIX :login; &lt;a
143 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
144 Does Not Imply Fault Tolerance. Analysis of Distributed Storage
145 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
146 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
147 H. Arpaci-Dusseau&lt;/li&gt;
148
149 &lt;li&gt;ZDNet
150 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
151 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
152
153 &lt;li&gt;ZDNet
154 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
155 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
156
157 &lt;li&gt;USENIX FAST&#39;07
158 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
159 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
160 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
161
162 &lt;li&gt;USENIX ;login: &lt;a
163 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
164 Integrity. Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
165 Hughes&lt;/li&gt;
166
167 &lt;li&gt;USENIX FAST&#39;08
168 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
169 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
170 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
171 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
172
173 &lt;li&gt;USENIX FAST&#39;07 &lt;a
174 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
175 failures in the real world: what does an MTTF of 1,000,000 hours mean
176 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
177
178 &lt;li&gt;USENIX ;login: &lt;a
179 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
180 Disks the Dominant Contributor for Storage Failures? A Comprehensive
181 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
182 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
183
184 &lt;li&gt;SIGMETRICS 2007
185 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
186 analysis of latent sector errors in disk drives&lt;/a&gt; by
187 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
188
189 &lt;/ul&gt;
190
191 &lt;p&gt;Several of these research papers are based on data collected from
192 hundred thousands or millions of disk, and their findings are eye
193 opening. The short story is simply do not implicitly trust RAID or
194 redundant storage systems. Details matter. And unfortunately there
195 are few options on Linux addressing all the identified issues. Both
196 ZFS and Btrfs are doing a fairly good job, but have legal and
197 practical issues on their own. I wonder how cluster file systems like
198 Ceph do in this regard. After all, there is an old saying, you know
199 you have a distributed system when the crash of a computer you have
200 never heard of stops you from getting any work done. The same holds
201 true if fault tolerance do not work.&lt;/p&gt;
202
203 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
204 fault tolerant your storage is, if you do not continuously monitor its
205 status to detect and replace failed disks.&lt;/p&gt;
206
207 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
208 activities, please send Bitcoin donations to my address
209 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
210 </description>
211 </item>
212
213 <item>
214 <title>How to figure out which RAID disk to replace when it fail</title>
215 <link>https://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</link>
216 <guid isPermaLink="true">https://people.skolelinux.org/pere/blog/How_to_figure_out_which_RAID_disk_to_replace_when_it_fail.html</guid>
217 <pubDate>Tue, 14 Feb 2012 21:25:00 +0100</pubDate>
218 <description>&lt;p&gt;Once in a while my home server have disk problems. Thanks to Linux
219 Software RAID, I have not lost data yet (but
220 &lt;a href=&quot;http://comments.gmane.org/gmane.linux.raid/34532&quot;&gt;I was
221 close&lt;/a&gt; this summer :). But once a disk is starting to behave
222 funny, a practical problem present itself. How to get from the Linux
223 device name (like /dev/sdd) to something that can be used to identify
224 the disk when the computer is turned off? In my case I have SATA
225 disks with a unique ID printed on the label. All I need is a way to
226 figure out how to query the disk to get the ID out.&lt;/p&gt;
227
228 &lt;p&gt;After fumbling a bit, I
229 &lt;a href=&quot;http://www.cyberciti.biz/faq/linux-getting-scsi-ide-harddisk-information/&quot;&gt;found
230 that hdparm -I&lt;/a&gt; will report the disk serial number, which is
231 printed on the disk label. The following (almost) one-liner can be
232 used to look up the ID of all the failed disks:&lt;/p&gt;
233
234 &lt;blockquote&gt;&lt;pre&gt;
235 for d in $(cat /proc/mdstat |grep &#39;(F)&#39;|tr &#39; &#39; &quot;\n&quot;|grep &#39;(F)&#39;|cut -d\[ -f1|sort -u);
236 do
237 printf &quot;Failed disk $d: &quot;
238 hdparm -I /dev/$d |grep &#39;Serial Num&#39;
239 done
240 &lt;/blockquote&gt;&lt;/pre&gt;
241
242 &lt;p&gt;Putting it here to make sure I do not have to search for it the
243 next time, and in case other find it useful.&lt;/p&gt;
244
245 &lt;p&gt;At the moment I have two failing disk. :(&lt;/p&gt;
246
247 &lt;blockquote&gt;&lt;pre&gt;
248 Failed disk sdd1: Serial Number: WD-WCASJ1860823
249 Failed disk sdd2: Serial Number: WD-WCASJ1860823
250 Failed disk sde2: Serial Number: WD-WCASJ1840589
251 &lt;/blockquote&gt;&lt;/pre&gt;
252
253 &lt;p&gt;The last time I had failing disks, I added the serial number on
254 labels I printed and stuck on the short sides of each disk, to be able
255 to figure out which disk to take out of the box without having to
256 remove each disk to look at the physical vendor label. The vendor
257 label is at the top of the disk, which is hidden when the disks are
258 mounted inside my box.&lt;/p&gt;
259
260 &lt;p&gt;I really wish the check_linux_raid Nagios plugin for checking Linux
261 Software RAID in the
262 &lt;a href=&quot;http://packages.qa.debian.org/n/nagios-plugins.html&quot;&gt;nagios-plugins-standard&lt;/a&gt;
263 debian package would look up this value automatically, as it would
264 make the plugin a lot more useful when my disks fail. At the moment
265 it only report a failure when there are no more spares left (it really
266 should warn as soon as a disk is failing), and it do not tell me which
267 disk(s) is failing when the RAID is running short on disks.&lt;/p&gt;
268 </description>
269 </item>
270
271 </channel>
272 </rss>