X-Git-Url: http://pere.pagekite.me/gitweb/homepage.git/blobdiff_plain/0f2c45fda2387a506fe9da3ae9d73fcb7c31d927..f16191ca76a47a08f56d027facdb8ab474bae8bb:/blog/archive/2012/10/10.rss diff --git a/blog/archive/2012/10/10.rss b/blog/archive/2012/10/10.rss index e050005a29..00320eaf8d 100644 --- a/blog/archive/2012/10/10.rss +++ b/blog/archive/2012/10/10.rss @@ -6,6 +6,95 @@ http://people.skolelinux.org/pere/blog/ + + 12 years of outages - summarised by Stuart Kendrick + http://people.skolelinux.org/pere/blog/12_years_of_outages___summarised_by_Stuart_Kendrick.html + http://people.skolelinux.org/pere/blog/12_years_of_outages___summarised_by_Stuart_Kendrick.html + Fri, 26 Oct 2012 14:20:00 +0200 + <p>I work at the <a href="http://www.uio.no/">University of Oslo</a> +looking after the computers, mostly on the unix side, but in general +all over the place. I am also a member (and currently leader) of +<a href="http://www.nuug.no/">the NUUG association</a>, which in turn +make me a member of <a href="http://www.usenix.org/">USENIX</a>. NUUG +is an member organisation for us in Norway interested in free +software, open standards and unix like operating systems, and USENIX +is a US based member organisation with similar targets. And thanks to +these memberships, I get all issues of the great USENIX magazine +<a href="https://www.usenix.org/publications/login">;login:</a> in the +mail several times a year. The magazine is great, and I read most of +it every time.</p> + +<p>In the last issue of the USENIX magazine ;login:, there is an +article by <a href="http://www.skendric.com/">Stuart Kendrick</a> from +Fred Hutchinson Cancer Research Center titled +"<a href="https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down">What +Takes Us Down</a>" (longer version also +<a href="http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf">available +from his own site</a>), where he report what he found when he +processed the outage reports (both planned and unplanned) from the +last twelve years and classified them according to cause, time of day, +etc etc. The article is a good read to get some empirical data on +what kind of problems affect a data centre, but what really inspired +me was the kind of reporting they had put in place since 2000.<p> + +<p>The centre set up a mailing list, and started to send fairly +standardised messages to this list when a outage was planned or when +it already occurred, to announce the plan and get feedback on the +assumtions on scope and user impact. Here is the two example from the +article: First the unplanned outage: + +<blockquote><pre> +Subject: Exchange 2003 Cluster Issues +Severity: Critical (Unplanned) +Start: Monday, May 7, 2012, 11:58 +End: Monday, May 7, 2012, 12:38 +Duration: 40 minutes +Scope: Exchange 2003 +Description: The HTTPS service on the Exchange cluster crashed, triggering + a cluster failover. + +User Impact: During this period, all Exchange users were unable to + access e-mail. Zimbra users were unaffected. +Technician: [xxx] +</pre></blockquote> + +Next the planned outage: + +<blockquote><pre> +Subject: H Building Switch Upgrades +Severity: Major (Planned) +Start: Saturday, June 16, 2012, 06:00 +End: Saturday, June 16, 2012, 16:00 +Duration: 10 hours +Scope: H2 Transport +Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end- + stations. We will replace these with newer Catalyst + 4510s. +User Impact: All users on H2 will be isolated from the network during + this work. Afterward, they will have gigabit + connectivity. +Technician: [xxx] +</pre></blockquote> + +<p>He notes in his article that the date formats and other fields have +been a bit too free form to make it easy to automatically process them +into a database for further analysis, and I would have used ISO 8601 +dates myself to make it easier to process (in other words I would ask +people to write '2012-06-16 06:00 +0000' instead of the start time +format listed above). There are also other issues with the format +that could be improved, read the article for the details.</p> + +<p>I find the idea of standardising outage messages seem to be such a +good idea that I would like to get it implemented here at the +university too. We do register +<a href="http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/">planned +changes and outages in a calendar</a>, and report the to a mailing +list, but we do not do so in a structured format and there is not a +report to the same location for unplanned outages. Perhaps something +for other sites to consider too?</p> + + + Amazon steal books from customer and throw out her out without any explanation http://people.skolelinux.org/pere/blog/Amazon_steal_books_from_customer_and_throw_out_her_out_without_any_explanation.html