]> pere.pagekite.me Git - homepage.git/blob - blog/data/2012-10-26-system-downtime.txt
Generated.
[homepage.git] / blog / data / 2012-10-26-system-downtime.txt
1 Title: 12 years of outages - summarised by Stuart Kendrick
2 Tags: english, nuug, standard, usenix
3 Date: 2012-10-26 14:20
4
5 <p>I work at the <a href="http://www.uio.no/">University of Oslo</a>
6 looking after the computers, mostly on the unix side, but in general
7 all over the place. I am also a member (and currently leader) of
8 <a href="http://www.nuug.no/">the NUUG association</a>, which in turn
9 make me a member of <a href="http://www.usenix.org/">USENIX</a>. NUUG
10 is an member organisation for us in Norway interested in free
11 software, open standards and unix like operating systems, and USENIX
12 is a US based member organisation with similar targets. And thanks to
13 these memberships, I get all issues of the great USENIX magazine
14 <a href="https://www.usenix.org/publications/login">;login:</a> in the
15 mail several times a year. The magazine is great, and I read most of
16 it every time.</p>
17
18 <p>In the last issue of the USENIX magazine ;login:, there is an
19 article by <a href="http://www.skendric.com/">Stuart Kendrick</a> from
20 Fred Hutchinson Cancer Research Center titled
21 "<a href="https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down">What
22 Takes Us Down</a>" (longer version also
23 <a href="http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf">available
24 from his own site</a>), where he report what he found when he
25 processed the outage reports (both planned and unplanned) from the
26 last twelve years and classified them according to cause, time of day,
27 etc etc. The article is a good read to get some empirical data on
28 what kind of problems affect a data centre, but what really inspired
29 me was the kind of reporting they had put in place since 2000.<p>
30
31 <p>The centre set up a mailing list, and started to send fairly
32 standardised messages to this list when a outage was planned or when
33 it already occurred, to announce the plan and get feedback on the
34 assumtions on scope and user impact. Here is the two example from the
35 article: First the unplanned outage:
36
37 <blockquote><pre>
38 Subject: Exchange 2003 Cluster Issues
39 Severity: Critical (Unplanned)
40 Start: Monday, May 7, 2012, 11:58
41 End: Monday, May 7, 2012, 12:38
42 Duration: 40 minutes
43 Scope: Exchange 2003
44 Description: The HTTPS service on the Exchange cluster crashed, triggering
45 a cluster failover.
46
47 User Impact: During this period, all Exchange users were unable to
48 access e-mail. Zimbra users were unaffected.
49 Technician: [xxx]
50 </pre></blockquote>
51
52 Next the planned outage:
53
54 <blockquote><pre>
55 Subject: H Building Switch Upgrades
56 Severity: Major (Planned)
57 Start: Saturday, June 16, 2012, 06:00
58 End: Saturday, June 16, 2012, 16:00
59 Duration: 10 hours
60 Scope: H2 Transport
61 Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end-
62 stations. We will replace these with newer Catalyst
63 4510s.
64 User Impact: All users on H2 will be isolated from the network during
65 this work. Afterward, they will have gigabit
66 connectivity.
67 Technician: [xxx]
68 </pre></blockquote>
69
70 <p>He notes in his article that the date formats and other fields have
71 been a bit too free form to make it easy to automatically process them
72 into a database for further analysis, and I would have used ISO 8601
73 dates myself to make it easier to process (in other words I would ask
74 people to write '2012-06-16 06:00 +0000' instead of the start time
75 format listed above). There are also other issues with the format
76 that could be improved, read the article for the details.</p>
77
78 <p>I find the idea of standardising outage messages seem to be such a
79 good idea that I would like to get it implemented here at the
80 university too. We do register
81 <a href="http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/">planned
82 changes and outages in a calendar</a>, and report the to a mailing
83 list, but we do not do so in a structured format and there is not a
84 report to the same location for unplanned outages. Perhaps something
85 for other sites to consider too?</p>