I work at the University of Oslo +looking after the computers, mostly on the unix side, but in general +all over the place. I am also a member (and currently leader) of +the NUUG association, which in turn +make me a member of USENIX. NUUG +is an member organisation for us in Norway interested in free +software, open standards and unix like operating systems, and USENIX +is a US based member organisation with similar targets. And thanks to +these memberships, I get all issues of the great USENIX magazine +;login: in the +mail several times a year. The magazine is great, and I read most of +it every time.
+ +In the last issue of the USENIX magazine ;login:, there is an +article by Stuart Kendrick from +Fred Hutchinson Cancer Research Center titled +"What +Takes Us Down" (also +available +from his own site), where he report what he found when he +processed the outage reports (both planned and unplanned) from the +last twelve years and classified them according to cause, time of day, +etc etc. The article is a good read to get some empirical data on +what kind of problems affect a data centre, but what really inspired +me was the kind of reporting they had put in place since 2000.
+ +
The centre set up a mailing list, and started to send fairly +standardised messages to this list when a outage was planned or when +it already occurred, to announce the plan and get feedback on the +assumtions on scope and user impact. Here is the two example from the +article: First the unplanned outage: + +
+ +Next the planned outage: + ++Subject: Exchange 2003 Cluster Issues +Severity: Critical (Unplanned) +Start: Monday, May 7, 2012, 11:58 +End: Monday, May 7, 2012, 12:38 +Duration: 40 minutes +Scope: Exchange 2003 +Description: The HTTPS service on the Exchange cluster crashed, triggering + a cluster failover. + +User Impact: During this period, all Exchange users were unable to + access e-mail. Zimbra users were unaffected. +Technician: [xxx] +
+ ++Subject: H Building Switch Upgrades +Severity: Major (Planned) +Start: Saturday, June 16, 2012, 06:00 +End: Saturday, June 16, 2012, 16:00 +Duration: 10 hours +Scope: H2 Transport +Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end- + stations. We will replace these with newer Catalyst + 4510s. +User Impact: All users on H2 will be isolated from the network during + this work. Afterward, they will have gigabit + connectivity. +Technician: [xxx] +
He notes in his article that the date formats and other fields have +been a bit too free form to make it easy to automatically process them +into a database for further analysis, and I would have used ISO 8601 +dates myself to make it easier to process (in other words I would ask +people to write '2012-06-16 06:00 +0000' instead of the start time +format listed above). There are also other issues with the format +that could be improved, read the article for the details.
+ +I find the idea of standardising outage messages seem to be such a +good idea that I would like to get it implemented here at the +university too. We do register +planned +changes and outages in a calendar, and report the to a mailing +list, but we do not do so in a structured format and there is not a +report to the same location for unplanned outages. Perhaps something +for other sites to consider too?
+