I work at the University of Oslo looking after the computers, mostly on the unix side, but in general all over the place. I am also a member (and currently leader) of the NUUG association, which in turn make me a member of USENIX. NUUG is an member organisation for us in Norway interested in free software, open standards and unix like operating systems, and USENIX is a US based member organisation with similar targets. And thanks to these memberships, I get all issues of the great USENIX magazine ;login: in the mail several times a year. The magazine is great, and I read most of it every time.
In the last issue of the USENIX magazine ;login:, there is an article by Stuart Kendrick from Fred Hutchinson Cancer Research Center titled "What Takes Us Down" (also available from his own site), where he report what he found when he processed the outage reports (both planned and unplanned) from the last twelve years and classified them according to cause, time of day, etc etc. The article is a good read to get some empirical data on what kind of problems affect a data centre, but what really inspired me was the kind of reporting they had put in place since 2000.
The centre set up a mailing list, and started to send fairly standardised messages to this list when a outage was planned or when it already occurred, to announce the plan and get feedback on the assumtions on scope and user impact. Here is the two example from the article: First the unplanned outage:
Next the planned outage:Subject: Exchange 2003 Cluster Issues Severity: Critical (Unplanned) Start: Monday, May 7, 2012, 11:58 End: Monday, May 7, 2012, 12:38 Duration: 40 minutes Scope: Exchange 2003 Description: The HTTPS service on the Exchange cluster crashed, triggering a cluster failover. User Impact: During this period, all Exchange users were unable to access e-mail. Zimbra users were unaffected. Technician: [xxx]
Subject: H Building Switch Upgrades Severity: Major (Planned) Start: Saturday, June 16, 2012, 06:00 End: Saturday, June 16, 2012, 16:00 Duration: 10 hours Scope: H2 Transport Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end- stations. We will replace these with newer Catalyst 4510s. User Impact: All users on H2 will be isolated from the network during this work. Afterward, they will have gigabit connectivity. Technician: [xxx]
He notes in his article that the date formats and other fields have been a bit too free form to make it easy to automatically process them into a database for further analysis, and I would have used ISO 8601 dates myself to make it easier to process (in other words I would ask people to write '2012-06-16 06:00 +0000' instead of the start time format listed above). There are also other issues with the format that could be improved, read the article for the details.
I find the idea of standardising outage messages seem to be such a good idea that I would like to get it implemented here at the university too. We do register planned changes and outages in a calendar, and report the to a mailing list, but we do not do so in a structured format and there is not a report to the same location for unplanned outages. Perhaps something for other sites to consider too?