I work at the University of Oslo
-looking after the computers, mostly on the unix side, but in general
-all over the place. I am also a member (and currently leader) of
-the NUUG association, which in turn
-make me a member of USENIX. NUUG
-is an member organisation for us in Norway interested in free
-software, open standards and unix like operating systems, and USENIX
-is a US based member organisation with similar targets. And thanks to
-these memberships, I get all issues of the great USENIX magazine
-;login: in the
-mail several times a year. The magazine is great, and I read most of
-it every time.
-
-
In the last issue of the USENIX magazine ;login:, there is an
-article by Stuart Kendrick from
-Fred Hutchinson Cancer Research Center titled
-"What
-Takes Us Down" (longer version also
-available
-from his own site), where he report what he found when he
-processed the outage reports (both planned and unplanned) from the
-last twelve years and classified them according to cause, time of day,
-etc etc. The article is a good read to get some empirical data on
-what kind of problems affect a data centre, but what really inspired
-me was the kind of reporting they had put in place since 2000.
-
-
The centre set up a mailing list, and started to send fairly
-standardised messages to this list when a outage was planned or when
-it already occurred, to announce the plan and get feedback on the
-assumtions on scope and user impact. Here is the two example from the
-article: First the unplanned outage:
-
-
-Subject: Exchange 2003 Cluster Issues
-Severity: Critical (Unplanned)
-Start: Monday, May 7, 2012, 11:58
-End: Monday, May 7, 2012, 12:38
-Duration: 40 minutes
-Scope: Exchange 2003
-Description: The HTTPS service on the Exchange cluster crashed, triggering
- a cluster failover.
-
-User Impact: During this period, all Exchange users were unable to
- access e-mail. Zimbra users were unaffected.
-Technician: [xxx]
-
-
-Next the planned outage:
-
-
-Subject: H Building Switch Upgrades
-Severity: Major (Planned)
-Start: Saturday, June 16, 2012, 06:00
-End: Saturday, June 16, 2012, 16:00
-Duration: 10 hours
-Scope: H2 Transport
-Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end-
- stations. We will replace these with newer Catalyst
- 4510s.
-User Impact: All users on H2 will be isolated from the network during
- this work. Afterward, they will have gigabit
- connectivity.
-Technician: [xxx]
-
-
-
He notes in his article that the date formats and other fields have
-been a bit too free form to make it easy to automatically process them
-into a database for further analysis, and I would have used ISO 8601
-dates myself to make it easier to process (in other words I would ask
-people to write '2012-06-16 06:00 +0000' instead of the start time
-format listed above). There are also other issues with the format
-that could be improved, read the article for the details.
-
-
I find the idea of standardising outage messages seem to be such a
-good idea that I would like to get it implemented here at the
-university too. We do register
-planned
-changes and outages in a calendar, and report the to a mailing
-list, but we do not do so in a structured format and there is not a
-report to the same location for unplanned outages. Perhaps something
-for other sites to consider too?
-