From 93b5072474c797a9a5b591b2c130d20e91dafddd Mon Sep 17 00:00:00 2001 From: Petter Reinholdtsen Date: Fri, 26 Oct 2012 12:01:05 +0000 Subject: [PATCH] New post. --- blog/data/2012-10-26-system-downtime.txt | 87 ++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 blog/data/2012-10-26-system-downtime.txt diff --git a/blog/data/2012-10-26-system-downtime.txt b/blog/data/2012-10-26-system-downtime.txt new file mode 100644 index 0000000000..821c05bb56 --- /dev/null +++ b/blog/data/2012-10-26-system-downtime.txt @@ -0,0 +1,87 @@ +Title: 12 years of outages - summarised by Stuart Kendrick +Tags: english, nuug, standard +Date: 2012-10-26 10:20 + +

I work at the University of Oslo +looking after the computers, mostly on the unix side, but in general +all over the place. I am also a member (and currently leader) of +the NUUG association, which in turn +make me a member of USENIX. NUUG +is an member organisation for us in Norway interested in free +software, open standards and unix like operating systems, and USENIX +is a US based member organisation with similar targets. I tend to +distill it down to the simple statement that all the skilled computer +people are members of NUUG, which while a goal is still not quite +reflected in reality. And thanks to these memberships, I get all +issues of the great USENIX magazine +;login: in the +mail several times a year. The magazine is great, and I read most of +it every time.

+ +

In the last issue of the USENIX magazine ;login:, there is an +article by Stuart Kendrick from +Fred Hutchinson Cancer Research Center titled +What +Takes Us Down (also +available +from his own site), where he report what he found when he +processed the outage reports (both planned and unplanned) from the +last twelve years and classified them according to cause, time of day, +etc etc. The article is a good read to get some empirical data on +what kind of problems affect a data centre, but what really inspired +me was the kind of reporting they had put in place since 2000.

+ +

The centre set up a mailing list, and send fairly standardised +messages to this list when a outage was planned or when it already +occurred. Here is the two example from the article: First the +unplanned outage: + +

+Subject:     Exchange 2003 Cluster Issues
+Severity:    Critical (Unplanned)
+Start: 	     Monday, May 7, 2012, 11:58
+End: 	     Monday, May 7, 2012, 12:38
+Duration:    40 minutes
+Scope:	     Exchange 2003
+Description: The HTTPS service on the Exchange cluster crashed, triggering
+             a cluster failover.
+
+User Impact: During this period, all Exchange users were unable to
+             access e-mail. Zimbra users were unaffected.
+Technician:  [xxx]
+
+ +Next the planned outage: + +
+Subject:     H Building Switch Upgrades
+Severity:    Major (Planned)
+Start:	     Saturday, June 16, 2012, 06:00
+End:	     Saturday, June 16, 2012, 16:00
+Duration:    10 hours
+Scope:	     H2 Transport
+Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end-
+	     stations. We will replace these with newer Catalyst
+	     4510s.
+User Impact: All users on H2 will be isolated from the network during
+     	     this work. Afterward, they will have gigabit
+     	     connectivity.
+Technician:  [xxx]
+
+
+

He notes in his article that the date formats and other fields have +been a bit too free form to make it easy to automatically process them +into a database for further analysis, and I would have used ISO 8601 +dates myself to make it easier to process (in other words I would ask +people to write '2012-06-16 06:00' instead of the start time format +listed above). There are also other issues with the format that could +be improved, read the article for the details.

+ +

I find the idea of standardising outage messages seem to be such a +good idea that I would like to get it implemented here at the +university too. We do register +planned +changes and outages in a calendar, and report the to a mailing +list, but we do not do so in a structured format and there is not a +report to the same location for unplanned outages. Perhaps something +for other sites to consider too?

-- 2.47.2