Title: 12 years of outages - summarised by Stuart Kendrick
-Tags: english, nuug, standard
-Date: 2012-10-26 10:20
+Tags: english, nuug, standard, usenix
+Date: 2012-10-26 14:20
-<p>I work at the <ahref="http://www.uio.no/">University of Oslo</a>
+<p>I work at the <a href="http://www.uio.no/">University of Oslo</a>
looking after the computers, mostly on the unix side, but in general
all over the place. I am also a member (and currently leader) of
-<ahref="http://www.nuug.no/">the NUUG association</a>, which in turn
-make me a member of <ahref="http://www.usenix.org/">USENIX</a>. NUUG
+<a href="http://www.nuug.no/">the NUUG association</a>, which in turn
+make me a member of <a href="http://www.usenix.org/">USENIX</a>. NUUG
is an member organisation for us in Norway interested in free
software, open standards and unix like operating systems, and USENIX
-is a US based member organisation with similar targets. I tend to
-distill it down to the simple statement that all the skilled computer
-people are members of NUUG, which while a goal is still not quite
-reflected in reality. And thanks to these memberships, I get all
-issues of the great USENIX magazine
-<ahref="https://www.usenix.org/publications/login">;login:</a> in the
+is a US based member organisation with similar targets. And thanks to
+these memberships, I get all issues of the great USENIX magazine
+<a href="https://www.usenix.org/publications/login">;login:</a> in the
mail several times a year. The magazine is great, and I read most of
it every time.</p>
<p>In the last issue of the USENIX magazine ;login:, there is an
-article by <ahref="http://www.skendric.com/">Stuart Kendrick</a> from
+article by <a href="http://www.skendric.com/">Stuart Kendrick</a> from
Fred Hutchinson Cancer Research Center titled
-<ahref="https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down">What
-Takes Us Down</a> (also
-<ahref="http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf">available
+"<a href="https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down">What
+Takes Us Down</a>" (longer version also
+<a href="http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf">available
from his own site</a>), where he report what he found when he
processed the outage reports (both planned and unplanned) from the
last twelve years and classified them according to cause, time of day,
what kind of problems affect a data centre, but what really inspired
me was the kind of reporting they had put in place since 2000.<p>
-<p>The centre set up a mailing list, and send fairly standardised
-messages to this list when a outage was planned or when it already
-occurred. Here is the two example from the article: First the
-unplanned outage:
+<p>The centre set up a mailing list, and started to send fairly
+standardised messages to this list when a outage was planned or when
+it already occurred, to announce the plan and get feedback on the
+assumtions on scope and user impact. Here is the two example from the
+article: First the unplanned outage:
<blockquote><pre>
Subject: Exchange 2003 Cluster Issues
this work. Afterward, they will have gigabit
connectivity.
Technician: [xxx]
-<blockquote><pre>
+</pre></blockquote>
<p>He notes in his article that the date formats and other fields have
been a bit too free form to make it easy to automatically process them
into a database for further analysis, and I would have used ISO 8601
dates myself to make it easier to process (in other words I would ask
-people to write '2012-06-16 06:00' instead of the start time format
-listed above). There are also other issues with the format that could
-be improved, read the article for the details.</p>
+people to write '2012-06-16 06:00 +0000' instead of the start time
+format listed above). There are also other issues with the format
+that could be improved, read the article for the details.</p>
<p>I find the idea of standardising outage messages seem to be such a
good idea that I would like to get it implemented here at the
university too. We do register
-<ahref="http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/">planned
+<a href="http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/">planned
changes and outages in a calendar</a>, and report the to a mailing
list, but we do not do so in a structured format and there is not a
report to the same location for unplanned outages. Perhaps something