blog/data/2012-10-26-system-downtime.txt

   1 Title: 12 years of outages - summarised by Stuart Kendrick
   2 Tags: english, nuug, standard, usenix
   3 Date: 2012-10-26 14:20
   4
   5 <p>I work at the <a href="http://www.uio.no/">University of Oslo</a>
   6 looking after the computers, mostly on the unix side, but in general
   7 all over the place.  I am also a member (and currently leader) of
   8 <a href="http://www.nuug.no/">the NUUG association</a>, which in turn
   9 make me a member of <a href="http://www.usenix.org/">USENIX</a>.  NUUG
  10 is an member organisation for us in Norway interested in free
  11 software, open standards and unix like operating systems, and USENIX
  12 is a US based member organisation with similar targets.  And thanks to
  13 these memberships, I get all issues of the great USENIX magazine
  14 <a href="https://www.usenix.org/publications/login">;login:</a> in the
  15 mail several times a year.  The magazine is great, and I read most of
  16 it every time.</p>
  17
  18 <p>In the last issue of the USENIX magazine ;login:, there is an
  19 article by <a href="http://www.skendric.com/">Stuart Kendrick</a> from
  20 Fred Hutchinson Cancer Research Center titled
  21 "<a href="https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down">What
  22 Takes Us Down</a>" (longer version also
  23 <a href="http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf">available
  24 from his own site</a>), where he report what he found when he
  25 processed the outage reports (both planned and unplanned) from the
  26 last twelve years and classified them according to cause, time of day,
  27 etc etc.  The article is a good read to get some empirical data on
  28 what kind of problems affect a data centre, but what really inspired
  29 me was the kind of reporting they had put in place since 2000.<p>
  30
  31 <p>The centre set up a mailing list, and started to send fairly
  32 standardised messages to this list when a outage was planned or when
  33 it already occurred, to announce the plan and get feedback on the
  34 assumtions on scope and user impact.  Here is the two example from the
  35 article: First the unplanned outage:
  36
  37 <blockquote><pre>
  38 Subject:     Exchange 2003 Cluster Issues
  39 Severity:    Critical (Unplanned)
  40 Start:       Monday, May 7, 2012, 11:58
  41 End:         Monday, May 7, 2012, 12:38
  42 Duration:    40 minutes
  43 Scope:       Exchange 2003
  44 Description: The HTTPS service on the Exchange cluster crashed, triggering
  45              a cluster failover.
  46
  47 User Impact: During this period, all Exchange users were unable to
  48              access e-mail. Zimbra users were unaffected.
  49 Technician:  [xxx]
  50 </pre></blockquote>
  51
  52 Next the planned outage:
  53
  54 <blockquote><pre>
  55 Subject:     H Building Switch Upgrades
  56 Severity:    Major (Planned)
  57 Start:       Saturday, June 16, 2012, 06:00
  58 End:         Saturday, June 16, 2012, 16:00
  59 Duration:    10 hours
  60 Scope:       H2 Transport
  61 Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end-
  62              stations. We will replace these with newer Catalyst
  63              4510s.
  64 User Impact: All users on H2 will be isolated from the network during
  65              this work. Afterward, they will have gigabit
  66              connectivity.
  67 Technician:  [xxx]
  68 </pre></blockquote>
  69
  70 <p>He notes in his article that the date formats and other fields have
  71 been a bit too free form to make it easy to automatically process them
  72 into a database for further analysis, and I would have used ISO 8601
  73 dates myself to make it easier to process (in other words I would ask
  74 people to write '2012-06-16 06:00 +0000' instead of the start time
  75 format listed above).  There are also other issues with the format
  76 that could be improved, read the article for the details.</p>
  77
  78 <p>I find the idea of standardising outage messages seem to be such a
  79 good idea that I would like to get it implemented here at the
  80 university too.  We do register
  81 <a href="http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/">planned
  82 changes and outages in a calendar</a>, and report the to a mailing
  83 list, but we do not do so in a structured format and there is not a
  84 report to the same location for unplanned outages.  Perhaps something
  85 for other sites to consider too?</p>