X-Git-Url: http://pere.pagekite.me/gitweb/homepage.git/blobdiff_plain/0f2c45fda2387a506fe9da3ae9d73fcb7c31d927..f16191ca76a47a08f56d027facdb8ab474bae8bb:/blog/archive/2012/10/10.rss

diff --git a/blog/archive/2012/10/10.rss b/blog/archive/2012/10/10.rss
index e050005a29..00320eaf8d 100644
--- a/blog/archive/2012/10/10.rss
+++ b/blog/archive/2012/10/10.rss
@@ -6,6 +6,95 @@
                 <link>http://people.skolelinux.org/pere/blog/</link>
 
 	
+	<item>
+		<title>12 years of outages - summarised by Stuart Kendrick</title>
+		<link>http://people.skolelinux.org/pere/blog/12_years_of_outages___summarised_by_Stuart_Kendrick.html</link>        
+		<guid isPermaLink="true">http://people.skolelinux.org/pere/blog/12_years_of_outages___summarised_by_Stuart_Kendrick.html</guid>
+                <pubDate>Fri, 26 Oct 2012 14:20:00 +0200</pubDate>
+		<description>&lt;p&gt;I work at the &lt;a href=&quot;http://www.uio.no/&quot;&gt;University of Oslo&lt;/a&gt;
+looking after the computers, mostly on the unix side, but in general
+all over the place.  I am also a member (and currently leader) of
+&lt;a href=&quot;http://www.nuug.no/&quot;&gt;the NUUG association&lt;/a&gt;, which in turn
+make me a member of &lt;a href=&quot;http://www.usenix.org/&quot;&gt;USENIX&lt;/a&gt;.  NUUG
+is an member organisation for us in Norway interested in free
+software, open standards and unix like operating systems, and USENIX
+is a US based member organisation with similar targets.  And thanks to
+these memberships, I get all issues of the great USENIX magazine
+&lt;a href=&quot;https://www.usenix.org/publications/login&quot;&gt;;login:&lt;/a&gt; in the
+mail several times a year.  The magazine is great, and I read most of
+it every time.&lt;/p&gt;
+
+&lt;p&gt;In the last issue of the USENIX magazine ;login:, there is an
+article by &lt;a href=&quot;http://www.skendric.com/&quot;&gt;Stuart Kendrick&lt;/a&gt; from
+Fred Hutchinson Cancer Research Center titled
+&quot;&lt;a href=&quot;https://www.usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down&quot;&gt;What
+Takes Us Down&lt;/a&gt;&quot; (longer version also
+&lt;a href=&quot;http://www.skendric.com/problem/incident-analysis/2012-06-30/What-Takes-Us-Down.pdf&quot;&gt;available
+from his own site&lt;/a&gt;), where he report what he found when he
+processed the outage reports (both planned and unplanned) from the
+last twelve years and classified them according to cause, time of day,
+etc etc.  The article is a good read to get some empirical data on
+what kind of problems affect a data centre, but what really inspired
+me was the kind of reporting they had put in place since 2000.&lt;p&gt;
+
+&lt;p&gt;The centre set up a mailing list, and started to send fairly
+standardised messages to this list when a outage was planned or when
+it already occurred, to announce the plan and get feedback on the
+assumtions on scope and user impact.  Here is the two example from the
+article: First the unplanned outage:
+
+&lt;blockquote&gt;&lt;pre&gt;
+Subject:     Exchange 2003 Cluster Issues
+Severity:    Critical (Unplanned)
+Start: 	     Monday, May 7, 2012, 11:58
+End: 	     Monday, May 7, 2012, 12:38
+Duration:    40 minutes
+Scope:	     Exchange 2003
+Description: The HTTPS service on the Exchange cluster crashed, triggering
+             a cluster failover.
+
+User Impact: During this period, all Exchange users were unable to
+             access e-mail. Zimbra users were unaffected.
+Technician:  [xxx]
+&lt;/pre&gt;&lt;/blockquote&gt;
+
+Next the planned outage:
+
+&lt;blockquote&gt;&lt;pre&gt;
+Subject:     H Building Switch Upgrades
+Severity:    Major (Planned)
+Start:	     Saturday, June 16, 2012, 06:00
+End:	     Saturday, June 16, 2012, 16:00
+Duration:    10 hours
+Scope:	     H2 Transport
+Description: Currently, Catalyst 4006s provide 10/100 Ethernet to end-
+	     stations. We will replace these with newer Catalyst
+	     4510s.
+User Impact: All users on H2 will be isolated from the network during
+     	     this work. Afterward, they will have gigabit
+     	     connectivity.
+Technician:  [xxx]
+&lt;/pre&gt;&lt;/blockquote&gt;
+
+&lt;p&gt;He notes in his article that the date formats and other fields have
+been a bit too free form to make it easy to automatically process them
+into a database for further analysis, and I would have used ISO 8601
+dates myself to make it easier to process (in other words I would ask
+people to write &#39;2012-06-16 06:00 +0000&#39; instead of the start time
+format listed above).  There are also other issues with the format
+that could be improved, read the article for the details.&lt;/p&gt;
+
+&lt;p&gt;I find the idea of standardising outage messages seem to be such a
+good idea that I would like to get it implemented here at the
+university too.  We do register
+&lt;a href=&quot;http://www.uio.no/tjenester/it/aktuelt/planlagte-tjenesteavbrudd/&quot;&gt;planned
+changes and outages in a calendar&lt;/a&gt;, and report the to a mailing
+list, but we do not do so in a structured format and there is not a
+report to the same location for unplanned outages.  Perhaps something
+for other sites to consider too?&lt;/p&gt;
+</description>
+	</item>
+	
 	<item>
 		<title>Amazon steal books from customer and throw out her out without any explanation</title>
 		<link>http://people.skolelinux.org/pere/blog/Amazon_steal_books_from_customer_and_throw_out_her_out_without_any_explanation.html</link>