Petter Reinholdtsen

Some notes on fault tolerant storage systems

1st November 2017

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

USENIX :login; Redundancy Does Not Imply Fault Tolerance. Analysis of Distributed Storage Reactions to Single Errors and Corruptions by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
ZDNet Why RAID 5 stops working in 2009 by Robin Harris
ZDNet Why RAID 6 stops working in 2019 by Robin Harris
USENIX FAST'07 Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso
USENIX ;login: Data Integrity. Finding Truth in a World of Guesses and Lies by Doug Hughes
USENIX FAST'08 An Analysis of Data Corruption in the Storage Stack by L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau
USENIX FAST'07 Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? by B. Schroeder and G. A. Gibson.
USENIX ;login: Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
SIGMETRICS 2007 An analysis of latent sector errors in disk drives by L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a compyter you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.

Tags: english, raid, sysadmin.

Web services for writing academic LaTeX papers as a team

31st October 2017

I was surprised today to learn that a friend in academia did not know there are easily available web services available for writing LaTeX documents as a team. I thought it was common knowledge, but to make sure at least my readers are aware of it, I would like to mention these useful services for writing LaTeX documents. Some of them even provide a WYSIWYG editor to ease writing even further.

There are two commercial services available, ShareLaTeX and Overleaf. They are very easy to use. Just start a new document, select which publisher to write for (ie which LaTeX style to use), and start writing. Note, these two have announced their intention to join forces, so soon it will only be one joint service. I've used both for different documents, and they work just fine. While ShareLaTeX is free software, while the latter is not. According to a announcement from Overleaf, they plan to keep the ShareLaTeX code base maintained as free software.

But these two are not the only alternatives. Fidus Writer is another free software solution with the source available on github. I have not used it myself. Several others can be found on the nice alterntiveTo web service.

If you like Google Docs or Etherpad, but would like to write documents in LaTeX, you should check out these services. You can even host your own, if you want to. :)

Tags: english.

Locating IMDB IDs of movies in the Internet Archive using Wikidata

25th October 2017

Recently, I needed to automatically check the copyright status of a set of The Internet Movie database (IMDB) entries, to figure out which one of the movies they refer to can be freely distributed on the Internet. This proved to be harder than it sounds. IMDB for sure list movies without any copyright protection, where the copyright protection has expired or where the movie is lisenced using a permissive license like one from Creative Commons. These are mixed with copyright protected movies, and there seem to be no way to separate these classes of movies using the information in IMDB.

First I tried to look up entries manually in IMDB, Wikipedia and The Internet Archive, to get a feel how to do this. It is hard to know for sure using these sources, but it should be possible to be reasonable confident a movie is "out of copyright" with a few hours work per movie. As I needed to check almost 20,000 entries, this approach was not sustainable. I simply can not work around the clock for about 6 years to check this data set.

I asked the people behind The Internet Archive if they could introduce a new metadata field in their metadata XML for IMDB ID, but was told that they leave it completely to the uploaders to update the metadata. Some of the metadata entries had IMDB links in the description, but I found no way to download all metadata files in bulk to locate those ones and put that approach aside.

In the process I noticed several Wikipedia articles about movies had links to both IMDB and The Internet Archive, and it occured to me that I could use the Wikipedia RDF data set to locate entries with both, to at least get a lower bound on the number of movies on The Internet Archive with a IMDB ID. This is useful based on the assumption that movies distributed by The Internet Archive can be legally distributed on the Internet. With some help from the RDF community (thank you DanC), I was able to come up with this query to pass to the SPARQL interface on Wikidata:

SELECT ?work ?imdb ?ia ?when ?label
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q11424.
  ?work wdt:P345 ?imdb.
  ?work wdt:P724 ?ia.
  OPTIONAL {
        ?work wdt:P577 ?when.
        ?work rdfs:label ?label.
        FILTER(LANG(?label) = "en").
  }
}

If I understand the query right, for every film entry anywhere in Wikpedia, it will return the IMDB ID and The Internet Archive ID, and when the movie was released and its English title, if either or both of the latter two are available. At the moment the result set contain 2338 entries. Of course, it depend on volunteers including both correct IMDB and The Internet Archive IDs in the wikipedia articles for the movie. It should be noted that the result will include duplicates if the movie have entries in several languages. There are some bogus entries, either because The Internet Archive ID contain a typo or because the movie is not available from The Internet Archive. I did not verify the IMDB IDs, as I am unsure how to do that automatically.

I wrote a small python script to extract the data set from Wikidata and check if the XML metadata for the movie is available from The Internet Archive, and after around 1.5 hour it produced a list of 2097 free movies and their IMDB ID. In total, 171 entries in Wikidata lack the refered Internet Archive entry. I assume the 70 "disappearing" entries (ie 2338-2097-171) are duplicate entries.

This is not too bad, given that The Internet Archive report to contain 5331 feature films at the moment, but it also mean more than 3000 movies are missing on Wikipedia or are missing the pair of references on Wikipedia.

I was curious about the distribution by release year, and made a little graph to show how the amount of free movies is spread over the years:

I expect the relative distribution of the remaining 3000 movies to be similar.

If you want to help, and want to ensure Wikipedia can be used to cross reference The Internet Archive and The Internet Movie Database, please make sure entries like this are listed under the "External links" heading on the Wikipedia article for the movie:

* {{Internet Archive film|id=FightingLady}}
* {{IMDb title|id=0036823|title=The Fighting Lady}}

Please verify the links on the final page, to make sure you did not introduce a typo.

Here is the complete list, if you want to correct the 171 identified Wikipedia entries with broken links to The Internet Archive: Q1140317, Q458656, Q458656, Q470560, Q743340, Q822580, Q480696, Q128761, Q1307059, Q1335091, Q1537166, Q1438334, Q1479751, Q1497200, Q1498122, Q865973, Q834269, Q841781, Q841781, Q1548193, Q499031, Q1564769, Q1585239, Q1585569, Q1624236, Q4796595, Q4853469, Q4873046, Q915016, Q4660396, Q4677708, Q4738449, Q4756096, Q4766785, Q880357, Q882066, Q882066, Q204191, Q204191, Q1194170, Q940014, Q946863, Q172837, Q573077, Q1219005, Q1219599, Q1643798, Q1656352, Q1659549, Q1660007, Q1698154, Q1737980, Q1877284, Q1199354, Q1199354, Q1199451, Q1211871, Q1212179, Q1238382, Q4906454, Q320219, Q1148649, Q645094, Q5050350, Q5166548, Q2677926, Q2698139, Q2707305, Q2740725, Q2024780, Q2117418, Q2138984, Q1127992, Q1058087, Q1070484, Q1080080, Q1090813, Q1251918, Q1254110, Q1257070, Q1257079, Q1197410, Q1198423, Q706951, Q723239, Q2079261, Q1171364, Q617858, Q5166611, Q5166611, Q324513, Q374172, Q7533269, Q970386, Q976849, Q7458614, Q5347416, Q5460005, Q5463392, Q3038555, Q5288458, Q2346516, Q5183645, Q5185497, Q5216127, Q5223127, Q5261159, Q1300759, Q5521241, Q7733434, Q7736264, Q7737032, Q7882671, Q7719427, Q7719444, Q7722575, Q2629763, Q2640346, Q2649671, Q7703851, Q7747041, Q6544949, Q6672759, Q2445896, Q12124891, Q3127044, Q2511262, Q2517672, Q2543165, Q426628, Q426628, Q12126890, Q13359969, Q13359969, Q2294295, Q2294295, Q2559509, Q2559912, Q7760469, Q6703974, Q4744, Q7766962, Q7768516, Q7769205, Q7769988, Q2946945, Q3212086, Q3212086, Q18218448, Q18218448, Q18218448, Q6909175, Q7405709, Q7416149, Q7239952, Q7317332, Q7783674, Q7783704, Q7857590, Q3372526, Q3372642, Q3372816, Q3372909, Q7959649, Q7977485, Q7992684, Q3817966, Q3821852, Q3420907, Q3429733, Q774474

Tags: english, opphavsrett.

A one-way wall on the border?

14th October 2017

I find it fascinating how many of the people being locked inside the proposed border wall between USA and Mexico support the idea. The proposal to keep Mexicans out reminds me of the propaganda twist from the East Germany government calling the wall the “Antifascist Bulwark” after erecting the Berlin Wall, claiming that the wall was erected to keep enemies from creeping into East Germany, while it was obvious to the people locked inside it that it was erected to keep the people from escaping.

Do the people in USA supporting this wall really believe it is a one way wall, only keeping people on the outside from getting in, while not keeping people in the inside from getting out?

Tags: english.

Generating 3D prints in Debian using Cura and Slic3r(-prusa)

9th October 2017

At my nearby maker space, Sonen, I heard the story that it was easier to generate gcode files for theyr 3D printers (Ultimake 2+) on Windows and MacOS X than Linux, because the software involved had to be manually compiled and set up on Linux while premade packages worked out of the box on Windows and MacOS X. I found this annoying, as the software involved, Cura, is free software and should be trivial to get up and running on Linux if someone took the time to package it for the relevant distributions. I even found a request for adding into Debian from 2013, which had seem some activity over the years but never resulted in the software showing up in Debian. So a few days ago I offered my help to try to improve the situation.

Now I am very happy to see that all the packages required by a working Cura in Debian are uploaded into Debian and waiting in the NEW queue for the ftpmasters to have a look. You can track the progress on the status page for the 3D printer team.

The uploaded packages are a bit behind upstream, and was uploaded now to get slots in the NEW queue while we work up updating the packages to the latest upstream version.

On a related note, two competitors for Cura, which I found harder to use and was unable to configure correctly for Ultimaker 2+ in the short time I spent on it, are already in Debian. If you are looking for 3D printer "slicers" and want something already available in Debian, check out slic3r and slic3r-prusa. The latter is a fork of the former.

Tags: 3d-printer, debian, english.

Mangler du en skrue, eller har du en skrue løs?

4th October 2017

Når jeg holder på med ulike prosjekter, så trenger jeg stadig ulike skruer. Det siste prosjektet jeg holder på med er å lage en boks til en HDMI-touch-skjerm som skal brukes med Raspberry Pi. Boksen settes sammen med skruer og bolter, og jeg har vært i tvil om hvor jeg kan få tak i de riktige skruene. Clas Ohlson og Jernia i nærheten har sjelden hatt det jeg trenger. Men her om dagen fikk jeg et fantastisk tips for oss som bor i Oslo. Zachariassen Jernvare AS i Hegermannsgate 23A på Torshov har et fantastisk utvalg, og åpent mellom 09:00 og 17:00. De selger skruer, muttere, bolter, skiver etc i løs vekt, og så langt har jeg fått alt jeg har lett etter. De har i tillegg det meste av annen jernvare, som verktøy, lamper, ledninger, etc. Jeg håper de har nok kunder til å holde det gående lenge, da dette er en butikk jeg kommer til å besøke ofte. Butikken er et funn å ha i nabolaget for oss som liker å bygge litt selv. :)

Tags: norsk.

Visualizing GSM radio chatter using gr-gsm and Hopglass

29th September 2017

Every mobile phone announce its existence over radio to the nearby mobile cell towers. And this radio chatter is available for anyone with a radio receiver capable of receiving them. Details about the mobile phones with very good accuracy is of course collected by the phone companies, but this is not the topic of this blog post. The mobile phone radio chatter make it possible to figure out when a cell phone is nearby, as it include the SIM card ID (IMSI). By paying attention over time, one can see when a phone arrive and when it leave an area. I believe it would be nice to make this information more available to the general public, to make more people aware of how their phones are announcing their whereabouts to anyone that care to listen.

I am very happy to report that we managed to get something visualizing this information up and running for Oslo Skaperfestival 2017 (Oslo Makers Festival) taking place today and tomorrow at Deichmanske library. The solution is based on the simple recipe for listening to GSM chatter I posted a few days ago, and will show up at the stand of Åpen Sone from the Computer Science department of the University of Oslo. The presentation will show the nearby mobile phones (aka IMSIs) as dots in a web browser graph, with lines to the dot representing mobile base station it is talking to. It was working in the lab yesterday, and was moved into place this morning.

We set up a fairly powerful desktop machine using Debian Buster/Testing with several (five, I believe) RTL2838 DVB-T receivers connected and visualize the visible cell phone towers using an English version of Hopglass. A fairly powerfull machine is needed as the grgsm_livemon_headless processes from gr-gsm converting the radio signal to data packages is quite CPU intensive.

The frequencies to listen to, are identified using a slightly patched scan-and-livemon (to set the --args values for each receiver), and the Hopglass data is generated using the patches in my meshviewer-output branch. For some reason we could not get more than four SDRs working. There is also a geographical map trying to show the location of the base stations, but I believe their coordinates are hardcoded to some random location in Germany, I believe. The code should be replaced with code to look up location in a text file, a sqlite database or one of the online databases mentioned in the github issue for the topic.

If this sound interesting, visit the stand at the festival!

Tags: debian, english, personvern, surveillance.

Easier recipe to observe the cell phones around you

24th September 2017

A little more than a month ago I wrote how to observe the SIM card ID (aka IMSI number) of mobile phones talking to nearby mobile phone base stations using Debian GNU/Linux and a cheap USB software defined radio, and thus being able to pinpoint the location of people and equipment (like cars and trains) with an accuracy of a few kilometer. Since then we have worked to make the procedure even simpler, and it is now possible to do this without any manual frequency tuning and without building your own packages.

The gr-gsm package is now included in Debian testing and unstable, and the IMSI-catcher code no longer require root access to fetch and decode the GSM data collected using gr-gsm.

Here is an updated recipe, using packages built by Debian and a git clone of two python scripts:

Start with a Debian machine running the Buster version (aka testing).
Run 'apt install gr-gsm python-numpy python-scipy python-scapy' as root to install required packages.
Fetch the code decoding GSM packages using 'git clone github.com/Oros42/IMSI-catcher.git'.
Insert USB software defined radio supported by GNU Radio.
Enter the IMSI-catcher directory and run 'python scan-and-livemon' to locate the frequency of nearby base stations and start listening for GSM packages on one of them.
Enter the IMSI-catcher directory and run 'python simple_IMSI-catcher.py' to display the collected information.

Note, due to a bug somewhere the scan-and-livemon program (actually its underlying program grgsm_scanner) do not work with the HackRF radio. It does work with RTL 8232 and other similar USB radio receivers you can get very cheaply (for example from ebay), so for now the solution is to scan using the RTL radio and only use HackRF for fetching GSM data.

As far as I can tell, a cell phone only show up on one of the frequencies at the time, so if you are going to track and count every cell phone around you, you need to listen to all the frequencies used. To listen to several frequencies, use the --numrecv argument to scan-and-livemon to use several receivers. Further, I am not sure if phones using 3G or 4G will show as talking GSM to base stations, so this approach might not see all phones around you. I typically see 0-400 IMSI numbers an hour when looking around where I live.

I've tried to run the scanner on a Raspberry Pi 2 and 3 running Debian Buster, but the grgsm_livemon_headless process seem to be too CPU intensive to keep up. When GNU Radio print 'O' to stdout, I am told there it is caused by a buffer overflow between the radio and GNU Radio, caused by the program being unable to read the GSM data fast enough. If you see a stream of 'O's from the terminal where you started scan-and-livemon, you need a give the process more CPU power. Perhaps someone are able to optimize the code to a point where it become possible to set up RPi3 based GSM sniffers? I tried using Raspbian instead of Debian, but there seem to be something wrong with GNU Radio on raspbian, causing glibc to abort().

Tags: debian, english, personvern, surveillance.

Datalagringsdirektivet kaster skygger over Høyre og Arbeiderpartiet

7th September 2017

For noen dager siden publiserte Jon Wessel-Aas en bloggpost om «Konklusjonen om datalagring som EU-kommisjonen ikke ville at vi skulle få se». Det er en interessant gjennomgang av EU-domstolens syn på snurpenotovervåkning av befolkningen, som er klar på at det er i strid med EU-lovgivingen.

Valgkampen går for fullt i Norge, og om noen få dager er siste frist for å avgi stemme. En ting er sikkert, Høyre og Arbeiderpartiet får ikke min stemme denne gangen heller. Jeg har ikke glemt at de tvang igjennom loven som skulle pålegge alle data- og teletjenesteleverandører å overvåke alle sine kunder. En lov som er vedtatt, og aldri opphevet igjen.

Det er tydelig fra diskusjonen rundt grenseløs digital overvåkning (eller "Digital Grenseforsvar" som det kalles i Orvellisk nytale) at hverken Høyre og Arbeiderpartiet har noen prinsipielle sperrer mot å overvåke hele befolkningen, og diskusjonen så langt tyder på at flere av de andre partiene heller ikke har det. Mange av de som stemte for Datalagringsdirektivet i Stortinget (64 fra Arbeiderpartiet, 25 fra Høyre) er fortsatt aktive og argumenterer fortsatt for å radere vekk mer av innbyggernes privatsfære.

Når myndighetene demonstrerer sin mistillit til folket, tror jeg folket selv bør legge litt innsats i å verne sitt privatliv, ved å ta i bruk ende-til-ende-kryptert kommunikasjon med sine kjente og kjære, og begrense hvor mye privat informasjon som deles med uvedkommende. Det er jo ingenting som tyder på at myndighetene kommer til å være vår privatsfære. Det er mange muligheter. Selv har jeg litt sans for Ring, som er basert på p2p-teknologi uten sentral kontroll, er fri programvare, og støtter meldinger, tale og video. Systemet er tilgjengelig ut av boksen fra Debian og Ubuntu, og det finnes pakker for Android, MacOSX og Windows. Foreløpig er det få brukere med Ring, slik at jeg også bruker Signal som nettleserutvidelse.

Tags: dld, norsk, personvern, stortinget, surveillance, valg.

Simpler recipe on how to make a simple $7 IMSI Catcher using Debian

9th August 2017

On friday, I came across an interesting article in the Norwegian web based ICT news magazine digi.no on how to collect the IMSI numbers of nearby cell phones using the cheap DVB-T software defined radios. The article refered to instructions and a recipe by Keld Norman on Youtube on how to make a simple $7 IMSI Catcher, and I decided to test them out.

The instructions said to use Ubuntu, install pip using apt (to bypass apt), use pip to install pybombs (to bypass both apt and pip), and the ask pybombs to fetch and build everything you need from scratch. I wanted to see if I could do the same on the most recent Debian packages, but this did not work because pybombs tried to build stuff that no longer build with the most recent openssl library or some other version skew problem. While trying to get this recipe working, I learned that the apt->pip->pybombs route was a long detour, and the only piece of software dependency missing in Debian was the gr-gsm package. I also found out that the lead upstream developer of gr-gsm (the name stand for GNU Radio GSM) project already had a set of Debian packages provided in an Ubuntu PPA repository. All I needed to do was to dget the Debian source package and built it.

The IMSI collector is a python script listening for packages on the loopback network device and printing to the terminal some specific GSM packages with IMSI numbers in them. The code is fairly short and easy to understand. The reason this work is because gr-gsm include a tool to read GSM data from a software defined radio like a DVB-T USB stick and other software defined radios, decode them and inject them into a network device on your Linux machine (using the loopback device by default). This proved to work just fine, and I've been testing the collector for a few days now.

The updated and simpler recipe is thus to

start with a Debian machine running Stretch or newer,
build and install the gr-gsm package available from http://ppa.launchpad.net/ptrkrysik/gr-gsm/ubuntu/pool/main/g/gr-gsm/,
clone the git repostory from https://github.com/Oros42/IMSI-catcher,
run grgsm_livemon and adjust the frequency until the terminal where it was started is filled with a stream of text (meaning you found a GSM station).
go into the IMSI-catcher directory and run 'sudo python simple_IMSI-catcher.py' to extract the IMSI numbers.

To make it even easier in the future to get this sniffer up and running, I decided to package the gr-gsm project for Debian (WNPP #871055), and the package was uploaded into the NEW queue today. Luckily the gnuradio maintainer has promised to help me, as I do not know much about gnuradio stuff yet.

I doubt this "IMSI cacher" is anywhere near as powerfull as commercial tools like The Spy Phone Portable IMSI / IMEI Catcher or the Harris Stingray, but I hope the existance of cheap alternatives can make more people realise how their whereabouts when carrying a cell phone is easily tracked. Seeing the data flow on the screen, realizing that I live close to a police station and knowing that the police is also wearing cell phones, I wonder how hard it would be for criminals to track the position of the police officers to discover when there are police near by, or for foreign military forces to track the location of the Norwegian military forces, or for anyone to track the location of government officials...

It is worth noting that the data reported by the IMSI-catcher script mentioned above is only a fraction of the data broadcasted on the GSM network. It will only collect one frequency at the time, while a typical phone will be using several frequencies, and not all phones will be using the frequencies tracked by the grgsm_livemod program. Also, there is a lot of radio chatter being ignored by the simple_IMSI-catcher script, which would be collected by extending the parser code. I wonder if gr-gsm can be set up to listen to more than one frequency?

Tags: debian, english, personvern, surveillance.