X-Git-Url: https://pere.pagekite.me/gitweb/homepage.git/blobdiff_plain/fc20a6cd50c2669321f60fc80703c58917dc5081..9b90db63be4ef8ae3cc2db3385af1738048fe6f4:/blog/index.rss diff --git a/blog/index.rss b/blog/index.rss index 0a657f587a..8780170dce 100644 --- a/blog/index.rss +++ b/blog/index.rss @@ -6,6 +6,102 @@ http://people.skolelinux.org/pere/blog/ + + Free software archive system Nikita now able to store documents + http://people.skolelinux.org/pere/blog/Free_software_archive_system_Nikita_now_able_to_store_documents.html + http://people.skolelinux.org/pere/blog/Free_software_archive_system_Nikita_now_able_to_store_documents.html + Sun, 19 Mar 2017 08:00:00 +0100 + <p>The <a href="https://github.com/hiOA-ABI/nikita-noark5-core">Nikita +Noark 5 core project</a> is implementing the Norwegian standard for +keeping an electronic archive of government documents. +<a href="http://www.arkivverket.no/arkivverket/Offentlig-forvaltning/Noark/Noark-5/English-version">The +Noark 5 standard</a> document the requirement for data systems used by +the archives in the Norwegian government, and the Noark 5 web interface +specification document a REST web service for storing, searching and +retrieving documents and metadata in such archive. I've been involved +in the project since a few weeks before Christmas, when the Norwegian +Unix User Group +<a href="https://www.nuug.no/news/NOARK5_kjerne_som_fri_programvare_f_r_epostliste_hos_NUUG.shtml">announced +it supported the project</a>. I believe this is an important project, +and hope it can make it possible for the government archives in the +future to use free software to keep the archives we citizens depend +on. But as I do not hold such archive myself, personally my first use +case is to store and analyse public mail journal metadata published +from the government. I find it useful to have a clear use case in +mind when developing, to make sure the system scratches one of my +itches.</p> + +<p>If you would like to help make sure there is a free software +alternatives for the archives, please join our IRC channel +(<a href="irc://irc.freenode.net/%23nikita"">#nikita on +irc.freenode.net</a>) and +<a href="https://lists.nuug.no/mailman/listinfo/nikita-noark">the +project mailing list</a>.</p> + +<p>When I got involved, the web service could store metadata about +documents. But a few weeks ago, a new milestone was reached when it +became possible to store full text documents too. Yesterday, I +completed an implementation of a command line tool +<tt>archive-pdf</tt> to upload a PDF file to the archive using this +API. The tool is very simple at the moment, and find existing +<a href="https://en.wikipedia.org/wiki/Fonds">fonds</a>, series and +files while asking the user to select which one to use if more than +one exist. Once a file is identified, the PDF is associated with the +file and uploaded, using the title extracted from the PDF itself. The +process is fairly similar to visiting the archive, opening a cabinet, +locating a file and storing a piece of paper in the archive. Here is +a test run directly after populating the database with test data using +our API tester:</p> + +<p><blockquote><pre> +~/src//noark5-tester$ ./archive-pdf mangelmelding/mangler.pdf +using arkiv: Title of the test fonds created 2017-03-18T23:49:32.103446 +using arkivdel: Title of the test series created 2017-03-18T23:49:32.103446 + + 0 - Title of the test case file created 2017-03-18T23:49:32.103446 + 1 - Title of the test file created 2017-03-18T23:49:32.103446 +Select which mappe you want (or search term): 0 +Uploading mangelmelding/mangler.pdf + PDF title: Mangler i spesifikasjonsdokumentet for NOARK 5 Tjenestegrensesnitt + File 2017/1: Title of the test case file created 2017-03-18T23:49:32.103446 +~/src//noark5-tester$ +</pre></blockquote></p> + +<p>You can see here how the fonds (arkiv) and serie (arkivdel) only had +one option, while the user need to choose which file (mappe) to use +among the two created by the API tester. The <tt>archive-pdf</tt> +tool can be found in the git repository for the API tester.</p> + +<p>In the project, I have been mostly working on +<a href="https://github.com/petterreinholdtsen/noark5-tester">the API +tester</a> so far, while getting to know the code base. The API +tester currently use +<a href="https://en.wikipedia.org/wiki/HATEOAS">the HATEOAS links</a> +to traverse the entire exposed service API and verify that the exposed +operations and objects match the specification, as well as trying to +create objects holding metadata and uploading a simple XML file to +store. The tester has proved very useful for finding flaws in our +implementation, as well as flaws in the reference site and the +specification.</p> + +<p>The test document I uploaded is a summary of all the specification +defects we have collected so far while implementing the web service. +There are several unclear and conflicting parts of the specification, +and we have +<a href="https://github.com/petterreinholdtsen/noark5-tester/tree/master/mangelmelding">started +writing down</a> the questions we get from implementing it. We use a +format inspired by how <a href="http://www.opengroup.org/austin/">The +Austin Group</a> collect defect reports for the POSIX standard with +<a href="http://www.opengroup.org/austin/mantis.html">their +instructions for the MANTIS defect tracker system</a>, in lack of an official way to structure defect reports for Noark 5 (our first submitted defect report was a <a href="https://github.com/petterreinholdtsen/noark5-tester/blob/master/mangelmelding/sendt/2017-03-15-mangel-prosess.md">request for a procedure for submitting defect reports</a> :). + +<p>The Nikita project is implemented using Java and Spring, and is +fairly easy to get up and running using Docker containers for those +that want to test the current code base. The API tester is +implemented in Python.</p> + + + Detecting NFS hangs on Linux without hanging yourself... http://people.skolelinux.org/pere/blog/Detecting_NFS_hangs_on_Linux_without_hanging_yourself___.html @@ -509,164 +605,5 @@ kanskje Datatilsynet bør gjøre det?</p> - - Where did that package go? &mdash; geolocated IP traceroute - http://people.skolelinux.org/pere/blog/Where_did_that_package_go___mdash__geolocated_IP_traceroute.html - http://people.skolelinux.org/pere/blog/Where_did_that_package_go___mdash__geolocated_IP_traceroute.html - Mon, 9 Jan 2017 12:20:00 +0100 - <p>Did you ever wonder where the web trafic really flow to reach the -web servers, and who own the network equipment it is flowing through? -It is possible to get a glimpse of this from using traceroute, but it -is hard to find all the details. Many years ago, I wrote a system to -map the Norwegian Internet (trying to figure out if our plans for a -network game service would get low enough latency, and who we needed -to talk to about setting up game servers close to the users. Back -then I used traceroute output from many locations (I asked my friends -to run a script and send me their traceroute output) to create the -graph and the map. The output from traceroute typically look like -this: - -<p><pre> -traceroute to www.stortinget.no (85.88.67.10), 30 hops max, 60 byte packets - 1 uio-gw10.uio.no (129.240.202.1) 0.447 ms 0.486 ms 0.621 ms - 2 uio-gw8.uio.no (129.240.24.229) 0.467 ms 0.578 ms 0.675 ms - 3 oslo-gw1.uninett.no (128.39.65.17) 0.385 ms 0.373 ms 0.358 ms - 4 te3-1-2.br1.fn3.as2116.net (193.156.90.3) 1.174 ms 1.172 ms 1.153 ms - 5 he16-1-1.cr1.san110.as2116.net (195.0.244.234) 2.627 ms he16-1-1.cr2.oslosda310.as2116.net (195.0.244.48) 3.172 ms he16-1-1.cr1.san110.as2116.net (195.0.244.234) 2.857 ms - 6 ae1.ar8.oslosda310.as2116.net (195.0.242.39) 0.662 ms 0.637 ms ae0.ar8.oslosda310.as2116.net (195.0.242.23) 0.622 ms - 7 89.191.10.146 (89.191.10.146) 0.931 ms 0.917 ms 0.955 ms - 8 * * * - 9 * * * -[...] -</pre></p> - -<p>This show the DNS names and IP addresses of (at least some of the) -network equipment involved in getting the data traffic from me to the -www.stortinget.no server, and how long it took in milliseconds for a -package to reach the equipment and return to me. Three packages are -sent, and some times the packages do not follow the same path. This -is shown for hop 5, where three different IP addresses replied to the -traceroute request.</p> - -<p>There are many ways to measure trace routes. Other good traceroute -implementations I use are traceroute (using ICMP packages) mtr (can do -both ICMP, UDP and TCP) and scapy (python library with ICMP, UDP, TCP -traceroute and a lot of other capabilities). All of them are easily -available in <a href="https://www.debian.org/">Debian</a>.</p> - -<p>This time around, I wanted to know the geographic location of -different route points, to visualize how visiting a web page spread -information about the visit to a lot of servers around the globe. The -background is that a web site today often will ask the browser to get -from many servers the parts (for example HTML, JSON, fonts, -JavaScript, CSS, video) required to display the content. This will -leak information about the visit to those controlling these servers -and anyone able to peek at the data traffic passing by (like your ISP, -the ISPs backbone provider, FRA, GCHQ, NSA and others).</p> - -<p>Lets pick an example, the Norwegian parliament web site -www.stortinget.no. It is read daily by all members of parliament and -their staff, as well as political journalists, activits and many other -citizens of Norway. A visit to the www.stortinget.no web site will -ask your browser to contact 8 other servers: ajax.googleapis.com, -insights.hotjar.com, script.hotjar.com, static.hotjar.com, -stats.g.doubleclick.net, www.google-analytics.com, -www.googletagmanager.com and www.netigate.se. I extracted this by -asking <a href="http://phantomjs.org/">PhantomJS</a> to visit the -Stortinget web page and tell me all the URLs PhantomJS downloaded to -render the page (in HAR format using -<a href="https://github.com/ariya/phantomjs/blob/master/examples/netsniff.js">their -netsniff example</a>. I am very grateful to Gorm for showing me how -to do this). My goal is to visualize network traces to all IP -addresses behind these DNS names, do show where visitors personal -information is spread when visiting the page.</p> - -<p align="center"><a href="www.stortinget.no-geoip.kml"><img -src="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-geoip-small.png" alt="map of combined traces for URLs used by www.stortinget.no using GeoIP"/></a></p> - -<p>When I had a look around for options, I could not find any good -free software tools to do this, and decided I needed my own traceroute -wrapper outputting KML based on locations looked up using GeoIP. KML -is easy to work with and easy to generate, and understood by several -of the GIS tools I have available. I got good help from by NUUG -colleague Anders Einar with this, and the result can be seen in -<a href="https://github.com/petterreinholdtsen/kmltraceroute">my -kmltraceroute git repository</a>. Unfortunately, the quality of the -free GeoIP databases I could find (and the for-pay databases my -friends had access to) is not up to the task. The IP addresses of -central Internet infrastructure would typically be placed near the -controlling companies main office, and not where the router is really -located, as you can see from <a href="www.stortinget.no-geoip.kml">the -KML file I created</a> using the GeoLite City dataset from MaxMind. - -<p align="center"><a href="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-scapy.svg"><img -src="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-scapy-small.png" alt="scapy traceroute graph for URLs used by www.stortinget.no"/></a></p> - -<p>I also had a look at the visual traceroute graph created by -<a href="http://www.secdev.org/projects/scapy/">the scrapy project</a>, -showing IP network ownership (aka AS owner) for the IP address in -question. -<a href="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-scapy.svg">The -graph display a lot of useful information about the traceroute in SVG -format</a>, and give a good indication on who control the network -equipment involved, but it do not include geolocation. This graph -make it possible to see the information is made available at least for -UNINETT, Catchcom, Stortinget, Nordunet, Google, Amazon, Telia, Level -3 Communications and NetDNA.</p> - -<p align="center"><a href="https://geotraceroute.com/index.php?node=4&host=www.stortinget.no"><img -src="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-geotraceroute-small.png" alt="example geotraceroute view for www.stortinget.no"/></a></p> - -<p>In the process, I came across the -<a href="https://geotraceroute.com/">web service GeoTraceroute</a> by -Salim Gasmi. Its methology of combining guesses based on DNS names, -various location databases and finally use latecy times to rule out -candidate locations seemed to do a very good job of guessing correct -geolocation. But it could only do one trace at the time, did not have -a sensor in Norway and did not make the geolocations easily available -for postprocessing. So I contacted the developer and asked if he -would be willing to share the code (he refused until he had time to -clean it up), but he was interested in providing the geolocations in a -machine readable format, and willing to set up a sensor in Norway. So -since yesterday, it is possible to run traces from Norway in this -service thanks to a sensor node set up by -<a href="https://www.nuug.no/">the NUUG assosiation</a>, and get the -trace in KML format for further processing.</p> - -<p align="center"><a href="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-geotraceroute-kml-join.kml"><img -src="http://people.skolelinux.org/pere/blog/images/2017-01-09-www.stortinget.no-geotraceroute-kml-join.png" alt="map of combined traces for URLs used by www.stortinget.no using geotraceroute"/></a></p> - -<p>Here we can see a lot of trafic passes Sweden on its way to -Denmark, Germany, Holland and Ireland. Plenty of places where the -Snowden confirmations verified the traffic is read by various actors -without your best interest as their top priority.</p> - -<p>Combining KML files is trivial using a text editor, so I could loop -over all the hosts behind the urls imported by www.stortinget.no and -ask for the KML file from GeoTraceroute, and create a combined KML -file with all the traces (unfortunately only one of the IP addresses -behind the DNS name is traced this time. To get them all, one would -have to request traces using IP number instead of DNS names from -GeoTraceroute). That might be the next step in this project.</p> - -<p>Armed with these tools, I find it a lot easier to figure out where -the IP traffic moves and who control the boxes involved in moving it. -And every time the link crosses for example the Swedish border, we can -be sure Swedish Signal Intelligence (FRA) is listening, as GCHQ do in -Britain and NSA in USA and cables around the globe. (Hm, what should -we tell them? :) Keep that in mind if you ever send anything -unencrypted over the Internet.</p> - -<p>PS: KML files are drawn using -<a href="http://ivanrublev.me/kml/">the KML viewer from Ivan -Rublev<a/>, as it was less cluttered than the local Linux application -Marble. There are heaps of other options too.</p> - -<p>As usual, if you use Bitcoin and want to show your support of my -activities, please send Bitcoin donations to my address -<b><a href="bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&label=PetterReinholdtsenBlog">15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b</a></b>.</p> - - -