- <div class="title"><a href="http://people.skolelinux.org/pere/blog/Free_software_archive_system_Nikita_now_able_to_store_documents.html">Free software archive system Nikita now able to store documents</a></div>
- <div class="date">19th March 2017</div>
- <div class="body"><p>The <a href="https://github.com/hiOA-ABI/nikita-noark5-core">Nikita
-Noark 5 core project</a> is implementing the Norwegian standard for
-keeping an electronic archive of government documents.
-<a href="http://www.arkivverket.no/arkivverket/Offentlig-forvaltning/Noark/Noark-5/English-version">The
-Noark 5 standard</a> document the requirement for data systems used by
-the archives in the Norwegian government, and the Noark 5 web interface
-specification document a REST web service for storing, searching and
-retrieving documents and metadata in such archive. I've been involved
-in the project since a few weeks before Christmas, when the Norwegian
-Unix User Group
-<a href="https://www.nuug.no/news/NOARK5_kjerne_som_fri_programvare_f_r_epostliste_hos_NUUG.shtml">announced
-it supported the project</a>. I believe this is an important project,
-and hope it can make it possible for the government archives in the
-future to use free software to keep the archives we citizens depend
-on. But as I do not hold such archive myself, personally my first use
-case is to store and analyse public mail journal metadata published
-from the government. I find it useful to have a clear use case in
-mind when developing, to make sure the system scratches one of my
-itches.</p>
-
-<p>If you would like to help make sure there is a free software
-alternatives for the archives, please join our IRC channel
-(<a href="irc://irc.freenode.net/%23nikita"">#nikita on
-irc.freenode.net</a>) and
-<a href="https://lists.nuug.no/mailman/listinfo/nikita-noark">the
-project mailing list</a>.</p>
-
-<p>When I got involved, the web service could store metadata about
-documents. But a few weeks ago, a new milestone was reached when it
-became possible to store full text documents too. Yesterday, I
-completed an implementation of a command line tool
-<tt>archive-pdf</tt> to upload a PDF file to the archive using this
-API. The tool is very simple at the moment, and find existing
-<a href="https://en.wikipedia.org/wiki/Fonds">fonds</a>, series and
-files while asking the user to select which one to use if more than
-one exist. Once a file is identified, the PDF is associated with the
-file and uploaded, using the title extracted from the PDF itself. The
-process is fairly similar to visiting the archive, opening a cabinet,
-locating a file and storing a piece of paper in the archive. Here is
-a test run directly after populating the database with test data using
-our API tester:</p>
-
-<p><blockquote><pre>
-~/src//noark5-tester$ ./archive-pdf mangelmelding/mangler.pdf
-using arkiv: Title of the test fonds created 2017-03-18T23:49:32.103446
-using arkivdel: Title of the test series created 2017-03-18T23:49:32.103446
-
- 0 - Title of the test case file created 2017-03-18T23:49:32.103446
- 1 - Title of the test file created 2017-03-18T23:49:32.103446
-Select which mappe you want (or search term): 0
-Uploading mangelmelding/mangler.pdf
- PDF title: Mangler i spesifikasjonsdokumentet for NOARK 5 Tjenestegrensesnitt
- File 2017/1: Title of the test case file created 2017-03-18T23:49:32.103446
-~/src//noark5-tester$
-</pre></blockquote></p>
-
-<p>You can see here how the fonds (arkiv) and serie (arkivdel) only had
-one option, while the user need to choose which file (mappe) to use
-among the two created by the API tester. The <tt>archive-pdf</tt>
-tool can be found in the git repository for the API tester.</p>
-
-<p>In the project, I have been mostly working on
-<a href="https://github.com/petterreinholdtsen/noark5-tester">the API
-tester</a> so far, while getting to know the code base. The API
-tester currently use
-<a href="https://en.wikipedia.org/wiki/HATEOAS">the HATEOAS links</a>
-to traverse the entire exposed service API and verify that the exposed
-operations and objects match the specification, as well as trying to
-create objects holding metadata and uploading a simple XML file to
-store. The tester has proved very useful for finding flaws in our
-implementation, as well as flaws in the reference site and the
-specification.</p>
-
-<p>The test document I uploaded is a summary of all the specification
-defects we have collected so far while implementing the web service.
-There are several unclear and conflicting parts of the specification,
-and we have
-<a href="https://github.com/petterreinholdtsen/noark5-tester/tree/master/mangelmelding">started
-writing down</a> the questions we get from implementing it. We use a
-format inspired by how <a href="http://www.opengroup.org/austin/">The
-Austin Group</a> collect defect reports for the POSIX standard with
-<a href="http://www.opengroup.org/austin/mantis.html">their
-instructions for the MANTIS defect tracker system</a>, in lack of an official way to structure defect reports for Noark 5 (our first submitted defect report was a <a href="https://github.com/petterreinholdtsen/noark5-tester/blob/master/mangelmelding/sendt/2017-03-15-mangel-prosess.md">request for a procedure for submitting defect reports</a> :).
-
-<p>The Nikita project is implemented using Java and Spring, and is
-fairly easy to get up and running using Docker containers for those
-that want to test the current code base. The API tester is
-implemented in Python.</p>
+ <div class="title"><a href="http://people.skolelinux.org/pere/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html">Locating IMDB IDs of movies in the Internet Archive using Wikidata</a></div>
+ <div class="date">25th October 2017</div>
+ <div class="body"><p>Recently, I needed to automatically check the copyright status of a
+set of <a href="http://www.imdb.com/">The Internet Movie database
+(IMDB)</a> entries, to figure out which one of the movies they refer
+to can be freely distributed on the Internet. This proved to be
+harder than it sounds. IMDB for sure list movies without any
+copyright protection, where the copyright protection has expired or
+where the movie is lisenced using a permissive license like one from
+Creative Commons. These are mixed with copyright protected movies,
+and there seem to be no way to separate these classes of movies using
+the information in IMDB.</p>
+
+<p>First I tried to look up entries manually in IMDB,
+<a href="https://www.wikipedia.org/">Wikipedia</a> and
+<a href="https://www.archive.org/">The Internet Archive</a>, to get a
+feel how to do this. It is hard to know for sure using these sources,
+but it should be possible to be reasonable confident a movie is "out
+of copyright" with a few hours work per movie. As I needed to check
+almost 20,000 entries, this approach was not sustainable. I simply
+can not work around the clock for about 6 years to check this data
+set.</p>
+
+<p>I asked the people behind The Internet Archive if they could
+introduce a new metadata field in their metadata XML for IMDB ID, but
+was told that they leave it completely to the uploaders to update the
+metadata. Some of the metadata entries had IMDB links in the
+description, but I found no way to download all metadata files in bulk
+to locate those ones and put that approach aside.</p>
+
+<p>In the process I noticed several Wikipedia articles about movies
+had links to both IMDB and The Internet Archive, and it occured to me
+that I could use the Wikipedia RDF data set to locate entries with
+both, to at least get a lower bound on the number of movies on The
+Internet Archive with a IMDB ID. This is useful based on the
+assumption that movies distributed by The Internet Archive can be
+legally distributed on the Internet. With some help from the RDF
+community (thank you DanC), I was able to come up with this query to
+pass to <a href="https://query.wikidata.org/">the SPARQL interface on
+Wikidata</a>:
+
+<p><pre>
+SELECT ?work ?imdb ?ia ?when ?label
+WHERE
+{
+ ?work wdt:P31/wdt:P279* wd:Q11424.
+ ?work wdt:P345 ?imdb.
+ ?work wdt:P724 ?ia.
+ OPTIONAL {
+ ?work wdt:P577 ?when.
+ ?work rdfs:label ?label.
+ FILTER(LANG(?label) = "en").
+ }
+}
+</pre></p>
+
+<p>If I understand the query right, for every film entry anywhere in
+Wikpedia, it will return the IMDB ID and The Internet Archive ID, and
+when the movie was released and its English title, if either or both
+of the latter two are available. At the moment the result set contain
+2338 entries. Of course, it depend on volunteers including both
+correct IMDB and The Internet Archive IDs in the wikipedia articles
+for the movie. It should be noted that the result will include
+duplicates if the movie have entries in several languages. There are
+some bogus entries, either because The Internet Archive ID contain a
+typo or because the movie is not available from The Internet Archive.
+I did not verify the IMDB IDs, as I am unsure how to do that
+automatically.</p>
+
+<p>I wrote a small python script to extract the data set from Wikidata
+and check if the XML metadata for the movie is available from The
+Internet Archive, and after around 1.5 hour it produced a list of 2097
+free movies and their IMDB ID. In total, 171 entries in Wikidata lack
+the refered Internet Archive entry. I assume the 70 "disappearing"
+entries (ie 2338-2097-171) are duplicate entries.</p>
+
+<p>This is not too bad, given that The Internet Archive report to
+contain <a href="https://archive.org/details/feature_films">5331
+feature films</a> at the moment, but it also mean more than 3000
+movies are missing on Wikipedia or are missing the pair of references
+on Wikipedia.</p>
+
+<p>I was curious about the distribution by release year, and made a
+little graph to show how the amount of free movies is spread over the
+years:<p>
+
+<p><img src="http://people.skolelinux.org/pere/blog/images/2017-10-25-verk-i-det-fri-filmer.png"></p>
+
+<p>I expect the relative distribution of the remaining 3000 movies to
+be similar.</p>
+
+<p>If you want to help, and want to ensure Wikipedia can be used to
+cross reference The Internet Archive and The Internet Movie Database,
+please make sure entries like this are listed under the "External
+links" heading on the Wikipedia article for the movie:</p>
+
+<p><pre>
+* {{Internet Archive film|id=FightingLady}}
+* {{IMDb title|id=0036823|title=The Fighting Lady}}
+</pre></p>
+
+<p>Please verify the links on the final page, to make sure you did not
+introduce a typo.</p>
+
+<p>Here is the complete list, if you want to correct the 171
+identified Wikipedia entries with broken links to The Internet
+Archive: <a href="http://www.wikidata.org/entity/Q1140317">Q1140317</a>,
+<a href="http://www.wikidata.org/entity/Q458656">Q458656</a>,
+<a href="http://www.wikidata.org/entity/Q458656">Q458656</a>,
+<a href="http://www.wikidata.org/entity/Q470560">Q470560</a>,
+<a href="http://www.wikidata.org/entity/Q743340">Q743340</a>,
+<a href="http://www.wikidata.org/entity/Q822580">Q822580</a>,
+<a href="http://www.wikidata.org/entity/Q480696">Q480696</a>,
+<a href="http://www.wikidata.org/entity/Q128761">Q128761</a>,
+<a href="http://www.wikidata.org/entity/Q1307059">Q1307059</a>,
+<a href="http://www.wikidata.org/entity/Q1335091">Q1335091</a>,
+<a href="http://www.wikidata.org/entity/Q1537166">Q1537166</a>,
+<a href="http://www.wikidata.org/entity/Q1438334">Q1438334</a>,
+<a href="http://www.wikidata.org/entity/Q1479751">Q1479751</a>,
+<a href="http://www.wikidata.org/entity/Q1497200">Q1497200</a>,
+<a href="http://www.wikidata.org/entity/Q1498122">Q1498122</a>,
+<a href="http://www.wikidata.org/entity/Q865973">Q865973</a>,
+<a href="http://www.wikidata.org/entity/Q834269">Q834269</a>,
+<a href="http://www.wikidata.org/entity/Q841781">Q841781</a>,
+<a href="http://www.wikidata.org/entity/Q841781">Q841781</a>,
+<a href="http://www.wikidata.org/entity/Q1548193">Q1548193</a>,
+<a href="http://www.wikidata.org/entity/Q499031">Q499031</a>,
+<a href="http://www.wikidata.org/entity/Q1564769">Q1564769</a>,
+<a href="http://www.wikidata.org/entity/Q1585239">Q1585239</a>,
+<a href="http://www.wikidata.org/entity/Q1585569">Q1585569</a>,
+<a href="http://www.wikidata.org/entity/Q1624236">Q1624236</a>,
+<a href="http://www.wikidata.org/entity/Q4796595">Q4796595</a>,
+<a href="http://www.wikidata.org/entity/Q4853469">Q4853469</a>,
+<a href="http://www.wikidata.org/entity/Q4873046">Q4873046</a>,
+<a href="http://www.wikidata.org/entity/Q915016">Q915016</a>,
+<a href="http://www.wikidata.org/entity/Q4660396">Q4660396</a>,
+<a href="http://www.wikidata.org/entity/Q4677708">Q4677708</a>,
+<a href="http://www.wikidata.org/entity/Q4738449">Q4738449</a>,
+<a href="http://www.wikidata.org/entity/Q4756096">Q4756096</a>,
+<a href="http://www.wikidata.org/entity/Q4766785">Q4766785</a>,
+<a href="http://www.wikidata.org/entity/Q880357">Q880357</a>,
+<a href="http://www.wikidata.org/entity/Q882066">Q882066</a>,
+<a href="http://www.wikidata.org/entity/Q882066">Q882066</a>,
+<a href="http://www.wikidata.org/entity/Q204191">Q204191</a>,
+<a href="http://www.wikidata.org/entity/Q204191">Q204191</a>,
+<a href="http://www.wikidata.org/entity/Q1194170">Q1194170</a>,
+<a href="http://www.wikidata.org/entity/Q940014">Q940014</a>,
+<a href="http://www.wikidata.org/entity/Q946863">Q946863</a>,
+<a href="http://www.wikidata.org/entity/Q172837">Q172837</a>,
+<a href="http://www.wikidata.org/entity/Q573077">Q573077</a>,
+<a href="http://www.wikidata.org/entity/Q1219005">Q1219005</a>,
+<a href="http://www.wikidata.org/entity/Q1219599">Q1219599</a>,
+<a href="http://www.wikidata.org/entity/Q1643798">Q1643798</a>,
+<a href="http://www.wikidata.org/entity/Q1656352">Q1656352</a>,
+<a href="http://www.wikidata.org/entity/Q1659549">Q1659549</a>,
+<a href="http://www.wikidata.org/entity/Q1660007">Q1660007</a>,
+<a href="http://www.wikidata.org/entity/Q1698154">Q1698154</a>,
+<a href="http://www.wikidata.org/entity/Q1737980">Q1737980</a>,
+<a href="http://www.wikidata.org/entity/Q1877284">Q1877284</a>,
+<a href="http://www.wikidata.org/entity/Q1199354">Q1199354</a>,
+<a href="http://www.wikidata.org/entity/Q1199354">Q1199354</a>,
+<a href="http://www.wikidata.org/entity/Q1199451">Q1199451</a>,
+<a href="http://www.wikidata.org/entity/Q1211871">Q1211871</a>,
+<a href="http://www.wikidata.org/entity/Q1212179">Q1212179</a>,
+<a href="http://www.wikidata.org/entity/Q1238382">Q1238382</a>,
+<a href="http://www.wikidata.org/entity/Q4906454">Q4906454</a>,
+<a href="http://www.wikidata.org/entity/Q320219">Q320219</a>,
+<a href="http://www.wikidata.org/entity/Q1148649">Q1148649</a>,
+<a href="http://www.wikidata.org/entity/Q645094">Q645094</a>,
+<a href="http://www.wikidata.org/entity/Q5050350">Q5050350</a>,
+<a href="http://www.wikidata.org/entity/Q5166548">Q5166548</a>,
+<a href="http://www.wikidata.org/entity/Q2677926">Q2677926</a>,
+<a href="http://www.wikidata.org/entity/Q2698139">Q2698139</a>,
+<a href="http://www.wikidata.org/entity/Q2707305">Q2707305</a>,
+<a href="http://www.wikidata.org/entity/Q2740725">Q2740725</a>,
+<a href="http://www.wikidata.org/entity/Q2024780">Q2024780</a>,
+<a href="http://www.wikidata.org/entity/Q2117418">Q2117418</a>,
+<a href="http://www.wikidata.org/entity/Q2138984">Q2138984</a>,
+<a href="http://www.wikidata.org/entity/Q1127992">Q1127992</a>,
+<a href="http://www.wikidata.org/entity/Q1058087">Q1058087</a>,
+<a href="http://www.wikidata.org/entity/Q1070484">Q1070484</a>,
+<a href="http://www.wikidata.org/entity/Q1080080">Q1080080</a>,
+<a href="http://www.wikidata.org/entity/Q1090813">Q1090813</a>,
+<a href="http://www.wikidata.org/entity/Q1251918">Q1251918</a>,
+<a href="http://www.wikidata.org/entity/Q1254110">Q1254110</a>,
+<a href="http://www.wikidata.org/entity/Q1257070">Q1257070</a>,
+<a href="http://www.wikidata.org/entity/Q1257079">Q1257079</a>,
+<a href="http://www.wikidata.org/entity/Q1197410">Q1197410</a>,
+<a href="http://www.wikidata.org/entity/Q1198423">Q1198423</a>,
+<a href="http://www.wikidata.org/entity/Q706951">Q706951</a>,
+<a href="http://www.wikidata.org/entity/Q723239">Q723239</a>,
+<a href="http://www.wikidata.org/entity/Q2079261">Q2079261</a>,
+<a href="http://www.wikidata.org/entity/Q1171364">Q1171364</a>,
+<a href="http://www.wikidata.org/entity/Q617858">Q617858</a>,
+<a href="http://www.wikidata.org/entity/Q5166611">Q5166611</a>,
+<a href="http://www.wikidata.org/entity/Q5166611">Q5166611</a>,
+<a href="http://www.wikidata.org/entity/Q324513">Q324513</a>,
+<a href="http://www.wikidata.org/entity/Q374172">Q374172</a>,
+<a href="http://www.wikidata.org/entity/Q7533269">Q7533269</a>,
+<a href="http://www.wikidata.org/entity/Q970386">Q970386</a>,
+<a href="http://www.wikidata.org/entity/Q976849">Q976849</a>,
+<a href="http://www.wikidata.org/entity/Q7458614">Q7458614</a>,
+<a href="http://www.wikidata.org/entity/Q5347416">Q5347416</a>,
+<a href="http://www.wikidata.org/entity/Q5460005">Q5460005</a>,
+<a href="http://www.wikidata.org/entity/Q5463392">Q5463392</a>,
+<a href="http://www.wikidata.org/entity/Q3038555">Q3038555</a>,
+<a href="http://www.wikidata.org/entity/Q5288458">Q5288458</a>,
+<a href="http://www.wikidata.org/entity/Q2346516">Q2346516</a>,
+<a href="http://www.wikidata.org/entity/Q5183645">Q5183645</a>,
+<a href="http://www.wikidata.org/entity/Q5185497">Q5185497</a>,
+<a href="http://www.wikidata.org/entity/Q5216127">Q5216127</a>,
+<a href="http://www.wikidata.org/entity/Q5223127">Q5223127</a>,
+<a href="http://www.wikidata.org/entity/Q5261159">Q5261159</a>,
+<a href="http://www.wikidata.org/entity/Q1300759">Q1300759</a>,
+<a href="http://www.wikidata.org/entity/Q5521241">Q5521241</a>,
+<a href="http://www.wikidata.org/entity/Q7733434">Q7733434</a>,
+<a href="http://www.wikidata.org/entity/Q7736264">Q7736264</a>,
+<a href="http://www.wikidata.org/entity/Q7737032">Q7737032</a>,
+<a href="http://www.wikidata.org/entity/Q7882671">Q7882671</a>,
+<a href="http://www.wikidata.org/entity/Q7719427">Q7719427</a>,
+<a href="http://www.wikidata.org/entity/Q7719444">Q7719444</a>,
+<a href="http://www.wikidata.org/entity/Q7722575">Q7722575</a>,
+<a href="http://www.wikidata.org/entity/Q2629763">Q2629763</a>,
+<a href="http://www.wikidata.org/entity/Q2640346">Q2640346</a>,
+<a href="http://www.wikidata.org/entity/Q2649671">Q2649671</a>,
+<a href="http://www.wikidata.org/entity/Q7703851">Q7703851</a>,
+<a href="http://www.wikidata.org/entity/Q7747041">Q7747041</a>,
+<a href="http://www.wikidata.org/entity/Q6544949">Q6544949</a>,
+<a href="http://www.wikidata.org/entity/Q6672759">Q6672759</a>,
+<a href="http://www.wikidata.org/entity/Q2445896">Q2445896</a>,
+<a href="http://www.wikidata.org/entity/Q12124891">Q12124891</a>,
+<a href="http://www.wikidata.org/entity/Q3127044">Q3127044</a>,
+<a href="http://www.wikidata.org/entity/Q2511262">Q2511262</a>,
+<a href="http://www.wikidata.org/entity/Q2517672">Q2517672</a>,
+<a href="http://www.wikidata.org/entity/Q2543165">Q2543165</a>,
+<a href="http://www.wikidata.org/entity/Q426628">Q426628</a>,
+<a href="http://www.wikidata.org/entity/Q426628">Q426628</a>,
+<a href="http://www.wikidata.org/entity/Q12126890">Q12126890</a>,
+<a href="http://www.wikidata.org/entity/Q13359969">Q13359969</a>,
+<a href="http://www.wikidata.org/entity/Q13359969">Q13359969</a>,
+<a href="http://www.wikidata.org/entity/Q2294295">Q2294295</a>,
+<a href="http://www.wikidata.org/entity/Q2294295">Q2294295</a>,
+<a href="http://www.wikidata.org/entity/Q2559509">Q2559509</a>,
+<a href="http://www.wikidata.org/entity/Q2559912">Q2559912</a>,
+<a href="http://www.wikidata.org/entity/Q7760469">Q7760469</a>,
+<a href="http://www.wikidata.org/entity/Q6703974">Q6703974</a>,
+<a href="http://www.wikidata.org/entity/Q4744">Q4744</a>,
+<a href="http://www.wikidata.org/entity/Q7766962">Q7766962</a>,
+<a href="http://www.wikidata.org/entity/Q7768516">Q7768516</a>,
+<a href="http://www.wikidata.org/entity/Q7769205">Q7769205</a>,
+<a href="http://www.wikidata.org/entity/Q7769988">Q7769988</a>,
+<a href="http://www.wikidata.org/entity/Q2946945">Q2946945</a>,
+<a href="http://www.wikidata.org/entity/Q3212086">Q3212086</a>,
+<a href="http://www.wikidata.org/entity/Q3212086">Q3212086</a>,
+<a href="http://www.wikidata.org/entity/Q18218448">Q18218448</a>,
+<a href="http://www.wikidata.org/entity/Q18218448">Q18218448</a>,
+<a href="http://www.wikidata.org/entity/Q18218448">Q18218448</a>,
+<a href="http://www.wikidata.org/entity/Q6909175">Q6909175</a>,
+<a href="http://www.wikidata.org/entity/Q7405709">Q7405709</a>,
+<a href="http://www.wikidata.org/entity/Q7416149">Q7416149</a>,
+<a href="http://www.wikidata.org/entity/Q7239952">Q7239952</a>,
+<a href="http://www.wikidata.org/entity/Q7317332">Q7317332</a>,
+<a href="http://www.wikidata.org/entity/Q7783674">Q7783674</a>,
+<a href="http://www.wikidata.org/entity/Q7783704">Q7783704</a>,
+<a href="http://www.wikidata.org/entity/Q7857590">Q7857590</a>,
+<a href="http://www.wikidata.org/entity/Q3372526">Q3372526</a>,
+<a href="http://www.wikidata.org/entity/Q3372642">Q3372642</a>,
+<a href="http://www.wikidata.org/entity/Q3372816">Q3372816</a>,
+<a href="http://www.wikidata.org/entity/Q3372909">Q3372909</a>,
+<a href="http://www.wikidata.org/entity/Q7959649">Q7959649</a>,
+<a href="http://www.wikidata.org/entity/Q7977485">Q7977485</a>,
+<a href="http://www.wikidata.org/entity/Q7992684">Q7992684</a>,
+<a href="http://www.wikidata.org/entity/Q3817966">Q3817966</a>,
+<a href="http://www.wikidata.org/entity/Q3821852">Q3821852</a>,
+<a href="http://www.wikidata.org/entity/Q3420907">Q3420907</a>,
+<a href="http://www.wikidata.org/entity/Q3429733">Q3429733</a>,
+<a href="http://www.wikidata.org/entity/Q774474">Q774474</a></p>