From: Petter Reinholdtsen Date: Wed, 25 Oct 2017 10:36:27 +0000 (+0200) Subject: Generated. X-Git-Url: http://pere.pagekite.me/gitweb/homepage.git/commitdiff_plain/d7f60d7c8b4286cba1ca32e0c07141785191b00e?ds=sidebyside Generated. --- diff --git a/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html b/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html new file mode 100644 index 0000000000..652cdca1d5 --- /dev/null +++ b/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html @@ -0,0 +1,715 @@ + + + + + Petter Reinholdtsen: Locating IMDB IDs of movies in the Internet Archive using Wikidata + + + + + + +
+

+ Petter Reinholdtsen + +

+ +
+ + +
+
Locating IMDB IDs of movies in the Internet Archive using Wikidata
+
25th October 2017
+

Recently, I needed to automatically check the copyright status of a +set of The Internet Movie database +(IMDB) entries, to figure out which one of the movies they refer +to can be freely distributed on the Internet. This proved to be +harder than it sounds. IMDB for sure list movies without any +copyright protection, where the copyright protection has expired or +where the movie is lisenced using a permissive license like one from +Creative Commons. These are mixed with copyright protected movies, +and there seem to be no way to separate these classes of movies using +the information in IMDB.

+ +

First I tried to look up entries manually in IMDB, +Wikipedia and +The Internet Archive, to get a +feel how to do this. It is hard to know for sure using these sources, +but it should be possible to be reasonable confident a movie is "out +of copyright" with a few hours work per movie. As I needed to check +almost 20,000 entries, this approach was not sustainable. I simply +can not work around the clock for about 6 years to check this data +set.

+ +

I asked the people behind The Internet Archive if they could +introduce a new metadata field in their metadata XML for IMDB ID, but +was told that they leave it completely to the uploaders to update the +metadata. Some of the metadata entries had IMDB links in the +description, but I found no way to download all metadata files in bulk +to locate those ones and put that approach aside.

+ +

In the process I noticed several Wikipedia articles about movies +had links to both IMDB and The Internet Archive, and it occured to me +that I could use the Wikipedia RDF data set to locate entries with +both, to at least get a lower bound on the number of movies on The +Internet Archive with a IMDB ID. This is useful based on the +assumption that movies distributed by The Internet Archive can be +legally distributed on the Internet. With some help from the RDF +community (thank you DanC), I was able to come up with this query to +pass to the SPARQL interface on +Wikidata: + +

+SELECT ?work ?imdb ?ia ?when ?label
+WHERE
+{
+  ?work wdt:P31/wdt:P279* wd:Q11424.
+  ?work wdt:P345 ?imdb.
+  ?work wdt:P724 ?ia.
+  OPTIONAL {
+        ?work wdt:P577 ?when.
+        ?work rdfs:label ?label.
+        FILTER(LANG(?label) = "en").
+  }
+}
+

+ +

If I understand the query right, for every film entry anywhere in +Wikpedia, it will return the IMDB ID and The Internet Archive ID, and +when the movie was released and its English title, if either or both +of the latter two are available. At the moment the result set contain +2338 entries. Of course, it depend on volunteers including both +correct IMDB and The Internet Archive IDs in the wikipedia articles +for the movie. It should be noted that the result will include +duplicates if the movie have entries in several languages. There are +some bogus entries, either because The Internet Archive ID contain a +typo or because the movie is not available from The Internet Archive. +I did not verify the IMDB IDs, as I am unsure how to do that +automatically.

+ +

I wrote a small python script to extract the data set from Wikidata +and check if the XML metadata for the movie is available from The +Internet Archive, and after around 1.5 hour it produced a list of 2097 +free movies and their IMDB ID. In total, 171 entries in Wikidata lack +the refered Internet Archive entry. I assume the 60 "disappearing" +entries (ie 2338-2097-171) are duplicate entries.

+ +

This is not too bad, given that The Internet Archive report to +contain 5331 +feature films at the moment, but it also mean more than 3000 +movies are missing on Wikipedia or are missing the pair of references +on Wikipedia.

+ +

I was curious about the distribution by release year, and made a +little graph to show how the amount of free movies is spread over the +years:

+ +

+ +

I expect the relative distribution of the remaining 3000 movies to +be similar.

+ +

If you want to help, and want to ensure Wikipedia can be used to +cross reference The Internet Archive and The Internet Movie Database, +please make sure entries like this are listed under the "External +links" heading on the Wikipedia article for the movie:

+ +

+* {{Internet Archive film|id=FightingLady}}
+* {{IMDb title|id=0036823|title=The Fighting Lady}}
+

+ +

Please verify the links on the final page, to make sure you did not +introduce a typo.

+ +

Here is the complete list, if you want to correct the 171 +identified Wikipedia entries with broken links to The Internet +Archive: Q1140317, +Q458656, +Q458656, +Q470560, +Q743340, +Q822580, +Q480696, +Q128761, +Q1307059, +Q1335091, +Q1537166, +Q1438334, +Q1479751, +Q1497200, +Q1498122, +Q865973, +Q834269, +Q841781, +Q841781, +Q1548193, +Q499031, +Q1564769, +Q1585239, +Q1585569, +Q1624236, +Q4796595, +Q4853469, +Q4873046, +Q915016, +Q4660396, +Q4677708, +Q4738449, +Q4756096, +Q4766785, +Q880357, +Q882066, +Q882066, +Q204191, +Q204191, +Q1194170, +Q940014, +Q946863, +Q172837, +Q573077, +Q1219005, +Q1219599, +Q1643798, +Q1656352, +Q1659549, +Q1660007, +Q1698154, +Q1737980, +Q1877284, +Q1199354, +Q1199354, +Q1199451, +Q1211871, +Q1212179, +Q1238382, +Q4906454, +Q320219, +Q1148649, +Q645094, +Q5050350, +Q5166548, +Q2677926, +Q2698139, +Q2707305, +Q2740725, +Q2024780, +Q2117418, +Q2138984, +Q1127992, +Q1058087, +Q1070484, +Q1080080, +Q1090813, +Q1251918, +Q1254110, +Q1257070, +Q1257079, +Q1197410, +Q1198423, +Q706951, +Q723239, +Q2079261, +Q1171364, +Q617858, +Q5166611, +Q5166611, +Q324513, +Q374172, +Q7533269, +Q970386, +Q976849, +Q7458614, +Q5347416, +Q5460005, +Q5463392, +Q3038555, +Q5288458, +Q2346516, +Q5183645, +Q5185497, +Q5216127, +Q5223127, +Q5261159, +Q1300759, +Q5521241, +Q7733434, +Q7736264, +Q7737032, +Q7882671, +Q7719427, +Q7719444, +Q7722575, +Q2629763, +Q2640346, +Q2649671, +Q7703851, +Q7747041, +Q6544949, +Q6672759, +Q2445896, +Q12124891, +Q3127044, +Q2511262, +Q2517672, +Q2543165, +Q426628, +Q426628, +Q12126890, +Q13359969, +Q13359969, +Q2294295, +Q2294295, +Q2559509, +Q2559912, +Q7760469, +Q6703974, +Q4744, +Q7766962, +Q7768516, +Q7769205, +Q7769988, +Q2946945, +Q3212086, +Q3212086, +Q18218448, +Q18218448, +Q18218448, +Q6909175, +Q7405709, +Q7416149, +Q7239952, +Q7317332, +Q7783674, +Q7783704, +Q7857590, +Q3372526, +Q3372642, +Q3372816, +Q3372909, +Q7959649, +Q7977485, +Q7992684, +Q3817966, +Q3821852, +Q3420907, +Q3429733, +Q774474

+
+ + + + +
+ + + + + +

+ Created by Chronicle v4.6 +

+ + +