From c5224e0563757ef0eda3b6290d206eb3227dcd4b Mon Sep 17 00:00:00 2001 From: Petter Reinholdtsen Date: Wed, 25 Oct 2017 12:19:04 +0200 Subject: [PATCH 1/1] New post about movies. --- .../data/2017-10-25-verk-i-det-fri-filmer.txt | 279 ++++++++++++++++++ .../2017-10-25-verk-i-det-fri-filmer.png | Bin 0 -> 8989 bytes 2 files changed, 279 insertions(+) create mode 100644 blog/data/2017-10-25-verk-i-det-fri-filmer.txt create mode 100644 blog/images/2017-10-25-verk-i-det-fri-filmer.png diff --git a/blog/data/2017-10-25-verk-i-det-fri-filmer.txt b/blog/data/2017-10-25-verk-i-det-fri-filmer.txt new file mode 100644 index 0000000000..a991845c90 --- /dev/null +++ b/blog/data/2017-10-25-verk-i-det-fri-filmer.txt @@ -0,0 +1,279 @@ +Title: Locating IMDB IDs of movies in the Internet Archive using Wikidata +Tags: english, opphavsrett +Date: 2017-10-25 12:20 + +

Recently, I needed to automatically check the copyright status of a +set of The Internet Movie database +(IMDB) entries, to figure out which one of the movies they refer +to can be freely distributed on the Internet. This proved to be +harder than it sounds. IMDB for sure list movies without any +copyright protection, where the copyright protection has expired or +where the movie is lisenced using a permissive license like one from +Creative Commons. These are mixed with copyright protected movies, +and there seem to be no way to separate these classes of movies using +the information in IMDB.

+ +

First I tried to look up entries manually in IMDB, +Wikipedia and +The Internet Archive, to get a +feel how to do this. It is hard to know for sure using these sources, +but it should be possible to be reasonable confident a movie is "out +of copyright" with a few hours work per movie. As I needed to check +almost 20,000 entries, this approach was not sustainable. I simply +can not work around the clock for about 6 years to check this data +set.

+ +

I asked the people behind The Internet Archive if they could +introduce a new metadata field in their metadata XML for IMDB ID, but +was told that they leave it completely to the uploaders to update the +metadata. Some of the metadata entries had IMDB links in the +description, but I found no way to download all metadata files in bulk +to locate those ones and put that approach aside.

+ +

In the process I noticed several Wikipedia articles about movies +had links to both IMDB and The Internet Archive, and it occured to me +that I could use the Wikipedia RDF data set to locate entries with +both, to at least get a lower bound on the number of movies on The +Internet Archive with a IMDB ID. This is useful based on the +assumption that movies distributed by The Internet Archive can be +legally distributed on the Internet. With some help from the RDF +community (thank you DanC), I was able to come up with this query to +pass to the SPARQL interface on +Wikidata: + +

+SELECT ?work ?imdb ?ia ?when ?label
+WHERE
+{
+  ?work wdt:P31/wdt:P279* wd:Q11424.
+  ?work wdt:P345 ?imdb.
+  ?work wdt:P724 ?ia.
+  OPTIONAL {
+        ?work wdt:P577 ?when.
+        ?work rdfs:label ?label.
+        FILTER(LANG(?label) = "en").
+  }
+}
+

+ +

If I understand the query right, for every film entry anywhere in +Wikpedia, it will return the IMDB ID and The Internet Archive ID, and +when the movie was released and its English title, if either or both +of the latter two are available. At the moment the result set contain +2338 entries. Of course, it depend on volunteers including both +correct IMDB and The Internet Archive IDs in the wikipedia articles +for the movie. It should be noted that the result will include +duplicates if the movie have entries in several languages. There are +some bogus entries, either because The Internet Archive ID contain a +typo or because the movie is not available from The Internet Archive. +I did not verify the IMDB IDs, as I am unsure how to do that +automatically.

+ +

I wrote a small python script to extract the data set from Wikidata +and check if the XML metadata for the movie is available from The +Internet Archive, and after around 1.5 hour it produced a list of 2097 +free movies and their IMDB ID. In total, 171 entries in Wikidata lack +the refered Internet Archive entry. I assume the 60 "disappearing" +entries (ie 2338-2097-171) are duplicate entries.

+ +

This is not too bad, given that The Internet Archive report to +contain 5331 +feature films at the moment, but it also mean more than 3000 +movies are missing on Wikipedia or are missing the pair of references +on Wikipedia.

+ +

I was curious about the distribution by release year, and made a +little graph to show how the amount of free movies is spread over the +years:

+ +

+ +

I expect the relative distribution of the remaining 3000 movies to +be similar.

+ +

If you want to help, and want to ensure Wikipedia can be used to +cross reference The Internet Archive and The Internet Movie Database, +please make sure entries like this are listed under the "External +links" heading on the Wikipedia article for the movie:

+ +

+* {{Internet Archive film|id=FightingLady}}
+* {{IMDb title|id=0036823|title=The Fighting Lady}}
+

+ +

Please verify the links on the final page, to make sure you did not +introduce a typo.

+ +

Here is the complete list, if you want to correct the 171 +identified Wikipedia entries with broken links to The Internet +Archive: Q1140317, +Q458656, +Q458656, +Q470560, +Q743340, +Q822580, +Q480696, +Q128761, +Q1307059, +Q1335091, +Q1537166, +Q1438334, +Q1479751, +Q1497200, +Q1498122, +Q865973, +Q834269, +Q841781, +Q841781, +Q1548193, +Q499031, +Q1564769, +Q1585239, +Q1585569, +Q1624236, +Q4796595, +Q4853469, +Q4873046, +Q915016, +Q4660396, +Q4677708, +Q4738449, +Q4756096, +Q4766785, +Q880357, +Q882066, +Q882066, +Q204191, +Q204191, +Q1194170, +Q940014, +Q946863, +Q172837, +Q573077, +Q1219005, +Q1219599, +Q1643798, +Q1656352, +Q1659549, +Q1660007, +Q1698154, +Q1737980, +Q1877284, +Q1199354, +Q1199354, +Q1199451, +Q1211871, +Q1212179, +Q1238382, +Q4906454, +Q320219, +Q1148649, +Q645094, +Q5050350, +Q5166548, +Q2677926, +Q2698139, +Q2707305, +Q2740725, +Q2024780, +Q2117418, +Q2138984, +Q1127992, +Q1058087, +Q1070484, +Q1080080, +Q1090813, +Q1251918, +Q1254110, +Q1257070, +Q1257079, +Q1197410, +Q1198423, +Q706951, +Q723239, +Q2079261, +Q1171364, +Q617858, +Q5166611, +Q5166611, +Q324513, +Q374172, +Q7533269, +Q970386, +Q976849, +Q7458614, +Q5347416, +Q5460005, +Q5463392, +Q3038555, +Q5288458, +Q2346516, +Q5183645, +Q5185497, +Q5216127, +Q5223127, +Q5261159, +Q1300759, +Q5521241, +Q7733434, +Q7736264, +Q7737032, +Q7882671, +Q7719427, +Q7719444, +Q7722575, +Q2629763, +Q2640346, +Q2649671, +Q7703851, +Q7747041, +Q6544949, +Q6672759, +Q2445896, +Q12124891, +Q3127044, +Q2511262, +Q2517672, +Q2543165, +Q426628, +Q426628, +Q12126890, +Q13359969, +Q13359969, +Q2294295, +Q2294295, +Q2559509, +Q2559912, +Q7760469, +Q6703974, +Q4744, +Q7766962, +Q7768516, +Q7769205, +Q7769988, +Q2946945, +Q3212086, +Q3212086, +Q18218448, +Q18218448, +Q18218448, +Q6909175, +Q7405709, +Q7416149, +Q7239952, +Q7317332, +Q7783674, +Q7783704, +Q7857590, +Q3372526, +Q3372642, +Q3372816, +Q3372909, +Q7959649, +Q7977485, +Q7992684, +Q3817966, +Q3821852, +Q3420907, +Q3429733, +Q774474

diff --git a/blog/images/2017-10-25-verk-i-det-fri-filmer.png b/blog/images/2017-10-25-verk-i-det-fri-filmer.png new file mode 100644 index 0000000000000000000000000000000000000000..930478ad7725f4b4b1c83d7c5ab4f4b6b6f5f87d GIT binary patch literal 8989 zcmaJ{2Ut_vvfjW^L_m&!fFNK20TEE?9i>POB{Zqhi~%Xqq=SkeO0UvUgwT;LT`7hR z5<=)jdg#5wTRG>vd*3_nyvz5I>~F6o=zdO4{eZpVv9dw-9s{Qj)u`?UuAW>ix_Jn|!+7U1p5f=#vn+ zI>?)#RD3~h+K+kFxlcmHz$BIhWPiUF*gG z$|8%88j3GvWahe(qsCYDIB6a0uwZ76p?g^TQ>`ZdZ?zR zc0RDtpu#R$>Uax|8m0s@1l=H~S{W)+*&X-wUE2Nr{r>*3kCN8a-+Lw|9@p`KHH60{ zc(K&jk4rI&yQcg zXyf)cT&@uGJvka!4haciWMI&M!=nxk53xNx;cH)=acmq{>43~ zp{mc@BTn7O?i5Ln%}(Jm%hpJL3mzUG{QSIz0HY29f$A?XUR0C?b}hAu@ks8EM{f1O z)fYPA`Ev>j!%gf8}+rbJp|MT9lkWqoXm!M{Wpb2+2;Jp7wpiN&-4=A405+;F+_?BlsLL;`j zEBs<^U(_?XLc&ZjH0oXzeQLxI?)oHAThGcU!g=D`BjXwoUL`R{OPNST59lf-FOtYA zGnYFld`YplrRD1d))ib=ve6i^pSGf#-#xev`{f@U!@0S+T4NWWt3VCR&`^U(=r-=v znX;l`y4zV;@7?_}E0r()JS5QWrvOq;*+gMsq{(N#%mmTx2js=xtZ%NFWfRK>&T~`@W2ob&p z@it$WhbFFoE#_?^T-^pg;F|#=W++L%f#mD&K)|mgYm&(59di%SH&HeTzxPEXVvxvd ze^LigYOqVJ|6z0fm1F+VPyOy;u$Of8^uG3WLVm>!{nRp-2Wt<5FR&~{A(f}5a9d3@ z<2)`=P}*%^rjC={?AH*>Q`8-+u z{Z?$^Q1}f)vhE)tOxq3Qd{5on^qHBNJ?1|Knxh|l3@3i+*Oe^JQ9oX&!8xsOL8NF{ z{u_D9(wBZJsIUk)iGYA&WUze>I};O=1=_>Iqttd-h{vQx->||C`!oF^^V6Gt=J5uL zsOoBUan~eX7boc3#d2#m6W?Q(Q5T#sBO_z)C;9hVUE-@vp_CUcw5{%{M1ncbl9DDP zx%^&PWwiK6T^2 zGp=4bB*Fo|=2*qUZEa&iOTlYtN$Q9A(tB^h%-GnBSz;ir`1i`pOatjtcMTn#fD2dV zF>7m=lrV8c41+`y>>wxt9}PN3^Ke7j#2|bchlYB?CYm?|ZEf$XdGX(W546z(_+>%3 z{1zxj>K6yDFI^qKn3TP+#3m25T_U5>xVChY8x9{*gEO+${}7^q+NjB>{7MH5$^G;O z4BUSA+s!bD4zDePqPan`=Xw<1djXnxhF_<6EK-EghJJSe2dBuK^_fza9he%uXjwS!ginmd#ptS5S!3QQ+k7#e;7*f6dj6@_dG5raCr9l zM~+s-WoW55X6=foqa0uDR^$a%sXVafZ33MGMH-x+Wv-?<&2_b|?sC(f7wy;o#;q(# z>1`dISzX=j*r0_MB9W1L#x4PJugA&$T$wfYmW~d)l)SonSh?+RX7hV`o5lWoF^{dR z8N$?dhNH~UAnay#580JKQS|!t z>#kDTwQwyp*=|I1K=vlo7N7rVW_|s+APA86e04pnCu^jV8!b1)+&3)S4jGkDKCSg6 zF*&rDo=xTqGrWYKqZRk$e%Fh-^gtF5MW@-0pP6cjePd%Re?=n^P?TL?Z%Ro?Nfs7T zxEu8Fy1%ma5^^3mijK}@F^4qu#zsB187XBrFMN*ZYD&I8VzkO-sNT0;$ad)S@!{@b zW}@9YW}82Qv)q2XpsNd(Z(Q9E(wm9)(&z6%nv5Xv&-~D(RQL3(9&?+b#A2}v>b?h2 zChjva{ucj2C$e;DXyoIc3DGE$SmQ2*(vzxoRe9UDX7jJ9q&QvpLHQH3&9aD%s`u$F zn&YQ<-3dx4zukY8s^RqFhT4!!QIR_*G(4b4D^m%4`mxFp-K9w9=@vh>h%Ly09Y9yh zWu_ovPGL|1IsQ=i1hr9u;7KZq;Pi8prKExk36i4MfFu*)F?#UW#)D5JX;(BF|6K%%<3!NH3y5k%eWlKa6XcQ?+YX*i~2fDD{T$t0AOo2w!`;_zDP zaH+(93ivPx%t%5Ufw~-Fa6Y=Spx_IDZI#N=2Wo0%J|~+Ssh!c?Pd+{qIG+~o2y=YE9_q!uxtFd^)Zy4iO;U0r=Lz+C*~sw z>4AWRmwe>*&s;ASa!7Y3d5z69G$%I~vp$91^fqv#MQldZ?8&G6^@H>bsF2Gy57tyX zx?HQQm$%X!!3V9}Z+QJ_>e<-vy;ei)GWzv)y}HITK23e-O7?n<7dE~k(Y1<8HlR^A{< z+V4QwSLJNn(gOZG`dtQ*8!4#vj8NJ(I~tsnZuG$M^=|DXNYW~kj41%6W$(&PE+i%r zk39rY*WIogXHlV%;K7>abq5S$^7>{<%g*TNSj~TXDMd+T8WCOl$7l_JgS9*(yrUym zbmkZ-r!)weHm!2+@}<6+&$pssKFOr4UV4)b+7=cxv8nVj3xs8h>+*68=C=o_r$w5g z=W_qM-e5sDy!ERlJix8X$1#}Q-nu15dO?mYzQ_`Nhax^Re%*a=nc0B~fT-0CT)Va_ z{hg$Jv48@j%2)?h&Hgx-OH zGJl-+(RvJ7nAJtuZ^>Y99bw6oh}=rf=0`$^gD&oVDZ^|NRF3i^)40(wq89}85_48m zmp(29ttkP7mO(Q-e!OBy$^xL!=h&n6aHYCxvrF7%ZQ?$UL21iRg+_5;Lb;Zc<`TgK za+LnzvkZ^qXN8IOMFM^eU&_jwKFS96jgIDT&UG)2d9GpuD4G7L8l98`rWxi}a?TP= zdqZg4ILYKT7WLgnI};PPzLBVuJ^o%%`uKi6D6G1^Zk^7X9;Wi>U+ zLt=mDB)nf#DhEcF=*`}B)on(v%4u}lc*ynjr4fOHGqi~S(C1P-VV*lnKNIcive{K) z*{=Krr~jkYoytUJ9ufU=ad~5b_yh}*K4iUo85ePbWT!bNuEQ-YZ)ZCbQ9C=X`KI+Ipc$;13KcK)K6FJGynCnlwxq{A z>6pJ^qQ=PF8X@ME5$v`$_}jCa67dR_e>RBEhbbReacfg5+|7K;PdVyd`9mP?`nqz? zZV}@{yyt{fUMU5RbNbQKyL7KDzKhNc24%;9VW#yj)M(|)%MWI*PrkIF_3&@%qFq-- z)GP>bc~88|P?dcVj;%L58oy!KQ7ge1x5A&u6&fQj^rj88k~q$O&R{IpulO!Eep@Y% zac`fpa^8|H#xFOA$9Zr+$qGY%SiP6c5z!rbN`DJv7NZOX^6f|NbwItY5LfN0lr#c; z!KsQ4ZS($tL5S>z=hntk`riSCRRG7oIk=a+?z1;3-G>ui%28A1epHuElW~$ac*%D# zCzm@fii3~Mo0mqqBv%((G{|)eV7*axG_&n2zX`@AFa7LW6DoKGO>BB~{ z*C0MU>{EgPyJdeKAkB1`53h73=4gd(=JOe;fOiF847;<6N*a+$iBG1HQb z?3uYEHreE*T>$YN)`W74l1D+k}R1JfkjJ%+oa_miM@vqEXv)Ts|a3jqS>u7Wb|AYpu?WCb|Gc z|EjrY^>ZFrC?yCNDo{y$sd+|Ti=s}<&LB&1`&AuUkDG4K4lrxdtqZGPO2F5(qKSkT z|3@gEy&=KryWgkqCE91*{W+s^m1feTk{nDO4vyl@0t^ub!MDFLT5Pq6#l6tX=AC&| z*@vEEjq&#?whi1I};?m$`l9XqwlA2duaWl_oTc6j7CKBXSx=8V( zKe0mKBPRy`%E^^J*J5w{QO0cn*A085Rg*QoHbuhy+O@$5E#IAA!*g4Ts=TSbQk!ek zAlaMbc;_}uK5d8|=F-K>EcJ!oIWlF)*t^ctG~V0x78}cRzxK#cKcZzV7>)Mb(c-y1 zKdl&y?Uqh;cLOD$5zla)PWF>*%R}l;X9ZrQ!ZohO(=pAK`iz*1xzEz?2QBkc)6qpI z4VE{xy`=>*7lDk;Y|mkbcdn|FV{YKk5IzYnqm>+?r^i%;Y=$FBm)Mq>oA(S2_Kz}{ zC=&+jn%r=b0I@YzCU zF$WsyXj(v)EGeD`>d*GM+;~mBxB|X3x^s5-8>TehNTlQO=NJ33$wk%k%hDMRb!Wu# zzb2tH^u&Zuc%l}qGIts`_N}RTUnn&*T?-1s%(T-a_8*i+xdlQw(IM7hD zP~Dooj0Gg+R_f{gsmn~8Ao@KQhHI-%r&5*M6)&hMW_MtinG;v(X!W413~NxeHOz$@ z<%%sYPXmDWpm&3+)y0{A5ubVRic0z@@Zg|W{W~e>K2gE?qV3#3b)OR|hd;a+_h8y7 zHo^K`VN@QBJ3#pBjD5VF^Yfi0>wE(py`!`2mCNN~^4A-mlc_s08{@00)}Jq|`koF6 z3c~&58sOBfb$j{2L$r>r@_4o1Db-xu2Cc8PO^Veeb{%P69?{i{?ocMI3cf9so<3zz z=T&z-qR`R9o>l~eN7a>QC|C=YlJXIcoIa5n-F-Rw8Fj^+q0pQr<$+aYwt{G;k2>!}d-(^Pb|N*d z=?=SEQ%J(+6uKdhsdRL7xKM9cM0)6RYg@||^>c(W@HOYl&iv-LVV1$dHTqO8lX%XH z!%}|%H^OO5UFp2wwY^Qb7^NN{;B77{8pP&a8J>MICP3Mo>JMYCFf`E!y&)gUAptnt zMO-1SdbKWBe6_!`=rw?f05550&?%bRN=hXn+S9r1l}_{HiEW(o73 zTlsJBcV_iA376cud5@PmHeU^XzLsE3L^caoWmphvNQ?GFQnx=v^>KnB8%IQYthcz? zPu#+8s$Gy@kbZD-LZ23lM)Q}BipRkMD%Ev~%y%y{e35qE_Cnu+V zwB%+7|4@y4iN&%a3Si~RN<6tYa2L(=blg_=saLXWx6S-e8JTgl(DAhz5LfBxdV&a8 zKCfwA!j}(vIfpkZ7RrYUFjX!-0CMJiPv*=UN_+75oPjA+SMbS{v@ZK7Zto@r7mRbC zOFqT=p6-qpmrLs;=@!;?X(jd-QgPB6SaH`uZD_FTs9`&mB^8>(miX~6@Z+hC8FK8% zNO<_|47^3=K_!qqjVou?J)H46JbhqTdmt7PN}v>J_!X8jS|eY^pQa3sbByDZDWWdn z#RUdUrH^w@W(s_tPgj0!z0$aTF(oC1WqWJu>0eZSf4(7VF4cEzA+9K21749I>Zq-! zm;KSKoY$14LPVK=bH}LMP_azTE9X-~eER)2HZ*>1Q3DS%S512_i44crrK;2~?I_U# z@^*Wp-3%lI{!jqjO7~_qu^=?}3A@%pAxHj9i?*GSO!RfTX{GZJ;%Je$l z8-sanTnO*(lBmT(-NI0GyDRznZhp4X@!0#(n@|k94r~0PWq=WCb*`5$PL?%h;Sdm@ z%38llx15i?kGzwgy&krBs(7z+`fi~_N20L$Tz4vNX|Tj)p_kp*_jsG}^5uA+`?$}3 zr?-%0H)S2Pde@01SLFh5tuuA7wyiN0ESTh4U6?2zGU(m-T#RRZZ2PWWQQcxv*vF{K zATx$84he4EJPp8qfg?uR3xQ>Jqhj8`t6~|aEXDKO@Y$!4Y?6md^>kbA{IxLN-I%2J zQBk&z=~`YXZ5N*x2Z%hf5QMFJD;FKm3vb)Nyiy#Iz6ABtTc?$;*d*%rn)v|V{TVR2 zi*fZw8FYLmdEna6nu1x`wW*cymnG#NN@SDruE7Q`U7yT$TpFzVIW|_H$oA>rz(fA} z#j=VD@ndShsfa%J@<+SMwz%Xh@GlK*1-BKzDJV1oY6@`XnPLM-BM71; z1|$w7BM(R=aGIp|<*$R#u`!b~ukQn1AAD}_>==msotVke=;TXqetxhw6;|f4ZQYq7 zX(+wls_qQjYM@B=G#=YUu3V1BdEk)r{>a*Qw~0=dR}3z$}5zQh1w?-AER_iQE0g#zt`N z*dt3NX}O@!NlQ2st#q23-aWi@{>^#p(I+K(R3NL5zVz{e`Zgep@F|`vKUaNrtN*uY z1WM>=f3s?L%(Kj_F<{{CudQnTDGpdtPdi_IUVneT|8Nk-IVtpt=%+Xn52ZjV)_!cR z9!<*Yf3VTQl1?jUu6}#I6F)fK>;#Z_7#lk3HWl(j z<0nrDjod6E)%(fG9$FgacIA0OD2B>sjbw9evXLwGJ;5-oLkCPperk4~&xpYCE_ywz zT~HyiBa%&m)q5+V^6Ag(@JPbv=2Vz;x%0D?k;++|sjqE9Cer748#fD1+R2A@%#}ki z`4wZFMdS1;*7K?whmC|2D3mRPHW#S)hgRE&$h%)>yUEf#@n>WM;i!4_L-ACQQ0SI=i z*3$);&VA)VZh5)TvNwP8_x8JydCkbbnqSwYfruUVTY|7Kse>$n2MycT8%J*K&!r-~ zQj|lY*h0d>a4*1~~!4H?8>sn09HN@V{Ri544 zH;*5jpU)EY+O(+Rv%FbiG} zpt0kCNe7P#W4md82^_l{6q(C6SS5)%%{GCcSP;=MXSGA6W;gka*!d>mpS$uRV=Vnq zz_7yFodjOvs?U0+-a6kO_w*LU#?rYsJ99FI!z^tTR$u7I#jgG-y1iV!m4(dtC%~(UbyNK=- z+cC#>9x_}5Kl9;6J6|V&`{BI&e8JVaL%U5tyPiRM-pzO)#fG^sN_nc#MF49+}X@g>h=S0mB3Y1ax?=RHC=`y=HTCay!7*ak>VJsHIa?^mq4! z8NP>Cp=TIxDBUzv0Vz%0!iEe2bT6Rl{=-`yf707EAcce1L>fQn;RApuEud>V^-f7G X^@Kn5lT-78Tac1GOs?R;<5&L$Ofqzy literal 0 HcmV?d00001 -- 2.47.2