Petter Reinholdtsen

Legal to share more than 11,000 movies listed on IMDB?

7th January 2018

I've continued to track down list of movies that are legal to distribute on the Internet, and identified more than 11,000 title IDs in The Internet Movie Database so far. Most of them (57%) are feature films from USA published before 1923. I've also tracked down more than 24,000 movies I have not yet been able to map to IMDB title ID, so the real number could be a lot higher. According to the front web page for Retro Film Vault, there are 44,000 public domain films, so I guess there are still some left to identify.

The complete data set is available from a public git repository, including the scripts used to create it. Most of the data is collected using web scraping, for example from the "product catalog" of companies selling copies of public domain movies, but any source I find believable is used. I've so far had to throw out three sources because I did not trust the public domain status of the movies listed.

Anyway, this is the summary of the 28 collected data sources so far:

 2352 entries (   66 unique) with and 15983 without IMDB title ID in free-movies-archive-org-search.json
 2302 entries (  120 unique) with and     0 without IMDB title ID in free-movies-archive-org-wikidata.json
  195 entries (   63 unique) with and   200 without IMDB title ID in free-movies-cinemovies.json
   89 entries (   52 unique) with and    38 without IMDB title ID in free-movies-creative-commons.json
  344 entries (   28 unique) with and   655 without IMDB title ID in free-movies-fesfilm.json
  668 entries (  209 unique) with and  1064 without IMDB title ID in free-movies-filmchest-com.json
  830 entries (   21 unique) with and     0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
   19 entries (   19 unique) with and     0 without IMDB title ID in free-movies-imdb-c-expired-gb.json
 6822 entries ( 6669 unique) with and     0 without IMDB title ID in free-movies-imdb-c-expired-us.json
  137 entries (    0 unique) with and     0 without IMDB title ID in free-movies-imdb-externlist.json
 1205 entries (   57 unique) with and     0 without IMDB title ID in free-movies-imdb-pd.json
   84 entries (   20 unique) with and   167 without IMDB title ID in free-movies-infodigi-pd.json
  158 entries (  135 unique) with and     0 without IMDB title ID in free-movies-letterboxd-looney-tunes.json
  113 entries (    4 unique) with and     0 without IMDB title ID in free-movies-letterboxd-pd.json
  182 entries (  100 unique) with and     0 without IMDB title ID in free-movies-letterboxd-silent.json
  229 entries (   87 unique) with and     1 without IMDB title ID in free-movies-manual.json
   44 entries (    2 unique) with and    64 without IMDB title ID in free-movies-openflix.json
  291 entries (   33 unique) with and   474 without IMDB title ID in free-movies-profilms-pd.json
  211 entries (    7 unique) with and     0 without IMDB title ID in free-movies-publicdomainmovies-info.json
 1232 entries (   57 unique) with and  1875 without IMDB title ID in free-movies-publicdomainmovies-net.json
   46 entries (   13 unique) with and    81 without IMDB title ID in free-movies-publicdomainreview.json
  698 entries (   64 unique) with and   118 without IMDB title ID in free-movies-publicdomaintorrents.json
 1758 entries (  882 unique) with and  3786 without IMDB title ID in free-movies-retrofilmvault.json
   16 entries (    0 unique) with and     0 without IMDB title ID in free-movies-thehillproductions.json
   63 entries (   16 unique) with and   141 without IMDB title ID in free-movies-vodo.json
11583 unique IMDB title IDs in total, 8724 only in one list, 24647 without IMDB title ID

I keep finding more data sources. I found the cinemovies source just a few days ago, and as you can see from the summary, it extended my list with 63 movies. Check out the mklist-* scripts in the git repository if you are curious how the lists are created. Many of the titles are extracted using searches on IMDB, where I look for the title and year, and accept search results with only one movie listed if the year matches. This allow me to automatically use many lists of movies without IMDB title ID references at the cost of increasing the risk of wrongly identify a IMDB title ID as public domain. So far my random manual checks have indicated that the method is solid, but I really wish all lists of public domain movies would include unique movie identifier like the IMDB title ID. It would make the job of counting movies in the public domain a lot easier.

Tags: english, opphavsrett, verkidetfri.

Kommentarer til «Evaluation of (il)legality» for Popcorn Time

20th December 2017

I går var jeg i Follo tingrett som sakkyndig vitne og presenterte mine undersøkelser rundt telling av filmverk i det fri, relatert til foreningen NUUGs involvering i saken om Økokrims beslag og senere inndragning av DNS-domenet popcorn-time.no. Jeg snakket om flere ting, men mest om min vurdering av hvordan filmbransjen har målt hvor ulovlig Popcorn Time er. Filmbransjens måling er så vidt jeg kan se videreformidlet uten endringer av norsk politi, og domstolene har lagt målingen til grunn når de har vurdert Popcorn Time både i Norge og i utlandet (tallet 99% er referert også i utenlandske domsavgjørelser).

I forkant av mitt vitnemål skrev jeg et notat, mest til meg selv, med de punktene jeg ønsket å få frem. Her er en kopi av notatet jeg skrev og ga til aktoratet. Merkelig nok ville ikke dommerene ha notatet, så hvis jeg forsto rettsprosessen riktig ble kun histogram-grafen lagt inn i dokumentasjonen i saken. Dommerne var visst bare interessert i å forholde seg til det jeg sa i retten, ikke det jeg hadde skrevet i forkant. Uansett så antar jeg at flere enn meg kan ha glede av teksten, og publiserer den derfor her. Legger ved avskrift av dokument 09,13, som er det sentrale dokumentet jeg kommenterer.

Kommentarer til «Evaluation of (il)legality» for Popcorn Time

Oppsummering

Målemetoden som Økokrim har lagt til grunn når de påstår at 99% av filmene tilgjengelig fra Popcorn Time deles ulovlig har svakheter.

De eller den som har vurdert hvorvidt filmer kan lovlig deles har ikke lyktes med å identifisere filmer som kan deles lovlig og har tilsynelatende antatt at kun veldig gamle filmer kan deles lovlig. Økokrim legger til grunn at det bare finnes èn film, Charlie Chaplin-filmen «The Circus» fra 1928, som kan deles fritt blant de som ble observert tilgjengelig via ulike Popcorn Time-varianter. Jeg finner tre flere blant de observerte filmene: «The Brain That Wouldn't Die» fra 1962, «God’s Little Acre» fra 1958 og «She Wore a Yellow Ribbon» fra 1949. Det er godt mulig det finnes flere. Det finnes dermed minst fire ganger så mange filmer som lovlig kan deles på Internett i datasettet Økokrim har lagt til grunn når det påstås at mindre enn 1 % kan deles lovlig.

Dernest, utplukket som gjøres ved søk på tilfeldige ord hentet fra ordlisten til Dale-Chall avviker fra årsfordelingen til de brukte filmkatalogene som helhet, hvilket påvirker fordelingen mellom filmer som kan lovlig deles og filmer som ikke kan lovlig deles. I tillegg gir valg av øvre del (de fem første) av søkeresultatene et avvik fra riktig årsfordeling, hvilket påvirker fordelingen av verk i det fri i søkeresultatet.

Det som måles er ikke (u)lovligheten knyttet til bruken av Popcorn Time, men (u)lovligheten til innholdet i bittorrent-filmkataloger som vedlikeholdes uavhengig av Popcorn Time.

Omtalte dokumenter: 09,12, 09,13, 09,14, 09,18, 09,19, 09,20.

Utfyllende kommentarer

Økokrim har forklart domstolene at minst 99% av alt som er tilgjengelig fra ulike Popcorn Time-varianter deles ulovlig på Internet. Jeg ble nysgjerrig på hvordan de er kommet frem til dette tallet, og dette notatet er en samling kommentarer rundt målingen Økokrim henviser til. Litt av bakgrunnen for at jeg valgte å se på saken er at jeg er interessert i å identifisere og telle hvor mange kunstneriske verk som er falt i det fri eller av andre grunner kan lovlig deles på Internett, og dermed var interessert i hvordan en hadde funnet den ene prosenten som kanskje deles lovlig.

Andelen på 99% kommer fra et ukreditert og udatert notatet som tar mål av seg å dokumentere en metode for å måle hvor (u)lovlig ulike Popcorn Time-varianter er.

Raskt oppsummert, så forteller metodedokumentet at på grunn av at det ikke er mulig å få tak i komplett liste over alle filmtitler tilgjengelig via Popcorn Time, så lages noe som skal være et representativt utvalg ved å velge 50 søkeord større enn tre tegn fra ordlisten kjent som Dale-Chall. For hvert søkeord gjøres et søk og de første fem filmene i søkeresultatet samles inn inntil 100 unike filmtitler er funnet. Hvis 50 søkeord ikke var tilstrekkelig for å nå 100 unike filmtitler ble flere filmer fra hvert søkeresultat lagt til. Hvis dette heller ikke var tilstrekkelig, så ble det hentet ut og søkt på flere tilfeldig valgte søkeord inntil 100 unike filmtitler var identifisert.

Deretter ble for hver av filmtitlene «vurdert hvorvidt det var rimelig å forvente om at verket var vernet av copyright, ved å se på om filmen var tilgjengelig i IMDB, samt se på regissør, utgivelsesår, når det var utgitt for bestemte markedsområder samt hvilke produksjons- og distribusjonsselskap som var registrert» (min oversettelse).

Metoden er gjengitt både i de ukrediterte dokumentene 09,13 og 09,19, samt beskrevet fra side 47 i dokument 09,20, lysark datert 2017-02-01. Sistnevnte er kreditert Geerart Bourlon fra Motion Picture Association EMEA. Metoden virker å ha flere svakheter som gir resultatene en slagside. Den starter med å slå fast at det ikke er mulig å hente ut en komplett liste over alle filmtitler som er tilgjengelig, og at dette er bakgrunnen for metodevalget. Denne forutsetningen er ikke i tråd med det som står i dokument 09,12, som ikke heller har oppgitt forfatter og dato. Dokument 09,12 forteller hvordan hele kataloginnholdet ble lasted ned og talt opp. Dokument 09,12 er muligens samme rapport som ble referert til i dom fra Oslo Tingrett 2017-11-03 (sak 17-093347TVI-OTIR/05) som rapport av 1. juni 2017 av Alexander Kind Petersen, men jeg har ikke sammenlignet dokumentene ord for ord for å kontrollere dette.

IMDB er en forkortelse for The Internet Movie Database, en anerkjent kommersiell nettjeneste som brukes aktivt av både filmbransjen og andre til å holde rede på hvilke spillefilmer (og endel andre filmer) som finnes eller er under produksjon, og informasjon om disse filmene. Datakvaliteten er høy, med få feil og få filmer som mangler. IMDB viser ikke informasjon om opphavsrettslig status for filmene på infosiden for hver film. Som del av IMDB-tjenesten finnes det lister med filmer laget av frivillige som lister opp det som antas å være verk i det fri.

Det finnes flere kilder som kan brukes til å finne filmer som er allemannseie (public domain) eller har bruksvilkår som gjør det lovlig for alleå dele dem på Internett. Jeg har de siste ukene forsøkt å samle og krysskoble disse listene for å forsøke å telle antall filmer i det fri. Ved å ta utgangspunkt i slike lister (og publiserte filmer for Internett-arkivets del), har jeg så langt klart å identifisere over 11 000 filmer, hovedsaklig spillefilmer.

De aller fleste oppføringene er hentet fra IMDB selv, basert på det faktum at alle filmer laget i USA før 1923 er falt i det fri. Tilsvarende tidsgrense for Storbritannia er 1912-07-01, men dette utgjør bare veldig liten del av spillefilmene i IMDB (19 totalt). En annen stor andel kommer fra Internett-arkivet, der jeg har identifisert filmer med referanse til IMDB. Internett-arkivet, som holder til i USA, har som policy å kun publisere filmer som det er lovlig å distribuere. Jeg har under arbeidet kommet over flere filmer som har blitt fjernet fra Internett-arkivet, hvilket gjør at jeg konkluderer med at folkene som kontrollerer Internett-arkivet har et aktivt forhold til å kun ha lovlig innhold der, selv om det i stor grad er drevet av frivillige. En annen stor liste med filmer kommer fra det kommersielle selskapet Retro Film Vault, som selger allemannseide filmer til TV- og filmbransjen, Jeg har også benyttet meg av lister over filmer som hevdes å være allemannseie, det være seg Public Domain Review, Public Domain Torrents og Public Domain Movies (.net og .info), samt lister over filmer med Creative Commons-lisensiering fra Wikipedia, VODO og The Hill Productions. Jeg har gjort endel stikkontroll ved å vurdere filmer som kun omtales på en liste. Der jeg har funnet feil som har gjort meg i tvil om vurderingen til de som har laget listen har jeg forkastet listen fullstendig (gjelder en liste fra IMDB).

Ved å ta utgangspunkt i verk som kan antas å være lovlig delt på Internett (fra blant annet Internett-arkivet, Public Domain Torrents, Public Domain Reivew og Public Domain Movies), og knytte dem til oppføringer i IMDB, så har jeg så langt klart å identifisere over 11 000 filmer (hovedsaklig spillefilmer) det er grunn til å tro kan lovlig distribueres av alle på Internett. Som ekstra kilder er det brukt lister over filmer som antas/påstås å være allemannseie. Disse kildene kommer fra miljøer som jobber for å gjøre tilgjengelig for almennheten alle verk som er falt i det fri eller har bruksvilkår som tillater deling.

I tillegg til de over 11 000 filmene der tittel-ID i IMDB er identifisert, har jeg funnet mer enn 20 000 oppføringer der jeg ennå ikke har hatt kapasitet til å spore opp tittel-ID i IMDB. Noen av disse er nok duplikater av de IMDB-oppføringene som er identifisert så langt, men neppe alle. Retro Film Vault hevder å ha 44 000 filmverk i det fri i sin katalog, så det er mulig at det reelle tallet er betydelig høyere enn de jeg har klart å identifisere så langt. Konklusjonen er at tallet 11 000 er nedre grense for hvor mange filmer i IMDB som kan lovlig deles på Internett. I følge statistikk fra IMDB er det 4.6 millioner titler registrert, hvorav 3 millioner er TV-serieepisoder. Jeg har ikke funnet ut hvordan de fordeler seg per år.

Hvis en fordeler på år alle tittel-IDene i IMDB som hevdes å lovlig kunne deles på Internett, får en følgende histogram:

En kan i histogrammet se at effekten av manglende registrering eller fornying av registrering er at mange filmer gitt ut i USA før 1978 er allemannseie i dag. I tillegg kan en se at det finnes flere filmer gitt ut de siste årene med bruksvilkår som tillater deling, muligens på grunn av fremveksten av Creative Commons-bevegelsen..

For maskinell analyse av katalogene har jeg laget et lite program som kobler seg til bittorrent-katalogene som brukes av ulike Popcorn Time-varianter og laster ned komplett liste over filmer i katalogene, noe som bekrefter at det er mulig å hente ned komplett liste med alle filmtitler som er tilgjengelig. Jeg har sett på fire bittorrent-kataloger. Den ene brukes av klienten tilgjengelig fra www.popcorntime.sh og er navngitt 'sh' i dette dokumentet. Den andre brukes i følge dokument 09,12 av klienten tilgjengelig fra popcorntime.ag og popcorntime.sh og er navngitt 'yts' i dette dokumentet. Den tredje brukes av websidene tilgjengelig fra popcorntime-online.tv og er navngitt 'apidomain' i dette dokumentet. Den fjerde brukes av klienten tilgjenglig fra popcorn-time.to i følge dokument 09,12, og er navngitt 'ukrfnlge' i dette dokumentet.

Metoden Økokrim legger til grunn skriver i sitt punkt fire at skjønn er en egnet metode for å finne ut om en film kan lovlig deles på Internett eller ikke, og sier at det ble «vurdert hvorvidt det var rimelig å forvente om at verket var vernet av copyright». For det første er det ikke nok å slå fast om en film er «vernet av copyright» for å vite om det er lovlig å dele den på Internett eller ikke, da det finnes flere filmer med opphavsrettslige bruksvilkår som tillater deling på Internett. Eksempler på dette er Creative Commons-lisensierte filmer som Citizenfour fra 2014 og Sintel fra 2010. I tillegg til slike finnes det flere filmer som nå er allemannseie (public domain) på grunn av manglende registrering eller fornying av registrering selv om både regisør, produksjonsselskap og distributør ønsker seg vern. Eksempler på dette er Plan 9 from Outer Space fra 1959 og Night of the Living Dead fra 1968. Alle filmer fra USA som var allemannseie før 1989-03-01 forble i det fri da Bern-konvensjonen, som tok effekt i USA på det tidspunktet, ikke ble gitt tilbakevirkende kraft. Hvis det er noe historien om sangen «Happy birthday» forteller oss, der betaling for bruk har vært krevd inn i flere tiår selv om sangen ikke egentlig var vernet av åndsverksloven, så er det at hvert enkelt verk må vurderes nøye og i detalj før en kan slå fast om verket er allemannseie eller ikke, det holder ikke å tro på selverklærte rettighetshavere. Flere eksempel på verk i det fri som feilklassifiseres som vernet er fra dokument 09,18, som lister opp søkeresultater for klienten omtalt som popcorntime.sh og i følge notatet kun inneholder en film (The Circus fra 1928) som under tvil kan antas å være allemannseie.

Ved rask gjennomlesning av dokument 09,18, som inneholder skjermbilder fra bruk av en Popcorn Time-variant, fant jeg omtalt både filmen «The Brain That Wouldn't Die» fra 1962 som er tilgjengelig fra Internett-arkivet og som i følge Wikipedia er allemannseie i USA da den ble gitt ut i 1962 uten 'copyright'-merking, og filmen «God’s Little Acre» fra 1958 som er lagt ut på Wikipedia, der det fortelles at sort/hvit-utgaven er allemannseie. Det fremgår ikke fra dokument 09,18 om filmen omtalt der er sort/hvit-utgaven. Av kapasitetsårsaker og på grunn av at filmoversikten i dokument 09,18 ikke er maskinlesbart har jeg ikke forsøkt å sjekke alle filmene som listes opp der om mot liste med filmer som er antatt lovlig kan distribueres på Internet.

Ved maskinell gjennomgang av listen med IMDB-referanser under regnearkfanen 'Unique titles' i dokument 09.14, fant jeg i tillegg filmen «She Wore a Yellow Ribbon» fra 1949) som nok også er feilklassifisert. Filmen «She Wore a Yellow Ribbon» er tilgjengelig fra Internett-arkivet og markert som allemannseie der. Det virker dermed å være minst fire ganger så mange filmer som kan lovlig deles på Internett enn det som er lagt til grunn når en påstår at minst 99% av innholdet er ulovlig. Jeg ser ikke bort fra at nærmere undersøkelser kan avdekke flere. Poenget er uansett at metodens punkt om «rimelig å forvente om at verket var vernet av copyright» gjør metoden upålitelig.

Den omtalte målemetoden velger ut tilfeldige søketermer fra ordlisten Dale-Chall. Den ordlisten inneholder 3000 enkle engelske som fjerdeklassinger i USA er forventet å forstå. Det fremgår ikke hvorfor akkurat denne ordlisten er valgt, og det er uklart for meg om den er egnet til å få et representativt utvalg av filmer. Mange av ordene gir tomt søkeresultat. Ved å simulerte tilsvarende søk ser jeg store avvik fra fordelingen i katalogen for enkeltmålinger. Dette antyder at enkeltmålinger av 100 filmer slik målemetoden beskriver er gjort, ikke er velegnet til å finne andel ulovlig innhold i bittorrent-katalogene.

En kan motvirke dette store avviket for enkeltmålinger ved å gjøre mange søk og slå sammen resultatet. Jeg har testet ved å gjennomføre 100 enkeltmålinger (dvs. måling av (100x100=) 10 000 tilfeldig valgte filmer) som gir mindre, men fortsatt betydelig avvik, i forhold til telling av filmer pr år i hele katalogen.

Målemetoden henter ut de fem øverste i søkeresultatet. Søkeresultatene er sortert på antall bittorrent-klienter registrert som delere i katalogene, hvilket kan gi en slagside mot hvilke filmer som er populære blant de som bruker bittorrent-katalogene, uten at det forteller noe om hvilket innhold som er tilgjengelig eller hvilket innhold som deles med Popcorn Time-klienter. Jeg har forsøkt å måle hvor stor en slik slagside eventuelt er ved å sammenligne fordelingen hvis en tar de 5 nederste i søkeresultatet i stedet. Avviket for disse to metodene for endel kataloger er godt synlig på histogramet. Her er histogram over filmer funnet i den komplette katalogen (grønn strek), og filmer funnet ved søk etter ord i Dale-Chall. Grafer merket 'top' henter fra de 5 første i søkeresultatet, mens de merket 'bottom' henter fra de 5 siste. En kan her se at resultatene påvirkes betydelig av hvorvidt en ser på de første eller de siste filmene i et søketreff.

Det er verdt å bemerke at de omtalte bittorrent-katalogene ikke er laget for bruk med Popcorn Time. Eksempelvis tilhører katalogen YTS, som brukes av klientet som ble lastes ned fra popcorntime.sh, et selvstendig fildelings-relatert nettsted YTS.AG med et separat brukermiljø. Målemetoden foreslått av Økokrim måler dermed ikke (u)lovligheten rundt bruken av Popcorn Time, men (u)lovligheten til innholdet i disse katalogene.

Metoden fra Økokrims dokument 09,13 i straffesaken om DNS-beslag.

1. Evaluation of (il)legality

1.1. Methodology

Due to its technical configuration, Popcorn Time applications don't allow to make a full list of all titles made available. In order to evaluate the level of illegal operation of PCT, the following methodology was applied:

A random selection of 50 keywords, greater than 3 letters, was made from the Dale-Chall list that contains 3000 simple English words1. The selection was made by using a Random Number Generator2.
For each keyword, starting with the first randomly selected keyword, a search query was conducted in the movie section of the respective Popcorn Time application. For each keyword, the first five results were added to the title list until the number of 100 unique titles was reached (duplicates were removed).
For one fork, .CH, insufficient titles were generated via this approach to reach 100 titles. This was solved by adding any additional query results above five for each of the 50 keywords. Since this still was not enough, another 42 random keywords were selected to finally reach 100 titles.
It was verified whether or not there is a reasonable expectation that the work is copyrighted by checking if they are available on IMDb, also verifying the director, the year when the title was released, the release date for a certain market, the production company/ies of the title and the distribution company/ies.

1.2. Results

Between 6 and 9 June 2016, four forks of Popcorn Time were investigated: popcorn-time.to, popcorntime.ag, popcorntime.sh and popcorntime.ch. An excel sheet with the results is included in Appendix 1. Screenshots were secured in separate Appendixes for each respective fork, see Appendix 2-5.

For each fork, out of 100, de-duplicated titles it was possible to retrieve data according to the parameters set out above that indicate that the title is commercially available. Per fork, there was 1 title that presumably falls within the public domain, i.e. the 1928 movie "The Circus" by and with Charles Chaplin.

Based on the above it is reasonable to assume that 99% of the movie content of each fork is copyright protected and is made available illegally.

This exercise was not repeated for TV series, but considering that besides production companies and distribution companies also broadcasters may have relevant rights, it is reasonable to assume that at least a similar level of infringement will be established.

Based on the above it is reasonable to assume that 99% of all the content of each fork is copyright protected and are made available illegally.

Tags: fildeling, freeculture, norsk, nuug, opphavsrett, verkidetfri, video.

Cura, the nice 3D print slicer, is now in Debian Unstable

17th December 2017

After several months of working and waiting, I am happy to report that the nice and user friendly 3D printer slicer software Cura just entered Debian Unstable. It consist of five packages, cura, cura-engine, libarcus, fdm-materials, libsavitar and uranium. The last two, uranium and cura, entered Unstable yesterday. This should make it easier for Debian users to print on at least the Ultimaker class of 3D printers. My nearest 3D printer is an Ultimaker 2+, so it will make life easier for at least me. :)

The work to make this happen was done by Gregor Riepl, and I was happy to assist him in sponsoring the packages. With the introduction of Cura, Debian is up to three 3D printer slicers at your service, Cura, Slic3r and Slic3r Prusa. If you own or have access to a 3D printer, give it a go. :)

The 3D printer software is maintained by the 3D printer Debian team, flocking together on the 3dprinter-general mailing list and the #debian-3dprinting IRC channel.

The next step for Cura in Debian is to update the cura package to version 3.0.3 and then update the entire set of packages to version 3.1.0 which showed up the last few days.

Tags: 3d-printer, debian, english.

Idea for finding all public domain movies in the USA

13th December 2017

While looking at the scanned copies for the copyright renewal entries for movies published in the USA, an idea occurred to me. The number of renewals are so few per year, it should be fairly quick to transcribe them all and add references to the corresponding IMDB title ID. This would give the (presumably) complete list of movies published 28 years earlier that did _not_ enter the public domain for the transcribed year. By fetching the list of USA movies published 28 years earlier and subtract the movies with renewals, we should be left with movies registered in IMDB that are now in the public domain. For the year 1955 (which is the one I have looked at the most), the total number of pages to transcribe is 21. For the 28 years from 1950 to 1978, it should be in the range 500-600 pages. It is just a few days of work, and spread among a small group of people it should be doable in a few weeks of spare time.

A typical copyright renewal entry look like this (the first one listed for 1955):

ADAM AND EVIL, a photoplay in seven reels by Metro-Goldwyn-Mayer Distribution Corp. (c) 17Aug27; L24293. Loew's Incorporated (PWH); 10Jun55; R151558.

The movie title as well as registration and renewal dates are easy enough to locate by a program (split on first comma and look for DDmmmYY). The rest of the text is not required to find the movie in IMDB, but is useful to confirm the correct movie is found. I am not quite sure what the L and R numbers mean, but suspect they are reference numbers into the archive of the US Copyright Office.

Tracking down the equivalent IMDB title ID is probably going to be a manual task, but given the year it is fairly easy to search for the movie title using for example http://www.imdb.com/find?q=adam+and+evil+1927&s=all. Using this search, I find that the equivalent IMDB title ID for the first renewal entry from 1955 is http://www.imdb.com/title/tt0017588/.

I suspect the best way to do this would be to make a specialised web service to make it easy for contributors to transcribe and track down IMDB title IDs. In the web service, once a entry is transcribed, the title and year could be extracted from the text, a search in IMDB conducted for the user to pick the equivalent IMDB title ID right away. By spreading out the work among volunteers, it would also be possible to make at least two persons transcribe the same entries to be able to discover any typos introduced. But I will need help to make this happen, as I lack the spare time to do all of this on my own. If you would like to help, please get in touch. Perhaps you can draft a web service for crowd sourcing the task?

Note, Project Gutenberg already have some transcribed copies of the US Copyright Office renewal protocols, but I have not been able to find any film renewals there, so I suspect they only have copies of renewal for written works. I have not been able to find any transcribed versions of movie renewals so far. Perhaps they exist somewhere?

I would love to figure out methods for finding all the public domain works in other countries too, but it is a lot harder. At least for Norway and Great Britain, such work involve tracking down the people involved in making the movie and figuring out when they died. It is hard enough to figure out who was part of making a movie, but I do not know how to automate such procedure without a registry of every person involved in making movies and their death year.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, opphavsrett, verkidetfri.

Is the short movie «Empty Socks» from 1927 in the public domain or not?

5th December 2017

Three years ago, a presumed lost animation film, Empty Socks from 1927, was discovered in the Norwegian National Library. At the time it was discovered, it was generally assumed to be copyrighted by The Walt Disney Company, and I blogged about my reasoning to conclude that it would would enter the Norwegian equivalent of the public domain in 2053, based on my understanding of Norwegian Copyright Law. But a few days ago, I came across a blog post claiming the movie was already in the public domain, at least in USA. The reasoning is as follows: The film was released in November or Desember 1927 (sources disagree), and presumably registered its copyright that year. At that time, right holders of movies registered by the copyright office received government protection for there work for 28 years. After 28 years, the copyright had to be renewed if the wanted the government to protect it further. The blog post I found claim such renewal did not happen for this movie, and thus it entered the public domain in 1956. Yet someone claim the copyright was renewed and the movie is still copyright protected. Can anyone help me to figure out which claim is correct? I have not been able to find Empty Socks in Catalog of copyright entries. Ser.3 pt.12-13 v.9-12 1955-1958 Motion Pictures available from the University of Pennsylvania, neither in page 45 for the first half of 1955, nor in page 119 for the second half of 1955. It is of course possible that the renewal entry was left out of the printed catalog by mistake. Is there some way to rule out this possibility? Please help, and update the wikipedia page with your findings.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, freeculture, opphavsrett, verkidetfri, video.

Metadata proposal for movies on the Internet Archive

28th November 2017

It would be easier to locate the movie you want to watch in the Internet Archive, if the metadata about each movie was more complete and accurate. In the archiving community, a well known saying state that good metadata is a love letter to the future. The metadata in the Internet Archive could use a face lift for the future to love us back. Here is a proposal for a small improvement that would make the metadata more useful today. I've been unable to find any document describing the various standard fields available when uploading videos to the archive, so this proposal is based on my best quess and searching through several of the existing movies.

I have a few use cases in mind. First of all, I would like to be able to count the number of distinct movies in the Internet Archive, without duplicates. I would further like to identify the IMDB title ID of the movies in the Internet Archive, to be able to look up a IMDB title ID and know if I can fetch the video from there and share it with my friends.

Second, I would like the Butter data provider for The Internet archive (available from github), to list as many of the good movies as possible. The plugin currently do a search in the archive with the following parameters:

collection:moviesandfilms
AND NOT collection:movie_trailers
AND -mediatype:collection
AND format:"Archive BitTorrent"
AND year

Most of the cool movies that fail to show up in Butter do so because the 'year' field is missing. The 'year' field is populated by the year part from the 'date' field, and should be when the movie was released (date or year). Two such examples are Ben Hur from 1905 and Caminandes 2: Gran Dillama from 2013, where the year metadata field is missing.

So, my proposal is simply, for every movie in The Internet Archive where an IMDB title ID exist, please fill in these metadata fields (note, they can be updated also long after the video was uploaded, but as far as I can tell, only by the uploader):

mediatype: Should be 'movie' for movies.
collection: Should contain 'moviesandfilms'.
title: The title of the movie, without the publication year.
date: The data or year the movie was released. This make the movie show up in Butter, as well as make it possible to know the age of the movie and is useful to figure out copyright status.
director: The director of the movie. This make it easier to know if the correct movie is found in movie databases.
publisher: The production company making the movie. Also useful for identifying the correct movie.
links: Add a link to the IMDB title page, for example like this: <a href="http://www.imdb.com/title/tt0028496/">Movie in IMDB</a>. This make it easier to find duplicates and allow for counting of number of unique movies in the Archive. Other external references, like to TMDB, could be added like this too.

I did consider proposing a Custom field for the IMDB title ID (for example 'imdb_title_url', 'imdb_code' or simply 'imdb', but suspect it will be easier to simply place it in the links free text field.

I created a list of IMDB title IDs for several thousand movies in the Internet Archive, but I also got a list of several thousand movies without such IMDB title ID (and quite a few duplicates). It would be great if this data set could be integrated into the Internet Archive metadata to be available for everyone in the future, but with the current policy of leaving metadata editing to the uploaders, it will take a while before this happen. If you have uploaded movies into the Internet Archive, you can help. Please consider following my proposal above for your movies, to ensure that movie is properly counted. :)

The list is mostly generated using wikidata, which based on Wikipedia articles make it possible to link between IMDB and movies in the Internet Archive. But there are lots of movies without a Wikipedia article, and some movies where only a collection page exist (like for the Caminandes example above, where there are three movies but only one Wikidata entry).

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, opphavsrett, verkidetfri.

Legal to share more than 3000 movies listed on IMDB?

18th November 2017

A month ago, I blogged about my work to automatically check the copyright status of IMDB entries, and try to count the number of movies listed in IMDB that is legal to distribute on the Internet. I have continued to look for good data sources, and identified a few more. The code used to extract information from various data sources is available in a git repository, currently available from github.

So far I have identified 3186 unique IMDB title IDs. To gain better understanding of the structure of the data set, I created a histogram of the year associated with each movie (typically release year). It is interesting to notice where the peaks and dips in the graph are located. I wonder why they are placed there. I suspect World War II caused the dip around 1940, but what caused the peak around 2010?

I've so far identified ten sources for IMDB title IDs for movies in the public domain or with a free license. This is the statistics reported when running 'make stats' in the git repository:

  249 entries (    6 unique) with and   288 without IMDB title ID in free-movies-archive-org-butter.json
 2301 entries (  540 unique) with and     0 without IMDB title ID in free-movies-archive-org-wikidata.json
  830 entries (   29 unique) with and     0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
 2109 entries (  377 unique) with and     0 without IMDB title ID in free-movies-imdb-pd.json
  291 entries (  122 unique) with and     0 without IMDB title ID in free-movies-letterboxd-pd.json
  144 entries (  135 unique) with and     0 without IMDB title ID in free-movies-manual.json
  350 entries (    1 unique) with and   801 without IMDB title ID in free-movies-publicdomainmovies.json
    4 entries (    0 unique) with and   124 without IMDB title ID in free-movies-publicdomainreview.json
  698 entries (  119 unique) with and   118 without IMDB title ID in free-movies-publicdomaintorrents.json
    8 entries (    8 unique) with and   196 without IMDB title ID in free-movies-vodo.json
 3186 unique IMDB title IDs in total

The entries without IMDB title ID are candidates to increase the data set, but might equally well be duplicates of entries already listed with IMDB title ID in one of the other sources, or represent movies that lack a IMDB title ID. I've seen examples of all these situations when peeking at the entries without IMDB title ID. Based on these data sources, the lower bound for movies listed in IMDB that are legal to distribute on the Internet is between 3186 and 4713.

It would be great for improving the accuracy of this measurement, if the various sources added IMDB title ID to their metadata. I have tried to reach the people behind the various sources to ask if they are interested in doing this, without any replies so far. Perhaps you can help me get in touch with the people behind VODO, Public Domain Torrents, Public Domain Movies and Public Domain Review to try to convince them to add more metadata to their movie entries?

Another way you could help is by adding pages to Wikipedia about movies that are legal to distribute on the Internet. If such page exist and include a link to both IMDB and The Internet Archive, the script used to generate free-movies-archive-org-wikidata.json should pick up the mapping as soon as wikidata is updates.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, opphavsrett, verkidetfri.

Some notes on fault tolerant storage systems

1st November 2017

If you care about how fault tolerant your storage is, you might find these articles and papers interesting. They have formed how I think of when designing a storage system.

USENIX :login; Redundancy Does Not Imply Fault Tolerance. Analysis of Distributed Storage Reactions to Single Errors and Corruptions by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
ZDNet Why RAID 5 stops working in 2009 by Robin Harris
ZDNet Why RAID 6 stops working in 2019 by Robin Harris
USENIX FAST'07 Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso
USENIX ;login: Data Integrity. Finding Truth in a World of Guesses and Lies by Doug Hughes
USENIX FAST'08 An Analysis of Data Corruption in the Storage Stack by L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau
USENIX FAST'07 Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? by B. Schroeder and G. A. Gibson.
USENIX ;login: Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky
SIGMETRICS 2007 An analysis of latent sector errors in disk drives by L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler

Several of these research papers are based on data collected from hundred thousands or millions of disk, and their findings are eye opening. The short story is simply do not implicitly trust RAID or redundant storage systems. Details matter. And unfortunately there are few options on Linux addressing all the identified issues. Both ZFS and Btrfs are doing a fairly good job, but have legal and practical issues on their own. I wonder how cluster file systems like Ceph do in this regard. After all, there is an old saying, you know you have a distributed system when the crash of a computer you have never heard of stops you from getting any work done. The same holds true if fault tolerance do not work.

Just remember, in the end, it do not matter how redundant, or how fault tolerant your storage is, if you do not continuously monitor its status to detect and replace failed disks.

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, raid, sysadmin.

Web services for writing academic LaTeX papers as a team

31st October 2017

I was surprised today to learn that a friend in academia did not know there are easily available web services available for writing LaTeX documents as a team. I thought it was common knowledge, but to make sure at least my readers are aware of it, I would like to mention these useful services for writing LaTeX documents. Some of them even provide a WYSIWYG editor to ease writing even further.

There are two commercial services available, ShareLaTeX and Overleaf. They are very easy to use. Just start a new document, select which publisher to write for (ie which LaTeX style to use), and start writing. Note, these two have announced their intention to join forces, so soon it will only be one joint service. I've used both for different documents, and they work just fine. While ShareLaTeX is free software, while the latter is not. According to a announcement from Overleaf, they plan to keep the ShareLaTeX code base maintained as free software.

But these two are not the only alternatives. Fidus Writer is another free software solution with the source available on github. I have not used it myself. Several others can be found on the nice alterntiveTo web service.

If you like Google Docs or Etherpad, but would like to write documents in LaTeX, you should check out these services. You can even host your own, if you want to. :)

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english.

Locating IMDB IDs of movies in the Internet Archive using Wikidata

25th October 2017

Recently, I needed to automatically check the copyright status of a set of The Internet Movie database (IMDB) entries, to figure out which one of the movies they refer to can be freely distributed on the Internet. This proved to be harder than it sounds. IMDB for sure list movies without any copyright protection, where the copyright protection has expired or where the movie is lisenced using a permissive license like one from Creative Commons. These are mixed with copyright protected movies, and there seem to be no way to separate these classes of movies using the information in IMDB.

First I tried to look up entries manually in IMDB, Wikipedia and The Internet Archive, to get a feel how to do this. It is hard to know for sure using these sources, but it should be possible to be reasonable confident a movie is "out of copyright" with a few hours work per movie. As I needed to check almost 20,000 entries, this approach was not sustainable. I simply can not work around the clock for about 6 years to check this data set.

I asked the people behind The Internet Archive if they could introduce a new metadata field in their metadata XML for IMDB ID, but was told that they leave it completely to the uploaders to update the metadata. Some of the metadata entries had IMDB links in the description, but I found no way to download all metadata files in bulk to locate those ones and put that approach aside.

In the process I noticed several Wikipedia articles about movies had links to both IMDB and The Internet Archive, and it occured to me that I could use the Wikipedia RDF data set to locate entries with both, to at least get a lower bound on the number of movies on The Internet Archive with a IMDB ID. This is useful based on the assumption that movies distributed by The Internet Archive can be legally distributed on the Internet. With some help from the RDF community (thank you DanC), I was able to come up with this query to pass to the SPARQL interface on Wikidata:

SELECT ?work ?imdb ?ia ?when ?label
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q11424.
  ?work wdt:P345 ?imdb.
  ?work wdt:P724 ?ia.
  OPTIONAL {
        ?work wdt:P577 ?when.
        ?work rdfs:label ?label.
        FILTER(LANG(?label) = "en").
  }
}

If I understand the query right, for every film entry anywhere in Wikpedia, it will return the IMDB ID and The Internet Archive ID, and when the movie was released and its English title, if either or both of the latter two are available. At the moment the result set contain 2338 entries. Of course, it depend on volunteers including both correct IMDB and The Internet Archive IDs in the wikipedia articles for the movie. It should be noted that the result will include duplicates if the movie have entries in several languages. There are some bogus entries, either because The Internet Archive ID contain a typo or because the movie is not available from The Internet Archive. I did not verify the IMDB IDs, as I am unsure how to do that automatically.

I wrote a small python script to extract the data set from Wikidata and check if the XML metadata for the movie is available from The Internet Archive, and after around 1.5 hour it produced a list of 2097 free movies and their IMDB ID. In total, 171 entries in Wikidata lack the refered Internet Archive entry. I assume the 70 "disappearing" entries (ie 2338-2097-171) are duplicate entries.

This is not too bad, given that The Internet Archive report to contain 5331 feature films at the moment, but it also mean more than 3000 movies are missing on Wikipedia or are missing the pair of references on Wikipedia.

I was curious about the distribution by release year, and made a little graph to show how the amount of free movies is spread over the years:

I expect the relative distribution of the remaining 3000 movies to be similar.

If you want to help, and want to ensure Wikipedia can be used to cross reference The Internet Archive and The Internet Movie Database, please make sure entries like this are listed under the "External links" heading on the Wikipedia article for the movie:

* {{Internet Archive film|id=FightingLady}}
* {{IMDb title|id=0036823|title=The Fighting Lady}}

Please verify the links on the final page, to make sure you did not introduce a typo.

Here is the complete list, if you want to correct the 171 identified Wikipedia entries with broken links to The Internet Archive: Q1140317, Q458656, Q458656, Q470560, Q743340, Q822580, Q480696, Q128761, Q1307059, Q1335091, Q1537166, Q1438334, Q1479751, Q1497200, Q1498122, Q865973, Q834269, Q841781, Q841781, Q1548193, Q499031, Q1564769, Q1585239, Q1585569, Q1624236, Q4796595, Q4853469, Q4873046, Q915016, Q4660396, Q4677708, Q4738449, Q4756096, Q4766785, Q880357, Q882066, Q882066, Q204191, Q204191, Q1194170, Q940014, Q946863, Q172837, Q573077, Q1219005, Q1219599, Q1643798, Q1656352, Q1659549, Q1660007, Q1698154, Q1737980, Q1877284, Q1199354, Q1199354, Q1199451, Q1211871, Q1212179, Q1238382, Q4906454, Q320219, Q1148649, Q645094, Q5050350, Q5166548, Q2677926, Q2698139, Q2707305, Q2740725, Q2024780, Q2117418, Q2138984, Q1127992, Q1058087, Q1070484, Q1080080, Q1090813, Q1251918, Q1254110, Q1257070, Q1257079, Q1197410, Q1198423, Q706951, Q723239, Q2079261, Q1171364, Q617858, Q5166611, Q5166611, Q324513, Q374172, Q7533269, Q970386, Q976849, Q7458614, Q5347416, Q5460005, Q5463392, Q3038555, Q5288458, Q2346516, Q5183645, Q5185497, Q5216127, Q5223127, Q5261159, Q1300759, Q5521241, Q7733434, Q7736264, Q7737032, Q7882671, Q7719427, Q7719444, Q7722575, Q2629763, Q2640346, Q2649671, Q7703851, Q7747041, Q6544949, Q6672759, Q2445896, Q12124891, Q3127044, Q2511262, Q2517672, Q2543165, Q426628, Q426628, Q12126890, Q13359969, Q13359969, Q2294295, Q2294295, Q2559509, Q2559912, Q7760469, Q6703974, Q4744, Q7766962, Q7768516, Q7769205, Q7769988, Q2946945, Q3212086, Q3212086, Q18218448, Q18218448, Q18218448, Q6909175, Q7405709, Q7416149, Q7239952, Q7317332, Q7783674, Q7783704, Q7857590, Q3372526, Q3372642, Q3372816, Q3372909, Q7959649, Q7977485, Q7992684, Q3817966, Q3821852, Q3420907, Q3429733, Q774474

As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Tags: english, opphavsrett, verkidetfri.