blog/archive/2017/11/11.rss

   1 <?xml version="1.0" encoding="ISO-8859-1"?>
   2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
   3         <channel>
   4                 <title>Petter Reinholdtsen - Entries from November 2017</title>
   5                 <description>Entries from November 2017</description>
   6                 <link>http://people.skolelinux.org/pere/blog/</link>
   7
   8
   9         <item>
  10                 <title>Metadata proposal for movies on the Internet Archive</title>
  11                 <link>http://people.skolelinux.org/pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</link>
  12                 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</guid>
  13                 <pubDate>Tue, 28 Nov 2017 12:00:00 +0100</pubDate>
  14                 <description>&lt;p&gt;It would be easier to locate the movie you want to watch in
  15 &lt;a href=&quot;https://www.archive.org/&quot;&gt;the Internet Archive&lt;/a&gt;, if the
  16 metadata about each movie was more complete and accurate.  In the
  17 archiving community, a well known saying state that good metadata is a
  18 love letter to the future.  The metadata in the Internet Archive could
  19 use a face lift for the future to love us back.  Here is a proposal
  20 for a small improvement that would make the metadata more useful
  21 today.  I&#39;ve been unable to find any document describing the various
  22 standard fields available when uploading videos to the archive, so
  23 this proposal is based on my best quess and searching through several
  24 of the existing movies.&lt;/p&gt;
  25
  26 &lt;p&gt;I have a few use cases in mind.  First of all, I would like to be
  27 able to count the number of distinct movies in the Internet Archive,
  28 without duplicates.  I would further like to identify the IMDB title
  29 ID of the movies in the Internet Archive, to be able to look up a IMDB
  30 title ID and know if I can fetch the video from there and share it
  31 with my friends.&lt;/p&gt;
  32
  33 &lt;p&gt;Second, I would like the Butter data provider for The Internet
  34 archive
  35 (&lt;a href=&quot;https://github.com/butterproviders/butter-provider-archive&quot;&gt;available
  36 from github&lt;/a&gt;), to list as many of the good movies as possible.  The
  37 plugin currently do a search in the archive with the following
  38 parameters:&lt;/p&gt;
  39
  40 &lt;p&gt;&lt;pre&gt;
  41 collection:moviesandfilms
  42 AND NOT collection:movie_trailers
  43 AND -mediatype:collection
  44 AND format:&quot;Archive BitTorrent&quot;
  45 AND year
  46 &lt;/pre&gt;&lt;/p&gt;
  47
  48 &lt;p&gt;Most of the cool movies that fail to show up in Butter do so
  49 because the &#39;year&#39; field is missing.  The &#39;year&#39; field is populated by
  50 the year part from the &#39;date&#39; field, and should be when the movie was
  51 released (date or year).  Two such examples are
  52 &lt;a href=&quot;https://archive.org/details/SidneyOlcottsBen-hur1905&quot;&gt;Ben Hur
  53 from 1905&lt;/a&gt; and
  54 &lt;a href=&quot;https://archive.org/details/Caminandes2GranDillama&quot;&gt;Caminandes
  55 2: Gran Dillama from 2013&lt;/a&gt;, where the year metadata field is
  56 missing.&lt;/p&gt;
  57
  58 So, my proposal is simply, for every movie in The Internet Archive
  59 where an IMDB title ID exist, please fill in these metadata fields
  60 (note, they can be updated also long after the video was uploaded, but
  61 as far as I can tell, only by the uploader):
  62
  63 &lt;dl&gt;
  64
  65 &lt;dt&gt;mediatype&lt;/dt&gt;
  66 &lt;dd&gt;Should be &#39;movie&#39; for movies.&lt;/dd&gt;
  67
  68 &lt;dt&gt;collection&lt;/dt&gt;
  69 &lt;dd&gt;Should contain &#39;moviesandfilms&#39;.&lt;/dd&gt;
  70
  71 &lt;dt&gt;title&lt;/dt&gt;
  72 &lt;dd&gt;The title of the movie, without the publication year.&lt;/dd&gt;
  73
  74 &lt;dt&gt;date&lt;/dt&gt;
  75 &lt;dd&gt;The data or year the movie was released.  This make the movie show
  76 up in Butter, as well as make it possible to know the age of the
  77 movie and is useful to figure out copyright status.&lt;/dd&gt;
  78
  79 &lt;dt&gt;director&lt;/dt&gt;
  80 &lt;dd&gt;The director of the movie.  This make it easier to know if the
  81 correct movie is found in movie databases.&lt;/dd&gt;
  82
  83 &lt;dt&gt;publisher&lt;/dt&gt;
  84 &lt;dd&gt;The production company making the movie.  Also useful for
  85 identifying the correct movie.&lt;/dd&gt;
  86
  87 &lt;dt&gt;links&lt;/dt&gt;
  88
  89 &lt;dd&gt;Add a link to the IMDB title page, for example like this: &amp;lt;a
  90 href=&quot;http://www.imdb.com/title/tt0028496/&quot;&amp;gt;Movie in
  91 IMDB&amp;lt;/a&amp;gt;.  This make it easier to find duplicates and allow for
  92 counting of number of unique movies in the Archive.  Other external
  93 references, like to TMDB, could be added like this too.&lt;/dd&gt;
  94
  95 &lt;/dl&gt;
  96
  97 &lt;p&gt;I did consider proposing a Custom field for the IMDB title ID (for
  98 example &#39;imdb_title_url&#39;, &#39;imdb_code&#39; or simply &#39;imdb&#39;, but suspect it
  99 will be easier to simply place it in the links free text field.&lt;/p&gt;
 100
 101 &lt;p&gt;I created
 102 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
 103 list of IMDB title IDs for several thousand movies in the Internet
 104 Archive&lt;/a&gt;, but I also got a list of several thousand movies without
 105 such IMDB title ID (and quite a few duplicates).  It would be great if
 106 this data set could be integrated into the Internet Archive metadata
 107 to be available for everyone in the future, but with the current
 108 policy of leaving metadata editing to the uploaders, it will take a
 109 while before this happen.  If you have uploaded movies into the
 110 Internet Archive, you can help.  Please consider following my proposal
 111 above for your movies, to ensure that movie is properly
 112 counted. :)&lt;/p&gt;
 113
 114 &lt;p&gt;The list is mostly generated using wikidata, which based on
 115 Wikipedia articles make it possible to link between IMDB and movies in
 116 the Internet Archive.  But there are lots of movies without a
 117 Wikipedia article, and some movies where only a collection page exist
 118 (like for &lt;a href=&quot;https://en.wikipedia.org/wiki/Caminandes&quot;&gt;the
 119 Caminandes example above&lt;/a&gt;, where there are three movies but only
 120 one Wikidata entry).&lt;/p&gt;
 121 </description>
 122         </item>
 123
 124         <item>
 125                 <title>Legal to share more than 3000 movies listed on IMDB?</title>
 126                 <link>http://people.skolelinux.org/pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</link>
 127                 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</guid>
 128                 <pubDate>Sat, 18 Nov 2017 21:20:00 +0100</pubDate>
 129                 <description>&lt;p&gt;A month ago, I blogged about my work to
 130 &lt;a href=&quot;http://people.skolelinux.org/pere/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html&quot;&gt;automatically
 131 check the copyright status of IMDB entries&lt;/a&gt;, and try to count the
 132 number of movies listed in IMDB that is legal to distribute on the
 133 Internet.  I have continued to look for good data sources, and
 134 identified a few more.  The code used to extract information from
 135 various data sources is available in
 136 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
 137 git repository&lt;/a&gt;, currently available from github.&lt;/p&gt;
 138
 139 &lt;p&gt;So far I have identified 3186 unique IMDB title IDs.  To gain
 140 better understanding of the structure of the data set, I created a
 141 histogram of the year associated with each movie (typically release
 142 year).  It is interesting to notice where the peaks and dips in the
 143 graph are located.  I wonder why they are placed there.  I suspect
 144 World War II caused the dip around 1940, but what caused the peak
 145 around 2010?&lt;/p&gt;
 146
 147 &lt;p align=&quot;center&quot;&gt;&lt;img src=&quot;http://people.skolelinux.org/pere/blog/images/2017-11-18-verk-i-det-fri-filmer.png&quot; /&gt;&lt;/p&gt;
 148
 149 &lt;p&gt;I&#39;ve so far identified ten sources for IMDB title IDs for movies in
 150 the public domain or with a free license.  This is the statistics
 151 reported when running &#39;make stats&#39; in the git repository:&lt;/p&gt;
 152
 153 &lt;pre&gt;
 154   249 entries (    6 unique) with and   288 without IMDB title ID in free-movies-archive-org-butter.json
 155  2301 entries (  540 unique) with and     0 without IMDB title ID in free-movies-archive-org-wikidata.json
 156   830 entries (   29 unique) with and     0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
 157  2109 entries (  377 unique) with and     0 without IMDB title ID in free-movies-imdb-pd.json
 158   291 entries (  122 unique) with and     0 without IMDB title ID in free-movies-letterboxd-pd.json
 159   144 entries (  135 unique) with and     0 without IMDB title ID in free-movies-manual.json
 160   350 entries (    1 unique) with and   801 without IMDB title ID in free-movies-publicdomainmovies.json
 161     4 entries (    0 unique) with and   124 without IMDB title ID in free-movies-publicdomainreview.json
 162   698 entries (  119 unique) with and   118 without IMDB title ID in free-movies-publicdomaintorrents.json
 163     8 entries (    8 unique) with and   196 without IMDB title ID in free-movies-vodo.json
 164  3186 unique IMDB title IDs in total
 165 &lt;/pre&gt;
 166
 167 &lt;p&gt;The entries without IMDB title ID are candidates to increase the
 168 data set, but might equally well be duplicates of entries already
 169 listed with IMDB title ID in one of the other sources, or represent
 170 movies that lack a IMDB title ID.  I&#39;ve seen examples of all these
 171 situations when peeking at the entries without IMDB title ID.  Based
 172 on these data sources, the lower bound for movies listed in IMDB that
 173 are legal to distribute on the Internet is between 3186 and 4713.
 174
 175 &lt;p&gt;It would be great for improving the accuracy of this measurement,
 176 if the various sources added IMDB title ID to their metadata.  I have
 177 tried to reach the people behind the various sources to ask if they
 178 are interested in doing this, without any replies so far.  Perhaps you
 179 can help me get in touch with the people behind VODO, Public Domain
 180 Torrents, Public Domain Movies and Public Domain Review to try to
 181 convince them to add more metadata to their movie entries?&lt;/p&gt;
 182
 183 &lt;p&gt;Another way you could help is by adding pages to Wikipedia about
 184 movies that are legal to distribute on the Internet.  If such page
 185 exist and include a link to both IMDB and The Internet Archive, the
 186 script used to generate free-movies-archive-org-wikidata.json should
 187 pick up the mapping as soon as wikidata is updates.&lt;/p&gt;
 188
 189 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 190 activities, please send Bitcoin donations to my address
 191 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 192 </description>
 193         </item>
 194
 195         <item>
 196                 <title>Some notes on fault tolerant storage systems</title>
 197                 <link>http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
 198                 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
 199                 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
 200                 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
 201 find these articles and papers interesting.  They have formed how I
 202 think of when designing a storage system.&lt;/p&gt;
 203
 204 &lt;ul&gt;
 205
 206 &lt;li&gt;USENIX :login; &lt;a
 207 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
 208 Does Not Imply Fault Tolerance.  Analysis of Distributed Storage
 209 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
 210 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
 211 H. Arpaci-Dusseau&lt;/li&gt;
 212
 213 &lt;li&gt;ZDNet
 214 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
 215 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
 216
 217 &lt;li&gt;ZDNet
 218 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
 219 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
 220
 221 &lt;li&gt;USENIX FAST&#39;07
 222 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
 223 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
 224 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
 225
 226 &lt;li&gt;USENIX ;login: &lt;a
 227 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
 228 Integrity.  Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
 229 Hughes&lt;/li&gt;
 230
 231 &lt;li&gt;USENIX FAST&#39;08
 232 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
 233 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
 234 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
 235 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
 236
 237 &lt;li&gt;USENIX FAST&#39;07 &lt;a
 238 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
 239 failures in the real world: what does an MTTF of 1,000,000 hours mean
 240 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
 241
 242 &lt;li&gt;USENIX ;login: &lt;a
 243 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
 244 Disks the Dominant Contributor for Storage Failures?  A Comprehensive
 245 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
 246 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
 247
 248 &lt;li&gt;SIGMETRICS 2007
 249 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
 250 analysis of latent sector errors in disk drives&lt;/a&gt; by
 251 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
 252
 253 &lt;/ul&gt;
 254
 255 &lt;p&gt;Several of these research papers are based on data collected from
 256 hundred thousands or millions of disk, and their findings are eye
 257 opening.  The short story is simply do not implicitly trust RAID or
 258 redundant storage systems.  Details matter.  And unfortunately there
 259 are few options on Linux addressing all the identified issues.  Both
 260 ZFS and Btrfs are doing a fairly good job, but have legal and
 261 practical issues on their own.  I wonder how cluster file systems like
 262 Ceph do in this regard.  After all, there is an old saying, you know
 263 you have a distributed system when the crash of a computer you have
 264 never heard of stops you from getting any work done.  The same holds
 265 true if fault tolerance do not work.&lt;/p&gt;
 266
 267 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
 268 fault tolerant your storage is, if you do not continuously monitor its
 269 status to detect and replace failed disks.&lt;/p&gt;
 270
 271 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 272 activities, please send Bitcoin donations to my address
 273 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 274 </description>
 275         </item>
 276
 277         </channel>
 278 </rss>