blog/archive/2017/11/11.rss

   1 <?xml version="1.0" encoding="ISO-8859-1"?>
   2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
   3         <channel>
   4                 <title>Petter Reinholdtsen - Entries from November 2017</title>
   5                 <description>Entries from November 2017</description>
   6                 <link>https://www.hungry.com/~pere/blog/</link>
   7
   8
   9         <item>
  10                 <title>Metadata proposal for movies on the Internet Archive</title>
  11                 <link>https://www.hungry.com/~pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</link>
  12                 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</guid>
  13                 <pubDate>Tue, 28 Nov 2017 12:00:00 +0100</pubDate>
  14                 <description>&lt;p&gt;It would be easier to locate the movie you want to watch in
  15 &lt;a href=&quot;https://www.archive.org/&quot;&gt;the Internet Archive&lt;/a&gt;, if the
  16 metadata about each movie was more complete and accurate.  In the
  17 archiving community, a well known saying state that good metadata is a
  18 love letter to the future.  The metadata in the Internet Archive could
  19 use a face lift for the future to love us back.  Here is a proposal
  20 for a small improvement that would make the metadata more useful
  21 today.  I&#39;ve been unable to find any document describing the various
  22 standard fields available when uploading videos to the archive, so
  23 this proposal is based on my best quess and searching through several
  24 of the existing movies.&lt;/p&gt;
  25
  26 &lt;p&gt;I have a few use cases in mind.  First of all, I would like to be
  27 able to count the number of distinct movies in the Internet Archive,
  28 without duplicates.  I would further like to identify the IMDB title
  29 ID of the movies in the Internet Archive, to be able to look up a IMDB
  30 title ID and know if I can fetch the video from there and share it
  31 with my friends.&lt;/p&gt;
  32
  33 &lt;p&gt;Second, I would like the Butter data provider for The Internet
  34 archive
  35 (&lt;a href=&quot;https://github.com/butterproviders/butter-provider-archive&quot;&gt;available
  36 from github&lt;/a&gt;), to list as many of the good movies as possible.  The
  37 plugin currently do a search in the archive with the following
  38 parameters:&lt;/p&gt;
  39
  40 &lt;p&gt;&lt;pre&gt;
  41 collection:moviesandfilms
  42 AND NOT collection:movie_trailers
  43 AND -mediatype:collection
  44 AND format:&quot;Archive BitTorrent&quot;
  45 AND year
  46 &lt;/pre&gt;&lt;/p&gt;
  47
  48 &lt;p&gt;Most of the cool movies that fail to show up in Butter do so
  49 because the &#39;year&#39; field is missing.  The &#39;year&#39; field is populated by
  50 the year part from the &#39;date&#39; field, and should be when the movie was
  51 released (date or year).  Two such examples are
  52 &lt;a href=&quot;https://archive.org/details/SidneyOlcottsBen-hur1905&quot;&gt;Ben Hur
  53 from 1905&lt;/a&gt; and
  54 &lt;a href=&quot;https://archive.org/details/Caminandes2GranDillama&quot;&gt;Caminandes
  55 2: Gran Dillama from 2013&lt;/a&gt;, where the year metadata field is
  56 missing.&lt;/p&gt;
  57
  58 So, my proposal is simply, for every movie in The Internet Archive
  59 where an IMDB title ID exist, please fill in these metadata fields
  60 (note, they can be updated also long after the video was uploaded, but
  61 as far as I can tell, only by the uploader):
  62
  63 &lt;dl&gt;
  64
  65 &lt;dt&gt;mediatype&lt;/dt&gt;
  66 &lt;dd&gt;Should be &#39;movie&#39; for movies.&lt;/dd&gt;
  67
  68 &lt;dt&gt;collection&lt;/dt&gt;
  69 &lt;dd&gt;Should contain &#39;moviesandfilms&#39;.&lt;/dd&gt;
  70
  71 &lt;dt&gt;title&lt;/dt&gt;
  72 &lt;dd&gt;The title of the movie, without the publication year.&lt;/dd&gt;
  73
  74 &lt;dt&gt;date&lt;/dt&gt;
  75 &lt;dd&gt;The data or year the movie was released.  This make the movie show
  76 up in Butter, as well as make it possible to know the age of the
  77 movie and is useful to figure out copyright status.&lt;/dd&gt;
  78
  79 &lt;dt&gt;director&lt;/dt&gt;
  80 &lt;dd&gt;The director of the movie.  This make it easier to know if the
  81 correct movie is found in movie databases.&lt;/dd&gt;
  82
  83 &lt;dt&gt;publisher&lt;/dt&gt;
  84 &lt;dd&gt;The production company making the movie.  Also useful for
  85 identifying the correct movie.&lt;/dd&gt;
  86
  87 &lt;dt&gt;links&lt;/dt&gt;
  88
  89 &lt;dd&gt;Add a link to the IMDB title page, for example like this: &amp;lt;a
  90 href=&quot;http://www.imdb.com/title/tt0028496/&quot;&amp;gt;Movie in
  91 IMDB&amp;lt;/a&amp;gt;.  This make it easier to find duplicates and allow for
  92 counting of number of unique movies in the Archive.  Other external
  93 references, like to TMDB, could be added like this too.&lt;/dd&gt;
  94
  95 &lt;/dl&gt;
  96
  97 &lt;p&gt;I did consider proposing a Custom field for the IMDB title ID (for
  98 example &#39;imdb_title_url&#39;, &#39;imdb_code&#39; or simply &#39;imdb&#39;, but suspect it
  99 will be easier to simply place it in the links free text field.&lt;/p&gt;
 100
 101 &lt;p&gt;I created
 102 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
 103 list of IMDB title IDs for several thousand movies in the Internet
 104 Archive&lt;/a&gt;, but I also got a list of several thousand movies without
 105 such IMDB title ID (and quite a few duplicates).  It would be great if
 106 this data set could be integrated into the Internet Archive metadata
 107 to be available for everyone in the future, but with the current
 108 policy of leaving metadata editing to the uploaders, it will take a
 109 while before this happen.  If you have uploaded movies into the
 110 Internet Archive, you can help.  Please consider following my proposal
 111 above for your movies, to ensure that movie is properly
 112 counted. :)&lt;/p&gt;
 113
 114 &lt;p&gt;The list is mostly generated using wikidata, which based on
 115 Wikipedia articles make it possible to link between IMDB and movies in
 116 the Internet Archive.  But there are lots of movies without a
 117 Wikipedia article, and some movies where only a collection page exist
 118 (like for &lt;a href=&quot;https://en.wikipedia.org/wiki/Caminandes&quot;&gt;the
 119 Caminandes example above&lt;/a&gt;, where there are three movies but only
 120 one Wikidata entry).&lt;/p&gt;
 121
 122 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 123 activities, please send Bitcoin donations to my address
 124 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 125 </description>
 126         </item>
 127
 128         <item>
 129                 <title>Legal to share more than 3000 movies listed on IMDB?</title>
 130                 <link>https://www.hungry.com/~pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</link>
 131                 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</guid>
 132                 <pubDate>Sat, 18 Nov 2017 21:20:00 +0100</pubDate>
 133                 <description>&lt;p&gt;A month ago, I blogged about my work to
 134 &lt;a href=&quot;https://people.skolelinux.org/pere/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html&quot;&gt;automatically
 135 check the copyright status of IMDB entries&lt;/a&gt;, and try to count the
 136 number of movies listed in IMDB that is legal to distribute on the
 137 Internet.  I have continued to look for good data sources, and
 138 identified a few more.  The code used to extract information from
 139 various data sources is available in
 140 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
 141 git repository&lt;/a&gt;, currently available from github.&lt;/p&gt;
 142
 143 &lt;p&gt;So far I have identified 3186 unique IMDB title IDs.  To gain
 144 better understanding of the structure of the data set, I created a
 145 histogram of the year associated with each movie (typically release
 146 year).  It is interesting to notice where the peaks and dips in the
 147 graph are located.  I wonder why they are placed there.  I suspect
 148 World War II caused the dip around 1940, but what caused the peak
 149 around 2010?&lt;/p&gt;
 150
 151 &lt;p align=&quot;center&quot;&gt;&lt;img src=&quot;https://people.skolelinux.org/pere/blog/images/2017-11-18-verk-i-det-fri-filmer.png&quot; /&gt;&lt;/p&gt;
 152
 153 &lt;p&gt;I&#39;ve so far identified ten sources for IMDB title IDs for movies in
 154 the public domain or with a free license.  This is the statistics
 155 reported when running &#39;make stats&#39; in the git repository:&lt;/p&gt;
 156
 157 &lt;pre&gt;
 158   249 entries (    6 unique) with and   288 without IMDB title ID in free-movies-archive-org-butter.json
 159  2301 entries (  540 unique) with and     0 without IMDB title ID in free-movies-archive-org-wikidata.json
 160   830 entries (   29 unique) with and     0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
 161  2109 entries (  377 unique) with and     0 without IMDB title ID in free-movies-imdb-pd.json
 162   291 entries (  122 unique) with and     0 without IMDB title ID in free-movies-letterboxd-pd.json
 163   144 entries (  135 unique) with and     0 without IMDB title ID in free-movies-manual.json
 164   350 entries (    1 unique) with and   801 without IMDB title ID in free-movies-publicdomainmovies.json
 165     4 entries (    0 unique) with and   124 without IMDB title ID in free-movies-publicdomainreview.json
 166   698 entries (  119 unique) with and   118 without IMDB title ID in free-movies-publicdomaintorrents.json
 167     8 entries (    8 unique) with and   196 without IMDB title ID in free-movies-vodo.json
 168  3186 unique IMDB title IDs in total
 169 &lt;/pre&gt;
 170
 171 &lt;p&gt;The entries without IMDB title ID are candidates to increase the
 172 data set, but might equally well be duplicates of entries already
 173 listed with IMDB title ID in one of the other sources, or represent
 174 movies that lack a IMDB title ID.  I&#39;ve seen examples of all these
 175 situations when peeking at the entries without IMDB title ID.  Based
 176 on these data sources, the lower bound for movies listed in IMDB that
 177 are legal to distribute on the Internet is between 3186 and 4713.
 178
 179 &lt;p&gt;It would be great for improving the accuracy of this measurement,
 180 if the various sources added IMDB title ID to their metadata.  I have
 181 tried to reach the people behind the various sources to ask if they
 182 are interested in doing this, without any replies so far.  Perhaps you
 183 can help me get in touch with the people behind VODO, Public Domain
 184 Torrents, Public Domain Movies and Public Domain Review to try to
 185 convince them to add more metadata to their movie entries?&lt;/p&gt;
 186
 187 &lt;p&gt;Another way you could help is by adding pages to Wikipedia about
 188 movies that are legal to distribute on the Internet.  If such page
 189 exist and include a link to both IMDB and The Internet Archive, the
 190 script used to generate free-movies-archive-org-wikidata.json should
 191 pick up the mapping as soon as wikidata is updates.&lt;/p&gt;
 192
 193 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 194 activities, please send Bitcoin donations to my address
 195 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 196 </description>
 197         </item>
 198
 199         <item>
 200                 <title>Some notes on fault tolerant storage systems</title>
 201                 <link>https://www.hungry.com/~pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
 202                 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
 203                 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
 204                 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
 205 find these articles and papers interesting.  They have formed how I
 206 think of when designing a storage system.&lt;/p&gt;
 207
 208 &lt;ul&gt;
 209
 210 &lt;li&gt;USENIX :login; &lt;a
 211 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
 212 Does Not Imply Fault Tolerance.  Analysis of Distributed Storage
 213 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
 214 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
 215 H. Arpaci-Dusseau&lt;/li&gt;
 216
 217 &lt;li&gt;ZDNet
 218 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
 219 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
 220
 221 &lt;li&gt;ZDNet
 222 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
 223 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
 224
 225 &lt;li&gt;USENIX FAST&#39;07
 226 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
 227 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
 228 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
 229
 230 &lt;li&gt;USENIX ;login: &lt;a
 231 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
 232 Integrity.  Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
 233 Hughes&lt;/li&gt;
 234
 235 &lt;li&gt;USENIX FAST&#39;08
 236 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
 237 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
 238 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
 239 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
 240
 241 &lt;li&gt;USENIX FAST&#39;07 &lt;a
 242 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
 243 failures in the real world: what does an MTTF of 1,000,000 hours mean
 244 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
 245
 246 &lt;li&gt;USENIX ;login: &lt;a
 247 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
 248 Disks the Dominant Contributor for Storage Failures?  A Comprehensive
 249 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
 250 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
 251
 252 &lt;li&gt;SIGMETRICS 2007
 253 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
 254 analysis of latent sector errors in disk drives&lt;/a&gt; by
 255 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
 256
 257 &lt;/ul&gt;
 258
 259 &lt;p&gt;Several of these research papers are based on data collected from
 260 hundred thousands or millions of disk, and their findings are eye
 261 opening.  The short story is simply do not implicitly trust RAID or
 262 redundant storage systems.  Details matter.  And unfortunately there
 263 are few options on Linux addressing all the identified issues.  Both
 264 ZFS and Btrfs are doing a fairly good job, but have legal and
 265 practical issues on their own.  I wonder how cluster file systems like
 266 Ceph do in this regard.  After all, there is an old saying, you know
 267 you have a distributed system when the crash of a computer you have
 268 never heard of stops you from getting any work done.  The same holds
 269 true if fault tolerance do not work.&lt;/p&gt;
 270
 271 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
 272 fault tolerant your storage is, if you do not continuously monitor its
 273 status to detect and replace failed disks.&lt;/p&gt;
 274
 275 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
 276 activities, please send Bitcoin donations to my address
 277 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
 278 </description>
 279         </item>
 280
 281         </channel>
 282 </rss>