]> pere.pagekite.me Git - homepage.git/blob - blog/archive/2017/11/11.rss
Generated.
[homepage.git] / blog / archive / 2017 / 11 / 11.rss
1 <?xml version="1.0" encoding="ISO-8859-1"?>
2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
3 <channel>
4 <title>Petter Reinholdtsen - Entries from November 2017</title>
5 <description>Entries from November 2017</description>
6 <link>http://people.skolelinux.org/pere/blog/</link>
7
8
9 <item>
10 <title>Metadata proposal for movies on the Internet Archive</title>
11 <link>http://people.skolelinux.org/pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</link>
12 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</guid>
13 <pubDate>Tue, 28 Nov 2017 12:00:00 +0100</pubDate>
14 <description>&lt;p&gt;It would be easier to locate the movie you want to watch in
15 &lt;a href=&quot;https://www.archive.org/&quot;&gt;the Internet Archive&lt;/a&gt;, if the
16 metadata about each movie was more complete and accurate. In the
17 archiving community, a well known saying state that good metadata is a
18 love letter to the future. The metadata in the Internet Archive could
19 use a face lift for the future to love us back. Here is a proposal
20 for a small improvement that would make the metadata more useful
21 today. I&#39;ve been unable to find any document describing the various
22 standard fields available when uploading videos to the archive, so
23 this proposal is based on my best quess and searching through several
24 of the existing movies.&lt;/p&gt;
25
26 &lt;p&gt;I have a few use cases in mind. First of all, I would like to be
27 able to count the number of distinct movies in the Internet Archive,
28 without duplicates. I would further like to identify the IMDB title
29 ID of the movies in the Internet Archive, to be able to look up a IMDB
30 title ID and know if I can fetch the video from there and share it
31 with my friends.&lt;/p&gt;
32
33 &lt;p&gt;Second, I would like the Butter data provider for The Internet
34 archive
35 (&lt;a href=&quot;https://github.com/butterproviders/butter-provider-archive&quot;&gt;available
36 from github&lt;/a&gt;), to list as many of the good movies as possible. The
37 plugin currently do a search in the archive with the following
38 parameters:&lt;/p&gt;
39
40 &lt;p&gt;&lt;pre&gt;
41 collection:moviesandfilms
42 AND NOT collection:movie_trailers
43 AND -mediatype:collection
44 AND format:&quot;Archive BitTorrent&quot;
45 AND year
46 &lt;/pre&gt;&lt;/p&gt;
47
48 &lt;p&gt;Most of the cool movies that fail to show up in Butter do so
49 because the &#39;year&#39; field is missing. The &#39;year&#39; field is populated by
50 the year part from the &#39;date&#39; field, and should be when the movie was
51 released (date or year). Two such examples are
52 &lt;a href=&quot;https://archive.org/details/SidneyOlcottsBen-hur1905&quot;&gt;Ben Hur
53 from 1905&lt;/a&gt; and
54 &lt;a href=&quot;https://archive.org/details/Caminandes2GranDillama&quot;&gt;Caminandes
55 2: Gran Dillama from 2013&lt;/a&gt;, where the year metadata field is
56 missing.&lt;/p&gt;
57
58 So, my proposal is simply, for every movie in The Internet Archive
59 where an IMDB title ID exist, please fill in these metadata fields
60 (note, they can be updated also long after the video was uploaded, but
61 as far as I can tell, only by the uploader):
62
63 &lt;dl&gt;
64
65 &lt;dt&gt;mediatype&lt;/dt&gt;
66 &lt;dd&gt;Should be &#39;movie&#39; for movies.&lt;/dd&gt;
67
68 &lt;dt&gt;collection&lt;/dt&gt;
69 &lt;dd&gt;Should contain &#39;moviesandfilms&#39;.&lt;/dd&gt;
70
71 &lt;dt&gt;title&lt;/dt&gt;
72 &lt;dd&gt;The title of the movie, without the publication year.&lt;/dd&gt;
73
74 &lt;dt&gt;date&lt;/dt&gt;
75 &lt;dd&gt;The data or year the movie was released. This make the movie show
76 up in Butter, as well as make it possible to know the age of the
77 movie and is useful to figure out copyright status.&lt;/dd&gt;
78
79 &lt;dt&gt;director&lt;/dt&gt;
80 &lt;dd&gt;The director of the movie. This make it easier to know if the
81 correct movie is found in movie databases.&lt;/dd&gt;
82
83 &lt;dt&gt;publisher&lt;/dt&gt;
84 &lt;dd&gt;The production company making the movie. Also useful for
85 identifying the correct movie.&lt;/dd&gt;
86
87 &lt;dt&gt;links&lt;/dt&gt;
88
89 &lt;dd&gt;Add a link to the IMDB title page, for example like this: &amp;lt;a
90 href=&quot;http://www.imdb.com/title/tt0028496/&quot;&amp;gt;Movie in
91 IMDB&amp;lt;/a&amp;gt;. This make it easier to find duplicates and allow for
92 counting of number of unique movies in the Archive. Other external
93 references, like to TMDB, could be added like this too.&lt;/dd&gt;
94
95 &lt;/dl&gt;
96
97 &lt;p&gt;I did consider proposing a Custom field for the IMDB title ID (for
98 example &#39;imdb_title_url&#39;, &#39;imdb_code&#39; or simply &#39;imdb&#39;, but suspect it
99 will be easier to simply place it in the links free text field.&lt;/p&gt;
100
101 &lt;p&gt;I created
102 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
103 list of IMDB title IDs for several thousand movies in the Internet
104 Archive&lt;/a&gt;, but I also got a list of several thousand movies without
105 such IMDB title ID (and quite a few duplicates). It would be great if
106 this data set could be integrated into the Internet Archive metadata
107 to be available for everyone in the future, but with the current
108 policy of leaving metadata editing to the uploaders, it will take a
109 while before this happen. If you have uploaded movies into the
110 Internet Archive, you can help. Please consider following my proposal
111 above for your movies, to ensure that movie is properly
112 counted. :)&lt;/p&gt;
113
114 &lt;p&gt;The list is mostly generated using wikidata, which based on
115 Wikipedia articles make it possible to link between IMDB and movies in
116 the Internet Archive. But there are lots of movies without a
117 Wikipedia article, and some movies where only a collection page exist
118 (like for &lt;a href=&quot;https://en.wikipedia.org/wiki/Caminandes&quot;&gt;the
119 Caminandes example above&lt;/a&gt;, where there are three movies but only
120 one Wikidata entry).&lt;/p&gt;
121 </description>
122 </item>
123
124 <item>
125 <title>Legal to share more than 3000 movies listed on IMDB?</title>
126 <link>http://people.skolelinux.org/pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</link>
127 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</guid>
128 <pubDate>Sat, 18 Nov 2017 21:20:00 +0100</pubDate>
129 <description>&lt;p&gt;A month ago, I blogged about my work to
130 &lt;a href=&quot;http://people.skolelinux.org/pere/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html&quot;&gt;automatically
131 check the copyright status of IMDB entries&lt;/a&gt;, and try to count the
132 number of movies listed in IMDB that is legal to distribute on the
133 Internet. I have continued to look for good data sources, and
134 identified a few more. The code used to extract information from
135 various data sources is available in
136 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
137 git repository&lt;/a&gt;, currently available from github.&lt;/p&gt;
138
139 &lt;p&gt;So far I have identified 3186 unique IMDB title IDs. To gain
140 better understanding of the structure of the data set, I created a
141 histogram of the year associated with each movie (typically release
142 year). It is interesting to notice where the peaks and dips in the
143 graph are located. I wonder why they are placed there. I suspect
144 World War II caused the dip around 1940, but what caused the peak
145 around 2010?&lt;/p&gt;
146
147 &lt;p align=&quot;center&quot;&gt;&lt;img src=&quot;http://people.skolelinux.org/pere/blog/images/2017-11-18-verk-i-det-fri-filmer.png&quot; /&gt;&lt;/p&gt;
148
149 &lt;p&gt;I&#39;ve so far identified ten sources for IMDB title IDs for movies in
150 the public domain or with a free license. This is the statistics
151 reported when running &#39;make stats&#39; in the git repository:&lt;/p&gt;
152
153 &lt;pre&gt;
154 249 entries ( 6 unique) with and 288 without IMDB title ID in free-movies-archive-org-butter.json
155 2301 entries ( 540 unique) with and 0 without IMDB title ID in free-movies-archive-org-wikidata.json
156 830 entries ( 29 unique) with and 0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
157 2109 entries ( 377 unique) with and 0 without IMDB title ID in free-movies-imdb-pd.json
158 291 entries ( 122 unique) with and 0 without IMDB title ID in free-movies-letterboxd-pd.json
159 144 entries ( 135 unique) with and 0 without IMDB title ID in free-movies-manual.json
160 350 entries ( 1 unique) with and 801 without IMDB title ID in free-movies-publicdomainmovies.json
161 4 entries ( 0 unique) with and 124 without IMDB title ID in free-movies-publicdomainreview.json
162 698 entries ( 119 unique) with and 118 without IMDB title ID in free-movies-publicdomaintorrents.json
163 8 entries ( 8 unique) with and 196 without IMDB title ID in free-movies-vodo.json
164 3186 unique IMDB title IDs in total
165 &lt;/pre&gt;
166
167 &lt;p&gt;The entries without IMDB title ID are candidates to increase the
168 data set, but might equally well be duplicates of entries already
169 listed with IMDB title ID in one of the other sources, or represent
170 movies that lack a IMDB title ID. I&#39;ve seen examples of all these
171 situations when peeking at the entries without IMDB title ID. Based
172 on these data sources, the lower bound for movies listed in IMDB that
173 are legal to distribute on the Internet is between 3186 and 4713.
174
175 &lt;p&gt;It would be great for improving the accuracy of this measurement,
176 if the various sources added IMDB title ID to their metadata. I have
177 tried to reach the people behind the various sources to ask if they
178 are interested in doing this, without any replies so far. Perhaps you
179 can help me get in touch with the people behind VODO, Public Domain
180 Torrents, Public Domain Movies and Public Domain Review to try to
181 convince them to add more metadata to their movie entries?&lt;/p&gt;
182
183 &lt;p&gt;Another way you could help is by adding pages to Wikipedia about
184 movies that are legal to distribute on the Internet. If such page
185 exist and include a link to both IMDB and The Internet Archive, the
186 script used to generate free-movies-archive-org-wikidata.json should
187 pick up the mapping as soon as wikidata is updates.&lt;/p&gt;
188
189 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
190 activities, please send Bitcoin donations to my address
191 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
192 </description>
193 </item>
194
195 <item>
196 <title>Some notes on fault tolerant storage systems</title>
197 <link>http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
198 <guid isPermaLink="true">http://people.skolelinux.org/pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
199 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
200 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
201 find these articles and papers interesting. They have formed how I
202 think of when designing a storage system.&lt;/p&gt;
203
204 &lt;ul&gt;
205
206 &lt;li&gt;USENIX :login; &lt;a
207 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
208 Does Not Imply Fault Tolerance. Analysis of Distributed Storage
209 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
210 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
211 H. Arpaci-Dusseau&lt;/li&gt;
212
213 &lt;li&gt;ZDNet
214 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
215 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
216
217 &lt;li&gt;ZDNet
218 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
219 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
220
221 &lt;li&gt;USENIX FAST&#39;07
222 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
223 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
224 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
225
226 &lt;li&gt;USENIX ;login: &lt;a
227 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
228 Integrity. Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
229 Hughes&lt;/li&gt;
230
231 &lt;li&gt;USENIX FAST&#39;08
232 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
233 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
234 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
235 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
236
237 &lt;li&gt;USENIX FAST&#39;07 &lt;a
238 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
239 failures in the real world: what does an MTTF of 1,000,000 hours mean
240 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
241
242 &lt;li&gt;USENIX ;login: &lt;a
243 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
244 Disks the Dominant Contributor for Storage Failures? A Comprehensive
245 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
246 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
247
248 &lt;li&gt;SIGMETRICS 2007
249 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
250 analysis of latent sector errors in disk drives&lt;/a&gt; by
251 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
252
253 &lt;/ul&gt;
254
255 &lt;p&gt;Several of these research papers are based on data collected from
256 hundred thousands or millions of disk, and their findings are eye
257 opening. The short story is simply do not implicitly trust RAID or
258 redundant storage systems. Details matter. And unfortunately there
259 are few options on Linux addressing all the identified issues. Both
260 ZFS and Btrfs are doing a fairly good job, but have legal and
261 practical issues on their own. I wonder how cluster file systems like
262 Ceph do in this regard. After all, there is an old saying, you know
263 you have a distributed system when the crash of a computer you have
264 never heard of stops you from getting any work done. The same holds
265 true if fault tolerance do not work.&lt;/p&gt;
266
267 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
268 fault tolerant your storage is, if you do not continuously monitor its
269 status to detect and replace failed disks.&lt;/p&gt;
270
271 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
272 activities, please send Bitcoin donations to my address
273 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
274 </description>
275 </item>
276
277 </channel>
278 </rss>