]> pere.pagekite.me Git - homepage.git/blob - blog/archive/2017/11/11.rss
Converted pages to temp site.
[homepage.git] / blog / archive / 2017 / 11 / 11.rss
1 <?xml version="1.0" encoding="ISO-8859-1"?>
2 <rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/'>
3 <channel>
4 <title>Petter Reinholdtsen - Entries from November 2017</title>
5 <description>Entries from November 2017</description>
6 <link>https://www.hungry.com/~pere/blog/</link>
7
8
9 <item>
10 <title>Metadata proposal for movies on the Internet Archive</title>
11 <link>https://www.hungry.com/~pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</link>
12 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Metadata_proposal_for_movies_on_the_Internet_Archive.html</guid>
13 <pubDate>Tue, 28 Nov 2017 12:00:00 +0100</pubDate>
14 <description>&lt;p&gt;It would be easier to locate the movie you want to watch in
15 &lt;a href=&quot;https://www.archive.org/&quot;&gt;the Internet Archive&lt;/a&gt;, if the
16 metadata about each movie was more complete and accurate. In the
17 archiving community, a well known saying state that good metadata is a
18 love letter to the future. The metadata in the Internet Archive could
19 use a face lift for the future to love us back. Here is a proposal
20 for a small improvement that would make the metadata more useful
21 today. I&#39;ve been unable to find any document describing the various
22 standard fields available when uploading videos to the archive, so
23 this proposal is based on my best quess and searching through several
24 of the existing movies.&lt;/p&gt;
25
26 &lt;p&gt;I have a few use cases in mind. First of all, I would like to be
27 able to count the number of distinct movies in the Internet Archive,
28 without duplicates. I would further like to identify the IMDB title
29 ID of the movies in the Internet Archive, to be able to look up a IMDB
30 title ID and know if I can fetch the video from there and share it
31 with my friends.&lt;/p&gt;
32
33 &lt;p&gt;Second, I would like the Butter data provider for The Internet
34 archive
35 (&lt;a href=&quot;https://github.com/butterproviders/butter-provider-archive&quot;&gt;available
36 from github&lt;/a&gt;), to list as many of the good movies as possible. The
37 plugin currently do a search in the archive with the following
38 parameters:&lt;/p&gt;
39
40 &lt;p&gt;&lt;pre&gt;
41 collection:moviesandfilms
42 AND NOT collection:movie_trailers
43 AND -mediatype:collection
44 AND format:&quot;Archive BitTorrent&quot;
45 AND year
46 &lt;/pre&gt;&lt;/p&gt;
47
48 &lt;p&gt;Most of the cool movies that fail to show up in Butter do so
49 because the &#39;year&#39; field is missing. The &#39;year&#39; field is populated by
50 the year part from the &#39;date&#39; field, and should be when the movie was
51 released (date or year). Two such examples are
52 &lt;a href=&quot;https://archive.org/details/SidneyOlcottsBen-hur1905&quot;&gt;Ben Hur
53 from 1905&lt;/a&gt; and
54 &lt;a href=&quot;https://archive.org/details/Caminandes2GranDillama&quot;&gt;Caminandes
55 2: Gran Dillama from 2013&lt;/a&gt;, where the year metadata field is
56 missing.&lt;/p&gt;
57
58 So, my proposal is simply, for every movie in The Internet Archive
59 where an IMDB title ID exist, please fill in these metadata fields
60 (note, they can be updated also long after the video was uploaded, but
61 as far as I can tell, only by the uploader):
62
63 &lt;dl&gt;
64
65 &lt;dt&gt;mediatype&lt;/dt&gt;
66 &lt;dd&gt;Should be &#39;movie&#39; for movies.&lt;/dd&gt;
67
68 &lt;dt&gt;collection&lt;/dt&gt;
69 &lt;dd&gt;Should contain &#39;moviesandfilms&#39;.&lt;/dd&gt;
70
71 &lt;dt&gt;title&lt;/dt&gt;
72 &lt;dd&gt;The title of the movie, without the publication year.&lt;/dd&gt;
73
74 &lt;dt&gt;date&lt;/dt&gt;
75 &lt;dd&gt;The data or year the movie was released. This make the movie show
76 up in Butter, as well as make it possible to know the age of the
77 movie and is useful to figure out copyright status.&lt;/dd&gt;
78
79 &lt;dt&gt;director&lt;/dt&gt;
80 &lt;dd&gt;The director of the movie. This make it easier to know if the
81 correct movie is found in movie databases.&lt;/dd&gt;
82
83 &lt;dt&gt;publisher&lt;/dt&gt;
84 &lt;dd&gt;The production company making the movie. Also useful for
85 identifying the correct movie.&lt;/dd&gt;
86
87 &lt;dt&gt;links&lt;/dt&gt;
88
89 &lt;dd&gt;Add a link to the IMDB title page, for example like this: &amp;lt;a
90 href=&quot;http://www.imdb.com/title/tt0028496/&quot;&amp;gt;Movie in
91 IMDB&amp;lt;/a&amp;gt;. This make it easier to find duplicates and allow for
92 counting of number of unique movies in the Archive. Other external
93 references, like to TMDB, could be added like this too.&lt;/dd&gt;
94
95 &lt;/dl&gt;
96
97 &lt;p&gt;I did consider proposing a Custom field for the IMDB title ID (for
98 example &#39;imdb_title_url&#39;, &#39;imdb_code&#39; or simply &#39;imdb&#39;, but suspect it
99 will be easier to simply place it in the links free text field.&lt;/p&gt;
100
101 &lt;p&gt;I created
102 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
103 list of IMDB title IDs for several thousand movies in the Internet
104 Archive&lt;/a&gt;, but I also got a list of several thousand movies without
105 such IMDB title ID (and quite a few duplicates). It would be great if
106 this data set could be integrated into the Internet Archive metadata
107 to be available for everyone in the future, but with the current
108 policy of leaving metadata editing to the uploaders, it will take a
109 while before this happen. If you have uploaded movies into the
110 Internet Archive, you can help. Please consider following my proposal
111 above for your movies, to ensure that movie is properly
112 counted. :)&lt;/p&gt;
113
114 &lt;p&gt;The list is mostly generated using wikidata, which based on
115 Wikipedia articles make it possible to link between IMDB and movies in
116 the Internet Archive. But there are lots of movies without a
117 Wikipedia article, and some movies where only a collection page exist
118 (like for &lt;a href=&quot;https://en.wikipedia.org/wiki/Caminandes&quot;&gt;the
119 Caminandes example above&lt;/a&gt;, where there are three movies but only
120 one Wikidata entry).&lt;/p&gt;
121
122 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
123 activities, please send Bitcoin donations to my address
124 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
125 </description>
126 </item>
127
128 <item>
129 <title>Legal to share more than 3000 movies listed on IMDB?</title>
130 <link>https://www.hungry.com/~pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</link>
131 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Legal_to_share_more_than_3000_movies_listed_on_IMDB_.html</guid>
132 <pubDate>Sat, 18 Nov 2017 21:20:00 +0100</pubDate>
133 <description>&lt;p&gt;A month ago, I blogged about my work to
134 &lt;a href=&quot;https://people.skolelinux.org/pere/blog/Locating_IMDB_IDs_of_movies_in_the_Internet_Archive_using_Wikidata.html&quot;&gt;automatically
135 check the copyright status of IMDB entries&lt;/a&gt;, and try to count the
136 number of movies listed in IMDB that is legal to distribute on the
137 Internet. I have continued to look for good data sources, and
138 identified a few more. The code used to extract information from
139 various data sources is available in
140 &lt;a href=&quot;https://github.com/petterreinholdtsen/public-domain-free-imdb&quot;&gt;a
141 git repository&lt;/a&gt;, currently available from github.&lt;/p&gt;
142
143 &lt;p&gt;So far I have identified 3186 unique IMDB title IDs. To gain
144 better understanding of the structure of the data set, I created a
145 histogram of the year associated with each movie (typically release
146 year). It is interesting to notice where the peaks and dips in the
147 graph are located. I wonder why they are placed there. I suspect
148 World War II caused the dip around 1940, but what caused the peak
149 around 2010?&lt;/p&gt;
150
151 &lt;p align=&quot;center&quot;&gt;&lt;img src=&quot;https://people.skolelinux.org/pere/blog/images/2017-11-18-verk-i-det-fri-filmer.png&quot; /&gt;&lt;/p&gt;
152
153 &lt;p&gt;I&#39;ve so far identified ten sources for IMDB title IDs for movies in
154 the public domain or with a free license. This is the statistics
155 reported when running &#39;make stats&#39; in the git repository:&lt;/p&gt;
156
157 &lt;pre&gt;
158 249 entries ( 6 unique) with and 288 without IMDB title ID in free-movies-archive-org-butter.json
159 2301 entries ( 540 unique) with and 0 without IMDB title ID in free-movies-archive-org-wikidata.json
160 830 entries ( 29 unique) with and 0 without IMDB title ID in free-movies-icheckmovies-archive-mochard.json
161 2109 entries ( 377 unique) with and 0 without IMDB title ID in free-movies-imdb-pd.json
162 291 entries ( 122 unique) with and 0 without IMDB title ID in free-movies-letterboxd-pd.json
163 144 entries ( 135 unique) with and 0 without IMDB title ID in free-movies-manual.json
164 350 entries ( 1 unique) with and 801 without IMDB title ID in free-movies-publicdomainmovies.json
165 4 entries ( 0 unique) with and 124 without IMDB title ID in free-movies-publicdomainreview.json
166 698 entries ( 119 unique) with and 118 without IMDB title ID in free-movies-publicdomaintorrents.json
167 8 entries ( 8 unique) with and 196 without IMDB title ID in free-movies-vodo.json
168 3186 unique IMDB title IDs in total
169 &lt;/pre&gt;
170
171 &lt;p&gt;The entries without IMDB title ID are candidates to increase the
172 data set, but might equally well be duplicates of entries already
173 listed with IMDB title ID in one of the other sources, or represent
174 movies that lack a IMDB title ID. I&#39;ve seen examples of all these
175 situations when peeking at the entries without IMDB title ID. Based
176 on these data sources, the lower bound for movies listed in IMDB that
177 are legal to distribute on the Internet is between 3186 and 4713.
178
179 &lt;p&gt;It would be great for improving the accuracy of this measurement,
180 if the various sources added IMDB title ID to their metadata. I have
181 tried to reach the people behind the various sources to ask if they
182 are interested in doing this, without any replies so far. Perhaps you
183 can help me get in touch with the people behind VODO, Public Domain
184 Torrents, Public Domain Movies and Public Domain Review to try to
185 convince them to add more metadata to their movie entries?&lt;/p&gt;
186
187 &lt;p&gt;Another way you could help is by adding pages to Wikipedia about
188 movies that are legal to distribute on the Internet. If such page
189 exist and include a link to both IMDB and The Internet Archive, the
190 script used to generate free-movies-archive-org-wikidata.json should
191 pick up the mapping as soon as wikidata is updates.&lt;/p&gt;
192
193 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
194 activities, please send Bitcoin donations to my address
195 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
196 </description>
197 </item>
198
199 <item>
200 <title>Some notes on fault tolerant storage systems</title>
201 <link>https://www.hungry.com/~pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</link>
202 <guid isPermaLink="true">https://www.hungry.com/~pere/blog/Some_notes_on_fault_tolerant_storage_systems.html</guid>
203 <pubDate>Wed, 1 Nov 2017 15:35:00 +0100</pubDate>
204 <description>&lt;p&gt;If you care about how fault tolerant your storage is, you might
205 find these articles and papers interesting. They have formed how I
206 think of when designing a storage system.&lt;/p&gt;
207
208 &lt;ul&gt;
209
210 &lt;li&gt;USENIX :login; &lt;a
211 href=&quot;https://www.usenix.org/publications/login/summer2017/ganesan&quot;&gt;Redundancy
212 Does Not Imply Fault Tolerance. Analysis of Distributed Storage
213 Reactions to Single Errors and Corruptions&lt;/a&gt; by Aishwarya Ganesan,
214 Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi
215 H. Arpaci-Dusseau&lt;/li&gt;
216
217 &lt;li&gt;ZDNet
218 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-5-stops-working-in-2009/&quot;&gt;Why
219 RAID 5 stops working in 2009&lt;/a&gt; by Robin Harris&lt;/li&gt;
220
221 &lt;li&gt;ZDNet
222 &lt;a href=&quot;http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/&quot;&gt;Why
223 RAID 6 stops working in 2019&lt;/a&gt; by Robin Harris&lt;/li&gt;
224
225 &lt;li&gt;USENIX FAST&#39;07
226 &lt;a href=&quot;http://research.google.com/archive/disk_failures.pdf&quot;&gt;Failure
227 Trends in a Large Disk Drive Population&lt;/a&gt; by Eduardo Pinheiro,
228 Wolf-Dietrich Weber and Luiz André Barroso&lt;/li&gt;
229
230 &lt;li&gt;USENIX ;login: &lt;a
231 href=&quot;https://www.usenix.org/system/files/login/articles/hughes12-04.pdf&quot;&gt;Data
232 Integrity. Finding Truth in a World of Guesses and Lies&lt;/a&gt; by Doug
233 Hughes&lt;/li&gt;
234
235 &lt;li&gt;USENIX FAST&#39;08
236 &lt;a href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/&quot;&gt;An
237 Analysis of Data Corruption in the Storage Stack&lt;/a&gt; by
238 L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C.
239 Arpaci-Dusseau, and R. H. Arpaci-Dusseau&lt;/li&gt;
240
241 &lt;li&gt;USENIX FAST&#39;07 &lt;a
242 href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/&quot;&gt;Disk
243 failures in the real world: what does an MTTF of 1,000,000 hours mean
244 to you?&lt;/a&gt; by B. Schroeder and G. A. Gibson.&lt;/li&gt;
245
246 &lt;li&gt;USENIX ;login: &lt;a
247 href=&quot;https://www.usenix.org/events/fast08/tech/full_papers/jiang/jiang_html/&quot;&gt;Are
248 Disks the Dominant Contributor for Storage Failures? A Comprehensive
249 Study of Storage Subsystem Failure Characteristics&lt;/a&gt; by Weihang
250 Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky&lt;/li&gt;
251
252 &lt;li&gt;SIGMETRICS 2007
253 &lt;a href=&quot;http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf&quot;&gt;An
254 analysis of latent sector errors in disk drives&lt;/a&gt; by
255 L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler&lt;/li&gt;
256
257 &lt;/ul&gt;
258
259 &lt;p&gt;Several of these research papers are based on data collected from
260 hundred thousands or millions of disk, and their findings are eye
261 opening. The short story is simply do not implicitly trust RAID or
262 redundant storage systems. Details matter. And unfortunately there
263 are few options on Linux addressing all the identified issues. Both
264 ZFS and Btrfs are doing a fairly good job, but have legal and
265 practical issues on their own. I wonder how cluster file systems like
266 Ceph do in this regard. After all, there is an old saying, you know
267 you have a distributed system when the crash of a computer you have
268 never heard of stops you from getting any work done. The same holds
269 true if fault tolerance do not work.&lt;/p&gt;
270
271 &lt;p&gt;Just remember, in the end, it do not matter how redundant, or how
272 fault tolerant your storage is, if you do not continuously monitor its
273 status to detect and replace failed disks.&lt;/p&gt;
274
275 &lt;p&gt;As usual, if you use Bitcoin and want to show your support of my
276 activities, please send Bitcoin donations to my address
277 &lt;b&gt;&lt;a href=&quot;bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&quot;&gt;15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&lt;/a&gt;&lt;/b&gt;.&lt;/p&gt;
278 </description>
279 </item>
280
281 </channel>
282 </rss>