X-Git-Url: http://pere.pagekite.me/gitweb/homepage.git/blobdiff_plain/76f50252fb1cf3031a53e28bac1d1ed136c22d97..0ac09e03cd9937e13295750471e75d8fd907f112:/blog/index.rss diff --git a/blog/index.rss b/blog/index.rss index 70f6ecb37a..c5fcc608cb 100644 --- a/blog/index.rss +++ b/blog/index.rss @@ -6,6 +6,288 @@ http://people.skolelinux.org/pere/blog/ + + s3ql, a locally mounted cloud file system - nice free software + http://people.skolelinux.org/pere/blog/s3ql__a_locally_mounted_cloud_file_system___nice_free_software.html + http://people.skolelinux.org/pere/blog/s3ql__a_locally_mounted_cloud_file_system___nice_free_software.html + Wed, 9 Apr 2014 11:30:00 +0200 + <p>For a while now, I have been looking for a sensible offsite backup +solution for use at home. My requirements are simple, it must be +cheap and locally encrypted (in other words, I keep the encryption +keys, the storage provider do not have access to my private files). +One idea me and my friends have had many years ago, before the cloud +storage providers showed up, have been to use Google mail as storage, +writing a Linux block device storing blocks as emails in the mail +service provided by Google, and thus get heaps of free space. On top +of this one can add encryption, RAID and volume management to have +lots of (fairly slow, I admit that) cheap and encrypted storage. But +I never found time to implement such system. But the last few weeks I +have looked at a system called +<a href="https://bitbucket.org/nikratio/s3ql/">S3QL</a>, a locally +mounted network backed file system with the features I need.</p> + +<p>S3QL is a fuse file system with a local cache and cloud storage, +handling several different storage providers, any with Amazon S3, +Google Drive or OpenStack API. There are heaps of such storage +providers. S3QL can also use a local directory as storage, which +combined with sshfs allow for file storage on any ssh server. S3QL +include support for encryption, compression, de-duplication, snapshots +and immutable file systems, allowing me to mount the remote storage as +a local mount point, look at and use the files as if they were local, +while the content is stored in the cloud as well. This allow me to +have a backup that should survive fire. The file system can not be +shared between several machines at the same time, as only one can +mount it at the time, but any machine with the encryption key and +access to the storage service can mount it if it is unmounted.</p> + +<p>It is simple to use. I'm using it on Debian Wheezy, where the +package is included already. So to get started, run <tt>apt-get +install s3ql</tt>. Next, pick a storage provider. I ended up picking +Greenqloud, after reading their nice recipe on +<a href="https://greenqloud.zendesk.com/entries/44611757-How-To-Use-S3QL-to-mount-a-StorageQloud-bucket-on-Debian-Wheezy">how +to use s3ql with their Amazon S3 service</a>, because I trust the laws +in Iceland more than those in USA when it come to keeping my personal +data safe and private, and thus would rather spend money on a company +in Iceland. Another nice recipe is available from the article +<a href="http://www.admin-magazine.com/HPC/Articles/HPC-Cloud-Storage">S3QL +Filesystem for HPC Storage</a> by Jeff Layton in the HPC section of +Admin magazine. When the provider is picked, figure out how to get +the API key needed to connect to the storage API. With Greencloud, +the key did not show up until I had added payment details to my +account.</p> + +<p>Armed with the API access details, it is time to create the file +system. First, create a new bucket in the cloud. This bucket is the +file system storage area. I picked a bucket name reflecting the +machine that was going to store data there, but any name will do. +I'll refer to it as <tt>bucket-name</tt> below. In addition, one need +the API login and password, and a locally created password. Store it +all in ~root/.s3ql/authinfo2 like this: + +<p><blockquote><pre> +[s3c] +storage-url: s3c://s.greenqloud.com:443/bucket-name +backend-login: API-login +backend-password: API-password +fs-passphrase: local-password +</pre></blockquote></p> + +<p>I create my local passphrase using <tt>pwget 50</tt> or similar, +but any sensible way to create a fairly random password should do it. +Armed with these details, it is now time to run mkfs, entering the API +details and password to create it:</p> + +<p><blockquote><pre> +# mkdir -m 700 /var/lib/s3ql-cache +# mkfs.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \ + --ssl s3c://s.greenqloud.com:443/bucket-name +Enter backend login: +Enter backend password: +Before using S3QL, make sure to read the user's guide, especially +the 'Important Rules to Avoid Loosing Data' section. +Enter encryption password: +Confirm encryption password: +Generating random encryption key... +Creating metadata tables... +Dumping metadata... +..objects.. +..blocks.. +..inodes.. +..inode_blocks.. +..symlink_targets.. +..names.. +..contents.. +..ext_attributes.. +Compressing and uploading metadata... +Wrote 0.00 MB of compressed metadata. +# </pre></blockquote></p> + +<p>The next step is mounting the file system to make the storage available. + +<p><blockquote><pre> +# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \ + --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql +Using 4 upload threads. +Downloading and decompressing metadata... +Reading metadata... +..objects.. +..blocks.. +..inodes.. +..inode_blocks.. +..symlink_targets.. +..names.. +..contents.. +..ext_attributes.. +Mounting filesystem... +# df -h /mnt +Filesystem Size Used Avail Use% Mounted on +s3c://s.greenqloud.com:443/bucket-name 1.0T 0 1.0T 0% /s3ql +# +</pre></blockquote></p> + +<p>The file system is now ready for use. I use rsync to store my +backups in it, and as the metadata used by rsync is downloaded at +mount time, no network traffic (and storage cost) is triggered by +running rsync. To unmount, one should not use the normal umount +command, as this will not flush the cache to the cloud storage, but +instead running the umount.s3ql command like this: + +<p><blockquote><pre> +# umount.s3ql /s3ql +# +</pre></blockquote></p> + +<p>There is a fsck command available to check the file system and +correct any problems detected. This can be used if the local server +crashes while the file system is mounted, to reset the "already +mounted" flag. This is what it look like when processing a working +file system:</p> + +<p><blockquote><pre> +# fsck.s3ql --force --ssl s3c://s.greenqloud.com:443/bucket-name +Using cached metadata. +File system seems clean, checking anyway. +Checking DB integrity... +Creating temporary extra indices... +Checking lost+found... +Checking cached objects... +Checking names (refcounts)... +Checking contents (names)... +Checking contents (inodes)... +Checking contents (parent inodes)... +Checking objects (reference counts)... +Checking objects (backend)... +..processed 5000 objects so far.. +..processed 10000 objects so far.. +..processed 15000 objects so far.. +Checking objects (sizes)... +Checking blocks (referenced objects)... +Checking blocks (refcounts)... +Checking inode-block mapping (blocks)... +Checking inode-block mapping (inodes)... +Checking inodes (refcounts)... +Checking inodes (sizes)... +Checking extended attributes (names)... +Checking extended attributes (inodes)... +Checking symlinks (inodes)... +Checking directory reachability... +Checking unix conventions... +Checking referential integrity... +Dropping temporary indices... +Backing up old metadata... +Dumping metadata... +..objects.. +..blocks.. +..inodes.. +..inode_blocks.. +..symlink_targets.. +..names.. +..contents.. +..ext_attributes.. +Compressing and uploading metadata... +Wrote 0.89 MB of compressed metadata. +# +</pre></blockquote></p> + +<p>Thanks to the cache, working on files that fit in the cache is very +quick, about the same speed as local file access. Uploading large +amount of data is to me limited by the bandwidth out of and into my +house. Uploading 685 MiB with a 100 MiB cache gave me 305 kiB/s, +which is very close to my upload speed, and downloading the same +Debian installation ISO gave me 610 kiB/s, close to my download speed. +Both were measured using <tt>dd</tt>. So for me, the bottleneck is my +network, not the file system code. I do not know what a good cache +size would be, but suspect that the cache should e larger than your +working set.</p> + +<p>I mentioned that only one machine can mount the file system at the +time. If another machine try, it is told that the file system is +busy:</p> + +<p><blockquote><pre> +# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \ + --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql +Using 8 upload threads. +Backend reports that fs is still mounted elsewhere, aborting. +# +</pre></blockquote></p> + +<p>The file content is uploaded when the cache is full, while the +metadata is uploaded once every 24 hour by default. To ensure the +file system content is flushed to the cloud, one can either umount the +file system, or ask s3ql to flush the cache and metadata using +s3qlctrl: + +<p><blockquote><pre> +# s3qlctrl upload-meta /s3ql +# s3qlctrl flushcache /s3ql +# +</pre></blockquote></p> + +<p>If you are curious about how much space your data uses in the +cloud, and how much compression and deduplication cut down on the +storage usage, you can use s3qlstat on the mounted file system to get +a report:</p> + +<p><blockquote><pre> +# s3qlstat /s3ql +Directory entries: 9141 +Inodes: 9143 +Data blocks: 8851 +Total data size: 22049.38 MB +After de-duplication: 21955.46 MB (99.57% of total) +After compression: 21877.28 MB (99.22% of total, 99.64% of de-duplicated) +Database size: 2.39 MB (uncompressed) +(some values do not take into account not-yet-uploaded dirty blocks in cache) +# +</pre></blockquote></p> + +<p>I mentioned earlier that there are several possible suppliers of +storage. I did not try to locate them all, but am aware of at least +<a href="https://www.greenqloud.com/">Greenqloud</a>, +<a href="http://drive.google.com/">Google Drive</a>, +<a href="http://aws.amazon.com/s3/">Amazon S3 web serivces</a>, +<a href="http://www.rackspace.com/">Rackspace</a> and +<a href="http://crowncloud.net/">Crowncloud</A>. The latter even +accept payment in Bitcoin. Pick one that suit your need. Some of +them provide several GiB of free storage, but the prize models are +quire different and you will have to figure out what suit you +best.</p> + +<p>While researching this blog post, I had a look at research papers +and posters discussing the S3QL file system. There are several, which +told me that the file system is getting a critical check by the +science community and increased my confidence in using it. One nice +poster is titled +"<a href="http://www.lanl.gov/orgs/adtsc/publications/science_highlights_2013/docs/pg68_69.pdf">An +Innovative Parallel Cloud Storage System using OpenStack’s SwiftObject +Store and Transformative Parallel I/O Approach</a>" by Hsing-Bung +Chen, Benjamin McClelland, David Sherrill, Alfred Torrez, Parks Fields +and Pamela Smith. Please have a look.</p> + +<p>Given my problems with different file systems earlier, I decided to +check out the mounted S3QL file system to see if it would be usable as +a home directory (in other word, that it provided POSIX semantics when +it come to locking and umask handling etc). Running +<a href="http://people.skolelinux.org/pere/blog/Testing_if_a_file_system_can_be_used_for_home_directories___.html">my +test code to check file system semantics, I was happy to discover that +no error was found. So the file system can be used for home +directories, if one chooses to do so.</p> + +<p>If you do not want a locally file system, and want something that +work without the Linux fuse file system, I would like to mention the +<a href="http://www.tarsnap.com/">Tarsnap service</a>, which also +provide locally encrypted backup using a command line client. It have +a nicer access control system, where one can split out read and write +access, allowing some systems to write to the backup and others to +only read from it.</p> + +<p>As usual, if you use Bitcoin and want to show your support of my +activities, please send Bitcoin donations to my address +<b><a href="bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b&label=PetterReinholdtsenBlog">15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b</a></b>.</p> + + + EU-domstolen bekreftet i dag at datalagringsdirektivet er ulovlig http://people.skolelinux.org/pere/blog/EU_domstolen_bekreftet_i_dag_at_datalagringsdirektivet_er_ulovlig.html @@ -616,97 +898,5 @@ workstation, LTSP client or LTSP server.</p> - - Hvordan bør RFC 822-formattert epost lagres i en NOARK5-database? - http://people.skolelinux.org/pere/blog/Hvordan_b_r_RFC_822_formattert_epost_lagres_i_en_NOARK5_database_.html - http://people.skolelinux.org/pere/blog/Hvordan_b_r_RFC_822_formattert_epost_lagres_i_en_NOARK5_database_.html - Fri, 7 Mar 2014 15:20:00 +0100 - <p>For noen uker siden ble NXCs fri programvarelisenserte -NOARK5-løsning -<a href="http://www.nuug.no/aktiviteter/20140211-noark/">presentert hos -NUUG</a> (video -<a href="https://www.youtube.com/watch?v=JCb_dNS3MHQ">på youtube -foreløbig</a>), og det fikk meg til å titte litt mer på NOARK5, -standarden for arkivhåndtering i det offentlige Norge. Jeg lurer på -om denne kjernen kan være nyttig i et par av mine prosjekter, og for ett -av dem er det mest aktuelt å lagre epost. Jeg klarte ikke finne noen -anbefaling om hvordan RFC 822-formattert epost (aka Internett-epost) -burde lagres i NOARK5, selv om jeg vet at noen arkiver tar -PDF-utskrift av eposten med sitt epostprogram og så arkiverer PDF-en -(eller enda værre, tar papirutskrift og lagrer bildet av eposten som -PDF i arkivet).</p> - -<p>Det er ikke så mange formater som er akseptert av riksarkivet til -langtidsoppbevaring av offentlige arkiver, og PDF og XML er de mest -aktuelle i så måte. Det slo meg at det måtte da finnes en eller annen -egnet XML-representasjon og at det kanskje var enighet om hvilken som -burde brukes, så jeg tok mot til meg og spurte -<a href="http://samdok.com/">SAMDOK</a>, en gruppe tilknyttet -arkivverket som ser ut til å jobbe med NOARK-samhandling, om de hadde -noen anbefalinger: - -<p><blockquote> -<p>Hei.</p> - -<p>Usikker på om dette er riktig forum å ta opp mitt spørsmål, men jeg -lurer på om det er definert en anbefaling om hvordan RFC -822-formatterte epost (aka vanlig Internet-epost) bør lages håndteres -i NOARK5, slik at en bevarer all informasjon i eposten -(f.eks. Received-linjer). Finnes det en anbefalt XML-mapping ala den -som beskrives på -&lt;URL: <a href="https://www.informit.com/articles/article.aspx?p=32074">https://www.informit.com/articles/article.aspx?p=32074</a> &gt;? Mitt -mål er at det skal være mulig å lagre eposten i en NOARK5-kjerne og -kunne få ut en identisk formattert kopi av opprinnelig epost ved -behov.</p> -</blockquote></p> - -<p>Postmottaker hos SAMDOK mente spørsmålet heller burde stilles -direkte til riksarkivet, og jeg fikk i dag svar derfra formulert av -seniorrådgiver Geir Ivar Tungesvik:</p> - -<p><blockquote> -<p>Riksarkivet har ingen anbefalinger når det gjelder konvertering fra -e-post til XML. Det står arkivskaper fritt å eventuelt definere/bruke -eget format. Inklusive da - som det spørres om - et format der det er -mulig å re-etablere e-post format ut fra XML-en. XML (e-post) -dokumenter må være referert i arkivstrukturen, og det må vedlegges et -gyldig XML skjema (.xsd) for XML-filene. Arkivskaper står altså fritt -til å gjøre hva de vil, bare det dokumenteres og det kan dannes et -utrekk ved avlevering til depot.</p> - -<p>De obligatoriske kravene i Noark 5 standarden må altså oppfylles - -etter dialog med Riksarkivet i forbindelse med godkjenning. For -offentlige arkiv er det særlig viktig med filene loependeJournal.xml -og offentligJournal.xml. Private arkiv som vil forholde seg til Noark -5 standarden er selvsagt frie til å bruke det som er relevant for dem -av obligatoriske krav.</p> -</blockquote></p> - -<p>Det ser dermed ut for meg som om det er et lite behov for å -standardisere XML-lagring av RFC-822-formatterte meldinger. Noen som -vet om god spesifikasjon i så måte? I tillegg til den omtalt over, -har jeg kommet over flere aktuelle beskrivelser (søk på "rfc 822 -xml", så finner du aktuelle alternativer).</p> - -<ul> - -<li><a href="http://www.openhealth.org/xmtp/">XML MIME Transformation -protocol (XMTP)</a> fra OpenHealth, sist oppdatert 2001.</li> - -<li><a href="https://tools.ietf.org/html/draft-klyne-message-rfc822-xml-03">An -XML format for mail and other messages</a> utkast fra IETF datert -2001.</li> - -<li><a href="http://www.informit.com/articles/article.aspx?p=32074">xMail: -E-mail as XML</a> en artikkel fra 2003 som beskriver python-modulen -rfc822 som gir ut XML-representasjon av en RFC 822-formattert epost.</li> - -</ul> - -<p>Finnes det andre og bedre spesifikasjoner for slik lagring? Send -meg en epost hvis du har innspill.</p> - - -