From: Petter Reinholdtsen Date: Wed, 9 Apr 2014 09:15:16 +0000 (+0000) Subject: New post. X-Git-Url: https://pere.pagekite.me/gitweb/homepage.git/commitdiff_plain/04f9755246f9917bd4810d624c25963d51a521ae?ds=sidebyside New post. --- diff --git a/blog/data/2014-04-09-nice-s3ql.txt b/blog/data/2014-04-09-nice-s3ql.txt new file mode 100644 index 0000000000..59bf76d0a4 --- /dev/null +++ b/blog/data/2014-04-09-nice-s3ql.txt @@ -0,0 +1,263 @@ +Title: s3ql, a locally mounted cloud file system - nice free software +Tags: english, debian, personvern, sikkerhet +Date: 2014-04-09 11:20 + +

For a while now, I have been looking for a sensible off site backup +solution to use at home. My requirements are cheap and locally +encrypted (in other words, I keep the keys, the storage provider do +not have access to my private files). One idea me and my friends have +had over the years have been to use Google mail as storage, writing a +Linux block device storing blocks as emails in the mail service +provided by Google, and thus get heaps of free space. On top of this +one can add encryption, RAID and volume management to have lots of +(fairly slow, I admit that) cheap and encrypted storage. But I never +found time to implement such system. But the last few weeks I have +looked at a system called +S3QL, a locally +mounted network backed file system with the features I need.

+ +

S3QL is a fuse file system with a local cache and cloud storage, +handling several different storage providers, any with Amazon S3, +Google Drive or OpenStack API. There are heaps of such providers. It +can also use a local directory as storage, which combined with sshfs +allow for file storage on any ssh server. S3QL include support for +encryption, compression, de-duplication, snapshots and immutable file +systems, allowing me to mount the remote storage as a local mount +point, look at and use the files as if they were local, while the +content is stored in the cloud as well. This allow me to have a +backup that should survive fire. The file system can not be shared +between several machines at the same time, as only one can mount it at +the time, but any machine with the encryption key and access to the +storage service can mount it if it is unmounted.

+ +

It is simple to use. I'm using it on Debian Wheezy, where the +package is included already. So to get started, run apt-get +install s3ql. Next, pick a storage provider. I ended up picking +Greenqloud, after reading their nice recipe on +how +to use s3ql with their Amazon S3 service, because I trust the laws +in Iceland more than those in USA when it come to keeping my data safe +and private, and thus would rather spend money on a company in +Iceland. Another nice recipe is available from the article +S3QL +Filesystem for HPC Storage by Jeff Layton in the HPC section of +Admin magazine. When the provider is picked, figure out how to get +the API key needed to connect to the storage API. With Greencloud, +the key did not show up until I had added payment details to my +account.

+ +

Armed with the API access details, it is time to create the file +system. First, create a new bucket in the cloud. This bucket is the +file system storage area. I picked a bucket name reflecting the +machine that was going to store data there, but any name will do. +I'll refer to it as bucket-name below. In addition, one need +the API login and password, and a locally created password. Store it +all in ~root/.s3ql/authinfo2 like this: + +

+[s3c]
+storage-url: s3c://s.greenqloud.com:443/bucket-name
+backend-login: API-login
+backend-password: API-password
+fs-passphrase: local-password
+

+ +

I create my local passphrase using pwget 50 or similar, +but any sensible way to create a fairly random password should do it. +Armed with these details, it is now time to run mkfs, entering the API +details and password to create it:

+ +

+# mkdir -m 700 /var/lib/s3ql-cache
+# mkfs.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2   --ssl s3c://s.greenqloud.com:443/bucket-name
+Enter backend login: 
+Enter backend password: 
+Before using S3QL, make sure to read the user's guide, especially
+the 'Important Rules to Avoid Loosing Data' section.
+Enter encryption password: 
+Confirm encryption password: 
+Generating random encryption key...
+Creating metadata tables...
+Dumping metadata...
+..objects..
+..blocks..
+..inodes..
+..inode_blocks..
+..symlink_targets..
+..names..
+..contents..
+..ext_attributes..
+Compressing and uploading metadata...
+Wrote 0.00 MB of compressed metadata.
+# 

+ +

The next step is mounting the file system to make the storage available. + +

+# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql
+Using 4 upload threads.
+Downloading and decompressing metadata...
+Reading metadata...
+..objects..
+..blocks..
+..inodes..
+..inode_blocks..
+..symlink_targets..
+..names..
+..contents..
+..ext_attributes..
+Mounting filesystem...
+# df -h /mnt
+Filesystem                              Size  Used Avail Use% Mounted on
+s3c://s.greenqloud.com:443/bucket-name  1.0T     0  1.0T   0% /s3ql
+#
+

+ +

The file system is now ready for use. I use rsync to store my +backups in it, and as the metadata used by rsync is downloaded at +mount time, no network traffic (and storage cost) is triggered by +running rsync. To unmount, one should not use the normal umount +command, as this will not flush the cache to the cloud storage, but +instead running the umount.s3ql command like this: + +

+# umount.s3ql /s3ql
+# 
+

+ +

There is a fsck command available to check the file system and +correct any problems detected. This can be used if the local server +crashes while the file system is mounted, to reset the "already +mounted" flag. This is what it look like when processing a working +file system:

+ +

+# fsck.s3ql --force --ssl s3c://s.greenqloud.com:443/bucket-name
+Using cached metadata.
+File system seems clean, checking anyway.
+Checking DB integrity...
+Creating temporary extra indices...
+Checking lost+found...
+Checking cached objects...
+Checking names (refcounts)...
+Checking contents (names)...
+Checking contents (inodes)...
+Checking contents (parent inodes)...
+Checking objects (reference counts)...
+Checking objects (backend)...
+..processed 5000 objects so far..
+..processed 10000 objects so far..
+..processed 15000 objects so far..
+Checking objects (sizes)...
+Checking blocks (referenced objects)...
+Checking blocks (refcounts)...
+Checking inode-block mapping (blocks)...
+Checking inode-block mapping (inodes)...
+Checking inodes (refcounts)...
+Checking inodes (sizes)...
+Checking extended attributes (names)...
+Checking extended attributes (inodes)...
+Checking symlinks (inodes)...
+Checking directory reachability...
+Checking unix conventions...
+Checking referential integrity...
+Dropping temporary indices...
+Backing up old metadata...
+Dumping metadata...
+..objects..
+..blocks..
+..inodes..
+..inode_blocks..
+..symlink_targets..
+..names..
+..contents..
+..ext_attributes..
+Compressing and uploading metadata...
+Wrote 0.89 MB of compressed metadata.
+# 
+

+ +

Thanks to the cache, working on files that fit in the cache is very +quick, about the same speed as local file access. Uploading large +amount of data is to me limited by the bandwidth out of and into my +house. Uploading 685 MiB with a 100 MiB cache gave me 305 kiB/s, +which is very close to my upload speed, and downloading the same +Debian installation ISO gave me 610 kiB/s, close to my download speed. +Both were measured using dd. So for me, the bottleneck is my +network, not the file system code.

+ +I mentioned that only one machine can mount the file system at the +time. If another machine try, it is told that the file system is +busy: + +

+# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql
+Using 8 upload threads.
+Backend reports that fs is still mounted elsewhere, aborting.
+#
+

+ +

The file content is uploaded when the cache is full, while the +metadata is uploaded once every 24 hour by default. To ensure the +file system content is flushed to the cloud, one can either umount the +file system, or ask s3ql to flush the cache and metadata using +s3qlctrl: + +

+# s3qlctrl upload-meta /s3ql
+# s3qlctrl flushcache /s3ql
+# 
+

+ +

If you are curious about how much space your data uses in the +cloud, and how much compression and deduplication cut down on the +storage usage, you can use s3qlstat on the mounted file system to get +a report:

+ +

+# s3qlstat /s3ql
+Directory entries:    9141
+Inodes:               9143
+Data blocks:          8851
+Total data size:      22049.38 MB
+After de-duplication: 21955.46 MB (99.57% of total)
+After compression:    21877.28 MB (99.22% of total, 99.64% of de-duplicated)
+Database size:        2.39 MB (uncompressed)
+(some values do not take into account not-yet-uploaded dirty blocks in cache)
+#
+

+ +

I mentioned earlier that there are several possible suppliers of +storage. I did not try to locate them all, but am aware of at least +Greenqloud, +Google Drive, +Amazon S3 web serivces, +Rackspace and +Crowncloud. The latter even +accept payment in Bitcoin. Pick one that suit your need. Some of +them provide several GiB of free storage, but the prize models are +quire different and you will have to figure out what suit you +best.

+ +

While researching this blog post, I had a look at research papers +and posters discussing the S3QL file system. There are several, which +told me that the file system is getting a critical check by the +science community and increased my confidence in using it. One nice +poster is titled +"An +Innovative Parallel Cloud Storage System using OpenStack’s SwiftObject +Store and Transformative Parallel I/O Approach" by Hsing-Bung +Chen, Benjamin McClelland, David Sherrill, Alfred Torrez, Parks Fields +and Pamela Smith. Please have a look.

+ +

If you do not want a locally file system, and want something that +work without the Linux fuse file system, I would like to mention the +Tarsnap service, which also +provide locally encrypted backup using a command line client. It have +a nicer access control system, where one can split out read and write +access, allowing some systems to write to the backup and others to +only read from it.

+ +

As usual, if you use Bitcoin and want to show your support of my +activities, please send Bitcoin donations to my address +15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.