New Year’s Resolution: How to Back Up Your Data

Last week I encouraged you to use Amazon’s cheap and extremely resilient Glacier service to archive your personal data. Let’s see how to do this.

Glacier uses the same storage technology as S3, for just one-tenth the cost. The primary trade-off is that Glacier is intended for archiving data, not using it, and to get the low price you have to put an archive into Glacier and just leave it there. One drawback to Glacier is indicated by its name. It is not designed for speed, it takes four hours to simply get an inventory of a storage archived. Verify what you do, but be patient.

safe-913452_640

Another drawback is the interface. It definitely isn’t point-and-click like S3. Amazon’s original plan was that everyone would write their own storage code using Amazon’s software development kits (or SDKs) for Java, Python, and .NET. You don’t have to write your own, a number of free tools are available, and Amazon has added Glacier support in their command-line toolkit. But before we look at those tools, we need to prepare our data and storage area:

Bundle and (Maybe) Compress

Start by gathering your data into large archive files. I use tar, although zip would also work. The billing is in units of gigabytes and they round everything up. Let’s say you have 50,000 image files of 1-2 MB each. You would pay less than $1 per month to store them as one tar archive file, but $500 to store them as 50,000 separate archives.

You might compress your archives with xz, but data like JPEG images and MP3 audio files are already compressed and won’t become significantly smaller.

$ tar cf /tmp/darkside.tar ./mp3/Pink\ Floyd/1973\ -\ Dark\ Side\ Of\ The\ Moon/*.mp3
$ ls -lh /tmp/darkside.tar
-rw-r--r--  1 cromwell  cromwell  39.3M Jan 12 10:11 /tmp/darkside.tar
$ xz -9v /tmp/darkside.tar
/tmp/darkside.tar (1/1)
  100 %         39.2 MiB / 39.3 MiB = 0.998   1.4 MiB/s       0:28
$ ls -lh /tmp/darkside.tar.xz
-rw-r--r--  1 cromwell  cromwell  39.2M Jan 12 10:11 /tmp/darkside.tar.xz

Name Your Storage Vaults

The next step is important: create vaults with meaningful names. You can see your vault names immediately, but it takes four hours to retrieve an inventory of the archives they contain. I create a vault named Pictures-2015-12.tar to store an archive named Pictures-2015-12.tar, which is the result of doing the following at the end of last December:

cd
tar cf Pictures-2015-12.tar ./Pictures/

If I had simply named the vault Pictures then I wouldn’t remember if it had several individual archive files or one large one, or what their names were. If I had simply named it 2015 or, worse yet, something like backup, then I wouldn’t have any idea at all. I would have to request inventories of all my vaults and wait four hours for each one!

Upload Your Archives

I still use SAGU or Simple Amazon Glacier Uploader. It’s a Java-based graphical client for Linux, Mac OS X, BSD, and Windows. It provides very little feedback while running but it gets the job done.

Other alternatives that have been around for a while include HashBackup, a command-line tool for Linux, Mac OS X, and BSD; CrossFTP, a graphical client for Linux, Mac OS X, and Windows; and mt-aws-glacier, a Perl tool supporting multi-threaded multi-part file system sync to Glacier.

Amazon’s Python SDK is Boto, look for packages with “boto” in their name: python-boto on most Linux distributions and py-boto on OpenBSD. That gives you the command glacier with which you can list your vaults and upload smaller archives of just a few gigabytes each.

Relevant Amazon how-to pages include “Uploading an Archive in Amazon Glacier” and “Using Amazon Glacier with the AWS Command Line Interface“.

The second of those describes how to upload with their command line interface. It’s rather complicated. You must split the archive into many files with dd or split, and then upload the individual chunks while specifying the byte ranges of the chunks so they can be reassembled.

As if that isn’t complicated enough, then you must calculate the tree hash, chaining together hashes of hashes. If you had just 4 parts, files chunkaa through chunkad, that would be done as follows:

$ openssl dgst -sha256 -binary chunkaa > hash1
$ openssl dgst -sha256 -binary chunkab > hash2
$ openssl dgst -sha256 -binary chunkac > hash3
$ openssl dgst -sha256 -binary chunkad > hash4

$ cat hash1 hash2 > hash12
$ openssh dgst -sha256 -binary hash12 > hash12hash
$ cat hash12hash hash3 > hash123
$ openssl dgst -sha256 -binary hash123 > hash123hash
$ cat hash123hash hash4 > hash1234
$ openssl dgst -sha256 hash1234 > hash1234hash
$ TREEHASH=$(cat hash1234hash)
$ echo $TREEHASH
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c

Then upload that value. This is always required, tools like SAGU do this for you. Nice, huh?

… And Wait Patiently

You can verify that you have a connection with lsof:

# lsof -i tcp:443
COMMAND   PID     USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
java    20576 cromwell   28u  IPv6 7720188      0t0  TCP c-24-12-71-136.hsd1.in.comcast.net:55882->glacier.us-east-1.amazonaws.com:https (ESTABLISHED)

Have a look at your outbound network utilization to verify that something is really going on. Be patient, Amazon has plenty of bandwidth at their end but you will saturate your network connection. Observe your outbound bandwidth, do the math to estimate the approximate finish time, and be patient.

Keep your data safe!

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.