Glacier uses the same storage technology as S3, for just one-tenth the cost. The primary trade-off is that Glacier is intended for archiving data, not using it, and to get the low price you have to put an archive into Glacier and just leave it there. One drawback to Glacier is indicated by its name. It is not designed for speed, it takes four hours to simply get an inventory of a storage archived. Verify what you do, but be patient.
Another drawback is the interface. It definitely isn’t point-and-click like S3. Amazon’s original plan was that everyone would write their own storage code using Amazon’s software development kits (or SDKs) for Java, Python, and .NET. You don’t have to write your own, a number of free tools are available, and Amazon has added Glacier support in their command-line toolkit. But before we look at those tools, we need to prepare our data and storage area:
Start by gathering your data into large archive files. I use
zip would also work. The billing is in units of gigabytes and they round everything up. Let’s say you have 50,000 image files of 1-2 MB each. You would pay less than $1 per month to store them as one
tar archive file, but $500 to store them as 50,000 separate archives.
You might compress your archives with
xz, but data like JPEG images and MP3 audio files are already compressed and won’t become significantly smaller.
$ tar cf /tmp/darkside.tar ./mp3/Pink\ Floyd/1973\ -\ Dark\ Side\ Of\ The\ Moon/*.mp3 $ ls -lh /tmp/darkside.tar -rw-r--r-- 1 cromwell cromwell 39.3M Jan 12 10:11 /tmp/darkside.tar $ xz -9v /tmp/darkside.tar /tmp/darkside.tar (1/1) 100 % 39.2 MiB / 39.3 MiB = 0.998 1.4 MiB/s 0:28 $ ls -lh /tmp/darkside.tar.xz -rw-r--r-- 1 cromwell cromwell 39.2M Jan 12 10:11 /tmp/darkside.tar.xz
The next step is important: create vaults with meaningful names. You can see your vault names immediately, but it takes four hours to retrieve an inventory of the archives they contain. I create a vault named
Pictures-2015-12.tar to store an archive named
Pictures-2015-12.tar, which is the result of doing the following at the end of last December:
cd tar cf Pictures-2015-12.tar ./Pictures/
If I had simply named the vault
Pictures then I wouldn’t remember if it had several individual archive files or one large one, or what their names were. If I had simply named it
2015 or, worse yet, something like
backup, then I wouldn’t have any idea at all. I would have to request inventories of all my vaults and wait four hours for each one!
Other alternatives that have been around for a while include HashBackup, a command-line tool for Linux, Mac OS X, and BSD; CrossFTP, a graphical client for Linux, Mac OS X, and Windows; and mt-aws-glacier, a Perl tool supporting multi-threaded multi-part file system sync to Glacier.
Amazon’s Python SDK is Boto, look for packages with “boto” in their name:
python-boto on most Linux distributions and
py-boto on OpenBSD. That gives you the command
glacier with which you can list your vaults and upload smaller archives of just a few gigabytes each.
Relevant Amazon how-to pages include “Uploading an Archive in Amazon Glacier” and “Using Amazon Glacier with the AWS Command Line Interface“.
The second of those describes how to upload with their command line interface. It’s rather complicated. You must split the archive into many files with
split, and then upload the individual chunks while specifying the byte ranges of the chunks so they can be reassembled.
As if that isn’t complicated enough, then you must calculate the tree hash, chaining together hashes of hashes. If you had just 4 parts, files
chunkad, that would be done as follows:
$ openssl dgst -sha256 -binary chunkaa > hash1 $ openssl dgst -sha256 -binary chunkab > hash2 $ openssl dgst -sha256 -binary chunkac > hash3 $ openssl dgst -sha256 -binary chunkad > hash4 $ cat hash1 hash2 > hash12 $ openssh dgst -sha256 -binary hash12 > hash12hash $ cat hash12hash hash3 > hash123 $ openssl dgst -sha256 -binary hash123 > hash123hash $ cat hash123hash hash4 > hash1234 $ openssl dgst -sha256 hash1234 > hash1234hash $ TREEHASH=$(cat hash1234hash) $ echo $TREEHASH b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
Then upload that value. This is always required, tools like SAGU do this for you. Nice, huh?
You can verify that you have a connection with
# lsof -i tcp:443 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 20576 cromwell 28u IPv6 7720188 0t0 TCP c-24-12-71-136.hsd1.in.comcast.net:55882->glacier.us-east-1.amazonaws.com:https (ESTABLISHED)
Have a look at your outbound network utilization to verify that something is really going on. Be patient, Amazon has plenty of bandwidth at their end but you will saturate your network connection. Observe your outbound bandwidth, do the math to estimate the approximate finish time, and be patient.
Keep your data safe!