Two years ago I wrote about backing up your personal data using cloud services, and specifically using Amazon’s Glacier service. I’m writing this post a couple of weeks after New Year’s Day, a time of year to think about getting into better habits. Let’s take another look at this.
Some simple commands that you learn in Learning Tree’s Linux server administration course make it easy to backup your personal data, and the price with Glacier is certainly good! If your camera or phone produces images of 1–2 MB, a dollar a month would let you archive 50,000 to 100,000 pictures.
As we discuss in Learning Tree’s System and Network Security Introduction course, you simply cannot prove anything about data availability as it isn’t based on mathematical concepts as confidentiality and integrity are. But Amazon makes a very good argument for their Glacier service being designed to provide an average annual durability of of 99.999999999% for an archive. That is, only a 1-in-1,000,000,000 chance that an archived file would be lost within one year. They base their analysis of their design on their ever-growing and already enormous collection of observations on media lifetime and other sources of data loss or corruption.
Nothing is perfect — extra copies on local disks provide only limited help, and there’s nothing magic about magnetic tape. I can’t imagine a corporation or government agency implementing an archive system like Glacier. Someone might get huffy and protest “But the U.S. Government would do this, of course!” No, the U.S. Government pays Amazon to build and operate these types of systems.
Amazon’s design uses redundancy and geographic diversity. An archive is stored at multiple facilities in what Amazon calls multiple availability zones, and on multiple devices within each of those facilities. There is hardware RAID at the very bottom, and aggressive replacement of hardware. The data is periodically written onto new hardware, with SHA2-256 hashes monitoring integrity.
Sure, Amazon has service outages from time to time. But we need to carefully consider the difference between data availability and durability. As Amazon uses those terms, availability means the ability to access your data right now, and there is no way that anyone can really promise that when you have to traverse the Internet and any carrier link or router along the way might be temporarily down. Durability means that you might have to wait a little while but you will get your data back.
There is no confidentiality with Glacier. They describe how uploads are done over TLS only, and the stored data is encrypted with AES using 256-bit keys. But they are keys that you don’t control, you don’t even see them, and Amazon will silently turn them over to the U.S. Government.
If you want confidentiality, do it yourself. Use the OpenSSL toolkit to encrypt your files before uploading them. Generate strong keys and save multiple paper copies in safe storage. Here’s how to generate a random 256-bit key, saving one copy into a file and printing three copies:
$ dd if=/dev/random bs=32 count=1 | \ base64 | tee my-key | enscript -# 3
Amazon’s US-East region has the most obvious outages. Partly because it’s the default region most of the time for both customers and Amazon. Partly because it’s embedded in what’s already the most overloaded neighborhood of the Internet. The Atlantic recently had a very interesting article “Why Amazon’s Data Centers Are Hidden in Spy Country” explaining the Internet history of northern Virginia and Amazon’s place within that.
The drawback to Glacier is the user interface. It’s not point-and-click as we have with S3 (which is based on the same storage technology with the same expected durability, but at ten times the cost). There is a command-line interface, but it’s not at all friendly. Check back next week and I’ll explain the alternatives.