One of my consulting clients made a disturbing discovery last month. This happened in-house, it would have been even more likely in the cloud.
One machine holds an Oracle database and some other crucial files. They set up a scheduled job to backup the database and those files at 3 AM every work day:
One staff member had the task of removing the tape from the drive every morning, taking the fresh backup to a safe location and inserting the next tape into the drive.
Sure enough, every morning a rewound tape stuck out of the drive. The Oracle administrator always observed that the database had shut down and restarted in the middle of the night. Things seemed OK, but they wanted to make sure.
The real test would be to recover an arbitrarily chosen backup onto a blank system, but that would require a spare system with plenty of empty storage, and the Oracle administrator would have to help us verify the recovered database.
First, let’s look at the backup script.
It was very nicely designed, carefully checking everything that I could imagine going wrong and clearly describing any problems it encountered. Its run was scheduled by cron and so all of its output would be e-mailed to the job’s owner, the administrator root.
“You aren’t getting any messages with errors?”
No, none at all. Good!
“So, are you doing anything with those daily reports about the backup?
“What daily reports?”
There’s no way for the script to run without generating some output along the way. Success or failure, it’s going to say something. Maybe it’s just not being heard:
$ ls -l /var/mail
Look at that multi-megabyte mail file for root! Use su to become the administrator and read the mail.
Would you look at that. Reports dutifully e-mailed at about 03:01 every day reporting that Oracle had stopped cleanly, but then the specified tape device did not exist. The commands to rewind and eject all tapes had run just fine, though.
Some hardware had been reconfigured back in May, and until we found this in February they were just shuffling blank rewound tapes.
The good news is that we found and fixed the problem before any recovery was needed. The physical presence of the ejected tape and the annoyance of the daily ritual kept reminding them to check on this.
In Learning Tree’s Cloud Security Essentials course we discuss how IaaS cloud services probably offer a better chance of achieving compliance, as it’s remote system administration and you retain control and visibility. But failures in unattended scheduled jobs like backups from AWS EC2 servers to S3 storage offer fewer reminders and clues.
The fix was in two parts: fix the specified tape device name in the script, and then make sure that mail to root was automatically forwarded to a human. Then, make sure that daily reports started appearing. Basic system administration isn’t glamorous, but it’s necessary.
I’ll show you how to fix this problem next week!