It’s only a Backup if you’ve tested a Restore

In modern software development, there are all kinds of exotic data storage options available to us – and many reasons why any of these might be good/bad for a certain context. However, there are some basic needs that all systems have, and one of these is that if you’re storing data that is valuable, you need to know it is safe from corruption or loss, and need to know that if something did happen, you can get the system back on its feet afterwards. It doesn’t matter how shiny something is; if this requirement is not met, it would be reckless and unprofessional to gamble with the business’s data like this!

Many databases offer backup/restore as a feature or plugin. That’s great; these options are supported by the vendor, well-maintained and trusted. This should be a first-class criteria in selecting a data storage platform. On its own though this is not enough; the team who are supporting the system need to know that:

  • Backups have been configured with appropriate settings for the context
  • Backups are actually running in the correct way
  • The backups can be restored and normal operation can be resumed!

Two out of three here isn’t good enough; you need to do this for real on an environment that doesn’t matter so that the process is tried and tested. Document or script the process, so that when it matters there are less surprised. Better still, practice on a regular basis; one of the principles of Continuous Integration and Continuous Deployment is that if something is perceived to be difficult or risky, you need to do it more often to force it to become simpler and safer. You could script something that takes the most recent backup, restores it to a new environment, and runs some health-checks against it – with this in place you have a health-monitor proving that should the worst happen you have a way back.

Make sure you think about all the elements in the chain; if you are saving your backup files to a disc, how is this disc protected? Are there any failure modes that you couldn’t recover from? Cloud service providers for example offer things like snapshots, redundant storage, and even geo-redundant storage, for resilience against region-wide incidents.

Be aware of course that things like a hardware failure aren’t the only thing that can go wrong; two other common causes of problems in a database are correct commands that were accidentally issued – where the command is correctly processed but the result is undesirable – and data corruption from incorrect sequences of commands that are partially handled but leave data in a broken state. Some backup technologies capture a transaction log, so be mindful that these will include corrupting transactions unless you have a way to filter them out! Generally speaking, you will want to identify the last known-good point where a backup exists, and work from this.

It’s worth mentioning that this is something that is far more difficult in a distributed system, that perhaps has multiple data stores. In this scenario, forethought is more important still, to make sure that there is a tested strategy in place to recover the system. You might for example have publish/subscribe mechanics between services, where rolling back one service’s database to a backup would mean that other services still have records of things that happened since that backup – which can lead to further corruption. Careful design and practice are needed to ensure a healthy recovery. Techniques like event sourcing can help, by replaying events.

There are many ways that a database can be damaged – and it’s probably not technically feasible to eliminate all of these. When the worst does happen, make sure you have a way back from it!