April 27th, 2022 by Adam Sandman
If you have been following the news in our industry recently, you will probably be aware that another company in our space had a multi week outage event due to backup and restore problem. It’s an important topic that is often misunderstood by customers and users. We are frequently asked about our backup process, but typically less so regarding the restore part of the equation. Understanding how to restore a system is just as important as the backing up part.
Since we offer our Spira application lifecycle management products and KronoDesk IT service desk products for both cloud and on-premise customers, this article will provide some best practices for setting up backup and restore for our self-hosted / download customers based on our cloud experiences.
The Basics
The two most common acronyms you will come across when discussing backup (and restore) strategies are:
- The Recovery Point Objective (RPO) – how far back do you maintain backups, and at what points in time can you restore the system back to.
- The Recovery Time Objective (RTO) – how long it will take you to get the system back to the specified recovery point.
This sounds pretty simple, but what people often forget to plan for is the need to be able to selectively restore data, or the need to have frequent backup intervals so that the risk of data loss is reduced. For example, if you take backups every day at 9:00am and you have a failure at 8:00am, you have effectively lost 23 hours of data (since the last backup). If you made backups every hour, the most data you could “lose” is only 59 minutes.
Data Tenancy is Key
One other major consideration when deciding of backup and restore strategies is to understand the tenancy of your systems and the data in them:
- Multi-Tenant Architecture – you have one single data store that serves all your customers
- Single-Tenant Architecture – you have individual separate data stores that are unique per customer
However, even within a single customer instance, you can have a system designed so that each project or workspace has its own data store, or a single instance for that customer for all projects.
The more granular your data tenancy is, the easier it will be to restore a customer’s data without negatively affecting other customers.
For example, imagine we have a fully multi-tenant architecture where all customers use the same database, and backups of the entire database are made every day. If a customer makes an unintended change or deletes some data, they make ask to restore back to the previous day’s data. If that database backup is for all customers, you may have to revert all customers back to the previous day, which is not desirable!! In that situation, you would need to architect the backups so that you can revert just one customer back from the restored data.
If you had the exact same situation, but used a more granular single-tenant architecture, you could simply restore the one customer from the daily backup without negatively impacting other customers
Lessons from the Inflectra Cloud
While there is no single, correct way to manage your backup and restore strategy, here are some recommendations based on how we’ve designed our cloud disaster and recovery plan.
- Avoid permanent deletes as much as possible
- Maintain customer-specific backups
- Frequent snapshots
- Geographical dispersion
1. Avoid permanent deletes as much as possible
The need to perform restoration from backups can happen for a variety of reasons including natural disasters or IT disruptions, however one common reason is that a user accidentally deleted something the shouldn’t.
In our Spira platform we have made most deletes “soft”, where we simply mark a requirement, defect, test case or other artifact as deleted rather than actually deleting the item. That allows administrators of the project in Spira to simply ‘undelete’ the item from within the application.
Where there are actions in Spira that result in permanent deletes or changes, we require the user to enter in the name of the project or other item to avoid someone doing it erroneously. This is similar to the approach taken by Amazon Web Services (AWS) where they force you to type in the letters “D-E-L-E-T-E” to permanently delete a resource.
These two design choices generally (but not always) reduce the number of accidental deletes, requiring backup restoration.
2. Maintain customer-specific backups
We designed our Spira cloud infrastructure to mirror in large part our on-premise deployments, this allows us to seamlessly and easily move customers to/from the cloud as their needs (or laws) change. As a result, we maintain daily backups of each SQL Server database separately in a rolling 7-day period. That means we can restore a customer from one of any of the 7 day backups and not negatively impact any other customers with the rollback.
Using the standard Microsoft SQL Server restore tools, the procedure to revert a customer back takes a few minutes vs. days. The longest time is usually the time it takes to coordinate with the customer and inform users – which are of course very important steps.
However, we realized that a daily backup means that you could lose upto 23 hours of data if the event happened just before the next daily backup. For that reason, we have a separate set of AWS volume snapshots that happen every hour.
3. Frequent snapshots
In addition to the SQL Server daily backups, we take hourly snapshots of the web server and database server Elastic Cloud Compute (EC2) instances. In the case of a single-customer database restore, this is useful because it allows us to restore back the data from any hourly snapshot. The means that the most data that will be lost is 59 minutes’ worth (if you deleted something at 3:59pm and the last backup was at 3:00pm).
In the case of a more serious hardware failure, the fact that the EC2 snapshots contain the entire working virtual machine, means that restoring functionality is much faster since you can just restore the EC2 image from the snapshot and the operating system, application stack and all data is restored back at the same time. We can restore back a single EC2 virtual machine (containing over 30 customers) in less than an hour, vs. the multiple hour process it would take using physical hardware.
4. Geographical dispersion
It is important to make sure that you consider the effects of geography and legal restrictions. At Inflectra we have customers that specific data residency requirements. Typically this means that certain customers’ data must stay in the USA, certain data needs to stay in the EU, and other data must remain “in-country”.
What we have done is have a set of legally equivalent “privacy regimes” that we replicate data between. That way we can maximize the geographic spread of our data (to avoid a natural disaster one area affecting both primary and backup regions) and avoid data flowing from privacy regime to another, violating data residency laws such as GDPR or HIPAA.
For example:
- Our US customers have their primary data hosted in Northern Virginia and backed up in Ohio
- Our EU customers have their primary data hosted in Dublin, Ireland and backed up in Frankfurt, Germany
- Our Australian customers have their primary data hosted in Sydney, NSW and backed up to Melbourne, Victoria
- Etc.
Conclusion
When looking at disaster recovery strategies, it is important to consider the frequency of backups, the retention period and also the type of backups taken. The latter should be considered in the context of your overall data storage strategy so that you can restore customers’ data back as quickly as possible, with as little data loss as possible, with as little impact to other customers or users as possible.