The importance of Disaster Recovery: lessons from the UniSuper Google GCP outage

Back to Blog

By Neil Barton |

May 23, 2024 |

In an age where digital transformation drives business operations, the recent UniSuper Google Cloud Platform (GCP) incident underscores a critical lesson for organizations worldwide: the need for robust Disaster Recovery (DR) planning for all disaster scenarios. UniSuper, a major superannuation fund in Australia, experienced significant service disruptions due to issues with Google's cloud services, resulting in more than half a million UniSuper fund members having no access to their accounts for a number of days. This event has highlighted not just the vulnerabilities inherent in relying on cloud infrastructure but also the essential role that DR plays in maintaining business continuity. For organizations deploying to the cloud, I propose a series of questions and measures they should all be asking themselves now.

Understanding the UniSuper GCP Incident

According to the joint statement from UniSuper and Google Cloud, this “one-of-a-kind” incident from Google Cloud, resulted in the deletion of UniSuper’s entire private cloud subscription. Despite UniSuper having their environment duplicated across two different GCP regions as protection against outages and loss, this incident resulted in the deletion of all infrastructure across both regions.

For an organization managing AUD $125 billion in retirement savings, even a minor hiccup can have significant ramifications. An outage of this type was a major issue. Google Cloud CEO, Thomas Kurian, confirmed this was caused by inadvertent misconfiguration during the provisioning of UniSuper’s Private Cloud services, ultimately resulting in the deletion of their subscription.

Below is a set of observations from this incident. Neither UniSuper nor Google Cloud has provided anything specific – all information disclosed has been very vague. However, there are some important lessons that can be learned.

Observations

1. Multi-region DR is not always enough for every disaster scenario: Despite having their environment duplicated across two different GCP regions as protection against outages and loss, which is a recommended best practice, this “one-of-a-kind occurrence” as described by Google Cloud CEO, Thomas Kurian, managed to delete UniSuper’s account across both regions. As an organization, you need to plan for the worst-case scenario! Most organizations would have considered region redundancy to have been sufficient. UniSuper wisely went a step further by having backups with other service providers.

2. Responsibility lies with the customer: Cloud Service Providers (CSP) do not provide DR guarantees or capabilities in most cases. The onus is on you the customer to provide the necessary DR plan, even in the case where the CSP is the root cause of the disaster. This is an often overlooked aspect when companies move to the cloud.

3. Independent DR architectures mitigate risks: UniSuper was fortunate (likely well prepared) to have “air-gapped” backups on a separate provider, which allowed them to recover to a known point and minimize data loss. This was a very well-thought-out DR strategy by UniSuper and almost certainly saved this from being a much worse outcome. However, the use of backups resulted in a long, multi-day return to service and significant data loss, which needed to be addressed. Be sure to include air-gapped backups, multi-cloud or hybrid standby infrastructure ideas in your DR planning.

4. Large CSPs (GCP, AWS, Azure, etc) are not immune from issues: Service maintenance tasks that impact customers - scripts with bugs that cause issues and other general issues around infrastructure stability have occurred on other major cloud providers and will continue to. While this specific incident was declared “one of a kind”, general infrastructure-related issues are common and should be accounted for when planning your Business Continuity and DR strategy.

5. Test, test and test: DR plans are only as good as the last time they were tested. Your DR solution should be able to be regularly and easily tested.

Conclusion

The UniSuper Google GCP issue serves as a stark reminder of the fragility of even the most sophisticated digital infrastructures. It should also be noted that since this UniSuper incident, Google Cloud had a separate set of issues following some internal maintenance activities that affected a variety of their cloud services, such as Compute Engine and Kubernetes Engine. This serves merely as a reminder that issues occur on cloud service providers and they are not infrequent. It also underscores the necessity for robust DR planning to safeguard against disruptions.

By prioritizing DR, organizations not only protect their operations and data but also preserve their reputation and client trust. Investing in comprehensive DR strategies is not just prudent – it is essential.

From a practical standpoint:

First and foremost, critical business systems relying solely on in-region replication should be urgently reviewed. Configuring a standby environment in a different region is essential for ensuring rapid recovery during regional failures.

Companies should also ensure that their DR strategy can withstand true catastrophic events. This includes maintaining air-gapped backups stored in separate locations or with different providers, and even implementing a replicated warm-standby database on an alternative service provider for critical databases.

Regular testing of the DR plan is also crucial to ensure that it:

a. Recovers all necessary components as expected,

b. Is kept up-to-date with changes in the company's infrastructure,

c. Provides sufficient performance for current workloads.

Next steps:

If you have any questions about your Disaster Recovery, whether you're running critical workloads on Oracle SE, SQL Server, or PostgreSQL, contact us today! With 20 years of expertise in database management, we're here to assist you.