When Disaster Strikes: Fire takes out Kakao and Naver services

Written by Erin Thomson | Nov 15, 2022 8:51:53 PM

Disasters happen all the time, whether in an on-premises setting, a public data center (IaaS), or a Cloud Service Provider (CSP), and the growing reliance on technology means extra measures need to be taken to protect business data. Unfortunately, many organizations only prioritize a suitable disaster recovery solution after a disaster and data loss has occurred. This blog will look at two major outages in Korea in October of this year, as well as some of their long-term consequences. We will next look at how these events could have been averted and how to achieve dependable, resilient Disaster Recovery (DR).

Data center fire takes down South Korea’s two major web giants

On October 15th, 2022, two of South Korea’s biggest internet companies, Naver and Kakao, were significantly impacted when a fire broke out at 3:33 pm at the SK Group data center in Pangyo, which hosts the majority of their infrastructure. Although Naver had a better outcome due to their existing backup plan, Kakao experienced ten hours of downtime as 32,000 of their servers shut down, with only 12,000 being recovered the following day. The disruption to Kakao’s services was ongoing, lasting several days and impacting 43 million of their active users in South Korea.

Kakao suffered the worst server outage throughout the country

Kakao is commonly referred to as “Korea’s app for everything” as it’s the nation's top messaging app but also provides services for banking and payments, entertainment, transportation and more. The outage slowed down all their communications and also impacted finance and transportation within South Korea. According to the Korea Herald, Kakao did not have a proper contingency plan in place despite being in service for more than 12 years. Although they had a backup system, it wasn’t functional and was only located in a single location, which was ultimately the one that experienced the outage.

Namkoong Whon, their co-CEO, stated that he "felt the heavy burden of responsibility" due to the gravity of the situation, prompting him to resign from his role. According to Korean news outlet KBS WORLD, the effects of the outage are still being felt, with all trust in the company having been lost and numerous groups of customers preparing a class-action suit against Kakao.

So where did Kakao go wrong?

Kakao did not have a comprehensive contingency plan in place, resulting in them storing all their essential databases in a single data center in Pangyo. They also aimed to save money by not operating their own data centers. The lack of a disaster recovery plan to begin with, as well as the fact that their sole backups were stored in only one location, is what caused Kakao's largest server outage to date; and why millions of consumers lost access to their services and experienced delays for days thereafter. Unfortunately, when thinking about disasters (if we think about them at all) we usually don’t imagine they will happen to us. We naively hope that everything will be fine, but hope is not a strategy.

As the basis of best practice DR, it’s critical to have a warm standby database in a separate geographical location. This is because redundancy within a single cloud or data center does not guarantee data security - geographical and infrastructure separation is required in the event that a primary location is impacted. Modern best practice requires the use of an up-to-date separate standby environment, ready for immediate failover and activation.

Depending only on backups takes time, as does restoration of those backups, which is why it's no longer recommended to depend on backups alone. You also won't know if a backup will operate in the case of a disaster unless you test it thoroughly and on a frequent basis. It's important to recognize that having the technological tools in place is not enough; the DR processes and ongoing maintenance are just as crucial. This is why a regularly tested and up-to-date standby environment, ready for fast failover and activation, is what's required under today's best practices. Having merely "cold" backups is not enough.

So how can an event like this be prevented?

To avoid losing revenue, customers, and your good name in the market, deploying a standby database and regular DR testing is mission-critical. Understanding any limitations and critical failure points is the first step. This means accepting that downtime does occur and developing a robust solution that mitigates all foreseeable risks is needed. Fortunately, software solutions like Standby MultiPlatform make it easy and efficient to create a warm standby environment that can ensure a fast recovery and near-zero data loss from any disaster. This sort of setup should be the gold standard when it comes to contingency planning in database management. Arranging DR strategies like these makes data recovery easier and more efficient in the event of unforeseen issues. It’s better to be safe than sorry.

Interested in learning more or have any questions? Contact one of our team members today! Alternatively, you can try StandbyMP for yourself here.

View full post