Dropbox Unplugged Its Own Datacenter – And Things Went Better Than Expected

If you're unsure how resilient your organization is to a disaster, there's a simple way to find out: unplug one of your datacenters from the internet and see what happens.

That's what Dropbox did in November, though with a bit more forethought. It had been planning to take the San Jose datacenter (its largest) offline for some time, and performed extensive tests prior to the actual event. It actually took all three datacenters in the city offline by physically pulling each site's main fiber connection from its port.

Dubbed the "SJC blackhole," the experiment was determined to be a success after 30 minutes had elapsed with what Dropbox described as no impact to its global availability. "In the unlikely event of a disaster, our revamped failover procedures showed that we now had the people and processes in place to offer a significantly reduced RTO [recovery time objective]," Dropbox said in a postmortem of the event.

According to the company, RTOs were reduced from eight to nine minutes down to four or five.

What was Dropbox thinking?

After parting ways with previous hosting service AWS and building its own datacenters, Dropbox said it realized there was a problem: its metadata was highly replicated, but block data wasn't. "Given San Jose's proximity to the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline," the company said.

The first attempt Dropbox made to eliminate its centrality was called Magic Pocket, a system that distributes block data to multiple datacenters, which can serve portions of files at the same time, without worries about a single datacenter outage eliminating service. This is known as an active-active system because multiple nodes are serving files to users simultaneously.

Dropbox ultimately settled on an active-passive failure model, which still replicates blocks across multiple datacenters, but only serves files from a single location. It said this was necessary to implement its plan because of limitations imposed by how Dropbox itself chose to manage metadata.

"These choices severely limited our architectural choices when designing an active-active system, and made the resulting system much more complex," Dropbox said.

Failing over and over

A May 2020 failover tooling failure caused a 47-minute long service outage, which pushed Dropbox into high gear on improving its disaster recovery systems. It started by implementing a dedicated disaster recovery team, which rebuilt Dropbox's failover-handling software before running tests, of which the November 2021 shutdown was part.

Testing began at Dropbox's two Dallas Fort Worth datacenters, and initially things were less than smooth – due to the team not realizing all of its S3 proxies were running from the datacenter it took offline. A second test proved more successful, which led to the San Jose experiment. 

"Much like our second DFW test, we saw no impact to global availability—and ultimately reached our goal of a 30-minute SJC blackhole," Dropbox said. 

Dropbox's postmortem is worth paying attention to: not only did it find a way to successfully distribute its services and make its entire system more resilient, it also shows the type of work it takes for a large enterprise to commit to that type of project.

The entire effort to improve resiliency was described by Dropbox as a multi-year, multi-team project. Its nature as a cloud service may mean Dropbox is more complex than other enterprises, but that should serve as a motivator: disaster recovery planning in other companies may be a lot easier.

Dropbox also recommends that other companies perform regular disaster recovery practise exercises. "Like a muscle, it takes training and practise to get stronger." ®

RECENT NEWS

Apple In Hot Water: EU Charges IPhone Maker Under New Digital Law

Apple is poised to become the first Big Tech company to face charges under the European Union's newly implemented Digita... Read more

Shareholders Back Elon Musk's Record Pay Package And Tesla's Move To Texas

Tesla shareholders have overwhelmingly approved a record-breaking pay package for CEO Elon Musk and endorsed the company... Read more

Microsoft Under Increased Scrutiny By European Regulators

Microsoft is once again facing heightened scrutiny from European regulators, marking the end of a long period of relativ... Read more

Legal Twist: Musk Drops Lawsuit Against OpenAI And Sam Altman

In an unexpected legal twist, Elon Musk has dropped his lawsuit against OpenAI and its CEO, Sam Altman, without offering... Read more

Apple And OpenAI Team Up To Revolutionize AI On IPhones

Cupertino, June 11, 2024 – In an exciting development for AI enthusiasts and Apple users alike, Apple has partnered wi... Read more

The AI Race: Can Apple Keep Up With Google And Microsoft?

Cupertino, CA – As Google and Microsoft lead the AI revolution with groundbreaking advancements in generative AI, Appl... Read more