Dropbox Unplugged Its Own Datacenter – And Things Went Better Than Expected
If you're unsure how resilient your organization is to a disaster, there's a simple way to find out: unplug one of your datacenters from the internet and see what happens.
That's what Dropbox did in November, though with a bit more forethought. It had been planning to take the San Jose datacenter (its largest) offline for some time, and performed extensive tests prior to the actual event. It actually took all three datacenters in the city offline by physically pulling each site's main fiber connection from its port.
Dubbed the "SJC blackhole," the experiment was determined to be a success after 30 minutes had elapsed with what Dropbox described as no impact to its global availability. "In the unlikely event of a disaster, our revamped failover procedures showed that we now had the people and processes in place to offer a significantly reduced RTO [recovery time objective]," Dropbox said in a postmortem of the event.
According to the company, RTOs were reduced from eight to nine minutes down to four or five.
What was Dropbox thinking?
After parting ways with previous hosting service AWS and building its own datacenters, Dropbox said it realized there was a problem: its metadata was highly replicated, but block data wasn't. "Given San Jose's proximity to the San Andreas Fault, it was critical we ensured an earthquake wouldn't take Dropbox offline," the company said.
The first attempt Dropbox made to eliminate its centrality was called Magic Pocket, a system that distributes block data to multiple datacenters, which can serve portions of files at the same time, without worries about a single datacenter outage eliminating service. This is known as an active-active system because multiple nodes are serving files to users simultaneously.
- OVHcloud datacenter 'lacked' automatic fire extinguishers, electrical cutoff
- Dunno about you, but we're seeing an 800% increase in cyberattacks, says one MSP
- Azure Site Recovery points now live for 15 days in case undetonated ransomware lurks
- AWS power failure in US-EAST-1 region killed some hardware and instances
Dropbox ultimately settled on an active-passive failure model, which still replicates blocks across multiple datacenters, but only serves files from a single location. It said this was necessary to implement its plan because of limitations imposed by how Dropbox itself chose to manage metadata.
"These choices severely limited our architectural choices when designing an active-active system, and made the resulting system much more complex," Dropbox said.
Failing over and over
A May 2020 failover tooling failure caused a 47-minute long service outage, which pushed Dropbox into high gear on improving its disaster recovery systems. It started by implementing a dedicated disaster recovery team, which rebuilt Dropbox's failover-handling software before running tests, of which the November 2021 shutdown was part.
Testing began at Dropbox's two Dallas Fort Worth datacenters, and initially things were less than smooth – due to the team not realizing all of its S3 proxies were running from the datacenter it took offline. A second test proved more successful, which led to the San Jose experiment.
"Much like our second DFW test, we saw no impact to global availability—and ultimately reached our goal of a 30-minute SJC blackhole," Dropbox said.
Dropbox's postmortem is worth paying attention to: not only did it find a way to successfully distribute its services and make its entire system more resilient, it also shows the type of work it takes for a large enterprise to commit to that type of project.
The entire effort to improve resiliency was described by Dropbox as a multi-year, multi-team project. Its nature as a cloud service may mean Dropbox is more complex than other enterprises, but that should serve as a motivator: disaster recovery planning in other companies may be a lot easier.
Dropbox also recommends that other companies perform regular disaster recovery practise exercises. "Like a muscle, it takes training and practise to get stronger." ®
From Chip War To Cloud War: The Next Frontier In Global Tech Competition
The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more
The High Stakes Of Tech Regulation: Security Risks And Market Dynamics
The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more
The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics
Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more
The Data Crunch In AI: Strategies For Sustainability
Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more
Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser
After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more
LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue
In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more