Discord Details How It Dodged Latency With A Super-disk Made In The Cloud
Chat platform Discord delivered a playful slap to Google yesterday with a post describing how the company dealt with "reliability issues" to achieve some impressively low latency.
Discord deals with 4 billion messages sent through the platform per day by its millions of users. The company runs a set of NoSQL database clusters (powered by ScyllaDB) but its real-time nature means that the databases need to respond to queries as quickly as possible.
"The biggest impact on our database performance is the latency of individual disk operations, how long it takes to read or write data from the physical hardware," said Glen Oakley, a senior software engineer at Discord.
Below a certain query rate, all is good. "Our databases do a great job of handling requests in parallel," said Oakley.
However, at some point you will hit blocking issues, where the database has to wait for an outstanding disk operation to complete before starting another. Things slow down, and users notice. The queries might time out before reaching the top of the queue.
One might have thought that slinging the Local SSDs on offer from GCP would deal with the problem. Oakley noted that the NVMe-based storage had incredibly fast latency profiles, but "in our testing, we ran into enough reliability issues that we didn't feel comfortable with depending on this solution for our critical data storage."
- Microsoft Azure cloud region settles over desert in Doha, Qatar
- Economic uncertainty can't stop cloud growth
- Google Cloud expands to Thailand, Malaysia and New Zealand
- Electrical explosion at Google datacenter injures three
Another option was persistent disks, storage that could be attached or detached when needed, replicated, and connected via the network. So nowhere near as low latency as a directly attached disk.
So what to do? The team wanted to stick with GCP and prioritize low-latency disk reads, but did not want to sacrifice existing uptime guarantees. They also needed to be able to survive a bad sector on an SSD. The solution was to use GCP's Local SSDs for low-latency reads while still writing to the Persistent Disks to take advantage of snapshotting and redundancy via replication.
After faffing around with various caching options in software (Discord runs Ubuntu on its database servers), the team settled on md and a tricked-out RAID configuration. RAID0 (which just splits raw data over disks – lose one, lose 'em all) was selected for the Local SSDs and a RAID1 (basically a mirror) between the Persistent Disk and RAID0 array.
The result was, more or less, the super-disk success hoped for, although Oakley noted there were some specific edge cases encountered in the cloud environment. "In retrospect," he said, "disk latency should have been an obvious concern early on in our database deployments.
"The world of cloud computing causes so many systems to behave in ways that are nothing like their physical datacenter counterparts."
Something to keep in mind during your company's charge to the cloud. ®
From Chip War To Cloud War: The Next Frontier In Global Tech Competition
The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more
The High Stakes Of Tech Regulation: Security Risks And Market Dynamics
The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more
The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics
Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more
The Data Crunch In AI: Strategies For Sustainability
Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more
Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser
After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more
LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue
In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more