Matrix.org Homeserver Grinds To A Halt After RAID Meltdown

A RAID failure has taken the Matrix.org homeserver offline, leaving users of the decentralized messaging service unable to send or receive messages while engineers attempt a 55 TB database restore.

To be clear, those with their own homeservers, such as government organizations, are unaffected, but anyone using Matrix.org as their homeserver will have been hearing the sound of silence from the platform while the team works to bring the service back online.

Problems began at 1117 UTC on September 2, when the secondary Matrix.org database lost its file system due to a RAID failure. The primary fell over at 1726 UTC, and a few minutes later, the organization admitted that things were indeed not very healthy.

The Matrix.org homeserver is backed by a large PostgreSQL database, which caused the organization grief in July when a long-gestating corruption of part of a table index caused issues with "rooms" in the system. The result was that attempts to join rooms would fail, messages wouldn't send, and occasional cryptic error messages would appear.

The team was understandably a little cautious when restoring the database and eventually reported: "We haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption)."

The solution is a full 55 TB database snapshot restore followed by a replay of 17 hours' worth of traffic. At the time of writing, the team had managed to restore the snapshot and subsequent incremental backups and was about to embark on the traffic replay.

Neil Johnson, chief engineering officer at Element, a messaging platform by the creators of Matrix, told The Register the trouble started with a routine storage upgrade exercise that went badly wrong. "A whole series of things happened at exactly the wrong time in unison, which then led to the situation that we see," he said.

It's not a great look for the organization, as users who rely on the Matrix.org homeserver can't access it. Messages sent to Matrix.org users will be queued until the service is back up and running. "There's not going to be any data loss. Eventually your message will get through," Johnson said.

There is no charge for using Matrix.org and there is also no service level agreement.

The incident demonstrates the benefits of a decentralized system. Users with their own homeservers aren't affected, nor are organizations such as Element, which have customer deployments that utilize the underlying technology.

One homeserver going down does not affect the rest, even one as visible as Matrix.org.

Matrix has become increasingly important in recent years as public and private sector organizations seek to reduce their dependency on centralized messaging services that might not meet sovereignty or privacy requirements. The Matrix.org outage, while embarrassing, serves to highlight that a decentralized approach can protect users from whoopsies on the part of those who run the service. ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more