Cloudflare Fesses Up To Config Change That Borked Internet Access For All

There was a disturbance in the force on July 14 after Cloudflare borked a configuration change that resulted in an outage, impacting internet services across the planet.

In a blog post, the content delivery network services biz detailed the unfortunate series of events that led to Monday's disruption.

On the day itself, "Cloudflare's 1.1.1.1 Resolver service became unavailable to the internet starting at 21:52 UTC and ending at 22:54 UTC. The majority of 1.1.1.1 users globally were affected. For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable," Cloudflare said.

But the problem originated much earlier.

The outage was caused by a "misconfiguration of legacy systems" which are used to uphold the infrastructure advertising Cloudflare's IP addresses to the internet.

"The root cause was an internal configuration error and not the result of an attack or a BGP hijack," the corp said.

Back on June 6 this year, as Cloudflare was preparing a service topology for a future Data Localization Suite (DLS) service, it introduced the config gremlin - prefixes connected to the 1.1.1.1 public DNS Resolver were "inadvertently included alongside the prefixes that were intended for the new DLS service."

"This configuration error sat dormant in the production network as the new DLS service was not yet in use,  but it set the stage for the outage on July 14. Since there was no immediate change to the production network there was no end-user impact, and because there was no impact, no alerts were fired."

On July 14, a second tweak to the service was made: Cloudflare added an offline datacenter location to the service topology for the pre-production DNS service in order "to allow for some internal testing." But the change triggered a refresh of the global configuration of the associated routes, "and it was at this point that the impact from the earlier configuration error was felt."

Things went awry at 2148 UTC.

"Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up… The 1.1.1.1 Resolver prefixes started to be withdrawn from production Cloudflare datacenters globally."

Traffic began to drop four minutes later and internal health alerts started to emerged. An "incident" was declared at 2201 UTC and a fix dispatched at 2220 to restore the previous configuration.

"To accelerate full restoration of service, a manually triggered action is validated in testing locations before being executed," Cloudflare said in its explanation of the outage. Revolver alerts were cleared by 2254 UTC and DNS traffic on Resolver prefixes went back to typical levels, it added.

Data on DNSPerf shared with us by a reader indicates a length of the disruption of around three hours, far longer than Cloudflare's summary suggests.

As a Reg reader pointed out: "Remember this is a DNS service. Every person using the service would have had no ability to use the internet. Every business using Cloudflare had no internet for the length of the outage. NO DNS = NO INTERNET." ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more