WAN Router IP Address Change Blamed For Global Microsoft 365 Outage

The global outage of Microsoft 365 services that last week prevented some users from accessing resources for more than half a working day was down to a packet bottleneck caused by a router IP address change.

Microsoft's wide area network toppled a bunch of services from 07:05 UTC on January 25 and although some regions and services had come back online by 09:00, intermittent packet loss woes weren't fully mitigated until 12:42. The wobble also affected Azure Government cloud services.

In a postmortem, Microsoft said that changes made to its WAN had hit connectivity between clients and Azure, across regions and cross-premises via ExpressRoute.

"As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them.

"The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed."

This meant users were unable to access resources hosted in Azure or other Microsoft 365 and Power Platform services.

Microsoft said monitoring systems detected DNS and WAN-related troubles at 07:12, some seven minutes after they began.

By 08:20, resident techies at Microsoft had spotted the "problematic command that triggered the issues" and some 40 minutes later networking telemetry indicated many of the services were running again.

However, Microsoft said the initial problem with the WAN meant automated systems for maintaining its health were paused. This included systems for identifying and expelling unhealthy devices, as well as the traffic engineering system for optimizing the flow of data across the network.

"Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC," the postmortem added.

Efforts Microsoft is taking to make similar incidents less likely or severe include blocking "highly impactful command from getting executed on the devices" and requiring all command execution on devices to follow safe guidelines.

The final post-incident report is scheduled to be published a fortnight after the outage. ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more