WAN Router IP Address Change Blamed For Global Microsoft 365 Outage

The global outage of Microsoft 365 services that last week prevented some users from accessing resources for more than half a working day was down to a packet bottleneck caused by a router IP address change.

Microsoft's wide area network toppled a bunch of services from 07:05 UTC on January 25 and although some regions and services had come back online by 09:00, intermittent packet loss woes weren't fully mitigated until 12:42. The wobble also affected Azure Government cloud services.

In a postmortem, Microsoft said that changes made to its WAN had hit connectivity between clients and Azure, across regions and cross-premises via ExpressRoute.

"As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them.

"The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed."

This meant users were unable to access resources hosted in Azure or other Microsoft 365 and Power Platform services.

Microsoft said monitoring systems detected DNS and WAN-related troubles at 07:12, some seven minutes after they began.

By 08:20, resident techies at Microsoft had spotted the "problematic command that triggered the issues" and some 40 minutes later networking telemetry indicated many of the services were running again.

However, Microsoft said the initial problem with the WAN meant automated systems for maintaining its health were paused. This included systems for identifying and expelling unhealthy devices, as well as the traffic engineering system for optimizing the flow of data across the network.

"Due to the pause in these systems, some paths in the network experienced increased packet loss from 09:35 UTC until those systems were manually restarted, restoring the WAN to optimal operating conditions. This recovery was completed at 12:43 UTC," the postmortem added.

Efforts Microsoft is taking to make similar incidents less likely or severe include blocking "highly impactful command from getting executed on the devices" and requiring all command execution on devices to follow safe guidelines.

The final post-incident report is scheduled to be published a fortnight after the outage. ®

RECENT NEWS

Google Leverages AI To Automatically Lock Phones During Theft

Amid increasing incidents of mobile phone thefts, Google has launched an AI-based feature that automatically locks the s... Read more

Microsofts Emissions Surge Nearly 30% Amid AI Demand Growth

Microsoft has reported a nearly 30% increase in its emissions from 2020 to 2023, underscoring the challenges the tech gi... Read more

Impact Of AWS Leadership Change On The Global AI Race

The recent leadership transition at Amazon Web Services (AWS), with Adam Selipsky stepping down and Matt Garman taking t... Read more

The Global Impact Of App Stores On Technology And Economy

Since Apple launched its App Store in 2008, app stores have become a central feature of the digital landscape, reshaping... Read more

Alibaba's Cloud Investment Strategy: Fuelling AI Innovation And Growth

Alibaba Group's cloud business, Alibaba Cloud, has emerged as a powerhouse in the tech industry, spearheading innovation... Read more

Elon Musk Takes On Government 'Censorship': A Clash Of Titans In The Digital Arena

Elon Musk's recent endeavors to challenge government-led content takedowns mark a significant development in the ongoing... Read more