Microsoft: Azure Delays Not Acknowledged For 5 Hours Because Manager Was Asleep

Microsoft has revealed it took five hours to acknowledge lengthy disruptions affecting European customers in late March because the task of informing customers relied on a US-based incident manager, who was asleep at the time. 

The delays affected customers in Europe and the UK for three days beginning around 9am UTC on March 24. However, at the outset, as customers struggled with extra-sluggish Azure services, Microsoft missed its 10-minute target for acknowledging issues by a wide margin. 

In a post mortem, Chad Kimes, director of engineering at Azure admits Microsoft's "communication during this incident was also problematic" and apologized for the frustration and confusion this caused to the 6,136 customers affected.   

The technical issue itself was caused by virtual-machine capacity constraints due to a surge in demand for Azure compute resources during COVID-19 coronavirus pandemic, which resulted in 21-minute delays affecting Microsoft's Pipelines DevOps service for releasing new builds targeting Windows and Linux agents in Azure. The longest delay was nine hours, according to Kimes. 

"The problem here is that our live-site processes have a gap for these types of incidents," Kimes said of the communication issue. 

"When incidents involve customer request failures or performance impacts, we have automated tooling that starts an incident and loops in both a DRI (designated responsible individual) and what we call a PIM (primary incident manager). The PIM is typically the person responsible for posting external communications acknowledging the incident," he adds. 

"Pipeline delays are detected by different tooling, and the PIM is not currently paged for these types of incidents. As a result, while the DRI was hard at work understanding the technical issues and looking for potential mitigations, the PIM was still asleep. Only when the PIM joined the incident bridge at roughly the beginning of business hours in the Eastern United States was the incident finally acknowledged."

Microsoft says it is planning to improve its live-site processes to "ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types".

The company is also rolling out architectural changes to mitigate bottlenecks in spinning up new agents from its hosted agent pool. 

RECENT NEWS

Adaptation And Innovation: Revolut's Response To Banking License Delay Through Advertising Sales Push

As Revolut eagerly awaits the acquisition of its banking license, the fintech giant has demonstrated remarkable adaptabi... Read more

Riding The Wave: The Evolution Of Fintech Investment Strategies

The fintech industry has experienced unprecedented growth in recent years, captivating the attention of investors worldw... Read more

How Fintech Is Revolutionizing Traditional Banking

How fintech is revolutionizing traditional banking is a topic that is garnering positive and immense discourse within th... Read more

Blockchain And Its Impact On Fintech Industry

Blockchain and its impact on Fintech Industry has become a hot topic in the current digital era. The amalgamation of blo... Read more

The Rise Of Fintech In The Digital Era

In the heart of the digital revolution, we've observed a term termed as "fintech" creating a substantial and transformat... Read more

Role Of Fintech In Transforming Retail Banking

The role of fintech in transforming retail banking is producing significant changes in the financial services industry. ... Read more