Alibaba Cloud Reveals Its Uptime And Efficiency Secrets Developed By In-house Network Boffins

Chinese web giant Alibaba has reduced network outages by 92 percent, cut load balancing costs by 18.9 percent, and found ways to improve SmartNIC performance by offloading workloads to idle infrastructure.

The company revealed those outcomes in papers it will present at the SIGCOMM conference next week.

The reduction in network outages came from a technology Alibaba calls “ZooRoute” that its researchers describe [PDF] as “a fast failure recovery service that ensures global bypass in large-scale cloud networks within seconds.”

The paper describing ZooRoute explains that cloud operators’ networks will inevitably fail from time to time, and that strategies like fast rerouting and traffic engineering can take seconds and minutes respectively to restore traffic flows – too slow for many users.

“As a result, tenants are forced to develop their own recovery solutions, which typically involve redundant resources or protocol stack modifications, thereby increasing capital and operating expenses,” the paper argues.

The company claims its own ZooRoute tech can “instantly reroute traffic to a working path” by constantly probing for viable routes. If a failure occurs, ZooRoute is therefore aware of a route that will work, and switches to it ASAP. The paper says Alibaba Cloud has used ZooRoute for 18 months, and it has “significantly improved network reliability, reducing cumulative outage time by 92.71 percent.”

Alibaba Cloud has also deployed a tool called Hermes that it says “reduces daily worker hangs by 99.8 percent and lowers the unit cost of L7 LB infrastructure by 18.9 percent.”

A paper [PDF] describing Hermes explains that the layer 7 load balancers clouds use to keep their networks humming “rely on I/O event notification mechanisms such as epoll to dispatch connections from the kernel to userspace workers,” but that this approach sometimes creates bottlenecks.

Alibaba’s solution is using eBPF - a tech that allows workloads to run with the same privileges enjoyed by processes in the Linux kernel – to filter demands from workers to understand which deserve priority, and then schedule tasks accordingly.

“Hermes is well suited for cloud L7 LBs facing diverse and rapidly changing traffic patterns, where no single scheduling policy can optimally handle all tenant workloads,” the paper states, and reports that in production at Alibaba Cloud it’s reduced the standard deviation of per-worker CPU utilization and connection counts by 90 percent and 99.4 percent, respectively, helped average daily worker hangs to decrease by 99.8 percent, and dropped the unit cost of cloud infra for our L7 LBs by 18.9 percent.

A third paper from Alibaba describes [PDF] “Nezha”, a distributed vSwitch load sharing system that works on SmartNICs – the CPU-equipped network cards that hyperscalers use to run networking and storage plumbing workloads so that CPUs can run tenants’ applications.

In the paper about Nezha, Alibaba admits that some of the virtual switches running on its SmartNICs are maxed out. Its solution is to find under-used SmartNICs and shift workloads to them.

“The deployment cost of Nezha is only a small fraction of that required to deploy new devices,” the paper states, and has significantly improved performance and moved bottlenecks from the vSwitch to the VM kernel stack.

SIGCOMM commences on September 8th, in Coimbra, Portugal.

One notable feature of this year’s event is a keynote by distinguished computer scientist (and Register columnist) Bruce Davie, to celebrate his being chosen as the recipient of the annual SIGCOMM Award, in recognition of his lifetime contributions to the field of communication networks.

Bruce is the first Australian to win the award, which The Register’s APAC desk thinks is bloody brilliant. ®

RECENT NEWS

From Chip War To Cloud War: The Next Frontier In Global Tech Competition

The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more

The High Stakes Of Tech Regulation: Security Risks And Market Dynamics

The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more

The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics

Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more

The Data Crunch In AI: Strategies For Sustainability

Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more

Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser

After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more

LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue

In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more