FYI: Today's Computer Chips Are So Advanced, They Are More 'mercurial' Than Precise – And Here's The Proof

Computer chips have advanced to the point that they're no longer reliable: they've become "mercurial," as Google puts it, and may not perform their calculations in a predictable manner.

Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design oversights but also from environmental conditions and from physical system failures that produce faults.

But these errors have tended to be rare enough that only the most sensitive calculations get subject to extensive verification if systems appear to be operating as expected. Mostly, computer chips are treated as trustworthy.

Lately, however, two of the world's larger CPU stressors, Google and Facebook, have been detecting CPU misbehavior more frequently, enough that they're now urging technology companies to work together to better understand how to spot these errors and remediate them.

"Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data," said Peter Hochschild, a Google engineer, in a video presented as a part of the Hot Topics in Operating Systems (HotOS) 2021 conference this week.

"These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them."

Looking more deeply at the code involved and operational telemetry from their machines, Google engineers began to suspect problems with their hardware. Their investigation found that the incidence of hardware errors was greater than expected and these issues showed themselves sporadically, long after installation, and on specific, individual CPU cores rather than entire chips or a family of parts.

The Google researchers examining these silent corrupt execution errors (CEEs) concluded "mercurial cores" were to blame – CPUs that miscalculated occasionally, under different circumstances, in a way that defied prediction. (That's mercurial as in unpredictable, not Mercurial as in the version control system of the same name.)

The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests. Rather, Google engineers theorize, the errors have arisen because we've pushed semiconductor manufacturing to a point where failures have become more frequent and we lack the tools to identify them in advance.

In a paper titled "Cores that don’t count" [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs.

"But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design," the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.

Google's not alone

Facebook has noticed the errors, too. In February, the social ad biz published a related paper, "Silent Data Corruption at Scale," that states, "Silent data corruptions are becoming a more common phenomena in data centers than previously observed." The paper proposes mitigation strategies though doesn't address the root cause.

As Google's researchers see it, Facebook spotted a symptom of unreliable cores – silent data corruption. But identifying the cause of the problem, and coming up with a fix, will require further work.

The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale.

Hochschild recounted an instance where Google's errant hardware conducted what might be described as an auto-erratic ransomware attack.

One of our mercurial cores corrupted encryption. It did it in such a way that only it could decrypt what it had wrongly encrypted

"One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."

Google's researchers declined to reveal detected CEE rates at its data centers citing "business reasons," though they provided a ballpark figure "on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook."

Ideally, Google would like to see automated methods to identify mercurial cores and has suggested strategies like CPU testing throughout the chip's lifecycle rather than relying only on burn-in testing prior to deployment. The mega-corp is currently relying on human-driven core integrity interrogation, which is not particularly accurate, because tools and techniques for identifying dubious cores remain works in progress.

"In our recent experience, roughly half of these human identified suspects are actually proven, on deeper investigation, to be mercurial cores – we must extract 'confessions' via further testing (often after first developing a new automatable test)," Google's researchers explain. "The other half is a mix of false accusations and limited reproducibility."

Let the Core Inquisition begin. ®

RECENT NEWS

AI Companies Bet On Profits With Small Language Models

In a notable shift within the artificial intelligence (AI) industry, leading companies Microsoft, Meta, and Google are f... Read more

Google Leverages AI To Automatically Lock Phones During Theft

Amid increasing incidents of mobile phone thefts, Google has launched an AI-based feature that automatically locks the s... Read more

Microsofts Emissions Surge Nearly 30% Amid AI Demand Growth

Microsoft has reported a nearly 30% increase in its emissions from 2020 to 2023, underscoring the challenges the tech gi... Read more

Impact Of AWS Leadership Change On The Global AI Race

The recent leadership transition at Amazon Web Services (AWS), with Adam Selipsky stepping down and Matt Garman taking t... Read more

The Global Impact Of App Stores On Technology And Economy

Since Apple launched its App Store in 2008, app stores have become a central feature of the digital landscape, reshaping... Read more

Alibaba's Cloud Investment Strategy: Fuelling AI Innovation And Growth

Alibaba Group's cloud business, Alibaba Cloud, has emerged as a powerhouse in the tech industry, spearheading innovation... Read more