AMD EPYC Rome CPUs Stop Working After 1,044 Days of Uptime

June 03, 2023
102 views

A revision guide for AMD EPYC 7002 "Rome" server processor reveals that a chip core could hang after 1,044 days of uptime (approximately three years).

AMD's EPYC Rome CPU Core Enters Sleep State After Almost Three Years Of Uptime

The AMD EPYC Rome CPUs are based on the Zen 2 core architecture and are some of the most competitive chips that the Red team has introduced for the data center market. However, a recent issue has been detected where the chip will go into a sleep state after almost three years of uptime. Here is how AMD describes the issue:

A core will fail to exit CC6 after about 1044 days after the last system reset. The time of failure may vary depending upon spread spectrum and REFCLK frequency.

According to AMD, the timing of failure depends upon the spread spectrum (changing base clock speeds to reduce electromagnetic interference) and the REFCLK frequency (reference clock that helps the chip keep track of time). However, AMD's specified failure time may be a bit deceptive as, according to a Reddit user, acid_migrain, the actual timing may be around 1042 days and roughly 12 hours. Here is why:

Despite what they say, the problem actually manifests at 1042 days and roughly 12 hours. The TSC ticks at 2800 MHz, and 2800 * 10**6 * 1042.5 days almost equals 0x380000000000000, which has too many zeros not to be a coincidence.

Fixing the problem doesn't require a long solution. Either reboot before 1,044 days of uptime, reset the CPU timer, or turn off the CC6 sleep state. AMD has no plans to provide a fix for this, as mentioned in the document. This isn't a severe issue; such problems pop up in several different CPUs. The EPYC 7002 was introduced in 2018, and this vulnerability is now being brought to light since specific customers could have run across it because of the completed uptime (1,044 days).

Due to the complex architecture of processors, several types of bugs are discovered after the processors are packed and ready to be shipped. The issues are of all kinds, with some being less effecting, like malfunctioning flags and cache tags, to the more problematic ones, such as those that could leave an attack vector open. The chipmaker evaluates the severity of the defect, the ease with which it may be fixed, and the urgency with which it must be addressed before deciding when and how to provide fixes.

News Source: Tom's Hardware

Source: Wccftech