Nvidia's data center Blackwell GPUs reportedly overheat, require rack redesigns and cause delays for customers

Microsoft's Blackwell-based machine
(Image credit: Microsoft/X)

Nvidia's next-generation Blackwell processors are facing significant challenges with overheating when installed in high-capacity server racks, reports The Information. These issues have reportedly led to design changes and delays and raised concerns among customers like Google, Meta, and Microsoft about whether they can deploy Blackwell servers on time. 

According to insiders familiar with the situation who spoke with The Information, Nvidia's Blackwell GPUs for AI and HPC overheat when used in servers with 72 processors inside. These machines are expected to consume up to 120kW per rack. These problems have caused Nvidia to reevaluate the design of its server racks multiple times, as overheating limits GPU performance and risks damaging components. Customers reportedly worry that these setbacks may hinder their timeline for deploying new processors in their data centers. 

Nvidia has reportedly instructed its suppliers to make several design changes to the racks to counteract overheating issues. The company has worked closely with its suppliers and partners to develop engineering revisions to improve server cooling. While these adjustments are standard for such large-scale tech releases, they have nonetheless added to the delay, further pushing back expected shipping dates. 

In response to the delays and overheating issues, an Nvidia spokesperson reminded Reuters about the collaborative efforts with cloud providers and described the design changes as part of the normal development process. This partnership with cloud providers and suppliers aims to ensure the final product meets performance and reliability expectations as Nvidia continues to work on resolving these technical challenges. 

Previously, Nvidia had to delay the Blackwell production ramp due to the processor's yield-killing design flaw. Nvidia's Blackwell B100 and B200 GPUs use TSMC's CoWoS-L packaging technology to connect their two chiplets. This design includes an RDL interposer with local silicon interconnect (LSI) bridges, which supports data transfer speeds of up to 10 TB/s. The precise positioning of these LSI bridges is essential for the technology to function as intended. However, a mismatch in the thermal expansion characteristics of the GPU chiplets, LSI bridges, RDL interposer, and the motherboard substrate led to warping and system failures. To address this, Nvidia reportedly modified the GPU silicon's top metal layers and bump structures to improve production reliability. Although Nvidia never revealed specific details about these changes, it noted that new masks were necessary as part of the fix. 

As a result, the final revision of Blackwell GPUs only entered mass production in late October, which means that Nvidia will be able to ship these processors starting in late January. 

Nvidia's clients, including tech giants like Google, Meta, and Microsoft, use Nvidia's GPUs to train their most powerful large language models. Delays in Blackwell AI GPUs naturally affect Nvidia customers' plans and products.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • bit_user
    It seems like Nvidia tried to launch with clock speeds that were too high.

    Perhaps they felt compelled to do that, in order to justify the exorbitant pricing. This would also explain why they can't simply tell customers to reduce clock speeds a little, since that would significantly affect the perf/$ and might result in customers asking for a rebate to offset the lost performance.

    While Nvidia provides industry leading AI performance, I'm sure they're probably less competitive on the perf/$ front. Intel basically said as much, during recent announcements regarding their repositioning of Gaudi. This is something Nvidia must be keeping an eye on.

    TBH, I honestly expected Nvidia to have transitioned their AI-centric products away from a GPU-style architecture, by now, and towards a more purpose-built, dataflow-oriented architecture like most of their competitors have adopted. I probably just underestimated how much effort it would take them to overhaul their software infrastructure, which I guess supports Jim Keller's claim that CUDA is more of a swamp than a moat.
    Reply
  • raminpro
    ithese days performance not very important,( all hardwares performance near toghather) , temprature is more important

    cant understand why they dont use AMD
    AMD maybe little be week but have very low heat and low watt

    high watt , high temprature = degrade
    Reply