Nvidia's Blackwell AI GPU overheating issues are seemingly overhyped — semiconductor analysts reveal cooling issues have been mostly addressed
Blackwell's cooling issues aren't as severe as some might have thought.
Reports of Nvidia's GB200 NVL72 server racks overheating have purportedly been exaggerated. Business Insider reports that Blackwell's cooling design faults have already been addressed. Dylan Patel, chief analyst at Semianalysis, purportedly told Business Insider that Blackwell's design issues, which have been present for months, have been largely addressed, stating that the overheating issues are largely overblown.
Semianalysis' five analysts monitoring the semiconductor industry reported that the cooling system issues triggering "reworks" from several suppliers were a "minor" change. Blackwell's cooling faults have been specifically problematic with Nvidia's massive 72-chip server rack, which can consume up to 120kW. Design flaws in the rack's design have forced Nvidia to reevaluate its design multiple times due to the GPUs inside overheating. This has setback shipments of Nvidia's GB200 hardware, causing additional delays due to the required design changes.
Nvidia's B200 GPUs are the most potent processing chips for AI workloads. The GB200 superchip, for instance, has a configurable TDP in the thousands of watts, with a peak rating of up to 2,700 watts. These absurdly high power figures make air cooling virtually impossible to use in the constraints of a standard rack mount form factor.
This physics problem has forced Nvidia to require liquid cooling on its latest Blackwell GPUs. It also requires data centers to revamp their server farms to accommodate the infrastructure needed to support liquid-cooled servers.
Nvidia could solve this problem by creating slower air-cooled GPUs — which the GPU manufacturer still does, in the form of GPUs such as the H200 NVL. However, to remain at the bleeding edge of the AI GPU arms race, Nvidia is prioritizing performance no matter the cost, which is why the company has opted to make GPUs that require thousands of watts of power at the expense of air-cooling.
The good news is that Nvidia's 72-chip Blackwell cooling issues are apparently minor and have been largely addressed already. In addition, only Nvidia's flagship 72-chip server rack is having the problem.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.
-
bit_user Ultimately, what matters are the delays affecting customer deployments. Whether they're due to minor or major issues is of secondary concern, but a delay is a delay.Reply -
DS426
Yeah, sure.cristovao said:4090 issues were also overhyped.
https://www.tomshardware.com/news/technician-repairs-hundreds-rtx-4090-melted-connectors-every-month
Maybe you say "that's still a small percentage given the total sales of 4090's," which is very true, but we also don't know about every case of these failures; moreover, some cause collateral damage to the rest of the system, such as PSU and motherboard failure. It was so bad that a new 12V high power standard was quickly devised -- something that's never been a problem in the history of ATX PCI-E for Graphics history. CableMod even recalled their products thru official consumer protection channels: https://www.cpsc.gov/Recalls/2024/GPU-Angled-Adapters-Recalled-Due-to-Fire-and-Burn-Hazards-Manufactured-by-CableMod
The 4090 is a crazy impressive gaming GPU, but nVidia doesn't need any free passes on $3.6T+ plus of market cap and the master class of master classes on mindshare success.