Nvidia's Jensen Huang admits AI chip design flaw was '100% Nvidia's fault' — TSMC not to blame, now-fixed Blackwell chips are in production
Huang dismissed reports of tension with TSMC as "fake news."
Nvidia's yield-killing design flaw in its Blackwell GPU was fixed months ago, and a refined version of the B100/B200 processors is about to enter mass production. Jensen Huang, Nvidia's CEO, admitted this week that the flaw was entirely caused by Nvidia and said that the company's production partner TSMC helped fix it in a timely manner, according to Reuters.
"We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low," Huang said. "It was 100% Nvidia's fault."
When the first reports about the design flaw emerged, some media outlets reported that TSMC was to blame — and suggested this might be causing strain between Nvidia and its foundry partner. This was not the case, according to Huang, and Nvidia's own miscalculations caused the problem. Huang also dismissed reports of tensions between the two companies as "fake news."
Nvidia's Blackwell B100 and B200 GPUs link their two chiplets using TSMC's CoWoS-L packaging technology, which relies on an RDL interposer equipped with local silicon interconnect (LSI) bridges (to enable data transfer rates of about 10 TB/s). The placement of these bridges is critical. However, a supposed mismatch in the thermal expansion properties between the GPU chiplets, LSI bridges, RDL interposer, and motherboard substrate caused the system to warp and fail, and Nvidia reportedly had to modify the top metal layers and bumps of the GPU silicon to enhance production yields. While the company did not disclose specific details about the fix, it did mention that new masks were required.
Yield-killing problems and major functionality issues (errata) are not unheard of in the semiconductor world. Typically, companies fix them by modifying a metal layer (or two) and calling it a new stepping. Case in point: Intel's Sapphire Rapids reportedly had 500 bugs, and the company released around a dozen steppings to fix them all (five were base respins). Every new stepping takes around three months to complete (including identifying the problem, fixing it, and producing a new version of the chip), so the speed at which Nvidia and TSMC fixed the Blackwell GPU is pretty impressive.
The now-fixed Blackwell GPUs for AI and supercomputers will enter mass production in late October and should start shipping early next year (which will still be Nvidia's fiscal year 2025).
That said, Nvidia disclosed earlier this year that, in order to meet demand for its Blackwell GPUs among major cloud service providers such as AWS, Google, and Microsoft, it will still have to ship some of the initial low-yield Blackwell processors in 2024. It's unclear how many Blackwell GPUs will be shipped to data centers in 2024.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
sjkpublic I am guessing Elon bought some of these. He must not be happy. It will be interesting to see what happens next.Reply -
Stomx Dear Jensen and Elon (who recently bought 100k of them),Reply
Please donate the whole this batch of your not yet dead B100/200 to Universities. Probably some of them will survive for a few years if use water cooling.
I'd gladly take a basket of B200 to upgrade my lab and personal supercomputer :) -
RUSerious
Good luck! I think you will need it. Out of curiosity - what's your research area?Stomx said:Please donate the whole this batch of your not yet dead B100/200 to Universities -
Samlebon2306 Imagine what would happen when Bugattis are mass-produced. I guess tow-truckers would be the happiest guys.Reply -
Pierce2623 Weren’t there people on here claiming that the idea of mistake by Nvidia was ludicrous?Reply -
Conor Stewart
It seems the issue is with yields, not chips failing, so the functional chips from that batch should be fine to be used, so why would they donate them? They said absolutely nothing about the chips failing or that water-cooling would fix the issue.Stomx said:Dear Jensen and Elon (who recently bought 100k of them),
Please donate the whole this batch of your not yet dead B100/200 to Universities. Probably some of them will survive for a few years if use water cooling.
I'd gladly take a basket of B200 to upgrade my lab and personal supercomputer :)
It is even mentioned in the article that they will still ship these chips.