Nvidia's Jensen Huang admits AI chip design flaw was '100% Nvidia's fault' — TSMC not to blame, now-fixed Blackwell chips are in production

Nvidia Blackwell GTC 2024 Keynote
(Image credit: Nvidia)

Nvidia's yield-killing design flaw in its Blackwell GPU was fixed months ago, and a refined version of the B100/B200 processors is about to enter mass production. Jensen Huang, Nvidia's CEO, admitted this week that the flaw was entirely caused by Nvidia and said that the company's production partner TSMC helped fix it in a timely manner, according to Reuters.

"We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low," Huang said. "It was 100% Nvidia's fault."

When the first reports about the design flaw emerged, some media outlets reported that TSMC was to blame — and suggested this might be causing strain between Nvidia and its foundry partner. This was not the case, according to Huang, and Nvidia's own miscalculations caused the problem. Huang also dismissed reports of tensions between the two companies as "fake news." 

Nvidia's Blackwell B100 and B200 GPUs link their two chiplets using TSMC's CoWoS-L packaging technology, which relies on an RDL interposer equipped with local silicon interconnect (LSI) bridges (to enable data transfer rates of about 10 TB/s). The placement of these bridges is critical. However, a supposed mismatch in the thermal expansion properties between the GPU chiplets, LSI bridges, RDL interposer, and motherboard substrate caused the system to warp and fail, and Nvidia reportedly had to modify the top metal layers and bumps of the GPU silicon to enhance production yields. While the company did not disclose specific details about the fix, it did mention that new masks were required.

Yield-killing problems and major functionality issues (errata) are not unheard of in the semiconductor world. Typically, companies fix them by modifying a metal layer (or two) and calling it a new stepping. Case in point: Intel's Sapphire Rapids reportedly had 500 bugs, and the company released around a dozen steppings to fix them all (five were base respins). Every new stepping takes around three months to complete (including identifying the problem, fixing it, and producing a new version of the chip), so the speed at which Nvidia and TSMC fixed the Blackwell GPU is pretty impressive.

The now-fixed Blackwell GPUs for AI and supercomputers will enter mass production in late October and should start shipping early next year (which will still be Nvidia's fiscal year 2025). 

That said, Nvidia disclosed earlier this year that, in order to meet demand for its Blackwell GPUs among major cloud service providers such as AWS, Google, and Microsoft, it will still have to ship some of the initial low-yield Blackwell processors in 2024. It's unclear how many Blackwell GPUs will be shipped to data centers in 2024.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • sjkpublic
    I am guessing Elon bought some of these. He must not be happy. It will be interesting to see what happens next.
    Reply
  • Stomx
    Dear Jensen and Elon (who recently bought 100k of them),
    Please donate the whole this batch of your not yet dead B100/200 to Universities. Probably some of them will survive for a few years if use water cooling.
    I'd gladly take a basket of B200 to upgrade my lab and personal supercomputer :)
    Reply
  • RodroX
    Rushing for the "bubble cash" usually have its issues.
    Reply
  • RUSerious
    Stomx said:
    Please donate the whole this batch of your not yet dead B100/200 to Universities
    Good luck! I think you will need it. Out of curiosity - what's your research area?
    Reply
  • Samlebon2306
    Imagine what would happen when Bugattis are mass-produced. I guess tow-truckers would be the happiest guys.
    Reply
  • Pierce2623
    Weren’t there people on here claiming that the idea of mistake by Nvidia was ludicrous?
    Reply
  • Conor Stewart
    Stomx said:
    Dear Jensen and Elon (who recently bought 100k of them),
    Please donate the whole this batch of your not yet dead B100/200 to Universities. Probably some of them will survive for a few years if use water cooling.
    I'd gladly take a basket of B200 to upgrade my lab and personal supercomputer :)
    It seems the issue is with yields, not chips failing, so the functional chips from that batch should be fine to be used, so why would they donate them? They said absolutely nothing about the chips failing or that water-cooling would fix the issue.

    It is even mentioned in the article that they will still ship these chips.
    Reply