Jim Keller suggests Nvidia should have used Ethernet to stitch together Blackwell GPUs, saving billions

(Image credit: Tom's Hardware)

As a strong supporter of open standards, Jim Keller tweeted that Nvidia should have used the Ethernet protocol chip-to-chip connectivity in Blackwell-based GB200 GPUs for AI and HPC. Keller contends this could have saved Nvidia and users of its hardware a lot of money. It would have also made it a bit easier for those customers to migrate their software to different hardware platforms, which Nvidia doesn't necessarily want.

When Nvidia introduced its GB200 GPU for AI and HPC applications, the company primarily focused on its AI performance and advanced memory subsystem, telling little about how the device was made. Meanwhile, Nvidia's GB200 GPU comprises two compute processors stitched together using TSMC's CoWoS-L packaging technology and the NVLink interconnection technology, which uses a proprietary protocol. This isn't an issue for those who already use Nvidia's hardware and software, but this poses a challenge for the industry in porting software from Nvidia's platforms.

There is a reason why Jim Keller, a legendary CPU designer and chief executive officer of Tenstorrent, an Nvidia rival, suggests that Nvidia should have used Ethernet instead of proprietary NVLink. Nvidia's platforms use proprietary low-latency NVLink for chip-to-chip and server-to-server communications (which compete against PCIe with the CXL protocol on top) and proprietary InfiniBand connections for higher-tier comms. To maximize performance, the software is tuned for both technologies' peculiarities. For obvious reasons, this could somewhat complicate software porting to other hardware platforms, which is good for Nvidia and not exactly suitable for its competitors. (You can see his thread if you expand the tweet below.)

pic.twitter.com/RXMO7bRwEhApril 11, 2024

There is a catch, though. Ethernet is a ubiquitous technology both on the hardware and software level, and it is a competitor to Nvidia's low-latency and high-bandwidth (up to 200 GbE) InfiniBand interconnection for data centers. Performance-wise, Ethernet (particularly next-generation 400 GbE and 800 GbE) can compete with InfiniBand.

However, InfiniBand still has some advantages regarding features for AI and HPC and superior tail latencies, so some might say that Ethernet's capabilities don't cater to emerging AI and HPC workloads. Meanwhile, the industry — spearheaded by AMD, Broadcom, Intel, Meta, Microsoft, and Oracle — is developing the Ultra Ethernet interconnection technology, poised to offer higher throughput and features for AI and HPC communications. Of course, Ultra Ethernet will become a more viable competitor to Nvidia's InfiniBand for these sorts of workloads.

Nvidia also faces challenges with its CUDA software platform dominance, hence the advent of the widely industry-supported Unified Accelerator Foundation (UXL), an industry consortium that includes Arm, Intel, Qualcomm, and Samsung, among others, that's intended to provide an alternative to CUDA.

Of course, Nvidia needs to develop data center platforms to use here and now, which is probably at least part of its desire to spend billions on proprietary technologies. If open-standard technologies like PCIe with CXL and Ultra Ethernet will outpace Nvidia's proprietary NVLink and InfiniBand technologies regarding performance and capabilities, Nvidia will have to redevelop its platforms, so Keller advises (or trolls) that Nvidia should adopt Ethernet. However, this may be years away, so for now, Nvidia's designs continue to leverage proprietary interconnects.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

52 Comments Comment from the forums

ezst036

Nvidia certainly has become the poster child for vendor lock-in.

Putting Apple for a run for its money.
Reply
hotaru251

It would have also made it a bit easier for those customers to migrate their software to different hardware platforms, which Nvidia doesn't necessarily want.

nvidia likely did it specifically for that reason.
even if it was cheaper to do it another way they want you locked into their stuff forever.

ezst036 said:
Putting Apple for a run for its money.
if that leaked memo a while ago was anything to go by Jensen entirely wants to be like Apple.
Reply
The Historical Fidelity

hotaru251 said:
nvidia likely did it specifically for that reason.
even if it was cheaper to do it another way they want you locked into their stuff forever.

if that leaked memo a while ago was anything to go by Jensen entirely wants to be like Apple.
I feel like companies like Apple and Nvidia are paving the way towards new Anti-Trust law interpretation. In that they are building roadblocks against another competitor having the chance to compete and win over the customer, thus creating a de facto monopoly in a certain market. Only the anti-trust case must be looked at on a per-capita basis being locked into the company’s ecosystem instead of simply from the classical “lack of competitor in the market-place” basis.

Ex: if Nvidia sells to 80% of the datacenter customers, and let’s say Intel or AMD come out with superior products incompatible with Nvidia’s required secondary hardware/connectivity/etc., if the only reason Intel and AMD cannot win over any part of the Nvidia 80% is because customers state they had to buy expensive proprietary Nvidia secondaries to use their current Nvidia hardware and abandoning Nvidia would incur losing all the investment in proprietary Nvidia secondaries, then that is a “DeFacto Monopoly” on 80% of the market, not because Nvidia is more competitive, but because of an artificial “ball and chain” arresting customers. All while their competitors offer better products using open source secondaries.

Maybe I’ve had too much coffee, but when I read this article, this idea popped in my head and I rolled with it lol.
Reply
hotaru251

The Historical Fidelity said:
with superior products incompatible with Nvidia’s required secondary hardware/connectivity/etc
I mean they do go out of there way to do so.
https://github.com/vosen/ZLUDA was made that let AMD gpu's (some not all) run CUDA on non nvidia gpu's...nvidia updated their licensing agreement to make that agasint the terms of use.

That "could" potentially be used in a case if it ever got to that point.
Reply
thisisaname

Was the next generation of Ethernet out when they where designing it?
Reply
Alvar "Miles" Udell

Keller contends this could have saved Nvidia and users of its hardware a lot of money.

Except it wouldn't save users any money, it'd just increase nVidia's profit margin if what he says is true.
Reply
ezst036

hotaru251 said:
if that leaked memo a while ago was anything to go by Jensen entirely wants to be like Apple.

I agree. Though, to be fair, it is always the leaders who take the arrows. Nvidia and Apple respectively, are the unrefuted leaders in their domains.

The most hilarious thing I think though is how Apple for years refused to have any dealings whatsoever with Nvidia, only supporting AMD(ATi) video cards throughout the decade, beginning around 2010. Nvidia was entirely outmoded in Mohave.

Apple's schemes of vendor lock-in found a hard deadlock against Nvidia's schemes of vendor lock-in. hehehe
Reply
edzieba

thisisaname said:
Was the next generation of Ethernet out when they where designing it?
No. And back when NVLink was developed, Ethernet was still limited to 10 Gigabit (the 25 Gigabit alliance had not even formed yet) and lacked the latency and coherency requirements for inter-chip communication.

The headline could effectively read "Jim Keller suggests Nvidia could reduce chip costs using time travel".
Reply
hotaru251

ezst036 said:
Nvidia and Apple respectively, are the unrefuted leaders in their domains.
Apple isn't.
Apples a brand more than unmatched products.

ezst036 said:
Apple for years refused to have any dealings whatsoever with Nvidia
not really. that bad blood was due to business co-op blame game and made total sense.
Reply
AmazingGoose

Blackwell NVLink supports 1.8TBps. Is there an Ethernet equivalent to 14.4Tbps? Even with some form of bonded or multichannel link, I'd be surprised if such an Ethernet solution exists.
Reply

Show more comments

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter