xAI Colossus supercomputer with 100K H100 GPUs comes online — Musk lays out plans to double GPU count to 200K with 50K H100 and 50K H200
Colossus only took a little bit over four months to build.
Elon Musk's X (formerly Twitter) has brought the world's most powerful training system online. The Colossus supercomputer uses as many as 100,000 Nvidia H100 GPUs for training and is set to expand with another 50,000 Nvidia H100 and H200 GPUs in the coming months.
"This weekend, the xAI team brought our Colossus 100K H100 training cluster online," Elon Musk wrote in an X post. "From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months."
According to Michael Dell, the head of the high-tech giant, Dell developed and assembled the Colossus system quickly. This highlights that the server maker has accumulated considerable experience deploying AI servers during the last few years' AI boom.
Elon Musk and his companies have been busy making supercomputer-related announcements recently. In late August, Tesla announced its Cortex AI cluster featuring 50,000 Nvidia H100 GPUs and 20,000 of Tesla's Dojo AI wafer-sized chips. Even before that, in late July, X kicked off AI training on the Memphis Supercluster, comprising 100,000 liquid-cooled H100 GPUs. This supercomputer has to consume at least 150 MW of power, as 100,000 H100 GPUs consume around 70 MW.
Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs.
xAI's training of the Grok version 2 large language model (LLM) required up to 20,000 Nvidia H100 GPUs, and Musk predicted that future versions, such as Grok 3, will need even more resources, potentially around 100,000 Nvidia H100 processors for training. To that end, xAI needs its vast data centers to train Grok 3 and then run inference on this model.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
DerKeyser All this from the man that warned us against AI 😂Reply
I’m fully aware there is nothing intelligent about current “AI”, but it is still a good show of how doublefaced and cynical these guys are. As long as there is money involved - nothing else matters. No mention of how the energy consumpution and production of all this crap is killing our planet - the planet he supposedly wants to save with Tesla by the way…. No pollution and pointless ressource usage here… -
vanadiel007 AI is so bizarre to me. Even more bizarre than bitcoin or even HDD coin.Reply
It's like why?
I don't see how this could not end in a big bubble burst at some point in the future. -
Evildead_666 I can see him somehow connecting Colossus, Cortex and Memphis together.Reply
Nothing better than to have a huge supercomputer spread over multiple sites for security and redundancy.
If he needs network bandwidth and security, he has Starlink. -
FunSurfer They need to use Colossus to design a fusion reactor in order to supply enough power to handle its growth rate.Reply -
MacZ24 Title : Elon Musk buys a lot of GPUsReply
Subtitle : Elon Musk tries to spit on Sam Altman's face and bankrupt him and his company if possible. -
hellcinder99 All that power to ask if the earth is flat or if the lunar landing actually happened. Of course if its Elon, that just means he's looking for new people to implant his seed. Cause we're running out of humans am I right?Reply -
Amdlova The good, The bad and the ugly...Reply
The good... X user will have the powerfully tool in the word
The bad no one can say what will happens.
The ugly... Same guy has the tesla will set 14 diesels generator to power some of these machines...
Where are the tesla solar powered battery's??? -
mangaTom
The earth isn't going anywhere. It has survived way more worse events than us. It's way more resilient considering the multiple extinction and cosmic event that it has experienced. Humans on the other hand though...DerKeyser said:All this from the man that warned us against AI 😂
I’m fully aware there is nothing intelligent about current “AI”, but it is still a good show of how doublefaced and cynical these guys are. As long as there is money involved - nothing else matters. No mention of how the energy consumpution and production of all this crap is killing our planet - the planet he supposedly wants to save with Tesla by the way…. No pollution and pointless ressource usage here… -
DerKeyser The most fascinating thing about these money and power mongers are that they are VERY bright indeed, and usually understands math better than most of us. Yet the most simple curve is completely ignored in the pursuit of money.Reply
This curve could be of such different things as:
- Power consumption and its cost/impact on earth ressources
- Cost of aquiring ressources as they become more scarce/rare
- Specific spicies extinction event as humans overconsume and destroy all natural habitats.
- Global Temperature as MASSIVE supercomputers guzzle Peta or Zetta watthours to generate pointless cryptocoins, make a better chatbot or generate new funny catimages to no use.
- Human overpopulation as the rich and powerfull continues to ignore and overexploit the less fortunate
Most of us understands how things ends with exponential curves......
Yet it doesn't seem to register to these super humans. I guess they are not to bright after all.... -
jp7189
Colossus is in Memphis. I took that to be the same deployment with a new name, but this article makes it sounds like a whole new thing... so now I'm confused.Evildead_666 said:I can see him somehow connecting Colossus, Cortex and Memphis together.
Nothing better than to have a huge supercomputer spread over multiple sites for security and redundancy.
If he needs network bandwidth and security, he has Starlink.