Weak demand prevents Amazon from deploying AMD's Instinct AI accelerators in the cloud — the company plans to strengthen its portfolio with Nvidia Blackwell GPUs (Updated)

AMD
(Image credit: AMD)

Update 12/10/2024 4:20am PT: AMD has reached out with the following statement in regards to the Business Insider report:

We have a great relationship with AWS and the report was not accurate - we are actively engaged with AWS and end customers on AI opportunities.
— AMD representative.

Original Article:

When AMD launched its Instinct MI300X accelerators for AI and HPC about a year ago, Amazon Web Services (AWS) expressed interest in deploying them in the cloud. However, according to Amazon, as reported by Business Insider, due to a lack of strong demand, the company still has not done so.

"We follow customer demand," Gadi Hutt, Director of Product and Customer Engineering at Annapurna Labs, an Amazon company, told Business Insider. "If customers have strong indications that those are needed, then there's no reason not to deploy."

At least, according to Hutt, there has not been enough interest to justify the deployment of AMD's Instinct MI300X accelerators at AWS. While AMD's Instinct MI300X is cheaper than Nvidia's H100, its software is not as robust as Nvidia's CUDA, which scares off many developers. As AMD's hardware offerings improve (e.g., Instinct MI325X), so too should its software.

To some degree, Hutt may be considered an interested party as Annapurna-developed Trainium rivals those from AMD and Nvidia in AWS's data centers. Still, assuming that he talked on the record, this is AWS's stance.

Speaking of Trainium, with its in-house-designed Trainium and Trainium2, AWS does not have to pay a premium to AMD or Nvidia, which is why it can offer Trn1 and Trn2 instances at very competitive prices compared to those powered by Nvidia's H100 GPUs. This may be another reason for the low interest in non-Nvidia third-party solutions.

Speaking of Nvidia, AWS announced at its re:Invent conference that it was set to strengthen its AI offerings with Nvidia's upcoming Blackwell GPUs for AI and HPC. During the re:Invent conference, AWS introduced its P6 servers equipped with Blackwell GPUs, reflecting the expectation that these machines will be in high demand.

Despite not offering AMD's Instinct MI300X in the cloud, AWS continues to collaborate closely with the company and offers plenty of instances based on AMD's EPYC processors. Given their core count and memory subsystem, these processors offer massive advantages over Intel Xeon rivals for compute—and memory-intensive instances.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • bit_user
    I wonder if AMD has tried giving away some of these systems to universities. That's one thing that helped CUDA gain strong early adoption.
    Reply
  • russell_john
    As I have been saying for years, Nvidia's strength over their competitors is their software stack which is wide and deep. They are willing to pay for the best software engineers and knows how to keep them. They have a very liberal/progressive benefits program and are a shining example that DEI actually works when done respectfully and properly.
    Reply
  • russell_john
    bit_user said:
    I wonder if AMD has tried giving away some of these systems to universities. That's one thing that helped CUDA gain strong early adoption.

    CUDA was something no one else had and unless AMD comes up with something that's actually revolutionary like CUDA was they can't recreate CUDA's success. Their other problem is CUDA has 10-15 year head start and Nvidia was on their 4th generation of Tensor Cores when AMD released their first generation "AI Cores". Even though the 7000 Series GPUs all have AI Cores AMD still isn't utilizing them even though you paid for them they just don't seem to have the software chops to make an AI driven version of FSR like Nvidia's AI driven DLSS.
    Reply
  • bit_user
    russell_john said:
    CUDA was something no one else had and unless AMD comes up with something that's actually revolutionary like CUDA was they can't recreate CUDA's success.
    My point wasn't to dig into deep history, but OpenCL came along soon enough after CUDA and before GPU Compute really took off. There was a window where the industry could've turned away from CUDA, but sadly the key players who could've made it happen (Google, Apple, and Microsoft) instead pursued their own compute APIs and Nvidia was successfully able to exploit the resulting fragmentation.

    Anyway, what AMD did, like 5 years ago, was to make a CUDA compatibility layer and toolchain, which greatly streamlines the process of porting CUDA code. So, AMD has basically done the best it can to nullify the API-level advantages of CUDA.

    They've also ported most of the popular deep learning frameworks to use their GPUs, so that most AI users (ideally) should see AMD's solution as a drop-in replacement. Now, it's obviously not going to be completely seamless, especially for more advanced use cases. That's why you want a group of motivated, capable, and resourceful users (like university students and post docs - hence my suggestion). I'm sure a lot of folks at university are currently starved for AI training time, so they're already highly motivated to use an alternate solution, if one were available to them.

    russell_john said:
    Nvidia was on their 4th generation of Tensor Cores when AMD released their first generation "AI Cores". Even though the 7000 Series GPUs all have AI Cores AMD still isn't utilizing them even though you paid for them
    I'm not sure what you're even talking about, here. In CDNA (MI100 being launched 4 years ago), AMD introduced Matrix cores, which were actually quite a bit more general than Nvidia's Tensor cores of the day.
    https://www.tomshardware.com/news/amd-announces-the-instinct-mi100-gpu-cdna-breaks-10-tflops-barrier
    Conversely, even as of RDNA3, their client GPUs don't actually have tensor or matrix cores. WMMA (Wave Matrix Multiply-Accumulate) instructions rely on the existing vector compute machinery, rather than adding any dedicated, new matrix-multiply pipelines.
    https://www.tomshardware.com/news/amd-rdna-3-gpu-architecture-deep-dive-the-ryzen-moment-for-gpus
    XDNA is another thing, entirely. It didn't come from their GPU division and currently has nothing to do with either their client or server GPUs.

    russell_john said:
    they just don't seem to have the software chops to make an AI driven version of FSR like Nvidia's AI driven DLSS.
    This article really isn't about their client GPUs. So, I'm not even going to touch the subject of FSR, because that's even more irrelevant.
    Reply
  • hwertz
    The previous ZLUDA project was dropped, apparently.. Because of terms in CUDA's terms saying one can't reverse engineer the binaries (... given one is not using CUDA when they are using ZLUDA I'm really not sure Nvidia has jack to say about this. But apparently it made AMD nervous enough to pull this project.). Now they have SCALE which supposedly ALSO does this (why one would be OK and the other not? Who knows.)

    The problem with AMD's software ecosystem pretty recently, forget the fancy aspects of nvidia's cuda software like debuggers and profilers... I don't need that stuff. You could be like 'I want to run Tensorflow and Pytorch', and you couldn't expect it to just work!
    Reply
  • bit_user
    hwertz said:
    The problem with AMD's software ecosystem pretty recently, forget the fancy aspects of nvidia's cuda software like debuggers and profilers... I don't need that stuff. You could be like 'I want to run Tensorflow and Pytorch', and you couldn't expect it to just work!
    That's why they need to find motivated users who are willing to put up with a bit of pain and inconvenience.
    Reply