Nvidia removes Rubin CPX accelerators from its roadmap — Groq 3 LPUs take center stage as CPX is removed

Nvidia
(Image credit: Nvidia)

One thing that caught our attention during Jensen Huang's keynote at GTC 2026 on Monday was the lack of any mention of the Rubin CPX context phase accelerator that the company promoted last year as an important part of the Vera Rubin platform. The Rubin CPX was also absent from the slides demonstrated during the keynote, but the slides mention Nvidia's upcoming Groq 3 LPU processors and LPX racks, which may indicate that these processors are replacing the CPX in Nvidia's roadmap.

Nvidia's Rubin CPX GPU was meant to be a part of the company's Vera Rubin and Vera Rubin Ultra platforms. These GPUs were designed to accelerate an initial compute-intensive context phase of a query that processes the input to generate the first output token. The main advantage of the context phase accelerator was its reliance on GDDR7 memory, which does not offer extreme bandwidth like HBM3E or HBM4 but consumes dramatically less power, which was said to greatly improve the competitiveness of Nvidia's Rubin platform for inference workloads.

Latest Videos From

Nvidia

(Image credit: Nvidia)

Nvidia's Groq 3 low-latency inference accelerators — which Nvidia calls LPUs — are designed to offer significant inference performance with extremely low latency, as it mainly relies on internal SRAM, which is by definition faster, lower latency, and lower power than any type of DRAM. For example, Nvidia's LP30 processor comes with 512 MB of SRAM and offers 1.23 FP8 PFLOPS performance, or 9.6 PFLOPS per Groq 3 LPX compute tray or 315 FP8 PFLOPS per rack. By contrast, the Rubin CPX accelerator was to deliver up to 30 NVFP4 PetaFLOPS of compute throughput, but with considerably higher latency.

For now, it remains to be seen whether Nvidia will actually offer its Rubin CPX accelerators or will refocus its efforts to Groq 3 LPU low-latency inference accelerators. Given Nvidia's recent $20 billion non-exclusive license acquisition of startup Groq's chip tech and talent, the move would make sense. The lack of Rubin CPX in roadmap slides and publicly favoring LPU processors is a rather clear indicator of the company's priorities. Nonetheless, it is possible that some of Nvidia's customers will deploy its CPX accelerators, as they have already invested in their deployment by tweaking their software for these processors. After all, off-roadmap parts are pretty common in the industry.

Google Preferred Source

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • abufrejoval
    Yes, that was quite conspicuous: they were perhaps a little late to recognize the urgency of lowing the inference energy cost and that it would require a complete redesign, instead of just modifying their GPU based offerings.

    I'm quite sure there was a lot of Wattage bashing by the other hyperscalers on CPX and they realized that it might offer more attack surface than addressable market: Vera will take whatever RAM is already LP-DDR, the rest gets stacked into HBM.
    Reply
  • Pierce2623
    It’s weird though, because the GroQ chip accelerates a totally different phase of inference. It’s designed to accelerate decode whereas CPX was designed to accelerate context
    Reply
  • bit_user
    abufrejoval said:
    Vera will take whatever RAM is already LP-DDR, the rest gets stacked into HBM.
    First of all, that's not remotely possible. Each modern form of DRAM is very specialized, at the silicon level, to a particular set of applications: LPDDR, DDR, GDDR, and HBM. Pretty much the only thing they have in common is the DRAM cells and their organization into rows, columns, banks, and ranks. Everything else about the dies is highly specialized and you can't just take GDDR7 dies and throw them in a HBM4e stack.

    Secondly, the article didn't mention it, but Groq's LPUs use DDR5. In fact, they support up to 12 TB of it, according to this:
    https://www.nvidia.com/en-us/data-center/lpx/
    So, if the article is right about Nvidia switching from Rubin CPX to Groq LPUs, it will likely mean less demand on GDDR7 (yay!), but more demand for DDR5 (boo!). Fortunately, DDR5 is more common. I'm not sure if all of the big 3 DRAM makers are yet producing GDDR7, but it's a lower-volume part and should have less elasticity. Perhaps this could even free up enough supply for Nvidia to do a RTX 5000 Super refresh, after all?
    Reply
  • abufrejoval
    bit_user said:
    First of all, that's not remotely possible. Each modern form of DRAM is very specialized, at the silicon level, to a particular set of applications: LPDDR, DDR, GDDR, and HBM. Pretty much the only thing they have in common is the DRAM cells and their organization into rows, columns, banks, and ranks. Everything else about the dies is highly specialized and you can't just take GDDR7 dies and throw them in a HBM4e stack.
    I corrected the post to make you happy.

    Yes, chips already produced and packaged can't just morph into something else.

    Just how much commonality exists in raw DRAM dies these days to turn them into one thing or another when packaging I don't know, but would love to learn. I was quite surprised to learn that AMD has provisioned all their Zen CCDs with vias for V-Cache, while only a part of their output actually uses them. So now I can't exclude that current GDRAM dies might contain provisions for HBM vias, where packging determines their final use.

    Most likely it's a classic case of modularity costing extra but also reducing the risks.

    However, rededicating DRAM production lines from one type to another seems much less of an effort than building capacity and has reportedly been going on. I also don't know if fab times in DRAM are very different from logic, multi-patterning might be used there as well, so even if retooling was zero, it might also be months from wafer starts to finished die... which increases the motives for modularity...
    bit_user said:
    Secondly, the article didn't mention it, but Groq's LPUs use DDR5. In fact, they support up to 12 TB of it, according to this:
    https://www.nvidia.com/en-us/data-center/lpx/
    So, if the article is right about Nvidia switching from Rubin CPX to Groq LPUs, it will likely mean less demand on GDDR7 (yay!), but more demand for DDR5 (boo!). Fortunately, DDR5 is more common. I'm not sure if all of the big 3 DRAM makers are yet producing GDDR7, but it's a lower-volume part and should have less elasticity. Perhaps this could even free up enough supply for Nvidia to do a RTX 5000 Super refresh, after all?
    Nvidia would still rather like to sell professional GPUs and are sure to milk the most out of every market.

    For gaming I'm not sure I see the need. For AI I'm no longer curious enough to suffer the expense. And the supposedly more interesting models have moved beyond what most anyone can afford locally, so it's cloud or nothing, by design.
    Reply
  • bit_user
    abufrejoval said:
    Yes, chips already produced and packaged can't just morph into something else.
    No, it's not just a matter of packaging. Like I said, the differences are down to the silicon level. If you want HBM, that's a different die that needs to be made with different masks. HBM is particularly special, since it needs to support TSVs for stacking, but all of the DRAM types are specialized to the point of having their own masks and fundamentally different die layouts.

    abufrejoval said:
    Just how much commonality exists in raw DRAM dies these days to turn them into one thing or another when packaging I don't know, but would love to learn.
    Well, the info is out there, if you look for it.

    abufrejoval said:
    I was quite surprised to learn that AMD has provisioned all their Zen CCDs with vias for V-Cache, while only a part of their output actually uses them.
    Zen 3, yes. Zen 2 also had them. In the Zen 4 CCDs, the ones with the full-sized cores had them, but the ones with Zen 4C did not. I assume that's also true of Zen 5 vs. 5C, but I haven't heard it confirmed.

    abufrejoval said:
    So now I can't exclude that current GDRAM dies might contain provisions for HBM vias, where packging determines their final use.
    No, that's silly. You're comparing apples and dogs, now.

    abufrejoval said:
    However, rededicating DRAM production lines from one type to another seems much less of an effort
    I'd expect so, although HBM requires advanced packaging that the others don't. As far as I'm aware, it's the only one that uses TSVs.

    I've read about Micron building fabs for HBM, but I think the main thing that sets them apart from other fabs is the ability to do die-stacking on-site. Otherwise, I'm not sure it'd need to be called out as a separate class of DRAM fab.

    Maybe @thestryker knows more.

    abufrejoval said:
    Nvidia would still rather like to sell professional GPUs and are sure to milk the most out of every market.

    For gaming I'm not sure I see the need.
    The need is coming from the fact that DRAM hasn't been scaling well and this has become apparent in graphics memory capacities. The use of 16 Gigabit GDDR7 dies in their RTX 5060 and above (the RTX 5050 uses GDDR6, which only has 16 Gb dies) has particularly impacted GPUs with narrower datapaths, like the lower end cards, since the DRAM shortage has meant they're de-prioritizing models with 2 chips per channel, like in the case of the 16 GB TX 5060 Ti. So, switching to 24 Gb dies would let them make a 12 GB version of that card and a 18 GB version of the RTX 5070. That would be great for gamers and help them compete better with AMD's RX 9060 and RX 9070.
    https://www.tomshardware.com/pc-components/gpus/gigabyte-ceo-explains-nvidias-potential-gpu-supply-strategy-amid-crushing-memory-shortages-gross-revenue-per-gigabyte-of-gddr7-memory-could-decide-what-products-thrive
    Reply
  • thestryker
    bit_user said:
    Secondly, the article didn't mention it, but Groq's LPUs use DDR5. In fact, they support up to 12 TB of it, according to this
    The LPU doesn't use any external memory by itself near as I can tell. The LPX racks do use a host CPU with them so potentially that's where the DDR5 is coming into play.
    bit_user said:
    I'm not sure if all of the big 3 DRAM makers are yet producing GDDR7,
    All three are manufacturing GDDR7. I want to say Micron was the one who was late to the party, but they may have been converting GDDR6X capacity over. I've always figured that exclusivity was a double edged sword should they become capacity constrained.
    bit_user said:
    I've read about Micron building fabs for HBM, but I think the main thing that sets them apart from other fabs is the ability to do die-stacking on-site.
    HBM is the only one which requires advanced packaging, but I don't believe anyone has on site facilities yet. It is in Micron's roadmap, but from what I've read that sounds like something which will happen after the fabs are finished.
    Reply
  • thestryker
    abufrejoval said:
    So now I can't exclude that current GDRAM dies might contain provisions for HBM vias, where packging determines their final use.

    Most likely it's a classic case of modularity costing extra but also reducing the risks.

    However, rededicating DRAM production lines from one type to another seems much less of an effort than building capacity and has reportedly been going on.
    No DRAM technology is interchangeable which is part of the reason we have so many capacity problems. HBM eating up a disproportionately high amount of wafers and taking longer to make is another big one. HBM also tends to have lower yield rates and is the only one which requires TSVs. They do use the same specialized process nodes which allows for the fabs to be converted in a cost effective manner, but it's still months of downtime from what I understand.

    AMD putting TSVs in all the CCDs comes down flexibility. It's really unprecedented control in the processor market to have such an important building block that can be allocated to client or enterprise as demand shifts. This makes wafer allocation extremely simple for them and likely explains why the mobile parts have historically been so far behind release schedule wise.
    abufrejoval said:
    For gaming I'm not sure I see the need.
    VRAM capacity is a very real problem in gaming cards. This has been driven by the usage of cache to make up for lower memory bus width. Entry level as it had existed for every video card generation until Ampere/RDNA2 is dead, but what counts as entry level now still only has 8GB VRAM. The 5060/9060 XT 8GB are faster than the GPU in the PS5/Series X, but can run into situations where the visual fidelity is lower simply because they don't have as much VRAM.

    If nvidia changed nothing about the Blackwell GDDR7 lineup other than 3GB IC it changes everything. 5060 with 12GB seems like the right amount of memory for that tier of card and even the 5060 Ti could have just been a single 12GB SKU. Meanwhile the 5070 moves to 18GB and the 5070 Ti/5080 go up to 24GB. Based on what we've seen happening and the future technologies that have been touted these quantities should be enough for their common lifetime.

    So while nvidia would for sure rather sell enterprise parts if the CPX is indeed dead they may end up with enough 3GB GDDR7 to still launch a super refresh. This would allow them to extend the lifetime of Blackwell and push back client usage of a more advanced process node.
    Reply
  • Pierce2623
    P
    abufrejoval said:
    I corrected the post to make you happy.

    Yes, chips already produced and packaged can't just morph into something else.

    Just how much commonality exists in raw DRAM dies these days to turn them into one thing or another when packaging I don't know, but would love to learn. I was quite surprised to learn that AMD has provisioned all their Zen CCDs with vias for V-Cache, while only a part of their output actually uses them. So now I can't exclude that current GDRAM dies might contain provisions for HBM vias, where packging determines their final use.

    Most likely it's a classic case of modularity costing extra but also reducing the risks.

    However, rededicating DRAM production lines from one type to another seems much less of an effort than building capacity and has reportedly been going on. I also don't know if fab times in DRAM are very different from logic, multi-patterning might be used there as well, so even if retooling was zero, it might also be months from wafer starts to finished die... which increases the motives for modularity...

    Nvidia would still rather like to sell professional GPUs and are sure to milk the most out of every market.

    For gaming I'm not sure I see the need. For AI I'm no longer curious enough to suffer the expense. And the supposedly more interesting models have moved beyond what most anyone can afford locally, so it's cloud or nothing, by design.
    Provision points for TSVs are almost costless to add. Packaging two chips onto the TSVs is the expensive part before placing them on the PCB is the expensive part.
    Reply
  • bit_user
    Pierce2623 said:
    Provision points for TSVs are almost costless to add. Packaging two chips onto the TSVs is the expensive part before placing them on the PCB is the expensive part.
    The way AMD maintained low latency for its stacked L3 cache was to build the tag RAM for the X3D die into the base die. So, in the case of those, it did add some nonzero cost to build-in the provisions for X3D to the base die. Then, when they went to do the density-optimized C-core dies, they got rid of that.
    Reply
  • abufrejoval
    bit_user said:
    The way AMD maintained low latency for its stacked L3 cache was to build the tag RAM for the X3D die into the base die. So, in the case of those, it did add some nonzero cost to build-in the provisions for X3D to the base die. Then, when they went to do the density-optimized C-core dies, they got rid of that.
    You mentioned that trick before, but I can't say it properly registered in my brain just what they did there until now: pretty smart move, is all I can say!

    Do you remember where they published that detail? Because I sure can't remember seeing that in a primary source.

    And I wonder if there is a way to generalize that approach one way or another for other scenarios.
    Reply