China announces CPU-only exascale supercomputer with 47,000 homemade processors, record 2 Exaflops of performance without GPUs — Lingshen super said to use Huawei Kunpeng servers and no foreign-made components

China announces Lingsheng supercomputer project — (Image credit: National Supercomputing Center, Shenzen)

China's National Supercomputing Center in Shenzhen announced the Lingshen supercomputer project at a conference on April 24, targeting sustained performance above 2 ExaFLOPS using only CPUs and no foreign-made components.

The system would pack 47,000 processors into 92 compute cabinets, making it the first exascale machine designed to reach that performance tier without GPU accelerators. Lu Yutong, director of the Shenzhen supercomputing center and the system's chief designer, presented the technical details at the event.

Every other exascale system in operation relies heavily on GPUs or accelerator hardware. The U.S. Department of Energy's El Capitan, currently the world's fastest supercomputer, runs on 44,544 AMD MI300A APUs that tightly couple CPU and GPU silicon on a single package. Lingsheng’s CPU-only architecture would represent a fundamentally different approach to reaching exascale throughput.

Latest Videos From

The project is split into a pilot verification phase and a full production system, with the pilot using 100 Huawei Kunpeng servers built on Arm-based Taishan cores, totaling 12,800 cores. The production system scales to 1,580 blade servers using x86 CPUs with 101,120 cores and a theoretical peak above 10 petaflops. 16 four-way servers could add another 2,048 cores, and four eight-way servers could contribute a further 1,280, based on a machine translation of Chinese-language press materials.

The full system would include 36 network cabinets supporting a million-port interconnect, 650PB of planned storage across 428 nodes, and 67 liquid-cooled storage cabinets with 10 TB/s of bandwidth.

Lingsheng's claimed sustained performance of 2+ ExaFLOPS would, if achieved, exceed El Capitan's measured Linpack score of 1.809 ExaFLOPS. While El Capitan's theoretical peak is 2.79 ExaFLOPS, real-world Linpack results are always lower. Obviously, no Linpack or equivalent benchmark data exists for Lingsheng because the system hasn’t been built yet.

China’s claims are at best dubious. On a literal reading of the announcement, China says it might be able to achieve 2+ ExaFLOPS at some point in the future. But El Capitan is already theoretically capable of 2.79 ExaFLOPS, so it’s difficult to see how China’s project is ever going to "cast a new benchmark for global supercomputing," when it’s unlikely to be running even five years from now.

Then China says there’ll be no reliance on outside vendors for Lingsheng while also claiming that the production system will use x86 CPUs. China’s domestic x86 options are limited to Zhaoxin, a joint venture between VIA Technologies and the Shanghai municipal government, and Hygon, which originally licensed AMD's Zen architecture but lost access to updated designs following U.S. export restrictions.

Neither company has demonstrated processors competitive with current-gen parts from Intel or AMD, and the center and its partners didn’t name any suppliers or operational timeline for the completed system.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

TOPICS

Luke James is a freelance writer and journalist. Although his background is in legal, he has a personal interest in all things tech, especially hardware and microelectronics, and anything regulatory.

48 Comments Comment from the forums

blitzkrieg316

I bet... TH is very pro china
Reply
Phyzzi

blitzkrieg316 said:
I bet... TH is very pro china
Not in the actual article here they aren't. They are pretty explicit in the end that the claim is both unlikely as it sits and not that great given the project timeline.

That said, there's nothing worse about Chinese ambition than there is about, say, USA's ambition. China does have both material and technological resources and even if I agree with the author's doubts on the likelihood of the project coming to fruition exactly as specified, it's not like the SLS was reflective of the final rocket on the first pass either, and there are ample reasons for China to want super compute access that isn't being directed or watched by outsiders so this project is likely to find continuing support as it rolls out.

Reading a little more deeply, there's something to be learned about the goals for this project if it's intending to use CPU cores instead of GPU cores. China is clearly interested in calculations with more digits of accuracy but without waiting for results, which probably means either space program calculations or possibly some other detailed simulation (protein folding or other nanoscale simulation that has to account for quantum effects), and they want this to be information that isn't shared, so probably at least somewhat military technology. Since the big push is compute without GPU acceleration and on internally designed and manufactured components, that means they aren't looking at AI applications and are almost certainly wary of backdoors and hardware espionage loopholes. However, since they also are suggesting using x86 architecture (which they certainly know involves some pain in switching fabs producing chips for mobile architecture), it seems likely that there is existing software that they want to run on this machine that won't do well even on 64 bit mobile processors. Finally the fact that China is announcing their *intent* to make this clearly military-use system rather than waiting for it to be complete means they are responding to something specific: my bet is the smooth completion of the Artemis II mission. This would fit with some other recent posturing China has done regarding their space program.

So yeah, the title is overly "gentle" on the Chinese propaganda machine, but the article has some useful information in it and there is even more if you stop to think about it.
Reply
bit_user

I would just point out that Fujitsu made the Fugaku supercomputer, which was the fastest supercomputer in the world from 2020 to 2022, without using GPUs. They also used their own self-designed ARM cores, implementing SVE at 512-bits, for a total of 0.44 Exaflops (i.e. using the official Top 500 fp64 RMax). It was also quite efficient.

So, it can definitely be done.
Reply
Phyzzi

bit_user said:
I would just point out that Fujitsu made the Fugaku supercomputer, which was the fastest supercomputer in the world from 2020 to 2022, without using GPUs. They also used their own self-designed ARM cores, implementing SVE at 512-bits, for a total of 0.44 Exaflops (i.e. using the official Top 500 fp64 RMax). It was also quite efficient.

So, it can definitely be done.
I don't think there's so much a technical question as to whether this is, broadly speaking, possible. For me the questions are 1) why announce now before seemingly securing the project (they don't appear to have processors currently in the pipeline or necessarily even designed) and 2) why specifically plan to use x86 (which has some known issues and inefficiency compared to ARM for legacy compatibility reasons). Maybe those issues will be sidestepped somehow, or maybe there are already more things in place than seem to be, but with China especially I find state announcements of new and planned technology to be worth looking at through the lense of "Why this and why now?". While I imagine that a real project will move forward, I also have no doubts that the announcement is propaganda and would certainly hesitate to assume it will achieve all the stated goals or specs.
Reply
qxp

So if they actually did that this is pretty smart. The idea of using CPU/GPU combo comes with a handycap - you need to shuffle data between CPU and GPU for adaptive algorithms. What you want instead is a capable CPU with a vector unit, similar to Xeon Phi and modern AMD/Intel CPUs but with a lot more memory bandwidth. Strix Halo is a small step in the right direction.

Compared to conventional CPU/GPU combo the high-bandwidth CPU and vector unit can offer a possibility of large algorithmic improvements.
Reply
bit_user

Phyzzi said:
I don't think there's so much a technical question as to whether this is, broadly speaking, possible.
The article seems to cast doubt on it. That's the main reason I cited Fugaku.

Phyzzi said:
For me the questions are 1) why announce now before seemingly securing the project
If you look at every US-based supercomputer, there were announcements made long before even the chips for it were being manufactured. Such an announcement serves a very practical purpose of informing researchers who might want to run jobs on it. They need some guideposts, so they can start tuning, porting, and optimizing their code for it. There are other reasons you might want to announce it, but I won't venture into such a realm of speculation.

Phyzzi said:
2) why specifically plan to use x86 (which has some known issues and inefficiency compared to ARM for legacy compatibility reasons).
Maybe hedging their bets? Or, maybe just throwing some funding towards Zhaoxin, in order to support development of more competitive x86 cores.
Reply
bit_user

qxp said:
The idea of using CPU/GPU combo comes with a handycap - you need to shuffle data between CPU and GPU for adaptive algorithms.
The data needs to get shuffled between nodes, anyhow. Or else, why even have a supercomputer?

As far as data movement goes, a switched fabric like NVLink scales far better than using CPU-centric PCIe.

qxp said:
Compared to conventional CPU/GPU combo the high-bandwidth CPU and vector unit can offer a possibility of large algorithmic improvements.
It would have to be something well beyond the realm of standard Linpack, because GPUs have dominated that space for the past 1.5 decades.

Intel tried to pitch Xeon Phi in a similar way to how you're saying. It didn't work out very well, however.
Reply
qxp

bit_user said:
The data needs to get shuffled between nodes, anyhow. Or else, why even have a supercomputer?

As far as data movement goes, a switched fabric like NVLink scales far better than using CPU-centric PCIe.

I meant on much smaller scale, such as within tight loops.

bit_user said:

It would have to be something well beyond the realm of standard Linpack, because GPUs have dominated that space for the past 1.5 decades.

Intel tried to pitch Xeon Phi in a similar way to how you're saying. It didn't work out very well, however.

Actually Xeon Phi was great as far as hardware was concerned. Liked it a lot. At the time of release a Xeon Phi gave 1 TFlop of compute, while Nvidia's GeForce gave 2 Tflops.

But because Xeon Phi was really just 200+ thread Pentium with a vector unit one can use a smarter algorithm that gave speedup of roughly 10x. So even though I did not have 2 Tflops it was a win.

The reason Xeon Phi flopped I suspect is that Intel grossly overpriced it, essentially killing it. For example the 8GB Xeon Phi was around $2000, while the 16GB version was $5000. Surely that extra 8GB memory did not cost $2000+ ? And the 4GB Xeon Phi was mostly useless because for interesting algorithms you need some memory per thread.

And then there is a consideration is that at the time the cloud infrastructure companies were selling CPUs per virtual instance, and I bet Intel was afraid that someone would get a decent Xeon Phi and launch 200+ VMs on it, and then why buy Xeon server CPUs?

The right way to develop Xeon Phi would have been to stick an Ethernet controller on it so you can plug it directly into the switch and sell 16GB (or even 32GB) version for some reasonable amount of money, say $1600 (GeForce was like $300-500 at the time).

Then you essentially get a 1TFlop computer for $1600 and this would have outcompeted all the GPUs.

And if you look now, we have a Strix Halo product with decent memory bandwidth, decent RAM size, but can you easily find a box with even a 40Gbit interface? No.. And no notebooks with 15-16" screen either. And its like we released it its out but we really don't want you have it.

If you do Strix Halo right it would outcompete H200. and all the MIxx stuff.
Reply
SpicyLlama

Phyzzi said:
Not in the actual article here they aren't. They are pretty explicit in the end that the claim is both unlikely as it sits and not that great given the project timeline.

That said, there's nothing worse about Chinese ambition than there is about, say, USA's ambition. China does have both material and technological resources and even if I agree with the author's doubts on the likelihood of the project coming to fruition exactly as specified, it's not like the SLS was reflective of the final rocket on the first pass either, and there are ample reasons for China to want super compute access that isn't being directed or watched by outsiders so this project is likely to find continuing support as it rolls out.

Reading a little more deeply, there's something to be learned about the goals for this project if it's intending to use CPU cores instead of GPU cores. China is clearly interested in calculations with more digits of accuracy but without waiting for results, which probably means either space program calculations or possibly some other detailed simulation (protein folding or other nanoscale simulation that has to account for quantum effects), and they want this to be information that isn't shared, so probably at least somewhat military technology. Since the big push is compute without GPU acceleration and on internally designed and manufactured components, that means they aren't looking at AI applications and are almost certainly wary of backdoors and hardware espionage loopholes. However, since they also are suggesting using x86 architecture (which they certainly know involves some pain in switching fabs producing chips for mobile architecture), it seems likely that there is existing software that they want to run on this machine that won't do well even on 64 bit mobile processors. Finally the fact that China is announcing their *intent* to make this clearly military-use system rather than waiting for it to be complete means they are responding to something specific: my bet is the smooth completion of the Artemis II mission. This would fit with some other recent posturing China has done regarding their space program.

So yeah, the title is overly "gentle" on the Chinese propaganda machine, but the article has some useful information in it and there is even more if you stop to think about it.

IMO, the "why" is because global media has been running that story about their supercomputing center being hacked, and massive amounts of data exfiltrated (probably by Tailored Access Operations). Fits many of the same keywords you'd see in Google, now their botnets and propaganda accounts will start spamming all over western social media. Saving face is the primary concern in Chinese Communist Party culture, such an embarrassment and failure can not be permitted to continue in public discourse.

This also fits with similar US computer network operations that happened prior to Russia's invasion of Ukraine, where Tailored Access Operations was again key in obtaining secret information. Exposing and embarrassing adversary governments has become a new tactic utilized by the US Intelligence Community. China just happens to be a particularly juicy target due to the cultural norms.

And to pre-answer, yes, I believe the data is real.
Reply
zsydeepsky

bit_user said:
I would just point out that Fujitsu made the Fugaku supercomputer, which was the fastest supercomputer in the world from 2020 to 2022, without using GPUs. They also used their own self-designed ARM cores, implementing SVE at 512-bits, for a total of 0.44 Exaflops (i.e. using the official Top 500 fp64 RMax). It was also quite efficient.

So, it can definitely be done.
Furthermore, the fastest supercomputer during 2016 & 2017 was China's Sunway Taihu Light
which is also built with 40960 China's in-house CPUs only.

Personally, the most surprising aspect of this news is that China ANNOUNCES new supercomputers.

Since the trade war, China has stopped exposing its supercomputer plans, hiding from the Top500 ranklists, for fear that the US sanctions would interrupt those plans.

So the Chinese gov must be very confident that the system can be completely immune to any sanctions or embargoes.
Reply

Show more comments