Chinese algorithm claimed to boost Nvidia GPU performance by up to 800X for advanced science applications
The breakthrough enables new solutions for complex mechanical challenges across multiple industries.
Researchers from Shenzhen MSU-BIT University, a collaboration between Lomonosov Moscow State University and the Beijing Institute of Technology, have reportedly developed a new computational algorithm that can significantly enhance the efficiency of peridynamics (PD), a non-local theory used to model fractures and material damage. The new method increases performance by up to 800 times, dramatically improving the speed of large-scale material simulations.
Peridynamics is widely used to predict material failure in aerospace, civil engineering, and military applications. However, traditional PD simulations require significant computational resources, making large-scale studies slow and impractical. Associate Professor Yang Yang and her team tackled this problem by leveraging Nvidia's CUDA technology to optimize algorithm design and memory management.
Their PD-General framework achieved up to 800x speed gains on an Nvidia RTX 4070 compared to traditional serial programs and 100x faster performance than OpenMP-based parallel programs. In large-scale simulations with millions of particles, it completed 4,000 iterative steps in five minutes. For high-scale 2D uniaxial tensile problems, it processed 69.85 million iterations in under two minutes using single precision.
The enhanced computational efficiency means researchers can now conduct simulations on consumer-grade GPUs instead of relying on costly, high-performance computing clusters. This has broad implications for industries that require detailed material analysis, including:
- Aerospace and Defense: Improved modeling of material stress and failure in aircraft structures.
- Engineering and Manufacturing: More efficient testing of materials for construction and industrial applications.
- Military Research: Faster development of impact-resistant materials for defense systems.
The ability to achieve high-performance simulations on widely available GPUs also reduces reliance on restricted foreign technology. Given ongoing trade restrictions and sanctions, this breakthrough allows China and Russia to potentially advance research without depending on high-end computing hardware from Western countries.
This development also marks a significant step in computational mechanics, enabling faster and more accessible simulations for material science, engineering, and defense applications. The study was notably published in the Chinese Journal of Computational Mechanics on January 8, 2025, and the research team believes that this optimization could extend beyond peridynamics, improving GPU performance for other scientific computations.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Kunal Khullar is a contributing writer at Tom’s Hardware. He is a long time technology journalist and reviewer specializing in PC components and peripherals, and welcomes any and every question around building a PC.
-
ezst036 It makes a lot of sense that China is investing heavily on the software side in order to do more with less.Reply
All of these U.S. government restrictions kind of miss the point when the only restriction is on a top-end model, but then here come software optimizations. -
JRStern Sounds like good work!Reply
Not shocking, this has happened with Fourier transforms and even matrix solution, year after year.
That sort of thing is probably a very minor part of NVDA sales. -
derekullo
Depending on where he lives ... maybe !rluker5 said:Would it have killed the author to use the word "some" in the title? -
bit_user I just have to roll my eyes at this. Any time you see the popular press report on something from a scientific journal, it should be viewed through a lens of skepticism. It's probably well outside the expertise of the article's author, who might not even be very accustomed to this type of research and lacks the context needed to interpret its results. In virtually every case of this I've seen on Toms, they're actually basing their article on an article in another publication, which adds yet another unknown into the mix.Reply
Okay, but who's using serial, anyhow? Especially when they talk about the GPU replacing "costly, high-performance computing clusters." This is the same kind of marketing BS we get from Nvidia, where they like to trumpet how much faster a CUDA version of something is than a slow, lame old implementation that no one serious would actually run.The article said:Their PD-General framework achieved up to 800x speed gains on an Nvidia RTX 4070 compared to traditional serial programs
Again, raises lots of questions like: which backend - CPU or GPU? If CPU, what kind?The article said:... and 100x faster performance than OpenMP-based parallel programs.
In general, OpenMP isn't usually very good. It's what you use when you've got a bunch of legacy code and you just want a quick, easy, low-risk way of speeding it up on a multi-core or GPU-accelerated machine. So, 100x probably isn't very surprising, here.
I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be quite so much faster.
I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.
If anyone has a link to the paper, I'd be interested in having a glance at it. During a brief search for it, I found this paper from 2017, claiming a speed up of 12 to 100x relative to sequential code. The link is just to an abstract, so I don't know what kind of GPU they used (but probably a fair bit slower than a RTX 4070):
Accelerating Peridynamics Program Using GPU with CUDA and OpenACCJ.X.Li, J.M. Zhao , F. Xu , and Y.J. Liu
Institute for Computational Mechanics and Its Applications (NPUiCMA), Northwestern Polytechnical University, Xi’an, 710072, P. R. China.
Mechanical Engineering, University of Cincinnati, Cincinnati, Ohio, 45221-0072, 210072, USA.
https://www.sci-en-tech.com/ICCM2017/PDFs/2404-8302-1-PB.pdf
BTW, OpenACC is a related/derivative of OpenMP. -
Peksha
From the description it looks like thisbit_user said:this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU -
The Historical Fidelity
Here, I found the 2025 article for you. I’m interested in your opinion on it.bit_user said:I just have to roll my eyes at this. Any time you see the popular press report on something from a scientific journal, it should be viewed through a lens of skepticism. It's probably well outside the expertise of the article's author, who might not even be very accustomed to this type of research and lacks the context needed to interpret its results. In virtually every case of this I've seen on Toms, they're actually basing their article on an article in another publication, which adds yet another unknown into the mix.
Okay, but who's using serial, anyhow? Especially when they talk about the GPU replacing "costly, high-performance computing clusters." This is the same kind of marketing BS we get from Nvidia, where they like to trumpet how much faster a CUDA version of something is than a slow, lame old implementation that no one serious would actually run.
Again, raises lots of questions like: which backend - CPU or GPU? If CPU, what kind?
In general, OpenMP isn't usually very good. It's what you use when you've got a bunch of legacy code and you just want a quick, easy, low-risk way of speeding it up on a multi-core or GPU-accelerated machine. So, 100x probably isn't very surprising, here.
I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be quite so much faster.
I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.
If anyone has a link to the paper, I'd be interested in having a glance at it. During a brief search for it, I found this paper from 2017, claiming a speed up of 12 to 100x relative to sequential code. The link is just to an abstract, so I don't know what kind of GPU they used (but probably a fair bit slower than a RTX 4070):
Accelerating Peridynamics Program Using GPU with CUDA and OpenACC
J.X.Li, J.M. Zhao , F. Xu , and Y.J. LiuInstitute for Computational Mechanics and Its Applications (NPUiCMA), Northwestern Polytechnical University, Xi’an, 710072, P. R. China.
Mechanical Engineering, University of Cincinnati, Cincinnati, Ohio, 45221-0072, 210072, USA.https://www.sci-en-tech.com/ICCM2017/PDFs/2404-8302-1-PB.pdf
BTW, OpenACC is a related/derivative of OpenMP.
https://doi.org/10.1016/j.enganabound.2025.106133
To me, when I see fantastical numbers like 800x, my first thought is that they are probably talking about a single step in a long chain of steps that they optimized by 800x, and when taken in context of the entire operation minimally reduces compute time. But I’ll dig into it as well to see if I’m right or if this really is a breakthrough.
Edit: after reading the article, it seems that this is simply a targeted optimization for the resource structures of an RTX 4070 and only an RTX 4070. This means that to get the kind of performance improvement in a way that competes with existing methods, the researchers would need to R&D an optimization scheme for each individual GPU used in this market. It’s cool, but software tool providers are not going to spend this kind of time trying to make 50+ unique optimization schemes only compatible with 1 specific GPU, so comparing this to the GPU agnostic software tools on the market is a bit like comparing an ASIC to a CPU’s general computing core.