Datacenter GPU service life can be surprisingly short — only one to three years is expected according to unnamed Google architect

Nvidia
(Image credit: Nvidia)

Datacenter GPUs may only last from one to three years, depending on their utilization rate, according to a high-ranking Alphabet specialist, quoted by Tech Fund. As GPUs do all the heavy lifting for AI training and inference, they are the components that are under considerable load at all times and therefore they degrade faster than other components.

The utilization rates of GPUs for AI workloads in a datacenter run by cloud service providers (CSP) is between 60% and 70%. With such utilization rates, a GPU will typically survive between one and two years, three years at the most, according to a quote allegedly made by a principal generative AI architect from Alphabet and reported by @techfund, a long-term tech investor with good sources.

We could not verify the name of the person who describes themselves as 'GenAI principal architect at Alphabet' and therefore we cannot 100% trust their claims. Nonetheless, we understand the claim to have merit as modern datacenter GPUs for AI and HPC applications consume and dissipate 700W of power or more, which is tangible stress for tiny pieces of silicon.

There is a way to prolong the lifespan of a GPU, according to the speaker: reduce their utilization rate. However, this means that they will depreciate slower and return their capital slower, which is not particularly good for business as a result, most of cloud service providers will prefer to use their GPUs at a high utilization rate.

Earlier this year Meta released a study describing its Llama 3 405B model training on a cluster powered by 16,384 Nvidia H100 80GB GPUs. The model flop utilization (MFU) rate of the cluster was about 38% (using BF16) and yet out of 419 unforeseen disruptions (during a 54-day pre-training snapshot), 148 (30.1%) were instigated by diverse GPU failures (including NVLink fails), whereas 72 (17.2%) were caused by HBM3 memory flops.

Meta's results seem to be quite favorable for H100 GPUs. If GPUs and their memory keep failing at Meta's rate, then the annualized failure rate of these processors will be around 9%, whereas the annualized failure rate for these GPUs in three years will be approximately 27%, though it is likely that GPUs fail more often after a year in service.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • bit_user
    The article said:
    There is a way to prolong the lifespan of a GPU, according to the speaker: reduce their utilization rate.
    Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

    IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

    I also can't help but wonder about temperatures and cooling.

    BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
    Reply
  • jp7189
    bit_user said:
    Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

    IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

    I also can't help but wonder about temperatures and cooling.

    BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
    Clocks have zero to do with longevity. Voltage and heat kill chips. You could argue that more voltage is required to hit the signal high threshold faster in order to hit a higher clock rate, but that's looking at it backwards.
    Reply
  • gg83
    bit_user said:
    Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

    IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

    I also can't help but wonder about temperatures and cooling.

    BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
    Is it any different than the etherium mining gpu's? Buying them used was always a gamble.
    Reply
  • jp7189
    Back when Intel was stuck at 14+++ part of the issue was insisting on cobalt doping because the pure copper planned for smaller features wasn't hitting their longevity targets whereas TSMC had no such qualms. I have always wondered what that would mean in the real world. Has the usable life of chips been getter shorter with each subsequent shrink?
    Reply
  • bit_user
    jp7189 said:
    Clocks have zero to do with longevity. Voltage and heat kill chips. You could argue that more voltage is required to hit the signal high threshold faster in order to hit a higher clock rate, but that's looking at it backwards.
    I'd argue it's not looking at it backwards, because the voltage/frequency curve has already been established by Nvidia for operating these products in a safe and error-free manner. You can scale back the clockspeeds without invalidating that, but as soon as you start monkeying with things like undervolting, you're now operating it outside those safety margins. Coloring outside the lines might be okay for gamers, but not datacenter operators.

    So, with clockspeed being effectively the only variable they can directly control, that's exactly the terms in which they'd need to look at it.
    Reply
  • Eximo
    jp7189 said:
    Back when Intel was stuck at 14+++ part of the issue was insisting on cobalt doping because the pure copper planned for smaller features wasn't hitting their longevity targets whereas TSMC had no such qualms. I have always wondered what that would mean in the real world. Has the usable life of chips been getter shorter with each subsequent shrink?

    The voltage has been slowly but surely dropping as the process node shrinks. This should counteract some of the potential degradation issues. Though we can see with Intel what happens when you go out of bounds. Tighter and tighter restrictions on overclocking and boost is about the only way this is going to go. Much like Nvidia really restricting power and voltage limits.
    Reply
  • RUSerious
    Hmm, cost of doing business. Lots of components have lifetimes in server ops that are much shorter than us DIYers. I remember replacing 60&80mm 5k+ rpm fans in servers. Don't put a finger near those when testing. Also, don't forget hearing protection.
    Reply
  • jp7189
    bit_user said:
    I'd argue it's not looking at it backwards, because the voltage/frequency curve has already been established by Nvidia for operating these products in a safe and error-free manner. You can scale back the clockspeeds without invalidating that, but as soon as you start monkeying with things like undervolting, you're now operating it outside those safety margins. Coloring outside the lines might be okay for gamers, but not datacenter operators.

    So, with clockspeed being effectively the only variable they can directly control, that's exactly the terms in which they'd need to look at it.
    No, clockspeeds aren't the primary knob for DC GPUs. Power and therefore thermals are the primary knob which has an effect on the voltage and dictates clockspeeds. No one is setting clock speeds (in a datacenter) and hoping the other parameters fall in line. They set the other parameters and accept the clockspeed as a result.
    Reply
  • bit_user
    jp7189 said:
    No, clockspeeds aren't the primary knob for DC GPUs. Power and therefore thermals are the primary knob which has an effect on the voltage and dictates clockspeeds.
    Now you're the one who has it backwards, because power is even more indirectly coupled from either clockspeed or voltage. If you set a power limit of half, but the workload has low shader utilization, the clocks (and therefore voltage) might still get ramped up, because the power management controller sees that it has enough headroom to clock up the non-idle units without going over the power limit. That will still cause accelerated wear on those blocks.

    If the issue you want to control wearout, then the best solution definitely involves lowering the peak frequencies. I'll bet it would only take a modest clipping of peak clocks. To achieve the same effect by limiting power, you might have to reduce it much more. You might also choose to limit power, based on how much thermals are thought to be a factor.

    jp7189 said:
    No one is setting clock speeds (in a datacenter) and hoping the other parameters fall in line.
    What part of anything I said involves "hope"? I said measure how they correlate, in order to set limits that favor a longer service life.

    jp7189 said:
    They set the other parameters and accept the clockspeed as a result.
    I'm not talking about current practice, which I'd expect is mostly centered around balancing against short-term operational costs (i.e. things like energy costs and cooling capacity). I'm talking about what you would do, if you really wanted to increase the service life of these components.

    Hey, it's a heck of a lot better than reducing duty cycle! That essentially means letting your hardware idle, where it's wasting both space and still some power!
    Reply
  • jp7189
    bit_user said:
    Now you're the one who has it backwards, because power is even more indirectly coupled from either clockspeed or voltage. If you set a power limit of half, but the workload has low shader utilization, the clocks (and therefore voltage) might still get ramped up, because the power management controller sees that it has enough headroom to clock up the non-idle units without going over the power limit. That will still cause accelerated wear on those blocks.

    If the issue you want to control wearout, then the best solution definitely involves lowering the peak frequencies. I'll bet it would only take a modest clipping of peak clocks. To achieve the same effect by limiting power, you might have to reduce it much more. You might also choose to limit power, based on how much thermals are thought to be a factor.


    What part of anything I said involves "hope"? I said measure how they correlate, in order to set limits that favor a longer service life.


    I'm not talking about current practice, which I'd expect is mostly centered around balancing against short-term operational costs (i.e. things like energy costs and cooling capacity). I'm talking about what you would do, if you really wanted to increase the service life of these components.

    Hey, it's a heck of a lot better than reducing duty cycle! That essentially means letting your hardware idle, where it's wasting both space and still some power!
    Are you intentionally misunderstanding me just to argue? Clockspeed is the rate at which the line is read. Sampling that data faster or slower has nothing at all to do with longevity. You argue that setting the clock rate changes the voltage because you move down on the V/F curve, but that is backwards. DC integrators have power and thermal targets for the overalls systems and specific cards which affects voltage and that in turn sets clockspeed. Then you argue that your aren't talking about what they do but rather what they should do (change clockspeed), but I'll go back to my very first point and say there is no doubt that voltage and power are the primary tuning points for longevity. How fast the line is read (clockspeed) can't possibly have any affect at all on chip degradation.
    Reply