Google Replaces Millions of Intel's CPUs With Its Own Homegrown Chips
YouTube now uses homegrown Argos VCUs
Google has designed its own new processors, the Argos video (trans)coding units (VCU), that have one solitary purpose: processing video. According to a recent report, the highly efficient new chips have allowed the technology giant to replace up to tens of millions of Intel CPUs with its own silicon.
For many years Intel's video decoding/encoding engines that come built into its CPUs have dominated the market both because they offered leading-edge performance and capabilities and because they were easy to use. But custom-built application-specific integrated circuits (ASICs) tend to outperform general-purpose hardware because they are designed for one workload only. As such, Google turned to developing its own specialized hardware for video processing tasks for YouTube, and to great effect.
However, Intel may have a trick up its sleeve with its latest tech that could win back Google's specialized video processing business.
Loads of Videos Require New Hardware
Users upload more than 500 hours of video content in various formats every minute to YouTube. Google needs to quickly transcode that content to multiple resolutions (including 144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 2160p, and 4320p) and data-efficient formats (e.g., H.264, VP9 or AV1), which requires formidable encoding horsepower.
Historically, Google had two options for transcoding/encoding content. The first option was Intel's Visual Computing Accelerator (VCA) that packed three Xeon E3 CPUs with built-in Iris Pro P6300/P580 GT4e integrated graphics cores with leading-edge hardware encoders. The second option was to use software encoding and general-purpose Intel Xeon processors.
Google decided that neither option was power-efficient enough for emerging YouTube workloads – the Visual Computing Accelerator was rather power hungry itself, whereas scaling the number of Xeon CPUs essentially meant increasing the number of servers, which means additional power and datacenter footprint. As a result, Google decided to go with custom in-house hardware.
Google's first-generation Argos VCU does not replace Intel's central processors completely as the servers still need to run the OS and manage storage drives and network connectivity. To a large degree, Google's Argos VCU resembles a GPU that always needs an accompanying CPU.
Instead of stream processors like we see in GPUs, Google's VCU integrates ten H.264/VP9 encoder engines, several decoder cores, four LPDDR4-3200 memory channels (featuring 4x32-bit interfaces), a PCIe interface, a DMA engine, and a small general-purpose core for scheduling purposes. Most of the IP, except the in-house designed encoders/transcoders, were licensed from third parties to cut down on development costs. Each VCU is also equipped with 8GB of usable ECC LPDDR4 memory.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
The main idea behind Google's VCU is to put as many high-performance encoders/transcoders into a single piece of silicon as possible (while remaining power efficient) and then scale the number of VCUs separately from the number of servers needed. Google places two VCUs on a board and then installs 10 cards per dual-socket Intel Xeon server, greatly increasing the company's decoding/transcoding performance per rack.
Increasing Efficiency Leads to Migration from Xeon
Google says that its VCU-based machines have seen up to 7x (H.264) and up to 33x (VP9) improvements in performance/TCO compute efficiency compared to Intel Skylake-powered server systems. This improvement accounts for the cost of the VCUs (vs. Intel's CPUs) and three years of operational expenses, which makes VCUs an easy choice for video behemoth YouTube.
Offline Two-Pass Single Output (SOT) Throughput in CPU, GPU, and VCU-Equipped Systems
System | Throughput (MPix/s) | Throughput (MPix/s) | Performance/TCO | Performance/TCO |
Row 1 - Cell 0 | H.264 | VP9 | H.264 | VP9 |
2-way Skylake | 714 | 154 | 1x | 1x |
4x Nvidia T4 | 2,484 | - | 1.5x | - |
8x Google Argos VCUs | 5,973 | 6,122 | 4.4x | 20.8x |
20x Google Argos VCUs | 14,932 | 15,306 | 7x | 33.3x |
From performance numbers shared by Google, it is evident that a single Argos VCU is barely faster than a 2-way Intel Skylake server in H.264. However, since 20 VCUs can be installed into such a server, VCU wins from an efficiency perspective. But when it comes to the more demanding VP9 codec, Google's VCU appears to be five times faster than Intel's dual-socket Xeon and therefore offers impressive efficiency advantages.
Since Google has been using its Argos VCUs for several years now, it clearly replaced many of its Xeon-based YouTube servers with machines running its own silicon. It is extremely hard to estimate how many Xeon systems that Google actually replaced, but some analysts believe the technology giant could have swapped from four to 33 million Intel CPUs for its own VC. Even if the second number is an overestimate, we are still talking about millions of units.
Since Google needs loads of processors for its other services, it is likely that the number of CPUs that the company buys from AMD or Intel is still very high and is not going to decrease any time soon as it will be years before Google's own datacenter-grade system-on-chips (SoCs) will be ready.
It is also noteworthy that in an attempt to use innovative encoding technologies (e.g., AV1) right now, Google needs to use general-purpose CPUs even for YouTube as the Argos does not support the codec. Furthermore, as more efficient codecs emerge (and these tend to be more demanding in terms of compute horsepower), Google will have to continue to use CPUs for initial deployments. Ironically, the advantage of dedicated hardware will only grow in the future.
Google is already working on its second-gen VCU that supports AV1, H.264, and VP9 codecs as its needs to further increase the efficiency of its encoding technologies. It is unclear when the new VCUs will be deployed, but it is clear that the company wants to use its own SoCs instead of general-purpose processors where possible.
Intel Isn't Standing Still
Intel isn't standing still, though. The company's DG1 Xe-LP-based quad-chip SG1 server card can decode up to 28 4Kp60 streams as well as transcode up to 12 simultaneous streams. Essentially, Intel's SG1 does exactly what Google's Argos VCU does: scale video decoding and transcoding performance separately from the server count and thus reduce the number of general-purpose processors required in a data center used for video applications.
With its upcoming single-tile Xe-HP GPU, Intel will offer transcoding of 10 high-quality 4Kp60 streams simultaneously. Keeping in mind that some of Xe-HP GPUs will scale to four tiles, and more than one GPU can be installed per system, Intel's market-leading media decoding and encoding capabilities will only become even more solid.
Summary
Google has managed to build a remarkable H.264 and VP9-supporting video (trans)coding unit (VCU) that can offer significantly higher efficiency in video encoding/transcoding workloads than Intel's existing CPUs. Furthermore, VCUs enable Google to scale its video encoding/transcoding performance independently from the number of servers.
Yet, Intel already has its Xe-LP GPUs and SG1 cards that offer some serious video decoding and encoding capabilities, too, so Intel will still be successful in datacenters with heavy video streaming workloads. Furthermore, with the emergence of Intel's Xe-HP GPUs, the company promises to solidify its position in this market.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
Co BIY These kind of single company efficiencies will make market entry for competitors increasingly difficult .Reply
It's easy to hate on intel but at least they'll sell to anyone.
The tech is amazing though. What process node and foundry are these being made on? -
jkflipflop98 I'm sure they'll also be peddling these things to Netflix, Hulu, Disney, yada yada yada. Smart idea, if they can beat Intel. Good luck.Reply -
watzupken That's why Intel's woes are getting worst over time. The sector which they historically makes a lot of money is eroding from them very quickly. That pain is starting to surface in their latest earnings where revenue from data centers fell off a cliff when everyone else is seeing a healthy bump in revenue/ profit. We will see in future earnings whether its truly a case of the industry still trying to digest the inventory or is there a bigger problem that is starting to show up on their P/L.Reply -
TerryLaze For many years Intel's video decoding/encoding engines that come built into its CPUs have dominated the market both because they offered leading-edge performance and capabilities and because they were easy to use. But custom-built application-specific integrated circuits (ASICs) tend to outperform general-purpose hardware because they are designed for one workload only. As such, Google turned to developing its own specialized hardware for video processing tasks for YouTube, and to great effect.
Isn't qsv an ASIC as well?! The issue for google is that it is connected to a CPU and that's probably why it's less efficient. (draws more power) -
Friesiansam
Use a good adblocker. I see no adverts on YouTube.excalibur1814 said:Cool! Now every video on youtube will have THREE adverts. -
ottonis What about encoding quality? Being a digital video editing hobbyist myself, for maximum quality output software encoding has ususally been preferred over hardware accelerated encoding (e.g. NVENC). The obvious trade-off was obviously speed.Reply
Have the most recent iterations of hardware encoders improved that much that they are on par or better than software based video encoding?
And is it true that the incredible performance of M1 chips in video encoding/decoding tasks is due to some implemented sophisticated hardware accelerators in the iGPU? -
evdjj3j
I was thinking the same thing.TerryLaze said:Isn't qsv an ASIC as well?! The issue for google is that it is connected to a CPU and that's probably why it's less efficient. (draws more power) -
JayNor while the transcoding is one operation, the SVP/GM of Intel's IOTG group this week talked about their main focus is on doing ai inference on multiple camera streams ... ai at the edge. So, I think Intel will find more use cases for its GPUs.Reply
FPGA based SmartNICs are also being promoted for fixed operation streaming. -
artk2219 Friesiansam said:Use a good adblocker. I see no adverts on YouTube.
Seriously. Youtube is damned near unwatchable without an ad blocker now.