FFmpeg devs boast of up to 94x performance boost after implementing handwritten AVX-512 assembly code

Panasonic
(Image credit: Panasonic)

Contemporary high-level programming languages and advanced compilers greatly simplify software development and lower its costs. However, this way of programming can hide the performance capabilities of modern hardware, partly due to inefficiencies of application programming interfaces (APIs). Apparently, a good old assembly code path can improve performance by between three and 94 times, depending on the workload, according to FFmpeg. The hardware this multiplied performance was achieved on was not disclosed.

FFmpeg is an open-source video decoding project developed by volunteers who contribute to its codebase, fix bugs, and add new features. The project is led by a small group of core developers and maintainers who oversee its direction and ensure that contributions meet certain standards. They coordinate the project's development and release cycles, merging contributions from other developers. This group of developers tried to implement a handwritten AVX512 assembly code path, something that has rarely been done before, at least not in the video industry.

The developers have created an optimized code path using the AVX-512 instruction set to accelerate specific functions within the FFmpeg multimedia processing library. By leveraging AVX-512, they were able to achieve significant performance improvements — from three to 94 times faster — compared to standard implementations. AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation. This optimization is ideal for compute-heavy tasks in general, but in the case of video and image processing in particular.

The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.

This development is particularly valuable for users running on high-performance, AVX-512-capable hardware, enabling them to process media content far more efficiently. There is an issue, though: Intel disabled AVX-512 for its Core 12th, 13th, and 14th Generations of Core processors, leaving owners of these CPUs without them. On the other hand, AMD's Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.

Unfortunately, due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • edzieba
    Note that this is a 64x speedup for a single function (a 8-tap motion compensation filter), and not a 64x speedup in video encoding.
    Reply
  • ex_bubblehead
    Hand optimization has always created faster executing code than relying on the compiler/assembler.
    Reply
  • ET3D
    Also note that this is 94x faster than pure C code, but between -3% and 60% faster than the AVX2 code path. (See this slide.)

    ex_bubblehead said:
    Hand optimization has always created faster executing code than relying on the compiler/assembler.
    And also tends to create code that's hard to update or debug.
    Reply
  • AdelaideSimone
    ex_bubblehead said:
    Hand optimization has always created faster executing code than relying on the compiler/assembler.
    Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

    In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

    In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
    Reply
  • usertests
    ET3D said:
    Also note that this is 94% faster than pure C code, but between -3% and 60% faster than the AVX2 code path. (See this slide.)
    94% != 94x
    Reply
  • ex_bubblehead
    AdelaideSimone said:
    Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

    In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

    In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
    Given that the basics are no longer taught to prospective programmers anymore you are correct. However, I have yet to find any compiler/assembler that can rival an experienced assembly language programmer. Running a compiler/assembler against already optimized code is a bad way to be doing things.
    Reply
  • user7007
    usertests said:
    94% != 94x
    the the post on x says 94x
    Reply
  • Rob1C
    Depending on where you look there's lots of assembly language code: https://git.ffmpeg.org/gitweb/ffmpeg.git/tree/969c271a5a7bd7681a1f775097cf9039f75768f6:/libavcodec/x86
    Reply
  • JamesJones44
    AdelaideSimone said:
    Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.
    In performance critical situations it's still often used, even today. Games even use it for critical path situations.

    Should it be the go to solution? Hell no, but when performance matters and an area is slow, dropping into ASM can yield impressive results.
    Reply
  • atmapuri
    And ffmpeg developers were looking down on AVX512 and pretending it was not there as long as it was an Intel only feature for the last 10 years. Now that it is an AMD only feature, now it is OK.
    Reply