Efficient VVC Encoding Using Hierarchical Parallelization: A Comprehensive Analysis

George, Valeri; Brandenburg, Jens; Hege, Gabriel; Hinz, Tobias; Wieckowski, Adam; Bross, Benjamin; Marpe, Detlev

doi:10.1142/S1793351X2450003X

2024

Journal Article

Abstract

This paper presents and analyzes different parallelization strategies in VVenC, an open and optimized software encoder implementation of the Versatile Video Coding (VVC) standard. VVC has been developed to address the increasing demand for higher compression of digital video data, and it reduces the bitrate by around 50% for the same perceived quality compared to its predecessor, the High-Efficiency Video Coding (HEVC) standard. However, this increase in compression efficiency comes with an increase in computational complexity, particularly on the encoder side. VVenC integrates algorithmic optimizations for each coding tool in VVC and defines a set of five presets from faster to slower that provide Pareto-optimal tradeoffs between runtime and efficiency. Multithreading is employed to further reduce runtime while preserving most of the compression efficiency of each preset. With a hierarchical combination of pre-processing, picture-level and in-picture parallelization, VVenC achieves a 4× speedup for four threads. Further speedup using a higher number of threads depends on the video resolution and used encoder preset. For 16 threads, it ranges from 6 to 9 for high definition to 10-12 for ultra-high-definition video. Compared to previous work, the usage of temporal prediction for the adaptive loop filter reduces the associated coding efficiency loss from 0.4% to almost zero. A better scaling for higher numbers of threads can be achieved at the cost of a higher coding efficiency loss. In the presented framework, increased speedup by smaller block Coding Tree Unit (CTU) sizes, a combination of wavefront parallel processing and various tiles picture partitioning configurations is examined. Furthermore, results on a 20-core ARM-based Apple M1 computer indicate a better scaling for multithreading compared to ×86-based architectures. The analysis is complemented by profiling, which exhibits the overhead by idle threads and identifies the mode estimation as the main bottleneck of the presented framework.