librav1e Multithreading and CPU Usage Explained
This article explains how the librav1e library, the
C-compatible wrapper for the Rust-based AV1 encoder rav1e,
manages multithreading and CPU utilization during video encoding. It
covers the underlying threading architecture, the mechanisms used to
distribute encoding tasks across processor cores, and how developers can
optimize thread allocation for maximum performance and resource
control.
The Rayon Thread Pool and Rust Concurrency
Because librav1e is a C-compatible interface for the
Rust-written rav1e encoder, it relies entirely on Rust’s
safety-centric concurrency model. At the heart of its multithreading
framework is rayon, a data-parallelism library for
Rust.
Instead of spawning and destroying threads dynamically during
encoding—which incurs high operating system
overhead—librav1e initializes a persistent, work-stealing
thread pool when the encoder session is configured. When intensive math
operations (such as motion estimation, intra-prediction, or
rate-distortion optimization) need to be executed, they are broken down
into smaller jobs and fed into the Rayon queue. Free threads in the pool
“steal” these jobs to ensure that all allocated CPU cores remain
active.
Parallelism Strategies in AV1 Encoding
To distribute the workload of a heavy codec like AV1 across multiple
processor cores, librav1e utilizes several parallelization
strategies:
- Tile-Based Parallelism: The AV1 specification
allows video frames to be split into a grid of independent blocks called
“tiles.”
librav1ecan process these tiles concurrently. If a video is configured to use four tiles, four separate threads can encode those sections of the frame at the same time. While highly effective at increasing CPU utilization, a high tile count can slightly reduce compression efficiency. - Worker-Based Task Decomposition: Even within a single tile, various coding tools run in parallel. The encoder splits pixel-level calculations, search algorithms, and entropy coding into independent tasks that run concurrently in the thread pool.
- Frame Pipelining:
librav1eprocesses multiple frames at different stages of the encoding pipeline. While frame \(N\) is undergoing final entropy coding, frame \(N+1\) can undergo motion estimation, ensuring a continuous flow of data that keeps the CPU saturated.
Controlling CPU Utilization and Thread Allocation
By default, librav1e attempts to auto-detect the number
of logical CPU cores on the host system and scales its thread pool to
match. However, on modern high-core-count processors (such as AMD
Threadripper or Intel Xeon chips), uncapped thread utilization can lead
to diminishing returns, cache thrashing, or thread synchronization
bottlenecks.
Developers can fine-tune CPU utilization through the library’s API configurations:
- Thread Limits: Using the C API function
rav1e_config_set_threads(), developers can set a hard limit on the number of threads the encoder is allowed to spawn. This is crucial for cloud encoding instances where CPU resources must be strictly partitioned. - Tile Customization: Developers can explicitly set the number of tile rows and tile columns. Matching the tile count to the available thread count is the most effective way to maximize CPU scaling on high-resolution videos (such as 4K).
- Speed Presets: The choice of speed preset (ranging from 0 to 10) significantly impacts CPU utilization. Slower presets (lower numbers) perform deep, complex searches that naturally saturate multiple threads for longer periods. Faster presets (higher numbers) simplify these searches, which can sometimes reduce overall CPU utilization because the system bottleneck shifts from raw computation to frame input/output handling.