How rav1e Calculates Motion Vectors in AV1 Encoding

This article explores the inner workings of motion estimation in the rav1e AV1 video encoder, detailing how the library calculates optimal motion vectors between consecutive frames. We will examine its hierarchical motion estimation strategy, block-matching search patterns, sub-pixel refinement, and the cost metrics used to balance encoding speed with compression efficiency.

Hierarchical Motion Estimation (HME)

To avoid searching every pixel in a reference frame—which is computationally prohibitive—librav1e employs Hierarchical Motion Estimation (HME). This multi-resolution approach divides the process into distinct stages:

Downscaled Search (Coarse Stage): The encoder creates downscaled versions of the source and reference frames (typically at 4x and 2x downscaling). It performs an initial, broad search on these low-resolution frames to find a rough approximation of where a block has moved.
Full-Resolution Search (Fine Stage): Using the rough coordinates identified in the downscaled stage as a starting point, the encoder scales the search back up to the original frame resolution. It performs a highly localized search to pinpoint the exact motion vector.

This hierarchical progression dramatically reduces the search area, allowing librav1e to calculate motion vectors quickly without sacrificing accuracy.

Block-Matching and Search Patterns

Within each stage of HME, librav1e uses block-matching algorithms to compare a block in the current frame against candidate blocks in the reference frames. The encoder generates Motion Vector Predictors (MVPs) based on the movement of neighboring blocks. Starting from these predictor coordinates, it executes structured search patterns:

Diamond and Hexagon Searches: Instead of checking every coordinate in a square grid, the encoder checks points in a diamond or hexagonal shape. If the best match is on the edge of the pattern, the center of the pattern shifts to that new point, and the process repeats.
Early Termination: To save CPU cycles, librav1e implements early termination heuristics. If a search point yields a match that is “good enough” based on a predefined threshold, the encoder stops searching and accepts that vector.

Distortion Metrics and Cost Evaluation

To determine which motion vector is “optimal,” librav1e must calculate the mathematical difference (distortion) between the target block and the reference block, balanced against the bitrate cost of encoding the motion vector itself. It uses three primary metrics:

Sum of Absolute Differences (SAD)

SAD is the simplest and fastest metric. It calculates the absolute difference between the color values of corresponding pixels in the target and reference blocks. Because it requires no complex transforms, librav1e uses SAD during the early, coarse phases of HME.

Sum of Absolute Transformed Differences (SATD)

For mid-to-late stage refinement, librav1e uses SATD. This metric applies a Hadamard transform to the pixel differences, converting them into the frequency domain. SATD closely correlates with how the video compressor will actually encode the residual data, making it much more accurate than SAD, though slightly more computationally expensive.

Rate-Distortion Optimization (RDO)

In the final decision-making phase, librav1e evaluates the actual rate-distortion cost. This calculation balances the quality loss (distortion) against the number of bits required to store the motion vector (rate). The motion vector that yields the lowest overall RDO cost is selected as the optimal vector.

Objects in a video do not always move in perfect pixel increments. To account for fractional movement, the AV1 standard supports sub-pixel motion vectors down to 1/8th-pel (one-eighth of a pixel) accuracy.

Once librav1e finds the best full-pixel motion vector, it performs sub-pixel refinement: 1. Interpolation: The encoder interpolates the reference frame to construct virtual fractional pixels using specialized AV1 interpolation filters. 2. Fractional Search: It searches a tiny grid around the optimal full-pixel vector at 1/2-pixel accuracy, then 1/4-pixel accuracy, and finally 1/8-pixel accuracy. 3. Selection: The sub-pixel offset that yields the lowest SATD or RDO cost is chosen, resulting in smooth, highly compressed motion representation.