How librav1e Calculates Structural Similarity

This article explores how librav1e, the library interface of the Rust-written AV1 video encoder rav1e, calculates structural similarity (SSIM) to evaluate video quality. We will examine the mathematical principles behind SSIM and Multi-Scale SSIM (MS-SSIM), how the encoder utilizes these metrics for rate-distortion optimization, and the performance optimizations used to calculate these values in real-time.

The Purpose of SSIM in rav1e

Video encoders must constantly make decisions about how to compress blocks of video frame data. librav1e utilizes Structural Similarity (SSIM) and Multi-Scale SSIM (MS-SSIM) as perceptual metrics to guide these decisions. Unlike Peak Signal-to-Noise Ratio (PSNR), which only measures absolute pixel-to-pixel differences, SSIM measures changes in structural information, luminance, and contrast. This approach closely mimics how the human visual system perceives video quality, allowing the encoder to allocate bitrate where it matters most to the viewer.

The Core Mathematical Calculation

To calculate SSIM, librav1e compares local windows of a reference frame (\(x\)) with a distorted frame (\(y\)). The calculation is split into three distinct comparison measurements:

Luminance (\(l\)): Compares the average intensity of the two windows using their means (\(\mu_x\) and \(\mu_y\)).
Contrast (\(c\)): Compares the contrast of the two windows using their standard deviations (\(\sigma_x\) and \(\sigma_y\)).
Structure (\(s\)): Compares the structural association using the covariance (\(\sigma_{xy}\)).

These individual components are combined into the standard SSIM formula:

\[SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}\]

In this formula, \(C_1\) and \(C_2\) are stabilization constants based on the dynamic range of the pixel values (e.g., 8-bit or 10-bit depth) to prevent division by zero when the denominators are close to null.

Block-by-Block Evaluation

librav1e does not calculate a single global SSIM score in a single step. Instead, it computes SSIM locally using a sliding window approach, typically over \(8 \times 8\) pixel blocks.

A local SSIM index is calculated for each overlapping window across the frame. These local index maps are then averaged to create a single overall quality score for the frame. This localized assessment allows the encoder to identify specific regions of a frame where compression artifacts are highly visible, enabling more precise Rate-Distortion Optimization (RDO) decisions at the block level.

Multi-Scale SSIM (MS-SSIM) Integration

For a more robust evaluation, librav1e frequently relies on MS-SSIM. This metric evaluates quality across multiple image scales to account for varying viewing distances and resolutions. The calculation involves the following steps:

Downsampling: The reference and distorted frames are progressively downsampled by factors of two, creating a multi-scale pyramid of images.
Scale-Specific Evaluation: Contrast (\(c\)) and structure (\(s\)) comparisons are computed at each scale.
Luminance Evaluation: Luminance (\(l\)) is computed only at the lowest resolution (the most downsampled scale).
Weighted Combination: The results are combined using scale-specific weights to reflect how human eyes prioritize different frequencies of visual detail.

Hardware-Accelerated Implementation in Rust

Calculating SSIM frame-by-frame is computationally intensive. librav1e optimizes this process by leveraging Rust’s safety and speed.

The encoder utilizes heavily optimized assembly paths, including AVX2 and NEON SIMD (Single Instruction, Multiple Data) instructions, to perform the vector math required for variance, mean, and covariance calculations simultaneously on multiple pixels. This high-speed implementation ensures that quality evaluation does not become a performance bottleneck during the AV1 encoding pipeline.