How librav1e Calculates Structural Similarity
This article explores how librav1e, the library
interface of the Rust-written AV1 video encoder rav1e,
calculates structural similarity (SSIM) to evaluate video quality. We
will examine the mathematical principles behind SSIM and Multi-Scale
SSIM (MS-SSIM), how the encoder utilizes these metrics for
rate-distortion optimization, and the performance optimizations used to
calculate these values in real-time.
The Purpose of SSIM in rav1e
Video encoders must constantly make decisions about how to compress
blocks of video frame data. librav1e utilizes Structural
Similarity (SSIM) and Multi-Scale SSIM (MS-SSIM) as perceptual metrics
to guide these decisions. Unlike Peak Signal-to-Noise Ratio (PSNR),
which only measures absolute pixel-to-pixel differences, SSIM measures
changes in structural information, luminance, and contrast. This
approach closely mimics how the human visual system perceives video
quality, allowing the encoder to allocate bitrate where it matters most
to the viewer.
The Core Mathematical Calculation
To calculate SSIM, librav1e compares local windows of a
reference frame (\(x\)) with a
distorted frame (\(y\)). The
calculation is split into three distinct comparison measurements:
- Luminance (\(l\)): Compares the average intensity of the two windows using their means (\(\mu_x\) and \(\mu_y\)).
- Contrast (\(c\)): Compares the contrast of the two windows using their standard deviations (\(\sigma_x\) and \(\sigma_y\)).
- Structure (\(s\)): Compares the structural association using the covariance (\(\sigma_{xy}\)).
These individual components are combined into the standard SSIM formula:
\[SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}\]
In this formula, \(C_1\) and \(C_2\) are stabilization constants based on the dynamic range of the pixel values (e.g., 8-bit or 10-bit depth) to prevent division by zero when the denominators are close to null.
Block-by-Block Evaluation
librav1e does not calculate a single global SSIM score
in a single step. Instead, it computes SSIM locally using a sliding
window approach, typically over \(8 \times
8\) pixel blocks.
A local SSIM index is calculated for each overlapping window across the frame. These local index maps are then averaged to create a single overall quality score for the frame. This localized assessment allows the encoder to identify specific regions of a frame where compression artifacts are highly visible, enabling more precise Rate-Distortion Optimization (RDO) decisions at the block level.
Multi-Scale SSIM (MS-SSIM) Integration
For a more robust evaluation, librav1e frequently relies
on MS-SSIM. This metric evaluates quality across multiple image scales
to account for varying viewing distances and resolutions. The
calculation involves the following steps:
- Downsampling: The reference and distorted frames are progressively downsampled by factors of two, creating a multi-scale pyramid of images.
- Scale-Specific Evaluation: Contrast (\(c\)) and structure (\(s\)) comparisons are computed at each scale.
- Luminance Evaluation: Luminance (\(l\)) is computed only at the lowest resolution (the most downsampled scale).
- Weighted Combination: The results are combined using scale-specific weights to reflect how human eyes prioritize different frequencies of visual detail.
Hardware-Accelerated Implementation in Rust
Calculating SSIM frame-by-frame is computationally intensive.
librav1e optimizes this process by leveraging Rust’s safety
and speed.
The encoder utilizes heavily optimized assembly paths, including AVX2 and NEON SIMD (Single Instruction, Multiple Data) instructions, to perform the vector math required for variance, mean, and covariance calculations simultaneously on multiple pixels. This high-speed implementation ensures that quality evaluation does not become a performance bottleneck during the AV1 encoding pipeline.