How JPEG XS Works – A Simple Introduction

JPEG XS uses wavelet-based compression to deliver ultra-low latency and visually lossless video — but what is the magic to make it work?

At the core JPEG XS is an extremely simple, yet efficient, compression-decompression system for image and video content. However, while being low in algorithmic complexity by design, JPEG XS still has its intrinsic complexities and tiny details that need to be taken care of to deliver the required line-based latency with sufficient compression performance and pristine visual quality. Recall that JPEG XS is designed to replace uncompressed without any compromise.

Figure below shows a high-level block design of the JPEG XS encoder. It processes individual images or video frames, called pictures from here on, in a line-by-line approach and outputs a self-contained codestream for each such picture. Due to how XS is designed, a typical low-latency encoder-decoder setup can output the decoded lines already after roughly 5 to 32 lines (depending on the profile and number of vertical wavelet levels) being processed by the encoder and can sustain this pacing for the entire picture.

This article will provide a description of the blocks that make up a typical JPEG XS encoder under an XS High profile. For clarity and brevity, it leaves the Temporal Differential Coding (TDC) mode that uses a compressed frame buffer for another article.

Pre-processing

The pre-processing block in Figure partitions the picture in slices and precincts (and optionally also in columns), with each slice representing 16 image lines.

A slice is further divided into 4, 8 or 16 precincts depending on the number of vertical wavelet decompositions – respectively 2, 1 or 0. Precincts represent the smallest visual unit of a picture, and slices are the smallest accessible block of packed information in an XS codestream. In other words, JPEG XS processes picture lines in a slice-by-slice fashion. The main and only exception is the vertical wavelet transform as it works on the global picture, in a continuous mode across precinct and slice boundaries to avoid introducing border artifacts. Dividing the pictures into independent slices of 16 lines each is a critical step to obtain the extreme low latencies of 32 lines or less.

The pre-processing block further also handles the optional color transforms, providing a set of individual picture components as result. In particular:

YC_bC_r input is kept as-is, and all three of the color channels will be further processed as individual picture components.
Regular RGB pictures can be converted to YC_bC_r components by means of a fully reversible color transform (RCT) and are then processed just like native YC_bC_r input.
An alpha channel is optional and will be treated as a fourth individual picture component.
Raw Bayer CFA (RGGB and variations) images are converted into decorrelated YC_bC_rY_h components by a special-purpose reversible color transformation called Star-Tetrix that works directly on the Bayer data – avoiding an expensive and irreversible debayering step. An optional DC level shift and non-linear transformation further allow XS to account for sensor characteristics. The Bayer profiles of JPEG XS employ a unique color transformation that deserves a dedicated article.

Discrete Wavelet Transformation (DWT)

JPEG XS uses exclusively the reversible biorthogonal Le Gall–Tabatabai (LGT) 5/3 discrete wavelet transform (DWT). This specific wavelet transform can be implemented in pure integer arithmetic, avoiding complex and inexact floating-point arithmetic. Being integer-based and reversible also means that the 5/3 DWT by itself is a lossless transformation.

XS applies the wavelet transformation to each individual picture component for a repeated number of times along the horizontal and optionally vertical dimensions, creating a multi-level representation. Figure shows a typical configuration using five horizontal and two vertical wavelet transformation levels, noted as (5, 2).

The wavelet transform takes the samples along its respective dimension and mathematically decorrelates that signal into a low-pass and a high-pass band, each containing one half of the signal. With each next level, the low-pass coefficients are transformed again, while the high-pass coefficients are kept as-is. The result of the forward DWT is a set of so-called sub-bands, with the coefficients reorganized per precinct and grouped per slice, as shown in the right part of Figure . With a (5, 2) configuration for a YUV 4:4:4 or RGB picture, each precinct consists of 30 sub-bands that represent exactly 4 lines of the original picture.

Quantization and rate control

The rate control block directly drives the quantization block to deliver a constant bitrate codestream. Quantization is performed to the wavelet coefficients by reducing their precision before coding. Remark that this is the only place in the entire JPEG XS algorithm where image information is effectively removed to achieve acceptable lossy compression under strict control towards maintaining visually lossless picture quality.

The rate control selects for each precinct a Q value – the quantization factor, and an R value – the refinement-priority value. Using these two values in combination with a global band-weighting-priority table, a unique value T[b], called the truncation point, is calculated for every band b in the precinct. The truncation point T[b] represents the number of lower bit planes to remove – or “quantize” – from the coefficients in each band, as shown in Figure . After quantization, the coefficients are called the quantization indices. Remark that there is a truncation point for every sub-band in the precinct, yet only Q and R are signaled in the codestream.

Coding of quantization indices

The quantization indices are subsequently processed per group of four. For each such group g, the bit-plane count M[g], the data bits, and the sign bits are encoded in a bitstream. For the bit-plane count, either raw or variable-length coding is used. The data bits and the sign bits of non-zero quantization indices are signaled raw as-is. As a small reminder, JPEG XS does not rely on any type of advanced entropy coding.

And finally, the packetization block organizes the coded bits of each sub-band of each precinct into individual packets, along with the necessary precinct and packet headers. Then it combines the precinct packets into a final XS codestream – interleaved with the relevant codestream markers to facilitate the decoder and provide it with the necessary metadata.

The actual placement of the sign bits can happen in two ways and is controlled by the sign packing Fs flag in the codestream header. Either sign bits are signaled as a part of the data bits, or alternatively they are placed in a separate sub-packet that follows right after the corresponding data sub-packet. Packing sign bits with the data bits is easier to implement but is slightly less efficient than placing them in a dedicated sign sub-packet.

Thus, as shown in Figure , the JPEG XS codestream consists of markers at the top level. The slice markers contain the actual encoded picture information, 16 lines each. This information is packed in individual data packets, each containing a specific piece of information needed to reconstruct the original picture samples. This specific packet-based organization of the encoded data allows JPEG XS to achieve the extreme low latencies (below 32 lines) and facilitates implementers to have high levels of parallelization opportunities. The latter is extremely important for real-time hardware and GPU implementations.