Codec Roadmap

Codec Roadmap#

What compressors currently supports, how each codec is plumbed in, and what’s still on the to-do list. ✅ marks codecs with a working evaluation harness in this repo and at least one rate-distortion + encode-complexity sweep on Kodak under results/<codec>/.

Supported#

Conventional, via Pillow built-ins#

The compressors.pillow harness (src/compressors/pillow/) is a thin wrapper around PIL.Image.save(format=..., quality=q). It accepts any PIL-registered format and routes per-format kwargs through _quality_dispatch.QUALITY_DISPATCH.

✅ JPEG — libjpeg-turbo backend.
✅ AVIF — libavif (Pillow’s built-in or pillow-avif-plugin). Three speed settings evaluated (default, speed=0, speed=10) to characterize the full speed-quality envelope.

The harness also supports — but hasn’t been swept on Kodak yet — every other format Pillow ships, including:

⏳ WebP — libwebp (lossy + lossless).
⏳ JPEG 2000 — OpenJPEG. Note: in PIL the quality_layers kwarg is inverted (higher value = lower output quality); already wired in _quality_dispatch.py.
⏳ PNG / TIFF Adobe Deflate — lossless baselines.

Pillow plugins (third-party PIL backends)#

✅ JPEG-XL — via pillow-jxl-plugin (libjxl). Lives in src/compressors/jxl/, which is a thin wrapper that imports the plugin and forwards to the pillow harness with --format JXL.
✅ JPEG-LS — via pillow_jpls (CharLS). Used internally by FRAPPE, WaLLoC, and LiVeAction to entropy-code their integer latents. No standalone harness yet, but adding one is one line of glue.

Diffusers VQ autoencoders#

src/compressors/diffusers/ evaluates the four pretrained quantized autoencoders shipped in upstream diffusers. None of these has a native rate knob; quality is implemented as a target-pixel-ratio resize-down inside the encoder pipeline. See src/compressors/diffusers/_codec.py for the registry.

✅ LDM-SR (CompVis/ldm-super-resolution-4x-openimages) — VQModel, 4× downsample, codebook 8192.
✅ VQ-Diffusion ITHQ (microsoft/vq-diffusion-ithq) — VQModel, 8× downsample, codebook 4096.
✅ Kandinsky 2.1 MoVQ (kandinsky-community/kandinsky-2-1) — VQModel, 8× downsample, codebook 16384.
✅ Stable Cascade VQGAN (stabilityai/stable-cascade) — PaellaVQModel, 4× downsample, codebook 8192.
❌ Kandinsky 3 MoVQ — intentionally absent. The published kandinsky-community/kandinsky-3 movq/diffusion_pytorch_model.fp16.safetensors ships with quantize.embedding.weight all zeros; every index decodes to the zero vector and PSNR is pinned at ~12 dB. Re-add if HF fixes the upload.

Vendored learned codecs#

Upstream compressai pins legacy torch versions, so the minimal model

entropy-coding closure is vendored under src/compressors/compressai_baselines/ (BSD-3-Clause-Clear; see src/compressors/compressai_baselines/LICENSE). The CompressAI baselines compute bpp from forward-pass likelihoods (-log2(likelihoods).sum() / n_pixels); no _CXX rANS coder needed. MCUCoder reuses the same vendored CompressAI layers (AttentionBlock, conv, deconv) but ships its own model architecture and a per-channel min/max + Huffman entropy coder.

✅ cheng2020-anchor — Cheng, Sun, Takeuchi, Katto, CVPR 2020, GMM hyperprior without attention. 6 quality levels.
⏳ mbt2018 — Minnen, Ballé, Toderici, NeurIPS 2018, joint autoregressive + hierarchical priors. 8 quality levels. Sweep in flight (q=7,8 are slow on CPU due to the autoregressive context model).
✅ MCUCoder — Hojjat, Haberer, Landsiedel, DAGM-GCPR 2025. Asymmetric variable-rate codec with a tiny (~3-conv) encoder and a heavy AttentionBlock-based decoder; 12 native quality levels via model.p = used_filter / 12. MIT, Kiel University. MS-SSIM-trained checkpoint pulled from zenodo:14988203 and cached under ~/.cache/compressors/mcucoder/. fp32 PyTorch only — the upstream INT8 / TFLite / CMSIS-NN deployment path is out of scope for this baseline.

In-house / authored, via external packages#

✅ FRAPPE — progressive multi-scale autoencoder. Weights on the Hub at danjacobellis/FRAPPE (config.json + FRAPPE_pytorch_model.safetensors). compressors.frappe.load_from_hub() is the canonical loader.
✅ WaLLoC — wavelet-domain learned codec, DCC 2025. Installed via pip install walloc; checkpoint RGB_16x.pth from the Hub (danjacobellis/walloc).
✅ LiVeAction — DCC 2026 (in press). Installed via pip install livecodec; checkpoint lsdir_f16c48.pth from the Hub (danjacobellis/liveaction).

Audio#

All audio harnesses evaluate on danjacobellis/musdb_segments (validation: 262 int16 stereo 44.1 kHz clips of 2²¹ samples) with a shared metric stack — kbps / CR against the int16 PCM original, waveform PSNR_dB on [0, 1]-convention normalized signals, and SSDR / SRDR via src/compressors/spatial_audio_quality.py — an independent PyTorch implementation of the duplex-theory decomposition of Watcharasupat & Lerch (ICASSP 2024, arXiv 2306.08053), written in-house because the authors’ GPL-3.0 package is unmaintained (pins Python < 3.11) and, on inspection, solves its gain projection against the unaligned reference, contradicting the paper’s stated theory for delay errors. Shared sweep loops and helpers live in src/compressors/audio_eval.py. Mono codecs encode the two channels independently (never downmixing the reference) and are charged for the resample to their native rate. Lossless audio codecs (FLAC, WAV) are explicitly out of scope.

Conventional, via torchcodec (FFmpeg)#

src/compressors/torchcodec/ — real byte streams via the in-memory AudioEncoder / AudioDecoder, --bit-rate sweep in kbps.

✅ MP3 — native 44.1 kHz stereo.
✅ Opus — encoded at 48 kHz (the only rate the encoder accepts), resampled back to 44.1 kHz on decode.
❌ Vorbis / AAC — encoders missing from torchcodec’s bundled FFmpeg.
❌ sphn — dropped; its Rust Opus encoder is hardcoded to VoIP mode with no bitrate knob.

Neural, via transformers / diffusers#

✅ EnCodec (facebook/encodec_24khz) — src/compressors/encodec/, mono 24 kHz, --bandwidth sweep, analytic code-bit rate accounting.
✅ DAC (descript/dac_16khz) — src/compressors/dac/, mono 16 kHz, --n-quantizers sweep.
✅ Mimi (kyutai/mimi) — src/compressors/mimi/, mono 24 kHz at 12.5 frames/s, --num-quantizers sweep.
✅ Stable Audio Open VAE (stabilityai/stable-audio-open-1.0 vae, diffusers AutoencoderOobleck) — src/compressors/oobleck/, native 44.1 kHz stereo, single operating point with fp16 latents (CR 64).

In-house / authored, vendored model code#

These vendor the model source (the living versions of the codec code) rather than importing the pip packages; single native operating point each, real bytes via the latent_to_pil packing recipes.

✅ WaLLoC stereo (stereo_5x from danjacobellis/walloc) — src/compressors/walloc_audio/, lossless-WebP latent bytes. stereo_20x deferred.
✅ LiVeAction stereo (musdb_stereo_f512c16.pth from danjacobellis/autocodec) — src/compressors/liveaction_audio/, TIFF-deflate latent bytes. The mono musdb_f1024c* variants in danjacobellis/liveaction are deferred.

Deferred (audio)#

Stem-conditional evaluation (--stems selector over the dataset’s vocal/bass/drums/other columns).
Resample-based rate sweeps for the fixed-rate codecs (the audio analog of the image pixel-ratio trick).
Spatial audio (7-ch Aria) and facebook/encodec_48khz.
Additional metrics (ViSQOL, Mel/STFT distance, SI-SDR), decode complexity, and the cross-codec audio comparison notebook (the image_compressors.ipynb analog).

Planned (not yet implemented)#

⏳ VVC intra (H.266) — currently the strongest conventional still- image codec. Candidate implementations:
- VVenC + VVdeC (Fraunhofer HHI) — production C++; BSD-3-Clear. Likely path. Open question: Pillow plugin vs subprocess wrapper.
- VTM (reference) — slow, research-only; BSD-3-Clear.
- uvg266 — fast research encoder; BSD-3.

Pluggable entropy coding#

For learned codecs whose encoder/decoder are essentially lossy analysis/synthesis transforms followed by a separate entropy coding stage — currently FRAPPE, WaLLoC, and LiVeAction — the goal is to mix and match the entropy coder independently of the analysis transform. The same trained transform should be combinable with multiple entropy coders (range / arithmetic with various probability models, ANS, learned context models) so that both the rate-distortion curve and the encoder’s runtime cost can be studied in isolation.

Implications:

Treat the analysis transform and the entropy coder as separate, swappable components rather than baking them together inside a single codec class.
The entropy-coder interface should be generic enough to be reused across analysis transforms.
Evaluation (RD + complexity) should report per (transform × entropy coder) combination.

The FRAPPE harness already has a four-function pluggable contract (compressors.frappe.entropy_coding: arrange_latents / unarrange_latents / encode_latents / decode_latents), with candidate alternative entropy coders living under experiments/encoder_optimization/. Open questions: which broader set of entropy coders to support out of the gate; whether constriction / torchac / CompressAI’s coders can be reused as-is or need wrapping.

Evaluation#

✅ Rate-distortion — bpp / PSNR / SSIM / LPIPS / DISTS on Kodak via piq.SSIMLoss, piq.LPIPS, piq.DISTS. Same metric stack across every codec module above.
✅ Encode-complexity — wallclock timing on 512×512 Kodak center crops, decomposed into per-stage timings (resize / analysis / transfer / store for codecs that internally resize-down for quality control; single encode stage for native-rate codecs). Uses the throughput.image.wallclock singleton.
✅ Audio rate-distortion + encode-complexity — kbps / CR / PSNR / SSDR / SRDR on full 2²¹-sample musdb clips; encode timing on 2²⁰-sample center crops of the first 24 clips with the same per-stage decomposition (resample for the mono codecs, then analysis / transfer / store; single encode stage for torchcodec), throughput in Msamples/s of the native 44.1 kHz crop.
⏳ Static cost (FLOPs / MACs) — not implemented yet. Candidate libraries: thop, fvcore, ptflops, calflops. Hardware- independent first-order proxy for cost.
⏳ Decode-complexity — out of scope so far; deferred along with FLOP counting.

In planning#

Once enough measurements have been collected across codecs × testbeds × resolutions, package the results as a public leaderboard in the spirit of the timm leaderboard — a Hugging Face Space combining rate-distortion and computational metrics for side-by-side codec comparison. Available testbeds: Mac mini (Apple Silicon), Raspberry Pi, Intel GPU, NVIDIA GPU, various x86 CPUs.