Back to News
research

Vision-language Models Achieve Human-Aligned Image Compression, Replicating Judgments Zero-Shot

Quantum Zeitgeist
Loading...
5 min read
1 views
0 likes
Vision-language Models Achieve Human-Aligned Image Compression, Replicating Judgments Zero-Shot

Summarize this article with:

Current methods for evaluating image compression often fail to accurately reflect how humans perceive visual quality, relying on simplistic measures that diverge from our visual systems. Kyle Sargent, Ruiqi Gao, and Philipp Henzler, alongside colleagues from Google DeepMind and Google Research, demonstrate that advanced vision-language models (VLMs) possess a surprising ability to replicate human judgements about image differences without any specific training. Building on this discovery, the team introduces Vision-Language Models as Perceptual Judges for Image Compression, or VLIC, a novel compression system that directly incorporates VLM judgements during a post-training process. This approach achieves competitive or state-of-the-art performance in human-aligned image compression, as confirmed by both established perceptual metrics and large-scale user studies, offering a promising new direction for developing compression technologies that better align with human visual perception. The research introduces a new post-training technique for diffusion autoencoders, utilising Vision-Language Models to evaluate different reconstructions of the same image. This approach, termed VLIC, leverages these judgements to refine the autoencoder through Diffusion DPO. Evaluations focus on assessing image compression performance, specifically examining the alignment between automated metrics and human preferences.

Results demonstrate substantial improvements in overall reconstruction quality, alongside enhanced alignment with human perception, indicating that traditional distortion functions such as MSE are often insufficiently aligned with how humans evaluate image fidelity. LLM-Guided Diffusion for Image Compression This research details a new approach to image compression that combines Diffusion Models and Large Language Models for improved perceptual quality. The key innovation lies in using the Large Language Model to evaluate reconstructed images and provide a training signal, allowing the compression model to optimise for how humans perceive images rather than simply minimising pixel-level errors. The system employs Diffusion Probabilistic Models and Diffusion DPO to achieve this, utilising a tiling strategy to support arbitrary image resolutions during compression and reconstruction by breaking images into overlapping tiles. The model architecture centres around a Diffusion Model for image compression and reconstruction, with Gemini 2. Flash acting as a perceptual critic, assessing the quality of reconstructions relative to each other. The model’s ratings are converted into a reward signal that guides the training process, with Diffusion DPO used to refine the diffusion model and generate reconstructions more aligned with human perception. A central innovation is the use of a Large Language Model to evaluate image quality, allowing the model to optimise for perceptual similarity, which more closely reflects human vision than traditional metrics like PSNR or SSIM. Diffusion DPO is well-suited for learning from preference signals, and the tiled inference strategy enables compression and reconstruction of images at any resolution. Vision-Language Models Compress Images Without Training Researchers have demonstrated a novel approach to image compression by harnessing the reasoning capabilities of Vision-Language Models (VLMs), achieving competitive or state-of-the-art performance on standard benchmarks. The study shows that an off-the-shelf VLM, Gemini 2.5-Flash, can accurately replicate human judgments of visual similarity across multiple datasets, including BAPPS and a newly collected compressed image dataset, without any task-specific training. This zero-shot capability highlights a major breakthrough, suggesting that advancements in VLMs naturally enhance automatic perceptual evaluation.

The team developed a diffusion-based image compression system, VLIC, which extends the FlowMo architecture with an entropy coder and is post-trained using preference data generated by the VLM. Experimental results demonstrate that VLIC outperforms traditional perceptual metrics, such as LPIPS, especially when combined with VLM-based guidance. On the CLIC 2022 dataset, VLIC achieves 0.196 bits per pixel (bpp), compared to 0.199 bpp for HiFiC and 0.205 bpp for PO-ELIC. Further evaluation on the MS-COCO dataset shows VLIC achieving 0.198 bpp, outperforming HiFiC (0.287 bpp) and PerCo (0.12 bpp). Variants of the approach achieved 0.203 bpp and 0.247 bpp, while HiFiCLo and HiFiCMi reached 0.215 bpp and 0.391 bpp, and PO-ELICHi and PO-ELICMi reached 0.321 bpp and 0.196 bpp, respectively. These results demonstrate that VLIC more faithfully preserves perceptually relevant details, including faces and textures, than existing compression methods, highlighting the potential of VLM-guided approaches for high-quality image compression.

Human Perception Guides Image Compression Success This research demonstrates that readily available vision-language models possess a visual understanding closely aligned with human perception. Scientists discovered these models can accurately replicate human judgements when asked to compare images, identifying subtle differences as humans would. Motivated by this finding, the team designed a new image compression system, VLIC, which leverages these models to optimise image quality based on perceptual similarity. Through post-training VLIC with preferences derived from the vision-language models, researchers achieved competitive or state-of-the-art performance in human-aligned image compression, depending on the dataset used. The quality of compression achieved is directly linked to the accuracy of the vision-language model employed, suggesting that ongoing improvements in these models will likely translate to further gains in image compression technology. The authors acknowledge that the diffusion-based decoder introduces latency, a limitation shared by other diffusion approaches, and that utilising vision-language models for reward assessment is computationally more expensive than traditional perceptual networks. Future work may focus on mitigating these computational costs and exploring the full potential of increasingly sophisticated vision-language models to enhance image compression performance. 👉 More information 🗞 VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression 🧠 ArXiv: https://arxiv.org/abs/2512.15701 Tags:

Read Original

Source Information

Source: Quantum Zeitgeist