Token Pruning in VLLMs Achieves 50% Cost Reduction, yet Degrades Beyond Layer 20 with 96.9% Information Loss

Summarize this article with:
Vision Large Language Models (VLLMs) currently demand substantial computational resources, largely due to their extensive use of visual tokens to process images. Yahong Wang, Juncheng Wu, and Zhangkai Ni, alongside Longzhen Yang, Yihang Liu, and Chengmei Yang, investigate the surprising limitation of existing token pruning techniques, discovering that beyond a certain network depth, these methods offer no advantage over simply removing tokens at random. Their research reveals a phenomenon they term “vanishing token information”, where visual tokens progressively lose their importance as data moves through the network, creating an “information horizon” beyond which pruning becomes ineffective. By quantifying token information based on its impact on model outputs, the team demonstrates that this horizon shifts depending on the visual complexity of the task and the capacity of the VLLM itself, ultimately showing that random pruning in deeper layers can efficiently maintain performance while reducing computational cost, and even improve existing pruning methods to achieve state-of-the-art results. Experiments conducted on both LLaVA-1. 5-7B and Qwen-2. 5-VL-7B consistently show that the performance gains from sophisticated pruning techniques diminish as data passes through the network.
The team discovered that beyond a certain depth, these methods do not effectively retain the most informative visual tokens, suggesting a fundamental challenge in how VLMs process visual information. Their own information-based pruning method, however, successfully improves performance, indicating that focusing on token information content is a viable strategy for optimization. The experimental setup involved rigorous testing on LLaVA-1. 5-7B and Qwen-2. 5-VL-7B, utilizing datasets including MME and TextVQA. Several pruning methods were compared, including FastV, SparseVLM, DART, Divprune, and random pruning, across various pruning ratios. Performance was evaluated using accuracy metrics, and a novel method was developed to quantify the information value of each visual token at different network layers. This layer-by-layer analysis revealed that the effectiveness of pruning methods declines with depth, highlighting the limitations of current approaches. Results from LLaVA-1. 5-7B showed the performance gap between existing pruning methods and random pruning beginning to diminish around the 7th layer on MME and the 21st layer on TextVQA. Similar trends were observed on Qwen-2. 5-VL-7B, confirming the consistency of these findings.
The team’s information-based pruning method consistently outperformed random pruning and existing methods, particularly in the deeper layers, demonstrating the potential of this approach. Detailed analysis confirmed that existing pruning methods do not effectively retain more informative visual tokens than random pruning in the deeper layers, explaining the observed performance limitations.
This research challenges the assumption that existing pruning methods are universally beneficial for VLMs, emphasizing the importance of considering the information content of visual tokens when developing pruning strategies. The authors’ information-based pruning method offers a promising direction for future research, potentially leading to more efficient VLMs with reduced computational costs and memory requirements. Researchers observed that existing training-free pruning techniques become ineffective in deeper layers, performing no better than random pruning, and sought to understand the underlying reasons. To investigate this, the team developed a metric to quantify the information content of each visual token by measuring the change in the model’s output probabilities when a token is removed. Experiments employed LLaVA-1. 5-7B and Qwen-2. 5-VL-7B models, systematically pruning tokens at various layers and evaluating performance on benchmarks including MME, ScienceQA, and TextVQA.
The team meticulously removed a high percentage of visual tokens within each layer of both models to assess the impact of pruning at different depths. Analysis revealed a phenomenon termed the “information horizon,” an intermediate layer beyond which visual tokens become redundant, losing their salience and providing minimal contribution to the model’s output. The position of this horizon varies depending on the task, extending deeper for visually intensive tasks like Optical Character Recognition (OCR) compared to more general tasks like Visual Question Answering (VQA). Furthermore, the study demonstrates a strong correlation between the information horizon and model capacity, with stronger VLLMs utilizing deeper visual tokens than weaker models. Based on these findings, researchers propose that simple random pruning in deep layers efficiently balances performance and efficiency, and that integrating random pruning consistently enhances existing pruning methods. Experiments demonstrate that beyond a certain point, these methods perform no better than randomly removing visual tokens, a phenomenon observed across models like LLaVA-1. 5-7B and Qwen-2. 5-VL-7B. This limitation prompted an investigation into how visual tokens retain information as data passes through the network. To quantify token information, the team developed a metric that measures the change in the model’s output probability when a specific visual token is removed. Results show that visual token information gradually diminishes with network depth, becoming uniform and eventually vanishing beyond a layer termed the “information horizon”. Beyond this horizon, visual tokens become redundant and can be removed without impacting performance. The position of this horizon is not fixed, but dynamically adjusts based on task complexity and model capacity. Visually intensive tasks require deeper layers to retain information compared to simpler knowledge-based question answering. Furthermore, stronger VLLMs utilize deeper visual tokens than weaker models, extending their information horizon. Based on these findings, the researchers demonstrate that integrating random pruning into existing methods efficiently balances performance and efficiency. Specifically, combining random pruning with DART enhances performance while maintaining a high percentage of the initial model performance with fewer visual tokens. Utilizing DivPrune with random pruning achieves state-of-the-art results, maintaining high performance while pruning a significant percentage of visual tokens, representing a significant advancement in VLLM efficiency. 👉 More information 🗞 All You Need Are Random Visual Tokens?
Demystifying Token Pruning in VLLMs 🧠 ArXiv: https://arxiv.org/abs/2512.07580 Tags:
