Submitted:
09 March 2025
Posted:
10 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Preliminaries
2.1. Transformers and Computational Bottlenecks
2.2. Model Compression Techniques
- Weight Pruning: This technique removes redundant or less important weights in the model, reducing storage and computation without significantly impacting accuracy. Structured pruning methods remove entire neurons or layers, while unstructured pruning eliminates individual weights.
- Quantization: Quantization reduces the precision of model weights and activations (e.g., from 32-bit floating point to 8-bit integer), leading to smaller model sizes and faster inference times [17].
- Knowledge Distillation: This method transfers knowledge from a large pre-trained model (teacher) to a smaller model (student), allowing the student to achieve comparable performance with fewer parameters.
- Token Pruning: Unlike weight pruning, which operates at the parameter level, token pruning removes or masks tokens from the input sequence to reduce the number of operations in self-attention and feed-forward layers [21].
2.3. Token Importance and Pruning Criteria
- Attention-Based Importance: Many token pruning methods leverage the attention scores produced by the self-attention mechanism to determine token importance [23]. Tokens with lower cumulative attention scores across layers are considered less relevant and are pruned.
- Gradient-Based Importance: Some approaches analyze the gradient magnitudes of token embeddings to assess their contribution to model predictions. Tokens with minimal impact on the gradient-based loss function are pruned [24].
- Reinforcement Learning-Based Policies: Adaptive pruning techniques employ reinforcement learning to learn optimal pruning strategies based on reward functions that balance accuracy and efficiency.
- Heuristic-Based Methods: Some methods use predefined rules, such as removing stop words or low-frequency tokens, to perform static pruning without additional computational overhead.
2.4. Challenges and Open Problems
- Preserving Model Accuracy: Aggressive token pruning can lead to significant loss of information, degrading model performance [26]. Effective strategies are required to balance pruning aggressiveness with accuracy preservation.
- Generalization Across Tasks: Token importance varies across NLP tasks [27]. A token pruning strategy that works well for text classification may not be suitable for machine translation or summarization, necessitating task-specific adaptation.
- Robustness to Distribution Shifts: Pruned models should be robust to variations in input data distributions, as pruning strategies trained on one dataset may not generalize well to unseen text domains [30].
- Hardware Efficiency: While token pruning reduces the number of floating-point operations (FLOPs), its impact on actual hardware efficiency depends on factors such as memory access patterns and parallelization capabilities. Efficient implementation techniques are needed to fully leverage the benefits of token pruning on modern hardware architectures [31].
3. Taxonomy of Token Pruning Methods
3.1. Static vs. Dynamic vs [34]. Adaptive Token Pruning
3.1.1. Static Token Pruning
- Stopword Removal: Common stopwords (e.g., “the,” “is,” “and”) are removed before input processing, as they contribute little semantic information.
- Low-Frequency Token Filtering: Rare words that appear infrequently in the training corpus are eliminated to reduce computational load.
- Fixed-Length Truncation: Input sequences exceeding a predefined length are truncated, discarding tokens beyond a certain limit [36].
- Lacks flexibility, as the same pruning policy is applied to all inputs [39].
- May remove important context-dependent tokens, leading to loss of information.
3.1.2. Dynamic Token Pruning
- Attention-Based Pruning: Tokens with low attention scores across transformer layers are discarded.
- Gradient-Based Pruning: Token importance is estimated using gradient-based methods, removing those with minimal impact on loss [42].
- Entropy-Based Pruning: Tokens with high uncertainty (entropy) are retained, while those with low entropy are pruned [43].
- More adaptable than static pruning, as pruning decisions depend on input context [44].
- Can be applied without requiring model retraining.
- Requires additional computations during inference to determine token importance.
- Potentially increases inference time if pruning decisions are computationally expensive [45].
3.1.3. Adaptive Token Pruning
- Reinforcement Learning-Based Pruning: A pruning policy is learned using a reward function that balances accuracy and efficiency.
- Learned Token Importance Scoring: Token importance is modeled using auxiliary neural networks trained alongside the main model.
- Gumbel-Softmax Sampling: Differentiable relaxation techniques are used to enable end-to-end training of token selection mechanisms.
- Optimized for task-specific pruning, leading to better performance trade-offs [46].
- Can generalize across different input distributions.
3.2. Pruning Granularity: Token-Level vs [49]. Group-Level
3.2.1. Token-Level Pruning
- Per-token importance scoring using attention weights.
- Gradient-based token sensitivity analysis.
3.2.2. Group-Level Pruning
- Attention head pruning, where entire attention heads are removed based on their contribution to the model.
- Phrase pruning, where contiguous spans of tokens are removed instead of individual words.
- Token-level pruning provides more granularity but may introduce irregular memory access patterns.
- Group-level pruning can be more hardware-efficient but may lead to larger losses in information.
3.3. Summary of Taxonomy
4. Token Pruning Methodologies
4.1. Attention-Based Token Pruning
4.1.1. Methodology
- Compute self-attention scores for each token at different layers.
- Aggregate attention scores across multiple heads and layers using a predefined strategy (e.g., mean or max pooling) [58].
- Rank tokens based on aggregated scores and remove those below a certain threshold [59].
- Adjust model predictions to compensate for removed tokens using interpolation or redistribution techniques.
4.1.2. Notable Techniques
- Layer-Wise Pruning: Tokens with the lowest attention scores in each layer are pruned progressively [60].
- Cumulative Attention Pruning: Instead of per-layer pruning, attention scores are summed across all layers, and the least informative tokens are removed.
- Threshold-Based Pruning: A predefined attention score threshold is used to discard low-importance tokens dynamically [61].
4.1.3. Advantages and Limitations
- Attention scores do not always capture true token importance.
- Pruning based solely on attention scores may lead to over-aggressive token removal.
- Can be ineffective for tasks like translation, where low-attention tokens might still be contextually important [64].
4.2. Gradient-Based Token Pruning
4.2.1. Methodology
- Compute token embeddings and pass them through the model.
- Calculate gradients of the loss function with respect to token embeddings.
- Rank tokens based on gradient magnitude, removing those with minimal impact on the loss.
- Recompute model outputs with pruned tokens to maintain consistency [66].
4.2.2. Notable Techniques
- Saliency-Based Pruning: Tokens with the smallest gradient magnitudes are removed, as they contribute the least to model predictions.
- Hessian-Based Pruning: Higher-order derivatives (Hessian matrix) are used to measure the curvature of the loss function, identifying tokens that least affect model confidence [67].
- Gradient Masking: Instead of removing tokens outright, gradients are masked during training to simulate the impact of pruning.
4.2.3. Advantages and Limitations
- Provides a principled approach to measuring token importance.
- More fine-grained than attention-based pruning.
- Can generalize better across different NLP tasks [68].
- Computationally expensive due to gradient computations.
- Prone to instability, as small gradients do not always indicate unimportant tokens.
- Requires additional backpropagation steps during inference, increasing runtime complexity [69].
4.3. Reinforcement Learning-Based Token Pruning
4.3.1. Methodology
- Define a reinforcement learning environment where each token represents an action [72].
- Use a policy network to predict token retention or removal.
- Define a reward function balancing computational efficiency and model accuracy [73].
- Train the agent using reinforcement learning algorithms such as Proximal Policy Optimization (PPO) or Q-learning.
4.3.2. Notable Techniques
- Binary Policy Networks: A neural network predicts a binary decision (keep or remove) for each token [74].
- Continuous Pruning Policies: Token importance is represented as a continuous value, allowing for probabilistic token retention.
- Meta-Learning for Pruning: The pruning policy adapts across different tasks using meta-learning techniques [75].
4.3.3. Advantages and Limitations
- Provides optimal pruning strategies through learning-based optimization [76].
- Can dynamically adjust pruning policies for different tasks and datasets.
- Balances accuracy and efficiency through reward-based learning.
- Requires extensive training, making it computationally expensive.
- Hard to interpret learned pruning policies.
- Prone to instability in reward signal optimization [77].
4.4. Hybrid Token Pruning Approaches
4.4.1. Examples of Hybrid Approaches
- Attention-Gradient Hybrid Pruning: Uses attention scores to pre-select candidate tokens, then applies gradient-based pruning for fine-grained selection [79].
- Reinforcement Learning with Attention Guidance: Reinforcement learning agents use attention maps as auxiliary information to improve token selection efficiency.
- Multi-Stage Pruning: First applies a lightweight static pruning strategy, followed by dynamic pruning to refine token selection.
4.4.2. Advantages and Limitations
4.5. Summary
5. Empirical Evaluation of Token Pruning Methods
5.1. Evaluation Metrics
5.1.1. Accuracy Metrics
-
Task-Specific Performance: Standard accuracy measures for different NLP tasks, such as:
- Classification Accuracy (e.g., for sentiment analysis and text categorization).
- BLEU Score (for machine translation) [88].
- F1-Score (for named entity recognition and question answering).
- Perplexity: Commonly used for language modeling tasks to measure the uncertainty of the model’s predictions [89].
- AUC-ROC: Applied in tasks involving ranking or probability estimation, such as document retrieval.
5.1.2. Efficiency Metrics
- FLOPs Reduction: Measures the percentage decrease in floating-point operations after token pruning.
- Inference Speedup: Reports the increase in tokens processed per second after pruning.
- Memory Footprint: Evaluates the reduction in GPU/CPU memory usage due to token pruning.
5.1.3. Robustness and Generalization Metrics
- Performance Degradation: The absolute or relative drop in accuracy compared to the unpruned baseline model [90].
- Generalization Across Tasks: The ability of a pruning method trained on one dataset to perform well on a different dataset without retraining.
- Performance Under Distribution Shifts: The resilience of pruned models to input variations, such as noisy data or domain shifts.
5.2. Benchmark Datasets
- GLUE Benchmark: A collection of diverse NLP tasks, including sentiment analysis (SST-2), natural language inference (MNLI), and paraphrase detection (QQP) [91].
- SQuAD (Stanford Question Answering Dataset): A widely used benchmark for reading comprehension and question answering [92].
- WMT (Workshop on Machine Translation): A benchmark dataset for evaluating machine translation systems across multiple language pairs [93].
- SuperGLUE: A more challenging successor to GLUE, designed to test model generalization in complex reasoning tasks.
- Long-Document Datasets: Datasets such as WikiText-103 and arXiv Papers dataset are used to test token pruning effectiveness on lengthy documents.
5.3. Comparison of Token Pruning Approaches
- Accuracy vs. Efficiency Trade-off: Methods that aggressively prune tokens (e.g., attention-based methods) achieve higher speedups but may suffer greater accuracy degradation.
- Hybrid Methods Offer the Best Balance: Combining multiple pruning techniques often results in superior efficiency gains while minimizing performance degradation.
- Task-Specific Sensitivity: Some pruning methods perform well for classification tasks but struggle with structured prediction tasks like machine translation.
5.4. Case Studies
5.4.1. Token Pruning for BERT Compression
5.4.2. Token Pruning for Machine Translation
5.4.3. Token Pruning for Long-Document Processing
5.5. Challenges in Empirical Evaluation
- Lack of Standardized Benchmarks: Most studies use different datasets, making direct comparisons difficult.
- Hardware-Dependent Speedup Measurements: Pruning effectiveness varies based on the underlying hardware (e.g., GPUs, TPUs).
- Trade-offs Between Efficiency and Generalization: Methods optimized for a specific dataset may not generalize well across diverse NLP tasks.
5.6. Summary
6. Practical Implementation of Token Pruning
6.1. Integrating Token Pruning into Transformer Models
6.1.1. Pruning at Input Embedding Level
- Reduces computational overhead at the earliest stage of processing.
- Requires minimal modifications to transformer architectures.
- Compatible with pre-trained transformer models without retraining.
- Pruning decisions are static and do not consider contextual importance.
- May lead to loss of critical information, impacting model accuracy [101].
6.1.2. Pruning During Self-Attention Computation
- Compute attention scores for all tokens.
- Identify tokens with attention scores below a predefined threshold.
- Mask or remove these tokens before computing subsequent attention updates.
- Context-aware pruning leads to better retention of important tokens.
- Improves efficiency while preserving task-relevant information [102].
- Requires modifications to the transformer’s attention mechanism [103].
- May introduce computational overhead due to dynamic token filtering.
6.1.3. Pruning at the Feedforward Layers
- Reduces computation in the most expensive layers of the transformer [105].
- More flexible than input-level pruning, as it considers intermediate token representations.
- Requires additional mechanisms to adjust the remaining token representations.
- Implementation is more complex than input-level pruning.
6.2. Optimization Techniques for Efficient Execution
6.2.1. Efficient Memory Management
6.2.2. Sparse Computation Optimization
- Tensor decomposition techniques for reducing redundant computations [109].
- Hardware-aware sparse matrix multiplication (e.g., NVIDIA’s cuSPARSE library).
- Dynamic batching methods that adapt to varying sequence lengths post-pruning.
6.2.3. Distillation-Aided Pruning
6.3. Challenges in Deploying Pruned Models
6.3.1. Compatibility with Pre-Trained Models
6.3.2. Inference-Time Adaptability
6.3.3. Scalability Across Hardware Platforms
6.4. Summary
7. Conclusion and Future Directions
7.1. Key Takeaways
- Effectiveness of Token Pruning: Empirical results show that token pruning can provide up to 3-4x inference speedup with minimal accuracy degradation, depending on the pruning strategy and NLP task.
- Trade-offs Between Efficiency and Performance: Aggressive pruning can lead to significant computational gains but may impact task-specific performance. Hybrid approaches offer a better balance by integrating multiple pruning signals.
- Task-Specific Sensitivity: Token pruning effectiveness varies across NLP tasks [118]. While it performs well in classification and language modeling tasks, structured prediction tasks such as translation and summarization require careful pruning strategies.
- Challenges in Dynamic Pruning: Real-time pruning methods introduce variability in computation, posing challenges for latency-sensitive applications [119].
- Hardware Considerations: While pruning reduces theoretical computation, its practical benefits depend on hardware compatibility, as some accelerators are optimized for dense operations.
7.2. Future Research Directions
7.3. Final Remarks
References
- Bi, X.; Chen, D.; Chen, G.; Chen, S.; Dai, D.; Deng, C.; Ding, H.; Dong, K.; Du, Q.; Fu, Z.; et al. Deepseek LLM: Scaling open-source language models with longtermism. arXiv:2401.02954, arXiv:2401.02954 2024.
- OpenAI., *!!! REPLACE !!!*; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S. OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. 2024; arXiv:cs.CL/2303.08774]. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, 1810. [Google Scholar]
- Bigham, J.P.; Jayant, C.; Ji, H.; Little, G.; Miller, A.; Miller, R.C.; Miller, R.; Tatarowicz, A.; White, B.; White, S.; et al. Vizwiz: Nearly real-time answers to visual questions. In Proceedings of the Proceedings of the 23nd annual ACM symposium on User interface software and technology, 2010, pp.
- Fang, Y.; Liao, B.; Wang, X.; Fang. You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems 2021, 34, 26183–26197. [Google Scholar]
- Alpher, F. Frobnication. IEEE TPAMI 2002, 12, 234–778. [Google Scholar]
- Zobel, J.; Moffat, A. Inverted files for text search engines. ACM computing surveys (CSUR) 2006, 38, 6. [Google Scholar]
- Dehua Zheng, Wenhui Dong, H.H. Less is More: Focus Attention for Efficient DETR. arXiv preprint arXiv:2307.12612, arXiv:2307.12612 2023.
- Amati, G.; Van Rijsbergen, C.J. Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Trans. Inf. Syst. 2002, 20, 357–389. [Google Scholar] [CrossRef]
- Lin, H.; Han, G.; Ma, J.; Huang, S.; Lin, X.; Chang, S.F. Supervised masked knowledge distillation for few-shot transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
- Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Minivit: Compressing vision transformers with weight multiplexing. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
- Gal, Y.; Ghahramani, Z. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Proceedings of the Proceedings of the 30th International Conference on Neural Information Processing Systems, USA, 2016.
- A Dosovitskiy, L Beyer, A.K. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929v2, arXiv:2010.11929v2 2021.
- Nogueira, R. From doc2query to docTTTTTquery. 2019.
- Reimers, N.; Gurevych, I. 2020; arXiv:cs.IR/2012.14210].
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, arXiv:1910.01108 2019.
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv:2303.08774, arXiv:2303.08774 2023.
- Herbrich, R.; Graepel, T.; Obermayer, K. Large margin rank boundaries for ordinal regression 2000. 88.
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
- Xu, D.; Zhao, Z.; Xiao, J.; Wu, F.; Zhang, H.; He, X.; Zhuang, Y. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the Proceedings of the ACM international conference on Multimedia, 2017, pp.
- Byungseok Roh, JaeWoong Shin, W.S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv preprint arXiv:2111.14330, arXiv:2111.14330 2021.
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 2014, 15, 1929–1958. [Google Scholar]
- Touvron, H.; Cord, M.; El-Nouby, A.; Verbeek, J.; Jégou, H. Three Things Everyone Should Know About Vision Transformers. In Proceedings of the Computer Vision – ECCV 2022; Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G.M.; Hassner, T., Eds., Cham; 2022; pp. 497–515. [Google Scholar]
- Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction tuning with gpt-4. arXiv:2304.03277, arXiv:2304.03277 2023.
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
- Paul, S.; Chen, P.Y. Vision transformers are robust learners. In Proceedings of the Proceedings of the AAAI conference on Artificial Intelligence, 2022, Vol.
- Zhan, J.; Mao, J.; Liu, Y.; Zhang, M.; Ma, S. 2020; arXiv:cs.IR/2006.15498].
- Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra Attention: Efficient Attention with Many Heads. In Proceedings of the Computer Vision – ECCV 2022 Workshops; Karlinsky, L.; Michaeli, T.; Nishino, K., Eds., Cham; 2023; pp. 35–49. [Google Scholar]
- Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744, arXiv:2310.03744 2023.
- Research, G. Vision Transformer. https://github.com/google-research/vision_transformer/, 2023.
- Bojar, O.; Chatterjee, R.; Federmann, C.; Graham, Y.; Haddow, B.; Huck, M.; Yepes, A.J.; Koehn, P.; Logacheva, V.; Monz, C.; et al. Findings of the 2016 conference on machine translation. In Proceedings of the Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, pp.
- Xiong, L.; Xiong, C.; Li, Y.; Tang, K.F.; Liu, J.; Bennett, P.; Ahmed, J.; Overwijk, A. 2020; arXiv:cs.IR/2007.00808].
- Ouyang, L.; Qu, Y.; Zhou, H.; Zhu, J.; Zhang, R.; Lin, Q.; Wang, B.; Zhao, Z.; Jiang, M.; Zhao, X.; et al. 2024; arXiv:cs.CV/2412.07626].
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser. ; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Ren, S.; Gao, Z.; Hua, T.; Xue, Z.; Tian, Y.; He, S.; Zhao, H. Co-advise: Cross inductive bias distillation. In Proceedings of the Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp.
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.
- Kurland, O.; Lee, L. Corpus structure, language models, and ad hoc information retrieval. ArXiv, 0405. [Google Scholar]
- Alpher, F.; Fotheringham-Smythe, F. Frobnication revisited. Journal of Foo 2003, 13, 234–778. [Google Scholar]
- Yang, H.; Yin, H.; Shen, M.; Molchanov, P.; Li, H.; Kautz, J. Global Vision Transformer Pruning With Hessian-Aware Saliency. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
- Huang, Z.; Shi, X.; Zhang, C.; Wang, Q.; Cheung, K.C.; Qin, H.; Dai, J.; Li, H. Flowformer: A transformer architecture for optical flow. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 2022, Proceedings, Part XVII. Springer, 2022, October 23–27; pp. 668–685.
- Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. arXiv:2311.10122, arXiv:2311.10122 2023.
- Nicolas Carion, Francisco Massa, G.S. End-to-End Object Detection with Transformers. arXiv preprint arXiv:2005.12872, arXiv:2005.12872 2023.
- Liu, Y.; Li, Z.; Huang, M.; Yang, B.; Yu, W.; Li, C.; Yin, X.C.; Liu, C.L.; Jin, L.; Bai, X. OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences 2024, 67, 220102. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp.
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media 2022, 8, 415–424. [Google Scholar]
- Paria, B.; Yeh, C.K.; Yen, I.E.; Xu, N.; Ravikumar, P.; Póczos, B. Minimizing FLOPs to Learn Efficient Sparse Representations. In Proceedings of the International Conference on Learning Representations; 2020. [Google Scholar]
- Shu, R.; Nakayama, H. Compressing Word Embeddings via Deep Compositional Code Learning. In Proceedings of the International Conference on Learning Representations; 2018. [Google Scholar]
- Chen, X.; Cao, Q.; Zhong, Y.; Zhang, J.; Gao, S.; Tao, D. DearKD: Data-efficient early knowledge distillation for vision transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
- Brunner, G.; Liu, Y.; Pascual, D.; Richter, O.; Ciaramita, M.; Wattenhofer, R. On Identifiability in Transformers. 02 2020.
- Elena Voita, David Talbot, F.M. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418, arXiv:1905.09418 2019.
- Janowsky, S.A. Pruning versus clipping in neural networks. Physical Review A 1989, 39, 6600. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
- Taylor, M.; Guiver, J.; Robertson, S.; Minka, T. SoftRank: Optimising Non-Smooth Rank Metrics. February 2008.
- Kong, Z.; Dong, P.; Ma, X.; Meng, X.; Niu, W.; Sun, M.; Shen, X.; Yuan, G.; Ren, B.; Tang, H.; et al. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 2022, Proceedings, Part XI. Springer, 2022, October 23–27; pp. 620–640.
- Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. 2023; arXiv:cs.CV/2305.06500].
- Xing, L.; Huang, Q.; Dong, X.; Lu, J.; Zhang, P.; Zang, Y.; Cao, Y.; He, C.; Wang, J.; Wu, F.; et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. arXiv preprint arXiv:2410.17247, arXiv:2410.17247 2024.
- He, L.; Ren, X.; Gao, Q.; Zhao, X.; Yao, B.; Chao, Y. The connected-component labeling problem: A review of state-of-the-art algorithms. Pattern Recognition 2017, 70, 25–43. [Google Scholar]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
- Terven, J.; Cordova-Esparza, D. A comprehensive review of yolo: From yolov1 and beyond. arXiv preprint arXiv:2304.00501, arXiv:2304.00501 2023.
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 5156–5165. [Google Scholar]
- Fang, H.; Zhai, C. An Exploration of Axiomatic Approaches to Information Retrieval. In Proceedings of the Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2005. [CrossRef]
- Wei, S.; Ye, T.; Zhang, S.; Tang, Y.; Liang, J. Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
- Alpher, F.; Gamow, F. Can a computer frobnicate? In Proceedings of the CVPR; 2005; pp. 234–778. [Google Scholar]
- Fu, D.Y.; Arora, S.; Grogan, J.; Johnson, I.; Eyuboglu, S.; Thomas, A.W.; Spector, B.; Poli, M.; Rudra, A.; Ré, C. Monarch Mixer: A simple sub-quadratic GEMM-based architecture. arXiv preprint arXiv:2310.12109, arXiv:2310.12109 2023.
- Wang, L.; Li, L.; Dai, D.; Chen, D.; Zhou, H.; Meng, F.; Zhou, J.; Sun, X. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160, arXiv:2305.14160 2023.
- Yu, S.; Chen, T.; Shen, J.; Yuan, H.; Tan, J.; Yang, S.; Liu, J.; Wang, Z. Unified Visual Transformer Compression. ArXiv, 2203. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems 2022. [Google Scholar]
- Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; et al. Gemini: A family of highly capable multimodal models. arXiv:2312.11805, arXiv:2312.11805 2023.
- Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling vision transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, arXiv:2010.11929 2020.
- Yu, F.; Huang, K.; Wang, M.; Cheng, Y.; Chu, W.; Cui, L. Width & Depth Pruning for Vision Transformers. In Proceedings of the AAAI Conference on Artificial Intelligence; 2022. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context, 2014. cite arxiv:1405.0312Comment: 1) updated annotation pipeline description and figures; 2) added new section describing datasets splits; 3) updated author list.
- Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, arXiv:1405.3531 2014.
- Yanghao Li, Hanzi Mao, R.G. Exploring Plain Vision Transformer Backbones for Object Detection. arXiv preprint arXiv:2203.16527, arXiv:2203.16527 2022.
- Urbano, J.; Marrero, M. The Treatment of Ties in AP Correlation. In Proceedings of the Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, New York, NY, USA, 2017. [CrossRef]
- Baranchuk, D.; Persiyanov, D.; Sinitsin, A.; Babenko, A. Learning to route in similarity graphs. In Proceedings of the International Conference on Machine Learning. PMLR; 2019; pp. 475–484. [Google Scholar]
- Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Tinyvit: Fast pretraining distillation for small vision transformers. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 2022, Proceedings, Part XXI. Springer, 2022, October 23–27; pp. 68–85.
- Hoos, H.H.; Stützle, T. Stochastic local search: Foundations and applications; Elsevier, 2004.
- Cordonnier, J.B.; Loukas, A.; Jaggi, M. On the Relationship between Self-Attention and Convolutional Layers. In Proceedings of the International Conference on Learning Representations; 2020. [Google Scholar]
- Chang, S.E.; Li, Y.; Sun, M.; Shi, R.; So, H.K.H.; Qian, X.; Wang, Y.; Lin, X. Mix and match: A novel fpga-centric deep neural network quantization framework. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE; 2021; pp. 208–220. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. CoRR, 1512. [Google Scholar]
- Zamani, H.; Dehghani, M.; Croft, W.B.; Learned-Miller, E.; Kamps, J. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the Proceedings of the 27th ACM International Conference on Information and Knowledge Management, New York, NY, USA, 2018. [CrossRef]
- Adams, D. The Hitchhiker’s Guide to the Galaxy; San Val, 1995.
- Karnin, E.D. A simple procedure for pruning back-propagation trained neural networks. IEEE transactions on neural networks 1990, 1, 239–242. [Google Scholar]
- McDonald, R.; Brokos, G.; Androutsopoulos, I. Deep Relevance Ranking Using Enhanced Document-Query Interactions. In Proceedings of the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), 2018.
- Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A llvm-based python jit compiler. In Proceedings of the Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, 2015, pp.
- Yi Tay, Dara Bahri, L.Y. Sparse Sinkhorn Attention. arXiv preprint arXiv:2002.11296, arXiv:2002.11296 2020.
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M.W. 2020; arXiv:cs.CL/2002.08909].
- Zhang, Y.; Rahman, M.M.; Braylan, A.; Dang, B.; Chang, H.; Kim, H.; McNamara, Q.; Angert, A.; Banner, E.; Khetan, V.; et al. Neural Information Retrieval: A Literature Review. CoRR, 1611. [Google Scholar]
- Alpher, F.; Fotheringham-Smythe, F.; Gamow, F. Can a machine frobnicate? Journal of Foo 2004, 14, 234–778. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 2004, 69, 066138. [Google Scholar] [PubMed]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. CoRR, 1301. [Google Scholar]
- Koohpayegani, S.A.; Pirsiavash, H. Sima: Simple softmax-free attention for vision transformers. arXiv preprint arXiv:2206.08898, arXiv:2206.08898 2022.
- MacAvaney, S.; Nardini, F.M.; Perego, R.; Tonellotto, N.; Goharian, N.; Frieder, O. , Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2020; pp. 49–58. [Google Scholar]
- Louizos, C.; Welling, M.; Kingma, D.P. 2018; arXiv:stat.ML/1712.01312].
- Burges, C.J. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report, 2010.
- Babenko, A.; Lempitsky, V. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence 2014, 37, 1247–1260. [Google Scholar]
- Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2021, Vol.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, arXiv:1810.04805 2018.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 2020, 21, 1–67. [Google Scholar]
- Andrew Howard, Mark Sandler, G.C. Searching for MobileNetV3. arXiv preprint arXiv:1905.02244, arXiv:1905.02244 2019.
- Gong, C.; Wang, D.; Li, M.; Chandra, V.; Liu, Q. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, arXiv:2104.12753 2021.
- Jégou, H.; Douze, M.; Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [PubMed]
- Craswell, N.; Mitra, B.; Yilmaz, E.; Campos, D.; Voorhees, E.M. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820, arXiv:2003.07820 2020.
- Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet Size and Speed. arXiv preprint arXiv:2212.08059, arXiv:2212.08059 2022.
- Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
- Tonellotto, N.; Macdonald, C. Query Embedding Pruning for Dense Retrieval. CoRR, 2108. [Google Scholar]
- Nogueira, R.; Jiang, Z.; Lin, J. 2020; arXiv:cs.IR/2003.06713].
- Ouyang, L.; Qu, Y.; Zhou, H.; Zhu, J.; Zhang, R.; Lin, Q.; Wang, B.; Zhao, Z.; Jiang, M.; Zhao, X.; et al. 2024; arXiv:cs.CV/2412.07626].
- Voorhees, E.M.; Harman, D.K. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing); The MIT Press, 2005.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019. [Google Scholar]
- Liu, H.; Yan, W.; Zaharia, M.; Abbeel, P. 2024; arXiv:cs.LG/2402.08268].
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp.
- Jang, Y.; Song, Y.; Yu, Y.; Kim, Y.; Kim, G. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
- Li, H. Learning to Rank for Information Retrieval and Natural Language Processing; Morgan & Claypool Publishers, 2011.
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 10347–10357. [Google Scholar]
- Clinchant, S.; Gaussier, E. Information-based Models for Ad Hoc IR. In Proceedings of the Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2010. [CrossRef]
- Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
- Câmara, A.; Hauff, C. Diagnosing BERT with Retrieval Heuristics. In Proceedings of the Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, 2020, Proceedings, Part I; Jose, J.M.; Yilmaz, E.; Magalhães, J.; Castells, P.; Ferro, N.; Silva, M.J.; Martins, F., Eds. Springer, 2020, Vol. 12035, Lecture Notes in Computer Science, April 14-17; pp. 605–618. [CrossRef]
- Lit, Z.; Sun, M.; Lu, A.; Ma, H.; Yuan, G.; Xie, Y.; Tang, H.; Li, Y.; Leeser, M.; Wang, Z.; et al. Auto-ViT-Acc: An FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE; 2022; pp. 109–116. [Google Scholar]
- Zhang, L.; Xu, D.; Arnab, A.; Torr, P.H. Dynamic graph message passing networks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. A: Intersection over Union, 2019; arXiv:cs.CV/1902.09630].
- Chen, T.; Li, L.; Sun, Y. Differentiable product quantization for end-to-end embedding compression. In Proceedings of the International Conference on Machine Learning. PMLR; 2020; pp. 1617–1626. [Google Scholar]
- Wang, X.; Zhang, H.; Huang, W.; Scott, M.R. Cross-Batch Memory for Embedding Learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
- Nogueira, R.; Cho, K. 2019; arXiv:cs.IR/1901.04085].
| Pruning Type | Characteristics | Examples |
|---|---|---|
| Static | Fixed pruning policy, no runtime adaptation | Stopword removal, truncation |
| Dynamic | Context-aware, input-dependent pruning | Attention-based, entropy-based |
| Adaptive | Learnable pruning, optimized for efficiency | Reinforcement learning, differentiable masking |
| Token-Level | Removes individual tokens | Attention score ranking, gradient-based pruning |
| Group-Level | Removes token sets (e.g., phrases, heads) | Attention head pruning, phrase pruning |
| Method | Advantages | Limitations |
|---|---|---|
| Attention-Based | Efficient, lightweight, interpretable | May not always correlate with token importance |
| Gradient-Based | Fine-grained importance estimation | Computationally expensive, requires backpropagation |
| Reinforcement Learning | Dynamically optimized pruning policies | High training cost, complex implementation |
| Hybrid Approaches | Combines benefits of multiple techniques | More complex to design and optimize |
| Method | Accuracy Drop (%) | Speedup (x) | Memory Reduction (%) |
|---|---|---|---|
| Attention-Based Pruning | 1.5 - 3.0 | 1.5 - 2.5x | 30 - 50% |
| Gradient-Based Pruning | 1.0 - 2.5 | 1.2 - 2.0x | 25 - 45% |
| Reinforcement Learning-Based Pruning | 0.5 - 2.0 | 2.0 - 3.5x | 40 - 60% |
| Hybrid Approaches | 0.5 - 1.5 | 2.5 - 4.0x | 50 - 70% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
