A Brief Review of Lightweighting Methods for Vision Transformers (ViT)

Xiang Du

doi:10.62051/ijcsit.v4n2.37

Authors

Xiang Du

DOI:

https://doi.org/10.62051/ijcsit.v4n2.37

Keywords:

Vision Transformer, Lightweight Strategies, Mobile Deployment, Post-Training Modifications, Model Architecture Changes

Abstract

The Vision Transformer (ViT) has emerged as a powerful model in recent years, surpassing traditional Convolutional Neural Networks (CNNs) in various benchmarks. However, its large model architecture and high parameter count present significant challenges for mobile deployment. Consequently, the exceptional performance of ViT in computer vision tasks is often overshadowed by its difficulties in being deployed on mobile devices due to its large parameter size and high computational demands. This paper provides a comprehensive review of the literature on lightweight ViT models, focusing on model optimization strategies such as post-training modifications—quantization, pruning, and knowledge distillation—as well as architectural changes including hybrid CNN-Transformer, MLP-based, and sparse models. These strategies are aimed at improving efficiency for mobile platforms. The review aims to clarify current techniques for mobile ViT, guide future research, stimulate innovation, and contribute to the development of efficient ViT models for mobile environments.

Downloads

Download data is not yet available.

References

[1] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv, abs/1704.04861.

[2] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., & Chen, L. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510-4520.

[3] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., & Chen, L. (2018). Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. ArXiv, abs/1801.04381.

[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929.

[5] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9992-10002.

[6] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y., & Tao, D. (2020). A Survey on Vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP, 1-1.

[7] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information Processing Systems.

[8] Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. ArXiv, abs/2110.02178.

[9] Wang, X., Zhang, L., Wang, Y., & Yang, M. (2022). Towards efficient vision transformer inference: a first study of transformers on mobile devices. Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications.

[10] Caron, M., Touvron, H., Misra, I., J'egou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9630-9640.

[11] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020) End-to-End Object Detection with Transformers, European Conference on Computer Vision, abs/2005.12872: 213-229.

[12] Li, Y., Mao, H., Girshick, R.B., & He, K. (2022). Exploring Plain Vision Transformer Backbones for Object Detection. ArXiv, abs/2203.16527.

[13] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2021). Masked-attention Mask Transformer for Universal Image Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1280-1289.

[14] Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., & J'egou, H. (2021). ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 5314-5321.

[15] Alam, N., Kolawole, S., Sethi, S.S., Bansali, N., & Nguyen, K. (2023). Vision Transformers for Mobile Applications: A Short Survey. ArXiv, abs/2305.19365.

[16] Han, S., Pool, J., Tran, J., & Dally, W.J. (2015). Learning both Weights and Connections for Efficient Neural Network. Neural Information Processing Systems.

[17] Han, S., Mao, H., & Dally, W.J. (2015). Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. arXiv: Computer Vision and Pattern Recognition.

[18] Hinton, G.E., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. ArXiv, abs/1503.02531.

[19] Li, Z., & Gu, Q. (2022). I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 17019-17029.

[20] Yuan, Z., Xue, C., Chen, Y., Wu, Q., & Sun, G. (2021). PTQ4ViT: Post-training Quantization for Vision Transformers with Twin Uniform Quantization. European Conference on Computer Vision.