Fine - Tuning and Optimization of Live2D Facial Expression Recognition Based on Vision Transformer (ViT)

Ziyang Chen

doi:10.62051/v27mjf51

Authors

Ziyang Chen

DOI:

https://doi.org/10.62051/v27mjf51

Keywords:

Vision Transformer, Facial Expression Recognition, Live2D Technology, Quantization Optimization, Low-Resource Devices.

Abstract

With the development of fields such as virtual reality, the expressiveness of virtual character facial expressions has become increasingly crucial. Traditional CNN - based facial expression recognition methods have problems such as limited local feature extraction, sensitivity to pose changes, and high computational complexity. This research is based on the Vision Transformer (ViT) model, exploring its optimization and application in the facial expression recognition task, and combining with Live2D technology to achieve real - time and efficient expression conversion. ViT can capture global information through the self - attention mechanism, better model facial expression changes, has the potential for transfer learning, and is suitable for low - resource devices. The FER - 2013 dataset was used in this study. The ViT model was fine - tuned and quantized, and compared with traditional CNN models such as ResNet50. Experiments show that the optimized ViT model has higher accuracy and real - time performance in facial expression recognition, and can also operate efficiently in low - resource environments. The quantization technology also reduces the computational overhead, making it suitable for economical consumer software. This research enhances the application value of ViT in the field of affective computing, provides support for the development of Live2D technology, and promotes the wide application of virtual characters. In the future, the potential of ViT in multimodal emotion recognition will be explored, and its performance in complex scenarios will be optimized to provide a more comprehensive solution for intelligent interactions of virtual characters.

Downloads

Download data is not yet available.

References

[1] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778 (2016).

[2] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR) (2015).

[3] Q. Aditya. Fer-2013 mobilenet: Fine-tuning mobilenetv2 for facial expression recognition. https://github.com/qwerty-aditya/FER2013-MobileNet (2023).

[4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR) (2021).

[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008 (2017).

[6] A. A. Nugroho et al. The facial emotion recognition (fer-2013) dataset for prediction system of micro-expressions face using the convolutional neural network (cnn) algorithm based raspberry pi. In 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), pages 277–283. IEEE (2020).

[7] W. J. Chu and Y. B. Liu. Thermal facial landmark detection by deep multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).

[8] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4401–4410 (2019).

[9] Z. T. Shen. A comparative study of hybrid cnn and vision transformer models for facial emotion recognition. In 2024 11th International Conference on Dependable Systems and Their Applications (DSA), pages 401–408 (2024).

[10] S. Bobojanov, B. M. Kim, M. Arabboev, and S. Begmatov. Comparative analysis of vision transformer models for facial emotion recognition using augmented balanced datasets. Applied Sciences 13(22) (2023).