Comparative Study on Fusion Method Based on Multimodal Speech Emotion Recognition of Speech and Text

Authors

  • Xinjie Xie

DOI:

https://doi.org/10.62051/qadyms14

Keywords:

Emotion Recognition; Multimodal; Deep Learning; Intelligent NPC Interaction; Game.

Abstract

With the rapid development of artificial intelligence and deep learning technologies, emotion recognition has gradually become an important research area in human-computer interaction. However, in the current gaming industry, emotion recognition is rarely utilized to optimize NPC (non-player character) intelligence to enhance immersion. Therefore, this study primarily explores the feasibility of applying multimodal emotion recognition in gaming scenarios, aiming to improve the accuracy of emotion recognition through the combination of speech and textual information, thereby optimizing NPC interactions within games. The study employs the IEMOCAP dataset, integrating audio and textual features, and conducts training and evaluation using various machine learning and deep learning models. Additionally, it compares the accuracy and training speed of several advanced fusion models to investigate whether these technologies can meet the accuracy and real-time requirements for gaming applications. The results reveal that the bimodal audio-text models significantly outperform unimodal models, with an improvement exceeding 15%. Current advanced models achieve an accuracy of over 75% with relatively short training times, preliminarily meeting the requirements for accuracy and real-time application in games.

Downloads

Download data is not yet available.

References

[1] N. Ahmed, Z. A. Aghbari, S. Girija, A systematic survey on multimodal emotion recognition using learning algorithms, Intelligent Systems with Applications, Volume 17, (2023), 200171, ISSN 2667-3053.

[2] K. K. Smith, B. Victoria, S. Esmaeil. Nadimi, U. Rajendra Acharya, Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations, Information Fusion, Volume 102, (2024),102019, ISSN 1566-2535.

[3] A. Kołakowska, A. Landowska, M. Szwoch, W. Szwoch, M. R. Wróbel. Emotion Recognition and Its Applications. In: Hippe, Z., Kulikowski, J., Mroczek, T., Wtorek, J. (eds) Human-Computer Systems Interaction: Backgrounds and Applications 3. Advances in Intelligent Systems and Computing, vol 300. Springer, Cham. (2014).

[4] M. Taufik Akbar, M. Nasrul Ilmi, V. Imanuel. J. Moniaga, T. K. Chen, A. Chowanda. Enhancing Game Experience with Facial Expression Recognition as Dynamic Balancing, Procedia Computer Science, Volume 157, (2019), Pages 388-395, ISSN 1877-0509.

[5] W. Lin, C. Li, Y. Zhang. A System of Emotion Recognition and Judgment and Its Application in Adaptive Interactive Game. Sensors (2023), 23, 3250.

[6] Z. Y. Ma, Z. S. Zheng, J. X. Ye, J. C. Li, Z. F. Gao, S. L. Zhang, X. Chen, Multimodal Speech Emotion Recognition and Ambiguity Resolution, arXiv:1904.06022 [cs. LG], (2024).

[7] G. Sahu, Emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation, Proc. ACL 2024 Findings, 12 Apr (2019).

[8] D. Jacob and M. Wei and K. Lee and T. Kristina, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv preprint arXiv:1810.04805, (2018).

[9] M. Sharmeen, S. A. Abdullah, Y. Siddeeq. A. Ameen, A. M. Mohammed, S. Zeebaree. Multimodal Emotion Recognition Using Deep Learning. JASTT (2021), 2 (01), 73-79.

[10] C. Busso, M. Bulut, C. Lee, et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation 42, 335–359 (2008).

Downloads

Published

10-07-2025

How to Cite

Xie, X. (2025) “Comparative Study on Fusion Method Based on Multimodal Speech Emotion Recognition of Speech and Text”, Transactions on Computer Science and Intelligent Systems Research, 9, pp. 197–207. doi:10.62051/qadyms14.