A Survey on Multimodal Emotion Recognition: Integrating Cues for a Deeper Understanding of Affect

Zhanpeng Li; Yuming Qi; Sanpeng Deng; Xiumin Shi

doi:10.62051/ijcsit.v7n3.02

Authors

Zhanpeng Li
Yuming Qi
Sanpeng Deng
Xiumin Shi

DOI:

https://doi.org/10.62051/ijcsit.v7n3.02

Keywords:

Multimodal Emotion Recognition, Affective Computing, Deep Learning, Feature Fusion, Sentiment Analysis, Transformer Models

Abstract

Multimodal Emotion Recognition (MER) has emerged as a crucial area of research in artificial intelligence and human-computer interaction, aiming to build systems that can understand human affective states by integrating information from various modalities. This review provides a comprehensive overview of the MER landscape, synthesizing insights from foundational and recent literature. We delve into the primary modalities utilized—including visual (facial expressions), acoustic (speech prosody), textual (language content), and physiological signals—and discuss the state-of-the-art deep learning techniques for feature extraction within each. A central focus is placed on multimodal fusion strategies, from early (feature-level) and late (decision-level) fusion to more sophisticated Transformer-based and attention mechanisms that capture complex inter-modal dynamics. We also examine the role of advanced architectures like Multimodal Large Language Models (MLLMs) and techniques such as knowledge distillation for handling real-world challenges like modality missingness. Key benchmark datasets that have propelled the field forward are described. Finally, we outline the persistent challenges, including data scarcity, modality misalignment, and real-world robustness, and propose promising future research directions to advance the development of more accurate, robust, and context-aware affective computing systems.

Downloads

Download data is not yet available.

References

[1] Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4), 344-350..

[2] Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.

[3] Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: from unimodal analysis to multimodal fusion. Information Fusion, 37, 98-125.

[4] Gideon, J., McInroe, A., Brophy, C., Wang, Z., & Fitter, N. T. (2023). A Survey of Affective-Computing-Based Multimodal Emotion Recognition. IEEE Transactions on Affective Computing.

[5] Wu, X., Mou, X., Liu, Y., & Liu, X. (2024). A multimodal emotion recognition algorithm based on speech, text and facial expression. Journal of Northwest University (Natural Science Edition), 54(2), 178-187.

[6] Liu, Z., & Lei, Y. (2024). Design and Experiment of Multi-Modal Sentiment Analysis Model by Fusing Multi-scale Features. Research and Exploration in Laboratory, 43(9), 78-83.

[7] Qiang, Y., Chu, S., & Hu, Y. (2025). MSD-Net: Multimodal Soft Knowledge Distillation for Sentiment Analysis in Real-World Modality Missing Scenarios. Journal of Taiyuan University of Technology.

[8] Yang, R., & Ma, J. (2023). A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT. Data Analysis and Knowledge Discovery, 7(11), 14-25.

[9] Ye, J., Zheng, W., Li, Y., Cai, Y., & Cui, Z. (2017). Multimodal emotion recognition based on deep neural network. Journal of Southeast University (English Edition), 33(4), 444-447.

[10] Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019). Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the ACL.

[11] Liu, J., Zhang, P., Liu, Y., Zhang, W., & Fang, J. (2021). Summary of Multi-modal Sentiment Analysis Technology. Journal of Frontiers of Computer Science and Technology, 15(7), 1165-1184

[12] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[13] Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: The Munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM International Conference on Multimedia.

[14] Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2), 124.

[15] Zheng, W. L., & Lu, B. L. (2015). Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental Development, 7(3), 162-175.