Research on Deep Learning-based Speech Emotion Recognition System
DOI:
https://doi.org/10.62051/ijcsit.v3n2.32Keywords:
Multimodal Feature Fusion, MFCC, Speech Emotion RecognitionAbstract
Speech, as one of the primaries means of human communication, conveys not only rich semantic information but also the emotional cues of the speaker. With the rapid advancement of deep learning, speech emotion recognition technology has been increasingly integrated into various aspects of daily life, such as telecommunications, automotive systems, and psychological health monitoring, highlighting the critical importance of research in this field. In this study, we propose a parallel architecture for multimodal feature fusion in speech emotion recognition. We design and implement a speech emotion recognition system that addresses challenges such as limited feature diversity and insufficient classification accuracy. To tackle these issues in speech emotion recognition, we introduce a method that integrates multiple features. Spectrograms, capturing local and global speech features through Convolutional Neural Networks (CNNs), are combined with Mel-Frequency Cepstral Coefficients (MFCCs), which extract dynamic features correlated with context using Long Short-Term Memory networks (LSTMs). Our proposed CNN+LSTM parallel structure (CL) fuses spatial and temporal features, yielding significant improvements in accuracy compared to models relying solely on spatial or temporal features, as demonstrated through experiments on the EMO-DB and CASIA databases, with accuracy gains of 6.88% and 7.20%, respectively. Finally, we validate the practicality and efficiency of the entire speech emotion recognition system by porting it to the NVIDIA Jetson Xavier NX platform.
Downloads
References
LI P, SONG Y, MCLOUGHLIN I V. An attention pooling based representation learning method for speech emotion recognition [J]. 2018.
ZHAOZ, ZHAOY, BAOZ. Deep spectrum feature representations for speech emotion recognition [C]//Proceedings of the Joint Workshop of the 4thWorkshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data. 2018:27-33
Ho N H, Yang H J, Kim S H, et al. Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network [1]. IEEE Access, 2020, 8(8):61672-61686.
Song Wenjun. Research on Speech Emotion Recognition Based on Neural Networks and Attention Mechanism [D]. Changchun: Changchun University of Science and Technology, 2021: 26-28.
Li Wenjie, Luo Wenjun, Li Yiwen. Research on Speech Emotion Recognition Based on Separable Convolution and LSTM [J]. Information Technology, 2020, 44(10): 61-66.Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech [C]//Ninth European Conference on Speech Communication and Technology. 2005.
Burkhardt F, Paeschke A, Rolfes M, et al. A database of german emotional speech [C]. INTERSPEECH, Lisbon, 2005:1517-1520.
CHEN M, ZHAO X. A multi-scale fusion framework for bimodal speech emotion recognition [C] // Proc. Interspeech. 2020:374-378.
CAI R, GUO K, XU B.Meta Multi-task Learning for Speech Emotion Recognition [J]. Proc.Interspeech 2020, 2020:3336-3340.
Peng Z, Lu Y, Pan S, et al.Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [C]. 2021 EEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada. 2021:3020-3024.
Hema C, Marquez F PG. Emotional Speech Recognition Using CNN and Deep Learning Techniques [J]. Applied Acoustics, 2023, 211:109492.
Wang Shui Hua, Phillips P, Sui Yuxiu, et al. Classification of Alzheimer’s disease based on eight-layer convolutional neural network with leaky rectified linear unit and max pooling [J]. Journal of medical systems, 2018, 42(5):85.
J. S. Wang, X. F. Du, L. L. He. Evaluation and modeling of automotive transmission whine noise quality based on MFCC and CNN [J]. Applied Acoustics, 2021(12):172-184.
Li H, Zhang X, Duan S, et al. Speech emotion recognition based on bi-directional acoustic–articulatory conversion [J]. Knowledge-Based Systems, 2024, 299112123-112123.
Samson A, Serestina V, Adekanmi A. An enhanced speech emotion recognition using vision transformer [J]. Scientific Reports, 2024, 14(1):13126-13126.
Yan J, Li H, Xu F, et al.Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network [J]. Electronics, 2024, 13(11):
Kishor B, Mohanaprasad K. Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network [J]. Circuits, Systems, and Signal Processing, 2023, 43(4):2341-2384.
Masum M B, Likhon M S, M. A. H. A. KBES: A dataset for realistic Bangla speech emotion recognition with intensity level [J]. Data in Brief, 2023, 51109741-109741.
Bhanusree Y, Kumar S S, Rao K A. Neural network-based blended ensemble learning for speech emotion recognition [J]. Multidimensional Systems and Signal Processing, 2022, 33(4):1323-1348.
Vasuki P. Design of Hierarchical Classifier to Improve Speech Emotion Recognition [J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 44(1):19-33.
Wang Rui. Research on Speech Emotion Recognition Method Based on Deep Learning [D]. Beijing University of Posts and Telecommunications, 2023. DOI: 10.26969/d.cnki.gbydu.2023.002773.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







