Research on Speech Emotion Recognition Method Based on ResSE_CNN1D

Yingcheng Zhang

doi:10.62051/tzs1ab85

Authors

Yingcheng Zhang

DOI:

https://doi.org/10.62051/tzs1ab85

Keywords:

ResSE_CNN1D, opensmile, SER, CASIA.

Abstract

This article examines a speech emotion recognition (SER) technique based on the enhanced one-dimensional convolution neural network ResSE_CNN1D. With the artificial intelligence developing rapidly, SER has a profound impact in many areas. The model of this article is used to extract the characteristics of the input data through opensmile, and is sent into the ResSE_CNN1D model, which is eventually classified by the softmax activation function and obtains the final results. The key to this model is efficient learning of decimal sets and the rapid deployment in resource-constrained environments. The ResSE_CNN1D model improves the performance of the model by adding the residual connection and the SE module on the basis of cnn1d. This increasing the accuracy of the recognition and preventing the fitting problem. After the model was created, the study adopted the audio sampling and training of the casia data concentration. The final accuracy was 0.900, which increased the accuracy of 2.9 percent compared to the cnn1d method. And by the analysis of the relationship diagram of the confusion matrix and the accuracy and loss rate relative to the number of training, the model has a high robustness and effectively prevents the appearance of the fitting problem. And it also can achieve high precision and achieve a lightweight goal relative to less training.

Downloads

Download data is not yet available.

References

[1] Huang W, Wu Q, Dey N, et al. Adjectives grouping in a dimensionality affective clustering model for fuzzy perceptual evaluation[J]. 2020.

[2] Moschitti A, Pang B, Daelemans W. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)[C] In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.

[3] Zhao X. XU X. Speech Emotion Recognition Based on Shallow Learning and Deep Learning Models [J]. Computer Applications and Software, 2020,37(12): 108-112,176.

[4] Eyben, F., Wöllmer, M., & Schuller, B. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459-1462. (2010, October).

[5] Ma, Y. and D. Klabjan. "Diminishing Batch Normalization." IEEE Trans Neural Netw Learn Syst 35(5): 6544-6557. (2024)

[6] Sharma, S., Sharma, S. and Athaiya, A. Activation functions in neural networks. Towards Data Sci, 6(12), pp.310-316.(2017)

[7] Park, S. and Kwak, N. Analysis on the dropout effect in convolutional neural networks. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13 (pp. 189-204). Springer International Publishing.(2017)

[8] Hu, J., Shen, L. and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).(2018)

[9] J. H. Tao, F. Z. Liu, M. Zhang and H. B. Jia, "Design of speech corpus for mandarin text to speech", Proc. Blizzard Challenge Workshop, pp. 1, 2008.

[10] Zhang J. Zhang S. Yan Q. et al. Emotion Recognition method based on speech rhythm difference. Computer Science,2024,51(4):262-269. DOI:10.11896/jsjkx.230200063.

[11] Deng, X., Liu, Q., Deng, Y. and Mahadevan, S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Information Sciences, 340, pp.250-261.(2016)