Improvement of Speech-Paraformer Large ASR for Industrial Voice Control in High-Noise Environments

Vasileva Mariya; Hongchang Sun; Yongxiang Jiang

doi:10.62051/ijcsit.v8n4.07

Authors

Vasileva Mariya
Hongchang Sun
Yongxiang Jiang

DOI:

https://doi.org/10.62051/ijcsit.v8n4.07

Keywords:

Speech-Paraformer-Large, FunASR, Industrial Noise Robustness, Post-processing, SNR Degradation, White Noise Comparison

Abstract

This study systematically evaluates the robustness of the Speech-Paraformer-Large automatic speech recognition (ASR) model under simulated industrial noise and proposes an effective post-processing enhancement strategy for safety-critical voice-controlled human-robot interaction in manufacturing environments. the controlled experiment used a dataset of 10 Mandarin Chinese industrial commands recorded in clean conditions (16 kHz, 16-bit PCM). Noisy test conditions were generated by mixing clean recordings with continuous white noise and authentic industrial machinery noise at Signal-to-Noise Ratios (SNR) from 20 dB to -10 dB (5 dB increments). The pre-trained Speech-Paraformer-Large model was evaluated, and a text-based verification layer with three hierarchical matching strategies (fuzzy exact matching, substring containment, sliding window similarity) was implemented as post-processing; performance was assessed via Word Error Rate (WER) and accuracy across 50 test utterances per condition. Results show that industrial machinery noise is significantly more detrimental to ASR performance than white noise (24% vs. 70% accuracy at -10 dB SNR). The proposed verification layer consistently improved performance across all SNR levels: accuracy increased by 8% (88% to 96%) under 0 dB white noise and by 10 percentage points (24% to 34%, 41.6% relative improvement) under -10 dB industrial noise. It also reduced substitution errors by 34%, insertion errors by 31%, and total errors by 32%, with unexpected efficiency gains (51.2% reduction in computation time at 0 dB industrial noise). This study demonstrates that intelligent post-processing can achieve practical, deployable robustness gains without model retraining or acoustic preprocessing, and the proposed text-based verification layer provides a cost-effective solution to improve voice control reliability in industrial environments, with direct implications for manufacturing safety and efficiency.

Downloads

Download data is not yet available.

References

[1] Marge, M., Espy-Wilson, C., Ward, N. G., Alwan, A., Artzi, Y., Bansal, M., et al. (2022). Spoken language interaction with robots: Recommendations for future research. Computer Speech & Language, 71, 101255. https://doi.org/10.1016/j.csl.2021.101255

[2] Ojanen, R. (2024). Human-robot collaboration by speech in an industrial assembly task. Tampere University Dissertations, 1234. https://doi.org/10.48550/arXiv.2506.22028

[3] Martinek, R., & Jaros, R. (2021). Noise reduction in industry based on virtual instrumentation. Computers, Materials and Continua, 69(1), 1073–1096. https://doi.org/10.32604/cmc.2021.017568

[4] Zhu, P., Li, X., Sun, H., Chen, Z., & Wang, J. (2025). Research on digital human speech recognition method in high-disturbance industrial environment. IEEE Transactions on Instrumentation and Measurement, 74, 1–16. https://doi.org/10.1109/TIM.2025.3578101

[5] Amodei, D., Anubhai, R., Battenberg, E., Case, C., et al. (2015). Deep speech: Scaling up end-to-end speech recognition. Proceedings of Machine Learning Research, 37, 1–6. https://doi.org/10.48550/arXiv.1512.02595

[6] Gao, Z., Zhang, S., McLoughlin, I., & Yan, Z. (2022). Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. Proceedings of INTERSPEECH, 1234–1238. https://doi.org/10.48550/arXiv.2206.08317

[7] Fan, R., Chu, W., Chang, P., Xiao, J., & Alwan, A. (2021). An improved single step non-autoregressive transformer for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 1245–1256. https://doi.org/10.1109/TASLP.2023.3263789

[8] Wright, J., Liberman, M., Ryant, N., Fiumara, J., et al. (2025). Evaluating speech-to-text systems with PennSound. Proceedings of the Annual Conference of the International Speech Communication Association, 1–5. https://doi.org/10.48550/arXiv.2504.05702

[9] Pearsell, S. M., & Niebuhr, O. (2025). Lost in the noise: Evaluating ASR performance in industrial and environment noise. *2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS)*, 1–5. https://doi.org/10.1109/ICPS65515.2025.11087895

[10] Moreno, O. A. C., De la Rosa Vargas, J. I., Sánchez, A. B., Ramírez, E. G., & Lumbreras, P. D. A. (2025). Deep learning speech recognition for industrial noise environments. In L. Martínez-Villaseñor, R. A. Vázquez, & G. Ochoa-Ruiz (Eds.), Advances in Soft Computing (Vol. 1622, 389-409). Cham: Springer. https://doi.org/10.1007/978-3-032-09037-9_30

[11] Lai, Y., Yuan, S., Nassar, Y., Fan, M., Gopal, A., Yorita, A., et al. (2025). Natural multimodal fusion-based human–robot interaction: Application with voice and deictic posture via large language model. IEEE Robotics & Automation Magazine, 32(1), 2–11. https://doi.org/10.1109/MRA.2025.3543957

[12] Fathullah, Y., Wu, C., Lakomkin, E., Jia, J., Shangguan, Y., Li, K., et al. (2024). Prompting large language models with speech recognition abilities. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 13351–13355. https://doi.org/10.1109/ICASSP48485.2024.10447605

[13] Han, Z., Gao, C., Liu, J., Zhang, J., Zhang, S. Q., et al. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey. ACM Computing Surveys, 57(3), 1–35. https://doi.org/10.48550/arXiv.2403.14608

[14] Tellex, S., Kollar, T., Dickerson, S., Walter, M., Banerjee, A., Teller, S., et al. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. Proceedings of the AAAI Conference on Artificial Intelligence, 25(1), 1507–1514. https://doi.org/10.1609/aaai.v25i1.7979

[15] Gao, Z., Li, Z., Wang, J., Luo, H., Shi, X., Chen, M., Li, Y., Zuo, L., Du, Z., Xiao, Z., & Zhang, S. (2023). FunASR: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013. https://doi.org/10.48550/arXiv.2305.11013

[16] Lee, J., Mansimov, E., & Cho, K. (2018). Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 861–866). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1149

[17] Abnar, S., Dehghani, M., Neyshabur, B., & Sedghi, H. (2021). Exploring the limits of large-scale pre-training. https://doi.org/10.48550/arXiv.2110.02095

[18] Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095. https://doi.org/10.1109/TPAMI.2007.1078

[19] Ali, A., & Renals, S. (2018). Word error rate estimation for speech recognition: e-WER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 20–24). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2004

[20] Hosseini, K., Nanni, F., & Coll Ardanuy, M. (2020). DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 62–69). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-demos.9