A Comparative Analysis of LSTM and Transformer-based Automatic Speech Recognition Techniques
DOI:
https://doi.org/10.62051/zq6v0d49Keywords:
Speech recognition; deep learning; long short-term memory, transformer.Abstract
Automatic Speech Recognition (ASR) is a technology that leverages artificial intelligence to convert spoken language into written text. It utilizes machine learning algorithms, specifically deep learning models, to analyze audio signals and extract linguistic features. This technology has revolutionized the way that people interact with voice-enabled devices, enabling efficient and accurate transcription of human speech in various applications, including voice assistants, captioning, and transcription services. Among previous works for ASR, Long Short-Term Memory (LSTM) networks and Transformer-based methods are typical solutions towards effective ASR. In this paper, the author focuses on an in-depth exploration of the progression and comparative analysis of deep learning innovations within the ASR domain. This work starts with a foundational historical perspective, mapping the evolution from pioneering ASR systems to the current benchmarks: LSTM networks and Transformer-based models. The study meticulously evaluates these technologies, dissecting their strengths, weaknesses, and the potential they hold for future advancements in ASR.
Downloads
References
Rabiner, Lawrence, and Biinghwang Juang. An introduction to hidden Markov models. IEEE ASSP magazine, 1986, 3(1): 4-16. DOI: https://doi.org/10.1109/MASSP.1986.1165342
Bahl, Lalit R., Frederick Jelinek, and Robert L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence, 1983, 2: 179-190. DOI: https://doi.org/10.1109/TPAMI.1983.4767370
Hinton, Geoffrey, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 2012, 29(6): 82-97. DOI: https://doi.org/10.1109/MSP.2012.2205597
Van Houdt, Greg, Carlos Mosquera, and Gonzalo Nápoles. A review on the long short-term memory model. Artificial Intelligence Review, 2020, 53(8): 5929-5955. DOI: https://doi.org/10.1007/s10462-020-09838-1
Zeng, Taiyao. Deep Learning in Automatic Speech Recognition (ASR): A Review. In 2022 7th International Conference on Modern Management and Education Technology, 2022: 173-179. DOI: https://doi.org/10.2991/978-2-494069-51-0_23
Weninger, Felix, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Latent Variable Analysis and Signal Separation: 12th International Conference, 2015, 12: 91-99. DOI: https://doi.org/10.1007/978-3-319-22482-4_11
Narayanan, Arun, and DeLiang Wang. The role of binary mask patterns in automatic speech recognition in background noise. The Journal of the Acoustical Society of America, 2013, 133(5): 3083-3093. DOI: https://doi.org/10.1121/1.4798661
Weninger, Felix, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. "Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments." Computer Speech & Language 28, no. 4 (2014): 888-902. DOI: https://doi.org/10.1016/j.csl.2014.01.001
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017, 30: 1-11.
Dong, Linhao, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing, 2018: 5884-5888. DOI: https://doi.org/10.1109/ICASSP.2018.8462506
Lee, Taewoo, Min-Joong Lee, Tae Gyoon Kang, Seokyeoung Jung, Minseok Kwon, Yeona Hong, Jungin Lee et al. Adaptable multi-domain language model for transformer asr. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 2021: 7358-7362. DOI: https://doi.org/10.1109/ICASSP39728.2021.9413475
Ganesan, Karthik, Pakhi Bamdev, Amresh Venugopal, and Abhinav Tushar. N-best ASR transformer: Enhancing SLU performance using multiple ASR hypotheses. ArXiv Preprint, 2021: 2106.06519. DOI: https://doi.org/10.18653/v1/2021.acl-short.14
Zeyer, Albert, Parnia Bahar, Kazuki Irie, Ralf Schlüter, and Hermann Ney. A comparison of transformer and lstm encoder decoder models for asr. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, 2019: 8-15. DOI: https://doi.org/10.1109/ASRU46091.2019.9004025
Downloads
Published
Conference Proceedings Volume
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.