Intelligent Navigation Dialect Detection and Recognition Based on Multimodal Large Language Model

Yanzhuo Wang

doi:10.62051/ijcsit.v4n1.10

Authors

Yanzhuo Wang

DOI:

https://doi.org/10.62051/ijcsit.v4n1.10

Keywords:

Multimodal large language model, Intelligent navigation, Dialect detection and recognition, Cross-language communication, Human-computer interaction

Abstract

This paper discusses the research methods of dialect detection and recognition in intelligent navigation systems based on multimodal large language models, points out the development trend of today's intelligent navigation systems and the important application of speech recognition technology in them. It focuses on the progress, basic principles and practical applications of current research, and summarizes the key technologies of dialect detection, including data collection, model design and system integration, by reviewing a large number of literatures. Specifically, this paper covers the acquisition and fusion of voice data and image data, feature extraction and recognition algorithms based on large language models, multimodal fusion strategies, and optimization methods for the system in terms of real-time performance and user experience. Through these technical means, it aims to improve the adaptability and user experience of intelligent navigation systems in multilingual environments, and provide more accurate and personalized navigation services.

Downloads

Download data is not yet available.

References

[1] Duan J, Yu S, Tan HL, et al. A survey of embodied ai: From simulators to research tasks [J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2022, 6(2): 230-244.

[2] Retscher G, Kealy A. Ubiquitous positioning technologies for modern intelligent navigation systems [J]. The Journal of Navigation, 2006, 59(1): 91-103.

[3] Yurtsever E, Lambert J, Carballo A, et al. A survey of autonomous driving: Common practices and emerging technologies [J]. IEEE access, 2020, 8: 58443-58469.

[4] Li H, Ma B, Lee K A. Spoken language recognition: from fundamentals to practice [J]. Proceedings of the IEEE, 2013, 101(5): 1136-1159.

[5] Zhang Q, Hansen JH L. Language/dialect recognition based on unsupervised deep learning [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(5): 873-882.

[6] LeCun Y, Bengio Y, Hinton G. Deep learning [J]. nature, 2015, 521(7553): 436-444.

[7] Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models [J]. ACM Transactions on Intelligent Systems and Technology, 2023.

[8] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning[C]//Proceedings of the 28th international conference on machine learning (ICML-11). 2011: 689-696.

[9] Wang C, Li M, Smola A J. Language models with transformers [J]. arXiv preprint arXiv:1904.09408, 2019.

[10] Chowdhary KR, Chowdhary K R. Natural language processing [J]. Fundamentals of artificial intelligence, 2020: 603-649.

[11] Nadkarni PM, Ohno-Machado L, Chapman W W. Natural language processing: an introduction [J]. Journal of the American Medical Informatics Association, 2011, 18(5): 544-551.

[12] Jones K S. Natural language processing: a historical review [J]. Current issues in computational linguistics: in honour of Don Walker, 1994: 3-16.

[13] Jauhiainen T, Lindén K, Jauhiainen H. Language model adaptation for language and dialect identification of text [J]. Natural Language Engineering, 2019, 25(5): 561-583.

[14] Pratap V, Xu Q, Sriram A, et al. Mls: A large-scale multilingual dataset for speech research [J]. arXiv preprint arXiv:2012.03411, 2020.

[15] Liang PP, Liu Z, Zadeh A, et al. Multimodal language analysis with recurrent multistage fusion [J]. arXiv preprint arXiv:1808.03920, 2018.

[16] Dianes JA, Díaz M, Rubio B. ServiceDDS: A framework for real-time P2P systems integration[C]//2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing. IEEE, 2010: 233-237.

[17] Wang Y, Chen K, Tan H, et al. Tabi: An Efficient Multi-Level Inference System for Large Language Models[C]//Proceedings of the Eighteenth European Conference on Computer Systems. 2023: 233-248.

[18] Toshniwal S, Sainath TN, Weiss RJ, et al. Multilingual speech recognition with a single end-to-end model[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018: 4904-4908.

[19] Gupta S, Chatterjee S. Text dependent voice based biometric authentication system using spectrum analysis and image acquisition [C]//Advances in Computer Science, Engineering & Applications: Proceedings of the Second International Conference on Computer Science, Engineering and Applications (ICCSEA 2012), May 25-27, 2012, New Delhi, India, Volume 1. Springer Berlin Heidelberg, 2012: 61-70.

[20] Meng T, Jing X, Yan Z, et al. A survey on machine learning for data fusion [J]. Information Fusion, 2020, 57: 115-129.

[21] Huang K, Li C, Zhang J, et al. Cascade and fusion: A deep learning approach for camouflaged object sensing [J]. Sensors, 2021, 21(16): 5455.

[22] Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning [J]. Neurocomputing, 2021, 452: 48-62.

[23] Wang L, Wu J, Huang SL, et al. An efficient approach to informative feature extraction from multimodal data [C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 5281-5288.

[24] Koolagudi SG, Rastogi D, Rao K S. Identification of language using mel-frequency cepstral coefficients (MFCC) [J]. Procedia Engineering, 2012, 38: 3391-3398.

[25] O'Shea K, Nash R. An introduction to convolutional neural networks [J]. arXiv preprint arXiv:1511.08458, 2015.

[26] Torrey L, Shavlik J. Transfer learning [M]//Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 2010: 242-264.

[27] Erhan D, Courville A, Bengio Y, et al. Why does unsupervised pre-training help deep learning? [C]//Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010: 201-208.