Application of Large Language Models in Embodied Artificial Intelligence

Zitian Li

doi:10.62051/05k91t81

Authors

Zitian Li

DOI:

https://doi.org/10.62051/05k91t81

Keywords:

Embodied AI; Large Language Models; Robustness and Adaptability.

Abstract

The convergence of Artificial Intelligence (AI) and robotics has led to the emergence of embodied AI, where intelligent systems equipped with sensors and actuators interact with the physical world and operate alongside humans. These systems are transforming industries such as autonomous driving, healthcare, and household assistance. However, despite extensive research, embodied AI systems face significant limitations, including poor generalization and performance degradation in complex environments, hindering their commercialization. Recent developments in Large Language Models (LLMs) present new opportunities to address the above challenges. This study aims to explore the integration of LLMs into embodied AI systems, highlighting their potential to enhance scene understanding, reasoning, and planning capabilities. The paper provides a detailed review of LLMs’ applications in embodied AI, demonstrating how these models can improve the robustness and adaptability of AI systems. Additionally, the study examines the limitations of LLMs, such as hallucinations and efficiency challenges, and discusses potential solutions to mitigate these issues. Through an in-depth analysis of LLM-powered enhancements in embodied AI, this research underscores the transformative impact of LLMs on intelligent systems. By addressing current limitations and implementing innovative solutions, LLMs can significantly advance the field of embodied AI, paving the way for more versatile and intelligent systems that can operate effectively in diverse real-world environments.

Downloads

Download data is not yet available.

References

[1] Y. Zhou, P. Sun, Y. Zhang, D. Anguelov, J.Y. Gao, T. Ouyang, J. Guo, J.Q. Ngiam, and V. Vasudevan. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Conference on Robot Learning, 1 (2020), 923–932.

[2] L.T. Sun, X.G. Jia, and A.D. Dragan. On complementing end-to-end human behavior predictors with planning. arXiv preprint:2103.05661 (2021).

[3] J. Achiam, S. Adler, S. Agarwal, et al. Gpt-4 technical report. arXiv preprint:2303.08774 (2023).

[4] X.P. Ding, J.H. Han, H. Xu, X.D. Liang, W. Zhang, and X.M. Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. arXiv preprint:2401.00988 (2024).

[5] D. Driess, F. Xia, M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint:2303.03378 (2023).

[6] C. Fei-Long, Z. Du-Zhen, H. Ming-Lun, C. Xiu-Yi, S. Jing, X. Shuang, and X. Bo. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1) (2023), 38–56.

[7] K. Weicheng, C. Yin, G. Xiuye, A.J. Piergiovanni, and A. Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint:2209.15639 (2022).

[8] X. Wei, N. Mehdipour, A. Collin, A.Y. Bin-Nun, E. Frazzoli, R. Duintjer Tebbens, and C. Belta. Rule-based optimal control for autonomous driving. In Proceedings of the ACM/IEEE 12th International Conference on Cyber-Physical Systems, 4(2021), 143–154.

[9] J. Liang, H. Wenlong, F. Xia, X. Peng, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation, (2023), 9493–9500.

[10] H. Yihan, Y. Jiazhi, C. Li, L. Keyu, S. Chonghao, Z. Xizhou, C. Siqi, D. Senyao, L. Tianwei, W. Wenhai, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), 17853–17862.

[11] M. Jiageng, Q. Yuxi, Z. Hang, and W. Yue. Gpt-driver: Learning to drive with gpt. arXiv preprint:2310.01415 (2023).

[12] M. Yingzi, C. Yulong, S. Jiachen, M. Pavone, and X. Chaowei. Dolphins: Multimodal language model for driving. arXiv preprint:2312.00438 (2023).

[13] A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto. Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2024), 14303–14312.

[14] J.H. Edward, S. Yelong, P. Wallis, Z. Allen-Zhu, L. Yuanzhi, W. Shean, L. Wang, and C. Weizhu. Lora: Low-rank adaptation of large language models. arXiv preprint:2106.09685 (2021).

[15] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quan- tized llms. Advances in Neural Information Processing Systems, 36 (2024).