Research on Multimodal AGI Empowering the Development of LLM-based Multi-Agent System

Zirun Bai

doi:10.62051/ijcsit.v8n5.06

Authors

Zirun Bai ArtCenter College of Art and Design, California, 91103, United States

DOI:

https://doi.org/10.62051/ijcsit.v8n5.06

Keywords:

Multimodal General AI, Large Language Model (LLM), Multi-Agent System, Cross-Modal Fusion, Collaborative Operation

Abstract

This paper analyzes the multimodal general artificial intelligence-enabled large language model multi-agent system, focusing on exploring the internal logic, technical laws, and practical scope of their integration. By integrating domestic and foreign literature, technical comparisons, and industrial cases from 2022 to 2025, and combining authoritative achievements in the fields of multi-agent and multimodal analysis, the analysis reveals that multimodal fusion can compensate for the shortcomings of traditional text intelligent agent perception, interaction, and collaboration, forming a complete perception–inference–execution chain. After the technology is implemented, the adaptability and stability of the agent are significantly improved, which reduces decision-making bias in individual models. Currently, demand for intelligent physical scenarios has surged, and traditional text agents are difficult to adapt to complex dynamic environments. Existing research mostly focuses on a single technical dimension and lacks systematic fusion analysis. This study fills this gap and provides reliable references and practical support for industrial upgrading. The study also identifies existing challenges and viable development pathways, and the conclusions drawn can provide direct basis for technological iteration, scenario implementation, and industry standardization improvement.

Downloads

Download data is not yet available.

References

[1] Jiang, B., Xie, Y., Wang, X., Su, W. J., Taylor, C. J., & Mallick, T. (2024, July). Multi-modal and multi-agent systems meet rationality: A survey. In ICML 2024 Workshop on LLMs and Cognition.

[2] Han, S., Zhang, Q., Jin, W., & Xu, Z. (2024). LLM multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578.

[3] Yang, J., Tan, R., Wu, Q., Zheng, R., Peng, B., Liang, Y., ... & Gao, J. (2025). Magma: A foundation model for multimodal ai agents. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 14203–14214).

[4] Zhao, X., Li, M., Weber, C., Hafez, M. B., & Wermter, S. (2023, October). Chat with the environment: Interactive multimodal perception using large language models. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 3590–3596). IEEE.

[5] Xu, C., Tang, Z., Yu, H., Zeng, P., & Kong, L. (2023). Digital twin-driven collaborative scheduling for heterogeneous task and edge-end resource via multi-agent deep reinforcement learning. IEEE Journal on Selected Areas in Communications, 41(10), 3056–3069. https://doi.org/10.1109/JSAC.2023.3310965

[6] Wang, J., Jiang, H., Liu, Y., Ma, C., Zhang, X., Pan, Y., ... & Zhang, S. (2024). A comprehensive review of multimodal large language models: Performance and challenges across different tasks. arXiv preprint arXiv:2408.01319.

[7] Hariyanto, Kristianingsih, F. X. D., & Maharani, R. (2025). Artificial intelligence in adaptive education: a systematic review of techniques for personalized learning. Discover Education, 4(1), 458. https://doi.org/10.3390/discovereduc4010458

[8] AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M. A., Damseh, R., & Sheikh, J. (2024). Multimodal large language models in health care: applications, challenges, and future outlook. Journal of Medical Internet Research, 26, e59505. https://doi.org/10.2196/59505

[9] Chen, Z., Xu, L., Zheng, H., Chen, L., Tolba, A., Zhao, L., ... & Feng, H. (2024). Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models. Computers, Materials & Continua, 80(2). https://doi.org/10.32604/cmc.2024.047241

[10] Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., ... & Ma, L. (2025). Efficient multimodal large language models: A survey. Visual Intelligence, 3(1), 27. https://doi.org/10.1007/s44267-024-00027-6

[11] Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., ... & Zhang, X. (2024). Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680.