Generative and Discriminative Models in Multimodal AI: An Analysis of Vision-Language Tasks

Authors

  • Fengjiang He

DOI:

https://doi.org/10.62051/hdjsgp39

Keywords:

Multimodal AI; Vision-Language Models; Generative Models; Discriminative Models; Transformer; BERT; GPT.

Abstract

The transformer architecture has triggered groundbreaking works in multimodal vision and language (V+L). This article offers brief look into the two main modeling paradigms—generative and discriminative—from their roots in natural language processing (NLP) specifically generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), respectively. The core ideas of these two paradigms are then examined to show how they have been modified to handle V+L tasks, resulting in different architectural paths and pre-training methods. The paradigms are also surveyed by core dimensions, analyzing the challenges along the path from distributed paradigms to unified models (e.g., model hallucination, limited evaluation capability and scalability). This work aims to provide a well-organized and clear view on how V+L modeling has evolved and possibly evolved into for researchers as well as practitioners.

Downloads

Download data is not yet available.

References

[1] Soydaner D. Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 2022, 34(16): 13371-13385.

[2] Luo Q, Zeng W, Chen M, et al. Self-attention and transformers: Driving the evolution of large language models. In 2023 IEEE 6th International Conference on Electronic Information and Communication Technology (ICEICT), 2023: 401-405.

[3] Charoenkwan P, Nantasenamat C, Hasan MM, et al. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics, 2021, 37(17): 2556-2562.

[4] Bengesi S, El-Sayed H, Sarker MK, et al. Advancements in generative AI: A comprehensive review of GANs, GPT, autoencoders, diffusion model, and transformers. IEEe Access, 2024.

[5] Bernardo JM, Bayarri MJ, Berger JO, et al. Generative or discriminative? getting the best of both worlds. Bayesian statistics, 2007, 8(3): 3-24.

[6] Park SM, Kim YG. Visual language integration: A survey and open challenges. Computer Science Review, 2023, 48: 100548.

[7] Ding Y, Jia M, Miao Q, et al. A novel time–frequency transformer based on self–attention mechanism and its application in fault diagnosis of rolling bearings. Mechanical Systems and Signal Processing, 2022, 168: 108616.

[8] Shen Z, Zhang M, Zhao H, et al. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021: 3531-3539.

[9] Wang Z, Yao K, Li X, et al. Multi-resolution multi-head attention in deep speaker embedding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 6464-6468.

[10] Chung YA, Zhang Y, Han W, et al. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021: 244-250.

[11] Chen L, Wang Z, Ren S, et al. Next token prediction towards multimodal intelligence: A comprehensive survey. arXiv preprint arXiv:2412.18619, 2024.

[12] Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 2019, 32.

[13] Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.

[14] Chen YC, Li L, Yu L, et al. Uniter: Universal image-text representation learning. In European conference on computer vision, 2020: 104-120.

[15] Li X, Yin X, Li C, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, 2020: 121-137.

[16] Huang Z, Jin X, Lu C, et al. Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 46(4): 2506-2517.

[17] Zhang K, Mao Z, Wang Q, et al. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 15661-15670.

[18] Shao Z, Yu Z, Wang M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023: 14974-14983.

[19] Khan MJ, Breslin JG, Curry E. Common sense knowledge infusion for visual understanding and reasoning: Approaches, challenges, and applications. IEEE Internet Computing, 2022, 26(4): 21-27.

[20] Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 2023, 56(2): 1-40.

[21] Wang Z, Yu J, Yu AW, et al. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.

[22] Jin W, Cheng Y, Shen Y, et al. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2021.

[23] Alayrac JB, Donahue J, Luc P, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 2022, 35: 23716-23736.

[24] Yang Z, Li L, Lin K, et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.

[25] Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015: 4566-4575.

[26] Anderson P, Fernando B, Johnson M, et al. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, 2016: 382-398.

[27] Wang P, Yang A, Men R, et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International conference on machine learning, 2022: 23318-23340.

[28] Wang W, Bao H, Dong L, et al. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 19175-19186.

[29] Schuhmann C, Beaumont R, Vencu R, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 2022, 35: 25278-25294.

[30] Huang L, Yu W, Ma W, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 2025, 43(2): 1-55.

[31] Papineni K, Roukos S, Ward T, et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002: 311-318.

[32] Chavan A, Liu Z, Gupta D, et al. One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023.

Downloads

Published

19-08-2025

How to Cite

He, F. (2025) “Generative and Discriminative Models in Multimodal AI: An Analysis of Vision-Language Tasks”, Transactions on Computer Science and Intelligent Systems Research, 10, pp. 152–160. doi:10.62051/hdjsgp39.