Advancements in Deep Learning-Based Image Captioning

Authors

  • Siqi Wang
  • Jihong Zhuang

DOI:

https://doi.org/10.62051/d1jtjx50

Keywords:

Image captioning; deep learning; template-based structure; encoder decoder-based structure.

Abstract

In the confluence of natural language processing and machine vision, the field of image captioning has experienced exponential growth since the introduction of the BLEU evaluation algorithm by IBM in 2002. This discipline serves to bridge the "semantic gap" between human and machine perception, translating visual information into semantic narratives. Such technology is extensively applied in areas like human-computer interaction, video subtitling, quiz generation, and image-based search functionalities. The paper presents an analysis of two primary methodologies in image captioning: template-based and encoder-decoder-based structures. Template-based approaches, defined by the use of pre-set templates, ensure syntactic accuracy yet offer limited flexibility in caption generation. Innovations within this methodology, including paraphrase back-translation and the integration of psycholinguistics, have enhanced caption diversity and descriptiveness. On the other hand, the encoder-decoder framework, particularly the CNN-RNN model, utilizes deep neural networks to learn directly from image-caption pairs. This method represents a more dynamic and adaptable approach to caption generation. The amalgamation of Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) networks within this framework has notably advanced the descriptive quality of captions, effectively addressing complex image contexts.

Downloads

Download data is not yet available.

References

Mahalakshmi, P., & Fatima, N. S. (2022). Summarization of text and image captioning in information retrieval using deep learning techniques. IEEE Access, 10, 18289-18297. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311-318.

Ma, H., Zhu, J., Lyu, M. R. T., & King, I. (2010). Bridging the semantic gap between image contents and tags. IEEE Transactions on Multimedia, 12(5), 462-473.

Turkerud, I. R., & Mengshoel, O. J. (2021, December). Image captioning using deep learning: text augmentation by paraphrasing via backtranslation. In 2021 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 01-10). IEEE.

Umemura, K., Kastner, M. A., Ide, I., Kawanishi, Y., Hirayama, T., Doman, K., ... & Murase, H. (2021). Tell as you imagine: Sentence imageability-aware image captioning. In MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II 27 (pp. 62-73). Springer International Publishing.

Chen, X., Jiang, M., & Zhao, Q. (2021). Self-distillation for few-shot image captioning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 545-555).

Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).

Chen, J., & Zhuge, H. (2019, September). News image captioning based on text summarization using image as query. In 2019 15th International Conference on Semantics, Knowledge and Grids (SKG) (pp. 123-126). IEEE.

Umemura, K., Kastner, M. A., Ide, I., Kawanishi, Y., Hirayama, T., Doman, K., ... & Murase, H. (2021). Tell as you imagine: Sentence imageability-aware image captioning. In MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II 27 (pp. 62-73). Springer International Publishing.

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & GAO, J. (2020, April). Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 13041-13049).

Zeng, Y., Zhang, X., Li, H., Wang, J., Zhang, J., & Zhou, W. (2023). X 2-vlm: All-in-one pre-trained model for vision-language tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Downloads

Published

12-08-2024

How to Cite

Wang, S. and Zhuang, J. (2024) “Advancements in Deep Learning-Based Image Captioning”, Transactions on Computer Science and Intelligent Systems Research, 5, pp. 464–469. doi:10.62051/d1jtjx50.