Advances in Text-to-Image Generation: Integrating Transformer Models and Self-Attention Mechanisms
DOI:
https://doi.org/10.62051/je6gc608Keywords:
Text-to-Image; Transformer Models; Self-Attention Mechanism; GANs.Abstract
This study provides a comprehensive overview of advancements in Text-to-Image (TTI) generation through the application of Transformer models and Self-Attention Mechanisms. It begins with a review of the evolution of Generative Adversarial Networks (GANs) and highlights the benefits introduced by Self-Attention, such as improved contextual understanding and clearer image generation. The paper explores the theoretical foundations of Transformers and GANs, detailing how their integration can enhance TTI tasks. It also examines several leading models that employ these methodologies and presents quantitative performance evaluations comparing these models with other commonly used approaches. The findings indicate that Transformer-based modifications significantly improve TTI performance. The study concludes by assessing the current state of Self-Attention techniques and identifying potential research directions, such as exploring multi-head, hard, and soft attention mechanisms. These future research efforts are expected to further refine TTI capabilities and address existing challenges, providing deeper insights and more robust solutions for generating diverse and high-quality images from textual descriptions.
Downloads
References
[1] Reed S. Akata Z. Yan X. et al. Generative adversarial text to image synthesis. International conference on machine learning. PMLR, 2016: 1060 - 1069.
[2] Goodfellow I. Pouget-Abadie J. Mirza M. et al. Generative adversarial nets. Advances in neural information processing systems, 2014, 27.
[3] Le C.Y. Bottou L. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86 (11): 2278 - 2324.
[4] Tsai Y.H. Bai S. Liang P. Multimodal transformer for unaligned multimodal language sequences. 2019, arXiv print: 1906. 00295.
[5] Vaswani A. Shazeer N. Parmar N. et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017: 5998 - 6008.
[6] Xu T. Zhang P. Huang Q. et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[7] Dosovitskiy A. Beyer L. Kolesnikov A. et al. An image is worth 16×16 words: transformers for image recognition at scale. 2020, arXiv print: 2010. 11929.
[8] Tan X.Y. He X.H. Wang Z.Y. et al. Text generation image technology based on Transformer cross-attention. Computer science, 2022, 49 (02): 107 - 115.
[9] Lin T.Y. Maire M. Belongie S. et al. Microsoft coco: Common objects in context. Computer Vision–ECCV European Conference, 2014: 740 - 755.
[10] Wah C. Branson S. Welinder P. et al, “The caltech-ucsd birds-200-2011 dataset”, 2011, http://www.birdfifieldguide.com.
[11] Tan F. Feng S. Ordonez V. Text2scene: Generating compositional scenes from textual descriptions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6710 - 6719.
[12] Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 1998, 6 (02): 107 - 116.
[13] Kusner M. Sun Y. Kolkin N. et al. From word embeddings to document distances. International conference on machine learning. PMLR, 2015: 957 - 966.
[14] Daskalakis C. Goldberg P.W. Papadimitriou C.H. The complexity of computing a Nash equilibrium. Communications of the ACM, 2009, 52 (2): 89 - 97.
[15] Cheng Y. Gan Z. Li Y. et al. Sequential attention GAN for interactive image editing. Proceedings of the 28th ACM international conference on multimedia. 2020: 4383 - 4391.
[16] Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 1998, 6 (02): 107 - 116.
[17] Huang H.Y. Gu Z.F. A text image generation adversarial network based on self-attention mechanism. Journal of Chongqing University, 2020, 43 (03): 55 - 61.
[18] Hou R. Chang H. Ma B. et al. Cross attention network for few-shot classification. Advances in neural information processing systems, 2019, 32.
[19] Barratt S. Sharma R. A note on the inception score. 2018, arXiv preprint: 1801. 01973.
Downloads
Published
Conference Proceedings Volume
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.