A Survey on Referring Image Segmentation
DOI:
https://doi.org/10.62051/a2t2ec16Keywords:
Referring image segmentation; deep learning; computer vision; natural language processing.Abstract
With the popularity of artificial intelligence models and the increasing expectation of artificial intelligence applications in many fields, reference image segmentation (RIS) has attracted much attention from researchers. RIS, as one of the most basic and challenging visual language cross-modal tasks in the intersection of computer vision and natural language processing, aims to segment an instance from an image corresponding to a given natural language representation. This paper aims to provide an overview as comprehensive as possible, covering the mainstream benchmark datasets and their statistic information, common evaluation metrics, a few crucial and representative works in RIS, and the performance evaluation of each proposed method. Included RIS methods are elaborated with their core model structure and procedure in performing RIS, and are categorized into 5 classes in this paper based on how multimodal information is processed. At the end of this paper, the author makes a brief expectation of possible future expansions on the research of RIS.
Downloads
References
Linder, Jason, Gierad Laput, Mira Dontcheva, Gregg Wilensky, W. Chang, Aseem Agarwala and Eytan Adar. ‘PixelTone: a multimodal interface for image editing.’ Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2013, 2829-2830.
Cheng, Ming-Ming, Shuai Zheng, Wen-Yan Lin, Vibhav Vineet, Paul Sturgess, Nigel Crook, Niloy J. Mitra, and Philip Torr. ‘ImageSpirit: Verbal Guided Image Parsing’. ACM Transactions on Graphics, 2014, 34(1): 1–11.
Wu, Dongming, Xingping Dong, Ling Shao, and Jianbing Shen. ‘Multi-Level Representation Learning With Semantic Alignment for Referring Video Object Segmentation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 4996–5005.
Kazemzadeh, Sahar, Vicente Ordonez, Mark Matten, and Tamara Berg. ‘ReferItGame: Referring to Objects in Photographs of Natural Scenes’. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 787-798.
Grubinger, Michael, Paul D. Clough, Henning Müller, and Thomas Deselaers. ‘The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems’, 2006.
Escalante, Hugo Jair, Carlos A. Hernández, Jesus A. Gonzalez, Aurelio López-López, Manuel Montes-y-Gómez, Eduardo F. Morales, Luis Enrique Sucar, Luis Villaseñor-Pineda, and Michael Grubinger. ‘The Segmented and Annotated IAPR TC-12 Benchmark’. Comput. Vis. Image Underst. 2010, 114: 419–428.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. ‘Microsoft COCO: Common Objects in Context’. In Computer Vision ECCV 2014, 740–55.
Yu, Licheng, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. ‘Modeling Context in Referring Expressions’. ArXiv abs/1608.00272 (2016).
Mao, Junhua, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. ‘Generation and Comprehension of Unambiguous Object Descriptions’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Vries, Harm de, Florian Strub, A. P. Sarath Chandar, Olivier Pietquin, H. Larochelle, and Aaron C. Courville. ‘GuessWhat?! Visual Object Discovery through Multi-Modal Dialogue’. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 4466–75.
Liu, Chang, Henghui Ding, and Xudong Jiang. ‘GRES: Generalized Referring Expression Segmentation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 23592–601, 2023.
Yu, Jiahui, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. ‘UnitBox: An Advanced Object Detection Network’. In Proceedings of the 24th ACM International Conference on Multimedia. MM ’16. ACM, 2016.
Rezatofighi, Seyed Hamid, Nathan Tsoi, Junyoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. ‘Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regre-ssion’. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 658-66.
Zheng, Zhaohui, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, Dongwei Ren. ‘Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression’. In AAAI Conference on Artificial Intelligence, 2019.
Salehi, Seyed Sadegh Mohseni, Deniz Erdoğmuş, and Ali Gholipour. ‘Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks’. In MLMI@MICCAI, 2017.
Jun, He, Caiqing Zhang, Xiaozhen Li, Dehai Zhang. Survey of Research on Multimodal Fusion Technology for Deep Learning. Computer Engineering, 2020, 46(5): 1-11.
Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. ‘Segmentation from Natural Language Expressions’. ArXiv abs/1603.06180 (2016).
Liu, Chenxi, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. ‘Recurrent Multimodal Interaction for Referring Image Segmentation’. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
Margffoy-Tuay, Edgar, Juan C. Perez, Emilio Botero, and Pablo Arbelaez. ‘Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries’. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. ‘Attention Is All You Need’. In Neural Information Processing Systems, 2017.
Yu, Licheng, Zhe L. Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. ‘MAttNet: Modular Attention Network for Referring Expression Comprehension’. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 1307–15.
Ye, Linwei, Mrigank Rochan, Zhi Liu, and Yang Wang. ‘Cross-Modal Self-Attention Network for Referring Image Segmentation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Hu, Zhiwei, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. ‘Bi-Directional Relationship Inferring Network for Referring Image Segmentation’. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 4423–4432.
Hui, Tianrui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. ‘Linguistic Structure Guided Context Modeling for Referring Image Segmentation’. In European Conference on Computer Vision, 2020.
Li, Ruiyu, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. ‘Referring Image Segmentation via Recurrent Refinement Networks’. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 5745–5753.
Ye, Linwei, Zhi Liu, and Yang Wang. ‘Dual Convolutional LSTM Network for Referring Image Segmentation’. IEEE Transactions on Multimedia 22, 2020, 12: 3224–3235.
Yang, Zhao, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H. S. Torr. ‘LAVT: Language-Aware Vision Transformer for Referring Image Segmentation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 18155–18165.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. ‘Learning Transferable Visual Models From Natural Language Supervision’. In International Conference on Machine Learning, 2021.
Wang, Zhaoqing, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. ‘CRIS: CLIP-Driven Referring Image Segmentation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 11686–11695.
Liu, Jiang, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R. Manmatha. ‘PolyFormer: Referring Image Segmentation As Sequential Polygon Generation’. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023 18653–18663.
Downloads
Published
Conference Proceedings Volume
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.