Research on a Curvature-Enhanced and Synergistic Attention-Based Multi-Task Perception Method for Transparent Objects

Jiajin Han; Sanpeng Deng; Yuming Qi; Xiumin Shi

doi:10.62051/ijcsit.v8n1.03

Authors

Jiajin Han
Sanpeng Deng
Yuming Qi
Xiumin Shi

DOI:

https://doi.org/10.62051/ijcsit.v8n1.03

Keywords:

Transparent object perception, Multi-task learning, Curvature prior, Depth estimation, Semantic segmentation, Attention mechanism

Abstract

Transparent objects challenge monocular perception due to refraction, reflection, and weak textures, which hinder accurate depth estimation and segmentation. To overcome these issues, we propose CESINet, a curvature-enhanced synergistic attention network for transparent object perception. CESINet explicitly incorporates surface curvature as a high-order geometric prior to strengthen spatial representation and introduces a curvature-guided synergistic attention module to enable effective cross-task feature interaction between depth and segmentation branches. A curvature consistency loss further enforces geometric coherence across predictions. Experiments on the ClearPose dataset show that CESINet achieves 94.33% mIoU and 98.27% mAP for segmentation, improving over the multi-task baseline ISGNet by 1.49% and 0.44%, respectively. For depth estimation, CESINet attains an RMSE of 0.112 and REL of 0.060, reducing errors by 8.9% and 11.8% compared with the baseline. Ablation results demonstrate that removing curvature priors or attention modules leads to performance drops of up to 3.5% in segmentation and 12% in depth accuracy, confirming the complementary benefits of explicit geometry and synergistic learning. Overall, CESINet enhances geometric consistency and boundary sharpness while maintaining computational efficiency, providing a unified and scalable framework for multi-task transparent object understanding.

Downloads

Download data is not yet available.

References

[1] Chen G, Han K, Wong K Y K. Tom-net: Learning transparent object matting from a single image [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 9233-9241.

[2] Sajjan S, Moore M, Pan M, et al. Clear grasp: 3d shape estimation of transparent objects for manipulation [C]//2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020: 3634-3642.

[3] Xie E, Wang W, Wang W, et al. Segmenting transparent objects in the wild [C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 696-711.

[4] Kalra A, Taamazyan V, Rao S K, et al. Deep polarization cues for transparent object segmentation [C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8602-8611.

[5] Chen X, Zhang H, Yu Z, et al. Clearpose: Large-scale transparent object dataset and benchmark [C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 381-396.

[6] Wang Y R, Zhao Y, Xu H, et al. Mvtrans: Multi-view perception of transparent objects [J]. arXiv preprint arXiv:2302.11683, 2023.

[7] Hamdi A, AlZahrani F, Giancola S, et al. MVTN: Learning multi-view transformations for 3D understanding [J]. International Journal of Computer Vision, 2025, 133(4): 2197-2226.

[8] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module [C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.

[9] Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering [J]. Advances in neural information processing systems, 2016, 29.

[10] WANG Zhaokui, ZHOU Zhengguang, WANG Hailin, et al. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition [J]. Neurocomputing, 2022, 489: 497-508.

[11] Cui Y, Han C, Liu D. Cml-mots: Collaborative multi-task learning for multi-object tracking and segmentation [J]. arXiv preprint arXiv:2311.00987, 2023.

[12] Misra I, Shrivastava A, Gupta A, et al. Cross-stitch networks for multi-task learning [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3994-4003.

[13] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture [C]//Proceedings of the IEEE international conference on computer vision. 2015: 2650-2658.

[14] Hernández-Bautista M, Melero F J. SR-CurvANN: Advancing 3D surface reconstruction through curvature-aware neural networks [J]. Computers & Graphics, 2025: 104260.

[15] Harrison J, Benn J, Sermesant M. Improving neural network surface processing with principal curvatures [J]. Advances in Neural Information Processing Systems, 2024, 37: 122384-122405.

[16] Bhardwaj S, Vinod A, Bhattacharya S, et al. Curvature Informed Furthest Point Sampling [J]. arXiv preprint arXiv:2411.16995, 2024.

[17] da Silva S A, Geiger D, Velho L, et al. Towards Understanding 3D Vision: the Role of Gaussian Curvature [J]. arXiv preprint arXiv:2508.11825, 2025.

[18] Liu, J., Ma, H., Guo, Y., et al. (2025). Monocular depth estimation and segmentation for transparent object with iterative semantic and geometric fusion. arXiv:2502.14616.

[19] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale [J]. arXiv preprint arXiv:2010.11929, 2020.

[20] Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 12179-12188.

[21] Yang G, Tang H, Ding M, et al. Transformer-based attention networks for continuous pixel-wise prediction [C]//Proceedings of the IEEE/CVF International Conference on Computer vision. 2021: 16269-16279.

[22] Si Y, Xu H, Zhu X, et al. SCSA: Exploring the synergistic effects between spatial and channel attention [J]. Neurocomputing, 2025, 634: 129866.