A comparative study of Contrastive Self-Supervised Learning (CSSL): Methods, Technologies, and Applications

Zhenghan Li

doi:10.62051/2hgr2j26

Authors

Zhenghan Li

DOI:

https://doi.org/10.62051/2hgr2j26

Keywords:

Self Supervised Learning (SSL); Contrastive learning; Feature representation learning; Model interpretability; Efficient training.

Abstract

With the continuous increase in the cost of data annotation and the explosion of diverse demands, traditional supervised learning is confronted with two major problems: the difficulty in annotating a large number of labels and the expansion bottleneck. Contrastive Self-Supervised Learning (CSSL) provides an effective solution for deep feature extraction in a label-free environment by constructing sample pairs and maximizing their discrimination in the feature space. This review takes Contrastive Predictive Coding (CPC), Simple Framework for Contrastive Learning of Visual Representations (SimCLR), Momentum Contrast (MoCo), Bootstrap Your Own Latent (BYOL), Supervised Contrastive Learning (SupCon), Swapping Assignments between Views (SwAV), and Self-Distillation with No Labels (DINO) as the research objects, systematically sorting out their theoretical frameworks and architectural improvements. The research methods cover InfoNCE loss, momentum encoders, and negative-free self-distillation techniques. This paper focuses on comparing the Top‑1 accuracy and computational resource costs of each method in the ImageNet linear evaluation, and presents the core technical differences and performance advantages and disadvantages through three tables. The comparison results show that SupCon and DINO approach or exceed traditional supervised pre-training in different batch settings, while lightweight methods such as BYOL and SwAV perform particularly well in resource‑constrained scenarios. Therefore, it can be seen that CSSL has not only made significant progress in feature representation quality and generalization ability, but also laid a solid foundation for subsequent research on the interpretability, multimodal fusion, and efficient deployment of deep learning models.

Downloads

Download data is not yet available.

References

[1] L. Jing, Y. Tian, Self-supervised visual feature learning with deep neural networks: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 43 (2020) 4037–4058.

[2] A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748 (2018).

[3] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: Proc. Int. Conf. Mach. Learn. (ICML), PMLR, 2020, pp. 1597–1607.

[4] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 9729–9738.

[5] J.B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, et al., Bootstrap your own latent: a new approach to self-supervised learning, Adv. Neural Inf. Process. Syst. 33 (2020) 21271–21284.

[6] O. Henaff, Data-efficient image recognition with contrastive predictive coding, in: Proc. Int. Conf. Mach. Learn. (ICML), PMLR, 2020, pp. 4182–4192.

[7] M. Caron, P. Bojanowski, A. Joulin, M. Douze, Deep clustering for unsupervised learning of visual features, in: Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 132–149.

[8] X. Chen, H. Fan, R. Girshick, K. He, Improved baselines with momentum contrastive learning, arXiv preprint arXiv:2003.04297 (2020).

[9] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, et al., Supervised contrastive learning, Adv. Neural Inf. Process. Syst. 33 (2020) 18661–18673.

[10] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties in self-supervised vision transformers, in: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660.

[11] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised learning of visual features by contrasting cluster assignments, Adv. Neural Inf. Process. Syst. 33 (2020) 9912–9924.

[12] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, Y. Wu, CoCa: contrastive captioners are image-text foundation models, arXiv preprint arXiv:2205.01917 (2022).

[13] S. Wan, Y. Zhan, S. Chen, S. Pan, J. Yang, D. Tao, C. Gong, Boosting graph contrastive learning via adaptive sampling, IEEE Trans. Neural Netw. Learn. Syst. (2023).