Efficient Self-Supervised Vision Transformers (EsViT)
This is a research project in exploring self-supervised learning (SSL) for computer vision. It aims to learn general-purpose image features from raw pixels without relying on manual supervisions, and the learned networks serve as the backbone of various downstream tasks. Aiming to improve the efficiency of Transformer-based SSL, this project presents Efficient self-supervised Vision Transformers (EsViT), by using a multi-stage architecture and a region-based pre-training task for unsupervised representation learning.