CVPR 2021 highlights: An overview of the cutting-edge progress of vision research

July 19, 2021

Share this page

As one of the world’s top AI conference, CVPR has been leading the academic and industrial trend in the field of computer vision and pattern recognition technology. Here, 12 papers from Microsoft Research Asia accepted by CVPR 2021 have been selected to introduce the cutting-edge research being conducted in the field of computer vision. These papers cover topics including pose estimation, image translation, 3D reconstruction, object tracking, pre-training, representation learning, semantic segmentation, domain adaptation, etc.

Bottom-Up Human Pose Estimation via Disentangled Keypoint Regression

Paper: https://arxiv.org/pdf/2104.02300.pdf (opens in new tab)

Code: https://github.com/HRNet/DEKR (opens in new tab)

The proposed direct regression approach DEKR outperforms keypoint detection and grouping based methods and achieves superior performance over previous state-of-the-art bottom-up pose estimation methods on two benchmark datasets, COCO and CrowdPose.

The previous pixel-wise keypoint regression approach, CenterNet, performs reasonably. But the regressed keypoints are spatially inaccurate, and its performance is worse than the keypoint detection and grouping scheme. Researchers argue that regressing the keypoint positions accurately needs representation learning that focuses on the keypoint regions.

Starting from this regression-by-focusing concept, researchers presented a simple yet effective approach called disentangled keypoint regression (DEKR). They adopted adaptive convolutions through a pixel-wise spatial transformer to activate the pixels in the keypoint regions and learn representations accordingly from these activated pixels, so that the learned representations can focus on the keypoint regions (Figure 1).

Figure 1: Illustration of the salient regions for regressing the keypoints. Three keypoints, nose and two ankles, are studied as an example for illustration clarity. Left: baseline. Right: The proposed DEKR. It can be seen that the proposed approach is able to focus on the keypoint regions.

Researchers further decoupled the representation learning for one keypoint from other keypoints. They adopted a separate regression scheme through a multi-branch structure: each branch learns a representation with dedicated adaptive convolutions and regresses one keypoint. The multi-branch structure explicitly decouples the representation learning for one keypoint from other keypoints and makes the optimization easier.

The new approach can learn highly concentrative representations, each of which focuses on the corresponding keypoint region (Figure 2), and thus the keypoint regression is spatially more accurate. The proposed direct regression method improves the localization quality of the regressed keypoint positions, outperforms keypoint detection and grouping methods, and achieves superior bottom-up pose estimation results on two benchmark datasets, COCO and CrowdPose.

Figure 2: Activated pixels for nose, left shoulder, left knee, and left ankle from the multi-branch regression in DEKR at the center pixel for each person. One can see that the proposed approach is able to activate the pixels around the keypoint.

CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation

Paper: https://arxiv.org/pdf/2012.02047.pdf (opens in new tab)

Code: https://github.com/microsoft/CoCosNet-v2 (opens in new tab)

Image to image translation has shown great promise in a wide range of applications. Recently, exemplar-based image translation has been proposed to allow style customization according to an exemplar image. The previous work, CoCosNet, achieved state-of-the-art quality while remaining faithful to the exemplar. This is because CoCosNet explicitly establishes the dense semantic correspondence between cross-domain images, so that the network could make use of the fine textures of the exemplar and easily hallucinate the textures for the final output (Figure 3). However, prohibitive memory footprint occurs when estimating high-resolution correspondence, so it is challenging to apply this method for high-resolution inputs.

Figure 3: Cross-domain correspondence learning for exemplar-based image translation.

In this paper, the newly proposed CoCosNet v2 establishes full-resolution correspondence for cross-domain images. Researchers proposed two techniques to improve the memory efficiency of high-resolution correspondence. First, they adopted a coarse-to-fine strategy (Figure 4), in which images are matched on the hierarchical feature pyramid, and the matching on the coarse level serves as the guidance for the finer level.

Figure 4: Coarse to fine feature matching.

Second, on each level, they used differentiable PatchMatch to iteratively refine the matching. As shown in Figure 5, each patch not only considers the matched candidate of its own, but also leverages the matching of its neighborhood. This way, good matchings could be propagated to the whole image.

Figure 5: Differentiable PatchMatch on deep feature.

Researchers also proposed the ConvGRU unit to further enhance the PatchMatch module (Figure 6), which brings crucial benefits. First, it allows the current correspondence to leverage a larger context rather than the local neighborhood. Second, the GRU memorizes the history of correspondence estimate and better forecasts the next search candidate. Third, the backward gradient can now flow to the pixels in a larger context, rather than to only those at sparse locations, thereby improving the feature learning.

Figure 6: ConvGRU enhances the convergence and gradient backpropagation of PatchMatch.

With the established high-resolution correspondence, CoCosNet v2 is able to synthesize images with stunning quality while being faithful to the exemplar style. Experiments on diverse image translation tasks were conducted, such as pose generation, edge-to-face synthesis, and layout-to-image synthesis, proving the advantage of CoCosNet v2 both qualitatively and quantitatively.

Figure 7: Pose synthesis results (second row) given the skeleton input (first column) and the style image (first row).

Figure 8: Qualitative comparison over prior leading approaches.

Figure 9: Quantitative comparison

Deep Implicit Moving Least-Squares Functions for 3D Reconstruction

Paper: https://arxiv.org/pdf/2103.12266.pdf (opens in new tab)

Code: https://github.com/Andy97/DeepMLS (opens in new tab)

According to the underlying 3D representation, there are two major types of approaches for learning-based 3D reconstruction. One uses explicit representations, e.g., point clouds and voxel grids, outputting by deep neural networks directly. The other uses implicit representations, e.g., signed distance functions or spatial occupancy functions, which are realized by a neural network mapping spatial coordinates to implicit function values; example works include OccNet and NeRF. The explicit representations are direct and fast in evaluating 3D shapes but cannot model fine details, while the implicit representations are expensive to evaluate but can capture details.

Researchers proposed a third approach that combines the benefits of both the explicit point set representation and the implicit function representation, and enables efficient, high-quality and generalizable 3D reconstruction. The representation is based on implicit moving least squares (IMLS): given a point-set with normal vectors , an implicit function is defined for any spatial point such that the function value and gradient are weighted averages of the points nearby and their normal vectors (please refer to the paper for details). The IMLS function approximates the signed distance function locally around the shape defined by the point set and has guaranteed smoothness and low evaluation cost, rendering it a perfect fit for 3D reconstruction.

Novel deep learning methods are designed to generate such IMLS functions (Figure 10). To generate a variable number of points, researchers employed a “scaffold + point set” two-stage method: generating an octree as the scaffold in the first stage, and filling up the leaf tree nodes with points that define IMLS function in the second stage. To train the two-stage neural network, researchers incorporated loss functions that supervised the octree structure, signed distance values of IMLS function, point set distribution and regularity, etc. Combining all these designs, this novel approach outperforms existing learning-based implicit reconstruction methods in terms of result quality and generalization (Figure 11).

Figure 10: Conceptual illustration of the framework. Left: The IMLS network generates an octree structure as the scaffold, and fills the leaf nodes with points and normal vectors that define the IMLS function, whose zero-surface models the 3D shape. Right: a real 3D example showing the point set and extracted from the IMLS function of the point set.

Figure 11: Comparison with other methods on reconstruction from noisy points for unseen object categories. First column: the input noisy and partial points. Second column: the result of [Convolutional Occupancy Networks，ECCV 2020], a SOTA implicit deep learning method. Third column: our results. Fourth column: the ground truth shapes. For these shapes whose categories are entirely unseen during training, our results show much better quality and generalization than the other methods.

Learning Invariant Representations and Risks for Semi-supervised Domain Adaptation

Paper: https://arxiv.org/abs/2010.04647 (opens in new tab)

Code: https://github.com/Luodian/Learning-Invariant-Representations-and-Risks (opens in new tab)

The success of supervised learning hinges on the assumption that the training and test data come from the same underlying distribution, which is often not valid in practice due to potential distribution shifts.

In light of this, most existing methods for unsupervised domain adaptation focus on achieving domain-invariant representations and small source domain error. However, recent works have shown that this is not sufficient to guarantee good generalization on the target domain, and in fact, it is provably detrimental under label distribution shifts. Furthermore, in many real-world applications, it is often feasible to obtain a small amount of labeled data from the target domain and use them to facilitate model training with source data.

Inspired by the above observations, researchers proposed the first method that aims to simultaneously learn invariant representations and risks under the setting of a semi-supervised domain adaptation (Semi-DA). First, they provided a finite sample bound for both classification and regression problems under Semi-DA. The bound suggested a principled way to obtain target generalization, i.e., by aligning both the marginal and conditional distributions across domains in feature space.

Motivated by this, researchers then introduced the LIRR algorithm for jointly learning invariant representations and risks. Finally, extensive experiments were conducted on both classification and regression tasks, demonstrating that LIRR consistently achieves state-of-the-art performance and significant improvements compared with methods that only learn invariant representations or invariant risks.

Figure12: Overview of the LIRR model.

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Paper: https://arxiv.org/abs/2104.14545 (opens in new tab)

Code: https://github.com/researchmm/LightTrack (opens in new tab)

Object tracking has achieved significant progress over the past few years. However, state-of-the-art trackers have become increasingly heavy and expensive, which limits their deployment in resource-constrained applications. In this paper, researchers present LightTrack, which uses neural architecture search (NAS) to design more lightweight and efficient object trackers.

To search for efficient neural architectures, researchers used depth-wise separable convolutions (DSConv) and mobile inverted bottleneck (MBConv) with squeeze-excitation module to construct a new search space. The detailed search space is elaborated in Table 1.

Table 1: Search space and supernet structure.

The search pipeline of the proposed LightTrack is shown in Figure 13. It consists of three phases: pretraining the backbone supernet, training the tracking supernet, and searching with an evolutionary algorithm on the tracking supernet.

Figure 13: Search pipeline of the LightTrack

Comprehensive experiments have shown that the LightTrack is effective. It can find trackers that achieve superior performance compared to handcrafted SOTA trackers, such as SiamRPN++ and Ocean, while using much fewer model Flops and parameters. For instance, as shown in Figure 14, LightTrack-Mobile achieves superior performance compared with the state-of-the-art Ocean tracker, while using 13x fewer parameters and 38x fewer Flops.

Figure 14: Comparisons with state-of-the-art trackers in terms of EAO performance, model Flops and parameters on VOT-19 benchmark.

Moreover, when deployed on resource-constrained mobile chipsets, the discovered trackers run much faster. As shown in Figure 15, LightTrack runs at real-time speed, being 3∼6x faster than SiamRPN++ (MobileNetV2 backbone), and 5∼17x faster than Ocean (offline) on Snapdragon 845 GPU and DSP, Apple A10 Fusion PowerVR GPU, and Kirin 985 Mali-G77 GPU. Such improvements might narrow the gap between academic models and industrial deployments in object tracking tasks.

Figure 15: Run-time speed on resource-limited platforms.

Figure 16 visualizes the search progress of the LightTrack-Mobile architecture. There are several interesting phenomena. 1) 50% of the backbone blocks use MBConv with a kernel size of 7×7. The underlying reason may be that large receptive fields can improve localization precision. 2) The searched architecture chooses the second-to-last block as the feature output layer. This reveals that tracking networks might not prefer high-level features. 3) The classification branch contains fewer layers than the regression branch. This may be attributed to the fact that coarse object localization is relatively easier than precise bounding box regression. These findings may inspire future work in designing new tracking networks.

Figure 16: The architecture searched by the proposed LightTrack (Mobile) framework.

Lite-HRNet: A Lightweight High-Resolution Network

Paper: https://arxiv.org/pdf/2104.06403.pdf (opens in new tab)

HRNet: https://github.com/HRNet (opens in new tab)

Lite-HRNet: https://github.com/HRNet/Lite-HRNet (opens in new tab)

Human pose estimation, aimed at recognizing and locating the keypoints for all persons in an image, requires high-resolution representation to achieve high performance. Recently, HRNet (high-resolution network) has shown strong capability among large models on this task. However, it remains unclear whether high resolution is helpful for small models.

In this paper, researchers present an efficient high-resolution network, Lite-HRNet, for human pose estimation. They started by simply applying the efficient shuffle block in ShuffleNet to HRNet, yielding stronger performance over popular lightweight networks, such as MobileNet, ShuffleNet, and Small HRNet.

Structure of Lite-HRNet.

Figure 17: Top: Structure of Lite-HRNet. Bottom: Build block of Lite-HRNet

Researchers observed that the heavily-used pointwise (1 × 1) convolutions in shuffle blocks become the computational bottleneck, the complexity of which is much higher than the depthwise convolution. They introduced a lightweight unit, conditional channel weighting, to replace costly pointwise (1 × 1) convolutions in shuffle blocks. The complexity of channel weighting is linear with respect to the number of channels and lower than the quadratic time complexity for pointwise convolutions. The new solution learns the weights from all the channels and over multiple resolutions that are readily available in the parallel branches in HRNet. It employs two functions to compute cross-resolution weights and spatial weights. The inhomogeneous cross-resolution weights aggregate cross-resolution information for the same position. The homogeneous spatial weights aggregate spatial information for the single resolution. It uses the weights as the bridge to exchange information across channels and resolutions, compensating the role played by the pointwise (1 × 1) convolution. Lite-HRNet demonstrates superior results on human pose estimation over popular lightweight networks. It is state-of-the-art in terms of complexity and accuracy trade-off on COCO and MPII benchmarks. Moreover, Lite-HRNet can be easily generalized to semantic segmentation tasks in the same lightweight manner.

M3P: A Multitask Multilingual Multimodal Pre-training Model

Paper: https://arxiv.org/abs/2006.02635 (opens in new tab)

Code: https://github.com/microsoft/M3P (opens in new tab)

Recently, a new paradigm of natural language processing (NLP)has emerged, where general knowledge is learned from raw texts through self-supervised pre-training and then applied to downstream tasks by task-specific fine-tuning. These state-of-the-art monolingual pre-trained language models, such as BERT, RoBERTa, and GPT-2, have now been expanded to multilingual scenarios, such as Multilingual BERT, XLM/XLM-R, and Unicoder. Moreover, a number of pre-training models under multimodal scenarios, such as Unicoder-VL, UNITER, ERNIE-ViL, VILLA and Oscar, have also emerged.

However, it is still challenging to extend these pre-trained models to multilingual-multimodal scenarios. Existing multilingual pre-trained language models cannot handle vision data (e.g., images or videos) directly, whereas many pre-trained multimodal models are trained on English corpora and thus cannot perform very well on non-English languages. Therefore, high quality multilingual multimodal training corpus is essential to combining multilingual pre-training and multimodal pre-training. However, there are only a few multilingual multimodal corpora in existence, and they have low language coverage. Relying on high-quality machine translation engines to generate such data from English multimodal corpora is also both time-consuming and computationally expensive. The ability to learn explicit alignments between vision and non-English languages during pre-training is lacking.

To address these challenges, this paper presents M3P, a multitask multilingual multimodal pre-trained model, which aims to learn universal representations that can map objects occurring in different modalities or texts expressed in different languages into a common semantic space. In order to alleviate the issue of having a lack of non-English labeled data for multimodal pre-training, researchers introduced Multimodal Code-switched Training (MCT) to enforce the explicit alignments between images and non-English languages.

Figure 18: Overview of the M3P

M³P uses the self-attentive transformer architecture of BERT. Multitask training is employed in the pre-training stage to optimize all pre-training objectives simultaneously. To pre-train M³P under a multilingual-multimodal scenario, researchers designed two types of pre-training objectives. Multilingual Training aims to learn grammar or syntax from well-formed multilingual sentences. Multimodal Code-switched Training (MCT) aims to learn different languages from the shared vision modal and the alignment between vision and non-English texts.

Table 2: Overall results of multilingual image-text retrieval

M3P achieves new state-of-the-art results for the multilingual image-text retrieval task on both Multi30K and MSCOCO for non-English languages, outperforming existing multilingual methods by a large margin. It has also achieved comparable results for English on these two datasets, compared to the state-of-the-art monolingual multimodal models. In addition, experiments show the effectiveness of Multimodal Code-switched Training (MCT) on low-resource settings.

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation

Paper: https://arxiv.org/pdf/2101.10979.pdf (opens in new tab)

Code: https://github.com/microsoft/ProDA (opens in new tab)

While deep learning has seen tremendous success in the last decade, vast quantities of data are required to achieve high performance. People have attempted to reduce the labeling expense by resorting to synthetic data, but deep neural networks are notoriously sensitive to domain misalignment, and any nuanced unrealism in rendered images will induce poor generalization to real data. Hence, domain adaptation techniques have been proposed to transfer the knowledge learned from the synthetic images (source domain) to real ones (target domain) with minimal performance loss. In this paper, researchers propose ProDA for domain adaptive semantic segmentation. Compared with prior works, ProDA improves the adaptation gain (the mIoU gain relative to the model without domain adaption) by more than 50% on typical datasets.

ProDA builds on self-training but improves upon it in three aspects. First, the pseudo label noises influence the self-training performance, so researchers proposed to rectify the pseudo labels during training. The pseudo labels are rectified by estimating the class-wise likelihoods according to its relative feature distances to the class-wise feature centroids (or prototypes). Since the prototypes are computed on-the-fly, the pseudo labels are progressively corrected throughout the training (as shown in Figure 19). Second, researchers drew inspiration from the recent unsupervised learning technique to learn the intrinsic structure of the target domain. They used the prototypical assignment under weak augmentation to guide the learning for the strong augmented view, resulting in a more compact feature space. Third, it is shown that distilling the already-learned knowledge to a self-supervised pretrained model further improves the performance significantly.

Figure 19: The proposed ProDA online refines the pseudo labels throughout the training.

Supercharged with the above techniques, ProDA outperforms the prior works by a large margin. With the Deeplabv2 [8] network, ProDA achieves the Cityscapes segmentation mIOU by 57.5 and 55.5 when adapting from the GTA5 and SYNTHIA [46] datasets, improving the adaption gain by 52.6% and 58.5% respectively over the prior leading approach. The figure below illustrates the performance progress of domain adaptation techniques on the GTA5-to-Cityscape benchmark.

Figure 20: Performance milestone of domain adaptation techniques.

Figure 21: Qualitative comparisons of different methods.

SSAN: Separable Self-Attention Network for Video Representation Learning

Paper: https://arxiv.org/abs/2105.13033 (opens in new tab)

Representation learning is crucial for computer vision tasks. Although 2D and 3D CNN based approaches have been extensively explored for images, learning strong and generic video representations is still challenging. Compared with images, videos contain not only rich semantic elements within individual frames, but also temporal reasoning across time, which links those elements to reveal semantic-level information for actions and events. Effective modeling of long-range dependencies is essential to capturing such contextual information, which current CNN operations can hardly achieve. The self-attention mechanism has been recognized as an effective way to build long-range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information. The former often relates to scenes and objects, and the latter often relates to temporal reasoning for actions (short-term activities) and events (long-term activities). Learning correlations along spatial and temporal dimensions together might capture irrelevant information, leading to ambiguity for action understanding. Intuitively, it would be better to separate spatial and temporal attentions.

However, a straightforward separation does not work well. In this paper, researchers propose a separable self-attention (SSA) module that models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. Specifically, researchers carefully designed a separable self-attention module, shown in Figure 22, which follows two principles. Firstly, the spatial and temporal attentions are performed sequentially, so that temporal correlations can fully consider the spatial contexts. Secondly, spatial attention maps exploit as much contextual information as possible. By adding the SSA module into 2D CNN, a SSA network (SSAN) is built for video representation learning.

Figure 22: Design of the separable self-attention attention module. The spatial attention (SA) part is highlighted in yellow. The temporal attention (TA) part is highlighted in blue.

On the task of video action recognition, the new approach outperforms state-of-the-art methods on the Something-Something and Kinetics-400 datasets. The new models often outperform counterparts with a shallower network and fewer modalities. Researchers further verified the semantic learning ability of this method in a visual-language task, i.e., video retrieval, which showcases the homogeneity of video representations and text embeddings. On the MSR-VTT and Youcook2 datasets, video representation learned by SSA outperforms the state-of-the-art by a large margin.

Table 3: Results of video action recognition on the Something-Something-V1 validation and test sets.

Style-based Point Generator with Adversarial Rendering for Point Cloud Completion

Paper: https://arxiv.org/abs/2103.02535 (opens in new tab)

Code: https://github.com/microsoft/SpareNet (opens in new tab)

As the usage of depth cameras becomes increasingly popular nowadays, point clouds are getting easier to acquire and have recently attracted a surge of research interest in computer vision. Due to limited sensor resolution and occlusion during data acquisition, raw point cloud data are usually sparse and incomplete, and point cloud completion is essential to enabling various downstream tasks such as scene understanding, shape manipulation and augmented visualization.

Contemporary point completion methods digest the partial inputs directly and hallucinate the complete point clouds in an end-to-end manner. They typically follow the encoder-decoder paradigm and adopt permutation invariant losses to regress the ground truth. However, existing methods fail to faithfully preserve the input structure because of the neglect of the global context during the feature extraction, the possession of limited imaginative capability to infer the global shape from the partial clue, and a lack of reliable metrics to measure the perceptual quality.

In this paper, researchers propose a Style-based Point generator with Adversarial REndering, or SpareNet, to circumvent the above issues. Firstly, channel-attentive EdgeConv is proposed to enhance the encoder, which simultaneously leverages the local and global context of the input. Secondly, researchers observed that the concatenation manner used by vanilla foldings limits its potential of generating a complex and faithful shape. Enlightened by the success of StyleGAN, researchers regarded the shape feature as a style code that modulates the normalization layers during the folding, which considerably enhances its capability. Lastly, in order to generate visually pleasing results, researchers proposed to project the generated point clouds to view images, whose realism is further examined by adversarial discriminators. Since the renderer used is differentiable, the gradient from the discriminators will guide the network to learn the completion with high perceptual quality when viewed at different angles. The overall architecture of the proposed SpareNet is illustrated in Figure 23.

Figure 23: Architecture of SpareNet

Figure 24 depicts the pipeline of differentiable point rendering, where the 3D points are first projected as 2D points and then rasterized with smooth kernel density. A depth map is ultimately generated through a pixel-wise maximum reduction of negated point depths. Note that the renderer is fully differentiable, and the adversarial training on such rendered depth map significantly improves the visual quality.

Figure 24: Differentiable point rendering

Extensive experiments have been conducted on ShapeNet and KITTI datasets, and SpareNet performs favorably over state-of-the-art methods both quantitatively and qualitatively.

Figure 25: Visualized completion comparison on ShapeNet dataset.

Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Paper: https://arxiv.org/pdf/2106.01226.pdf (opens in new tab)

Code: https://git.io/CPS (opens in new tab)

In this paper, researchers study the semi-supervised semantic segmentation problem by exploring both labeled data and extra unlabeled data. A novel consistency regularization approach, called cross pseudo supervision (CPS), is proposed. The new approach imposes the consistency on two segmentation networks perturbed with different initialization for the same input image. The pseudo one-hot label map, output from one perturbed segmentation network, is used to supervise the other segmentation network with the standard cross-entropy loss, and vice versa. The benefits from the cross pseudo supervision scheme are two-fold. On one hand, like the perviously-studied consistency scheme, it encourages high similarity between the predictions of two perturbed networks for the same input image, which leads the network to learn a more compact feature space. On the other hand, during the later optimization stage, the pseudo segmentation becomes stable and more accurate than the result from normal supervised training conducted only on the labeled data. The pseudo labeled data behaves like an expansion of the training data, thus improving the quality of the segmentation network training.

Figure 26: Illustrating the architectures for (a) this approach cross pseudo supervision, (b) cross confidence consistency, (c) mean teacher, and (d) PseudoSeg structure.

Experiments were conducted on two public benchmarks, Cityscapes and PASCAL VOC 2012, to validate this method. Results show that this new method outperforms state-of-the-art methods consistently under different partition protocols. In particular, CPS outperforms PseudoSeg (an approach from some Google researchers), which uses a complicated scheme to compute the pseudo segmentation map. This method is also verified under the full-supervision setting. Researchers used the fine-labeled training set of Cityscapes as a labeled set and randomly sampled 3K images from a coarse-labeled set as an unlabeled set. For the unlabeled set, researchers did not use their coarsely annotated ground truth. Even with a large amount of labeled data, CPS still improved the performance of two very strong baselines significantly: 80.40->81.54 for DeepLabv3+ (ResNet-101) and 80.65->82.41 for HRNet-W48.

Unsupervised Visual Representation Learning by Tracking Patches in Videos

Paper: https://arxiv.org/pdf/2105.02545.pdf (opens in new tab)

Code: https://github.com/microsoft/CtP (opens in new tab)

We humans can visually track moving objects in our line of sight, even in early childhood, and existing experiments in cognitive science suggest that tracking can even help babies understand objects. This inspired researchers with a natural question: can tracking also be helpful to developing a better representation for the artificial neural network model?

In this paper, researchers propose a self-supervised learning method based on the visual tracking task. It aims to pretrain video representation models, such as C3D, R3D or TSM. Concretely, given the input video clip and the target location on the starting frame, the model is asked to predict the target locations in the remaining frames. This self-supervised learning task can force the model to learn the temporal motion information among different video frames, which plays vital roles in many video-understanding tasks.

To leverage the large-scale unlabeled data, researchers hope to obtain the ground-truth tracking trajectories without access to any human annotation. To this end, various approaches have been tried, such as (1) producing pseudo labels by various off-the-shelf trackers, (2) the forward-backward cycle consistence of tracking, and (3) generating synthetic videos by pasting image patches. Among these three approaches, the best one has been the synthetic video approach, i.e., we randomly pick one sub-patch from a video frame, and then paste it onto other frames by some predefined rules. If the pasted patch is considered a specific object, we can obtain an absolutely accurate ground-truth trajectory.

The pretrained video representation model has been applied to a series of downstream tasks, such as action recognition and video retrieval. This method can achieve relatively good performance. For example, on the UCF-101 action recognition dataset, the randomly initialized model has an accuracy of 67.0%, while the pretrained one can reach an accuracy of 88.4%. In future work, researchers hope to further explore suitable ways to leverage the video motion information for self-supervised learning.

Table 4: Comparison with state-of-the-art video representation learning approaches.

Microsoft Research Lab – Asia

CVPR 2021 highlights: An overview of the cutting-edge progress of vision research

Share this page