Unlocking new dimensions in image-generation research with Manifold Matching via Metric Learning

Published

By , Data & Applied Scientist, Microsoft Bing Multimedia Group , Program Manager, Microsoft Bing Multimedia Group

An image generation model training using MVM. Images show a panda progressively being generated to become more realistic, various paintings generated by the model, an anime character generated to become more realistic, and various cat images generated to become more realistic.

Generative image models offer a unique value by creating new images. Such images can be sharp super-resolution versions of existing images or even realistic-looking synthetic photographs. Generative Adversarial Networks (GANs) and their variants have demonstrated pioneering success with the framework of training two networks against each other: a generator network learns to generate realistic fake data that can trick a discriminator network, and the discriminator network learns to correctly tell apart the generated fake data from the real data.

In order to apply the latest innovations in computer vision to GANs, the research community needs to address two challenges. First, GANs model data distributions with statistical measures, such as the mean and the moments, as opposed to geometric measures. Second, traditional GANs represent the loss of the discriminator network only as a 1D scalar value corresponding to the Euclidean distance between the real and the fake data distributions. Because of these two challenges, the research community has been unable to directly apply breakthrough metric learning methods or experiment with novel loss functions and training techniques to continue to improve generative models.

In this work, “Manifold Matching via Deep Metric Learning for Generative Modeling,” we propose a new framework for generative models, which we call Manifold Matching via Metric Learning (MvM). In the MvM framework, two networks are trained against each other. The metric generator network learns to define a better metric for the distribution generator network’s manifold matching objective, and the distribution generator network learns to produce more hard negative samples for the metric learning objective of the metric generator network. Through the adversarial training, MvM produces a distribution generator network that can generate fake data distribution that is very close to the real data distribution and a metric generator network that can provide an effective metric for capturing the internal geometric structure of the data distribution. This paper was accepted at the International Conference on Computer Vision (ICCV 2021) in October.

Comparing Manifold Matching via Metric Learning to Generative Adversarial Networks

DifferencesGANsMvM
Main point of viewstatisticsgeometry
Matching termsmeans, moments, etc.centroids, p-diameters
Matching criteriastatistical discrepancylearned distances
Underlying metricdefault Euclideanlearned intrinsic
Objective functionsone min-max value functiontwo distinct objectives
Table 1: Comparison between Generative Adversarial Networks (GANs) and Manifold Matching via Metric Learning (MvM)

Table 1 summarizes five important differences between GANs and MvM. One useful characteristic of MvM is that it uses a learned intrinsic metric as opposed to the default Euclidean distance to represent the performance of its networks. This unlocks the opportunity for the research community to bring in latest breakthroughs from metric learning directly to the field of training generative models.

MvM is also more interpretable than GANs are. GANs use a single min-max value function as the objective function. The training loss calculated using this objective function goes down when the discriminator network gets better and goes up when the generator network gets better. As a consequence, the training loss fluctuates up and down as the training progresses, giving a human interpreter no interpretable sign of the behavior or the performance of the networks.

Today, a human interpreter is forced to print generated images and make qualitative decisions to deduce this information. In contrast, MvM uses two distinct objective functions. The first objective function calculates the metric learning loss—fluctuating up and down to demonstrate that the two networks are learning adversarial against each other as desired. The other objective function calculates the manifold matching loss. This objective function monotonically decreases over training epochs as the generated fake data distribution become more similar to the real data distribution. This means a human interpreter is able to deduce quantitative conclusions based on the value of manifold matching loss.

Finally, unlike GANs, MvM outputs multi-dimensional representation of images. This enables the research community to experiment with new or existing training frameworks and techniques, such as unsupervised representation learning or various metric learning techniques. This is expected to accelerate generative model research and unlock new potential directions that were previously considered impossible.

Image Generation and Image Super-Resolution via MvM

To test the effectiveness and the versatility of framework, we applied MvM to two popular image generation tasks: unsupervised image generation and image super-resolution.

Images generated by StyleGAN2 architecture trained in MvM framework on the Flickr-Faces-HQ Dataset
Figure 1: Images generated by StyleGAN2 architecture trained in MvM framework on the Flickr-Faces-HQ Dataset

For the unsupervised image generation task, we trained a StyleGAN2 architecture and generated large images (512 pixels by 512 pixels) based on the Flickr-Faces-HQ (FFHQ) dataset—a popular benchmark for demonstrating the effectiveness of image generation models. Results are shown in Figure 1, and the examples demonstrate that MvM is indeed effective in generating images that closely resemble real data in the dataset.

A comparison of images created with Bicubic, GAN, and MVM against the original High Resolution (HR).  Unlike the results from GANs that have a grid-like result, MvM leads to a line-like result that is closer to ground truth.
Figure 2: Comparison of bicubic, GAN, and MvM against the original high resolution (HR).

For the image super-resolution task, we trained three different generator backbones, ResNet, RDN, and NSRNet, using GAN and MvM. Figure 2 shows the original high-resolution image as the ground truth, a bicubic super-resolution algorithm as the baseline, and two NSRNet generator networks trained respectively with GAN and MvM for comparison. Unlike the results from GANs that have a grid-like result, MvM leads to a line-like result that is closer to ground truth. You can see this is closer to the HR method shown above versus other methods like bicubic (see Figure 2). We qualitatively observe that the generator network trained with MvM surpasses the one trained with GAN in reconstructing fine details like outlines without inaccurate grid-like artifacts. For more examples, details, and benchmark results, please read our full paper.

Dive deeper into Manifold Matching via Metric Learning

You can download a PyTorch implementation of the Manifold Matching via Metric Learning at our GitHub repository. If the opportunity to build pioneering computer vision models like this excites you, please visit our career page to learn about our openings. We thank Haibin Hang, Cynthia Yu, Kun Wu, Meenaz Merchant, Arun Sacheti, and Jordi Ribas for enabling this work.