Artificial intelligence has evolved to become a revolutionary technology. It is rapidly changing the economy, both by creating new opportunities (it’s the backbone of the gig economy) and by bringing venerable institutions, like transportation, into the 21^{st} century. Yet deep at its core something is amiss, and more and more experts are worried: the technology seems to be extremely brittle, a phenomenon epitomized by **adversarial examples**.

Adversarial examples exploit weaknesses in modern AI. Today, most successful AI applications use machine learning (more specifically, supervised learning) by training big neural networks to mimic input-output mappings on sample data. But this technique relies on the assumption that new input data coming in is somewhat similar to the sample data being used for training. By manipulating input data in ways imperceptible to humans, malicious actors have been able to confuse AI into making incorrect predictions. These manipulated inputs, or adversarial examples, pose tremendous threats for applications such as automated insurance claim processing, and they can even be life-threatening if used to target autonomous vehicle systems.

Since 2013, there have been thousands of papers on adversarial examples. Very roughly speaking there are two camps of research: those who try to develop robust training techniques (creating robust neural networks), and those who try to find adversarial examples (breaking down neural networks). There has been a healthy back-and-forth between those camps, with “robust” training methods being broken one after another.

In “Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers,” a paper accepted at the thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019), our team of Microsoft researchers seeks to break that cycle with a method that provides provable guarantees. More specifically, by combining adversarial training with a technique called randomized smoothing, our method achieves state-of-the-art results for training networks that are provably robust against adversarial attacks.

### A change in how AI works caused a need to re-focus on robustness

In the 2000s, it was recognized that robustness is a key component of the success of any AI system. A handful of papers showed that such robustness was in fact achievable for linear models. Specifically, it turns out that for certain types of robustness guarantees, adding these guarantees to the objective function of Support Vector Machines (SVMs) yields an objective that remains convex. Therefore, robust SVMs can be efficiently learned. In other words, **these works showed that there exists a robust training method for linear models.**

But at the end of the 2000s, the deep learning revolution happened, and SVMs fell out of popularity in favor of deep neural networks. For a couple of years, as everyone focused on creating better and bigger networks, the robustness question was ignored—until 2013 came and with it also came the publication of a paper titled “Intriguing properties of neural networks,” by Goodfellow and others. In this work, the authors showed that the ghosts of the AI past were coming back to haunt us yet again: conventionally trained neural networks are extremely non-robust, just like conventional training for SVMs used to produce non-robust linear models. This time around, however, there has been no miracle. So far, nobody has been able to produce a truly robust training method for neural networks.

As mentioned earlier, after this paper a back-and-forth cycle began, where researchers would develop a more robust AI only to have it broken by new adversarial attacks, and the cycle would continue. Today, all non-broken robust training methods rely, in some ways, on adversarial training. In a nutshell, the idea of adversarial training is to attack the network ourselves during training, so that one identifies the weaknesses of the network and incrementally fixes them before deployment.

A critical difference between the neural network problem and the SVM example is that, in this case, **adversarial training is merely an empirical technique**: there is no guarantee of success, and one can only test empirically whether the resulting network is robust on a given input. The question that we address in our research has to do with just this: Is there a way to prove that a network is robust?

### A scalable provable guarantee: randomized smoothing via the Weierstrass transform

Our technique to construct provably robust neural networks uses randomized smoothing, which takes its origin in the work of the 19^{th} century mathematician Karl Weierstrass. The Weierstrass transform takes a bounded function and returns a Lipschitz (that is, smooth) function. It does so by convolving the function with a Gaussian kernel. In the case of neural networks, this transform has no closed form, but we can approximate it in high dimension through Monte Carlo sampling.

The key implications of the Weierstrass transform for neural networks are that smoothness exactly implies **provable robustness** and that the evaluation of the Weierstrass transform is **probabilistically efficient** (meaning that with high probability one can get a very good approximation to the value of the transform). In earlier work this year at ICML by a team led by Zico Kolter at Carnegie Mellon University, it was demonstrated that this approach can provably demonstrate robustness guarantees in the ballpark of their state-of-the-art empirical counterparts.

### Combining the Weierstrass transform with adversarial training

In our work, we combine both modern-day adversarial training methods and randomized smoothing, in a method called **smoothed adversarial training**. To fully unwind this idea, we need to explain how to train a smoothed (Weierstrass-transformed) classifier and how to perform adversarial training in this setting.

We set out to train a neural network so that its Weierstrass transform both has high accuracy and is robust against adversarial examples. As stated before, because the Weierstrass transform is Gaussian convolution, the transform can be evaluated approximately by Monte Carlo using random Gaussian vectors. To train the transformed function, we evaluate this approximation under our loss function of choice and leverage the powerful automatic differentiation features of modern deep learning frameworks to perform gradient descent. As expected, the more Gaussian samples we take, the better the approximation and the resulting model.

Before we explain how to adversarially train Weierstrass-transformed neural networks, we need to describe how to do it for standard neural networks. In each loop of adversarial training, we seek to find adversarial examples to our current model and focus our training on such “hard” instances. Finding these specific adversarial examples is itself a challenge.

An adversarial example might be, for instance, a subtly modified image of a cat that makes a neural network think it’s a dog with high confidence. To find such an image, one could just try “all” small modifications of a cat image and see if any of them causes the neural network to make a mistake. Of course, it is entirely infeasible to try all modifications, not just because there are infinitely many of them to be tried.

However, if there is any one universal lesson that the deep learning era teaches us, it must be that gradient descent works. This is the idea behind Madry et al.’s PGD algorithm, which has now become standard in the adversarial literature. We formulate the search for adversarial examples as a constrained optimization problem and apply gradient descent to (approximately) solve it. The constrained optimization problem goes something like this:

*Constrained to have small norm, find a modification vector v that maximizes the confidence that the model thinks (cat image + v) is a dog.*

Despite having no theoretical guarantee, this PGD algorithm can reliably find adversarial examples. To perform adversarial training, we then train on the adversarial examples discovered by the PGD algorithm.

### Adversarially training a smoothed classifier and results on two datasets

To find adversarial examples of the smoothed classifier, we apply the PGD algorithm described above to a Monte Carlo approximation of it. Likewise, to train on these adversarial examples, we apply a loss function to the same Monte Carlo approximation and backpropagate to obtain gradients for the neural network parameters. Besides this core structure, there are several important details that significantly affect the performances, and we refer to the paper for more on this. (One of our researchers also wrote this blog post with a deeper dive into the technical details as well.)

The simple idea of adversarially training the smoothed classifier establishes new state-of-the-art results in provably robust image classification. Our smoothed classifiers are more provably robust than prior work against the *entire* spectrum of adversarial perturbations on both CIFAR-10 and ImageNet datasets, as shown below.

We trained a variety of smoothed models on CIFAR-10 (Figure 1). Given a radius r, there is a portion of the test set that the model classifies correctly and that* provably *has no adversarial examples within radius r. This is called the *provably robust accuracy*. We plot the provably robust accuracy on the y-axis against the radius on the x-axis above, in blue solid lines, and compare against the prior state of the art, in red solid lines. We also empirically attacked our models using PGD and obtained the empirical robust accuracies on CIFAR-10, in dashed lines, which have no provable guarantee. As expected, the empirical accuracies are higher than the certified accuracy. Notably, our certified accuracies are higher, across the board, than the empirical accuracies of the previous state of the art.

We do the same here for ImageNet (Figure 2). The amount of improvements we gain from adversarially training the smoothed classifiers is strikingly large and consistent!

Our method for adversarially training smooth classifiers raises the bar by obtaining state-of-the-art results in provably robust image classification, which takes a step closer to solving the problem of malicious adversarial attacks. There is still work to do. At present, all robust training methods fall into the category of having small perturbations, which means that when increasing the dimensionality of the problem (such as increasing the resolution of an image), no substantial improvement in robustness occurs as it relates to perturbation size. At a very high resolution, deep networks can still be fooled by modifying a few pixels.

If you are attending NeurIPS 2019, our work will be featured as a spotlight during Track 2 Session 5 at 10:20 AM PST on Thursday, December 12^{th}. We also have a poster in East Exhibition Hall B + C, showing from 10:45 AM-12:45 PM PST also on Thursday, December 12^{th}. We hope you’ll check out our work!

Special thanks to Microsoft AI residents Edward Hu, Tony Duan, and Judy Shen for their contributions to this post. To find out more about the Microsoft AI Residency Program, check out its page.