Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Discovering the best neural architectures in the continuous space

November 30, 2018 | By Fei Tian, Researcher

Discovering the best neural architectures in the continuous space

If you’re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number of other questions. What operations should I choose for this layer—convolution, depth separable convolution, or max pooling? What is the kernel size for convolution? 3×3 or 1×1? And which previous node should serve as the input to the current recurrent neural network (RNN) cell? Such decisions are crucial to the architecture’s success. If you’re a domain expert in both neural network modeling and the specific task at hand, it might be easy for you to make the correct decisions. But what if your experience with either of them is limited?

In that case, you might turn to neural architecture search (NAS), an automated process in which an additional machine learning algorithm is leveraged to guide the creation of better neural architecture given the historically observed architectures and their performances. Thanks to NAS, we can pinpoint the neural network architectures that achieve the best results on widely used benchmark datasets, such as ImageNet, without any human intervention.

But while existing methods in automatic neural architecture design—typically based on reinforcement learning or evolutionary algorithm—have generally conducted their searches in the exponentially large discrete space, my collaborators and I in the Machine Learning group at Microsoft Research Asia have designed a simplified, more efficient method based on optimization in the continuous space. With our new approach, called neural architecture optimization (NAO), we leverage the power of a gradient-based method to conduct optimization in the more compact space. The work is part of the program at this year’s Conference on Neural Information Processing Systems (NeurIPS).

Figure 1: The workflow of NAO

Figure 1: The workflow of NAO

The key components of NAO

Driving NAO’s ability to perform gradient-based optimization in the continuous space are three components (see Figure 1):

  • an encoder that maps a discrete neural network architecture into a continuous vector, or embedding
  • a performance prediction function that takes the vector as input and generates a real value as a prediction of the performance of the architecture (for example, accuracy)
  • a decoder that recovers the architecture from its continuous vector

These three components are trained jointly. After we have finished training, starting from an architecture x, we map it using encoder E to its representation ex, then move ex to a new embedding vector, denoted as ex’, along the gradient direction of the performance prediction function f (the green line). Since we are essentially doing gradient ascent, we are guaranteed that f(ex’) ≥ f(ex) as long as the step size is small enough. Finally, we map ex’ into a discrete architecture x’ using the decoder D. In this way, we obtain a potentially better architecture x’. By iteratively updating architectures in this way, we obtain the final architecture, which is assumed to provide the best performance.

High performance with limited computational resources

We conducted extensive experiments to verify the effectiveness of using NAO to automatically discover the best neural architecture. Table 1 (below) demonstrates the results on the CIFAR-10 image classification dataset using various convolutional neural network (CNN) architectures discovered by different NAS algorithms. From the table, we can observe that the neural network discovered by NAO achieves the lowest error rate among the studied NAS algorithms. Furthermore, we can achieve significant search speed when combining NAO with the weight sharing mechanism (referred to as “NAO-WS”), a method used to significantly reduce the computational cost of obtaining the performances of various neural network architectures by letting them share the same copy of weight parameters. In our experiments, we found we can use one graphics processing unit (GPU) in seven hours to obtain a CNN architecture achieving a 3.53 error rate. With weight sharing, there is no need to train different neural networks from scratch.

Table 2 (below) summarizes the results on PTB language modeling. The lower perplexity scores indicate better performance. Again, we can observe that the RNN architecture found by our NAO achieves impressive performance with very limited computational resources.

By using continuous optimization, NAO achieves better performance compared to existing NAS methods directly searching among discrete architecture space. As for future application, we plan to use NAO to search for architecture for other important AI tasks, such as neural machine translation. And just as important, the availability of a simpler, more efficient automatic neural architecture design continues to make machine learning technologies accessible to practitioners of all experience levels.

Method Error Rate Resource (#GPU x #Hours)
ENAS 3.54 11
NAO-WS 3.53 7
AmoebaNet 2.13 3150 * 24
Hier-EA 3.75 300 * 24
NAO 2.11 200 * 24

Table 1: Results on CIFAR-10 classification

Method Perplexity Resource (#GPU x #Hours)
NASNet 62.4 1e4 CPU days
ENAS 58.6 12
DARTS 56.1 24
NAO 56.0 300 * 24
NAO-WS 56.6 10

Table 2: Results on PTB language modeling

Up Next

Artificial intelligence, Programming languages and software engineering

Neural-Guided Deductive Search: A best of both worlds approach to program synthesis

Program synthesis — automatically generating a program that satisfies a given specification — is a major challenge in AI. In addition to changing the way we design software, it has the potential to revolutionize task automation. End users without programming skills can easily provide input-output examples of the desired program behavior. The Flash Fill feature […]

Alex Polozov


Artificial intelligence

New Meta-learning Techniques for Neural Program Induction

Much research in AI lately focuses on extending the capabilities of deep learning architectures: moving beyond simple classification and pattern recognition into the realm of learning algorithmic tasks, such as inductive programming. Building on our past work in neural program synthesis for learning string transformations in a functional language, our most recent work explores the […]

Rishabh Singh


Artificial intelligence, Human language technologies

Deliberation Network: Pushing the frontiers of neural machine translation

During the Tang dynasty of China, which lasted from 618 to 907, the poet Jia Dao was known for polishing his poems over and over to make them better and better. One famous story describes how he deliberated over two lines of a poem that read, “Birds nestle in the trees by the pond. A […]

Fei Tian