Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Discovering the best neural architectures in the continuous space

November 30, 2018 | By Fei Tian, Researcher

Discovering the best neural architectures in the continuous space

If you’re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number of other questions. What operations should I choose for this layer—convolution, depth separable convolution, or max pooling? What is the kernel size for convolution? 3×3 or 1×1? And which previous node should serve as the input to the current recurrent neural network (RNN) cell? Such decisions are crucial to the architecture’s success. If you’re a domain expert in both neural network modeling and the specific task at hand, it might be easy for you to make the correct decisions. But what if your experience with either of them is limited?

In that case, you might turn to neural architecture search (NAS), an automated process in which an additional machine learning algorithm is leveraged to guide the creation of better neural architecture given the historically observed architectures and their performances. Thanks to NAS, we can pinpoint the neural network architectures that achieve the best results on widely used benchmark datasets, such as ImageNet, without any human intervention.

But while existing methods in automatic neural architecture design—typically based on reinforcement learning or evolutionary algorithm—have generally conducted their searches in the exponentially large discrete space, my collaborators and I in the Machine Learning group at Microsoft Research Asia have designed a simplified, more efficient method based on optimization in the continuous space. With our new approach, called neural architecture optimization (NAO), we leverage the power of a gradient-based method to conduct optimization in the more compact space. The work is part of the program at this year’s Conference on Neural Information Processing Systems (NeurIPS).

Figure 1: The workflow of NAO

Figure 1: The workflow of NAO

The key components of NAO

Driving NAO’s ability to perform gradient-based optimization in the continuous space are three components (see Figure 1):

  • an encoder that maps a discrete neural network architecture into a continuous vector, or embedding
  • a performance prediction function that takes the vector as input and generates a real value as a prediction of the performance of the architecture (for example, accuracy)
  • a decoder that recovers the architecture from its continuous vector

These three components are trained jointly. After we have finished training, starting from an architecture x, we map it using encoder E to its representation ex, then move ex to a new embedding vector, denoted as ex’, along the gradient direction of the performance prediction function f (the green line). Since we are essentially doing gradient ascent, we are guaranteed that f(ex’) ≥ f(ex) as long as the step size is small enough. Finally, we map ex’ into a discrete architecture x’ using the decoder D. In this way, we obtain a potentially better architecture x’. By iteratively updating architectures in this way, we obtain the final architecture, which is assumed to provide the best performance.

High performance with limited computational resources

We conducted extensive experiments to verify the effectiveness of using NAO to automatically discover the best neural architecture. Table 1 (below) demonstrates the results on the CIFAR-10 image classification dataset using various convolutional neural network (CNN) architectures discovered by different NAS algorithms. From the table, we can observe that the neural network discovered by NAO achieves the lowest error rate among the studied NAS algorithms. Furthermore, we can achieve significant search speed when combining NAO with the weight sharing mechanism (referred to as “NAO-WS”), a method used to significantly reduce the computational cost of obtaining the performances of various neural network architectures by letting them share the same copy of weight parameters. In our experiments, we found we can use one graphics processing unit (GPU) in seven hours to obtain a CNN architecture achieving a 3.53 error rate. With weight sharing, there is no need to train different neural networks from scratch.

Table 2 (below) summarizes the results on PTB language modeling. The lower perplexity scores indicate better performance. Again, we can observe that the RNN architecture found by our NAO achieves impressive performance with very limited computational resources.

By using continuous optimization, NAO achieves better performance compared to existing NAS methods directly searching among discrete architecture space. As for future application, we plan to use NAO to search for architecture for other important AI tasks, such as neural machine translation. And just as important, the availability of a simpler, more efficient automatic neural architecture design continues to make machine learning technologies accessible to practitioners of all experience levels.

Method Error Rate Resource (#GPU x #Hours)
ENAS 3.54 11
NAO-WS 3.53 7
AmoebaNet 2.13 3150 * 24
Hier-EA 3.75 300 * 24
NAO 2.11 200 * 24

Table 1: Results on CIFAR-10 classification

Method Perplexity Resource (#GPU x #Hours)
NASNet 62.4 1e4 CPU days
ENAS 58.6 12
DARTS 56.1 24
NAO 56.0 300 * 24
NAO-WS 56.6 10

Table 2: Results on PTB language modeling

Up Next

Artificial intelligence

New Meta-learning Techniques for Neural Program Induction

Much research in AI lately focuses on extending the capabilities of deep learning architectures: moving beyond simple classification and pattern recognition into the realm of learning algorithmic tasks, such as inductive programming. Building on our past work in neural program synthesis for learning string transformations in a functional language, our most recent work explores the […]

Rishabh Singh


Artificial intelligence, Human language technologies

Deliberation Network: Pushing the frontiers of neural machine translation

During the Tang dynasty of China, which lasted from 618 to 907, the poet Jia Dao was known for polishing his poems over and over to make them better and better. One famous story describes how he deliberated over two lines of a poem that read, “Birds nestle in the trees by the pond. A […]

Fei Tian



Artificial intelligence, Data management, analysis and visualization

FigureQA: an annotated figure dataset for visual reasoning

Reasoning about figures Almost every scientific publication is accompanied by data visualizations in the form of graphs and charts. Figures are an intuitive aid for understanding the content of documents, so naturally, it is useful to leverage this visual information for machine reading comprehension. To enable research in this domain we built FigureQA, a new dataset composed of […]

Microsoft blog editor