Discovering the best neural architectures in the continuous space

Published November 30, 2018

By Fei Tian , Researcher

Share this page

If you’re a deep learning practitioner, you may find yourself faced with the same critical question on a regular basis: Which neural network architecture should I choose for my current task? The decision depends on a variety of factors and the answers to a number of other questions. What operations should I choose for this layer—convolution, depth separable convolution, or max pooling? What is the kernel size for convolution? 3×3 or 1×1? And which previous node should serve as the input to the current recurrent neural network (RNN) cell? Such decisions are crucial to the architecture’s success. If you’re a domain expert in both neural network modeling and the specific task at hand, it might be easy for you to make the correct decisions. But what if your experience with either of them is limited?

In that case, you might turn to neural architecture search (NAS), an automated process in which an additional machine learning algorithm is leveraged to guide the creation of better neural architecture given the historically observed architectures and their performances. Thanks to NAS, we can pinpoint the neural network architectures that achieve the best results on widely used benchmark datasets, such as ImageNet, without any human intervention.

But while existing methods in automatic neural architecture design—typically based on reinforcement learning or evolutionary algorithm—have generally conducted their searches in the exponentially large discrete space, my collaborators and I in the Machine Learning group at Microsoft Research Asia (opens in new tab) have designed a simplified, more efficient method based on optimization in the continuous space. With our new approach, called neural architecture optimization (NAO) (opens in new tab), we leverage the power of a gradient-based method to conduct optimization in the more compact space. The work is part of the program at this year’s Conference on Neural Information Processing Systems (NeurIPS) (opens in new tab).

Figure 1: The workflow of NAO

The key components of NAO

Driving NAO’s ability to perform gradient-based optimization in the continuous space are three components (see Figure 1):

an encoder that maps a discrete neural network architecture into a continuous vector, or embedding
a performance prediction function that takes the vector as input and generates a real value as a prediction of the performance of the architecture (for example, accuracy)
a decoder that recovers the architecture from its continuous vector

These three components are trained jointly. After we have finished training, starting from an architecture x, we map it using encoder E to its representation e_x, then move e_x to a new embedding vector, denoted as e_x’, along the gradient direction of the performance prediction function f (the green line). Since we are essentially doing gradient ascent, we are guaranteed that f(e_x’) ≥ f(e_x) as long as the step size is small enough. Finally, we map e_x’ into a discrete architecture x’ using the decoder D. In this way, we obtain a potentially better architecture x’. By iteratively updating architectures in this way, we obtain the final architecture, which is assumed to provide the best performance.

High performance with limited computational resources

We conducted extensive experiments to verify the effectiveness of using NAO to automatically discover the best neural architecture. Table 1 (below) demonstrates the results on the CIFAR-10 image classification dataset using various convolutional neural network (CNN) architectures discovered by different NAS algorithms. From the table, we can observe that the neural network discovered by NAO achieves the lowest error rate among the studied NAS algorithms. Furthermore, we can achieve significant search speed when combining NAO with the weight sharing mechanism (opens in new tab) (referred to as “NAO-WS”), a method used to significantly reduce the computational cost of obtaining the performances of various neural network architectures by letting them share the same copy of weight parameters. In our experiments, we found we can use one graphics processing unit (GPU) in seven hours to obtain a CNN architecture achieving a 3.53 error rate. With weight sharing, there is no need to train different neural networks from scratch.

Table 2 (below) summarizes the results on PTB language modeling. The lower perplexity scores indicate better performance. Again, we can observe that the RNN architecture found by our NAO achieves impressive performance with very limited computational resources.

By using continuous optimization, NAO achieves better performance compared to existing NAS methods directly searching among discrete architecture space. As for future application, we plan to use NAO to search for architecture for other important AI tasks, such as neural machine translation. And just as important, the availability of a simpler, more efficient automatic neural architecture design continues to make machine learning technologies accessible to practitioners of all experience levels.


Method	Error Rate	Resource (#GPU x #Hours)
ENAS (opens in new tab)	3.54	11
NAO-WS	3.53	7
AmoebaNet (opens in new tab)	2.13	3150 * 24
Hier-EA (opens in new tab)	3.75	300 * 24
NAO	2.11	*200 24**

Table 1: Results on CIFAR-10 classification


Method	Perplexity	Resource (#GPU x #Hours)
NASNet (opens in new tab)	62.4	1e4 CPU days
ENAS (opens in new tab)	58.6	12
DARTS (opens in new tab)	56.1	24
NAO	56.0	*300 24**
NAO-WS	56.6	10