Return to Blog Home
Microsoft Research Blog

Microsoft Researchers’ Algorithm Sets ImageNet Challenge Milestone


Posted by Richard Eckel

The race among computer scientists to build the world’s most accurate computer vision system is more of a marathon than a sprint.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

The race’s new leader is a team of Microsoft researchers in Beijing, which this week published a paper in which they noted their computer vision system based on deep convolutional neural networks (CNNs) had for the first time eclipsed the abilities of people to classify objects defined in the ImageNet 1000 challenge.

In their paper, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, the researchers say their system achieved a 4.94 percent error rate on the 1000-class ImageNet 2012 classification dataset, which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images. In previous experiments, humans have achieved an estimated 5.1 percent error rate.

“To our knowledge, our result is the first to surpass human-level performance…on this visual recognition challenge,” the researchers wrote.

Jian Sun
Jian Sun, principal researcher at Microsoft

The research team comprises 38-year-old Jian Sun, principal researcher, and Kaiming He, a 30-year-old researcher in Microsoft Research Asia’s Visual Computing Group, and two academic interns, Xiangyu Zhang of Xi’an Jiaotong University and Shaoqing Ren of the University of Science and Technology of China.

Sun, who joined Microsoft Research Asia a dozen years ago, earned his bachelor’s, master’s and Ph.D degrees in electrical engineering from Xi’an Jiaotong University, where in 2001 he was a student of Harry Shum (@harryshum), Microsoft’s executive vice president, Technology and Research, and one of the founding members of Microsoft’s research organization in China. Shum, an IEEE Fellow and an ACM Fellow for his contributions to computer vision and computer graphics, is incredibly proud of his former student’s accomplishment.

“The first project Jian worked on with me in 2001, together with Professor Nanning Zheng of Xi’an Jiaotong University, was stereo reconstruction with belief propagation. Jian was among the first to realize the power of using Bayesian belief propagation to solve a large class of computer vision problems with Markov networks such as stereo,” Shum said.

Shum is especially proud of Sun’s Microsoft achievements. “Many of Jian’s research results have been incorporated within Microsoft products, and I am especially excited about the potential of his latest work with deeper neural nets.”

Sun credits the team’s most recent achievement to two key ideas: the development of more adaptable nonlinear neural units of the neural network, and a better training algorithm that makes the neural network more powerful.

In the paper, the researchers note that the rectifier neuron is one of several keys to the recent success of deep neural networks being applied to computer vision challenges.

“In this paper, we investigate neural networks from two aspects particularly driven by the rectifiers,” the researchers wrote. “First, we propose a new generalization of ReLU, which we call Parametric Rectified Linear Unit (PReLU). This activation function adaptively learns the parameters of the rectifiers, and improves accuracy at negligible extra computational cost. Second, we study the difficulty of training rectified models that are very deep. By explicitly modeling the nonlinearity of the rectifiers (ReLU/PReLU), we derive a theoretically sound initialization method, which helps with convergence of very deep models (e.g., with 30 weight layers) trained directly from scratch. This gives us more flexibility to explore more powerful network architectures.”

Although excited about the team’s algorithm eclipsing human understanding, similar to other researchers in the field, the paper’s authors emphasize that computer vision still cannot match human vision in general, noting that the computing system has challenges with understanding objects, or where contextual understanding or high-level knowledge of a scene is required.

“While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general…On recognizing elementary object categories…machines still have obvious errors in cases that are trivial for humans. Nevertheless, we believe our results show the tremendous potential of machine algorithms to match human-level performance for many visual recognition tasks.”

With the Chinese New Year (the year of the sheep) approaching on Feb. 19, Sun uses sheep to explain that human-level understanding is still more developed than computer image classification.

“Humans have no trouble distinguishing between a sheep and a cow. But computers are not perfect with these simple tasks,” Sun explains. “However, when it comes to distinguishing between different breeds of sheep, this is where computers outperform humans. The computer can be trained to look at the detail, texture, shape and context of the image and see distinctions that can’t be observed by humans.”

The work of Sun, He and team isn’t confined to research; it’s already being applied to Microsoft services, including Bing image search and OneDrive, the company’s online storage solution. In a recent blog post, Douglas Pearce (@douglasprc), group program manager, noted how OneDrive now can automatically recognize content in your photos.

“Our users will have access to automatically grouped collections of photos and they can easily search for specific ones. You’ll be able to quickly find things such as ‘people,’ ‘dogs,’ ‘whiteboard,’ ‘beach,’ ‘sunsets,’ and dozens of other terms. This makes it even easier to add your photos in to presentations for school, to relive a specific memory, or to share something important with all of your friends on Facebook,” Pearce said.

He later suggested that readers interested in how this technology works read this article which we posted last fall about the work of these same researchers that speeds deep-learning object-detection systems by as many as 100 times, yet maintains accuracy. The team’s advance was documented in this research paper, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.

“The Visual Computing team here in Beijing has been devoted to pushing the state-of-art in computer vision, with the ultimate goal of enabling computers to emulate the perceptual capability of humans. I’m proud of their achievements over the years, which have not only impacted the academic world through the contribution of high-quality publications, but also empowered Microsoft products through technology transfers,” said Hsiao-Wuen Hon, chairman of Microsoft’s Asia-Pacific R&D Group, and managing director of Microsoft Research Asia.

The computer vision marathon gained momentum in 2010 when scientists from Stanford, Princeton and Columbia universities started the Large Scale Visual Recognition Challenge. According to an August 2014 New York Times article by noted technology industry journalist John Markoff (@markoff), accuracy almost doubled in the 2014 competition and error rates were cut in half. Most recently, Baidu researchers have published a paper in which they claim to have achieved “a top-5 error rate of 5.33%.” against the ImageNet classification challenge.

The marathon continues; this year’s challenge will take place in December. But that isn’t the primary focus of Sun, He and team. “Our goal is to develop systems that are as good as, or better, at recognizing images than humans on many useful applications,” Sun said. “For that to happen, we need more training data and more real-world test scenarios. It’s our work with Bing, OneDrive and other services that will help us improve the robustness of our algorithm.”

See also: