Microsoft Research Blog

The Microsoft Research blog shares stories of collaborations with computer scientists at academic and scientific institutions to advance technical innovations in computing, as well as related events, scholarships, and fellowships.

  • Blog
  • Uncategorized
  • Microsoft Computational Network Toolkit offers most efficient distributed deep learning computational performance

Microsoft Computational Network Toolkit offers most efficient distributed deep learning computational performance

December 7, 2015 | Posted by Microsoft Research Blog

By Xuedong Huang, Chief Speech Scientist, Microsoft


Xuedong Huang (Photo by Scott Eklund/Red Box Pictures)

For more than 20 years, Microsoft has invested in advanced speech recognition research and development. It’s great to see the return on that investment in products and services such as Windows Cortana, Skype Translator, and Project Oxford Speech APIs. Microsoft researchers pioneered using deep neural networks for speech recognition, and earlier this year, our speech researchers shared our deep learning tools with the speech research community when we introduced the Computational Network Toolkit (CNTK) under an open source license at the ICASSP Conference in April 2015.

CNTK is a unified computational network framework that describes deep neural networks as a series of computational steps via a directed graph. In a directed graph, each node represents an input value or a network parameter, and each edge represents a matrix operation upon its children. CNTK provides algorithms to carry out both forward computation and gradient calculation. Most popular computation node types are predefined and users can easily extend node types under the open source license. With the combination of CNTK and Microsoft’s upcoming Azure GPU Lab, we have a modern, distributed GPU platform that the community can utilize to advance AI research.

Since the debut of CNTK in April, we’ve significantly improved machine learning efficiency with Azure GPU Lab. The combination of CNTK and Azure GPU Lab allows us to build and train deep neural nets for Cortana speech recognition up to 10 times faster than our previous deep learning system. Our Microsoft colleagues also have used CNTK to run other tasks, such as ImageNet classification and a deep structured semantic model. We’ve seen firsthand the kind of performance CNTK can deliver, and we think it could make an even greater impact within the broader machine learning and AI community. It’s our hope that the community will take advantage of CNTK to share ideas more quickly through the exchange of open source working code.

For mission critical AI research, we believe efficiency and performance should be one of the most important design criteria. There are a number of deep learning toolkits available from Torch, Theano and Caffe to the recently open sourced toolkits from Google and IBM. We compared CNTK with four popular toolkits. We focus on comparing the raw computational efficiency of different toolkits using simulated data with an effective mini batch size (8192) in order to fully utilize all GPUs. With a fully connected 4-layer neural network (see our benchmark scripts), the number of frames each toolkit can process per second is illustrated in the chart. We include two configurations on a single Linux machine with 1 and 4 GPUs (Nvidia K40) respectively. We also report our 8-GPU CNTK speed on Azure GPU Lab with 2 identical Linux machines (2 x 4 GPUs) as used in the baseline benchmark. CNTK compares favorably in computational efficiency for distributed deep learning (4 GPUs or 8 GPUs) on all these toolkits we tested. CNTK can easily scale beyond 8 GPUs across multiple machines with superior distributed system performance.

We understand there are many design considerations to balance between computational performance and flexibility, and each toolkit has its unique strengths. TensorFlow offers a user-friendly Python interface; Theano is unique with its symbolic operation; Torch uses Lua programming language; Caffe is popular for computer vision researchers due to its efficient performance; and CNTK on Azure GPU Lab offers the most efficient distributed computational performance.

Until now, our focus with CNTK has primarily been within the speech research community. As a result, its superior distributed computational performance capabilities aren’t well known within the broader AI community. We hope to change that with our workshop on CNTK this Friday at the Neural Information Processing Systems (NIPS) Conference. Dong Yu and I are looking forward to sharing CNTK’s capabilities with the broader AI community. We’re also looking forward to sharing future developments as we work together to deliver computing systems that can see, hear, speak, understand, and even begin to reason.

Related: