Microsoft Computational Network Toolkit offers most efficient distributed deep learning computational performance

Published

By Xuedong Huang (opens in new tab), Chief Speech Scientist, Microsoft

 (opens in new tab)
Xuedong Huang (Photo by Scott Eklund/Red Box Pictures)

For more than 20 years, Microsoft has invested (opens in new tab) in advanced speech recognition research and development. It’s great to see the return on that investment in products and services such as Windows Cortana (opens in new tab), Skype Translator (opens in new tab), and Project Oxford Speech APIs (opens in new tab). Microsoft researchers pioneered (opens in new tab) using deep neural networks for speech recognition, and earlier this year, our speech researchers shared our deep learning tools with the speech research community when we introduced the Computational Network Toolkit (CNTK) (opens in new tab) under an open source license at the ICASSP (opens in new tab) Conference in April 2015.

CNTK is a unified computational network framework that describes deep neural networks as a series of computational steps via a directed graph. In a directed graph, each node represents an input value or a network parameter, and each edge represents a matrix operation upon its children. CNTK provides algorithms to carry out both forward computation and gradient calculation. Most popular computation node types are predefined and users can easily extend node types under the open source license. With the combination of CNTK and Microsoft’s upcoming Azure GPU Lab (opens in new tab), we have a modern, distributed GPU platform that the community can utilize to advance AI research.

on-demand event

Microsoft Research Forum Episode 3

Dive into the importance of globally inclusive and equitable AI, updates on AutoGen and MatterGen, explore novel new use cases for AI, and more.

 (opens in new tab)

Since the debut of CNTK in April, we’ve significantly improved machine learning efficiency with Azure GPU Lab. The combination of CNTK and Azure GPU Lab allows us to build and train deep neural nets for Cortana speech recognition up to 10 times faster than our previous deep learning system. Our Microsoft colleagues also have used CNTK to run other tasks, such as ImageNet classification and a deep structured semantic model. We’ve seen firsthand the kind of performance CNTK can deliver, and we think it could make an even greater impact within the broader machine learning and AI community. It’s our hope that the community will take advantage of CNTK to share ideas more quickly through the exchange of open source working code.

For mission critical AI research, we believe efficiency and performance should be one of the most important design criteria. There are a number of deep learning toolkits available from Torch (opens in new tab), Theano (opens in new tab) and Caffe (opens in new tab) to the recently open sourced toolkits from Google (opens in new tab) and IBM (opens in new tab). We compared CNTK with four popular toolkits. We focus on comparing the raw computational efficiency of different toolkits using simulated data with an effective mini batch size (8192) in order to fully utilize all GPUs. With a fully connected 4-layer neural network (see our benchmark scripts (opens in new tab)), the number of frames each toolkit can process per second is illustrated in the chart. We include two configurations on a single Linux machine with 1 and 4 GPUs (Nvidia K40) respectively. We also report our 8-GPU CNTK speed on Azure GPU Lab with 2 identical Linux machines (2 x 4 GPUs) as used in the baseline benchmark. CNTK compares favorably in computational efficiency for distributed deep learning (4 GPUs or 8 GPUs) on all these toolkits we tested. CNTK can easily scale beyond 8 GPUs across multiple machines with superior distributed system performance.

We understand there are many design considerations to balance between computational performance and flexibility, and each toolkit has its unique strengths. TensorFlow offers a user-friendly Python interface; Theano is unique with its symbolic operation; Torch uses Lua programming language; Caffe is popular for computer vision researchers due to its efficient performance; and CNTK on Azure GPU Lab offers the most efficient distributed computational performance.

Until now, our focus with CNTK has primarily been within the speech research community. As a result, its superior distributed computational performance capabilities aren’t well known within the broader AI community. We hope to change that with our workshop on CNTK (opens in new tab) this Friday at the Neural Information Processing Systems (NIPS (opens in new tab)) Conference. Dong Yu (opens in new tab) and I are looking forward to sharing CNTK’s capabilities with the broader AI community. We’re also looking forward to sharing future developments as we work together to deliver computing systems that can see, hear, speak, understand, and even begin to reason.

Related:

Continue reading

See all blog posts