Conquering Distributed Challenges

Published November 4, 2013

Share this page

Posted by Rob Knies

Distributed computing can be a fiendishly difficult endeavor. Its benefits are manifest: Such systems pass messages across a series of computing devices connected to a network, and those devices interact efficiently to achieve results beyond the capability of any of the individual components.

Such work, though, is not for the faint-hearted. But then, the faint-hearted don’t walk the halls of Microsoft Research Silicon Valley (opens in new tab), which has focused on distributed computing since its founding 12 years ago. Still, many vexing hurdles remain to be cleared, and scientists from the Silicon Valley lab will be suggesting solutions to some of them during SOSP 2013 (opens in new tab), the Association for Computing Machinery’s 24th annual Symposium on Operating System Principles, being held from Nov. 3 to 6 in the scenic Laurel Highlands region in Southwestern Pennsylvania.

The Dandelion and Naiad projects, subjects of two of seven papers from Microsoft Research to be presented during SOSP 2013, demonstrate research that exhibits computational thinking, discussed by Jeannette Wing (opens in new tab), corporate vice president of Microsoft Research, in a video (opens in new tab) made public Oct. 29 as part of the Microsoft Research Luminaries (opens in new tab) video series.

Dandelion (opens in new tab) makes it much easier to program applications that run on CPU-plus-GPU clusters, by shifting responsibilities for coping with challenges inherent in programming for GPUs on distributed systems from the programmer to the compiler during runtime.

Naiad (opens in new tab), a prototype platform for performing incremental, iterative, and interactive data-parallel computation, has been able to achieve various types of computation on big data that were previously unable to be handled by a single system.

In the days leading up to SOSP 2013, researchers from both projects took a few minutes to describe what their contributions to the conference entail.

Dandelion: a Compiler and Runtime for Heterogeneous Systems (opens in new tab), was a joint authorial effort by Christopher J. Rossbach (opens in new tab), Yuan Yu (opens in new tab), Jon Currey (opens in new tab), Jean-Philippe Martin (opens in new tab), and Dennis Fetterly (opens in new tab), all of Microsoft Research Silicon Valley. Rossbach offered an explanation of the project’s contribution.

“Dandelion addresses programmability problems that arise in settings that feature architectural heterogeneity and/or distributed-computing resources,” he explains. “A good example of such a context—the one we focus on in the paper—is a small cluster of machines that both have multicore CPUs and GPUs. Unfortunately, developing applications for these systems is extremely difficult and generally requires experts.

“The presence of heterogeneity in the form of GPUs requires a developer to deal with diverse programming and execution models and specialized front-end language tools. GPUs are still very hard to program. A GPU-plus-CPU cluster combines these challenges with a long list of well-known problems inherent in writing code for a distributed system, and the result is that, in aggregate, these challenges can make using this kind of infrastructure untenable for most programmers.”

Dandelion, though, addresses this obstacle by providing a sequential, managed programming interface for such systems.

“The programmer writes code in a familiar productivity language like C#,” Rossbach says, “and the Dandelion compiler handles all the challenges, automatically generating parallel GPU implementations and managing data partitioning, distribution, and scheduling.

“Ideally, the programmer can develop code without specializing it for the execution context. The SOSP paper describes a prototype of Dandelion that takes important steps toward that goal by automatically executing LINQ (opens in new tab) queries on such clusters.”

The other paper mentioned above, (opens in new tab), is also an all-Microsoft Research Silicon Valley offering, written by Derek G. Murray (opens in new tab), Frank McSherry (opens in new tab), Rebecca Isaacs (opens in new tab), Michael Isard (opens in new tab), Paul Barham (opens in new tab), and Martín Abadi (opens in new tab). McSherry, who last year wrote a post for this blog that served as a useful primer on Naiad (opens in new tab), discusses the latest refinement of the platform.

“Each big-data system does a limited set of things well,” he says, “for example, low-latency streaming computation or iterative graph computation or high-throughput batch computation. But if users want to do some combination, or all three, they are out of luck. The reason is that it is fundamentally hard to do all of these things at the same time.

“We’ve done all three in Naiad, but we had to invent some fundamentally new ways to coordinate distributed computation to get low latency for streaming and loops without involving all of the machines in a clunky synchronous protocol.”

Those new techniques included a new computational model called “timely dataflow,” which enables parallelism across a wide set of algorithms.

“The timely dataflow model is meant to unify a lot of the patterns that existing big-data systems have used and to enable Naiad to execute all of these types of systems in a common framework,” McSherry says. “This means that you can do high-throughput data ingestion, fed into a complex iterative graph analysis and served up with low-latency, interactive access to a user—all within the same framework.”

Combining these separate computational needs into a single model offers an additional benefit.

“We’re hoping,” McSherry says, “that the unification will simplify a lot of the research into these sorts of systems, because new platforms won’t need to implement everything from scratch, as they do now, and can instead build on and contribute to the Naiad ecosystem.”

To enable that capability, Naiad has been released to the open-source community and is available on GitHub (opens in new tab). The download source contains the entire Naiad system for running on a single computer or a cluster of computers, and it is compatible with Windows (opens in new tab), Linux, and Mac OS. It includes example applications, as well as frameworks and libraries for building new applications.

Rossbach says that Dandelion also will conserve programming resources.

“We show,” he says, “that C#/.NET programs can derive significant performance benefit from GPUs in the contexts we describe̶̶—with almost no additional effort from the programmer.”

Other papers with Microsoft Research co-authors that have been accepted for SOSP 2013:

(opens in new tab)—Douglas B. Terry, Microsoft Research; Vijayan Prabhakaran, Microsoft Research; Ramakrishna Kota, Microsoft Research; Mahesh Balakrishnan, Microsoft Research; Marcos K. Aguilera, Microsoft Research; and Hussam Abu-Libdeh, Cornell University.

IOFlow: A Software-Defined Storage Architecture (opens in new tab)—Eno Thereska, Microsoft Research; Hitesh Ballani, Microsoft Research; Greg O’Shea, Microsoft Research; Thomas Karagiannis, Microsoft Research; Antony Rowstron, Microsoft Research; Tom Talpey, Microsoft; Richard Black, Microsoft Research; and Timothy Zhu, Carnegie Mellon University.

(opens in new tab)—Mahesh Balakrishnan, Microsoft Research; Dahlia Malkhi, Microsoft Research; Ted Wobber, Microsoft Research; Ming Wu, Microsoft Research; Vijayan Prabhakaran, Microsoft Research; Michael Wei, University of California, San Diego; John D. Davis, Microsoft Research; Sriram Rao, Microsoft; Tao Zou, Cornell University; and Aviad Zuck, Tel Aviv University.

(opens in new tab)—Levin Ravindranath, Massachusetts Institute of Technology; Jitendra Padhye, Microsoft Research; Ratul Mahajan, Microsoft Research; and Hari Balakrishnan, Massachusetts Institute of Technology.

(opens in new tab)—Yang Zhang, New York University; Russell Power, New York University; Siyuan Zhou, New York University; Yair Sovran, New York University; Marcos K. Aguilera, Microsoft Research; and Jinyang Li, New York University.

Microsoft Research Blog

AI Testing and Evaluation: Learnings from Science and Industry