Conquering Distributed Challenges
Distributed computing can be a fiendishly difficult endeavor. Its benefits are manifest: Such systems pass messages across a series of computing devices connected to a network, and those devices interact efficiently to achieve results beyond the capability of any of the individual components.
Such work, though, is not for the faint-hearted. But then, the faint-hearted don’t walk the halls of Microsoft Research Silicon Valley, which has focused on distributed computing since its founding 12 years ago. Still, many vexing hurdles remain to be cleared, and scientists from the Silicon Valley lab will be suggesting solutions to some of them during SOSP 2013, the Association for Computing Machinery’s 24th annual Symposium on Operating System Principles, being held from Nov. 3 to 6 in the scenic Laurel Highlands region in Southwestern Pennsylvania.
The Dandelion and Naiad projects, subjects of two of seven papers from Microsoft Research to be presented during SOSP 2013, demonstrate research that exhibits computational thinking, discussed by Jeannette Wing, corporate vice president of Microsoft Research, in a video made public Oct. 29 as part of the Microsoft Research Luminaries video series.
Dandelion makes it much easier to program applications that run on CPU-plus-GPU clusters, by shifting responsibilities for coping with challenges inherent in programming for GPUs on distributed systems from the programmer to the compiler during runtime.
Naiad, a prototype platform for performing incremental, iterative, and interactive data-parallel computation, has been able to achieve various types of computation on big data that were previously unable to be handled by a single system.
In the days leading up to SOSP 2013, researchers from both projects took a few minutes to describe what their contributions to the conference entail.
Dandelion: a Compiler and Runtime for Heterogeneous Systems, was a joint authorial effort by Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly, all of Microsoft Research Silicon Valley. Rossbach offered an explanation of the project’s contribution.
“Dandelion addresses programmability problems that arise in settings that feature architectural heterogeneity and/or distributed-computing resources,” he explains. “A good example of such a context—the one we focus on in the paper—is a small cluster of machines that both have multicore CPUs and GPUs. Unfortunately, developing applications for these systems is extremely difficult and generally requires experts.
“The presence of heterogeneity in the form of GPUs requires a developer to deal with diverse programming and execution models and specialized front-end language tools. GPUs are still very hard to program. A GPU-plus-CPU cluster combines these challenges with a long list of well-known problems inherent in writing code for a distributed system, and the result is that, in aggregate, these challenges can make using this kind of infrastructure untenable for most programmers.”
Dandelion, though, addresses this obstacle by providing a sequential, managed programming interface for such systems.
“The programmer writes code in a familiar productivity language like C#,” Rossbach says, “and the Dandelion compiler handles all the challenges, automatically generating parallel GPU implementations and managing data partitioning, distribution, and scheduling.
“Ideally, the programmer can develop code without specializing it for the execution context. The SOSP paper describes a prototype of Dandelion that takes important steps toward that goal by automatically executing LINQ queries on such clusters.”
The other paper mentioned above, Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. McSherry, who last year wrote a post for this blog that served as a useful primer on Naiad, discusses the latest refinement of the platform.
“Each big-data system does a limited set of things well,” he says, “for example, low-latency streaming computation or iterative graph computation or high-throughput batch computation. But if users want to do some combination, or all three, they are out of luck. The reason is that it is fundamentally hard to do all of these things at the same time.
“We’ve done all three in Naiad, but we had to invent some fundamentally new ways to coordinate distributed computation to get low latency for streaming and loops without involving all of the machines in a clunky synchronous protocol.”
Those new techniques included a new computational model called “timely dataflow,” which enables parallelism across a wide set of algorithms.
“The timely dataflow model is meant to unify a lot of the patterns that existing big-data systems have used and to enable Naiad to execute all of these types of systems in a common framework,” McSherry says. “This means that you can do high-throughput data ingestion, fed into a complex iterative graph analysis and served up with low-latency, interactive access to a user—all within the same framework.”
Combining these separate computational needs into a single model offers an additional benefit.
“We’re hoping,” McSherry says, “that the unification will simplify a lot of the research into these sorts of systems, because new platforms won’t need to implement everything from scratch, as they do now, and can instead build on and contribute to the Naiad ecosystem.”
To enable that capability, Naiad has been released to the open-source community and is available on GitHub. The download source contains the entire Naiad system for running on a single computer or a cluster of computers, and it is compatible with Windows, Linux, and Mac OS. It includes example applications, as well as frameworks and libraries for building new applications.
Rossbach says that Dandelion also will conserve programming resources.
“We show,” he says, “that C#/.NET programs can derive significant performance benefit from GPUs in the contexts we describe̶̶—with almost no additional effort from the programmer.”
Other papers with Microsoft Research co-authors that have been accepted for SOSP 2013:
IOFlow: A Software-Defined Storage Architecture—Eno Thereska, Microsoft Research; Hitesh Ballani, Microsoft Research; Greg O’Shea, Microsoft Research; Thomas Karagiannis, Microsoft Research; Antony Rowstron, Microsoft Research; Tom Talpey, Microsoft; Richard Black, Microsoft Research; and Timothy Zhu, Carnegie Mellon University.
Timecard: Controlling User-Perceived Delays in Server-Based Mobile Applications” href=”http://research.microsoft.com/en-us/um/people/maheshba/papers/tangososp.pdf” target=”_blank”>Timecard: Controlling User-Perceived Delays in Server-Based Mobile Applications—Levin Ravindranath, Massachusetts Institute of Technology; Jitendra Padhye, Microsoft Research; Ratul Mahajan, Microsoft Research; and Hari Balakrishnan, Massachusetts Institute of Technology.