Cluster-based data-parallel frameworks such as MapReduce, Hadoop, and Dryad are increasingly popular for a large class of compute-intensive tasks. Although such systems are designed for large-scale clusters, they also offer a convenient and accessible route to data-parallel programming for small-scale clusters. This potentially allows applications traditionally targeted at supercomputers or remote server farms, such as sophisticated video processing, to be deployed in a small-scale ad-hoc fashion by aggregating the servers and workstations in the home or office network.
The default scheduling algorithms of these frameworks perform well at scale, but are significantly less optimal in a small (3-10 machine) cluster environment where nodes have widely differing performance characteristics. To make effective use of an ad-hoc cluster, we require a “planner” rather than a scheduler that takes account of the predicted resource consumption by each vertex in the dataflow graph and the heterogeneity of the available hardware.
In this talk I will describe our enhancements to DryadLINQ and Dryad for ad-hoc clusters. We have integrated a constraint-based planner that maps the dataflow graph generated by the DryadLINQ compiler onto the cluster. The planner makes use of DryadLINQ operator performance models that are constructed from low-level traces of vertex executions. The performance models abstract the behaviour of each vertex in sufficient detail to predict the bottleneck resource, which can change during vertex execution, on different hardware and with different sizes of input. Experimental evaluation shows reasonable predictive accuracy and good performance gains for parallel jobs on ad-hoc clusters.