Abstract

Performance of data-parallel computing (e.g., MapReduce, DryadLINQ)
heavily depends on its data partitions. Solutions implemented by the
current state of the art systems are far from optimal. Techniques
proposed by the database community to find optimal data partitions are
not directly applicable when complex user-defined functions and data
models are involved. We outline our solution, which draws expertise
from various fields such as programming languages and optimization,
and present our preliminary results.

‚Äč