Managing data and computation is at the heart of data center computing. Manual management of data can lead to data loss, wasteful consumption of storage, and laborious bookkeeping. Lack of proper management of computation can result in lost opportunities to share common computations across multiple jobs or to compute results incrementally.
Nectar is a system designed to address all the aforementioned problems. Nectar uses a novel approach that automates and unifies the management of data and computation in a data center. With Nectar, the results of a computation, called derived datasets, are uniquely identified by the program that computes it, and together with the program are automatically managed by a data center wide caching service. All computations and uses of derived datasets are controlled by the system. The system automatically regenerates a derived dataset from its program if it is determined missing. Nectar greatly improves data center management and resource utilization: obsolete or infrequently used derived datasets are automatically garbage collected, and shared common computations are computed only once and reused by others.
This paper describes the design and implementation of Nectar, and reports our evaluation of the system using both analysis of actual logs from a number of production clusters and an actual deployment on a 240-node cluster.