Managing data and computation is at the heart of datacenter
computing. Manual management of data can lead to data loss,
wasteful consumption of storage, and laborious bookkeeping. Lack of
proper management of computation can result in lost opportunities to
share common computations across multiple jobs or to compute
Nectar is a system designed to address the aforementioned
problems. It automates and unifies the management of data and
computation within a datacenter. In Nectar, data and computation are treated interchangeably
by associating data with its computation.
Derived datasets, which are the results of computations, are uniquely identified by the
programs that produce them, and together with their programs, are
automatically managed by a datacenter wide caching service.
Any derived dataset can be transparently regenerated
by re-executing its program, and any computation can be transparently avoided
by using previously cached results. This enables us
to greatly improve datacenter management and resource
utilization: obsolete or infrequently used derived datasets
are automatically garbage collected, and shared common
computations are computed only once and reused by others.
This paper describes the design and implementation of Nectar, and reports on
our evaluation of the system using analytic studies of
logs from several production clusters and an actual deployment on a 240-node cluster.