Provenance Algebra and Materialized View-based Provenance Management

  • Satya S. Sahoo ,
  • Roger S. Barga ,
  • Jonathan Goldstein ,
  • Amit P. Sheth

MSR-TR-2008-170 |

Provenance, from the French word ‘provenir’ meaning “to come from”, describes the lineage of an entity. Provenance is critical information in eScience to accurately interpret scientific results. Though information provenance has been recognized as a hard problem in computing science (British Computing Society, 2004), many fundamental research issues in provenance have yet to be addressed. A common provenance model with well-defined formal semantics to facilitate interoperability of provenance metadata from different sources has not been defined. Another important issue is the lack of a systematic study of provenance query characteristics across multiple applications. A classification or taxonomy of the provenance queries will not only help to better understand provenance metadata, but will also enable the definition of provenance query operators. Finally, while provenance for a user or an application is a specific view over all available provenance metadata, a provenance management system that supports provenance storage as views has not been implemented. In this paper we propose a novel provenance algebra consisting of a common provenance model called provenir, defined in description logic based W3C Web Ontology Language (OWL-DL), along with a set of provenance query operators derived from the classification of provenance queries. We also introduce a practical provenance storage solution using materialized views over a generic relational database system. Our approach takes advantage of provenance query operators and well-defined indices to efficiently process complex provenance queries over very large datasets. To support our claims we present an evaluation of both performance and scalability aspects of our initial implementation. To the best of our knowledge this is the first provenance management system that supports the complete process from a formal provenance model and query operators to storage and efficient queries over provenance data.