Indexing HDFS data in PDW: splitting the data from the index

Vinitha Reddy Gankidi; Nikhil Teletia; Jignesh M. Patel; Alan Halverson; David J. DeWitt

Indexing HDFS data in PDW: splitting the data from the index

Vinitha Reddy Gankidi ,
Nikhil Teletia ,
Jignesh M. Patel ,
Alan Halverson ,
David J. DeWitt

Very Large Data Bases | July 2014

Published by VLDB Endowment

PDF | Publication | Publication

Download BibTex

There is a growing interest in making relational DBMSs work synergistically with MapReduce systems. However, there are interesting technical challenges associated with figuring out the right balance between the use and co-deployment of these systems. This paper focuses on one specific aspect of this balance, namely how to leverage the superior indexing and query processing power of a relational DBMS for data that is often more cost-effectively stored in Hadoop/HDFS. We present a method to use conventional B+-tree indices in an RDBMS for data stored in HDFS and demonstrate that our approach is especially effective for highly selective queries.