Pig: Dataflow Programming for Map-Reduce Clusters

July 17, 2008
Christopher Olston | Yahoo! Research

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies which routinely process petabytes. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.

In this talk I will describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig is used extensively at Yahoo!, and is available to the public as an open-source Apache incubator project.

Speaker Details

Christopher Olston is a senior research scientist at Yahoo! Research, working in the areas of data management and web search. Olston is occasionally seen behaving as a professor, and has taught undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford. He received his Ph.D. in 2003 from Stanford under fellowships from the university and the National Science Foundation. His Bachelor’s degree is from Berkeley with highest honors. Olston is an avid Cal fan but likes to rollerblade at Stanford.