Erasure Codes for Big Data over Hadoop and Large-scale Sparse PCA for Twitter Analysis


April 30, 2012


Alex Dimakis and Dimitris S. Papailiopoulos


University of Southern California


1). Erasure Codes for Big Data over Hadoop

As big data grows faster than infrastructure, triple replication becomes prohibitively expensive for distributed storage systems. For this reason most large systems use erasure codes for archival files to provide high reliability with small storage overheads. Reed-Solomon codes are the common choice, a classical error-correcting construction relying on polynomials over finite fields.
Unfortunately, classical error-correcting codes are not sufficient for distributed systems since repairing a single failure requires large data disk IO and network transfers.
We will present a new family of erasure codes that minimize repair communication and disk IO during single node failures. An implementation over HDFS will be described and several tradeoffs and research directions will be discussed.

2). Large-scale Sparse PCA for Twitter Analysis

Large-scale data analytics are now becoming a trend in big data set applications, such as event detection and sentiment analysis in social networks that host millions of users. Sparse PCA, a new tool in dimensionality reduction, generates sparse collections of data features that identify major trends in a data set. The sparsity of this tool is fundamental to the high interpretability of the results, e.g., in event detection applications, a small collection of words that “describe” major events are easier to interpret than a bag of thousands of words. Unfortunately, sparse PCA is NP-hard and most of its polynomial-time relaxations involve solving subproblems that become intractable in high dimensions.

We will present a new, parallelizable spectral algorithm for sparse-PCA that inspires a novel feature elimination technique, which on real Twitter data sets reduces the problem size from hundreds of thousands to hundreds of features. Our algorithm comes with specific optimality and approximation guarantees and is based on a new way to visualize the span of matrices using auxiliary spherical variables. Future directions on implementing our algorithm in the Hadoop MapReduce framework will be discussed.


Alex Dimakis and Dimitris S. Papailiopoulos

Alex Dimakis is an Assistant Professor at the Viterbi School of Engineering, University of Southern California since 2009. He currently holds the Colleen and Roberto Padovani Early Career Chair in Electrical Engineering. He received his Ph.D. in 2008 and M.S. degree in 2005 in electrical engineering and computer sciences from UC Berkeley and the Diploma degree in Electrical and Computer Engineering from the National Technical University of Athens in 2003. During 2009 he was a postdoctoral scholar at the CMI center at Caltech. He received the NSF Career award in 2011, the Eli Jury dissertation award in 2008 and the IEEE Data Storage committee best paper award for 2010. His interests include communications, coding theory, and networking, with a current focus on distributed storage, network coding and message passing algorithms.

Dimitris S. Papailiopoulos received the Diploma and M.Sc. degrees in electronic and computer engineering from the Technical University of Crete, Greece, in 2007 and 2009, respectively. Since 2009, he has been a graduate student with the Department of Electrical Engineering, University of Southern California, where he is currently working toward the Ph.D. degree. In 2009, he received the USC Annenberg Fellowship and in 2011 the Myronis Fellowship. His research interests are in the areas of coding theory, communications, and signal processing, with an emphasis on storage codes for big data, interference alignment methods, and large-scale sparse principal component analysis.