Machine learning from modern datasets presents novel opportunities and challenges. Larger and more diverse datasets enable us to answer more complex statistical questions, but present computational challenges in designing algorithms that can scale. In this talk I will present two results, the first one about computational challenges and the second about an opportunity enabled by modern datasets in the context of representation learning.
I will start by presenting a distributed machine learning system we developed to address the computational scalability problem. Our system obtains state-of-the-art computational results in many common classification and regression tasks. I will discuss both the communication and computational components of the system, along with experimental evaluation on industry-scale data as well as large datasets in the academic literature.
In the second part of my talk, I will present my recent work on dictionary learning, also known as sparse coding. The goal here is to efficiently learn a basis such that each data point is a combination of only a small number of basis elements, and applications arise in signal processing as well as machine learning. I will present an efficient algorithm which is guaranteed to recover the true dictionary, given enough data samples. This is the first recovery result for overcomplete dictionaries, and comes with an easy to implement algorithm.