MSR NYC Data Science Seminar Series


This seminar series is bringing data science researchers from Columbia University, NYU, Cornell Tech and Microsoft Research together. Our goal is to increase interactions within the broader New York data science community, and to provide a new forum for discussions on data science research.


The events in this series start with a formal talk session (45 minutes). Invited speakers present short talks, providing their views on the opportunities and challenges in data science research. The second part of the event (2 hours) is a wine and cheese social designed to enable researchers to exchange ideas in a relaxed setting.



Ceren Budak, Post Doc Researcher

Duncan Watts, Principal Researcher

Jennifer Chayes, Distinguished Scientist / Managing Director, Microsoft Research New England & New York City


Michael Kearns, University of Pennsylvania

Michael Macy, Cornell University

Claudia Perlich, Dstillery; NYU Stern

David Blei, Columbia University

Deborah Estrin, Cornell Tech

Yann LeCun, NYU and Facebook

Mor Naaman, Cornell Tech

Tony Jebara, Columbia University


Previous Events

June 25, 2015

Speaker: Michael Kearns, University of Pennsylvania

Title: From “In” to “Over”: Behavioral Experiments on Whole-Network Computation

Abstract:  We report on a series of behavioral experiments in human computation on three different tasks over networks: graph coloring, community detection (or graph clustering), and competitive contagion. While these tasks share similar action spaces and interfaces, they capture a diversity of computational challenges: graph coloring is a search problem, clustering is an optimization problem, and competitive contagion is a game-theoretic problem. In contrast with much of the recent literature on human-subject experiments in networks, in which collectives of subjects are embedded “in” the network, and have only local information and interactions, here individual subjects have a global (or “over”) view and must solve “whole network” problems alone. Our primary findings are that subject performance is impressive across all three problem types; that subjects find diverse and novel strategies for solving each task; and that collective performance can often be strongly correlated with known algorithms.

Joint work with Lili Dworkin.

Bio: Michael Kearns is a professor in the Computer and Information Science department at the University of Pennsylvania, where he holds the National Center Chair and has joint appointments in the Wharton School. He is founding director of Penn’s Networked and Social Systems Engineering (NETS) program (, and founding director of Penn’s Warren Center for Network and Data Sciences ( His research interests include topics in machine learning, algorithmic game theory, social networks, and computational finance. He has worked and consulted extensively in the technology and finance industries. He is a fellow of the American Academy of Arts and Sciences, the Association for Computing Machinery, and the Association for the Advancement of Artificial Intelligence.

April 16, 2015

Speaker: Michael Macy, Cornell University

Title: On A Scale From 1 to 5, How Much Confidence Do You Have in Survey Results?

Abstract: From astronomy to neuroscience to particle physics, scientific knowledge depends heavily on the available tools for observation. Since the introduction of stratified sampling in 1934, the survey has been the single most important observational tool for social science. During this time, impressive advances have taken place in our ability to reduce sampling error (e.g. Respondent Driven Sampling), measurement error (e.g. SEM), specification error (e.g. linear mixed models), and inferential error (e.g. Propensity Score Matching). Nevertheless, increasing confidence in survey technology has paradoxically reinforced a debilitating theoretical blinder that has compromised the ability of social science to elicit confidence in predictions. What is worse, this blinder has largely escaped notice through a combination of ideological bias and reluctance to pull back the covers on problems for which we have no solution. The good news is that a solution is finally on the horizon, with the potential for social science to begin to close the gap with the physical and life sciences in predictive ability. The bad news is … (to be continued).

Bio: Michael Macy earned his B.A. and Ph.D from Harvard, along with an M.A. from Stanford. He is currently Goldwin Smith Professor of Arts and Sciences and Director of the Social Dynamics Laboratory at Cornell, with a dual appointment in the Departments of Sociology and Information Science. With support from the National Science Foundation, the Department of Defense, and Google, his research team has used computational models, online laboratory experiments, and digital traces of device-mediated interaction to explore familiar but enigmatic social patterns, such as circadian rhythms, the emergence and collapse of fads, the spread of self-destructive behaviors, cooperation in social dilemmas, the critical mass in collective action, the spread of high-threshold contagions on small-world networks, the polarization of opinion, segregation of neighborhoods, and assimilation of minority cultures. Recent research uses 509 million Twitter messages to track diurnal and seasonal mood changes in 54 countries, telephone logs for 12B calls in the UK to measure the economic correlates of network structure, and hundreds of millions of Yahoo! email logs in 90 countries to test Huntington’s theory of the “clash of civilizations.” His research has been published in leading journals, including Science, PNAS, American Journal of Sociology, American Sociological Review, and Annual Review of Sociology.

February 5, 2015

You can view the recording of the talk here

Speaker: Claudia Perlich, Dstillery & NYU Stern

Title: What makes us human? Machine learning challenges in digital advertising

Abstract: Digital advertising is one of the largest and open playgrounds for machine learning, data mining and related analytic approaches. This talk will touch on a number of challenges which arise in this environment: 1) high volume data streams of around 30 Billion daily consumer touch points, 2) low latency requirements on scoring and automated bidding decisioning within 100ms and 3) adversarial modeling in the light of advertising fraud and bots. Specifically, we will discuss an automated learning system implemented at Dstillery, that uses privacy friendly data representation to build sparse targeting models for thousands of products in Millions of dimensions. The solution incorporates ideas from transfer learning, Bayesian priors, stochastic gradient descent, hashing and learning rate estimation. On the sidelines, but of no less importance, are topics on bid optimization, data reliability, cross-device identification and observational methods for causal inference. Finally, I will touch on a few higher-level lessons around incentive misalignments/measurement issues in the advertising industry and pose the paradox of big data and predictive modeling: You never have the data you need.

Bio: Claudia Perlich currently acts as Chief Scientist at Dstillery and designs, develops, analyzes and optimizes the machine learning that drives digital advertising. An active industry speaker and frequent contributor to academic and industry publications, Claudia was recently named winner of the Advertising Research Foundation’s (ARF) Grand Innovation Award, was selected as member of the Crain’s NY annual 40 Under 40 list, WIRED’s Smart List, and FastCompany’s 100 Most Creative People. She has published over 50 scientific articles, and holds multiple patents in machine learning. Claudia has a PhD in Information Systems from NYU and worked in the Predictive Modeling Group at IBM’s Watson Research Center, concentrating on data analytics and machine learning for real-world applications. She also teaches in the NYU Stern MBA program.

December 4th, 2014

You can view the recording of the talk here

Speaker: David Blei, Columbia University

Title: Topic Models and User Behavior

Abstract: Probabilistic topic models provide a suite of tools for analyzinglarge document collections. Topic modeling algorithms discover thelatent themes that underlie the documents and identify how eachdocument exhibits those themes. Topic modeling can be used to helpexplore, summarize, and form predictions about documents. Topicmodeling ideas have been adapted to many domains, including images,music, networks, genomics, and neuroscience.

Traditional topic modeling algorithms analyze a document collectionand estimate its latent thematic structure. However, many collectionscontain an additional type of data: how people use the documents. Forexample, readers click on articles in a newspaper website, scientistsplace articles in their personal libraries, and lawmakers vote on acollection of bills. Behavior data is essential both for makingpredictions about users (such as for a recommendation system) and forunderstanding how a collection and its users are organized.

In this talk, I will review the basics of topic modeling and describeour recent research on collaborative topic models, models thatsimultaneously analyze a collection of texts and its correspondinguser behavior. We studied collaborative topic models on 80,000scientists’ libraries from Mendeley and 100,000 users’ click data fromthe arXiv. Collaborative topic models enable interpretablerecommendation systems, capturing scientists’ preferences and pointingthem to articles of interest. Further, these models can organize thearticles according to the discovered patterns of readership. Forexample, we can identify articles that are important within a fieldand articles that transcend disciplinary boundaries.

More broadly, topic modeling is a case study in the large field ofapplied probabilistic modeling. Finally, I will survey some recentadvances in this field. I will show how modern probabilistic modelinggives data scientists a rich language for expressing statisticalassumptions and scalable algorithms for uncovering hidden patterns inmassive data.

Bio: David Blei is a Professor of Statistics and Computer Science atColumbia University. His research is in statistical machine learning,involving probabilistic topic models, Bayesian nonparametric methods,and approximate posterior inference. He works on a variety ofapplications, including text, images, music, social networks, userbehavior, and scientific data.David earned his Bachelor’s degree in Computer Science and Mathematics from Brown University (1997) and his PhD in Computer Science from theUniversity of California, Berkeley (2004). Before arriving toColumbia, he was an Associate Professor of Computer Science atPrinceton University. He has received several awards for hisresearch, including a Sloan Fellowship (2010), Office of NavalResearch Young Investigator Award (2011), Presidential Early CareerAward for Scientists and Engineers (2011), Blavatnik Faculty Award(2013), and ACM-Infosys Foundation Award (2013).

September 23, 2014

You can view the recording of the talk here

Speaker: Deborah Estrin, Cornell Tech

Title: Small, n=me, data

Abstract: Consider a new kind of cloud-based app that would create a picture of an individuals life over time by continuously, securely, and privately analyzing the digital traces they generate 24×7. The social networks, search engines, mobile operators, online games, and e-commerce sites that they access every hour of most every day extensively use these digital traces to tailor service offerings and to improve system performance and in some cases to target advertisements. Our premise is that these diverse and messy, but highly personalized, data can be analyzed to draw powerful inferences about an individual, and for that individual. Use of applications that are fueled by these traces could enhance, and even transform, our experiences as consumers, patients, passengers, customers, family members, as well as users of online media. This talk will discuss precedents for small data in mobile health, and the opportunities and challenges of broadening the scope of small data capture, storage, and use.

Bio: Deborah Estrin (PhD, MIT (1985); BS, UCB (1980)) is a Professor of Computer Science at Cornell Tech in New York City ( and a Professor of Health Policy and Research at Weill Cornell Medical College. She is a co-founder of Open mHealth ( Her current focus is on mobile health and small data, leveraging the pervasiveness of mobile devices and digital interactions for health and life management (TEDMED Estrin was the founding director of the NSF-funded Science and Technology Center for Embedded Networked Sensing (CENS) at UCLA (2002-12). Awards include: ACM Athena Lecturer (2006) and Anita Borg Institute’s Women of Vision Award for Innovation (2007). She is an elected member of the American Academy of Arts and Sciences (2007) and National Academy of Engineering (2009).

April 24, 2014

Opening Event

You can view the talks and the panel discussion here.

Opening Remarks

Speaker: Duncan Watts

Bio: Prior to joining Microsoft, Duncan Watts was a Senior Principal Research Scientist at Yahoo! Research, where he directed the Human Social Dynamics group. Prior to joining Yahoo!, he was a full professor of Sociology at Columbia University, where he taught from 2000-2007. His research on social networks and collective dynamics has appeared in a wide range of journals, from Nature, Science, and Physical Review Letters to the American Journal of Sociology and Harvard Business Review. He is also the author of three books, most recently Everything is Obvious (Once You Know The Answer) (Crown Business, 2011). He holds a B.Sc. in Physics from the Australian Defense Force Academy, and a Ph.D. in Theoretical and Applied Mechanics from Cornell University.

Technical Talks

Yann LeCun, NYU and Facebook

Title & Abstract: Yann presents a demo on deep learning and vision. 

Bio: Yann is Director of AI Research at Facebook, and Silver Professor of Dara Science, Computer Science, Neural Science, and Electrical Engineering at New York University, affiliated with the NYU Center for Data Science, the Courant Institute of Mathematical Science, the Center for Neural Science, and the Electrical and Computer Engineering Department. He received the Electrical Engineer Diploma from Ecole Superieure d’Ingenieurs en Electrotechnique et Electronique (ESIEE), Paris in 1983, and a PhD in Computer Science from Universite Pierre et Marie Curie (Paris) in 1987. He is the lead faculty at NYU for the Moore-Sloan Data Science Environment, a $36M initiative in collaboration with UC Berkeley and University of Washington to develop data-driven methods in the sciences. He is the recipient of the 2014 IEEE Neural Network Pioneer Award.

Mor Naaman, Cornell Tech

Title: Data and People in Connective Media

Abstract: In five minutes or less, I will talk about how we use methods from social science, people-centered design, data science and machine learning to understand social media data large and small, and build new applications that help us make sense of the city from (public) social media data. I’ll also say a word about Cornell Tech and our Connective Media hub. OK, six minutes may be needed to squeeze it all in.

Bio: Naaman is an associate professor at Cornell Tech’s Jacobs Institute. He is also a co-founder and Chief Scientist at, a startup founded to make sense of the real-time web and social media. Mor’s research applies multidisciplinary methods to gain new insights about people and society from social media data, and to develop novel tools to make this data more accessible and usable in various settings. He gets awards, too, including the NSF Early Faculty CAREER Award, research awards from Google, Yahoo!, and Nokia, and three best paper awards.

Tony Jebara, Columbia University

Title: Learning From Network Connectivity and Mobile Phone Data

Abstract: Many real-world networks are described by both connectivity information as well as features for every node. While most network growth models are based on link analysis, we explore how an individual’s data profile without any connectivity information can be used to infer their connectivity with other users. For example, in a class of incoming freshmen students with no known friendship connections, can we predict which pairs will become friends at the end of the year using only their profile information? Similarly, can we using co-location to predict communication? In other words, by observing only the mobile location data from users, can we predict what pairs of users are likely to communicate? To learn how to reconstruct these networks, we present structure-preserving metric learning and apply it to Facebook data, Wikipedia data, FourSquare data and mobile phone call detail records,

Bio: Tony is Associate Professor of Computer Science at Columbia University. He chairs the Center on Foundations of Data Science as well as directs the Columbia Machine Learning Laboratory. His research intersects computer science and statistics to develop new frameworks for learning from data with applications in social networks, spatio-temporal data, vision and text. Jebara has founded or advised startups including Sense Networks (acquired by, AchieveMint, Agolo, and Bookt (acquired by RealPage NASDAQ:RP). He is the author of the book Machine Learning: Discriminative and Generative. In 2004, Jebara was the recipient of the Career award from the National Science Foundation.

Panel Discussion

Panel Topic: Opportunities and Challenges in Data Science Research

Panel Moderator: Jennifer Chayes, Managing Director, MSR New York City

Bio: Jennifer Tour Chayes is Managing Director of Microsoft Research New York City as well as the Microsoft Research New England lab in Cambridge. Before this, she was research area manager for Mathematics, Theoretical Computer Science and Cryptography at Microsoft Research Redmond. Chayes joined Microsoft Research in 1997, when she co-founded the Theory Group. Her research areas include phase transitions in discrete mathematics and computer science, structural and dynamical properties of self-engineered networks, and algorithmic game theory. She is the co-author of almost 100 scientific papers and the co-inventor of more than 20 patents.

Panel Members: Yann LeCun, Mor Naaman, Tony Jebara