Software Engineering Mix Volume 2: Large-scale Data Analysis of Software Repositories


Software Engineering Mix (SE-MIX) provided a forum for our colleagues from academia to interact directly with Microsoft engineers. The program featured talks from academics: highlights of published research that is highly relevant for Microsoft and blue sky talks summarizing emerging research areas. In addition, practitioners gave presentations about theoretical and pragmatic engineering challenges they face, soliciting help from academia. A coffee round table setting was used to facilitate discussions. This session built on the success of SEIF Days, which provided a discussion forum about the future of software engineering.

The topic of this year’s SE-MIX was the large-scale data analysis of software repositories (like GitHub for example). Many teams are using GitHub for their OSS projects and would like to have a richer understanding and insight into that activity. While some projects like GHTorrent and GitHub Archive exist, and some insights are available for analyzing a single project, everyone touching this topic sees an enormous potential in the data. The SE-MIX was intended to jumpstart connections between academia and Microsoft on the vast opportunities in leveraging GitHub data and data from other software repositories to develop software more efficiently.


Speakers talked about open source and data analysis of large-scale software repositories like GitHub. The SE-MIX featured Open Source Live, a showcase of open source projects related to Microsoft.

8:30-10:30  First Session

  • Welcome (10 minutes)
  • Judith Bishop, Microsoft. Industrial Research and Open Source – Reasons and Results (20 minutes) slides
  • Jeff McAffer, Microsoft. GitHub Insight: Understanding Open Source (20 minutes)
  • Mei Nagappan, Rochester Institute of Technology. Curating GitHub for Engineered Software Projects (20 minutes) slides
  • Speed Dating between academics and Microsoft engineers (50 minutes)

10:30-10:50  Break

10:50-12:00  Second Session

  • Vladimir Filkov, University of California, Davis. How to analyze GitHub traces to ask important questions and get actionable answers? (20 minutes)
  • Cristina Manu & Daniel Quirk, Microsoft. Spot – A distributed system for source code analysis (20 minutes)
  • Laura Dabbish, Carnegie Mellon University. The social life of software repositories: What large scale software analysis can learn from small scale qualitative research (20 minutes)
  • Preparation for Group Brain Storming (10 minutes)

12:00-13:20  Lunch break / Group picture

13:20-15:30  Third Session

  • Group Brain Storming (65-80 minutes)
  • Wrap-up (5 minutes) followed by
  • Open Source Live! Showcase (45-60 minutes)