Data mining competition takes center stage in Chicago

September 24, 2013 | Posted by Microsoft Research Blog

Microsoft Research Connections was proud to sponsor the 2013 KDD CupIn keeping with our mission to collaborate with top academic and scientific researchers to foster innovations in scientific inquiry, Microsoft Research Connections was proud to sponsor the 2013 KDD Cup, arguably the world’s best-known competition in data mining. The winning teams were announced at KDD 2013, the 19th annual conference of ACM SIGKDD (the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining) which took place in Chicago in August. KDD is the premier event for researchers grappling with today’s data deluge, as it’s the only conference spanning big data, data mining, data science, and analytics and all the related algorithms, foundations, applications, and practices.

2013 KDD Cup challenge winners, Team Algorithm, from National Taiwan University

The 2013 KDD Cup challenge focused on the ability to search literature and to collect metrics around publications—a capability that is essential to modern research, as academic and industry researchers increasingly rely on search to discover what has been published and by whom. The competition made use of a data set of 250,000 authors and 2.5 million published papers. The dataset was broken up into a distinct labeled training set, a validation set for the leaderboard, and a test set. The competitors faced two tasks: first, a prediction task to determine whether an author had written a paper, and second, a name disambiguation task to identify duplicate author names in a dataset with name variants.

These tasks go to the heart of one of the main challenges of information extraction and curation in any people-centric dataset: resolving people-name ambiguity. In the scholarly publishing world, many authors publish under several variations of their own name, and to add to the complexity of discovery, different authors might share a similar or even the same name. As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. The KDD Cup task challenged participants to determine which papers in an author profile were truly written by a given author. Read the full parameters of the challenge.

The competition was fierce, with more than 800 teams from more than 40 different countries developing approximately 12,000 data-mining models over the course of a few months. The winning solution, created by Professor Chih-Jen Lin and Team Algorithm from National Taiwan University, was the product of outstanding teamwork: eighteen students and three teaching assistants actually designed a graduate course around the competition. Other winners included teams from University of Illinois at Urbana-Champaign, Moscow State University, and FICO. Winners presented their solutions at a KDD Cup workshop and poster session at the conference. Moreover, solutions created for the competition resulted in 10 research papers that are available through the KDD Cup 2013 Workshop proceedings.

KDD Cup poster session participants at KDD 2013

On behalf of Microsoft Research Connections, I would like to thank the key collaborators who helped make this competition a success. The Microsoft Research Connections proposal for the KDD Cup challenge was selected after careful deliberation by 2013 KDD Cup chairpersons Claudia Perlich and Brian Dalessandro of Media6°. Partnering with me in designing the contest rules and evaluation criteria were Professors Martine DeCock of Ghent University and Senjuti Basu Roy of the University of Washington Tacoma, along with Ben Hamner and Will Cukierski of Kaggle. Swapna Savvana and Yitao Li from the University of Washington Tacoma helped with the logistics of the contest execution.

So congrats to the KDD Cup winners, and kudos to everyone who accepted the challenge. The many outstanding solutions showed great creativity, which is exactly what we’ll need as we move forward in this new world of data-intensive scientific inquiry.

—Vani Mandava, Senior Program Manager, Microsoft Research Connections

