Code Completion with Statistical Language Models

August 29, 2014
Veselin Raychev | Microsoft

In this talk, I present the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for the holes with the most likely sequences of method calls that we learn from existing code.

The main idea of our approach is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences. We design a simple and scalable static analysis that extracts sequences of method calls from large codebase, and index them into a statistical language model. Then we employ the language model to find the highest ranked sentences and use them to synthesize a code completion.

Our technique is capable of synthesizing sequences of method calls, calls that may span across multiple objects and methods together with their arguments. We implemented our approach for Java programs using Android APIs and our results show that the system is fast and effective. Virtually all computed completions typecheck and the desired completion appears in the first 3 results for 90% of the test cases.

Speaker Details

Veselin Raychev is a third year PhD student at ETH Zurich, Switzerland. Before ETH, Veselin has completed a MSc. degree in Artificial Intelligence at Sofia University in Bulgaria. His current research is in leveraging large code-bases in order to solve program analysis and synthesis tasks such as code completion, refactoring, code deobfuscation, etc. Previously, he worked as a Software Engineer at Google, where he led the team behind the public transportation routing of Google Maps.