Frontiers in Artificial Intelligence is a series of public lectures at Microsoft Research Cambridge featuring leading researchers in the field, focusing on the cutting edge topics at the intersection of machine learning, statistics, and artificial intelligence. Students, scientists, and engineers in academia and industry are all welcome to join us for these exciting talks and the opportunity to socialize with the Cam-bridge AI/ML community.
Neural Code Comprehension: A Learnable Representation of Code Semantics
In the era of “Big Code”, research is being conducted into automating the understanding of computer programs. Most of the current works base on techniques from Natural Language Processing and Deep Learning, which have been successful recently, attempting to process the code directly or using syntactic representations (e.g., ASTs and AST paths). However, to comprehend program semantics robustly, structural features of code have to be taken into account as well, including function calls, branching, and interchangeable order of statements. In this talk, I will present a novel processing technique to use Machine Learning for code semantics, and show how it applies to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings quantitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that even without fine-tuning, a single Recurrent Neural Network (RNN) architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.