Microsoft researchers unlock the black box of network embedding


At the ACM Conference on Web Search and Data Mining 2018, my team will introduce research that, for the first time, provides a theoretical explanation of popular methods used to automatically map the structure and characteristics of networks, known as network embedding. We then use this theoretical explanation to present a new network embedding method that performs as well as or better than existing methods.

Networks are fundamental ways of representing knowledge and relating to the world. As humans, we think through association; we naturally draw links between concepts or entities. People are linked through family relationships or through collaborations. Diseases are linked to treatments. Works of art are linked to their creators. Wikipedia represents the interconnectedness of human knowledge.

Microsoft research webinars

Lectures from Microsoft researchers with live Q&A and on-demand viewing.

As humans, we have great capacity to understand knowledge, notice similarities and make connections and inferences. However, our capacity is limited by the amount of information we can process. Computers, on the other hand, can process tremendous amounts of information, but their ability to understand knowledge and make inferences is limited. That’s because humans think in symbols and computers think in math. Recent advances in computer science such as word and network embedding address this problem by using methods that translate discrete symbols into continuous representations such as high-dimensional vectors that computers can process algebraically.

Network embedding methods enable us to map the structure and characteristics of a network algorithmically, without human intervention. Mapping can be used to predict missing links, to identify missing entities and to make inferences that inform, for example, recommender systems on e-commerce websites.

Existing methods of network embedding work empirically, but so far, scientists have considered them a black box: We lacked understanding of how or why these methods work. Our recent research—a collaboration between Microsoft Research and Tsinghua University—contributes to the theoretical understanding of four popular network-embedding methods, including DeepWalk, node2vec, as well as LINE and PTE, which were developed at Microsoft Research Asia.

Our research bridges the gap between the empirical performance and theoretical mechanism of network embedding. We prove that four popular network-embedding models are —in theory— implicitly performing matrix factorizations over a network. More excitingly, the theoretical analysis also provides the closed forms of those factorized matrices, most of which are related to the network’s Laplacian matrix. We show that these models share a common practice: they are unified into a general matrix-factorization framework. As such, this work lays the theoretical foundation for network embedding, yielding a better understanding of latent representation learning over networks.

Theoretical understanding is important because it enables us to go beyond empirical attempts. Once we understand why and how something works, we are better able to predict how it will work in the future and can use it more efficiently. Based on the new theoretical understanding we derived from four existing methods, we proposed a new algorithm, NetMF, that uses matrix factorization explicitly to create network embeddings. The new algorithm performed at comparative or better levels than existing methods in our tests, providing further support for our theoretical explanation.

We use network embedding in the Microsoft Academic Graph – the largest knowledge graph of research publications in existence, accessible via the Academic Knowledge API and the Microsoft Academic website. As machines map the structure and meaning of entities in the graph and their relationships, they can begin making inferences. For example, they can infer that two papers are quite similar in content, or two authors do similar work. In a way, this vector representation is like your genome sequence. If it is similar to someone else’s, you must be related. With network embedding, we can find academic twins that perhaps were not aware of each other. We can help researchers discover scholarship they might not have come across otherwise, and bridge gaps between traditional disciplinary divisions.

The research team included Jiezhong Qiu (a summer intern at Microsoft Research), Yuxiao Dong, Hao Ma, Kuansan Wang from Microsoft Research, and Jian Li and Jie Tang from Tsinghua University.

Figure 1: The closed forms of the matrices that are implicitly approximated and factorized by popular network-embedding models.