Portrait of Jun Yan

Jun Yan

Senior Research Manager


Dr. Jun Yan received the Ph.D. degree in digital signal processing and pattern recognition from the department of information science, school of mathematical science, Peking University, P.R. China. During his Ph.D., he has been a research intern of MSRA from 2003 to 2005 and awarded as Microsoft fellow in 2004. Before join Microsoft, he has been a research associate at CBI, HMS, Harvard, Cambridge, MA, in 2005. He joined Microsoft Research Asia (MSRA) from 2006. Currently he is working in the Data Mining and Enterprise Intelligence group of MSRA as a senior research manager.

His research interests are on knowledge mining for AI, text data preprocessing,  information retrieval and behavior targeted online advertising etc. So far, he has successfully incubated tens of technologies, which have been used in Microsoft products. In academia, he has more than 60 quality papers published in referred conferences and journals, including SIGKDD, SIGIR, WWW, ICDM, TKDE, etc. He has been the PC members of international conferences SIGKDD, SIGIR etc. and is also reviewers of journals articles TKDE, TPAMI etc.


Digital Me

Established: April 1, 2016

Digital Me: Toward Digitalizing Everybody in the World Introduction: Artificial Intelligence (AI) applications such as chat bots, software assistants etc. are attracting increasing attention from both academic and industry. Most existing work aim to assist people in information acquisition and task completion scenarios using knowledge from public data sources such as Web, enterprise documents, social media etc. Differently, digital Me (DM) aims to digitalize knowledge of each person for building her personal “Avatar”, through which…

Enterprise Dictionary

Established: April 14, 2014

1.    Project Introduction "Everyday we are faced with a sea of acronyms, ever changing group structures, and fast-tracked projects." Currently, collation and curation of corporate knowledge is a painstaking manual process. We seek to move these activities into the background so that the relationships between different people, project updates, and emerging milestones can be surfaced in an ambient light-weight way. This is our project: Enterprise Dictionary. It is a research project which aims to learn…


Established: July 5, 2010

The Conceptualization model aims to map text format entities into semantic concept categories with some probabilities, which may depend on the context texts of the entities. As an example, “Microsoft” could be automatically mapped to “Software Company” and “Fortune 500 company” etc. with some probabilities. It provides computers the common sense computing capability and make machines "aware" of the mental world of human beings, through which way machines can better understand human communication in text.…


Research Interests

  • Large scale Web knowledge extraction and mining
  • Behavioral targeted online advertising
  • Large scale data preprocessing
  • Machine learning for information retrieval
  • User modeling and understanding

Selected Projects

  • Knowledge Table. Kable, which is known as Knowledge Table project, aims to automatically extract structured domain knowledge from the semi-structured and unstructured World Wide Web. And then process and store the knowledge in Table format with each row stands for a domain entity and each column stands for an attribute. The cells in Kable are the attribute values of corresponding entity-attribute pairs. Construct this kind of structured knowledge base is important for different OSD applications such as BING search, paid search, display ads etc. Kable research concept map has three layers, which are Data Layer, Model Layer and Application Layer.
  • Intent based behavioral targeting project. Description: this project aims to sale the “intents” to advertisers in behavioral targeted advertising. We classify the user search behaviors into different user intent categories, based on which we can accurately deliver ads to audience. In this project, I mainly focus on the algorithm design and driving the cross group research efforts within MSRA.
  • Online ad relevance verification project. Description: this project aims to improve the ad relevance in Bing paid search. We propose novel features and classifier to improve the ad relevance in a machine learning view. In this project, I mainly focus on the algorithm design, feature proposal and lead the research efforts within MSRA.
  • Bing search task classification project. Description: this project aims to understand whether the Bing search users have the intent to compare sports domain Web objects. Bing will return the side by side comparison without requiring users to browse the 10-blue links. We propose classification solution to make it give satisfied performance to online users. In this project, I work together with product team to design and transfer the intent classifier.
  • Self-service BT prototyping. Description: this project aims to let the advertiser can customize their user segments for ads delivery. We propose the Minhash based user clustering solution and implement the prototype. In this project, I mainly focus on the scenario design, algorithm design and leading the team wide research efforts within MSRA.
  • Office online assets recommendation project. Description: This project aims to recommend the possibly user interested assets to “office online” users based on similar users’ behaviors. We develop the algorithm to make this online recommendation and transfer the technology to Office Online AP team. In this project, I mainly focus on the algorithm design and driving the research efforts within MSRA.


  • Indexing Semantic User Profiles for Targeted Advertising
  • Web Knowledge Extraction for Search Task Simplification
  • Build of Website Knowledge Tables
  • Forecasting Search Queries based on Time Dependencies(Appl. No. 11/770,462)
  • Clustering Aggregator for RSS feeds (Appl. No. 20090327320)
  • Prediction of Future Popularity of Query Terms (Appl. No. 20090222321)
  • Categorizing Online User Behavior Data (MS#327757.01)
  • Representing Queries and Determining Similarity based on An ARIMA Model (Appl. No. 20090006326)
  • Identification of Events of Search Queries (Appl. No. 11/770,423)
  • Forecasting Time-Dependent Search Queries (Appl. No. 11/770,385)
  • Learning Latent Semantic Space for Ranking
  • Identification of Similar Queries based on Overall and Partial Similarity of Time Series
  • Determination of Time Dependency of Search Queries (Appl. No. 11/770,358.)
  • Forecasting Time Independent Search Queries (Appl. No. 11/770,445)
  • Scalable Parallel User Clustering in Discrete Time Window (Appl. No. 20100169258)
  • Learning User Intent from Rule-based Training Data (MS# 329229.01)
  • Related Links Recommendation (MS# 329226.01)


Yong Luo, Jian Tang, Jun Yan, Chao Xu, Zheng Chen, Pre-Trained Multi-View Word Embedding Using Two-Side Neural Network, AAAI, 2014

Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, Jiawei Han, Heterogeneous graph-based Intent Learning with Queries, Web Pages and Wikipedia Concepts, WSDM, 2014

Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, Zheng Chen, Zhifang Sui, Towards Accurate Distant Supervision for Relational Facts Extraction, ACL, 2013