Portrait of Scott Wen-tau Yih

Scott Wen-tau Yih

Senior Researcher


I’m Scott Wen-tau Yih, a researcher in the Natural Language Processing Group. My primary research interest is natural language understanding, in particular solutions enabled by advanced machine learning models. Currently I am working on problems related to semantic parsing and question answering, as well as continuous-space semantic representations.



Established: January 30, 2015

The goal of this project is to develop a class of deep representation learning models. DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. DSSM, developed by the MSR Deep Learning Technology Center(DLTC), is a deep neural network (DNN) modeling technique for representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space and modeling semantic similarity between two text strings (e.g., Sent2Vec). DSSM has wide applications including information retrieval…

Recurrent Neural Networks for Language Processing

Established: November 23, 2012

This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks. A toolkit for doing RNN language modeling with side-information is in the associated download. Sample word vectors for use with this toolkit…


Established: April 4, 2012

Statistical Parsing and Linguistic Analysis Toolkit is a linguistic analysis toolkit. Its main goal is to allow easy access to the linguistic analysis tools produced by the Natural Language Processing group at Microsoft Research. The tools include both traditional linguistic analysis tools such as part-of-speech taggers and parsers, and more recent developments, such as sentiment analysis (identifying whether a particular of text has positive or negative sentiment towards its focus) Demo URL: You can find…




















Link description

Question Answering


July 24, 2015


Benjamin Van Durme, Luke Zettlemoyer, Matthew Richardson, Scott Yih, and Yan Ke


Microsoft Research, Microsoft, Johns Hopkins University, University of Washington


Question Sequences for Conversational Question Answering

January 2017

The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.

    Click the icon to access this download

  • Website

MSR FastRDFStore Package – Data Release

January 2017

This data release is part of the MSR FastRDFStore Package and includes the last dump of Freebase, as well as the processed version ready to load directly into FastRDFStore.

    Click the icon to access this download

  • Website

NCI-PID-PubMed Genomics Knowledge Base Completion Dataset

October 2016

This dataset includes a database of regulation relationships among genes and corresponding textual mentions of pairs of genes in PubMed article abstracts.

    Click the icon to access this download

  • Website

WebQuestions Semantic Parses Dataset

May 2016

    Click the icon to access this download

  • Website

Microsoft Research WikiQA Code Package

October 2015

    Click the icon to access this download

  • Website

Microsoft Research WikiQA Corpus

August 2015

    Click the icon to access this download

  • Website

Data Set of English-Spanish Term Vectors from Wikipedia

August 2011

    Click the icon to access this download

  • Website


Academic Services

Conference/Workshop Organizer

  • CoNLL-14 Program Co-chair
  • ICML-2014 Workshop on Knowledge-Powered Deep Learning for Text Mining
  • The second Workshop on Continuous Vector Space Models and their Compositionality
  • IJCNLP-13 Workshop Co-chair
  • CEAS-09 Program Co-chair
  • ICML-07 Workshop on Constrained Optimization and Learning with Structured Outputs

Editorial Board Member

  • Journal of Artificial Intelligence Research (JAIR), 2013-2016

Program Committee Member

  • Area Chair: HLT-NAACL-12, ACL-14
  • Senior Program Committee: IJCAI-09, AAAI-11, AAAI-14, AAAI-15
  • CEAS: 2004, 2005, 2006, 2007, 2008, 2009 (Program Co-chair), 2010
  • ICML: 2006, 2008, 2009, 2012, 2013, 2014
  • NIPS: 2006, 2007, 2008, 2009, 2012, 2013 (Reviewer Award), 2014, 2015
  • AAAI: 2006, 2008, 2011 (SPC), 2014 (SPC), 2015 (SPC)
  • IJCAI: 2009 (SPC)
  • ACL: 2007, 2008 (w/ HLT), 2009 (w/ IJCNLP), 2010, 2011 (w/ HLT), 2012, 2013, 2014 (Area Chair), 2015 (w/ IJCNLP)
  • EMNLP: 2005, 2007 (w/ CoNLL), 2008, 2010, 2011, 2013, 2014, 2015
  • HLT-NAACL: 2004, 2009, 2010, 2012 (Area Co-chair), 2013, 2015
  • CIKM-08, CoNLL-09, ILPNLP-WS-09, EACL-12, ICLR-15


  • Deep Learning and Continuous Representations for NLP (Tutorial for NAACL-HLT-15, with Xiaodong He & Jianfeng Gao) [Video]
  • Multi-Relational Latent Semantic Analysis (UW/MSR ML Day 2013)

Demo & Web Service