Announcing the fifteenth Symposium in Computational Linguistics sponsored by the UW Departments of Linguistics, Electrical Engineering, and Computer Science, Microsoft Research, and UW alumni at Microsoft. Come take advantage of this opportunity to connect with the computational linguistics community at Microsoft and the University of Washington. This is a regular opportunity for computational linguists at the University of Washington and at Microsoft to discuss topics in the field and to connect in a friendly informal atmosphere. The symposium consists of two invited talks, followed by an informal reception and demos.
Dept. of Computer and Science Engineering, UW
Autonomous Web-scale Information Extraction
Search engines are extremely useful tools for answering questions. However, a significant number of questions users might pose – for example, “which actors have won an Oscar for playing a villain?” – are difficult to answer using existing search engines, because the answers do not lie on a single page. To answer these kinds of queries, users must extract and synthesize information from multiple documents. Currently, this is a tedious and error-prone manual process.
In this talk, I will describe research aimed at automating the extraction of this information from the Web. I begin by presenting a model of the redundancy inherent in the Web, and show that the model can be used to identify correct extractions autonomously, without the manually labeled examples typically assumed in previous information extraction research. However, the model has limited efficacy for the “long tail” of infrequently mentioned facts; my second investigation shows how unsupervised language models can be leveraged in concert with redundancy to overcome this limitation.
Models for Comparable Corpus Fragment Alignment
The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce competitive improvements on cross-domain test data without suffering in-domain degradation even at very large scale.
Scott Drellishak, Kelly O’Hara, Emily M. Bender
Dept. of Linguistics, UW
Case and Inflection in the Grammar Matrix
The LinGO Grammar Matrix is a computational foundation for implementing grammars of natural languages in the HPSG framework. In order to provide students and researchers wishing to build such grammars with a head start, the Matrix customization system presents the user with a typological questionnaire via a web page, then based on the answers, creates a limited but functional starter grammar. Recent work has focused on expanding the questionnaire to handle a variety of case systems, and to allow the description of inflectional marking on nouns, verbs, and determiners.
Michael Gamon, Sumit Basu, Dmitriy Belenko, Danyel Fisher, Matthew Hurst, Arnd Christian König
Microsoft Research and Microsoft Live Labs
BLEWS: What the Blogosphere Tells You About the News
While typical news-aggregation sites do a good job of clustering news stories according to topic, they leave the reader without information about which stories figure prominently in political discourse. BLEWS uses political blogs to categorize news stories according to their reception in the conservative and liberal blogospheres. It visualizes information about which stories are linked to from conservative and liberal blogs, and it indicates the level of emotional charge in the discussion of the news story or topic at hand in both political camps. BLEWS also offers a “see the view from the other side” functionality, enabling a reader to compare different views on the same story from different sides of the political spectrum. BLEWS achieves this goal by digesting and analyzing a real-time feed of political-blog posts provided by the Live Labs Social Media platform, adding both link analysis and text analysis of the blog posts.
Michael Gamon, Chris Brockett, Dmitriy Belenko, Bill Dolan, Jianfeng Gao, Lucy Vanderwende
The MSR ESL Assistant
The Microsoft Research ESL Assistant is a web service that provides correction suggestions for typical ESL (English as a Second Language) errors. Such errors include, for example, the choice of determiners (the/a) and the choice of prepositions. The web service also provides word choice suggestions from a thesaurus. In order to help the user make decisions on whether to accept a suggestion, the service displays “before and after” web search results so that the user can see real-life examples of the usage of both their original input and the suggested correction. Error detection and correction are based on machine-learned and heuristic modules, combined with a large language model.