Word Sense Induction and Disambiguation at Powerset


September 17, 2008


Chris Biemann




This talk summarizes Powerset’s endeavor to set up a flexible and data driven approach to handling word senses.
In a traditional keyword search engine setting, word sense disambiguation is believed to play a subordinate role. While keyword queries tend to disambiguate itself through the presence of other keywords e.g. “flying jets” vs. “ny jets”, this is not the case in an index expansion setting, where wrong sense expansions lead to spurious semantic matches.
A system is proposed that consists of two steps: word sense induction (WSI) from the target corpus and word sense disambiguation (WSD) in this corpus using features derived at the induction phase. In this way, the method is kept independent from fixed word sense inventories and applies seamlessly to different domains and languages.
The core step in the WSI phase is to cluster distributionally similar words along context features, the WSD step compares the global contexts of these clusters to the current context in a document.
Constructing an evaluation corpus using Amazon Turk and computing a distributional thesaurus based on grammar relations will be elaborated on in more detail.
The evaluation of a simple bag-of-words implementation showed promising results, a more thorough assessment of system performance is current work.
Finally, promising directions are identified and an outlook is provided.


Chris Biemann

Chris is a natural language scientist at Powerset, a semantic search engine recently acquired by Microsoft. His research interests are large-scale lexical acquisition, graph clustering, weakly supervised and unsupervised natural language processing. He graduated from the University of Leipzig, Germany in 2007. His recent research community activities include the organization of TextGraphs: Graph-based Algorithms for Natural Language Processing and active support of the ACL Video Archive initiative.