This talk summarizes Powerset’s endeavor to set up a flexible and data driven approach to handling word senses.
In a traditional keyword search engine setting, word sense disambiguation is believed to play a subordinate role. While keyword queries tend to disambiguate itself through the presence of other keywords e.g. “flying jets” vs. “ny jets”, this is not the case in an index expansion setting, where wrong sense expansions lead to spurious semantic matches.
A system is proposed that consists of two steps: word sense induction (WSI) from the target corpus and word sense disambiguation (WSD) in this corpus using features derived at the induction phase. In this way, the method is kept independent from fixed word sense inventories and applies seamlessly to different domains and languages.
The core step in the WSI phase is to cluster distributionally similar words along context features, the WSD step compares the global contexts of these clusters to the current context in a document.
Constructing an evaluation corpus using Amazon Turk and computing a distributional thesaurus based on grammar relations will be elaborated on in more detail.
The evaluation of a simple bag-of-words implementation showed promising results, a more thorough assessment of system performance is current work.
Finally, promising directions are identified and an outlook is provided.