Entity Disambiguation based on a Probabilistic Taxonomy

  • Masumi Shirakawa ,
  • Haixun Wang ,
  • ,
  • Zhongyuan Wang ,
  • Kotaro Nakayama ,
  • Takahiro Hara ,
  • Shojiro Nishio

MSR-TR-2011-125 |

This paper presents a method for entity disambiguation, one of the most substantial tasks for machines to understand text in natural languages. In a natural language, terms have ambiguity, e.g. “Barcelona” usually means a Spanish city but it can also refer to a professional football club. In our work, we utilize a probabilistic taxonomy that is as rich as our mental world in terms of the concepts of worldly facts it contains. We then employ a naive Bayes probabilistic model to disambiguate a term by identifying its related terms in the same document. Specifically, our method consists of two steps: clustering related terms and conceptualizing the cluster using the probabilistic taxonomy. We cluster related terms probabilistically instead of using any threshold-based deterministic clustering approach. Our method automatically adjusts the relevance weight between two terms by taking the topic of the document into consideration. This enables us to perform clustering without using a sensitive, predefined threshold. Then, we conceptualize all possible clusters using the probabilistic taxonomy, and we aggregate the probabilities of each concept to find the most likely one. Experimental results show that our method outperforms threshold-based methods with optimally set thresholds as well as several gold standard approaches for entity disambiguation.