Key words: Statistical Natural Language Processing, Machine Learning, Text Mining, and Semantic Integration
Most information in the world exists in the format of text, such as news articles and web pages. Different lines of research have been conducted to allow a computer system to pinpoint knowledge from text with different precision, including information retrieval, information, and question answering. The goal of these tasks is to discover, understand and access knowledge about real-world entities and relations from text. Currently, however, most of them are still relying on simple string and token-level matching techniques to identify relevant information, and extracted information are typically not effectively integrated.
In this talk, we explain the necessity of moving from string and token processing to concept-based text understanding and mining, and that of integrating information based on real-world concepts. Furthermore we provide state-of-the-art solutions to the related problems. At the heart of this work is a mechanism, I-Track, that can automatically resolve the concept ambiguity in text: a given entity – representing a person, a location or an organization – may be mentioned in text in multiple, ambiguous ways. This talk will cover three related projects that we have been working on in the last few years: Entity identification in text,
Semantic integration across text and databases,
Supervised Clustering Framework for entity identification.