Towards Concept-Based Text Understanding and Mining


March 9, 2005


Xin Li


University of Illinois at Urbana-Champaign


Key words: Statistical Natural Language Processing, Machine Learning, Text Mining, and Semantic Integration

Most information in the world exists in the format of text, such as news articles and web pages. Different lines of research have been conducted to allow a computer system to pinpoint knowledge from text with different precision, including information retrieval, information, and question answering. The goal of these tasks is to discover, understand and access knowledge about real-world entities and relations from text. Currently, however, most of them are still relying on simple string and token-level matching techniques to identify relevant information, and extracted information are typically not effectively integrated.

In this talk, we explain the necessity of moving from string and token processing to concept-based text understanding and mining, and that of integrating information based on real-world concepts. Furthermore we provide state-of-the-art solutions to the related problems. At the heart of this work is a mechanism, I-Track, that can automatically resolve the concept ambiguity in text: a given entity – representing a person, a location or an organization – may be mentioned in text in multiple, ambiguous ways. This talk will cover three related projects that we have been working on in the last few years: Entity identification in text,
Semantic integration across text and databases,
Supervised Clustering Framework for entity identification.


Xin Li

Xin Li is currently a graduate student in the CS department at University of Illinois at Urbana-Champaign. He expects to acquire Ph.D. in May 2005. He received M.S. in Computer Science from Peking University, China and B.S. in Computer Science from Wuhan University, China. He joined Cognitive Computation Group at UIUC from summer, 2000, where he leads the project of intelligent access to textual information. His advisor is Professor Dan Roth.His research focuses on machine learning and natural language processing (NLP), with emphasis on intelligent access to textual information. Specific topics include text-mining tasks such as semantic integration across databases and text, question answering, entity and relation extraction; learning-related tasks such as supervised clustering and multi-class classification; and semantic integration in databases. Xin Li has published many papers in major conferences such as AAAI, COLING, NAACL and CoNLL.