Scalable Knowledge Harvesting

July 12, 2005
Deepak Ravichandran | University of Southern California

Performance of many Natural Language Processing (NLP) systems have reached a plateau using existing techniques. There seems to be a general consensus that systems have to integrate semantic knowledge or world knowledge in one form or another in order to provide additional information required to improve the quality of results. But building adequate and large enough semantic resources is a difficult unsolved problem. In my work, I attack the problem of very large scale acquisition of semantic knowledge by exploiting natural language text available on the Internet. In particular, I concentrate on one problem: extracting is-a relations from a very large corpus (70 million web pages, 26 billion word corpus) downloaded from the Internet. Since the amount of data involved is greater by two orders of magnitude than published before, the algorithms designed had to be highly scalable. This was achieved by:

Using a novel Pattern-based learning algorithm that exploits local features.
Using a Clustering algorithm that uses randomized techniques by exploiting co-occurrence (global) features in linear time.

Using these algorithms, I extract is-a relations from text to build a huge table. These extracted relations are then evaluated by using a host of different applications.

Speaker Details

Deepak Ravichandran is a Ph.D. candidate in University of Southern California (USC). He works as a graduate research assistant at USC’s Information Sciences Institute. He has an MS (2002), also from the University of Southern California and a B.E. (Hons.) in Computer Engineering (2000) from the University of Bombay, India. His primary research interest is in Natural Language Processing and scalable Machine Learning algorithms.