Robust Automated Topic Identification

PhD Thesis: University of Southern California |

As the amount of on-line text keeps growing, it becomes increasing difficult for humans to process the deluge of information in the time available. We need automatic text processing systems to help us scan through huge volume of texts, route them to relevant parties, filter them into prespecified categories, or even summarize them. To achieve this, one crucial step is to identify the major topics of the texts, since summarization, text routing, etc., centrally require knowing the topics. In this research, we investigated several topic identification methods and developed three major results:

(1) We extended existing word-based frequency counting methods to form a new concept-based frequency method based on the assumption `the more a concept is mentioned in a text, the more important it is.’ We used the knowledge base WordNet to generalize words into concepts and showed how to select concepts of the appropriate degree of generalization.

(2)We studied patterns of word co-occurrence (topic signatures) consisting of sets of keywords that uniquely identify the topics of interest. We showed how to acquire keywords from texts pre-classified for each topic, using the tf * idf measure. We also demonstrated how to identify topics using topic signatures, introduced confusion sets and multi-level topic signatures, and discussed the problems associated with multiple topics in a text.

(3) We described, implemented, and evaluated a method to learn the Optimal Position Policy (OPP) for finding topic-rich sentences in texts. This work is based on the Position Hypothesis: in genres with xed discourse structure, the (ordinal) position of a sentence is related to its importance in a text. We showed how to verify the Position Hypothesis using topic keywords, empirically identify important sentence positions in a genre or domain, and quantitatively evaluate the results with various measures.