Abstract

Recently topic models have emerged as a powerful tool to analyze document collections in an unsupervised fashion. The seminal work by Blei. et al. [1], starts by assuming that each document is a mixture of a set of topics where each topic is in turn a combination of words from the vocabulary. When fit to a document collection, the model inherently uses the co-occurrence information to group semantically related words into a single topic. Since then, many extensions have been proposed to improve the semantic coherence of the words in each topic. Because of the unsupervised nature of these approaches, a user has no way to specify the intended topics and moreover he/she is left with the job of making sense of the topics learnt by the model.

In this paper, we address this problem by providing a simple yet effective mechanism to guide the model to learn desired topics by providing seed words in each topic. For example, the user can gather seed words for each of the dmoz categories and provide them as input. This enables the model to analyze a document collection in terms of these well known categories.

Like in LDA, each document is assumed to be a mixture over topics but each topic is a convex combination of a seed topic and a traditional LDA style topic. Here we assume that there is a one-to-one correspondence between seed topics and LDA style topics. But this can be easily modified to handle the case where a topic is associated with multiple seed topics and vice versa. To understand the intuition, consider a seed topic (say 4) with words {grain, wheat, corn} now by assigning all the related words such as ‘tonnes’, ‘agriculture’, ‘production’ etc. to the same topic (i.e. topic 4) the model can potentially put high probability mass on topic 4 for agriculture related documents. Otherwise the model has to distribute the probability mass on the topic 4 and also the other topic which contains the new agriculture related words and as a result it will pay more penalty. Thus the model starts from seed topics and groups related words into the same topic and as a consequence we hope the document topic distributions become more focussed.