Partially Labeled Topic Models for Interpretable Text Mining

KDD 2011 |

Much of the world’s electronic text is annotated with human interpretable labels, such as tags on web pages and subject codes on academic publications. Effective text mining in this setting requires models that can flexibly account for the textual patterns that underlie the observed labels while still discovering unlabeled topics. Neither supervised classification, with its focus on label prediction, nor purely unsupervised learning, which does not model the labels explicitly, is appropriate. In this paper, we present two new partially supervised generative models of labeled text, Partially Labeled Dirichlet Allocation (PLDA) and the Partially Labeled Dirichlet Process (PLDP). These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. We explore applications with qualitative case studies of tagged web pages from and PhD dissertation abstracts, demonstrating improved model interpretability over traditional topic models. We use the many tags present in our dataset to quantitatively demonstrate the new models’ higher correlation with human relatedness scores over several strong baselines.