Partially Labeled Topic Models for Interpretable Text Mining
Much of the world’s electronic text is annotated with human interpretable labels, such as tags on web pages and subject codes on academic publications. Eﬀective text mining in this setting requires models that can ﬂexibly account for the textual patterns that underlie the observed labels while still discovering unlabeled topics. Neither supervised classiﬁcation, with its focus on label prediction, nor purely unsupervised learning, which does not model the labels explicitly, is appropriate. In this paper, we present two new partially supervised generative models of labeled text, Partially Labeled Dirichlet Allocation (PLDA) and the Partially Labeled Dirichlet Process (PLDP). These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. We explore applications with qualitative case studies of tagged web pages from del.icio.us and PhD dissertation abstracts, demonstrating improved model interpretability over traditional topic models. We use the many tags present in our del.icio.us dataset to quantitatively demonstrate the new models’ higher correlation with human relatedness scores over several strong baselines.