Abstract

A semantic class is a collection of items (words or phrases) which have semantically peer or sibling relationship. This paper studies the employment of topic models to automati-cally construct semantic classes, taking as the source data a collection of raw semantic classes (RASCs), which were extracted by ap-plying predefined patterns to web pages. The primary requirement (and challenge) here is dealing with multi-membership: An item may belong to multiple semantic classes; and we need to discover as many as possible the dif-ferent semantic classes the item belongs to. To adopt topic models, we treat RASCs as “doc-uments”, items as “words”, and the final se-mantic classes as “topics”. Appropriate preprocessing and postprocessing are per-formed to improve results quality, to reduce computation cost, and to tackle the fixed-k constraint of a typical topic model. Experi-ments conducted on 40 million web pages show that our approach could yield better re-sults than alternative approaches.