Abstract

This paper presents a study of low-latency domain-independent online vocabulary adaptation using limited amounts of supporting text data. The target applications include blind indexing of Internet content, indexing of new content with low latency, and domains where Out-Of-Vocabulary (OOV) words are problematic. A number of methods to perform document-specific adaptation using a small amount of support metadata and the Internet are examined. It is shown that a combination of word feature fusion and cross-file statistics pooling provides robust adaptation. The best evaluated method achieved an absolute reduction of 27.6% in OOV detection false alarm rate over the baseline word feature thresholding methods.