Random Forests and the Data Sparseness Problem in Language Modeling

  • Peng Xu | Johns Hopkins University

In this talk, we explore the use of Random Forests (RFs) in language modeling, the problem of predicting the next word based on words already seen. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and interpolated Kneser-Ney smoothing. This new technique aims at solving the data sparseness problem in language modeling and it is complementary to many of the existing techniques.

We study our RF approach in the context of n-gram type language modeling. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are long (>4). We show that our RF language models are superior to interpolated Kneser-Ney n-gram models in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems.

The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).

Speaker Details

Peng Xu was born in China, where he received his two BS degrees (in Engineering Mechanics and Electronics & Computer Technology) from Tsinghua University in 1995, and his MS degree (in Pattern Recognition & Artificial Intelligence) from Institute of Automation, Chinese Academy of Sciences in 1998. After spending one year in Brown University, he transferred to the Johns Hopkins University as a Ph.D. candidate in Dept. of Electrical and Computer Engineering, and started his language modeling work in the Center for Language and Speech Processing (CLSP) under the supervision of Prof. Frederick Jelinek. While his research is focused on statistical language modeling, he is also interested in statistical machine learning, information retrieval, and statistical machine translation.

    • Portrait of Jeff Running

      Jeff Running

    • Portrait of Peng Xu

      Peng Xu