A Challenge Set for Advancing Language Modeling

  • Geoffrey Zweig ,
  • Chris J.C. Burges

Workshop on the Future of Language Modeling for HLT, NAACL-HLT 2012 |

Published by ACL/SIGPARSE

In this paper, we describe a new, publicly available corpus intended to stimulate research into language modeling techniques which are sensitive to overall sentence coherence. The task uses the Scholastic Aptitude Testโ€™s sentence completion format. The test set consists of 1040 sentences, each of which is missing a content word. The goal is to select the correct replacement from amongst five alternates. In general, all of the options are syntactically valid, and reasonable with respect to local N-gram statistics. The set was generated by using an N-gram language model to generate a long list of likely words, given the immediate context. These options were then hand-groomed, to identify four decoys which are globally incoherent, yet syntactically correct. To ensure the right to public distribution, all the data is derived from out-of-copyright materials from Project Gutenberg. The test sentences were derived from five of Conan Doyleโ€™s Sherlock Holmes novels, and we provide a large set of Nineteenth and early Twentieth Century texts as training material.