Microsoft Research IME Corpus

Hisami Suzuki, Jianfeng Gao

MSR-TR-2005-168 |

This document describes the Microsoft Research IME (MSR-IME) Corpus, a test corpus for language modeling research based on the task of Japanese character conversion for text input. The aim of the corpus is to facilitate research in language modeling by providing a data set on a realistic, non-trivial task that is yet easy to use. The corpus consists of 6,000 sentences, and includes the gold-standard output of conversion, the corresponding input phonetic stream in kana, and the 100-best list of conversion candidates, specifically prepared for the re-ranking formulation of the task. This report includes the description of the task of Japanese character conversion for text input, the details of the corpus as well as the guidelines used to assign the readings to the reference file.