Microsoft Research IME Corpus

MSR-TR-2005-168 |

This document describes the Microsoft Research IME (MSR-IME) Corpus, a test corpus for language modeling research based on the task of Japanese character conversion for text input. The aim of the corpus is to facilitate research in language modeling by providing a data set on a realistic, non-trivial task that is yet easy to use. The corpus consists of 6,000 sentences, and includes the gold-standard output of conversion, the corresponding input phonetic stream in kana, and the 100-best list of conversion candidates, specifically prepared for the re-ranking formulation of the task. This report includes the description of the task of Japanese character conversion for text input, the details of the corpus as well as the guidelines used to assign the readings to the reference file.

Publication Downloads

Microsoft Research IME Corpus

December 21, 2005

This download consists of data only: it provides a test data set for the task of Japanese character conversion for text input. The data set consists of: (1) reference files, which consist of Japanese sentences that are randomly extracted from news articles (no more than one sentence has been extracted per news article); (2) reading files, which consist of corresponding kana readings for the sentences in the reference files; (3) n-best files, which contain 100-best conversion candidates for each sentence in the reading files.