Abstract

This report documents the participation of Mi-crosoft Research India (MSR India) in the Crosslingual Information Retrieval (CLIR) evaluation organized by the Forum for Information Retrieval Evaluation 2010 [FIRE 2010]. MSR India participated in two crosslingual evaluation tasks, namely the Hindi-English and Tamil-English crosslingual tasks, in addition to the English-English monolingual task. Our core CLIR engine employed a language modeling based approach using query likelihood based document ranking and a probabilistic translation lexicon learned from English-Hindi and English-Tamil parallel corpora. In addition, we employed two specific techniques to deal with out-of-vocabulary terms in the crosslingual runs: first, generating transliterations directly or transitively, and second, mining possible transliteration equiva-lents from the documents retrieved in the first-pass. We show experimentally that each of these techniques significantly improved the overall retrieval performance of our crosslingual IR system. Our system, using all of the topic-description-and-narrative information, achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual English-English task; in crosslingual tasks, our systems achieved a peak performance of a MAP of 0.4977 in Hindi-English and 0.4145 in the Tamil-English. The post-task analyses indicate that the mining of appropriate transliterations from the top results of the first-pass retrieval achieved enhanced the crosslingual performance of our system overall, in addition to enhancing individual performance of more queries. Our Hindi-English crosslingual retrieval performance was nearly equal (~97%) to the English-English monolingual retrieval performance, indicating the effectiveness of our approaches to handle OOV‟s to enhance the baseline performance of our CLIR system.