Word-Level Language Identification Using CRF: Code-Switching Shared Task Report of MSR India System

Gokul Chittaranjan; Yogarshi Vyas; Kalika Bali; Monojit Choudhury

Word-Level Language Identification Using CRF: Code-Switching Shared Task Report of MSR India System

Gokul Chittaranjan ,
Yogarshi Vyas ,
Kalika Bali ,
Monojit Choudhury

Proceedings of the First Workshop on Computational Approaches to Code Switching | October 2014

Published by Association for Computational Linguistics

Download BibTex

We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages. Its performance is benchmarked against the test sets provided by the shared task on code-mixing (Solorio et al., 2014) for four language pairs, namely, English-Spanish (En-Es), English-Nepali (En-Ne),English-Mandarin (En-Cn), and Standard Arabic-Arabic (Ar-Ar) Dialects. The experimental results show a consistent performance across the language pairs.