I am a Researcher in Microsoft Research Lab India since 2007. My research interests cut across  the areas of Linguistics, Cognition and Computation. Currently, I am working on script and code-mixing, especially in social media and web search. We have introduced the notion of Mixed-Script Information Retrieval, where the query and the documents can be in different, and possibly, more than one scripts but in the same language; the task is to retrieve the relevant documents across scripts. Such situations arise quite commonly for Indian languages, where the documents (say song lyrics or posts on discussion forums) can be either written in the native script or in Romanized form. In fact, a large amount of Indian language (and also Greek, Arabic, etc.) content on the Web is available in Romanized form. Mixed-script IR entails challenges such as indexing cross-script indexing, handling transliteration induced spelling variations in queries and documents, code-mixed query understanding and query completion.

Code-mixing or use of more than one languages in a single conversation or utterance is a phenomenon that is observed in all multilingual societies. Due to social media and online forums, code-mixing is now rampant on the Internet. I am interested in developing core NLP techniques for identifying and processing code-mixed text. I am also interested in studying the extent, distribution and socio-linguistic factors influencing code-mixing. Check out more: Project Mélange and the code-mixing blog.

I also work on various NLP and Information Retrieval techniques for Indian languages. In the past I have worked on computational musicology, language evolution, evolution of the structure of Web search queries and complex networks.


Project Mélange

Established: January 1, 2012

Project Mélange: Understanding MixEd LANguaGE and Code-mixing The goal of Project Mélange is to understand the uses of and build tools around code-mixing. Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single…















POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments

December 2015

We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language…

  Website

“I am borrowing ya mixing?” An Analysis of English-Hindi Code Mixing in Facebook

October 2014

Code-Mixing is a frequently observed phenomenon in social media content generated by multi-lingual users. The processing of such data for linguistic analysis as well as computational modelling is challenging due to the linguistic complexity resulting from the nature of the mixing as well as the presence of non-standard variations in spellings and grammar, and transliteration.…

  Website

The use of Melodic Scales in Bollywood Music: An Empirical Study

November 2013

Hindi film music, which is commonly referred to as Bollywood music, is one of the most popular forms of music in the world today. One of the reasons for its popularity has been the willingness of Bollywood composers to adopt and be influenced by various musical forms including Western pop, jazz, rock, and classical music.…

  Website

Entailment: An Effective Metric for Comparing and Evaluating Hierarchical and Non-hierarchical Annotation Schemes

July 2013

Hierarchical or nested annotation of linguistic data often co-exists with simpler non-hierarchical or flat counterparts, a classic example being that of annotations used for parsing and chunking. In this work, we propose a general strategy for comparing across these two schemes of annotation using the concept of entailment that formalizes a correspondence between them. We…

  Website

Challenges in Designing Input Method Editors for Indian Languages: The Role of Word-Origin and Context

November 2011

Back-transliteration based Input Method Editors are very popular for Indian Lan-guages. In this paper we evaluate two such Indic language systems to help un-derstand the challenge of designing a back-transliteration based IME. Through a detailed error-analysis of Hindi, Bang-la and Telugu data, we study the role of phonological features of Indian scripts that are reflected…

I am an Associate Editor for ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). Please consider submitting your manuscripts.

I have taught a few courses at IIT Kharagpur as an Adjunct Faculty.

Linguistics Olympiad

I am actively involved in the organization of the Panini Linguistics Olympiad, the Indian national version of the International Linguistics Olympiad (IOL). These programs introduce high school students to the fascinating world of languages, linguistics and NLP through puzzles on decoding the rules and patterns of lesser known, often endangered or extinct, languages. Personally, I enjoy designing and solving such puzzles, and training and interacting with the super-smart kids. We hosted the 14th International Linguistics Olympiad in Mysore, India, July 2016. I was the coach-cum-team leader of the Indian national team at IOL 2014 (Beijing), IOL 2015 (Blagoevgrad) and the upcoming IOL 2017 (Dublin). Currently, I am a member of the IOL Board of Directors, and the co-chair of the National Board, Panini Linguistics Olympiad.

Shared Tasks and Competitions
Mixed Script Information Retrieval, FIRE:
   In 2013, along with my former PhD student Dr. Rishiraj Saha Roy and Prof Prasenjit Majumder, we introduced a couple of shared tasks on Mixed-script IR, which involved language identification and ad hoc retrieval of transliterated documents. This has evolved into one of the most popular shared task tracks in FIRE over the years, and new subtasks, e.g., Mixed-script Question-Answering, have been introduced.

CODS 2017 Data Challenge:

Prof Animesh Mukherjee, Jasabanta Patra and I am organizing the Data Challenge in CODS 2017 (the flagship conference of ACM India KDD). The task involves differentiating word borrowing from code-mixing. Please do consider participating in the challenge, especially if you are into Natural Language Processing or Social Media Analytics.