About

I am a Researcher in Microsoft Research Lab India since 2007. My research interests cut across  the areas of Linguistics, Cognition and Computation. Currently, I am working on script and code-mixing, especially in social media and web search. We have introduced the notion of Mixed-Script Information Retrieval, where the query and the documents can be in different, and possibly, more than one scripts but in the same language; the task is to retrieve the relevant documents across scripts. Such situations arise quite commonly for Indian languages, where the documents (say song lyrics or posts on discussion forums) can be either written in the native script or in Romanized form. In fact, a large amount of Indian language (and also Greek, Arabic, etc.) content on the Web is available in Romanized form. Mixed-script IR entails challenges such as indexing cross-script indexing, handling transliteration induced spelling variations in queries and documents, code-mixed query understanding and query completion.

Code-mixing or use of more than one languages in a single conversation or utterance is a phenomenon that is observed in all multilingual societies. Due to social media and online forums, code-mixing is now rampant on the Internet. I am interested in developing core NLP techniques for identifying and processing code-mixed text. I am also interested in studying the extent, distribution and socio-linguistic factors influencing code-mixing. Check out more: Project Mélange and the code-mixing blog.

I also work on various NLP and Information Retrieval techniques for Indian languages. In the past I have worked on computational musicology, language evolution, evolution of the structure of Web search queries and complex networks.

Projects

Publications

Videos

Downloads

Other

Teaching

  • Adjunct Faculty at International Institute of Information Technology, Hyderabad (since July 2017)
  • Adjunct Faculty at Indian Institute of Technology, Kharagpur (Spring 2008-09, Fall 2015-16). I taught the following courses:

Journal/Conference/Committees

2018

2017

Shared Tasks and Competitions

Mixed Script Information Retrieval, FIRE :
   In 2013, along with my former PhD student Dr. Rishiraj Saha Roy and Prof Prasenjit Majumder, we introduced a couple of shared tasks on Mixed-script IR, which involved language identification and ad hoc retrieval of transliterated documents. This has evolved into one of the most popular shared task tracks in FIRE over the years, and new subtasks, e.g., Mixed-script Question-Answering, have been introduced:

CODS 2017 Data Challenge :

Data Challenge in CODS 2017 (the flagship conference of ACM India KDD) involved differentiating word borrowing from code-mixing.

Linguistics Olympiad

I am actively involved in the organization of the Panini Linguistics Olympiad, the Indian national version of the International Linguistics Olympiad (IOL). These programs introduce high school students to the fascinating world of languages, linguistics and NLP through puzzles on decoding the rules and patterns of lesser known, often endangered or extinct, languages. Personally, I enjoy designing and solving such puzzles, and training and interacting with the super-smart kids. We hosted the 14th International Linguistics Olympiad in Mysore, India, July 2016. I was the coach-cum-team leader of the Indian national team at IOL 2014 (Beijing), IOL 2015 (Blagoevgrad) and the upcoming IOL 2017 (Dublin). Currently, I am a member of the IOL Board of Directors, and the co-chair of the National Board, Panini Linguistics Olympiad.