Portrait of Monojit Choudhury

Monojit Choudhury

Principal Data and Applied Scientist, Turing India

About

Currently, I am a Principal Data and Applied Scientist in Turing India. We build large universal language models that forms the backbone of various Microsoft products. Prior to this, I was a Principal researcher at Microsoft Research Lab India, and I still strongly collaborate with my colleagues from MSR. My research interests cut across the areas of Linguistics, Cognition, Computation and Society. I have a B.Tech and PhD in Computer Science and Engineering from IIT Kharagpur, and had been at Microsoft Research since 2007.

I am interested in understanding the nature of the massively multilingual language models (such as Turing ULR, XLM-R, mBERT). While these models offer a promise of equality across the world’s languages, in practice they do not work equally well for the languages, and neither we have pre-trained models for all languages. How can we make language technologies more equitable across languages?

As a part of Project LITMUS – Linguistically Aware Testing of Multilingual Systems, I worked on systematic evaluation and estimation of MMLM performance across languages, even in the absence of test datasets. This in turn enables us to understand the factors that affect MMLM performance and build optimal data collection strategies to ensure more equal or equitable performance.

I also collaborate on Project ELLORA, where our aim is to enable the speakers of low-resource languages through appropriate language technology. We are working with collaborators and NGOs on extremely low-resourced and lesser-known languages such as Gondi, Mundari, Idu-mishmi, Sheng, Swahili and Igbo. Our decade long experience in working with low-resource language communities tells us that technology is seldom the bottleneck; and more often than not, technological interventions do not work when the human and social contexts are not taken into consideration. On the other hand, participatory design and co-design, whenever possible, leads to simpler yet effective technological solutions.

In the past, I have extensively worked in the area of code-mixing and script mixing. Code-mixing or use of more than one languages in a single conversation or utterance is a phenomenon that is observed in all multilingual societies. Due to social media and online forums, code-mixing is now rampant on the Internet. As a part of Project Mélange we have developed a set of tools and techniques for processing code-mixed text and speech, as well as a deeper understanding of the sociolinguistic and pragmatic factors that influence the nature of mixing (also check our code-mixing blog)

I also work on various NLP and Information Retrieval techniques for Indian languages. In the past I have worked on computational musicology, language evolution, evolution of the structure of Web search queries and complex networks. I like designing language puzzles, and am closely involved with organization of the Indian national linguistics Olympiad – PLO, as well as the Asia Pacific Linguistics Olympiad.