Portrait of Sunayana Sitaram

Sunayana Sitaram

Senior Researcher

About

Hello, and thanks for stopping by! I’m a Senior Researcher at Microsoft Research India where I work on Speech and Natural Language Processing. Currently, the focus of my work is on the evaluation of massive language models, multilingual speech and NLP, and code-switching. For up-to-date information about publications, please take a look at my Google Scholar page.

I started, with collaborators from CMU, Columbia University, University of Houston and MSR special sessions on Code-switching at Interspeech 2017-2019, which we organized as a two day workshop at Interspeech 2020. This year, I am organizing the Computational Approaches to Linguistic Code-Switching, CALCS 2021 workshop at NAACL 2021.

My research goal is to make all the content in the world available to all the people in the world, regardless of the language they speak, their level of education, their age, gender, and their special needs. So far, my main expertise has been in multilingual systems, particularly in enabling speech and NLP applications in languages that have very few linguistic resources. More recently, I have developed a keen interest in the evaluation of NLP systems, particularly in the challenges of evaluating multilingual systems.

I have been fortunate to work with many wonderful interns and Research Fellows! Sunit Sivasankaran (now a PhD student at INRIA), Sai Krishna Rallabandi (now a PhD student at CMU), Brij Mohan Lal Srivastava (now a PhD student at INRIA), Simran Khanuja (now at Google Research), Sanket Shah and Shaily Bhat.

*NEWS*

I was invited to be a speaker at the VAIBHAV summit organized by the Govt. of India for the AI/ML Speech Understanding panel.

I was part of the Students Meet Experts session at Interspeech 2020 organized by the ISCA-SAC,

Our survey paper on code-switching, that covers more than 250 papers is now available on arxiv.

*NEW* Code and Datasets

Our benchmark for evaluating code-switched NLP called GLUECoS is now open source, along with scripts for pre-processing 11 code-switched datasets! Get the code here.

We built the first code-switched NLI dataset using Bollywood movie data as premises. Check out the paper and data here. We also released a tool for Language Identification from text here.

Code-switched data for the Language Identification shared task organized as part of the First Workshop on Speech Technologies for Code-switching for Multilingual Communities is now available for research use.

I also organized a shared task on ASR for low resource languages in a special session at Interspeech 2018, and we released data from three low-resource Indian languages as part of this challenge which is now available for research use.

Prior to coming to MSR India

I finished my PhD in 2015 at the Language Technologies Institute, Carnegie Mellon University. I worked on Text-to-Speech systems with my advisor Alan W Black, and my thesis was on pronunciation modeling for low-resource languages. From 2010-2012, I was a Masters student at CMU with Jack Mostow, and I worked on children’s oral reading prosody. I also interned with Microsoft Research India in Summer 2012 and we built a low-vocabulary ASR system for farmers in rural central India.