First Workshop on Speech Technologies for Code-switching in Multilingual Communities 2020

First Workshop on Speech Technologies for Code-switching in Multilingual Communities 2020

Summary

Code-switching is the use of multiple languages in the same utterance and is common in multilingual communities across the world. Code-switching poses many challenges to speech and NLP systems and has gained widespread interest in academia and industry recently. We organized special sessions on code-switching at Interspeech 2017, 2018 and 2019. In 2020, we will be organizing this as a virtual workshop immediately after Interspeech 2020.

We welcome papers related to, but not restricted to the following aspects of code-switching:

  1. Code-switched speech recognition and synthesis
  2. Language Modeling for code-switching
  3. Multilingual models for code-switching
  4. Data and resources for code-switching
  5. Code-switched chatbots and dialogue systems
  6. Code-switched speech analytics

*NEW* You can find the proceedings of the workshop here and view all pre-recorded talks in the Schedule tab.

*NEW* The workshop will be conducted on Microsoft Teams. All registered participants have been sent information about this by email.

Workshop timeline:

Shared task testing period: April 27-29 2020

First Paper submission deadline: June 5 2020

Paper acceptance notification: July 20 2020

1 page Abstract submission deadline for special track: Aug 9  2020

Abstract and paper acceptance notification (special track and second round): Sept 7 2020

Camera ready papers due (*Both Rounds*): Sept 20th 2020

Video submission deadline for accepted papers: 9th October 2020

Registration deadline: 15th October 2020

Workshop: 30 and 31 October 2020

Contact us:

Please write to sunayana.sitaram@microsoft.com

Schedule

Please note: This is a tentative schedule and is subject to change. All times are in China Standard Time ‎(UTC+8)‎.

Day 1  Friday, 30 October 2020
Time (CST) Session Session Chairs Title Speaker
20:30-21:30 Opening remarks and Keynote Points of connection between linguistics and speech technology with regard to code-switching Barbara Bullock, Jacqueline Toribio
21:30-21:40 Break
21:40-21:55 PaperS1 Thamar Solorio and Manuel Mager A Study of Types and Characteristics of Code-Switching in Mandarin-English Speech Leijing Hou
21:55-22:10 PaperS1 Thamar Solorio and Manuel Mager Malayalam-English Code-Switched: Speech Corpus Development and Analysis Sreeram Manghat
22:10-22:25 PaperS1 Thamar Solorio and Manuel Mager Understanding forced alignment errors in Hindi-English code-mixed speech — a feature analysis Ayushi Pandey
22:25-22:40 PaperS1Q&A Thamar Solorio and Manuel Mager Q&A
22:40-22:50 Break
22:50-23:00 SponsorTalk Microsoft Basil Abraham
23:00-23:15 PaperS2 Kalika Bali and Khyathi Chandu Mere account mein kitna balance hai? – On building voice enabled Banking Services for Multilingual Communities Akshat Gupta
23:15-23:30 PaperS2 Kalika Bali and Khyathi Chandu Investigating Modelling Techniques for Natural Language Inference on Code-Switched Dialogues in Bollywood Movies Anjana Umapathy
23:30-23:40 PaperS2Q&A Kalika Bali and Khyathi Chandu Q&A
Day 2:  Saturday, 31 October 2020
Time (CST) Session Session Chairs Title
20:30-20:45 SharedTask Sunayana Sitaram and Gustavo Aguilar Opening Remarks and description of shared task Sunayana Sitaram
20:45-20:55 SharedTask Sunayana Sitaram and Gustavo Aguilar Vocapia-LIMSI System for 2020 Shared Task on Code-switched Spoken Language Identification Claude Barras
20:55-21:05 SharedTask Sunayana Sitaram and Gustavo Aguilar Exploiting Spectral Augmentation for Code-Switched Spoken Language Identification Pradeep R
21:05-21:15 SharedTask Sunayana Sitaram and Gustavo Aguilar On detecting code mixing in speech using Discrete latent representations Sai Krishna Rallabandi
21:15-21:25 SharedTask Sunayana Sitaram and Gustavo Aguilar Language Identification for Code-Mixed Indian Languages In The Wild Parav Nagarsheth
21:25-21:35 SharedTask Sunayana Sitaram and Gustavo Aguilar Utterance-level Code-Switching Identification using Transformer Network Krishna DN
21:35-21:45 SharedTaskQ&A Sunayana Sitaram and Gustavo Aguilar Q&A
21:45-22:00 Break
22:00-22:10 SponsorTalk SpeechOcean Yufeng Hao
22:10-22:25 PaperS3 Genta Indra Winata and Sai Krishna Rallabandi Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition Sanket Shah
22:25-22:40 PaperS3 Genta Indra Winata and Sai Krishna Rallabandi Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages Trideba Padhi
22:40-22:55 PaperS3 Genta Indra Winata and Sai Krishna Rallabandi The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results Xian Shi
22:55-23:10 PaperS3Q&A Genta Indra Winata and Sai Krishna Rallabandi Q&A
23:10-23:20 Closing remarks

Keynote

Title: Points of connection between linguistics and speech technology with regard to code-switching

The study of multilingualism presents a unique challenge within the discipline of linguistics since, without exception, the major linguistic theories have been developed from a monolingual orientation. However, no language is completely insulated from all others; there is invariably some evidence of language contact in every grammar. In the speech of multilinguals, these effects can be significant. In this talk, we focus on the overt forms of language contact, as manifested by the phenomena of borrowing and code-switching. We will also touch on the covert form of contact, what we call convergence. Our aim is threefold: (i) to provide a comprehensive overview of the syntactic, lexical, phonetic, and pragmatic effects of borrowing, code-switching, and convergence; (iii) to examine the theories that attempt to account for linguistic patterns of codeswitching and borrowing; and (iii) to highlight points of connection to speech technologies.

a woman smiling for the cameraBarbara E. Bullock (Ph.D., Linguistics, University of Delaware 1991) is Professor of Linguistics in the Department of French & Italian at the University of Texas. She specializes in the effects of bilingualism and language contact on linguistic structure, particularly on the phonetic systems. Her research projects investigate sociophonetics, code-switching and borrowing, language variation and change, and computational approaches to multilingualism. With colleagues and students, she has begun to explore the power of corpus linguistics and NLP as effective tools in research on bilingual speech forms working to quantify and visualize language mixing and its intermittency to enable cross-corpus comparisons and linguistic generalizations.

 

a woman smiling for the cameraAlmeida Jacqueline Toribio (Ph.D., Linguistics, Cornell University 1993) is Professor of Linguistics in the Department of Spanish and Portuguese at the University of Texas. Her research in formal linguistics investigates patterns of morphological and syntactic variation across languages and dialects as well as structural patterns of language mixing in bilingual code-switching; her complementary work in sociolinguistics considers the ways in which variables such as ethnicity, race, gender, literacy, and national origin are encoded through linguistic features and language choices. Her investigations employ diverse methods, from experimental elicitation, to ethnographies of rural and urban communities, to computational analyses of literary texts and popular media.

Shared Task

We will be organizing a shared task on Code-switched Spoken Language Identification (LID) in three language pairs – Gujarati-English, Telugu-English and Tamil-English. The shared task will consist of two subtasks:

Subtask A: Utterance-level identification of monolingual vs. code-switched utterances

Subtask B: Frame-level identification of language in a code-switched utterance.

Registration for the shared task has started. Please fill the form available at this link. Participants will get download links to the data via email once they register with their email address.

More details about the shared task and baselines can be found here.

Shared task rules:

  1. To participate in the Shared Task, you must register and consent to the agreement at the “Register” page and download the data. Participants may not share the data with any person or organization without Microsoft’s prior written consent.
  2. Participants are required to use only the data released for the shared task for models submitted during the testing period.  If desired, they can report scores using additional external data in the paper they submit to the workshop.
  3. Participants may choose to use the corresponding language’s data to build each system or combine the data and use it cross-lingually.
  4. Participants may build systems for any number of languages, even if they all use the data.
  5. Only the audio for the blind test sets will be released. Participants are expected to run their systems on the blind test sets and submit the label files.
  6. Participants may form teams, and participants can be part of multiple teams. All team members names should be clearly mentioned in the submission email. These team members are expected to be co-authors of the paper that each team will submit to the workshop.
  7. Participants can submit up to three models per task (task A and B) per language pair per team during the testing period. This means that each team can submit a total of up to 18 models (3 models, 3 language pairs, 2 tasks). Any additional models submitted after the first three per language pair per task will not be considered for evaluation.
  8. Participants may also use training and dev monolingual Gujarati, Tamil and Telugu data previously released by us available here to train their models. Participants should not use the test data available at this link.
  9. The systems submitted are expected to beat the baseline system in terms of Accuracy and EER, however, innovative systems that come close to the baseline may be considered.
  10. Participants must submit the following items for evaluation: (1) the results files; (2) the final LID models; and (3) the research paper so the shared task organizers can reproduce the results against the blind set.

Submission format:

All participants who have registered before 21 April 2020 will be sent an email on their registered email id with links to download test data on 27th April 2020. Participants will have till 17:00 IST on 29th April 2020 to submit their results files for evaluation.

Participants need to send an email to CSWorkshop2020@microsoft.com with the subject “TaskA/TaskB language CS Workshop 2020 Shared Task Evaluation”. Here language will be TA, TE or GU (Tamil-English, Telugu-English or Gujarati-English). For example, the subject line will be “TaskA TA CS Workshop 2020 Shared Task Evaluation” or “TaskB GU CS Workshop 2020 Shared Task Evaluation”. Please include names of all team members in the body of the email, as well as a team name.

Please follow exact guidelines for subject and file format as the evaluation will be done automatically. In case there is a problem with the format, you will receive an email and you can resubmit. A failed submission will not be counted in the allowed attempts.

Participants should attach up to 3 results files along with each email. Results for each language will have to be submitted separately in separate emails.

Participants should submit CSV files with the formats specified below

Task A:

File name: TaskA-language-modelname.csv where language = TA/TE/GU and modelname is a name of your choice

File format:

filename1,0/1

filename2,0/1

filename3,0/1

where 0 represents code-switched and 1 represents Monolingual

Task B:

File name: TaskB-language-modelname.csv where language = TA/TE/GU and modelname is a name of your choice

File format:

filename1, Space separated sequence of language tags for every 200ms

filename2, Space separated sequence of language tags for every 200ms

filename3, Space separated sequence of language tags for every 200ms

where the language tags are E (English), G (Gujarati), T (Tamil) and T (Telugu) or S (Silence).

Please note: For audio that cannot be divided exactly into 200ms frames, for example, if the audio is 4.56 seconds long then only 4.40 seconds of the audio will be considered for testing and the last 160ms of the audio will be ignored

Contact: In case you have questions about the shared task, please contact us at sunayana.sitaram@microsoft.com

Leaderboard

Task A
Gujarati       Telugu       Tamil
Team Name Accuracy EER Team Name Accuracy EER Team Name Accuracy EER
VocapiaLIMSI 0.75 0.12 VocapiaLIMSI 0.79 0.10 VocapiaLIMSI 0.79 0.10
Swiggy 0.70 0.15 Swiggy 0.79 0.10 Swiggy 0.79 0.10
Ground Zero 0.55 0.22 CMU 0.74 0.13 CMU 0.73 0.13
CMU 0.50 0.25 Sizzle 0.71 0.14 Ground Zero 0.67 0.16
Sizzle 0.47 0.26 Ground Zero 0.67 0.16 Sizzle 0.55 0.22
 

Task B

               
Gujarati       Telugu       Tamil
Team Name Accuracy EER Team Name Accuracy EER Team Name Accuracy EER
VocapiaLIMSI 0.78 0.06 VocapiaLIMSI 0.79 0.06 VocapiaLIMSI 0.78 0.06
Swiggy 0.75 0.07 Swiggy 0.74 0.07 Swiggy 0.74 0.07