October 30, 2020 October 31, 2020

First Workshop on Speech Technologies for Code-switching in Multilingual Communities 2020

Location: Virtual/Online

Register

We will be organizing a shared task on Code-switched Spoken Language Identification (LID) in three language pairs – Gujarati-English, Telugu-English and Tamil-English. The shared task will consist of two subtasks:

Subtask A: Utterance-level identification of monolingual vs. code-switched utterances

Subtask B: Frame-level identification of language in a code-switched utterance.

Registration for the shared task has started. Please fill the form available at this link (opens in new tab). Participants will get download links to the data via email once they register with their email address.

More details about the shared task and baselines can be found here (opens in new tab).

Shared task rules:

  1. To participate in the Shared Task, you must register and consent to the agreement at the “Register” page and download the data. Participants may not share the data with any person or organization without Microsoft’s prior written consent.
  2. Participants are required to use only the data released for the shared task for models submitted during the testing period.  If desired, they can report scores using additional external data in the paper they submit to the workshop.
  3. Participants may choose to use the corresponding language’s data to build each system or combine the data and use it cross-lingually.
  4. Participants may build systems for any number of languages, even if they all use the data.
  5. Only the audio for the blind test sets will be released. Participants are expected to run their systems on the blind test sets and submit the label files.
  6. Participants may form teams, and participants can be part of multiple teams. All team members names should be clearly mentioned in the submission email. These team members are expected to be co-authors of the paper that each team will submit to the workshop.
  7. Participants can submit up to three models per task (task A and B) per language pair per team during the testing period. This means that each team can submit a total of up to 18 models (3 models, 3 language pairs, 2 tasks). Any additional models submitted after the first three per language pair per task will not be considered for evaluation.
  8. Participants may also use training and dev monolingual Gujarati, Tamil and Telugu data previously released by us available here (opens in new tab) to train their models. Participants should not use the test data available at this link.
  9. The systems submitted are expected to beat the baseline system in terms of Accuracy and EER, however, innovative systems that come close to the baseline may be considered.
  10. Participants must submit the following items for evaluation: (1) the results files; (2) the final LID models; and (3) the research paper so the shared task organizers can reproduce the results against the blind set.

Submission format:

All participants who have registered before 21 April 2020 will be sent an email on their registered email id with links to download test data on 27th April 2020. Participants will have till 17:00 IST on 29th April 2020 to submit their results files for evaluation.

Participants need to send an email to CSWorkshop2020@microsoft.com with the subject “TaskA/TaskB language CS Workshop 2020 Shared Task Evaluation”. Here language will be TA, TE or GU (Tamil-English, Telugu-English or Gujarati-English). For example, the subject line will be “TaskA TA CS Workshop 2020 Shared Task Evaluation” or “TaskB GU CS Workshop 2020 Shared Task Evaluation”. Please include names of all team members in the body of the email, as well as a team name.

Please follow exact guidelines for subject and file format as the evaluation will be done automatically. In case there is a problem with the format, you will receive an email and you can resubmit. A failed submission will not be counted in the allowed attempts.

Participants should attach up to 3 results files along with each email. Results for each language will have to be submitted separately in separate emails.

Participants should submit CSV files with the formats specified below

Task A:

File name: TaskA-language-modelname.csv where language = TA/TE/GU and modelname is a name of your choice

File format:

filename1,0/1

filename2,0/1

filename3,0/1

where 0 represents code-switched and 1 represents Monolingual

Task B:

File name: TaskB-language-modelname.csv where language = TA/TE/GU and modelname is a name of your choice

File format:

filename1, Space separated sequence of language tags for every 200ms

filename2, Space separated sequence of language tags for every 200ms

filename3, Space separated sequence of language tags for every 200ms

where the language tags are E (English), G (Gujarati), T (Tamil) and T (Telugu) or S (Silence).

Please note: For audio that cannot be divided exactly into 200ms frames, for example, if the audio is 4.56 seconds long then only 4.40 seconds of the audio will be considered for testing and the last 160ms of the audio will be ignored

Contact: In case you have questions about the shared task, please contact us at sunayana.sitaram@microsoft.com