
LINGUA: Announcing the awardees from Microsoft’s AI for Good Lab Open Call
On the European Day of Languages celebrating Europe’s rich linguistic and cultural diversity, we released the LINGUA Open Call. The call invited proposals that advanced digital inclusion for Europe’s low-resource languages. These are languages with limited online content and datasets, leading to underrepresentation in AI technologies compared to high-resource counterparts such as English, Spanish, French, or German. While many vulnerable and endangered languages fall into this category, the call was open to any European language that lacks the digital foundations required for fair representation and participation in the AI era.
LINGUA aims to address this gap by supporting innovative projects that collect high-quality speech and text datasets for Europe’s underrepresented languages. It is part of Microsoft’s commitment to digital sovereignty and linguistic diversity in Europe, ensuring that every language has the opportunity to be represented in the future of AI. Read more about the initiative (opens in new tab).
Our commitment
At the Microsoft AI for Good Lab, we are deepening our commitment to Europe’s digital future by supporting linguistic diversity, digital sovereignty, and inclusive innovation. The LINGUA Open Call is part of the EU Digital Unlock initiative, which aims to make Europe’s languages and cultures more open and accessible in the digital era. We are proud to collaborate with nonprofits, universities, research institutes, startups, and cultural organizations to enhance resources for low-resource languages, close digital gaps, and maximize impact through shared knowledge and collective action.
We are excited to launch this initiative in close coordination with the APERTUS project led by EPFL & ETH Zurich, and in consultation with the Council of Europe. Together, we are building data resources for European languages, expanding the supply of multilingual datasets, and enhancing the performance of low resource language LLMs. Our goal is to ensure that Europe’s rich linguistic and cultural heritage is fully represented in the next generation of AI models (e.g., Apertus (opens in new tab), EuroLLM, SmolLM3) by empowering communities, fostering innovation, and recognizing the people and organizations that make Europe a hub of creativity and inclusion.

LINGUA Open Call awardees
The selected projects span 16 languages and dialects across 10 countries, reflecting a diverse mix of low-resource, vulnerable, and underrepresented linguistic communities.
Based on applicant estimates, they collectively cover languages spoken by over 65 million people, including Icelandic, Luxembourgish, Basque, Maltese, Ladino, Romansh, Ladin, Ukrainian, Romani (and Greco-Romani), several Balkan languages (Serbian, Turkish, Bosnian), and Italian dialects (Neapolitan, Sicilian, Roman), alongside multi-language work.
The awardees bring together universities, nonprofits, a government language center, and a public broadcaster, with efforts focused on open dataset creation and digitization, heritage language preservation, and new evaluation resources (including safety benchmarks) to strengthen multilingual AI and help safeguard Europe’s linguistic diversity.
We’re grateful to MILA Quebec, Mozilla, and EPFL for their close collaboration and support throughout the evaluation and selection process.
We’re pleased to announce the selected projects for the LINGUA Open Call:
- BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives — Kyiv National University of Construction and Architecture (Ukraine) — Ukrainian
- Collection and Digitization of Romani Language Data in Greece: Laying the Foundations for Representation in Artificial Intelligence — ARSIS – Association for the Social Support of Youth (Greece) — Romani, Greco-Romani
- Icelandic AI Safety Benchmarks: Creating Open Evaluation Datasets for LLM Safety in a Low-Resource Language — University of Iceland (Iceland) — Icelandic
- LuxVLD: Luxembourgish Vision-Language Dataset for Education and Digital Inclusion — SnT, University of Luxembourg (Luxembourg) — Luxembourgish
- PARLA CHIARO (Speak Clearly) – Protecting Italian Dialect Speakers from AI-generated Health Misinformation — University of Naples Federico II (Italy) — Neapolitan, Sicilian, Roman
- Protecting Kosovo’s languages through responsible AI — Radio Television of Kosovo – RTK (Kosovo) — Serbian, Turkish, Bosnian, and Romani
- RhaetoChat: LLM Fine-Tuning Data for Rhaeto-Romance Languages — Department of Computational Linguistics, University of Zurich (Switzerland) — Romansh and Ladin
- SaqWI: Korpus Malti ta’ Mistoqsijiet u Tweġibiet / SaqWI: A Maltese Corpus of Qs & As (SaqWI-QA) — Ċentru tal-Ilsien Malti (CIM) (Malta) — Maltese
- Scaling Finweb2-HQ: Multi-Signal Extraction and Quality Enhancement for European Language Models and Beyond — EPFL (Switzerland) — Multi-language
- Speaking Ladino: Open Speech and Text Datasets for AI-Powered Language Preservation — Inalco Paris (France) — Ladino
- Wikispeech for All: Basque Edition — Wikimedia Sverige (Sweden) — Basque
In addition to the selected awardees, further projects will receive support through Azure compute credits.

Eligibility
We encouraged proposals from nonprofits, NGOs, universities, research institutions, social enterprises, cultural organizations, and startups. Proposals with multiple collaborators were welcome, particularly from those committed to the public good and able to demonstrate strong community engagement and ethical data practices.
To be eligible, applicants were required to demonstrate a commitment to producing fully open‑licensed datasets for text‑to‑text, speech‑to‑text, and text‑to‑speech applications. These efforts laid critical groundwork for the inclusion of low‑resource languages in open language and speech models.
