Research Forum | Episode 3 - abstract chalkboard background with colorful hands

Research Forum Brief | June 2024

Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities

Share this page

Sunayana Sitaram

“One of the solutions that we’ve been using is to actually design with ‘human in the loop’ in mind because we know that these technologies are not perfect. And so, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome.”

Sunayana Sitaram, Principal Researcher, Microsoft Research India

Transcript: Panel Discussion

Generative AI for Global Impact: Challenges and Opportunities

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)
Sunayana Sitaram, Principal Researcher, Microsoft Research India
Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge
Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Microsoft researchers discuss the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

Microsoft Research Forum, June 4, 2024

JACKI O’NEILL: I’m delighted to be hosting what promises to be a really engaging panel today with three fabulous panelists. In my talk, I talked about the importance of building globally equitable generative AI systems for diverse communities and application areas, and I hope that I’ve convinced you all of the importance of doing this if generative AI is not going to compound existing systemic inequalities. In this panel, we’re going to dive much deeper into the application areas, the user populations, the problems, and the solutions of doing this with our three expert panelists: Sunayana Sitaram, Tanuja Ganu, and Daniela Massiceti. So without further ado, I’d like to ask each of the panelists to introduce themselves.

TANUJA GANU: Thank you, Jacki, and hello, everyone. My name is Tanuja Ganu, and I’m principal research engineering manager at Microsoft Research in India. My background is in applied AI, and my work is focused on developing and validating technologies that would drive positive change in society. I have been leading an incubation center in MSR India called SCAI—Societal Impact through Cloud and AI—and in last 1½ years, I’m spending a lot of time on how we can take the potential of generative AI to empower every individual across the globe and catalyze the change in some of the domains like education. Thank you.

SUNAYANA SITARAM: Hi, everyone. I’m Sunayana Sitaram. I’m principal researcher at Microsoft Research India, and my background is in natural language processing. My research involves trying to make sure that large language models, or generative AI as they’re also known, work well for all languages and cultures. And over the last couple of years, my research group has really looked into how to evaluate how well these large language models are doing for different languages across the world, including languages that have smaller amounts of data compared to English but are still spoken by millions of people worldwide. Thank you.

DANIELA MASSICETI: Hi, everyone. My name is Daniela Massiceti, and I’m a senior researcher at Microsoft Research based in Australia. My background is in machine learning, but nowadays, I work much more at the intersection of machine learning and human-computer interaction, particularly looking at multi-modal models. So these are models that work with both image and text input. And my main focus is, how do we ensure that these AI models or AI systems work well for the users who are in the tails of the user distribution? And in particular, the research that I’ve done along with my team, it particularly looks at people with disabilities, who will, of course, be major beneficiaries of these multi-modal models.

O’NEILL: Thank you so much. I’d like to start by asking you what you see as the core problems we face building equitable generative AI that works well for diverse communities and user groups. Tanuja, would you like to start us off?

GANU: Let me start off by saying that I feel that this is an exciting time to be in technology, and I’m really thrilled with the remarkable progress and the vast potential of generative AI. And we are already seeing successful deployments of generative AI in enterprise applications like GitHub Copilot for programmers or Office 365 Copilot for enterprise users, which is showing the improved efficiency and quality as well as giving ability to the users to focus more on their creative work. So the natural next question is, how can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied, kind of, technology access and financial, kind of, affordability, as well? So when we are looking at this technological evolution, I think it’s crucial that we, kind of, prioritize and focus and address the digital divide and we really, kind of, actively work to reduce this particular gap. So taking these points into account, there are [a] few sociotechnical challenges that we need to address when we want to make sure that generative AI technology truly, kind of, works for every individual. So firstly, I think the first important challenge is making sure that these technologies are able to provide seamless interaction across thousands of world languages. And it’s not only about language, but it’s also about incorporating and preserving cultural nuances in these different kind of communities and user groups. The second important challenge is about designing for existing infrastructural constraints. Infrastructural constraints are like the existing technologies need to have low-end mobile phones as primary interface in some of the cases or dealing with low or intermittent network connectivity and overall low affordability when we are especially looking at vast majority of populations from Global South. The third important problem that I consider is the varied access levels depending upon the literacy levels as well as the access needs depending upon disabilities. And the fourth important challenge is really overarching as in, how can we expand and how can we revisit the responsible AI and safe deployment principles taking into account these culturally and linguistically varied user groups and expanding to include the dimensions of equity, access, and inclusion? So I think these are really some of the important challenges.

O’NEILL: Thank you so much, Tanuja. I think you’ve really given us a great overview there. Daniela, I wonder if you could deep dive a bit on the accessibility questions that Tanuja raised.

MASSICETI: Yeah, sure thing, Jacki. So, yeah, I can definitely bring some perspectives here from the work that my team—me and my team—have done in the accessibility space. So we know, as I said earlier, that these multi-modal models really hold the potential to transform assistive technologies for communities with disabilities. But up until now, very few works have actually quantified, well, how well are these models going to work for these communities? And so a piece of work that we recently did, which was published in CVPR, basically aimed to do exactly this. Specifically, we looked at images and text captured by users who are blind and then evaluated how well CLIP, which is a very popular multi-modal model, actually works on their data. And I wanted to share, kind of, three insights that came from this work which speak to the core challenges I think that lie ahead of us realizing truly equitable AI systems in the future.

So the first is that the datasets typically used to train these AI models do not include data from communities with disabilities. In our work, we analyzed three large-scale datasets that are typically used to pretrain these large multi-modal models, and we found that disability content—things like guide canes, Braille displays—are significantly underrepresented or actually just not present at all in these datasets. And so this means that then any model that is trained on this dataset will perform poorly on any task that involves identifying, locating, or answering questions about any of these particular objects. And I don’t think that this problem of data inclusion is just the case for the blind and low-vision community but many, many marginalized communities who may not be included in these datasets. And the second core problem is that I think we’re moving toward this paradigm where we have a very small number of enormous models—these so-called foundation models—which are being widely used by many, many downstream models and applications. But if these foundation models don’t work well in the first instance for marginalized communities, then we have the potential to see this compounding essentially in any downstream application that uses these foundation models. And this is exactly what we saw in our CVPR work.

We identified that CLIP, as a base model, significantly underperforms on data from blind and low-vision users. But then when CLIP is embedded as a component in other models, these failures persist and in some cases are even amplified. So, for example, we looked at DALL-E 2, which uses a CLIP vision encoder under the hood, and we basically saw that it couldn’t generate any decent images of any of the disability objects we tested. You know, when we asked it for a guide cane, it gave us very funky-looking walking sticks. And when we asked it for Braille keyboards, it again gave us these random arrangements of white dots on a page.

And in the final core problem I’ll reflect on is that I think we don’t often embed ourselves deeply enough in marginalized communities to really understand the ways that AI models need to work for these communities. So, for example, one of the findings in our CVPR paper was that CLIP has trouble recognizing objects if users describe them by their material rather than their color. So, for example, a user might say find my leather bag rather than my brown bag. And we only really knew to test for this because our team collectively has over 20-plus years of experience in working with the blind and low-vision community to know that users often use these material-based descriptions when they’re talking about their objects. And so without this insight, we would never have uncovered this particular failure mode, and so I think it’s really important, to achieve truly equitable AI models, we really need to deeply embed ourselves in the communities that we’re working with.

O’NEILL: Thank you, Daniela. So Sunayana, Daniela’s given us a really good overview of the challenges with the multi-modal models and the image models. I know that your research is primarily thinking about how different language communities can interact with these language models. I’m wondering, what do you see as the problems for making these models work well for anyone, anywhere, whatever language they speak?

SITARAM: Right. So as Daniela mentioned, there is a data divide, right, even when it comes to languages because most language models today are trained predominantly on data that comes from the web. And we know that not all languages and cultures are equally represented on the web, right. So at the very first step of the pipeline, you now have this inequity because of different representation of different languages and cultures. But I think that’s not the only problem. There are a lot of other decisions that are taken during the model-building process which could also influence downstream performance. So, for example, in some of our research earlier last year, which was published in EMNLP, we found that the tokenizer, which is the component that actually breaks words down into smaller pieces, that doesn’t work equally well for all languages, and that actually has a significant impact on downstream performance. So things like this, you know, decisions that are taken during the model-building process can also really influence the performance. And finally, you know, one of the biggest challenges I see—and I may be a little biased because this is my area of research—is that, you know, we are not able to actually evaluate these models across all languages and cultures well. And this is because of a variety of reasons, including the fact that, you know, not too many benchmarks exist with the sufficient linguistic and cultural diversity. But because we are not doing a good job of evaluation, we don’t even know how well these models work for different languages and cultures. And so I think, you know, beyond data, there are many other challenges that need to be addressed in order to make these models actually work for all languages and cultures.

O’NEILL: Yeah, thank you so much. I think it’s really clear from your answers how these technologies are the biggest challenges for making these technologies work at both the societal level and also the level of the actual models themselves, you know, whether they’re vision or multi-modal models or language models, and we know that this has a direct impact on various user populations. As Tanuja mentioned in the beginning, you know, we’re seeing a lot of enterprise applications and enterprise technologies being developed, whether that’s for helping you code or ideate or answer emails. But are there other user populations who could really benefit from applications of generative AI which works well? Tanuja?

GANU: Yeah, so I think there are a lot of interesting and impactful applications which are emerging for generative AI in domains like education or health care and agriculture. So let me give you an example from our work in education, where we are developing [an] AI assistant, which is called Shiksha copilot, that provides agency to the teachers in public schools in India for generating personalized and engaging learning experiences like activities, assessments, the teaching material for their students. So what is important here is that the content generated is completely grounded in the local curriculum and the interaction is completely in local language, which is Kannada in this particular case. It’s also important that the content, kind of, preserves the cultural or local norms. So let’s take an example of a teacher teaching components of food or balanced diet as the topic. So it should include the examples which are coming from the local diet and cuisine, maybe giving an example of biryani or maybe giving an example of ragi mudde, which is made up of finger millet. So it’s also additionally important that the teacher is able to use and generate the lesson plans on the mobile phone or their desktop, whichever are the, kind of, resources which are available to them, and they should be able to utilize this particular Shiksha copilot while using in the classrooms where AV systems might not be available. So they can generate the lesson plan on the phone, and they can take it to the classroom and completely utilize it in the offline manner. So I think these are all the challenges that we discussed earlier; those become really important when we are doing these kind of real-world deployments. So with Shiksha copilot, we have completed a successful small pilot with 50 teachers in India, and now we are gearing towards a scaled pilot with thousand teachers. And I feel like applications like these can have a really transformative effect in the education system and create a positive impact for students and teachers across the globe.

O’NEILL: Thank you. Daniela, for the accessibility populations, what type of applications and populations are important in this space?

MASSICETI: Yeah, sure thing. So an estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So I think it’s really exciting to see these generative AI applications coming online for these communities, and our team has done, as you may already have gathered, a lot of work with the blind and low-vision community. And so I wanted to call out a couple of promising generative AI applications for this particular community. The first is Microsoft’s own actually: Seeing AI. So Seeing AI is a mobile app for users who are blind and low vision, and they’re really leading the charge in innovating new assistive user experiences using models like GPT-4. So, for example, they’ve built in features which allow users to answer really detailed questions about a document they’ve scanned as well as get these beautifully detailed captions or descriptions of photos that they’ve taken. And you can really see the impact of these. For example, maybe when you’re visiting a museum, you can snap a picture and get these beautiful descriptions around the artworks that are … of the artworks that are around you. I’ll also call out the partnership which was recently announced or announced last year between Be My Eyes and OpenAI. So Be My Eyes is a video-calling app which connects blind users with sighted volunteers when they need help on a particular task. So, for example, they snap a picture of a packet of potatoes or a packet of tomatoes and then ask the sighted volunteer if they’re out of date, for example. And the promise with the OpenAI partnership is that perhaps some point in the future, these sighted volunteers may be replaced by a model like GPT-4 with vision, enabling pretty much instantaneous and fully automated assistance for blind users anywhere in the world. So I think that’s really exciting. And in fact, I—along with some other colleagues at Microsoft Research—worked very closely with OpenAI and teams across Microsoft to red team the GPT-4 with vision model and really ensure that it met Microsoft’s high bar before it was publicly released. And I think this is a really tangible demonstration of Microsoft’s commitment to delivering safe and responsible AI technologies to its customers.

O’NEILL: Thank you so much. So how do we, given these large populations who could really benefit, how do we go about building solutions for them that actually work?

GANU: So maybe I will take this. So given that we are working with really diverse populations, I think it’s extremely useful that we work with user-centered design or participatory design approach and collect the voices of the users and especially the marginalized communities and the underserved communities right from the start at the design time. It’s also important while we are dealing with this nascent or emerging technology that we do have the right safeguards while deploying the system and we are able to collect the feedback at every stage when we, kind of, deploy the systems, such as using the expert-in-the-loop kind of deployment, where the expert has the ability to verify as well as override the responses as and when required. So to give an example, this was one of the, kind of, conscious decisions when we started working with Shiksha copilot, to start with the teachers and not with the students first, where teacher is the expert in the loop, and we can extend the benefits of the technology to the students through teachers to start with and eventually, kind of, go to the students.

Also, while we are working and looking at various applications across population scale, as I mentioned earlier, in domains like agriculture, education, health care, and other domains, what we are seeing is that there are common problems or universal challenges which are repeated across all these particular domains. As Sunayana talked about earlier, multilingual interaction is a huge problem across all domains. The other important problem is that most of the knowledge base that is required for grounding or, kind of, generating these AI experiences on is non-digitally native and multi-modal. So how do we extract the information from these multi-modal, non-digitally native content is a challenge across these different domains. So what we are doing as part of our project, which is called Project VeLLM, which stands for “uniVersal Empowerment with Large Language Models,” is we are building this versatile platform, which you can think of as building blocks or tool set providing all these different functionalities which are common across these different, kind of, applications. And now the other developers do not have to start from scratch. They can use these building blocks and create their equitable AI experiences rapidly across different domains.

SITARAM: Generalizing a little bit from what Tanuja just said about expert in the loop, I think that, you know, one of the solutions that we’ve been using is to actually design with “human in the loop” in mind because we know that these technologies are not perfect. And so, you know, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome. And in our research, we’ve actually been doing this for evaluation of, you know, multilingual scenarios. So, for example, we know that, you know, large language models can do a good job of evaluation, but we also know that they don’t do a very good job on some languages and along some dimensions, right. So those languages and those dimensions should ideally be left to a human to do, whereas for the ones that we are very confident that the LLM is doing a good job, we can actually rely on it more with some human oversight in order to scale up the process of evaluation. So this idea of actually using humans and AI together and designing for this kind of hybrid system, I think, is really crucial. And, of course, we need to keep revisiting this design as these AI systems become more and more capable.

MASSICETI: Yeah, so many points I can agree with there and build on. I think what’s common with both Tanuja’sand Sunayana’s answers is really this need to, kind of, bring models and humans together. And I think one real limitation we’ve seen in our work across many of the models we’ve worked with is that they really often generate quite generic responses, you know. So if you prompt an LLM to write you an email, the tone and style don’t quite, sort of, quite feel like yours. And so I think as we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you. And I think that’s another example of how human users and the AI model need to come together in order to create something even more powerful. And I think this is going to be even more impactful for marginalized—even more important even—for marginalized communities, whose needs often differ a lot from, kind of, the average or the generic needs.

And to, kind of, just bring one concrete example to the table, our team has been building a personalizable object recognizer over the last year. So here, a blind user can pretty much teach the object recognizer their personal objects, things like their sunglasses, their partner’s sunglasses, maybe their favorite T-shirt. And they do this by taking short videos of these objects, and then the personalized recognizer can then help them locate these things at any point in the future. And so in this sense, the user is really given the agency. It’s really this example of a human-in-the-loop paradigm, where a user is given the agency to personalize their AI system to meet their exact needs. So, yeah, it’s really exciting. This feature has actually just been released in Seeing AI, and so we’re really keen to begin imagining how we might see more personalizable generative AI experiences for users in the near future.

O’NEILL: Yeah, I really love that idea. I think we would all benefit from more personalized AI, even when you’re just trying to craft an email or something like that. The challenge people often face is it doesn’t really sound like them.


O’NEILL: And then if you have to edit it too much, then, you know, you reduce the benefit. So I think there’s so many areas right across the board where personalization could help. So finally, as we’re coming to a close, I really would love to finish by asking each of you what you think the biggest research questions that are still open are, what the biggest gaps are, and how you would advise the research community to go about solving them.

MASSICETI: Yeah, it’s a big, big question. I’ll maybe take a stab first. So I think a couple of us have already touched on this point before, but the data divide, I think, is really a big, big challenge. You know, the fact that data is widely available for some communities but then totally absent or very sparse for others. And I think this is one of the biggest hurdles we need to address as a research community in order to really move the needle on equitable AI because it’s impacting everything from the way that we can train models but also, as Sunayanasaid, to how we can evaluate these models, as well. But I want to, kind of, call out that even though we’ve identified the problem—we, kind of, know what the problem is; you know, we need to include data from these communities—I think there’s just so many open questions around how we actually do this well and how we actually do this right. And so I want to bring up two specific challenges or open questions that I feel are very prevalent.

The first is, what do equitable paradigms actually look like when we’re collecting data from or about a marginalized community? These communities, as we know, have often historically been exploited. And so we really need to find fair ways of not only involving these communities in these data collection efforts, but also compensating them for their efforts as these models are then trained on this data and then are deployed and used more broadly. But then the second open question, I think, is that we really need deep technical innovation in adapting models to new data. You know, we’ve obviously seen a lot of adaptation methods coming online—fine-tuning, LoRA—and they do really well at, kind of, adapting these models to new datasets and tasks. But what we’re seeing in our current experiments is that these approaches don’t work so well when that new data that’s coming in is very different from the pretraining dataset. So in one particular example, we gave Stable Diffusion 10 training images of a funky-looking cat statue, and it learned it really well, and it could generate actually really realistic images of this statue. But then when we did the same for a guide cane, Stable Diffusion just still cannot generate realistic-looking images of guide canes. And so I think we really need to build as a research community a deeper understanding around how we get models to learn new concepts or new things, even when they aren’t well represented in the pretraining datasets.

O’NEILL: Thanks so much, Daniela. Tanuja, is there anything you want to add?

GANU: So for me, it feels like we are just, kind of, beginning to scratch the surface, and there is a lot more work underway across the dimensions of culture, cost, human values, cognition, universal access, and many other dimensions. So while the journey is long and we are trying to solve some of these hard and important problems, it is important that we, kind of, continue to make progress systematically and iteratively and we continue to, kind of, collect feedback and critical feedback at each of these stages. We definitely need to do lot more work also looking at different types of models as in large language models for more complex tasks. But can we look at smaller language models, especially when we are looking at the infrastructural challenges, as I discussed earlier. How can we use combination of these models? How can we generate and collect data from different cultures and involve these communities to, kind of … because these are very implicit things and not documented, kind of, information about different cultures. So how do we, kind of, learn for those is also important question. And I think collaboration is the key here. It’s important that we involve the experts from multiple disciplines, user communities, researchers, and policymakers and accelerate the progress in the right direction. We are already doing some of these collaborations with academia and NGOs, with the programs like Microsoft Research AI & Society Fellows, and some of the existing collaborations with our community and partners in India and Africa. But I think we’ll just need to continue doing it more and continue making steady progress on this important problem.

SITARAM: I completely agree with what both Daniela and Tanuja said. And talking more about the language and culture aspect, I think we need to figure out a way to involve these local communities in the design and training as well as evaluation phases of model building. And we need to do this at scale if we really want to reach all languages, all cultures, etc., right. So I think that is the thing that we really need to figure out how to do. So there are a couple of projects that we’ve been working on that have attempted to do this. One of them is called DOSA, where we collected a dataset of cultural artifacts from different users in India. And this was meant to be a participatory design approach where people would tell us what cultural artifacts were really important to them, and then we would collect this data from the ground up and try to evaluate whether LLMs did a good job or not, right. That’s one example. The other project that we’ve been working on is called Pariksha, where we employ workers from this ethical data company called Karya to do evaluation of Indian language models. So here we’re really asking the users, who speak multiple languages, to tell us whether these models work for them or not. And so I feel like we need to figure out more ways in which we can involve these local communities but at scale so that we can really impact the model-building process and then so that we can actually make these models work well for everybody.

O’NEILL: I couldn’t agree with you more, Sunayana. I think involving user communities in technology design in general is one of the most important things that we can do, and this is even more so with underserved communities. I would just like to add something to that, though, which is that we really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale. We need to transcend disciplinary boundaries if we’re going to build technology that really works for everyone, everywhere. And on that note, I’d like to say thank you to the panelists. It’s been a great discussion and thank you to the audience.

MASSICETI: Thanks very much.

GANU: Thank you so much.

SITARAM: Thank you.