Social Biases through the Text-to-Image Generation Lens
Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software by generating illustrative content with high photorealism starting from a given descriptive text as a prompt. Such models are however trained on massive amounts of web data, which surfaces the peril of potential harmful biases that may leak in the generation process itself. In this paper, we take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images, by focusing on how occupations, personality traits, and everyday situations are depicted across representations of (perceived) gender, age, race, and geographical location. Through an extensive set of both automated and human evaluation experiments we present findings for two popular T2I models: DALLE-v2 and Stable Diffusion. Our results reveal that there exist severe occupational biases of neutral prompts majorly excluding groups of people from results for both models. Such biases can get mitigated by increasing the amount of specification in the prompt itself, although the prompting mitigation will not address discrepancies in image quality or other usages of the model or its representations in other scenarios. Further, we observe personality traits being associated with only a limited set of people at the intersection of race, gender, and age. Finally, an analysis of geographical location representations on everyday situations (e.g., park, food, weddings) shows that for most situations, images generated through default location-neutral prompts are closer and more similar to images generated for locations of United States and Germany.
Evaluation and Understanding of Foundation Models
Presented by Besmira Nushi at Microsoft Research Forum, Episode 1
Besmira Nushi, Principal Researcher at Microsoft Research AI Frontiers summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models.
Transcript
Evaluation and Understanding of Foundation Models
BESMIRA NUSHI: Hi, everyone. My name is Besmira Nushi, and together with my colleagues at Microsoft Research, I work on evaluating and understanding foundation models. In our team, we see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.
But evaluation is hard, and new generative tasks are posing new challenges in evaluation and understanding. For example, it has become really difficult to scale up evaluation for long, open-ended, and generative outputs. At the same time, for emergent abilities, very often some benchmarks do not exist and often we have to create them from scratch. And even when they exist, they may be saturated or leaked into training datasets. In other cases, factors like prompt variability and model updates may be just as important as the quality of the model that is being tested in the first place. When it comes to end-to-end and interactive scenarios, other aspects of model behavior may get in the way and may interfere with task completion and user satisfaction. And finally, there exists a gap between evaluation and model improvement.
In our work, we really see this as just the first step towards understanding new failure modes and new architectures through data and model understanding. So in Microsoft Research, when we address these challenges, we look at four important pillars. First, we build novel benchmarks and evaluation workflows. Second, we perform and put a focus on interactive and multi-agent systems evaluation. And in everything we do, in every report that we write, we put responsible AI at the center of testing and evaluation to understand the impact of our technology on society. Finally, to bridge the gap between evaluation and improvement, we pursue efforts in data and model understanding.
But let’s look at some examples. Recently, in the benchmark space, we released KITAB. KITAB is a novel benchmark and dataset for testing constraint satisfaction capabilities for information retrieval queries that have certain user specifications in terms of constraints. And when we tested recent state-of-the-art models with this benchmark, we noticed that only in 50 percent of the cases these models are able to satisfy user constraints.
And similarly, in the multimodal space, Microsoft Research just released HoloAssist (opens in new tab). HoloAssist is a testbed with extensive amounts of data that come from recording and understanding how people perform tasks in the real and physical world. And this provides us with an invaluable amount of resources in terms of evaluation for understanding and measuring how the new models are going to assist people in things like task completion and mistake correction. In the responsible AI area, ToxiGen (opens in new tab) is a new dataset that is designed to mention and to understand toxicity generation from language models. And it is able to measure harms that may be generated from such models across 13 different demographic groups.
Similarly, in the multimodal space, we ran extensive evaluations to measure representational fairness and biases. For example, we tested several image generation models to see how they represent certain occupations, certain personality traits, and geographical locations. And we found that sometimes such models may present a major setback when it comes to representing different occupations if compared to real-world representation. For instance, in some cases, we see as low as 0 percent representation for certain demographic groups.
Now when it comes to data on model understanding, often what we do is that we look back at architectural and model behavior patterns to see how they are tied to important and common errors in the space. For example, for the case of constraint satisfaction for user queries, we looked at factual errors, information fabrication and mapped them to important attention patterns. And we see that whenever factual errors occur, there are very weak attention patterns within the model that map to these errors. And this is an important finding that is going to inform our next steps in model improvement.
So as we push the new frontiers in AI innovation, we are also just as excited about understanding and measuring that progress scientifically. And we hope that many of you are going to join us in that challenge.
Thank you.