Microsoft Research Forum Briefing Book cover image

Research Forum Brief | January 2024

Kahani: Visual Storytelling through Culturally Nuanced Images

Share this page

Sameer Segal headshot

“[Project Kahani is] trying to bring not only visually stunning images but also bring in cultural nuances to it. Past work has shown that diffusion models tend to stereotype and fail to understand local words, but they don’t provide ways to overcome these shortcomings without modifying the model or using fine-tuning.”

Sameer Segal, Principal Research Software Development Engineer

Transcript – Lightning Talk 5: Kahani: Visual Storytelling through Culturally Nuanced Images

Sameer Segal, Principal Research Software Development Engineer, Microsoft Research India 

Sameer Segal discusses Kahani, a research prototype that allows the user to create visually stunning and culturally nuanced images just by describing them in their local languages. 

Microsoft Research Forum, January 30, 2024 

SAMEER SEGAL: Hi, everyone. My name is Sameer Segal. I’m a principal research engineer at the Microsoft Research India lab. I’ve always been passionate about technology and societal impact. I was an entrepreneur for 10 years before I joined MSR (Microsoft Research), and it’s been absolutely wonderful the last couple of years that I’ve been here because I’m pursuing my passion at a whole new level of scale.

I’m also a father to a 6-year-old daughter, and like most parents with kids this age, you spend a lot of time making up stories—sometimes to teach important lessons like how to be kind and sometimes as well as just for fun. In India, we have a great repertoire of folktales, but unfortunately, they’re not visually appealing to the kids of today. With all these recent advancements in generative AI like large language models and diffusion models, wouldn’t it be great if we could create visual stories? That’s what our Project Kahani is trying to do. It’s trying to bring not only visually stunning images but also bring in cultural nuances to it.  

Past work has shown that diffusion models tend to stereotype and fail to understand local words, but they don’t provide ways to overcome these shortcomings without modifying the model or using fine-tuning in significant ways. The other big problem is that to get that perfect image, you need to do a lot of prompting, and sometimes if you use tools like Adobe Photoshop or use fine-tuning, this makes it out of the league of laypeople. And that’s really sad because these models were meant to be a force of democratization.  

Our project started off at an internal hackathon a few months ago and has now evolved into a research project. Let me show you what we have built.  

I’m going to paste a prompt inspired by a story that my daughter and I recently read. It’s about Geetha, a girl who lives near a forest near BR Hills. And it’s about her unexpected friendship with a butterfly and a firefly. And we want to emphasize about how important it is to be kind to your friends. So the system takes this instruction, and it tries to pick up the cultural nuances from this and generate a story. And then from there, it creates characters and scenes. And here is, you know, an example of what is done, right, so about Geetha, who meets a butterfly that’s stuck in a cobweb. If I’d like to add more to the story, I can make changes and just add new instructions. But if I’d like to add specific instructions, let’s say, on this particular slide … you know, in villages in India, we have something called as a Nazar Battu, which wards of evil. So what I can do is I can pull up the scene and just make a little hole here to place the object that I want, and I am going to give the system a reference image. I am going to tell it that this is a Nazar Battu. And let’s see what it does with this. [PAUSE] There you have it. It was pretty easy to get a word that the model doesn’t really understand right where we wanted in the context of our story.  

Let me show you how this was happening. From my prompt, we were able to extract these cultural elements. The large language models are especially good where they were able to understand BR Hills means that this is a place in India, southern India. And from these cultural nuances, we were able to create a specific prompt that was able to generate this character. And from this character, we were able to compose and create various scenes, right. Now it’s not perfect, but it’s a big step up from where we were with just the models. This work required us to do a series of benchmarking exercises where we tried out different prompts with names, visual descriptions, and definitions, and we would generate the image and compare that to a reference image that we got from a search engine. And GPT-4 [with] Vision was used as a judge to decide whether the image actually matched the reference image or not. 

We believe our work has tremendous potential. It can make local culture a lot more accessible, especially for image generation. And this can have application not just in storytelling and education but across domains.

Thank you.