Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Bots generate video titles and tags to bring AI researchers one step closer to visual intelligence

October 10, 2016 | By Microsoft blog editor

By Winnie Cui, Senior Research Program Manager, Microsoft Research Asia

When your grandma posts a video to the cloud, there it lies, lonely and unwatched, unless your grandmother has more tagging and titling skills than mine does. My nana loves taking family videos with her cellphone, but while she’s great at creating content, she’s not so good at creating an audience. While my sisters and I might love to watch the clip, she’s made it practically undiscoverable.

I know your nana (and friends, colleagues, and family) is probably like mine, because a high percentage of user-generated videos stored in the cloud have very few views. Well, grandma, help is here from artificial intelligence researchers. Their research will eventually enable you to easily find and watch user-generated content, including that amazing clip of your grandpa losing his teeth while dancing at your cousin’s wedding!

Chia-Wen Lin and Min Sun, professors in the Electrical Engineering department of National Tsinghua University in Taiwan, tackled this issue with machine learning. In effect, they’ve created a system where bots watch a video, determine its highlights, create a relevant title for easy searching, and recommend who might want to be tagged to watch it.

“Our research has taken us one step closer to the holy-grail of visual intelligence, understanding visual content in user-generated videos,” said Professor Sun.

Video Title Generation System

Professor Sun created a video title generation method based on deep learning to automatically find the special moments—or highlights—in videos, and generate an accurate and interesting title for the highlights. In parallel, Professor Lin developed a method to detect and cluster the faces in videos to provide richer summaries of the videos and relevant suggestions about whom to share them with. Working together, their algorithms can detect highlights, generate descriptions of highlights and tag potential viewers of user-generated videos.

Their work was inspired by Microsoft Research’s COCO (Common Objects in Context). COCO is a new image recognition, segmentation, and captioning dataset that recognizes more than 300,000 images, in context, and because videos are essentially a succession of images, this dataset could help with video title generation. Professor Lin and Professor Sun collaborated with Dr. Tao Mei, a lead researcher in multimedia at Microsoft Research Asia in 2015, using COCO captions for sentence augmentation and using captions in MSCOCO to train their system. Their work is published on arxiv.

Currently, Professors Sun and Lin have analyzed 18,000 videos for highlights and generated 44,000 titles/descriptions. To improve the system, Professor Sun and his students participated in the VideoToText challenge sponsored by Microsoft Research, using the data released in the challenge for additional validation. Their work will be published at the ECCV (European Conference on Computer Vision) conference, October 8–16, 2016. Professor Sun and Dr. Tao Mei have started looking into the next phase of their collaboration on storytelling of personal photos.

If you’re also engaged with furthering the state of the art in visual intelligence research, you’ll find our Cognitive Services Computer Vision API useful. It extracts rich information from any image to categorize and process visual data. You can even build an app to caption your own videos using our sample on GitHub. Check it out!

Learn More

Up Next

Artificial intelligence, Hardware and devices, Human-computer interaction, Systems and networking

Watch the live stream of the 2018 Microsoft Research Faculty Summit – Systems/Fueling Future Disruptions

The Systems focused Microsoft Research Faculty Summit is just around the corner on August 1-2, 2018! This year’s summit will highlight how Systems are the infrastructure Fueling Future Disruptions by delivering an engaging program exploring not only the importance of systems and systems research but their fundamental role in fueling the future disruptions that are […]

Dan Fay

Senior Director - Outreach

Artificial intelligence, Computer vision, Graphics and multimedia

ChatPainter: Improving text-to-image generation by using dialogue

Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to enable bots to draw realistic images in defined domains, such as […]

Microsoft blog editor

Microsoft Pix before and after panoramic photo of Miners Landing

Artificial intelligence, Graphics and multimedia

New Microsoft Pix features let you take bigger, wider pictures and turns your videos into comics

Microsoft has released two new features with today’s update to Microsoft Pix for iOS, an app powered by a suite of intelligent algorithms developed by Microsoft researchers to take the guesswork out of getting beautiful photos and videos. The first of these features, Photosynth, helps create photos that take in more of the perspective or […]

Nicky Budd-Thanos