Bots generate video titles and tags to bring AI researchers one step closer to visual intelligence

Published October 10, 2016

Share this page

By Winnie Cui, Senior Research Program Manager, Microsoft Research Asia

When your grandma posts a video to the cloud, there it lies, lonely and unwatched, unless your grandmother has more tagging and titling skills than mine does. My nana loves taking family videos with her cellphone, but while she’s great at creating content, she’s not so good at creating an audience. While my sisters and I might love to watch the clip, she’s made it practically undiscoverable.

I know your nana (and friends, colleagues, and family) is probably like mine, because a high percentage of user-generated videos stored in the cloud have very few views. Well, grandma, help is here from artificial intelligence researchers. Their research will eventually enable you to easily find and watch user-generated content, including that amazing clip of your grandpa losing his teeth while dancing at your cousin’s wedding!

Chia-Wen Lin and Min Sun, professors in the Electrical Engineering department of National Tsinghua University in Taiwan, tackled this issue with machine learning. In effect, they’ve created a system where bots watch a video, determine its highlights, create a relevant title for easy searching, and recommend who might want to be tagged to watch it.

“Our research has taken us one step closer to the holy-grail of visual intelligence, understanding visual content in user-generated videos,” said Professor Sun.

Video Title Generation System

Professor Sun created a video title generation method based on deep learning to automatically find the special moments—or highlights—in videos, and generate an accurate and interesting title for the highlights. In parallel, Professor Lin developed a method to detect and cluster the faces in videos to provide richer summaries of the videos and relevant suggestions about whom to share them with. Working together, their algorithms can detect highlights, generate descriptions of highlights and tag potential viewers of user-generated videos.

Their work was inspired by Microsoft Research’s COCO (opens in new tab) (Common Objects in Context). COCO is a new image recognition, segmentation, and captioning dataset that recognizes more than 300,000 images, in context, and because videos are essentially a succession of images, this dataset could help with video title generation. Professor Lin and Professor Sun collaborated with Dr. Tao Mei, a lead researcher in multimedia at Microsoft Research Asia in 2015, using COCO captions for sentence augmentation and using captions in MSCOCO to train their system. Their work is published on arxiv (opens in new tab).

Currently, Professors Sun and Lin have analyzed 18,000 videos for highlights and generated 44,000 titles/descriptions. To improve the system, Professor Sun and his students participated in the VideoToText (opens in new tab) challenge sponsored by Microsoft Research, using the data released in the challenge for additional validation. Their work will be published at the ECCV (European Conference on Computer Vision) conference, October 8–16, 2016. Professor Sun and Dr. Tao Mei have started looking into the next phase of their collaboration on storytelling of personal photos.

If you’re also engaged with furthering the state of the art in visual intelligence research, you’ll find our Cognitive Services Computer Vision API (opens in new tab) useful. It extracts rich information from any image to categorize and process visual data. You can even build an app to caption your own videos using our sample on GitHub (opens in new tab). Check it out!

Learn More