Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Bots generate video titles and tags to bring AI researchers one step closer to visual intelligence

October 10, 2016 | By Microsoft blog editor

By Winnie Cui, Senior Research Program Manager, Microsoft Research Asia

When your grandma posts a video to the cloud, there it lies, lonely and unwatched, unless your grandmother has more tagging and titling skills than mine does. My nana loves taking family videos with her cellphone, but while she’s great at creating content, she’s not so good at creating an audience. While my sisters and I might love to watch the clip, she’s made it practically undiscoverable.

I know your nana (and friends, colleagues, and family) is probably like mine, because a high percentage of user-generated videos stored in the cloud have very few views. Well, grandma, help is here from artificial intelligence researchers. Their research will eventually enable you to easily find and watch user-generated content, including that amazing clip of your grandpa losing his teeth while dancing at your cousin’s wedding!

Chia-Wen Lin and Min Sun, professors in the Electrical Engineering department of National Tsinghua University in Taiwan, tackled this issue with machine learning. In effect, they’ve created a system where bots watch a video, determine its highlights, create a relevant title for easy searching, and recommend who might want to be tagged to watch it.

“Our research has taken us one step closer to the holy-grail of visual intelligence, understanding visual content in user-generated videos,” said Professor Sun.

Video Title Generation System

Professor Sun created a video title generation method based on deep learning to automatically find the special moments—or highlights—in videos, and generate an accurate and interesting title for the highlights. In parallel, Professor Lin developed a method to detect and cluster the faces in videos to provide richer summaries of the videos and relevant suggestions about whom to share them with. Working together, their algorithms can detect highlights, generate descriptions of highlights and tag potential viewers of user-generated videos.

Their work was inspired by Microsoft Research’s COCO (Common Objects in Context). COCO is a new image recognition, segmentation, and captioning dataset that recognizes more than 300,000 images, in context, and because videos are essentially a succession of images, this dataset could help with video title generation. Professor Lin and Professor Sun collaborated with Dr. Tao Mei, a lead researcher in multimedia at Microsoft Research Asia in 2015, using COCO captions for sentence augmentation and using captions in MSCOCO to train their system. Their work is published on arxiv.

Currently, Professors Sun and Lin have analyzed 18,000 videos for highlights and generated 44,000 titles/descriptions. To improve the system, Professor Sun and his students participated in the VideoToText challenge sponsored by Microsoft Research, using the data released in the challenge for additional validation. Their work will be published at the ECCV (European Conference on Computer Vision) conference, October 8–16, 2016. Professor Sun and Dr. Tao Mei have started looking into the next phase of their collaboration on storytelling of personal photos.

If you’re also engaged with furthering the state of the art in visual intelligence research, you’ll find our Cognitive Services Computer Vision API useful. It extracts rich information from any image to categorize and process visual data. You can even build an app to caption your own videos using our sample on GitHub. Check it out!

Learn More

Up Next

Artificial intelligence, Computer vision, Graphics and multimedia

ChatPainter: Improving text-to-image generation by using dialogue

Generating realistic images from a text description is a challenging task for a bot. A solution to this task has potential applications in the video game and image editing industries, among many others. Recently, researchers at Microsoft and elsewhere have been exploring ways to enable bots to draw realistic images in defined domains, such as […]

Microsoft blog editor

2017 Microsoft Research Faculty Summit

Artificial intelligence

Watch the livestream of the 2017 Microsoft Research Faculty Summit—The Edge of AI

By Evelyne Viegas, Program Co-Chair of Faculty Summit and Director, Microsoft We are looking forward to another informative Microsoft Research Faculty Summit (July 17-18, 2017) where this year’s theme is The Edge of AI. The event will consist of keynotes, sessions, panels, and showcased technologies. The summit brings together thought leaders and researchers from a broad range […]

Microsoft blog editor

Smart Home Security

Artificial intelligence, Computer vision

Researchers develop new visual intelligence techniques to boost smart home security

By Kangping Liu, Senior Research Program Manager, Microsoft Research Asia Imagine when you leave your house or apartment that a smart home security system can automatically “look after” your home, giving you real-time notices about events happening at home, or providing you with a short video including all events of interest that happened while you […]

Microsoft blog editor