Multimedia Search and Mining

Established: November 18, 2013

Multimedia Search and Mining (MSM) group focuses on a wide variety of multimedia-related research and projects, e.g., understanding, analysis, search, data mining, and applications. We are working on research problems in image understanding, video analytics, large scale visual (image and video) indexing and search, 3D reconstruction, and so on.

















Large Scale Weakly Supervised Learning

Established: August 1, 2016

Click-through data accumulated by search engine where rich connections between images and semantics have been built via the massive user clicks. The data comes free when search engine freely provides service to users, and naturally scales up to million scale even billion scale. Unlike dedicatedly constructed datasets, click-through data is noisy, unstructured and unbalanced. Under this project, we are targeting effectively using click-through data to solve image understanding problems.


Established: August 1, 2016

We study the problem of image captioning, i.e., automatically describing an image by a sentence. This is a challenging problem, since different from other computer vision tasks such as image classification and object detection, image captioning requires not only understanding the image, but also the knowledge of natural language. We formulate this problem as a multimodal translation task, and develop novel algorithms to solve this problem.

Network Morphism

Established: August 1, 2016

We propose a novel learning scheme called network morphism. It morphs a parent network into a child network, allowing fast knowledge transferring. The child network is able to achieve the performance of the parent network immediately, and its performance shall continue to improve as the training process goes on. The proposed scheme allows any network morphism in an expanding mode for arbitrary non-linear neurons, including depth, width, kernel size and subnet morphing operations.

Fine-grained Image Recognition

Established: July 1, 2016

Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and fine-grained feature learning. In this project, we are aiming at recognizing the fine-grained image categories at a very high accuracy. For example, now we can recognize more 1,000 flower species, 200 birds, 200 dogs, 800+ car models with the accuracy higher than 88% in several large-scale real-world datasets. In our work accepted to CVPR 2017, we propose a…

Video Analysis

Established: March 16, 2016

Video has become ubiquitous on the Internet, broadcasting channels, as well as that captured by personal devices. This has encouraged the development of advanced techniques to analyze the semantic video content for a wide variety of applications, such as video representation learning [CVPR 2017], video highlight detection [CVPR 2016], video summarization, object detection, action recognition [CVPR 2016, ICMR 2016], semantic segmentation, and so on. Highlight detection The emergence of wearable devices such as portable cameras…

Image chat

Established: February 22, 2016

Image is becoming a popular media for user communications on social networks. Then, it comes to be a natural requirement to enable chatbot to chat on images besides textual inputs. Based on MS XiaoIce(微软小冰), we explore the direction of image chat and iterate several rounds to enhance her talkative ability for images.

Food Recognition

Established: January 25, 2016

  We study the problem of food image recognition via deep learning techniques. Our goal is to develop a robust service to recognize thousands of popular Asia and Western food. Several prototypes have been developed to support diverse applications. The techniques have been shipped to Bing local search and XiaoIce. We are also developing a prototype called Im2Calories, to automatically calculate the calories and conduct nutrition analysis for a dish image.

Photo Story

Established: January 25, 2016

The capability of managing personal photos is becoming crucial. In this work, we have attempted to solve the following pain points for mobile users: 1) intelligent photo tagging, best photo selection, event segmentation and album naming, 2) speech recognition and user intent parsing of time, location, people attributes and objects, 3) search by arbitrary queries. We first segment and categorize the unstructured photo streams into multiple semantic-related albums in an automatic way. Second, we analyze…

Vision and Language

Established: January 14, 2016

Automatically describing visual content with natural language is a fundamental challenge of computer vision and multimedia. Sequence learning (e.g., Recurrent Neural Networks), attention mechanism, memory networks, etc., have attracted increasing attention on visual interpretation. In this project, we are focusing on the following topics related to the emerging topic of "vision and language": Image and video captioning, including MSR-VTT video to language grand challenge and datasets ( Image and video commenting (conversation) Visual storytelling (e.g., generation of…

Deep Neural Networks

Established: September 1, 2015

We study how to morph a well-trained neural network to a new one, and how to design advanced deep neural networks.

3D Object Reconstruction and Recognition

Established: May 1, 2015

We study the problem of 3D object reconstruction and recognition. For reconstruction, we aim at developing algorithms and systems to lower down the barrier of 3D reconstruction for common users. In this way, we can collect a world-class 3D object repository via leveraging crowdsourcing. For recognition, we aim at dealing with a large-scale task (e.g. identifying thousands of objects), and providing real-time performance. 1. 3D Object Reconstruction coming soon ...   2. 3D Object Recognition…

Mobile Video Search

Established: February 17, 2014

Mobile video is quickly becoming a mass consumer phenomenon. More and more people are using their smartphones to search and browse video contents while on the move. This project is to develop an innovative instant mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. The system is able to index large-scale video…

Image/Video Understanding and Analysis

Established: February 1, 2014

We target at the core problems in image/video understanding and analysis, such as image recognition, image segmentation, image captioning, image parsing, object detection, and video segmentation.

Picto: A large scale visual indexing and recognition system

Established: September 1, 2009

Object image recognition is a challenge but important problem. Towards addressing this problem, we initialed the Picto project. Our research in this project covers three fundamental aspects of this problem: low-level image features, middle level image representations, and indexing and recognition algorithms. We specially emphasize scalability and applicability in our research. 1. Large-scale indexing techniques In most object image retrieval systems, images are represented by the so-called bag-of-visual-words (BOF) vectors, in which each entry corresponds…

MindFinder: Finding Images by Sketching

Established: August 12, 2009

Sketch-based image search is a well-known and difficult problem, in which little progress has been made in the past decade in developing a large-scale and practical sketch-based search engine. We have revisited this problem and developed a scalable solution to sketch-based image search. The MindFinder system has been built by indexing more than 1.5 billion web images to enable efficient sketch-based image retrieval, and many creative applications can be expected…

Video Collage

Established: August 1, 2006

Video Collage is a kind of synthesized image that enable users to quickly browse the video content. Given a video, Video Collage is able to select the most representative images from the video, extract salient regions of interest from these images, and seamlessly arrange ROI on a given canvas. Video Collage can be used for Windows Vista Explorer, Live Search Video, as well as MSN Soapbox. Publications: Tao Mei, Bo Yang, Shi-Qiang Yang, Xian-Sheng Hua.…

Multimedia Advertising

Established: August 1, 2006

The ever increasing multimedia content on the Internet has become the primary source for more effective online advertising. Conventional advertising systems treat multimedia content as the same as general text, without considering automatically monetizing the rich content of the images and videos. This research direction will leverage content analysis and understanding to enable more effective and efficient advertising on multimedia content, whether on the Internet and mobile devices. Projects: ImageSense. ImageSense automatically finds suitable images…