ACM MM: Best Papers
Below are the best papers from ACM Multimedia 2015.
Xavier Alameda-Pineda, Yan Yan (University of Trento, Italy), Elisa Ricci, Oswald Lanz (Fondazione Bruno Kessler, Italy), Nicu Sebe (University of Trento, Italy)
During natural social gatherings, humans tend to organize themselves in the so-called free-standing conversational groups. In this context, robust head and body pose estimates can facilitate the higher-level description of the ongoing interplay. Importantly, visual information typically obtained with a distributed camera network might not suffice to achieve the robustness sought. In this line of thought, recent advances in wearable sensing technology open the door to multimodal and richer information flows. In this paper we propose to cast the head and body pose estimation problem into a matrix completion task. We introduce a framework able to fuse multimodal data emanating from a combination of distributed and wearable sensors, taking into account the temporal consistency, the head/body coupling and the noise inherent to the scenario. We report results on the novel and challenging SALSA dataset, containing visual, auditory and infrared recordings of 18 people interacting in a regular indoor environment. We demonstrate the soundness of the proposed method and the usability for higher-level tasks such as the detection of F-formations and the discovery of social attention attractors.
Michael Stengel, Steve Grogorick (TU Braunschweig, Germany), Elmar Eisemann (TU Delft, The Netherlands), Martin Eisemann (TH Koeln, Germany), Marcus A. Magnor (TU Braunschweig, Germany)
Immersion is the ultimate goal of head-mounted displays (HMD) for Virtual Reality (VR) in order to produce a convincing user experience. Two important aspects in this context are motion sickness, often due to imprecise calibration, and the integration of a reliable eye tracking. We propose an affordable hard- and software solution for drift-free eye-tracking and user-friendly lens calibration within an HMD. The use of dichroic mirrors leads to a lean design that provides the full field-of-view (FOV) while using commodity cameras for eye tracking. Our prototype supports personalizable lens positioning to accommodate for different interocular distances. On the software side, a model-based calibration procedure adjusts the eye tracking system and gaze estimation to varying lens positions. Challenges such as partial occlusions due to the lens holders and eye lids are handled by a novel robust monocular pupil-tracking approach. We present four applications of our work: Gaze map estimation, foveated rendering for depth of field, gaze-contingent level-of-detail, and gaze control of virtual avatars.
Wei Wang (National University of Singapore), Gang Chen (Zhejiang university, China), Anh Tien Tuan Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, Sheng Wang (National University of Singapore)
Recently, deep learning techniques have enjoyed success in various multimedia applications, such as image classification and multi-modal data analysis. Two key factors behind deep learning’s remarkable achievement are the immense computing power and the availability of massive training datasets, which enable us to train large models to capture complex regularities of the data. There are two challenges to overcome before deep learning can be widely adopted in multimedia and other applications. One is usability, namely the implementation of different models and training algorithms must be done by non-experts without much effort. The other is scalability, that is the deep learning system must be able to provision for a huge demand of computing resources for training large models with massive datasets. To address these two challenges, in this paper, we design a distributed deep learning platform called SINGA which has an intuitive programming model and good scalability. Our experience with developing and training deep learning models for real-life multimedia applications in SINGA shows that the platform is both usable and scalable.
Xiangbo Shu (Nanjing University of Science and Technolog, China), Guo-Jun Qi (University of Central Florida, USA), Jinhui Tang (Nanjing University of Science and Technology, China), Jingdong Wang (Microsoft Research, China)
In recent years, deep networks have been successfully applied to model image concepts and achieved competitive performance on many data sets. In spite of impressive performance, the conventional deep networks can be subjected to the decayed performance if we have insufficient training examples. This problem becomes extremely severe for deep networks with powerful representation structure, making them prone to over fitting by capturing nonessential or noisy information in a small data set. In this paper, to address this challenge, we will develop a novel deep network structure, capable of transferring labeling information across heterogeneous domains, especially from text domain to image domain. This weakly-shared Deep Transfer Networks (DTNs) can adequately mitigate the problem of insufficient image training data by bringing in rich labels from the text domain.
Specifically, we present a novel architecture of DTNs to translate cross-domain information from text to image. To share the labels between two domains, we will build multiple weakly shared layers of features. It allows to represent both shared inter-domain features and domain-specific features, making this structure more flexible and powerful in capturing complex data of different domains jointly than the strongly shared layers. Experiments on real world dataset will show its competitive performance as compared with the other state-of-the-art methods.
For more computer science research news, visit ResearchNews.com.