Mini Talks on Multimedia, Interaction and Communication


October 4, 2011


Dinei Florencio, Sanjeev Mehrotra, and Zhengyou Zhang


Microsoft Research


This seminar consists of four mini talks to be presented at the IEEE International Workshop on Multimedia Signal Processing (MMSP), Hangzhou, China, October 17-19, 2011.

Mini Talk 1: Crowdsourcing Region of Interest Determination for Videos, by Flavio Ribeiro, Dinei Florencio
Abstract: The ability to identify and track visually interesting regions has many practical applications – for example, in image and video compression, visual marketing and foveal machine vision. Due to challenges in modeling the peculiarities of human physiological and psychological responses, automatic detection of fixation points is an open problem. Indeed, no objective methods are currently capable of fully modeling the human perception of regions of interest (ROIs). Thus, research often relies on user studies with eye tracking systems. In this paper we propose a cost-effective and convenient alternative, obtained by having internet workers annotate videos with ROI coordinates. The workers use an interactive video player with a simulated mouse-driven fovea, which models the fall-off in resolution of the human visual system. Since this approach is not supervised, we implement methods for identifying inaccurate or alicious results. Using this proposal, one can collect ROI data in an automated fashion, and at a much lower cost than laboratory studies.

Mini Talk 2: Interpolation of Combined Head and Room Impulse Response for Audio Spatialization, by Sanjeev Mehrotra, Wei-ge Chen, Zhengyou Zhang Abstract: Audio spatialization is becoming an important part of creating realistic experiences needed for immersive video conferencing and gaming. Using a combined head and room impulse response (CHRIR) has been recently proposed as an alternative to using separate head related transfer functions (HRTF) and room impulse responses (RIR). Accurate measurements of the CHRIR at various source and listener locations and orientations are needed to perform good quality audio spatialization. However, it is infeasible to accurately measure or model the CHRIR for all possible locations and orientations. Therefore, low-complexity and accurate interpolation techniques are needed to perform audio spatialization in real-time. In this talk, we present a novel frequency domain interpolation technique which naturally interpolates the interaural level difference (ILD) and interaural time difference (ITD) for each frequency component in the spectrum. The proposed technique allows for an accurate and low-complexity interpolation of the CHRIR as well as allowing for a low-complexity audio spatialization technique which can be used for both headphones as well as loudspeakers.

Mini Talk 3: Low-Complexity, Near-Lossless Coding of Depth Maps from Kinect-Like Depth Cameras, by Sanjeev Mehrotra, Zhengyou Zhang, Qin Cai, Cha Zhang, Philip A. Chou Abstract: Depth cameras are gaining interest rapidly in the market as depth plus RGB is being used for a variety of applications ranging from foreground/background segmentation, face tracking, activity detection, and free viewpoint video rendering. In this talk, we present a novel low-complexity, near-lossless codec for coding depth maps. This coding requires no buffering of video frames, is table-less, can encode or decode a frame in close to 5ms with little code optimization, and provides between 7:1 to 16:1 compression ratio for near-lossless coding of 16-bit depth maps generated by the Kinect camera.

Mini Talk 4: ViewMark: An Interactive Videoconferencing System for Mobile Devices, by Shu Shi, Zhengyou Zhang
Abstract: ViewMark, a server-client based interactive mobile videoconferencing system is proposed in this paper to enhance the remote meeting experience for mobile users. Compared with the state-of-the-art mobile videoconferencing technology, ViewMark is novel in allowing a mobile user to interactively change the viewpoint of the remote video, create viewmarks, and hear with spatial audio. In addition, ViewMark also streams the screen of the presentation slides to mobile devices. In this paper, we introduce the system design of ViewMark in details, compare the devices that can be used to implement interactive videoconferencing, and demonstrate the prototype system we have built on Windows Mobile platform.


Dinei Florencio, Sanjeev Mehrotra, and Zhengyou Zhang

Dinei Florêncio received the B.S. and M.S. from University of Brasília (Brazil), and the Ph.D. from Georgia Tech, all in Electrical Engineering. He is a researcher with Microsoft Research since 1999, currently with the Multimedia, Interaction, and Communication group. From 1996 to 1999, he was a member of the research staff at the David Sarnoff Research Center. Dr. Florencio is a senior member of the IEEE, and has published over 50 referred papers, and 36 granted US patents (with another 20 currently pending). He received the 1998 Sarnoff Achievement Award, an NCR inventor award, and a SAIC award. His papers have won awards at SOUPS’2010, ICME’2010, and MMSP’2009. His research has enhanced the lives of millions of people, through high impact technology transfers to many Microsoft products, including Live Messenger, Exchange Server, RoundTable, and the MSN toolbar. He is a member of the IEEE SPS Multimedia Technical Committee, and an associated editor for the IEEE Trans. on Information Forensics and Security. Dr. Florencio was general chair of CBSP’2008, MMSP’2009 and WIFS’2011 and technical co-chair of Hot3D’2010, WIFS’2010, and ICME’2011.

Sanjeev Mehrotra is a Principal Software Architect in Microsoft Research, Redmond. Previously, he was the development manager for the audio codecs and DSP team in the Core Media Processing Technologies team. Prior to that he was the development lead for the Windows Media Audio codec. Before that, he was one of the first employees at VXtreme, a pioneering streaming media startup. He is the primary inventor, designer, and developer for the Windows Media Screen codec, the low bitrate extensions to the Windows Media Professional Audio codec, the first prototype version of adaptive streaming technologies (SmoothStreaming), and numerous other media technologies shipping in Windows, Zune, Xbox, and other Microsoft products. Recently, he has developed the forward error correction code in OC/Lync, the new UDP based transport protocol used by Remote Desktop in Windows 8, the new bandwidth management solution in Lync, and helped optimize deduplication code in Windows 8 server. He has also helped with developing audio spatialization and depth codec technologies for Viewport and Teleport. He received his Ph.D. from Stanford in 2000. He is an author on more than 70 patent applications and more than 30 peer reviewed publications. He is a senior member of the IEEE and has received the NSF, Tau Beta Pi, and Kodak graduate fellowships, and is also a recipient of Microsoft Gold Star Award.

Zhengyou Zhang is a Research Manager and a Principal Researcher at MSR Redmond, leading the Multimedia, Interaction, and Communication (MIC) Group, whose mission is to develop novel multimedia technologies involving audio, visual, haptic, and other natural signals to improve people’s experience in interacting with each other and with machines. He has contributed to shipping several Microsoft products including Windows XP Media Edition, Xbox 360, RoundTable, Lync, Kinect, and Avatar Kinect. Before joining Microsoft Research in 1998, Zhengyou was a Senior Research Scientist at INRIA, France. In 1996-1997, he did one-year sabbatical at ATR, Kyoto, Japan. He received his Ph.D. degree in computer science from University of Paris, Orsay, France, his M.S. degree in computer science from University of Nancy, France, and his B.S. degree in Electric Engineering from Zhejiang University, China. He is an IEEE Fellow.