Microsoft Research Blog

English

Deploying a Robotic ride-on Car in the Hospital to Reduce the Stress of Pediatric Patients before Surgery

September 1, 2024

Pediatric patients frequently experience stress before surgery, impacting their overall well-being. Our study introduces a unique robotic ride-on car aimed at reducing preoperative stress in children aged 3-10. Drawing inspiration from assistive driving technologies, this car is equipped with various sensors and intervention mechanisms that…
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

September 1, 2024

Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across…
Datacenter power and energy management: past, present, and future

September 1, 2024 | Ricardo Bianchini, Christian Belady, and Anand Sivasubramaniam

This article overviews some of the key past developments in cloud datacenter power and energy management, where we are today, and what the future could be. This topic is gaining enormous, renewed interest in the context of the conflicting needs of the AI revolution and…
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

September 1, 2024 | Ruihan Yang, Hannes Gamper, and Sebastian Braun

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed…
Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

September 1, 2024 | Eloi Moliner, Sebastian Braun, and Hannes Gamper

Audio domain transfer is the process of modifying audio signals to match characteristics of a different domain, while retaining the original content. Examples include transferring room acoustics or altering audio effects such as distortion. This paper investigates the potential of Gaussian Flow Bridges, an emerging…
Domain mismatch and data augmentation in speech emotion recognition

September 1, 2024 | Dimitra Emmanouilidou, Hannes Gamper, and Midia Yousefi

Large, pretrained model architectures have demonstrated potential in a wide range of audio recognition and classification tasks. These architectures are increasingly being used in Speech Emotion Recognition (SER) as well, an area that continues to grapple with the scarcity of data, and especially of labeled…
Diversity in Study Participants and a Critique of the “Representative Sample” in Human-Computer Interaction Research

September 1, 2024 | Advait Sarkar

We reflect on our experiences in improving the diversity of participants in our research, focusing on geographic diversity and countering WEIRDness. Our reflections are grounded in four studies conducted over two years, with more than 100 total participant engagements across more than 100 hours of…
COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

September 1, 2024

We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under…
Multi-label audio classification with a noisy zero-shot teacher

September 1, 2024 | Sebastian Braun and Hannes Gamper

We propose a novel training scheme using self-label correction and data augmentation methods designed to deal with noisy labels and improve real-world accuracy on a polyphonic audio content detection task. The augmentation method reduces label noise by mixing multiple audio clips and joining their labels,…
AI detection of malicious push notifications in augmented reality in the workplace

September 1, 2024 | Sarah Katz

Distraction caused by the visual processing of multiple objects during augmented reality (AR) immersion could make users more susceptible to malicious push notifications, thus potentially exposing organisations to unwitting insider threats. This case study consulted four experts in the field of AR application development to…
Target conversation extraction: Source separation using turn-taking dynamics

September 1, 2024

Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker…
Knowledge boosting during low-latency inference

September 1, 2024

Models for low-latency, streaming applications could benefit from the knowledge capacity of larger models, but edge devices cannot run these models due to resource constraints. A possible solution is to transfer hints during inference from a large model running remotely to a small model running…

No results