Audio and acoustics

Publication

Large-Scale Automatic Audiobook Creation

Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, Markus Weimer

Interspeech Show and Tell | September 2023

Project

Video

HyWay: Physical Walk (MSR India – TAB Feb 2023)

August 21, 2023

A key aspect of attending such an event in person is being able to experience the setting in its fullness — hearing the buzz of background conversations and seeing who is around. This can be…

0:27

Publication

End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

Yukang Liang, Kaitao Song, Shaoguang Mao, Huiqiang Jiang, Luna K. Qiu, Yuqing Yang, Dongsheng Li, Linli Xu, Lili Qiu

INTERSPEECH 2023 | August 2023

Microsoft Research Blog

Research Focus: Week of August 14, 2023

August 16, 2023

In this issue: HyWay enables hybrid mingling; Auto-Tables transforms non-relational tables into standard relational forms; training dense retrievers to identify high-quality in-context examples for LLM; improving pronunciation assessment in CAPT.

Microsoft Research Focus 22 | Week of August 14, 2023

Publication

Audio Retrieval with WavText5K and CLAP Training

Soham Deshmukh, Benjamin Elizalde, Huaming Wang

2023 Conference of the International Speech Communication Association | August 2023

Project

Publication

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

August 2023

Publication

Zero-Shot Transfer for Wildlife Bioacoustics Detection

Zhongqi Miao, Benjamin Elizalde, Soham Deshmukh, Justin Kitzes, Huaming Wang, Rahul Dodhia, Juan M. Lavista Ferres

August 2023

Project Project

Microsoft Research Blog

Thinking beyond audio: Augmenting headphones for everyday digital interactions

July 12, 2023 | Payod Panda

Because headphones rank among the most popular wearables in the market, we have an exciting opportunity to expand their capabilities through integrating existing sensors with supplementary ones to enable a wide variety of experiences that…

Microsoft at DIS 2023: Thinking beyond audio: Augmenting headphones for everyday digital interactions

Microsoft Research Blog

Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation

June 29, 2023 | Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension,…

CoDi can generate any combination of modalities from any, all at once.

Project

VALL-E

Neural codec language models for speech synthesis We introduce a language modeling approach for text-to-speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural…