Candidate Talk: Conversational Turn-Taking as a Dynamic Decision Process

September 29, 2008
Antoine Raux | Carnegie Mellon University

With the significant progress made in the past two decades in speech technologies, spoken dialog systems have been able to progressively handle more and more complex tasks, while at the same time becoming more sophisticated in handling the structure and the uncertainties inherent to conversation. Yet, at the lower level of timing and turn-taking, current systems are still rigid and brittle, particularly when natural language input is accepted. In this talk, I will focus specifically on the problem of end-of-turn detection, which has typically been handled by a pause detection mechanism combined with a fixed threshold on the duration of the pause (e.g. “consider that the user has finished their utterance when they pause for 700 ms or more”). The limitations of this approach are obvious. If the threshold is short, the system will be prone to interrupting the user in the middle of their turn (“cut-ins”), whereas if it’s long, system latency will suffer. In order to address these issues, we designed an algorithm to dynamically set the threshold for each pause using features available from different levels of dialog, from speech recognition scores, to prosody, to semantic interpretations, to discourse structure. By combining these features in a single decision tree, we were able to reduce system latency for a fixed cut-in rate up to 24% in a publicly deployed spoken dialog system.
We then moved one step further and frame turn-taking as a dynamic decision process, i.e. one in which time is an important factor in the utility/cost of each possible action. By grounding the problem in a well established theoretical framework, we have been able to 1) improve over our previous results with latency reductions over a fixed threshold baseline of up to 35%, 2) integrate the turn-taking mechanism in a general spoken dialog architecture that captures the relationship between low levels of interaction and higher discourse structure, and 3) generalize the approach to other aspects of turn-taking, such as interruption detection.

Speaker Details

Antoine Raux is a PhD candidate at the Language Technologies Institute in Carnegie Mellon University’s School of Computer Science. His interests are in speech technologies in general and his thesis work focuses on low level interactional aspects of spoken dialog systems. He has published refereed papers in many areas of speech processing, including speech recognition, synthesis, spoken dialog systems and computer-assisted language learning. Prior his PhD, Antoine got a Masters degree in Intelligence Science and Technology from Kyoto University (Japan), and a Engineering Diploma from Ecole Polytechnique (France). He is married and the proud father of a 4-year-old boy and a 1-year-old girl.