Spoken Language Understanding without Transcriptions in a Call Canter Scenario

PhD Thesis: Logos Verlag |

This dissertation addresses the possibilities of Spoken Language Understanding (SLU) in a call center scenario. It is widely agreed upon that human understanding involves complicated cognitive structures the machine replication of which is not tangible today. However, many practically important applications in the automated SLU don’t have to rely on computer cognition, while adopting the “behaviorist” approach to understanding instead. This approach states that understanding is present wherever the action the machine takes upon receiving an input message, is perceived as intuitively correct. This action can be formally encoded in terms of its semantic function (that determines a rough category of the action) and semantic attributes, parameters that this function takes to become well-defined. Thus, the goal of the understanding becomes to extract these elements from the input signals wherever possible. So, calltype COLLECT CALL is semantic function and named entity 12345 is its parameter in the utterance “I’d like to make a collect call to number 12345” taken from one call center scenario. In general, calltypes as semantic functions and named entities as semantic attributes are characteristic for many call center applications; in this work we show how spoken utterances can be handled with respect to these information types. We extract calltypes and consider three categories of named entity processing tasks: detection, localization and value extraction of named entities.

One distinctiveness of our experiments consists in not relying on the availability of manually created word-level annotations for training corpora from the target domain. To retain acceptable word accuracy in the ASR-output, we suggest using the mechanism of unsupervised language model adaptation. Similarly, we avoid the need to manually annotate instances of named entities in the training data by using generic application-independent grammars for their modeling. For those hard cases where not even an off-the-shelf language model is available to bootstrap the speech recognizer, we show how our algorithms can be ported onto the phone level.

In the context of the last task, we also discuss the academic problem of word lexicon extraction from a continuous phone stream. We use special semantic and syntactic qualities of words, to infer phone subsequences that replicate them.

All experiments we report on in this thesis were conducted on the “How May I Help You?” speech corpus of over-the-phone interactions of AT&T customers with the company’s partly automated call center.