Abstract

We propose a theoretical framework for doing speech recognition
with segmental conditional random fields, and describe the implemenation of a toolkit for experimenting with these models. This framework allows users to easily incorporate multiple detector streams into
a discriminatively trained direct model for large vocabulary continuous speech recognition. The detector streams can operate at multiple
scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes.
A key aspect of our approach is that features are defined at the
word level, and can thus identify long span phenomena such as the
edit distance between an observed and expected sequence of detection events. Further, a wide variety of features are automatically constructed
from atomic detector streams, allowing the user to focus on the creation of informative detectors. Generalization to unseen words is
possible through the use of decomposable consistency features [1, 2],
and our framework allows for the joint or separate training of the
acoustic and language models.

‚Äč