The language model (LM) in most state-of-the-art large vocabulary continuous speech
recognition (LVCSR) systems is still the n-gram. A major reason for using such simple
LMs, besides the ease of estimating them from text, is computational complexity.
It is also true, however, that long-span LMs, be they due to a higher n-gram order,
or because they take syntactic, semantic, discourse and other long-distance dependencies
into account, are much more accurate than low-order n-grams. The standard practice is
to carry out a first pass of decoding using, say, a 3-gram LM to generate a lattice, and to
rescore only the hypotheses in the lattice with a higher order LM. But even the search space
defined by a lattice is intractable for many long-span LMs. In such cases, only the N-best
full-utterance hypotheses from the lattice are extracted for evaluation. However, the N-
best lists so produced, tend to be “baised” towards the model producing them, making the
re-scoring sub-optimal, especially if the re-scoring model is complementary to the initial
n-gram model. For this reason, we seek ways to incorporate information from long-span
LMs by searching in a more unbiased search space.
In this thesis, we first present strategies to combine many complex long and short span language models to form a much superior unified model of language. We then show how
this unified model of language can be incorporated for re-scoring dense word graphs, using
a novel search technique, thus alleviating the necessity of sub-optimal N-best list rescoring.
We also present an approach based on the idea of variational inference, virtue of
which, long-span models are efficiently approximated by some tractable but faithful models,
allowing for the incorporation of long distance information directly into the first-pass
We have validated the methods proposed in this thesis on many standard and competitive
speech recognition tasks, sometimes outperforming state-of-the-art results. We hope
that these methods will be useful for research with long span language models not only in
speech recognition but also in other areas of natural language processing such as machine
translation, where even there the decoding is limited to n-gram language models.