Strategies for Optimizing Alphanumeric Recognition Accuracy

Heiko Rahmel
PM Speech Components Group

Introduction
Recognizing alphanumeric strings is a common task used by wide variety of applications. Gathering policy numbers, catalog numbers, license plate numbers, etc., are typical tasks that involve alphanumeric recognition. These particular tasks can however also be one of the most challenging ones for a speech recognition engine since many letters and numbers sound very similar and even present a challenge for human listeners. This article offers insight into the difficulty of recognizing alphanumeric strings and gives tips on how to leverage information about the structure of these strings to significantly improve recognition accuracy.

Why alphanumeric recognition is difficult
Most people probably have experienced how difficult it can sometimes be to get someone on the other end of a telephone line to correctly understand the spelling of a name. I am always amused at the different variations with which my name appears on address labels after having talked to some customer service agent on the phone. My name is always a challenge, because it is unusual and most people have little to go on with which to constrain what letters to expect. Two factors come together to make recognizing alphanumeric strings challenging:

  1. Many of the letters are easily confused with other letters: "N" with "M", "B" with "D", etc.
  2. Each letter in the string presents a new chance for an error.

A speech recognition engine faces this challenge when asked to recognize for example a ten character alphanumeric string using a simple unconstrained grammar like this:

(where the rule A_through_Z is just a list of the letters A through Z, and the rule Zero_through_Nine is a list of the numbers zero through nine (including "oh").)

What recognition accuracy can we expect for a string of eight alphanumeric characters? We can estimate the expected accuracy of getting the entire string right this way [1]:

So for example given a 95% accuracy[2] of recognizing each individual character (a nominally high accuracy rate), the accuracy with which the entire eight character string is recognized can be estimated thusly:

(95%)8 = 66%.

# of Characters

Estimated Accuracy

2

(95%)2 = 90%

3

(95%)3 = 86%

4

(95%)4 = 81%

5

(95%)5 = 77%

6

(95%)6 = 74%

7

(95%)7 = 70%

8

(95%)8 = 66%

9

(95%)9 = 63%

10

(95%)10 = 60%

So you can see that even with a relatively high recognition accuracy estimate of 95% for individual characters, the overall string recognition accuracy estimate rapidly degrades as more characters are added to the string. This illustrates how the accuracy of recognizing a single letter and the length of the alphanumeric string together make alphanumeric recognition a difficult task.

How to improve alphanumeric recognition accuracy by constraining the grammar
An eight character alphanumeric string can express a huge number of possible strings: 26 letters plus 10 digits equals 36 possibilities for each character; so for an eight character string the number of total possibilities is 368 or 2.8 trillion possibilities.

Arguably, no application needs to distinguish between the 2.8 trillion choices an eight character alphanumeric string offers. In most cases there is a pattern to the alphanumeric string. This means the strings are not consecutive, but rather sparse, which means that they differ by more than one character. We can take advantage of this to enhance recognition accuracy.

A catalog may for example list a couple of thousand items. In this case, when designing the grammar, it is simplest and best to explicitly spell out each individual catalog number as a separate phrase element. While there are more compact ways to write the grammar it is sufficient to use this format, since the recognizer automatically compacts the structure of any grammar when loading it for optimal runtime performance.

Let's consider the following example:
A caller is asked for a part number and he says: "A S X four one six nine F". With the unconstrained grammar the utterance "A S X four one six nine F" could been recognized as "A F X four one six nine F" by the recognizer since "S" and "F" are very similar.

However the application is actually only looking for four[3] valid alphanumeric strings and so we write a grammar like this instead:

For a constrained grammar the recognizer can only consider the alphanumeric strings specified by the grammar as possible matches. The recognizer therefore has to confuse multiple characters to return an incorrect string. So with the constrained grammar the most similar string allowed is "P F X four one six nine R". For the recognizer to consider this to be a closer match to the utterance it would have to not only confuse "S" with "F", but also "A" with "P" and F" with "R". While the chance of getting one character out of ten wrong is relatively high, getting two or more characters wrong in just the right way so that they turn one valid string into another valid string is quite low.

This approach of explicitly spelling out each valid alphanumeric string in the grammar typically should work well for static lists up to 100K entries in size. For large list sizes it should be straightforward to create a small script or program which will automatically generate the grammar from a database or other source.

Some tasks require more flexibility, because the list of valid entries changes often or covers millions of strings. Still even here for most tasks the strings follow a distinct pattern that can be exploited to constrain the grammar as much as possible. What is important to keep in mind is that more than constraining individual characters it is much more useful to identify groups of characters that occur together since then typically multiple characters have to be in error in the right way to get the sequence wrong.

As another example lets say the application allowed an alphanumeric string that always starts with one of the following character sequences followed by a specific number of digits:

A. F. X. (followed by 5 digits)
F. H. Q. (followed by 3 digits)
P. W. D. (followed by 6 digits)

The first grammar we consider looks like this:

While this grammar is constrained each of the letters is independent of each other. This grammar still allows the invalid string "A. H. Q. one two three four".

A better way to write this grammar would be:

Here the characters are grouped together and the length of the digit string is constrained to each specific length. In order for the recognizer to get the three letter prefix wrong both multiple incorrect letters and an incorrect number of digits have to match a valid utterance better than the correct prefix.

This approach can be combined with subsequently validating the top N n-best choices against a database containing the valid alphanumeric strings, or by using a checksum or similar method that can easily be calculated, but not easily expressed in a grammar. The optimal number N of n-best choices will depend on the specifics of the application and the time it takes to validate a particular alphanumeric string. Checking a larger number of n-best entries increases the chance of finding a valid match, but increases the time the caller will have to wait for the answer. If a database lookup were to take half a second, then checking up to 20 n-best entries would take 10 seconds, which is much too long to have a caller wait. But testing may show that checking just up to 4 n-best entries together with a sufficiently constrained grammar yields an acceptable accuracy level.

Other tips to consider when writing an alphanumeric grammar

  1. Force the recognizer to use the specially trained letter recognition models.
    By specifying the letters in the grammar as "A.", "B." "C." (i.e. the letter followed by a dot), you can ensure that the recognizer uses recognition models specifically trained for letter recognition. The dot is necessary (in the case of English) to distinguish letter A from the word A ("A man") and the letter I from the word I. Also, do not specify any custom pronunciations for a letter, since this will prevent the recognizer from using the special recognition model for this letter.


  2. Write numbers as words
    You should use "one" instead of "1", "two" instead of "2", etc. This way the recognizer won't have to use text normalization to translate the digits into words. For "0" you may want to specify both "zero" and "oh", but be aware that the "oh" might clash with the letter "O.". As with letters providing a custom pronunciation for a digit will prevent the recognizer from using the special digit model for that digit.


  3. Separate the characters
    This is basically automatic if you follow 1 and 2, but you should always separate the characters with spaces, otherwise you might get unexpected results, e.g.. "BAD" being interpreted as "bad" not "B. A. D.".


  4. Clearly prompt the user
    As with any speech recognition task, providing a clear and concise prompt to the user is important to make sure the expected response is provided by the user. Make sure your application indicates whether you expect the user to speak non-alphanumeric characters like dashes, etc.

Conclusion
While recognizing alphanumeric strings can be a challenge for your application, the above tips should help to substantially improve its perceived accuracy and performance.


[1] This assumes that recognition errors happen for each character independently from each other which is not necessarily true, but it simplifies the calculation and is sufficiently accurate.

[2] This is an example number chosen here and not a statement of actual performance. The actual accuracy depends on many factors like noise conditions, caller population, device used for calling to name a few.

[3] The number of choices is so small so that the grammar fits nicely on the page, but the following holds for much longer lists too.