By Ken Circeo and Walter Isidro
So you want to write a speech application. Great! Judging from consumer trends, you're still ahead of the market, and a good speech app can increase your customer satisfaction while decreasing your operating costs.
But preparation is key. And, as with any development project, asking the right questions beforehand can save you time, money, and stress down the road. This is particularly true for speech applications because of their unique aspects audible user input, recorded voice prompts and follow-up, and speech recognition accuracy relative to more traditional apps. So before you launch full sail into the development process, you'll need to answer these four questions:
What kind of speech application am I going to make? Determining the type of speech application you will create before you begin will give you an immediate idea of how much work is involved in the project and will keep you from getting off track throughout the development process.
Speech applications come in many forms. Perhaps you're writing a web application, or a .NET serviced component. Maybe your speech app is a front-end executable, or just a low-level service that's launched from another application. Are you starting from scratch, or are you "speech-enabling" a legacy app? If a web server is available, perhaps you should consider building a distributed app.
One nice feature about Microsoft Speech Server is that it uses Visual Studio .NET 2003 as its development interface, making app development both a familiar and flexible process, regardless of how large or small your project is. It also includes the Speech Web Application Project Wizard, which makes it easier for you to set the application mode, choose a debugging tool, and add prompt projects and grammar files.
Do I have written specs? Written specifications are critical to any enterprise-level development project, and a speech application is no exception. As with a typical enterprise project, you should start by writing a Requirements Spec, which should clearly explain how the application should be built in order to satisfy both the business need and the technological need. It should be "part recipe (where all the particulars of the system are clearly spelled out) and part story (a description of how the system will feel)." (The Art and Business of Speech Recognition by Blade Kotelly, p. 46.)
But in addition to the traditional Requirements Spec, speech projects require two other spec types: the Dialog Spec and the Servicing Spec.
The Dialog Spec describes the different conditions for going from one program state to the next. Each state will contain the kinds of questions you are expecting from customers for that particular state, such as what should happen if the user stays silent or mumbles. How should the system be programmed to react? No question, writing good dialogs is tough. Many sound unnatural ("Please say 'one' for option 1 and 'two' for option 2.") Plan to spend time working and reworking your application's prompts and call flow.
Another unique aspect to speech development is post-deployment tuning, and for this you should write a Servicing Spec. Unlike a DTMF application, where a customer chooses from a finite number of options (typically by pressing a number on the telephone keypad), a speech application opens the door to an infinite number of verbal responses, making it impossible for you to predict what all customers will say. As you analyze post-deployment data and respond to customer feedback, you'll need to fine-tune your speech application by adjusting your prompts and dialogs. Tuning takes time sometimes a considerable amount of time. Fortunately, Microsoft Speech Server includes analysis and tuning tools such as the Call Viewer, which presents different views of call data so you can quickly locate and diagnose problem calls. Your Servicing Spec should include your post-deployment tuning plan, and realistic benchmarks that you plan to meet as you increase your application's performance, along with customer satisfaction.
What hardware and software infrastructure am I building on? You've already chosen a versatile and cost-effective platform in Microsoft Speech Server. But what does your current infrastructure look like, and what needs to be done to prepare it for your speech application? Consider that speech applications can place special demands on hardware resources. Because Microsoft Speech Server is designed for deployments of all sizes, its capacity requirements are set at a reasonable level. But that level can go up depending on your operational goals.
As with any computer, your speech computer's requirements come down to three hardware resources: hard drive, memory, and processor. For example, are you planning to store audio files on your hard drive? Which types of files will be stored, and for how long? A month? Six months? Let's say your speech app offers customers time-sensitive information such as stock quotes. You'd probably have no reason to store voice files from customer inquiries. But if those customers are performing transactions, such as buying and selling shares, you would be required to store the voice files to maintain federal checks and balances requirements, which may mean setting up offsite computers to handle data redundancy.
In addition, a speech app can weigh heavily on your processor, and that weight goes up with the number of simultaneous callers on your system. Think about your peak load and remember that speech recognition is memory intensive. Processor speed goes hand-in-hand with RAM capacity. How many simultaneous users can your system handle at one time? Before you start building your app, make sure your system contains enough disk space, memory, and a fast enough processor to meet your performance goals.
What do I know about developing for Microsoft Speech Server? You already know that you can use Microsoft Speech Server to build and deploy speech applications, but you may be unfamiliar with its internal architecture and components. Do you consider yourself primarily a web developer or an app developer? Have you ever speech-enabled a DTMF app? Microsoft Speech Server contains a rich set of unique APIs that handle Speech-to-Text and Text-to-Speech conversions. You can choose from a variety of languages to write your application, including C# and Visual Basic.
Microsoft Speech Server leverages industry standards such as SQL and Microsoft Internet Information Services (IIS), as well as speech components like Speech Engine Services (SES) and Telephony Application Services (TAS). It supports the Speech Application Language Tags (SALT) markup language, which extends existing markup languages such as HTML, XHTML, and XML to provide voice access to web-based applications.
The following diagram illustrates the Microsoft Speech Server components and the relationships between them.
You can find this and other diagrams in the product Help documentation (go to Introducing Microsoft Speech Server, Speech Server Architecture). Learning about the components and how they work together can help you determine which components are most important to your application. If you evaluate your own body of knowledge against the project you're about to undertake, you may find that some research is necessary to prepare yourself before you get started.
In fact, if you scan through the resources on the Microsoft Speech Server Web site, you can find the answers to many of the questions presented in this article. Others, only you can answer, depending on your company's application requirements and your own personal skill and experience. But keep in mind that proper preparation can help you complete your project on time and on budget, and enable you to meet your company's business need, enhance customer experience, and ultimately improve your bottom line.