Microsoft Speech Server

Enabling people to use speech as part of their everyday interactions with software and services whether they are using telephones, mobile devices, or PCs.

Published: July 2003

For the latest information, please see http://www.microsoft.com/speech/

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Microsoft Speech Server

The Microsoft® Speech Server (MSS) is a flexible and integrated speech solution that delivers the business value of speech technologies at the lowest total cost of ownership.  Used in conjunction with the Microsoft Speech Application Software Development Kit (SASDK), the Microsoft Speech Server enables enterprises to cost-effectively deploy speech applications that can improve employee productivity, reduce costs, increase customer satisfaction and create new revenue opportunities.  Used in conjunction with
the SASDK, MSS enables enterprises to extend existing or new Web applications to be accessible by speech.  This allows enterprises to merge their Web and voice infrastructure to create unified applications with both speech and visual access.

The Microsoft Speech Server and Toolset address the needs of those exploring interactive voice responsive (IVR) systems for the first time, call center veterans, and IT managers exploring breakthrough speech applications for customers and employees. The solution is well suited to a variety of speech scenarios ranging from traditional call center applications that target customers with phones to newer IT applications targeted at customers or employees with Pocket PC devices. Speech can be used to do these things:

·         Bring in lower-cost, automated IVR systems to improve customer service.

·         Fix an ineffective touch-tone IVR system (speech has been shown by many third-party studies to increase transaction completion 50 percent to 100 percent).

·         Integrate Web applications with call center applications using speech where appropriate.

·         Create new multimodal (speech plus graphical user interface) speech applications using speech for filling out forms or navigating through applications in settings where people have a high degree of mobility, such as hospitals, offices, or factories.

·         Address the needs of customers and employees with mobile devices such as Tablet PCs and Pocket PCs.

Contrasting the Past and the Future

Historical challenges around speech have mostly disappeared. Microsoft Corp. alone
has been working for over a decade on speech technologies, tools, and platforms. Today’s speech recognition engines, including Microsoft’s own state-of-the-art solution, produce 95 percent to 99 percent accuracy (better than most people can type). Desktop and server computers have enough power to handle the processor and memory demands of speech recognition. And standardization is finally here with standards-setting bodies such as the SALT Forum (www.saltforum.org) and the World Wide Web Consortium (www.w3c.org) laying out standards for speech and structured information that enable modern solutions for traditional telephony and new multimodal applications.

Even with these advances in speech technology and standards, many speech solutions still rely on older, non-standard platforms.

Most proprietary IVR systems are not only very expensive, but use esoteric scripting
and programming languages that require companies to pay high-priced experts to make application changes. These IVR systems have not kept up with the advances and cost savings IT managers have come to expect from their other hardware and software. Many IVR systems are single-purpose islands that do not integrate well with other Web or
line-of-business applications. It’s unfair for IT and call center managers to be handcuffed with systems that aren’t accommodating current needs and aren’t good investments for the future.

What is needed is a cost-effective, integrated speech solution that addresses both IT
and call center applications and leverages existing data center infrastructure and accessible expertise within the company.

Microsoft now offers a complete enterprise speech solution that can save companies money, modernize out-of-date IVR applications, and provide a clear path for integrating speech into existing Web and productivity applications. The offering addresses multiple endpoints – telephones, cell phones, Pocket PCs, Tablet PCs and desktop computers
 – using tools that are familiar to and easy to use by Visual Studio® .NET developers.

Microsoft, its industry partners, and other members of the SALT Forum are creating a wide variety of standards-based speech hardware and software. For medium-sized business upgrading their call center software, or large businesses deploying speech technologies throughout the company, the Microsoft Speech Server can provide a solid standards-based platform for deploying speech applications.

The Microsoft Speech Server provides the tools, technologies, platforms, standards,
and interfaces users will need.

 

Scenario

Customer

End User

Examples

Create a more effective customer support offering

Call Center

Customers

·  Front end to technical support

·  Product warranty claims

Provide new entry points for customers

Call Center

Customers

·  Automated prescription services

·  Online travel arrangements
and flight checks

·  Automated banking requests and alerts

Add convenient access to existing communications systems

IT Manager

Customers and Employees

·  Auto-attendant systems
to replace operators and receptionists

·  Access to existing e-mail
and scheduling services

Add more flexible interfaces to mobile employees and field personnel

IT Manager

Employees

·  Up-to-date customer information to off-site
sales staff

·  Line-of-business applications and data to on-site inspectors or workers

Table 1. Speech can be used in call centers and elsewhere to reduce costs and improve transaction completion, enhance productivity, increase customer satisfaction, and drive new revenue opportunities.

Introducing the Microsoft Speech Server and Toolset

The Microsoft Speech Server contains a complete solution for developing, testing, deploying, and managing telephony (speech only) and multimodal (speech/visual) applications. Specifically, the product contains the following:

·         Microsoft Speech Server

·         Microsoft Speech Application SDK

The Microsoft Speech Server (MSS) contains all server components for deploying telephony and multimodal applications. MSS runs on Windows ServerTM 2003 and performs speech recognition and speech synthesis for telephone, cell phone and
Pocket PC devices. For telephony applications, MSS includes services for connecting
to private branch exchange (PBX) and telephony lines in many flexible configurations.

The Microsoft Speech Application Software Development Kit (SASDK) addresses the needs of the speech application developer with APIs, controls, and tools that extend
Visual Studio .NET into the speech domain. Developers comfortable using Visual
Studio .NET will have little trouble learning speech concepts and creating telephony
and multimodal applications. SASDK includes the client-side speech add-ins, Speech Add-in for Microsoft Internet Explorer and Speech Add-in for Microsoft Pocket Internet Explorer which incorporate the means for desktop PCs, Tablet PCs, and Pocket PC devices to understand speech tags embedded in HTML pages as defined by the
Speech Application Language Tags (SALT) specification, an open industry standard.
In all, it’s a cohesive platform with the power and simplicity to address a company’s speech requirements.

The figure below shows a high-level view of a deployed speech solution including telephone and multimodal clients, the speech server components, and a Web server hosting a speech application. Note the use of standards – HTML, Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), and SALT – throughout.

Figure 1. The main parts of a speech solution

 

The Microsoft Speech Server changes the rules for speech. Gone is reliance on proprietary IVR systems that ignore access from Pocket PCs, Tablet PCs, and desktop PCs. Users can replace these systems with a complete, integrated solution able to meet existing demands and ready for the future.

Let’s explore each of the components.

Introducing the Microsoft Speech Server

The server-side components of Microsoft Speech Server (MSS) enable telephones,
cell phones, desktop computers, and Pocket PC devices to access speech applications. Specifically, the Microsoft Speech Server includes speech recognition, speech synthesis, connectivity interfaces, and call management interfaces for telephones, cell phones, computers, and Pocket PC devices. MSS includes the following components:

 

·         Speech Engine Services

·         Telephony Application Services

Speech Engine Services (SES) handles the speech recognition and speech playback. Telephony Application Services (TAS) controls the connectivity and call management needed for traditional phone lines or PBXs necessary when supporting telephone
end-points. TAS works with third-party Telephony Interface Manager (TIM) software
to support telephone and PBX connectivity. We’ll discuss each of these components
in a bit more depth.

The Microsoft Speech Server provides specific benefits both for the company and for
call center and IT managers in charge of telephony and speech projects.

For the Company

·         Hardware and software cost savings by using existing Windows® servers

·         Lower-cost deployment leveraging Windows Server, a common and widely
distributed platform

·         The ability to deploy traditional IVR (with or without speech) applications and new multimodal applications using the same infrastructure and tools

For the Call Center and IT Manager

·         The integration of IVR, speech and Web development efforts that reuse existing business logic and data access code

·         Decreased reliance on proprietary IVR platforms

·         Robust management through common Windows management tools such as the Microsoft Management Console (MMC)

·         The power to choose from a wide variety of standards-based hardware and software from multiple vendors to build best-of-breed telephony and speech applications

 


As shown in the table below, MSS addresses the needs of both traditional telephony applications and new multimodal applications.

Application (Supported Devices)

Required Components

Telephony (telephone, cell phone)

Microsoft Speech Components

·  Speech Engine Services (SES)

·  Telephony Application Services (TAS)

·   Telephony Interface Manager (TIM)

Other Required Components

·  Supported telephony adapter for connecting
to phone lines or PBX systems

·  Microsoft Windows Server 2003 with Internet Information Services

·  Load balancers (optional for distributing load
to multiple SES and Web servers)

·  PBX (optional for building distributed TAS deployments)

Multimodal (Pocket PC)

Microsoft Speech Components

·  Speech Engine Services (SES)

·  Speech Add-in for Microsoft Pocket Internet Explorer

Other Required Components

·  Microsoft Windows Server 2003 with Internet Information Services

·  Load balancers (optional for distributing load
to multiple SES and Web servers)

Multimodal (desktop PC, Tablet PC)

Microsoft Speech Components

·  Speech Add-in for Microsoft Internet Explorer

Other Required Components

·  Microsoft Windows 2003 with Internet Information Services

·  Load balancers (optional for distributing load
to multiple SES and Web servers)

Table 2.  Components Necessary for a Telephony and Multimodal Deployment

Speech Engine Services

Speech Engine Services (SES) provides server-side speech recognition and speech playback services for multimodal and telephony clients using the Microsoft Speech
Server. A multimodal client on a Pocket PC device accesses SES directly for both
speech recognition and speech playback. Desktop and tablet computers perform
speech recognition and speech playback locally. A person using a telephone
accesses SES through TAS, which serves as a proxy for a telephone and allows telephony endpoints to use the same application framework and speech services
as do Pocket PCs (see more on TAS in a later section). SES provides clear benefits
for end users, call center managers, and IT managers.

·         Includes Microsoft state-of-the-art speech recognition engine for accurately handling speech inputs from the end user

·         Handles speech recognition and speech synthesis in a single platform

·         Supports touch-tone IVR (also known as dual tone multi frequency [DTMF] IVR),
IVR with speech, and multimodal applications in a single platform

·         Includes a text-to-speech (TTS) engine using a prompt engine from Microsoft for realistic-sounding speech created from prerecorded speech, and a real-time TTS engine for handling any unknown word or phrase that MSS may encounter

SES contains many of the features one would expect from a standards-based, enterprise, server-side speech engine:

·         Connectivity from remote clients using TCP/IP and SOAP messages

·         Engine pooling that supports multiple instances for handling simultaneous speech requests from clients

·         Caching of application resources such as grammars, prompt databases, and audio files for faster performance

·         Standard Windows 2003 facilities for management, monitoring, and logging

SES Components

The figure below illustrates the components and workings of SES.

Figure 2. Speech Engine Services processes multiple requests from clients.

Here is a brief description of how speech recognition (audio into text) works:

1.       SES manages multiple speech recognition instances so it can ably serve multiple client requests simultaneously.

2.       A Pocket PC or telephone (connecting via TAS) makes a speech recognition request to SES using a standard XML/SOAP message.This request is handled by the Lobby, which works with the Broker to locate a suitable speech engine instance.

3.       A suitable speech engine is one that is available and that ideally has the necessary grammars pre-loaded.

4.       The speech engine instance processes the audio input using Microsoft’s state-of-the-art speech recognition engine.

5.       If necessary, the speech engine instance accesses the grammars referenced by the client (accessed from the Web server where the application is deployed) to constrain the possible results, thereby improving the speech recognition accuracy.

6.       The speech engine instance returns (via the Lobby) the speech recognition result to the client (the actual client in the case of a multimodal device, or TAS in the case of
a telephone connection).

Here is a brief description of how speech output (text into audio) works:

1.       SES manages multiple text-to-speech engine instances so it can ably serve multiple client requests simultaneously.

2.       An application requiring speech output makes a request to SES using standard XML/SOAP messages adhering to the Speech Synthesis Markup Language (SSML) format, a W3C specification. This request is handled by the Lobby, which works with the Broker to locate a suitable text-to-speech engine instance.

3.       A suitable text-to-speech engine instance is one that is available and that ideally has the appropriate prompt databases loaded into its memory (accessed from the Web server where the application is deployed).

4.       The text-to-speech engine instance processes the text and searches its prompt databases for suitable matches to create the desired speech output.

5.       If it can not find one or more of the words or phrases required in its prompt databases, it uses a real-time speech synthesis engine to create the speech.

6.       The text-to-speech engine instance then returns the audio result to the requesting application.

Note: SES also contains a set of DTMF-engine instances for handling the processing
of touch-tone input common in traditional IVR applications accessed by users with telephones. This operation is much simpler than either speech recognition or speech output and will not be discussed here.

Telephony Application Services

Telephony Application Services (TAS) handles the connectivity and call management necessary to support traditional telephones and cell phones connecting to the Microsoft Speech Server. TAS manages a set of SALT interpreters that enable phones and cell phones to communicate with Web applications with embedded speech tags. TAS further brokers the communication link between the telephone system, speech services, and Web server application. TAS works with third-party Telephony Interface Manager (TIM) software for connectivity to phone lines and PBXs. TAS and TIM are not necessary for multimodal endpoints such as Pocket PC devices.

TAS serves as the application proxy that enables telephones to use the same kinds
of Web applications (with embedded grammars, dialog and prompts) that multimodal clients such as Pocket PCs use. TAS overcomes the limitation that telephones do not understand Web pages: TAS does understand Web pages developed with embedded speech tags for call management, speech recognition, and speech output.

TIM, provided by third parties and matched with an appropriate telephony adapter, provides the call management interface and works with specific drivers necessary to communicate with telephony cards that connect to analog phone lines, digital phone
lines or PBXs in a company. TIM uses the advanced capabilities of the underlying telephony adapter to construct complex telephone services from simpler programming elements that are presented to the application developer.

Note: Telephony applications use the same speech recognition and speech output services as multimodal applications but instead of the speech content coming from
and going to a computer or Pocket PC device, TAS serves as an intermediary. In
addition, phones can use other parts of SES that handle DTMF (touch-tone) input
specific to telephones.

TAS provides clear benefits for the end user, the application developer, and the company wanting to build speech applications.

·         TAS insulates developers from the complexities of telephony connectivity, including traditional analog and digital phone lines, PBXs, and telephone switches.

·         It allows end users of telephones and cell phones to access services and data historically not available to them.

·         TAS integrates telephone access with multimodal access for a single application development and deployment environment.

·         It supports a number of telephony adapters for flexible connectivity to any telephone or call center infrastructure.

TAS contains many of the features one would expect for connectivity and call management using phone lines, PBXs, and switches in customer call centers:

·         Reliable connectivity to existing phone systems

·         SALT Interpreter pooling that supports multiple instances for handling simultaneous incoming calls

·         Clean, layered interface between applications and telephony adapter through
TIM and device drivers

·         Support for all standard European Computer Manufacturer Association (ECMA) Computer Supported Telephony Application (CSTA) messages

TAS Components

The figure below illustrates the components and workings of SES.

Figure 3. Telephony Application Services (TAS) works with the Telephony Interface Manager (TIM)
and a telephone card to handle telephone connectivity and call management.

 

Here is a brief description of how TAS handles phone connections through phone lines
or PBXs:

1.       TAS manages multiple SALT Interpreter instances so it can ably serve multiple incoming or outgoing calls simultaneously.

2.       The SALT Interpreter, as its name suggests, interprets the SALT tags embedded
in pre-loaded Web pages specifically designed by developers to serve telephony applications. It also supports the HTML document object model (DOM) and hosts
a JScript® .NET host for more flexible application development.

3.       These tags include all tags necessary for dialog for speech requirements (speech recognition and speech output).

4.       The pages also include call control elements (e.g., answering calls or terminating calls) to communicate control instructions to TIM. The call control messages use <smex> as a simple transport to send XML-formatted ECMA269 and ECMA323 messages, commonly known as CSTA.

5.       Each SALT Interpreter instance registers itself with TIM using a <SetAgentState> command to make itself available for incoming calls.

6.       A telephone or cell phone makes an incoming call that is handled by TIM.

7.       TIM requests a new TAS session.

8.       TIM and TAS (and its loaded application) communicate using standard XML-based ECMA CSTA messages sent using a very simple send/receive paradigm (<smex> messages).

9.       TAS  includes a CSTA tag (AnswerCall) to answer the incoming call; when it gets
the appropriate response from TIM (AnswerCallResponse), it initiates the appropriate dialog as specified by the application.

10.   TAS, through its Media and Speech Manager, uses SES when it needs speech recognition or speech output services.

The end result is the ability to create traditional or speech-enabled IVR applications
in a Web development and deployment environment.

Introducing the Microsoft Speech Application SDK

The Microsoft Speech Application Software Development Kit (SASDK), included with Microsoft Speech Server, is a set of tools, ASP.NET controls, and examples incorporating support for speech using the SALT specification that enables developers to build both telephony and multimodal applications. Developers can incorporate speech functionality into Web applications quickly and easily and can learn the concepts necessary to build
a speech application within the familiar Visual Studio .NET development environment. Both the tools and controls integrate seamlessly into Visual Studio .NET.

SASDK contains the following components:

·         ASP.NET controls to initiate speech recognition (Listen) and speech playback (Prompt), to create interactive speech dialogs (QA) especially useful in IVR applications, and for call control functionality

·         GUI tool for creating W3C standard grammars easily within the Visual Studio environment

·         GUI tool for creating prompts and prompt databases easily within the Visual Studio environment

·         A set of controls including common grammar libraries for frequently used content
such as yes/no, credit card numbers, calendar dates, and social security numbers

The Microsoft Speech Application SDK provides specific benefits both for the company developing and deploying IVR or multimodal applications and for the developers assigned to the project.

For the Company

·         Hardware and software cost savings from using existing Windows Server 2003 servers
for speech applications

·         Ability to leverage existing investment in Microsoft development tools and technologies including Visual Studio, the .NET Framework, and Windows servers

·         Ability to use existing developers to create speech applications that enhance and improve the customer experience

·         Ability to leverage the .NET environment, including its rich set of foundation classes, for building Web applications

·         Reduced development time and increased code maintenance through leveraging existing business logic and data access code

For the Developer

·         Ability to create speech applications within the familiar Visual Studio environment

·         Ability to learn speech concepts such as dialogs, grammars, and prompts in the
well-known Visual Studio environment

·         Project Creation Wizard to enable developers to quickly establish the foundation
for new telephony or multimodal applications

·         Powerful Visual Studio tools to create speech grammars (speech input constraints) and prompts (speech output)

·         A palette of Visual Studio speech controls, using familiar concepts such as properties, code-behind, and IntelliSense® completion, to create speech elements in ASP.NET applications

Deploying Speech Applications

The Microsoft Speech Server offers IT managers and call center personnel flexible deployment options for all application needs. Unlike proprietary IVR systems, MSS bridges the gap between datacenter and call center operations, and can provide the foundation for both telephony and multimodal applications. This makes MSS an ideal platform for deploying existing and new IVR applications while keeping an eye to the future when multimodal devices will play a major role in speech applications.

Foundations

Windows 2003 provides the foundation for a Microsoft Speech Server deployment that allows IT managers to leverage existing Windows management, security, and networking services. MSS, like other members of the Windows Server family of products, uses Microsoft Management Console and Windows Management Instrumentation. The entire speech solution runs on Windows and leverages standard Intel hardware for cost-effective speech deployments.

For connecting to telephony hardware, MSS leverages call management software
and telephony adapters from multiple companies. Through these adapters and the corresponding TIM software, companies can connect to a wide variety of analog lines, digital circuits, and PBXs.

Single-Server, Dual-Server, and Distributed Environments

IT managers or call center personnel can deploy a complete speech solution on a single physical server. This configuration will handle the needs of some speech applications whose speech processing and telephony requirements are minimal. For larger call center or enterprise-level applications, customers will choose to distribute the components across multiple servers. The customer has three basic deployment options:   

Topology

Benefits

Considerations

Single server

Simplicity and cost

Will handle lower call and speech request volumes

Dual servers

Redundancy

Provides increased performance by separating telephony services from
speech services

Distributed deployments with multiple servers

Scalability, performance and reliability

Distribute Web, TAS, and SES services as needed

Table 3. Flexible deployment options provide customers many options to achieve
 redundancy, performance, and scalability.

Given its use of the standard TCP/IP, http, and SOAP protocols, both the Web servers (housing the applications) and SES can be distributed and load-balanced using standard hardware load-balancing solutions. No proprietary software or hardware is required. To creating a set of distributed TAS services, a PBX or other telephone switching equipment should be used. This allows customers to have redundant incoming phone-circuits for high-availability applications.

Planning Considerations

When planning a speech deployment, the following factors should be considered:

·         The number of incoming calls or requests expected

·         Anticipated average call duration

·         The speech recognition and speech output load expected

·         The number of Web servers required

·         The number of TAS servers required based on expected incoming calls and average call duration

·         The number of SES servers required based on expected speech recognition and speech output load

·         A robust high-speed data network connecting the servers

·         Load balancers (optional) to distribute load across Web servers and SES servers

·         PBXs (optional) to distribute load across TAS servers

MSS includes a number of helpful configuration scripts and several test applications that can be used to confirm a speech deployment.

Managing and Securing Microsoft Speech Server

Those familiar with management and security in a Windows 2003 environment already have a good idea of how MSS is managed. If not, numerous books and resources (such as MSDN®) are available describing Windows management.

MSS leverages existing Windows Server 2003 management metaphors including the Microsoft Management Console and Windows Management Instrumentation. MSS provides extremely flexible management, giving the administrator the option of GUI or command-line tools, local and remote management, and the ability to view and manage multiple servers at the same time.

The MMC module provides all commonly used configuration and management options. For more involved tasks, Microsoft includes dozens of scripts to perform otherwise time-consuming and advanced configuration routines. For example, here are a few of the commonly used scripts:

·         Postinstallationconfiguration. Allows administrators to easily setup a distributed deployment with prompts to assign SES and TAS services to individual servers in
a deployment

·         Speechservercontrol. Allows administrators to perform common configuration
tasks such as stopping and starting speech services

·         Deployment. Allows administrators to add and remove servers to and from
a deployment

·         Trustedsites. Allows administrators to add, remove, or modify the trusted servers (servers that MSS trusts to provide grammars, prompts, and applications)

This is only a sampling of the available scripts. Administrators and developers can also create their own custom scripts using the robust Microsoft Speech Server Provider with
its classes and methods for such areas as Telephony, Speech, Deployments, and
Trusted Sites.

Security in Microsoft Speech Server

Microsoft provides powerful security integrated with the Windows Server 2003 operating system. Administrators can specify Trusted Sites, indicating those Web servers that can be used by MSS to access grammars, prompts, and applications. Access control lists, commonly used by Windows administrators, can provide limits on access by internal
and external users, by department, or by role in the company. For administration,
only those with Administrator privileges can configure settings in any of the speech
server components.

Summary

The Microsoft Speech Server and Toolset offer a current generation of call center managers and IT managers the ability to develop and deploy cost-effective telephony
and multimodal applications. The future looks bright for speech as a mainstream and pervasive technology. Other information about speech technologies can be found at http://www.microsoft.com/speech.

 

 

 

 

 

 

 

 

 

 

© 2003 Microsoft Corporation. All rights reserved. Microsoft, IntelliSense, JScript, MSDN, Visual Studio, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corp. in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.