Silicon Valley Speaker Series: Cisco, Comverse, Intel, Microsoft, Philips and SpeechWorks Found Speech Application Language Tags Forum to Develop New Standard For Multimodal and Telephony-Enabled Applications and Services

Microsoft Valley Speaker Series
Oct. 15, 2001

*

BILL MEISEL: Welcome to this forum, the subject of which I guess will require some explanation, which I'll try to give as an introductory brief talk.

This is the first public announcement of something called the SALT Forum, where SALT stands for Speech Application Language Tags, as you can see.

I'm Bill Meisel. I'm not actually part of the SALT Forum, but I'm an industry analyst and publisher of a newsletter called Speech Recognition Update , and I'll try to be an independent moderator of this panel.

SALT is founded by six companies, presented there in alphabetical order: Cisco, Comverse, Intel, Microsoft, Philips and SpeechWorks, if you'll forgive me for reading the slide.

What SALT is, is that it's an attempt -- well, let me back up just a bit. Speech recognition and text to speech synthesis where text is pronounced as speech are two technologies that have been getting better and better technically over time, and you see them actually deployed in certain areas quite productively. For example, for at least four years now you've been able to call some stock brokers and say the name of a stock and get the price quote on that stock, and that's been very profitable for those companies, for example, for one company saving about $17 million a year. You can call airlines for flight information and get a speech recognition system. You can dial with some wireless services dial by phone using as many as a 500 name dialing list. You can get your e-mail by phone. All these services by telephone are becoming more ubiquitous, and I would be very surprised if everyone in the audience didn't at least encounter one of these speech recognition systems by now in their every-day activities.

Similarly, the latest release of Office XP, for example, has built-in speech recognition so you can dictate into your applications as well as that application having been around for quite a while now for some millions of copies sold for dictation purposes.

While these applications suggest that speech technology is at a point where the technology is useful, it doesn't necessarily imply that creating applications that are compelling is easy, and that's part of what the SALT Forum is trying to do is to try to address the issue of how do we make speech easier to use, easier to develop, and available on more devices, and available on devices such that when you develop the speech application you can use it on multiple devices without redeveloping it.

And thus the SALT Forum was created to create multi-modal and telephone access to information, applications and services, whether you're using a standard telephone, a special enhanced telephone, PC, Tablet PC or personal digital assistants that are wirelessly connected to some network, so that while you're using these devices in the multi-modal sense you could use speech to say what you want, speech to hear the results of what you asked for, or also use a keyboard as input, a stylus as input, a mouse as input if you're on a PC, for an example, and get output in the form that's best suited for the application you want. If you say, I want to see my portfolio, then rather than having, say, a list of 10 stocks read to you with prices, maybe it would just put them on the screen if the screen was available. So the output can be in the form that you want.

Now, this SALT will work in a multi-modal sense that is supporting the technologies simultaneously, and it can also be used to support them independently, if that's your interest.

The SALT specification at this point is under development. It's an early stage. It's not available yet. But it's planned to be royalty free and platform independent, and to extend existing Web standards. So the idea is to get developers who are familiar with current development tools and who have existing Web sites or existing talents in developing Web sites to be able to use speech technology more easily and to use it integrated with other modes.

So the idea of SALT is to extend existing Web standards like HTML, xHTML and XML and to use those existing skills, and the intent is to have this specification, an initial version of the specification available in the first quarter of 2002 and submit it to a standards body by mid-year.

The group of people we have here, and by the way, I want to thank those of you who have traveled here, especially if you traveled here from out of town, given the difficulty of travel, and those among the panel who have gone to all the trouble of traveling in times where it's more difficult to travel. But we have a tremendous group of speakers that I think reflect -- from the founding companies that I think reflect the importance of this initiative and the key focus it has in terms of adding another means, another user interface modality to almost any device. With a graphical user interface you need a screen typically and the smaller the screen the harder it is to use the graphical user interface. With speech, any device can probably fit a microphone in, and if it has the processing power or is connected to a network can conceivably support speech interaction.

The agenda is that we're going to have first a concept demonstration from John Dehlin from Microsoft, who's going to show, not using SALT initially, but show what the objective of SALT is using other technology and give a concept of why the multi-modal aspect of it and the single use aspect of it are important and what you can do with it, and then we're going to have the listed speakers from the companies give talks in the order that's basically based upon trying to create a logical discussion of why their companies are involved in this.

We have Howard Bubb, who's vice president in the Telecommunication and Embedded Group at Intel; Frank Caris, President, Philips Speech Processing North America; Michael Krasner, Vice President and General Manager of the Voice Solutions Division of Comverse; Dr. Kai-Fu Lee, Vice President of the Natural Interactive Services Group at Microsoft; Alistair Woodman, Director of Marketing of the Voice Technology Group at Cisco, and Steve Chambers, Vice President of Worldwide Marketing at SpeechWorks.

What they'll do is each will come up and I won't take your time by making a transition where I repeat what I've just told you; I'll just let them all come up in the order that essentially they're sitting, the order in which I read, starting with Howard on the left here is the order that we have here, and without further ado, I'd ask John to show you what this is intended to go toward.

JOHN DEHLIN: Fantastic. My name is John Dehlin. I'm excited to be here. Can you guys hear me in the back? Great.

As Bill mentioned, SALT stands for Speech Application Language Tags. And the purpose of SALT, both SALT and the SALT forum is to help drive the voice activated Web to the mainstream; or another way to say that vision is to make it so that all 6 million Web developers out there, 6 million plus, are building speech into their Web sites, that every one of these Web sites, where it makes sense, has a speech interface, so that it can not only be accessed by a desktop browser but also by a simple telephone and also by a multi-modal device, one of the many multi-modal devices that's out there.

So we feel pretty strongly that the best place to begin to make this happen is with those Web sites that are already in existence and with the 6 million Web developers that are out there.

Now, what SALT represents are some likely extensions to HTML. Why are we starting with HTML? Because the technology has set the standard, it works across multiple platforms, but most importantly it has a rich object model and a rich event programming model, which is perfect for what we want to do with SALT.

So from there, how are we going to make this vision a reality? What we're going to do is we're going to start with the companies that are in existence today, and one example is a demo I'm going to show you. In this demo, the company is called Kontosa Realty, and they're a basic real estate agency that provides some very basic functionality through their Web site. They allow a customer, a tentative customer to come up to their Web site, to select Buy, specify what type of house they want to buy -- in this case let's go with, say, a two-bedroom townhouse for under $200,000. Since location is one of the most important things about a house, they can actually specify what area they want to purchase a home in. And then finally it's going to assign them a real estate agent so that those people can connect and start working on getting that home.

So this is a basic Web site. But wouldn't it be fantastic if this Web site could be extended to the most ubiquitous and engaging device on the planet, and that's the telephone? So all we have to do is season this Web site with a little bit of SALT, implement the appropriate infrastructure behind the scenes, whether it's telephony or speech, et cetera, and all of a sudden this Web site becomes accessible by telephone, as I'll demonstrate now.

Now this is research, so we're going to try some real recognition here, but bear with me.

(Begin demonstration.)

(Telephone dialing)

OPERATOR: Welcome to Kontosa Realty. We have thousands of listing in the Seattle area. Please select Rent, Buy or Sell.

JOHN DEHLIN: Buy.

OPERATOR: How many bedrooms and what style and price are you looking for?

JOHN DEHLIN: I'd like a two-bedroom town home for under $200,000.

OPERATOR: A two-bedroom townhouse for under $200,000. Which neighborhood or school district are you interested in?

JOHN DEHLIN: Make that under $300,000 near Microsoft.

OPERATOR: A two-bedroom townhouse for under $300,000 near Microsoft. We have matched you with our agent Derek. Derek is not available right now, but he will contact you as soon as possible.

We have detected that a Passport account --

(End of demonstration.)

JOHN DEHLIN: So as you can see, by simply adding a little bit of SALT this basic Web page can be enabled to be accessible by a simple telephone.

But what about the explosion of multi-modal devices that are out there, from the smart telephones to the phones and PDAs to the Palms and Pocket PCs and even the tablet PCs that are springing up into the marketplace? These devices are important because they provide users with the means to communicate any time, any place, on any device. And I'll go ahead and show you an instance of an example where a device like this might be appropriate.

Let me go ahead and tab over and it looks like it's time for a presentation.

Now, what I have here is a Pocket PC. It's actually wirelessly connected, so I'm talking peer-to-peer with my server. I have a nifty little display application on my PDA so that you can see actually what I see real time while I'm typing on it.

So the scenario is as follows. The customer has now told the real estate agency, I want to work with you, and the customer is asking the real estate agency to please notify me when a house comes on the market that meets my needs. Now, the customer may be at a Starbucks browsing on their PDA or they may be in the airport, wherever it's connected to an 802.11, but all of a sudden a notification comes online, "Hey, John, a new house has just come on the market that meets your needs." All I have to do is click on the link to that Web site, the same Web site that served up the original Web page that just the telephone is connected to, and I could actually interact with it in the most standard way you'd interact with the PDA, with my stylus. In this case I'll click on See Map, and I could go ahead and see a map of the location.

However, many of who you have tried to work with the stylus know that it's perhaps not always the optimum method of input, and sometimes inputs with the stylus can be tough. Wouldn't it be fantastic if there could be a common and a natural means of interfacing with these devices across all these multi-modal devices? And the most common natural interface and form of communication to all of us is, of course, speech. So let's see if we can get speech to work on this Web site as well.

Show kitchen. Yep, my wife likes a nice wide kitchen. That will do, but she also likes a nice and spacious living room. Show living room. Yep, that's nice, wide and spacious. But I've got three children and I need to make sure I've got a backyard that's spacious and they can run around in, spacious being the key here. Show backyard.

So you can see that this is very natural for me. I can speak to it in terms that I'm familiar with, and it's able to recognize and respond accordingly.

Now just to drive this point home a little bit further, something very simple like filling out a form on a device with a stylus can also be a little bit tedious. In this case I could change to the keyboard input panel, and if I want to search for nearby, I've got to scroll down here and pick elementary school within 10 miles, minimize that; it's just not necessarily the easiest way to interact -- 10 miles by foot, submit. I can get the information that I want, but once I become as a user familiar with this form, wouldn't it be fantastic if I could just fill the form out with my voice. So let's give it a try now, switch back to the speech recognizer and I'll query the form.

Show stores within two miles. Show stores within two miles. It's going to be fickle. Let's give it one last chance. Show stores within two miles. There we go. That lets you know it's real, right? And that would be a very easy way to interact with the device.

Now, to round this out, imagine if I really liked a home and now I want to schedule an appointment for my real estate agent, for the selling agent and for myself. It would be very difficult to actually enter in everyone's aliases into Pocket Outlook, reconcile everyone's calendar. If we connect it up to some type of service to make that happen, maybe something like this might be possible.

Schedule a visit on Wednesday night. Schedule a visit on Wednesday night. And there you see that it could reconcile all of our calendars and provide me with the means whereby I could schedule that.

So hopefully you've gotten a flavor for how speech might work as a modality with a multi-modal device, interacting, of course, with the stylus simultaneously.

And to just sort of round out this scenario, imagine now that I am transformed magically into the real estate agent, and as a real estate agent I'm driving around in a car and maybe I have some type of tablet device wirelessly connected as well. And I'm with my customer and there are three or four homes that I want to show the customer as we go drive around. But the customer may be able to tell just by viewing the outside of the home or the floor plan that this home isn't the one for them. A lot of time gets wasted driving around from house to house to house to find that right home. Wouldn't it be great if as the agent I can have some type of device to get that work done up front, and to help us do other things such as driving around?

So what we might do is have some type of tablet device where I could interact with it again with my voice and with the stylus. Homes for John Dehlin. I can say, "Hey, John, I want to go visit a couple homes. Before we actually take the time to drive out, let's go ahead and look at the floor plan. Does that look like something that you're interested in?" If I want to with the stylus I could ink comments on the floor plans. John's happy with that. If we really want to get fancy, we could actually do a 360-degree view of the home. Because we're using a rich device, that type of stuff is possible.

And then finally once John says, "Hey, let's go check it out," we've got to get from wherever we are to the home. Well, if we hook in some type of service that's geographical that actually hooks into my GPS location or coordinate, we could do something like this. Map this property. It will show where the home is located. The customer can check that out. Oh yes, it's close to a freeway. That's really cool. Maybe it's really important for the customer to be near a park maybe within walking distance. So again you could use multi-modal inputs and using voice and clicking here, so I'll say, "How do I get to here," and this could actually draw the directions for me.

And then finally if I really wanted to get fancy and impress that customer, I know the customer doesn't want to be near railroad tracks or back up into some power lines, I can throw in a little bit of snazz and connect up to some type of geographical service and so something like this. Give me the satellite view. And again provide the customer with a satellite view of the home.

So just to conclude, we feel very strongly that SALT will be very important for a lot of people: For the developers, because now they can take a language they're already familiar with, which is HTML, xHTML and they can use that as a foundation by learning just a few more extensions to speech enable those Web sites and add more value to their customers.

But for customers they'll be able to have an input device, a multi-modal device where they can access their information anytime, anywhere, on any device in a very natural way.

And then finally all the enterprises that have made this tremendous investment in their Web site and in their infrastructure can now extend the leverage of that investment in ways never before thought possible.

And then if we could just conclude on one point, and that's that we feel very strongly that SALT will expand the power for every member, current existing member of the supply team in the speech industry, and we feel like this rising tide will truly lift all ships within the industry, and we're very excited to take voice into the mainstream.

And with that, I would like to switch back over to the next machine and introduce, without any further ado, Howard Bubb, vice president and general manager from Intel.

(Applause.)

HOWARD BUBB: Thank you. Well, good afternoon, everyone. It's my pleasure to address the topic of SALT from Intel's perspective, and that's the perspective of a platform provider. At Intel some may be familiar with how Craig Barrett has changed the mission from simply computing to really all aspects of the Internet. And the reason we view things this way is we believe today that every computer is becoming a communications device and every communications device is becoming a computer.

Now, the first part is very clear: Every computer you buy today has some kind of jack or antenna or something coming out of it, but the second part of it is every communications device within the same small footprint is upping its MIPS, it's capability, its functionality and is becoming a computer.

And actually just as Intel invested in Intel architecture in the desktops and server, we're investing in specific architectures around IA and XScale for low-power mobile devices.

We think this is an exciting transformation we're seeing, because every time a new communications paradigm is introduced in a very flexible and open way it creates dramatic opportunity, change and new business for all the ships on this rising tide that John talked about.

We see that enabling speech is something that's really lagged. Certainly in the Internet now we have rich multimedia interfaces, but we've lacked the basic ability for a human being to interact in their most natural way, whether it's a realtor or whether it's just your simple ability without going through a bunch of menus to ask for the information you want. And we think this richer interface not only for the computers that you have on your desk or your laptop, but more appropriately to the devices, whether they be PDA, cell phone or any mobile device, is critical to opening up communications in a very human and interactive way.

The boundary between a computer interaction and a human interaction should be dropped. Many of you using operator services today call up and just ask for a city, state, street, zip code and Intel is heavily invested in these kinds of technologies. In fact, 9 out of the top 10 voice portals today use Intel platforms.

So we've invested in this over a very long period of time, and we believe that speech will become the primary interface to especially mobile computing. I mean, in the future if you don't have a speech interface, we're going to have little points on our fingers to manage those little itty-bitty devices, because you won't be able to handle them any other way.

We think SALT is very important to sparking innovation. With SALT we now have a very rich programming language for multimedia or multi-modal interaction.

The communications industry has lacked what the computer industry has had, and that is a rich programming interface that will allow standardization, allow millions of developers, six million HTML and XML developers to create the kind of exciting interactive media that you've seen here, and we think SALT will open up this innovation. Voice XML is out there today creating telephony interfaces for Web pages and SALT certainly can do these functions, but it also picks up where voice XML is not today, and that is in the multi-modal, rich mixed media type of interactions that have been so well demonstrated.

We think that speech today has become extremely economical, and, in fact, the point that I want to get across is that these are really today technologies. Six years ago or so leading research institutions were doing speech, but it took an entire workstation, some very large, extensive device to get one or two channels. Today we're deploying with different voice portal companies and recognition players, such as are represented here today, the ability on let's say a dual 1 gigahertz Pentium platform to do anywhere from 96 to 192 channels of voice recognition, and we think this is going to move up very rapidly.

We've been investing in this technology for a number of years. We have our research centers in the U.S. and in China doing fundamental research into the platform aspects. Certainly you're familiar with MMX a number of years ago. Some of you may be familiar with SSE or vector processing extensions, which are very useful for media and speech.

More recently we've been introducing the Intel Performance Primitives, a library that enables the speech recognition technologies to be optimized on IA platforms.

And now as we speak, we're working on the next generation of technology, which is instruction level optimizations really tailored towards high-speed media processing. And media is not just graphics, visual, video, but specifically speech technologies. So on and beyond clock speed, the processors themselves are becoming tailored to the requirements of human interaction.

So at Intel we believe in speech. We've been investing in it for a number of years. We think it will open up a very rich environment that SALT provides a widely available programming interface, and at Intel we're committed to making that platform the platform of choice for the development of these multimedia technologies. Thanks.

(Applause.)

FRANK CARIS: My name is Frank Caris. I am president of the Philips Speech Processing Group in Dallas, responsible for all speech technology in the Northern American region. And I would like to walk you through what Philips is doing and why we think it is important to support SALT.

Let's look to today's cell phone as you can see it over here. It's a very nice looking cell phone, mobile phone. It has a very nice display. We're all familiar with that. And we have a keypad basically that is useful for some click-through activity on the phone. This is not a regular phone; this is a WAP phone, W-A-P. You will see this phone around because Philips is obviously supplying these telephones into the mobile phone vendor community.

Now, because it's a WAP phone I have access to the Internet, and that was the reason we used this specific telephone.

So now let's assume that I'm trying to find out what the stock price of Microsoft is on this WAP phone. You saw that I had to click once already because I'm looking for a sticker, a stock ticker. I touched it one time, two times, three times, four times, five times, six times, seven times, eight times, nine times, ten times and eleven and there I go. And if everything goes well, I get this beautiful feedback and it tells me what the stock price of Microsoft is. (Laughter.)

That's very elaborate, 11 steps for such a simple thing, and we in the speech recognition business at Philips really don't get that mechanism at all, so we are very happy with this initiative, because we learned that through our technology it's pretty simple to say, "Stock symbol, quote Microsoft" and you get where you want to be.

So what we see here is the slow and tedious process, and the Microsoft is not used at all, which is a contradiction on a mobile phone.

So with direct voice access to the Web, life is just much simpler. It's much more intuitive. As Howard said before, we get used on our PC to have access in a fast, easy way to information, but when we talk about all these mobile devices like mobile phones, we have to live with a crappy solution.

Let's take the example. We say, "Stock quote Microsoft," and we get a stock quote of Microsoft. "Weather in Mountain View." That's a stupid question because the weather is always good here. (Laughter.) Oh, sorry, I went too fast. Let's do it again. I'm having a little problem here.

So what is the beauty of this story? It's very simple, I think: I use my voice and I have immediate access with SALT to all these beautiful applications I want to use on my phone. So by saying "My mailbox," you go to your mailbox; by saying "My business pages," you go to your business pages. That's not clicking through 11 times. And then if you know the stock price of Microsoft, you want to know the second stock and you have to go to the same iteration. It's very elaborate, very user unfriendly.

Now, with 3G systems coming out, what we see is that there is a range of applications coming to the market, and those applications need a better user interface; otherwise it will never -- 3G as a standard will never be fully exploited. So we need to think about very good user interfaces there, and they are voice based again.

To give you an example, Philips customers like KG Telecom in Taiwan, they use a Philips based voice portal and people are swapping out audio files already. That's part of life, of reality. And again you cannot do these things in the way we tried to do them in the past.

Philips is a supplier of technology. We have a lot of these telecoms around currently where we supply our technology and this voice portal technology. And for these operators it is very important to drive traffic in the future. And we are helping them. We have a very good position in that market.

Now let's see what's happening step by step. So you have your beautiful user interface, whether it's a PDA or a mobile. There is a browser. There is an operator. There is a portal aggregator, an ASP, and there are category aggregators and the content owners. The content owner and the category aggregators basically have all the information available. I would like to listen to my mobile phone, and currently I can't. And again I think that's where SALT comes in. We have these Web sites and with some very simple SALT on the Web site, as the words were used here, it is easy to get access to all these sites without them being completely rebuilt, and I think that's a very strong point.

A little bit about Philips: So basically Philips is in the voice control system. That's DSP-based technology. So we have obviously a link to mobile phones, PDAs, whether that's either in the semiconductor form or the components form or in my group where we supply the technologies and the algorithms for these devices.

We have telephony solutions, which basically supply the technologies into both the telcos for voice portals, for voice dialing, et cetera, and where we focus also on the private networks, where a company has directory assistance; you call into a company and you say, "I want to talk to Mrs. X or Mr. Y," and you immediately get connected to these people. That's where Philips is strong. That's our telephony solutions group.

And there are the dictation systems, where we all know that before it was a professional market, the medical market, legal market, where people dictate in front of their PC, and that is a business, which is much more well-known already.

I think our language portfolio is important in this whole scenario, because speech recognition means that you need to adapt to the local market much more than we did before. We need to have the languages available in every specific region and in every specific market. Philips has those languages available and they will certainly have advantages from this technology.

So where has the keypad gone or the keyboard gone? Again, this is in the next stage of 3G technology there is probably a keyboard, but the information feedback you are getting is very graphical; in this case a beautiful overview on all the houses and sightseeing places where you can go. And we think that the next generations will simply have no keyboard left anymore. It's a telephone without a keyboard.

And again we see more complications in these LCD screens, you know, by saying "Calendar, address book," you go to a specific application. Scrolling through that in a menu manner will kill you. After two, three times you will give up.

A little bit about Philips here, again I think our strength is in the directory assistance and yellow page applications, where you simply say, you know, "I need a broker in Redmond, I need a house near Microsoft." First that is you need to get access to all kinds of services: addresses, telephone numbers. We have deployed a range of applications in the telco industry.

That was it. If you want to have more information, visit us at our Web site. I would like to give it over to the next speaker.

(Applause.)

MICHAEL KRASNER: Thank you. I'd like to spend a few minutes to tell you a little bit about where Comverse is going, our vision, and why we're here today as one of the founders of the SALT Forum.

A few years ago we made a video of where we wanted our company to go.

(Videotape demonstration.)

MICHAEL KRASNER: Well, I think when we made this, we didn't quite realize that everybody would be talking about mobile users and here we had perhaps the prototype of the immobile user. (Laughter.)

Okay, well, why are we part of the Forum? Well, Comverse provides enhanced services for network operators. We have over 375 network operators worldwide, 105 countries, and somewhere over 300 million users who are using our systems.

And as I'm sure as is in every market, our market is changing. When we look at that market, we see that our network operators, their capital budgets are tighter, their bond with their subscribers is growing weaker, churn is not going down very easily, mobile data services were launched with huge fanfare and looking at great potential and perhaps have not quite lived up to that potential and need to be relaunched in another way. The average monthly revenue per user has been going down worldwide and our customers are looking at ways of how to increase the revenue per user.

Well, when we look at it, we see there is really a demand for applications that will take our customers, the network operators from 2G to 2.5G to 3G, and they're asking for open application environments.

In response to that, we recently launched the Comverse Spark initiative, which is to open our platform based on open standards, such as open SMS, voice XML and SALT, so you have access to not only our own applications, but a huge range of other applications that adhere to those standards.

Well, I started out, the first slide said it's all about the user. And it is. It's all about how the network operators, our customers, how do they deliver value to the user, and if they deliver value to the user, then all those issues about revenue and subscriber loyalty, all of those are part of delivering value.

Well, we went and asked users, we interviewed not only our customers worldwide but several thousand end users worldwide, in North America, Europe, Japan and several other places and asked them what is it that they were looking for, what did they believe is value in their eyes.

They said several things: Convenience, personalized convenience, the ability to be able to interact in ways that they wanted to interact.

Connectivity: Being able to use any device any place, any time, whether it was work, play, evening, morning, in the car, at work.

Entertainment: Very strong, wanting to have entertainment and looking at how the entertainment in the future was going to be on mobile terminals.

And finally a richness, a richness of media, speech, text, video and graphics all combined together.

When we look at that, we see that SALT technology will clearly be a key in order to deliver this multimedia world on mobile terminals to our customers, and that's why we're here.

Thank you.

(Applause.)

MR. KAI-FU LEE: Good afternoon. I'm Kai-Fu Lee from Microsoft. I'm really excited to see that from all the exciting presentations and demos we saw that voice is beginning to be really useful, beginning to be used, beginning to be useful. In Office XP, in Windows XP we're seeing the beginnings of speech in our products, and as mentioned by Bill earlier, if you've used a telephone to access Fidelity, Charles Schwab, Sears and I heard Wal-Mart today, you have probably had a chance to dabble in what speech interface looks like.

But my talk today is not about speech today; it's about the future. It's about speech enabling the Web and taking speech into the mainstream, which is why we're participating in this forum.

In some sense when you look at Microsoft, making things mainstream is our core competency. In the past the work that we have done can be summarized, the success that we've had can be summarized as creating platforms for developers to innovate and to create the killer apps that crate end user value. And that's where we're going with SALT and that's where we're going with our .NET initiative, which is really our company-wide initiative to bring about an exciting new world of web services.

I'm sure you're all familiar with web services, but certainly some good examples were John Dehlin's demo, the demo from Comverse; those are some really exciting future scenarios.

But if you have to summarize web service in two phrases, I would say first it's all about XML, and secondly, it's all about multi-device.

It's about XML because we don't want developers to have to build large monolithic applications. When you saw John Dehlin's demo, what is really exciting to me is the fact that Kontosa Realty doesn't have the resources to build this gigantic, monolithic app, so they were able to take advantage of a speech recognition and synthesis web service wrapped in XML. They were able to take advantage of an identity authentication system provided by some other company and they were able to tap into the multiple listing XML web service provided by some other company. So the power of web services is the ability to connect these individual components together using the common foundation and language of XML.

And the second exciting thing about web services is that they're accessed anywhere, anytime, on any device. We fully understand and recognize that PCs are an amazing industry. However, the reality is people want services to be customized to them, to people, not to machines, not to personal computers. So individual users want personalized capabilities with storage stored in the cloud, accessible anywhere from any device, so an interface has to work across all these devices.

And that's why we're so excited about speech, because speech is in the future really the ideal interface for these Web and in our case the .NET platform interfaces.

The way we look at it from Microsoft is that speech has two great properties that make it a perfect fit for the web service. First, imagine a world where a lot of companies are successful, pushing out these web services, Kontosa Realty and many, many others, 1-800-Flowers, what have you. People need a way to access, to discover, to find and to use these services. And can you imagine using GUI, using your mouse to be able to enter something like "Townhomes near Microsoft," or to be able to say, "Send flowers to my mom," or to be able to say, if an urgent call comes in, "Page me." These are all things that require not only the naturalness but also the expressiveness, the power of human language, which is intended for communication, that we think is what makes it a perfect fit to discover and use web services.

And the multi device, I think you've heard many times that there is not going to be a keyboard on my telephone ever, and voice is the only modality that spans through all of these devices. So it's scalable, it's natural, it can be used on your telephone, smart phone, television, as well as the personal computer.

Now, coming back to this talk about mainstream, what is it going to take for this or anything to become mainstream? In Microsoft's experience, what makes something mainstream is about four things: One, open industry standards that are supported. Two, a common platform of well-distributed server and client supporting these open standards. Three, great tools that get developers to be excited. And four, killer apps: For the Internet HTML, HTTP for the open standards, IE and Netscape for the browser platforms. There are server platforms as well and there are great tools such as FrontPage, Dream Weaver and others that get developers excited and the rest was history for the Internet.

So for speech, we're looking for exactly the same set of things. First, we need an open standard that people can all buy into that you can see from this set of companies we in our respective products are going to be supporting this new initiative of SALT.

If there's one thing I'm most excited about SALT is that it's not a new giant, big technology and standard, but it's a lightweight set of extensions to HTML that enable Web developers to continue to use the programming paradigm that they've come to love and use every day. They can use the same events, object models, methods and scripts to continue to create the web service experience that they have and that speech is that simple addition that they can get with an incremental amount of investment. They don't have to learn a new programming language. They can use the paradigms they like today. And their investment in HTML is preserved.

Along with other colleagues here, Microsoft is also interested in doing our part to make SALT more and more used. What we'd like to tell you today is that we intend to make this even easier for developers and for Web developers in particular. We intend to give assurance to Web developers that when they develop, there will be client platforms and server platforms from Microsoft that support this open standard, that ASP .NET, our server side product, and that IE and Pocket IE will support SALT.

But most importantly, because this is about developers, we're going to be providing great tools built on VisualStudio.net, which has six million users worldwide, and when we make the possibility of developing speech so easy, that's when there's going to be a wide amount of opportunities for developers to build on top of that.

So to conclude, I think the success of speech is all about killer apps. Killer apps are built by excited developers. Excited developers need assurance of great client and server platforms to be distributed and they need great tools, and all of this needs to be built on a single incredibly important but lightweight ingredient, SALT.

Thank you.

(Applause.)

ALISTAIR WOODMAN: Good afternoon. You may ask why Cisco is sort of engaged in the SALT Forum. I think at a very macro level the company has always been interested in standards and protocols. It's sort of part of the company's heritage, and there's an old aphorism at Cisco that we've never really run into a protocol we didn't like. So to a certain extent that's one of the levels that we like to get engaged in these types of things to enable the industry to scale, and I think that's sort of true of most of the founders here that they have an interest in that sort of higher level common good, which essentially equates to raising all boats in this particular marketplace.

We've also got some more parochial interests in this particular space. A good example I think is looking at the amount of sort of PC to phone users out there in the marketplace and just the availability of broadband support. And we see an opportunity where there's technology to sort of drive and lift the interest in being able to drive services over broadband infrastructure by using the ability of PCs as terminal devices and other smart end points to be able to sort of get more and more traffic onto essentially data networks to provide growth in all manners of the industry.

So from our perspective we think it will sort of enable and enhance the user interface and experience on PCs and handheld devices, both fixed and wireless, and that should also then drive the interest in people being able to get killer apps over broadband access. So we think there's a sort of coupling of the business interests in that particular regard as far as access technology is concerned.

As far as the whole sort of telephony interface of going after business opportunities, we see that the marketplace at the moment for pocket telephony and some of the underlying areas that we've been active in the industry is driven by increasingly more interest in higher value added services and new access opportunities.

So those are the sort of yellow and the blue lines on these charts here, and this is essentially why we're interested in this particular area. We have business interests in multiple other areas, but it's the thing that's going to be driving new applications and new ways of interacting with networks that's of interest in our engagement with the SALT Forum.

We've here tried to sort of talk about where we see the marketplace in the past and where we think it's going in the future. There has been this sort of discussion about this sort of fixed and modality of devices and how people interact with them. I think everybody is very familiar with the sort of phone and the plain old telephone service. Its primary advantage today is its ubiquity. Even though it is true that still probably about half the world's population has never heard a dial tone, everybody else, the other half of the world population is very familiar with interacting with the phone and used to dealing with and engaging with services through that particular metaphor.

The interest from our side, of course, is to sort of be able to break out of that particular portion of the marketplace, which its advantage at the moment is ubiquity and scale in terms of the number of end points, and drive sort of other types of features and services out of that particular realm and get into the ability to be able to provide services for mobile and IP telephony end points into the marketplace.

The other way of looking at it as well is where you've gotten from the particular sort of PC as an end point and the way that that device is sort of migrating in the marketplace from being sort of just the desktop device, which was sort of high level of ubiquity but nowhere near like the level of ubiquity of the phone itself, so the sort of devices that people use a lot more today like portables and next generation sort of pocket PDA type devices, and we're interested in making sure that we see a unity of those types of devices as well and being able to allow people to span both the synchronous and the asynchronous dimensions of communication, as well as get out from that sort of single modal mechanism of community to the multi-modal facet.

So the interesting part of the market is getting the new killer applications and services that you can drive across all these terminal types, because I don't know if you as users of these pieces of technology, generally I feel very frustrated when I can't use the services that I'm used to on one particular device and get access to them from another. And this is essentially the drive here of the SALT Forum is to make it able to provide spanning mechanisms of getting from one different access device to another different access device to services and interact with them in relatively similar types of ways across the board.

We believe that this will create an obvious virtual circle in the industry of being able to allow the sort of growth of voice activated services, drive opportunities for new access and networking demand and services back here, which will then create yet ever more increase in sort of applications on the graphical side. And this was sort of alluded to earlier on that we believe will essentially make the pie bigger. I think for those of you who have looked at any of the economic statistics and interactions with any of these types of things, communications services generally have a positive effect on other communication services, and it's very rarely that one really sort of cannibalizes another. They usually end up driving the whole infrastructure and the business to be more widely used in both different areas, and we believe the ability to create this linkage in the multi-modal area to be able to link up graphical user interfaces and speech built interfaces will have a very positive effect and positive development in the marketplace. This will help drive the industry, help move more traffic, create better applications.

So from the sort of company who likes to do a lot of the plumbing work around here, that's a good thing, because that sort of drives a lot of the infrastructure stuff, and in our particular business that's sort of what it's all about.

So we've heard a lot about standards. I'm not going to spend too much time in this general area. We think there are a lot of things out there that you need to be able to build robust packet telephony infrastructures, and we're very excited to be a family member of the SALT Forum.

So with that, I will hand on to our next speaker.

(Applause.)

STEVE CHAMBERS: Good afternoon, everybody. I am the last speaker and after this we'll be going into Q & A. And I thought it would be interesting for you to hear a little bit from some major companies how they look at multi-modal applications servicing their business in the future. SpeechWorks makes speech recognition and text-to-speech technologies and related tools to build very high-quality applications delivered over the telephone to date, but we've also been investing to enable this multi-modal vision.

So when I say companies like SpeechWorks, this is kind of our internal mantra, to manage their customer conversations. There are a lot of companies out there who have already begun this work just over the telephone. So today we announced new customers in Wal-Mart and also Amtrak, and we've talked to some of the names up here about multi-modal applications.

So rather than maybe showing a demo, I want to share with you United Airlines, for example, a big SpeechWorks customer. They have seven speech applications up and running today. And we sat down with them in a strategic planning session about, well, where do you want speech to go in your company, and one of the first things they said this year was, "Well, a lot of our premier customers when we did research, they're using these PDAs and they're investing and their turning over their cell phones and PDAs. We're going to need to reach them not only just with visual displays but we'd like them to use voice navigation." We said, "For what applications?"

Well, it's difficult to give people flight itinerary information when there are a lot of results coming back. So right now today our over-the-phone application supports calling United and saying, "I'd like to go from San Jose to Boston next Tuesday," and the system comes back and says, "I have five flights that meet that itinerary." There's a lot of reading back of what those flights are, and what United wants to do is make it economical and just hit them right back with a visual list, which is a lot easier for people and their clients to process.

They also want to deliver frequent flier coupons, electronic coupons and premier incentives for their premier emeritus class of frequent flier.

First Union National Bank is a customer and they're about to roll out a series of seven new banking services all over the telephone, where they want to get as well is introducing account displays for you checking on bill payment status. That again you can get a lot of lift, for bill payment you can get a lot of lift for checks. A visual representation of that is something they feel will make them a lot easier to do business with.

And then Wal-Mart is today doing order tracking, so that's what we're announcing in terms of their use of SpeechWorks technology. What they want to do is display to you in the future all their goods in the store, and then when we get to high bandwidth networks, 2.5G and 3G, they want you to go to virtual store walkthroughs on your device, where you'll navigate, go to toys, go to home goods, et cetera.

So that's just to give you a flavor of this isn't that futuristic. The idea of multi-modal technologies is coming out by major companies who have already invested in speech-based technologies today.

So the way we look at this is when companies want to automate their transactions, when they want to reach customers in an automated way, they started off for years with touch-tone, and this is the interface everyone loves to hate. They added Web, which was fantastic in terms of the power and control it gave customers, the degree of visual navigation and options and personalization.

But as you've heard today, the most common device in the world is the technological device, the telephone, so to reach the most customers obviously you'd want to automate through that common device, which is the telephone, and that leads people to conclude, well, speech is a great way to do that.

The beauty of multi-modal is it ties them all together. It allows you to combine the visual and the reach of speech and the visual interactivity in a way to create a better caller experience and make a more economical transaction, et cetera, which is all companies' grail. They want things to be cheaper, easier to do business with, higher value, et cetera.

So just in a nutshell these are the technologies we are working on to help make this vision real. It starts with you've heard a lot of speech recognition today, you speak, the computer understands you, so we do that in about 20 languages; then text to speech. Today, Yahoo and AOL are using our text to speech and deploy services such that you could dial in and check your e-mail on your commute and cycle through your e-mail and get through your messages, have them read back to you.

So that's called server-side technologies, because you have any old telephone over today's networks, you dial in, you can transact, receive information. But what you've seen today is embedded technologies, basically speech recognition shrunk down such that it fits on a device for what's called command and control purposes. You heard John earlier in a demo make some commands of the device and the device responded. So that's embedded speech recognition, and then text to speech where the device itself might talk back to you is embedded text to speech.

What we're working on are these four quadrants and DSR really is some fancy stuff you can do when you have technology running on the device, the embedded technology communicating with technologies on a server. And we can do some interesting things relative to pre-processing and accuracy improvements and whatnot when both of these things are used in conjunction; so key component technology to the vision of multi-modal speech-based and multi-modal applications that are going to be available.

So the idea that on these devices -- and you've seen this already, so I won't hover on this, but obviously everyone is talking about speech in/visual out is kind of the Holy Grail combination for a lot of applications and it assumes the high quality displays and the embedded speech recognition and text-to-speech technologies.

Now, why we see this as so important is today this has been the success formula in the speech industry, where you have technologies, raw technology, you have services and user interface -- building the user interface for multi-modal applications, for example, is totally new domain. If we turn back the clock 12 years ago and sat everyone down and showed them an icon of a house and two arrows and a red circle and said, "What do you think," you know, people would have no idea what those represented. We have similar obligations to create standards in user interface. So you heard John issue some commands to his device. Those were specific maybe to a device, but the idea for this to take off is that we create the moral equivalent of a nav bar on the Web for multi-modal applications.

So SpeechWorks won a DARPA grant and is working with folks like Compaq and AOL and Lobby 7 and some other players to try to standardize and define via a lot of human factors testing what the optimum interface is.

So the technologies we went over, speech recognition, text to speech and a number of other related technologies; partners who are all involved, like the six founders of the SALT initiative; and then finally standards, because none of this will achieve widespread adoption unless we do have a standards effort. So the charter of this group is really about accelerating standards for multi-modal technologies. SALT can work for speech-only technologies, but part of the power of it is it's really the first attempt to accelerate standards for speech and visual used in one application. So that's why it's so exciting and that's why SpeechWorks is involved.

End


Top of pageTop of page