June 2004 Silicon Valley Speaker Series Transcript - Microsoft Research Road Show 2004: A Preview of Technologies of the Future

June Silicon Valley Speaker Series:
Microsoft Research Road Show 2004: A Preview of Technologies of the Future
Dan Ling, corporate vice president, Microsoft Research (MSR)
Jim Gray, distinguished engineer
Mike Schroeder, senior researcher and assistant director, MSR
Mountain View, California
June 9, 2004

*

JIM GRAY: Good morning. My name is Jim Gray. I work at Microsoft Research in San Francisco. And first I'd like to begin by thanking you very much for coming. We have an exciting afternoon planned for you. We've got demos. Next door there's -- I hope you picked up some of the food on the way in, there will be more food on the way out.

But what we're going to do is first, Dan Ling, who is vice president of Research and head of the Research Lab in Redmond, Washington, which for us is the Mecca, is going to speak first and talk about the things they're doing in Redmond. Then I'll talk a little bit about the work that we're doing in San Francisco, and then Mike Schroeder, who is assistant director of the lab here and he's our host today, is going to talk about the work that we're doing here in Silicon Valley at the research lab.

So with that, I'd like to introduce Dan Ling. Dan?

DAN LING: Thanks. (Applause.) Thank you again, all of you, for coming to this Microsoft Research Silicon Valley Road Show. This is, I guess, the second year that we've done this. I thought I would like to start by telling you a little bit about Microsoft Research, and giving you a bit of an overview about what we do. The organization was founded in 1991, and one of the things I like to do is actually to remind people that Microsoft was really quite a different company back in 1991. It seems like it wasn't that long ago, but actually Microsoft had just crossed the $1 billion in revenue mark around 1990 or so. Windows 3.0 had been released, but not Windows 3.1, which was our big, successful release of Windows. Word and Excel were out there, but not yet leaders in their respective markets; WordPerfect and Lotus 1-2-3 were still the market leader.

So, Microsoft was really a much smaller company, a much different company than it is today. And yet I think it showed great foresight on the part of Bill Gates and Nathan Myhrvold that they realized that it was important for us to start a basic research lab, and to really do the research work in order to enable Microsoft to become the company that they hoped the company would become. And I always like to remind people of that, because it's just something that people very often forget.

Over the past 13 years or so, we've grown a lot. Microsoft Research now has 700-plus researchers around the world. We have labs up in Redmond on our corporate campus. We have a group in San Francisco that Jim Gray heads. The one here in Silicon Valley Campus that Roy Levin and Mike Schroeder lead. We have a lab in Beijing, China, which is really a lab that works very closely with the university community and the research community in China, and that was started in 1998. And in 1997, we also started a lab in Cambridge, England, on the Cambridge University campus right next to the computer laboratory.

One of the things that also people tend to get confused about is that we are actually a very small "R" next to a very large overall "D" that Microsoft funds. Microsoft every year spends close to $7 billion in research and development. We are the research part of that, and we don't, as a matter of course, we don't really undertake development projects ourselves, although we obviously work very closely with the development groups throughout Microsoft.

Our mission is really very straightforward. It's really important for us to actually be part of the research community, to push forward the state of the art of computer science in the areas that we work; to be a part of that community means to publish, to attend conferences, to be very much a part of the discussions and be very open with having a lot of visitors, and to spend a lot of time really engaging with the worldwide research community, and that's the way we move the state of the art forward.

But, of course, we are a corporate research lab, and, therefore, it's really important for us to take some of this technology that we work on and have it benefit the hundreds of millions of Microsoft customers around the world. And to do that, we work very closely with the various product development groups in Redmond, down here, and other parts of the world to make sure that Microsoft products have the research technology. And one of the things that I'm really quite happy to say is that at this point every single Microsoft product has been touched by some technology from Microsoft Research, whether it's a tool that has been used to test the product, or to improve the quality of the product, or whether it has been something that is a new feature and a new direction, every Microsoft product now has pieces from Microsoft Research.

In order for us to work with our program management team, we do a number of things. With our products groups we do a number of things. One of which is to have a program management team that helps us work together with a product group. One of the things that we actually do as a part of that is have almost what is like an internal trade show every year in March, and we take the whole Microsoft Conference Center in Redmond and set up over 100 different demos, and invite all of the rest of the company to attend. And last year, we had about 8,000 people, Microsoft employees actually, attend the event, which is a very significant fraction of the Microsoft employee base in the Redmond area. So, you can imagine, it's an event that really gets a lot of people excited, both in the Research group, because it's an opportunity to show our technology and for people to show what they've been working on, and, actually, for a lot of the employees around the company it's an opportunity for them to see a little bit about what we do and what the future holds for information technology.

I said something already about participating in the academic research community. We work very closely not only with the universities, but with organizations like the National Research Council, and the NSF, and DARPA, serving on things like ISAT committees to review new technology areas, different program committees, editorial boards, and really do a lot of professional service as part of our work.

We have very strong ties to universities. We have a lot of interns who come and visit us at Microsoft Research. In Redmond alone this summer, I think we had 180 interns from graduate schools around the world who come and spend the summer with us. In addition to that, we have quite a rich visiting researcher program where faculty members come for maybe a couple of days to give a lecture and visit, or maybe a couple of weeks to do a little bit of a joint research project with us, or maybe spend up to a year as part of their sabbatical. So, there's a lot of ways where we interact with the rest of the research community.

We have also a lot of events. Certainly this is an example of such an event. At the same time, this week we're actually also having an Open Day on our Cambridge Campus where we've invited a lot of press and other visitors to come and see the work that we're doing at MSR Cambridge.

In Beijing every year we hold an event called the 21st Century of Computing, and that's mostly for students, and we book a very large conference hall in Beijing, and we invite something like 2,500 students to come and listen to a day of lectures from lots of prominent computer scientists, not just from Microsoft. We invite leading computer scientists from around the world to come and give talks during that day, and the students are incredibly excited, and it's usually standing room only, and I've heard that tickets are in huge demand, and afterwards the speakers are always asked for their autographs, and to have their picture taken, which is something that happens very infrequently to people in computer science.

We also have a number of conferences that we hold for people who are also doing work in areas that we're interested in. We recently had one in Redmond on social computing, sort of the idea of how do you use technology to help promote social groups; blogging, for example, has become a very rich technology for building communities. We're having a conference right here, in fact, later this summer on e-mail and anti-spam, where we've invited people from a number of different companies, e-mail service providers, research labs, and universities to come and talk a little bit about ways in which we can all work together to fight spam, and I'll have a little bit more to say about spam a little later on.

We have a faculty summit in Redmond a little later this summer, and we also have, in collaboration with the University of Washington, institutes during the summer on a variety of different subjects. Every year there is one. This year there's going to be one on testing, but we've had them on data mining, we've had one on looking at analogies between the neurobiology of invertebrates and computers, which is a little bit of a stretch. But, it turned out to be a very interested conference, as well.

We really do, across our labs at Microsoft Research, work in many, many different areas. I've just listed a few here. There's really far too many for me to go into in any detail. So what I thought I'd do is just pick a few projects which are actually not really in evidence in the demo room next door, and talk to you a little bit more about those.

The first one is what I talked about a little bit before, fighting spam, and this is clearly something where it's of great importance to the Hotmail group, which is here in Silicon Valley. It's just a tremendous challenge for, I think, everybody who uses e-mail today. If you look at some of these numbers from the study that the Pew Foundation undertook, it's really the percentage of mail that is spam is very high, the number of users who are really feeling that their entire faith in the Internet has been weakened because of spam is significant. It's really a problem for all of us.

In the Hotmail world, you can see that as of August a year ago, close to 80 percent of the inbound mail was spam, and I've heard that the number has risen even beyond that. So the number of spam messages coming into corporations, to e-mail providers, ISPs, is really very, very large, and a big challenge for everybody.

There are interesting, technical solutions to spam, or to fighting spam, that I'll talk about here, but obviously there are other things that need to be done, as well. There are legal things that need to be done, and it's clearly something that needs to be undertaken by the community as a whole. No single company can successfully fight spam. And I'm pleased to say that leading e-mail providers are really starting to get together in conferences like the conference I described earlier, and at other venues, and at the IETF, and places like that, to look at sort of common strategies to fight spam.

One approach is certainly to do better filtering of mail, and using machine learning techniques to try to automatically identify pieces of mail that look like spam. And in a system like Hotmail where we have millions and millions of users, some of whom are helping us identify new pieces of mail which are spam, these machine-learning-based filters can actually evolve over time, so that they learn to identify new spam, as the spammers try new techniques.

There are other approaches that try to block IP addresses. There is some very interesting work that I'll talk a little more about, around the idea of postage, namely, having the senders spend a little something, whether it be money, whether it be computation, or whether it be a proof that they are human, and not just a machine. In other words, just change the economics a little bit, to make spam a less attractive economic vehicle for the spammers. There are lots of things going on in anti-spoofing.

Today one of the reasons spam is so common is that it's very easy to pretend you're somebody else, or you're sending mail from somewhere else, other than where you're really sending mail from. And there are some efforts by the leading e-mail, and ISP providers, as well as the technical community to join forces and look at how to have some new standards that will make it much more difficult to spoof the system.

The idea of postage is really, as I said earlier, an economic argument, to just change the economics slightly so that spamming is not entirely free, and there is some cost involved with sending mail. And this is really the idea that if you just change the economics a little bit, it is no longer so attractive to send spam, and you can do this in one of a number of ways. You could have it be like regular postage, and have it cost some money to send mail, but that's very complicated, because you'll have to figure out who is going to collect the money, and who does that money go to, and so on, and so forth. So there are some alternatives that revolve around one of two ideas.

One is that you ought to prove that you are a human. So suppose I get a piece of mail, and it comes from somebody I don't know, and I've never communicated with before, then maybe what I'll do is send a challenge to that person to solve a little puzzle, and the puzzle could be a little test, like being able to read the letters listed here, or the puzzle might be identifying some spoken word.

(Audio played.)

You probably couldn't hear that very well, but it's basically the same letters and numbers here, spoken against a background of noise, and it makes it very hard for some automatic device to understand that, and requires a human listening to it to understand it. And that simply means that by solving this little puzzle you prove that you're human, and once you solve the little puzzle the mail can be forwarded to your inbox.

An alternative is actually not to have you prove that you're human, but to have your computer do a little bit of work. Now, this would happen entirely behind your back, so the user would not know this was going on at all, and this is some work done at the Silicon Valley lab here, where they were looking at, let's try to impose a computational cost to the sender, so that the sender might have to use their computer in the background, again, totally invisible to the user, to compute a difficult problem that might take a few seconds, which, as you aggregate over all the spam mail the spammers send, would mean that the spammers would have to buy lots and lots of machines, and therefore it would change the economics of all of this completely.

Another example I'll give you of something we're working on is more of a grand challenge kind of a thing, and that is sort of the idea of trying to understand normal human language, a natural language like English or French or Chinese or Japanese. This is actually an interesting case, because it illustrates what we do when we undertake projects that have a very long-term horizon. In other words, our end goal here is really to be able to do something like be able to speak in English to a machine, and have it understand, and respond in a reasonable way, which requires speech recognition of the words, but even more important than recognizing the words, actually understanding what it is that you say. And that requires natural language understanding.

Our approach in doing these long-range projects is actually to see along the way to achieving some of these long-range goals. Are there some interesting intermediate things that we can do that are already interesting?

So the first thing that you need to do, before you can really understand language, is you need to understand a little bit about the grammar. So we actually have built grammar parsers which try to identify things like the sentence structure, the noun, the verb, the object, and then, clearly, from something like that you can build a grammar checker that tries to identify, for example, that the subject and verb agree in number, so that they're singular or plural and so on.

Beyond that, in order to do something more in natural language, you need to understand a little bit about the meaning of the words. Our approach there has been to build something that we call a mind net, which is really a semantic network. Think of it as a bunch of senses of a particular word, and the relationship between those words. So if I say the word bank, bank has many senses, one sense is as a financial institution. And what is the link between the word bank, in the sense of a financial institution? Well, that might have links to money and tellers and finance and things like that. On the other hand, the word bank could also mean a river bank, but in that sense the word is linked to words like water and river and other such words related to that sense of the word.

So as you can see, by building this semantic network of lots of words, you acquire a very rich set of information on which to analyze sentences.

So what we've done with this is actually to look at machine translation. Machine translation is a challenge for a lot of companies like Microsoft that have to produce a lot of manuals in many different languages. Every multinational company has that as a challenge, of actually being able to produce material, manuals, and information in many different languages.

To date, machine translation has always depended upon very heavily customized, manually authored translation systems. Instead we had a different idea of what to do, and namely to try to build a machine translation system that was example based. So given an example of hand-translated material, say, existing manuals-- and Microsoft, for example, has a lot of existing manuals in English, and in Spanish, and in German, and so on, and so forth -- and if the system were given this example base of existing material, could it then take that information from the example base, and then generalize that to new manuals, but in roughly the same domain areas? So if you build the translation system like that it might be good at translating more Microsoft manuals, but it wouldn't be very good at translating medical text, for example.

So that's sort of the strategy that we took here. To give you a little bit more of an idea of exactly how the system works, I thought I'd just walk you through this a little bit. There are two phases in which the machine works, one is a training phase, where you're trying to give the machine the example base of human translated material. And then later on, I'll show you what happens in the run time when we're faced with translating new material.

So, suppose we had manuals that already had an English phrase "click the tab" and a Spanish phrase-- which I won't read, since I don't speak Spanish -- that means "click the tab. The idea is that you parse those sentences with the grammar so that you know that the word is "click," and in Spanish the word is "hacer" or something, and the object is click, and in English the object is tab, and so on and so forth. You learn this, and then you learn word association, so that, ah-ha (Spanish word) must mean tab because I've seen it and can translate it. So, presumably, that's correct. And then I can remember that this fragment translates into the other fragment. And then we will store it away in our mind net.

So, the idea is that from the example base we parse it. We break it up. We learn word associations. And we learn about fragment translations. Then, when the new text comes, we see we can assemble the fragments together but in a new way. So, suppose we have a related Spanish phrase that says, again, I can't read Spanish very well, so you'll have to pardon me (speaking Spanish) so, there's a click and all of that as before, but it's about something slightly different. Once again, we parse it so that we know that hacer and click is still the verb thing, but there's a new object that you're going to be clicking on. We have fragments that we've gotten before, the one we showed before was click and it was the (Spanish word) but here we're going to have to do it on a text, and then we have little fragments that we've stored away from the form on mind net, and we reassemble the fragments into the new translation. And so you get "click the highlighted sample text." And so that's sort of the idea.

We've actually applied this to a large database that we have of articles that are written to help our customers resolve problems on their computers, it's called our Product Support Services Knowledge Base, and it's a very large database. It changes constantly and is updated constantly. About 5 percent of those articles are hand translated. The most popular articles are hand translated into different languages, but it just isn't cost effective to translate all of those articles into different languages. But we've actually applied this machine translation technique to the entire database, and translate and provided both a Spanish version and now a Japanese version that is available online, and you can actually look at the machine-translated versions if you go to Microsoft's Web site.

Now, as you can imagine, the translation quality, obviously, is still not going to be as good as a hand-translated article, but the thing that's very encouraging here is that the percentage of customers who could solve their issues using the knowledge base which admittedly is not a really high number that we have to work on anyway, but that it at least doesn't slip that much when you go to the machine-translated version. So, it goes from 57 percent to 55 percent, which shows that people are really able to get the general drift of what the article is about, and still solve their problem with the machine-translated database.

One final example, some work that we're doing that is really an attempt to contribute to science, and really isn't related to any of Microsoft's businesses directly, it's a very interesting one, and it has to do with doing an AIDS vaccine design in a rational manner. And the question is, can computer sciences contribute to designing an AIDS vaccine? The answer may be no, but it turns out, interestingly, it's yes. Computer sciences can contribute in very significant ways to something like designing an AIDS vaccine. And why is that? Well, the problem with designing an AIDS vaccine is that the HIV virus mutates very rapidly, and so there are many, many strains of the HIV virus around. And vaccines normally are designed to basically use maybe only a few strains, and so they don't work very well in the HIV case.

Well, it turns out that what you really want to do here is collect virus samples from lots and lots of patients over a long period of time, so you actually see the virus in its many, many, many different forms, and then you curve the machine-learning techniques to figure out what kind of a vaccine would likely produce an immunity to as many of those strains as possible. And the way to do that is that, it turns out, the way the human immune system works is that it recognizes little short fragments of amino acids that are present in the proteins on the surface of the virus. These are these things called the epitopes, each one of these letters is the name of an amino acid, and the little lines are little spans, sort of short spans of amino acids that your human immune system uses to recognize that virus.

So, the problem here is that you want to come up with a sequence of amino acids, a sequence of letters, that match to as many different strains of a virus as possible. And to show you that this is kind of a hard problem, we're doing a really small toy example here, but it sort of illustrates what the problem is. These are actually from real data on the HIV virus. The letters, again, refer to amino acids, and from position 5, and for amino acids we have the following, and at position 105, we have some other ones. And what you see is, if you look at the pattern there, the initial pattern is a lot of the KGEKLD type pattern, and there's some of the RGGKLD pattern, and the second pattern there's a lot of the ERFAVN and fewer of the DRSALN pattern. Well, conventionally the kind of things that people have done is look for the most common pattern. Well, if you look for the most common pattern, then you get this consensus model here, which is KGGKLD, and ERFAVN. Well, if you look up above, actually, you know, KGGKLD never actually matches up with ERVAFN, so you really get a lot of errors on that, and it's not a really good pattern match.

So, a more sophisticated approach might be to try to look for two sequences, and where's my arrow, two sequences, and you try to look for something in the front, on the front six that look pretty common, so there's the RGGKLD, and the KGGKLD, and then you look for two sequences on the back that are pretty common, and it turns out that DRFALN is really pretty rare. So the algorithm never actually picks that, and it only picks these two. And, again, the matches aren't very good. But using good machine-learning techniques, you get a really good match, or you can get both the RGGKLD, the ERFAVN, and the KGGKLD match with DRFALN, and if you look up there you see you get sort of both, all of the common six letter sequences in this lower type approach, and you get only two errors in the pattern.

This is just a toy example. In the real HIV virus, you have to match these patterns of a very, very, very long sequence, and you've got many, many, many different strains of the HIV virus, and the vaccine itself, which should be this string here, can be infinitely long, right, so you need to come up with a fairly compact pattern that finds little sequences and matches up with as many of the possible strains of virus as possible. And it turns out that machine learning is a very good technique to solve that problem. And we're working with a number of researchers at the University of Washington Medical School on exactly those problems.

So, I'll conclude here. I tried to give you a little bit of an introduction to MSR, about what we do, trying to move the state of the art forward, work with the outside research community, and then deliver that technology into the hands of our customers. And with that, I'll turn it back to Jim Gray.

Thank you so much. (Applause.)

JIM GRAY: So we have a very small lab in San Francisco. This is all of us. There are about six of us. We work on scalable systems and also on collaboration and media management. Many of you know the guy in the middle, Gordon Bell, and Tom Barkley, this guy here, will be in the demo area later talking about the TerraServer.

And what I'd like to do is just tell you a little bit about what we're doing. One of the projects we have ongoing -- and the projects I'm going to talk about are projects that emphasize collaboration, my sell of this thing or my part of this thing is to talk about the collaborative projects that we're doing.

So we've been working with the USGS. The USGS has a picture of your house, has a picture of my house. It has 1-meter photographs of the United States. It has topo maps of the United States. All of this stuff was sitting on disks in various places. We glued it together, put it on the Internet and let you poke around on it.

More recently the USGS has taken 1-foot, full color, natural color pictures of about 120 urban areas. Unfortunately, the Bay Area is not online yet. The data is coming. Your house will be there. But we do have Washington, D.C. and a few other places.

But the thing has been online, it's very popular. An interesting thing is about 2000, 2001 it became a Web service and now we get about a million hits a day to the Web service of people coming to the TerraServer and saying give me exactly the picture I want, so pointing at this spot, this latitude, this longitude, this size and this zoom level.

It's fundamentally if you're working on scalable systems it's a great laboratory to work on scalable systems. It's a real app, it's on the Internet.

What does it take to manage something like this and how can you drive down the costs of doing it?

The astonishing thing is that when we first started the TerraServer we went sort of from there to there. It was in round numbers $3 million or $4 million dollars worth of iron. It's now probably 10 or 20 times more storage than we used to have. It now costs $100,000 for the whole thing.

And, in fact, all the storage lives here. This is one copy of the TerraServer mirrored, and here's a mirror of it, and here's a spare. So this is a 2N-plus-1 kind of architecture and we're moving towards a geoplex of this.

And Tom Barkley, as I say, will talk about this in the demo area.

The project I've been working on is sort of the TerraServer looking the other way. It's the celestial sphere and astronomers are taking pictures of the sky and we would like to glue all of those pictures together into one great big telescope, has a picture of the entire sky, all spectral bands, all times and it's cross-linked with the literature. So this is a pretty major shift that's happening in the sciences where scientists are putting all of their data online and they're cross-correlating their data with the literature, so you have an ability to go from a question to the data and from the data to the literature or from the literature to the data and it is a new way of doing science.

The particular thing I've been doing is working with this guy here, Alex Szalay, standing in front of his telescope, but the real telescope is this thing down at the bottom, which is where all of the bytes come from. No one actually looks at the bytes coming out of the telescope. It goes through millions of lines of code before humans look at the information.

So there's a Web site called the SkyServer, which externalizes much of the Sloan Digital Sky Survey's data. And we've been working on building that site. We were very shrewd this time. Rather than operating the site we got the astronomers to operate the site and we just help. Actually operating the site is a lot of work.

And one of the interesting things about this Web site is that it was actually initially designed with a lot of help from Curtis Wong, who's also going to be demoing in the room, a lot of help from Curtis Wong and his team in the UI design for this, but it was designed to be easy to use and, in fact, it was designed for education, to let kids have access to this information. And it turned out to be so easy to use that many of the astronomers have decided that this is a lot easier interface to use than the ones they've been using, and so there's just a general tendency to move across to this.

The next step after you take one of these things and put it online is to take all of the archives that are available and federate them together, so, in fact, what we did over the last two years is to federate about 15 of the archives from many different places, some of them in Europe, some of them in North America, we're hoping to get some from Asia real soon, and you can go to a portal and ask a question and it will take you to the literature or it will take you to the many different datasets and glue together answers from those datasets, give them back to you.

And again the current thing we're doing is to expand this portal to have personal databases so that you can have your own personal workbook there and to have a batch scheduling, because some of the questions people ask can't be answered immediately, they take hours or days to run the queries and so we have a batch scheduling and a compute farm at the portal.

So those projects were mostly done with Johns Hopkins, the previous one with the USGS.

Another project we've been working on is to work with Caltech and CERN. CERN is building something called the Large Hadron Collider, which will come online in about 2007. It will produce about a gigabyte a second of data.

The people in Caltech and people actually all around the world want to get a copy of this information. How do they do that? Well, one way is Sneakernet. Another way is to buy a phone line and so they have a 1 gigabyte-per-second phone line or actually probably 1.2 gigabytes-per-second phone line from CERN to Caltech, and we are working on moving the bytes from CERN to Caltech using Windows and TCP/IP and things like that. And my job is, in fact, to move it the last meter from the computer to the disks.

So it turns out that there's a whole bunch of guys at Caltech and at CERN who were really good at networking, and there are also people in Redmond who are very good at networking, but they were having a heck of a time moving data to and from disks very fast. So I've been working with them on improving that. Our goal, of course, is to move it at a gigabyte a second from CERN to Caltech. To give you a sense, that would be your typical DVD moving in like three seconds.

But the fact is, this has to be ready in the year 2007 and it has to be brain-dead simple, it has to be automatic. If there's any problem, it has to self-heal. So those are the problems we're struggling with over the next year or two.

And with that, I'd like to introduce Mike Schroeder, who, as I said, is the assistant director here in Silicon Valley, and he'll tell you about the things that are going on in Silicon Valley. Thank you. (Applause.)

MIKE SCHROEDER: Thanks very much, Jim. It's always a pleasure to be introduced by you. You've been doing it for many years now, as I recall.

So my job is to finish up the presentation here today and tell you a little bit about the newest member of Microsoft Research, the lab that's here on campus. When Rick came to us, Roy Levin and me, in the summer of 2001 and asked us if we'd be willing to start a distributed computing lab for Microsoft here in Silicon Valley, it actually took about three microseconds to say yes. The chance to build up a new lab in general doesn't come along very often, and this seemed like a really great environment to do it in.

It was pretty easy to decide to have it here, and obviously the very first thing that we needed to do was to build up from a staff of two to a bigger group, and we've actually been doing pretty well at that. We've gotten up to 25 PhD researchers at this point who come from a very broad range of technical expertise.

And this follows along with a philosophical idea we have about how this lab should work. The idea is to have a very collaborative approach to solving problems rather than have a lab that focuses on a single set of technologies.

So if you look at the list up there, we span distributed computing from theory to practice, we like to say, and whenever a new problem comes along we often can get together a multi-disciplinary team to approach the problem and sometimes come up with really interesting results.

One question I'm often asked about this is why Silicon Valley, and that's actually a pretty easy question to answer I think. Once Microsoft Research figured out that having remote labs was operationally feasible, and that was well proved by the existence of the San Francisco lab, the Cambridge lab and the Asia lab, then Silicon Valley becomes a really natural place to put a lab.

As we said earlier, one of the prime missions of Microsoft Research is to advance the state-of-the-art, and the way you do that is by interacting ultimately with the research community and the world at large, and obviously Silicon Valley has been a hotbed of innovation in computer technology for many, many years, so this is a way to increase the size of the interface between Microsoft Research and the Valley, and to get a little bit of Valley culture transplanted into the organization up in Redmond and in the rest of the world.

Secondly, Microsoft already had a campus here. There are quite a few product groups on campus and having an arm of research down here gave us a chance to interact with those groups more effectively as well.

And finally, there seemed to be a lot of expertise in distributed computing in the Valley, and that we were hoping that, by building a lab here focused on that, we could get some of that expertise into the Microsoft Research system.

The promise of interacting with the Valley technically I think is working out pretty well. By being in the Valley here we're able to collaborate with the universities in a much easier way, we're able to work with the other laboratories in the Valley to share symposia for our summer research interns, we're able to attend retreats that the various universities have about technical areas and participate generally in the culture and, in fact, we've even gotten to the point where we're starting to guest teach at some of the universities around here and that's been a lot of fun too in helping us to learn more about what's going on at the universities.

This is a list of the full set of projects that we're working on right now, and I put it up here mainly to give you an idea that we're covering quite a broad range of topics, and you can see in here the theory and practice aspects coming up together.

The top three up there I'm going to talk to you a little bit more about before we let you go look at the demos, and, in fact, they're also subjects of the demos next door, so when you get done if you want to learn more about this stuff it should be easy to buttonhole the inventors and ask all the questions you want to.

So when we think about how we impact the world, we like to divide it into three levels. First of all, there's the impact on the state of the art, which is the prime mission of Microsoft Research. Secondly, there's the impact on Microsoft products directly in the sense of features that users will notice. But perhaps more so than the other labs in Microsoft, we also spend a fair amount of our time impacting base system capabilities. And I know that these aren't quite as visible to normal users but they can make a huge difference over the long run in the reliability or the cost-effectiveness of computing.

I'm going to give you some examples of each of these areas.

Now, Dan already, I think, did a good job of explaining to you why participating in the open research community is a really good idea, and we certainly follow along with that model and have a lot of external research collaborations with the universities around here. We try to be very open about the work that we're doing, except in the cases where we're directly helping a product group get ready for the next release. We measure our researchers on their ability to publish high-quality technical results, and in general try to be a responsible participant in that community.

At the next level down, we occasionally do have the opportunity to do work that really impacts the way that users see computing and one of the things that we've done recently in this area is to do some statistical analysis of the dynamic Web. The way this worked was that about a year ago we put together a research crawler and put it to work looking at about a billion pages of the Web. And we went back and looked at those billion pages over and over again, maybe once a week for a period of several months, and tried to use this to figure out how that part of the Web was changing as time went by.

And then we did a lot of statistical analyses of the data we were getting and trying to look for statistical anomalies and hoping that those would correspond with interesting things about the Web, like, for example, all of the pages that show this particular statistics are probably spam or probably pornography, and that's a very useful thing for a Web search service to know about because those are probably pages that you don't want to see when you put requests into the service.

And this data was when published in the major Web conferences to help the world understand what the Web is like and how it's evolving, and also the data was used internally to help us inform the MSN Search people how to do a better job, help work with them to make these systems do more of what people really want.

And here are a couple of examples of the kind of things they found. In this first thing, they were keeping track of how many different Internet names map to the same address. And on this graph you have the number of IP addresses in the vertical axis, and that's a log scale so it's going up to 10 million, and on the horizontal axis you have the number of host names mapping to that single name. And you'll notice way over there on the right-hand corner there is a single IP address that has 10 million different names that map to it. That's probably not a page you want to look at.

QUESTION: What is it?

MIKE SCHROEDER: It's part of a porn site. (Laughter.) No, actually that's not right. I believe that one is every word possible in the first position, particular domain name second, and com third, is what it is.

Here's another one. You can look at the number of URLs that come in from other Web pages to a particular Web page and again you notice some strange outliers are along the blue area on the right-hand side of the graph, and these again represent probably some kind of strange clustering that are associated with pages you don't want to look at.

And in the bottom case we have exactly the same phenomena of 10 million incoming links to a particular page, and it's just not something that's going to happen naturally.

OK, another area where we manage to do some work that could affect the way that consumers use computers is in the area of shortest path routing. This is the thing that goes on inside of a map service to generate driving directions. And in general right now the state-of-the-art is that you have to use a lot of carefully tuned heuristics to get the job done in a reasonable amount of time, and as a result you don't always get the actual shortest path, and certainly you don't get information that represents bridge outages, traffic congestion, et cetera.

But the problem is that a complete, standard, shortest-path-routing algorithm takes way too long to run and uses way to much memory to use in these services. So one of our algorithm guys sat down and scratched his head about this and figured out a way to deliver quite a bit of a performance improvement. The amount of improvement that you get varies on the size of the network that you're routing but, for example, the road network of the state of California, he can find shortest paths 20 times faster than with the standard algorithm.

And the result of that is it's probably possible, although we haven't actually managed to get this far yet, to put this real shortest-path-routing algorithm into a PDA, and it may even be possible to make it responsive to real-time reports about traffic delays and road outages and so forth.

And let me give you an example of how this actually works. So how many of you can figure out what this is a map of? What we've got here is the Bay Area, right? Here's San Francisco and the South Bay. And what's gone on in this particular representation is we've taken the road network and represented it as a graph and every intersection in that road network down to the dead-end street in your neighborhood is a node in the graph, and every street or highway that connects those nodes is a link in this graph. And then for the purpose of representing it on the page here each dot represents every quarter mile or so of territory in which some node of this graph appears; it's just an easy way to see it. Think of it as kind of like an impressionistic painting of the Bay Area road network.

I then clicked on it and said, all right, this weekend I want to go visit the California State Railway Museum in Sacramento, so I told it to start here at the lab at the bottom left and to end up in Sacramento at the Railway Museum and told it to go.

And this is the result I get from running the standard algorithm. Now, it certainly is going to get the right shortest path because both of these algorithms are correct algorithms that generate the right result, but in this case the green dots show you all of the places that it had to visit in order to figure out what the shortest path is and that's the reason that this algorithm is too expensive. You have to page all of those things into your PDA from the back memory, and you have to investigate them all, and the result is that in this particular case it took like 297 microseconds to do that. It would take quite a bit longer if you had the entire U.S. in there or trying to figure out how to get from New York to San Francisco.

With the new algorithm things look a little bit different, same starting point, same ending point, same representation but notice that the green now is much better under control. We really manage to get by with visiting quite a few fewer false paths in order to construct the real path. I also find it kind of amusing that it got a little bit confused about where to go exactly the same place I do when I'm going to Sacramento -- (laughter) -- which is Livermore, right? Do you want to go over the pass, or do you want to go up through Walnut Creek?

Now, you shouldn't read too much into this particular route because notice it went out Highway 4 and up through the delta, and we know that that's not the right way to go, but that's simply because the representation of the network that's in this model is not very accurate; it uses a very simplistic model of driving time, which is speed divided by distance and there's a lot more to take into account than that to do it correctly. But I think that this has a lot of promise and we're hoping that it can impact the way that these things work in reality in the next couple of years.

Finally, here's an example of an area where we're working on some of the guts of the system in a way that I hope can make an impact on the costs of providing large scale systems in the future.

As you saw from what Jim was saying earlier, you can now build very large capacity storage and computation systems out of rack systems at a very reasonable price. Here's an example of a configuration that's actually incredibly heavy on CPU cycles and RAM with a gigabyte of disk in each box and in bulk if you buy enough of them you might be able to buy these for $5,000 each. Now, if you don't want so many CPUs and you want less gigabytes of RAM you might be able to buy a box like this for $2,000 or even $1,500 if you get to balance it to your job.

And you put these things together without using a storage area network, which is a source of complexity and cost, and instead perhaps stick it all together with gigabit Ethernet, a nice way to go, cheap.

And at least we understand how to scale the physical structure incrementally to many thousands of such nodes. For example, if you build a system that had 1,500 of these nodes you'd have a pretty capable computer system, 5,000 spec ends, 12 terabytes of RAM, 1.5 petabytes of disk storage, 6,000 disk arms. You ought to be able to do quite a bit of computing with that and you probably have invested about $8 million in this data center.

Well, what's the problem? Well, the problem is software. This kind of a system actually can be made to work pretty well if you have a problem that naturally partitions, for example, Web indexing a search, e-mail, data mining or simulation. I was surprised to learn that all of the car companies have pretty much giving up crashing cars into concrete walls for their initial work, and they do it with simulation on farms of machines like this.

But in order to make that actually work you have to do a significant amount of application-specific development, and this I think limits the utility of this approach to people who are willing to make those kinds of investments.

So an interesting research problem, one that we're working on, is how to make it more easily usable for a much broader range of problems and in this regard what I think needs to be done is we need to build up the core software that enables you to stitch these things together and to more quickly stop having to think about the way the data is partitioned.

I think there are several important elements to that, that one of the most important is getting rid of the idea that with our 1,500 computers there's going to be 1,500 separate file systems. That's a pain. And so you need to virtualize the storage to integrate it and to make it all look like one storage system without giving up the performance gain that you get from having lots of separate storage systems that are close to the computers. And it's a hard problem that I personally have been involved in working on for about the last ten years and we have a new project going on in the lab right now where we're taking yet another one and I think we're going to get quite a bit further this time.

Secondly, in order to stitch the system together you need to have some global state that's correct, even though lots of machines are coming up and going down all the time, and there's a set of distributed set maintenance algorithms and also global locking algorithms that keep these things synchronized and running that have been developed over the last ten years. We're putting some more refinements on them, we're trying to figure out how to integrate them into the package here to make it easier for applications programmers to build a system on top of this.

Finally, perhaps the most important area is the area of automatic provisioning and error detection and re-initialization. When you've got a system this big, computers are going to break and you're constantly going to be adding computers to the system to make it bigger as time goes by. And that's very, very labor intensive if you don't get the thing organized correctly, but I think we see a way to make it possible to come in on Monday morning, pull all of the computers out of the racks that have the red lights on, put new computers back in their slots and they'll take over their job automatically and every time you need to add the capacity to put another rack in you never have to think about it. That would be awfully nice if we could figure out how to do that and if we could it would produce significant cost reductions for this kind of a service and I think that would be a good effect on everybody. Anytime you can bring the cost of doing the basic business of the Web, for example, down, then there's a lot more business that can get done.

So that's the end of my part. All of the things that we've talked about today are available on research.microsoft.com. I don't think there's actually a WWW there but it probably works anyway.

Now, what we'd like to do next is to get Dan and Jim back up here and we're happy to answer some questions for you if anybody has them.

(Applause.)

QUESTION: With respect to spam filtering, I was interested in whether or not you had been doing work in personalizing spam filters, that is, that one person's spam is another person's crucial data and how can I personalize my spam filter to get my view of the world.

DAN LING: Well, so the machine-learning techniques that we use could certainly be personalized. However, it turns out that a lot of spam people now want to remove at the inbound gateway. For example, corporations really don't even want to distribute a lot of this mail to their individual employees, and so there's different sort of levels of spam filtering, shall we say. There is some level that is usually built right in the inbound gateway of a spam filter and there can be other personalized ones that can learn your individual preferences that could be on your client.

JIM GRAY: I think Heckerman and Horowitz both have done quite a bit of work on doing a base of that.

DAN LING: Absolutely.

JIM GRAY: So there was a filter built into Outlook that was from our machine-learning group that was client-side filtering.

DAN LING: There is the current one now.

JIM GRAY: That learns and watches your behavior and as you classify things it learns from that and classifies future messages based on that.

QUESTION: What group is that in?

DAN LING: It's our machine-learning group that does a lot of Bayesian statistical analysis, and they use their techniques to look at how to build filters of this kind.

JIM GRAY: David Heckerman would be the man to look for and Eric Horowitz also. And they also do summarization so that you can classify messages to be forwarded to your cell phone and they summarize the message and just send the header, so it pivots on who it's from and what the entropy of the words are in it and how big it is.

QUESTION: (Off mike).

DAN LING: So this particular project does not, no. There are some other projects that have actually been looking at natural language, so, for example, identifying tags in e-mail and flag that using natural language techniques.

QUESTION: I'm curious how Microsoft Research balances the need for confidentiality and applying for patents with the need to interface with the research community and be open and collaborate with others.

DAN LING: So clearly we are a corporate research lab, and we do understand that one of our core missions is obviously to work with our product groups and take that technology that we produce and put it into our products. Very often we can do that in conjunction with, for example, publishing. For example, what we do is we really ask our researchers to think about the appropriate times for doing certain things.

So, for example, early in a research project, for example, there are times when the research might be fairly pre-competitive, if you want to call it that, when the ideas are just basically at an idea state and a lot of discussion takes place externally and there's not that much concern with intellectual property. Later on there might be more concern, and clearly when you're down to the implementation and the scheduling and delivery of product then you are going to be a little bit more proprietary about things. So I think that one way to think about it is certainly on the timeline of maturity of a particular project.

QUESTION: You mentioned in your presentation there are several labs, San Francisco, Silicon Valley and Beijing. So what's the difference among those labs? Is there any focus or priorities? Thank you.

DAN LING: Our approach to labs has been very much based upon a little bit of sort of looking for opportunities that the different locations present and so there's an opportunistic element to that. So, for example, as Mike mentioned, we started the lab here in Silicon Valley focused mostly upon distributed systems, for example. In Cambridge, Roger Needham was the founding managing director, so there was a strong emphasis on systems and security, for example. But then as time goes on the labs sort of evolve and again that depends a lot on what outstanding researchers we can attract at the lab.

But in general I would say that there are some differences, although they're not so much as really planned differences. So, for example, in the lab in Asia we do a lot of work on Asian language related issues with speech recognition, natural language, search and those kinds of things. In our lab in Cambridge, for example, we do a lot of work on programming languages kinds of things and software development techniques because there's a lot of strength in that community in Europe.

QUESTION: A question for Michael Schroeder. Could you tell us a little bit more about that you said there's a new project on the modular data center? Is there a demo on that today or, if not, could you describe a little bit more the targets of the new project?

MIKE SCHROEDER: There is no demo on that today. The work that we're specifically doing right now is the next version of virtualized storage system, called Boxwood, and there's a Web page about that. I'll be happy to talk to you a little bit afterwards and give you some of the details about what's new about this approach compared to the approaches that were taken in the past.

QUESTION: One of the figures of merit that's sometimes used for R & D efforts is the number of patents that result. I'm wondering if you can disclose that and tell us how many you hold proprietarily for internal use by Microsoft and how many you've made available to the public.

DAN LING: So how many we've made available to the public? I guess we do patent as a corporation, and I think there are some publicly available numbers as to how many patents have been issued to Microsoft. It has clearly been our intention and something that we've made known quite publicly to make available our intellectual property portfolio for licensing to lots of additional other companies, so that's something we're clearly quite interested in doing.

QUESTION: With your collaboration with universities and other such institutions, are you focused solely in the geographics around where your current labs are? For example, if you discover that it is a very good critical mass for research being done in Timbuktu and the world, will you reach out and collaborate over long distance?

MIKE SCHROEDER: The answer is that we're happy to collaborate anywhere there's good people to collaborate with. And, in fact, in the case of our lab we have collaborations that are all over the world.

The point about the local area, though, is that certain kinds of collaborations are a lot easier to do if the guy is down the street, and in particular the idea of perhaps teaching a course at an institution is something that you can only do with a reasonable commitment of time if it's down the street. But in general our collaborations are all over the world.

DAN LING: And in Jim's case, for example, with Johns Hopkins.

JIM GRAY: Yeah, I mean, so I'm working with Johns Hopkins, which is in Maryland, and the USGS, East Coast. There are some people down here in Menlo Park as well, George Lake will be actually here, and the high-speed networking work is between CERN and Pasadena. These are not Timbuktu, any of them, but they are quite remote from San Francisco.

Thank you.


Top of pageTop of page