|
Remarks by Bill Gates
Microsoft Corporation
SIGMOD 98
(Special Interest Group on Management of Data)
June 2,1998
Seattle, WA
MR. GATES:
Good morning, and thanks for coming to Seattle. It's great to have you here. I hope you have a fun visit. And I actually hope some of you will stay here in Seattle and help us with all the research challenges that we have that I'll talk about a little bit this morning.
Seattle is actually where I got started in computing. A high school out in the north end of the city had an ASR-33 teletype that connected up to a GE timesharing computer. Now, the problem was that they charged lots and lots of money for every minute of connection, and one of the teachers accidentally ran an infinite loop and used up the computer budget in a few minutes. The teachers were afraid to use this device from that point forward, and so a few students got to kind of dominate what was going on there.
A few months later, we got Digital to loan us a PDP-8 so we could have a computer that we could program ourselves and not have to worry about the charges, and that's when we first started writing interpreters and thinking about the possibilities of software.
The next big milestone for me was when Paul Allen, co-founder of Microsoft, noted the exponential improvement that was taking place in chips. He pointed out that with all that hardware power the thing that really counted was software, and that maybe we should have a company that was dedicated to software. It's kind of a strange idea. We didn't do anything about it for about four or five years, until we were both back on the East Coast and the Altair kit computer came out. That's when we said, boy, this is really going to happen, let's go off and do software, because that's going to make all the difference to whether we can unlock all this power and do great things for everyone who wants to use the computer as a tool.
Well, I'd say that same imperative exists today, that the software industry has grown dramatically. The database part of that is a very significant part of it. It's doing some wonderful things. But with all this power we're getting, the real critical area, the need for better thinking and better research really is in the software realm.
These hardware guys, it's amazing what they're giving us. When you buy a personal computer today, it's hard to buy a machine with less than four gigabytes of storage, simply because the scale economics of making that hard disk are such that a two gigabyte driver, a one gig drive, doesn't really save them much manufacturing cost. And if we take that out for the next five to ten years, then you're going to have hundreds of gigabytes of local storage.
And what you're going to want to do is keep information of all types. You're going to want to keep your photographs, the music that you're interested in, even video clips, all should be on that device. And yet making it easy to find those things and then to relate those things goes beyond any software products that are out there today, whether it's the CPU speed, the move to a 64-bit address space, or the very exciting work that's taking multiple processors and putting those together in symmetric multiprocessing machines, or perhaps most importantly, clustering those together in configurations with very little hardware sharing. That's where we see an opportunity to really break through any of the performance ceilings that might hold us back in these incredible scenarios. So, hardware is not going to be the problem. They're going to give us lots of room to do all the exciting things we want.
I'm really a big believer in research and the payback that it has. And I'm actually a little surprised how little the software industry has invested in research. The payoff is very, very clear. The need is very, very clear. In the information age, this ability to process information, to let people have access to the things they care about, it is the most exciting problem of all. And I think the field that you work in will be particularly exciting in the next decade. The contributions have been fantastic and out there on a very high percentage of PCs today. There are products like Microsoft Access and many others that use relational technology that came out of database research. The relational model, you know, we all take it for granted now, but it's been incredibly valuable, the rich query processing. And now moving in multimedia support transaction capability, so millions and millions of people are benefitting from the research to come out of your area. You know, particular companies like IBM, institutions like Wisconsin and Berkeley, and many others, have made these contributions that have allowed the software industry to put these products together.
However, there really can be no doubt that there's a lot left to be done. Information on a PC today is very difficult to find. We've got it stored in many, many different ways. We have different name spaces, different caching, different security, and those things lead to not only difficulty in developing applications, but it shows through to the end user. You know, how many search commands do you have to know. How do you relate data, say, that's in your schedule to data that's in a typical database? Or even more difficult, say you have two Internet sites, how do they agree on what kind of information they have, how do they transform that information so that they make sense to each other, how can that be done in a way that you don't have to get experts involved to connect those things together? So, we have a lot of areas where we need further advances.
I put ease of use at the top of the list. When you look at these very big markets and reaching out essentially to everyone, because everyone cares about data, we have a lot left to do. Data mining to find the interesting information, natural language which to date has not been a popular query technique. I think it will become very mainstream. You know, I don't think in the future we'll be using URLs to find information. I think primarily we'll use natural language. Unifying the different stores, there's going to be lots and lots of stores out there, stores that are specialized for large size, or different data types, and yet we need to create the illusion that all that information is in one big logical store, and a lot of work could be done there, in particular in the PC domain. A componentization will be an important part of that. Rich schema at a semantic level, I don't just mean at the call-up names and things, I mean things like being able to take scheduling information at a semantic level and match it up, even though the storage representations are quite different.
And then, finally, scalability and reliability are always very important, and the Internet is driving us to incredible new levels of need. We see that some devices will continue to be disconnected from time to time, things like portable machines. And so replication technology becomes very, very important. Well, users today find even so-called "low-end" databases very intimidating. A lot of Access users, of which there are about eight million now, a lot of them just do single table databases. Even the ones who do multiple table databases, they're often confused about how to navigate between one and the other. I've got here the menu bar that you see when you're running Access. And, you know, we feel great, we've got all those commands down in there, but there are still way to many concepts, and sort of hieroglyphic things that we're exposing people to there. So, it's amazing that we've got that many people using database and taking advantage of it, but we're really not going to get to the next level with that kind of approach.
A lot of users today find the true databases complex enough that they simply go into either the word processor, with the table-type capabilities, or into the spreadsheet, which I'd say is a little more typical, and use that as their way of structuring data. And, of course, you get a huge discontinuity because, as you want to do database-type operations, the spreadsheet isn't set up for that. And so then you have to learn a lot of new commands and move your data into another location. What we'd like to see is that even if you start out in the spreadsheet, there's a very simple way then to bring in software that uses that data in a richer fashion, and so you don't see a discontinuity when you want to move up and do new things. But that's very easy to say that. It's going to require some breakthrough ideas to really make that possible.
The world of PCs is growing at an incredible rate. This year, about 100 million PCs will be sold. That's a serious number, because we've got five billion people, we only have to keep this up for a little over a decade, and we'll be out there to half the people. The PCs are getting dramatically better all the time, the price is coming down. The richness of what's available, the percentage who connect to the Internet is going up. So the PC industry is very healthy.
On top of that, though, we've also got a variety of other devices, hand-held devices that will use wireless communication, things like the Palm Pilot, or some of the Windows CE devices that Microsoft is involved in creating together with hardware companies like Hewlett-Packard. We've got the idea of an intelligent TV set. The set-top box that lets you get at simply analog video today will be a very advanced computer. Not only will the video flow in digital AMPEG (sp) compressed form, but you'll be able to interact at any time. If you see an ad you're interested in, if you want to chat with your friend who is watching the same show, if you want to play a multiplayer game, the chips in that box will be far more powerful than what any PC makes available today. And so, all of those TV sets will join PCs as devices connected up to the Internet.
Some of these devices will not have keyboards. Some of them will respond to voice commands. Some will have a flat surface where you can use a pen, and so they'll be able to understand handwriting type commands. And a significant number of them will have video cameras, cheap CCD cameras, with a little bit of software logic that will allow them to tell not only who is using the device, but sort of what they're paying attention to, --whether they are they confused, what kind of gestures they are making, and be able to interface in a way that's more natural than just forcing everything through the keyboard.
We have different guesses inside Microsoft in terms of how quickly non-PC devices connected to the Internet will be greater in numbers than PCs connected to the Internet. Thinking about all those cameras, and TVs, and those pocket-type devices, there's no doubt that sometime in the next three to ten years, we'll cross over that threshold. And, so the central nature of the Internet, and the moving of information in digital form instead of on paper will really become sort of commonsense in terms of how people do things.
Now, we've got some real problems in making this happen. Just hooking the devices together and getting used to them is very difficult. In fact, in the popular media there's always stories about the challenges that people are having with electronic devices. We've put together a short video clip that I think highlights some of the problems people are having with today's devices. Let's go ahead and take a look at that.
(Video clip.)
MR. GATES: So there is still a lot of work to do, before this can all be mainstreamed. I'm sort of thinking what's the database equivalent of some of those things? And probably a close analogy is the work a database administrator has to do to keep track of everything, and keep the system tuned up behind the scenes. And I thought it might be interesting to show a little of the work we're doing in that area to try and automate some of the things that have been manual in the past. So I'd like to ask one of our database research people, Surajit Chaudhuri, to come up and just give us a quick look at his work on auto administration.
Hi, Surajit.
MR. CHAUDHURI:Good morning, everyone. I'm going to talk a little bit about our auto admin project and show you a small demo. This is really taking what Bill is talking about, the ease of use, and examining that in the context of the databases. In particular, I like the analogy with auto-focus cameras, you know, 10-15 years ago, I remember that taking a photograph used to be much more complex than it is today. And what we hope is that the technology we are developing eventually will take a step closer to the vision of making it all auto administered and self deemed.
So what we have done is we've developed a tool for selecting the indexes for a database, automatically, by looking at it as to work load. So it monitors it and is able to recommend the indexes that serve the databases. And I did this work in close collaboration with the SQL Server Group and with --(inaudible) -- research. So let's start it up. And I'm going to first go to the SQL Server Enterprise Manager, just to familiarize you with the database, against which we are going to do this tuning. So the focus of the demo is taking a small database, which has a set of tables, and one of the tables is too relatively large. And on our table set, we'll see that the system would automatically select indexes appropriate for this database, and focused on these tables.
Okay. The next piece that we need to look at, at this point, is the SQL Server Profiler. What it allows us to do is to trace the events against the database. And so as I perform work against the database they get logged. And this will form the basis for the Index Tuning Wizard to look at this history and come up with a recommendation that suits the behavior of this database's load. And now the finest piece that I want to show, of the tools that I need to make use of, is the SQL Server Query Analyzer. So this is an interface through which you can post upgrades to the system. And right now, you know, we are using SQL to -- we have a set of SQL statements here, that have -- I have six statements here, which are over the database that I showed. So in line with your workload and what you want to do, you'll find a set of indexes which are sort of off of this workload.
So what we'll do is that we'll go ahead and execute this workload. And in a sense you can see that these queries have aggregation and sum of -- (inaudible) -- and so on, which indicate a somewhat efficient support kind of flavor. And as we look at these queries they get, in some sense, logged in the databases -- in the workload that we see here. And you can see that revision duration is also recorded.
The first query, actually, will take about six-and-a-half minutes -- seconds, sorry. And it hasn't shown up here, yet. So in any case, so what I'll do now, the next thing is, let me just go ahead and go to the query analyzer, I'll just re-execute it one more time, so that you have the first query's duration come up as well. So you can see now, the first query which, at that point, you didn't see it, it was about six-and-a-half seconds. And among the queries that we have here, this is the one that has a relatively longer time, so I'll just try to tune this particular query. So this is what I'm trying to show you, what I want to show you is two ways in which the tool can be used. One is to fix what we call problem queries. Everything else is fine. We got one problem query. How can I help that particular query.
So let me pause the work -- the tracing at this point, and go back to this query, this was the first query which was taking time. And I'll just highlight this query, and tell the system, why don't you do an index analysis, and tell me what indexes can I possibly add, to make things better. So it comes back with recommendations for two indexes. And you'll notice that one index has four columns, and part of the reason for that is, what it is trying to do is use a nexus, so that you don't have to go to the data file, but you can simply use the index as a vertical partition of the data.
So we'll go ahead and accept this recommendation. And now would be a good time for us to reset the trace, the tracing process, because what we can do now is go to the query analyzer, and re-execute the query, and see that whether it actually gave us a benefit. So and you can see that from 6.5 seconds, it dropped to something like 2.6 seconds. So it gave you an improvement. Of course, we have a small improvement in times of apps per second. In a large database, you will see order of magnitude improvement.
Okay. So now what I want to do at this point is to remind you the other way in which we can use this tool, which is that after all, while it may be good to tune a specific -- in a well tuned database to tune one or two queries, for the problems. When we have a large workload, what we really want to do is to tune the design for the entire workload. And now, so what you like is that administrator just logs the event for a while, and then comes back and tells the system, look at that event, and come up with the recommendations of any change or whatever you would like us to do. So it's not picking a particular query.
So in order to do that, let me first clean up the database, because I have cluttered it up with data -- with these indexes. So I'm just going to go ahead and clean it up. Okay. So I have the entire workload here. And let us try to find indexes which would serve all the six queries, and overall give an overall improvement.
So for that we can go back to the profiler, and from the tools we can just go into the Index Tuning Wizard. So that will bring up a fairly simple interface for us to use, to do the workload tuning. So it's asking me to first look at the server, and the database. So I'm going -- so at this point, pick the right database on which you want to do that. Well, I'm doing that, I want to point toward one thing, which is over here, all the queries are select queries. Now, the fact of the matter is, when you have a workload which is updated, and -- (inaudible) -- you can see that indexes can hurt, it can increase the time. So although we'll not show it in the demo, the index tuning wizard would automatically detect that, and not recommend indexes, and might even recommend that you drop certain indexes, if it is hurting. Okay.
So let's go back to the Index Tuning Wizard and what you want to do is add databases to these databases. And over this database, we want to -- we have already a saved workload file, essentially containing the queries that you saw. So I am -- I'm going to point it to that particular workload file. And here you know, it is set as such that you can just use it as, you know, fully automatic mode. We don't have to think of it, at one's option. But, if -- (inaudible) -- in terms of how much space would you like to find to this index tuning process. You may not have enough space, and you want to restrict it. Right now it is, you know, about roughly twice the size of the database, but you can set it to what you like.
You also have a control on how wide you want the indexes to be. Okay. And the option here is to keep all the existing indexes, lest you say, hey, I don't want to hurt any performance of any of these queries, you can add some more, but don't please take away anything from those queries I have. So we'll let the default work here for us, and move onto the next slide, which again, is a way to tune it.
For example, here the database has -- here, what you see is a list of all the tables that the database currently has. And I can say that I don't want to touch anything on the customer's table, because I know that, you know, Bill runs a lot of queries on the customer's table, I don't want to muck around with it. So we are safe now, so we'll just go ahead and this small, baby database we'll just let all the tables to be considered for tuning.
Okay. So another thing is that when I was -- one thing I was mentioning, which is the logical point I would update is SQL Server 7.2 users indexes in a variety of ways. So one of the real technical challenges that we face in trying to do this is to be able to anticipate all possible uses of the indexes. And that's very different from the first generation of the systems, where use of the indexes was much less sophisticated. Okay. So I'll just go over this screen, because you have already seen this. So I press the cancel there, you can see.
So, okay, so now we're just going over and doing the actual tuning, and figuring out the right set of indexes which are appropriate for this workload. And you can see what it is doing, it is trying to go over wider and wider indexes, and trying to automate it in the process. So what it has done now, it has gone ahead and selected a set of tables, from the cluster tables on the orders database, and on the order details database, and it is making use of the wider indexes. So at this point you can go ahead and recommend the index -- accept the recommendations and proceed.
But, again, if you are picky, you can ask the question that, give me a little more analysis, give me an idea that if I want to use the indexes, which index will be used a lot, which index will not be used? I mean, you're recommending three indexes, I want to get a sense of the tradeoffs of the storage I'm giving. You can say, I don't want this particular index, it is taking space and it is giving me some improvement, but I don't want to work with it. So there is one kind of analysis you can get.
You can also get a query cost report. What it gives you is that it tells you that you have all these queries in the workload. How much was the improvement for each individual query? So you could have a query of your choice. You can see, did it improve it significantly, or did it hurt a query that we really care about? And in which case you may like to accept the recommendation, and so on. So, again, this is about fine-tune control, and a person who is not going to use it, need not ever look at it, and simply accept the recommendations. So this is very optional.
Okay. And if you are happy with this, we go off to the next screen, and here I can say, apply the changes. And I also have the option of just saving the script file. When I apply changes, it gives me an option to schedule it at the time of my choice, maybe perhaps in the evening, also, and in the off time, or batch hours, or I can execute the recommendations now. Since the database is small, I'll be bold and execute the recommendations right away.
Okay. So what I want to do now is the final step. I want to go back to these queries, and go back to the profiler, start it up, and essentially take you to the workload, which is the workload we tune for. And, you know, ask it to execute all the queries in this workload. And as it does, in the profiler we'll see how much time it essentially takes to execute all these queries, and that should give us an indication on how much improvement we are currently expecting. Okay. So you can see that essentially you have dropped some of the queries down from what they were. And again, the important thing here is trying to balance off the various other queries that are in the workload, and to get an improvement. So that's what I wanted to show.
Of course, one thing you'll notice in this demo is that we are pro-actively saying, trace it. So what we would like really to see is that the tracing would automatically watch, as the queries are coming by into the system, and at some point notice something has changed sufficiently that the administer needs to be notified. Or in a system without any administrator, it will take charge and make the changes. So that will be the natural step forward, and obviously this is one piece in the entire zero administration database project.
MR. GATES:That went well. Thanks very much.
MR. CHAUDHURI:Thank you very much.
MR. GATES:I mentioned that one of the big opportunities is unification, and if you take a very broad definition of the word "store," there are a ridiculous number of stores that people have to get familiar with. Some of them are down in the guts and you shouldn't have to think about, like the registry, but it's surprising how often its limitations or problems with that poke through and actually affect how the system is working.
Also, some of the descriptions inside the system aren't rich enough. For example, there is such a thing as an application. An application requires a certain number of files to be set up in a certain way. However, today, the operating system doesn't have a rich description of that application, and so it's not able to see if the application is ready, it's not able to repair problems. It's not able to go out over the network and get files that are missing, because as far as it's concerned, it's a bunch of disconnected files that are being executed. And so the error messages tend to be very low level, and very, very confusing.
The file system is very different than the registry, which is different from the message store, and those are different than where we're caching Web pages. Again, we have replications, different replication approaches on those things. Directory services are now coming in as a certain kind of database, very important for the security of this system, but using concepts that make it hard to administer and to develop against these things.
And so, by having richer capabilities, we ought to be able to pull these things together into very few stores, if not, for the typical PC a single very, very rich store, which in some ways you think of as a replacement for the file system, since that has been the centerpiece up until now, but in many ways it would go well beyond that.
This is a challenge that we need to take in stages, but we will be injecting a lot of very rich database technology into that primary store, into the file system store, to simplify and get rid of those other things. One part of that, and the next step, does get rid of the message store as a separate entity, as well as the Web cache. Part of this, of course, is having the navigation be common, taking what has been the shell, what has been the mail client, what has been the browser, and bringing those together, and so techniques like having a history list, or going backward and forward, all of those are shared in the kind of navigation you want to do across the information.
I also mentioned schema is a very important thing. And, you know, we use this word in a very high-level sense. There's a lot of discussion now about taking XML and having that be a description language that allows arbitrary Internet sites to look at their data and see what sort of interchange is possible. We want, ideally, certain world-view type information, user attributes, time-based information, product-type information, with all of that there should be enough of a common agreement that you can go to a site and look at a set of, say, events, like football games or something, and be able to drag and drop that into various things that deal with events, and have it work very well. There's going to have to be a tolerance for ambiguity and incomplete data in these models, and that's why it really is going to push the state of the art in these areas.
Another key area is breaking databases down into their logical components. They can have a rich query processor that works against different stores. For example, something like the spreadsheet. You don't want to have to build a special query capability there, but you would like the data that's inside there to be acceptable. One of the approaches we have to this is an API that we're evolving called OLE DD, and that we're implementing pretty broadly across the different products. Partly a disconnected operation leads to the need for this separation, things like the cache manager, and a cursor engine will need to be down on the client. In some cases, a client that's way too small to keep the entire storage engine or do arbitrary queries. When we think of devices, we're not just thinking of PCs. We're thinking of that device in your pocket, or that intelligent TV, and so data has to be everywhere. In fact, you don't even want to think about where the data is. If you're ever connected up to the Internet, you want your mail, your schedules, your files to all be available without regard to actually where those things happen to be stored.
Well, the great thing about scalability is that even with all these exponential hardware improvements, people's expectations and needs are going up fast enough that breakthroughs in scalability ideas will continue to have immense value. The number of transactions, the amount of data to analyze that is generated on the Internet, will be at least one if not two or three orders of magnitude greater than what mainframes have been asked to do historically. There are a lot of companies like Cisco and Dell that are moving very quickly to use the Web as their primary way of doing business. And this is only going to accelerate. They want to analyze those customers, specialize things for those customers very much in real time, see new patterns emerging that help them make better business decisions.
We've got an Internet site which is the Hot Mail site, which adds over 100,000 new users a day. Now, we can't keep that up forever, we'll have more than everybody on the planet, but it's pretty phenomenal, not only how many people are signing up, but also how much traffic they generate. The beauty of that site is, of course, that anyplace where you have Internet access through a browser, you can get your mail.
Now, what we want to do is synthesize the two worlds. The world of a rich mail client that uses the offline store with the idea of the mail store that's out there on the Internet and available through a simple device. We don't want to block what features you have available simply because you want that universal acceptability. So, we're bringing the work we had been doing that was very client focused, and then Internet work and pulling those together. As we get richer and richer with things like Buddy Lists where you can identify people or groups that are working together, making sure that we retain the kind of performance scaling that we've had, that's going to be a very big challenge.
So, electronic commerce is really going to redefine how business is done. And we're just starting to see a few companies emerging that are leading the way in that. Behind every one of these things will be a number of very complex databases, and databases that will be mined for information in a way that hasn't been necessary in the past. Microsoft is pushing forward on scalability. It's never been our view that the PC has an upper bound. We'd like to get to the point, and it's simply a matter of time, where PC technology has the advantage of being available from many different manufacturers, and having a much greater wealth of software and peripheral choices than any other part of the computer industry. We want to make sure that it can meet the most demanding requirements, whether it's the requirements of the past or the requirements of the future.
We've been making good progress on this, but it's one of these problems that the more you get involved with it, the more you realize how complex it is. It's not just running the benchmarks very well, it's all the things related to backup, and reliability, and those are fun, challenging problems that we're taking on.
I've got on this slide some of the benchmark results we've got, which are very good. And, of course, even without standing still, the hardware side will be pushing these numbers up fairly dramatically. One of the most interesting of these was the billion transaction demo we did with the clustering approach and the Transaction Server, and it was great to see that come together, and we are now starting to have commercial customers that are taking configurations like that and using it for their business.
Another challenge we took on was building a very big database, and actually our database research group got involved in helping out with this, and it's been a really exciting project, and I'd like to ask Jim Gray to come on up and talk a little bit about this Terra Server project, and what we've learned from it.
MR. GRAY:So, we've got a scalability research group, and it's hard to do research on scalability without any abstracts. You actually have to build a big system. So, there's two kinds of scalability. There's scalability scaling out by buying computing by the slice, and just adding computers until you've got enough power. And there's also scale up, which is to take a fairly big box, and then add things to it. And the demo I'm going to be doing here is using, you talked about PCs, this is the world's largest PC. It weighs about eight tons. It's got 324 disks on it. It's running NT out of the box at the enterprise. It's got a lot of RAM on it, and actually got 2.4 terabytes of raided disk, that is 2.9 terabytes before you waste it to all of the stuff.
So, the fundamental idea here is that we're trying to build a big database, and fill up those disks. So, we went and looked for some data. And certainly you can take a random number generator, but we wanted to put something on the Internet and show people that you can actually build big databases with PC technology. So, we looked around. We called up the USGS. We decided we'd make a picture of the planet. And so we called up the USGS and we said, how much data do you have? And they said, we'll we've got a lot. And we said, well, we'd like it all.
It turns out, they published about 100 gigabytes of data on CD. And we said, well, you know, that's not a terabyte. But you can see that the green stuff on this picture is the USGS data. So, we wanted to get world coverage, so we went to various people, and we found some people in Russia who have been taking pictures of the world for a long time, and have a lot of images on the shelf, and they were enthusiastic about doing something with these images. They don't have any real use for them anymore. And so we made a deal with them. We'll build a gazetteer of their data, put it on the Internet, and people will be able to download, and they'll be able to pay the Russian Space Agency for it. So, everybody was happy about this.
And so, we demoed a prototype of this, what, a year ago, and so what have we been doing in the last year? Well, we've been trying to actually make the thing work. In about three weeks, it's going to go live on the Internet. It's in beta test right now inside of Microsoft, and so this is the beta version. And probably the simplest interface and something we've added is a so-called coverage map. And you can click this map, and zoom in, and the green stuff is areas where we have pictures.
`And this is the kind of interesting display. It's a roll-up. It says, well, we've got 32 million rows, which are tiled. And what we've done is taken the photos that we have and broken them up into about 10 kilobytes of things that can be downloaded on the Internet. We have a mosaic of the United States. And there's about 32 million little tiles in the thing we're looking at right now. If we were to pan to the right, there we have about 100 million tiles, 109 million tiles. And it's 445 gigabytes of storage.
But we can point to, say, Seattle, and zoom in on that, and there's going to be -- now we're down to a mere 3 million rows. And we can, I hope, find us. So, what we've done -- whoops, that doesn't look very interesting, does it? Let's zoom off to the right. So there's some land at least. So, this is a 16-meter resolution. We can go and look at the 32-meter resolution image, and you can see that this is coming to us over the Internet through an ISDN line. And I still don't recognize where we are, do you?
MR. GATES:That's Seattle.
MR. GRAY:So, where are we? Are we north of this?
MR. GATES:No, aren't we down? Where are we?
MR. GRAY:Head south, okay. That's probably the Hood Canal. Are we getting close?
MR. GATES:Yes. There's the bridge.
MR. GRAY:Okay. Are we right about there?
MR. GATES:Yes. Good.
MR. GRAY:Okay, we're getting closer. So there's the King Dome down there at the very bottom, and we're about right here. And I think we're in that building right there. So that's us, and we can get a medium-sized image of that.
So, all of these are database queries, and this was about 19 database queries to get the various tiles that went into this image. And there's the fountain, and here's where we are right here. Space Needle is down a ways. I think we can find it.
MR. GATES:Is that it there?
MR. GRAY:Is that it?
MR. GATES:Yes.
MR. GRAY:Okay.
MR. GATES:It doesn't look good from that angle.
MR. GRAY:No, it doesn't look as good. Okay. So, we can take this image, it's from the USGS, we can download it, and here is a version of that image which we can actually save to our hard drive. So it's coming down over the Internet, hopefully, with patience, the worldwide wait, and so pretty soon it's going to pop up and say, okay, where do you want to save this? And I'm going to save it in that file there.
So, that's one thing. And the USGS has been trying to get this data out to customers for years. Much of this data is available at their Website, but you have to be pretty much a geek to understand how to use the data. This is a much more intuitive interface. They also have a Site Server. You can go to the Site Server and order the data that corresponds to this, and buy it, and they'll send it to you through the mail. So, that's the U.S. part of the picture, and it shows basically navigation by pointing at a map.
There's another kind of navigation that we've done, which is navigation by name. So, I'm going to try and go to Victoria, Hong Kong. So, we're going to type in the name "Victoria" and we could type in the country and what kind of thing we're looking for, but let's just give it a sort of unconstrained query. And these are all the Victorias that we know about. We're actually little bit embarrassed that we don't have more coverage of Africa and South America. But we do have Hong Kong, see, and there is Hong Kong. I'm kind of a sailor, so I always go to the Typhoon Harbor. And, so there is the little harbor, and so this is taken from 200 miles up. This is a declassified spy satellite photo that people were using for various purposes in the past that are now publicly available. And the Russians put up another satellite in January, and we're beginning to get new images. And while this site is on the Internet, we'll be loading additional imagery and hopefully get more world coverage.
So, I guess the main thing to say about this is, first, it is a big database. It's arguably one of the largest, if not the largest single Web server on the Internet when it goes online. It's a multimedia database. I think most large databases are going to be multimedia. It's just hard to come up with a trillion numbers. If you take all of the Microsoft customers ever, and put them in a database, you probably wouldn't come to a terabyte. If you take all of the HTML pages and put them, it's hard to come up with a terabyte. So, if you want to get multi-terabyte databases, you've got to do something else. And you can see, we have not actually very much coverage. There's 500 million terameters on the planet. We've taken four terabytes and slowed it down to about 800 gigabytes. If we can fill in that whole thing with green, we'll have something like 100 terabytes uncompressed, and something like, I think, about 20 compressed.
Why do we do this? It's going to be a vehicle for scalability. It's just been a heck of a lot of fun. It's been a great test of the products we have. And we're looking forward to a lot more research in spacial databases, and also in scalability. We now have something we can back up and restore, and try algorithms for backing up and restoring. That's about it.
MR. GATES:Great. That's fantastic. Thanks, Jim.
MR. GRAY:Okay. Sure. Take care.
MR. GATES:Let me just close by saying a few words about research. As I said earlier, it's really been fantastic putting together our research group at Microsoft. It was about five or six years ago that we got to the size where we decided we could really do this. And so we got going, in areas like natural language and graphics. And I think it was about three or four years ago we started some of the work in databases. This has been the fastest growing part of Microsoft, and that will continue to be the case for many, many years to come. The only thing that holds us back is how quickly we can bring in great people, which has always been true of the company, as a whole. It's just the research people are, at this stage, doing the best job of that.
Right now we're in three locations, Redmond, our headquarters, down in San Francisco, and out in the U.K. We'll probably add one or two more locations, in the next two or three years. There's an interesting paradox, it's actually easier to do distributed research than it is to do distributed product development. Our product development is much more concentrated in one location, although that's another area we're looking at what can we do to spread things out. Really, I think the whole industry gains when companies invest in research. And I would like to see other software companies do more in this area, because most of the good work is shared, and it really can help drive forward the pace at which we're taking advantage of all these computer capabilities.
This slide just lists some of the major groups that we've got. Some of the unusual things we're doing, we are looking at new ways of programming. I mean, after all, we're a software factory, and improving the development process, and equally importantly, improving the testing process, is very critical for us to be able to keep our pace going. In fact, in one of the database groups they came up with a way of generating SQL statements that have known results. And therefore, being able to throw millions of SQL statements at our databases, and find problems. And it was very impressive, they found a lot of problems pretty quickly with that automatic test generation. And it's a technique that we're going to use in other areas, as well.
It was interesting, we took that automatic test generator, we not only threw it against our databases, but we threw it against our competitors databases, too. So if they're interested in a few thousand bugs each, that we found, we'll be glad to share that information with them for a reciprocal contribution on their part. So there's a lot of very important areas, and some of the most exciting things will come out of the intersection of these areas, taking database technology and user interface technology and seeing how those can come together, taking the natural language area, and these rich database schemas, and figuring out how those things can relate to each other, reducing the complexity of what people see when they navigate the information.
It's important to have a long time horizon, in all of this. You know, not expect much in the way of payback for 5 or even 10 years. I have to say, in our case, even though we didn't go into it with any of that expectation, the early collaboration, the early impact of having the research group has already been really quite dramatic. Whether it's the code in the word processor, that comes from the natural language group, or whether it's the things that are going into the SQL database that we come out with this year, that our database research people contributed to.
And so in a software company, there's a natural affinity between the researchers and the people doing product development. You don't have this big discontinuity that you often get in many companies, between research and productization. And having that smooth spectrum makes it all the more worthwhile to make the investment and do the new things. We've had fantastic relationships with universities, and the bulk of the computer science research in the world continues to be done there, so it's very, very important that we work well with all of those groups.
Let me just close on a very optimistic note. I think the kind of work you do is crucial. Databases are at the center of the information age. And getting more knowledge out of them, being able to deal with them in an easier way, dealing with them in a very large scale, I think the people in this room are going to make a huge difference, by coming up with breakthroughs that make that possible. So we need your help, and there's exciting times ahead.
Thank you.
(Applause.)
MR. GATES:With that we can get the lights up a little bit, and I think I did leave a little time here at the end, for anybody who wants to ask questions.
Should I -- how do you want me to do it? Okay. I'll point and then I'll repeat it.
Go ahead.
QUESTION: Much of the research you were talking about, about challenges that Microsoft would like to address, appears to me to be looking to the future or, let's say, providing a repository or whatever for existing Microsoft products. The vast majority of all of the world's productive information systems and data reside in one of pejoratively called legacy systems. I prefer to call them business systems. So when you do this creation of new opportunities, may I implore you, or actually ask you about your view concerning the development of tools to facilitate bringing the older world into this new products, so that we can run our corporations within the frame of these products?
MR. GATES:Well, that's a very good point. I don't know if everybody heard it. But, he was talking about how we bridge from a world where everybody's got their data in legacy environments, how we make sure any new improvements make it easy either to take advantage of those, without moving the data, or simple ways to move to the new environment. The change is starting to take place. People who do new ERP systems, like SAP, they are moving their data off the mainframe and onto either NT or UNIX-based systems. And, you know, it's all in a relational database, and there's a lot of ways that that data is available to other applications that want to navigate through that data.
We ended up doing an incredible amount of work in interoperability. When we did our transaction server, it was kind of a pure product that worked on NT. But, then we did an equal amount of work to the creation of the transaction server, doing things like the CICS interoperability. And, you know, we did it for LU 6.2, we thought that was great. Then people said, no, we don't update our MVS, we had to go back and do it for LU 0, and we had to do it working through the Queue Manager, for certain things. And so interoperability is a huge investment for us.
When we see people moving away from legacy environments, there's two ways they can move. They can move abruptly, where they take the code and data, and move that to the PC environment all at once. And that does happen. Another approach is to leave the data on the mainframe.
In the case of banks, the number of banks we've convinced to move their main account database onto PC technology is modest, it's about 20 medium-sized banks around the world. But, what we have done is gotten a lot of banks to say, any new code you write, write it in the NT environment, and then use these gateways to reach out, you know, to CICS, or however you want to gateway to it. And so you still have your data up on the mainframe, and have the comfort that comes from that. So, you know, we're engaged in about a 5-to-10-year process of moving that legacy data, with lots of interop along the way.
QUESTION: Which research projects do you have that are aimed at managing the info glut?
MR. GATES:Well, in terms of managing the info-glut, today when you search the Internet you're getting essentially more and more data that you want to be able to sort through. Part of that is moving away from simply string matching to actually parsing the text, and having lots of rich tags that go together with that text. And the standards for meta data won't come from a single company, but we're a huge participant in creating XML tags that can go on all these pages, so that as you do the searches, you're more likely to find the things that you're interested in. We also have an approach where we do the parsing of the entire Internet, and so it's not just based on the words that are used there.
I think, you know, That the info-glut is sort of the big problem. Everything you want will be on the Internet. The question is, will you be able to get to it, and for most users, they'll just want to either through speech or typing sentences ask the information in the same way they would ask another human being. And so the commercial payoff to being able to do that well is very dramatic. We won't be able to solve it, just by having clever front-end software, we will have to raise some of the semantic tagging of the information itself, which is an industry wide problem.
QUESTION: What is Microsoft doing to make Java interoperable?
MR. GATES:What are we doing to make Java interoperable? Well, that's not a technical question. Actually, if you look at conformance with Java tests, and runningthe Java suites that are out there, our implementation has been rated again and again, by third parties, as the best implementation that there is. And so we support Java, in terms of if you want to do least common denominator stuff. We support everything that's there.
We'll also let you make unique operating system calls. It's like anybody who writes an application, you can be neutral to the operating system you're on, or neutral to the database you're using or, sometimes when you want to get performance or do some unique things, you can make direct calls into the unique capabilities of the platform that you're running on. And Java shouldn't be different than any other language, in terms of giving you that choice, of being generic and therefore universal, or calling into the native capabilities that are available. And so we've done that, as well.
The key milestone for Java will be, when does it get into a standards group, where everybody has an equal vote? And to date, there is no such mechanism. So you see people like Hewlett Packard, or ourselves, as we come up with innovative ideas that we want a broad community vote to decide is this a good thing or not a good thing? To date, there is not a forum for that to take place. In every computer language that's come along, there has been a neutral standards body to adjudicate those things, so hopefully that will come along.
Speaking of computer languages, it's something I didn't get into in my presentation. But, I think the boundary between the database world and the language world is probably one of the most ripe boundaries for innovation. We still have a big impedance mismatch between the languages and the data, particularly as we're getting richer and richer structures in the data. And, you know, although Java is nice, it isn't a breakthrough in this respect. And so I think whether it's extensions of Java, or other existing languages, or if somebody wants new languages, it's really ripe for somebody to do that.
Our strategy overall is, in a sense, language neutral. We don't think everybody is going to rewrite their applications in some new language. Java is a good language. Lots of people use that. C is not going away. Visual Basic is not going away. COBOL, ABAP, you name it, those will be there. And so we create a platform that is completely language neutral, in terms of the things that it offers us, and we lobby for standards efforts in all of the languages that run on that platform.
QUESTION: Questioner asked if the Terra Server demonstration requires Internet Explorer?
MR. GATES:No. I mean, what it's doing in terms of its use of the browser, there is nothing particularly sophisticated there. I mean, those are basically, they just come down and get the images, and display them. That works on any browser that you can imagine. There are some rich things going on in browsers, where instead of having to round trip up to the server again and again, where your browsing table-like data, there is a need to have sort of a richer cache, and the ability to cursor through things, and even do simple table operations, down on the client in a disconnected way.
Again, everything we're doing in those areas, we propose to standards groups. If you look at what's gone on with HTML, and how we participate in that standard, and drive that standard forward, contrast that with what other companies in the browser business have done, you know, we're very proud of our adherence to those standards, and how we're pushing those forward. But, you know, the only case where you get a difference in terms of viewing things in browsers is where you get out into the state of the art issues, things like annotations, footnotes, rich data browsing, which the Terra Server doesn't push any of those boundaries at all.
QUESTION: Do you see the relationship between database technology and XML?
MR. GATES:Well, XML is about as straightforward a thing as there possibly can be. XML is a textual way that supports hierarchy of having attribute names and attributes. And so it's just a standard that we can use to exchange information. The databases themselves are unlikely to be moved around in XML, simply because it's not a very compressed way of exchanging the information. So it's likely there is going to be a lot of standards around XML.
We're driving for standards for user descriptions in XML. We're driving for scheduling standards in XML. We're driving for geographic data in standards in XML. So anybody who's got capability in those areas can describe what they're capable of doing, and connect up to other arbitrary databases across the Internet. When the actual data itself flows, you can use XML. But, it's more likely it would be in some sort of binary compressed form, when you get large amounts of data flowing.
QUESTION: (Inaudible.)
MR. GATES:In terms of trustworthiness of data, there is not a formal group at Microsoft who is focused on that. There is enough sort of informal activity in our research group. There could be somebody working on that, and I wouldn't even know about it. But, in terms of it being a major focus, no. We haven't had a group there. Actually, this whole area of testing has been the most frustrating. It was about four years ago I said to Nathan Myhrvold, who is our chief technology officer, hey, you know, we're getting to the point where we have more testers than developers at this company. There must be some people who are super excited about doing research in testing methodology. And of all the groups in research, it's the one that we've had the hardest time finding somebody who is just super excited about it and wanted to do new things.
There was the work that came out of our database group. But, in terms of a broad attack on software reliability, we still have a lot of open positions to go out and tackle that. So, you know, I hope we can do better there. And I don't know if the results will be extremely specific to software, or whether they'll flow over into techniques for broader data reliability. But, our primary focus will be software reliability.
QUESTION: What's your attitude towards SQL standards?
MR. GATES:What's my attitude towards SQL standards? There should be very few SQL standards. We're supporting all the rich extensions to SQL in our work. This whole area of object orientation, and how that relates to SQL, there seems to be a lot of different opinions there. Whatever mainstream thinking develops in the world of SQL, we are going to support. We've been big believers in allowing applications to be database independent, if they want to. And so we've done a lot of drivers that will allow you to call through in a fairly abstract way into the underlying database engine. OLE DB is just the most recent effort in that area.
We participate in the SQL standards. I expected SQL to sort of emerge as a mainstream language, and to see people writing programs in it. But, it looks now like that's not going to happen. We're not standing in the way of it. If it did, you know, that would be fine from our point of view. It just doesn't seem like that's happening. And so the advances in the SQL standard seem to be at a plateau right now. But, from our point of view, allowing our database implementation to compete on an equal basis is a great thing. So the more standards that are used inside databases the better off we're going to be, and we'll contribute to driving a very strong SQL standard.
QUESTION: (Inaudible.)
MR. GATES:Well, the software industry is primarily made up of small and medium-sized companies, and the number of new jobs being created in the software industry, the number of new start-ups being funded is at an all-time high. It's really quite fantastic what's going on. Partly, it's the volume of the PC has made it economical to do more specialized software, it's a clear target platform that people go out at. There's the excitement of the Internet where all those devices are going to be connected together. And so the number of successful small and medium-sized software companies will be quite amazing.
There will be a few categories like, say, word processing, where probably a large company will be doing the leading work there. Office productivity software is about a $5 billion or $6 billion category, and I certainly hope that a large company continues to be strong in those areas. So, you have things like database, high-end database engines, operating systems and productivity software that really match sort of a global, very research driven, fairly large scale operation. Then you have all the application software, the utility software, the things that fit around that. A lot of them more and more domain specific, and that's where you get these tens of thousands of great software companies. And our primarily success, you know, why is Windows popular, it's because we work well with other software companies who build applications on top of Windows.
Windows itself is not a solution to anything. And so, we put three or four hundred million dollars a year into developers conferences, and our developer network, and things like that. That's very competitive. I mean, developers are always going to look to whatever platforms that are out there. But it's by listening to those people, doing a good job for them, and facilitating the incredible success that they've had, which has been the growth of the software industry, that we've been able to play a leadership role. So, certainly it's very important for us. And large companies aren't suitable for doing most types of software that are out there. So the industry will always have a structure that's very tilted towards the smaller end of the side, with room for a few big software companies.
Okay. I guess one last one. Go ahead.
QUESTION: (Inaudible.)
MR. GATES:Well, in terms of communications and computers coming together, it depends on how you look at it. Certainly, increasing the communications bandwidth, both to businesses and homes, is probably the most critical issue for the computer industry, because if you just have dial-up connections to the Internet, it will never become the mainstream way that you get information. So, we've got to get the speed and the instant availability very, very pervasively to fulfill the vision there.
From an architectural point of view, though, you could say that the opposite is taking place. You know, I used to have meetings with AT&T and communications companies, and they'd tell me how their protocols would be the standard, their directory, their security, and they talked about the so-called "intelligent" networks. And we said, yikes, you know, how many people are going to define their own intelligent networks, and our stuff was going to have to try and figure out all these different things.
Well, that's not what happened. What happened was, IP and Internet standards created an architectural separation between the people who provide the capacity to communicate and the way that software relates to the information that's on those networks. And so now we've gotten one central set of standards, the Internet standards, that are driving all of those networks. And the beauty of that is, you get specialization. You get now AT&T very focused on quality of service and their capacity. You get new people like Quest and Level Three coming in to do those things. You get satellite guys like Teledesic, you get wireless people, but they don't have to think about going out and evangelizing a bunch of software developers, or inventing their own security or directory approaches. So that kind of separation and specialization is very, very critical.
So the computer industry is more dependent on the communications industry than ever, and yet the boundary of who they are and who we are is clearer than it's ever been. And that's the same revolution that happened in the PC industry. You know, it used to be that a computer company defined its own chips, its own instruction set, its own operating system, and it was all one big vertical stack. As the pace of innovation came along, driven primarily by the microprocessor, it wasn't possible for one company to maintain that kind of a stack, not even IBM with all the good work they had done. And so, you know, Microsoft is a specialized layer, Intel is a fairly specialized layer, and that has led to all sorts of wonderful results in terms of the competition and the fact that people can come in and focus on the piece that they do.
So, the computer industry has a great need, and an important relationship with the communications industry, but it's certainly not a merging of the two businesses.
Okay. Thanks very much.
(Applause and end of presentation.)
|
|