Jim Gray, Microsoft Distinguished Engineer, before one of three telescopes at the Apache Point Observatory in New Mexico.
REDMOND, Wash., Oct. 6, 2003 — Finding a needle in a field of haystacks would be easier than the task that confronts Richard McMahon. A researcher at the Cambridge University Institute of Astronomy in England, McMahon is looking for 10 of the most distant celestial objects in the universe. And he needs to find them in the university's data library of 100 million astronomical objects.
The most practical way to narrow his search is to cross-reference Cambridge's data with another database, but that is on the other side of the Atlantic Ocean, at Johns Hopkins University in Baltimore.
Advances in scientific research tools are making dilemmas such as this all too common in astronomy and other sciences. Today's ultra-powerful telescopes double the amount of astronomical information available every year, creating data-management challenges that dwarf those faced by even the largest data-driven businesses. Astronomy databases now collectively hold about 200 terabytes (TBs) of data an amount equal to10 times the information in all the texts currently stored in the U.S. Library of Congress.
Hidden in this data are clues to many of astronomy's biggest, most tantalizing questions: how galaxies form, the shape and size of the universe, even the future of the universe itself. But finding the answers, many astronomers agree, requires better ways to conveniently and cost-effectively analyze these libraries of data and to unite disparate databases via the Internet -- in effect, to produce a virtual, worldwide telescope.
A growing number of scientific organizations are turning to Microsoft to help them achieve these goals. Astronomers say Microsoft technology and the know-how of its researchers are helping them transform the way they do their work, allowing them to do research better, faster and with fewer technical challenges than ever before. And to the joy of amateur astronomers, these technologies are providing unprecedented access to a nearly limitless stockpile of celestial data and information to anyone with an online connection.
The results so far speak for themselves:
A first-of-its kind Web portal that uses Microsoft .NET Technologies and Web-server software to unite or -- in scientific parlance -- "federate" astronomical data over the Internet, creating what is considered the first prototype of a virtual worldwide telescope.
One of astronomy's largest ongoing space surveys -- and a growing number of other research groups -- uses many of these and other Microsoft technologies to manage and provide professional and amateur astronomers access to their multi-gigabyte data catalogs.
"Microsoft technology is allowing us to easily manage and begin to federate the information within today's massive galactic databases in ways that weren't possible even a few years ago," says Alex Szalay, a theoretical cosmologist and astrophysics professor at Johns Hopkins and principal investigator for the National Virtual Observatory. "We now can handle all this data and put it to maximum use to begin to answer some of sciences most complex astronomical questions."
.NET Technologies Unite Astronomical Data Online
Among the questions are those that Cambridge's McMahon -- the researcher looking for 10 needles in the celestial haystacks -- hopes to answer. He is searching for details about the beginning of the universe by determining when specific chemical elements were formed. To do so, he wants to find 10 quasars, the bright centers of distant galaxies, at the edge of the universe. These will serve as search lights, illuminating other nearby objects. The data available to him at Cambridge is restricted to a survey captured in infrared light bands, in which quasars shine brightly. But he needs to cross-reference a similar survey captured in the optical bands, where quasars are faint. The comparison will dramatically narrow his search.
Cambridge doesn't have an optical survey, meaning McMahon will have to use databases both from Cambridge and from other locations -- and it's not convenient to put all of this data in one place.
The solution is SkyQuery -- a Web portal that allows astronomers to cross-reference and combine data contained in disparate databases via the Internet. Szalay and several other Johns Hopkins researchers developed SkyQuery, with help from Microsoft distinguished engineer Jim Gray. The portal allows visitors to search and compare data on four online databases on two continents, including McMahon's Cambridge database. As many as six additional databases are scheduled to be added in the coming months.
The researchers built the portal on the .NET Framework, a Microsoft development environment that includes a common language runtime and prepackaged development resources that streamline construction of the Web services that underlie and facilitate the offerings on sites like SkyQuery.
The SkyQuery portal answers search requests by joining the data from the archives. It determines how to best route queries among the databases. A Web service "wrapper" at each participating database hides any differences among the protocols used by the different databases and provides a uniform view to the portal.
Another Web service on SkyQuery allows amateur and professional astronomers to create advanced overlays of celestial images to help pinpoint stars or galaxies. Others make it possible to hone in on specific celestial objects by pointing and clicking on larger galactic images -- or even to extract images from SkyQuery onto another Web site.
The portal's Web services are built on Extensible Markup Language (XML) and Simple Object Access Protocol (SOAP), making it possible to transmit information between different computing platforms. "A platform-independent framework was essential because different astronomers work on different machines and operating systems, and they may use different types of databases," says Tamas Budavari, an assistant research scientist in the Center for Astrophysical Sciences at Johns Hopkins. "Web services are ideal for this purpose."
McMahon predicts it will take him a few days of work to complete the cross-matching via SkyQuery. "Given budget and other restrictions, it would have been nearly impossible for me to do this research without SkyQuery and its core Microsoft technology," he says.
The Microsoft technology that built SkyQuery, Szalay says, couldn't have been available at a better time for the astronomy world. "With so much information in so many databases around the world, we cannot move all the data to where the analysis is being done. We need to bring the analysis to the data by dividing the computation up among the archives," he explains. "Web Services and other Microsoft technologies provided a flexible, integrated environment that let us quickly build this sophisticated application."
Microsoft Technology Streamlines Development of Online Resources, Data Management
The ease with which Szalay and his colleagues built SkyQuery still surprises them. Start to finish, the service took two months to build, using the C# programming language and Visual Studio .NET and other Microsoft developer tools.
Prior to .NET, it was somewhat of a "black art" to hook up disparate databases, Szalay says. "With .NET, it took several hundred lines, as opposed to thousands and thousands lines of code that it would have taken in the past."
"Visual Studio .NET allows you to develop a program, push a button and all of a sudden, it's an object on the Internet," he explains. "That's really a breakthrough. It makes programming for distributed computing very accessible."
Moreover, Visual Studio .NET makes it easier to find faults within the services. "Parts of the SkyQuery code run in SQL language. Some parts run in C# or C++," Szalay says. "Without such an integrated environment, debugging programs that include so many different languages would be really hard. But with Visual Studio .NET, it was really very natural."
Cambridge's McMahon and other researchers report similar efficiencies when using the .NET tools and other Microsoft products, namely Microsoft SQL Server 2000, to get their data into a database and online. It took McMahon, a SQL novice, less than a month in his spare time to figure out how to use the database, download a subset of Cambridge's data into the server and begin making queries on the data.
Astronomers at the University of Edinburgh in Scotland have found SQL Server 2000 similarly easy to use and maintain. "Like a lot of astronomers, we didn't have any experience using databases and couldn't afford to hire a database administrator. Ease of use and manageability were major issues," said Edinburgh researcher Bob Mann. "We have been extremely impressed using SQL. It will greatly improve the way that we do our astronomy."
Yellow Pages of Northern Skies Relies on SQL Server
SQL Server has done more than improve the way the researchers involved in the Sloan Digital Sky Survey (SDSS) do their science. It has transformed their work. SDSS, a collaborative research project making the most complete map of the skies visible from the Northern Hemisphere, chose SQL Server 2000 as the database server for its ongoing Sky Survey and SkyServer Web site. The site and survey, partially funded by the National Science Foundation, currently offer online access to 800 gigabytes (GB) of data -- including images of 80 million stars and galaxies -- from SDSS' first public data release. In all, it presents 3 billion rows of data.
Data for the survey are generated via telescopes at Apache Point Observatory in New Mexico and processed at the U.S. Department of Energy's Fermi National Accelerator Laboratory. When completed in 2007, the survey and site should have close to 5 TBs of catalog data and 25 TBs of data overall, with images and information for more than 200 million astronomical objects, cementing their status as the ultimate repository and online Yellow Pages of the Northern skies.
SQL Server offers built-in data-mining tools to extract complex patterns from deep within large stores of data -- whether it is a database of a billion stars or a retail chain's sales records for the past 10 years. SQL completes many data queries of the SDSS data in a matter of seconds and all but the most complex searches in less than a minute a fraction of the time it took astronomers to make similar searches using other automated methods.
SDSS researchers have also realized significant time savings when building and maintaining the database and when they construct queries of information in the database. They were able to load the first version of the SDSS data catalogs into the SQL database in a few hours. It took several days with a previous system.
Szalay, who is responsible for the database aspects of the SDSS, previously spent too much of his research time sorting, manipulating and filtering data. "This now takes 10 seconds with SQL," he says. "We now can focus more time on the core science because we are not wasting time rewriting code."
To demonstrate the point, Szalay tells of a challenge from colleagues at another university. They challenged him and Microsoft researcher Gray to find a particular fast-moving asteroid in less than the 13 days it took them -- 10 days to develop the search code and three more to search the data stored in a flat file system. Szalay and Gray wrote a SQL query and located the asteroid within hours, using the SDSS database.
The SkyServer site, which launched in 2001, has become a popular destination for professional and amateur astronomers and students, attracting more than 1 million hits a month. The site is offered in three languages English, German and Japanese and 150 hours of educational resources, along with Web services for manipulating images and data on the site.
"The speed and ease of use of SQL Server makes it possible for almost anyone to perform advanced searches on a massive online database," says Ani Thakar of SDSS. "Just as importantly, the stability and easy maintenance of SQL Server 2000 is vital for researchers as we add more data to SkyServer and our other online resources."
Astronomy Setting an Example for Other Sciences
The advances that astronomy is making with database, Web services and other technologies provide a compelling blueprint for other sciences to follow, Szalay says. Like astronomy, biology and other fields of science are getting buried under a flood of new data. "We can help other sciences loosen up their rigid ways of doing certain things," he says. "If they look and see how astronomy is benefiting, they will realize they can do the same thing."
This example will be even more compelling as the virtual worldwide telescope evolves one that expands on the example provided by SkyQuery. What's missing? More online bandwidth. More manageable systems that more easily run in parallel. The ability to handle petabyte-scale datasets. "But all of these things seem tractable," Gray says. "It's just a matter of time and energy. This will happen."
And it appears Microsoft technology and researchers will be intimately a part of the effort, helping uncover those proverbial needles in the sky.