Exploring the universe
Science at Microsoft

SkyServer: the Universe at Your Fingertips

The SkyServer is the multi-terabyte astronomy archive, containing the data of the Sloan Digital Sky Survey (SDSS) project, a collaborative effort among public and private organizations to create the most complete map of the northern sky. The SkyServer has its origins in the TerraServer. It presents the SDSS data as a web-accessible database, along with visual tools to analyze the data. The result is an SQL database with approximately 3 billion rows describing approximately 400 million celestial objects and 1 million spectra. It gives full graphical user interface (GUI) and SQL access to the SDSS data. Now everyone can use one of the world’s best telescopes. It includes 200 hours of online instruction to teach astronomy and computational science by using this data. That has been a big success—approximately 10 percent of the visitors are students using these online courses. It has 1 million distinct users—surprising, given the fact that there are only about 15,000 professional astronomers in the world. Nevertheless, based on an analysis of citation records, the SkyServer has been the world’s most used astronomy facility three times out of the last few years.

The SkyServer gives interactive access to the data via a point-and-click virtual telescope view of the pixel data and via canned reports generated from the online catalogs. It also allows ad hoc catalog queries. All data is accessible via standard browsers. A Java GUI client interface lets users pose SQL queries, while Python and Emacs interfaces allow client scripts to access the database. All of these clients use the same public HTTP/SOAP/XML interfaces. SkyServer has public web services to allow users programmatic access to the data and to the data analysis tools—a classic service-oriented architecture. But some of the astronomy queries run forever, so it also has a batch job system to let users submit long-running jobs, where they create personal databases (MyDB) near the server. MyDB stores intermediate results and uploaded user data and allows users to perform multi-step analysis on huge datasets.

The ability to pose questions in a few hours and get answers in a few minutes changes the way scientists view the data; they can experiment interactively. When queries take three days and hundreds of lines of code, scientists ask fewer questions and so get far fewer answers. This and similar experiences prove that interactive access to scientific data and data mining tools can dramatically improve productivity.

The SkyServer is also an educational tool. Several interactive astronomy projects, from elementary to graduate level, have been developed in three languages (English, Japanese, and German). Interest in this aspect of the SkyServer continues to grow. The SkyServer design has been cloned by several other observatories: Royal Observatory, Edinburgh, Cornell Arecibo Pulsar Search, Caltech Quest, Space Telescope Science Institute for Hubble and Galex datasets, and the National Optical Astronomical Observatories at Tucson. Also, SkyServer’s framework has been used as the template for radiation oncology and environmental sensing data.

SkyServer was one of the earliest projects that used web services to access and serve large datasets and image files. With many Terabytes of data and imagery, the application explored the scaling capability of Microsoft SQL Server database technology. The project also inspired a new generation of astro-informaticians to take advantage of the power of relational database technology.

Primary Researchers

Jim Gray

Jim Gray, Ph.D., was Distinguished Engineer in the Microsoft Research Bay Area eScience group. Jim’s long career with Microsoft, Digital Equipment, Tandem Computers, and IBM produced seminal work in relational database management systems, transaction processing systems, and the sciences. Jim’s vision and his work on applying computing technologies to data-intensive sciences inspired the collection of insightful essays in The Fourth Paradigm: Data-Intensive Scientific Discovery.

Alexander S. Szalay

Alexander S. Szalay is professor of Astrophysics and Computer Science at the Johns Hopkins University. He is a cosmologist whose work spans a broad area from astrophysics to statistics and computer science. He collaborated with Jim Gray to extend the ideas from the SkyServer to the Virtual Observatory and to other scientific disciplines. He currently works on building large scientific databases and data-intensive computing.