Big Data Blows into the Windy City
This week, the annual Microsoft eScience Workshop is being held in Chicago (the “Windy City”), providing an unparalleled opportunity for domain scientists, researchers, and technologists to discuss the benefits and difficulties of incorporating more computing and information technology into the scientific process. Over the years, the eScience workshop has provided a forum where scientists could voice their data and technology challenges and get input from those who’ve confronted similar issues.
Front and center this year are topics related to Big Data—be it the management of the rising data flood, the analysis of the data tsunami, or even the visualization of the data explosion. In addition, this year’s workshop explores questions about how to train and develop data scientists, and how citizen scientists can play a role in gaining insights from the vast amounts of information.
Many of these topics are examined in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, which is an excellent resource for these discussions. And, as evidenced in that book, the Big Data “opportunity” has actually been building for some time—but now it has reached the tipping point in terms of awareness across more science domains. The commoditization of devices, sensors, storage, and connectivity—paired with technologies like cloud computing—has made the idea of capturing and maintaining all data in those science domains a plausible reality. As a result, scientists are thinking about what can be done, rather than lamenting what could be done if only they had the research infrastructure.
In preparing for this year’s event, I looked back at the very first Microsoft eScience Workshop, held in 2004. I revisited Jim Gray’s keynote and put together this six-slide composite of the main challenges Jim identified back then. As you’ll notice, while some progress has been made, many of those challenges are still being addressed. For instance, global federation has remained a key issue for distributed and disparate databases. Do you move all the data to one location? Or do you ensure that the data owners continue to curate the data and safeguard the quality of the datasets? The approach taken by SkyQuery has really advanced federation, by demonstrating how multiple datasets can be queried seamlessly and by implementing novel approaches, such as the spatial join queries. If you want more details, check out the paper, SkyQuery: A WebService Approach to Federate Databases.
Six-slide composite of the main challenges that Jim Gray identified at the first Microsoft eScience Workshop in 2004
To truly tackle these data challenges, scientific datasets need the following attributes: discoverability, accessibility, and consumability. If a dataset doesn’t have all three, it might as well be kept in a file cabinet. There has been much work done lately on discoverability: for example, the emergence of different “data.gov” domain science catalogs—and even commercial ones like the Windows Azure Marketplace. The “Open Data for Open Science” session at this year’s eScience Workshop explores how to address some of these challenges from the science side and looks at how simple, Internet-based protocols, such as OData (the Open Data Protocol), can help ensure that the end-user scientist can use the data.
The Monday evening event at the Adler Planetarium showcases how scientific data and information can be communicated to the public, through amazing 3-D tours powered by Microsoft Research WorldWide Telescope (WWT) and brought to life in the planetarium’s Grainger Sky Theater. Microsoft researcher Jonathan Fay, architect of WWT, has been working with the Adler to ensure that tours that were originally developed to be shown in planetarium can be taken home and experienced later. An example of the great work from the Adler is the Welcome to the Universe show and the WWT tour narrated by astronomer Mark SubbaRao. You can play the tour in your browser. You can find more tours powered by WorldWide Telescope at the Layerscape website.
Whether you’re attending the Microsoft eScience Workshop or just wishing you could, I encourage you to dive into these Big Data challenges.
—Dan Fay, Director, Earth, Energy, and Environment; Microsoft Research Connections
- Microsoft eScience Workshop 2012
- The Fourth Paradigm: Data-Intensive Scientific Discovery
- Science@Microsoft—The Fourth Paradigm in Practice Book (PDF, 10 MB)
- Science@Microsoft Stories
- Windows Azure Marketplace
- Open Data Protocol (OData)
- WorldWide Telescope
- eScience at Microsoft Research Connections