Project Trident: Navigating a Sea of Data
By Rob Knies, Managing Editor, Microsoft Research
How deep is the ocean? Geologically, the answer is straightforward: almost seven miles. This we know from a series of surveys, beginning in the 19th century, of the depth of the Mariana Trench, near Guam in the North Pacific, a boundary between two tectonic plates that is understood to be the deepest point in the world’s oceans.
When it comes to understanding what transpires in the ocean, however, the question becomes immensely more challenging. The complexities of ocean dynamics remain a profound mystery. Water is effectively opaque to electromagnetic radiation, meaning that the floor of the oceans, which drive biological and climatic systems with fundamental implications for terrestrial life, have not been mapped as thoroughly as the surfaces of some of our fellow planets in the solar system. The oceans, covering 70 percent of the globe, represent Earth’s vast, last physical frontier.
Roger Barga is helping to unlock those secrets.
Barga, principal architect for the External Research division of Microsoft Research, heads Project Trident: A Scientific Workflow Workbench, an effort to make complex data visually manageable, enabling science to be conducted at a large scale.
Working with researchers at the University of Washington, the Monterey Bay Aquarium Research Institute, and others, Barga and his colleagues on External Research’s Advanced Research Tools and Services group have developed a mechanism for expanding the Windows Workflow Foundation, based on the Microsoft .NET Framework, to combine visualization and workflow services to enable better management, evaluation, and interaction with complex data sets.
Project Trident was presented on July 13 during the 10th annual Microsoft Research Faculty Summit. The workbench is available as a research development kit on DVD; future releases will be available on CodePlex.
“Scientific workflow has become an integral part of most e-research projects,” Barga says. “It allows researchers to capture the process by which they go from raw data to actual final results. They are able to articulate these in workflow schedules. They can share them, they can annotate them, they can edit them very easily.
“A repertoire of these workflows becomes a workbench, by which scientists can author new experiments and run old ones. It also is a platform to which you can attach services like provenance [in this case, the origin of a specific set of information or data]. It becomes this wonderful environment in which researchers can do their research, capture the results, and share their knowledge. That’s what scientific workflow is all about.”
Project Trident, which includes fault tolerance and the ability to recover from failures, has the potential to make research more efficient. Scientists spend a lot of time validating and replicating their experiments, and the workbench can capture every step of an experiment and enable others to check or rerun it by setting different parameters.
True to its namesake in classical mythology, Project Trident’s first implementation is to assist in the data management for a seafloor-based research network called the Ocean Observatories Institute (OOI), formerly known as NEPTUNE.
The OOI, a $400 million effort sponsored by the National Science Foundation, will produce a massive amount of data from thousands of ocean-based sensors off the coast of the Pacific Northwest. The first Regional Cabled Observatory will consist of more than 1,500 kilometers of fiber-optic cable on the seafloor of the Juan de Fuca plate. Affixed to the cable will be thousands of chemical, geological, and biological sensors transmitting continuous streaming data for oceanographic analysis.
The expectation is that this audacious undertaking will transform oceanography from a data-poor discipline to one overflowing with data. Armed with such heretofore inaccessible information, scientists will be able to examine issues such as the ocean’s ability to absorb greenhouse gases and to detect seafloor stresses that could spawn earthquakes and tsunamis.
“It will carry power and bandwidth to the ocean,” Barga says, “and will allow scientists to study long-term ocean processes. I think it’s going to be a rich area for researchers to invest in and Microsoft to be a part of. It’s very compelling.”
Barga, who has been interested in custom scientific workflow solutions throughout his career, got involved with Project Trident in 2006 It should come as little surprise that his initial nudge in the direction that became Project Trident came from computer-science visionary Jim Gray.
“I had been with the group for only six weeks,” Barga recalls. “I wanted to engage in a project with external collaborators, and I reached out to Jim Gray, who consulted with Tony [Hey, corporate vice president of External Research].
“I asked Jim about what he thought would be a good opportunity to engage the scientific community. He introduced me to the oceanographers and computer scientists working on a project called NEPTUNE. He introduced me to a graduate student named Keith Grochow.”
Grochow was a doctoral student at the University of Washington studying visualization techniques to help oceanographers. He was being supervised by Ed Lazowska and Mark Stoermer of the university faculty. Barga met them, too. But it was Gray who put Barga on the Project Trident path.
“Jim described, during the course of an hour-long phone conversation, his idea behind an oceanographer’s workbench that would consist of sensors, data streaming in off the NEPTUNE array, and these beautiful visualizations of what was going on in the ocean appearing on the oceanographer’s desktop, wherever they were in the world,” Barga says. “He noted that we needed to be able to transform raw data coming in off the sensors in the ocean, invoking computational models and producing visualizations. He noted that workflow was exactly what was needed, and he knew my passion in the area.
“Hence, we started off building a specific scientific workflow solution for the oceanographers, for NEPTUNE. That project delivered its first prototype in three months, and we validated that we can support scientific workflow on Windows Workflow.”
Along the way, Barga and associates became aware that their work on Project Trident was extensible to other scientific endeavors.
“We realized we had an incredible amount to offer other groups,” Barga says, “Several groups acknowledged they were spending too much time supporting their platform.”
Before long, Barga found himself collaborating with astronomers from Johns Hopkins University to develop an astronomer’s workbench to support the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), an effort to combine relatively small mirrors with large digital cameras to produce an economical system that can observe the entire available sky several times each month. The goal of Pan-STARRS, which is being developed at the University of Hawaii’s Institute for Astronomy, is to discover and characterize Earth-approaching objects, such as asteroids and comets, that could pose a danger to Earth.
Such work was made possible by ensuring that the work on Project Trident could be generalized to other scientific domains.
“We were able to look back on all the existing workflow systems and build upon the best design ideas,” Barga says. “That allowed us to move forward very fast. In addition, we chose two or three different problems to work on. Not only were we working on the oceanographic one, we looked at how we could support astronomy with Pan-STARRS, a very different domain, a very different set of requirements.
“If you design a system with two or three different customers in mind, you generalize very well. You come up with a very general architecture. One of the challenges we had to overcome was to not specialize on just one domain, or it would be too specialized a solution. Pick two or three, and balance the requirements so you build a general, extensible framework. We think we’ve done that.”
Project Trident also exploits the powerful graphics capabilities of modern computers.
“The gaming industry has created this amazing graphics engine available on every PC, yet the resource has been largely ignored by the scientific community,” says Grochow, whose doctoral thesis will be based on the NEPTUNE project. He adds that the same graphical tools that enable gamers to battle monsters or to fly virtual aircraft can be used instead of cumbersome text and formula entries to achieve many scientific tasks.
The University of Washington’s Collaborative Observatory Visualization Environment (COVE) was running out of funding when Microsoft Research got involved. Microsoft supplied financial and technical support to enable COVE to thrive, says Stoermer, director of the university’s Center for Environmental Visualization.
“COVE really is about taking a gaming perspective to research,” he says. “And in the long run, we see this as applicable well beyond oceanography.”
John Delaney, professor of oceanography at the University of Washington, and Deb Kelley, an associate professor of marine geology and geophysics at the university, also have been key collaborators on the project, as has Jim Bellingham and his team at the Monterey Bay Aquarium Research Institute.
“They have given us very valuable feedback,” Barga says, “on the role workflow will play in their environment.”
In computer science, the concept of workflow refers to detailed code specifications for running and coordinating a sequence of actions. The workflow can be simple and linear, or it can be a conditional, many-branched series with complex feedback loops. Project Trident enables sophisticated analysis in which scientists can write a desired sequence of computational steps and data flow ranging from data capture from sensors or computer simulations to data cleaning and alignment to the final visualization of the analysis. Scientists can explore data in real time; compose, run, and catalog experiments; and add custom workflows and data transformation for others. But the concept required some convincing.
“It’s been an interesting journey,” Barga smiles. “When we started this a year and a half ago, in the oceanographic community the response was, ‘What’s workflow?’ It took a long dialogue and a series of demonstrations.
“Fast forward 16 months, and people are keen to embrace a workflow system. They’re actually thinking about their problems as workflows and repeating them back to us: ‘I have a workflow. Let me explain it to you.’ Their awareness has been raised significantly in the oceanographic community.”
The deluge of scientific data not only requires tools to enable data management, but also to use the vast computing resources of data centers. And another Microsoft Research technology, DryadLINQ, can help in that regard.
“Researchers need to have automated pipelines to convert that data into useful research objects,” Barga explains. “That’s where tools like workflow and Trident come into play. Then researchers have a very large cluster, but no means by which to efficiently program against it. That’s where DryadLINQ comes into play. They can take a sequential program and schedule that thing over 3,000 nodes in a cluster and get very high distributed throughput.
“We envision a world where the two actually work together. All that data may invoke a very large computation room, may require very detailed analysis or cleaning. If we use DryadLINQ over a cluster, we may be able to do data-parallel programming and bring the result back into the workflow.”
A group of researchers at Microsoft Research Silicon Valley have been working on the Dryad and DryadLINQ projects for more than four years. The goal of their research is to make distributed data-parallel computing easily accessible to all developers. Developers write programs using LINQ and .NET as if they were programming for a single computer. Dryad and DryadLINQ automatically take care of the hard problems of parallelization and distributed execution on clusters consisting of thousands of computers. Dryad and DryadLINQ have been used on a wide variety of applications, including relational queries, large-scale log mining, Web-graph analysis, and machine learning. The tools are available as no-cost downloads to academic researchers and scientists.
The objectives of Project Trident are to enable researchers to dig into large-scale projects and to analyze impenetrably complex problems.
“Beyond showing that Windows Workflow could be used as an underlying engine,” Barga says, “the goal was to explore new services that we could build on top of workflow, such as automatic provenance capture. Project Trident has a feature that allows the researcher to generate a result—an image or a gif or a chart—and to export it to a Word document. Not only do you get the image that you want to put into the document, but you get all the inputs required to rerun that workflow at a later date, should somebody want to reproduce the research. Project Trident has mechanisms by which it versions workflows, versions the data, records all this information, and exports enough information to make it possible to come back and rerun everything.
“This is a new capability that you don’t see in other systems.”
In addition to the practicality of such technology, there also is a specific research objective.
“As we move into more complex architectures—multicore, programming data centers—workflows are a wonderful abstraction for specifying the exact work that needs to be done and the order in which it needs to be done, a very natural way for users to express their intent. It also leaves this beautiful artifact called a schedule, which is an XML representation of these constraints. You could analyze it and then figure out how to schedule all that work on a multicore machine. You might have an eight- or 12-core machine sitting behind you, but on your next iteration of that same workflow, you may have a 30-node cluster. The scheduler can look at that XML representation of the work and do the scheduling.”
The project also supports runtime adaptation: If an underwater earthquake occurs, the technology can pause running its normal workflows and initiate higher-priority workflows to enable the processing and visualization of data from the event. And Project Trident can help in cost estimation of required time and system resources.
“The team has done a very nice job of building a tool,” Barga says, crediting Hey for supporting the project and Dan Fay, director of Earth, Energy, and Environment for External Research, for providing financial support. Jared Jackson worked on the initial prototype and acted as developmental lead for the project. Nelson Araujo was the software architect for Project Trident’s key features, and Dean Guo managed the dev team. The Aditi partner firm tested the code, built it, and made it robust.
Now, with the code having been released, Barga and team will focus on building a community around Project Trident.
“We have been invited to take it on a handful of major science studies,” he says. “We would love to engage these researchers with it, help them write the workflows to carry out these projects. If we do two or three, the body of workflows and activities is only going to continue to grow. We’re not going to try to push it into any new areas right now. We’re just going to go deeper in oceanography and try to build a community of users around it.”
And that, he hopes, will lead to Project Trident being incorporated into the tool kit available to 21st-century scientists.
“Future scientific-workflow systems will not be built from the ground up,” Barga says, “but instead will leverage commercial workflow engines, and researchers will only build what they need. We’ll see more sustainability for scientific-workflow systems, which will validate the thesis we started with. You’re going to see conventional workflow systems start to take the requirements and features we built in Project Trident—data flow, programming, provenance, versioning.”
Such developments are eagerly awaited.
“We’d like to see entire communities of oceanographers sharing workflows,” Barga concludes.” The NEPTUNE Canadian team came to visit about a month ago, and that’s what they were most excited about, thinking about deploying it in their environments. They could share workflows internationally, from Canada down to the U.S. and other installations around the world.
“That would be fantastic.”