Business Impact Article - Posted 3/18/2008
Views: 2061
Rate This Evidence:

University of Cambridge

Research Group Unlocks the Secrets of Darwinian Research with Mapping and Database Software

A research group at the University of Cambridge has been using the advanced capabilities in Microsoft® SQL Server® 2008 database software to extract information from data that was originally collected 150 years ago. The timing couldn’t be better because the information relates directly to the ideas of Charles Darwin and is being uncovered just in time for his bicentenary—celebrations for which will be centred in Cambridge in 2009 (see www.darwin2009.cam.ac.uk/).

“The mash-up of maps and data that displays the plant locations in Microsoft Virtual Earth was ready in three days at a cost of five developer days in total. The entire application, including comments, is less than 300 lines of code.”

– MARK WHITEHORN, Research Associate, University of Cambridge

Darwin’s Theory of Evolution—the Modern Software Challenge
In 2005, the Cambridge research group created something of a storm in the scientific world by publishing a paper in Nature—considered by many the world’s most prestigious scientific publication—that rewrote the history of how Darwin developed his theory of evolution by natural selection.
It is a common belief that Darwin’s interest in variation was sparked by his observations during his five-year voyage on HMS Beagle (1831–36)—variation within and between species is one of the central pillars of his theory of evolution in The Origin of Species. Researchers knew that Darwin’s mentor at Cambridge, Professor J.S. Henslow, was instrumental in obtaining for Darwin the position on the Beagle.

Using Microsoft database software, the research team analysed the Cambridge Herbarium (see What Is a Herbarium?), which showed that Henslow had been studying variation since 1821. Further work proved he actively trained Darwin from 1829 to 1831 to study variation—before Darwin ever set foot on the Beagle.

The research in 2005 proved that the Herbarium was the basis on which Darwin’s Theory of Evolution was formed. But the work also highlighted that today we have almost no understanding of how it was constructed. How many people were involved in collecting the plants? Who were they? How were they recruited? Did they all contribute in the same way? Over what distances were the plants collected? How was this amazing knowledge network set up?

But is it important to know this? Well, the creation of this Herbarium was a massive undertaking—the Cambridge one, for example, now has more than a million sheets, each of which can contain several plants. Each plant had to be collected in the wild with associated data, dried and pressed, transported to Cambridge, identified, labelled, mounted, and stored.

It’s important to understand the sheer scale of the work involved. Let’s be conservative and say two person hours per sheet—that’s 2.4 million hours or about 30 life-times of work. The construction of Henslow’s Herbarium was the CERN project of its day, but, until very recently, we knew almost nothing about its construction.

Fortunately, Henslow was an excellent scientist and meticulously recorded the name of the collector, the date, and the location for each plant. But it was hard—if not impossible—to analyse this data in any formal way because the locations were recorded as text: Bottisham Wood, Little Wratting, and so on.

Microsoft Database Technology Introduces Spatial Data Types
It is only recently that database and business intelligence technology have given analysts the tools to collate such location-specific data. For example, Microsoft has introduced spatial data types in Microsoft SQL Server 2008. These data types not only ensure that locations can be mapped anywhere on the surface of the Earth with a high degree of accuracy. They also help with calculations such as, “How many plants were collected by Henslow between 1826 and 1830 within 20 miles of Cambridge?”

Even though the software was still in beta, or Microsoft Community Technology Preview (CTP), the researchers at Cambridge successfully developed a database that uses these spatial data types. However, it isn’t just the database that is important. The Cambridge research team wanted to visualise this data on a map. This was achieved using the Microsoft Visual Studio® 2008 development system and Microsoft Virtual Earth™.

Mark Whitehorn, Research Associate, University of Cambridge, is the group’s database specialist. He says: “In theory, we could have plotted the points by hand on a map, but to answer just one of our complex questions would have taken days of work. We wanted to ask questions and get the answers back in real time. The analysis we’ve already done with this software would have been impossible without the spatial data types and has already produced some very interesting findings.”

Whitehorn adds: “Remember that the scientific world, for good or evil, is now as competitive as the business world. We saw a competitive advantage in using the CTP. We balanced that against the drawbacks and there was no real argument. As it happens, the CTP has proved very stable. Also, it wasn’t just the spatial data types we were after. We’re collecting huge amounts of data now—ultimately, multiple terabytes—so the compression (both online and backup) is proving a major bonus to us. We’re also handling many large images so the file-stream capacity in SQL Server 2008 is a ‘killer feature’ as far as we’re concerned.”

SQL Server and Virtual Earth Reveal the Origins of Darwin’s Theory
The team has been using versions of SQL Server for several years. Whitehorn says: “We started this project in 2003 and chose Microsoft database software because it provided the rich set of business intelligence tools—online analytical processing (OLAP), data mining, and so on—that we needed. The truth is that, once started, no project, whether commercial or scientific, will change engine unless there is a very good reason—such as a killer feature in a competing engine. Spatial data wasn’t considered important by us when we started, but it soon became apparent that it was crucial to establish how the Herbarium was created. This feature appeared in SQL Server 2008 at just the right time for us, which is one reason why we’ve used the CTP. I suppose the important question is, ‘If you could magically change engine now, with no pain, would you do so?’ The answer is, ‘Absolutely not.’ Looking at the likely competition for this project, we would lose features that are essential and gain none that we need.”

The team has also gained a huge advantage from the Microsoft tools associated with Microsoft SQL Server, such as Microsoft Visual Studio 2008. “The speed of application development has been astonishing,” says Whitehorn. “The mash-up of maps and data that displays the plant locations in Microsoft Virtual Earth was ready in three days at a cost of five developer days in total. The entire application, including comments, is less than 300 lines of code.”

Using Virtual Earth to Unlock the Past
The team is delighted with the results so far. In a matter of weeks they uncovered fascinating information. Clearly, team members are keen to keep most findings for their scientific publications, but they were happy to share some of them.

Professor John Parker, Director of the Cambridge University Botanic Garden and Leader of the Research Group, says: “We’ve always known that the collection was centred on Cambridge. But, we were very keen to understand the dynamics of the collection in more detail.

“Once we had the data in spatial data types, it was easy to plot the cumulative number of plants against distance from the site of the original Botanic Garden in Cambridge. When we did so, we were surprised to see that the result was clearly biphasic. In other words, there were two phases in the collection with an intersection at around 23 kilometres. We rapidly realised—and equestrian experts were able to confirm—that this represents a reasonable distance for a rider of a horse to travel out and back in a day, with some time set aside for collecting. In other words, the pattern of collection was heavily dependent on horsepower as the prevailing transport system of the day.”

Figure 1. Plotting the cumulative number of plants against location revealed a biphasic pattern with an intersection at 23 kilometres, the maximum distance from Cambridge that a researcher on horseback could travel in one day.

Click on thumbnail to insert into case study

Source: University of Cambridge research group

These remarkable insights, and others, are only now coming to light. Thanks to Microsoft database software, the Cambridge research group is pioneering the use of business intelligence and its application to scientific data. By unlocking the secrets of how one of the fundamental research tools of 19th century biology was constructed and used, we have a far better understanding of how Darwin and his peers established the theory of evolution and natural selection.

Adoption of Good Data
Structuring by Scientists
Scientists come in all flavours. But botanists, astronomers, physicists, and chemists, to name a few, have one thing in common: they collect and manipulate data. Some areas of research—for example, particle physics and human genome—have dedicated data specialists who are experts in storing and manipulating data.

However, many scientists have no formal training in data structuring and are still storing data in spreadsheets or even on paper. For these people, a straightforward move to the more structured environment of a relational database engine such as Microsoft® SQL Server® 2008 can be a huge revelation. Mark Whitehorn, Research Associate, University of Cambridge, says:
“There is a paradox here. Relational databases arose in the scientific world—their roots are a mixture of mathematical theory and solid computer science—but the scientific world itself has been slower than the business world to adopt them.”

What Is a Herbarium?
Very simply, a herbarium consists of dried, pressed plants stuck to sheets of paper. But, in the early 19th century, these were one of the major research tools of the pre-eminent scientists who were trying to solve the most important question of their time: how the living world was structured. Without these herbaria, today’s researchers would not understand how the myriad species that surround us are related to each other.

Click on thumbnail to insert into case study

This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
Document published March 2008
Solution Overview



Organization Size: 3000 employees

Organization Profile


The mission of the University of Cambridge is to contribute to society through the pursuit of education, learning, and research at the highest international levels of excellence.


Software and Services
  • Microsoft SQL Server 2008
  • Microsoft Visual Studio 2008 Professional Edition
  • Microsoft Virtual Earth

Vertical Industries
Universities

Country/Region
United Kingdom