DataUp—Data Curation for the Long Tail of Science

October 2, 2012 | Posted by Microsoft Research Blog

The long tail: sure, it’s a well-known concept in business and marketing, but there’s a very important “hidden” long tail in the sciences, too. So, what is this hidden long tail of science? It consists of the millions of datasets that are not stored in a databank and therefore are not available for use by other scientists. Every day, researchers throughout the world are observing, calculating, and compiling data, recording it all on their local machines within their labs—often not even as a shared resource to their institutions. Regrettably, much of this data never gets deposited in larger web-accessible data repositories where it could be reused by other investigators around the globe.

As a researcher myself and working with other researchers from around the globe, I am acutely aware of scientific data pain points; after all, those of us in the research community understand better than anyone that data preservation, curation, and sharing are critical for the advancement of scientific discovery. We want to share our data beyond our immediate groups, but many times we find ourselves hindered by a lack of tools and services designed to promote data curation and sharing.

Enter DataUp, an open-source tool that helps us document, manage, and archive our tabular data. The DataUp project was born out of this need for seamless integration of data management into the researchers’ current workflows. The University of California Curation Center (UC3) at the California Digital Library (CDL), with sponsorship from Microsoft Research and the Gordon and Betty Moore Foundation (GBMF), focused on creating a tool that could be used by researchers in the environmental sciences. They recognized that this field epitomizes the problems of data management and curation; in particular, the storage of data locally without data description (metadata)—such as where it was collected, by whom, and when—that would make it more usable by others.

By conducting surveys at ecological and environmental science events, CDL found that the majority of these scientists use spreadsheets to collect and organize their data, so rather than make them learn a new program, UC3 recognized a need for a tool that works with a program most scientists already know: Microsoft Excel.

From the results of further surveys, it was determined that about half of the scientists preferred a tool that would be installed on their laptop, while the other half wanted a web-based tool that they could use on any device. Well, we sponsors and the UC3 team were not about to let this divided preference thwart the creation of a much-needed tool, so, together, we decided that there needed to be two versions of the tool: an open-source add-in (extension) for Microsoft Excel, and an open-source web application.

To achieve the project goals of facilitating data management, sharing, and archiving, both the add-in and the web application accomplish four main tasks:

  1. Perform a best-practices check to ensure good data organization
  2. Guide users through creation of metadata for their Excel file
  3. Help users obtain a unique identifier for their dataset
  4. Connect users to a major repository, where their data can be deposited and shared with others

The California Digital Library established the initial repository, the ONEShare. Researchers will be able to find tools from the DataUp project as part of the Investigator Toolkit for DataONE.

I want to thank Carly Strasser, Trisha Cruse, John Kunze, and Stephen Abrams from UC3 for their passion and commitment to bring DataUp to life. I also want to thank Chris Mentzel from GBMF for co-funding the project with Microsoft Research Connections.

Now, get out there and DataUp!

Kristin Tolle, Director, Microsoft Research Connections

