Abstract

Many environmental scientists today are attempting to assemble, use, share and save data from a diverse set of sources. These “synthesis” efforts are often interdisciplinary and blend data from ground-based sensors, satellites, field observations, and the literature. Today, many of these efforts are campaigns where the data are gathered, processed, and analyzed through a single concentrated effort, and the data contributors know in advance exactly who will use their data and why In the future, more of these efforts will need longer term continuous data gathering and processing and the data will be shared with scientists outside the initial collaboration perhaps in different disciplines for toward different science goals. At even moderate scales of both data size and diversity, the cost and time required to find, gather, collate, normalize, and customize data in order to build a synthesis dataset can be daunting at best.. By explicitly identifying and addressing the different requirements for each data role (author, data valet, publisher, curator, and consumer), our data management architecture for large-scale shared scientific data enables the creation of such synthesis datasets that continue to grow and evolve with new data, data annotations, participants, and use rules. We show the effectiveness of our approach in the context of the FLUXNET Carbon-Climate Synthesis Dataset, one of the largest ongoing biogeophysical field experiments.