Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Helping proteomics scientists share peptide data: Azure does the heavy lifting

July 11, 2018 | By Vani Mandava, Director, Data Science Outreach

Scientific research breakthroughs are often achieved when many different scientists, in different labs and organizations, work together on a single task. That happened at the turn of the 21st century with the Human Genome Project, where human DNA was mapped for future reference and is now key to many breakthroughs in medicine. This is happening again, with a similarly visionary effort: the Human Proteome Project (HUPO), which is cataloging all 20,000 human proteins, such as insulin, toward the goal of treating and preventing many diseases. Many diseases are a result of changes in protein configuration and function, so scientists are eager to grow their knowledge in this area.

A key resource in the HUPO project is something called the Peptide Atlas, which is being spearheaded by the Seattle-based Institute for System Biology (ISB). Peptide Atlas is a catalog of the peptides, or building blocks of proteins, delineated through their mass spectrometry (MS) data, or “spectra.” It becomes the reference guide for all scientists doing proteomics work—they can see if the peptides they are working with are cataloged, and what spectra matches a particular peptide, a necessary precursor toward understanding proteins at a deeper level.

Here’s where Microsoft gets involved, with donated Azure credits to this effort.

Dozens, if not hundreds, of labs submit results and data to the Peptide Atlas; but the plethora of MS manufacturers and lab techniques, in addition to different research agendas, means the data submitted isn’t apples to apples. Post processing of the datasets needs to happen, and when you’re dealing with 20,000 proteins, each of which may have 50 peptides that need to be identified and catalogued, it becomes a big data challenge.

In addition, the technology used for matching spectra to peptides keeps improving and much of the existing data set needs to be reprocessed by using more up-to-date algorithms.

Through grants of Azure resources donated by Microsoft and made available through the West Big Data Hub, ISB has the compute resources to do this important data processing work. Instead of their software scientists taking up all the ISB’s internal computer processing resources, they can use thousands of low priority inexpensive cores of Azure in the cloud to offload much of the challenge, both speeding the task through Azure Batch processing, and freeing up internal resources for ISB scientists to use on other projects.

The Big Data Hubs are a National Science Foundation-sponsored effort that supplies cloud computing credits to large scientific projects. In addition to the Azure credits donated by Microsoft, ISB received a Spoke Project Planning Grant from NSF to facilitate interactions between experts from the genomic variant and protein structure communities, with the goal of developing methods for integrating data. Eventually, every proteomics researcher will be able to access this data and run their algorithms on it, in the cloud.

And Microsoft will be there to help.

Up Next

Data visualization, analytics, and platform

Microsoft and Tsinghua University Work Together on Open Academic Data Research

In a recent collaboration, Microsoft and China’s Tsinghua University released an academic graph, named Open Academic Graph (OAG). This billion-scale academic graph integrates the current Microsoft Academic Graph (MAG) and Tsinghua’s AMiner academic graph. Specifically, it contains the metadata information of 155 million academic paper metadata from AMiner and 166 million papers from MAG. By […]

Microsoft blog editor

NSF Big Data Innovation Hubs collaboration

Artificial intelligence, Data visualization, analytics, and platform, Ecology and environment, Medical, health and genomics

NSF Big Data Innovation Hubs collaboration — looking back after one year

By Vani Mandava, Director, Data Science Significant technical advancements in cloud computing have led to lower infrastructure costs, making possible big storage and big computing. Big data technology, though, requires cross-discipline research within and beyond non-computing domains. This is where domain experts collaborate with computing teams, industry, and government agencies to discover new insights that […]

Microsoft blog editor

Data visualization, analytics, and platform

Microsoft continues to support data science research with $3M cloud credits to NSF BIGDATA program

By Vani Mandava, Director, Data Science, Microsoft Research The National Science Foundation has launched a new solicitation in 2017 for the advancement of data science research and applications. The solicitation, titled Critical Techniques, Technologies and Methodologies for Advancing Foundations and Applications of Big Data Sciences and Engineering (BIGDATA), is inviting proposals under two categories: Foundations […]

Microsoft blog editor