United States Change | All Microsoft Sites
FaST-LMM speeds genome analysis
Share:
 
 
Science at Microsoft
 

Identifying Genetic Factors in Disease with Big Data

Medical researchers have long known that many serious diseases—including heart disease, asthma, and many forms of cancer—are hereditary. Until fairly recently, however, there was no easy way to identify the particular genes that are associated with a given malady. Now, researchers can conduct genome-wide association studies—by sequencing the DNA of human subjects—enabling them to statistically correlate specific genes to particular diseases.

In order to study the genetics of a particular disease, researchers need a large sample of people who have the disorder, which means that some these people are likely to be related to one another—even if it’s a distant relationship. This skews research results because certain positive associations between specific genes and the disease are false positives—the result of two people sharing a common ancestor. In other words, the research sample is not truly random, and researchers must statistically correct for the “confounding” that was caused by the shared ancestry of the subjects.

This is not an insurmountable statistical problem: there are so-called linear mixed models (LMMs) that can eliminate the confounding. However, an inordinately large amount of computer runtime and memory are required to run LMMs to account for the relatedness among thousands of research subjects. When working with the large datasets that offer the most promise for finding the connections between genetics and disease, the cost of the computer time and memory that these models require can quickly become prohibitive.

To address this problem, the Microsoft Research team developed Factored Spectrally Transformed Linear Mixed Model (FaST-LMM), an algorithm for genome-wide association studies that scale linearly in the number of individuals in both runtime and memory use. FaST-LMM can analyze data for 120,000 individuals in just a few hours (whereas the current algorithms fail to run at all with data for just 20,000 individuals). The outcome: large datasets that are indispensable to genome-wide association studies are now computationally manageable from a memory and runtime perspective.

FaST-LMM will enable researchers to analyze hundreds of thousands of individuals to find relationships between their DNA and their traits, identifying not only which diseases a given patient may get, but also which drugs will work best for that patient. FaST-LMM takes us one step closer to the day when physicians can provide their patients with personalized assessments of their risk of developing certain diseases and devise prevention and treatment protocols that are attuned to their unique hereditary makeup.

Learn more about this research:

Primary Researchers

David Heckerman

David Heckerman, Ph.D., is a Microsoft Distinguished Scientist and senior director of the eScience group at Microsoft Research. His research interest is focused on learning from data. The models and methods he uses are inspired by work in the fields of statistics and data analysis, machine learning, probability theory, decision theory, decision analysis, and artificial intelligence. His recent work has concentrated on using graphical models for data analysis and visualization in biology and medicine with a special focus on the design of HIV vaccines.

Jennifer Listgarten

Jennifer Listgarten, Ph.D., is a researcher in the eScience group at Microsoft Research. Her work focuses on the development and application of statistical and machine learning methods for the analysis of high-throughput, biologically-based data.

Christoph Lippert

Christoph Lippert is a Ph.D. student at the Max Planck Institutes for Developmental Biology and for Intelligent Systems in Tübingen, Germany. In summer 2012, he will join the eScience group at Microsoft Research. His research focuses on the development of probabilistic models in genomics. He has contributed analysis tools that assist biologists with their research in genetic mapping of human diseases and the analysis of genetic and phenotypic variation in plants and other model organisms.