Identifying Genetic Factors in Disease with Big Data
Medical researchers have long known that many serious diseases—including heart disease, asthma, and many forms of cancer—are hereditary. Until fairly recently, however, there was no easy way to identify the particular genes that are associated with a given malady. Now, researchers can conduct genome-wide association studies—by sequencing the DNA of human subjects—enabling them to statistically correlate specific genes to particular diseases.
In order to study the genetics of a particular disease, researchers need a large sample of people who have the disorder, which means that some these people are likely to be related to one another—even if it’s a distant relationship. This skews research results because certain positive associations between specific genes and the disease are false positives—the result of two people sharing a common ancestor. In other words, the research sample is not truly random, and researchers must statistically correct for the “confounding” that was caused by the shared ancestry of the subjects.
This is not an insurmountable statistical problem: there are so-called linear mixed models (LMMs) that can eliminate the confounding. However, an inordinately large amount of computer runtime and memory are required to run LMMs to account for the relatedness among thousands of research subjects. When working with the large datasets that offer the most promise for finding the connections between genetics and disease, the cost of the computer time and memory that these models require can quickly become prohibitive.
To address this problem, the Microsoft Research team developed Factored Spectrally Transformed Linear Mixed Model (FaST-LMM), an algorithm for genome-wide association studies that scale linearly in the number of individuals in both runtime and memory use. FaST-LMM can analyze data for 120,000 individuals in just a few hours (whereas the current algorithms fail to run at all with data for just 20,000 individuals). The outcome: large datasets that are indispensable to genome-wide association studies are now computationally manageable from a memory and runtime perspective.
FaST-LMM will enable researchers to analyze hundreds of thousands of individuals to find relationships between their DNA and their traits, identifying not only which diseases a given patient may get, but also which drugs will work best for that patient. FaST-LMM takes us one step closer to the day when physicians can provide their patients with personalized assessments of their risk of developing certain diseases and devise prevention and treatment protocols that are attuned to their unique hereditary makeup.
Learn more about this research: