FaST-LMM

Established: October 14, 2006

FaST-LMM, (Factored Spectrally Transformed Linear Mixed Models) is a set of tools for efficiently performing genome-wide association studies (GWAS), prediction, and heritability estimation on large data sets. FaST-LMM runs on both Windows and Linux, and has been tested on data sets with over one million samples.

The most up-to-date version of FaST-LMM is written in python and available on GitHub.  It supports univariate GWAS [1, 4], tests for epistasis, corrections for cellular heterogeneity via the inclusion of principal components [2], set association tests [3], and heritability estimation [5].  A C++ version, including Windows binary, Linux binary, and source, supports univariate GWAS and limited epistatic testing. Another version supporting corrections for cellular heterogeneity is available in python and R.  An example of FaST-LMM with cloud computing is here.

[1] Lippert, J. Listgarten, Y. Liu, C.M. Kadie, R.I. Davidson, D. Heckerman. FaST linear mixed models for genome-wide association studies. Nature Methods, 8: 833-835, Oct 2011 (doi:10.1038/nmeth.1681).

[2] Zou, C. Lippert, D. Heckerman, M. Aryee, J. Listgarten. Epigenome-wide association studies without the need for cell-type composition. Nature Methods, 11: 309–311, Jan 2014 (doi:10.1038/nmeth.2815).

[3] Lippert, Jing Xiang, Danilo Horta, Christian Widmer, Carl M. Kadie, D. Heckerman, J. Listgarten. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. Bioinformatics, 30, July 2014 (doi: 10.1093/bioinformatics/btu504).

[4] Widmer, C. Lippert, O. Weissbrod, N. Fusi, C.M. Kadie, R.I. Davidson, J. Listgarten, and D. Heckerman. Further Improvements to Linear Mixed Models for Genome-Wide Association Studies. Scientific Reports, 4, 6874, Nov 2014 (doi:10.1038/srep06874).

[5] Heckerman, D. Gurdasani, C. Kadie, C. Pomilla, T. Carstensen, H. Martin, K. Ekoru, R.N. Nsubuga, G. Ssenyomo A. Kamali, P. Kaleebu, C. Widmer, and M.S. Sandhu. Linear mixed model for heritability estimation that explicitly addresses environmental variation. PNAS, 113: 7377–7382, July 2016 (doi: 10.1073/pnas.1510497113).

Click here for a full annotated bibliography.

Univariate GWAS

[1] H. Kang, N. Zaitlen, C. Wade, A. Kirby, D. Heckerman, M. Daly, and E. Eskin, Efficient Control of Population Structure in Model Organism Association Mapping, Genetics, 178:1709-1723, March, 2008 (doi: 10.1534/genetics.107.080101).

Describes early efforts to make linear mixed models more computationally efficient.

[2] Lippert*, J. Listgarten*, Y. Liu, C.M. Kadie, R.I. Davidson, D. Heckerman*FaST linear mixed models for genome-wide association studiesNature Methods, 8: 833-835, Oct 2011 (doi:10.1038/nmeth.1681). (*equal contributions)

Shows how exact linear-mixed-model computations can be performed in time and memory linear in the number of individuals when the number of SNPs used in the similarity matrix is less than the number of individuals (i.e., when the similarity matrix is low rank). This work also describes an approach to select SNPs to achieve this condition with linkage-disequilibrium-based pruning. In addition, this work shows that computations are quadratic in time and memory when the similarity matrix is full rank.

[3] J. Listgarten*, C. Lippert*, C.M. Kadie, R.I. Davidson, E. Eskin, D. Heckerman*. Improved linear mixed models for genome-wide association studiesNature Methods, 9: 525-526, June 2012 (doi:10.1038/nmeth.2037). (*equal contributions)

Describes a method for selecting SNPs for the linear-mixed-model similarity matrix by identifying SNPs that are predictive of the phenotype. A later publication [6] shows this approach yields poor control of type I error, whereas the original selection method in [2] performs well. This work also shows that the inclusion of irrelevant SNPs in the similarity matrix leads to inflated test statistics and reduced power, a phenomenon called “dilution”. Although an incorrect explanation for dilution is offered here, a correction is given in [5]. Finally, there is a bug in the analysis of the synthetic data, which makes the prediction-based selection method appear to perform better than it actually does.

[4] J. Listgarten*, C. Lippert*, D. Heckerman*. FaST-LMM-Select for addressing confounding from spatial structure and rare variantsNature Genetics (2013) doi:10.1038/ng.2620 (*equal contributions)

Shows how the feature-selection method in [3] addresses an open problem in statistical genetics that had been published in Nature Genetics. Based on results in [6], however, we recommend that the selection approach in [2] be used instead.

[5] C. Lippert*, Gerald Quon, Eun Youg Kang, Carl M. Kadie, J. Listgarten*, D. Heckerman*The benefits of selecting phenotype-specific variants for applications of mixed models in genomicsScientific Reports(2013) doi:10.1038/srep01815 (*equal contributions)

Describes additional experiments regarding the feature-selection method in [3] as applied to GWAS and prediction. Again, based on the results in [6], we recommend that the selection approach in [2] be used instead.

[6] C. Widmer*, C. Lippert*, O. Weissbrod, N. Fusi, C.M. Kadie, R.I. Davidson, J. Listgarten, and D. Heckerman*. Further Improvements to Linear Mixed Models for Genome-Wide Association Studies. Scientific Reports, 4, 6874, Nov 2014 (doi:10.1038/srep06874). (*equal contributions)

Describes the latest version of FaST-LMM. It shows that selecting SNPs for the linear-mixed-model similarity matrix through pruning via linkage disequilibrium (as in [2]) works well to control type I error, whereas selecting SNPs that are predictive of the phenotype (as in [3]) does not.

[7] C. Lippert and D. Heckerman. Computational and statistical issues in personalized medicine. XRDS 21, 24-27, Summer 2015 (doi:10.1145/2788502).

Describes statistical issues in GWAS with linear mixed models from a graphical-model perspective.

Set Tests for GWAS

[8] Listgarten*, C. Lippert*, Eun Youg Kang, Jing Xiang, Carl M. Kadie, D. Heckerman*A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics, 29:1526-1533, April 2013 (doi:10.1093/bioinformatics/btt177). (*equal contributions)

Shows that the LRT can be more powerful than a score test for set association tests. This work is limited to similarity matrices that are low rank and includes an efficient algorithm for this case. This limitation is relaxed in [9].

[9] C. Lippert, Jing Xiang, Danilo Horta, Christian Widmer, Carl M. Kadie, D. Heckerman*, J. Listgarten. Greater power and computational efficiency for kernel-based association testing of sets of genetic variantsBioinformatics, 2014 (doi: 10.1093/bioinformatics/btu504). (*corresponding author)

Makes theoretical arguments and demonstrates empirically that the LRT is often more powerful than the traditionally-used score test (e.g. SKAT). It also has exposition on how to do a number of algebraic computations for set tests with either a low- or full-rank background kernel efficiently.

Data Transformations/Pre-processing for GWAS

[10] N. Fusi*, C. Lippert, N. D. Lawrence and O. Stegle*. Warped linear mixed models for the genetic analysis of transformed phenotypes. Nature Communications, 2014.

Shows how monotonically transforming the phenotype can increase power in genome-wide association studies and increase the accuracy of heritability estimation and phenotype prediction.

[11] O. Weissbrod, C. Lippert, D. Geiger, and D. Heckerman.  Accurate liability estimation improves power in ascertained case-control studiesNature Methods, Feb 2015 (doi:10.1038/nmeth.3285).

Describes an approach to pre-process ascertained case-control-study data that leads to improved power when analyzed with a linear mixed model.

Epigenetic Cellular Heterogeneity Correction

[12] Zou, C. Lippert, D. Heckerman, M. Aryee, Jennifer Listgarten. Epigenome-wide association studies without the need for cell-type compositionNature Methods, doi:10.1038/NMETH.2815.

Shows how FaST-LMM, with the inclusion of principal components (PCs) as covariates, can correct for the confounding effects of multiple cell types. Although a method for selecting PCs is presented here, the method in [6] is now recommended.

Epistatic Genome-Wide Association

[13] Lippert*, J. Listgarten*, Robert Davidson, Scott Baxter, Hoifung Poon, Carl M. Kadie, D. Heckerman*An Exhaustive Epistatic SNP Association Analysis on Expanded Wellcome Trust Data, Scientific Reports, 2013, doi:10.1038/srep01099 (*equal contributions)

Presents results for all pairwise-epistatic tests for all phenotypes in the WTCCC1 data, using a linear mixed model with a low-rank similarity matrix based on the feature-selection method in [3]. As described, based on the results in [6], we now recommend that the feature-selection method in [2] be used instead. The rank order of the hits may be approximately correct, and therefore we have left these results on the Azure marketplace http://datamarket.azure.com/dataset/microsoftresearch/epistasisgwas.

GWAS for “functional traits” such as longitudinal traits

[14] Fusi and J. Listgarten.  Leveraging Non-Linear Genetic Effects on Functional Traits for GWAS, Proceedings of RECOMB 2016.

Introduces a model for performing GWAS for vector-valued traits which vary smoothly in time. The framework is expressive and computationally efficient, but the null model is not nested inside of the alternative model, something we are currently addressing in ongoing work.

Heritability estimation

[15] N. Furlotte, D. Heckerman, and C. Lippert.  Quantifying the uncertainty in heritabilityJournal of Human Genetics 27, March 2014 (doi: 10.1038/jhg.2014.15).

Applies the spectral-decomposition trick from FaST-LMM [2] to speed up Bayesian estimates of heritability.

[16] Heckerman, D. Gurdasani, C. Kadie, C. Pomilla, T. Carstensen, H. Martin, K. Ekoru, R.N. Nsubuga, G. Ssenyomo A. Kamali, P. Kaleebu, C. Widmer, and M.S. Sandhu. Linear mixed model for heritability estimation that explicitly addresses environmental variation. PNAS, 113: 7377–7382 (doi: 10.1073/pnas.1510497113).

Describes a way to generalize linear mixed models to take spatial location into account when jointly modeling the influences of genomics and environment on traits.