Oral Session 10


January 6, 2014


Richard Durbin and Cho-Jui Hsieh


Wellcome Trust Sanger Institute, UT Austin


New Methods for the Analysis of Genome Variation Data – Genetic variation in genome sequences within a species such as humans underpins our biological diversity, is the basis for the genetic contribution to disease, provides information about our ancestry, and is the substrate for evolution. Genetic variation has a complex structure of shared inheritance from a common ancestor at each position in the genome, with the pattern of sharing changing along the genome as a consequence of genetic recombination. The scale of data sets that can be obtained from modern sequencing and genotyping methods, currently of the order of hundreds of terabytes, makes analysis computationally challenging. During the last few years, a number of tools such as BWA, Bowtie have been developed for sequence matching based on suffix array derived data structures, in particular the Burrows-Wheeler tranform (BWT) and Ferragina-Manzini (FM) index, which have the nice property that they not only give asymptotically optimal search, but also are highly compressed data structures (they underlie the bzip compression algorithms). I will discuss a number of approaches based on these data structures for primary data processing, sequence assembly, variation detection and large scale genetic analysis, with applications to very large scale human genetic variation data sets.

BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables – The l1-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20,000 variables. In this paper, we develop an algorithm BigQUIC, which can solve 1 million dimensional l1-regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that BigQUIC can achieve super-linear or even quadratic convergence rates.


Richard Durbin and Cho-Jui Hsieh

Richard Durbin is a Senior Group Leader and joint Head of Human Genetics at The Wellcome Trust Sanger Institute. He is currently co-leading the 1000 Genomes Project to produce a deep catalogue of human genetic variation by large scale sequencing, and the UK10K collaboration to extend sequence based genetics to samples with clinically relevant phenotypes. Previously Richard contributed to the human genome project, and development of the Pfam database of protein families and the Ensembl genome data resource. He has also made theoretical and algorithmic contributions to biological sequence analysis. Richard has a BA in Mathematics, and a PhD in Biology from Cambridge University, where he was also a Research Fellow, at King’s College, from 1986 to 1988. He was a Fulbright Visiting Scholar in Biophysics at Harvard University from 1982 to 1983 and a Lucille P Markey visiting Fellow in the Department of Psychology, Stanford University from 1988 to 1990. He was a staff scientist at the MRC Laboratory of Molecular Biology from 1990 to 1996, and was Head of Informatics at the Sanger Institute from 1992-2006 and Deputy Director from 1997 to 2006. He was elected a Fellow of the Royal Society in 2004. Richard’s home page can be found at http://www.sanger.ac.uk/research/faculty/rdurbin/