The Coming Genomics Software Revolution?


July 16, 2013


David Haussler, David Heckerman, David Patterson, and George Varghese


University of California, Microsoft Research


With the cost to sequence a full human genome soon to fall below US$1,000, most people will have their DNA sequenced and stored in a database along with their medical records. The hardware revolution has largely occurred; this session will focus on the software issues that remain. If there were a database of a million or more fully sequenced genomes together with phenotype information such as disease and treatment codes, what software would be necessary to mine it for biological insights that might eventually lead to drug discovery and personalized medicine?

Cancer is a fundamentally genomic disease, and drugs such as Gleevec have shown the effectiveness of combatting the genomic pathways of disease. Could a large genomic database enable progress like this for other cancers, or for diseases other than cancer? What are the medical and systems implications of such a database? Could genomics engender a new software industry?

This session of the 2013 Microsoft Research Faculty Summit looks at the existing TCGA (The Cancer Genome Atlas) database currently consisting of 5,000 cancer genomes and the open algorithmic problems that arise in making use of the data. The presenters describe parallel infrastructure and computing paradigms for genome computation, and new systems issues that must be solved. They discuss issues in scaling genomic inference in the face of confounding factors. The session also explores the challenges that must be solved in order to query a vast genome database interactively in a few seconds.


David Haussler, David Heckerman, David Patterson, and George Varghese

David Haussler develops new computer-based algorithms to interpret comparative and high-throughput genomics data to understand gene structure, function, and regulation. As a collaborator on the international Human Genome Project, his team assembled the first human genome sequence and produced the UCSC Genome Browser. His group’s informatics work on cancer genomics provides a complete analysis pipeline from raw DNA reads through the detection and interpretation of mutations and altered gene expression in tumor samples. Recently, he built the CGHub database to hold all cancer genomics data for the National Cancer Institute. Haussler received his Ph.D. in computer science from the University of Colorado at Boulder. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences. He has won a number of awards, including the 2011 Weldon Memorial prize for application of mathematics and statistics to biology, the 2009 ASHG Curt Stern Award in Human Genetics, and the 2008 Senior Scientist Accomplishment Award from the International Society for Computational Biology.

David Heckerman is Senior Director of eScience and Distinguished Scientist at Microsoft Research. He is known for his work in showing the importance of probability theory in Artificial Intelligence, for developing methods to learn graphical models from data, and for developing machine learning and statistical approaches for biological and medical applications, including the design of a vaccine for HIV and the identification of genetic causes of disease. At Microsoft, he has developed numerous applications including data-mining tools in SQL Server and Commerce Server, the junk-mail filters in Outlook, Exchange, and Hotmail, handwriting recognition in the Tablet PC, text mining software in Sharepoint Portal Server, troubleshooters in Windows, and the Answer Wizard in Office. David received his Ph.D. (1990) and M.D. (1992) from Stanford University, and is an ACM and AAAI Fellow.

David Andrew Patterson is an American computer pioneer and academic who has held the position of professor of Computer Science at the University of California, Berkeley since 1977. Patterson is noted for his pioneering contributions to RISC (reduced instruction set computing) processor design, having coined the term RISC, and by leading the Berkeley RISC project. He is also noted for his research on RAID (redundant array of independent disks). Patterson’s book on computer architecture (co-authored with John L. Hennessy) is widely used in computer science education. Patterson is a Fellow of the American Association for the Advancement of Science.

George Varghese is a researcher in Victor Bahl’s network research group. He obtained his Ph.D in 1992 from MIT, worked from 1993-1999 at Washington University and at UCSD from 1999-2012 as a professor of computer science. He won the ONR Young Investigator Award in 1996, and was elected to be a Fellow of the ACM in 2002. Together with colleagues, he has 16 patents awarded in the general field of Network Algorithmics. Several of the algorithms he helped develop have found their way into commercial systems including Linux (timing wheels), the Cisco GSR (DRR), and Microsoft Windows (IP lookups). His book “Network Algorithmics” was published in December 2004 by Morgan-Kaufman. In May 2004, he co-founded NetSift Inc., where he was the President and CTO. NetSift was acquired by Cisco Systems in 2005. For the 2010-2011 academic year, he was the Distinguished Visitor in the Computer Science department at Stanford University.