With the cost to sequence a full human genome soon to fall below US$1,000, most people will have their DNA sequenced and stored in a database along with their medical records. The hardware revolution has largely occurred; this session will focus on the software issues that remain. If there were a database of a million or more fully sequenced genomes together with phenotype information such as disease and treatment codes, what software would be necessary to mine it for biological insights that might eventually lead to drug discovery and personalized medicine?
Cancer is a fundamentally genomic disease, and drugs such as Gleevec have shown the effectiveness of combatting the genomic pathways of disease. Could a large genomic database enable progress like this for other cancers, or for diseases other than cancer? What are the medical and systems implications of such a database? Could genomics engender a new software industry?
This session of the 2013 Microsoft Research Faculty Summit looks at the existing TCGA (The Cancer Genome Atlas) database currently consisting of 5,000 cancer genomes and the open algorithmic problems that arise in making use of the data. The presenters describe parallel infrastructure and computing paradigms for genome computation, and new systems issues that must be solved. They discuss issues in scaling genomic inference in the face of confounding factors. The session also explores the challenges that must be solved in order to query a vast genome database interactively in a few seconds.