Computational Biology Seminar Series

Computational Biology Seminar Series

About

The Computational Biology Seminar Series is an occasional forum for delivering academic computational biology talks. All talks are open to the public.

 

Seminar List Subscription

To subscribe to talk announcements for this series, send a blank message to MSRNE-CB-ANNOUNCE-subscribe-request@LISTS.RESEARCH.MICROSOFT.COM. You will be asked to confirm the subscription by email — please check your Spam or Junk mail folder to find it if necessary. Due to recent broad changes in spam filtering protocols, the list announcements themselves may go to Spam/Junk; we are experimenting with different ways to send announcement emails and they may therefore be sent from different email addresses. If you have any questions or concerns please send us an email.

 

Upcoming Talks

Paolo Casale, U. Cambridge/EBI, Multivariate set tests for genetic association studies

WHO:                             Paolo CasaleAFFILIATION:             University of Cambridge and at the European Bioinformatics InstituteTITLE:                           Multivariate set tests for genetic association studiesHOST:                           Jennifer Listgarten and Nicolo FusiWHEN:                          Thurs. Jan 19th 2016WHERE:                        Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA / Conf Room NERD 1/1180 Clara BartonSCHEDULE:                  11 AM to 12 PM

Abstract

In the last decade, genome-wide association studies (GWAS) have helped to advance our understanding of the genetic architecture of important traits, including human diseases. However, optimal analysis of GWAS data remains a major challenge. First, many traits of interest have polygenic architectures, which means that they are controlled by a large number of loci, which in turn can harbor multiple causal variants. Second, these loci tend to have effects on multiple related phenotypes, a phenomenon that is known as pleiotropy. Finally, genetic effects can depend on the specific cellular and environmental context. In this talk, I will present new statistical models to characterize polygeneicity, pleiotropy, and to elucidate context-specificity, by enabling joint genetic analysis across multiple variants, traits and contexts.

First, I will discuss mtSet, an efficient mixed-model approach that enables genome-wide association testing between sets of genetic variants and multiple traits while accounting for confounding factors. Both in simulations and in applications to real datasets, we find that this joint modelling approach offers power advantages compared to methods that aggregate either across traits or variants in isolation.

In the second part of the talk, I will discuss extensions to this model, deriving a new strategy to test for interactions between sets of genetic variants and categorical contexts (iSet). iSet accounts for polygenic effects and allows for characterizing context-specificity at genetic loci. In an application to a monocyte eQTL dataset, this method increases power for detecting genotype-context interactions and identifies genes associated with different regulatory architectures between contexts. Our results highlight that even “simple” traits such as gene expression levels can have surprisingly complex cis-genetic architectures, with multiple associated variants and changes of genetic regulation between different cellular contexts.

Biography

Francesco Paolo Casale carried out his PhD research in statistical genetics at the University of Cambridge and at the European Bioinformatics Institute. His PhD work has been concerned with the development of integrative statistical methods to unravel genotype-to-phenotype relationships. At the same time, he has been actively involved in applying these approaches to datasets from major studies, including the latest release of the 1000 Genomes Project, the BLUEPRINT initiative and others. Previously, he obtained a Bachelor’s and Master’s degree in physics from the University of Naples.

 

Talk Title TBA - Michael Hoffman, University of Toronto

Title TBA

Michael Hoffman, University of Toronto

Monday, March 27, 201710:00 AM – 11:00 AM

ABSTRACT:TBA

BIOGRAPHY:TBA

Arrival Guidance

Arrival Guidance

Upon arrival, be prepared to show a picture ID and sign the Building Visitor Log when approaching the Lobby Floor Security Desk. Alert them to the name of the event you are attending and ask them to direct you to the appropriate floor. Typically the talks are located in the First Floor Conference Center, however sometimes the location may change.

Parking Information

Guests are allowed to park in our garage located at One Memorial Drive. Microsoft receptionists will not validate parking for any guests. All day parking is $27.00 on weekdays and $10.00 on weekends. Please note that these rates are subject to change.

 

*Hospitality Notice: Microsoft Research may provide hospitality at this event. Because different universities and legal jurisdictions have differing rules, we rely on you to know whether acceptance of this invitation would be inconsistent with those rules. Accordingly, By accepting our invitation, you confirm that this invitation is compliant with your institution’s policies.

Past Talks

Omer Weissbrod, Weizmann Institute of Science — Multi-kernel linear mixed models for complex phenotype prediction

Multi-kernel linear mixed models for complex phenotype prediction

Omer Weissbrod, Weizmann Institute of Science

January 11, 2017

Abstract

Linear mixed models (LMMs) and their extensions have recently become the method of choice in phenotype prediction for complex traits. However, LMM use to  date has typically been limited by assuming simple genetic architectures. Here we present MKLMM (multi-kernel linear mixed model), a predictive modeling framework that extends the standard LMM using multiple-kernel machine learning approaches. MKLMM can model genetic interactions, and is particularly suitable for modeling complex local interactions between nearby variants. We additionally present MKLMM-Adapt, which automatically infers interaction types across multiple genomic regions.  In an analysis of eight case-control data sets from the Wellcome Trust Case Control Consortium and over a hundred mouse phenotypes, MKLMM-Adapt consistently outperforms competing methods in phenotype prediction. MKLMM is as computationally efficient as standard LMMs and does not require storage of genotypes, thus achieving state of the art predictive power without compromising computational feasibility or genomic privacy.

Biography

Omer Weissbrod is currently a postdoctoral researcher in the Computer Science Department of the Weizmann Institute of Science, working at Eran Segal’s lab. His research is focused on understanding and predicting risk of genetic diseases by using high dimensional genetic data. Omer completed his PhD studies under the joint guidance of Prof. Saharon Rosset at the Statistics Department at Tel Aviv University, and of Prof. Dan Geiger at the Computer Science Department at the Technion. Previously he was also a researcher at the Machine Learning for Healthcare and Life Sciences group at IBM Research Haifa, where he worked on a variety of projects involving analysis of genetic and clinical data. His papers have been published in top scientific journals, including Nature Methods and Genome Research.

Michael Baym, Harvard— Evolutionary Strategies to Counter Antibiotic Resistance

Evolutionary Strategies to Counter Antibiotic Resistance

Michael Baym, Harvard Medical School

Friday, December 2, 2016
Abstract
Antibiotics are among the most important tools in medicine, but today their efficacy is threatened by the evolution of resistance. While resistance continues to spread, the development of new antibiotics is slowing. We need new strategies to delay or reverse the course of resistance evolution. In this talk I will describe several approaches to studying and manipulating resistance evolution in detail. Using a new experimental device, we have been able to study previously elusive aspects of the evolution of antibiotic resistance in spatial environments. Second, I will show how increasing the scale of experiments allows both the discovery of new avenues of attack and potential failure modes of evolutionary interventions. I will conclude with the algorithmic and biological challenges in the practical application of these approaches. (see also https://vimeo.com/180908160).
Biography
Michael is a research fellow in the Kishony Lab in the Department of Systems Biology at Harvard Medical School. His work focuses on the evolution of antibiotic resistance and algorithms for the efficient analysis of large-scale datasets.

Computational Aspects of Biological Information (CABI) 2016

Computational Aspects of Biological Information (CABI) 2016

WHAT: A Full Day Computational Biology Symposium

WHEN: Wednesday, November 30, 2016

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

REGISTRATION: Additional information can be found here

Confirmed speakers include:

Co-organizers: Nicolo Fusi, Max Leiserson, Jennifer Listgarten

Meromit Singer, Broad Institute — Identification and validation of a gene module specific for T cell dysfunction in tumor via population and single-cell transcriptomic

Identification and validation of a gene module specific for T cell dysfunction in tumor via population and single-cell transcriptomics

Meromit Singer, Broad Institute

Friday, November 4, 20164:00 PM – 5:00 PM

Abstract:

Targeted engagement of the immune system to tackle cancer is currently at the forefront of therapeutic development and is achieving unprecedented success in the clinic. One of the most successful approaches involves rehabilitation of dysfunctional (“exhausted”) T cells within a tumor to regain their full functional capacity for tumor clearance. However, the different T cell subsets found within tumors are not well defined or characterized, and the underlying molecular mechanisms driving T cells in and out of a dysfunctional state are poorly understood. This gap in knowledge is currently the major limiting factor to identifying novel therapeutic targets and achieving improved patient prognosis. In this talk I will describe our path to the discovery of a gene module that is specific to the dysfunctional T cells within a tumor, and how it enabled the identification of a novel dysfunctional T cell subpopulation. We will discuss computational approaches applied to high-throughput transcriptomics data to achieve molecular characterization of T cell dysfunction at the population and single-cell level, as well as the interplay between data analysis, hypothesis generation and experimental follow-up.

Biography:

Meromit Singer is a postdoctoral researcher in Aviv Regev’s group at the Broad Institute. Meromit completed her B.Sc. at Tel-Aviv University and continued to a PhD in Computer Science at UC Berkeley, under the supervision of Lior Pachter.

Lior Pachter, University of California at Berkeley — Developing clinical capability for RNA-Seq

Developing clinical capability for RNA-Seq

Lior Pachter, University of California at Berkeley

Wednesday, October 5, 2016

Abstract:

I’ll discuss a number of recent technological and computational developments in RNA-Seq transcriptomics technology that are transforming the clinical capabilities of the technology.

Biography:

Lior Pachter was born in Ramat Gan, Israel, and grew up in Pretoria, South Africa where he attended Pretoria Boys High School. After receiving a B.S. in Mathematics from Caltech in 1994, he left for MIT where he was awarded a PhD in applied mathematics in 1999. He then moved to the University of California at Berkeley where he was a postdoctoral researcher (1999-2001), assistant professor (2001-2005), associate professor (2005-2009), and is currently the Raymond and Beverly Sackler professor of computational biology at UC Berkeley and professor of mathematics and molecular and cellular biology with a joint appointment in computer science. His research interests span the mathematical and biological sciences, and he has authored over 100 research articles in the areas of algorithms, combinatorics, comparative genomics, algebraic statistics, molecular biology and evolution. He’s been awarded a National Science Foundation Career award, a Sloan Research Fellowship, the Miller Professorship, and a Federal Laboratory Consortium award for the successful technology transfer of widely used sequence alignment software developed in his group.

David Kelley, Harvard — Learning the regulatory code of the accessible genome with deep convolutional neural networks

WHO: David Kelley

AFFILIATION: Harvard University

HOST: Jennifer Listgarten

WHEN: Friday, May 27th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 4pm-5pm

ABSTRACT:

The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanism. I’ll address this challenge using an approach based on a recent machine learning advance–deep convolutional neural networks (CNNs). My colleagues and I developed an open source package Basset (https://github.com/davek44/Basset) to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq and demonstrate far greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell’s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.

BIOGRAPHY:

David completed his PhD in the Center for Bioinformatics and Computational Biology at the U. Maryland College Park, advised by Steven Salzberg, where he developed methods and software for genome assembly and gene prediction. In 2007, he joined John Rinn’s lab in the Stem Cell and Regenerative Biology department Harvard, where he performed multiple analyses of the function and evolution of long noncoding RNAs. David also introduced an approach based on deep convolutional neural networks to predict functional activity of DNA sequences. He recently joined Calico Labs where he’ll continue to apply machine learning approaches to genomics toward better understand the aging process.

Nir Friedman — Dynamics of Chromatin and Transcription

WHO: Nir Friedman

AFFILIATION: The Hebrew University

HOST: Jennifer Listgarten

WHEN: Friday, August 12th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 2pm-3pm

ABSTRACT:

In this talk I will review our ongoing efforts to understand the interplay of transcription and chromatin using combined experimental approaches, systematic analysis and very little modeling. I will try to motivate why this system is interesting and also biologically important, and highlight the challenges on the way.

BIOGRAPHY:

Nir Friedman is a Professor of Computer Science and Biology at the Hebrew University of Jerusalem. His research combines Machine Learning and Statistical Learning with Systems Biology, specifically in the fields of Gene Regulation, Transcription and Chromatin. His highly cited research[8][9] includes work on Bayesian network classifiers, Bayesian Structural EM, and the use of Bayesian methods to analyzing gene expression data. More recent works focus on Probabilistic Graphical Models, reconstructing Regulatory Networks, Genetic Interactions, and the role of Chromatin in Transcriptional Regulation. In 2009, Friedman and Daphne Koller published a textbook on Probabilistic Graphical Models. Later that year, he joined the Institute of Life Sciences, and opened an experimental lab where he uses advanced robotic tools to study transcriptional regulation in the yeast Saccharomyces cerevisiae.

Debora S. Marks & John Ingraham — Inferring the structure and function of biological sequences

WHO: Debora S. Marks and John Ingraham

AFFILIATION: Harvard Medical School

HOST: Jennifer Listgarten

WHEN: Thursday, June 9th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 3pm-4pm

ABSTRACT:

Modern genome sequencing and synthesis can acquire and generate tremendous molecular diversity in a day, but our ability to navigate and interpret the exponentially large space of possible biological sequences remains limited. Central to this challenge is the lack of a priori knowledge about epistasis, i.e. non-additive interactions between positions in a molecule. We discuss a family of probabilistic models based on pairwise interactions that can successfully recover both the 3D structures and effects of mutations in proteins and RNA in an unsupervised fashion from evolutionary examples. When used to predict 3D structure, the models give insight into a number of challenging cases, including membrane proteins, complexes, disordered proteins, and RNAs. As we apply these interaction-based models to predict the effects of mutations, we find that the tremendous number of parameters involved necessitate new hierarchical priors and inference algorithms that facilitate Bayesian approaches to sparsity in large, undirected models. We discuss the outlook for applications in biomedicine and engineering..

BIOGRAPHIES:

Debora Marks, a computational biologist, is a new assistant professor of Systems Biology at Harvard Medical School, and the director of the Raymond and Beverly Sackler Laboratory for Computational Biology, a new collaborative venture. Debora has a track record of using algorithms and statistics to successfully address unsolved biological problems. During her PhD, she quantified the potential pan-genomic scope of microRNA targeting and combinatorial regulation of protein expression. She made headway the classic, unsolved problem of ab initio 3D structure prediction of proteins using a maximum entropy probability model for evolutionary sequences and has extended this to RNA genes, interactions and the challenge of biomolecular flexibility. Debbie’s new lab is now developing algorithms for the important challenge the quantitating the effects of genetic variants, including those involved in antibiotic resistance. Her lab is actively recruiting and encourage you to get in touch and visit! – see marks.hms.harvard.edu Debora is a recipient of the 2016 Overton Prize from the International Society for Computational Biology, and the 2016 Charles E.W. Grinnell Medical Research Award for computational approaches for critical challenges in biomedical research

John Ingraham is a graduate student in Systems Biology at Harvard Medical School. With a background in applied math and biochemistry, his work has tended towards many length scales of quantitative biology, including biophysical models of brain cancer, fieldwork and modeling of howler monkey ranging patterns, and a now-daily interaction with the 20 amino acids. In his PhD, John is developing new probabilistic models and methods for learning and leveraging structure in sequence space. John is supported by an NSF graduate fellowship.

Anshul Kundaje — Integrative, interpretable deep learning frameworks for regulatory genomics and epigenomics

WHO: Anshul Kundaje

AFFILIATION: Stanford University

HOST: Jennifer Listgarten

WHEN: Friday, April 29th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 11am-12pm

ABSTRACT:

We present generalizable and interpretable supervised deep learning frameworks to predict regulatory and epigenetic state of putative functional genomic elements by integrating raw DNA sequence with diverse chromatin assays such as ATAC-seq, DNase-seq or MNase-seq. First, we develop multi-modal convolutional neural networks (CNNs) that can integrate haploid or diploid DNA sequence and chromatin accessibility profiles (DNase-seq or ATAC-seq) to predict in-vivo binding sites of a diverse set of transcription factors (TF) across cell types with high accuracy. Our integrative models provide significant improvements over other state-of-the-art methods including recently published deep learning TF binding models. Next, we train multi-task, multi-modal deep CNNs to simultaneously predict multiple histone modifications and combinatorial chromatin state at regulatory elements by integrating DNA sequence, RNA-seq and ATAC-seq or a combination of DNase-seq and MNase-seq. Our models achieve high prediction accuracy even across cell-types revealing a fundamental predictive relationship between chromatin architecture and histone modifications. Finally, we develop DeepLIFT (Deep Linear Importance Feature Tracker), a novel interpretation engine for extracting and ranking predictive and biological meaningful patterns from deep neural networks (DNNs) for diverse genomic data types. We apply DeepLIFFT on our models to obtain unified TF sequence affinity motifs, infer high resolution point binding events of TFs, dissect regulatory sequence grammars involving homodimer and heterodimeric binding with co-factors, learn predictive chromatin architectural features and unravel the sequence and architectural heterogeneity of regulatory elements.

BIOGRAPHY:

Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University and a 2014 Alfred Sloan Fellow. His primary research interest is computational regulatory genomics. His lab develops statistical and machine learning methods for large-scale integrative analysis of diverse functional genomic data to decipher heterogeneity of regulatory elements, uncover their long-range interactions in the context of 3D genome organization, learn transcriptional regulatory network models and understand the regulatory impact of non-coding genetic variation. Anshul has led the computational analysis efforts of two of the largest functional genomics consortia – The Encyclopedia of DNA Elements (ENCODE) Project and the Roadmap Epigenomics Project.

Finale Doshi-Velez — Learning Cross-Corpora Models of Disease Progression in Autism Spectrum Disorder

WHO: Finale Doshi-Velez

AFFILIATION: Harvard University

HOST: Jennifer Listgarten

WHEN: Tuesday, May 3rd, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 3pm-4pm

ABSTRACT:

Patients with developmental disorders, such as autism spectrum disorder (ASD), present with symptoms that change with time even if the named diagnosis remains fixed. For example, language impairments may present as delayed speech in a toddler and difficulty reading in a school-age child. Characterizing these trajectories is important for early treatment. However, deriving these trajectories from observational sources is challenging: electronic health records only reflect observations of patients at irregular intervals and only record what factors are clinically relevant at the time of observation. Meanwhile, caretakers discuss daily developments and concerns on social media.

In this talk, I will present a fully unsupervised approach for learning disease trajectories from incomplete medical records and social media posts, including cases in which we have only a single observation of each patient. In particular, we use a dynamic topic model approach which embeds each disease trajectory as a path in R^D. A polyagamma augmentation scheme is used to efficiently perform inference as well as incorporate multiple data sources. We learn disease trajectories from the electronic health records of 13,435 patients with ASD and the forum posts of 13,743 caretakers of children with ASD, deriving interesting clinical insights as well as good predictions. I’ll end with broader questions about learning disease models from data.

BIOGRAPHY:

Finale Doshi-Velez is an assistant professor in Computer Science at Harvard. She completed her Master’s at the University of Cambridge, her PhD at MIT, and her postdoc at Harvard Medical School.

John Doench — Genetic Screens with CRISPR: A New Hope in Functional Genomics

WHO: John Doench

AFFILIATION: Broad Institute of MIT and Harvard

HOST: Jennifer Listgarten and Nicolo Fusi

WHEN: Wednesday, May 4th, 2016.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 4pm-5pm.

ABSTRACT:

Functional genomics attempts to understand the genome by disrupting the flow of information from DNA to RNA to protein and then observing how the cell or organism changes in response. Both RNAi and CRISPR technologies are simply hacks of systems that originally evolved to silence viruses, reprogrammed to target genes we’re interested in studying, as decoding the function of genes is a critical step towards understanding how gene dysfunction leads to disease. Here we will discuss the development and optimization of CRISPR technology for genome-wide genetic screens and its application to multiple biological problems.

BIOGRAPHY:

John Doench is the Associate Director of the Genetic Perturbation Platform at the Broad Institute. He develops and applies the latest approaches in functional genomics, including RNAi, ORF, and CRISPR technologies, to understand the function of genes and how gene dysfunction leads to disease. John collaborates with researchers across the community to develop faithful biological models and execute genetic screens. Prior to joining the Broad in 2009, John did his postdoctoral work at Harvard Medical School, received his PhD from the biology department at MIT, and majored in history at Hamilton College. John lives in Jamaica Plain, MA with his wife and daughter, where he enjoys coaching soccer, cheering on the Red Sox and Patriots, playing volleyball, running, and avoiding imminent death while navigating the streets of Boston on a bicycle.

Hilary Finucane — Insight into the biology of common diseases using summary statistics of large genome-wide association studies

WHO: Hilary Finucane

AFFILIATION: Harvard School of Public Health and MIT Mathematics

HOST: Jennifer Listgarten

WHEN: Tues. February 16th, 2016

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 11am-12pm

ABSTRACT:

Datasets with genotype data for tens of thousands of individuals with and without a given disease contain valuable information about the genetic basis of the disease. However, for most common diseases, obtaining insights from these data is difficult because the signal is very diffuse: there are likely thousands or tens of thousands of genetic variants that each contribute a small amount to disease risk, and that are hidden among roughly a million variants in the dataset. Moreover, for many of the largest genotype datasets, no individual researcher has access to all of the genotype data; rather, the only data available are meta-analyzed marginal effect size estimates for each variant. I will describe a powerful approach to modeling these summary statistics that allows us, for example, to identify disease-relevant tissues or to quantify the degree to which two traits have a common genetic basis. The method, called LD score regression, is based on a commonly used model in genetics in which the effect of each variant on the disease is random. The parameters of this model provide information about the disease such as whether regions of the genome active in a given tissue (e.g., liver) tend to be more associated with disease than regions of the genome active in a second tissue (e.g., brain). The LD score regression method takes into account factors such as the correlational structure of the genome, potential confounding in the data, and the possibility that causal variants not in the dataset might be correlated with variants that are in the dataset.

BIOGRAPHY:

Hilary Finucane is a graduate student in the MIT Mathematics department doing research in statistical genetics. Her advisor is Alkes Price, at the Harvard School of Public Health. As an undergraduate at Harvard, she majored in math and wrote her senior thesis on coding schemes for multilevel flash memory with Michael Mitzenmacher. She then completed an MSc in theoretical computer science at the Weizmann Institute of Science, working with Irit Dinur, followed by a year of research in probability theory and geometric group theory with Itai Benjamini, also at the Weizmann Institute of Science. She is supported by a Hertz Foundation Fellowship.

Emily Oster — Using Observed Controls to Infer the Effect of Unobserved Controls

WHO: Emily Oster

AFFILIATION: Brown University

HOST: Jennifer Listgarten

WHEN: Wed, February 3rd, 2015.

WHERE: Microsoft Conference Center located at One Memorial Drive, First Floor, Cambridge, MA

SCHEDULE: 1:30-2:30 PM

ABSTRACT:

Omitted variable bias (equivalently, residual confounding) is a well known issue in deriving causal effects from observational data. This issue is especially problematic when the confounding arises from variables unobserved to the researcher and when there is no random variation in treatment to rely on. I will discuss a methodology to evaluate robustness of causal effects to unobserved confounders. The key assumption is that there is a correspondence between the relationship between the treatment variable and the observed controls and the treatment variable and unobserved controls. I will discuss the theory – an extension of Altonji, Elder and Taber (2005) – and present evidence that this may perform well in some social science settings. I will discuss its application to both the economics literature and the medical literature.

BIOGRAPHY:

Emily Oster is an associate professor of economics at Brown University. She is one of the leading experts on public health issues, especially in developing contexts, within economics. Her research primarily uses observational data to reach important and provocative causal conclusions on major public health issues, such as the causes and consequence of infant mortality and HIV/AIDS. Beyond her purely academic work, Emily has tried to communicate the findings of the best work in health to consumers through her book on pregnancy Expecting Better and her column “Ask Emily” in the Wall Street Journal.

Computational Aspects of Biological Information 2015

Computational Aspects of Biological Information (CABI) 2015 was the third one-day workshop on challenges and successes in computational biology, which brought together experts in the Boston/Cambridge area to discuss computational solutions to problems in biology, including systems biology, genomics, and related areas.

CABI 2015 took place on Tuesday, December 1, 2015, at Microsoft Research New England in Cambridge, MA.

Workshop speakers included:

  • Bonnie Berger (MIT CSAIL)
  • Arup Chakraporty (MIT Chemistry)
  • Michael Desai (Harvard Systems Biology)
  • Polina Golland (MIT CSAIL)
  • Rafael Irizarry (Harvard University)
  • Leonid Mirny (Harvard-MIT Division of Health Sciences and Technology, MIT)
  • Peter Park (Harvard Medical School)
  • David Sontag (NYU)
  • Shamil Sunyaev (Harvard Medical School)

Organizing committee

Barbara Engelhardt — Recovering usable hidden structure using exploratory data analyses on genomic data

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Barbara Engelhardt

Affiliation: Princeton

Host: Jennifer Listgarten

Date: Wed. November 4th, 2015

Time: 4pm – 5pm with reception to follow

Abstract

Methods for exploratory data analysis have been the recent focus of much attention in `big data’ applications because of their ability to quickly allow the user to explore structure in the underlying data in a controlled and interpretable way. In genomics, latent factor models are commonly used to identify population substructure, identify gene clusters, and control noise in large data sets. In this talk I will describe a series of statistical models for exploratory data analysis to illustrate the structure that they are able to identify in large genomic data sets. I will consider several downstream uses for the recovered latent structure: understanding technical noise in the data, developing undirected networks from the recovered structure, and using this latent structure to study genomic differences among people.

Biography

Barbara Engelhardt is an assistant professor in the Computer Science Department and the Center for Statistics and Machine Learning at Princeton University. Prior to that, she was at Duke University as an assistant professor in Biostatistics and Bioinformatics and Statistical Sciences. She graduated from Stanford University and received her Ph.D. from the University of California, Berkeley, advised by Professor Michael Jordan. She did postdoctoral research at the University of Chicago, working with Professor Matthew Stephens. Interspersed among her academic experiences, she spent two years working at the Jet Propulsion Laboratory, a summer at Google Research, and a year at 23andMe, a personal genomics company. Professor Engelhardt received an NSF Graduate Research Fellowship, the Google Anita Borg Memorial Scholarship, and the Walter M. Fitch Prize from the Society for Molecular Biology and Evolution. She also received the NIH NHGRI K99/R00 Pathway to Independence Award. Professor Engelhardt is currently a PI on the Genotype-Tissue Expression (GTEx) Consortium. Her research interests involve statistical models and methods for analysis of high-dimensional data, with a goal of understanding the underlying biological mechanisms of complex phenotypes and human diseases.

Neil Lawrence — Personalized Health with Gaussian Processes

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Neil Lawrence

Affiliation: University of Sheffield

Host: Nicolo Fusi

Date: Wed, Aug 19

Time: 4pm – 5pm with reception to follow

Abstract

Modern data connectivity gives us different views of the patient which need to be unified for truly personalized health care. I’ll give a personal perspective on the type of methodological and social challenges we expect to arise in this this domain and motivate Gaussian process models as one approach to dealing with the explosion of data.

Biography

Neil Lawrence received his bachelor’s degree in Mechanical Engineering from the University of Southampton in 1994. Following a period as an field engineer on oil rigs in the North Sea he returned to academia to complete his PhD in 2000 at the Computer Lab in Cambridge University. He spent a year at Microsoft Research in Cambridge before leaving to take up a Lectureship at the University of Sheffield, where he was subsequently appointed Senior Lecturer in 2005. In January 2007 he took up a post as a Senior Research Fellow at the School of Computer Science in the University of Manchester where he worked in the Machine Learning and Optimisation research group. In August 2010 he returned to Sheffield to take up a collaborative Chair in Neuroscience and Computer Science.

Neil’s main research interest is machine learning through probabilistic models. He focuses on both the algorithmic side of these models and their application. He has a particular focus on applications in personalized health and computational biology, but happily dabbles in other areas such as speech, vision and graphics.

Neil was Associate Editor in Chief for IEEE Transactions on Pattern Analysis and Machine Intelligence (from 2011-2013) and is an Action Editor for the Journal of Machine Learning Research. He was the founding editor of the JMLR Workshop and Conference Proceedings (2006) and is currently series editor. He was an area chair for the NIPS conference in 2005, 2006, 2012 and 2013, Workshops Chair in 2010 and Tutorials Chair in 2013. He was General Chair of AISTATS in 2010 and AISTATS Programme Chair in 2012. He was Program Chair of NIPS in 2014 and is General Chair for 2015.

Oliver Stegle — Modeling molecular heterogeneity between individuals and single cells

Speaker: Oliver Stegle

Affiliation: European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI)

Host: Jennifer Listgarten and Nicolo Fusi

Date: Monday, May 11th, 2015

Time: 2:00 PM – 3:00 PM

Abstract

The analysis of large-scale expression datasets is often compromised by hidden structure between samples. In the context of genetic association studies, this structure can be linked to differences between individuals, which can reflect their genetic makeup (such as population structure) or be traced back to environmental and technical factors. In this talk, I will discuss statistical methods to reconstruct this structure from the observed data to account for it in genetic analyses. By incorporating principles from causal reasoning, we show that critical pitfalls of falsely explaining away true biological signals can be circumvented. In the second part of this talk I will extend the introduced class of latent variable models to account for unwanted heterogeneity in single-cell transcriptome datasets. In applications to a T helper cell differentiation study, we show how this model allows for dissecting expression patterns of individual genes and reveals new substructure between cells that is linked to cell differentiation. I will finish with an outlook of modeling challenges and initial solutions that enable combining multiple omics layers that are profiled in the same set of single cells.

Biography

Oliver Stegle is a group leader at the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK. His group develops statistical methods to analyse high-dimensional molecular traits both in the context of genetic association and single-cell biology. He received his Ph.D. from the University of Cambridge, UK, in physics in 2009, working with David MacKay. After a period as a postdoctoral researcher at the Max Planck Campus in Tübingen, Germany, he moved to the EMBL-EBI in November 2012 to establish his own research group.

Dana Pe'er — Mapping single cells: A geometric approach

(This will be part of our MSR New England General Colloquium Series, intended for broad audiences of all backgrounds.)

Speaker: Dana Pe’er

Affiliation: Departments of Biological Sciences and Systems Biology, Columbia University

Host: Jennifer Listgarten

Date: Wed. Nov 5th, 2014

Time: 4:00 PM – 5:00 PM

Abstract

High dimensional single cell technologies are on the rise, rapidly increasing in accuracy and throughput. These offer computational biology both a challenge and an opportunity. One of the big challenges with this data-type is to understand regions of density in this multi-dimensional space, given millions of noisy measurements. Underlying many of our approaches is mapping this high-dimensional geometry into a nearest neighbor graph and characterization single cell behavior using this graph structure. We will discuss a number of approaches (1) An algorithm that harnesses the nearest neighbor graph to order cells according to their developmental maturity and its use to identify novel progenitor B-cell sub-populations. (2) Using reweighted density estimation to characterize cellular signal processing in T-cell activation. (2) New clustering and dimensionality reduction approaches to map heterogeneity between cells; with an application to characterizing tumor heterogeneity in Acute Myeloid Leukemia.

Biography

Dana Pe’er is an associate professor in the Departments of Biological Sciences and Systems Biology. Her team develops computational methods that integrate diverse high-throughput data to provide a holistic, systems-level view of molecular networks. Currently they have two key focuses: developing computational methods to interpret single cell data and understand cellular heterogeneity; modeling how genetic and epigenetic variation alters regulatory network function and subsequently phenotype in health and disease. This path has led them to explore how systems biology approaches can be used to personalize cancer care. Dana is recipient of the Burroughs Wellcome Fund Career Award, NIH Directors New Innovator Award, NSF CAREER award, Stand Up To Cancer Innovative Research Grant, a Packard Fellow in Science and Engineering, and very recently, the prestigious 2014 ISCB Overton Prize Award.

Quaid Morris — Reconstructing tumour subpopulation genotypes and evolution from short-read sequencing of bulk tumour samples

Speaker: Quaid Morris

Affiliation: Donnelly Center for Cellular and Biomolecular Research, University of Toronto

Host: Jennifer Listgarten

Date: Friday, September 12th, 2014

Time: 2:00 PM – 3:30 PM

Abstract

Tumours consist of genetically diverse subpopulations of cells that differ in their response to therapy and their metastatic potential. The short read sequencing used to characterize tumour heterogeneity only provides the allelic frequencies of the tumour somatic mutations, not full genotypes of individual cells. I will describe my lab’s efforts to recover these full genotypes by fitting subpopulation phylogenies to the allele frequency data. In some circumstances, a full, unique reconstruction is possible but often multiple phylogenies are consistent with the data. Our methods (PhyloSub, PhyloWGS, treeCRP) use Bayesian inference to distinguish ambiguous and unambiguous portions of the phylogeny thereby explicitly representing reconstruction uncertainty. Our methods incorporate simple somatic mutations (point mutations and indels) as well as copy number variations; have excellent results on real and simulated data; and can take as input allele frequencies from single or multiple tumour samples where these frequencies are estimated using either targeted or whole genome sequencing.

Biography

Quaid Morris is an associate professor in the Donnelly Centre at the University of Toronto in Canada. He is a multi-disciplinary researcher with cross-appointments in the Departments of Computer Science, Engineering, and Molecular Genetics. He founded his lab in 2005 and after having received his PhD from the Massachusetts Institute of Technology (MIT) in 2003. His doctoral training was in machine learning and computational neuroscience under the supervision of Peter Dayan at M.I.T. and the Gatsby Unit at University College London. His lab uses statistical learning to make biological discoveries and develop new methodology for analysing large-scale biomedical datasets. He is currently interested in understanding cancer (and other complex diseases) using genomics; post-transcriptional regulation; text mining of medical records; and the automated prediction of gene function (see http://www.genemania.org).

Nicolo Fusi — The Warped Linear Mixed Model: finding optimal phenotype transformations yields a substantial increase in signal in genetic analyses

Speaker: Nicolo Fusi

Affiliation: Microsoft Research, Los Angeles

Host: Jennifer Listgarten

Date: Wed. August 20th, 2014

Time: 2:00 PM – 3:30 PM

Abstract

Genome-wide association studies, now routine, still have many remaining methodological open problems. Among the most successful models for GWAS are linear mixed models, also used in several other key areas of genetics, such as phenotype prediction and estimation of heritability. However, one of the fundamental assumptions of these models—that the data have a particular distribution (i.e., the noise is Gaussian-distributed)—rarely holds in practice. As a result, standard approaches yield sub-optimal performance, resulting in significant losses in power for GWAS, increased bias in heritability estimation, and reduced accuracy for phenotype predictions. In this talk, I will discuss our solution to this important problem—a novel, robust and statistically principled method, the “Warped Linear Mixed Model”—which automatically learns an optimal “warping function” for the phenotype simultaneously as it models the data. Our approach effectively searches through an infinite set of transformations, using the principles of statistical inference to determine an optimal one. In extensive experiments, we find up to twofold increases in GWAS power, significantly reduced bias in heritability estimation and significantly increased accuracy in phenotype prediction, as compared to the standard LMM.

People