Intra-tumor heterogeneity is one of the key confounding factors in deciphering tumor evolution. Malignant cells even if coming from a single tumor will have variations in their gene expression, copy numbers, and mutation. Bulk sequencing of these diverse cells averages out their genomic profiles, hence masking the critical differences in genomic information that can lead to proper characterization of cancer progression and/or tumor evolution. Hence single cell sequencing of tumor cells is of paramount importance for unmasking the underlying the tumor heterogeneity. On the other hand single cell sequencing comes with its own sets of computational challenges. Single cell expression data is very noisy with a lot of drop out events (lowly expressed genes will not be detected in many cells, hence dropped out). Also high fidelity estimation of mutation and copy number variation from single cell DNA requires very high depth sequencing, which is not always possible due to financial constraints. Given the nature of the data, appropriate computational techniques are necessary to extract meaningful heterogeneity information from the single cell genomic data. Here we are using a variational autoencoder to transform the data to a latent encoded feature space that is more efficient in differentiating between the hidden tumor subpopulation. This technique is generalized across different types of genomic data such as copy number variation from DNA sequencing and gene expression data from RNA sequencing. We have tested the method on three gene expression data set having 2.5K to 8K tumor cells and three copy number variation data set having 100 to 300 tumor cells. Analysis of the encoded feature space revealed subpopulation of cells bearing distinct genomic signatures and the evolutionary relationship between them.
Co-author Sabrina Rashid, PhD intern from CMU:
I am a third year PhD student at Carnegie Mellon University. My major is Computational Biology and My PhD advisor is Dr. Ziv Bar-Joseph. My research interest is in single cell gene expression data and systems biology. For my thesis, I have developed an algorithm to find temporal trajectories, branching, and cell assignment from single cell time series RNA-seq experiments.