Sohrab Shah

Associate Professor

Relevant Degree Programs


Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2019)
Tumour evolution at single-cell resolution (2018)

No abstract available.

A systems biology study of alternative splicing regulations and functions (2017)

Alternative splicing is highly appreciated as a major contributor to cellular complexity, and its dysregulation has been associated to several diseases. Despite being the focus of numerous studies in recent years, there remains much unknown about functions and regulations of alternative splicing in mammalian systems. Here, I take a systems biology approach to study alternative splicing using high-throughput sequencing data.In Chapter 2, I use tissue-specific high-throughput libraries of Drosophila melanogaster to explore the potential inter-relation of RNA editing and alternative splicing. I first develop a pipeline to accurately detect editing events. Next, I find regions where editing and splicing are likely to influence each other, and report conserved RNA structures that can mediate the inter-relation.In Chapter 3, I study functions of Cyclin dependent kinase 12 (CDK12) using human cell line data. I show that CDK12 influences the differential usage of alternative last exon. Additionally, the results demonstrate that CDK12 modulates the expression of DNA damage response genes, and increases the tumorigenicity of breast cancer cells by down-regulating the long isoform of DNAJB6 gene. Finally, in chapter 4, I first present a review of methods that search for underlying mechanisms explaining variations between high-throughput measurements of two biological conditions. Next, I introduce our RNA-seq data derived from progressively inhibiting splicing-related proteins at multiple concentrations of pharmaceuticals, and I discuss how the reviewed methods should be adopted to benefit most from our type of data.Our systems biology research provides new insights on how the studied components of the splicing machinery contribute to splicing functions and regulations, and these findings can help to improve our understanding of related diseases.

View record

Clinical Implications of inter-tumour, intra-tumour, and tumour microenvironment heterogeneity in B-cell lymphomas (2017)

B-cell lymphomas are lymphoid neoplasms derived from mature B lymphocytes at various stages of B-cell development. Advances in sequencing have contributed to decoding the genomic landscapes underlying many subtypes of B-cell lymphomas. However, it remains unclear why some B-cell lymphoma patients suffer from disease progression. A major factor contributing to disease progression is tumour heterogeneity, a consequence of branched evolutionary processes, and microenvironment heterogeneity leading to variation in the composition and properties of non-malignant cells infiltrating and surrounding the cancer. A thorough characterization of these forms of diversity in B-cell lymphomas and their association with disease progression has not been undertaken. As such, the overarching hypothesis of this thesis is that uncharacterized inter-tumour, intra-tumour, and tumour microenvironment heterogeneity impacts disease progression in B-cell lymphomas. In particular, this thesis is focused on studying these types of heterogeneity in three subtypes of B-cell lymphomas and their implications on disease progression. First, I explored inter-tumour heterogeneity in primary specimens of diffuse large B-cell lymphoma patients. I identified novel RCOR1 deletions and their corresponding transcriptional signature in a subset of patients that stratified patients into good and poor outcome following first-line treatment. Secondly, I explored intra-tumour heterogeneity in histologically transformed and early progressed follicular lymphoma patients using serial samples of their primary and transformed/progressed specimens. Through the inference of clonal dynamic patterns, I revealed divergent evolution patterns and identified novel genes underlying these distinct clinical end points. Thirdly, I explored tumour microenvironment (TME) heterogeneity in classical Hodgkin lymphoma relapse patients through serial sampling of primary pretreatment and relapse specimens. I demonstrated how specific TME dynamic patterns can inform on treatment failure. Moreover, I derived a novel, clinically applicable prognostic model (RHL30), based on the TME composition at relapse that predicts response to second-line treatment.Collectively, the work in this thesis constitutes a step forward in our characterization of tumour and microenvironment heterogeneity in B-cell lymphomas and its association with disease progression. The results presented here will aid in the determination of precise therapeutic approaches for individual lymphoma patients.

View record

Computational methods for systems biology data of cancer (2016)

High-throughput genome sequencing and other techniques provide a cost-effective way to study cancer biology and seek precision treatment options. In this dissertation I address three challenges in cancer systems biology research: 1) predicting somatic mutations, 2) interpreting mutation functions, and 3) stratifying patients into biologically meaningful groups. Somatic single nucleotide variants are frequent therapeutically actionable mutations in cancer, e.g., the ‘hotspot’ mutations in known cancer driver genes such as EGFR, KRAS, and BRAF. However, only a small proportion of cancer patients harbour these known driver mutations. Therefore, there is a great need to systematically profile a cancer genome to identify all the somatic single nucleotide variants. I develop methods to discover these somatic mutations from cancer genomic sequencing data, taking into account the noise in high-throughput sequencing data and valuable validated genuine somatic mutations and non-somatic mutations. Of the somatic alterations acquired for each cancer patient, only a few mutations ‘drive’ the initialization and progression of cancer. To better understand the evolution of cancer, as well as to apply precision treatments, we need to assess the functions of these mutations to pinpoint the driver mutations. I address this challenge by predicting the mutations correlated with gene expression dysregulation. The method is based on hierarchical Bayes modelling of the influence of mutations on gene expression, and can predict the mutations that impact gene expression in individual patients. Although probably no two cancer genomes share exactly the same set of somatic mutations because of the stochastic nature of acquired mutations across the three billion base pairs, some cancer patients share common driver mutations or disrupted pathways. These patients may have similar prognoses and potentially benefit from the same kind of treatment options. I develop an efficient clustering algorithm to cluster high-throughput and high-dimensional bio- logical datasets, with the potential to put cancer patients into biologically meaningful groups for treatment selection.

View record

Probabilistic models for the identification and interpretation of somatic single nucleotide variants in cancer genomes (2016)

Somatic single nucleotide variants (SNVs) are mutations resulting from the substitution of a single nucleotide in the genome of cancer cells. Somatic SNVs are numerous in the genomes of most types of cancers. SNVs can contribute to the malignant phenotype of cancer cells, though many SNVs likely have negligible selective value. Because many SNVs are selectively neutral, their presence in a measurable proportion of cells is likely due to drift or genetic hitchhiking. This makes SNVs an appealing class of genomic aberrations to use as markers of clonal populations and ultimately tumour evolution. Advances in sequencing technology, in particular the development of high throughput sequencing (HTS) technologies, have made it possible to systematically profile SNVs in tumour genomes. We introduce three probabilistic models to solve analytical problems raised by experimental designs that leverage HTS to study cancer biology. The first experimental design we address is paired sequencing of normal and tumour tissue samples to identify somatic SNVs. We develop a probabilistic model to jointly analyse data from both samples, and reduce the number of false positive somatic SNV predictions. The second experimental design we address is the deep sequencing of SNVs to quantify the cellular prevalence of clones harbouring the SNVs. The key challenge we resolve is that allele abundance measured by HTS is not equivalent to cellular prevalence due to the confounding issues of mutational genotype, normal cell contamination and technical noise. We develop a probabilistic model which solves these problems while simultaneously inferring the number of clonal populations in the tissue. The final experimental design we consider is single cell sequencing. Single cell sequencing provides a direct means to measure the genotypes of clonal populations. However, sequence data from a single cell is inherently noisy which confounds accurate measurement of genotypes. To overcome this problem we develop a model to aggregate cells by clonal population in order to pool statistical strength and reduce error. The model jointly infers the assignment of cells to clonal populations, the genotype of the clonal populations, and the number of populations present.

View record

Probabilistic approaches for profiling copy number aberrations and loss of heterozygosity landscapes in cancer genomes (2014)

Genomic aberrations such as copy number alterations (CNA) and loss of heterozygosity (LOH) are hallmarks of human malignancies. These genomic abnormalities can have a measurable effect on the structure and dosage of chromosomal regions. Tumour suppressors and oncogenes altered by CNAs often contribute to a tumourigenic phenotype of increased proliferation. CNA and LOH can accrue through the process of branched evolution, resulting in the emergence of divergent clones with distinct aberrations present at diagnosis. Therefore, measuring and modeling how CNA/LOH distribute in cell populations can elucidate the abundance of specific clones and, ultimately, enable the study of clonal evolution. CNA/LOH events in tumours can be profiled using SNP genotyping arrays and whole genome sequencing (WGS). However, to maximize biological interpretability from these data, accurate and statistically robust computational methods for inferring CNA/LOH are necessary.I present three novel probabilistic approaches that apply hidden Markov models (HMM) to analyze CNA/LOH in tumour genomes. The first method is HMM-Dosage, which distinguishes somatic and germline copy number events. This tool was used to profile 2000 breast cancers, the largest study of this kind in the world. The second method is APOLLOH, which was one of the earliest methods developed to profile LOH in tumour WGS data. Its application to WGS of 23 triple negative breast cancers (TNBC) represents the first time that LOH and its effects on allelic expression were jointly analyzed from sequencing data. The third method is TITAN, which simultaneously infers CNA/LOH and the clonal population dynamics from tumour WGS data. This method provides an analytical route to studying the degree of clonal evolution driven by CNA/LOH. I applied TITAN to a novel set of primary breast tumours and corresponding mouse xenografts, presenting the results of distinct modes of temporal clonal selection patterns. In conclusion, this dissertation presents a suite of novel approaches and their application to real-world cancer datasets, contributing to significant discoveries in breast and ovarian cancers. Future applications of these approaches will further facilitate the elucidation of cancer evolution, the genetic basis of metastatic potential, and therapeutic response and resistance.

View record

Master's Student Supervision (2010 - 2018)
E-scape : interactive visualization of single cell phylogenetics and spatio-temporal evolution in cancer (2016)

Cancers evolve over time and space, producing a dynamic, heterogeneous mixture of related cells. Reconstructing the evolution of each cancer requires sequencing tumour cells and processing resulting data with novel computational and statistical methods. These advances have led to numerous insights, both clinical and biological, but the ability for a biologist to interact with these results across an experimental workflow remains limited, with expert intuition often injected only through cumbersome iterations of data analysis. Here we describe E-scape, a visualization tool suite enabling interactive analysis of cancer heterogeneity and evolution. The suite includes three tools: TimeScape and MapScape for visualizing population dynamics over time and space, respectively, and CellScape for visualizing evolution at single cell resolution. The tools integrate phylogenetic, clonal prevalence, mutation and imaging data to generate intuitive views of a cancer's evolution.

View record

Cis-regulatory somatic mutations and gene-expression alteration in B cell lymphomas (2015)

Substantial progress has been achieved in characterizing protein coding (PC) regions for cancer genomes, with large contributions coming from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC). In order to obtain a complete mutational profile of cancer genomes, the whole genome must be analyzed for two reasons: a large proportion of somatic mutations are within the non-coding region and 80% of the human genome is estimated to have some biological functionality. The dramatic cost reduction afforded by next generation sequencing has now made it tractable to sequence entire cancer genomes, allowing mutational profiling of the functional loci in the non-coding regions, such as cis-regulatory elements. Recent cancer genomic studies observed somatic mutations within cis-regulatory elements have the capacity to deregulate gene expression, but their impact remains underexplored. Initial attempts to prioritize cis-regulatory mutations did not incorporate RNASeq. We used 84 B cell lymphoma samples to address this limitation by prioritizing disruptive cis-regulatory mutations based on their potential to be the cause of observable cascading expression changes throughout biological networks. BCL6, ROBO1, GNA13, HAS2 and MYC were dysregulated genes targeted with somatic mutations through different mechanisms. Mutations either targeted the genes directly (PC mutations), indirectly (cis-regulatory mutations) or both. Our analyses demonstrates the importance of identifying genomically altered cis-regulatory elements, along with gene expression data, to interpret the mutational landscapes of cancers.

View record

A novel statistical method for the accurate identification of RNA-edits with application to human cancers (2012)

RNA-editing is the post-transcriptional, enzymatic modification of RNA molecules resulting in analtered nucleotide sequence. These modifications play a critical role in mammalian tissues and areessential for proper function of liver and neuronal development, among other processes. The adventof high-throughput sequencing (HTS) technologies (e.g. Illumina HiSeq) has renewed interest inRNA-editing discovery due to unprecedented opportunities for simultaneous interrogation of wholegenome and transcriptome sequences. In the past several months a number of studies have been publisheddescribing methods and results of RNA-editing discovery in HTS data. These methods havebeen ad hoc approaches based on repurposing SNP calling tools designed for genome-based variantdetection. However, the statistical properties of RNA-editing warrant specialized analytical strategiesthat leverage the non-uniform substitution distributions inherent in RNA-editing processes.A novel statistical framework, called Auditor, that simultaneously analyzes the genomic andtranscriptomic base-counts and infers the likelihood of an RNA-edit at each position in the transcriptomeis reported. This model leverages the inherent correlation present in the RNA and DNAsequence while encoding the non-uniform substitution distributions induced by RNA-editing, conferringincreased sensitivity. Further, a Random-Forest based technical artifact removal tool thataccurately identifies sequencing and alignment errors has been implemented, greatly increasing thespecificity of the method. The combination of these approaches leads to a robust, principled methodthat accurately detects RNA-edits in the presence of both biological and technical noise.It is systematically shown, in both a simulation study and on real matched whole genome andtranscriptome data generated from 11 lymphoma samples, that Auditor significantly outperformssimilar, but simpler statistical frameworks, including a Samtools/bcftools based approach that issimilar to a recently published study. Finally by profiling 11 diffuse large B-cell lymphomas and16 triple negative breast cancers with Auditor, it is shown that RNA-editing is an active processin human malignancies. Surprisingly, consistent patterns of nucleotide substitutions and regionalenrichment of RNA-edits in 3 UTRs suggests that RNA-editing processes are invariant betweencell lineages and between tumours of similar histological subtypes and even cancers from distincttissues of origin.ii

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.