Paul Pavlidis

Professor

Relevant Degree Programs

 

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - Mar 2019)
Computational analysis of ribonucleic acid basepairs in RNA structure and RNA-RNA interactions (2016)

Ribonucleic acids (RNA), are an essential part of cellular function, transcribed from DNA and translated into protein. Rather than a passive informational medium, RNA can also be highly functional and regulatory. Certain RNAs fold into specific structures giving it enzymatic properties, while others bind to specific targets to guide regulatory processes. With the advent of next-generation sequencing, a large number of novel non-coding RNAs have been discovered through whole-transcriptome sequencing. Many efforts have been made to study the structure and binding partners of these novel RNAs, in order to determine their function and roles. This work begins with a description of my R package R4RNA for manipulating RNA basepair data, the building blocks of RNA structure and RNA binding. The package deals with the input/output and manipulation of RNA basepair and sequence data, along with statistical and visualization methods for evaluation, interpretation and presentation. We also describe R-chie, a visualization tool and web server built on R4RNA that visualizes complex RNA basepairs in conjunction with sequence alignments. We then conduct the largest known evaluation of RNA-RNA interaction methods to date, running state-of-the-art tools on curated experimentally validated datasets. We end with a review of cotranscriptional RNA basepair formation, summarizing biological, theoretical and computational methods for the process, and future directions for improving classical methods in RNA structure prediction.All content chapters of this thesis has been peer-reviewed and published. The work on R4RNA has led to two publications, with the package used to great visual effect by various publications and also adopted by the RNA structure database Rfam. My assessment of RNA-RNA interaction is at present the only published evaluation of its kind, and will hopefully become a benchmark for future tool development and a guide to selecting appropriate tools and algorithms. Our published review on RNA cotranscriptional folding is well-received, being the first review specifically on its topic.

View record

Generation of truncated proteoforms in proteolytic networks : modeling and prediction in the protease web (2016)

Primarily controlled by gene expression and fine-tuned by translation and degradation rates, protein activity is governed by a plethora of post-translation modifications such as phosphorylation and glycosylation, which generate a diversity of protein species and thereby control complex biological phenotypes. Protease processing by proteases is a particular modification leading to the irreversible generation of stable protein truncations. Well understood in examples such as signal- or propeptide removal, recent analyses consistently identify >50% of N-terminal peptides mapping inside the protein sequence as predicted by genomics, indicating an important regulatory role of proteases. All proteins undergo protease cleavage as part of processing or degradation, a second biological process controlled by proteases. Proteases are involved in numerous pathologies and commonly considered as drug targets. However, protease research and drug development is complicated, in part due to widespread crosstalk between proteases. Proteases regulate other proteases through direct cleavage or cleavage of protease inhibitors in a complex network of protease interactions, the protease web. Yet, a comprehensive analysis of the protease web has not been performed, hampering assignment of proteases to clear biological roles, their direct substrates, and protease inhibitor drug targeting. A second problem in the identification of protein processing is the potential confound between protein termini generated by protease processing, alternative splicing, and alternative translation. In this thesis, I computationally analyzed large and diverse datasets of protease interactions and protein truncations to gain insight into complex proteolytic processes and to guide biochemical follow- up experiments. Analyzing protease cleavage, alternative splicing and alternative translation data incorporated into our database TopFIND, I found that protease cleavage and alternative translation likely generate most protein truncations. Combining protease cleavage and inhibition data in a graph model of the protease web, I demonstrated extensive protease crosstalk and then predicted and validated a proteolytic pathway. Finally, investigating strategies for the prediction of protease inhibition, I predicted hundreds of protease-inhibitor interactions, and validated inhibition of kallikrein-5 by serpin B12. This work thus generated predictions for biochemical follow-up as well as important insights into the regulation of biological systems through proteases.

View record

Bioinformatics for neuroanatomical connectivity (2012)

Neuroscience research is increasingly dependent on bringing together large amounts of data collected at the molecular, anatomical, functional and behavioural levels. This data is disseminated in scientific articles and large online databases. I utilized these large resources to study the wiring diagram of the brain or ‘connectome’. The aims of this thesis were to automatically collect large amounts of connectivity knowledge and to characterize relationships between connectivity and gene expression in the rodent brain. To extract the knowledge embedded in the neuroscience literature I created the first corpus of neuroscience abstracts annotated for brain regions and their connections. These connections describe long distance or macroconnectivity between brain regions. The collection of over 1,300 abstracts allowed accurate training of machine learning classifiers that mark brain region mentions (76% recall at 81% precision) and neuroanatomical connections between regions (50% sentence level recall at 70% precision). By automatically extracting connectivity statements from the Journal of Comparative Neurology I generated a literature based connectome of over 28,000 connections. Evaluations revealed that a large number of brain region descriptions are not found in existing lexicons. To address this challenge I developed novel methods that allow mapping of brain region terms to enclosing structures. To further study the connectome I moved from scientific articles to large online databases. By employing resources for gene expression and connectivity I showed that patterns of gene expression correlate with connectivity. First, two spatially anti-correlated patterns of mouse brain gene expression were identified. These signatures are associated with differences in expression of neuronal and oligodendrocyte markers, suggesting they reflect regional differences in cellular populations. Expression level of these genes is correlated with connectivity degree, with regions expressing the neuron-enriched pattern having more incoming and outgoing connections with other regions. Finally, relationships between profiles of gene expression and connectivity were tested. Specifically, I showed that brain regions with similar expression profiles tend to have similar connectivity profiles. Further, optimized sets of connectivity linked genes are associated with neuronal development, axon guidance and autistic spectrum disorder. This demonstration of text mining and large scale analysis provides new foundations for neuroinformatics.

View record

Meta-analysis of expression profiling data in the postmodern human brain (2012)

No abstract available.

Master's Student Supervision (2010-2017)
A study of methods for learning phylogenies of cancer cell populations from binary single nucleotide variant profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Exploring sources of variability in electrophysiology data of mammalian neurons (2017)

Recently, there has been a major effort by neuroscientists to systematically organize and integrate vast quantities of brain data. However, electrophysiological properties have been shown to be sensitive to experimental conditions, thus directly comparing them between experiments could lead to inconsistent results. Here, I characterize the general effects of experimental solution composition differences on the reported ephys measurements. For that purpose, I employ text-mining, supplemented with manual curation to gather experimental solution information from published neurophysiological articles. I integrate the extracted information into the existing NeuroElectro database, which contains the electrophysiology, neuron type and experimental conditions information (temperature, electrode type, animal age, etc.) from the above neuroscientific literature. Exploring commonly used experimental solution recipes, I found the effect of solution compositions of explaining variance in electrophysiological properties to be small, relative to the amount of the existing ephys variability. Then, I created models for predicting the variability of ephys properties commonly reported by neurophysiologists, using the available experimental conditions information. These models can be used to remove a portion of the ephys variance when comparing results from different experiments, generally making such comparisons more reliable. To validate their performance, I adjusted a portion of NeuroElectro data to experimental conditions used by Allen Institute for Brain Science and compared the respective ephys properties before and after the adjustment.

View record

Meta-analysis of gene expression in mouse models of neurodegenerative disorders (2017)

There is intense interest in understanding the molecular mechanisms that contribute to neurodegenerative disorders (NDs), which involve complex interplays of genetic and environmental factors. To catch early events involved in disease initiation requires investigation on pre-symptomatic brain samples. It is difficult to capture early molecular events using post-mortem human brain samples since these samples represent the late phase of the disorder with progressive brain damage and neurodegeneration. Disease mouse models are developed to study disease progression and pathophysiology. Here, I focus on two of the most studied NDs: Alzheimer’s disease (AD) and Huntington’s disease (HD). Mouse models developed for the disease (AD or HD) often share similar phenotypes mimicking human disease symptoms, which suggest potential common underlying mechanisms of disease initiation and progression across mouse models of the same disease. Investigation of gene expression profiles of pre-symptomatic animals from different mouse models may shed light on the mechanisms occurred in the early disease phase. Gene expression profiling analyses have been performed on mouse models and some of the studies investigate the molecular changes in pre-symptomatic phase of AD and HD respectively. However, their findings have not reached a clear consensus. To identify shared molecular changes across mouse models, I conducted a systematic meta-analysis of gene expression in mouse models of AD and HD, consisted of 369 gene expression profiles from 23 independent studies. The goal of this project is to identify transcriptional alterations shared among different mouse models of each disease respectively, especially changes during early disease phase that may link to disease-causing mechanisms, and potential common cross-disease changes. For both of the disorders, the results showed subtle but biologically interpretable changes shared across mouse models in the early disease phase that may contribute to the early disease progression: dysregulation of genes involved in cholesterol biosynthesis and complement system in AD mouse models and genes encoding mitochondrial respiratory chain complexes in HD mouse models. Cross-disease similarities in the late phase suggested that different brain regions may share mechanisms in response to neuronal loss and toxic protein aggregates.

View record

A study of methods for learning phylogenies of cancer cell populations from binary single nucleotide variant profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Identification and exploration of gene product annotation instability and its impact on current usages (2014)

Proteins are macromolecules responsible for a wide range of activities in the structure and function of cells. Their activities have been described in different contexts as a mean to elucidate their ``function". These descriptions have been captured across biological databases in a standardized format called Gene Ontology Annotations (GOA), to disseminate the knowledge and extrapolate the information to other proteins whose function is still unknown. Furthermore, the annotations are used to analyse and interpret data from high-throughput studies and also as a benchmark for the assessment of protein function prediction algorithms. Constant changes occur in GOA that can potentially impact such usages, but only limited effort has been put into exploring their instability, or to assess the impact that these changes have on reproducibility or interpretation of previous analyses. In the present work, I performed the most comprehensive analysis of the annotation instability for 14 representative model organisms (E.coli, fruit fly, Mouse, etc.). The results showed important instability patterns that were species-specific. As such information would be of use to the community to trace the instability of annotations of their interest, a web-based visualization tool was built to track these changes on a protein, functional term and species specific basis. Additionally, we identified artifacts on the annotation data that can be attributed to curation patterns. We propose such artifacts to be considered for a more accurate assessment of function prediction algorithms. Furthermore, the impact that changes in the annotations have on common settings like gene set enrichment analyses was also explored. In particular, 2,000 datasets were used to assess the robustness of enrichment results over time. On average, the results would display a 60% similarity after only 2 years. However, cases were found were the similarity will drop 80% within the same year, demonstrating the impact that the instability has on such applications. In conclusion, the results of this work will prove useful for those who use the annotations to interpret their studies to assess their reliability on a case-by-case scenario.

View record

Meta-analysis of human methylomes reveals stably methylated sequences surrounding CpG islands associated with high gene expression (2014)

DNA methylation is thought to play an important role in the regulation of mammalian gene expression. Part of the evidence for this role is the observation that lack of CpG island methylation in gene promoters is associated with high transcriptional activity. However, CpG island methylation level only accounts for a fraction of the variance in gene expression, and methylation in other domains is hypothesized to play a role (e.g., island shores and shelves). We set out to improve understanding of the human methylome through a meta-analysis approach, using 1737 samples from 30 publicly available studies. An initial screen identified 15224 CpGs that are “ultra-stable” in their state, being always fully methylated or unmethylated across diverse tissues, cell types and developmental stages (974 always methylated; 14250 always unmethylated). A further analysis of ultra-stable CpGs led us to identify a novel class of CpG islands, “ravines”, that exhibit a markedly consistent pattern of low methylation with highly methylated flanking shores and shelves. Our findings were validated using independent and heterogeneous datasets assayed on the same and different technologies. Building on additional existing data types such as gene expression microarrays, DNase hypersensitive sites, and histone modifications, we found that ravines are associated with higher gene expression, compared to typical unmethylated CpG islands. This finding suggests a novel role for methylation in promoters, markedly different from the traditional view that active promoters need to be unmethylated. We propose ravines are a new class of CpG islands, established early in development and maintained through differentiation, that mark universally active genes and provide new evidence that methylation beyond the CpG island could play a role in gene expression.

View record

Cell type marker enrichment across brain regions and experimental conditions (2013)

The first chapter of this thesis explored the dominant gene expression pattern in the adult human brain. We discovered that the largest source of variation can be explained by cell type marker expression. Across brain regions, expression of neuron cell type markers are anti-correlated with the expression of oligodendrocyte cell type markers. Next, we explored gene function convergence and divergence in the adult mouse brain. Our contributions are as follows. First, we provide candidate cell type markers for investigating specific cell type populations. Second, we highlight orthologous genes that show functional divergence between human and mouse brains.In the second chapter, we present our preliminary work on the effects of tissue types and experimental conditions on human microarray studies. First, we measured the expression and differential expression levels of tissue-enriched genes. Next, we identified modules with similar expression levels and differential expression p-values. Our results show that expression levels reflect tissue type variation. In contrast, differential expression levels are more complex, owing to the large diversity of experimental conditions in the data. In summary, our work provides a different perspective on the functional roles of genes in human microarray studies.

View record

Characterization of gene expression patterns in wild pacific salmon (2013)

No abstract available.

Meta-analysis of gene expression in individuals with Autism Spectrum Disorders (2013)

Autism spectrum disorders (ASD) are clinically heterogeneous and biologically complex.State of the art genetics research has unveiled a large number of variants linked to ASD. Butin general it remains unclear, what biological factors lead to changes in the brains of autisticindividuals. We build on the premise that these heterogeneous genetic or genomic aberrationswill converge towards a common impact downstream, which might be reflected in thetranscriptomes of individuals with ASD. Similarly, a considerable number of transcriptomeanalyses have been performed in attempts to address this question, but their findings lack aclear consensus. As a result, each of these individual studies has not led to any significantadvance in understanding the autistic phenotype as a whole. The goal of this research is tocomprehensively re-evaluate these expression profiling studies by conducting a systematicmeta-analysis. Here, we report a meta-analysis of over 1000 microarrays across twelveindependent studies on expression changes in ASD compared to unaffected individuals,in blood and brain. We identified a number of genes that are consistently differentiallyexpressed across studies of the brain, suggestive of effects on mitochondrial function. Inblood, consistent changes were more difficult to identify, despite individual studies tendingto exhibit larger effects than the brain studies. Our results are the strongest evidence to dateof a common transcriptome signature in the brains of individuals with ASD.

View record

Wide-scale comparison of transcriptome data and the role of microRNA in major depression and suicide (2011)

The first chapter of this thesis addresses a common problem in genomics experiments: interpreting a resulting "hit list" of interesting genes. We present work on an approach for summarizing and exploring "hit lists" that makes use of the large amount of gene expression data in public repositories such as the Gene Expression Omnibus. We compare the query list with datasets that we have analyzed for differential expression of genes. Studies that have similarities to the given hit list yield additional insights, help contextualize studies, and serve as a basis for future meta-analysis. A conceptually similar problem that we addressed is the classification or clustering of datasets based on patterns of differential expression. Both problems required a method for determining distances between datasets based on rankings of genes. We tested and benchmarked several methods using manually annotated datasets. The method that performed best according to our evaluation process is based on Kendall's Tau top-k distance. We investigated potential sources of confounds, finding that the largest challenge may be posed by the high prevalence of certain gene expression patterns. These highly prevalent patterns tended to dominate search results. Nonetheless, we demonstrated the effectiveness of this approach in a case study. In the second chapter, we investigated the role of microRNAs in the context of major depression and suicide. We profiled microRNA and messenger RNA levels in post-mortem prefrontal cortex and hippocampus brain tissue of depressed suicides, suicides, and controls. In the prefrontal cortex, we found miR-1202 to be down-regulated in suicides versus controls, and LCT (lactase enzyme) was up-regulated in suicides or depressed suicides compared to controls. The former result was independently confirmed using quantitative PCR. While further study is needed, our results have the potential to provide insight into molecular changes in the brains of depressed and suicidal individuals.

View record

Evaluating coexpression analysis for gene function prediction (2010)

Microarray expression data sets vary in size, data quality and other features, but most methods for selecting coexpressed gene pairs use a ‘one size fits all’ approach. There have been many different procedures for selecting coexpressed gene pairs of high functional similarity from an expression dataset. However, it is not clear which procedure performs best as there are few studies reporting comparisons of these approaches. The goal of this thesis is to develop a set of “best practices” in order to select coexpression links of high functional similarity from an expression dataset, along which methods for identifying datasets likely to yield poor information. With these goals, we hope to improve the quality of gene function predictions produced by coexpression analysis.Using 80 human expression datasets we examined the impact of different thresholds, correlation metrics, expression data filtering and transformation procedures on performance in functional prediction. We also investigated the relationship between data quality and other features of expression datasets and their performance in functional prediction. We used the annotations of the Gene Ontology as a primary metric to measure similarity in gene function, and employ additional functional metrics for validation. Our results show that several dataset features have a greater influence on the performance in functional prediction than others. Expression datasets which produce coexpressed gene pairs of poor functional quality can be identified by a similar set of data features. Some procedures used in coexpression analysis have a negligible effect on the quality of functional predictions while others are essential to achieving the best performance in the algorithm. We also find that some procedures interact greatly with features of expression datasets and that these interactions increase the number of high quality coexpressed gene pairs retrieved through coexpression analysis. This thesis uncovers important information on the many intrinsic and extrinsic factors that influence the performance in functional prediction of coexpression analysis. The information summarized here will help guide future studies using coexpression analysis and improve the quality of gene function predictions.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.