Paul Pavlidis

Prospective Graduate Students / Postdocs

This faculty member is currently not looking for graduate students or Postdoctoral Fellows. Please do not contact the faculty member with any such requests.


Research Classification

Research Interests

cellular and molecular neuroscience
disorders of the nervous system

Relevant Degree Programs

Research Options

I am available and interested in collaborations (e.g. clusters, grants).
I am interested in and conduct interdisciplinary research.
I am interested in working with undergraduate students on research projects.

Research Methodology

machine learning

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - Nov 2019)
The interpretation of gene coexpression in systems biology (2020)

One of the key features of transcriptomic data is the similarity of expression patterns among groups of genes, referred to as coexpression. It has been shown that coexpressed genes tend to share similar functions. Based on this, a common assumption is that gene coexpression is a result of transcriptional regulation and therefore, regulatory relationships could be inferred from coexpression. However, success in inferring such relationships has been limited and there are questions about the source and interpretation of coexpression. Here I explore coexpression as an observed signal from the data, examine its source and assess its relevance for inferring regulatory relationships. In chapter 2 I studied differential coexpression, which refers to the alteration of gene coexpression between biological conditions. It is commonly assumed that differential coexpression can reveal rewiring of transcription regulatory networks, specifically among the genes that maintain their average expression level between the conditions. However, I show that to a large extent and in contrast to this common assumption, differential coexpression is more parsimoniously explained by changes in average expression levels. This finding demonstrates limitations for inference of regulatory rewiring from coexpression and poses questions for the underlying causes of the observed coexpression. In Chapter 3, I studied cellular composition variation among bulk tissue samples as a source of variance and the observed coexpression. I found that for most genes, differences in expression levels across cell types account for a large fraction of their variance and as a result genes with similar cell-type expression profiles appear to be coexpressed. Finally, I showed that this coexpression dominates the underlying intra-cell-type coexpression and also has the two prominent features of coexpression in the bulk tissue: reproducibility and biological relevance. Through my studies, I was able to provide an explanation for much of the observed coexpression in the bulk tissue and shed light on its resolution and limitation for inference of regulatory relationships. I also studied coexpression in single-nucleus data and show that some of the observed coexpression in it is likely to be attributed to the transcriptional regulation, which could be a subject for future studies. Supplementary materials available at:

View record

Computational analysis of ribonucleic acid basepairs in RNA structure and RNA-RNA interactions (2016)

Ribonucleic acids (RNA), are an essential part of cellular function, transcribed from DNA and translated into protein. Rather than a passive informational medium, RNA can also be highly functional and regulatory. Certain RNAs fold into specific structures giving it enzymatic properties, while others bind to specific targets to guide regulatory processes. With the advent of next-generation sequencing, a large number of novel non-coding RNAs have been discovered through whole-transcriptome sequencing. Many efforts have been made to study the structure and binding partners of these novel RNAs, in order to determine their function and roles. This work begins with a description of my R package R4RNA for manipulating RNA basepair data, the building blocks of RNA structure and RNA binding. The package deals with the input/output and manipulation of RNA basepair and sequence data, along with statistical and visualization methods for evaluation, interpretation and presentation. We also describe R-chie, a visualization tool and web server built on R4RNA that visualizes complex RNA basepairs in conjunction with sequence alignments. We then conduct the largest known evaluation of RNA-RNA interaction methods to date, running state-of-the-art tools on curated experimentally validated datasets. We end with a review of cotranscriptional RNA basepair formation, summarizing biological, theoretical and computational methods for the process, and future directions for improving classical methods in RNA structure prediction.All content chapters of this thesis has been peer-reviewed and published. The work on R4RNA has led to two publications, with the package used to great visual effect by various publications and also adopted by the RNA structure database Rfam. My assessment of RNA-RNA interaction is at present the only published evaluation of its kind, and will hopefully become a benchmark for future tool development and a guide to selecting appropriate tools and algorithms. Our published review on RNA cotranscriptional folding is well-received, being the first review specifically on its topic.

View record

Generation of Truncated Proteoforms in Proteolytic Networks: Modeling and Prediction in the Protease Web (2016)

Primarily controlled by gene expression and fine-tuned by translation and degradation rates, protein activity is governed by a plethora of post-translation modifications such as phosphorylation and glycosylation, which generate a diversity of protein species and thereby control complex biological phenotypes. Protease processing by proteases is a particular modification leading to the irreversible generation of stable protein truncations. Well understood in examples such as signal- or propeptide removal, recent analyses consistently identify >50% of N-terminal peptides mapping inside the protein sequence as predicted by genomics, indicating an important regulatory role of proteases. All proteins undergo protease cleavage as part of processing or degradation, a second biological process controlled by proteases. Proteases are involved in numerous pathologies and commonly considered as drug targets. However, protease research and drug development is complicated, in part due to widespread crosstalk between proteases. Proteases regulate other proteases through direct cleavage or cleavage of protease inhibitors in a complex network of protease interactions, the protease web. Yet, a comprehensive analysis of the protease web has not been performed, hampering assignment of proteases to clear biological roles, their direct substrates, and protease inhibitor drug targeting. A second problem in the identification of protein processing is the potential confound between protein termini generated by protease processing, alternative splicing, and alternative translation. In this thesis, I computationally analyzed large and diverse datasets of protease interactions and protein truncations to gain insight into complex proteolytic processes and to guide biochemical follow- up experiments. Analyzing protease cleavage, alternative splicing and alternative translation data incorporated into our database TopFIND, I found that protease cleavage and alternative translation likely generate most protein truncations. Combining protease cleavage and inhibition data in a graph model of the protease web, I demonstrated extensive protease crosstalk and then predicted and validated a proteolytic pathway. Finally, investigating strategies for the prediction of protease inhibition, I predicted hundreds of protease-inhibitor interactions, and validated inhibition of kallikrein-5 by serpin B12. This work thus generated predictions for biochemical follow-up as well as important insights into the regulation of biological systems through proteases.

View record

Bioinformatics for neuroanatomical connectivity (2012)

Neuroscience research is increasingly dependent on bringing together large amounts of data collected at the molecular, anatomical, functional and behavioural levels. This data is disseminated in scientific articles and large online databases. I utilized these large resources to study the wiring diagram of the brain or ‘connectome’. The aims of this thesis were to automatically collect large amounts of connectivity knowledge and to characterize relationships between connectivity and gene expression in the rodent brain. To extract the knowledge embedded in the neuroscience literature I created the first corpus of neuroscience abstracts annotated for brain regions and their connections. These connections describe long distance or macroconnectivity between brain regions. The collection of over 1,300 abstracts allowed accurate training of machine learning classifiers that mark brain region mentions (76% recall at 81% precision) and neuroanatomical connections between regions (50% sentence level recall at 70% precision). By automatically extracting connectivity statements from the Journal of Comparative Neurology I generated a literature based connectome of over 28,000 connections. Evaluations revealed that a large number of brain region descriptions are not found in existing lexicons. To address this challenge I developed novel methods that allow mapping of brain region terms to enclosing structures. To further study the connectome I moved from scientific articles to large online databases. By employing resources for gene expression and connectivity I showed that patterns of gene expression correlate with connectivity. First, two spatially anti-correlated patterns of mouse brain gene expression were identified. These signatures are associated with differences in expression of neuronal and oligodendrocyte markers, suggesting they reflect regional differences in cellular populations. Expression level of these genes is correlated with connectivity degree, with regions expressing the neuron-enriched pattern having more incoming and outgoing connections with other regions. Finally, relationships between profiles of gene expression and connectivity were tested. Specifically, I showed that brain regions with similar expression profiles tend to have similar connectivity profiles. Further, optimized sets of connectivity linked genes are associated with neuronal development, axon guidance and autistic spectrum disorder. This demonstration of text mining and large scale analysis provides new foundations for neuroinformatics.

View record

Meta-analysis of expression profiling data in the postmodern human brain (2012)

Schizophrenia is a severe psychiatric illness for which the precise etiology remains unknown. Studies using postmortem human brain have become increasingly important in schizophrenia research, providing an opportunity to directly investigate the diseased brain tissue. Gene expression profiling technologies have been used by a number of groups to explore the postmortem human brain and seek genes which show changes in expression correlated with schizophrenia. While this has been a valuable means of generating hypotheses, there is a general lack of consensus in the findings across studies. Expression profiling of postmortem human brain tissue is difficult due to the effect of various factors that can confound the data. The first aim of this thesis was to use control postmortem human cortex for identification of expression changes associated with several factors, specifically: age, sex, brain pH and postmortem interval. I conducted a meta-analysis across the control arm of eleven microarray datasets (representing over 400 subjects), and identified a signature of genes associated with each factor. These genes provide critical information towards the identification of problematic genes when investigating postmortem human brain in schizophrenia and other neuropsychiatric illnesses. The second aim of this thesis was to evaluate gene expression patterns in the prefrontal cortex associated with schizophrenia by exploring two methods of analysis: differential expression and coexpression. Seven schizophrenia microarray studies of prefrontal cortex were combined for a total of 153 subjects with schizophrenia and 153 healthy controls. Meta-analysis was conducted with careful consideration for the effects of covariates, revealing a robust list of 98 differentially expressed ‘schizophrenia genes’. Using the same seven schizophrenia datasets, coexpression networks were generated for control and schizophrenia cohorts within each dataset and then combined across studies using a rank aggregation approach. Topological properties of our ‘schizophrenia genes’ were evaluated in the context of each network, highlighting differences in correlation structure of these genes in the control and schizophrenia brain. Together these results converge towards a general conclusion, emphasizing that the integration of postmortem human brain expression profiling data improves statistical power and is particularly useful in detecting subtle yet consistent changes in expression associated with schizophrenia

View record

Master's Student Supervision (2010 - 2018)
A Study of Methods for Learning Phylogenies of Cancer Cell Populations from Binary Single Nucleotide Variant Profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

An analysis of genetic variants associated with autism spectrum disorder (2018)

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder affecting roughly 1% of the human population. Genomics research to date has discovered only a fraction of the variants causative for ASD. To this end, we whole-genome sequenced a cohort of 119 ASD individuals in order to find likely pathogenic variation. After quality and frequency filters, we prioritized variants as likely causal according to rarity and predicted damage scores (CADD and Snap2). Here, we report five de novo damaging variants and seven likely damaging variants of unknown inheritance. Since much of the variation reported in ASD cases is uncertain both in function and in significance in ASD, we aimed to functionally characterize missense variants from the ASD literature in PTEN and SYNGAP1, two well-characterized ASD genes. We curated missense variants of unknown significance from the ASD literature and assayed their functional effect in yeast using a Synthetic Genetic Array. We chose previously biochemically validated variants, population variants, and other variants in the genes of interest to gain insight into the functional diversity of PTEN and SYNGAP1 variation. We established functional effect of the ASD variants of unknown significance in PTEN and showed that computational predictors of damage are reasonable predictors of variants’ functional effects in yeast. We found that agreement of computational metrics breaks down when predicting damage in certain genes, such as SYNGAP1. Functionalizing variants in this way contributes to our understanding of the range of functional effects of ASD variants. Supplementary materials available at:

View record

Mega-analysis of gene expression patterns across tissues in human and mouse (2018)

Expression patterns across tissues are a primary indicator of gene function. High-throughput technology created many cross-tissue data sets on a transcriptomic level (tissue panel data sets). However, the existence of multiple tissue panel data sets creates a challenge for the scientific community to decide if these data sets are equally valid or decide which data set to choose. To date, the multiple tissue panel data sets have not been well compared, nor fully evaluated. In my Master’s thesis, I collected a large number of public-available tissue panel data sets, harmonized them, integrated the data sets into a tissue expression atlas including human data and mouse data, compared and contrasted the data sets across the atlas, evaluated each data set preliminarily with a gene-specific disagreement index that I developed. I found in general, these data sets had a good agreement. However, in certain data sets the amount of disagreement was high, which indicated the qualities of these data sets were suspect.Applying the disagreement index, I was able to offer a summarized expression pattern in the tissue expression atlas with either consensus or disagreements outlined. I also developed a web-based prototype to access to this atlas.Furthermore, I explored the range of changes in gene expression patterns that may be caused by experimental conditions, such as diseases or drug treatments. I found most of the changes could not be as dramatic as a change from unexpressed to highly expressed, even though these changes were reported as statistically significant in literatures. Only a couple of conditions such as cancer or inflammation could cause an unexpressed-to-highly-expressed change, because tissue composition in those conditions were changed substantially.

View record

Exploring sources of variability in electrophysiology data of mammalian neurons (2017)

Recently, there has been a major effort by neuroscientists to systematically organize and integrate vast quantities of brain data. However, electrophysiological properties have been shown to be sensitive to experimental conditions, thus directly comparing them between experiments could lead to inconsistent results. Here, I characterize the general effects of experimental solution composition differences on the reported ephys measurements. For that purpose, I employ text-mining, supplemented with manual curation to gather experimental solution information from published neurophysiological articles. I integrate the extracted information into the existing NeuroElectro database, which contains the electrophysiology, neuron type and experimental conditions information (temperature, electrode type, animal age, etc.) from the above neuroscientific literature. Exploring commonly used experimental solution recipes, I found the effect of solution compositions of explaining variance in electrophysiological properties to be small, relative to the amount of the existing ephys variability. Then, I created models for predicting the variability of ephys properties commonly reported by neurophysiologists, using the available experimental conditions information. These models can be used to remove a portion of the ephys variance when comparing results from different experiments, generally making such comparisons more reliable. To validate their performance, I adjusted a portion of NeuroElectro data to experimental conditions used by Allen Institute for Brain Science and compared the respective ephys properties before and after the adjustment.

View record

Meta-analysis of gene expression in mouse models of neurodegenerative disorders (2017)

There is intense interest in understanding the molecular mechanisms that contribute to neurodegenerative disorders (NDs), which involve complex interplays of genetic and environmental factors. To catch early events involved in disease initiation requires investigation on pre-symptomatic brain samples. It is difficult to capture early molecular events using post-mortem human brain samples since these samples represent the late phase of the disorder with progressive brain damage and neurodegeneration. Disease mouse models are developed to study disease progression and pathophysiology. Here, I focus on two of the most studied NDs: Alzheimer’s disease (AD) and Huntington’s disease (HD). Mouse models developed for the disease (AD or HD) often share similar phenotypes mimicking human disease symptoms, which suggest potential common underlying mechanisms of disease initiation and progression across mouse models of the same disease. Investigation of gene expression profiles of pre-symptomatic animals from different mouse models may shed light on the mechanisms occurred in the early disease phase. Gene expression profiling analyses have been performed on mouse models and some of the studies investigate the molecular changes in pre-symptomatic phase of AD and HD respectively. However, their findings have not reached a clear consensus. To identify shared molecular changes across mouse models, I conducted a systematic meta-analysis of gene expression in mouse models of AD and HD, consisted of 369 gene expression profiles from 23 independent studies. The goal of this project is to identify transcriptional alterations shared among different mouse models of each disease respectively, especially changes during early disease phase that may link to disease-causing mechanisms, and potential common cross-disease changes. For both of the disorders, the results showed subtle but biologically interpretable changes shared across mouse models in the early disease phase that may contribute to the early disease progression: dysregulation of genes involved in cholesterol biosynthesis and complement system in AD mouse models and genes encoding mitochondrial respiratory chain complexes in HD mouse models. Cross-disease similarities in the late phase suggested that different brain regions may share mechanisms in response to neuronal loss and toxic protein aggregates.

View record

A Study of Methods for Learning Phylogenies of Cancer Cell Populations from Binary Single Nucleotide Variant Profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Identification and exploration of gene product annotation instability and its impact in current usages (2014)

Proteins are macromolecules responsible for a wide range of activities in the structure and function of cells. Their activities have been described in different contexts as a mean to elucidate their ``function". These descriptions have been captured across biological databases in a standardized format called Gene Ontology Annotations (GOA), to disseminate the knowledge and extrapolate the information to other proteins whose function is still unknown. Furthermore, the annotations are used to analyse and interpret data from high-throughput studies and also as a benchmark for the assessment of protein function prediction algorithms. Constant changes occur in GOA that can potentially impact such usages, but only limited effort has been put into exploring their instability, or to assess the impact that these changes have on reproducibility or interpretation of previous analyses. In the present work, I performed the most comprehensive analysis of the annotation instability for 14 representative model organisms (E.coli, fruit fly, Mouse, etc.). The results showed important instability patterns that were species-specific. As such information would be of use to the community to trace the instability of annotations of their interest, a web-based visualization tool was built to track these changes on a protein, functional term and species specific basis. Additionally, we identified artifacts on the annotation data that can be attributed to curation patterns. We propose such artifacts to be considered for a more accurate assessment of function prediction algorithms. Furthermore, the impact that changes in the annotations have on common settings like gene set enrichment analyses was also explored. In particular, 2,000 datasets were used to assess the robustness of enrichment results over time. On average, the results would display a 60% similarity after only 2 years. However, cases were found were the similarity will drop 80% within the same year, demonstrating the impact that the instability has on such applications. In conclusion, the results of this work will prove useful for those who use the annotations to interpret their studies to assess their reliability on a case-by-case scenario.

View record

Meta-analysis of Human Methylomes Reveals Stably Methylated Sequences Surrounding CpG Islands Associated with High Gene Expression (2014)

DNA methylation is thought to play an important role in the regulation of mammalian gene expression. Part of the evidence for this role is the observation that lack of CpG island methylation in gene promoters is associated with high transcriptional activity. However, CpG island methylation level only accounts for a fraction of the variance in gene expression, and methylation in other domains is hypothesized to play a role (e.g., island shores and shelves). We set out to improve understanding of the human methylome through a meta-analysis approach, using 1737 samples from 30 publicly available studies. An initial screen identified 15224 CpGs that are “ultra-stable” in their state, being always fully methylated or unmethylated across diverse tissues, cell types and developmental stages (974 always methylated; 14250 always unmethylated). A further analysis of ultra-stable CpGs led us to identify a novel class of CpG islands, “ravines”, that exhibit a markedly consistent pattern of low methylation with highly methylated flanking shores and shelves. Our findings were validated using independent and heterogeneous datasets assayed on the same and different technologies. Building on additional existing data types such as gene expression microarrays, DNase hypersensitive sites, and histone modifications, we found that ravines are associated with higher gene expression, compared to typical unmethylated CpG islands. This finding suggests a novel role for methylation in promoters, markedly different from the traditional view that active promoters need to be unmethylated. We propose ravines are a new class of CpG islands, established early in development and maintained through differentiation, that mark universally active genes and provide new evidence that methylation beyond the CpG island could play a role in gene expression.

View record

Cell type marker enrichment across brain regions and experimental conditions (2013)

The first chapter of this thesis explored the dominant gene expression pattern in the adult human brain. We discovered that the largest source of variation can be explained by cell type marker expression. Across brain regions, expression of neuron cell type markers are anti-correlated with the expression of oligodendrocyte cell type markers. Next, we explored gene function convergence and divergence in the adult mouse brain. Our contributions are as follows. First, we provide candidate cell type markers for investigating specific cell type populations. Second, we highlight orthologous genes that show functional divergence between human and mouse brains.In the second chapter, we present our preliminary work on the effects of tissue types and experimental conditions on human microarray studies. First, we measured the expression and differential expression levels of tissue-enriched genes. Next, we identified modules with similar expression levels and differential expression p-values. Our results show that expression levels reflect tissue type variation. In contrast, differential expression levels are more complex, owing to the large diversity of experimental conditions in the data. In summary, our work provides a different perspective on the functional roles of genes in human microarray studies.

View record

Characterization of gene expression patterns in wild pacific salmon (2013)

Declines in Pacific salmon stocks in recent decades have spurred much research into their physiology and survivorship, but comparatively little into their genomics. Sockeye salmon in particular are experiencing high levels of mortality during their migration upriver, and the numbers of returning sockeye have fluxuated wildly with respect to predictions in recent years. The goal of my project is to gain insight into the basic genomics of Pacific salmon stocks, including the sockeye, through bioinformatic approaches to gene expression profiling. Using microarray technology, I have conducted a large-scale analysis of over 1,000 samples from multiple tissues, stocks, and species of salmon. I identified tissue-specific and housekeeping genes and compared them to orthologs in mouse and human, respectively. I have also classified a number of microarray samples with a support vector machine (SVM) using qPCR data showing the presence of several common pathogens affecting Pacific salmon populations. Using identified housekeeping genes as normalizing factors, I modeled in silico a qPCR assay designed to identify salmon as infected or uninfected with a particular pathogen. With these data I hope to increase basic knowledge of the genomics of the Pacific salmon.

View record

Meta-analysis of gene expression in individuals with autism spectrum disorders (2013)

Autism spectrum disorders (ASD) are clinically heterogeneous and biologically complex.State of the art genetics research has unveiled a large number of variants linked to ASD. Butin general it remains unclear, what biological factors lead to changes in the brains of autisticindividuals. We build on the premise that these heterogeneous genetic or genomic aberrationswill converge towards a common impact downstream, which might be reflected in thetranscriptomes of individuals with ASD. Similarly, a considerable number of transcriptomeanalyses have been performed in attempts to address this question, but their findings lack aclear consensus. As a result, each of these individual studies has not led to any significantadvance in understanding the autistic phenotype as a whole. The goal of this research is tocomprehensively re-evaluate these expression profiling studies by conducting a systematicmeta-analysis. Here, we report a meta-analysis of over 1000 microarrays across twelveindependent studies on expression changes in ASD compared to unaffected individuals,in blood and brain. We identified a number of genes that are consistently differentiallyexpressed across studies of the brain, suggestive of effects on mitochondrial function. Inblood, consistent changes were more difficult to identify, despite individual studies tendingto exhibit larger effects than the brain studies. Our results are the strongest evidence to dateof a common transcriptome signature in the brains of individuals with ASD.

View record

Wide-scale comparison of transcriptome data and the role of microRNA in major depression and suicide (2011)

The first chapter of this thesis addresses a common problem in genomics experiments: interpreting a resulting "hit list" of interesting genes. We present work on an approach for summarizing and exploring "hit lists" that makes use of the large amount of gene expression data in public repositories such as the Gene Expression Omnibus. We compare the query list with datasets that we have analyzed for differential expression of genes. Studies that have similarities to the given hit list yield additional insights, help contextualize studies, and serve as a basis for future meta-analysis. A conceptually similar problem that we addressed is the classification or clustering of datasets based on patterns of differential expression. Both problems required a method for determining distances between datasets based on rankings of genes. We tested and benchmarked several methods using manually annotated datasets. The method that performed best according to our evaluation process is based on Kendall's Tau top-k distance. We investigated potential sources of confounds, finding that the largest challenge may be posed by the high prevalence of certain gene expression patterns. These highly prevalent patterns tended to dominate search results. Nonetheless, we demonstrated the effectiveness of this approach in a case study. In the second chapter, we investigated the role of microRNAs in the context of major depression and suicide. We profiled microRNA and messenger RNA levels in post-mortem prefrontal cortex and hippocampus brain tissue of depressed suicides, suicides, and controls. In the prefrontal cortex, we found miR-1202 to be down-regulated in suicides versus controls, and LCT (lactase enzyme) was up-regulated in suicides or depressed suicides compared to controls. The former result was independently confirmed using quantitative PCR. While further study is needed, our results have the potential to provide insight into molecular changes in the brains of depressed and suicidal individuals.

View record

Evaluating Coexpression Analysis for Gene Function Prediction (2010)

Microarray expression data sets vary in size, data quality and other features, but most methods for selecting coexpressed gene pairs use a ‘one size fits all’ approach. There have been many different procedures for selecting coexpressed gene pairs of high functional similarity from an expression dataset. However, it is not clear which procedure performs best as there are few studies reporting comparisons of these approaches. The goal of this thesis is to develop a set of “best practices” in order to select coexpression links of high functional similarity from an expression dataset, along which methods for identifying datasets likely to yield poor information. With these goals, we hope to improve the quality of gene function predictions produced by coexpression analysis.Using 80 human expression datasets we examined the impact of different thresholds, correlation metrics, expression data filtering and transformation procedures on performance in functional prediction. We also investigated the relationship between data quality and other features of expression datasets and their performance in functional prediction. We used the annotations of the Gene Ontology as a primary metric to measure similarity in gene function, and employ additional functional metrics for validation. Our results show that several dataset features have a greater influence on the performance in functional prediction than others. Expression datasets which produce coexpressed gene pairs of poor functional quality can be identified by a similar set of data features. Some procedures used in coexpression analysis have a negligible effect on the quality of functional predictions while others are essential to achieving the best performance in the algorithm. We also find that some procedures interact greatly with features of expression datasets and that these interactions increase the number of high quality coexpressed gene pairs retrieved through coexpression analysis. This thesis uncovers important information on the many intrinsic and extrinsic factors that influence the performance in functional prediction of coexpression analysis. The information summarized here will help guide future studies using coexpression analysis and improve the quality of gene function predictions.

View record



If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research, and more than 300 programs in our 2021 Graduate Viewbook!