Paul Pavlidis


Research Interests

cellular and molecular neuroscience
disorders of the nervous system
single-cell genomics
Computational Biology
Gene regulation

Relevant Thesis-Based Degree Programs

Research Options

I am available and interested in collaborations (e.g. clusters, grants).
I am interested in and conduct interdisciplinary research.
I am interested in working with undergraduate students on research projects.

Research Methodology

machine learning
Computational Biology
Single-cell genomics


Master's students
Doctoral students
Postdoctoral Fellows
Any time / year round

Please visit for more information on our research.

I support public scholarship, e.g. through the Public Scholars Initiative, and am available to supervise students and Postdocs interested in collaborating with external partners as part of their research.
I support experiential learning experiences, such as internships and work placements, for my graduate students and Postdocs.
I am open to hosting Visiting International Research Students (non-degree, up to 12 months).
I am interested in hiring Co-op students for research placements.

Complete these steps before you reach out to a faculty member!

Check requirements
  • Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
  • Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
Focus your search
  • Identify specific faculty members who are conducting research in your specific area of interest.
  • Establish that your research interests align with the faculty member’s research interests.
    • Read up on the faculty members in the program and the research being conducted in the department.
    • Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
Make a good impression
  • Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
    • Do not send non-specific, mass emails to everyone in the department hoping for a match.
    • Address the faculty members by name. Your contact should be genuine rather than generic.
  • Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
  • Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
  • Demonstrate that you are familiar with their research:
    • Convey the specific ways you are a good fit for the program.
    • Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
  • Be enthusiastic, but don’t overdo it.
Attend an information session

G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.



These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Mining of differential expression across thousands of conditions (2022)

Differential expression (DE) analysis is performed to identify genes associated to a phenotype based on changes in RNA expression levels. The result of various bioinformatics analyses is a hit list of genes that requires further interpretation to identify the functions of these genes and prioritize the genes for further study; there is currently a lack of objective metrics for gene prioritization. The ease of generating transcriptomic data has resulted in the accumulation of massive amounts of data in repositories (“NCBI GEO”). In my thesis, I investigate means of harnessing this archived data for interpreting hit lists. First, I describe the development of Gemma, a large corpus containing over 10,000 curated and reprocessed datasets made suitable for data mining. I contributed by establishing the curation guidelines of using ontology concepts during dataset annotation, and characterizing Gemma’s features. Next, I describe the evaluation of Connectivity Map (CMap), a hit list interpretation framework designed for in silico repositioning of previously approved drugs for treating human diseases. Through a series of analyses, I demonstrated that drug repositioning results between two versions of CMap are discordant, and is caused by low reproducibility of DE profiles both between and within each CMap. This demonstrates the importance of high-quality data and careful evaluation of hit list interpretation frameworks. Finally, in a collaboration, we showed that there are huge differences in how often genes are differentially expressed (“DE prior”) across a large corpus of human datasets. We proposed that the prior could be used to facilitate hit list interpretation, identifying genes that are more specifically DE in a studied phenotype. I expanded this work further by examining variables that may influence the DE prior such as microarray platform gene coverage; I found the DE prior robust to these variables. I also demonstrate that given enough data, context (e.g. tissue) or topic specific DE priors can be developed for topic-specific applications. My work contributes to our knowledge of patterns of gene differential expression and their utility in addressing questions related to gene function in human health and disease.

View record

Prioritizing genes with functionally distinct splice isoforms (2021)

Most mammalian genes generate multiple transcripts via splicing, and we do not know the function of most splice variants. Currently, there is a debate about how many splice variants are likely nonfunctional or “noisy” transcripts. My thesis explores the claim that alternative splicing vastly increases the genome’s functional diversity in the context of noisy splicing, and in doing so attempts to identify candidate cases for which alternative splicing is likely to be of consequence.To ground computational analyses of genes with multiple splice variants in experimental data, the field needs a corpus of genes that have experimental evidence of functionally distinct splice isoforms (FDSIs). We curated the literature for 743 genes and found that ~5% had literature evidence of FDSIs. This suggests that the claim that alternative splicing vastly increases genomic functional diversity is extrapolated from a few key genes.Next, I developed a pipeline to identify candidate genes with FDSIs using long-read RNA-seq data. The output of my pipeline is a computationally-prioritized list of candidate genes likely to have FDSIs based on features such as expression, conservation, functional domains, and coding-potential. From an initial set of 6,799 genes with multiple splice variants, I prioritized 79 candidate genes. While I had limited long-read data, my work aids in establishing guidelines for high-throughput prioritization of genes with FDSIs for future study.With our collaborators, I investigated a specific application of my pipeline to the voltage-gated calcium channel gene Cacna1e. Using novel long-read data, I established a set of 2,110 splice variants for Cacna1e. Based on properties of the channel, I determined that at most 154 splice variants are likely to encode a functional channel. My results highlighted the amount of potential noise produced by one gene’s expression. Through my investigation, I added to the growing body of literature in support of noisy splicing. I also provided the field with a list of interesting genes with multiple splice variants. This includes a gold standard set of genes from the experimental literature, and a novel set of prioritized genes. Both sets of genes will be useful for future studies of gene function.

View record

The interpretation of gene coexpression in systems biology (2020)

One of the key features of transcriptomic data is the similarity of expression patterns among groups of genes, referred to as coexpression. It has been shown that coexpressed genes tend to share similar functions. Based on this, a common assumption is that gene coexpression is a result of transcriptional regulation and therefore, regulatory relationships could be inferred from coexpression. However, success in inferring such relationships has been limited and there are questions about the source and interpretation of coexpression. Here I explore coexpression as an observed signal from the data, examine its source and assess its relevance for inferring regulatory relationships. In chapter 2 I studied differential coexpression, which refers to the alteration of gene coexpression between biological conditions. It is commonly assumed that differential coexpression can reveal rewiring of transcription regulatory networks, specifically among the genes that maintain their average expression level between the conditions. However, I show that to a large extent and in contrast to this common assumption, differential coexpression is more parsimoniously explained by changes in average expression levels. This finding demonstrates limitations for inference of regulatory rewiring from coexpression and poses questions for the underlying causes of the observed coexpression. In Chapter 3, I studied cellular composition variation among bulk tissue samples as a source of variance and the observed coexpression. I found that for most genes, differences in expression levels across cell types account for a large fraction of their variance and as a result genes with similar cell-type expression profiles appear to be coexpressed. Finally, I showed that this coexpression dominates the underlying intra-cell-type coexpression and also has the two prominent features of coexpression in the bulk tissue: reproducibility and biological relevance. Through my studies, I was able to provide an explanation for much of the observed coexpression in the bulk tissue and shed light on its resolution and limitation for inference of regulatory relationships. I also studied coexpression in single-nucleus data and show that some of the observed coexpression in it is likely to be attributed to the transcriptional regulation, which could be a subject for future studies.

View record

Computational analysis of ribonucleic acid basepairs in RNA structure and RNA-RNA interactions (2016)

Ribonucleic acids (RNA), are an essential part of cellular function, transcribed from DNA and translated into protein. Rather than a passive informational medium, RNA can also be highly functional and regulatory. Certain RNAs fold into specific structures giving it enzymatic properties, while others bind to specific targets to guide regulatory processes. With the advent of next-generation sequencing, a large number of novel non-coding RNAs have been discovered through whole-transcriptome sequencing. Many efforts have been made to study the structure and binding partners of these novel RNAs, in order to determine their function and roles. This work begins with a description of my R package R4RNA for manipulating RNA basepair data, the building blocks of RNA structure and RNA binding. The package deals with the input/output and manipulation of RNA basepair and sequence data, along with statistical and visualization methods for evaluation, interpretation and presentation. We also describe R-chie, a visualization tool and web server built on R4RNA that visualizes complex RNA basepairs in conjunction with sequence alignments. We then conduct the largest known evaluation of RNA-RNA interaction methods to date, running state-of-the-art tools on curated experimentally validated datasets. We end with a review of cotranscriptional RNA basepair formation, summarizing biological, theoretical and computational methods for the process, and future directions for improving classical methods in RNA structure prediction.All content chapters of this thesis has been peer-reviewed and published. The work on R4RNA has led to two publications, with the package used to great visual effect by various publications and also adopted by the RNA structure database Rfam. My assessment of RNA-RNA interaction is at present the only published evaluation of its kind, and will hopefully become a benchmark for future tool development and a guide to selecting appropriate tools and algorithms. Our published review on RNA cotranscriptional folding is well-received, being the first review specifically on its topic.

View record

Generation of Truncated Proteoforms in Proteolytic Networks: Modeling and Prediction in the Protease Web (2016)

Primarily controlled by gene expression and fine-tuned by translation and degradation rates, protein activity is governed by a plethora of post-translation modifications such as phosphorylation and glycosylation, which generate a diversity of protein species and thereby control complex biological phenotypes. Protease processing by proteases is a particular modification leading to the irreversible generation of stable protein truncations. Well understood in examples such as signal- or propeptide removal, recent analyses consistently identify >50% of N-terminal peptides mapping inside the protein sequence as predicted by genomics, indicating an important regulatory role of proteases. All proteins undergo protease cleavage as part of processing or degradation, a second biological process controlled by proteases. Proteases are involved in numerous pathologies and commonly considered as drug targets. However, protease research and drug development is complicated, in part due to widespread crosstalk between proteases. Proteases regulate other proteases through direct cleavage or cleavage of protease inhibitors in a complex network of protease interactions, the protease web. Yet, a comprehensive analysis of the protease web has not been performed, hampering assignment of proteases to clear biological roles, their direct substrates, and protease inhibitor drug targeting. A second problem in the identification of protein processing is the potential confound between protein termini generated by protease processing, alternative splicing, and alternative translation. In this thesis, I computationally analyzed large and diverse datasets of protease interactions and protein truncations to gain insight into complex proteolytic processes and to guide biochemical follow- up experiments. Analyzing protease cleavage, alternative splicing and alternative translation data incorporated into our database TopFIND, I found that protease cleavage and alternative translation likely generate most protein truncations. Combining protease cleavage and inhibition data in a graph model of the protease web, I demonstrated extensive protease crosstalk and then predicted and validated a proteolytic pathway. Finally, investigating strategies for the prediction of protease inhibition, I predicted hundreds of protease-inhibitor interactions, and validated inhibition of kallikrein-5 by serpin B12. This work thus generated predictions for biochemical follow-up as well as important insights into the regulation of biological systems through proteases.

View record

Bioinformatics for neuroanatomical connectivity (2012)

Neuroscience research is increasingly dependent on bringing together large amounts of data collected at the molecular, anatomical, functional and behavioural levels. This data is disseminated in scientific articles and large online databases. I utilized these large resources to study the wiring diagram of the brain or ‘connectome’. The aims of this thesis were to automatically collect large amounts of connectivity knowledge and to characterize relationships between connectivity and gene expression in the rodent brain. To extract the knowledge embedded in the neuroscience literature I created the first corpus of neuroscience abstracts annotated for brain regions and their connections. These connections describe long distance or macroconnectivity between brain regions. The collection of over 1,300 abstracts allowed accurate training of machine learning classifiers that mark brain region mentions (76% recall at 81% precision) and neuroanatomical connections between regions (50% sentence level recall at 70% precision). By automatically extracting connectivity statements from the Journal of Comparative Neurology I generated a literature based connectome of over 28,000 connections. Evaluations revealed that a large number of brain region descriptions are not found in existing lexicons. To address this challenge I developed novel methods that allow mapping of brain region terms to enclosing structures. To further study the connectome I moved from scientific articles to large online databases. By employing resources for gene expression and connectivity I showed that patterns of gene expression correlate with connectivity. First, two spatially anti-correlated patterns of mouse brain gene expression were identified. These signatures are associated with differences in expression of neuronal and oligodendrocyte markers, suggesting they reflect regional differences in cellular populations. Expression level of these genes is correlated with connectivity degree, with regions expressing the neuron-enriched pattern having more incoming and outgoing connections with other regions. Finally, relationships between profiles of gene expression and connectivity were tested. Specifically, I showed that brain regions with similar expression profiles tend to have similar connectivity profiles. Further, optimized sets of connectivity linked genes are associated with neuronal development, axon guidance and autistic spectrum disorder. This demonstration of text mining and large scale analysis provides new foundations for neuroinformatics.

View record

Meta-analysis of expression profiling data in the postmodern human brain (2012)

Schizophrenia is a severe psychiatric illness for which the precise etiology remains unknown. Studies using postmortem human brain have become increasingly important in schizophrenia research, providing an opportunity to directly investigate the diseased brain tissue. Gene expression profiling technologies have been used by a number of groups to explore the postmortem human brain and seek genes which show changes in expression correlated with schizophrenia. While this has been a valuable means of generating hypotheses, there is a general lack of consensus in the findings across studies. Expression profiling of postmortem human brain tissue is difficult due to the effect of various factors that can confound the data. The first aim of this thesis was to use control postmortem human cortex for identification of expression changes associated with several factors, specifically: age, sex, brain pH and postmortem interval. I conducted a meta-analysis across the control arm of eleven microarray datasets (representing over 400 subjects), and identified a signature of genes associated with each factor. These genes provide critical information towards the identification of problematic genes when investigating postmortem human brain in schizophrenia and other neuropsychiatric illnesses. The second aim of this thesis was to evaluate gene expression patterns in the prefrontal cortex associated with schizophrenia by exploring two methods of analysis: differential expression and coexpression. Seven schizophrenia microarray studies of prefrontal cortex were combined for a total of 153 subjects with schizophrenia and 153 healthy controls. Meta-analysis was conducted with careful consideration for the effects of covariates, revealing a robust list of 98 differentially expressed ‘schizophrenia genes’. Using the same seven schizophrenia datasets, coexpression networks were generated for control and schizophrenia cohorts within each dataset and then combined across studies using a rank aggregation approach. Topological properties of our ‘schizophrenia genes’ were evaluated in the context of each network, highlighting differences in correlation structure of these genes in the control and schizophrenia brain. Together these results converge towards a general conclusion, emphasizing that the integration of postmortem human brain expression profiling data improves statistical power and is particularly useful in detecting subtle yet consistent changes in expression associated with schizophrenia

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Investigations into transcriptomic engram cells (2024)

The physical substrate of memory in the brain is still a subject of debate. Itis thought that strongly connected networks of neurons, called engrams, canreproduce the original activity pattern from partial cues. Synapses betweenengram neurons are thus the most likely candidate for memory storage;but questions remain over their long-term stability. Some groups have proposedactivity related transcription, while normally considered transient,could be the beginning of a transition to a more permanent cell type thatstores long term memory. Persistent chromatin conformation changes andresulting transcription changes, triggered by reactivation, could be a stablelong-lasting storage mechanism which enables remembering at remote timepoints. Some groups have reported transcription predicted by this model.We trained classifiers using single-cell RNA (scRNA-seq) sequencing fromthe earlier transient signature (Jaeger et al., 2018; Lacar et al., 2016) andthe long term engram neuron signature (Chen et al., 2020). The transientearly signature was readily identifiable but when trying to identify engramcells exhibiting the long-term memory signature we found a significant decreasein classifier performance. The important features of the classifier werenot genes reported as deferentially expressed in the original publication. Reproducing the original author’s results using their data proved challenging,suggesting the persistent long-term memory signature was not detected ordoes not exist. Reactivation did not appear to elicit a strong transcriptionalresponse either, which contradicts models of transcriptomic engramcell formation. Unfortunately, the design of the original experiment does notallow for the falsification of the supposed persistent transcriptional programinduced by reactivation. My research suggests future directions to take inevaluating transcription’s contribution to synaptic plasticity and memory.

View record

Identification of cell type marker genes of the brain and their use in estimation of cell type proportions (2022)

Establishing the molecular diversity of cell types is crucial for the study of the nervous system. I compiled a cross-laboratory database of mouse brain cell type-specific transcriptomes from 36 major cell types from across the mammalian brain using rigorously curated published data from pooled cell type microarray and single-cell RNA-sequencing (RNA-seq) studies. I used these data to identify cell type-specific marker genes, discovering a substantial number of novel markers, many of which we validated using computational and experimental approaches. By examining datasets with known cell type proportion differences, I further demonstrate that summarized expression of marker gene sets (MGSs) in bulk tissue data can be used to estimate the relative cell type abundance across samples. Using this approach, I show that majority of genes previously reported as differentially expressed in Parkinson’s disease can be attributed to the reduction in dopaminergic cell number rather than regulatory events. To facilitate use of this expanding resource, I provide a user-friendly web interface at

View record

Cellular composition variation drives coexpression-based gene function prediction (2021)

Coexpression analysis has been widely used for gene function prediction, based on the principleof guilt by association. Most studies use transcriptomic data obtained from bulk tissues, wherethe expression level of genes reflects the contribution of multiple cell types. Previous work hasalready documented how variability of cellular composition impacts coexpression analysis.However, the connection between the predictability of gene functions, coexpression networksand cell type profiles has not been studied. I hypothesized that one reason bulk-data-derivedcoexpression networks contain signals relevant to function prediction is that it containsinformation about genes’ expression profiles across cell types. Focusing on human braindatasets, I applied several approaches to test this hypothesis, including creating simulated bulkdatasets from single-nucleus data and bulk data deconvolution. I find that much predictive powercan be attributed to cell type proportion variation. Consequently, a more explicit andinterpretable function prediction can be made directly using expression patterns across cell types,which not only yields similar results but also clearly reveals the association between thefunctional terms and specific cell types. These findings have important implications forcoexpression analysis and function prediction.

View record

Large-scale mining of differential expression data for insight into gene function (2021)

A persistent challenge in genetics and genomics is the interpretation of “hit lists” of genes, leading to the development of, and almost universal application of methods such as Gene Ontology (GO) enrichment analysis. While these methods have been of unquestionable utility, GO enrichment and similar approaches based on gene annotations leave much to be desired and they are often used as a “sanity check” rather than a way to make discoveries. To offer a complementary perspective with the potential to remedy some existing challenges, I developed and evaluated an algorithm that helps put hit lists of genes into biological context by performing large-scale mining on patterns of differential expression (DE). In this work, I present the development and evaluation of my algorithm which mines over 10,000 transcriptomic datasets in a process we term “condition enrichment”. The output of the algorithm is a list of biological condition comparisons (drug treatments, diseases, etc.) scored according to their relatedness (in terms of DE) to the query genes. I show that performing searches on gene sets of a priori interest enables my algorithm to effectively identify known gene-condition relationships in real and simulated data, providing a useful summary of the condition comparisons most highly associated with the differential expression of the gene set. Finally, I present a powerful open-source web application to provide researchers access to Gemma DE, in the hope that it will aid future research.

View record

Single-cell analytics for phospho flow cytometry reveals dynamic interactions between molecular pathways (2021)

Quantitative analysis of large single-cell measures acquired by phospho flow cytometry typically involves establishing inclusion gate thresholds and combining measures from accepted cells into a single median metric. Though this analysis method is simple, it overlooks the heterogeneity of cell populations and there could be information missing from the single-cell level. Here, we have formulated approaches that can recognize the heterogeneity and extract additional information involving dose-response and interactions between multiple molecules from phospho flow cytometry datasets. Using phospho flow multiplexed sampling of cell physical features, and primary antibodies against protein markers, including GAPDH as a protein expression control, HA tag as an exogenous gene/variant transfection measurement, and 8 antibodies detecting the activation (phosphorylation) states of 8 proteins within conserved molecular pathways, two panels of phospho-specific antibodies were used simultaneously for multiplexed measures in the same cells. Our approach involves single-cell standardization, fitting loess regression, identifying linear domains in dose-response plots, building linear mixed-effects models, and multi-dimensional analyses to detect interactions between phosphorylated protein markers. We demonstrate the utility of this approach by expressing wild-type and 5 variants (4A, D268E, Y138L, P38H, G129E) of PTEN on 8 markers of molecular pathways downstream of PTEN, and we also expressed RHEB WT testing its impact on markers in the shared associated pathways. We succeeded in differentiating subtypes of PTEN loss-of-function variants and were able to predict that PTEN P38H is a loss-of-lipid-phosphatase-function variant. We were also able to infer that pAKT, p4EBP1, pS6, and pCREB are all downstream targets of PTEN regulation while pAKT is between PTEN and p4EBP1, pS6, or pCREB. In conclusion, our results demonstrate dose response and molecular pathway interactions unavailable from reducing population data to single values, and our approach manifests strong promise in variant function measurement and molecular signaling pathway inference.

View record

An investigation into the utility of guilt by association machine learning algorithms for the prioritization of autism spectrum disorder candidate risk genes (2020)

Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by impairments in social interaction and communication, and restrictive repetitive behaviours or interests, with extreme phenotypic and genetic heterogeneity. Currently, genetic association studies have identified 90 risk genes with high confidence out of an estimated 1000. Researchers have begun to use machine learning methods leveraging heterogeneous biological network data in attempts to aid in discovery of ASD risk genes. However, the real-world utility of these studies is questionable: network-based machine learners are often biased towards well studied genes because they operate on a principle called “guilty by association.” In this thesis, I evaluate and compare genetic and computation approaches to ASD risk gene prioritization. I demonstrate that network-based computational approaches are adding little additional useful information compared to genetic approaches for prioritization. Furthermore, I demonstrate that gene expression profiles, and generic measures of disease gene likelihood may provide less biased contextual information that can be used to supplement genetic association data to prioritize ASD risk genes. Lastly, I discuss how data quality and data dependence impacts evaluation of machine learning algorithms and genetic association studies.

View record

A Study of Methods for Learning Phylogenies of Cancer Cell Populations from Binary Single Nucleotide Variant Profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

An analysis of genetic variants associated with autism spectrum disorder (2018)

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder affecting roughly 1% of the human population. Genomics research to date has discovered only a fraction of the variants causative for ASD. To this end, we whole-genome sequenced a cohort of 119 ASD individuals in order to find likely pathogenic variation. After quality and frequency filters, we prioritized variants as likely causal according to rarity and predicted damage scores (CADD and Snap2). Here, we report five de novo damaging variants and seven likely damaging variants of unknown inheritance. Since much of the variation reported in ASD cases is uncertain both in function and in significance in ASD, we aimed to functionally characterize missense variants from the ASD literature in PTEN and SYNGAP1, two well-characterized ASD genes. We curated missense variants of unknown significance from the ASD literature and assayed their functional effect in yeast using a Synthetic Genetic Array. We chose previously biochemically validated variants, population variants, and other variants in the genes of interest to gain insight into the functional diversity of PTEN and SYNGAP1 variation. We established functional effect of the ASD variants of unknown significance in PTEN and showed that computational predictors of damage are reasonable predictors of variants’ functional effects in yeast. We found that agreement of computational metrics breaks down when predicting damage in certain genes, such as SYNGAP1. Functionalizing variants in this way contributes to our understanding of the range of functional effects of ASD variants.

View record

Mega-analysis of gene expression patterns across tissues in human and mouse (2018)

Expression patterns across tissues are a primary indicator of gene function. High-throughput technology created many cross-tissue data sets on a transcriptomic level (tissue panel data sets). However, the existence of multiple tissue panel data sets creates a challenge for the scientific community to decide if these data sets are equally valid or decide which data set to choose. To date, the multiple tissue panel data sets have not been well compared, nor fully evaluated. In my Master’s thesis, I collected a large number of public-available tissue panel data sets, harmonized them, integrated the data sets into a tissue expression atlas including human data and mouse data, compared and contrasted the data sets across the atlas, evaluated each data set preliminarily with a gene-specific disagreement index that I developed. I found in general, these data sets had a good agreement. However, in certain data sets the amount of disagreement was high, which indicated the qualities of these data sets were suspect.Applying the disagreement index, I was able to offer a summarized expression pattern in the tissue expression atlas with either consensus or disagreements outlined. I also developed a web-based prototype to access to this atlas.Furthermore, I explored the range of changes in gene expression patterns that may be caused by experimental conditions, such as diseases or drug treatments. I found most of the changes could not be as dramatic as a change from unexpressed to highly expressed, even though these changes were reported as statistically significant in literatures. Only a couple of conditions such as cancer or inflammation could cause an unexpressed-to-highly-expressed change, because tissue composition in those conditions were changed substantially.

View record

Exploring sources of variability in electrophysiology data of mammalian neurons (2017)

Recently, there has been a major effort by neuroscientists to systematically organize and integrate vast quantities of brain data. However, electrophysiological properties have been shown to be sensitive to experimental conditions, thus directly comparing them between experiments could lead to inconsistent results. Here, I characterize the general effects of experimental solution composition differences on the reported ephys measurements. For that purpose, I employ text-mining, supplemented with manual curation to gather experimental solution information from published neurophysiological articles. I integrate the extracted information into the existing NeuroElectro database, which contains the electrophysiology, neuron type and experimental conditions information (temperature, electrode type, animal age, etc.) from the above neuroscientific literature. Exploring commonly used experimental solution recipes, I found the effect of solution compositions of explaining variance in electrophysiological properties to be small, relative to the amount of the existing ephys variability. Then, I created models for predicting the variability of ephys properties commonly reported by neurophysiologists, using the available experimental conditions information. These models can be used to remove a portion of the ephys variance when comparing results from different experiments, generally making such comparisons more reliable. To validate their performance, I adjusted a portion of NeuroElectro data to experimental conditions used by Allen Institute for Brain Science and compared the respective ephys properties before and after the adjustment.

View record

Meta-analysis of gene expression in mouse models of neurodegenerative disorders (2017)

There is intense interest in understanding the molecular mechanisms that contribute to neurodegenerative disorders (NDs), which involve complex interplays of genetic and environmental factors. To catch early events involved in disease initiation requires investigation on pre-symptomatic brain samples. It is difficult to capture early molecular events using post-mortem human brain samples since these samples represent the late phase of the disorder with progressive brain damage and neurodegeneration. Disease mouse models are developed to study disease progression and pathophysiology. Here, I focus on two of the most studied NDs: Alzheimer’s disease (AD) and Huntington’s disease (HD). Mouse models developed for the disease (AD or HD) often share similar phenotypes mimicking human disease symptoms, which suggest potential common underlying mechanisms of disease initiation and progression across mouse models of the same disease. Investigation of gene expression profiles of pre-symptomatic animals from different mouse models may shed light on the mechanisms occurred in the early disease phase. Gene expression profiling analyses have been performed on mouse models and some of the studies investigate the molecular changes in pre-symptomatic phase of AD and HD respectively. However, their findings have not reached a clear consensus. To identify shared molecular changes across mouse models, I conducted a systematic meta-analysis of gene expression in mouse models of AD and HD, consisted of 369 gene expression profiles from 23 independent studies. The goal of this project is to identify transcriptional alterations shared among different mouse models of each disease respectively, especially changes during early disease phase that may link to disease-causing mechanisms, and potential common cross-disease changes. For both of the disorders, the results showed subtle but biologically interpretable changes shared across mouse models in the early disease phase that may contribute to the early disease progression: dysregulation of genes involved in cholesterol biosynthesis and complement system in AD mouse models and genes encoding mitochondrial respiratory chain complexes in HD mouse models. Cross-disease similarities in the late phase suggested that different brain regions may share mechanisms in response to neuronal loss and toxic protein aggregates.

View record

A Study of Methods for Learning Phylogenies of Cancer Cell Populations from Binary Single Nucleotide Variant Profiles (2015)

An accurate phylogeny of a cancer tumour has the potential to shed light on numerous phenomena, such as key oncogenetic events, relationships between clones, and evolutionary responses to treatment. Most work in cancer phylogenetics to-date relies on bulk tissue data, which can resolve only a few genotypes unambiguously. Meanwhile, single-cell technologies have considerably improved our ability to resolve intra-tumour heterogeneity. Furthermore, most cancer phylogenetic methods use classical approaches, such as Neighbor-Joining, which put all extant species on the leaves of the phylogenetic tree. But in cancer, ancestral genotypes may be present in extant populations. There is a need for scalable methods that can capture this phenomenon.We have made progress on this front by developing the Genotype Tree representation of cancer phylogenies, implementing three methods for reconstructing Genotype Trees from binary single-nucleotide variant profiles, and evaluating these methods under a variety of conditions. Additionally, we have developed a tool that simulates the evolution of cancer cell populations, allowing us to systematically vary evolutionary conditions and observe the effects on tree properties and reconstruction accuracy.Of the methods we tested, Recursive Grouping and Chow-Liu Grouping appear to be well-suited to the task of learning phylogenies over hundreds to thousands of cancer genotypes. Of the two, Recursive Grouping has the strongest and most stable overall performance, while Chow-Liu Grouping has a superior asymptotic runtime that is competitive with Neighbor-Joining.

View record

Identification and exploration of gene product annotation instability and its impact in current usages (2014)

Proteins are macromolecules responsible for a wide range of activities in the structure and function of cells. Their activities have been described in different contexts as a mean to elucidate their ``function". These descriptions have been captured across biological databases in a standardized format called Gene Ontology Annotations (GOA), to disseminate the knowledge and extrapolate the information to other proteins whose function is still unknown. Furthermore, the annotations are used to analyse and interpret data from high-throughput studies and also as a benchmark for the assessment of protein function prediction algorithms. Constant changes occur in GOA that can potentially impact such usages, but only limited effort has been put into exploring their instability, or to assess the impact that these changes have on reproducibility or interpretation of previous analyses. In the present work, I performed the most comprehensive analysis of the annotation instability for 14 representative model organisms (E.coli, fruit fly, Mouse, etc.). The results showed important instability patterns that were species-specific. As such information would be of use to the community to trace the instability of annotations of their interest, a web-based visualization tool was built to track these changes on a protein, functional term and species specific basis. Additionally, we identified artifacts on the annotation data that can be attributed to curation patterns. We propose such artifacts to be considered for a more accurate assessment of function prediction algorithms. Furthermore, the impact that changes in the annotations have on common settings like gene set enrichment analyses was also explored. In particular, 2,000 datasets were used to assess the robustness of enrichment results over time. On average, the results would display a 60% similarity after only 2 years. However, cases were found were the similarity will drop 80% within the same year, demonstrating the impact that the instability has on such applications. In conclusion, the results of this work will prove useful for those who use the annotations to interpret their studies to assess their reliability on a case-by-case scenario.

View record

Meta-analysis of Human Methylomes Reveals Stably Methylated Sequences Surrounding CpG Islands Associated with High Gene Expression (2014)

DNA methylation is thought to play an important role in the regulation of mammalian gene expression. Part of the evidence for this role is the observation that lack of CpG island methylation in gene promoters is associated with high transcriptional activity. However, CpG island methylation level only accounts for a fraction of the variance in gene expression, and methylation in other domains is hypothesized to play a role (e.g., island shores and shelves). We set out to improve understanding of the human methylome through a meta-analysis approach, using 1737 samples from 30 publicly available studies. An initial screen identified 15224 CpGs that are “ultra-stable” in their state, being always fully methylated or unmethylated across diverse tissues, cell types and developmental stages (974 always methylated; 14250 always unmethylated). A further analysis of ultra-stable CpGs led us to identify a novel class of CpG islands, “ravines”, that exhibit a markedly consistent pattern of low methylation with highly methylated flanking shores and shelves. Our findings were validated using independent and heterogeneous datasets assayed on the same and different technologies. Building on additional existing data types such as gene expression microarrays, DNase hypersensitive sites, and histone modifications, we found that ravines are associated with higher gene expression, compared to typical unmethylated CpG islands. This finding suggests a novel role for methylation in promoters, markedly different from the traditional view that active promoters need to be unmethylated. We propose ravines are a new class of CpG islands, established early in development and maintained through differentiation, that mark universally active genes and provide new evidence that methylation beyond the CpG island could play a role in gene expression.

View record

Cell type marker enrichment across brain regions and experimental conditions (2013)

The first chapter of this thesis explored the dominant gene expression pattern in the adult human brain. We discovered that the largest source of variation can be explained by cell type marker expression. Across brain regions, expression of neuron cell type markers are anti-correlated with the expression of oligodendrocyte cell type markers. Next, we explored gene function convergence and divergence in the adult mouse brain. Our contributions are as follows. First, we provide candidate cell type markers for investigating specific cell type populations. Second, we highlight orthologous genes that show functional divergence between human and mouse brains.In the second chapter, we present our preliminary work on the effects of tissue types and experimental conditions on human microarray studies. First, we measured the expression and differential expression levels of tissue-enriched genes. Next, we identified modules with similar expression levels and differential expression p-values. Our results show that expression levels reflect tissue type variation. In contrast, differential expression levels are more complex, owing to the large diversity of experimental conditions in the data. In summary, our work provides a different perspective on the functional roles of genes in human microarray studies.

View record

Characterization of gene expression patterns in wild pacific salmon (2013)

Declines in Pacific salmon stocks in recent decades have spurred much research into their physiology and survivorship, but comparatively little into their genomics. Sockeye salmon in particular are experiencing high levels of mortality during their migration upriver, and the numbers of returning sockeye have fluxuated wildly with respect to predictions in recent years. The goal of my project is to gain insight into the basic genomics of Pacific salmon stocks, including the sockeye, through bioinformatic approaches to gene expression profiling. Using microarray technology, I have conducted a large-scale analysis of over 1,000 samples from multiple tissues, stocks, and species of salmon. I identified tissue-specific and housekeeping genes and compared them to orthologs in mouse and human, respectively. I have also classified a number of microarray samples with a support vector machine (SVM) using qPCR data showing the presence of several common pathogens affecting Pacific salmon populations. Using identified housekeeping genes as normalizing factors, I modeled in silico a qPCR assay designed to identify salmon as infected or uninfected with a particular pathogen. With these data I hope to increase basic knowledge of the genomics of the Pacific salmon.

View record

Meta-analysis of gene expression in individuals with autism spectrum disorders (2013)

Autism spectrum disorders (ASD) are clinically heterogeneous and biologically complex.State of the art genetics research has unveiled a large number of variants linked to ASD. Butin general it remains unclear, what biological factors lead to changes in the brains of autisticindividuals. We build on the premise that these heterogeneous genetic or genomic aberrationswill converge towards a common impact downstream, which might be reflected in thetranscriptomes of individuals with ASD. Similarly, a considerable number of transcriptomeanalyses have been performed in attempts to address this question, but their findings lack aclear consensus. As a result, each of these individual studies has not led to any significantadvance in understanding the autistic phenotype as a whole. The goal of this research is tocomprehensively re-evaluate these expression profiling studies by conducting a systematicmeta-analysis. Here, we report a meta-analysis of over 1000 microarrays across twelveindependent studies on expression changes in ASD compared to unaffected individuals,in blood and brain. We identified a number of genes that are consistently differentiallyexpressed across studies of the brain, suggestive of effects on mitochondrial function. Inblood, consistent changes were more difficult to identify, despite individual studies tendingto exhibit larger effects than the brain studies. Our results are the strongest evidence to dateof a common transcriptome signature in the brains of individuals with ASD.

View record

Wide-scale comparison of transcriptome data and the role of microRNA in major depression and suicide (2011)

The first chapter of this thesis addresses a common problem in genomics experiments: interpreting a resulting "hit list" of interesting genes. We present work on an approach for summarizing and exploring "hit lists" that makes use of the large amount of gene expression data in public repositories such as the Gene Expression Omnibus. We compare the query list with datasets that we have analyzed for differential expression of genes. Studies that have similarities to the given hit list yield additional insights, help contextualize studies, and serve as a basis for future meta-analysis. A conceptually similar problem that we addressed is the classification or clustering of datasets based on patterns of differential expression. Both problems required a method for determining distances between datasets based on rankings of genes. We tested and benchmarked several methods using manually annotated datasets. The method that performed best according to our evaluation process is based on Kendall's Tau top-k distance. We investigated potential sources of confounds, finding that the largest challenge may be posed by the high prevalence of certain gene expression patterns. These highly prevalent patterns tended to dominate search results. Nonetheless, we demonstrated the effectiveness of this approach in a case study. In the second chapter, we investigated the role of microRNAs in the context of major depression and suicide. We profiled microRNA and messenger RNA levels in post-mortem prefrontal cortex and hippocampus brain tissue of depressed suicides, suicides, and controls. In the prefrontal cortex, we found miR-1202 to be down-regulated in suicides versus controls, and LCT (lactase enzyme) was up-regulated in suicides or depressed suicides compared to controls. The former result was independently confirmed using quantitative PCR. While further study is needed, our results have the potential to provide insight into molecular changes in the brains of depressed and suicidal individuals.

View record

Evaluating Coexpression Analysis for Gene Function Prediction (2010)

Microarray expression data sets vary in size, data quality and other features, but most methods for selecting coexpressed gene pairs use a ‘one size fits all’ approach. There have been many different procedures for selecting coexpressed gene pairs of high functional similarity from an expression dataset. However, it is not clear which procedure performs best as there are few studies reporting comparisons of these approaches. The goal of this thesis is to develop a set of “best practices” in order to select coexpression links of high functional similarity from an expression dataset, along which methods for identifying datasets likely to yield poor information. With these goals, we hope to improve the quality of gene function predictions produced by coexpression analysis.Using 80 human expression datasets we examined the impact of different thresholds, correlation metrics, expression data filtering and transformation procedures on performance in functional prediction. We also investigated the relationship between data quality and other features of expression datasets and their performance in functional prediction. We used the annotations of the Gene Ontology as a primary metric to measure similarity in gene function, and employ additional functional metrics for validation. Our results show that several dataset features have a greater influence on the performance in functional prediction than others. Expression datasets which produce coexpressed gene pairs of poor functional quality can be identified by a similar set of data features. Some procedures used in coexpression analysis have a negligible effect on the quality of functional predictions while others are essential to achieving the best performance in the algorithm. We also find that some procedures interact greatly with features of expression datasets and that these interactions increase the number of high quality coexpressed gene pairs retrieved through coexpression analysis. This thesis uncovers important information on the many intrinsic and extrinsic factors that influence the performance in functional prediction of coexpression analysis. The information summarized here will help guide future studies using coexpression analysis and improve the quality of gene function predictions.

View record



If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research and more than 300 programs in our Graduate Viewbook!