Wyeth Wasserman: Professor at Department of Medical Genetics, UBC Faculty of Medicine

Professor

Faculty of Medicine

Research Classification

Medical, health and life sciences

Medical and biomedical engineering

Research Interests

Creation of computational methods for the analysis of genome sequences (bioinformatics)

Study of cis-regulatory elements controlling gene transcription

Applied analyses of genome sequences (genomics)

Indigenous genomics

Relevant Thesis-Based Degree Programs

View all programs

Affiliations to Research Centres, Institutes & Clusters

BC Children's Hospital Research Institute

Centre for Molecular Medicine and Therapeutics

Research Options

I am available and interested in collaborations (e.g. clusters, grants).

I am interested in and conduct interdisciplinary research.

I am interested in working with undergraduate students on research projects.

Open All

Research Methodology

bioinformatics

Recruitment

Looking to recruit: Postdoctoral Fellows

Desired start dates: Any time / year round

Potential research project areas:

Gene Regulation

Amongst the most important challenges of this era of life science research is understanding the regulation of gene expression, a process that allows an incredible diversity of cells to be produced from the same genome sequence. During development and across physiological conditions, a set of proteins, called Transcription Factors (TFs), interact with the genome to control the activity of genes. The roughly ~1500 TFs in the human genome cooperate in different combinations and interact with other regulatory processes. The lab studies gene regulation via multiple lines. First, the lab creates novel algorithms and software to predict interactions between TFs and DNA. Second, the lab collaborates on the analysis of emerging types of data, to identify active regulatory regions (e.g. enhancer or promoter regions in the genome) in specific biological processes, such as the transition from stem cells into differentiated cells. Third, the lab designs compact DNA sequences, based on regulatory regions in the human genome, to direct gene expression from virus-based gene therapy vectors.

Genome Analysis

Genome Sequencing has accelerated health research, particularly disease genetics. The lab has been developing computational methods and tools to allow researchers and clinicians to identify functional consequences of genetic variations within the human genome, both in the protein coding and in the non-coding space. The latter effort is fueled by the gene regulation bioinformatics research in the lab.

Engaging with patients and clinicians both locally through BC Children’s Hospital, and through international collaborations, our genomics analyses enable the diagnosis, and in some cases treatment, of previously undiagnosed cases. As DNA sequencing technology has revolutionized the diagnosis and management of rare genetic disorders, the Wasserman lab has embarked on an endeavour to make the technology available to currently underrepresented populations, namely the indigenous populations of Canada. Learn more about the Silent Genome Project.

Ideal Applicant Profile:

Join Us!

We are always looking for curious individuals with a talent in computing, genomics and gene regulation. Feel free to contact us to explore matching interests.

Postdoctoral fellows

SILENT GENOMES PROJECT

For the amazing silent genomes project we need a post-doc with an interest in equitable access to genome medicine. Creating resources in partnership with Canada's Indigenous communities that positively impact clinical genetics and empower choice.

GENE REGULATION

We are developing new approaches based on Deep Learning. Ideally candidates will have experience with machine learning methods, but candidates with experience across the life sciences who have demonstrated a strong commitment to developing programming skills are encouraged to apply.

Graduate students

The lab is not presently seeking graduate students. We do review applications and would consider exceptional candidates at anytime. However, we do not currently anticipate taking on new students until 2023. When we do take on students, most pursue their training within the UBC Bioinformatics Graduate Program.

Undergraduate students

We periodically welcome UBC Work-Learn students, coop students from across Canada, and UBC or SFU students conducting undergraduate thesis studies.

Other positions

No other positions are currently posted.

Notice for Potential Applicants

Our team is constantly changing. The students and post-docs in the group have historically done well, with alumni working in both industry and academia. We take pride in teamwork and maintaining a positive research environment. Opportunities are always available for exceptional students and post-docs. Computer programming skills are essential—we work in a linux environment and develop our own software (primarily in Python).

Other options: I am interested in supervising students to conduct interdisciplinary research.

Complete these steps before you reach out to a faculty member!

Check requirements

Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.

Focus your search

Identify specific faculty members who are conducting research in your specific area of interest.
Establish that your research interests align with the faculty member’s research interests.
- Read up on the faculty members in the program and the research being conducted in the department.
- Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.

Make a good impression

Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
- Do not send non-specific, mass emails to everyone in the department hoping for a match.
- Address the faculty members by name. Your contact should be genuine rather than generic.
Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
Demonstrate that you are familiar with their research:
- Convey the specific ways you are a good fit for the program.
- Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
Be enthusiastic, but don’t overdo it.

Attend an information session

G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.

ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS

These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

Supervision Enquiry

If you have reviewed some of this faculty member's publications, understand their research interests and have reviewed the admission requirements, you may .

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Advancing our understanding of genome regulation via optimization of stem cell differentiation and interpretable deep learning (2022)

The regulation of gene expression is a core challenge in understanding how diverse types of cells can be produced from the same DNA instructions. Insights about this complex machinery advance not only science but applications in therapy and pharmacology. For instance, the differentiation of stem cells for the purpose of regenerative medicine to treat patients with diabetes. In my second chapter, I address the problem of optimizing the differentiation protocol towards definitive endoderm, the precursor of insulin-producing pancreatic beta cells, by replacing the expensive growth factor with cheap molecule alternatives. I introduce a multiple-step pipeline based on small molecule transcriptome response profiles. The discovered chemicals emphasize the importance of key transcription factors in the process, such as HIF and MYC. The study of transcription factors is of high importance, and will further promote our knowledge about differentiation. Motivated by the thought, I explore the current trends of studying transcription factors in the gene regulation context. With large-scale data generation efforts by public consortia such as ENCODE, deep learning methods have become pervasive. A large training dataset is fundamental to the success of these methods, however, the amount of TF-related data is often small. To tackle this issue, in my third chapter, I perform an in-depth assessment of transfer learning for TF binding prediction and provide biologically motivated guidelines for efficient training of deep models when the data is limited. An additional challenge for deep models beyond data sufficiency is interpretability. In the fourth chapter, I systematically categorize and summarize interpretation approaches, exploring their underlying assumptions, strengths, and weaknesses. Inspired by transparent deep learning architectures, I present ExplaiNN, a new transparent model for the genomics tasks. I explore its efficiency and usability on a variety of problems in the fifth chapter of this thesis. Finally, in the last chapter, I apply ExplaiNN to ATAC-seq datasets of mouse and human immune systems to study differences in cis-regulatory logic. Transparency of the new method allowed me to discover a reproducible set of sequence motifs that either individually or combinatorially are responsible for the bulk of the predictions, and tend to have species-specific occurrence patterns.

View record

Approaches to genome analysis through the application of graph theory (2021)

The human reference genome provides a framework against which the analysis and interpretation of an individual’s genome can be performed. Over the past twenty years the cost of genome sequencing has dropped from a prohibitive amount of hundreds of millions of dollars, to just a few thousand dollars. This has brought genome sequencing in line with the cost of other diagnostic medical tests, leading to a rapid uptake in both clinical and research settings. As a consequence of this global spread, deficiencies and population-specific inequities have emerged from the use of a framework that relies upon a single linear reference sequence. Partial, ad-hoc solutions, such as the introduction of alternative sequences for sections of the genome, have provided a stopgap but fail to fully represent the wealth of information now known about the level of variation that exists within and between populations. This thesis presents an alternative perspective on how we can take advantage of new computational methods to enhance the reference genome in the era of widespread sequencing and big data. An argument is given to motivate the revaluation of the role of the reference genome, and calls for a non-indexed, mutable reference framework with the crucial indexing methods to be shifted from the linear reference to a raw read set. A patented, edge-labelled, cyclic, graph-based model, the GNOmics Graph Model, is introduced as a flexible framework against which read alignment and variant calling can be performed. The value of indexing raw reads is explored through a published tool, FlexTyper, which allows a read set to be screened for informative markers. While there is still an ongoing global discussion as to how best to improve the reference genome, this thesis provides a thought-provoking reconceptualisation of applied human genome analysis.

View record

Evolutionary dynamics of ovarian cancer microenvironments and tumour cells (2021)

High-grade serous ovarian cancer (HGSC) is the most common and lethal histotype of epithelial ovarian cancer. Often presenting as multi-site disease, HGSC exhibits extensive malignant clonal diversity with widespread but non-random patterns of disease dissemination. The proclivity of HGSC toward clonally heterogeneous disease is thought to underlie the prevalence of treatment-resistant disease. Yet, the factors that influence the spatial distribution of cancer clones in HGSC remain largely uncharacterized. Hypothesizing that distinct peritoneal niches formed by microenvironmental cell types shape the observed patterns of clonal dynamics in HGSC, the primary aim of this thesis was to understand how microenvironmental factors influence malignant cell evolutionary dynamics.To establish the experimental substrate for this thesis, I led the construction of a cohort of 148 tumour samples from 41 HGSC cases (Chapter 2). In addition to coordinating clinical case identification, I oversaw and learned how to create patient-derived xenograft models and conduct single cell experiments from patient tumours. Leveraging this resource, I explored whether local immune microenvironment factors shape tumor progression properties at the interface of tumor-infiltrating lymphocytes and cancer cells (Chapter 3). Through multi-region study with whole-genome sequencing, immunohistochemistry, image analysis, gene expression profiling, and T- and B-cell receptor sequencing, I identified three immunologic subtypes across samples associated with patterns of malignant clonal diversity. These findings were consistent with immunological pruning of tumor clones. Finally, in order to explore the non-lymphocytic components of the tumour microenvironment, I developed an automated approach to cell type identification from single cell RNA-seq data that eliminates the manual work involved in traditional workflows reliant on post-hoc expert annotation (Chapter 4). I demonstrated how this method performs superiorly to state-of-the-art workflows for cell type identification and applied the method to profile the HGSC microenvironment.Collectively, this work highlights multiple interfaces of evolutionary interplay between malignant and non-malignant cells in the HGSC microenvironment, identifying novel mechanisms by which tumour cells escape from immune recognition. These results will inform the interpretation of results from immunotherapy clinical trials and set the stage for comprehensive microenvironment profiling in large HGSC cohorts and other cancers.

View record

Towards the identification of causal genes and contributing molecular processes underlying strabismus (2021)

Eye misalignment, or strabismus, has a frequency of up to 4% in a population, and is known to have both environmental and genetic causes. Genes associated with syndromic forms of strabismus (i.e. strabismus concurrent with multiple phenotypes) have emerged, but genes contributing to isolated strabismus remain to be discovered. Only one isolated strabismus locus, STBMS1 on chromosome 7, has been confirmed in more than one family, but the inheritance model of the locus is inconsistent between studied families and no specific causal variant has been reported. The large set of syndromes with strabismus suggests that within the visual system multiple perturbations of an underlying genetic network(s) can have the common output of disrupted eye alignment. Thus, I used a bioinformatic-driven approach to analyze curated genes associated with strabismus to provide insight into the biological mechanisms underlying strabismus, highlighting a link to the Ras-MAPK pathway. During the process, I noticed strabismus presenting within a large number of intellectual disability disorders. Therefore, I studied the co-occurrence of strabismus and other common phenotypes in a series of patients with intellectual disability, which confirmed a significant correlation between eye alignment and intellectual disability. Finally, I resumed efforts from my prior studies to identify the genetic cause in a seven-generation family with isolated strabismus inherited in an autosomal dominant manner. The likely casual gene disruption, altering a likely cis-regulatory region of the FOXG1gene, was identified through the incorporation of linkage analysis, next generation sequencing, and in-depth bioinformatic analyses. This thesis identifies potential roles for genes participating in the Ras-MAPK pathway, emphasizes the role of the central nervous system, and reveals FOXG1 as a causal gene candidate for isolated strabismus.

View record

Expanding the utility of whole genome sequencing in the diagnosis of rare genetic disorders (2020)

The emergence of whole genome sequencing (WGS) has revolutionized the diagnosis of rare genetic disorders, advancing the capacity to identify the “causal” gene responsible for disease phenotypes. In a single assay, many classes of genomic variants can be detected from small single nucleotide changes to large insertions, deletions and duplications. While WGS has enabled a significant increase in the diagnostic rate compared to previous assays, at least 50% of cases remain unsolved. The lack of a diagnosis is the result of both limitations in variant calling, and in variant interpretation. As the field of genomic medicine continues to advance, the emergence of novel bioinformatic approaches to variant calling and interpretation herald promise for the future of undiagnosed cases. In the applied setting, innovation is driven by anecdotes of complex diagnoses, which in turn lead to the development of novel tools and approaches. This is a key theme within this thesis work, where in-depth analysis of a single undiagnosed case leads to an appreciation for a challenging class of variants–short tandem repeats–which in turn leads to the development of novel software for detecting these variants in WGS data. Following the anecdote and novel tool development came an appreciation for the role of simulation, both in enabling the development and in the uptake of bioinformatic innovation for diagnostic analysis pipelines. This appreciation led to the development of a rare disease scenario simulator, which can simulate complex variants in multiple inheritance patterns to emulate challenging cases. Lastly, appreciating the limitations of the linear reference genome, I develop a framework for detecting the presence of user-specified sequences within unmapped read sets. This flexible framework can reproduce microarray-like coverage profiles, and genotype SNPs to identify ancestry and sex which can inform the choice of personalized reference genomes in emergent analysis pipelines. Together, the novel short tandem repeat discovery, bioinformatic innovation, and increased capacity to simulate rare disease cases, expand the utility of whole genome sequencing in the diagnosis of rare genetic diseases.

View record

Development of human-computer interactive approaches for rare disease genomics (2019)

Clinical genome sequencing is becoming a tool for standard clinical practice. Many studies have presented sequencing as effective for both diagnosing and informing the management of genetic diseases. However, the task of finding the causal variant(s) of a rare genetic disease within an individual is often difficult due to the large number of identified variants and lack of direct evidence of causality. Current computational solutions harness existing genetic knowledge in order to infer the pathogenicity of the variant(s), as well as filter those unlikely to be pathogenic. Such methods can bring focus to a compact set (less than hundreds) of variants. However, they are not sufficient to interpret causality of variants for patient phenotypes; interpretation involves expert examination and synthesis of complex evidence, clinical knowledge, and experience. To accelerate interpretation and avoid diagnostic delay, computational methods are emerging for automated prioritization that capture, translate, and exploit clinical knowledge. While automation provides efficiency, it does not replace the expert-driven interpretation process. Moreover, knowledge and experience of human experts can be challenging to fully encode computationally.This thesis, therefore, explores an alternative space between expert-driven and computer-driven solutions, where human expertise is deeply embedded within computer-assisted analytic and diagnostic processes via facilitated human-computer interactions. First, clinical experts and their work environment were observed via collaborations in an interdisciplinary exome analysis project as well as in a clinical resource development project. From these observations, we identified two elements of human-computer interaction: characteristic cognitive processes underlying the diagnostic process and information visualization. Exploiting these findings, we designed and evaluated an interactive variant interpretation strategy that augments cognitive processes of clinical experts. We found that this strategy could expedite variant interpretation. We then qualitatively assessed current information visualization practices during clinical exome and genome analyses. Based on the findings of this assessment, we formulated design requirements that can enhance visual interpretation of complex genetic evidence. In summary, this research highlights the synergistic utility of human-computer interaction in clinical exome and genome analyses for rare genetic diagnoses. Furthermore, it exemplifies the importance of empowering the skills of human experts in digital medicine.

View record

Revealing the impact of sequence variants on transcription factor binding and gene expression (2017)

Transcription factors (TFs) can bind to specific regulatory regions to control the expression of target genes. Disruption of TF binding is regarded as one of the key mechanisms by which regulatory variants could act to cause disease. However predicting the functional impact of variants on TF binding remains a major challenge for the field, standing as a key obstacle to achieving the potential of clinical genome analysis. This thesis confronts this challenge from a bioinformatics perspective and addresses two unresolved problems. The first problem is the determination of which genetic variants alter TF binding. Only a small number of allele-specific binding (ASB) events, in which TFs preferentially bind to one of two alleles at heterozygous sites in the genome, have been determined. To study the impact of variants on TF binding, access to a large, gold standard collection of ASB events could facilitate the development of new predictive methods. In Chapter 2, we implemented a pipeline to identify ASB events from ChIP-seq data and applied it to produce one of the largest ASB datasets. We found that ASB events were associated with allelic alterations of TF motifs, chromatin accessibility and histone modifications. Using the available features, classifiers were trained to predict the impact of variants on TF binding. To improve ASB calling, Chapter 3 evaluated five statistical methods, ultimately supporting a method that pooled ChIP-seq replicates and utilized a binomial distribution to model allelic read counts.The second problem is to determine how altered TF binding events impact the expression of target genes. In Chapter 4, we implemented regression-based models to predict gene expression changes based on altered TF binding events across 358 individuals. The models showed predictive capacity for 19.2% of genes, and the key TF binding events in the model provided mechanistic insights as to how these regulatory variants alter gene expression.In summary, this thesis both generated the largest, high-quality collection of ASB events, and developed algorithms to predict variant impact on TF binding and gene expression. The presented work advances the capacity of the field to interpret regulatory variants and will facilitate future clinical genome analysis.

View record

Computational analysis of transcriptional regulation from local sequence features to three dimensional chromatin domains (2016)

Regulation of gene expression spans different levels of complexity: from genomic sequence, transcription factor binding and epigenetics, to three-dimensional chromatin interactions. Data from different individuals such as genetic variations presents an extra dimension to consider. Abnormal activities at any level may lead to disease phenotypes, motivating deeper exploration of gene regulation. New high-throughput sequencing techniques have empowered genome-wide studies of the regulatory mechanisms within cells. This thesis uses computational approaches to examine gene regulation with high-throughput data in order to address biological hypotheses traversing from short local sequence features to megabase-sized topologically associating domains (TADs).The hypotheses addressed in the thesis have two central themes: 1) the elucidation of local and domain regulation of gene expression, and 2) the application of such knowledge to identify functional phenotypic variants. We developed a computational approach to identify functional variants associated with cancer, and demonstrated how annotating regulatory sequences and linking these regions to target genes can strengthen genome interpretation. The concurrent and intertwined nature of local and domain regulation of gene expression develops as the thesis unfolds. In a study of genes that escape from X-chromosome inactivation, we found the YY1 transcription factor to be a key regulator, and is potentially associated with long distance chromatin looping mechanisms. Similarly, when studying the spread of inactivation to the autosomes in translocated cells, we detected local features associated with inactivation status, and at the domain level, we observed the spreading to be in accordance with TADs. Lastly, when considering TADs as transcriptional units, the identification of cell type-selectively co-expressed and co-localized TADs highlighted an organized and dynamic chromatin architecture across multiple cell types.In summary, this thesis provides insights into the mechanisms involved in gene expression across multiple scales (from local sequences to chromatin domains) using computational analyses on publicly available datasets. The presented methods and results have potential applications to interpret genetic variations and further our understanding in diseases and phenotypes. The findings may contribute to an era of preventative and regenerative medicine to come.

View record

Development and Evaluation of Software for Applied Clinical Genomics (2016)

High-throughput next-generation DNA sequencing has evolved rapidly over the past 20 years. The Human Genome Project published its first draft of the human genome in 2000 at an enormous cost of 3 billion dollars, and was an international collaborative effort that spanned more than a decade. Subsequent technological innovations have decreased that cost by six orders of magnitude down to a thousand dollars, while throughput has increased by over 100 times to a current delivery of gigabase of data per run. In bioinformatics, significant efforts to capitalize on the new capacities have produced software for the identification of deviations from the reference sequence, including single nucleotide variants, short insertions/deletions, and more complex chromosomal characteristics such as copy number variations and translocations. Clinically, hospitals are starting to incorporate sequencing technology as part of exploratory projects to discover underlying causes of diseases with suspected genetic etiology, and to provide personalized clinical decision support based on patients’ genetic predispositions. As with any new large-scale data, a need has emerged for mechanisms to translate knowledge from computationally oriented informatics specialists to the clinically oriented users who interact with it. In the genomics field, the complexity of the data, combined with the gap in perspectives and skills between computational biologists and clinicians, present an unsolved grand challenge for bioinformaticians to translate patient genomic information to facilitate clinical decision-making. This doctoral thesis focuses on a comparative design analysis of clinical decision support systems and prototypes interacting with patient genomes under various sectors of healthcare to ultimately improve the treatment and well-being of patients. Through a combination of usability methodologies across multiple distinct clinical user groups, the thesis highlights reoccurring domain-specific challenges and introduces ways to overcome the roadblocks for translation of next-generation sequencing from research laboratory to a multidisciplinary hospital environment. To improve the interpretation efficiency of patient genomes and informed by the design analysis findings, a novel computational approach to prioritize exome variants based on automated appraisal of patient phenotypes is introduced. Finally, the thesis research incorporates applied genome analysis via clinical collaborations to inform interface design and enable mastery of genome analysis.

View record

Improving the Detection of Transcription Factor Binding Regions (2015)

The identification of non-coding regulatory elements in the genome has been the focus of much experimental and computational effort. However, both experimental data, such as ChIP-seq, and computational methods of transcription factor (TF) binding predictions suffer from a degree of non-specificity. ChIP-seq experiments report regions that don’t contain the expected canonical motif for the ChIPped TF, which may arise from indirect binding or a non-TF-specific mechanism. Computational predictions based on sequence-level information alone are plagued by false positives. This thesis explores computational approaches to improve both the interpretation of large-scale TF binding data, and the detection of TF binding regions.In Chapters 2 and 3 we observe that experimentally defined regulatory regions of the human genome are a mixture of sub-groups reflecting distinct properties. On average a third of a ChIP-seq dataset does not contain the targeted TF’s motif, and within this subset up to 45% of the ChIP-seq peaks are unexpectedly enriched for a small class of non-targeted TFs’ motifs. Many of these regions are not specific to a TF but are ChIPped by multiple diverse TFs across multiple cell types. These recurring regions tend to be the lower scoring peaks of a dataset, are less likely to reproduce between experimental replicates, and tend to associate with cohesin and polycomb protein occupied positions in the genome. The regulatory regions with a greater specificity for a TF do not share these properties. Based on these observations we suggest a TF ‘loading-zone’ model to account for the presence of the aforementioned recurrent regions in ChIP-seq data. In Chapter 4 we further explore the regulatory region subgroups with a biophysical simulator of TF occupancy (tfOS). Within tfOS we have incorporated TF-DNA interaction energies, TF search mechanics, cooperative TF interactions, and sequence accessibility data into the model. Simulations with tfOS across sequences reveal distinct features associated with recurrent and non-recurrent regions described in Chapter 3. The research presented has improved our understanding and interpretation of large-scale TF binding data and advanced our understanding of TF regulatory regions, leading to improved annotation and interpretation of the human genome.

View record

Inferring Novel Relationships through Over-Representation Analysis of Medical Subjects in Biomedical Bibliographies (2012)

MEDLINE®/PubMed® is a richly annotated resource of over 21 million article citations, growing at a modern rate of over 600,000 citations annually. One grand challenge of bioinformatics is analysing the extensive literature for a biomedical entity such as a gene or disease. This thesis explores using over-representation to extract pertinent biomedical annotation from the research articles for an entity. The quantitative profiles generated are compared to predict novel associations between entities.Medical Subject Heading Over-representation Profiles (MeSHOPs) are constructed from the primary literature of an entity of interest. Medical subject annotations for each article are extracted. Statistical tests evaluate the significance of each term’s frequency across the set of articles, compared against an appropriate background set. The resulting MeSHOP is composed of each term and corresponding enrichment p-value. MeSHOPs can be computed for any entity with an associated bibliography of PubMed articles. We evaluate the predictive performance of quantitatively comparing MeSHOPs to discover novel associations between gene and disease entities, achieving up to 16% improvement in accuracy compared to gene or disease baseline features (measured as increased Receiver Operating Characteristic Area Under the Curve). Strong literature annotation level bias on the predictive performance for future gene-disease association was seen. We observe similar results in a parallel analysis of associations between drugs and disease.Efficiently identifying authors with similar research interests is a challenge in science. During the peer review process, authors seek scientists with similar expertise. MeSHOPs are generated for individual authors, identifying their research foci. Extending the methods to allow comparison across large sets of entities, overlapping research interests between researchers were identified. The predictive performance was evaluated for capacity to identify authors working in the same research domains. Biomedical annotation analysis of primary literature provides insight into the areas of research focus, and is demonstrated to link entities through similarities in their MeSHOPs. We quantitatively confirm the trend where well-studied genes, diseases and drugs are more likely to be the focus of further research. MeSHOP analysis demonstrates that knowledge in the annotated primary literature can be efficiently mined, and the untapped knowledge therein can be discovered computationally.

View record

Evolutionary conserved regulatory programs (2011)

No abstract available.

Computational Prediction of Regulatory Element Combinations and Transcription Factor Cooperativity (2010)

No abstract available.

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Identification of pharmacogenetic variants influencing the likelihood of developing cancer treatment-induced mucositis using pathway analyses (2024)

The full abstract for this thesis is available in the body of the thesis, and will be available when the embargo expires.

View record

Identification of regulatory mechanisms in governing gene expression from the inactivated X chromosome (2022)

X chromosome inactivation (XCI) is the process in which one copy of the X chromosomes in females (XX) is randomly silenced to achieve dosage compensation of gene expression from X chromosome of males (XY). XCI is known to be incomplete, resulting in a subset of genes on the inactive X (Xi) being expressed. Genes along the X chromosome are classified into three categories based upon their expression from the Xi: subject, escape and variable. Escape genes, corresponding to 15% of X-linked genes, are generally expressed from the Xi at a substantially lower level compared to the active X (Xa). The underlying mechanisms in controlling escape from the heterochromatic state of Xi has been a long-standing question. While there is evidence that supports a role for intrinsic DNA elements in the escape XCI, they have not yet been identified.With increasing amounts of data for transcription factor (TF) binding sites, the objective of this thesis was to identify the regulatory TFs facilitating the ability of genes to escape XCI via a bioinformatics approach using empirical data. ChIP-seq peaks of 155 TFs from the ReMap database were assessed for enrichment at regulatory regions of 55 escape genes. 19 TFs were identified via enrichment analysis in the transcription start site regions of escape genes. Co-binding of pairs of enriched TFs were characterized by gene set similarity between the target genes. Of the 155 TFs examined, ZFP36, an RNA binding protein that alters RNA stability, showed importance in both enrichment analysis and co-binding analysis.An initial exploration of methods to compare the structures of Xi and Xa were undertaken using ChIA-PET data, providing insights into the limitations of current data resources and opportunities for future studies to inform the topographical organization of escape genes within Xi.The results of this thesis refined our knowledge on cis-regulatory elements and trans-acting factors potentially involved in escape of XCI. This list of enriched TF may be useful in future analyses, experimental or computational, to further determine their sufficiency and necessity in the escape of XCI.

View record

Cell-conditional generative adversarial network (2021)

With single cell sequencing advances, research has increasingly focused on under-standing cell-specific gene regulation mechanisms. However, single cell sequencing data are often noisy and the amount of sequence obtained from rare cell types small. Simulation can be a powerful approach to aid understanding when data is limited, both because the process used to generate such data can provide mechanistic insights into cell-specific regulation and the data produced can augment analysis methods development. We constructed and optimized a stand-alone cell-conditional GAN (ccGAN) to simulate cell-specific ATAC-seq data. We trained our model on published single cell ATAC-seq (scATAC-seq) data that had been produced with different protocols on embryonic mice forebrain and adult mice brain. The ccGAN generated sequence was correlated in both Transcription Factor (TF) binding motif composition and positional distribution with the experimental scATAC-seq. The ccGAN simulator was able to learn important cell-specific signals amidst noise. The ccGAN architecture holds broad potential for single cell regulatory data simulation beyond ATAC-seq, such as for ChIP-seq or epigenetic properties

View record

Linking cis-regulatory regions using transcription factor binding signatures (2020)

Linking cooperatively functioning cis-regulatory elements (CREs), specifically enhancers and promoters, is a challenging task. Current strategies include correlation of expression of RNA transcribed from the CREs, experimentally measured chromatin interactions (Promoter Capture Hi-C) or machine learning based computational predictions. However, all three approaches require the availability of experimental data, which is sparse for most cells and tissues. We propose a new similarity metric to link enhancers to their target promoters based on transcription factor (TF)-binding “signatures”. TF-binding signatures are binary string representations (e.g. 0011001...), where each position indicates binding (“1”) or not (“0”) of a TF to a CRE. We apply a cosine similarity metric to enhancer-promoter pairs linked in published studies involving CRISPRi-FlowFISH, co-expression (FANTOM), or experimental tiling-deletion (CREST-seq). We find a significant difference between TF signature similarities of linked promoter-enhancer pairs compared to unlinked pairs. Furthermore we observe that TF-binding similarity scores are CRR specific. Based on the results, new directions are proposed that may allow further improvement towards a reliable mapping of interacting CREs across the genome.

View record

Bioinformatics design of cis-regulatory elements controlling human gene expression (2017)

Gene therapy has the potential to not only treat, but cure individuals suffering from inherited diseases. Advances in understanding the human genome and the discovery of causal genes underlying diseases has heightened the need to solve the gene therapy challenge. Viral vectors are often used as a delivery tool for therapeutics, but their safety and efficacy are still being studied. To contribute to this goal, we have created 49 small viral promoters by bioinformatically annotating cis-regulatory regions from which a subset are concatenated with the goal of drivingcell-specific expression of a reporter gene. We have tested a subset of these in mice in vivo. Regulatory region analysis can take a trained designer multiple weeks. To resolve this issue, we have created a semi-automated approach to regulatory region identification, named OnTarget. The OnTarget database accumulates thousands of cell and tissue-specific experiments in order to identify regions informative of regulatory properties. OnTarget is able to identify regulatory regions consistent with those identified by designers. In this capacity, we expect OnTarget to lead to betterand faster identification of cis-regulatory regions for the design of promoters targeting specific sets of cells.

View record

Text based methods for variant prioritization (2017)

Despite improvements in sequencing technologies, DNA sequence variant interpretation for rare genetic diseases remains challenging. In a typical workflow for the Treatable Intellectual Disability Endeavor in B.C. (TIDE BC), a geneticist examines variant calls to establish a set of candidate variants that explain a patient's phenotype. Even with a sophisticated computation pipeline for variant prioritization, they may need to consider hundreds of variants. This typically involves literature searches on individual variants to determine how well they explain the reported phenotype, which is a time consuming process. In this work, text analysis based variant prioritization methods are developed and assessed for the capacity to distinguish causal variants within exome analysis results for a reference set of individuals with metabolic disorders.

View record

Mapping a New Locus for Non-Syndromic Strabismus with High-Throughput Genome Analysis (2014)

Eye misalignment, called strabismus, occurs in up to 5% of individuals. While misalignment is frequently observed in rare complex syndromes, the majority of strabismus cases are non-syndromic. Over the past decade, genes and pathways associated with syndromic forms of strabismus have emerged, but the genes contributing to non-syndromic strabismus remain elusive. Non-syndromic strabismus is highly heterogeneous, and different loci have been inferred from previous genetics studies. Only a single strabismus locus, STBMS1, on chromosome 7 has been confirmed in more than one family, but the reported inheritance patterns of this locus with disease conflict and no specific variant has been proposed. Here, I analyzed a large non-consanguineous family with multiple individuals affected by strabismus across seven generations. The hypothesis is that a single variant is responsible for the non-syndromic strabismus in this particular family displaying dominant patterns of inheritance. Whole exome sequencing (WES) was performed to uncover large- blocks of variations within protein-coding regions of the genome shared by two affected distant relatives. In parallel, chromosome regions segregating with the strabismus phenotype in the family were identified using linkage analysis on 12 individuals. Linkage analysis identified one specific risk locus of high confidence. Based on the lack of protein-coding alterations in the locus, whole genome sequencing (WGS) was performed to find additional shared candidate causal variants. Combining the available information, a 10 Mb region on chromosome 14 was identified with high confidence that it was associated with strabismus, within which a set of potential regulatory sequence alterations have been highlighted for further study. This study represents the first identified locus for autosomal dominant, non- syndromic, strabismus. The project utilizes next-generation sequencing (NGS), linkage analysis, and bioinformatic analyses to prioritize and select both coding and non-coding variants, demonstrating the effectiveness of combining NGS and classical genetic approaches. The research findings improve our understanding of strabismus genetics and defines multiple paths for future research, family-specific genetic testing for early diagnosis, and consequent preventive therapy.

View record