Keegan Korthauer

Faculty of Science

Research Classification

Bioinformatics

Genomics

Statistics

Research Interests

Epigenomics

Single-cell analysis

Statistical genomics

Relevant Thesis-Based Degree Programs

View all programs

Affiliations to Research Centres, Institutes & Clusters

BC Children's Hospital Research Institute

Recruitment

Looking to recruit:

Master's students, Doctoral students, Postdoctoral Fellows

Desired start dates:

Any time / year round

Complete these steps before you reach out to a faculty member!

Focus your search

Make a good impression

ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS

These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.

If you have reviewed some of this faculty member's publications, understand their research interests and have reviewed the admission requirements, you may submit a contact request to this supervisor.

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Probabilistic modeling of high-throughput sequencing data for enhanced understanding of DNA methylation heterogeneity (2025)

DNA methylation is a key epigenetic mechanism governing gene regulation and cellular identity. Advances in high-throughput sequencing technologies have enabled detailed investigation of methylation landscapes across single cells and complex tissue mixtures. However, the sparsity and noise inherent in single-cell data, as well as the signal distortion in enrichment-based platforms, pose major analytical challenges. This thesis presents two novel statistical frameworks to address these limitations and advance the computational toolkit for DNA methylation analysis.The first contribution is vmrseq, a probabilistic method and software for detecting variably methylated regions from single-cell bisulfite sequencing data. vmrseq integrates a smoothing-based strategy for candidate region identification with hidden Markov modeling to account for spatial correlation and technical noise. Through extensive benchmarking on synthetic and experimental datasets, vmrseq demonstrates improved precision and biological relevance in identifying methylation heterogeneity, supporting downstream analyses such as unsupervised clustering and cell-type-specific marker discovery.The second contribution is decemedip, a hierarchical Bayesian model and software for cell type deconvolution of enrichment-based methylation data such as MeDIP-seq. By leveraging reference panels derived from alternative platforms and modeling the complex relationship between methylation levels, CpG density, and read counts, decemedip enables accurate estimation of cell type proportions with uncertainty quantification. Its performance is validated through simulations, cross-platform comparisons, and real-world applications involving patient-derived xenografts and circulating cell-free DNA from cancer cohorts.Together, these methods address critical gaps in the analysis of high-throughput DNA methylation data, enabling robust detection of epigenetic heterogeneity across biological contexts. The associated open-source software implementations provide practical tools for future epigenomic research and potential clinical applications.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Predicting sustained remission and maximal disease severity in pediatric crohn's disease using machine learning (2026)

Crohn's disease (CD) is a chronic inflammatory condition affecting the gastrointestinal tract, and displays a growing prevalence in children. In affected children, it displays more heterogeneous disease trajectories and treatment responses than adult-onset cases, posing significant management challenges. While early aggressive treatment may benefit patients with severe trajectories, no objective method exists to identify the high-risk children at diagnosis. This prognostic gap forces reliance on subjective clinical judgment, potentially delaying critical interventions. This study aimed to use machine learning models to predict two first-year outcomes in the Canadian Children IBD Network inception cohort: 1) sustained remission vs non-sustained remission, defined as maintaining a post-remission Weighted Pediatric Crohn's Disease Activity Index (wPCDAI) 12.5 without inflammatory episodes, and 2) maximal disease severity (remission/mild [post-diagnosis wPCDAI 40, indicating minimal inflammatory activity] vs moderate/severe [wPCDAI ≥40, indicating substantial inflammation and need for treatment escalation]). Nine algorithms were trained on baseline clinical, microbiome, and integrated clinical-microbiome datasets using repeated nested 3-fold cross-validation, with the minimum redundancy maximal relevance feature selection, Bayesian hyperparameter optimization, and SHAP for model explainability. For sustained remission prediction, integrated models outperformed microbiome- or clinical-only models, with integrated logistic regression achieving the highest mean AUC (0.763); key features included initial treatment at diagnosis, disease location, and wPCDAI at diagnosis, as well as taxa known to play a role in CD such as Haemophilus and Lachnospiraceae. For maximal disease severity prediction, microbiome models performed best, with Gaussian naïve Bayes reaching a mean AUC of 0.801 and highlighting microbes such as Clostridium and Veillonella as predictors of severe disease, while taxa such as Coprococcus and Romboutsia were associated with milder disease. Bayesian decision curve analysis of our top-performing models also demonstrated likely clinical utility at relevant decision thresholds. Our results suggest the potential of integrated machine learning approaches to support clinical decision-making in pediatric Crohn's disease. By enabling early identification of high-risk patients at diagnosis, this work paves the way for personalized treatment strategies that could improve long-term outcomes in this vulnerable population.

View record

Application of supervised learning models to compare epigenetic predictors of gene expression across healthy breast cell types (2024)

A new data driven framework for simulating mendelian randomization data (2023)

Evaluating omics-based tests with Bayesian Decision Curve Analysis (2023)

Omics-based tests (OBTs) combine high-dimensional omics features into clinical prediction modelsthat predict diagnosis, prognosis, or treatment effects. Past incidences of premature implementa-tion of OBTs into clinical trials have demonstrated the need for increased rigour in their clinicalevaluation. However, their performance assessment is often limited to classification metrics such assensitivity and specificity, with little regard for formal analysis of clinical decision-making. Decisioncurve analysis (DCA) complements classification metrics by combining classical assessment of pre-dictive performance with the consequences of using a test or model to guide clinical decisions. InDCA, the best clinical decision strategy, such as diagnosing or treating based on an OBT, is the onethat maximizes the concept of net benefit: the net number of true positives (or negatives) providedby a given clinical decision strategy. Before reaching real patients, we must be sufficiently confi-dent that new OBTs actually provide superior clinical decision strategies, as compared to default,standard-of-care strategies. Trained on hundreds to thousands of features, OBTs are particularlyprone to chance results. In this context, the present work develops parametric Bayesian approachesto DCA that allow uncertainty quantification around four fundamental concerns when evaluatingOBT-guided clinical decision strategies: (i) which strategies are clinically useful, (ii) what is thebest available decision strategy, (iii) direct pairwise comparisons between strategies, and (iv) whatis the consequence of the current level of uncertainty. We evaluate the methods using simulationstudies and present a comprehensive case study. We also provide an application to a recently-developed OBT for multi-cancer early detection. Software implementation of the method is freelyavailable in the bayesDCA R package. Ultimately, the Bayesian DCA workflow may help cliniciansand health policymakers make better-informed decisions when choosing and implementing clinicaldecision strategies based on OBTs.

View record