Relevant Thesis-Based Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Dr. Anne Condon is a Professor in the Department of Computer Science at U. British Columbia. Her research is in the areas of computational complexity and algorithms, with a current focus on ways to computationally predict nucleic acid structure and on molecular programming - the art and science of writing programs that are realized and executed by DNA or other molecules. She is an ACM Fellow and a Fellow of the Royal Society of Canada. She received her Bachelor's degree (1982) from University College Cork, Ireland, and her Ph.D. (1987) at the University of Washington.
Complete these steps before you reach out to a faculty member!
- Familiarize yourself with program requirements. You want to learn as much as possible from the information available to you before you reach out to a faculty member. Be sure to visit the graduate degree program listing and program-specific websites.
- Check whether the program requires you to seek commitment from a supervisor prior to submitting an application. For some programs this is an essential step while others match successful applicants with faculty members within the first year of study. This is either indicated in the program profile under "Admission Information & Requirements" - "Prepare Application" - "Supervision" or on the program website.
- Identify specific faculty members who are conducting research in your specific area of interest.
- Establish that your research interests align with the faculty member’s research interests.
- Read up on the faculty members in the program and the research being conducted in the department.
- Familiarize yourself with their work, read their recent publications and past theses/dissertations that they supervised. Be certain that their research is indeed what you are hoping to study.
- Compose an error-free and grammatically correct email addressed to your specifically targeted faculty member, and remember to use their correct titles.
- Do not send non-specific, mass emails to everyone in the department hoping for a match.
- Address the faculty members by name. Your contact should be genuine rather than generic.
- Include a brief outline of your academic background, why you are interested in working with the faculty member, and what experience you could bring to the department. The supervision enquiry form guides you with targeted questions. Ensure to craft compelling answers to these questions.
- Highlight your achievements and why you are a top student. Faculty members receive dozens of requests from prospective students and you may have less than 30 seconds to pique someone’s interest.
- Demonstrate that you are familiar with their research:
- Convey the specific ways you are a good fit for the program.
- Convey the specific ways the program/lab/faculty member is a good fit for the research you are interested in/already conducting.
- Be enthusiastic, but don’t overdo it.
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
ADVICE AND INSIGHTS FROM UBC FACULTY ON REACHING OUT TO SUPERVISORS
These videos contain some general advice from faculty across UBC on finding and reaching out to a potential thesis supervisor.
Graduate Student Supervision
Doctoral Student Supervision
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
Nucleic acid molecules are vital constituents of living beings. These molecules arealso utilized for building autonomous nanoscale devices with biological and technologicalapplications, such as toehold switches, algorithmic structures, robots,and logic gates. Predicting the kinetics (non-equilibrium dynamics) of interactingnucleic acid strands, such as hairpin opening and strand displacement reactions,would assist with understanding the functionality of nucleic acids in the cell andwith building nucleic-acid based devices.Continuous-time Markov chains (CTMC) are commonly used to predict thekinetics of these reactions. However, predicting kinetics with CTMC models ischallenging. Because, first, the CTMCs should be defined with accurate and biophysicallyrealistic kinetic models. Second, the state space of the CTMCs may belarge, making predictions time-consuming, particularly for reactions that happenon a long time scale (rare events), such as strand displacement at room temperature.We introduce an Arrhenius kinetic model of interacting nucleic acid strandsthat relates the activation energy of a state transition with the immediate local environmentof the affected base pair. Our model can be used in stochastic simulationsto estimate kinetic properties and is consistent with existing thermodynamic modelsthat make equilibrium predictions. We infer the model’s parameters on a widerange of reactions by using mean first passage time (MFPT) estimates. We estimateMFPTs using exact computations on simplified state spaces. We show thatour new model surpasses the performance of the previously established Metropoliskinetic model.We further address MFPT estimation and the rapid evaluation of perturbed parameters for parameter inference in the full state space of reactions’ CTMCs.We show how to use a reduced variance stochastic simulation algorithm (RVSSA)to estimate MFPTs. We also introduce a fixed path ensemble inference (FPEI)approach for the rapid evaluation of perturbed parameters. These methods arepromising, but they are not suitable for rare events. Thus, we introduce the pathwayelaboration method, a time-efficient and probabilistic truncated-based approach foraddressing both mentioned tasks. We demonstrate the effectiveness of our methodsby conducting computational experiments on nucleic acid kinetics measurementsthat cover a wide range of rates for different type of reactions.
RNA-binding proteins (RBPs) interact with their RNA targets to mediate various critical cellular processes in post-transcriptional gene regulations such as RNA splicing, modification, replication, localization, etc.. Characterizing the binding preferences of RBP is essential for us to decipher the underlying interaction code and to understand the functions of the interaction partners. However, currently only a minority of the numerous RBPs have RNA binding data available from in vivo or in vitro experiments. The binding preferences of experimentally unexplored RBPs remain largely unknown and are challenging to identify. In this thesis, we take machine learning based recommendation approaches to address this problem. We focus on leveraging the binding data currently available to infer the RNA preferences for RBPs that have not been experimentally explored. Firstly, we present a recommendation method based on co-evolutions to predict the RNA binding specificities for experimentally unexplored RBPs, waiving the need of the RBPs' binding data. We first demonstrate the co-evolutionary relationship between RBPs and their RNA targets. We then describe a K-nearest neighbors based algorithm to explore co-evolutions to infer the RNA binding specificity of an RBP using only the specificities information from its homologous RBPs. Secondly, we present a nucleic acid recommender system to predict probe-level binding profiles for unexplored or poorly studied RBPs. We first encode biological sequences to distributed feature representations by adapting word embedding techniques. We then build a neural network to recommend binding profiles for unexplored RBPs by learning the similarities between them and RBPs that have binding data available. Thirdly, we present a graph convolutional network for unexplored RBPs' binding affinities recommendation. Extending from the previous two approaches, this method adopts a transductive message passing setting to incorporate more information from the data. It predicts the interaction affinity between an unexplored RBP and an RNA probe by propagating information from other explored RBP-RNA interactions through a heterogeneous graph of RBPs and RNAs. Overall, the approaches presented here can help to improve the understanding of RBPs' binding mechanisms and provide new opportunities to investigate the complex post-transcriptional regulations.
Next generation sequencing (NGS) technology provides researchers an opportunityto study cancer genomes at different resolutions. In particular, detection andinterpretation of the smallest somatic changes of the genome (single nucleotidevariants) are now tractable at scale. However, significant challenges in the analysisof both bulk tumour and single cell sequencing methods remain to fully exploitthe advance in technology development. Two emerging areas of applying sequencingtechnology to better ascertain properties of cancer evolution are (i) sequencingmultiple tumour biopsies from the same patient, and (ii) single cell genomesequencing. Both of these advances represent computational challenges that I addressthrough development of novel methods in this thesis. The first proposedmethod (Chapter 2) incorporates prior clonal information to improve the accuracyof detecting SNVs across the genome of multiple bulk tumour samples. The secondproposed method (Chapter 3) is a statistical model that exploits the underlyingphylogeny of individually sequenced cells to detect SNVs in every individual cell.The latter method identifies clone specific SNVs without the requirement of deconvolvingthe results from bulk sequencing data. The resultant accurate detection ofSNVs (Chapter 4) helps enhance insight on the evolutionary process of tumoursand genetic pathways. Together, the methods provide a toolbox for comprehensiveprofiling of SNVs for the study of tumour dynamics.
The contributions of this thesis are motivated by an exciting challenge atthe intersection of computer science and biochemistry: Can we programmolecules to do interesting or useful computations? There has been significant progress in programming nucleic acids - particularly DNA molecules -thanks in part to availability of models and algorithms for predicting nucleicacid structure and folding kinetics. At a higher level of abstraction, ChemicalReaction Networks (CRNs) have proven to be valuable as a molecular programmingmodel that enables researchers to understand the potential andlimitations of computing with molecules, unencumbered by low-level details.These two levels of abstraction are linked; it is possible to "compile" CRNprograms into nucleic acid systems that make the programs implementablein a test tube.We design and analyze CRN algorithms for two purposes. First, weshow how any semilinear function can be computed by CRNs, even whenno "leader" species (i.e., initial species with constant but non-zero counts)is present. Our results improve earlier results of Chen et al. (2012) whoshowed that only semilinear functions are computable by error-free CRNsusing leaders. Our new CRN construction can be done in expected timeO(n), which is faster than O(n log n) bound achieved by Chen et al. Second,we provide the most intuitive proofs of correctness and efficiency for threedifferent CRNs computing Approximate Majority: Given a mixture of twotypes of species with an initial gap between their counts, a CRN computationmust reach totality on the majority species with high probability. The CRNsof our interest have the ability to start with an initial gap of Ω(√n log n).In the second part of this thesis, we study the problem of predicting theMinimum Free Energy secondary structure (the set of base pairs) of a givenset of nucleic acid strands with no pseudoknots (crossing base pairs). Weshow that this problem is APX-hard which implies that there does not exista polynomial time approximation scheme for this problem, unless P = NP.We also propose a new Monte-Carlo based method to efficiently estimatenucleic acid folding kinetics.
High-throughput genome sequencing and other techniques provide a cost-effective way to study cancer biology and seek precision treatment options. In this dissertation I address three challenges in cancer systems biology research: 1) predicting somatic mutations, 2) interpreting mutation functions, and 3) stratifying patients into biologically meaningful groups. Somatic single nucleotide variants are frequent therapeutically actionable mutations in cancer, e.g., the ‘hotspot’ mutations in known cancer driver genes such as EGFR, KRAS, and BRAF. However, only a small proportion of cancer patients harbour these known driver mutations. Therefore, there is a great need to systematically profile a cancer genome to identify all the somatic single nucleotide variants. I develop methods to discover these somatic mutations from cancer genomic sequencing data, taking into account the noise in high-throughput sequencing data and valuable validated genuine somatic mutations and non-somatic mutations. Of the somatic alterations acquired for each cancer patient, only a few mutations ‘drive’ the initialization and progression of cancer. To better understand the evolution of cancer, as well as to apply precision treatments, we need to assess the functions of these mutations to pinpoint the driver mutations. I address this challenge by predicting the mutations correlated with gene expression dysregulation. The method is based on hierarchical Bayes modelling of the influence of mutations on gene expression, and can predict the mutations that impact gene expression in individual patients. Although probably no two cancer genomes share exactly the same set of somatic mutations because of the stochastic nature of acquired mutations across the three billion base pairs, some cancer patients share common driver mutations or disrupted pathways. These patients may have similar prognoses and potentially benefit from the same kind of treatment options. I develop an efficient clustering algorithm to cluster high-throughput and high-dimensional bio- logical datasets, with the potential to put cancer patients into biologically meaningful groups for treatment selection.
RNA molecules are crucial in different levels of cellular function, and their functions largely depend on their structures. Since experimental methods for determining RNA structure are expensive, computational methods for prediction of RNA secondary structure (the set of base pairs) are valuable. One such approach is based on the Minimum Free Energy (MFE) folding hypothesis, which states an RNA molecule folds into the structure with the minimum free energy. Another approach is based on the hierarchical folding hypothesis, which posits that an RNA molecule first folds into a psuedoknot- free structure (i.e., no crossing base pairs), then additional base pairs are added that may form pseudoknots (structures with crossing base pairs) to lower the structure’s free energy.The overarching goal of this thesis is to overcome some limitations of the previous methods, namely (1) slow running time, (2) poor prediction accuracy (i.e., low number of correctly predicted base pairs), (3) limited in types of pseudoknots they can predict, and (4) limited opportunity to provide structural information that can guide prediction. We propose two algorithms and study different variations of each method. We propose a relaxed version of the hierarchical folding hypothesis and present Iterative HFold, which is based on this hypothesis. Furthermore, we propose the CCJ algorithm, an MFE-based algorithm that significantly expands the structures handled in O(n⁵) time. Employing a sparsification technique, we propose a sparse version of CCJ that improves its space usage.While CCJ’s and Iterative HFold’s performance are not significantly different on our large data set, Iterative HFold is considerably less expensive to run than CCJ. In addition, Iterative HFold provides the user with ability to incorporate structural information as input constraint. Both CCJ and Iterative HFold predict significantly more accurate structures on key data sets than two of the best performing computational methods currently available (i.e., HotKnots V2.0 and IPknot). Of the folding hypotheses that we have studied, it seems that the relaxed hierarchical folding hypothesis has better potential to predict RNA secondary structure, at least with respect to the energy model used in this work.
Two classes of widely studied markets are auctions, such as eBay, and two-sided matching markets, such as matching medical residents to hospitals. Inboth of these markets, it often happens that one side is in power and can influence the outcome of the market---e.g., the seller in an auction.In the fist part of this thesis we study two-sided matching markets in which participants are initially endowed with partial preference orderings. Our goal is to identify a matching that is stable and optimal for the side of the market that is in power, w.r.t. the underlying true preferences of the agents. We first investigate the extent to which we can learn about this matching given the partial information. We then provide a novel model in which the true preferences are learned through interviews. Our goal is to identify a centralized interviewing policy that returns our matching of interest while minimizing the number of interviews. We introduce three minimizationcriteria, of which the most desirable one is not always achievable. We provide an exponential-time algorithm for computing a policy that satisfies the other two criteria, and exhibit evidence that a more efficient computation may not be possible in general. We then show how to design a computationally efficient interview-minimizing policy for a setting in which agents on one sideof the market are initially endowed with identical partial preferences, and agents must interview before getting matched. In the second part, we study combinatorial auctions (CAs), where multiplegoods are for sale and buyers are allowed to place bids on bundles, i.e. sets of goods. In single-good auctions the auctioneer is at no risk of losing revenue if more bidders compete in the auction. In CAs however, an auctioneer may fi nd it pro table to disqualify bidders in order to be matched to a smaller set of bidders. We investigate the extent to which this counterintuitive phenomenon can occur under CA mechanisms that o er bidders dominant strategies. We show that a broad class of deterministic CAs exhibit non-monotonicity in revenue. However, revenue monotonic CAs exist in the broader domain of randomized mechanisms.
Nucleic acids play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the folded structures they form. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway. As this is a thermodynamically-driven probabilistic process, folding pathways tend to avoid high energy structures and those which do are said to have a low energy barrier.In the first part of this thesis, we study the problem of predicting low energy barrier folding pathways of a nucleic acid strand. We show various restrictions of the problem are computationally intractable, unless P=NP. We propose an exact algorithm that has exponential worst-case runtime, but uses only polynomial space and performs well in practice. Motivated by recent applications in molecular programming we also consider a number of related problems that leverage folding pathways to perform computation. We show that verifying the correctness of these systems is PSPACE-hard and in doing so show that predicting low energy barrier folding pathways of multiple interacting strands is PSPACE-complete. We explore the computational limits of this class of molecular programs which are capable, in principle, of logically reversible and thus energy efficient computation. We demonstrate that a space and energy efficient molecular program of this class can be constructed to solve any problem in SPACE ---the class of all space-bounded problems. We prove a number of limits to deterministic and also to space efficient computation of molecular programs that leverage folding pathways, and show limits for more general classes.In the second part of this thesis, we continue the study of algorithms and data structures for predicting properties of nucleic acids, but with quite different motivations pertaining to sequence rather than structure. We design a number of compressed text indexes that improve pattern matching queries in light of common biological events such as single nucleotide polymorphisms in genomes and alternative splicing in transcriptomes. Our text indexes and associated algorithms have the potential for use in alignment of sequencing data to reference sequences.
RNA molecules play important roles, including catalysis of chemical reactions and control of gene expression, and their functions largely depend on their folded structures. Since determining these structures by biochemical means is expensive, there is increased demand for computational predictions of RNA structures. One computational approach is to find the secondary structure (a set of base pairs) that minimizes a free energy function for a given RNA conformation. The forces driving RNA folding can be approximated by means of a free energy model, which associates a free energy parameter to a distinct considered feature.The main goal of this thesis is to develop state-of-the-art computational approaches that can significantly increase the accuracy (i.e., maximize the number of correctly predicted base pairs) of RNA secondary structure prediction methods, by improving and refining the parameters of the underlying RNA free energy model.We propose two general approaches to estimate RNA free energy parameters. The Constraint Generation (CG) approach is based on iteratively generating constraints that enforce known structures to have energies lower than other structures for the same molecule. The Boltzmann Likelihood (BL) approach infers a set of RNA free energy parameters which maximize the conditional likelihood of a set of known RNA structures. We discuss several variants and extensions of these two approaches, including a linear Gaussian Bayesian network that defines relationships between features. Overall, BL gives slightly better results than CG, but it is over ten times more expensive to run. In addition, CG requires software that is much simpler to implement.We obtain significant improvements in the accuracy of RNA minimum free energy secondary structure prediction with and without pseudoknots (regions of non-nested base pairs), when measured on large sets of RNA molecules with known structures. For the Turner model, which has been the gold-standard model without pseudoknots for more than a decade, the average prediction accuracy of our new parameters increases from 60% to 71%. For two models with pseudoknots, we obtain an increase of 9% and 6%, respectively. To the best of our knowledge, our parameters are currently state-of-the-art for the three considered models.
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Nucleic acid strands, which react by forming and breaking Watson-Crick base pairs, can be designed to form complex nanoscale structures or devices. Controlling such systems requires accurate predictions of the reaction rate and folding pathways of the interacting strands. These kinetic properties can be modelled using continuous-time Markov chains (CTMCs), whose states and transitions correspond to secondary structures and elementary base pair changes, respectively. The transient dynamics of a CTMC are determined by a kinetic model, which assigns transition rates to pairs of states. The rate of a reaction can be estimated using its CTMC's mean first passage time (MFPT), which can be computed exactly by solving a linear system, or approximated via Monte Carlo simulation. However, both approaches may be computationally infeasible for rare event reactions in larger systems. This limitation can be addressed by constructing truncated CTMCs, which only include a small subset of states and transitions, selected either manually or through simulation. In recent work, posterior inference in an Arrhenius-type kinetic model was performed using a fixed set of manually truncated CTMCs and a small experimental dataset of DNA reaction rates. We extend this Bayesian approach, using a larger dataset of 1105 reactions, a new prior model that is directly motivated by the physical meaning of the parameters and is compatible with experimental measurements of elementary rates, and larger truncated state spaces, constructed stochastically using the recently introduced pathway elaboration method. Despite a significantly higher computational cost, we find that the larger state spaces do not necessarily lead to more accurate rate predictions than the small, manually designed state spaces. For posterior approximation, we apply the standard random walk Metropolis algorithm and the gradient-based no-u-turn sampler. Our posterior approximations, which are often multimodal, recover an expected correlation structure among the kinetic parameters. However, we also uncover severe numerical instability in the MPFT computations. Due to numerous design limitations in the legacy software, a significant refactoring effort was required to implement the above extensions, resulting also in improvements in performance and reproducibility.
When disallowing error, traditional chemical reaction networks are very limited in computational power: Angluin et al. and Chen et al. showed that only semilinear predicates and functions are stably computable by CRNs. Qian et al. and others have shown that polymer-supplemented chemical reaction networks are capable of Turing-universal computation. However, their model requires that inputs are loaded onto the polymers at protocol initialization, in contrast with the traditional convention that inputs are represented by counts of molecules in solution. Here, we show that polymer-supplemented chemical reaction networks can stably simulate Turing-universal computations even with solution-based inputs. However, such simulations use a unique ''leader'' polymer per input type and thus involve many slow bottleneck reactions. We further refine the polymer-supplemented chemical reaction network model to allow for clone polymers, that is, multiple functionally-identical copies of a polymer, and provide an illustrative example of how bottleneck reactions can be reduced in this new model.
This thesis provides a sufficiency condition for the functions f: ℕ² → ℕ which are stably computable by output-oblivious Stochastic Chemical Reaction Networks (CRNs), i.e., systems of reactions in which output species are never reactants. While it is known that precisely the semilinear functions are stably computable by CRNs, such CRNs sometimes rely on initially producing too many output species, and then consuming the excess in order to reach a correct stable state. These CRNs may be difficult to integrate into larger systems: if the output of a CRN C becomes the input to a downstream CRN C′, then C′ could inadvertently consume too many outputs before C stabilizes. If, on the other hand, C is output-oblivious then C′ may consume C’s output as soon as it is available. In this work, we prove that a semilinear function f: ℕ² → ℕ is stably computable by an output-oblivious CRN with a leader if it is both increasing and either grid-affine (intuitively, its domains are congruence classes), or the minimum of a finite set of fissure functions (intuitively, functions behaving like the min function).This result complements the necessity condition obtained and proven by other contributors. The run-time analysis provided in the end adds more detail to the proof andthe construction.
RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since their functions largely depend on their folded structures, having accurate and efficient methods for RNA structure prediction is increasingly valuable. A number of computational approaches have been developed to predict RNA secondary structure. There have been some recent advances for two of these approaches, namely Minimum Free Energy (MFE) and Maximum Expected Accuracy (MEA). The main goals of this thesis are twofold: 1) to empirically analyze the accuracy (i.e., the number of correctly predicted base pairs) of the MFE and MEA algorithms in different energy models and 2) to estimate the free energy parameters specifically for the MEA by using the constraint generation (CG) approach. We present new results on the degree to which different factors influence the prediction accuracy MFE and MEA. The factors that we consider are: structural information that is provided in addition to RNA sequence, thermodynamic parameters, and data set size. Structural information significantly increases the accuracy of these methods. The relative performance of MFE and MEA changes according to the free energy parameters used; however, both have the best performance when they use Andronescu et al.'s BL* parameter set. Having bigger data sets results in less variation in prediction accuracy of the MFE and MEA algorithms. Furthermore, we try to find better free energy parameters for the MEA algorithm. For our purpose, we adapt the CG approach so that it specifically optimizes parameters for MEA. Overall, when parameters are trained using MFE, they slightly outperform the parameters estimated for MEA. For the MEA algorithm, we also study the effect of parameter γ which controls the relative sensitivity and positive predictive value (PPV). We obtain that when γ=1, the accuracy of MEA (in terms of F-measure on the BL* parameter set) is better than its accuracy for other values of γ. We also find that the sensitivity and PPV of MEA will interestingly change for different values of γ using the BL* parameter set.
- Approximate majority analyses using tri-molecular chemical reaction networks (2020)
Natural Computing, 19 (1), 249--270
- Interpretable dimensionality reduction of single cell transcriptome data with deep generative models (2018)
Nature Communications, 9 (1)
- The complexity of string partitioning (2015)
Journal of Discrete Algorithms,