Doctor of Philosophy in Computer Science (PhD)
Efficient inference and prediction for large continuous-time Markov chains, with applications to nucleic acid kinetic simulators
G+PS regularly provides virtual sessions that focus on admission requirements and procedures and tips how to improve your application.
No abstract available.
The contributions of this thesis are motivated by an exciting challenge atthe intersection of computer science and biochemistry: Can we programmolecules to do interesting or useful computations? There has been significant progress in programming nucleic acids - particularly DNA molecules -thanks in part to availability of models and algorithms for predicting nucleicacid structure and folding kinetics. At a higher level of abstraction, ChemicalReaction Networks (CRNs) have proven to be valuable as a molecular programmingmodel that enables researchers to understand the potential andlimitations of computing with molecules, unencumbered by low-level details.These two levels of abstraction are linked; it is possible to "compile" CRNprograms into nucleic acid systems that make the programs implementablein a test tube.We design and analyze CRN algorithms for two purposes. First, weshow how any semilinear function can be computed by CRNs, even whenno "leader" species (i.e., initial species with constant but non-zero counts)is present. Our results improve earlier results of Chen et al. (2012) whoshowed that only semilinear functions are computable by error-free CRNsusing leaders. Our new CRN construction can be done in expected timeO(n), which is faster than O(n log n) bound achieved by Chen et al. Second,we provide the most intuitive proofs of correctness and efficiency for threedifferent CRNs computing Approximate Majority: Given a mixture of twotypes of species with an initial gap between their counts, a CRN computationmust reach totality on the majority species with high probability. The CRNsof our interest have the ability to start with an initial gap of Ω(√n log n).In the second part of this thesis, we study the problem of predicting theMinimum Free Energy secondary structure (the set of base pairs) of a givenset of nucleic acid strands with no pseudoknots (crossing base pairs). Weshow that this problem is APX-hard which implies that there does not exista polynomial time approximation scheme for this problem, unless P = NP.We also propose a new Monte-Carlo based method to efficiently estimatenucleic acid folding kinetics.
High-throughput genome sequencing and other techniques provide a cost-effective way to study cancer biology and seek precision treatment options. In this dissertation I address three challenges in cancer systems biology research: 1) predicting somatic mutations, 2) interpreting mutation functions, and 3) stratifying patients into biologically meaningful groups. Somatic single nucleotide variants are frequent therapeutically actionable mutations in cancer, e.g., the ‘hotspot’ mutations in known cancer driver genes such as EGFR, KRAS, and BRAF. However, only a small proportion of cancer patients harbour these known driver mutations. Therefore, there is a great need to systematically profile a cancer genome to identify all the somatic single nucleotide variants. I develop methods to discover these somatic mutations from cancer genomic sequencing data, taking into account the noise in high-throughput sequencing data and valuable validated genuine somatic mutations and non-somatic mutations. Of the somatic alterations acquired for each cancer patient, only a few mutations ‘drive’ the initialization and progression of cancer. To better understand the evolution of cancer, as well as to apply precision treatments, we need to assess the functions of these mutations to pinpoint the driver mutations. I address this challenge by predicting the mutations correlated with gene expression dysregulation. The method is based on hierarchical Bayes modelling of the influence of mutations on gene expression, and can predict the mutations that impact gene expression in individual patients. Although probably no two cancer genomes share exactly the same set of somatic mutations because of the stochastic nature of acquired mutations across the three billion base pairs, some cancer patients share common driver mutations or disrupted pathways. These patients may have similar prognoses and potentially benefit from the same kind of treatment options. I develop an efficient clustering algorithm to cluster high-throughput and high-dimensional bio- logical datasets, with the potential to put cancer patients into biologically meaningful groups for treatment selection.
RNA molecules are crucial in different levels of cellular function, and their functions largely depend on their structures. Since experimental methods for determining RNA structure are expensive, computational methods for prediction of RNA secondary structure (the set of base pairs) are valuable. One such approach is based on the Minimum Free Energy (MFE) folding hypothesis, which states an RNA molecule folds into the structure with the minimum free energy. Another approach is based on the hierarchical folding hypothesis, which posits that an RNA molecule first folds into a psuedoknot- free structure (i.e., no crossing base pairs), then additional base pairs are added that may form pseudoknots (structures with crossing base pairs) to lower the structure’s free energy.The overarching goal of this thesis is to overcome some limitations of the previous methods, namely (1) slow running time, (2) poor prediction accuracy (i.e., low number of correctly predicted base pairs), (3) limited in types of pseudoknots they can predict, and (4) limited opportunity to provide structural information that can guide prediction. We propose two algorithms and study different variations of each method. We propose a relaxed version of the hierarchical folding hypothesis and present Iterative HFold, which is based on this hypothesis. Furthermore, we propose the CCJ algorithm, an MFE-based algorithm that significantly expands the structures handled in O(n⁵) time. Employing a sparsification technique, we propose a sparse version of CCJ that improves its space usage.While CCJ’s and Iterative HFold’s performance are not significantly different on our large data set, Iterative HFold is considerably less expensive to run than CCJ. In addition, Iterative HFold provides the user with ability to incorporate structural information as input constraint. Both CCJ and Iterative HFold predict significantly more accurate structures on key data sets than two of the best performing computational methods currently available (i.e., HotKnots V2.0 and IPknot). Of the folding hypotheses that we have studied, it seems that the relaxed hierarchical folding hypothesis has better potential to predict RNA secondary structure, at least with respect to the energy model used in this work.
Two classes of widely studied markets are auctions, such as eBay, and two-sided matching markets, such as matching medical residents to hospitals. Inboth of these markets, it often happens that one side is in power and can influence the outcome of the market---e.g., the seller in an auction.In the fist part of this thesis we study two-sided matching markets in which participants are initially endowed with partial preference orderings. Our goal is to identify a matching that is stable and optimal for the side of the market that is in power, w.r.t. the underlying true preferences of the agents. We first investigate the extent to which we can learn about this matching given the partial information. We then provide a novel model in which the true preferences are learned through interviews. Our goal is to identify a centralized interviewing policy that returns our matching of interest while minimizing the number of interviews. We introduce three minimizationcriteria, of which the most desirable one is not always achievable. We provide an exponential-time algorithm for computing a policy that satisfies the other two criteria, and exhibit evidence that a more efficient computation may not be possible in general. We then show how to design a computationally efficient interview-minimizing policy for a setting in which agents on one sideof the market are initially endowed with identical partial preferences, and agents must interview before getting matched. In the second part, we study combinatorial auctions (CAs), where multiplegoods are for sale and buyers are allowed to place bids on bundles, i.e. sets of goods. In single-good auctions the auctioneer is at no risk of losing revenue if more bidders compete in the auction. In CAs however, an auctioneer may fi nd it pro table to disqualify bidders in order to be matched to a smaller set of bidders. We investigate the extent to which this counterintuitive phenomenon can occur under CA mechanisms that o er bidders dominant strategies. We show that a broad class of deterministic CAs exhibit non-monotonicity in revenue. However, revenue monotonic CAs exist in the broader domain of randomized mechanisms.
Nucleic acids play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the folded structures they form. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway. As this is a thermodynamically-driven probabilistic process, folding pathways tend to avoid high energy structures and those which do are said to have a low energy barrier.In the first part of this thesis, we study the problem of predicting low energy barrier folding pathways of a nucleic acid strand. We show various restrictions of the problem are computationally intractable, unless P=NP. We propose an exact algorithm that has exponential worst-case runtime, but uses only polynomial space and performs well in practice. Motivated by recent applications in molecular programming we also consider a number of related problems that leverage folding pathways to perform computation. We show that verifying the correctness of these systems is PSPACE-hard and in doing so show that predicting low energy barrier folding pathways of multiple interacting strands is PSPACE-complete. We explore the computational limits of this class of molecular programs which are capable, in principle, of logically reversible and thus energy efficient computation. We demonstrate that a space and energy efficient molecular program of this class can be constructed to solve any problem in SPACE ---the class of all space-bounded problems. We prove a number of limits to deterministic and also to space efficient computation of molecular programs that leverage folding pathways, and show limits for more general classes.In the second part of this thesis, we continue the study of algorithms and data structures for predicting properties of nucleic acids, but with quite different motivations pertaining to sequence rather than structure. We design a number of compressed text indexes that improve pattern matching queries in light of common biological events such as single nucleotide polymorphisms in genomes and alternative splicing in transcriptomes. Our text indexes and associated algorithms have the potential for use in alignment of sequencing data to reference sequences.
RNA molecules play important roles, including catalysis of chemical reactions and control of gene expression, and their functions largely depend on their folded structures. Since determining these structures by biochemical means is expensive, there is increased demand for computational predictions of RNA structures. One computational approach is to find the secondary structure (a set of base pairs) that minimizes a free energy function for a given RNA conformation. The forces driving RNA folding can be approximated by means of a free energy model, which associates a free energy parameter to a distinct considered feature.The main goal of this thesis is to develop state-of-the-art computational approaches that can significantly increase the accuracy (i.e., maximize the number of correctly predicted base pairs) of RNA secondary structure prediction methods, by improving and refining the parameters of the underlying RNA free energy model.We propose two general approaches to estimate RNA free energy parameters. The Constraint Generation (CG) approach is based on iteratively generating constraints that enforce known structures to have energies lower than other structures for the same molecule. The Boltzmann Likelihood (BL) approach infers a set of RNA free energy parameters which maximize the conditional likelihood of a set of known RNA structures. We discuss several variants and extensions of these two approaches, including a linear Gaussian Bayesian network that defines relationships between features. Overall, BL gives slightly better results than CG, but it is over ten times more expensive to run. In addition, CG requires software that is much simpler to implement.We obtain significant improvements in the accuracy of RNA minimum free energy secondary structure prediction with and without pseudoknots (regions of non-nested base pairs), when measured on large sets of RNA molecules with known structures. For the Turner model, which has been the gold-standard model without pseudoknots for more than a decade, the average prediction accuracy of our new parameters increases from 60% to 71%. For two models with pseudoknots, we obtain an increase of 9% and 6%, respectively. To the best of our knowledge, our parameters are currently state-of-the-art for the three considered models.
RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since their functions largely depend on their folded structures, having accurate and efficient methods for RNA structure prediction is increasingly valuable. A number of computational approaches have been developed to predict RNA secondary structure. There have been some recent advances for two of these approaches, namely Minimum Free Energy (MFE) and Maximum Expected Accuracy (MEA). The main goals of this thesis are twofold: 1) to empirically analyze the accuracy (i.e., the number of correctly predicted base pairs) of the MFE and MEA algorithms in different energy models and 2) to estimate the free energy parameters specifically for the MEA by using the constraint generation (CG) approach. We present new results on the degree to which different factors influence the prediction accuracy MFE and MEA. The factors that we consider are: structural information that is provided in addition to RNA sequence, thermodynamic parameters, and data set size. Structural information significantly increases the accuracy of these methods. The relative performance of MFE and MEA changes according to the free energy parameters used; however, both have the best performance when they use Andronescu et al.'s BL* parameter set. Having bigger data sets results in less variation in prediction accuracy of the MFE and MEA algorithms. Furthermore, we try to find better free energy parameters for the MEA algorithm. For our purpose, we adapt the CG approach so that it specifically optimizes parameters for MEA. Overall, when parameters are trained using MFE, they slightly outperform the parameters estimated for MEA. For the MEA algorithm, we also study the effect of parameter γ which controls the relative sensitivity and positive predictive value (PPV). We obtain that when γ=1, the accuracy of MEA (in terms of F-measure on the BL* parameter set) is better than its accuracy for other values of γ. We also find that the sensitivity and PPV of MEA will interestingly change for different values of γ using the BL* parameter set.