Alexandre Bouchard-Cote

Associate Professor

Relevant Degree Programs

 

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2019)
Bayesian analysis of continuous time Markov chains with applications to phylogenetics (2019)

Bayesian analysis of continuous time, discrete state space time series is an important and challenging problem, where incomplete observation and large parameter sets call for user-defined priors based on known properties of the process. Generalized Linear Models (GLM) have a largely unexplored potential to construct such prior distributions. We show that an important challenge with Bayesian generalized linear modelling of Continuous Time Markov Chains (CTMCs) is that classical Markov Chain Monte Carlo (MCMC) techniques are too ineffective to be practical in that setup. We propose two computational methods to address this issue. The first algorithm uses an auxiliary variable construction combined with an Adaptive Hamiltonian Monte Carlo (AHMC) algorithm. The second algorithm combines Hamiltonian Monte Carlo (HMC) and Local Bouncy Particle Sampler (LBPS) to take advantage of the sparsity in certain high-dimensional rate matrices. We propose a characterization for a class of sparse factor graphs, where LBPS can be efficient. We also provide a framework to assess the computational complexity for the algorithm. An important aspect of practical implementation of Bouncy Particle Sampler (BPS) is the simulation of event times. Default implementations use conservative thinning bounds. Such bounds can slow down the algorithm and limit the computational performance. In the second algorithm, we develop exact analytical solutions to the random event times in the context of CTMCs. Both sampling algorithms and our model make it efficient both in terms of computation and analyst's time to construct stochastic processes informed by prior knowledge, such as known properties of the states of the process. We demonstrate the flexibility and scalability of our framework using both synthetic and real phylogenetic protein data.

View record

Entangled Monte Carlo (2018)

No abstract available.

Practical Bayesian optimization with application to tuning machine learning algorithms (2016)

Bayesian optimization has recently emerged in the machine learning community as a very effective automatic alternative to the tedious task of hand-tuning algorithm hyperparameters. Although it is a relatively new aspect of machine learning, it has known roots in the Bayesian experimental design (Lindley, 1956; Chaloner and Verdinelli, 1995), the design and analysis of computer experiments (DACE; Sacks et al., 1989), Kriging (Krige, 1951), and multi-armed bandits (Gittins, 1979). In this thesis, we motivate and introduce the model-based optimization framework and provide some historical context to the technique that dates back as far as 1933 with application to clinical drug trials (Thompson, 1933). Contributions of this work include a Bayesian gap-based exploration policy, inspired by Gabillon et al. (2012); a principled information-theoretic portfolio strategy, out-performing the portfolio of Hoffman et al. (2011); and a general practical technique circumventing the need for an initial bounding box. These various works each address existing practical challenges in the way of more widespread adoption of probabilistic model-based optimization techniques. Finally, we conclude this thesis with important directions for future research, emphasizing scalability and computational feasibility of the approach as a general purpose optimizer.

View record

Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference (2016)

Phylogenetic inference aims to reconstruct the evolutionary history of populations or species. With the rapid expansion of genetic data available, statistical methods play an increasingly important role in phylogenetic inference by analyzing genetic variation of observed data collected at current populations or species. In this thesis, we develop new evolutionary models, statistical inference methods and efficient algorithms for reconstructing phylogenetic trees at the level of populations using single nucleotide polymorphism data and at the level of species using multiple sequence alignment data. At the level of populations, we introduce a new inference method to estimate evolutionary distances for any two populations to their most recent common ancestral population using single-nucleotide polymorphism allele frequencies. Our method is based on a new evolutionary model for both drift and fixation. To scale this method to large numbers of populations, we introduce the asymmetric neighbor-joining algorithm, an efficient method for reconstructing rooted bifurcating trees. Asymmetric neighbor-joining provides a scalable rooting method applicable to any non-reversible evolutionary modelling setup. We explore the statistical properties of asymmetric neighbor-joining, and demonstrate its accuracy on synthetic data. We validate our method by reconstructing rooted phylogenetic trees from the Human Genome Diversity Panel data. Our results are obtained without using an outgroup, and are consistent with the prevalent recent single-origin model of human migration. At the level of species, we introduce a continuous time stochastic process, the geometric Poisson indel process, that allows indel rates to vary across sites. We design an efficient algorithm for computing the probability of a given multiple sequence alignment based on our new indel model. We describe a method to construct phylogeny estimates from a fixed alignment using neighbor-joining. Using simulation studies, we show that ignoring indel rate variation may have a detrimental effect on the accuracy of the inferred phylogenies, and that our proposed method can sidestep this issue by inferring latent indel rate categories. We also show that our phylogenetic inference method may be more stable to taxa subsampling in a real data experiment compared to some existing methods that either ignore indels or ignore indel rate variation.

View record

Bayesian phylogenetic inference via Monte Carlo methods (2012)

A main task in evolutionary biology is phylogenetic tree reconstruction, whichdetermines the ancestral relationships among di erent species based on observedmolecular sequences, e.g. DNA data. When a stochastic model, typically continuoustime Markov chain (CTMC), is used to describe the evolution, the phylogenetic inferencedepends on unknown evolutionary parameters (hyper-parameters) in thestochastic model. Bayesian inference provides a general framework for phylogeneticanalysis, able to implement complex models of sequence evolution and toprovide a coherent treatment of uncertainty for the groups on the tree. The conventionalcomputational methods in Bayesian phylogenetics based on Markov chainMonte Carlo (MCMC) cannot e ciently explore the huge tree space, growing superexponentially with the number of molecular sequences, due to di culties ofproposing tree topologies. sequential Monte Carlo (SMC) is an alternative to approximateposterior distributions. However, it is non-trivial to directly apply SMCto phylogenetic posterior tree inference because of its combinatorial intricacies.We propose the combinatorial sequential Monte Carlo (CSMC) method to generalizeapplications of SMC to non-clock tree inference based on the existence ofa flexible partially ordered set (poset) structure, and we present it in a level of generalitydirectly applicable to many other combinatorial spaces. We show that theproposed CSMC algorithm is consistent and fast in simulations. We also investigatetwo ways of combining SMC and MCMC to jointly estimate the phylogenetictrees and evolutionary parameters, particle Markov chain Monte Carlo (PMCMC)algorithms with CSMC at each iteration and an SMC sampler with MCMC moves.Further, we present a novel way to estimate the transition probabilities for a generalCTMC, which can be used to solve the computing bottleneck in a general evolutionary model, string-valued continuous time Markov Chain (SCTMC), that canincorporate a wide range of molecular mechanisms.

View record

Master's Student Supervision (2010 - 2018)
dd-PyClone : improving clonal subpopulation inference from single cells and bulk sequencing data (2016)

Improving our understanding of intra-tumour heterogeneity in cancer has important clinical implications, including an opportunity to understand mechanisms behind relapses and drug resistance. Next generation bulk sequencing is a mature tech- nology that has been used to study subclonal tumour populations at an aggregate level. Inference of populations from bulk sequencing requires sophisticated com- putational deconvolution methods. An alternative is to identify populations directly with single cell sequencing. However, single cell sequencing is a very error-prone process, and this impedes its ability to completely replace bulk sequencing for now.In this work we present dd-PyClone, a statistical model to combine single cell and bulk sequencing data to study clonal subpopulation architecture and improve clustering assignment and cellular prevalence estimates of a set of genomic loci.We introduce a single nucleotide variant and copy number aberration aware genotype simulation scheme based on a phylogenetic tree, termed the Generalized Dollo model. This model is an improvement over previous genotype generator models in that it also accounts for the evolutionary process before a rare event (here the single nucleotide variant) occurs.We show that incorporating genomic loci co-occurrence patterns from single cell sequencing studies in inferring clonal subpopulation structure from bulk se- quencing data is beneficial. Our method outperforms existing methods in simula- tion studies and performs comparably in real dataset benchmarking. We also show that our method is fairly robust as to the choice of hyperparameters and performs reasonably in presence of noise. We hope that our method will further the under- standing of the evolutionary basis of cancer.

View record

Divide and conquer sequential Monte Carlo for phylogenetics (2015)

Recently reconstructing evolutionary histories has become a computational issue due to the increased availability of genetic sequencing data and relaxations of classical modelling assumptions. This thesis specializes a Divide & conquer sequential Monte Carlo (DCSMC) inference algorithm to phylogenetics to address these challenges. In phylogenetics, the tree structure used to represent evolutionary histories provides a model decomposition used for DCSMC. In particular, speciation events are used to recursively decompose the model into subproblems. Each subproblem is approximated by an independent population of weighted particles, which are merged and propagated to create an ancestral population. This approach provides the flexibility to relax classical assumptions on large trees by parallelizing these recursions.

View record

DrSMC : a sequential Monte Carlo sampler for deterministic relationships on continuous random variables (2015)

Computing posterior distributions over variables linked by deterministic constraints is a recurrent problem in Bayesian analysis. Such problems can arise due to censoring, identifiability issues, or other considerations. It is well-known that standard implementations of Monte Carlo inference strategies break down in the presence of these deterministic relationships. Although several alternative Monte Carlo approaches have been recently developed, few are applicable to deterministic relationships on continuous random variables. In this thesis, I propose Deterministic relationship Sequential Monte Carlo (DrSMC), a new Monte Carlo method for continuous variables possessing deterministic constraints. My exposition focuses on developing a DrSMC algorithm for computing the posterior distribution of a continuous random vector given its sum. I derive optimal settings for this algorithm and compare its performance to that of alternative approaches in the literature.

View record

Inference of rates across sites via an expectation maximization algorithm (2013)

The rates of nucleotide substitution can be different from genes to genes. Moreover, different regions of the same gene can have different rates of mutation as well. Many attempts have been tried to allow for the variable rates across different nucleotide sites. A rate factor coming from the continuous distribution has been introduced to deal with the problem. However, for computation reasons, this method can only scale to less than a dozen sequences. Later studies use a discrete gamma distribution to approximate the gamma distribution. The main contribution of our work is that we propose a discrete distribution over the rate factor which is more flexible while preserving attractive computational properties. We make inference about the rate factor and its distribution via an Expectation Maximization (EM) algorithm. We evaluate our method by both simulations and a real dataset. From the real dataset, it reflects that the method is useful for large phylogenies with even thousands of sequences. We analyze the identifiability of our model for a pair of DNA sequences under certain conditions. We also prove for certain types of rate matrices, this model is non-identifiable.

View record

Nonparametric Bayesian models for Markov jump processes (2012)

Markov jump processes (MJPs) have been used as models in various fields such asdisease progression, phylogenetic trees, and communication networks. The mainmotivation behind this thesis is the application of MJPs to data modeled as havingcomplex latent structure. In this thesis we propose a nonparametric prior, thegamma-exponential process (GEP), over MJPs. Nonparametric Bayesian modelshave recently attracted much attention in the statistics community, due to their flexibility,adaptability, and usefulness in analyzing complex real world datasets. TheGEP is a prior over infinite rate matrices which characterize an MJP; this prior canbe used in Bayesian models where an MJP is imposed on the data but the number ofstates of the MJP is unknown in advance. We show that the GEP model we proposehas some attractive properties such as conjugacy and simple closed-form predictivedistributions. We also introduce the hierarchical version of the GEP model;sharing statistical strength can be considered as the main motivation behind the hierarchicalmodel. We show that our hierarchical model admits efficient inferencealgorithms. We introduce two inference algorithms: 1) a “basic” particle Markovchain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequencesproposed by a sequential Monte Carlo (SMC) algorithm; 2) a modifiedversion of this PMCPC algorithm with an “improved” SMC proposal. Finally, wedemonstrate the algorithms on the problems of estimating disease progression inmultiple sclerosis and RNA evolutionary modeling. In both domains, we foundthat our model outperformed the standard rate matrix estimation approach.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.