#### Nikola Surjanovic

Doctor of Philosophy in Statistics (PhD)

**Research Topic**

Scalable Bayesian Methods for Sampling From Complex Probability Distributions

- Graduate School
- »
- Prospective Students »
- Research Supervisors »
- Alexandre Bouchard-Cote

Professor

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Bayesian models for hierarchical clustering of network data (2023)

Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms.The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.

View record

Non-reversible parallel tempering on optimized paths (2023)

Parallel tempering (PT) methods are a popular class of Markov chain Monte Carlo schemes used to sample complex high-dimensional probability distributions. They rely on a collection of N interacting auxiliary chains targeting tempered versions of the target distribution to improve the exploration of the state-space. We provide here a new perspective on these highly parallel algorithms and their tuning by identifying and formalizing a sharp divide in the behaviour and performance of reversible versus non-reversible PT schemes. We show theoretically and empirically that a class of non-reversible PT methods dominates its reversible counterparts. These results are exploited to identify the optimal annealing schedule for non-reversible PT and to develop an iterative scheme approximating this schedule. The proposed methodology is applicable to sample from a distribution π₁ with respect to a reference distribution π₀, and to compute normalizing constants. We provide a wide range of numerical examples supporting our theoretical and methodological contributions. The performance of non-reversible PT depends on how quickly a sample from the reference distribution makes its way to the target, which in turn depends on the particular path of annealing distributions. Traditionally PT has used only simple paths constructed from convex combinations of the reference and target log-densities; we demonstrate that this path performs poorly in the setting where the reference and target are nearly mutually singular. To address this issue, we expand the framework of PT to general families of paths, formulate the choice of a path as an optimization problem that admits tractable gradient estimates, and propose a flexible new family of spline interpolation paths for use in practice. We show that PT induces a geometry on the space of probability distributions and characterize these optimal paths as length minimizing geodesics between π₀ and π₁. Theoretical and empirical results demonstrate that our proposed methodology breaks previously-established upper-performance limits for traditional linear paths. Finally, we identify distinct scaling limits for the non-reversible and reversible schemes, the former being a piecewise-deterministic Markov process and the latter a diffusion.

View record

Quantitative fitness modelling in cancer using single cell timeseries population dynamics (2021)

Tumour fitness landscapes underpin selection in cancer, impacting evolution and response to treatment. Quantitative fitness modelling of cancer cells has numerous and diverse implications: attributing clonal dynamics to drift or selection, identifying the determinants of clonal expansion, and forecasting tumour growth trajectories. Why and how drug resistance evolves is among the key unresolved areas of investigation that require advanced understanding of fitness in cancer. Longitudinal xenoengraftment interrogated via next generation single cell sequencing (SCS) has enabled more accurate, quantitative measurements of tumours as they evolve. This process generates timestamped samples comprising thousands of cells each measured at thousands of copy number aberrations (CNA). Major analytical challenges introduced by this new datatype include (i) how to identify biologically meaningful groups of cells (i.e., clones) across multiple timepoints, and (ii) how to quantitatively reason about the underlying evolutionary forces acting on the clones via their observed dynamics.To address the first problem, we describe and provide supplementary tools for sitka, a scalable Bayesian phylogenetic inference method in Chapter 2. It resolves the clonal structure of a heterogeneous tumour cell population sampled over multiple timepoints by reconstructing the evolutionary relationship between single cells from their inferred CNA profiles. We then develop Lumberjack, a tree-cutting algorithm, and use it to assign cells to clones. We address the second problem in Chapter 3 by developing fitClone, a Bayesian probabilistic framework that ascribes quantitative selection coefficients to individual cancer clones and forecasts competitive clonal dynamics over time. In Chapter 4, we exemplify the computational models introduced above on real-world data collected from cancer cells over a multi-year period to verify two key hypotheses, that (i) clonal dynamics in a pre-treatment triple negative breast cancer (TNBC) tumour is quantifiably reproducible, and that (ii) the fitness landscape is reversed under early response to cisplatin treatment. Our results show that population genetic modelling of timeseries tumour measurements to predict clonal evolution is tractable. Further study with timeseries modelling will provide insight into therapeutic strategies promoting early intervention, drug combinations and evolution-aware approaches to clinical management.

View record

Bayesian analysis of continuous time Markov chains with applications to phylogenetics (2019)

Bayesian analysis of continuous time, discrete state space time series is an important and challenging problem, where incomplete observation and large parameter sets call for user-defined priors based on known properties of the process. Generalized Linear Models (GLM) have a largely unexplored potential to construct such prior distributions. We show that an important challenge with Bayesian generalized linear modelling of Continuous Time Markov Chains (CTMCs) is that classical Markov Chain Monte Carlo (MCMC) techniques are too ineffective to be practical in that setup. We propose two computational methods to address this issue. The first algorithm uses an auxiliary variable construction combined with an Adaptive Hamiltonian Monte Carlo (AHMC) algorithm. The second algorithm combines Hamiltonian Monte Carlo (HMC) and Local Bouncy Particle Sampler (LBPS) to take advantage of the sparsity in certain high-dimensional rate matrices. We propose a characterization for a class of sparse factor graphs, where LBPS can be efficient. We also provide a framework to assess the computational complexity for the algorithm. An important aspect of practical implementation of Bouncy Particle Sampler (BPS) is the simulation of event times. Default implementations use conservative thinning bounds. Such bounds can slow down the algorithm and limit the computational performance. In the second algorithm, we develop exact analytical solutions to the random event times in the context of CTMCs. Both sampling algorithms and our model make it efficient both in terms of computation and analyst's time to construct stochastic processes informed by prior knowledge, such as known properties of the states of the process. We demonstrate the flexibility and scalability of our framework using both synthetic and real phylogenetic protein data.

View record

Scalable sequential Monte Carlo methods and probabilistic approach to combinatorial problems (2018)

In this thesis, we consider two combinatorial problems arising in phylogenetics and computational forestry. We develop sequential Monte Carlo based inference methods to tackle the challenges arising from these applications. In the phylogenetics application, the SMC is placed under stringent restrictions in terms of computer memory to the point where using sufficient number of particles may be prohibitive. To that end, a new method, streaming particle filter (SPF), is proposed which alleviates this problem and study some of the theoretical properties of the algorithm, including L² convergence of the estimators derived from the SPF simulation. In the case of the computational forestry, the combinatorial problem that is tackled in this thesis is referred to as the knot matching problem. This problem is formulated as a graph matching problem on a K-partite hypergraph. A model for graph matching that admits efficient parameter inference is proposed along with a sequential Monte Carlo sampler for graph matching based on this model. A detailed background on each of the two problems are presented and evaluated on simulated and real data.

View record

Practical Bayesian optimization with application to tuning machine learning algorithms (2016)

Bayesian optimization has recently emerged in the machine learning community as a very effective automatic alternative to the tedious task of hand-tuning algorithm hyperparameters. Although it is a relatively new aspect of machine learning, it has known roots in the Bayesian experimental design (Lindley, 1956; Chaloner and Verdinelli, 1995), the design and analysis of computer experiments (DACE; Sacks et al., 1989), Kriging (Krige, 1951), and multi-armed bandits (Gittins, 1979). In this thesis, we motivate and introduce the model-based optimization framework and provide some historical context to the technique that dates back as far as 1933 with application to clinical drug trials (Thompson, 1933). Contributions of this work include a Bayesian gap-based exploration policy, inspired by Gabillon et al. (2012); a principled information-theoretic portfolio strategy, out-performing the portfolio of Hoffman et al. (2011); and a general practical technique circumventing the need for an initial bounding box. These various works each address existing practical challenges in the way of more widespread adoption of probabilistic model-based optimization techniques. Finally, we conclude this thesis with important directions for future research, emphasizing scalability and computational feasibility of the approach as a general purpose optimizer.

View record

Stochastic processes, statistical inference and efficient algorithms for phylogenetic inference (2016)

Phylogenetic inference aims to reconstruct the evolutionary history of populations or species. With the rapid expansion of genetic data available, statistical methods play an increasingly important role in phylogenetic inference by analyzing genetic variation of observed data collected at current populations or species. In this thesis, we develop new evolutionary models, statistical inference methods and efficient algorithms for reconstructing phylogenetic trees at the level of populations using single nucleotide polymorphism data and at the level of species using multiple sequence alignment data. At the level of populations, we introduce a new inference method to estimate evolutionary distances for any two populations to their most recent common ancestral population using single-nucleotide polymorphism allele frequencies. Our method is based on a new evolutionary model for both drift and fixation. To scale this method to large numbers of populations, we introduce the asymmetric neighbor-joining algorithm, an efficient method for reconstructing rooted bifurcating trees. Asymmetric neighbor-joining provides a scalable rooting method applicable to any non-reversible evolutionary modelling setup. We explore the statistical properties of asymmetric neighbor-joining, and demonstrate its accuracy on synthetic data. We validate our method by reconstructing rooted phylogenetic trees from the Human Genome Diversity Panel data. Our results are obtained without using an outgroup, and are consistent with the prevalent recent single-origin model of human migration. At the level of species, we introduce a continuous time stochastic process, the geometric Poisson indel process, that allows indel rates to vary across sites. We design an efficient algorithm for computing the probability of a given multiple sequence alignment based on our new indel model. We describe a method to construct phylogeny estimates from a fixed alignment using neighbor-joining. Using simulation studies, we show that ignoring indel rate variation may have a detrimental effect on the accuracy of the inferred phylogenies, and that our proposed method can sidestep this issue by inferring latent indel rate categories. We also show that our phylogenetic inference method may be more stable to taxa subsampling in a real data experiment compared to some existing methods that either ignore indels or ignore indel rate variation.

View record

Bayesian phylogenetic inference via Monte Carlo methods (2012)

A main task in evolutionary biology is phylogenetic tree reconstruction, whichdetermines the ancestral relationships among di erent species based on observedmolecular sequences, e.g. DNA data. When a stochastic model, typically continuoustime Markov chain (CTMC), is used to describe the evolution, the phylogenetic inferencedepends on unknown evolutionary parameters (hyper-parameters) in thestochastic model. Bayesian inference provides a general framework for phylogeneticanalysis, able to implement complex models of sequence evolution and toprovide a coherent treatment of uncertainty for the groups on the tree. The conventionalcomputational methods in Bayesian phylogenetics based on Markov chainMonte Carlo (MCMC) cannot e ciently explore the huge tree space, growing superexponentially with the number of molecular sequences, due to di culties ofproposing tree topologies. sequential Monte Carlo (SMC) is an alternative to approximateposterior distributions. However, it is non-trivial to directly apply SMCto phylogenetic posterior tree inference because of its combinatorial intricacies.We propose the combinatorial sequential Monte Carlo (CSMC) method to generalizeapplications of SMC to non-clock tree inference based on the existence ofa flexible partially ordered set (poset) structure, and we present it in a level of generalitydirectly applicable to many other combinatorial spaces. We show that theproposed CSMC algorithm is consistent and fast in simulations. We also investigatetwo ways of combining SMC and MCMC to jointly estimate the phylogenetictrees and evolutionary parameters, particle Markov chain Monte Carlo (PMCMC)algorithms with CSMC at each iteration and an SMC sampler with MCMC moves.Further, we present a novel way to estimate the transition probabilities for a generalCTMC, which can be used to solve the computing bottleneck in a general evolutionary model, string-valued continuous time Markov Chain (SCTMC), that canincorporate a wide range of molecular mechanisms.

View record

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

A bayesian nonparametric model for RNA-sequencing data (2022)

Cancers from different tissue types can share a latent structure reflecting commonly altered gene pathways. It is difficult to cluster cancer patients based on this latent structure because the tissue of origin often dominates the latent structure effect. We propose a Bayesian nonparametric model that accounts for the tissue effect and clusters based on a latent structure using a Dirichlet Process prior. More specifically, we use an infinite Gaussian mix- ture model where the mean parameter is modelled as the linear combination of tissue, gene, and latent cluster effects. The choice of the Dirichlet Pro- cess prior allows us to side-step a model selection problem as the number of latent clusters is unknown apriori. Our approach learns the tissue effect by using tissue parameters in a supervised learning setting, while simultaneously learning the latent structure based on the residuals in an unsupervised setting. These so-called residuals result from subtracting out the inferred tissue and gene parameters from the observations and can be interpreted as the cluster effect. A key component of the model is its ability to leverage conjugacy between the likelihood model and cluster parameters. The Gaussian form of the model is not effect by our choice of mean parameter therefore conjugacy is preserved. Indeed, the model has the intuitive interpretation of clustering on the cluster effect signal that remains subtracting out the tissue and gene effects. Conjugacy allows for the use of sophisticated Markov chain Monte Carlo techniques used in Bayesian mixture models such as Split-Merge samplers. We demonstrate our model by showing results on synthetic data, semi-synthetic data generated using a publicly available dataset from the Genome-Tissue Expression (GTEx) portal, and another publicly available dataset from the International Cancer Genome Consortium (ICGC).

View record

Differential RNA expression analysis across cancer clonal populations using Bayesian phylogenetic modelling (2022)

Cancer cells follow evolutionary principles leading to clonal populations that are nearly genetically identical, called clones. Using single-cell DNA sequencing technologies, a copy number profile of each clone can be found. This copy number profile can then be used to infer the relatedness of the cells in the sample, represented by a phylogeny. Clades in the phylogeny are determined to be the clones based on similarity in mutation profiles. From the same tumour sample, single-cell RNAsequencing allows the measurement of the gene expression. These cells can then be matched to the clones based on the copy number changes and abundances in expression in those regions. From here, it is natural to compare the gene expression between the clones, as this phenotypic characterization can provide insight into treatment resistance and metastasis. Tools for differential expression analyses that compare the clones cannot determine exactly where in their evolutionary history the changes in gene expression occurred. We propose to use a Bayesian phylogenetic model to infer the expression states at ancestral, unobserved clones. This model takes as input the phylogeny determined from the DNA mutation profiles as well as RNA sequencing counts matched to the clones. The RNA sequencing data is binned into discrete values to be used in the discrete trait phylogenetic model. We use a Markov Chain Monte Carlo Algorithm to infer parameters of the model which are then used to infer the ancestral gene expression states. After the ancestral states have been inferred, we find genes that had changes in expression state alongeach branch. We test the performance of this approach on synthetic and real data, and demonstrate how the results can be used for downstream analysis to identify changes in expression during cancer evolution.

View record

Slice sampling for general completely random measures (2021)

Completely random measures provide a principled approach to creating flexible unsupervised models, where the number of latent features is infinite and the number of features that influence the data grows with the size of the data set. Due to the infinity the latent features, posterior inferencerequires either marginalization---resulting in dependence structures that prevent efficient computation via parallelization and conjugacy---or finite truncation, which arbitrarily limits the flexibility of the model. In this paper we present a novel Markov chain Monte Carlo algorithm for posterior inference that adaptively sets the truncation level using auxiliary slice variables, enabling efficient, parallelized computation without sacrificing flexibility. In contrast to past work that achieved this on a model-by-model basis, we provide a general recipe that is applicable to the broad class of completely random measure-based priors. The efficacy of the proposed algorithmis evaluated on several popular nonparametric models, demonstrating a higher effective sample size per second compared to algorithms using marginalization as well as a higher predictive performance compared to models employing fixed truncations.

View record

dd-PyClone: Improving Clonal Subpopulation Inference from Single Cells and Bulk Sequencing Data (2016)

Improving our understanding of intra-tumour heterogeneity in cancer has important clinical implications, including an opportunity to understand mechanisms behind relapses and drug resistance. Next generation bulk sequencing is a mature tech- nology that has been used to study subclonal tumour populations at an aggregate level. Inference of populations from bulk sequencing requires sophisticated com- putational deconvolution methods. An alternative is to identify populations directly with single cell sequencing. However, single cell sequencing is a very error-prone process, and this impedes its ability to completely replace bulk sequencing for now.In this work we present dd-PyClone, a statistical model to combine single cell and bulk sequencing data to study clonal subpopulation architecture and improve clustering assignment and cellular prevalence estimates of a set of genomic loci.We introduce a single nucleotide variant and copy number aberration aware genotype simulation scheme based on a phylogenetic tree, termed the Generalized Dollo model. This model is an improvement over previous genotype generator models in that it also accounts for the evolutionary process before a rare event (here the single nucleotide variant) occurs.We show that incorporating genomic loci co-occurrence patterns from single cell sequencing studies in inferring clonal subpopulation structure from bulk se- quencing data is beneficial. Our method outperforms existing methods in simula- tion studies and performs comparably in real dataset benchmarking. We also show that our method is fairly robust as to the choice of hyperparameters and performs reasonably in presence of noise. We hope that our method will further the under- standing of the evolutionary basis of cancer.

View record

Poisson Process Infinite Relational Model: A Bayesian nonparametric model for transactional data (2016)

Transactional data consists of instantaneously occurring observations made on ordered pairs of entities. It can be represented as a network---or more specifically, a directed multigraph---with edges possessing unique timestamps. This thesis explores a Bayesian nonparametric model for discovering latent class-structure in transactional data. Moreover, by pooling information within clusters of entities, it can be used to infer the underlying dynamics of the time-series data. Blundell, Beck, and Heller (2012) originally proposed this model, calling it the Poisson Process Infinite Relational Model; however, this thesis derives and elaborates on the necessary procedures to implement a fully Bayesian approximate inference scheme. Additionally, experimental results are used to validate the computational correctness of the inference algorithm. Further experiments on synthetic data evaluate the model's clustering performance and assess predictive ability. Real data from historical records of militarized disputes between nations test the model's capacity to learn varying degrees of structured relationships.

View record

Divide and Conquer Sequential Monte Carlo for Phylogenetics (2015)

Recently reconstructing evolutionary histories has become a computational issue due to the increased availability of genetic sequencing data and relaxations of classical modelling assumptions. This thesis specializes a Divide & conquer sequential Monte Carlo (DCSMC) inference algorithm to phylogenetics to address these challenges. In phylogenetics, the tree structure used to represent evolutionary histories provides a model decomposition used for DCSMC. In particular, speciation events are used to recursively decompose the model into subproblems. Each subproblem is approximated by an independent population of weighted particles, which are merged and propagated to create an ancestral population. This approach provides the flexibility to relax classical assumptions on large trees by parallelizing these recursions.

View record

DrSMC: A Sequential Monte Carlo Sampler for Deterministic Relationships on Continuous Random Variables (2015)

Computing posterior distributions over variables linked by deterministic constraints is a recurrent problem in Bayesian analysis. Such problems can arise due to censoring, identifiability issues, or other considerations. It is well-known that standard implementations of Monte Carlo inference strategies break down in the presence of these deterministic relationships. Although several alternative Monte Carlo approaches have been recently developed, few are applicable to deterministic relationships on continuous random variables. In this thesis, I propose Deterministic relationship Sequential Monte Carlo (DrSMC), a new Monte Carlo method for continuous variables possessing deterministic constraints. My exposition focuses on developing a DrSMC algorithm for computing the posterior distribution of a continuous random vector given its sum. I derive optimal settings for this algorithm and compare its performance to that of alternative approaches in the literature.

View record

Entangled Monte Carlo (2013)

A recurrent problem in statistics is that of computing an expectation involving intractable integration. In particular, this problem arises inBayesian statistics when computing an expectation with respect to a posterior distribution known only up to a normalizing constant. A common solution is to use Monte Carlo simulation to estimate the target expectation. Two of the mostcommonly adopted simulation methods are Markov Chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) methods. However, these methods fail to scale up with the size of the inference problem. For MCMC, the problem takes the form of simulations that must be ran for a long time in order to obtain an accurate inference. For SMC, one may not be able to store enough particles to exhaustively explore the state space. We propose a novel scalable parallelization of Monte Carlo simulation, Entangled Monte Carlo simulation, that can scale up with the size of the inference problem. Instead of transmitting particles over the network, our proposed algorithm reconstructs the particles from the particle genealogy using the notion of stochastic maps borrowed from perfect simulation literature. We propose bounds on the expected time for particles to coalesce based on the coalescent model. Our empirical results also demonstrate the efficacy of our method on datasets from the field of phylogenetics.

View record

Inference of rates across sites via an expectation maximization algorithm (2013)

The rates of nucleotide substitution can be different from genes to genes. Moreover, different regions of the same gene can have different rates of mutation as well. Many attempts have been tried to allow for the variable rates across different nucleotide sites. A rate factor coming from the continuous distribution has been introduced to deal with the problem. However, for computation reasons, this method can only scale to less than a dozen sequences. Later studies use a discrete gamma distribution to approximate the gamma distribution. The main contribution of our work is that we propose a discrete distribution over the rate factor which is more flexible while preserving attractive computational properties. We make inference about the rate factor and its distribution via an Expectation Maximization (EM) algorithm. We evaluate our method by both simulations and a real dataset. From the real dataset, it reflects that the method is useful for large phylogenies with even thousands of sequences. We analyze the identifiability of our model for a pair of DNA sequences under certain conditions. We also prove for certain types of rate matrices, this model is non-identifiable.

View record

Nonparametric Bayesian models for Markov jump processes (2012)

Markov jump processes (MJPs) have been used as models in various fields such asdisease progression, phylogenetic trees, and communication networks. The mainmotivation behind this thesis is the application of MJPs to data modeled as havingcomplex latent structure. In this thesis we propose a nonparametric prior, thegamma-exponential process (GEP), over MJPs. Nonparametric Bayesian modelshave recently attracted much attention in the statistics community, due to their flexibility,adaptability, and usefulness in analyzing complex real world datasets. TheGEP is a prior over infinite rate matrices which characterize an MJP; this prior canbe used in Bayesian models where an MJP is imposed on the data but the number ofstates of the MJP is unknown in advance. We show that the GEP model we proposehas some attractive properties such as conjugacy and simple closed-form predictivedistributions. We also introduce the hierarchical version of the GEP model;sharing statistical strength can be considered as the main motivation behind the hierarchicalmodel. We show that our hierarchical model admits efficient inferencealgorithms. We introduce two inference algorithms: 1) a “basic” particle Markovchain Monte Carlo (PMCMC) algorithm which is an MCMC algorithm with sequencesproposed by a sequential Monte Carlo (SMC) algorithm; 2) a modifiedversion of this PMCPC algorithm with an “improved” SMC proposal. Finally, wedemonstrate the algorithms on the problems of estimating disease progression inmultiple sclerosis and RNA evolutionary modeling. In both domains, we foundthat our model outperformed the standard rate matrix estimation approach.

View record

This is a small sample of students and/or alumni that have been supervised by this researcher. It is not meant as a comprehensive list.