Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Master's Student Supervision (2010 - 2021)
Boosting variational inference (BVI) approximates Bayesian posterior distributions by iteratively building a mixture of components. However, BVI requires greedily optimizing the next component—an optimization problem that becomes increasingly computationally expensive as more components are added to the mixture. Furthermore, previous work has only used simple (i.e., Gaussian) component distributions; in practice, many of these components are needed to obtain a reasonable approximation. These shortcomings can be addressed by considering components that adapt to the target density. However, natural choices such as MCMC chains do not have tractable densities and thus require a density-free divergence for training. As a first contribution, we show that the kernelized Stein discrepancy—which to the best of our knowledge is the only density-free divergence feasible for VI—cannot detect when an approximation is missing modes of the target density. Hence, it is not suitable for boosting components with intractable densities. As a second contribution, we develop locally-adaptive boosting variational inference (LBVI), in which each component distribution is a Sequential Monte Carlo (SMC) sampler, i.e., a tempered version of the posterior initialized at a given simple reference distribution. Instead of greedily optimizing the next component, we greedily choose to add components to the mixture and perturb their adaptivity, thereby causing them to locally converge to the target density; this results in refined approximations with considerably fewer components. Moreover, because SMC components have tractable density estimates, LBVI can be used with common divergences (such as the Kullback–Leibler divergence) for model learning. Experiments show that, when compared to previous BVI methods, LBVI produces reliable inference with fewer components and in less computation time.
The primary goal of factor analysis is to uncover a set of latent factors that can explain the variation in the data. Principal Component Analysis is one approach that estimates the factors by a set of orthogonal vectors. However, it may be difficult to interpret the factors as-is, and so it is common to rotate the estimated factors to make their coefficients as sparse as possible to improve interpretability. Varimax is the most popular method for factor rotations, and its statistical properties have been studied in recent literature. Entromin is another factor rotation method that is less commonly used and not as well-studied, but there exists conventional wisdom that Entromin generally finds sparser rotations compared to Varimax. In this thesis, we aim to explain the sparsity claim for Entromin by studying its statistical properties. Our main contributions include several theoretical results that take steps towards this aim. We show that Varimax is a first-order approximation of Entromin, and that generalizing this connection leads to a family of Entromin approximations. We derive the conditions under which the second-order approximation can be viewed as performing statistical inference in a latent factor model. We then make the connection between optimizing the Entromin criterion and recovering sparsity in the factors. Other contributions of this thesis include novel connections to statistical concepts that have not been made in the literature to our knowledge, and an empirical study of Entromin on a dataset of New York Times articles.
A useful step in data analysis is clustering, in which observations are grouped together in a hopefully meaningful way. The mainstay model for Bayesian nonparametric clustering is the Dirichlet process mixture model, which has one key advantage of inferring the number of clusters automatically. However, the Dirichlet process mixture model has particular characteristics, such as linear growth in the size of clusters and exchangeability, that may not be suitable modelling choices for some data sets, so there is further research to be done into other Bayesian nonparametric models with characteristics that differ from that of the Dirichlet process mixture model while maintaining automatic inference of the number of clusters.In this thesis, we introduce the Neutral-to-the-Left mixture model, a family of Bayesian nonparametric infinite mixture models which serves as a strict generalization of the Dirichlet process mixture model. This family of mixture models has two key parameters: the distribution of arrival times of new clusters, and the parameters of the stick breaking distribution, whose customization allows the user to inject prior beliefs regarding the structure of the clusters into the model. We describe collapsed Gibbs and Metropolis–Hastings samplers to infer the posterior distribution of clusterings given data. We consider one particular parameterization of the Neutral-to-the-Left mixture model with characteristics that are distinct from that of the Dirichlet process mixture model, evaluate its performance on simulated data, and compare these to results from a Dirichlet process mixture model. Finally, we explore the utility of the Neutral-to-the-Left mixture model on real data by applying the model to cluster tweets.