Doctor of Philosophy in Computer Science (PhD)
Optimization for Machine Learning
This thesis makes contributions to two problems in learning theory: prediction with expert advice and learning mixtures of Gaussians.The problem of prediction with expert advice can be cast as a sequential game between an algorithm and an adversary as follows. At each time step, an algorithm chooses one of n options (or experts) and the adversary sets a cost for each expert. The algorithm's goal is to minimize its regret, i.e. its cost relative to the best expert in hindsight. The celebrated multiplicative weights algorithm is known to be optimal if the the game is terminated at a fixed, known time and the number of experts is large. Optimal algorithms are also known when the number of experts is 2, 3, or 4.If the game does not terminate at a known time or is run indefinitely, the optimal algorithm is not known for any number of experts. We contribute to this problem by giving the optimal algorithm when there are 2 experts. Our algorithm is designed by considering a continuous analogue of the problem, which is solved using ideas from stochastic calculus.In the second part of the thesis, we look at distribution learning, which is a fundamental task in statistics that has been studied for over a century. We consider such a problem where the distribution is a mixture of k Gaussians in d dimensions. The objective is density estimation: given i.i.d. samples from the unknown distribution, produce a distribution whose total variation from the unknown distribution is within some desired accuracy. We contribute to this problem by designing an algorithm with near-optimal sample complexity.
Online convex optimization (OCO) is a powerful algorithmic framework that has extensive applications in different areas. Regret is a commonly-used measurement for the performance of algorithms in this framework. Lipschitz continuity of the cost functions is commonly assumed in order to obtain sublinear regret, that is to say, this condition is usually necessary for theoretical guarantees for good performances of OCO algorithms. Moreover, strong convexity of cost functions can sometimes give even better theoretical performance bounds, more specifically, logarithmic regret. Recently, researchers from convex optimization proposed the notions of “relative Lipschitz continuity” and “relative strong convexity”. Both of the notions are generalizations of their classical counterparts. It has been shown that subgradient methods in the relative setting have performance analogous to their performance in the classical setting.In this work, we consider OCO for relative Lipschitz and relative strongly con- vex functions. We extend the known regret bounds for classical OCO algorithms to the relative setting. Specifically, we show regret bounds for the follow the regularized leader algorithms and a variant of online mirror descent. Due to the generality of these methods, these results yield regret bounds for a wide variety of OCO algorithms. Furthermore, we extend the results to algorithms with extra regularization such as regularized dual averaging.
Consider the problem of minimizing functions that are Lipschitz and strongly convex, but not necessarily differentiable. We prove that after T steps of stochastic gradient descent (SGD), the error of the final iterate is O(logT⁄T) with high probability. We also construct a function from this class for which the error of the final iterate of deterministic gradient descent is Ω(logT⁄T). This shows that the upper bound is tight and that, in this setting, the last iterate of stochastic gradient descent has the same general error rate (with high probability) as deterministic gradient descent. This resolves both open questions posed by Shamir (2012). We prove analogous results for functions which are Lipschitz and convex, but not necessarily strongly convex or differentiable. After T steps of stochastic gradient descent, the error of the final iterate is O(logT⁄√T) with high probability, and there exists a function for which the error of the final iterate of deterministic gradient descent is Ω(logT⁄√T). In the strongly-convex setting, several forms of SGD, including suffix averaging, are known to achieve the optimal O(1⁄T) convergence rate in expectation. An intermediate step of our high probability analysis for the error of the final iterate proves that the suffix averaging method achieves error O(1⁄T) with high probability, which is optimal (for any first-order optimization method). This improves results of Rakhlin et al. (2012) and Hazan and Kale (2014), both of which achieved error O(1⁄T), but only in expectation, and achieved a high probability error bound of O(loglogT ⁄T), which is suboptimal. This is the first known high-probability result which attains the optimal O(1⁄T) rate. We also consider a simple, non-uniform averaging strategy of Lacoste-Julien et al. (2012) and prove that it too achieves the optimal O(1⁄T) convergence rate with high probability. This provides a second algorithm which achieves the optimal O(1⁄T) convergence rate with high-probability. Our high-probability results are proven using a generalization of Freedman's Inequality which we develop.
Low-stretch trees are spanning trees which provide approximate distance preservation for edges in the original graph by minimizing stretch. We explore the application of these trees to network visualization. In particular, we present a novel edge bundling technique, LSTB, that computes edge bundles explicitly and efficiently and does not rely on fixed vertex positions. This approach is in contrast to previous methods, which require the user to provide a layout of the input graph. We introduce an abstract framework for edge bundling methods, which provides a unifying formalization of bundling terminology and techniques, as well as a classification of such methods. Based on this framework, LSTB provides algorithmic support for sophisticated visual encodings, including dynamic layout adjustment and interactive bundle querying.In addition, we explore the use of the multiplicative weights update method to compute a distribution over low-stretch trees in order to achieve low stretch for all edges in expectation, rather than on average. We present the results of using this distribution in place of a single low-stretch tree as a routing graph for LSTB. While the distribution provides better stretch guarantees, we find that from a visual perspective a single low-stretch tree provides a better routing graph for the LSTB edge bundling application.
Herding is an algorithm of recent interest in the machine learning community, motivated by inference in Markov random fields. It solves the following Sampling Problem: given a set Χ \subset R^d with mean μ, construct an infinite sequence of points from Χ such that, for every t ≥ 1, the mean of the first t points in that sequence lies within Euclidean distance O(1/t) of μ. The error of a solution to Sampling Problem is defined to be the distance between the empirical mean of the first t samples and the original mean μ.The O(1/t) error bound suppresses the dependence on d and Χ. In this thesis, we study the best dependence on d and |Χ| that can be achieved for the error in Sampling Problem. Known analysis of the Herding algorithm give an error bound that depends on geometric properties of Χ but, even under favorable conditions, this bound depends linearly on d.We first show that any algorithm for the Sampling Problem must have error Ω(√d/t). Afterward, we present a new polynomial-time algorithm that solves the Sampling Problem with error O(√d log^2.5|Χ|/t) assuming that Χ is finite. This implies that our algorithm is optimal to within logarithmic factors. Finally, we prove that the actual error of the Herding Algorithm is strictly worse than the error of our algorithm if we measure the error in the infinity-norm. Our algorithm is randomized and based on recent algorithmic results in discrepancy theory. We implement our algorithm and other potential solutions for the Sampling Problem and evaluate them on various inputs.