Mark Schmidt

Associate Professor

Relevant Degree Programs


Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2021)
Efficiently estimating kinetics of interacting nucleic acid strands modeled as continuous-time Markov chains (2021)

No abstract available.

Pragmatic investigations of applied deep learning in computer vision applications (2021)

Deep neural networks have dominated performance benchmarks on numerous machine learning tasks. These models now power the core technology of a growing list of products such as Google Search, Google Translate, Apple Siri, and even Snapchat, to mention a few. We first address two challenges in the real-world applications of deep neural networks in computer vision: data scarcity and prediction reliability. We present a new approach to data collection through synthetic data via video games that is cost-effective and can produce high-quality labelled training data on a large scale. We validate the effectiveness of synthetic data on multiple problems through cross-dataset evaluation and simple adaptive techniques. We also examine the reliability of neural network predictions in computer vision problems and show that these models are fragile on out-of-distribution test data. Motivated by statistical learning theory, we argue that it is necessary to detect out-of-distribution samples before relying on the predictions. To facilitate the development of reliable out-of-distribution sample detectors, we present a less biased evaluation framework. Using our framework, we thoroughly evaluate over ten methods from data mining, deep learning, and Bayesian methods. We show that on real-world problems, none of the evaluated methods can reliably certify a prediction. Finally, we explore the applications of deep neural networks on high-resolution portrait production pipelines. We introduce AutoPortrait, a pipeline that performs professional-grade colour-correction, portrait cropping, and portrait retouching in under two seconds. We release the first large scale professional retouching dataset.

View record

Practical optimization methods for machine learning models (2020)

This work considers optimization methods for large-scale machine learning (ML).Optimization in ML is a crucial ingredient in the training stage of ML models.Optimization methods in this setting need to have cheap iteration cost. First-ordermethods are known to have reasonably low iteration costs. A notable recent classof stochastic first-order methods leverage variance reduction techniques to improvetheir convergence speed. This group includes stochastic average gradient(SAG), stochastic variance reduced gradient (SVRG), and stochastic average gradientam´elior´e (SAGA). The SAG and SAGA approach to variance reduction useadditional memory in their algorithm. SVRG, on the other hand, does not needadditional memory but requires occasional full-gradient evaluation. We first introducevariants of SVRG that require fewer gradient evaluations. We then presentthe first linearly convergent stochastic gradient method to train conditional randomfields (CRFs) using SAG. Our method addresses the memory issues requiredfor SAG and proposes a better non-uniform sampling (NUS) technique. The thirdpart of this work extends the applicability of SAGA to Riemannian manifolds.We modify SAGA with operations existing in the manifold to improve the convergencespeed of SAGA in these new spaces. Finally, we consider the convergence ofclassic stochastic gradient methods, based on mirror descent (MD), in non-convexsetting. We analyse the MD with more general divergence function and show itsapplication for variational inference models.

View record

Where are the objects? weakly supervised methods for counting, localization and segmentation (2020)

In 2012, deep learning made a major comeback. Deep learning started breaking records in many machine learning benchmarks, especially those in the field of computer vision. By integrating deep learning, machine learning methods have became more practical for many applications like object counting, detection, or segmentation. Unfortunately, in the typical supervised learning setting, deep learning methods might require a lot of labeled data that are costly to acquire. For instance, in the case of acquiring segmentation labels, the annotator has to label each pixel in order to draw a mask over each object and get the background regions. In fact, each image in the CityScapes dataset took around 1.5 hours to label. Further, to achieve high accuracy, we might need millions of such images. In this work, we propose four weakly supervised methods. They only require labels that are cheap to collect, yet they perform well in practice. We devised an experimental setup for each proposed method. In the first setup, the model needs to learn to count objects from point annotations. In the second setup, the model needs to learn to segment objects from point annotations. In the third setup, the model needs to segment objects from image level annotations. In the final setup, the model needs to learn to detect objects using counts only. For each of these setups the proposed method achieves state-of-the-art results in its respective benchmark. Interestingly, our methods are not much worse than fully supervised methods. This is despite their training labels being significantly cheaper to acquire than for the fully supervised case. In fact, in fixing the time budget for collecting annotations, our models performed much better than fully supervised methods. This suggests that carefully designed models can effectively learn from data labeled with low human effort.

View record

Practical optimization for structured machine learning problems (2019)

Recent years have witnessed huge advances in machine learning (ML) and its applications, especially in image, speech, and language applications. Optimization in ML is a key ingredient in both the training and hyperparameter tuning steps, and it also influences the test phase. In this thesis, we make improvements to optimization of all of these three problems. The first part of the thesis considers the training problem. We present the first linearly-convergent stochastic gradient method to train conditional random fields (CRFs) using the stochastic average gradient (SAG) method. Our method addresses the memory issues required for SAG and proposes a better non-uniform sampling (NUS) technique. The second part of the thesis deals with memory-free linearly-convergent stochastic gradient method for training ML models. We consider the stochastic variance reduced gradient (SVRG) algorithm, which normally requires occasional full-gradient evaluations. We show that we can decrease the number of gradient evaluations using growing batches (which improves performance in the early iterations) and support vectors (which improves performance in the later iterations). The third part of this thesis switches to the problem of hyper-parameter tuning. Although it is a lower dimensional problem, it can be more difficult as it is a non-convex global optimization problem. First, we propose a harmless global optimization algorithm to ensure that the performance is not worse than using random search. Moreover, we explore the use of gradient-based Bayesian optimization (BO) to improve the performance. We propose the use of directional derivatives to address the memory issues of BO. Finally, in the fourth part of this thesis, we propose a novel optimization method that combines the advantages of BO and Lipschitz optimization (LO).

View record

Structured Bandits and Applications (2019)

We study the problem of decision-making under uncertainty in the bandit setting. This thesis goes beyond the well-studied multi-armed bandit model to consider structured bandit settings and their applications. In particular, we learn to make better decisions by leveraging the application-specific problem-structure in the form of features or graph information. We investigate the use of structured bandits in two practical applications: online recommender systems with an available network of users and viral marketing in social networks. For each of these applications, we design efficient bandit algorithms and theoretically characterize their performance. We experimentally evaluate the efficiency and effectiveness of these algorithms on real-world datasets. For applications that require modelling complex non-linear feature-reward relationships, we propose a bootstrapping approach and establish theoretical regret bounds for it. Furthermore, we consider the application of multi-class classification with bandit feedback as a test-bed for evaluating our bootstrapping approach.

View record

Greed is good: greedy optimization methods for large-scale structured problems (2018)

This work looks at large-scale machine learning, with a particular focus on greedy methods. A recent trend caused by big datasets is to use optimization methods that have a cheap iteration cost. In this category are (block) coordinate descent and Kaczmarz methods, as the updates of these methods only rely on a reduced subspace of the problem at each iteration. Prior to our work, the literature cast greedy variations of these methods as computationally expensive with comparable convergence rates to randomized versions. In this dissertation, we show that greed is good. Specifically, we show that greedy coordinate descent and Kaczmarz methods have efficient implementations and can be faster than their randomized counterparts for certain common problem structures in machine learning. We show linear convergence for greedy (block) coordinate descent methods under a revived relaxation of strong convexity from 1963, which we call the Polyak-Lojasiewicz (PL) inequality. Of the proposed relaxations of strong convexity in the recent literature, we show that the PL inequality is the weakest condition that still ensures a global minimum. Further, we highlight the exploitable flexibility in block coordinate descent methods, not only in the different types of selection rules possible, but also in the types of updates we can use. We show that using second-order or exact updates with greedy block coordinate descent methods can lead to superlinear or finite convergence (respectively) for popular machine learning problems. Finally, we introduce the notion of “active-set complexity”, which we define as the number of iterations required before an algorithm is guaranteed to reach the optimal active manifold, and show explicit bounds for two common problem instances when using the proximal gradient or the proximal coordinate descent method.

View record

Master's Student Supervision (2010 - 2020)
Interpolation, growth conditions, and stochastic gradient descent (2020)

Current machine learning practice requires solving huge-scale empirical risk minimization problems quickly and robustly. These problems are often highly under-determined and admit multiple solutions which exactly fit, or interpolate, the training data. In such cases, stochastic gradient descent has been shown to converge without decreasing step-sizes or averaging, and can achieve the fast convergence of deterministic gradient methods. Recent work has further shown that stochastic acceleration and line-search methods for step-size selection are possible in this restricted setting. Although pioneering, existing analyses for first-order methods under interpolation have two major weaknesses: they are restricted to the finite-sum setting, and, secondly, they are not tight with the best deterministic rates. To address these issues, we extend the notion of interpolation to stochastic optimization problems with general, first-order oracles. We use the proposed framework to analyze stochastic gradient descent with a fixed step-size and with an Armijo-type line-search, as well as Nesterov’s accelerated gradient method with stochastic gradients. In nearly all settings, we obtain faster convergence with a wider range of parameters. The improvement for stochastic Nesterov acceleration is comparable to dividing by the square-root of the condition number and addresses criticism that existing rates were not truly “accelerated”. In the case of convex functions, our convergence rates for stochastic gradient descent — both with and without the stochastic Armijo line-search — recover the best-known rates in the deterministic setting. We also provide a simple extension to L2-regularized minimization, which opens the path to proximal-gradient methods and non-smooth optimization under interpolation.

View record

Investigating the impact of normalizing flows on latent variable machine translation (2020)

Natural language processing (NLP) has pervasive applications in everyday life, and has recently witnessed rapid progress. Incorporating latent variables in NLP systems can allow for explicit representations of certain types of information. In neural machine translation systems, for example, latent variables have the potential of enhancing semantic representations. This could help improve general translation quality. Previous work has focused on using variational inference with diagonal covariance Gaussian distributions, which we hypothesize cannot sufficiently encode latent factors of language which could exhibit multi-modal distributive behavior. Normalizing flows are an approach that enables more flexible posterior distribution estimates by introducing a change of variables with invertible functions. They have previously been successfully used in computer vision to enable more flexible posterior distributions of image data. In this work, we investigate the impact of normalizing flows in autoregressive neural machine translation systems. We do so in the context of two currently successful approaches, attention mechanisms, and language models. Our results suggest that normalizing flows can improve translation quality in some scenarios, and require certain modelling assumptions to achieve such improvements.

View record

Stochastic Second-Order Optimization for Over-parameterized Machine Learning Models (2020)

We consider stochastic second-order methods for minimizing smooth and strongly-convex functions under an interpolation condition, which can be satisfied by over-parameterized machine learning models. Under this condition, we show that the regularized subsampled Newton’s method (R-SSN) achieves global linear convergence with an adaptive step-size and a constant batch-size. By growing the batch size for both the subsampled gradient and Hessian, we show that R-SSN can converge at a quadratic rate in a local neighbourhood of the solution. We also show that R-SSN attains local linear convergence for the family of self-concordant functions. Furthermore, we analyze stochastic BFGS algorithms in the interpolation setting and prove their global linear convergence. We empirically evaluate stochastic limited-memory BFGS (L-BFGS) and a “Hessian-free” implementation of R-SSN for binary classification on synthetic, linearly-separable datasets and real datasets under a kernel mapping. Our experimental results demonstrate the fast convergence of these methods, both in terms of the number of iterations and wall-clock time.

View record

Deep kernel mean embeddings for generative modeling and feedforward style transfer (2017)

The generation of data has traditionally been specified using hand-craftedalgorithms. However, oftentimes the exact generative process is unknownwhile only a limited number of samples are observed. One such case isgenerating images that look visually similar to an exemplar image or as ifcoming from a distribution of images. We look into learning the generatingprocess by constructing a similarity function that measures how close thegenerated image is to the target image. We discuss a framework in whichthe similarity function is specified by a pre-trained neural network withoutfine-tuning, as is the case for neural texture synthesis, and a frameworkwhere the similarity function is learned along with the generative processin an adversarial setting, as is the case for generative adversarial networks.The main point of discussion is the combined use of neural networks andmaximum mean discrepancy as a versatile similarity function. Additionally, we describe an improvement to state-of-the-art style transferthat allows faster computations while maintaining generality of the generatingprocess. The proposed objective has desirable properties such as a simpleroptimization landscape, intuitive parameter tuning, and consistent frame-by-frame performance on video. We use 80,000 natural images and 80,000paintings to train a procedure for artistic style transfer that is efficient butalso allows arbitrary content and style images.

View record


Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research and more than 300 programs in our 2022 Graduate Viewbook!