# William Welch

#### Relevant Thesis-Based Degree Programs

#### Affiliations to Research Centres, Institutes & Clusters

## Graduate Student Supervision

##### Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Computer models are used as replacements for physical experiments in a wide variety of applications. Nevertheless, direct use of the computer model for the ultimate scientific objective is often limited by the complexity and cost of the model. Historically, Gaussian process (GP) regression has proven to be the almost ubiquitous choice for a fast statistical emulator for such a computer model, due to its flexible form and analytical expressions for predictive uncertainty.In the first part of this dissertation, we consider complications that arise when the design is moderate to large. Fitting a GP regression can be computationally intractable for even moderate designs, due to computing time increasing with the cube of the design size. We propose a new solution to this problem: adaptive design and analysis via partitioning trees (ADAPT). By taking a data-adaptive approach to the development of a design, and choosing to partition the space in the regions of highest variability, we obtain a higher density of points in these regions and hence accurate prediction for complex computer models.Next, we consider the scenario where multiple computer models are available for predicting the same physical process—known as multi-model ensembles (MMEs). Such ensembles are common in many applications, such as climate modelling and weather prediction. We present a new statistical methodology for combining output from such models to best describe the underlying physical process, using field data to estimate the weights assigned to each model. The methodology allows us to make predictions with appropriate measures of uncertainty. Additionally, the weights are allowed to vary with the inputs and thus represent the changing relative importance between the computer models throughout the input space. The methodology is applied to ice sheet models for the deglaciation of North America. Finally, we address several considerations that arise when the MME field data are binary. A new MME model formulation is presented, and applied to ice absence/presence data in the deglaciation application.In summary, this dissertation presents new methods for two scenarios prevalent in the design and analysis of computer experiments: large designs, and the presence of multiple computer models.

View record

A wide range of natural phenomena and engineering processes make physical experimentation hard to apply, or even impossible. To overcome these issues, we can rely on mathematical models that simulate these systems via computer experiments. Nonetheless, if the experimenter wants to explore many runs, complex computer codes can be excessively resource and time-consuming. Since the 1980s, Gaussian Stochastic Processes have been used in computer experiments as surrogate models. Their objective can be predicting outputs at untried input runs, given a model fitted with a training design coming from the computer code. We can exploit different modelling strategies to improve prediction accuracy, e.g., the regression component or the correlation function. This thesis makes a comprehensive exploitation of two additional strategies, which the existing literature has not fully addressed in computer experiments. One of these strategies is implementing non-standard correlation structures in model training and testing. Since the beginning of factorial designs for physical experiments in the first half of the 20th century, there have been basic guidelines for modelling from three effect principles: Sparsity, Heredity, and Hierarchy. We explore these principles in a Gaussian Stochastic Process by suggesting and evaluating novel correlation structures. Our second strategy focuses on output and input transformations via Dimensional Analysis. This methodology pays attention to fundamental physical dimensions when modelling scientific and engineering systems. It goes back at least a century but has recently caught statisticians' attention, particularly in the design of physical experiments. The core idea is to analyze dimensionless quantities derived from the original variables. While the non-standard correlation structures depict additive and low-order interaction effects, applying the three principles above relies on a proper selection of effects. Similarly, the implementation of Dimensional Analysis is far from straightforward; choosing the derived quantities is particularly challenging. Hence, we rely on Functional Analysis of Variance as a variable selection tool for both strategies. With the "right" variables, the Gaussian Stochastic Process' prediction accuracy improves for several case studies, which allows us to establish new modelling frameworks for computer experiments.

View record

Conditional generative adversarial networks (cGANs) are state-of-the-art models for synthesizing images dependent on some conditions. These conditions are usually categorical variables such as class labels. cGANs with class labels as conditions are also known as class-conditional GANs. Some modern class-conditional GANs such as BigGAN can even generate photo-realistic images. The success of cGANs has been shown in various applications. However, two weaknesses of cGANs still exist. First, image generation conditional on continuous, scalar variables (termed regression labels) has never been studied. Second, low-quality fake images still appear frequently in image synthesis with state-of-the-art cGANs, especially when training data are limited. This thesis aims to resolve the above two weaknesses of cGANs and explore the applications of cGANs in improving a lightweight model with the knowledge from a heavyweight model (i.e., knowledge distillation). First, existing empirical losses and label input mechanisms of cGANs are not suitable for regression labels, making cGANs fail to synthesize images conditional on regression labels. To solve this problem, this thesis proposes the continuous conditional generative adversarial network (CcGAN), including novel empirical losses and label input mechanisms. Moreover, even the state-of-the-art cGANs may produce low-quality images, so a subsampling method to drop these images is necessary. In this thesis, we propose a density ratio based subsampling framework for unconditional GANs. Then, we introduce its extension to the conditional image synthesis setting called cDRE-F-cSP+RS, which can effectively improve the image quality of both class-conditional GANs and CcGAN. Finally, we propose a unified knowledge distillation framework called cGAN-KD suitable for both image classification and regression (with a scalar response), where the synthetic data generated from class-conditional GANs and CcGAN are used to transfer knowledge from a teacher net to a student net, and cDRE-F-cSP+RS is applied to filter out bad-quality images. Compared with existing methods, cGAN-KD has many advantages, and it achieves state-of-the-art performance in both image classification and regression tasks.

View record

Computer experiments have been widely used in practice as important supplements to traditional laboratory-based physical experiments in studying complex processes. However, a computer experiment, which in general is based on a computer model with limited runs, is expensive in terms of its computational time. Sacks.et.al. (1989) proposed to use Gaussian process (GP) as a statistical surrogate, which has become a standard way to emulate a computer model. In the thesis, we are concerned with design and analysis of computer experiments based on a GP. We argue that comprehensive, evidence-based assessment strategies are needed when comparing different model options and designs. We first focus on the regression component and the correlation structure of a GP. We use comprehensive assessment strategies to evaluate the effect of the two factors on the prediction accuracy. We also have a limited evaluation on empirical Bayes methods and full Bayes methods, from which we notice Bayes methods with a squared exponential structure do not yield satisfying prediction accuracy in some examples considered. Hence, we propose to use hybrid and full Bayes methods with flexible structures. Through empirical studies, we show the new Bayes methods with flexible structures not only have better prediction accuracy, but also have a better quantification of uncertainty. In addition, we are interested in assessing the effect of design on prediction accuracy. We consider a number of popular designs in the literature and use several examples to evaluate their performances. It turns out the performance difference between designs is small for most of the examples considered. From the evaluation of designs, we are motivated to use a sequential design strategy. We compare the performances of two existing sequential search criteria and handle several important issues in using the sequential design to estimate an extreme probability/quantile of a floor supporting system. Several useful recommendations have also been made.

View record

Legislative actions regarding ozone pollution use air quality models (AQMs) such as the Community Multiscale Air Quality (CMAQ) model for scientific guidance, hence the evaluation of AQM is an important subject. Traditional point-to-point comparisons between AQM outputs and physical observations can be uninformative or even misleading since the two datasets are generated by discrepant stochastic spatial processes. I propose an alternative model evaluation approach that is based on the comparison of spatial-temporal ozone features, where I compare the dominant space-time structures between AQM ozone and observations. To successfully implement feature-based AQM evaluation, I further developed a statistical framework of analyzing and modelling space-time ozone using ozone features. Rather than working directly with raw data, I analyze the spatial-temporal variability of ozone fields by extracting data features using Principal Component Analysis (PCA). These features are then modelled as Gaussian Processes (GPs) driven by various atmospheric conditions and chemical precursor pollution. My method is implemented on CMAQ outputs during several ozone episodes in the Lower Fraser Valley (LFV), BC. I found that the feature-based ozone model is an efficient way of emulating and forecasting a complex space-time ozone field. The framework of ozone feature analysis is then applied to evaluate CMAQ outputs against the observations. Here, I found that CMAQ persistently over-estimates the observed spatial ozone pollution. Through the modelling of feature differences, I identified their associations with the computer model's estimates of ozone precursor emissions, and this CMAQ deficiency is focused on LFV regions where the pollution process transitions from NOx-sensitive to VOC-sensitive. Through the comparison of dynamic ozone features, I found that the CMAQ's over-prediction is also connect to the model producing higher than observed ozone plume in daytime. However, the computer model did capture the observed pattern of diurnal ozone advection across LFV. Lastly, individual modelling of CMAQ and observed ozone features revealed that even under the same atmospheric conditions, CMAQ tends to significantly over-estimate the ozone pollution during the early morning. In the end, I demonstrated that the AQM evaluation methods developed in this thesis can provide informative assessments of an AQM's capability.

View record

An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.

View record

Single nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technologiesusually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixturemodels,rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further developinga linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosystems).A key feature of the proposed algorithm is that it applies to unnormalizedfluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likelihoodapproach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally distributed.Its relationships with several existing clustering methods are discussed.Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate locationsor properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.

View record

Cross validation (CV) is widely used for model assessment and comparison. In this thesis, we first review and compare threev-fold CV strategies: best single CV, repeated and averaged CV and double CV. The mean squared errors of the CV strategies inestimating the best predictive performance are illustrated by using simulated and real data examples. The results show that repeated and averaged CV is a good strategy and outperforms the other two CV strategies for finite samples in terms of the mean squared error in estimating prediction accuracy and the probability of choosing an optimal model.In practice, when we need to compare many models, conducting repeated and averaged CV strategy is not computational feasible. We develop an efficient sequential methodology for model comparison based on CV. It also takes into account the randomness in CV. The number of models is reduced via an adaptive,multiplicity-adjusted sequential algorithm, where poor performers are quickly eliminated. By exploiting matching of individual observations, it is sometimes even possible to establish the statistically significant inferiority of some models with just oneexecution of CV. This adaptive and computationally efficient methodologyis demonstrated on a large cheminformatics data set from PubChem.Cross validated mean squared error (CVMSE) is widely used to estimate the prediction mean squared error (MSE) of statistical methods.For linear models, we show how CVMSE depends on the number of folds, v, used in cross validation, the number of observations, and the number of model parameters. We establish that the bias of CVMSE in estimating the true MSE decreases with v and increases with model complexity. In particular, the bias may be very substantial for models with many parameters relative to the number of observations, even if v is large. Theseresults are used to correct CVMSE for its bias. We compare our proposed bias correction with that of Burman (1989), through simulated and real examples. We also illustrate that our method of correcting for the bias of CVMSE may change the results of model selection.

View record

Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms. In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.

View record

Computer models or simulators are becoming increasingly common in many fields in science and engineering, powered by the phenomenal growth in computer hardware over thepast decades. Many of these simulators implement a particular mathematical model as a deterministic computer code, meaning that running the simulator again with the same input gives the same output.Often running the code involves some computationally expensive tasks, such as solving complex systems of partial differential equations numerically. When simulator runs become too long, it may limit their usefulness. In order to overcome time or budget constraints by making the most out of limited computational resources, a statistical methodology has been proposed, known as the "Design and Analysis of Computer Experiments".The main idea is to run the expensive simulator only at a relatively few, carefully chosen design points in the input space, and based on the outputs construct an emulator (statistical model) that can emulate (predict) the output at new, untriedlocations at a fraction of the cost. This approach is useful provided that we can measure how much the predictions of the cheap emulator deviate from the real responsesurface of the original computer model.One way to quantify emulator error is to construct pointwise prediction bands designed to envelope the response surface and makeassertions that the true response (simulator output) is enclosed by these envelopes with a certain probability. Of course, to be ableto make such probabilistic statements, one needs to introduce some kind of randomness. A common strategy that we use here is to model the computer code as a random function, also known as a Gaussian stochastic process. We concern ourselves with smooth response surfaces and use the Gaussian covariance function that is ideal in cases when the response function is infinitely differentiable.In this thesis, we propose Fast Bayesian Inference (FBI) that is both computationally efficient and can be implemented as a black box. Simulation results show that it can achieve remarkably accurate prediction uncertainty assessments in terms of matchingcoverage probabilities of the prediction bands and the associated reparameterizations can also help parameter uncertainty assessments.

View record

##### Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

We propose an algorithm for a family of optimization problems where the objective can be decomposed as a sum of functions with monotonicity properties. The motivating problem is optimization of hyperparameters of machine learning algorithms where we argue that the objective, validation error, can be decomposed into two approximately monotonic functions of the hyperparameters, along with some theoretical justification. Our proposed algorithm adapts Bayesian optimization methods to incorporate monotonicity constraints. We illustrate the improvement in search efficiency for applications of hyperparameter tuning in machine learning on an artificial problem and Penn Machine Learning Benchmarks.

View record

Gaussian Processes (GPs) are commonly used in the analysis of data from a computer experiment. Ideally, the analysis will provide accurate predictions with correct coverage probabilities of credible intervals. A Bayesian method can, in principle, capture all sources of uncertainty and hence give valid inference. Several implementations are available in the literature, differing in choice of priors, etc. In this thesis, we first review three popular Bayesian methods in the analysis of computer experiments. Two prediction criteria are proposed to measure both the prediction accuracy and the prediction actual coverage probability. From a simple example, we notice that the performances of the three Bayesian implementations are quite different. Motivated by the performance difference, we specify four important factors in terms of Bayesian analysis and allocate different levels for the factors based on the three existing Bayesian implementations. Full factorial experiments are then conducted on the specified factors both for real computer models and via simulation with the aim of identifying the significant factors. Emphasis is placed on the prediction accuracy, since the performances of the prediction coverage probability for most combinations are satisfactory. Through the analyses described above, we find that among the four factors, two factors are actually significant to the prediction accuracy. The best combination for the levels of the four factors is also identified.

View record

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.