# William Welch

#### Relevant Degree Programs

## Graduate Student Supervision

##### Doctoral Student Supervision (Jan 2008 - May 2019)

Computer experiments have been widely used in practice as important supplements to traditional laboratory-based physical experiments in studying complex processes. However, a computer experiment, which in general is based on a computer model with limited runs, is expensive in terms of its computational time. Sacks.et.al. (1989) proposed to use Gaussian process (GP) as a statistical surrogate, which has become a standard way to emulate a computer model. In the thesis, we are concerned with design and analysis of computer experiments based on a GP. We argue that comprehensive, evidence-based assessment strategies are needed when comparing different model options and designs. We first focus on the regression component and the correlation structure of a GP. We use comprehensive assessment strategies to evaluate the effect of the two factors on the prediction accuracy. We also have a limited evaluation on empirical Bayes methods and full Bayes methods, from which we notice Bayes methods with a squared exponential structure do not yield satisfying prediction accuracy in some examples considered. Hence, we propose to use hybrid and full Bayes methods with flexible structures. Through empirical studies, we show the new Bayes methods with flexible structures not only have better prediction accuracy, but also have a better quantification of uncertainty. In addition, we are interested in assessing the effect of design on prediction accuracy. We consider a number of popular designs in the literature and use several examples to evaluate their performances. It turns out the performance difference between designs is small for most of the examples considered. From the evaluation of designs, we are motivated to use a sequential design strategy. We compare the performances of two existing sequential search criteria and handle several important issues in using the sequential design to estimate an extreme probability/quantile of a floor supporting system. Several useful recommendations have also been made.

View record

Legislative actions regarding ozone pollution use air quality models (AQMs) such as the Community Multiscale Air Quality (CMAQ) model for scientific guidance, hence the evaluation of AQM is an important subject. Traditional point-to-point comparisons between AQM outputs and physical observations can be uninformative or even misleading since the two datasets are generated by discrepant stochastic spatial processes. I propose an alternative model evaluation approach that is based on the comparison of spatial-temporal ozone features, where I compare the dominant space-time structures between AQM ozone and observations. To successfully implement feature-based AQM evaluation, I further developed a statistical framework of analyzing and modelling space-time ozone using ozone features. Rather than working directly with raw data, I analyze the spatial-temporal variability of ozone fields by extracting data features using Principal Component Analysis (PCA). These features are then modelled as Gaussian Processes (GPs) driven by various atmospheric conditions and chemical precursor pollution. My method is implemented on CMAQ outputs during several ozone episodes in the Lower Fraser Valley (LFV), BC. I found that the feature-based ozone model is an efficient way of emulating and forecasting a complex space-time ozone field. The framework of ozone feature analysis is then applied to evaluate CMAQ outputs against the observations. Here, I found that CMAQ persistently over-estimates the observed spatial ozone pollution. Through the modelling of feature differences, I identified their associations with the computer model's estimates of ozone precursor emissions, and this CMAQ deficiency is focused on LFV regions where the pollution process transitions from NOx-sensitive to VOC-sensitive. Through the comparison of dynamic ozone features, I found that the CMAQ's over-prediction is also connect to the model producing higher than observed ozone plume in daytime. However, the computer model did capture the observed pattern of diurnal ozone advection across LFV. Lastly, individual modelling of CMAQ and observed ozone features revealed that even under the same atmospheric conditions, CMAQ tends to significantly over-estimate the ozone pollution during the early morning. In the end, I demonstrated that the AQM evaluation methods developed in this thesis can provide informative assessments of an AQM's capability.

View record

An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.

View record

Single nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technologiesusually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixturemodels,rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further developinga linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosystems).A key feature of the proposed algorithm is that it applies to unnormalizedfluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likelihoodapproach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally distributed.Its relationships with several existing clustering methods are discussed.Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate locationsor properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.

View record

Cross validation (CV) is widely used for model assessment and comparison. In this thesis, we first review and compare threev-fold CV strategies: best single CV, repeated and averaged CV and double CV. The mean squared errors of the CV strategies inestimating the best predictive performance are illustrated by using simulated and real data examples. The results show that repeated and averaged CV is a good strategy and outperforms the other two CV strategies for finite samples in terms of the mean squared error in estimating prediction accuracy and the probability of choosing an optimal model.In practice, when we need to compare many models, conducting repeated and averaged CV strategy is not computational feasible. We develop an efficient sequential methodology for model comparison based on CV. It also takes into account the randomness in CV. The number of models is reduced via an adaptive,multiplicity-adjusted sequential algorithm, where poor performers are quickly eliminated. By exploiting matching of individual observations, it is sometimes even possible to establish the statistically significant inferiority of some models with just oneexecution of CV. This adaptive and computationally efficient methodologyis demonstrated on a large cheminformatics data set from PubChem.Cross validated mean squared error (CVMSE) is widely used to estimate the prediction mean squared error (MSE) of statistical methods.For linear models, we show how CVMSE depends on the number of folds, v, used in cross validation, the number of observations, and the number of model parameters. We establish that the bias of CVMSE in estimating the true MSE decreases with v and increases with model complexity. In particular, the bias may be very substantial for models with many parameters relative to the number of observations, even if v is large. Theseresults are used to correct CVMSE for its bias. We compare our proposed bias correction with that of Burman (1989), through simulated and real examples. We also illustrate that our method of correcting for the bias of CVMSE may change the results of model selection.

View record

Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms. In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.

View record

Computer models or simulators are becoming increasingly common in many fields in science and engineering, powered by the phenomenal growth in computer hardware over thepast decades. Many of these simulators implement a particular mathematical model as a deterministic computer code, meaning that running the simulator again with the same input gives the same output.Often running the code involves some computationally expensive tasks, such as solving complex systems of partial differential equations numerically. When simulator runs become too long, it may limit their usefulness. In order to overcome time or budget constraints by making the most out of limited computational resources, a statistical methodology has been proposed, known as the "Design and Analysis of Computer Experiments".The main idea is to run the expensive simulator only at a relatively few, carefully chosen design points in the input space, and based on the outputs construct an emulator (statistical model) that can emulate (predict) the output at new, untriedlocations at a fraction of the cost. This approach is useful provided that we can measure how much the predictions of the cheap emulator deviate from the real responsesurface of the original computer model.One way to quantify emulator error is to construct pointwise prediction bands designed to envelope the response surface and makeassertions that the true response (simulator output) is enclosed by these envelopes with a certain probability. Of course, to be ableto make such probabilistic statements, one needs to introduce some kind of randomness. A common strategy that we use here is to model the computer code as a random function, also known as a Gaussian stochastic process. We concern ourselves with smooth response surfaces and use the Gaussian covariance function that is ideal in cases when the response function is infinitely differentiable.In this thesis, we propose Fast Bayesian Inference (FBI) that is both computationally efficient and can be implemented as a black box. Simulation results show that it can achieve remarkably accurate prediction uncertainty assessments in terms of matchingcoverage probabilities of the prediction bands and the associated reparameterizations can also help parameter uncertainty assessments.

View record

##### Master's Student Supervision (2010 - 2018)

Gaussian Processes (GPs) are commonly used in the analysis of data from a computer experiment. Ideally, the analysis will provide accurate predictions with correct coverage probabilities of credible intervals. A Bayesian method can, in principle, capture all sources of uncertainty and hence give valid inference. Several implementations are available in the literature, differing in choice of priors, etc. In this thesis, we first review three popular Bayesian methods in the analysis of computer experiments. Two prediction criteria are proposed to measure both the prediction accuracy and the prediction actual coverage probability. From a simple example, we notice that the performances of the three Bayesian implementations are quite different. Motivated by the performance difference, we specify four important factors in terms of Bayesian analysis and allocate different levels for the factors based on the three existing Bayesian implementations. Full factorial experiments are then conducted on the specified factors both for real computer models and via simulation with the aim of identifying the significant factors. Emphasis is placed on the prediction accuracy, since the performances of the prediction coverage probability for most combinations are satisfactory. Through the analyses described above, we find that among the four factors, two factors are actually significant to the prediction accuracy. The best combination for the levels of the four factors is also identified.

View record

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.