Matias Salibian-Barrera

Professor

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters

 
 

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Boosting for regression problems with complex data (2022)

Boosting is a highly flexible and powerful approach when it comes to making predictions in non-parametric settings. By constructing an estimator using a combination of “base learners”, it can achieve high prediction accuracy and scale to data with many explanatory variables. In spite of the popularity and practical success of boosting algorithms, there is a lack of focus on its generalizations to “complex data”, such as data with “outliers” or functional variables. For data like these, we develop new boosting algorithms that fit in the framework of gradient boosting machines (GBM). We illustrate our findings on simulated and real datasets and developed openly available R packages implementing our proposals. For data contaminated with outliers, we propose a two-stage boosting algorithm similar to what is done for robust linear MM-regression: it first minimizes a robust residual scale estimator and then improves it by optimizing a bounded loss function. Unlike previous robust boosting proposals this approach does not require computing an ad hoc residual scale estimator in each boosting iteration. We address the issue of the initialization of our boosting algorithm and provide a permutation-based procedure to robustly measure the importance of each variable. For data containing functional predictors, we propose a boosting algorithm that uses tree “base-learners” that are constructed with multiple projections. Our proposal incorporates possible interactions between indices, making it capable of approximating complex regression functions. In addition, our estimator is constructed using relatively simple regression trees, which are notably easier to compute than multi-dimensional kernel smoothers used in other proposals. Finally, we extend the proposal above to robust functional regression in the presence of outliers, which may appear in the measurements of the response, the functional predictors, or both. We explore robust boosting algorithms derived from M-estimators or MM-estimators respectively and make suggestions on which method to use based on the type of contamination and computing budget.

View record

Methods for preferential sampling in geostatistics (2018)

Preferential sampling in geostatistics occurs when the locations at which observations are made may depend on the spatial process that underlines the correlation structure of the measurements. If ignored, this may affect the parameter estimates of the model and the resulting spatial predictions. In this thesis, we first show that previously proposed Monte Carlo estimates for the likelihood function may not be approximating the desired function. Furthermore, we argue that for preferential sampling of moderate complexity, alternative and widely available numerical methods to approximate the likelihood function produce better results than Monte Carlo methods. We illustrate our findings on various data sets, include the biomonitoring Galicia dataset analysed previously in the literature. Research on preferential sampling has so far been restricted to stationary sampling locations such as monitoring sites. In this thesis, we also expand the methodology for applicability in cases where the sensors are moving through the domain of interest. More specifically, we propose a flexible framework for inference on preferentially sampled fields, where the process that generates the sampling locations is stochastic and moving through a 2-dimensional space. The main application of these methods is the sampling of ocean temperature fields by marine mammal mounted sensors. This is an area of research which has grown drastically over the past 25 years and is providing scientists with a wealth of new oceanographic information in areas of our oceans previously not well understood. We show that standard geostatistical models may not be reliable for this type of data, due to the possibility that the regions visited by the animals may depend on the ocean temperatures, hence resulting in a type of preferential sampling. Our simulation studies confirm that predictions obtained from the preferential sampling model are more reliable when this phenomenon is present, and they compare very well to the standard ones when there is no preferential sampling. We apply our methods to sea surface temperature data collected by Southern elephant seals in the Southern Indian ocean and show how predictions of sea surface temperature fields using this data may vary when accounting for the preferential movement.

View record

Extending linear grouping analysis and robust estimators for very large data sets (2009)

Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest. Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”. LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data. In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient.The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Robust methods for generalized partial linear partial additive models with an application to detection of disease outbreaks (2019)

An essential function of the public health system is to detect and control disease outbreaks. The British Columbia Centre for Disease Control (BC CDC) monitors approximately 60 notifiable disease counts from 8 branch offices in 16 health service delivery areas. These disease counts exhibit a variety of characteristics, such as seasonality in meningococcal and a long-term trend in acute hepatitis A. As staff need to determine whether the reported counts are higher than expected, the detection process is both costly and fallible. To alleviate this problem, in the early 2000’s the BC CDC commissioned an automated statistical method to detect disease outbreaks. The method is based on a generalized additive partially linear model and appears to capture the characteristics of disease counts. However, it relies on certain ad-hoc criteria to flag counts for an outbreak. The BC CDC is interested in considering other alternatives. In this thesis, we discuss an outbreak detection method based on robust estimators. It builds on recently proposed robust estimators for additive, generalized additive, and generalized linear models. Using real and simulated data, we compare our method with that of the BC CDC and other natural competitors and present promising results.

View record

Dimension reduction using Independent Component Analysis with an application in business psychology (2017)

Independent component analysis (ICA) is used for separating a set of mixed signals into statistically independent additive subcomponents. The methodology extracts as many independent components as there are dimensions or features in the original dataset. Since not all of these components may be of importance, a few solutions have been proposed to reduce the dimension of the data using ICA. However, most of these solutions rely on prior knowledge or estimation of the number of independent components that are to be used in the model. This work proposes a methodology that addresses the problem of selecting fewer components than the original dimension of the data that best approximate the original dataset without prior knowledge or estimation of their number. The trade off between the number of independent components retained in the model and the loss of information is explored. This work presents mathematical foundations of the proposed methodology as well as the results of its application to a business psychology dataset.

View record

A Markov Random Fields Approach to Modelling Habitat (2015)

Habitat modelling presents a challenge due to the variety of data available and their corresponding accuracy. One option is to use Markov random fields as a way to incorporate these distinct types of data for habitat modelling. In this work, I provide a brief overview of the intuition, mathematical theory, and application considerations behind modelling habitat under this framework. In particular, an auto-logistic model is built and applied to modelling sea lion habitat using synthetic data. First, we explore modelling one sample of data. Afterwards, the framework is extended to the multi-sample scenario. Finally, the theory for the methodology is presented, the results of the applied implementation are presented.

View record

A robust fit for generalized partial linear partial additive models (2013)

In regression studies, semi-parametric models provide both flexibility and interpretability.In this thesis, we focus on a robust model fitting algorithm for a family of semi-parametric models – the Generalized Partial Linear Partial Addi- tive Models (GAPLMs), which is a hybrid of the widely-used Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs). The traditional model fitting algorithms are mainly based on likelihood proce- dures. However, the resulting fits can be severely distorted by the presence of a small portion of atypical observations (also known as “outliers”), which deviate from the assumed model. Furthermore, the traditional model diag- nostic methods might also fail to detect outliers. In order to systematically solve these problems, we develop a robust model fitting algorithm which is resistant to the effect of outliers. Our method combines the backfitting algorithm and the generalized Speckman estimator to fit the “partial linear partial additive” styled models. Instead of using the likelihood-based weights and adjusted response from the generalized local scoring algorithm (GLSA), we apply the robust weights and adjusted response derived form the robust quasi-likelihood proposed by Cantoni and Ronchetti (2001). We also extend previous methods by proposing a model prediction algorithm for GAPLMs.To compare our robust method with the non-robust one given by the R function gam

View record

Lower Quantile Estimation of Wood Strength Data (2012)

In wood engineering, lower quantile estimation is vital to the safety of the construction with wood materials. In this thesis, we will first study the censored Weibull maximum likelihood estimate (MLE) of the lower quantile as in the current industrial standard D5457 (ASTM, 2004a) from a statistical point of view. According to our simulations, the lower quantile estimated by the censored Weibull MLE with the 10th empirical percentile as the threshold has a smaller mean squared error (MSE) than the intuitive parametric or non-parametric quantile estimate. This advantage can be shown to be achieved by a good balance between the variance and bias with the help of subjective censorship.However, the standard D5457 (ASTM, 2004a) only utilizes a small (10%) and ad-hoc proportion of the data in the lower quantile estimation, which stimulates us to further improve it. First, we can consider fitting a more complex model, such as the Weibull mixture, to a larger, (e.g., 70%) proportion of the data set with the subjective censorship, which leads to the censored Weibull mixture estimate of the lower quantile. Also, the bootstrap can be used to select a better censoring threshold for the censored Weibull MLE, which leads to the bootstrap censored Weibull MLE. According to our simulations, both proposals can yield a better lower quantile estimates than the standard D5457 and the bootstrap censored Weibull MLE is better than the censored Weibull mixture.

View record

Robustification of the sparse K-means clustering algorithm (2011)

Searching a dataset for the ‘‘natural grouping / clustering’’ is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we find that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustified sparse K-means (Witten and Tibshirani [28]), based on the idea of trimmed K-means introduced by Gordaliza [7] and Gordaliza [8]. Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reflects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand [5]) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC.

View record

 

Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations

Academic Unit(s)

 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Learn about our faculties, research and more than 300 programs in our Graduate Viewbook!