Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - Nov 2020)
Preferential sampling in geostatistics occurs when the locations at which observations are made may depend on the spatial process that underlines the correlation structure of the measurements. If ignored, this may affect the parameter estimates of the model and the resulting spatial predictions. In this thesis, we first show that previously proposed Monte Carlo estimates for the likelihood function may not be approximating the desired function. Furthermore, we argue that for preferential sampling of moderate complexity, alternative and widely available numerical methods to approximate the likelihood function produce better results than Monte Carlo methods. We illustrate our findings on various data sets, include the biomonitoring Galicia dataset analysed previously in the literature. Research on preferential sampling has so far been restricted to stationary sampling locations such as monitoring sites. In this thesis, we also expand the methodology for applicability in cases where the sensors are moving through the domain of interest. More specifically, we propose a flexible framework for inference on preferentially sampled fields, where the process that generates the sampling locations is stochastic and moving through a 2-dimensional space. The main application of these methods is the sampling of ocean temperature fields by marine mammal mounted sensors. This is an area of research which has grown drastically over the past 25 years and is providing scientists with a wealth of new oceanographic information in areas of our oceans previously not well understood. We show that standard geostatistical models may not be reliable for this type of data, due to the possibility that the regions visited by the animals may depend on the ocean temperatures, hence resulting in a type of preferential sampling. Our simulation studies confirm that predictions obtained from the preferential sampling model are more reliable when this phenomenon is present, and they compare very well to the standard ones when there is no preferential sampling. We apply our methods to sea surface temperature data collected by Southern elephant seals in the Southern Indian ocean and show how predictions of sea surface temperature fields using this data may vary when accounting for the preferential movement.
Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest. Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”. LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data. In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient.The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution.
Master's Student Supervision (2010 - 2018)
Independent component analysis (ICA) is used for separating a set of mixed signals into statistically independent additive subcomponents. The methodology extracts as many independent components as there are dimensions or features in the original dataset. Since not all of these components may be of importance, a few solutions have been proposed to reduce the dimension of the data using ICA. However, most of these solutions rely on prior knowledge or estimation of the number of independent components that are to be used in the model. This work proposes a methodology that addresses the problem of selecting fewer components than the original dimension of the data that best approximate the original dataset without prior knowledge or estimation of their number. The trade off between the number of independent components retained in the model and the loss of information is explored. This work presents mathematical foundations of the proposed methodology as well as the results of its application to a business psychology dataset.
Habitat modelling presents a challenge due to the variety of data available and their corresponding accuracy. One option is to use Markov random fields as a way to incorporate these distinct types of data for habitat modelling. In this work, I provide a brief overview of the intuition, mathematical theory, and application considerations behind modelling habitat under this framework. In particular, an auto-logistic model is built and applied to modelling sea lion habitat using synthetic data. First, we explore modelling one sample of data. Afterwards, the framework is extended to the multi-sample scenario. Finally, the theory for the methodology is presented, the results of the applied implementation are presented.
In regression studies, semi-parametric models provide both flexibility and interpretability.In this thesis, we focus on a robust model fitting algorithm for a family of semi-parametric models – the Generalized Partial Linear Partial Addi- tive Models (GAPLMs), which is a hybrid of the widely-used Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs). The traditional model fitting algorithms are mainly based on likelihood proce- dures. However, the resulting fits can be severely distorted by the presence of a small portion of atypical observations (also known as “outliers”), which deviate from the assumed model. Furthermore, the traditional model diag- nostic methods might also fail to detect outliers. In order to systematically solve these problems, we develop a robust model fitting algorithm which is resistant to the effect of outliers. Our method combines the backfitting algorithm and the generalized Speckman estimator to fit the “partial linear partial additive” styled models. Instead of using the likelihood-based weights and adjusted response from the generalized local scoring algorithm (GLSA), we apply the robust weights and adjusted response derived form the robust quasi-likelihood proposed by Cantoni and Ronchetti (2001). We also extend previous methods by proposing a model prediction algorithm for GAPLMs.To compare our robust method with the non-robust one given by the R function gam
Searching a dataset for the ‘‘natural grouping / clustering’’ is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we ﬁnd that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustiﬁed sparse K-means (Witten and Tibshirani ), based on the idea of trimmed K-means introduced by Gordaliza  and Gordaliza . Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reﬂects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand ) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC.