Ruben Zamar

Professor

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters

 
 

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

A data-driven ensemble framework for modeling high-dimensional data: theory, methods, algorithms, and applications (2022)

Sparse and ensemble methods are the two main approaches in the statistical literature for modeling high-dimensional data. On the one hand, sparse methods yield a single predictive model that is generally interpretable and possesses desirable theoretical properties. On the other hand, multi-model ensemble methods can generally achieve superior prediction accuracy, but current ensemble methodology relies on randomization or boosting to generate diverse models which results in uninterpretable ensembles. In this dissertation, we introduce a new data-driven ensemble framework that combines ideas from sparse modeling and ensemble modeling. We search for optimal ways to select and split the candidate predictors into subsets for the different models that will be combined in an ensemble. Each model in the ensemble provides an alternative explanation for the relationship between the predictor variables and the response variable. The degrees of sparsity of the individual models and diversity among the models are both driven by the data. The task of optimally splitting the candidate predictors into subsets results in a computationally intractable combinatorial optimization problem when the number of predictors is large. We demonstrate the potential of an exhaustive search for the optimal split of the predictors into the different models of an ensemble on specifically designed low-dimensional data which mimic the typical behavior of high-dimensional data.In this dissertation, we propose different computational approaches to the optimal split selection problem. We introduce a multiconvex relaxation in the regression case and develop efficient algorithms to compute solutions for any level of sparsity and diversity. We show that the resulting ensembles yield consistent predictions and consistent individual models, and provide empirical evidence that this method outperforms state-of-the-art sparse and ensemble methods for high-dimensional prediction tasks using simulated data and a chemometrics application. We then extend the methodology, theory and algorithms to classification ensembles, and investigate the performance of the method on simulated data and a large collection of gene expression datasets. We finally propose a direct computational approach to calculate approximate solutions to the optimal split selection problem in the regression case and benchmark the performance of the method gene expression data. Supplementary materials available at: http://hdl.handle.net/2429/83086

View record

Hierarchical clustering of observations and features in high-dimensional data (2017)

In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling.We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods.We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.

View record

Robust estimation and inference under cellwise and casewise contamination (2017)

Cellwise outliers are likely to occur together with casewise outliers in datasets of relatively large dimension. Recent work has shown that traditional high breakdown point procedures may fail when applied to such datasets. In this thesis, we consider this problem when the goal is to (1) estimate multivariate location and scatter matrix and (2) estimate regression coefficients and confidence intervals for inference, which both are cornerstones in multivariate data analysis. To address the first problem, we propose a two-step procedure to deal with casewise and cellwise outliers, which generally proceeds as follows: first, it uses a filter to identify cellwise outliers and replace them by missing values; then, it applies a robust estimator to the incomplete data to down-weight casewise outliers. We show that the two-step procedure is consistent under the central model provided the filter is appropriately chosen. The proposed two-step procedure for estimating location and scatter matrix is then applied in regression for the case of continuous covariates by simply adding a third step, which computes robust regression coefficients from the estimated robust multivariate location and scatter matrix obtained in the second step. We show that the three-step estimator is consistent and asymptotically normal at the central model, for the case of continuous covariates. Finally, the estimator is extended to handle both continuous and dummy covariates. Extensive simulation results and real data examples show that the proposed methods can handle both cellwise and casewise outliers similarly well.

View record

Rare-class classification using ensembles of subsets of variables (2013)

An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.

View record

Robust estimation of multivariate scatter in non-affine equivariant scenarios (2010)

We consider the problem of robust estimation of the scatter matrix of an elliptical distribution when observed data are corrupted in a cell-wise manner. The first half of the thesis develops a framework for dealing with data subjected to independent cell-wise contamination. Each data cell (as opposed to data case in traditional robustness) can be contaminated independently of the rest of the case. Instead of downweighting the whole case we attempt to identify the affected cells, remove the offending values and treat them as missing at random for subsequent likelihood-based processing. We explore several variations of the detection procedure that takes into account the multivariate structure of the data and end up with a heuristic algorithm that identifies and removes a large proportion of dangerous independent contamination. Although there are not many existing methods to measure against, the proposed covariance estimate compares favorably to naive alternatives such as pairwise estimates or univariate Winsorising.The cell-wise data corruption mechanism that we deal with in the second half of this thesis is missing data. Missing data on their own have been well studied and likelihood methods are well developed. The new setting that we are interested in is when missing data come together with the traditional case-wise contamination. Both issues have been studied extensively over that last few decades but little attention has been paid to how to address them both at the same time. We propose a modification of the S-estimate that allows robust estimation of multivariate location and scatter matrix in the presence of missing completely at random (MCAR) data. The method is based on the idea of the maximum likelihood of the observed data and extends it into the world of S-estimates. The estimate comes complete with the computation algorithm, which is an adjusted version of the widely used Fast-S procedure. Simulation results and applications to real datasets confirm the superiority of our method over available alternatives.Preliminary investigation reported in the concluding chapter suggests that combining the two main ideas presented in this thesis can yield an estimate that is robust against case-wise and cell-wise contamination simultaneously.

View record

Linear clustering with application to single nucleotide polymorphism genotyping (2008)

Single nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technologiesusually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixturemodels,rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further developinga linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosystems).A key feature of the proposed algorithm is that it applies to unnormalizedfluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likelihoodapproach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally distributed.Its relationships with several existing clustering methods are discussed.Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate locationsor properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.

View record

Robust genotype classification using dynamic variable selection (2008)

Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms. In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Robust and sparse regression in the presence of cellwise and casewise contamination with application in data quality modelling (2020)

This thesis considers the problem of robust and sparse estimation of linear regression parameters in data with structural and independent contamination. Independent outliers can propagate in data with relatively large numbers of dimensions, resulting in a high fraction of observations with at least one outlying cells. Recent work has shown that traditional robust regression methods are not highly robust to such outliers. We investigate the application of Robust Least Angle Regression (RLARS) to data with independent contamination. We also propose two modified versions of RLARS to further improve its performance. The first method applies RLARS to data which has been filtered of independent outliers. The second method performs RLARS with the Lasso modification for Least Angle Regression (LARS). Extensive simulations show that RLARS is resilient to structural and independent contamination. Compared with RLARS, simulation results show that the first modified version has significantly improved robustness to independent contamination and the second modified version has improved robustness when there are a large number of predictors.We also consider the application of the proposed methods to data quality modelling in a case study for MineSense Technologies Ltd. (MineSense). MineSense develops sensor packages for use in the harsh conditions of an active mine. To maintain high system availability and performance, data must be monitored for a deterioration in sensor health or a change in the data generating process, such as a change in ore body, which can manifest as outliers. We pose the problem of contamination detection, the identification of whether a dataset contains outliers, as a distinct problem from outlier detection, the identification of which cases or cells are outliers. We propose a contamination detection method based on the comparison of robust and non-robust linear regression estimates. When outliers are present, the robust and non-robust estimates differ significantly, indicating the presence of contamination. Simulation results and analysis of real sensor data provided by MineSense suggest that our method can effectively detect the presence of contamination with a low false detection rate.

View record

Linear Model Selection Based on Extended Robust Least Angle Regression (2012)

In variable selection problems, when the number of candidate covariates is relatively large, the "two-step" model building strategy, which consists of two consecutive steps sequencing and segmentation, is often used. Sequencing aims to first sequence all the candidate covariates to form a list of candidate variables in which more "important" ones are likely to appear at the beginning. Then, in the segmentation step, the subsets of the first m (chosen by the user) candidate covariates which are ranked at the top of the sequenced list will be carefully examined in order to select the final prediction model. This thesis mainly focuses on the sequencing step. Least Angle Regression (LARS), proposed by Efron, Hastie, Johnstone and Tibshirani (2004), is a quite powerful step-by-step algorithm which can be used to sequence the candidate covariates in order of their importance. Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007) further proposed its robust version --- Robust LARS. Robust LARS is robust against outliers and computationally efficiency. However, neither the original LARS nor the Robust LARS is available for carrying out the sequencing step when the candidate covariates contain both continuous and nominal variables. In order to remedy this, we propose the Extended Robust LARS by proposing the generalized definitions of correlations which includes the correlations between nominal variables and continuous variables. Simulations and real examples are used to show that the Extended Robust LARS gives superior performance to two of its competitors, the classical Forward Selection and Group Lasso.

View record

 

Membership Status

Member of G+PS
View explanation of statuses

Program Affiliations

Department(s)

 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Read tips on applying, reference letters, statement of interest, reaching out to prospective supervisors, interviews and more in our Application Guide!