Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling.We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods.We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.
Cellwise outliers are likely to occur together with casewise outliers in datasets of relatively large dimension. Recent work has shown that traditional high breakdown point procedures may fail when applied to such datasets. In this thesis, we consider this problem when the goal is to (1) estimate multivariate location and scatter matrix and (2) estimate regression coefficients and confidence intervals for inference, which both are cornerstones in multivariate data analysis. To address the first problem, we propose a two-step procedure to deal with casewise and cellwise outliers, which generally proceeds as follows: first, it uses a filter to identify cellwise outliers and replace them by missing values; then, it applies a robust estimator to the incomplete data to down-weight casewise outliers. We show that the two-step procedure is consistent under the central model provided the filter is appropriately chosen. The proposed two-step procedure for estimating location and scatter matrix is then applied in regression for the case of continuous covariates by simply adding a third step, which computes robust regression coefficients from the estimated robust multivariate location and scatter matrix obtained in the second step. We show that the three-step estimator is consistent and asymptotically normal at the central model, for the case of continuous covariates. Finally, the estimator is extended to handle both continuous and dummy covariates. Extensive simulation results and real data examples show that the proposed methods can handle both cellwise and casewise outliers similarly well.
An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.
We consider the problem of robust estimation of the scatter matrix of an elliptical distribution when observed data are corrupted in a cell-wise manner. The first half of the thesis develops a framework for dealing with data subjected to independent cell-wise contamination. Each data cell (as opposed to data case in traditional robustness) can be contaminated independently of the rest of the case. Instead of downweighting the whole case we attempt to identify the affected cells, remove the offending values and treat them as missing at random for subsequent likelihood-based processing. We explore several variations of the detection procedure that takes into account the multivariate structure of the data and end up with a heuristic algorithm that identifies and removes a large proportion of dangerous independent contamination. Although there are not many existing methods to measure against, the proposed covariance estimate compares favorably to naive alternatives such as pairwise estimates or univariate Winsorising.The cell-wise data corruption mechanism that we deal with in the second half of this thesis is missing data. Missing data on their own have been well studied and likelihood methods are well developed. The new setting that we are interested in is when missing data come together with the traditional case-wise contamination. Both issues have been studied extensively over that last few decades but little attention has been paid to how to address them both at the same time. We propose a modification of the S-estimate that allows robust estimation of multivariate location and scatter matrix in the presence of missing completely at random (MCAR) data. The method is based on the idea of the maximum likelihood of the observed data and extends it into the world of S-estimates. The estimate comes complete with the computation algorithm, which is an adjusted version of the widely used Fast-S procedure. Simulation results and applications to real datasets confirm the superiority of our method over available alternatives.Preliminary investigation reported in the concluding chapter suggests that combining the two main ideas presented in this thesis can yield an estimate that is robust against case-wise and cell-wise contamination simultaneously.
Single nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technologiesusually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixturemodels,rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further developinga linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosystems).A key feature of the proposed algorithm is that it applies to unnormalizedfluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likelihoodapproach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally distributed.Its relationships with several existing clustering methods are discussed.Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate locationsor properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms. In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
This thesis considers the problem of robust and sparse estimation of linear regression parameters in data with structural and independent contamination. Independent outliers can propagate in data with relatively large numbers of dimensions, resulting in a high fraction of observations with at least one outlying cells. Recent work has shown that traditional robust regression methods are not highly robust to such outliers. We investigate the application of Robust Least Angle Regression (RLARS) to data with independent contamination. We also propose two modified versions of RLARS to further improve its performance. The first method applies RLARS to data which has been filtered of independent outliers. The second method performs RLARS with the Lasso modification for Least Angle Regression (LARS). Extensive simulations show that RLARS is resilient to structural and independent contamination. Compared with RLARS, simulation results show that the first modified version has significantly improved robustness to independent contamination and the second modified version has improved robustness when there are a large number of predictors.We also consider the application of the proposed methods to data quality modelling in a case study for MineSense Technologies Ltd. (MineSense). MineSense develops sensor packages for use in the harsh conditions of an active mine. To maintain high system availability and performance, data must be monitored for a deterioration in sensor health or a change in the data generating process, such as a change in ore body, which can manifest as outliers. We pose the problem of contamination detection, the identification of whether a dataset contains outliers, as a distinct problem from outlier detection, the identification of which cases or cells are outliers. We propose a contamination detection method based on the comparison of robust and non-robust linear regression estimates. When outliers are present, the robust and non-robust estimates differ significantly, indicating the presence of contamination. Simulation results and analysis of real sensor data provided by MineSense suggest that our method can effectively detect the presence of contamination with a low false detection rate.
In variable selection problems, when the number of candidate covariates is relatively large, the "two-step" model building strategy, which consists of two consecutive steps sequencing and segmentation, is often used. Sequencing aims to first sequence all the candidate covariates to form a list of candidate variables in which more "important" ones are likely to appear at the beginning. Then, in the segmentation step, the subsets of the first m (chosen by the user) candidate covariates which are ranked at the top of the sequenced list will be carefully examined in order to select the final prediction model. This thesis mainly focuses on the sequencing step. Least Angle Regression (LARS), proposed by Efron, Hastie, Johnstone and Tibshirani (2004), is a quite powerful step-by-step algorithm which can be used to sequence the candidate covariates in order of their importance. Khan, J.A., Van Aelst, S., and Zamar, R.H. (2007) further proposed its robust version --- Robust LARS. Robust LARS is robust against outliers and computationally efficiency. However, neither the original LARS nor the Robust LARS is available for carrying out the sequencing step when the candidate covariates contain both continuous and nominal variables. In order to remedy this, we propose the Extended Robust LARS by proposing the generalized definitions of correlations which includes the correlations between nominal variables and continuous variables. Simulations and real examples are used to show that the Extended Robust LARS gives superior performance to two of its competitors, the classical Forward Selection and Group Lasso.