Relevant Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - Nov 2019)
In this thesis, we present new developments of hierarchical clustering in high-dimensional data. We consider different use cases of hierarchical clustering, namely, clustering observations for exploratory analysis and clustering high-dimensional features for adaptive feature grouping and ensembling.We first focus on the clustering of observations. In high-dimensional data, the existence of potential noise features and outliers poses unique challenges to the existing hierarchical clustering techniques. We propose the Robust Sparse Hierarchical Clustering (RSHC) and the Multi-rank Sparse Hierarchical Clustering (MrSHC) to address these challenges. We show that via robust feature selection techniques, both RSHC and MrSHC can handle the potential existence of noise features and outliers in high-dimensional data and result in better clustering accuracy and interpretation comparing to the existing hierarchical clustering methods.We then consider clustering of features in high-dimensional data. We propose a new hierarchical clustering technique to adaptively divide the large number of features into subgroups called Regression Phalanxes. Features in the same Regression Phalanx work well together as predictors in a pre-defined regression model. Then models built on different Regression Phalanxes are considered for further ensembling. We show that the ensemble of Regression Phalanxes resulting from the hierarchical clustering produces further gains in prediction accuracy when applied to an effective method like Lasso or Random Forests.
Cellwise outliers are likely to occur together with casewise outliers in datasets of relatively large dimension. Recent work has shown that traditional high breakdown point procedures may fail when applied to such datasets. In this thesis, we consider this problem when the goal is to (1) estimate multivariate location and scatter matrix and (2) estimate regression coefficients and confidence intervals for inference, which both are cornerstones in multivariate data analysis. To address the first problem, we propose a two-step procedure to deal with casewise and cellwise outliers, which generally proceeds as follows: first, it uses a filter to identify cellwise outliers and replace them by missing values; then, it applies a robust estimator to the incomplete data to down-weight casewise outliers. We show that the two-step procedure is consistent under the central model provided the filter is appropriately chosen. The proposed two-step procedure for estimating location and scatter matrix is then applied in regression for the case of continuous covariates by simply adding a third step, which computes robust regression coefficients from the estimated robust multivariate location and scatter matrix obtained in the second step. We show that the three-step estimator is consistent and asymptotically normal at the central model, for the case of continuous covariates. Finally, the estimator is extended to handle both continuous and dummy covariates. Extensive simulation results and real data examples show that the proposed methods can handle both cellwise and casewise outliers similarly well.
An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set.Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds.The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets.The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests.
We consider the problem of robust estimation of the scatter matrix of an elliptical distribution when observed data are corrupted in a cell-wise manner. The first half of the thesis develops a framework for dealing with data subjected to independent cell-wise contamination. Each data cell (as opposed to data case in traditional robustness) can be contaminated independently of the rest of the case. Instead of downweighting the whole case we attempt to identify the affected cells, remove the offending values and treat them as missing at random for subsequent likelihood-based processing. We explore several variations of the detection procedure that takes into account the multivariate structure of the data and end up with a heuristic algorithm that identifies and removes a large proportion of dangerous independent contamination. Although there are not many existing methods to measure against, the proposed covariance estimate compares favorably to naive alternatives such as pairwise estimates or univariate Winsorising.The cell-wise data corruption mechanism that we deal with in the second half of this thesis is missing data. Missing data on their own have been well studied and likelihood methods are well developed. The new setting that we are interested in is when missing data come together with the traditional case-wise contamination. Both issues have been studied extensively over that last few decades but little attention has been paid to how to address them both at the same time. We propose a modification of the S-estimate that allows robust estimation of multivariate location and scatter matrix in the presence of missing completely at random (MCAR) data. The method is based on the idea of the maximum likelihood of the observed data and extends it into the world of S-estimates. The estimate comes complete with the computation algorithm, which is an adjusted version of the widely used Fast-S procedure. Simulation results and applications to real datasets confirm the superiority of our method over available alternatives.Preliminary investigation reported in the concluding chapter suggests that combining the two main ideas presented in this thesis can yield an estimate that is robust against case-wise and cell-wise contamination simultaneously.
Single nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technologiesusually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixturemodels,rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further developinga linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosystems).A key feature of the proposed algorithm is that it applies to unnormalizedfluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likelihoodapproach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally distributed.Its relationships with several existing clustering methods are discussed.Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate locationsor properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.
Single nucleotide polymorphisms (SNPs) are DNA sequence variations, occurring when a single nucleotide –A, T, C or G – is altered. Arguably, SNPs account for more than 90% of human genetic variation. Dr. Tebbutt's laboratory has developed a highly redundant SNP genotyping assay consisting of multiple probes with signals from multiple channels for a single SNP, based on arrayed primer extension (APEX). The strength of this platform is its unique redundancy having multiple probes for a single SNP. Using this microarray platform, we have developed fully-automated genotype calling algorithms based on linear models for individual probe signals and using dynamic variable selection at the prediction level. The algorithms combine separate analyses based on the multiple probe sets to give a final confidence score for each candidate genotypes.Our proposed classification model achieved an accuracy level of >99.4% with 100% call rate for the SNP genotype data which is comparable with existing genotyping technologies. We discussed the appropriateness of the proposed model related to other existing high-throughput genotype calling algorithms. In this thesis we have explored three new ideas for classification with high dimensional data: (1) ensembles of various sets of predictors with built-in dynamic property; (2) robust classification at the prediction level; and (3) a proper confidence measure for dealing with failed predictor(s).We found that a mixture model for classification provides robustness against outlying values of the explanatory variables. Furthermore, the algorithm chooses among different sets of explanatory variables in a dynamic way, prediction by prediction. We analyzed several data sets, including real and simulated samples to illustrate these features. Our model-based genotype calling algorithm captures the redundancy in the system considering all the underlying probe features of a particular SNP, automatically down-weighting any ‘bad data’ corresponding to image artifacts on the microarray slide or failure of a specific chemistry.Though motivated by this genotyping application, the proposed methodology would apply to other classification problems where the explanatory variables fall naturally into groups or outliers in the explanatory variables require variable selection at the prediction stage for robustness.
Master's Student Supervision (2010 - 2018)
Searching a dataset for the ‘‘natural grouping / clustering’’ is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we ﬁnd that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustiﬁed sparse K-means (Witten and Tibshirani ), based on the idea of trimmed K-means introduced by Gordaliza  and Gordaliza . Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reﬂects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand ) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC.