Raymond Tak-yan Ng
Relevant Degree Programs
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - Nov 2019)
Privacy preservation is a key issue in outsourcing of data mining.When we seek approaches to protect the sensitive information containedin the original data, it is also important to preserve the mining outcome.We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially.We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guaranteewhen the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and proposea variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results showthat the two approaches are effective to counter powerful attack models.In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise.We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases.
This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed.This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups. We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.
DNA copy number alterations (CNAs) are genetic changes that can produceadverse effects in numerous human diseases, including cancer. CNAs aresegments of DNA that have been deleted or amplified and can range in sizefrom one kilobases to whole chromosome arms. Development of arraycomparative genomic hybridization (aCGH) technology enables CNAs to bemeasured at sub-megabase resolution using tens of thousands of probes.However, aCGH data are noisy and result in continuous valued measurements ofthe discrete CNAs. Consequently, the data must be processed throughalgorithmic and statistical techniques in order to derive meaningfulbiological insights. We introduce model-based approaches to analysis of aCGHdata and develop state-of-the-art solutions to three distinct analyticalproblems.In the simplest scenario, the task is to infer CNAs from a single aCGHexperiment. We apply a hidden Markov model (HMM) to accurately identifyCNAs from aCGH data. We show that borrowing statistical strength acrosschromosomes and explicitly modeling outliers in the data, improves onbaseline models.In the second scenario, we wish to identify recurrent CNAs in a set of aCGHdata derived from a patient cohort. These are locations in the genomealtered in many patients, providing evidence for CNAs that may be playingimportant molecular roles in the disease. We develop a novel hierarchicalHMM profiling method that explicitly models both statistical and biologicalnoise in the data and is capable of producing a representative profile for aset of aCGH experiments. We demonstrate that our method is more accuratethan simpler baselines on synthetic data, and show our model produces outputthat is more interpretable than other methods.Finally, we develop a model based clustering framework to stratify a patientcohort, expected to be composed of a fixed set of molecular subtypes. Weintroduce a model that jointly infers CNAs, assigns patients to subgroupsand infers the profiles that represent each subgroup. We show our model tobe more accurate on synthetic data, and show in two patient cohorts how themodel discovers putative novel subtypes and clinically relevant subgroups.
Master's Student Supervision (2010 - 2018)
Epigenome-wide association studies are used to link patterns in the epigenome to human phenotypes and disease. These studies continue to increase in num- ber, driven by improving technologies and decreasing costs. However, results from population-scale association studies are often difficult to interpret. One major chal- lenge to interpretation is separating biologically relevant epigenetic changes from changes to the underlying cell type composition. This thesis focuses on computa- tional methods for correcting cell type composition in epigenome-wide association studies measuring DNAm in blood. Specifically, we focus on a class of methods, called reference-based methods, that rely on measurements of DNAm from puri- fied constituent cell types. Currently, reference-based correction methods perform poorly on human cord blood. This is unusual because adult blood, a closely related tissue, is a case-study in successful computational correction. Several previous attempts at improving cord blood estimation were only partially successful. We demonstrate how reference-based estimation methods that rely on for cord blood can be improved. First, we validated that existing methods perform poorly on cord blood, especially in minor cell types. Then, we demonstrated how this low per- formance stems from missing cell type references, data normalization and violated assumptions in signature construction. Resolving these issues improved estimates in a validation set with experimentally generated ground truth. Finally, we com- pared our reference-based estimates against reference-free techniques, an alterna- tive class of computational correction methods. Going forward, this thesis provides a template for extending reference-based estimation to other heterogeneous tissues.
Discourse parsing is a popular technique widely used in text understanding, sentiment analysis, and other NLP tasks. However, for most discourse parsers, the performance varies significantly across different discourse relations. In this thesis, we first validate the underfitting hypothesis, i.e., the less frequent a relation is in the training data, the poorer the performance on that relation. We then explore how to increase the number of positive training instances, without resorting to manually creating additional labeled data. We propose a training data enrichment framework that relies on co-training of two different discourse parsers on unlabeled documents. Importantly, we show that co-training alone is not sufficient. The framework requires a filtering step to ensure that only “good quality” unlabeled documents can be used for enrichment and re-training. We propose and evaluate two ways to perform the filtering. The first is to use an agreement score between the two parsers. The second is to use only the confidence score of the faster parser. Our empirical results show that agreement score can help to boost the performance on infrequent relations, and that the confidence score is a viable approximation of the agreement score for infrequent relations.
Complex cellular functions are carried out by the coordinated activity of networks of genes and gene products. In order to understand mechanisms of disease and disease pathogenesis, it is crucial to develop an understanding of these complex interactions. Microarrays provide the potential to explore large scale cellular networks by measuring the expression of thousands of genes simultaneously. The purpose of our project is to develop a stable and robust method that can identify, from such gene expression data, modules of genes that are involved in a common functional role. These modules can be used as a first step in systems scale analyses to extract valuable information from various gene expression studies. Our method constructs modules by identifying genes that are co-expressed across many diseases. We use peripheral blood microarray samples from patients having one of several diseases and cluster the genes in each disease group separately. We then identify genes that cluster together across all disease groups to construct our modules. We first use our method to construct baseline peripheral blood modules relevant to the lung using 5 groups of peripheral blood microarray samples that were collected as controls for separate studies. An enrichment analysis using gene sets from a number of pathway and ontology databases reveals the biological significance of our modules. We utilize our background modules by doing an enrichment analysis on a list of genes that were differentially expressed in a COPD case vs. control study and identify modules that are enriched in that list.Although a similar approach has been used to identify modules of genes that are coordinately expressed across multiple conditions, we show that our method is an improvement as it is robust to the order in which the different disease datasets are presented to the algorithm. We also apply our procedure to 3 different datasets including a COPD dataset, a COPD normal dataset and a lung tissue dataset. We then assess the stability of our method by performing a resampling experiment on our module construction procedure and find that our method repeatedly produces modules with high concordance which is measured by Jaccard distance.
Community detection is an important aspect of network analysis that has far-reaching consequences, in particular for biological research. In the study of systems biology, it is important to detect communities in biological networks to identify areas that have a heavy correlation between one another or are significant for biological functions. If one were to model networks that evolved over time, a differential network would be a vital part or product of that analysis. One such network could have an edge between two vertices if there is a significant change in the correlation of expression levels between the two genes that the vertices are designed to model.For this particular network, there are no community detection algorithms that suffice. An analysis of the current community detection algorithms shows that most heuristic-based methods are too simple or have too high a cost for detecting communities on such sparse networks. A prototypical algorithm is presented that is preferential to high weight edges when determining community membership. This algorithm, Weighted Sparse Community Finder or WSCF, is an incremental algorithm that develops community structure from highly-weighted community seeds, which are 3-vertex substructures in the network with a high local modularity.A preliminary analysis of this detection algorithm shows that it is functional on data sets consisting of up to 600 genes, with more on a more powerful machine. The communities detected are different than the ones provided by the benchmark algorithms because of the high precedence placed on higher-weight edges. This prototypical algorithm has the potential for refinement and expansion to provide the ability to find significant results for applications in the field of Systems Biology.
No abstract available.