Raymond Tak-yan Ng
Relevant Thesis-Based Degree Programs
Affiliations to Research Centres, Institutes & Clusters
Graduate Student Supervision
Doctoral Student Supervision
Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.
RNA-binding proteins (RBPs) interact with their RNA targets to mediate various critical cellular processes in post-transcriptional gene regulations such as RNA splicing, modification, replication, localization, etc.. Characterizing the binding preferences of RBP is essential for us to decipher the underlying interaction code and to understand the functions of the interaction partners. However, currently only a minority of the numerous RBPs have RNA binding data available from in vivo or in vitro experiments. The binding preferences of experimentally unexplored RBPs remain largely unknown and are challenging to identify. In this thesis, we take machine learning based recommendation approaches to address this problem. We focus on leveraging the binding data currently available to infer the RNA preferences for RBPs that have not been experimentally explored. Firstly, we present a recommendation method based on co-evolutions to predict the RNA binding specificities for experimentally unexplored RBPs, waiving the need of the RBPs' binding data. We first demonstrate the co-evolutionary relationship between RBPs and their RNA targets. We then describe a K-nearest neighbors based algorithm to explore co-evolutions to infer the RNA binding specificity of an RBP using only the specificities information from its homologous RBPs. Secondly, we present a nucleic acid recommender system to predict probe-level binding profiles for unexplored or poorly studied RBPs. We first encode biological sequences to distributed feature representations by adapting word embedding techniques. We then build a neural network to recommend binding profiles for unexplored RBPs by learning the similarities between them and RBPs that have binding data available. Thirdly, we present a graph convolutional network for unexplored RBPs' binding affinities recommendation. Extending from the previous two approaches, this method adopts a transductive message passing setting to incorporate more information from the data. It predicts the interaction affinity between an unexplored RBP and an RNA probe by propagating information from other explored RBP-RNA interactions through a heterogeneous graph of RBPs and RNAs. Overall, the approaches presented here can help to improve the understanding of RBPs' binding mechanisms and provide new opportunities to investigate the complex post-transcriptional regulations.
View record
Privacy preservation is a key issue in outsourcing of data mining.When we seek approaches to protect the sensitive information containedin the original data, it is also important to preserve the mining outcome.We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially.We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guaranteewhen the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and proposea variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results showthat the two approaches are effective to counter powerful attack models.In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise.We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases.
View record
This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed.This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups. We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.
View record
DNA copy number alterations (CNAs) are genetic changes that can produceadverse effects in numerous human diseases, including cancer. CNAs aresegments of DNA that have been deleted or amplified and can range in sizefrom one kilobases to whole chromosome arms. Development of arraycomparative genomic hybridization (aCGH) technology enables CNAs to bemeasured at sub-megabase resolution using tens of thousands of probes.However, aCGH data are noisy and result in continuous valued measurements ofthe discrete CNAs. Consequently, the data must be processed throughalgorithmic and statistical techniques in order to derive meaningfulbiological insights. We introduce model-based approaches to analysis of aCGHdata and develop state-of-the-art solutions to three distinct analyticalproblems.In the simplest scenario, the task is to infer CNAs from a single aCGHexperiment. We apply a hidden Markov model (HMM) to accurately identifyCNAs from aCGH data. We show that borrowing statistical strength acrosschromosomes and explicitly modeling outliers in the data, improves onbaseline models.In the second scenario, we wish to identify recurrent CNAs in a set of aCGHdata derived from a patient cohort. These are locations in the genomealtered in many patients, providing evidence for CNAs that may be playingimportant molecular roles in the disease. We develop a novel hierarchicalHMM profiling method that explicitly models both statistical and biologicalnoise in the data and is capable of producing a representative profile for aset of aCGH experiments. We demonstrate that our method is more accuratethan simpler baselines on synthetic data, and show our model produces outputthat is more interpretable than other methods.Finally, we develop a model based clustering framework to stratify a patientcohort, expected to be composed of a fixed set of molecular subtypes. Weintroduce a model that jointly infers CNAs, assigns patients to subgroupsand infers the profiles that represent each subgroup. We show our model tobe more accurate on synthetic data, and show in two patient cohorts how themodel discovers putative novel subtypes and clinically relevant subgroups.
View record
Master's Student Supervision
Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.
Multivariate time series generation is a promising method for sharing sensitive data in numerous medical, financial, and Internet of Things applications. A common type of multivariate time series is derived from a single source, such as the biometric measurements of a patient. Originating from a single source results in intricate dynamical patterns between individual time series that are difficult for typical generative models such as Generative Adversarial Network (GAN)s to learn. Machine learning models can use the valuable information in those patterns to better classify, predict, or perform other downstream tasks. GroupGAN is a novel framework that considers time series’ common origin and favors preserving inter-channel relationships. The two critical points of the GroupGAN method are: 1) the individual time series are generated from a common point in latent space, and 2) a central discriminator favors the preservation of inter-channel dynamics. We demonstrate empirically that the GroupGAN method helps preserve channel correlations and that the synthetic data generated using the GroupGAN method performs very well downstream tasks with medical and financial data.
View record
Cancer is associated not only with mortality, but also with impacts on physical, mental, and social health. When unmet, these resulting psychosocial needs are associated with worsened quality-of-life and survival. Cancer centres employ psychiatrists, counsellors, and other allied health clinicians to help address these needs. However, these needs often go unmet even when these resources exist. It can be difficult for treating oncologists to detect these needs and refer patients to these resources.In this work, we investigated the use of neural natural language processing (NLP) models to predict these psychosocial needs using initial oncologist consultation documents. We compared a non-neural model, bag-of-words (BOW), with three neural models: convolutional neural networks (CNN), long-short term memory (LSTM), and bidirectional encoder representation from transformers (BERT). We used these models to predict self-reported emotional and informational needs around the time these documents were generated. We also used these models to predict whether the patient will have clinician-addressed needs – specifically, seeing a psychiatrist or counsellor within the five years following document generation. We compared the prediction of these psychosocial needs to predicting a non-psychosocial outcome, survival. We found these models can predict whether patients will see a psychiatrist with balanced accuracy and receiver-operator-area-under-curve (AUC) above 0.70. This is a similar performance to comparable prior work predicting mental health outcomes. We also predicted seeing a counsellor with AUC above 0.70, but predicting self-reported psychosocial needs seemed to be a more difficult task, with these metrics usually below 0.70. We predicted the non-psychosocial outcome, survival, with higher performance. For this task, balanced accuracy was above 0.80 and AUC above 0.90. Predictions using subsets of our study population suggest that predicting these psychosocial outcomes is easier in females, and with cancer patients diagnosed with a Stage II illness. We found that CNN and LSTM models performed the best, and investigated how BERT’s document size limit may hinder its performance on these tasks. This work is the first of its kind using NLP for this application, and builds a foundation to improve how these techniques may one day help cancer patients.
View record
The goal of understanding which microbes are responsible for shifts in different environmental and clinical conditions has motivated the increase in the development of custom feature selection techniques for microbiome data. When identifying an appropriate feature selection method, researchers are often faced with the question of whether or not to apply a phylogenetic approach. However, in many cases, it is not possible to know which is most suitable a priori. The motivation behind a phylogenetic approach is biological, as the features in microbiome data embodies an inherent hierarchical structure that may contain signal from trait conservation. Microbiome shifts correlated with host outcome could be driven by groups of taxa which are closely phylogenetically related. As such, techniques that leverage phylogenetic information seem highly fitting. Recent studies have shown promising results that a phylogenetic approach could be beneficial, however less have sought to provide a thorough evaluation of the robustness and applicability of the phylogenetic approach for different host outcomes of study. Guidance for researchers on the applicability of a phylogenetic approach is unclear in the current state of the literature. In this work, we sought to perform an assessment of feature selection methods in order to understand how different classes of methods compete on microbiome data. We sought to evaluate whether a phylogenetic approach would be more powerful in finding ground-truth features with phylogenetic correlation structures, leading us to discover that non-phylogenetic methods are the best all-around methods both in the presence and absence of strong phylogenetic signal. Some evidence has shown that there is still merit in the phylogenetic approach — such as in scenarios where the phylogenetic signal is very strong. Our observations and findings provide insights into strategies for testing for a phylogenetic signal using a combination of techniques.
View record
The proliferation of financial news sources reporting on companies, markets, currencies and stocks presents an opportunity for strategic decision making by mining data with the goal of extracting structured representations about financial entities and their inter-relations. These representations can be conveniently stored as (subject, predicate, object) triples in a knowledge graph that can be used drive new in-sights through answering complex queries using high level declarative languages.Towards this goal, we develop a high precision knowledge extraction pipeline tailored for the financial domain. This pipeline combines multiple information ex-traction techniques with a financial dictionary that we built, all working together to produce over 342,000 compact extractions from over 288,000 financial news articles, with a precision of 78% at the top-100 extractions. These extractions are stored in a knowledge graph readily available for use in downstream applications. Our pipeline outperforms existing work in terms of precision, the total number of extractions and the coverage of financial predicates.
View record
Epigenome-wide association studies are used to link patterns in the epigenome to human phenotypes and disease. These studies continue to increase in num- ber, driven by improving technologies and decreasing costs. However, results from population-scale association studies are often difficult to interpret. One major chal- lenge to interpretation is separating biologically relevant epigenetic changes from changes to the underlying cell type composition. This thesis focuses on computa- tional methods for correcting cell type composition in epigenome-wide association studies measuring DNAm in blood. Specifically, we focus on a class of methods, called reference-based methods, that rely on measurements of DNAm from puri- fied constituent cell types. Currently, reference-based correction methods perform poorly on human cord blood. This is unusual because adult blood, a closely related tissue, is a case-study in successful computational correction. Several previous attempts at improving cord blood estimation were only partially successful. We demonstrate how reference-based estimation methods that rely on for cord blood can be improved. First, we validated that existing methods perform poorly on cord blood, especially in minor cell types. Then, we demonstrated how this low per- formance stems from missing cell type references, data normalization and violated assumptions in signature construction. Resolving these issues improved estimates in a validation set with experimentally generated ground truth. Finally, we com- pared our reference-based estimates against reference-free techniques, an alterna- tive class of computational correction methods. Going forward, this thesis provides a template for extending reference-based estimation to other heterogeneous tissues.
View record
Complex cellular functions are carried out by the coordinated activity of networks of genes and gene products. In order to understand mechanisms of disease and disease pathogenesis, it is crucial to develop an understanding of these complex interactions. Microarrays provide the potential to explore large scale cellular networks by measuring the expression of thousands of genes simultaneously. The purpose of our project is to develop a stable and robust method that can identify, from such gene expression data, modules of genes that are involved in a common functional role. These modules can be used as a first step in systems scale analyses to extract valuable information from various gene expression studies. Our method constructs modules by identifying genes that are co-expressed across many diseases. We use peripheral blood microarray samples from patients having one of several diseases and cluster the genes in each disease group separately. We then identify genes that cluster together across all disease groups to construct our modules. We first use our method to construct baseline peripheral blood modules relevant to the lung using 5 groups of peripheral blood microarray samples that were collected as controls for separate studies. An enrichment analysis using gene sets from a number of pathway and ontology databases reveals the biological significance of our modules. We utilize our background modules by doing an enrichment analysis on a list of genes that were differentially expressed in a COPD case vs. control study and identify modules that are enriched in that list.Although a similar approach has been used to identify modules of genes that are coordinately expressed across multiple conditions, we show that our method is an improvement as it is robust to the order in which the different disease datasets are presented to the algorithm. We also apply our procedure to 3 different datasets including a COPD dataset, a COPD normal dataset and a lung tissue dataset. We then assess the stability of our method by performing a resampling experiment on our module construction procedure and find that our method repeatedly produces modules with high concordance which is measured by Jaccard distance.
View record
Community detection is an important aspect of network analysis that has far-reaching consequences, in particular for biological research. In the study of systems biology, it is important to detect communities in biological networks to identify areas that have a heavy correlation between one another or are significant for biological functions. If one were to model networks that evolved over time, a differential network would be a vital part or product of that analysis. One such network could have an edge between two vertices if there is a significant change in the correlation of expression levels between the two genes that the vertices are designed to model.For this particular network, there are no community detection algorithms that suffice. An analysis of the current community detection algorithms shows that most heuristic-based methods are too simple or have too high a cost for detecting communities on such sparse networks. A prototypical algorithm is presented that is preferential to high weight edges when determining community membership. This algorithm, Weighted Sparse Community Finder or WSCF, is an incremental algorithm that develops community structure from highly-weighted community seeds, which are 3-vertex substructures in the network with a high local modularity.A preliminary analysis of this detection algorithm shows that it is functional on data sets consisting of up to 600 genes, with more on a more powerful machine. The communities detected are different than the ones provided by the benchmark algorithms because of the high precedence placed on higher-weight edges. This prototypical algorithm has the potential for refinement and expansion to provide the ability to find significant results for applications in the field of Systems Biology.
View record
This thesis presents a domain-independent approach for the task of dialogue act modeling across a comprehensive set of different spoken and written conversations including: emails, forums, meetings, and phone conversations. We begin by investigating the performance of unsupervised methods for the task of dialogue act recognition. The low performance of these techniques gives us a motivation to tackle this problem in supervised and semi-supervised manners. To this aim, we propose a domain-independent feature set for the task of dialogue act modeling on different spoken and written conversations. Then, we compare the results of SVM-multiclass and two structured predictors namely SVM-hmm and CRF algorithms for supervised dialogue act modeling. We then provide an in-depth analysis about the effectiveness of proposed domain-independent dialogue act modeling approaches in different written and spoken conversations. Extensive empirical results, across different conversational modalities, demonstrate the effectiveness of our SVM-hmm model for dialogue act recognition in conversations. Furthermore, we use the SVM-hmm algorithm to investigate the effectiveness of using unlabeled data in a semi-supervised dialogue act recognition framework.
View record
If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.