Relevant Degree Programs
Graduate Student Supervision
Doctoral Student Supervision (Jan 2008 - May 2021)
Acute myeloid leukemia (AML) is a high grade malignancy of non-lymphoid cells of the hematopoietic system. AML is a heterogeneous disease, and numerous attempts have been made to risk-stratify AML so that appropriate treatment can be offered. Single cell analysis methods could provide insights into the biology of AML leading to risk-stratified and functionally tailored treatments and hence improved outcomes. Recent advances in flow cytometry allow the simultaneous measurement of up to 17 antibody markers per cell for up to millions of cells, and it is performed routinely during AML clinical workup. However, despite vast amounts of flow cytometry data being gathered, comprehensive, objective and automated studies of this data have not been undertaken. Another method, strand-seq, elucidates template strand inheritance in single cells, with a range of potential applications, none of which had been automated when this thesis work commenced.I have developed bioinformatic methods enabling research into AML using both these types of data.I present flowBin, a method for faithfully recombining multitube flow cytometry data. I present flowType-DP, a new version of flowType, able to process flow cytometry and other single cell data having more than 12 markers (including flowBin output). I demonstrate the application of flowBin to AML data, for digitally isolating abnormal cells, and classifying AML patients. I also use flowBin in conjunction with flowType to find cell types associated with clinically relevant gene mutations in AML.I present BAIT, a software package for accurately detecting sister chromatid exchanges in strand-seq data. I present functionality to place unbridged contigs in late-build genomes into their correct location, and have, with collaborators, published the corrected locations of more than half the unplaced contigs in the current build of the mouse genome. I present contiBAIT, a software package for assembling early-build genomes which consist entirely of unanchored, unbridged contigs. ContiBAIT has the potential to dramatically improve the quality of many model organism genomes at low cost. These developments enable rapid, automated, objective and reproducible deep profiling of AML flow cytometry data, subclonal cell analysis of AML cytogenetics, and improvements to model organisms used in AML research.
It is increasingly challenging to analyze the data produced in biomedicine, even more so when relying on manual analysis methods. My hypothesis is that using a common representation of knowledge, implemented via standard tools, and logically formalized can make those datasets computationally amenable, help with data integration from multiple sources and allow to answer complex queries. The first part of this dissertation demonstrates that ontologies can be used as common knowledge models, and details several use cases where they have been applied to existing information in the domain of biomedical investigations, clinical data and vaccine representation. In a second part, I address current issues in developing and implementing ontologies, and proposes solutions to make ontologies and the datasets they are applied to available on the Semantic Web, increasing their visibility and reuse. The last part of my thesis then builds upon the first two, and applies their results to pharmacovigilance, and specifically to analysis of reports of adverse events following immunization. I encoded existing standard clinical guidelines from the Brighton Collaboration in Web Ontology Language (OWL) in the Adverse Events Reporting Ontology (AERO) I developed within the framework of the Open Biological and Biomedical Ontologies Foundry. I show that it is possible to automate the classification of adverse events using the AERO with very high specificity (97%). I also demonstrate that AERO can be used with other types of guidelines. Finally, my pipeline relies on open and widely used data standards (Resource Description Framework (RDF), OWL, SPARQL) for implementation, making the system easily transposable to other domains. This thesis validates the usefulness of ontologies as semantic models in biomedicine enabling automated, computational processing of large datasets. It also fulfills the goal of raising awareness of semantic technologies in the clinical community of users. Following my results the Brighton Collaboration is moving towards providing a logical representation of their guidelines.
Flow Cytometry (FCM) is widely used to investigate and diagnose human disease. Although high-throughput systems allow rapid data collection from large cohorts, manual data analysis can take months. Moreover, identification of cell populations can be subjective, and analysts rarely examine the entirety of the multidimensional dataset (focusing instead on a limited number of subsets, the biology of whichhas usually already been well-described). Thus, the value of Polychromatic Flow Cytometry (PFC) as a discovery tool is largely wasted. In this thesis, I will present three computational tools that once merged together provide a complete pipeline for analysis and visualization of FCM data: (1) a clustering algorithm for identification of homogeneous groups of cells (cell populations); (2) a set of statistical tools for identifying immunophenotypes (based on the cell populations) that are correlated with an external variable (e.g., a clinical outcome); (3) a tool for identifying the most important parent populations that can best describe a set of related immunophenotypes. In addition to technical advancements, this pipeline represents a conceptual advance that allows a more powerful, automated, and complete analysis of complex flow cytometry data than previously possible. As a side product, this pipeline allows complex information from PFC studies to be translated into clinical or resource-poor settings, where multiparametric analysis is less feasible. I demonstrated the utility of this approach in a large (n = 466), retrospective, 14-parameter PFC study of early HIV infection, where we identified three T-cell subsets that strongly predicted progression to AIDS (only one of which was identified by an initial manual analysis).Before and during the development of this pipeline, a wide range of computational tools for analysis of FCM data were published. However, guidance for end users about appropriate use and application of these methods is scarce. The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) is a highly collaborative project for evaluation of these computational tools using real-world datasets. The FlowCAP results presented here will help both computational and biological scientists to better develop and use advanced bioinformatics pipelines.
Master's Student Supervision (2010 - 2020)
Technical complications occurring during the data acquisition process can impact the quality of the cytometry data and its analysis results. Clogs can cause spikes in the data sets in the time domain. Other issues, such as changing machine acquisition speed, can result in a shift in means of the populations analyzed. The outliers can potentially bias the downstream analysis if left unchecked and, as such, should be identified and removed. To address this need, I developed flowCut is an R package for automated detection of anomaly events and flagging of files for flow cytometry experiments. Results are on par with manual analysis, and it outperforms the existing approaches in data quality control. flowCut has the highest F1 scores in two types of evaluations used in this study and has zero crash rate on all files tested.I also studied the bone marrow regeneration pattern of acute myeloid leukemia patients after chemotherapy by applying state of the art automated methods. I identified cell populations and biomarkers that are uniquely present in relapsed patients when comparing to normal bone marrow data. I also identified cell populations that have different regeneration dynamics between relapsed and non-relapsed patients.
Flow cytometry (FCM) is a technology that allows the rapid quantification of physical and chemical properties of up to millions of cells in a sample. It is a technology commonly used in drug discovery, health research, medical diagnosis and treatment, and vaccine development. Recent technological advancements in optics and reagents allow the quantification of up to 21 parameters per cell and advancements in robotics allow the use of FCM as a high-throughput technology. Lagging in the development of FCM technologies is the data analysis component. Conventional analysis of FCM data is labour intensive, subjective, hard to reproduce, error prone and not standardized. Indeed, the traditional analysis represents one of the main bottlenecks for the future adoption of recent technological advancements in biomedical research and the clinical environment.Here, an analysis framework developed for the automated analysis of FCM data derived from hematopoietic stem cell (HSC) transplant experiments using data generated in the Terry Fox Laboratory is presented. The data analysis pipeline developed aims to simplify approaches to analyze such data and generated automated tools for accurate analysis and quality control. The tool presented achieves equivalent results when compared to the traditional analysis, but avoids the traditional need for continuous user interaction.Incorporated into the analysis pipeline, is a model to predict the repopulation outcome from the HSC transplant experiments. Because HSC purification strategies are typically below 50%, more than half of the mice transplanted with a single cell will not be repopulated. The repopulation prediction model showed a performance of correctly identifying 81% of the mice that did not showed a positive engraftment, while keeping the incorrect misclassification of positive engraftments below 5%.