Doctor of Philosophy in Bioinformatics (PhD)
High Performance Computational Tools for Nucleic Acid Sequence Classification, Assembly and Analysis
No abstract available.
With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.
Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing roundworm and human transcriptomes, while using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of the American Bullfrog. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible. Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes. With a composite reference transcriptome and a draft genome assembly for the American Bullfrog, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through support vector machines. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.
The information stored in nucleotide sequences is of critical importance for modern biological and medical research. However, in spite of considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise, and remains beyond the reach of many researchers. One solution to this problem is restricting the assembly to a portion of the genome, which is typically a small region of interest. Genes are the most obvious choice for this kind of targeted assembly approach, as they contain the most relevant biological information, which can be acted upon downstream. Here we present Kollector, a targeted assembly pipeline that assembles genic regions using the information from the transcript sequences. Kollector not just enables researchers to take advantage of the rapidly expanding transcriptome data, but is also scalable to large eukaryotic genomes. These features make Kollector a valuable addition to the current crop of targeted assembly tools, a fact we demonstrate by comparing Kollector to the state-of-the-art. Furthermore, we show that by localizing the assembly problem, Kollector can recover sequences that cannot be reconstructed by a whole genome de novo assembly approach. Finally, we also demonstrate several use cases for Kollector, ranging from comparative genomics to viral strain detection.
Obtaining an accurate representation of the microorganisms present in microbial ecosystems presents a considerable challenge. Microbial communities are typically highly complex, and may consist of a variety of differentially abundant bacteria, archaea, and microbial eukaryotes. The targeted sequencing of the 16S rDNA gene has become a standard method for profiling membership and biodiversity of microbial communities, as the bacterial and archaeal community members may be profiled directly, without any intermediate culturing steps. These studies rely upon specialized 16S rDNA gene reference databases, but little systematic and independent evaluation of the annotations assigned to sequences in these databases has been performed. This project examined the quality of the nomenclature annotations provided by the 16S rDNA sequences in three public databases: The Ribosomal Database Project, SILVA, and Greengenes. To do that, first three nomenclature resources – the List of Prokaryotic Names with Standing in Nomenclature, Integrated Taxonomic Information System, and Prokaryotic Nomenclature Up-to-Date – were evaluated to determine their suitability for validating prokaryote nomenclature. A core-set of valid, invalid, and synonymous organism names was then collected from these resources, and used to identify incorrect nomenclature in the public 16S rDNA databases. To assess the potential impact of misannotated reference sequences on microbial gene survey studies, the misannotations identified in the SILVA database were categorized by sample isolation source. Methods for the detection and prevention of nomenclature errors in reference databases were examined, leading to the proposal of several quality assurance strategies for future biocuration efforts. These included phylogenetic methods for the identification of anomalous taxonomic placements, database design principles and technologies for quality control, and opportunities for community assisted curation.
High-throughput RNA sequencing (RNA-seq) is primarily used in measuring gene expression, quantifying transcript abundance, and building reference transcriptomes. Without bias from a reference sequence, de novo RNA-seq assembly is particularly useful for building new reference transcriptomes, detecting fusion genes, and discovering novel spliced transcripts. This is a challenging problem, and to address it at least eight approaches, including Trans-ABySS and Trinity, were developed within the past decade. For instance, using Trinity and 12 CPUs, it takes approximately one and a half day to assemble a human RNA-seq sample of over 100 million read pairs and requires up to 80 GB of memory. While the high memory usage typical of de novo RNA-seq assemblers may be alleviated by distributed computing, access to a high-performance computing environment is a requirement that may be limiting for smaller labs. In my thesis, I present a novel de novo RNA-seq assembler, “RNA-Bloom,” which utilizes compact data structures based on Bloom filters for the storage of k-mer counts and the de Bruijn graph in memory. Compared to Trans-ABySS and Trinity, RNA-Bloom can assemble a human transcriptome with comparable accuracy using nearly half as much memory and half the wall-clock time with 12 threads.
Acute myeloid leukemia (AML) is a genetically heterogeneous disease characterized by the accumulation of acquired somatic genetic abnormalities in hematopoietic progenitor cells. Recurrent chromosomal rearrangements are well-established diagnostic and prognostic markers. However, approximately 50% of AML cases have normal cytogenetics and have variable responses to conventional chemotherapy. Molecular markers have been begun to subdivide cytogenetically normal AML (CN-AML) and have been shown to predict clinical outcome.Despite these achievements, current classification schemes are not completely accurate and improved risk stratification is required. My overall objective was to identify specific gene expression and mutation signatures to define novel subclasses of CN-AML. I hypothesized that CN-AML would be separated into at least two or more subgroups. Gene expression and mutational profiles were established using RNA-Sequencing, clustering, de novo transcriptome assembly, and variant detection. I found the CN-AML could be separated into three groups, two of which had statistically significant survival differences (Kaplan-Meier analysis, log-rank test, p=9.75x10-³). Variant analysis revealed nine fusions that are not detectable via cytogenetic analysis and differential expression analysis identified a set of discriminatory genes to classify each subgroup. These findings contribute to the current understanding of the genetic complexity of AML and highlight gene fusion candidates for follow-up functional analyses.