Inanc Birol


Relevant Degree Programs


Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Improving sequence analysis with probabilistic data structures and algorithms (2020)

In the biological sciences, sequence analysis refers to analytical investigations that use nucleic acid or protein sequences to elucidate biological insights from them, such as their function, species of origin, or evolutionary relationships. However, sequences are not very meaningful by themselves, and useful insights generally come from comparing them to other sequences. Indexing sequences using concepts borrowed from the computational sciences may help perform these comparisons. One such concept is a probabilistic data structure, the Bloom filter, which enables low memory indexing with high computational efficiency at the cost of false-positive queries by storing a signature of a sequence rather than the sequence itself. This thesis explores high-performance applications of this probabilistic data structure in sequence classification (BioBloom Tools) and targeted sequence assembly (Kollector) and shows how these implemented tools outperform state-of-the-art methods.To remedy some weaknesses of Bloom filters, such as the inability to index multiple targets, I have developed a novel probabilistic data structure called a multi-index Bloom filter (miBF), used to facilitate alignment-free classification of thousands of references. The data structure also synergizes with spaced seeds. Sequences are often broken up into subsequences when using a hashed-based algorithm, and spaced seeds are subsequences with wildcard positions to improve classification sensitivity and specificity. This novel data structure enables faster classification and higher sensitivity than sequence alignment-based methods and executes in an order of magnitude less time while using half the memory compared to other spaced seed-based approaches. This thesis features formulations of classification false-positive rates in relation to the indexed and queried sequences, and benchmarks the data structure on simulated data. In addition to my work on short read data, I explore and evaluate of methods for finding sequence overlaps in error-prone long read datasets.

View record

Efficient assembly of large genomes (2019)

Genome sequence assembly presents a fascinating and frequently-changing challenge. As DNA sequencing technologies evolve, the bioinformatics methods used to assemble sequencing data must evolve along with it. Sequencing technology has evolved from slab gel sequencing, to capillary sequencing, to short read sequencing by synthesis, to long-read and linked-read single-molecule sequencing. Each evolutionary jump in sequencing technology required developing new bioinformatic tools to address the unique characteristics of its sequencing data. This work reports the development of efficient methods to assemble short-read and linked-read sequencing data, named ABySS 2.0 and Tigmint. ABySS 2.0 reduces the memory requirements of short-read genome sequencing assembly by ten fold compared to ABySS 1.0. It does so by using a Bloom filter probabilistic data structure to represent a de Bruijn graph. Tigmint uses linked reads to identify large-scale errors in a genome sequence assembly. Correcting assembly errors using Tigmint before scaffolding improves both the contiguity and correctness of a human genome assembly compared to scaffolding without correction. I have also applied these methods to assemble the 12 gigabase genome of western redcedar (Thuja plicata), which is four times the size of the human genome.Although numerous mitochondrial genomes of angiosperm are available, few mitochondria of gymnosperms have been sequenced. I assembled the plastid and mitochondrial genomes of white spruce (Picea glauca) using whole genome short read sequencing. I assembled the mitochondrial genome of Sitka spruce (Picea sitchensis) using whole genome long read sequencing, the largest complete genome assembly of a gymnosperm mitochondrion. The mitochondrial genomes of both species include a remarkable number of trans-spliced genes.I have developed two additional tools, UniqTag and ORCA. UniqTag assigns unique and stable gene identifiers to genes based on their sequence content. This gene labeling system addresses the inconvenience of gene identifiers changing between versions of a genome assembly. ORCA is a comprehensive bioinformatics computing environment, which includes hundreds of bioinformatics tools in a single easily-installed Docker image, and is useful for education and research.The assembly of linked read and long read sequencing of large molecules of DNA have yielded substantial improvements in the quality of genome assembly projects.

View record

Parallel algorithms and software tools for high-throughput sequencing data (2017)

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

High throughput in silico discovery of antimicrobial peptides in amphibian and insect transcriptomes (2021)

The full abstract for this thesis is available in the body of the thesis, and will be available when the embargo expires.

View record

Scalable methods for improving genome assemblies (2021)

De novo genome assembly is cornerstone to modern genomics studies. It is also a useful method for studying genomes with high variation, such as cancer genomes, as it is not biased by a reference. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by k - 1 bases, followed by merging nodes along unambiguous walks in the graph. The selection of k is influenced by a few factors, and its fine tuning results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, this approach has only been explored with small genomes, without addressing scalability issues with larger ones. Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which it uses to evaluate the assembly graph path support at branching points and removes the paths with insufficient support. RResolver runs efficiently, taking 3% of a typical ABySS human assembly pipeline run time on average with 48 threads and 40GB memory. Compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 16% and reduces misassemblies by up to 7%. RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome.

View record

Seasonal and sex-dependent gene expression in emu (Dromaius novaehollandiae) fat tissues (2021)

The emu (Dromaius novaehollandiae) is a bird that has been farmed for its oil, rendered from fat, for uses in therapeutics and cosmetics. Emu oil is valued for its anti-inflammatory and antioxidant properties, which promote wound healing. In spring and summer, adult emus start to gain fat and expend the energy from their fat stores during breeding in winter to sustain themselves when food is scarce. Since emus go through an annual cycle of fat gain and loss, understanding the genes affecting fat metabolism and deposition is crucial to improve fat production in emu farms. Samples were taken from back and abdominal fat tissues of the same four male and four female emus in April, June, and November for RNA-sequencing. In November, the emus’ body and fat pad weights were recorded. Seasonal and sex-dependent differentially expressed (DE) genes were analyzed and genes involved in fat metabolism were identified. A total of 100 DE genes (47 seasonally in males; 34 seasonally in females; 19 between sexes) were found. Seasonally DE genes generating significant difference between the sexes in gene ontology terms as well as supporting studies suggested integrin beta chain-2 (ITGB2) influences fat changes. Six seasonally DE genes functioned in more than two enriched pathways (two female: angiopoietin-like 4 (ANGPTL4) and lipoprotein lipase (LPL); four male: lumican (LUM), osteoglycin (OGN), aldolase B (ALDOB), and solute carrier family 37 member 2 (SLC37A2)). Two sexually DE genes, follicle stimulating hormone receptor (FSHR) and perilipin 2 (PLIN2), had functional investigations supporting their influence on fat gain and loss. The results suggested these nine genes (ITGB2, ANGPTL4, LPL, LUM, OGN, ALDOB, SLC37A2, FSHR, PLIN2) functionally influence fat metabolism and deposition in emus. This study lays foundation for further downstream studies to improve emu fat production through selective breeding using single nucleotide polymorphism markers.

View record

Antimicrobial peptide host toxicity prediction with transfer learning for proteins (2020)

Antimicrobial peptides (AMPs) are host defense peptides produced by all multicellular organisms, and can be used as alternative therapeutics in peptide-based drug discovery. In large peptide discovery and validation pipelines, it is important to avoid time and resource sinks that arise due to the necessity of experimentally validating a large number of peptides for toxicity. Therefore, in silico methods the prediction of antimicrobial peptide toxicity can be applied in advance to filter out any sequences that may be of toxic nature. While many machine learning-based approaches exist for predicting toxicity of proteins, it is often defined as a problem of classifying venoms and toxins from proteins that are nonvenomous. In my thesis I propose a new method called tAMPer that focuses on the classification of AMPs that may or may not induce host toxicity based on their sequences. I have used deep learning model ELMo as adapted by SeqVec to obtain vector embeddings for a dataset of synthetic and natural AMPs that have been experimentally tested in vitro for their toxicity through hemolytic and cytotoxicity assays. This is a balanced dataset that contains ~2600 sequences, split 80/20 into train and test set. By utilizing the latent representation of the data by SeqVec, and by further applying ensemble learning methods on these embeddings I have built a model that is capable of predicting toxicity of antimicrobial peptides with a F1 score of 0.758 and accuracy of 0.811 on the test set, and performing comparably better than state-of-the-art approaches both when trained and tested on our dataset as well as on other methods’ respective train and test datasets.

View record

De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing (2017)

Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing roundworm and human transcriptomes, while using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of the American Bullfrog. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible. Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes. With a composite reference transcriptome and a draft genome assembly for the American Bullfrog, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through support vector machines. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.

View record

Kollector: transcript-informed targeted de novo assembly of gene loci (2017)

The information stored in nucleotide sequences is of critical importance for modern biological and medical research. However, in spite of considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise, and remains beyond the reach of many researchers. One solution to this problem is restricting the assembly to a portion of the genome, which is typically a small region of interest. Genes are the most obvious choice for this kind of targeted assembly approach, as they contain the most relevant biological information, which can be acted upon downstream. Here we present Kollector, a targeted assembly pipeline that assembles genic regions using the information from the transcript sequences. Kollector not just enables researchers to take advantage of the rapidly expanding transcriptome data, but is also scalable to large eukaryotic genomes. These features make Kollector a valuable addition to the current crop of targeted assembly tools, a fact we demonstrate by comparing Kollector to the state-of-the-art. Furthermore, we show that by localizing the assembly problem, Kollector can recover sequences that cannot be reconstructed by a whole genome de novo assembly approach. Finally, we also demonstrate several use cases for Kollector, ranging from comparative genomics to viral strain detection.

View record

Nomenclature errors in public 16s rDNA gene databases: strategies to improve the accuracy of sequence annotations (2017)

Obtaining an accurate representation of the microorganisms present in microbial ecosystems presents a considerable challenge. Microbial communities are typically highly complex, and may consist of a variety of differentially abundant bacteria, archaea, and microbial eukaryotes. The targeted sequencing of the 16S rDNA gene has become a standard method for profiling membership and biodiversity of microbial communities, as the bacterial and archaeal community members may be profiled directly, without any intermediate culturing steps. These studies rely upon specialized 16S rDNA gene reference databases, but little systematic and independent evaluation of the annotations assigned to sequences in these databases has been performed. This project examined the quality of the nomenclature annotations provided by the 16S rDNA sequences in three public databases: The Ribosomal Database Project, SILVA, and Greengenes. To do that, first three nomenclature resources – the List of Prokaryotic Names with Standing in Nomenclature, Integrated Taxonomic Information System, and Prokaryotic Nomenclature Up-to-Date – were evaluated to determine their suitability for validating prokaryote nomenclature. A core-set of valid, invalid, and synonymous organism names was then collected from these resources, and used to identify incorrect nomenclature in the public 16S rDNA databases. To assess the potential impact of misannotated reference sequences on microbial gene survey studies, the misannotations identified in the SILVA database were categorized by sample isolation source. Methods for the detection and prevention of nomenclature errors in reference databases were examined, leading to the proposal of several quality assurance strategies for future biocuration efforts. These included phylogenetic methods for the identification of anomalous taxonomic placements, database design principles and technologies for quality control, and opportunities for community assisted curation.

View record

RNA-Bloom: de novo RNA-seq assembly with Bloom filters (2017)

High-throughput RNA sequencing (RNA-seq) is primarily used in measuring gene expression, quantifying transcript abundance, and building reference transcriptomes. Without bias from a reference sequence, de novo RNA-seq assembly is particularly useful for building new reference transcriptomes, detecting fusion genes, and discovering novel spliced transcripts. This is a challenging problem, and to address it at least eight approaches, including Trans-ABySS and Trinity, were developed within the past decade. For instance, using Trinity and 12 CPUs, it takes approximately one and a half day to assemble a human RNA-seq sample of over 100 million read pairs and requires up to 80 GB of memory. While the high memory usage typical of de novo RNA-seq assemblers may be alleviated by distributed computing, access to a high-performance computing environment is a requirement that may be limiting for smaller labs. In my thesis, I present a novel de novo RNA-seq assembler, “RNA-Bloom,” which utilizes compact data structures based on Bloom filters for the storage of k-mer counts and the de Bruijn graph in memory. Compared to Trans-ABySS and Trinity, RNA-Bloom can assemble a human transcriptome with comparable accuracy using nearly half as much memory and half the wall-clock time with 12 threads.

View record

Gene expression and mutation profiles define novel subclasses of cytogenetically normal acute myeloid leukemia (2016)

Acute myeloid leukemia (AML) is a genetically heterogeneous disease characterized by the accumulation of acquired somatic genetic abnormalities in hematopoietic progenitor cells. Recurrent chromosomal rearrangements are well-established diagnostic and prognostic markers. However, approximately 50% of AML cases have normal cytogenetics and have variable responses to conventional chemotherapy. Molecular markers have been begun to subdivide cytogenetically normal AML (CN-AML) and have been shown to predict clinical outcome.Despite these achievements, current classification schemes are not completely accurate and improved risk stratification is required. My overall objective was to identify specific gene expression and mutation signatures to define novel subclasses of CN-AML. I hypothesized that CN-AML would be separated into at least two or more subgroups. Gene expression and mutational profiles were established using RNA-Sequencing, clustering, de novo transcriptome assembly, and variant detection. I found the CN-AML could be separated into three groups, two of which had statistically significant survival differences (Kaplan-Meier analysis, log-rank test, p=9.75x10-³). Variant analysis revealed nine fusions that are not detectable via cytogenetic analysis and differential expression analysis identified a set of discriminatory genes to classify each subgroup. These findings contribute to the current understanding of the genetic complexity of AML and highlight gene fusion candidates for follow-up functional analyses.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Sign up for an information session to connect with students, advisors and faculty from across UBC and gain application advice and insight.