Application of high performance compute technology in bioinformatics

Research output: Thesisinternal PhD, WU

Abstract

Bioinformatics and computational biology are driven by growing volumes of data in biological systems that also tend to increase in complexity. The research presented in this thesis focuses on the need to analyze such data volumes in such complexity. The results show that the application of high-performance compute technologies, preferably combined with low-cost hardware, is a successful approach to generate new bioinformatics approaches that allow addressing new types of data analyses and research questions in biology.

An overview of the technologies and recent developments in biology and computer science relevant for this thesis (Chapter 1) identifies current high-throughput sequencing platforms as a key technology. Sequencing platforms now deliver data sets up to terabytes in size for elucidating genome structure, gene content, gene activity, as well as gene variants. The concepts and technologies from computer science to handle these large amounts of data include (a) grid technologies for compute parallelization while making more efficient use of existing low-cost infrastructure; (b) graphics cards for increased compute power and (c) graph databases for large data volume storage and advanced methods for analyses. This thesis presents novel applications and added value of these three concepts for bioinformatics research.  

Small RNAs are important regulators of genome function, yet their prediction in genomes is still a major computational challenge (Chapter 2). They tend to have a minimal free energy (MFE) significantly lower than the MFE of non-small RNA sequences with the same nucleotide composition. Evaluation of many MFEs is, however, too compute-intensive for genome-wide screening. With a local grid infrastructure of desktop computers, MFE distributions of a very large collection of sequence compositions were pre-calculated and used to determine the MFE distribution for any given sequence composition by interpolation. This approach allows on-the-fly calculation for any candidate sequence composition and makes genome-wide screening with this characteristic of a pre-miRNA sequence feasible. This way, MFE evaluation can be added as a new parameter for genome-wide selection of potential small RNA candidates (Chapter 2). The concept of large-scale pre-calculation of compute-intensive parameters is one of the options for future bioinformatics analyses. 

Sequence alignment is essential in the analysis of next-generation sequencing data. The gold standard for sequence alignment is the Smith-Waterman (SW) algorithm. Existing implementations of the full SW algorithm are either not fast enough, or limited to dedicated tasks, usually to optimize for speed, whereas popular heuristic SW versions (such as BLAST) suffer from statistical issues. Graphics hardware is well-suited to speed up SW alignments, but SW on graphics cards does not report the alignment details desired by biologists for further analysis. This thesis presents the CUDA-based Parallel SW Alignment Software (PaSWAS) (Chapter 3). PaSWAS gives (a) easy access to the computational power of NVIDIA-based graphics cards for high-speed sequence alignments, (b) information such as score, number of gaps and mismatches with the accuracy of the full SW algorithm and (c) a report of multiple hits per alignment. Two use cases show the usability and versatility of the new parallel Smith-Waterman implementation for bioinformatics analyses. It demonstrates the added value of the use of low-cost graphics cards in bioinformatics software.

To further promote the use of PaSWAS, a new implementation, pyPaSWAS, provides the SW sequence alignment code fully packed in Python and the more widely accepted OpenCL language (Chapter 4). Moreover, pyPaSWAS now supports an affine gap penalty. This way, pyPaSWAS presents an easy Python-based environment for accurate and retrievable parallel SW sequence alignments on GPUs and multi-core systems. The strategy of integrating Python with high-performance parallel compute languages to create a developer- and user-friendly environment is worth to be considered for other computationally-intensive bioinformatics algorithms.

Thanks to the accuracy and retrieval characteristics of (py)PaSWAS, it was noted that long sequencing reads on the PacBio platform can contain many artificial palindromic sequences. These palindromes are due to errors introduced by whole-genome amplification (WGA). Next-generation sequencing requires sufficient amounts of DNA. If not available, WGA is routinely used to generate the amounts of DNA required. The introduction of artificial palindromic sequences hampers assembly and severely limits the value of long sequencing reads. Pacasus is a novel software tool to identify and resolve such artificial palindromic sequences in long sequencing reads (Chapter 5). Two use cases show that Pacasus markedly improves read mapping and assembly of WGA DNA. In comparison, the quality of mapping and assembly is similar to the quality obtained with non-amplified DNA. Therefore, with Pacasus, long-read technology becomes feasible for the sequencing of samples for which only very small amounts of DNA are available, such as single cells or single chromosomes.

Numerous tools and databases exist to annotate and investigate the functions encoded in properly assembled genomes, such as InterProScan, KEGG, GO and many more. Comparisons of functionalities across multiple genomes is, however, not trivial. The concept of graph databases is a promising novel approach from computer science for such multi-genome comparisons. For a data set of all (> 150,000) genes of 17 fungal species functionally annotated with InterProScan, the associated KEGG, GO and annotation data are imported and interconnected in a new Neo4j graph database (Chapter 6).  Relationships in this database are visualized and mined with a newly refurbished and extended Neo4j plugin for Cytoscape. Inspection of (sub)graphs of functional annotations is an attractive way to compare and group functional annotation across species. In the use case of the seventeen fungal genomes, it helped to outline, compare and explain details of the life style of groups of individual species.

The general discussion of this thesis provides an outlook on the future of bioinformatics in the context of the results here presented (Chapter 7). A grid infrastructure is recommended as a feasible and attractive cost-effective strategy to create compute power, as is the further inclusion of graphics cards. Full implementation of graph technology is considered necessary for advancing bioinformatics.  The work presented in this thesis also shows that use of grids, graphics cards and graph technology imply the redesign of existing software applications. To be able to create novel stable, predictable and user-friendly applications in bioinformatics, formal training in software engineering principles is highly recommended. Courses and other programs are necessary for the life-long learning that will be crucial for the future of bioinformatics. The main challenges for bioinformatics in the years to come are all data centered: issues with growing data volumes, with more data types and with higher data complexity. To deal with these challenges, further integration of now separate fields of science is warranted in ways we cannot even image yet. 

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
Supervisors/Advisors
  • de Ridder, Dick, Promotor
  • Nap, Jan-Peter (Jp), Co-promotor
Award date22 Oct 2019
Place of PublicationWageningen
Publisher
Print ISBNs9789463951128
DOIs
Publication statusPublished - 2019

Fingerprint Dive into the research topics of 'Application of high performance compute technology in bioinformatics'. Together they form a unique fingerprint.

  • Projects

    Cite this