High-throughput open source computational methods for genetics and genomics

J.C.P. Prins

Research output: Thesisinternal PhD, WU

Abstract

Biology is increasingly data driven by virtue of the development of high-throughput technologies, such as DNA and RNA sequencing. Computational biology and bioinformatics are scientific disciplines that cross-over between the disciplines of biology, informatics and statistics; which is clearly reflected in this thesis. Bioinformaticians often contribute crucial insights and novelty to scientific research because they are central to data analysis and contribute concrete algorithms and software solutions. In addition, bioinformaticians have an important role to play when it comes to organising data and software and making it accessible to others.  In this thesis, in addition to contributing to biological questions, I discuss issues around accessing and sharing data, with the challenges of handling large data, input/output (IO) bottlenecks and effective use of multi-core computations. 

By creating software solutions together with molecular biologists, I contributed and published insights in biological processes in nematodes and plants. I published software solutions that made it easier for others to analyse data, which impacts the wider research community. I created solutions that made it easier for others to publish software solutions by themselves.  The introduction of computing and the internet makes it possible to share ideas and computational methods. I am convinced it is a good idea to publish software solutions as `free and open source' software (FOSS) in the public domain so that we can continue to build on the work of others. 

Chapter 2 presents a computational method for identifying gene families in a sequenced genome that may be involved in pathogenicity, i.e., those genes that code for proteins that interact with molecules of an infected host. Such nematode proteins are known to contain highly variable DNA sections that code for the biochemical properties of an interaction site. By applying phylogenetic analysis through maximum likelihood (PAML) and comparison of homologues sequences in other organisms with comparable and different life styles, we discovered 77 unique candidate sequence families in the plant pathogen Meloidogyne incognita that deserve further investigation in the laboratory. 

Chapter 3 presents GenEST, a computational method for predicting which fragments captured by the cDNA-AFLP high-throughput technology matched known expressed sequence tags (ESTs). The cDNA-AFLP biochemical process was calculated in silico and fragments matching the fragment lengths as given by cDNA-AFLP were matched. Through this technique novel effectors from the nematode Globodera rostochiensis, putatively involved in pathogenicity, were identified and partly confirmed in the laboratory. 

Chapter 4 presents GenFrag, a computational method that expands on GenEST for predicting which fragments captured by cDNA-AFLP matched fragments of a fully sequenced genome with its known spliced gene variants. Through this in silico technique genes were identified in the plant Arabidopsis thaliana putatively involved in maternal genomic imprinting and partly confirmed in the laboratory. 

Chapter 5 presents multiple QTL mapping (MQM), a high-throughput computational method for predicting what sections of a genome correlate with, for example, gene expression.  The study of finding such eQTL is challenging, not least because many of them are potentially false positives. The MQM parallelized algorithm is embedded in the R/qtl software package which makes it widely available to researchers. The impact thereof means that it is widely cited by studies on model organisms, such as mouse, rat, the nematode Caenorhabditis elegans and the plant A. thaliana. 

Chapter 6 presents a theoretical framework in the form of a review for identifying plant-resistance genes (R-genes) that combines the lessons learnt in the previous chapters. Plants lack an adaptive immune system and therefore, next to having physical defences, use R-genes to code for proteins that recognise molecules and proteins from invading pathogens, with an example on A. thaliana. These R-genes can be viewed as the counterparts of effectors identified in Chapter 3 and Chapter 4.  By introducing the concept of a prior the chapter discusses eQTL or broader xQTL techniques as presented in the Chapter 5 to narrow down on gene candidates involved in plant defence. 

Chapter 7 and Chapter 8 present FOSS bioinformatics tools, and modules that make use the Ruby programming language.  BioRuby (Chapter 7) has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports widely used data formats and provides access to databases, external programs and public web services.  All Ruby software created in the context of this thesis was contributed initially to the main BioRuby project, e.g. the PAML parser of Chapter 2, and later as individual Biogems (Chapter 8), e.g. the bio-blastxmlparser, bio-alignment, bigbio and bio-rdf biogems for Chapter 2, and three Genfrag related biogems for Chapter 4.  Over 16 modules were contributed by the author as Ruby FOSS projects and are listed on the http://biogems.info/ website. Because of the open nature of the BioRuby project, both BioRuby and BioGem software modules are increasingly used and cited in biomedical research, not only in genomics, but also in phylogenetics and prediction of protein structural complexes and data integration. 

Chapter 9 presents sambamba, a software tool for scaling up next generation sequencing (NGS) alignment processing through the use of multiple cores on a computer. Sambamba is a replacement for samtools, a commonly used software tool for working with aligned output from sequencers.  Sambamba makes use of multi-core processing and is written in the D programming language. Not only does sambamba outperform samtools, but it already comes with an improved deduplication routine and other facilities, such as easy filtering of data. The Sambamba software is now used in the large sequencing centres around the world. 

Chapter 10 `Big Data, but are we ready?' gives a response to a publication on using cloud computing for large data processing. The chapter discusses computational bottlenecks and proves prescient because the number of citations of this paper increases every year. 

Chapter 11 `Towards effective software solutions for big biology' discusses the need for a change of strategy with regard to bioinformatics software development in the biomedical sciences to realise big biology software projects. This includes improved scientific career tracks for bioinformaticians and dedicated funding for big data software development. 

Chapter 12 discusses the computational methods and software solutions presented in this thesis, painting a picture of further challenges in bioinformatics computational solutions for the elucidation of biological processes. The chapter starts with a discussion on the merits and shortcomings of each individual software solution presented in this thesis, followed by a perspective on next generation sequencing, data integration and future research in software solutions. 

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
Supervisors/Advisors
  • Bakker, Jaap, Promotor
  • Jansen, R.C., Promotor
  • Smant, Geert, Co-promotor
Award date5 Oct 2015
Place of PublicationWageningen
Publisher
Print ISBNs9789462574595
Publication statusPublished - 2015

Fingerprint

genomics
bioinformatics
amplified fragment length polymorphism
genes
Nematoda
methodology
Biological Sciences
sequence analysis
Arabidopsis thaliana
phylogeny
proteins
genome
quantitative trait loci
pathogenicity
Globodera rostochiensis
genomic imprinting
structural proteins
biomedical research
organisms
Meloidogyne incognita

Keywords

  • plant parasitic nematodes
  • dna sequencing
  • next generation sequencing
  • throughput
  • computational science
  • genetics
  • genomics

Cite this

Prins, J. C. P. (2015). High-throughput open source computational methods for genetics and genomics. Wageningen: Wageningen University.
Prins, J.C.P.. / High-throughput open source computational methods for genetics and genomics. Wageningen : Wageningen University, 2015. 136 p.
@phdthesis{16fa90b37e35464881fc833dbc2185dd,
title = "High-throughput open source computational methods for genetics and genomics",
abstract = "Biology is increasingly data driven by virtue of the development of high-throughput technologies, such as DNA and RNA sequencing. Computational biology and bioinformatics are scientific disciplines that cross-over between the disciplines of biology, informatics and statistics; which is clearly reflected in this thesis. Bioinformaticians often contribute crucial insights and novelty to scientific research because they are central to data analysis and contribute concrete algorithms and software solutions. In addition, bioinformaticians have an important role to play when it comes to organising data and software and making it accessible to others.  In this thesis, in addition to contributing to biological questions, I discuss issues around accessing and sharing data, with the challenges of handling large data, input/output (IO) bottlenecks and effective use of multi-core computations.  By creating software solutions together with molecular biologists, I contributed and published insights in biological processes in nematodes and plants. I published software solutions that made it easier for others to analyse data, which impacts the wider research community. I created solutions that made it easier for others to publish software solutions by themselves.  The introduction of computing and the internet makes it possible to share ideas and computational methods. I am convinced it is a good idea to publish software solutions as `free and open source' software (FOSS) in the public domain so that we can continue to build on the work of others.  Chapter 2 presents a computational method for identifying gene families in a sequenced genome that may be involved in pathogenicity, i.e., those genes that code for proteins that interact with molecules of an infected host. Such nematode proteins are known to contain highly variable DNA sections that code for the biochemical properties of an interaction site. By applying phylogenetic analysis through maximum likelihood (PAML) and comparison of homologues sequences in other organisms with comparable and different life styles, we discovered 77 unique candidate sequence families in the plant pathogen Meloidogyne incognita that deserve further investigation in the laboratory.  Chapter 3 presents GenEST, a computational method for predicting which fragments captured by the cDNA-AFLP high-throughput technology matched known expressed sequence tags (ESTs). The cDNA-AFLP biochemical process was calculated in silico and fragments matching the fragment lengths as given by cDNA-AFLP were matched. Through this technique novel effectors from the nematode Globodera rostochiensis, putatively involved in pathogenicity, were identified and partly confirmed in the laboratory.  Chapter 4 presents GenFrag, a computational method that expands on GenEST for predicting which fragments captured by cDNA-AFLP matched fragments of a fully sequenced genome with its known spliced gene variants. Through this in silico technique genes were identified in the plant Arabidopsis thaliana putatively involved in maternal genomic imprinting and partly confirmed in the laboratory.  Chapter 5 presents multiple QTL mapping (MQM), a high-throughput computational method for predicting what sections of a genome correlate with, for example, gene expression.  The study of finding such eQTL is challenging, not least because many of them are potentially false positives. The MQM parallelized algorithm is embedded in the R/qtl software package which makes it widely available to researchers. The impact thereof means that it is widely cited by studies on model organisms, such as mouse, rat, the nematode Caenorhabditis elegans and the plant A. thaliana.  Chapter 6 presents a theoretical framework in the form of a review for identifying plant-resistance genes (R-genes) that combines the lessons learnt in the previous chapters. Plants lack an adaptive immune system and therefore, next to having physical defences, use R-genes to code for proteins that recognise molecules and proteins from invading pathogens, with an example on A. thaliana. These R-genes can be viewed as the counterparts of effectors identified in Chapter 3 and Chapter 4.  By introducing the concept of a prior the chapter discusses eQTL or broader xQTL techniques as presented in the Chapter 5 to narrow down on gene candidates involved in plant defence.  Chapter 7 and Chapter 8 present FOSS bioinformatics tools, and modules that make use the Ruby programming language.  BioRuby (Chapter 7) has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports widely used data formats and provides access to databases, external programs and public web services.  All Ruby software created in the context of this thesis was contributed initially to the main BioRuby project, e.g. the PAML parser of Chapter 2, and later as individual Biogems (Chapter 8), e.g. the bio-blastxmlparser, bio-alignment, bigbio and bio-rdf biogems for Chapter 2, and three Genfrag related biogems for Chapter 4.  Over 16 modules were contributed by the author as Ruby FOSS projects and are listed on the http://biogems.info/ website. Because of the open nature of the BioRuby project, both BioRuby and BioGem software modules are increasingly used and cited in biomedical research, not only in genomics, but also in phylogenetics and prediction of protein structural complexes and data integration.  Chapter 9 presents sambamba, a software tool for scaling up next generation sequencing (NGS) alignment processing through the use of multiple cores on a computer. Sambamba is a replacement for samtools, a commonly used software tool for working with aligned output from sequencers.  Sambamba makes use of multi-core processing and is written in the D programming language. Not only does sambamba outperform samtools, but it already comes with an improved deduplication routine and other facilities, such as easy filtering of data. The Sambamba software is now used in the large sequencing centres around the world.  Chapter 10 `Big Data, but are we ready?' gives a response to a publication on using cloud computing for large data processing. The chapter discusses computational bottlenecks and proves prescient because the number of citations of this paper increases every year.  Chapter 11 `Towards effective software solutions for big biology' discusses the need for a change of strategy with regard to bioinformatics software development in the biomedical sciences to realise big biology software projects. This includes improved scientific career tracks for bioinformaticians and dedicated funding for big data software development.  Chapter 12 discusses the computational methods and software solutions presented in this thesis, painting a picture of further challenges in bioinformatics computational solutions for the elucidation of biological processes. The chapter starts with a discussion on the merits and shortcomings of each individual software solution presented in this thesis, followed by a perspective on next generation sequencing, data integration and future research in software solutions. ",
keywords = "plantenparasitaire nematoden, dna-sequencing, next generation sequencing, verwerkingscapaciteit, computational science, genetica, genomica, plant parasitic nematodes, dna sequencing, next generation sequencing, throughput, computational science, genetics, genomics",
author = "J.C.P. Prins",
note = "WU thesis 6141",
year = "2015",
language = "English",
isbn = "9789462574595",
publisher = "Wageningen University",
school = "Wageningen University",

}

Prins, JCP 2015, 'High-throughput open source computational methods for genetics and genomics', Doctor of Philosophy, Wageningen University, Wageningen.

High-throughput open source computational methods for genetics and genomics. / Prins, J.C.P.

Wageningen : Wageningen University, 2015. 136 p.

Research output: Thesisinternal PhD, WU

TY - THES

T1 - High-throughput open source computational methods for genetics and genomics

AU - Prins, J.C.P.

N1 - WU thesis 6141

PY - 2015

Y1 - 2015

N2 - Biology is increasingly data driven by virtue of the development of high-throughput technologies, such as DNA and RNA sequencing. Computational biology and bioinformatics are scientific disciplines that cross-over between the disciplines of biology, informatics and statistics; which is clearly reflected in this thesis. Bioinformaticians often contribute crucial insights and novelty to scientific research because they are central to data analysis and contribute concrete algorithms and software solutions. In addition, bioinformaticians have an important role to play when it comes to organising data and software and making it accessible to others.  In this thesis, in addition to contributing to biological questions, I discuss issues around accessing and sharing data, with the challenges of handling large data, input/output (IO) bottlenecks and effective use of multi-core computations.  By creating software solutions together with molecular biologists, I contributed and published insights in biological processes in nematodes and plants. I published software solutions that made it easier for others to analyse data, which impacts the wider research community. I created solutions that made it easier for others to publish software solutions by themselves.  The introduction of computing and the internet makes it possible to share ideas and computational methods. I am convinced it is a good idea to publish software solutions as `free and open source' software (FOSS) in the public domain so that we can continue to build on the work of others.  Chapter 2 presents a computational method for identifying gene families in a sequenced genome that may be involved in pathogenicity, i.e., those genes that code for proteins that interact with molecules of an infected host. Such nematode proteins are known to contain highly variable DNA sections that code for the biochemical properties of an interaction site. By applying phylogenetic analysis through maximum likelihood (PAML) and comparison of homologues sequences in other organisms with comparable and different life styles, we discovered 77 unique candidate sequence families in the plant pathogen Meloidogyne incognita that deserve further investigation in the laboratory.  Chapter 3 presents GenEST, a computational method for predicting which fragments captured by the cDNA-AFLP high-throughput technology matched known expressed sequence tags (ESTs). The cDNA-AFLP biochemical process was calculated in silico and fragments matching the fragment lengths as given by cDNA-AFLP were matched. Through this technique novel effectors from the nematode Globodera rostochiensis, putatively involved in pathogenicity, were identified and partly confirmed in the laboratory.  Chapter 4 presents GenFrag, a computational method that expands on GenEST for predicting which fragments captured by cDNA-AFLP matched fragments of a fully sequenced genome with its known spliced gene variants. Through this in silico technique genes were identified in the plant Arabidopsis thaliana putatively involved in maternal genomic imprinting and partly confirmed in the laboratory.  Chapter 5 presents multiple QTL mapping (MQM), a high-throughput computational method for predicting what sections of a genome correlate with, for example, gene expression.  The study of finding such eQTL is challenging, not least because many of them are potentially false positives. The MQM parallelized algorithm is embedded in the R/qtl software package which makes it widely available to researchers. The impact thereof means that it is widely cited by studies on model organisms, such as mouse, rat, the nematode Caenorhabditis elegans and the plant A. thaliana.  Chapter 6 presents a theoretical framework in the form of a review for identifying plant-resistance genes (R-genes) that combines the lessons learnt in the previous chapters. Plants lack an adaptive immune system and therefore, next to having physical defences, use R-genes to code for proteins that recognise molecules and proteins from invading pathogens, with an example on A. thaliana. These R-genes can be viewed as the counterparts of effectors identified in Chapter 3 and Chapter 4.  By introducing the concept of a prior the chapter discusses eQTL or broader xQTL techniques as presented in the Chapter 5 to narrow down on gene candidates involved in plant defence.  Chapter 7 and Chapter 8 present FOSS bioinformatics tools, and modules that make use the Ruby programming language.  BioRuby (Chapter 7) has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports widely used data formats and provides access to databases, external programs and public web services.  All Ruby software created in the context of this thesis was contributed initially to the main BioRuby project, e.g. the PAML parser of Chapter 2, and later as individual Biogems (Chapter 8), e.g. the bio-blastxmlparser, bio-alignment, bigbio and bio-rdf biogems for Chapter 2, and three Genfrag related biogems for Chapter 4.  Over 16 modules were contributed by the author as Ruby FOSS projects and are listed on the http://biogems.info/ website. Because of the open nature of the BioRuby project, both BioRuby and BioGem software modules are increasingly used and cited in biomedical research, not only in genomics, but also in phylogenetics and prediction of protein structural complexes and data integration.  Chapter 9 presents sambamba, a software tool for scaling up next generation sequencing (NGS) alignment processing through the use of multiple cores on a computer. Sambamba is a replacement for samtools, a commonly used software tool for working with aligned output from sequencers.  Sambamba makes use of multi-core processing and is written in the D programming language. Not only does sambamba outperform samtools, but it already comes with an improved deduplication routine and other facilities, such as easy filtering of data. The Sambamba software is now used in the large sequencing centres around the world.  Chapter 10 `Big Data, but are we ready?' gives a response to a publication on using cloud computing for large data processing. The chapter discusses computational bottlenecks and proves prescient because the number of citations of this paper increases every year.  Chapter 11 `Towards effective software solutions for big biology' discusses the need for a change of strategy with regard to bioinformatics software development in the biomedical sciences to realise big biology software projects. This includes improved scientific career tracks for bioinformaticians and dedicated funding for big data software development.  Chapter 12 discusses the computational methods and software solutions presented in this thesis, painting a picture of further challenges in bioinformatics computational solutions for the elucidation of biological processes. The chapter starts with a discussion on the merits and shortcomings of each individual software solution presented in this thesis, followed by a perspective on next generation sequencing, data integration and future research in software solutions. 

AB - Biology is increasingly data driven by virtue of the development of high-throughput technologies, such as DNA and RNA sequencing. Computational biology and bioinformatics are scientific disciplines that cross-over between the disciplines of biology, informatics and statistics; which is clearly reflected in this thesis. Bioinformaticians often contribute crucial insights and novelty to scientific research because they are central to data analysis and contribute concrete algorithms and software solutions. In addition, bioinformaticians have an important role to play when it comes to organising data and software and making it accessible to others.  In this thesis, in addition to contributing to biological questions, I discuss issues around accessing and sharing data, with the challenges of handling large data, input/output (IO) bottlenecks and effective use of multi-core computations.  By creating software solutions together with molecular biologists, I contributed and published insights in biological processes in nematodes and plants. I published software solutions that made it easier for others to analyse data, which impacts the wider research community. I created solutions that made it easier for others to publish software solutions by themselves.  The introduction of computing and the internet makes it possible to share ideas and computational methods. I am convinced it is a good idea to publish software solutions as `free and open source' software (FOSS) in the public domain so that we can continue to build on the work of others.  Chapter 2 presents a computational method for identifying gene families in a sequenced genome that may be involved in pathogenicity, i.e., those genes that code for proteins that interact with molecules of an infected host. Such nematode proteins are known to contain highly variable DNA sections that code for the biochemical properties of an interaction site. By applying phylogenetic analysis through maximum likelihood (PAML) and comparison of homologues sequences in other organisms with comparable and different life styles, we discovered 77 unique candidate sequence families in the plant pathogen Meloidogyne incognita that deserve further investigation in the laboratory.  Chapter 3 presents GenEST, a computational method for predicting which fragments captured by the cDNA-AFLP high-throughput technology matched known expressed sequence tags (ESTs). The cDNA-AFLP biochemical process was calculated in silico and fragments matching the fragment lengths as given by cDNA-AFLP were matched. Through this technique novel effectors from the nematode Globodera rostochiensis, putatively involved in pathogenicity, were identified and partly confirmed in the laboratory.  Chapter 4 presents GenFrag, a computational method that expands on GenEST for predicting which fragments captured by cDNA-AFLP matched fragments of a fully sequenced genome with its known spliced gene variants. Through this in silico technique genes were identified in the plant Arabidopsis thaliana putatively involved in maternal genomic imprinting and partly confirmed in the laboratory.  Chapter 5 presents multiple QTL mapping (MQM), a high-throughput computational method for predicting what sections of a genome correlate with, for example, gene expression.  The study of finding such eQTL is challenging, not least because many of them are potentially false positives. The MQM parallelized algorithm is embedded in the R/qtl software package which makes it widely available to researchers. The impact thereof means that it is widely cited by studies on model organisms, such as mouse, rat, the nematode Caenorhabditis elegans and the plant A. thaliana.  Chapter 6 presents a theoretical framework in the form of a review for identifying plant-resistance genes (R-genes) that combines the lessons learnt in the previous chapters. Plants lack an adaptive immune system and therefore, next to having physical defences, use R-genes to code for proteins that recognise molecules and proteins from invading pathogens, with an example on A. thaliana. These R-genes can be viewed as the counterparts of effectors identified in Chapter 3 and Chapter 4.  By introducing the concept of a prior the chapter discusses eQTL or broader xQTL techniques as presented in the Chapter 5 to narrow down on gene candidates involved in plant defence.  Chapter 7 and Chapter 8 present FOSS bioinformatics tools, and modules that make use the Ruby programming language.  BioRuby (Chapter 7) has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports widely used data formats and provides access to databases, external programs and public web services.  All Ruby software created in the context of this thesis was contributed initially to the main BioRuby project, e.g. the PAML parser of Chapter 2, and later as individual Biogems (Chapter 8), e.g. the bio-blastxmlparser, bio-alignment, bigbio and bio-rdf biogems for Chapter 2, and three Genfrag related biogems for Chapter 4.  Over 16 modules were contributed by the author as Ruby FOSS projects and are listed on the http://biogems.info/ website. Because of the open nature of the BioRuby project, both BioRuby and BioGem software modules are increasingly used and cited in biomedical research, not only in genomics, but also in phylogenetics and prediction of protein structural complexes and data integration.  Chapter 9 presents sambamba, a software tool for scaling up next generation sequencing (NGS) alignment processing through the use of multiple cores on a computer. Sambamba is a replacement for samtools, a commonly used software tool for working with aligned output from sequencers.  Sambamba makes use of multi-core processing and is written in the D programming language. Not only does sambamba outperform samtools, but it already comes with an improved deduplication routine and other facilities, such as easy filtering of data. The Sambamba software is now used in the large sequencing centres around the world.  Chapter 10 `Big Data, but are we ready?' gives a response to a publication on using cloud computing for large data processing. The chapter discusses computational bottlenecks and proves prescient because the number of citations of this paper increases every year.  Chapter 11 `Towards effective software solutions for big biology' discusses the need for a change of strategy with regard to bioinformatics software development in the biomedical sciences to realise big biology software projects. This includes improved scientific career tracks for bioinformaticians and dedicated funding for big data software development.  Chapter 12 discusses the computational methods and software solutions presented in this thesis, painting a picture of further challenges in bioinformatics computational solutions for the elucidation of biological processes. The chapter starts with a discussion on the merits and shortcomings of each individual software solution presented in this thesis, followed by a perspective on next generation sequencing, data integration and future research in software solutions. 

KW - plantenparasitaire nematoden

KW - dna-sequencing

KW - next generation sequencing

KW - verwerkingscapaciteit

KW - computational science

KW - genetica

KW - genomica

KW - plant parasitic nematodes

KW - dna sequencing

KW - next generation sequencing

KW - throughput

KW - computational science

KW - genetics

KW - genomics

M3 - internal PhD, WU

SN - 9789462574595

PB - Wageningen University

CY - Wageningen

ER -

Prins JCP. High-throughput open source computational methods for genetics and genomics. Wageningen: Wageningen University, 2015. 136 p.