Filling the gap between sequence and function: a bioinformatics approach

J.W. Bargsten

Research output: Thesisinternal PhD, WU


The research presented in this thesis focuses on deriving function from sequence information, with the emphasis on plant sequence data. Unravelling the impact of genomic elements, in most cases genes, on the phenotype of an organism is a major challenge in biological research and modern plant breeding. An important part of this challenge is the (functional) annotation of such genomic elements. Currently, wet lab experiments may provide high quality, but they are laborious and costly. With the advent of next generation sequencing platforms, vast amounts of sequence data are generated. This data are used in connection with the available experimental data to derive function from a bioinformatics perspective.

The connection between sequence information and function was approached on the level of chromosome structure (chapter 2) and of gene families (chapter 3) using combinations of existing bioinformatics tools. The applicability of using interaction networks for function prediction was demonstrated by first markedly improving an existing method (chapter 4) and by exploring the role of network topology in function prediction (chapter 5). Taken together, the combination of methods and results presented indicate the potential as well as the current state-of-the-art of function prediction in (plant) bioinformatics.

Chapter 1 introduces the basis for the approaches used and developed in this thesis. This includes the concepts of genome annotation, comparative genomics, gene function prediction and the analysis of network topology for gene function prediction. A requirement for the study of any new organism is the sequencing and annotation of its genome. Current genome annotation is divided into structural identification and functional categorization of genomic elements. The de facto standard for categorizing functional annotation is provided by the Gene Ontology. The Gene Ontology is divided into three domains, molecular function, biological process and cellular component. Approaches to predict molecular function and biological process are outlined. Accurate function prediction generally relies on existing input data, often of experimental origin, that can be transferred to unannotated genomic elements. Plants often lack such input data, which poses a big challenge for current function prediction algorithms. In unravelling the function of genomic elements, comparative genomics is an important approach. Via the comparison of multiple genomes it gives insights into evolution, function as well as genomic structure and variation. Comparative genomics has become an essential toolkit for the analysis of newly sequenced organisms. Often bioinformatics methods need to be adapted to the specific needs of plant genome research. With a focus on the commercially important crop plants tomato and potato, specific requirements of plant bioinformatics, such as the high amount of repetitive elements and the lack of experimental data, are outlined.

In chapter 2, the structural homology of the long arm of chromosome 2 (2L) of tomato, potato and pepper is analyzed. Molecular organization and collinear junctions are delineated using multi-color BAC FISH analysis and comparative sequence alignment. We identify several large-scale rearrangements including inversions and segmental translocations that were not reported in previous comparative studies. Some of the structural rearrangements are specific for the tomato clade, and differentiate tomato from potato, pepper and other solanaceous species. There are many small-scale synteny perturbations, but local gene vicinity is largely preserved. The data suggests that long distance intra-chromosomal rearrangements and local gene rearrangements have evolved frequently during speciation in the Solanum genus, and that small changes are more prevalent than large-scale differences. The occurrence of transposable elements and other repeats near or at junction breaks may indicate repeat-mediated rearrangements. The ancestral 2L topology is reconstructed and the evolutionary events leading to the current topology are discussed.

In chapter 3, we analyze the Snf2 gene family. As part of large protein complexes, Snf2 family ATPases are responsible for energy supply during chromatin remodeling, but the precise mechanism of action of many of these proteins is largely unknown. They influence many processes in plants, such as the response to environmental stress. The analysis is the first comprehensive study of Snf2 family ATPases in plants. Some subfamilies of the Snf2 gene family are remarkably stable in number of genes per genome, whereas others show expansion and contraction in several plants. One of these subfamilies, the plant-specific DRD1 subfamily, is non-existent in lower eukaryote genomes, yet it developed into the largest Snf2 subfamily in plant genomes. It shows the occurrence of a complex series of evolutionary events. Its expansion, notably in tomato, suggests novel functionality in processes connected to chromatin remodeling. The results underpin and extend the Snf2 subfamily classification, which could help to determine the various functional roles of Snf2 ATPases and to target environmental stress tolerance and yield in future breeding with these genes.

In chapter 4, a new approach to improve the prediction of protein function in terms of biological processes is developed that is particularly attractive for sparsely annotated plant genomes. The combination of the network-based prediction method Bayesian Markov Random Field (BMRF) with the sequence-based prediction method Argot2 shows significantly improved performance compared to each of the methods separately, as well as compared to Blast2GO. The approach was applied to predict biological processes for the proteomes of rice, barrel clover, poplar, soybean and tomato. Analysis of the relationships between sequence similarity and predicted function similarity identifies numerous cases of divergence of biological processes in which proteins are involved, in spite of sequence similarity. Examples of potential divergence are identified for various biological processes, notably for processes related to cell development, regulation, and response to chemical stimulus. Such divergence in biological process annotation for proteins with similar sequences should be taken into account when analyzing plant gene and genome evolution. This way, the integration of network-based and sequence-based function prediction will strengthen the analysis of evolutionary relationships of plant genomes.

In chapter 5 the influence of network topology on network-based function prediction algorithms is investigated. The analysis of biological networks using algorithms such as Bayesian Markov Random Field (BMRF) is a valuable predictor of the biological processes that proteins are involved in. The topological properties and constraints that determine prediction performance in such networks are however largely unknown. This chapter presents analyses based on network centrality measures, such as node degree, to evaluate the performance of BMRF upon progressive removal of highly connected hub nodes (pruning). Three different protein-protein interaction networks with data from Arabidopsis, human and yeast were analyzed. All three show that the average prediction performance can improve significantly. The chapter paves the way for further improvement of network-based function prediction methods based on node pruning.

Chapter 6 discusses the results and methods developed in this thesis in the context of the vast amount of generated sequencing data. Sequencing or re-sequencing a (plant) genome has become fairly straightforward and affordable, but the interpretation for subsequent use of this sequence data is far from trivial. The topics addressed in this thesis, annotation of function, analysis of genome structure and identifying genomic variation, focus on this main bottleneck of biological research. Issues discussed in connection with this work and its future are data accuracy, error propagation, possible improvements and future implications for biological research in crop plants. In particular the shift of costs from sequencing to downstream analyses, with functional genome annotation as essential step, is covered. One of the biggest challenges biology and bioinformatics will face is the integration of results from such downstream analyses and other sources into a complete picture. Only this will allow understanding of complex biological systems.


Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
  • Visser, Richard, Promotor
  • Nap, Jan-Peter (Jp), Co-promotor
Award date28 Oct 2014
Place of PublicationWageningen
Print ISBNs9789462570764
Publication statusPublished - 2014


  • bioinformatics
  • plants
  • genomics
  • nucleotide sequences
  • functional genomics
  • comparative genomics
  • comparative mapping
  • genomes
  • genetic mapping
  • plant breeding
  • methodology


Dive into the research topics of 'Filling the gap between sequence and function: a bioinformatics approach'. Together they form a unique fingerprint.

Cite this