Paving the way for FAIR data in plant phenotyping

Evangelia A. Papoutsoglou

Research output: Thesisinternal PhD, WU


The increasing nutritional demands of the world as well as the need for crops that perform reliably, in spite of diverse environmental conditions (abiotic and biotic stresses and variable weather conditions), put the plant sciences at the forefront of domains where progress is urgently needed. To be able to do so, plant phenotyping and genotyping are extremely important. Especially in plant phenotyping, research is met with challenges related to poor data management, and thereby inefficient exploitation - let alone reuse of datasets. The challenges to phenotypic data reuse and integration arise due to the highly distributed nature of data in the domain (as there are no central plant phenotypic data repositories) and their multifaceted heterogeneity. The variety of experimental goals and the sheer number of species studied may necessitate different approaches (e.g. for crops, model organisms, forest trees). Experiments may be conducted in open fields, greenhouses or other locations, follow different designs and produce different types of data (e.g. visual observation of a score, images, manual and automatic measurements, molecular assays). Even when everything else matches, the data files produced may have different formats and structures, which is a challenge for data integration. Moreover, good data documentation practices are often lacking, which hinders interpretation and reuse. In the vast majority of cases, plant phenotyping datasets are used only once, solely to address the research question for which they were originally generated. It is the exception, rather than the rule, when different datasets, produced by different, uncoordinated parties, are analyzed to generate further knowledge. Even rarer, though much more useful, are cases where independently created datasets are integrated for the purpose of meta-analyses or improvement of statistical and predictive models. Such work is crucial, for example, for multi-environment studies investigating the adaptability of crops to different conditions. This relative rarity of meta-analyses and integrative studies indicates that researchers conduct experiments and collect data anew for every new study they wish to undertake, which is in many cases a suboptimal use of resources. This may not be a serious issue on a low level (i.e., single experiments) but on a higher level where multiple independent experiments may be reused and integrated in e.g. multi-environment trials, this has a greater impact.

The challenges mentioned in the previous paragraph for plant phenotyping are not specific, but generic for the data life cycle in research. To address this challenge, the FAIR (Findable, Accessible, Interoperable, Reusable) data principles have been proposed as guidelines to alleviate generic reusability bottlenecks. However, FAIR data principles require domain-specific solutions. With them, datasets become more easily discoverable, interpretable, integratable and reusable. Furthermore, the principles emphasize that there should be an equal focus on human and machine readability, so that automated techniques can facilitate every step of the process. It is up to each community to devise ways to implement the FAIR principles. In this thesis, we investigated the application of the FAIR data principles in the domain of plant phenotyping.

Our initial research question focused on a core requirement of FAIR, domain-relevant community standards. We identified and tackled shortcomings of the MIAPPE (Minimum Information About a Plant Phenotyping Experiment) metadata standard, which was initially presented as a flat checklist. We produced a new, refined version, MIAPPE 1.1, which can cover experiments involving a broader range of plant species (including forest trees), boasts improved documentation, and can now support FAIR data through its explicit data model and ontology (Chapter 2). We tested the new version of the standard by using it to describe a wide range of different plant phenotyping experiments which proved that it can sufficiently accommodate the metadata of those experiments in a variety of formats.

For our second research question, we addressed the needs of machine readable data exchange for plant breeding information systems with the plant Breeding API (BrAPI), a standardized RESTful API (Application Programming Interface) specification, developed by and for the community (Chapter 3). Unlike MIAPPE, which is strictly a metadata standard for phenotyping experiments, BrAPI has a broader scope, which covers phenotypic and genotypic data alike. BrAPI can now be used to interact uniformly with breeding systems, fetching essential genotypic, phenotypic and organizational information, and BrAPI-compliant endpoints can support modular applications for a variety of use cases. Finally, we ensured that BrAPI includes community-relevant metadata by following the MIAPPE community standard and ensuring that its essential attributes were present. BrAPI is only one of the MIAPPE implementations: in Chapter 2, to make the metadata standard easier to adopt, we provided more of them for different usage contexts. We developed the Plant Phenotype Experiment Ontology for the RDF (Resource Description Framework) implementation, and a configuration supporting ISA-Tab (Investigation Study Assay-Tabular) archives. Therefore, all of these implementations can now communicate MIAPPE-compliant datasets, in fulfillment of the FAIR data requirements for reuse (domain-relevant community standards).

For our third research question, we retrace some of the steps of a previous project, which revolved around the integration and reuse of heterogeneous data (phenotypic, genotypic, environmental) from potato experiments. Reuse was challenging in that project mainly due to a lack of organized metadata, which is a central requirement for FAIR data. Otherwise, resolving the heterogeneity in the presentation of data to arrive at a common format was time consuming and, in some cases, ambiguous. To improve this, in Chapter 4, we report steps toward better reusability of the data. Relevant subsets of the datasets were made FAIR and placed on a FAIR Data Point, which can be used to discover, acquire and reproducibly reuse the data. This process proved that the MIAPPE standard can support this integration, and highlighted difficulties that may arise when documentation and metadata are not compiled when an experiment is first conducted. It also emphasized that attributes supported by MIAPPE can be used to integrate datasets from different domains (phenotyping, environment), a type of integration crucial to investigations of crop stability. The FAIR Data Point provides the location of an RDF version (distribution) of the phenotypic dataset. We show that, by using a Jupyter notebook that interacts with it, we can easily create different views of the data, and that combining it with (environmental) data obtained from external resources is trivial.

Finally, we took a different approach toward data integration, findability and reusability. The core concern was the accelerating pace of research publications and the limited time that researchers can devote to consuming large volumes of text. Whereas databases and other structured information sources can be readily explored, articles - which primarily consist of unstructured text - do not enjoy the same benefit. To present researchers with a more efficient means toward hypothesis generation, we constructed knowledge networks based on relationships extracted with natural language processing (NLP) methods, in particular IBM’s Watson suite. Using potato tuber flesh color as the trait of interest, we conducted a time analysis to test the viability of our approach, discovering that latent connections hinting at new genotype-phenotype associations between particular metabolites, proteins and genes existed already for longer periods in literature before they were experimentally confirmed. Our knowledge networks included new and testable genes two years ahead of the actual publications (Chapter 5).

This thesis contributes to state-of-the-art methods for making plant phenotyping data FAIR. With metadata standards to aid interpretation and reusability, and better means for computer-readable data exchange, an infrastructure can be set up to benefit farmers, academic and industry stakeholders. Not only can better data management pave the way for more reuse and more powerful analyses and models, improving the landscape for plant research and the outlook for advances in the domain; it can also help with gaining new insights which would not have been possible without the linked datasets. We show using the carrot rather than using the stick that, by having FAIR plant phenotyping data, we can enhance re-use and further integration of existing datasets and enable a new era of data-driven research.

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
  • Visser, Richard, Promotor
  • Finkers, R., Co-promotor
  • Athanasiadis, Ioannis, Co-promotor
Award date16 Jun 2021
Place of PublicationWageningen
Print ISBNs9789463957977
Publication statusPublished - 2021


Dive into the research topics of 'Paving the way for FAIR data in plant phenotyping'. Together they form a unique fingerprint.

Cite this