Genomics data integration for knowledge discovery using genome annotations from molecular databases and scientific literature

Research output: Thesisinternal PhD, WU

Abstract

One of the major global challenges of today is to meet the food demands of an ever increasing population (food demand will increase by 50% in 2030). One approach to address this challenge is to breed new crop varieties that yield more even under unfavorable conditions e.g. have improved tolerance to drought and/or resistance to pathogens. However, designing a breeding program is a laborious and time consuming effort that often lacks the capacity to generate new cultivars quickly in response to the required traits. Recent advances in biotechnology and genomics data science have the potential to accelerate and precise breeding programs greatly. As large-scale genomic data sets for crop species are available in multiple independent data sources and scientific literature, this thesis provides innovative technologies that use natural language processing (NLP) and semantic web technologies to address challenges of integrating genomic data for improving plant breeding. 

Firstly, in this research study, we developed a supervised Natural language processing (NLP) model with the help of IBM Watson, to extract knowledge networks containing genotypic-phenotypic associations of potato tuber flesh color from the scientific literature. Secondly, a table mining tool called QTLTableMiner++ (QTM) was developed which enables knowledge discovery of novel genomic regions (such as QTL regions), which positively or negatively affect the traits of interest. The objective of both above mentioned, NLP techniques was to extract information which is implicitly described in the literature and is not available in structured resources, like databases. Thirdly, with the help of semantic web technology, a linked-data platform called Solanaceae linked data platform(pbg-ld) was developed, to semantically integrates geno- and pheno-typic data of Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. Lastly, analysis workflows for prioritizing candidate genes with QTL regions were tested using pbg-ld. Hence, this research provides in-silico knowledge discovery tools and genomic data infrastructure, which aids researchers and breeders in the design of a precise and improved breeding program.

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
Supervisors/Advisors
  • Visser, Richard, Promotor
  • Bachem, Christian, Co-promotor
  • Finkers, Richard, Co-promotor
Award date9 Dec 2019
Place of PublicationWageningen
Publisher
Print ISBNs9789463952019
DOIs
Publication statusPublished - 2019

Fingerprint

genomics
genome
world wide web
Solanaceae
quantitative trait loci
breeding
new crops
cultivars
plant breeding
infrastructure
biotechnology
tubers
researchers
drought
potatoes
breeds
color
pathogens
extracts
crops

Cite this

@phdthesis{b8cafe220d79441ea82b678966f1d500,
title = "Genomics data integration for knowledge discovery using genome annotations from molecular databases and scientific literature",
abstract = "One of the major global challenges of today is to meet the food demands of an ever increasing population (food demand will increase by 50{\%} in 2030). One approach to address this challenge is to breed new crop varieties that yield more even under unfavorable conditions e.g. have improved tolerance to drought and/or resistance to pathogens. However, designing a breeding program is a laborious and time consuming effort that often lacks the capacity to generate new cultivars quickly in response to the required traits. Recent advances in biotechnology and genomics data science have the potential to accelerate and precise breeding programs greatly. As large-scale genomic data sets for crop species are available in multiple independent data sources and scientific literature, this thesis provides innovative technologies that use natural language processing (NLP) and semantic web technologies to address challenges of integrating genomic data for improving plant breeding.  Firstly, in this research study, we developed a supervised Natural language processing (NLP) model with the help of IBM Watson, to extract knowledge networks containing genotypic-phenotypic associations of potato tuber flesh color from the scientific literature. Secondly, a table mining tool called QTLTableMiner++ (QTM) was developed which enables knowledge discovery of novel genomic regions (such as QTL regions), which positively or negatively affect the traits of interest. The objective of both above mentioned, NLP techniques was to extract information which is implicitly described in the literature and is not available in structured resources, like databases. Thirdly, with the help of semantic web technology, a linked-data platform called Solanaceae linked data platform(pbg-ld) was developed, to semantically integrates geno- and pheno-typic data of Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. Lastly, analysis workflows for prioritizing candidate genes with QTL regions were tested using pbg-ld. Hence, this research provides in-silico knowledge discovery tools and genomic data infrastructure, which aids researchers and breeders in the design of a precise and improved breeding program.",
author = "Gurnoor Singh",
note = "WU thesis 7402 Includes bibliographical references. - With summary in English",
year = "2019",
doi = "10.18174/505685",
language = "English",
isbn = "9789463952019",
publisher = "Wageningen University",
school = "Wageningen University",

}

Genomics data integration for knowledge discovery using genome annotations from molecular databases and scientific literature. / Singh, Gurnoor.

Wageningen : Wageningen University, 2019. 135 p.

Research output: Thesisinternal PhD, WU

TY - THES

T1 - Genomics data integration for knowledge discovery using genome annotations from molecular databases and scientific literature

AU - Singh, Gurnoor

N1 - WU thesis 7402 Includes bibliographical references. - With summary in English

PY - 2019

Y1 - 2019

N2 - One of the major global challenges of today is to meet the food demands of an ever increasing population (food demand will increase by 50% in 2030). One approach to address this challenge is to breed new crop varieties that yield more even under unfavorable conditions e.g. have improved tolerance to drought and/or resistance to pathogens. However, designing a breeding program is a laborious and time consuming effort that often lacks the capacity to generate new cultivars quickly in response to the required traits. Recent advances in biotechnology and genomics data science have the potential to accelerate and precise breeding programs greatly. As large-scale genomic data sets for crop species are available in multiple independent data sources and scientific literature, this thesis provides innovative technologies that use natural language processing (NLP) and semantic web technologies to address challenges of integrating genomic data for improving plant breeding.  Firstly, in this research study, we developed a supervised Natural language processing (NLP) model with the help of IBM Watson, to extract knowledge networks containing genotypic-phenotypic associations of potato tuber flesh color from the scientific literature. Secondly, a table mining tool called QTLTableMiner++ (QTM) was developed which enables knowledge discovery of novel genomic regions (such as QTL regions), which positively or negatively affect the traits of interest. The objective of both above mentioned, NLP techniques was to extract information which is implicitly described in the literature and is not available in structured resources, like databases. Thirdly, with the help of semantic web technology, a linked-data platform called Solanaceae linked data platform(pbg-ld) was developed, to semantically integrates geno- and pheno-typic data of Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. Lastly, analysis workflows for prioritizing candidate genes with QTL regions were tested using pbg-ld. Hence, this research provides in-silico knowledge discovery tools and genomic data infrastructure, which aids researchers and breeders in the design of a precise and improved breeding program.

AB - One of the major global challenges of today is to meet the food demands of an ever increasing population (food demand will increase by 50% in 2030). One approach to address this challenge is to breed new crop varieties that yield more even under unfavorable conditions e.g. have improved tolerance to drought and/or resistance to pathogens. However, designing a breeding program is a laborious and time consuming effort that often lacks the capacity to generate new cultivars quickly in response to the required traits. Recent advances in biotechnology and genomics data science have the potential to accelerate and precise breeding programs greatly. As large-scale genomic data sets for crop species are available in multiple independent data sources and scientific literature, this thesis provides innovative technologies that use natural language processing (NLP) and semantic web technologies to address challenges of integrating genomic data for improving plant breeding.  Firstly, in this research study, we developed a supervised Natural language processing (NLP) model with the help of IBM Watson, to extract knowledge networks containing genotypic-phenotypic associations of potato tuber flesh color from the scientific literature. Secondly, a table mining tool called QTLTableMiner++ (QTM) was developed which enables knowledge discovery of novel genomic regions (such as QTL regions), which positively or negatively affect the traits of interest. The objective of both above mentioned, NLP techniques was to extract information which is implicitly described in the literature and is not available in structured resources, like databases. Thirdly, with the help of semantic web technology, a linked-data platform called Solanaceae linked data platform(pbg-ld) was developed, to semantically integrates geno- and pheno-typic data of Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. Lastly, analysis workflows for prioritizing candidate genes with QTL regions were tested using pbg-ld. Hence, this research provides in-silico knowledge discovery tools and genomic data infrastructure, which aids researchers and breeders in the design of a precise and improved breeding program.

U2 - 10.18174/505685

DO - 10.18174/505685

M3 - internal PhD, WU

SN - 9789463952019

PB - Wageningen University

CY - Wageningen

ER -