Genetics research is increasingly focusing on mining fully sequenced genomes and their annotations to identify the causal genes associated with traits (phenotypes) of interest. However, a complex trait is typically associated with multiple quantitative trait loci (QTLs), each comprising many genes, that can positively or negatively affect the trait of interest. To help breeders in ranking candidate genes, we developed an analytical platform called pbg-ld that provides semantically integrated geno- and phenotypic data on Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. In particular, QTLs were extracted from tables of full-text articles from the Europe PubMed Central (PMC) repository using QTLTableMiner++ (QTM), while the genomic annotations were obtained from the Sol Genomics Network (SGN), UniProt and Ensembl Plants databases. These datasets were transformed into Linked Data graphs, which include cross-references to many other relevant databases such as Gramene, Plant Reactome, InterPro and KEGG Orthology (KO). Users can query and analyze the integrated data through a web interface or programmatically via the SPARQL and RESTful services (APIs). We illustrate the usability of pbg-ld by querying genome annotations, by comparing genome graphs, and by two biological use cases in Jupyter Notebooks. In the first use case, we performed a comparative genomics study using pbg-ld to compare the difference in the genetic mechanism underlying tomato fruit shape and potato tuber shape. In the second use case, we developed a seamlessly integrated workflow that uses genomic data from pbg-ld knowledge graphs and prioritization pipelines to predict candidate genes within QTL regions for metabolic traits of tomato.
- Linked data
- Plant breeding
- Prioritization of candidate genes
- Semantic web