Integration of multi-omics data for prediction of phenotypic traits using random forest

Animesh Acharjee, Bjorn Kloosterman, Richard G.F. Visser, Chris Maliepaard

Research output: Contribution to journalArticleAcademicpeer-review

23 Citations (Scopus)

Abstract

Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these ~ omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated ~ omics variables that can predict, with higher accuracy, a quality trait of interest. Results: We used Random Forest regression for integrating multiple ~ omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75%, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45%. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine. Conclusions: In this study, we made a strategy for selecting and integrating multiple ~ omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.

Original languageEnglish
Article number180
JournalBMC Bioinformatics
Volume17
Issue number5
DOIs
Publication statusPublished - 2016

Fingerprint

Random Forest
Gene expression
Metabolites
Gene Expression
Prediction
Solanum tuberosum
Discoloration
Potato
Color
Chromosomes
Proteins
Proteomics
Quantitative Trait Loci
Genes
Gene Expression Data
Protein
Chromosome
Metabolomics
Chromosomes, Human, Pair 2
Electric network analysis

Keywords

  • Data integration
  • Genetical genomics
  • Networks
  • Random forest

Cite this

@article{a58914e84d0b405187a9a70d23b1be6a,
title = "Integration of multi-omics data for prediction of phenotypic traits using random forest",
abstract = "Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these ~ omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated ~ omics variables that can predict, with higher accuracy, a quality trait of interest. Results: We used Random Forest regression for integrating multiple ~ omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75{\%}, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45{\%}. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine. Conclusions: In this study, we made a strategy for selecting and integrating multiple ~ omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.",
keywords = "Data integration, Genetical genomics, Networks, Random forest",
author = "Animesh Acharjee and Bjorn Kloosterman and Visser, {Richard G.F.} and Chris Maliepaard",
year = "2016",
doi = "10.1186/s12859-016-1043-4",
language = "English",
volume = "17",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "Springer Verlag",
number = "5",

}

Integration of multi-omics data for prediction of phenotypic traits using random forest. / Acharjee, Animesh; Kloosterman, Bjorn; Visser, Richard G.F.; Maliepaard, Chris.

In: BMC Bioinformatics, Vol. 17, No. 5, 180, 2016.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Integration of multi-omics data for prediction of phenotypic traits using random forest

AU - Acharjee, Animesh

AU - Kloosterman, Bjorn

AU - Visser, Richard G.F.

AU - Maliepaard, Chris

PY - 2016

Y1 - 2016

N2 - Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these ~ omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated ~ omics variables that can predict, with higher accuracy, a quality trait of interest. Results: We used Random Forest regression for integrating multiple ~ omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75%, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45%. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine. Conclusions: In this study, we made a strategy for selecting and integrating multiple ~ omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.

AB - Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these ~ omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated ~ omics variables that can predict, with higher accuracy, a quality trait of interest. Results: We used Random Forest regression for integrating multiple ~ omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75%, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45%. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine. Conclusions: In this study, we made a strategy for selecting and integrating multiple ~ omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.

KW - Data integration

KW - Genetical genomics

KW - Networks

KW - Random forest

U2 - 10.1186/s12859-016-1043-4

DO - 10.1186/s12859-016-1043-4

M3 - Article

VL - 17

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 5

M1 - 180

ER -