TY - JOUR
T1 - On the increase of predictive performance with high-level data fusion Highlighted and/or underlined version
AU - Doeswijk, T.G.
AU - Smilde, A.K.
AU - Hageman, J.A.
AU - Westerhuis, J.A.
AU - van Eeuwijk, F.A.
PY - 2011
Y1 - 2011
N2 - The combination of the different data sources for classification purposes, also called data fusion, can be done at different levels: low-level, i.e. concatenating data matrices, medium-level, i.e. concatenating data matrices after feature selection and high-level, i.e. combining model outputs. In this paper the predictive performance of high-level data fusion is investigated.
Partial least squares is used on each of the data sets and dummy variables representing the classes are used as response variables. Based on the estimated responses View the MathML source for data set j and class k, a Gaussian distribution View the MathML source is fitted. A simulation study is performed that shows the theoretical performance of high-level data fusion for two classes and two data sets. Within group correlations of the predicted responses of the two models and differences between the predictive ability of each of the separate models and the fused models are studied.
Results show that the error rate is always less than or equal to the best performing subset and can theoretically approach zero. Negative within group correlations always improve the predictive performance. However, if the data sets have a joint basis, as with metabolomics data, this is not likely to happen. For equally performing individual classifiers the best results are expected for small within group correlations. Fusion of a non-predictive classifier with a classifier that exhibits discriminative ability lead to increased predictive performance if the within group correlations are strong. An example with real life data shows the applicability of the simulation results
AB - The combination of the different data sources for classification purposes, also called data fusion, can be done at different levels: low-level, i.e. concatenating data matrices, medium-level, i.e. concatenating data matrices after feature selection and high-level, i.e. combining model outputs. In this paper the predictive performance of high-level data fusion is investigated.
Partial least squares is used on each of the data sets and dummy variables representing the classes are used as response variables. Based on the estimated responses View the MathML source for data set j and class k, a Gaussian distribution View the MathML source is fitted. A simulation study is performed that shows the theoretical performance of high-level data fusion for two classes and two data sets. Within group correlations of the predicted responses of the two models and differences between the predictive ability of each of the separate models and the fused models are studied.
Results show that the error rate is always less than or equal to the best performing subset and can theoretically approach zero. Negative within group correlations always improve the predictive performance. However, if the data sets have a joint basis, as with metabolomics data, this is not likely to happen. For equally performing individual classifiers the best results are expected for small within group correlations. Fusion of a non-predictive classifier with a classifier that exhibits discriminative ability lead to increased predictive performance if the within group correlations are strong. An example with real life data shows the applicability of the simulation results
KW - Classification
KW - Data fusion
KW - Error rate
KW - Metabolomics
U2 - 10.1016/j.aca.2011.03.025
DO - 10.1016/j.aca.2011.03.025
M3 - Article
SN - 0003-2670
VL - 705
SP - 41
EP - 47
JO - Analytica Chimica Acta
JF - Analytica Chimica Acta
IS - 1-2
ER -