Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach

A.G. Heidema, N. Nagelkerke

Research output: Contribution to journalArticleAcademicpeer-review

10 Citations (Scopus)

Abstract

To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively.
Original languageEnglish
Article number5
Number of pages9
JournalStatistical Applications in Genetics and Molecular Biology
Volume7
Issue number2
DOIs
Publication statusPublished - 2008

Fingerprint

Proteomics
Logistic Regression
Breast Cancer
Discrimination
Logistics
Penalized Regression
Logistic Models
Breast Neoplasms
Classification and Regression Trees
Charge
Random Forest
Forestry
Decision Rules
Specificity
Predictors
Sensitivity and Specificity
Interaction
Model

Keywords

  • random forest
  • classification
  • selection
  • bias

Cite this

@article{46e709644b2a4bf0be18f262ca93f90f,
title = "Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach",
abstract = "To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3{\%}, and a sensitivity and specificity of 86.8{\%} and 85.7{\%}, respectively.",
keywords = "random forest, classification, selection, bias",
author = "A.G. Heidema and N. Nagelkerke",
note = "ISI:000254568100002",
year = "2008",
doi = "10.2202/1544-6115.1341",
language = "English",
volume = "7",
journal = "Statistical Applications in Genetics and Molecular Biology",
issn = "1544-6115",
publisher = "De Gruyter",
number = "2",

}

Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach. / Heidema, A.G.; Nagelkerke, N.

In: Statistical Applications in Genetics and Molecular Biology, Vol. 7, No. 2, 5, 2008.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach

AU - Heidema, A.G.

AU - Nagelkerke, N.

N1 - ISI:000254568100002

PY - 2008

Y1 - 2008

N2 - To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively.

AB - To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively.

KW - random forest

KW - classification

KW - selection

KW - bias

U2 - 10.2202/1544-6115.1341

DO - 10.2202/1544-6115.1341

M3 - Article

VL - 7

JO - Statistical Applications in Genetics and Molecular Biology

JF - Statistical Applications in Genetics and Molecular Biology

SN - 1544-6115

IS - 2

M1 - 5

ER -