Estimating disease prevalence from drug utilization data using the Random Forest algorithm

Laurentius C.J. Slobbe, Koen Füssenich*, Albert Wong, Hendriek C. Boshuizen, Markus M.J. Nielen, Johan J. Polder, Talitha L. Feenstra, Hans A.M. Van Oers

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Background: Aggregated claims data on medication are often used as a proxy for the prevalence of diseases, especially chronic diseases. However, linkage between medication and diagnosis tend to be theory based and not very precise. Modelling disease probability at an individual level using individual level data may yield more accurate results. Methods; Individual probabilities of having a certain chronic disease were estimated using the Random Forest (RF) algorithm. A training set was created from a general practitioners database of 276 723 cases that included diagnosis and claims data on medication. Model performance for 29 chronic diseases was evaluated using Receiver-Operator Curves, by measuring the Area Under the Curve (AUC). Results: The diseases for which model performance was best were Parkinson’s disease (AUC = .89, 95% CI = .77–1.00), diabetes (AUC = .87, 95% CI = .85–.90), osteoporosis (AUC = .87, 95% CI = .81–.92) and heart failure (AUC = .81, 95% CI = .74–.88). Five other diseases had an AUC >.75: asthma, chronic enteritis, COPD, epilepsy and HIV/AIDS. For 16 of 17 diseases tested, the medication categories used in theory-based algorithms were also identified by our method, however the RF models included a broader range of medications as important predictors. Conclusion: Data on medication use can be a useful predictor when estimating the prevalence of several chronic diseases. To improve the estimates, for a broader range of chronic diseases, research should use better training data, include more details concerning dosages and duration of prescriptions, and add related predictors like hospitalizations
Original languageEnglish
Pages (from-to)615-621
Number of pages6
JournalEuropean Journal of Public Health
Volume29
Issue number4
Early online date3 Jan 2019
DOIs
Publication statusPublished - Aug 2019

Fingerprint

Drug Utilization
Area Under Curve
Chronic Disease
Enteritis
Proxy
General Practitioners
Chronic Obstructive Pulmonary Disease
Osteoporosis
Prescriptions
Parkinson Disease
Epilepsy
Acquired Immunodeficiency Syndrome
Hospitalization
Asthma
Heart Failure
HIV
Databases
Research

Cite this

Slobbe, L. C. J., Füssenich, K., Wong, A., Boshuizen, H. C., Nielen, M. M. J., Polder, J. J., ... Van Oers, H. A. M. (2019). Estimating disease prevalence from drug utilization data using the Random Forest algorithm. European Journal of Public Health, 29(4), 615-621. https://doi.org/10.1093/eurpub/cky270
Slobbe, Laurentius C.J. ; Füssenich, Koen ; Wong, Albert ; Boshuizen, Hendriek C. ; Nielen, Markus M.J. ; Polder, Johan J. ; Feenstra, Talitha L. ; Van Oers, Hans A.M. / Estimating disease prevalence from drug utilization data using the Random Forest algorithm. In: European Journal of Public Health. 2019 ; Vol. 29, No. 4. pp. 615-621.
@article{2336da20e8d4490e9db891b6362c8290,
title = "Estimating disease prevalence from drug utilization data using the Random Forest algorithm",
abstract = "Background: Aggregated claims data on medication are often used as a proxy for the prevalence of diseases, especially chronic diseases. However, linkage between medication and diagnosis tend to be theory based and not very precise. Modelling disease probability at an individual level using individual level data may yield more accurate results. Methods; Individual probabilities of having a certain chronic disease were estimated using the Random Forest (RF) algorithm. A training set was created from a general practitioners database of 276 723 cases that included diagnosis and claims data on medication. Model performance for 29 chronic diseases was evaluated using Receiver-Operator Curves, by measuring the Area Under the Curve (AUC). Results: The diseases for which model performance was best were Parkinson’s disease (AUC = .89, 95{\%} CI = .77–1.00), diabetes (AUC = .87, 95{\%} CI = .85–.90), osteoporosis (AUC = .87, 95{\%} CI = .81–.92) and heart failure (AUC = .81, 95{\%} CI = .74–.88). Five other diseases had an AUC >.75: asthma, chronic enteritis, COPD, epilepsy and HIV/AIDS. For 16 of 17 diseases tested, the medication categories used in theory-based algorithms were also identified by our method, however the RF models included a broader range of medications as important predictors. Conclusion: Data on medication use can be a useful predictor when estimating the prevalence of several chronic diseases. To improve the estimates, for a broader range of chronic diseases, research should use better training data, include more details concerning dosages and duration of prescriptions, and add related predictors like hospitalizations",
author = "Slobbe, {Laurentius C.J.} and Koen F{\"u}ssenich and Albert Wong and Boshuizen, {Hendriek C.} and Nielen, {Markus M.J.} and Polder, {Johan J.} and Feenstra, {Talitha L.} and {Van Oers}, {Hans A.M.}",
year = "2019",
month = "8",
doi = "10.1093/eurpub/cky270",
language = "English",
volume = "29",
pages = "615--621",
journal = "European Journal of Public Health",
issn = "1101-1262",
publisher = "Oxford University Press",
number = "4",

}

Slobbe, LCJ, Füssenich, K, Wong, A, Boshuizen, HC, Nielen, MMJ, Polder, JJ, Feenstra, TL & Van Oers, HAM 2019, 'Estimating disease prevalence from drug utilization data using the Random Forest algorithm', European Journal of Public Health, vol. 29, no. 4, pp. 615-621. https://doi.org/10.1093/eurpub/cky270

Estimating disease prevalence from drug utilization data using the Random Forest algorithm. / Slobbe, Laurentius C.J.; Füssenich, Koen; Wong, Albert; Boshuizen, Hendriek C.; Nielen, Markus M.J.; Polder, Johan J.; Feenstra, Talitha L.; Van Oers, Hans A.M.

In: European Journal of Public Health, Vol. 29, No. 4, 08.2019, p. 615-621.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Estimating disease prevalence from drug utilization data using the Random Forest algorithm

AU - Slobbe, Laurentius C.J.

AU - Füssenich, Koen

AU - Wong, Albert

AU - Boshuizen, Hendriek C.

AU - Nielen, Markus M.J.

AU - Polder, Johan J.

AU - Feenstra, Talitha L.

AU - Van Oers, Hans A.M.

PY - 2019/8

Y1 - 2019/8

N2 - Background: Aggregated claims data on medication are often used as a proxy for the prevalence of diseases, especially chronic diseases. However, linkage between medication and diagnosis tend to be theory based and not very precise. Modelling disease probability at an individual level using individual level data may yield more accurate results. Methods; Individual probabilities of having a certain chronic disease were estimated using the Random Forest (RF) algorithm. A training set was created from a general practitioners database of 276 723 cases that included diagnosis and claims data on medication. Model performance for 29 chronic diseases was evaluated using Receiver-Operator Curves, by measuring the Area Under the Curve (AUC). Results: The diseases for which model performance was best were Parkinson’s disease (AUC = .89, 95% CI = .77–1.00), diabetes (AUC = .87, 95% CI = .85–.90), osteoporosis (AUC = .87, 95% CI = .81–.92) and heart failure (AUC = .81, 95% CI = .74–.88). Five other diseases had an AUC >.75: asthma, chronic enteritis, COPD, epilepsy and HIV/AIDS. For 16 of 17 diseases tested, the medication categories used in theory-based algorithms were also identified by our method, however the RF models included a broader range of medications as important predictors. Conclusion: Data on medication use can be a useful predictor when estimating the prevalence of several chronic diseases. To improve the estimates, for a broader range of chronic diseases, research should use better training data, include more details concerning dosages and duration of prescriptions, and add related predictors like hospitalizations

AB - Background: Aggregated claims data on medication are often used as a proxy for the prevalence of diseases, especially chronic diseases. However, linkage between medication and diagnosis tend to be theory based and not very precise. Modelling disease probability at an individual level using individual level data may yield more accurate results. Methods; Individual probabilities of having a certain chronic disease were estimated using the Random Forest (RF) algorithm. A training set was created from a general practitioners database of 276 723 cases that included diagnosis and claims data on medication. Model performance for 29 chronic diseases was evaluated using Receiver-Operator Curves, by measuring the Area Under the Curve (AUC). Results: The diseases for which model performance was best were Parkinson’s disease (AUC = .89, 95% CI = .77–1.00), diabetes (AUC = .87, 95% CI = .85–.90), osteoporosis (AUC = .87, 95% CI = .81–.92) and heart failure (AUC = .81, 95% CI = .74–.88). Five other diseases had an AUC >.75: asthma, chronic enteritis, COPD, epilepsy and HIV/AIDS. For 16 of 17 diseases tested, the medication categories used in theory-based algorithms were also identified by our method, however the RF models included a broader range of medications as important predictors. Conclusion: Data on medication use can be a useful predictor when estimating the prevalence of several chronic diseases. To improve the estimates, for a broader range of chronic diseases, research should use better training data, include more details concerning dosages and duration of prescriptions, and add related predictors like hospitalizations

U2 - 10.1093/eurpub/cky270

DO - 10.1093/eurpub/cky270

M3 - Article

VL - 29

SP - 615

EP - 621

JO - European Journal of Public Health

JF - European Journal of Public Health

SN - 1101-1262

IS - 4

ER -