Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

Skip to main content

Predicting the onset of Alzheimer’s disease and related dementia using electronic health records: findings from the cache county study on memory in aging (1995–2008)

Abstract

Introduction

Clinical notes, biomarkers, and neuroimaging have proven valuable in dementia prediction models. Whether commonly available structured clinical data can predict dementia is an emerging area of research. We aimed to predict gold-standard, research-based diagnoses of dementia including Alzheimer’s disease (AD) and/or Alzheimer’s disease related dementias (ADRD), in addition to ICD-based AD and/or ADRD diagnoses, in a well-phenotyped, population-based cohort using a machine learning approach.

Methods

Administrative healthcare data (k = 163 diagnostic features), in addition to census/vital record sociodemographic data (k = 6 features), were linked to the Cache County Study (CCS, 1995–2008).

Results

Among successfully linked UPDB-CCS participants (n = 4206), 522 (12.4%) had incident dementia (AD alone, AD comorbid with ADRD, or ADRD alone) as per the CCS “gold standard” assessments. Random Forest models, with a 1-year prediction window, achieved the best performance with an Area Under the Curve (AUC) of 0.67. Accuracy declined for dementia subtypes: AD/ADRD (AUC = 0.65); ADRD (AUC = 0.49). Accuracy improved when using ICD-based dementia diagnoses (AUC = 0.77).

Discussion

Commonly available structured clinical data (without labs, notes, or prescription information) demonstrate modest ability to predict “gold-standard” research-based AD/ADRD diagnoses, corroborated by prior research. Using ICD diagnostic codes to identify dementia as done in the majority of machine learning dementia prediction models, as compared to “gold-standard” dementia diagnoses, can result in higher accuracy, but whether these models are predicting true dementia warrants further research.

Peer Review reports

Introduction

Overview

An estimated 6.7 million Americans age 65 and older are living with dementia in 2023, with this number expected to rise to 12.7 million by 2050 [1]. Due to the lack of definitive biomarkers, effective treatments, and stigma of a diagnosis, up to 50% of Americans with dementia never receive a dementia-related diagnosis [1,2,3]. Fewer yet receive a correct diagnosis regarding etiology or underlying pathology, such as Alzheimer’s disease dementia (AD) compared to vascular dementia, frontotemporal dementia, or some other related dementia (ADRD) including mixed pathology [4]. Predictive models that can detect early signs and symptoms of dementia, or prodromal dementia, may be useful for improving care and patient outcomes [5]. Indeed, with recent findings indicating high diagnostic accuracy of phosphorylated tau 217 and other serum biomarkers to detect AD among cognitively impaired individuals, we are at a time when blood-based biomarkers can be used to identify. AD [6, 7]. Routinely collected health data, including electronic health records (EHR) in large population cohorts, is a promising vehicle for dementia prediction even earlier, before cognitive impairment sets in. However, to date, few EHR-based prediction studies for dementia have included research cohorts with gold-standard measures for cognitive assessment.

There are over sixty studies using machine learning (ML) or deep learning (DL) to predict dementia. Approximately one-third of the models to date are based on using neuroimaging to predict transition from mild cognitive impairment to dementia, while an additional one-third are devoted to using voice recordings to identify dementia [8]. Of the over twenty-five studies using clinical features to predict dementia, most use features derived from cognitive screening workups, including neuropsychological test scores and genetic/biomarker testing, and include small sample sizes (i.e., < 500).8 Of the nearly dozen ML/DL dementia prediction studies using routinely collected health records, all include clinical notes, medication history, and/or lab test results in addition to diagnostic and procedural codes and basic sociodemographic data [5, 9,10,11,12,13,14,15,16,17,18,19,20]. We are unaware of any prior study using commonly available, structured, routinely collected health data to predict “gold-standard” based dementia diagnoses within a large population-based sample of individuals.

To address this research gap, we used numerous medical features (captured via Medicare, ambulatory, and inpatient record ICD diagnostic codes) along with key social features (captured via vital records or census data) to predict later expert consensus-based incident dementia diagnoses (AD alone, AD comorbid with ADRD, or ADRD alone) within the Cache County Study of Memory in Aging (CCS, 1995–2008) [21].

Methods

Study population

Full details on the CCS study design and methodology, including validated survey instruments used, have been published previously [17, 18]. In brief, the CCS was a 13-year prospective epidemiological study of dementia that enrolled 90% (N = 5,092) of the Cache County, Utah, permanent resident population aged 65 years and older as of January 1, 1995 [21, 22]. The primary purpose of the CCS was to examine genetic, psychosocial, and environmental risk factors for late-life cognitive decline. The initial interviews with CCS participants occurred in 1995, with three follow-up waves occurring 3, 7, and 10 years later. For the initial interview at baseline, in addition to sociodemographics, participants (or their informants) reported having been told by a healthcare provider in the past that they had any of the following condition or were treated for any of the following condition: hypertension, hypercholesterolemia, diabetes mellitus, stroke, coronary artery bypass graft surgery (CABG), and myocardial infarction (MI) among others. For each wave, a multi-stage dementia ascertainment protocol was employed, in which a panel of experts in neurology, geriatric psychiatry, neuropsychology, and cognitive neuroscience reviewed all available data and assigned final consensus diagnoses of AD and/or ADRD using National Institute of Neurological and Communicative Diseases and Stroke/Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) [23], Neurological Disorders and Stroke and Association Internationale pour la Recherche et l’Enseignement en Neurosciences (NINDS-AIREN) [21] or other standard research criteria.

Over the course of the four triennial waves of thorough dementia ascertainment among the 5,092 participants, 942 (18.5%) persons were identified with dementia (335 prevalent cases; 607 incident cases). Of these, 58% had final diagnoses (after final follow-up) of AD alone (probable or possible), 11% had AD comorbid with other forms of related dementia (AD/ADRD, also known as AD Mixed), and 31% had other related dementias but no AD (ADRD). The remaining 4,150 participants (81.5%) were deemed cognitively normal or cognitively impaired without dementia (CIND) by the end of follow-up [21]. 16% of participants were still alive by the end of the study in 2008. Prior work has shown high sensitivity and specificity for identifying dementia over the first two waves of ascertainment [24, 25]. The Cache County Study was approved by the Institutional Review Board. All participants gave written, informed consent to participate.

Key factors of CCS enrollees, including consensus-based dementia diagnoses (“gold standard” diagnoses), age of enrollment, race/ethnicity, and sex, were linked with the 1995–2008 Master Beneficiary Summary File of the Medicare data and the 1996–2008 Utah Department of Health Hospital Facilities and Claims Records (Inpatient Hospital Claims and Ambulatory Surgery records) via the Utah Population Database (UPDB). Key additional social variables obtained from UPDB included education, rural/urban residence, occupation, and number of live births. The UPDB is a comprehensive data resource that links demographic, medical, and genealogical data for nearly all residents of Utah to support medical research [26]. Study approvals were obtained from the Resource for Genetic and Epidemiologic Research, a special review panel authorizing access to the UPDB and the University of Utah Institutional Review Board.

Outcome variables

CCS consensus-based “gold standard” dementia diagnoses were dichotomously categorized (present/absent) into the following two mutually exclusive groups: AD (AD with or without mixed pathology) and ADRD (related dementia with no AD).

Predictor features

Predictor medical features used in this study comprised of ICD diagnostic codes sourced from inpatient records, ambulatory surgery, or Medicare records. First, we removed all ICD records with an ICD-based dementia diagnosis (Supplemental Table 1). Next, we grouped the remaining diagnostic codes into 65 exclusive health conditions based on the Chronic Conditions Warehouse (CCW) grouping protocol (Supplemental Table 2). The 65 code groups are not comprehensive, so we added the top 99 most frequent codes found in the inpatient, ambulatory surgery, and Medicare records (Supplemental Table 2) that were not included in the CCW groupings. For our baseline models, we included these 164 medical features along with the age of enrollment into CCS and sex assigned at birth (k = 166 features in the baseline model). For our extended models, we expanded beyond age and sex to include the following five socio-demographic factors: race/ethnicity, education level, number of live births, earliest occupation, and earliest urban or rural residence captured via UPDB-linked vital and census records (n = 171 features in the extended model).

Prediction window

Prior research has predicted dementia based on prediction windows of anywhere from 0 to 10 years [9, 11], meaning that the data used in these study’s prediction windows are the same number of years, or greater, prior to a participant’s dementia diagnosis date. We used the most common 1-year prediction window with up to 13 years of observation (Fig. 1).

Fig. 1
figure 1

Observation period for Cache County Study of Memory in Aging participants. AD/ADRD was assessed via “gold standard” Cache County Study expert consensus assessments (four triennial waves of dementia ascertainment) according to the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) and National Institute of Neurological Disorders and Stroke and Association-Internationale pour la Recherché et l’Enseignment en Neurosciences (NINDS-AIREN) criteria

Since our goal was to predict rather than simply classify dementia diagnoses, all predictive medical features (n = 166) that occurred after or within the same calendar year as the CCS dementia diagnosis were removed from our data. This process allowed our models to remain prospective as removal of these codes made sure the model was not trained on data that occurred after the prediction window. For CCS participants not diagnosed with dementia over the 13-year follow-up period (1995–2008), all predictive feature data were included.

After removing participants who did not link to any inpatient, ambulatory surgery or Medicare claim record (n = 176) along with removing records that came after an AD/ADRD diagnosis, including removing the 335 CCS participants with prevalent dementia at baseline, 4208 (83%) of the original CCS participants, with 221,004 medical records, were included in the study (Fig. 2). Of the included participants, 522 (12.4%) had an AD and/or ADRD incident diagnosis over the follow-up period as per “gold standard” assessment.

Fig. 2
figure 2

Overview of study cohort: Cache County Study in Memory in Aging participants that could be linked to the Utah Population Database (inpatient, ambulatory surgery, and/or Medicare records). Prediction modeling warranted removing records that occurred after a dementia diagnosis

Model development

Preprocessing was conducted on the data to address categorical features through 1-hot encoding, scaling numeric features, and adjusting for imbalances using sampling techniques. The 1-hot encoding technique created a new binary column for each possible categorical variable value. A zero if the category was not present and a one if the value was present. Numeric features (e.g., age, number of children) were scaled using a MinMaxScaler to bring the range of values between 0 and 1. A variety of imbalance techniques were evaluated (random over sampling, random under sampling, synthetic minority oversampling technique [SMOTE]) were evaluated for each outcome metric (Supplemental Table 3).

After the data preprocessing, the data was separated into a training and testing dataset using a stratified 80/20 split for each outcome variable [27]. Four different models were trained (Gradient Boosting Classifier [GBC], Random Forest [RF], Multi-layer perceptron [MLP], and XGBoost [XGB]) using the 166 features (171 features in extended models) for the three different outcome metrics (All-cause dementia, AD/ADRD, and ADRD). Each of the four models (GBC, RF, MLP, and XBG) was trained on the training dataset using a nested five k-fold cross-validation with an inner validation of three k-folds. Nested cross-validation procedures yield robust and unbiased estimates of model performance [28, 29]. Hyperparameter (Supplemental Table 4) optimization was conducted using GridSearchCV on the inner folds to determine the best model parameters based on the f1_macro scoring. Each model was trained and evaluated against a threshold between 0 and 1 with a sliding increment of 0.01. Any predicted probability higher than the threshold was classified as positive while anything below it was negative. Once the best threshold and hyperparameters were determined from the highest f1_score, the model was re-trained on all of the training dataset using the best hyperparameters. The trained model was tested on ten bootstrapped samples from the testing dataset to determine its generalizability. Each bootstrapped result for each model was compiled across the four-evaluation metrics of AUC, specificity, sensitivity, and f1-score. Also known as the harmonic mean, the f1 score is an important metric to use when evaluating imbalanced classes because it takes the average of both the precision and recall metrics and gives equal weight to both. The f1-macro score was used because it does not consider class imbalance which is important so that the majority class does not mask how well the model does at predicting the desired minority class. The ten f1-score test results for each of the four models were then tested for significance using the Friedman test. If the Friedman test was significant, then the Nemenyi post-hoc test was conducted. The Friedman test determines significance between the means of three or more groups while the Nemenyi test identifies which of those groups are actually significant. The best model was picked using the f1-score.

Python 3 was used for all analyses, with scikit-learn package used in conjunction with the imblearn package for implementing imbalance techniques. We evaluated the best performing model based on the highest average f1 score after 10 rounds of bootstrapping. Built-in feature importance for RF models and SHapley Additive exPlanations (SHAP) analysis for all models were used to identify feature importance.

Sensitivity analysis I: expanding to include ICD-based dementia diagnoses

Given the high specificity of dementia diagnoses within administrative healthcare records [3, 4], we additionally ran models where we included the CCS “gold-standard” dementia diagnoses and the ICD-based diagnoses. Furthermore, given repeated clinic encounters in our ICD-based administrative healthcare database, we estimated models where we required participants to have one ICD dementia diagnosis to be considered a dementia case and where we required at least three ICD dementia diagnoses to be considered a case [11, 30].

Sensitivity analysis II: not eliminating records that came after dementia diagnosis (classification modeling)

Given that dementia researchers using administrative healthcare records for retrospective cohort analyses are interested in correctly classifying dementia, we additionally ran models whereby we did not exclude medical features coming after the dementia diagnosis.

Sensitivity analysis III: addressing potential selection bias by including all CCS participants, regardless of whether they had medical record linkage or not

Given potential selection bias by excluding individuals who did not link to ≥ 1 inpatient, ambulatory surgery, and/or Medicare records within the UPDB but who did have sociodemographic features available (n = 5091), we ran our models by including all participants, regardless of whether they had medical record linkage (i.e., included CCS participants who only contributed sociodemographic data).

Results

Study population characteristics

The average age of enrollment among included CCS participants was 76.3 ± 7.0 years, with an average age of dementia diagnosis for those acquiring dementia over the study period of 86.0 ± 7.0 years. The study comprised 56% females, 98% white non-Hispanics, 84% with less than a college education, and an average number of live births of 2.1 ± 2.6. Farming and homemaking were the most common occupations listed, making up 29% of the total occupations. The majority were considered to live in urban areas (88%).

Regarding prevalence of important cardiovascular and cerebrovascular risk factors, there were 51% of the population who self-reported a prior diagnosis of hypertension at baseline, 20% with high cholesterol, 11% with diabetes mellitus, 19% with obesity, 3% with stroke, 7% with CABG, and 11% with MI. Baseline prevalences of CCW ICD diagnoses greater than null are listed in Supplementary Table 5.

Prediction models using combined datasets

Random Forest (RF) achieved the best performances for predicting dementia on the test set (AUC: 0.67) (Table 1). AUCs declined when evaluating dementia subtypes: AD/ADRD: AUC = 0.65; and ADRD, AUC = 0.49. Lower prediction for dementia and dementia subtypes was largely found for the GLB, MLP and XGB models, respectively (Supplemental Table 6). Results slightly improved when adding sociodemographic features, including number of prior live births, race/ethnicity, education, earliest occupation, and earliest residence for all-cause dementia RF AUC = 0.69 (Supplemental Table 7). Implementing techniques to address dementia imbalance in our dataset did not improve prediction.

Table 1 Comparison of the random forest model performances for predicting all-cause dementia, AD/ADRD, or ADRD using 166 UPDB administrative healthcare record medical features, enrollment age, and sex in the Cache County study of memory in aging

Feature importance

The most important features were extracted from the fit model and are shown in Fig. 3 for all-cause dementia. The five most influential features were age, with a weight of 0.1636, followed by heart failure (0.0346), hypertension (0.0343), chronic kidney disease (0.0291), and fibromyalgia (0.0231). Results changed slightly when adding in additional sociodemographic factors, namely the importance of parity and education in feature importance for all-cause dementia (Supplemental Fig. 1). Lead time between various baseline chronic conditions and time to dementia, among CCS participant who developed dementia over the study period, can be found in Supplemental Table 8.

Fig. 3
figure 3

Top 15 features of importance extracted from the Random Forest fit model for “gold standard” all-cause dementia

Sensitivity analyses

Defining dementia by using the UPDB administrative healthcare ICD records (via Medicare claims, ambulatory surgery, and inpatient records) resulted in a marked improvement in accuracy, especially when we defined dementia by having ≥ 1 ICD-based dementia diagnosis over the study period (1995–2008) with an AUC = 0.77 for all-cause dementia in the RF model. Requiring ≥ 3 vs. ≥ 1 ICD-based diagnoses to define dementia did not result in model improvement.

There was no appreciable improvement in the accuracy of classifying versus predicting dementia when we included all the medical features over the observation period, regardless of whether they came before or after a dementia diagnosis (Supplementary Table 9). Our models addressing selection bias, where we included all 5091 participants that linked to the UPDB regardless of whether they linked to any medical records, resulted in only a slight improvement in accuracy across the various models (Supplementary Table 10).

Discussion

Main findings

In this study, we evaluated commonly available health records and their ability to predict whether a CCS participant had Alzheimer’s or another related dementia disease using gold-standard consensus-based diagnoses (1995–2008). Using linked datasets of inpatient, ambulatory surgery, and Medicare data, we sought to determine if 65 CCW conditions, seven sociodemographic features, and an additional 99 data-driven ICD codes would be able to train a machine learning model. We obtained only modest results with Random Forest models achieving an AUC of 0.69, 0.61, and 0.50 for all-cause dementia, AD/ADRD, and ADRD, respectively, in the full model of 171 features. Our models improved when using ICD-based dementia diagnoses, which makes sense given that the medical features are ICD-based diagnoses. Uncertainty regarding whether ICD-based dementia diagnoses represent true dementia warrants caution in using only ICD-based diagnoses for prediction modeling.

Comparison with previous studies

A recent systematic review reported accuracies between 64% and 99% (mean 86%) in 25 clinical studies using a machine or deep learning to predict dementia [8]. However, the majority of these studies, as well as others not listed in this review,9 10 included neuropsychological test scores, genetic, gait, or other blood-based biomarkers that are not part of commonly available health records in community-based samples.

There are nearly a dozen prior studies strictly using EHR and/or other routinely collected health records (e.g., Medicare claims) in their supervised machine-learning dementia prediction models [5, 9,10,11,12,13,14,15,16,17,18]. All of these studies use a combination of structured data to assemble their predictive features including sociodemographic data (e.g., age at study entry, race/ethnicity, sex assigned at birth), ICD diagnostic and procedural codes, vital signs, medications, and lab test results, with a handful also including clinical notes [12, 14, 16]. Sample sizes range from 4000 to over 5 million, with prediction windows (time prior to dementia detection) ranging from 0 to 8 years and accuracies between 65% and 94%.

The vast majority of prior EHR AD/ADRD prediction studies used ICD diagnoses (sometimes accompanied by dementia medications) to define their dementia outcome. Only one study was similar to ours in using a community-based sample of individuals participating in an aging cohort study for which gold-standard dementia ascertainment was assessed as the outcome [5]. Barnes et al. conducted a retrospective cohort study among 4330 participants in the Adult Changes in Thought (ACT) study who underwent a comprehensive dementia assessment every two years and have linked EHR data including socio-demographics (age, sex, and race/ethnicity), 31 medical conditions via ICD-9 codes, vital signs (body mass index and blood pressure), healthcare utilization and medications for a total of 64 predictors. Splitting the data into a 70:30 training and test set, they applied the LASSO approach to predict unidentified dementia (i.e., dementia identified via ACT assessments but not reported in the EHR) using the EHR information for the prior two years, arriving at AUCs of 0.81 (95% CI 0.78, 0.84). While this study is similar to ours in using a community-based sample and gold-standard assessment for dementia diagnoses, a comparison between this study and ours is difficult since we did not remove recognized dementia cases. However, the authors reported that using all dementia cases (recognized and unrecognized) did not improve performance [5]. What this study does reveal is that relatively good prediction (> 80% accuracy) can be achieved in a sample of less than 5000 using only 64 knowledge-driven predictors.

The other strictly EHR AD/ADRD structured machine learning studies all used the ICD diagnostic codes to identify dementia [11,12,13,14, 17, 18]. Jammeh et al. conducted a study in the UK (2010–2012) among over 26,000 eligible primary care patients (850 with dementia randomly matched to 2213 controls) and used over 15,000 diagnostic, process of care, and medication codes over a 2-year period, arriving at AUCs of 0.87 for the Naïve Bayes classification results [18]. Li et al. conducted a study using EHR records from the OneFlorida + Research Consortium of 23,835 ADRD patients randomly matched 1:10 to controls and used over 2500 sociodemographic, PheWas, RxNorm, CCS, vital signs, and lab value tests in their prediction models. The Gradient Boosting Tree models achieved the best performance with AUCs of 0.94, 0.91, 0.88, and 0.85 for prediction of ADRD 0, 1, 3, or 5 years, respectively, before diagnosis. A South Korean study (2002–2013) among 40,736 patients using 4894 features captured from the National Health Insurance Service database also found relatively high AUCs using models: AUCs of 0.90 for 0-year and 0.78 for 1-year model predicting definite AD [13]. A retrospective analysis of 7587 patients from New York (2007–2019) who had at least five years of records (702 with probable AD) found slightly lower AUCs using XGBoost predictive models: AUCs of 0.76 for 0-year and 0.75 for 1-year model predicting probable AD [17]. A study conducted among US Veterans with (n = 1861) and without (n = 9305) dementia used 853 EHR features, including clinical notes to arrive at an AUC of 0.91 for dementia using logistic regression models [14]. As we saw in our sensitivity analyses, using ICD diagnostic codes to identify dementia (rather than research, consensus-based diagnoses such as what we used from the CCS) can result in higher accuracy, but the question is whether these models are predicting true dementia. Further research using both “gold standard” and “ICD-based” diagnoses, as well as validation work for dementia ascertainment in medical records, is warranted.

In addition to the Barnes et al. study using gold-standard assessments to capture dementia, the most relevant prior study would be that conducted by Miled et al. [12] This study, for which dates of the study were not provided, used classification to train EHR data, captured ten years prior to the index date, from the Indiana Network for Patient Care and Research through the Regenstrief Institute. The study was conducted on 2159 patients with dementia and 11,558 controls and extracted prescriptions, diagnoses, and medical notes from the EHR records of each patient. One major strength of the analyses is that the authors divided the performance metrics for the 1-year and 3-year prediction models by the different datasets used: prescription, diagnoses, and medical notes. Sensitivity and specificity for the 1-year models for diagnoses was 0.66 and 0.65, respectively; and for the 3-year was 0.64 and 0.63, respectively. Additionally adding medications and clinical notes resulted in sensitivity and specificity of 0.76 and 0.77 for the 1-year and 0.71 and 0.74 for the 3-year. While comparisons between Miled et al. and our study are limited given their use of ICD codes to define dementia and having a sample of over 13,000 compared to our sample of over 4000, our results for just using EHR diagnoses are somewhat comparable with their finding for the 1-year models, although we had higher specificity while they showed higher sensitivity (i.e., our RF 1-year prediction models had a sensitivity/specificity of 0.55/0.80 compared to their 0.66/0.65). While etiologic research prioritizes specificity, screening or medical prediction models prioritize sensitivity, which were higher in the Milred et al. study. Regardless of nuances between the two studies, running structured ML models using only EHR ICD diagnoses and procedures results in models of relatively poor performance.

Strengths

One of the strengths of our study is having a 13-year follow-up on a community-based sample of over 4200 well-phenotyped individuals participating in the CCS with linkages to over 220,000 medical records. To our knowledge, the Barnes et al. study is the only other study that aimed to build predictive models using a gold standard research cohort linked to administrative healthcare records [5], whereas other studies use some form of ICD based dementia diagnosis as the main outcome variable. By using gold standard dementia diagnoses to define our outcome, we strengthen our model’s validity by ensuring that the study participants who are positive for the disease have the disease, and those study participants who are classified as not having dementia are also classified to a higher degree of accuracy. We are the first US study to use accessible, standardized Medicare claims and Department of Health facilities data (inpatient and ambulatory surgery records). While our prediction using these standardized records was modest at best, other studies using health insurance service databases in other countries augmented by medication and/or laboratory test results performed relatively well (AUCs over 0.78), meaning that these US health databases may also be informative in dementia prediction as long as medication and vital signs/lab results are included. Finally, we are the first study to assess whether our prediction models differed between Alzheimer’s disease or Alzheimer’s disease comorbid with another dementia (AD/ADRD) as compared to having a related dementia with no Alzheimer’s disease pathology (ADRD). Across all of the prior machine learning dementia prediction studies using EHR records, those that have evaluated all-cause dementia have tended towards finding higher AUCs (> 80%)5,10,11,14,16,18 compared to those evaluating AD alone (AUCs < 80%).13,20 Our finding of being able to better predict all-cause dementia compared to AD/ADRD (AUC = 0.65) or ADRD (AUC = 0.49) within our single well-phenotyped CCS needs to be validated in other studies with presentation of which features better predict Alzheimer’s disease versus other related dementias.

Weaknesses

Given that our source data was focused on a population over the age of 65 years and were predominately non-Hispanic white, the generalizability of our findings to younger or other races/ethnicities is limited. Indeed, similar to other studies [5], we found little improvement in our models when adding in race/ethnicity, likely due to the limited diversity within the sample. Finally, we only considered dementia cases as compared to non-dementia cases. Future work whereby non-dementia cases are split between cognitively normal/no dementia versus CIND may help with model accuracy.

Conclusion

Overall, our machine learning models had only a modest ability to predict dementia within the CCS. This is due in part to our relatively small sample of just over 4200 participants and, more importantly, due to strict reliance on diagnoses with limited information on sociodemographic factors. While our models using ICD diagnoses combined with 7 important sociodemographic characteristics had only modest ability, validity of model effectiveness and transportability should be pursued. Additionally, leveraging a larger dataset with more features beyond diagnosis codes, including medications, vital signs, lab test results, and clinical notes may enrich the feature set to improve each model’s training and performance in predicting whether an individual develops dementia [12]. Continued leveraging of existing well characterized research cohorts embedded within administrative healthcare databases such as the UPDB will help improve prediction models by adding features such as family history of dementia or APOE genotype that are not routinely collected in addition to capturing vetted dementia outcome diagnoses. Furthermore, when resources allow, adding imaging, serum, and cerebrospinal biomarkers will help in determining dementia subtypes. This, in the end, will lead to improved early diagnosis, care management, and ultimately better patient outcomes among at-risk individuals.

Data availability

The data that support the findings of this study are from the Utah Population Database (UPDB) and the Cache County Study on Memory, Health and Aging (CCS), and for privacy reasons are not publicly available. Requests for use of this data must be made directly to National Institute of Aging/CCS primary investigators or the UPDB.

Abbreviations

AD:

Alzheimer’s Disease

ADRD:

Alzheimer’s Disease Related Dementias

AUC:

Area under the curve

CCS:

Cache County Study of Memory and Aging

CCW:

Chronic Conditions Data Warehouse

DL:

Deep Learning

EHR:

Electronic Health Record

GBC:

Gradient Boosting Classifier

ICD:

International Classification of Diseases

MLP:

Multi-layer perceptron

RF:

Random Forest

UPDB:

Utah Population Database

XGB:

XGBoost

References

  1. 2023. 2023 Alzheimer’s Disease Facts and Figures: https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf. Accessed September 18th, 2024.

  2. Bradford A, Kunik ME, Schulz P, Williams SP, Singh H. Missed and delayed diagnosis of dementia in primary care: prevalence and contributing factors. Alzheimer Dis Assoc Disord. 2009;23(4):306.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Schliep KC, Ju S, Foster NL, et al. How good are medical and death records for identifying dementia? Alzheimers Dement Dec. 2021;7. https://doi.org/10.1002/alz.12526.

  4. Wilkinson T, Ly A, Schnier C, et al. Identifying dementia cases with routinely collected health data: a systematic review. Alzheimers Dement Aug. 2018;14(8):1038–51. https://doi.org/10.1016/j.jalz.2018.02.016.

    Article  Google Scholar 

  5. Barnes DE, Zhou J, Walker RL, et al. Development and Validation of eRADAR: a Tool using EHR Data to detect unrecognized dementia. J Am Geriatr Soc Jan. 2020;68(1):103–11. https://doi.org/10.1111/jgs.16182.

    Article  Google Scholar 

  6. VandeVrede L, Rabinovici GD. Blood-based biomarkers for Alzheimer Disease-Ready for Primary Care? JAMA Neurol. Jul. 2024;28. https://doi.org/10.1001/jamaneurol.2024.2801.

  7. Palmqvist S, Tideman P, Mattsson-Carlgren N, et al. Blood biomarkers to detect Alzheimer Disease in Primary Care and secondary care. JAMA Jul. 2024;28. https://doi.org/10.1001/jama.2024.13855.

  8. Javeed A, Dallora AL, Berglund JS, Ali A, Ali L, Anderberg P. Machine learning for Dementia Prediction: a systematic review and future research directions. J Med Syst Feb. 2023;1(1):17. https://doi.org/10.1007/s10916-023-01906-7.

    Article  Google Scholar 

  9. Dallora AL, Minku L, Mendes E, Rennemark M, Anderberg P, Sanmartin Berglund J. Multifactorial 10-Year prior diagnosis prediction model of Dementia. Int J Environ Res Public Health. 2020;17(18):6674.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ford E, Sheppard J, Oliver S, Rooney P, Banerjee S, Cassell JA. Automated detection of patients with dementia whose symptoms have been identified in primary care but have no formal diagnosis: a retrospective case-control study using electronic primary care records. BMJ Open Jan. 2021;22(1):e039248. https://doi.org/10.1136/bmjopen-2020-039248.

    Article  Google Scholar 

  11. Li Q, Yang X, Xu J, et al. Early prediction of Alzheimer’s disease and related dementias using real-world electronic health records. Alzheimers Dement Feb. 2023;23. https://doi.org/10.1002/alz.12967.

  12. Ben Miled Z, Haas K, Black CM, et al. Predicting dementia with routine care EMR data. Artif Intell Med Jan. 2020;102:101771. https://doi.org/10.1016/j.artmed.2019.101771.

    Article  Google Scholar 

  13. Park JH, Cho HE, Kim JH, et al. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. NPJ Digit Med. 2020;3:46. https://doi.org/10.1038/s41746-020-0256-0.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Shao Y, Zeng QT, Chen KK, Shutes-David A, Thielke SM, Tsuang DW. Detection of probable dementia cases in undiagnosed patients using structured and unstructured electronic health records. BMC Med Inf Decis Mak Jul. 2019;9(1):128. https://doi.org/10.1186/s12911-019-0846-4.

    Article  Google Scholar 

  15. Tang AS, Oskotsky T, Havaldar S, et al. Deep phenotyping of Alzheimer’s disease leveraging electronic medical records identifies sex-specific clinical associations. Nat Commun Feb. 2022;3(1):675. https://doi.org/10.1038/s41467-022-28273-0.

    Article  CAS  Google Scholar 

  16. Nori VS, Hane CA, Sun Y, Crown WH, Bleicher PA. Deep neural network models for identifying incident dementia using claims and EHR datasets. PLoS ONE. 2020;15(9):e0236400.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Xu J, Wang F, Xu Z, et al. Data-driven discovery of probable Alzheimer’s disease and related dementia subphenotypes using electronic health records. Learn Health Syst Oct. 2020;4(4):e10246. https://doi.org/10.1002/lrh2.10246.

    Article  Google Scholar 

  18. Jammeh EA, Carroll CB, Pearson SW, et al. Machine-learning based identification of undiagnosed dementia in primary care: a feasibility study. BJGP Open Jul. 2018;2(2):bjgpopen18X101589. https://doi.org/10.3399/bjgpopen18X101589.

    Article  Google Scholar 

  19. Uspenskaya-Cadoz O, Alamuri C, Wang L, et al. Machine learning algorithm helps identify Non-diagnosed Prodromal Alzheimer’s Disease patients in the General Population. J Prev Alzheimers Dis. 2019;6(3):185–91. https://doi.org/10.14283/jpad.2019.10.

    Article  CAS  PubMed  Google Scholar 

  20. Fukunishi H, Nishiyama M, Luo Y, Kubo M, Kobayashi Y. Alzheimer-type dementia prediction by sparse logistic regression using claim data. Comput Methods Programs Biomed Nov. 2020;196:105582. https://doi.org/10.1016/j.cmpb.2020.105582.

    Article  Google Scholar 

  21. Tschanz JT, Norton MC, Zandi PP, Lyketsos CG. The Cache County study on memory in aging: factors affecting risk of alzheimers disease and its progression after onset. Int Rev Psychiatry. 2013;25(6):673–85. https://doi.org/10.3109/09540261.2013.849663.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Breitner JC, Wyse BW, Anthony JC, et al. APOE-epsilon4 count predicts age when prevalence of AD increases, then declines: the Cache County study. Neurol Jul. 1999;22(2):321–31. https://doi.org/10.1212/wnl.53.2.321.

    Article  Google Scholar 

  23. McKhann G, Drachman D, Folstein M, Katzman R, Price D, Stadlan EM. Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer’s Disease. Neurol Jul. 1984;34(7):939–44.

    CAS  Google Scholar 

  24. Hayden KM, Warren LH, Pieper CF, et al. Identification of VaD and AD prodromes: the Cache County study. Alzheimers Dement Jul. 2005;1(1):19–29. https://doi.org/10.1016/j.jalz.2005.06.002.

    Article  CAS  Google Scholar 

  25. Khachaturian AS, Gallo JJ, Breitner JC. Performance characteristics of a two-stage dementia screen in a population sample. J Clin Epidemiol May. 2000;53(5):531–40. https://doi.org/10.1016/s0895-4356(99)00196-1.

    Article  CAS  Google Scholar 

  26. Smith KR, Fraser A, Reed DL, et al. The Utah Population Database. A model for Linking Medical and Genealogical Records for Population Health Research. Hist Life Course Stud. 2022;12:58–77.

    Article  Google Scholar 

  27. Biswas A, Saran I, Wilson FP. Introduction to Supervised Machine Learning. Kidney360. May 27. 2021;2(5):878–880. https://doi.org/10.34067/KID.0000182021

  28. Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS ONE. 2019;14(11):e0224365.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7(1):91.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Ostbye T, Taylor DH Jr., Clipp EC, Scoyoc LV, Plassman BL. Identification of dementia: agreement among national survey data, medicare claims, and death certificates. Research Support, Extramural NIH. Health services research. Feb 2008;43(1 Pt 1):313 – 26. https://doi.org/10.1111/j.1475-6773.2007.00748.x

Download references

Acknowledgements

We thank the Pedigree and Population Resource of the Huntsman Cancer Institute, University of Utah (funded in part by the Huntsman Cancer Foundation) for its role in the ongoing collection, maintenance and support of the Utah Population Database (UPDB). We also acknowledge partial support for the UPDB through grant P30 CA2014 from the National Cancer Institute, University of Utah and from the University of Utah’s Program in Personalized Health and Center for Clinical and Translational Science.

Funding

This work was supported by the University of Utah Center on Aging Pilot Grant Program and the Department of Family and Preventive Medicine Health Studies Fund. Research was also supported by National Institute of Aging (NIA) grants: “Hypertensive Disorders of Pregnancy and Subsequent Risk of Vascular Dementia, Alzheimer’s Disease, or Related Dementia: A Retrospective Cohort Study Taking into Account Mid-Life Mediating Factors” (Project 1K01AG058781-01A1; PI: Karen Schliep) and “Early Life Conditions, Survival, and Health: A Pedigree-Based Population Study” (Project: R01AG022095; PI: Ken Smith) and an NCRR grant, “Sharing Statewide Health Data for Genetic Research” (R01 RR021746, G. Mineau, PI) with additional support from the Utah State Department of Health and the University of Utah. The National Institute on Aging grants AG-11380, AG-18712, and AG-031272 supported the Cache County Study on Memory in Aging.

Author information

Authors and Affiliations

Authors

Contributions

KCS, JTT, KRS and SA conceived, planned, and carried out the study; JT and SA conducted the analyses; KCS and JT took the lead in writing the manuscript. All authors contributed to the interpretation of the results, provided critical feedback, and helped shape the research, analysis, and manuscript.

Corresponding author

Correspondence to Karen C. Schliep.

Ethics declarations

Ethics approval and consent to participate

Study approvals were obtained from the Resource for Genetic and Epidemiologic Research, a special review panel authorizing access to the Utah Population Database (UPDB), and the University of Utah Institutional Review Board (IRB # 116984). The IRB determined this study exempt from Human Subjects Research due to the study of data for which subjects cannot be identified. The Cache County Study was approved by the Institutional Review Board. All participants gave written, informed consent to participate.

Consent for publication

The IRB determined this study exempt from Human Subjects Research due to involving study of data for which subjects cannot be identified.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schliep, K.C., Thornhill, J., Tschanz, J.T. et al. Predicting the onset of Alzheimer’s disease and related dementia using electronic health records: findings from the cache county study on memory in aging (1995–2008). BMC Med Inform Decis Mak 24, 316 (2024). https://doi.org/10.1186/s12911-024-02728-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-024-02728-4

Keywords