Development of transient ischemic attack risk prediction model suitable for initializing a learning health system unit using electronic medical records

Wen, Jian; Zhang, Tianmei; Ye, Shangrong; Li, Cheng; Han, Ruobing; Huang, Ran; Shen, Bairong; Chen, Anjun; Li, Qinghua

doi:10.1186/s12911-024-02767-x

Research
Open access
Published: 18 December 2024

Development of transient ischemic attack risk prediction model suitable for initializing a learning health system unit using electronic medical records

Jian Wen¹,
Tianmei Zhang¹,
Shangrong Ye¹,
Cheng Li¹,
Ruobing Han¹,
Ran Huang²,
Bairong Shen²,
Anjun Chen ORCID: orcid.org/0000-0003-4209-8301³ &
…
Qinghua Li ORCID: orcid.org/0000-0002-4547-8513¹

BMC Medical Informatics and Decision Making volume 24, Article number: 392 (2024) Cite this article

744 Accesses
Metrics details

Abstract

Background

Patients with transient ischemic attack (TIA) face a significantly increased risk of stroke. However, TIA screening and early detection rates are low, especially in developing countries. This study aims to develop an inclusive and practical TIA risk prediction model using machine learning (ML) that performs well in both hospital and resource-limited clinic settings. This model is essential for initiating the first ML-enabled learning health system (LHS) unit designed for routine and equitable TIA screening and early detection across broad populations.

Methods

Employing a novel protocol, this study first standardized data from a hospital’s electronic medical records (EMR) to construct inclusive TIA risk prediction ML models using a data-centric approach. Subsequently, a quantitative distribution of TIA risk factors was applied in feature engineering to reduce the number of variables for a practical ML model. This refined model initiated a TIA ML-LHS unit that is capable of continuously updating with new EMR data from hospitals and clinics. Additionally, the practical model underwent external validation using data from another hospital.

Results

The inclusive 150-variable ML models, derived from all available EMR variables for TIA, achieved a recall of 0.868 and an accuracy of 0.886 in predicting TIA risk. Further feature engineering produced a practical XGBoost model with 20 variables, maintaining acceptable performance of 0.855 recall and 0.796 accuracy. The initialized TIA ML-LHS unit, based on the practical model, achieved performance metrics of 0.830 recall, 0.726 precision, 0.816 ROC-AUC, and 0.812 accuracy. The model also performed well in external validation, confirming its effectiveness with patient data from different clinical settings.

Conclusions

This study developed the first inclusive and practical TIA XGBoost model from full hospital EHR and initiated the first TIA risk prediction ML-LHS unit. This TIA model, which requires only 20 variables, enables the ML-LHS to serve not only patients in hospitals but also those in resource-limited clinics. These results have significant implications for expanding risk-based TIA screening in community and rural clinics, thereby enhancing early detection of TIA among underserved populations and improving health equity. The novel protocol used in this study is also applicable for initiating ML-LHS units for various preventable diseases, providing a new system-level approach to responsible AI development and applications.

Peer Review reports

Background

Annually, approximately 15 million people worldwide suffer from a stroke [1, 2]. The risk of stroke significantly increases after a transient ischemic attack (TIA) [3,4,5]. Studies from the USA and Australia have observed a downward trend in stroke over the last two decades, suggesting that effective management of TIA can reduce the risk of a major stroke [3, 6]. However, in developing countries, public awareness and diagnosis of TIA are still low [7]. There’s a pressing need for more effective approaches to promote early detection and management of TIA to prevent both TIA and subsequent stroke [8, 9]. A recent systematic review highlighted that the risk of subsequent stroke among patients evaluated in TIA clinics was not higher than that among those hospitalized, suggesting that equipping clinics to predict TIA risk could be a key element in stroke prevention [10].

Numerous machine learning (ML) studies have focused on building stroke risk prediction models [11,12,13,14]. Some ML models, alongside the widely accepted ABCD/ABCD2 statistical risk models, have been developed to predict stroke in TIA patients [15,16,17,18,19]. Conversely, only a few ML studies have emphasized predicting the initial onset of TIA. For instance, one study leveraged clinical notes to build a model for predicting TIA-like presentations, achieving an ROC-AUC (area under the curve of receiver operating characteristics) of 0.819 [20]. Other risk score methodologies have also shown potential. Stanciu et al. devised a multinomial classification model that boasted 79.5% accuracy for patients visiting emergency rooms [21]. Dutta employed 17 predictors in a statistical model, achieving an AUC of 0.91 [22]. Lasserson et al. confirmed that the Dawson score was more effective in diagnosing TIA in specialist assessments compared to primary care assessments [23].

Despite the availability of these risk prediction models, global TIA screening rates remain disappointingly low. This indicated that revolutionary, system-level changes in clinical delivery may be required. The vision and frameworks for learning health systems (LHS) proposed by the US National Academy of Medicine (NAM) present a hopeful strategy [24, 25]. For instance, New York University hospital has demonstrated quality improvement in an LHS through rapid learning cycles [26]. The VA hospital has applied the LHS concept to improve the quality of TIA care [27]. However, these examples of LHS have not yet integrated machine learning and artificial intelligence (AI).

We propose a new ML-enabled LHS unit approach to improve the TIA screening rate. This unit requires an inclusive and practical ML model for TIA risk prediction, which can be applied to patients in the communities and rural areas surrounding a central hospital, thereby expanding the screening population. Such an ML model can be developed in two stages, as demonstrated in our earlier study for a nasopharyngeal cancer risk prediction model [28]. Stage 1 involves using all health factors from patients with the target disease in the hospital’s electronic medical records (EMR) to create an inclusive model. Stage 2 involves generating a quantitative distribution of risk factors for the disease from the same EMR dataset using the new patient graph connection delta ratio (CDR) analysis method [29], and then applying the risk factors to reduce the number of variables in the model to a level that is obtainable in small clinics.

Our current study aims to use our hospital EMR to develop an inclusive and practical TIA risk prediction ML model suitable for initializing an ML-LHS unit, designed to enable risk-based TIA screening in community and rural clinics. Our findings could have significant implications for increasing TIA screening and early detection rates among underserved populations.

Methods

Standardized data collection from EMR for ML study

The Institutional Review Board (IRB) of Guilin Medical University Affiliated Hospital (GLMUH) approved this study involving EMR patient data (QTLL202139). The hospital’s informatics department provided a secured data server for this project, granting access to the de-identified patient records between January 2018 and June 2021 from the hospital’s EMR, which covered approximately 1 million patients of any diseases and 7 million encounters (see Fig. 1). Personal identifiers such as patient names, birthdates, contact information, and addresses were removed, and patient identifiers were replaced with random numbers. In compliance with the hospital’s policy, our research team received training in patient data security and privacy before accessing the data.

To identify TIA patient records, we queried the dataset using Chinese synonyms for TIA, yielding a total of 737 patients aged 30 or older with a clear physician diagnosis of TIA. Patient records, including inpatient and outpatient visits, diagnoses, lab tests, and procedures were imported into a custom data collection tool on the secured server. Lab test data were automatically collected and stored in a MongoDB database provided by MongoDB Inc. (Palo Alto, CA, USA). Other data types, such as diseases, symptoms, medical history, observations, procedures, medications, treatments, and risk factors, were manually collected. To ensure consistency, data collection rules were established, and synonyms were automatically converted to local “standard terms,” under which data were cataloged. For every TIA patient, a Patient Diagnosis Journey (PDJ) data profile was created, covering data from one or multiple encounters leading up to the final TIA diagnosis. After data collection, PDJ profile data were exported as CSV files for subsequent analysis, selecting only the most recent data from each health factor within the same PDJ.

For the negative samples used in machine learning, 1,448 non-TIA (background) patients aged 30 or older were randomly selected from the patient records. The data collection process for background patients mirrored that for TIA patients, including all encounters in their patient journey.

Machine learning for inclusive TIA risk prediction models

Since we used data-centric EMR ML approach, we collected as many health factors as possible. The TIA PDJ profiles contained over 13,000 data items and 1,200 codes (variables), while the background patient profiles included over 49,000 data items and 2,800 codes. Lab test results in the EMR were categorized as normal/abnormal, true/false, positive/negative, high/medium/low, up/down/normal, etc. Continuous numeric data were straightforwardly converted into categorical data. The guiding principal was to use fewer value categories when finer binning was expected to make no difference in prediction accuracy. For example, categories included age: <30, 30–50, 50–70, > 70 years; drinking frequency: 0–2, > 2 drinks per day; smoking habits: 0, 1–20, > 20 cigarettes per day; symptom duration: 0–30 min, > 30 min. From the patient profiles, an ML-ready table was generated, consisting of 2,185 patient rows and 636 code columns, with TIA patients assigned the label “1” and background patients given the label “0”. Since the ML dataset table is highly sparse, many variables occurred in only a few patients. To build models, codes were sorted by the number of linked TIA patients in descending order, and we selected variables occurring in a minimum of 10 TIA patients for different variable sets. The number of variables in each set varied from 10 to 175. All TIA-related procedure codes, medication codes, and treatment codes were excluded for the purpose of studying disease risk prediction.

The XGBoost Python library was downloaded according to the instructions at xgboost.readthedocs.io and used to build XGBoost base models with default settings [30]. The library supports parallel tree boosting and effectively handles datasets with missing data. The scikit-learn Python library from scikit-learn.org performed all other ML tasks [31]. The free Jupyter Notebook tool was used for conducting ML experiments, and the Padas library was employed to read and write CSV files and manipulate data tables.

For the ML model development, the dataset was divided into 60% (1,311 patients) for training, 20% (437 patients) for tuning, and 20% (437 patients) for testing [32]. The XGBoost classifier was trained using the default hyperparameters with the training and tuning subsets, then validated on the testing dataset. Performance metrics such as recall, precision, ROC-AUC, and accuracy were calculated by calling scikit-learn functions to assess the efficacy of the risk prediction model. For algorithmic comparison, XGBoost was compared with Random Forest (RF), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The scikit-learn classifiers for these algorithms were also trained using the default hyperparameters and the identical variable set.

Feature engineering for practical ML models

Because the inclusive ML model may require a large number of variables, which may not be available in small clinics that have limited resources, it is necessary to further reduce the number of variables. In our previous study, we applied the CDR method to generate a quantitative distribution of TIA risk factors from patient graphs, which comprised an equal number of TIA and non-TIA (background) patients [33]. Based on this distribution, we grouped variables in the inclusive model, with each group represented by a single factor in the resulting practical XGBoost ML model. The XGBoost algorithm was utilized to examine and select features critical for developing the desired practical model.

To develop the best-performing models in the process of building and operating ML-LHS for TIA screening, we first adopted a data-centric approach, followed by a model-centric approach. This ML-LHS strategy marks a key difference from conventional or standard model development in the initial stages. The goal of this study was to develop a high-performing, inclusive, and practical model suitable for ML-LHS deployment using only default hyperparameters, without the aid of hyperparameter fine-tuning. By design, our ML-LHS approach postpones the model fine-tuning step to later phases of the LHS process, after the desired large inclusive dataset can be gradually collected within the deployed ML-LHS.

Initialization of a TIA ML-LHS unit

A practical TIA XGBoost model, characterized by a minimal number of variables and high predictive performance was selected to initialize a TIA ML-LHS unit. For the first internal validation during initialization phase, new data related to the model’s 20 variables were collected from 148 TIA patients visiting GLMUH between July 2021 and February 2023. The updated dataset, combining both new and previous data, included a total of 885 TIA patients. An updated XGBoost model was reconstructed using the updated dataset to verify the stability of the model’s performance.

External validation of the practical TIA model

The practical XGBoost model underwent external validation using patient data from another hospital. We selected 80 TIA patients (47 men and 33 women) and 50 random background patients (25 men and 25 women) from EMR between January 2020 and October 2021. Data corresponding to the model’s 20 variables were extracted from these records. After processing the data similarly to the original ML model, it was input into the practical XGBoost model for risk prediction. The prediction results were analyzed using IBM SPSS Statistics software version 22.0 (IBM Corp, Armonk, NY, USA) and the confusion matrix.

Results

Protocol design for developing ML models to initialize the TIA ML-LHS unit

To enhance inclusive TIA screening and early detection rates, we designed an ML-LHS unit for providing TIA risk prediction to patients served in a central hospital and its affiliated community and rural clinics. As illustrated in Fig. 2, this unit requires the ML model to be initially trained with patient data from the hospital EMR, representing a diverse population, using a data-centric approach. This model, referred to as the “inclusive model”, is then refined to use fewer than 30 variables while maintaining a recall rate of 80% or better, thus forming what we call a “practical model.” This model serves as the initial basis for initializing the ML-LHS unit for TIA screening. In the initialization phase, the model’s performance is internally validated with new patient data from the hospital and externally validated by different hospitals. Following successful internal and external validations, the TIA ML-LHS unit is technically prepared for deployment in a hospital-led initiative, providing risk prediction in LHS embedded clinical research across a clinical research network (CRN) including resource-limited clinics. Because small clinics may only have data available for a small number of variables, the practical ML model is crucial for ML-LHS to be effectively deployed and used in resource-limited clinics to improve the ethical use of ML/AI.

Effect of the number of variables in TIA risk prediction

The basic characteristics of the TIA patients and non-TIA patients are shown in Table 1. There are slightly more male (58.5%) than female (41.5%) in the TIA patient group. In this data-centric ML approach, we collected health factors (or variables) from patient EMR records without preselecting certain ones, resulting in an ML-ready dataset comprising over 600 variables. We then tested various subsets of these variables from a sorted variable list to assess their impact on the performance of the XGBoost ML base model in predicting TIA risk. Table 2 shows that the key performance metrics for predicting TIA risk plateaued at approximately 0.880 ROC-AUC when the XGBoost model included 150 or more variables (see Fig. 3). The models incorporated only variables potentially contributing to disease risk, such as demographics, diseases, medical histories, symptoms, observations, lab tests and other risk factors.

Table 1 Baseline characteristics of TIA patients and background patients

Full size table

Table 2 Comparison of performance using different numbers of variables in XGBoost TIA risk prediction models

Full size table

Comparison of different algorithms for TIA risk prediction

The XGBoost algorithm was compared to three other common ML algorithms: RF, SVM, and KNN. These comparisons used the default settings and the same 150 variables from categories like symptoms, diseases, lab tests, observations, demographics, and medical histories. As Table 3 shows, the 150-variable RF and SVM base models performed similarly to the XGBoost base model. Given the importance of recall in preventive screening applications, and considering the longer running time required by SVM, XGBoost was chosen to create the inclusive model for TIA risk prediction (Fig. 2, step 1). This model achieved a recall of 0.868, a precision of 0.815, an ROC-AUC of 0.882, and an accuracy of 0.886. Figure 4a and b display the ROC curve and the reliability curve of the XGBoost model, respectively.

Table 3 Comparison of different ML algorithms for TIA risk prediction models at various development stages

Full size table

Practical ML models for TIA risk prediction

To deploy a TIA risk prediction model in community and rural settings, the number of variables in the model was reduced to 20 (referred to as “pv20”, see Table 4). This reduction was accomplished through ML feature engineering, using the quantitative distribution of TIA risk factors derived from the EMR data (see Fig. 2, step 2). The performances of different algorithms using the pv20 variable set are compared in Table 3. The pv20 XGBoost base model proved to be the most effective, demonstrating a recall of 0.855, a precision of 0.660, an ROC-AUC of 0.810, and an accuracy of 0.796. It was closely followed by the RF and SVM base models. Figure 4c and d illustrate the ROC curve and reliability curve of the pv20 XGBoost base model, respectively.

Table 4 List of the 20 practical variables (pv20) used in the practical TIA risk prediction models

Full size table

Initialization of a TIA ML-LHS unit with the practical model

The pv20 XGBoost model was selected to initialize a TIA ML-LHS unit designed for continuously improving the model for risk prediction and providing risk prediction for TIA screening. For the initial internal validation (see Fig. 2, step 3), the ML dataset was updated with data from 148 new TIA patients at GLMUH, resulting in a new pv20 XGBoost base model. This new model achieved performance metrics of 0.830 recall, 0.726 precision, 0.816 ROC-AUC, and 0.812 accuracy (Table 3; Fig. 4e and f), confirming its suitability for initializing the TIA ML-LHS unit.

External validation of the practical TIA model

For external validation (as depicted in Fig. 2, step 4), the practical pv20 XGBoost base model was tested from 80 TIA patients from another hospital. The test results are shown in the confusion matrix (Table 5). The model yielded the following performance metrics: a recall (or sensitivity) of 0.838, a precision of 0.848, a specificity 0.760, an AUC 0.799, and an accuracy 0.808, proving the model performed at similar level on patient data from a different clinical setting. Considering the results of both internal and external validations, the practical TIA model met the criteria for initializing the first TIA ML-LHS unit for screening, which is capable to perform as expected when the TIA ML model is deployed and used in different hospitals and clinics.

Table 5 Confusion matrix for the external validation of the practical TIA model and the patient group demographics

Full size table

Discussion

Utilizing a two-stage ML protocol, this study initially constructed a high-performance TIA risk prediction model using the XGBoost algorithm and 150 variables from hospital EMR data. Subsequently, a quantitative distribution of TIA risk factors, derived directly from the EMR data, was employed to develop a more practical 20-variable XGBoost model. Following external validation, this model was used to initialize a TIA ML-LHS unit. Leveraging new patient data, the ML-LHS unit produced the first updated XGBoost model, which achieved performance metrics of 0.830 recall, 0.726 precision, 0.816 ROC-AUC, and 0.812 accuracy, confirming its alignment with the ML-LHS design objectives.

The TIA ML-LHS unit is technically poised for the next phase of LHS development, namely deploying the ML-LHS unit via a CRN. This deployment aims to provide TIA risk prediction to patients and recommend high-risk patients for screening and early detection in community and rural clinical environments. The new TIA ML-LHS unit could be pivotal in broadening risk-based TIA screening across diverse populations, thereby enhancing early TIA detection, especially among underserved groups.

The significance of the TIA ML-LHS unit is underscored by the limited availability of physician resources in community and rural clinics, highlighting the need for augmentation with ML-based AI tools. The unit is designed to assist clinicians in assessing TIA risk in resource-limited settings to improve equitable screening. We have previously demonstrated the feasibility of establishing the first system component – a CRN led by a central hospital in collaboration with a large number of rural clinics [34]. The current study has developed the second system component, a practical XGBoost model for TIA risk prediction, achieving over 80% recall and accuracy. Our future research direction is to implement and operate the TIA ML-LHS unit across the CRN, aiding community and rural physicians in early TIA prediction. As the LHS embeds ML research into routine clinical workflows, the CRN will accumulate more patient data, continually enhancing the TIA ML model’s performance. Due to its scalability, the TIA ML-LHS unit’s success in various underserved areas is feasible, representing a significant step towards reducing global health disparities in TIA and stroke care.

The innovation of this study is threefold:

(1)
It designed and demonstrated a novel approach for initializing a tangible system-level solution, termed “ML-LHS unit for predictive screening.” This addresses the difficult issue of low adoption rates in preventive screening. The solution mandates that the risk prediction ML models be both inclusive and practical, as they are intended for use in primary care settings in communities and rural areas. It achieves this by integrating data-centric EMR ML and feature engineering with the quantitative distribution of risk factors. The solution’s design of LHS across CRN ensures that responsible ML/AI is developed from and used for the patient populations served.
(2)
The resulting initial ML-LHS unit for TIA risk prediction is the first of its kind in developing system-level or structural solutions necessary for sustainable, large-scale TIA screening in both urban and rural areas. Unlike a few statistical TIA risk models using pre-selected variables and one TIA ML model using clinical notes [20,21,22], the TIA risk model developed in this study is the first ML risk model built from full EHR patient data without preselection of variable, which is significant progress. Designed for predicting TIA risk in the general population, this TIA model is also very different from the common ABCD models for predicting stroke risk in patients who have experienced a TIA.
(3)
This study’s protocol for developing ML models and initializing ML-LHS units can be adapted for all preventable diseases that are suitable for predictive screening in community and rural clinics. This ML process validates model performance and reliability with new patient data at multiple steps, which is expected to produce more reliable models than common methods such as cross-validation.

The study also had limitations. Firstly, potential bias in the resulting ML models could arise from missing EMR data [35]. Our ML pipeline prioritized health factors with minimal missing data and favored robust algorithms like XGBoost for handling incomplete data. Secondly, identifying and rectifying data bias within EMRs was challenging, potentially compromising the reliability of ML models [36]. Thirdly, the lack of structured data in EMR required significant efforts in data standardization to prevent loss of usable data for ML. Fourthly, condensing the number of variables might risk overfitting in practical ML models built from small dataset. However, this challenge is expected to diminish as the patient data volume in the ML-LHS unit grows.

Conclusions

In conclusion, this study developed the first inclusive and practical TIA XGBoost model from EMR data to initialize the first ML-LHS unit for TIA risk prediction. The results enable future studies to deploy this TIA ML-LHS unit over clinical research networks, empowering community and rural clinics to provide risk-based TIA screening for enhancing TIA early detection and health equity for marginalized communities. The study demonstrated a novel system-level approach to responsible ML/AI development applicable to building ML-LHS units for various preventable diseases.

Data availability

Patient datasets from the current study are not available due to patient privacy concerns. Data without privacy implications is available from the corresponding author upon reasonable request.

Abbreviations

TIA:: Transient ischemic attack
ML:: Machine learning
LHS:: Learning health system
ML-LHS:: ML-enabled LHS
EMR:: Electronic medical records
ROC-AUC:: Area under the curve of receiver operating characteristics
AI:: Artificial intelligence
CDR:: Graph connection delta ratio
GLMUH:: Guilin Medical University Affiliated Hospital
PDJ:: Patient diagnosis journey
RF:: Random Forest
SVM:: Support Vector Machines
KNN:: K-Nearest Neighbors
CRN:: Clinical research network

References

GBD 2019 Stroke Collaborators. Global, regional, and national burden of stroke and its risk factors, 1990–2019: a systematic analysis for the global burden of Disease Study 2019. Lancet Neurol. 2021;20(10):795–820. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/S1474-4422(21)00252-0.
Article Google Scholar
Tsao CW, et al. Heart Disease and Stroke Statistics—2022 update: a Report from the American Heart Associationexternal icon. Circulation. 2022;145(8):e153–639.
Article PubMed Google Scholar
Lioutas V, et al. Incidence of transient ischemic attack and Association with Long-Term risk of stroke. JAMA. 2021;325(4):373–81. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jama.2020.25071.
Article PubMed PubMed Central Google Scholar
Amarenco P, Steering Committee and Investigators of the TIAregistry.org Project. Five-year risk of stroke after TIA or minor ischemic stroke. N Engl J Med. 2018;379(16):1580–1. https://doiorg.publicaciones.saludcastillayleon.es/10.1056/NEJMc1808913.
Article PubMed Google Scholar
Kleindorfer D, et al. Incidence and short-term prognosis of transient ischemic attack in a population-based study. Stroke. 2005;36:720–3. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/01.STR.0000158917.59233.b7.
Article PubMed Google Scholar
Sundararajan, V, et al. Trends over time in the risk of stroke after an incident transient ischemic attack. Stroke. 2014;45(11):3214–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/STROKEAHA.114.006575.
Wang Y, et al. Prevalence, knowledge, and treatment of transient ischemic attacks in China. Neurology. 2015;84(23):2354–61. https://doiorg.publicaciones.saludcastillayleon.es/10.1212/WNL.0000000000001665.
Article PubMed PubMed Central Google Scholar
Lambert CM, Olulana O, Bailey-Davis L, Abedi V, Zand R. Lessons learned preventing recurrent ischemic strokes through Secondary Prevention Programs: a systematic review. J Clin Med. 2021;10(18):4209. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/jcm10184209.
Article PubMed PubMed Central Google Scholar
Giles MF, Rothwell PM. Transient ischaemic attack: clinical relevance, risk prediction and urgency of secondary prevention. Curr Opin Neurol. 2009;22(1):46–53. https://doiorg.publicaciones.saludcastillayleon.es/10.1097/WCO.0b013e32831f1977.
Article PubMed Google Scholar
Shahjouei S, et al. Risk of subsequent stroke among patients receiving outpatient vs Inpatient Care for transient ischemic attack: a systematic review and Meta-analysis. JAMA Netw Open. 2022;5(1):e2136644. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jamanetworkopen.2021.36644.
Article PubMed PubMed Central Google Scholar
Lip GYH, et al. Improving stroke risk prediction in the General Population: a comparative Assessment of Common Clinical rules, a New Multimorbid Index, and machine-learning-based algorithms. Thromb Haemost. 2022;122(1):142–50. https://doiorg.publicaciones.saludcastillayleon.es/10.1055/a-1467-2993.
Article PubMed Google Scholar
Hung CY, Lin CH, Lan TH, Peng GS, Lee CC. Development of an intelligent decision support system for ischemic stroke risk assessment in a population-based electronic health record database. PLoS ONE. 2019;14(3):e0213007.
Article CAS PubMed PubMed Central Google Scholar
Abedi V, et al. Novel Screening Tool for Stroke using Artificial neural network. Stroke. 2017;48(6):1678–81. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/STROKEAHA.117.017033.
Article PubMed Google Scholar
Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 2017;12(4):e0174944. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pone.0174944.
Article CAS PubMed PubMed Central Google Scholar
Perry JJ, et al. Prospective validation of Canadian TIA score and comparison with ABCD2 and ABCD2i for subsequent stroke risk after transient ischaemic attack: multicentre prospective cohort study. BMJ. 2021;372:n49. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmj.n49.
Article PubMed PubMed Central Google Scholar
Chaudhary D, et al. Clinical risk score for Predicting Recurrence following a cerebral ischemic event. Front Neurol. 2019;10. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fneur.2019.01106.
Wardlaw JM, et al. ABCD2 score and secondary stroke prevention: meta-analysis and effect per 1,000 patients triaged. Neurology. 2015;85(4):373–80. https://doiorg.publicaciones.saludcastillayleon.es/10.1212/WNL.0000000000001780.
Article PubMed PubMed Central Google Scholar
Giles MF, Rothwell PM. Systematic review and pooled analysis of published and unpublished validations of the ABCD and ABCD2 transient ischemic attack risk scores. Stroke. 2010;41(4):667–73. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/STROKEAHA.109.571174.
Article PubMed Google Scholar
Johnston SC, et al. Validation and refinement of scores to predict very early stroke risk after transient ischaemic attack. Lancet. 2007;369(9558):283–92.
Article PubMed Google Scholar
Bacchi S, et al. Deep Learning Natural Language Processing successfully predicts the Cerebrovascular cause of transient ischemic attack-like presentations. Stroke. 2019;50(3):758–60. https://doiorg.publicaciones.saludcastillayleon.es/10.1161/STROKEAHA.118.024124.
Article PubMed Google Scholar
Stanciu A, et al. A predictive analytics model for differentiating between transient ischemic attacks (TIA) and its mimics. BMC Med Inf Decis Mak. 2020;20:112. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-020-01154-6.
Article Google Scholar
Dutta D. Diagnosis of TIA (DOT) score–design and validation of a new clinical diagnostic tool for transient ischaemic attack. BMC Neurol. 2016;16:20. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12883-016-0535-1.
Article CAS PubMed PubMed Central Google Scholar
Lasserson DS, Mant D, Hobbs FD, Rothwell PM. Validation of a TIA recognition tool in primary and secondary care: implications for generalizability. Int J Stroke. 2015;10(5):692–6. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/ijs.12201.
Article PubMed Google Scholar
Institute of Medicine. The Learning Healthcare System: Workshop Summary. Washington, DC: National Academies; 2007. https://doiorg.publicaciones.saludcastillayleon.es/10.17226/11903.
Book Google Scholar
Institute of Medicine. Digital Infrastructure for the Learning Health System. The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. Washington, DC: National Academies; 2011. https://doiorg.publicaciones.saludcastillayleon.es/10.17226/12912.
Book Google Scholar
Horwitz LI, Kuznetsova M, Jones SA. Creating a Learning Health System through Rapid-Cycle, Randomized Testing. N Engl J Med. 2019;381(12):1175–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1056/NEJMsb1900856.
Article PubMed Google Scholar
Bravata DM, et al. Assessment of the protocol-guided Rapid evaluation of Veterans Experiencing New transient neurological symptoms (PREVENT) program for improving quality of care for transient ischemic attack: a Nonrandomized Cluster Trial. JAMA Netw Open. 2020;3(9):e2015920. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jamanetworkopen.2020.15920.
Article PubMed PubMed Central Google Scholar
Chen A, Lu R, Han R, et al. Building practical risk prediction models for nasopharyngeal carcinoma screening with patient graph analysis and machine learning. Cancer Epidemiol Biomarkers Prev. 2023;32(2):274–80. https://doiorg.publicaciones.saludcastillayleon.es/10.1158/1055-9965.EPI-22-0792.
Chen A. A novel graph methodology for analyzing disease risk factor distribution using synthetic patient data. Healthc Analytics. 2022;2:100084. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.health.2022.100084.
Article Google Scholar
Chen T, Guestrin C. XGBoost: a scalable Tree Boosting System. KDD ‘16: Proc 22nd ACM SIGKDD Int Conf Knowl Discovery Data Min. 2016;785–794. https://doiorg.publicaciones.saludcastillayleon.es/10.1145/2939672.2939785.
Pedregosa F, et al. Scikit-learn: machine learning in Python. JMLR. 2011;12:2825–30.
Google Scholar
Liu Y, Chen PC, Krause J, Peng L. How to read Articles that Use Machine Learning: users’ guides to the Medical Literature. JAMA. 2019;322(18):1806–16. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jama.2019.16489.
Article PubMed Google Scholar
Wen J, Zhang T, Ye S, et al. Quantitative patient graph analysis for transient ischemic attack risk factor distribution based on electronic medical records. Heliyon. 2023;10(1):e22766. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.heliyon.2023.e22766.
Article PubMed PubMed Central Google Scholar
Chen A, et al. Feasibility study for implementation of the AI-powered internet + primary care model (AiPCM) across hospitals and clinics in Gongcheng County, Guangxi, China. Lancet. 2019;394(Supplement 1):S44. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/S0140-6736(19)32380-3.
Article Google Scholar
Cesare N, Were LPO. A multi-step approach to managing missing data in time and patient variant electronic health records. BMC Res Notes. 2022;15:64. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-022-05911-w.
Article PubMed PubMed Central Google Scholar
Verheij RA, Curcin V, Delaney BC, McGilchrist MM. Possible sources of Bias in Primary Care Electronic Health Record Data Use and Reuse. J Med Internet Res. 2018;20(5):e185. https://doiorg.publicaciones.saludcastillayleon.es/10.2196/jmir.9134.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

None.

Funding

This work was supported by the Guilin Municipal Science and Technology Bureau (China) [grant number 20190219-2], the Guangxi Provicial Science and Technology Bureau (China) [grant numbers AB23026017 and AB24010167] and the Sichuan Provincial Science and Technology Bureau (China) [grant number 2020YFQ0019].

Author information

Authors and Affiliations

Department of Neurology, Guilin Medical University Affiliated Hospital, 15 Lequn Road, Guilin, Guangxi, 541000, China
Jian Wen, Tianmei Zhang, Shangrong Ye, Cheng Li, Ruobing Han & Qinghua Li
West China Hospital, 2222 Xingchuan Road, Chengdu, Sichuan, 610212, China
Ran Huang & Bairong Shen
Health System Sciences, ELHS Institute, Palo Alto, CA, 94306, USA
Anjun Chen

Authors

Jian Wen
View author publications
You can also search for this author inPubMed Google Scholar
Tianmei Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Shangrong Ye
View author publications
You can also search for this author inPubMed Google Scholar
Cheng Li
View author publications
You can also search for this author inPubMed Google Scholar
Ruobing Han
View author publications
You can also search for this author inPubMed Google Scholar
Ran Huang
View author publications
You can also search for this author inPubMed Google Scholar
Bairong Shen
View author publications
You can also search for this author inPubMed Google Scholar
Anjun Chen
View author publications
You can also search for this author inPubMed Google Scholar
Qinghua Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

QL obtained the funding. BS obtained the funding. JW directed and supervised the study. TZ conceptualized the clinical study, collected data, collaborated with clinics, and reviewed the manuscript. SY collected data, collaborated with clinics, and reviewed the manuscript. CL collected and analyzed data. RBH collected data. RH developed software and models, and analyzed data. AC designed the methods and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Jian Wen, Anjun Chen or Qinghua Li.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the Institutional Review Board of Guilin Medical University Affiliated Hospital (QTLL202139). Informed consent was waived for this retrospective EMR study by the Institutional Review Board of Guilin Medical University Affiliated Hospital.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wen, J., Zhang, T., Ye, S. et al. Development of transient ischemic attack risk prediction model suitable for initializing a learning health system unit using electronic medical records. BMC Med Inform Decis Mak 24, 392 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02767-x

Download citation

Received: 15 January 2024
Accepted: 14 November 2024
Published: 18 December 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02767-x

Development of transient ischemic attack risk prediction model suitable for initializing a learning health system unit using electronic medical records

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Standardized data collection from EMR for ML study

Machine learning for inclusive TIA risk prediction models

Feature engineering for practical ML models

Initialization of a TIA ML-LHS unit

External validation of the practical TIA model

Results

Protocol design for developing ML models to initialize the TIA ML-LHS unit

Effect of the number of variables in TIA risk prediction

Comparison of different algorithms for TIA risk prediction

Practical ML models for TIA risk prediction

Initialization of a TIA ML-LHS unit with the practical model

External validation of the practical TIA model

Discussion

Conclusions

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us