Improving stroke risk prediction by integrating XGBoost, optimized principal component analysis, and explainable artificial intelligence

Mochurad, Lesia; Babii, Viktoriia; Boliubash, Yuliia; Mochurad, Yulianna

doi:10.1186/s12911-025-02894-z

BMC Medical Informatics and Decision Making

Table 1 Comparison of existing machine learning methods for stroke prediction

From: Improving stroke risk prediction by integrating XGBoost, optimized principal component analysis, and explainable artificial intelligence

Method	Description	Advantages	Disadvantages
Logistic Regression	Models the probability of stroke based on risk factors. Uses a sigmoid function and gradient descent to build the model. Evaluates model performance with and without regulation	High accuracy (more than 95%), the possibility of improvement through regularization, and ease of implementation	Limited accuracy with nonlinear relationships, and sensitivity to parameter selection
Decision Tree	A set of decision trees trained on different subsets of data uses voting to make the final decision	Visualization of decisions, ease of interpretation	Tendency to overfitting without regularization
Random Forest	Aggregates result from multiple decision trees, reducing the risk of overfitting	High accuracy (96%), and reliability	Can be slow on large datasets
Naive Bayes	Classifies based on the assumption of independence of features	Simplicity, efficiency in many classification tasks	The achieved accuracy is 82%. Limitations with complex relationships between features
k-Nearest Neighbors (k-NN)	Classifies new observations based on the nearest neighbors in the training set	Simplicity and clarity	Scaling problems, sensitivity to the choice of the k parameter
Support Vector Machine (SVM)	It uses kernel functions to process nonlinear distributions	High efficiency in high-dimensional data	Sensitivity to parameter selection, difficulties with large datasets
Deep Learning	Use of convolutional neural networks (CNN) for medical image analysis	High accuracy in detecting complex patterns	Requires large datasets and powerful resources
Artificial Neural Networks (ANN)	A machine learning model using resampling, data leakage avoidance, feature selection, and interpretability techniques (such as permutation importance and LIME) for stroke prediction	High interpretability with LIME, effective resampling, and feature selection. High prediction accuracy (95%)	Dependence on external dataset validation and ongoing optimization for better performance
XGBoost	A gradient enhancement algorithm that combines different methods to achieve better results	High prognostic efficiency and interpretability. The accuracy is over 97%	Difficult to configure parameters, requires computing resources

Back to article page

ISSN: 1472-6947

Contact us

General enquiries: journalsubmissions@springernature.com