Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Machine Learning Approaches for Stroke Risk Prediction: Findings from the Suita Study

Version 1 : Received: 14 May 2024 / Approved: 14 May 2024 / Online: 14 May 2024 (13:25:32 CEST)

How to cite: Vu, T.; Kokubo, Y.; Inoue, M.; Yamamoto, M.; Mohsen, A.; Martin-Morales, A.; Inoué, T.; Dawadi, R.; Araki, M. Machine Learning Approaches for Stroke Risk Prediction: Findings from the Suita Study. Preprints 2024, 2024050975. https://doi.org/10.20944/preprints202405.0975.v1 Vu, T.; Kokubo, Y.; Inoue, M.; Yamamoto, M.; Mohsen, A.; Martin-Morales, A.; Inoué, T.; Dawadi, R.; Araki, M. Machine Learning Approaches for Stroke Risk Prediction: Findings from the Suita Study. Preprints 2024, 2024050975. https://doi.org/10.20944/preprints202405.0975.v1

Abstract

Stroke constitutes a significant public health concern due to its impact on mortality and morbidity. This study investigates the utility of machine learning algorithms in predicting stroke and identifying key risk factors using data from the Suita study, comprising 7,389 participants and 53 variables. Initially, unsupervised K-prototype clustering categorized participants into risk clusters, while five supervised models including Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosted Machine (Light-GBM) were employed to predict stroke outcomes. Stroke incidence disparities among identified risk clusters using the unsupervised K-prototype clustering method are substantial, according to the findings. Supervised learning, particularly RF was a preferable option because of the higher levels of performance metrics. The Shapley Additive Explanations (SHAP) method identified age, systolic blood pressure, hypertension, estimated glomerular filtration rate, metabolic syndrome, and blood glucose level as key predictors of stroke, aligning with findings from the unsupervised clustering approach in high-risk groups. Additionally, previously unidentified risk factors such as elbow joint thickness, fructosamine, hemoglobin, and calcium level demonstrate potential for stroke prediction. In conclusion, machine learning facilitated accurate stroke risk predictions and highlighted potential biomarkers, offering a data-driven framework for risk assessment and biomarker discovery.

Keywords

Stroke; supervised machine learning; unsupervised machine learning; Logistic Regression; Random Forest; Support Vector Machine (SVM); eXtreme Gradient Boost (XGBoost); Light Gradient Boosted Machine (Light-GBM); K-prototype clustering; Shapley Addictive ExPlanations (SHAP)

Subject

Public Health and Healthcare, Other

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.