Version 1
: Received: 26 September 2024 / Approved: 26 September 2024 / Online: 26 September 2024 (14:45:33 CEST)
How to cite:
Onah, E.; Eze, U. J.; Abdulraheem, A. S.; Ezigbo, U. G.; Amorha, K. C. Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints2024, 2024092121. https://doi.org/10.20944/preprints202409.2121.v1
Onah, E.; Eze, U. J.; Abdulraheem, A. S.; Ezigbo, U. G.; Amorha, K. C. Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints 2024, 2024092121. https://doi.org/10.20944/preprints202409.2121.v1
Onah, E.; Eze, U. J.; Abdulraheem, A. S.; Ezigbo, U. G.; Amorha, K. C. Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints2024, 2024092121. https://doi.org/10.20944/preprints202409.2121.v1
APA Style
Onah, E., Eze, U. J., Abdulraheem, A. S., Ezigbo, U. G., & Amorha, K. C. (2024). Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints. https://doi.org/10.20944/preprints202409.2121.v1
Chicago/Turabian Style
Onah, E., Ugochukwu Gabriel Ezigbo and Kosisochi Chinwendu Amorha. 2024 "Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction" Preprints. https://doi.org/10.20944/preprints202409.2121.v1
Abstract
Background/Objectives: Thyroid cancer, particularly well-differentiated thyroid cancer, is one of the most prevalent endocrine malignancies, with a rising incidence. Although it generally has a favorable prognosis, recurrence is common. Accurate prediction of recurrence is crucial for optimizing treatment plans and improving patient outcomes. This study aimed to advance the state-of-the-art in thyroid cancer recurrence prediction by refining feature engineering techniques and exploring a diverse ensemble of machine learning algorithms and an artificial neural network, using the differentiated thyroid cancer dataset from the UCI Machine Learning Repository; Methods: Various unsupervised data engineering methods, such as dimensionality reduction and clustering, were employed to enhance feature quality and mitigate noise, using stratified 10-fold cross-validation. The best-performing dimensionality reduction techniques were used to build classification model pipelines employing each of several machine learning models and an artificial neural network. The performance of these classification pipelines were assessed using metrics sensitive to class imbalance; Results: Principal Component Analysis and Truncated Singular Value Decomposition achieved superior clustering performance and moderate variances in their first principal components. Logistic Regression, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Feedforward Neural Network models all achieved high performance, with Logistic Regression pipelines demonstrating balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision—all greater than 0.93 on the test set and slightly lower in 10-fold cross-validation. Gradient Boosting classification pipelines performed the lowest, though still with respectable metrics; Conclusions: This study shows that employing feature engineering techniques like Principal Component Analysis or Truncated Singular Value Decomposition in Logistic Regression, Random Forest, Feedforward Neural Networks, Support Vector Machine, and K-Nearest Neighbors classification pipelines can improve thyroid cancer recurrence prediction accuracy and reliability, supporting more personalized treatment strategies in post-treatment patients.
Keywords
Well-differentiated thyroid cancer; Recurrence prediction; Unsupervised Data Engineering Methods; Dimensionality Reduction Techniques; Machine learning; Clustering; Principal Component Analysis; Truncated Singular Value Decomposition; Logistic Regression
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.