Preprint Article Version 1 This version is not peer-reviewed

Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction

Version 1 : Received: 26 September 2024 / Approved: 26 September 2024 / Online: 26 September 2024 (14:45:33 CEST)

How to cite: Onah, E.; Eze, U. J.; Abdulraheem, A. S.; Ezigbo, U. G.; Amorha, K. C. Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints 2024, 2024092121. https://doi.org/10.20944/preprints202409.2121.v1 Onah, E.; Eze, U. J.; Abdulraheem, A. S.; Ezigbo, U. G.; Amorha, K. C. Optimizing Unsupervised Feature Engineering and Predictive Models for Thyroid Cancer Recurrence Prediction. Preprints 2024, 2024092121. https://doi.org/10.20944/preprints202409.2121.v1

Abstract

Background/Objectives: Thyroid cancer, particularly well-differentiated thyroid cancer, is one of the most prevalent endocrine malignancies, with a rising incidence. Although it generally has a favorable prognosis, recurrence is common. Accurate prediction of recurrence is crucial for optimizing treatment plans and improving patient outcomes. This study aimed to advance the state-of-the-art in thyroid cancer recurrence prediction by refining feature engineering techniques and exploring a diverse ensemble of machine learning algorithms and an artificial neural network, using the differentiated thyroid cancer dataset from the UCI Machine Learning Repository; Methods: Various unsupervised data engineering methods, such as dimensionality reduction and clustering, were employed to enhance feature quality and mitigate noise, using stratified 10-fold cross-validation. The best-performing dimensionality reduction techniques were used to build classification model pipelines employing each of several machine learning models and an artificial neural network. The performance of these classification pipelines were assessed using metrics sensitive to class imbalance; Results: Principal Component Analysis and Truncated Singular Value Decomposition achieved superior clustering performance and moderate variances in their first principal components. Logistic Regression, Random Forest, Support Vector Machine, K-Nearest Neighbors, and Feedforward Neural Network models all achieved high performance, with Logistic Regression pipelines demonstrating balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision—all greater than 0.93 on the test set and slightly lower in 10-fold cross-validation. Gradient Boosting classification pipelines performed the lowest, though still with respectable metrics; Conclusions: This study shows that employing feature engineering techniques like Principal Component Analysis or Truncated Singular Value Decomposition in Logistic Regression, Random Forest, Feedforward Neural Networks, Support Vector Machine, and K-Nearest Neighbors classification pipelines can improve thyroid cancer recurrence prediction accuracy and reliability, supporting more personalized treatment strategies in post-treatment patients.

Keywords

Well-differentiated thyroid cancer; Recurrence prediction; Unsupervised Data Engineering Methods; Dimensionality Reduction Techniques; Machine learning; Clustering; Principal Component Analysis; Truncated Singular Value Decomposition; Logistic Regression

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.