Preprint Article Version 1 This version is not peer-reviewed

Cervical Cancer Prediction based on Imbalanced Data using Machine Learning Algorithms

Version 1 : Received: 13 September 2024 / Approved: 14 September 2024 / Online: 14 September 2024 (08:33:58 CEST)

How to cite: Muraru, M. M.; Simó, Z.; Iantovics, L. B. Cervical Cancer Prediction based on Imbalanced Data using Machine Learning Algorithms. Preprints 2024, 2024091118. https://doi.org/10.20944/preprints202409.1118.v1 Muraru, M. M.; Simó, Z.; Iantovics, L. B. Cervical Cancer Prediction based on Imbalanced Data using Machine Learning Algorithms. Preprints 2024, 2024091118. https://doi.org/10.20944/preprints202409.1118.v1

Abstract

Cervical Cancer affects a large part of female population that makes the prediction of this disease based on Machine Learning (ML) by outmost importance. ML algorithms can be integrated in complex intelligent agent-based systems that can offer decision support to the resident medical doctors or even to experienced medical doctors. For instance can be mentioned the situation when an experienced medical doctor diagnose a case but he/she needs expertise support that is related to another medical specialty. Data imbalance is frequent in healthcare data and has a negative influencing effect in making predictions using ML algorithms. Cancer data generally and cervical cancer data particularly are frequently imbalanced. Based on this fact the study of data imbalance impact on diverse state-of-the-art ML prediction algorithms is important. This research subject is also motivated by the fact that in many research are presented experimental evaluations of algorithms without characterization of the data on that they have been applied. Such characterizations could give clear indication to other researchers regarding the applicability of the algorithms on their specific data. Specifically, if the data have the respective characteristics than are expectable the same performance evaluation results like those in the reported research. For the study we chosed a messy real-life Cervical Cancer dataset available in a recognized data repository included a large amount of missing and noisy values. To identify the best imbalanced technique for this medical dataset, it is compared the performance of eleven important resampling methods combined with the following state-of-the-art ML models: K-Nearest Neighbours (with k values of 2 and 3), binary Logistic Regression, and Random Forest, that are frequently applied in prediction types of researches in healthcare. The studied resampling methods includes seven undersampling methods namely Condensed Nearest Neighbour, Tomek Links, Edited Nearest Neighbours, Repeated Edited Nearest Neighbours, All K-Nearest Neighbours, NearMiss, Neighbourhood Cleaning Rule, and Instance Hardness Threshold, and four oversampling methods namely Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling Approach for Imbalanced Learning SMOTE, Support Vector Machine SMOTE and Borderline SMOTE. In the case of the dataset for the confidence interval with 95% confidence level was between 9.23 and 16.22, while the imbalance ratio is 12.73. The obtained results show that resampling methods help the learning models to improve the classification ability of cervical cancer. The applied oversampling techniques generally showed better results than undersampling methods. In the case of Logistic Regression classifier had the highest impact on balanced techniques, while Random Forest had promising performance, even before balancing techniques and KNN2 was generally better than KNN3.

Keywords

cervical cancer; cancer; artificial intelligence; sampling methods; unbalanced datasets; 34 classification methods; prediction methods; K-Nearest Neighbours; Logistic Regression; Random 35 Forest

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.