Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics

Version 1 : Received: 30 June 2024 / Approved: 4 July 2024 / Online: 4 July 2024 (14:15:25 CEST)

How to cite: Zhou, P.-Y.; Lum, F.; Wang, T. J.; Dan, C.; Lee, S.; Wong, A. K. An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics. Preprints 2024, 2024070425. https://doi.org/10.20944/preprints202407.0425.v1 Zhou, P.-Y.; Lum, F.; Wang, T. J.; Dan, C.; Lee, S.; Wong, A. K. An Unsupervised Error Detection Methodology for Detecting Mislabels in Healthcare Analytics. Preprints 2024, 2024070425. https://doi.org/10.20944/preprints202407.0425.v1

Abstract

Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions. In this study, we propose an unsupervised error detection method using patterns discovered by the Pattern Discovery and Disentanglement (PDD) model, developed in our earlier work. Applied to the large data, the eICU Collaborative Research Database for sepsis risk assessment, the proposed algorithm can effectively discover statistically significant association patterns, generate an interpretable knowledge base for interpretability, cluster samples in an unsupervised learning manner, and detect abnormal samples from the dataset. As shown in the experimental result, our method outperformed K-Means by 38% on the full dataset and 47% on the reduced dataset for unsupervised clustering. Multiple supervised classifiers improve accuracy by an average of 4% after removing abnormal samples by the proposed error detection approach. Therefore, the proposed algorithm provides a robust and practical solution for unsupervised clustering and error detection in healthcare data.

Keywords

Unsupervised Learning; Error detection; Pattern Discovery and Disentanglement; Healthcare Data Analysis

Subject

Public Health and Healthcare, Public Health and Health Services

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.