Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Application of Transcriptome–Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer

Version 1 : Received: 29 May 2024 / Approved: 29 May 2024 / Online: 29 May 2024 (08:31:17 CEST)

A peer-reviewed article of this Preprint also exists.

Jeong, Y.; Chu, J.; Kang, J.; Baek, S.; Lee, J.-H.; Jung, D.-S.; Kim, W.-W.; Kim, Y.-R.; Kang, J.; Do, I.-G. Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer. Curr. Issues Mol. Biol. 2024, 46, 7291-7302. Jeong, Y.; Chu, J.; Kang, J.; Baek, S.; Lee, J.-H.; Jung, D.-S.; Kim, W.-W.; Kim, Y.-R.; Kang, J.; Do, I.-G. Application of Transcriptome-Based Gene Set Featurization for Machine Learning Model to Predict the Origin of Metastatic Cancer. Curr. Issues Mol. Biol. 2024, 46, 7291-7302.

Abstract

Identifying the primary site of origin of metastatic cancer is vital for guiding treatment decisions, especially for patients with cancer of unknown primary (CUP). Despite advanced diagnostic techniques, CUP remains difficult to pinpoint and is responsible for a considerable number of cancer-related fatalities. Understanding its origin is crucial for effective management and potentially improving patient outcomes. This study introduces a machine learning framework ONCOfind-AI that leverages transcriptome-based gene set features to enhance the accuracy of predicting the origin of metastatic cancers. By ensuring compatibility between RNA-sequencing and micro-array data, we were able to construct a more comprehensive training dataset. Integrating data from different platforms improved the accuracy of our machine learning models for predicting cancer origins. Our method was validated using external data from clinical samples collected through Kangbuk Samsung Medical Center and the Gene Expression Omnibus. The external validation results demonstrated a top-1 accuracy ranging from 0.80 to 0.86, with a top-2 accuracy of 0.90. This study highlights that incorporating biological knowledge through curated gene sets can merge gene expression data from different platforms, enhancing the compatibility needed for more effective machine learning prediction models.

Keywords

Cancer of Unknown Primary; Metastatic Cancer; Machine Learning; Gene Expression; Transcriptome

Subject

Medicine and Pharmacology, Oncology and Oncogenics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.