1. Introduction
The lifetime risk of women developing endometrial cancer is approximately 3%, and, over the past 30 years, the overall incidence has increased by 132%, reflecting an increase in risk factors (particularly obesity and aging) [
1]. Endometrial cancer is classically classified into two groups—namely, Type I or II tumors [
2,
3]. Type I endometrial tumors are associated with excess estrogen, obesity, hormone receptor positivity, and abnormalities in hormone receptors. On the other hand, Type II tumors, which are mainly serous, are often observed in older, non-obese women and are considered to have a worse prognosis [
2,
4].
In recent years, there has been a growing focus on molecular biological classification of endometrial cancers. The classification of endometrial cancer proposed by The Cancer Genome Atlas (TCGA) in 2013—a joint project of the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI)—employed milestone data for molecular classification of endometrial cancer [
5]. The TCGA proposed classification into four classes—POLE (ultramutated), MSI (hypermutated), Copy-number low (endometrioid), and Copy-number-high (serous-like)—based on next-generation sequencing (NGS) data obtained from 232 cases of endometrial cancers. Following this TCGA classification, Talhouk et al. developed and verified a modified molecular classification method called "ProMisE" [
6,
7]. This method replaces the detection of abnormality of the
TP53 gene and microsatellite status, which are dependent on sequencing, with immunohistochemical staining (IHC), making molecular classification of endometrial cancers more clinically accessible. The ProMisE method classifies endometrial cancer into four molecular sub-types: POLEmut, Mismatch Repair Deficient (MMRd), p53abn, and NSMP (No Specific Molecular Profile, p53wt). These four classes correspond to the POLE, MSI, Copy-number low, and Copy-number-high TCGA classes, respectively. In 2020, the World Health Organization (WHO) also recommended a molecular classification for endometrial cancer [
8,
9]. In 2023, the molecular biology-based classification of the Federation of Obstetrics and Gynecology (FIGO) staging was also demonstrated [
10]. Therefore, it is anticipated that the provision of a stable and easy method for molecular profiling of endometrial cancer will become clinically significant in the near future.
Based on these molecular classification results, one of the most crucial therapeutic agents to consider for classification-matched treatment is the immune checkpoint inhibitor (ICI). ICIs are being investigated and gaining interest for various type of tumors, including endometrial cancer [
11]. Programmed death receptor-1 (PD-1) is an immune checkpoint molecule expressed on activated T-cells, with programmed cell death ligand 1 (PD-L1) being a representative ligand [
12]. PD-1 and PD-L1 inhibitors accelerate cancer cell elimination, mainly mediated through cytotoxic T-cells. One of the accepted surrogate markers for the effectiveness of ICIs is deficient mismatch repair (dMMR) and the resulting microsatellite instability (MSI) [
13]. According to a cross-organ analysis of solid tumors, endometrial cancer had the highest frequency of MSI, occurring in 17% [
14].
Commonly used methods for determining dMMR/MSI status are based on polymerase chain reaction (PCR) [
15,
16], IHC for MMR proteins [
17], and NGS [
18,
19]. The use of IHC for MMR status classification involves examining the expression of MutL homolog 1 (MLH1), MutS homolog 2 (MSH2), MutS homolog 6 (MSH6), and post-meiotic segregation increase 2 (PMS2) [
17]. In endometrial cancer, the high detection rate of dMMR underscores the utmost importance of immunologic profiles [
14,
20]; however, testing for detailed molecular profiles (including MMR status) in every endometrial cancer patient can be expensive and time-consuming, which could complicate the course of treatment. Therefore, it is necessary to develop alternative molecular classification methods in order to reduce the associated financial and time costs.
As a novel classification method, we focused on approaches using machine learning. In the field of healthcare, deep learning has already been demonstrated to be useful for the classification of medical images. Originally, deep learning emerged as a prominent sub-field of machine learning [
21]. There have been many reports on the effective use of deep learning approaches for image classification in clinical use, such as magnetic resonance imaging (MRI) of the brain [
22], retinal images [
23], and computed tomography (CT) of the lungs [
24], among others [
25]. Unlike conventional machine learning methods, deep learning relies on deep neural networks, which mimic the operation of the neurons in the human brain. Deep learning networks can automatically extract the significant features necessary for the corresponding learning tasks with minimal human effort [
21].
Among the various deep learning algorithms, convolutional neural networks (CNNs) have been most commonly used [
26]. Each convolutional layer extracts different information and, through stacking multiple convolutional layers, the network can progressively extract more complex and abstract features. The activation function in the middle of the convolutional layers enhances the network's ability to handle non-linear problems and adapt to different distributions [
27]. The network may gain features that humans may not be consciously aware of; however, many of these features can be challenging to articulate. Some reports [
28,
29,
30] in the field of cancer have suggested that hematoxylin and eosin (H&E) staining can be used to predict genetic alterations and features of the tumor microenvironment without the need for further laboratory testing. In particular, determining microsatellite instability (MSI) status by deep learning analysis of H&E-stained slides has been described in several reports focused on gastric cancer [
31] and colon cancer [
32,
33,
34,
35,
36]. In endometrial cancer, Hong et al. [
28] attempted a comprehensive assessment using deep learning for the detection of histological subtypes and genetic alterations, achieving an area under the receiver operating characteristics curve (AUROC) ranging from 0.613 to 0.969 despite variations in the assessment criteria. Additionally, Fremond et al. [
30] similarly attempted to carry out decision making through the use of deep learning approaches and to visualize the histological features specific to molecular classifications in endometrial cancer.
Thus, the use of artificial intelligence—particularly deep learning—for medical image analysis has been rapidly expanding [
25,
37]. Therefore, we considered the potential application of deep learning to address issues related to endometrial cancer. In this study, we examined the utility of CNNs and novel attention-based networks for prediction of MMR status of endometrial cancer.
2. Materials and Methods
2.1. Ethical Compliance
According to the guidelines of the Declaration of Helsinki, informed consent was acquired through an opt-out form on the website of Sapporo Medical University. The Sapporo Medical University Hospital's Institutional Review Board granted approval for this study under permission number 332-158.
2.2. Patients and Specimens
For this study, formalin-fixed paraffin-embedded (FFPE) tumor samples from Sapporo Medical University Hospital were used. Surgical specimens were obtained from patients with a primary site of endometrial cancer. A pathologist and a gynecologic oncologist chose representative slides of endometrial cancer resection specimens that were stained with H&E. Of the 127 patients with endometrial cancer treated from 2005 to 2009 in total, we excluded 7 patients with non-endometrial cancer and 6 patients without sufficiently available tumor component tiles (
Figure 1).
2.3. Immunohistochemistry Staining and Evaluation of MMR Status
FFPE tumor tissues were cut into 4 μm slices, and Target Retrieval Solution at pH 9 (DAKO, Glostrup, Denmark) was used for epitope retrieval. The tissues were then stained with rabbit anti-MSH6 monoclonal antibody (clone, EP51; DAKO) and mouse anti-PMS2 monoclonal antibody (clone, ES05; DAKO), which were used to detect MMR proteins in the tissues. The slides then underwent incubation with a secondary antibody. Subsequently, the slides underwent hematoxylin counterstaining, followed by rinsing, alcohol dehydration, and cover-slipping with mounting medium. Two gynecologists and one pathologist evaluated the resulting IHC and MMR status. As previously reported, negative staining for MSH6 corresponds to a lack of MSH2 and/or MSH6 proteins, as the stability of MSH6 depends on MSH2 [
38]. In the same way, PMS2 staining covers the protein expression of PMS2 and/or MLH1. Therefore, if either PMS2 or MSH6 expression was deficient, it was determined as dMMR and, if not, it was determined as proficient MMR (pMMR) (
Figure 2B). In total, 29 patients were classified as dMMR, while 85 patients were classified as pMMR (
Figure 1).
2.4. Pre-Processing of Whole Slide Images
The H&E slides were then digitized using a Nanozoomer whole slide scanner (Hamamatsu Photonics, Japan). Each whole slide image (WSI) was divided into non-overlapping square tiles of 942 μm at a magnification of 5×, each with dimensions of 512 × 512 pixels (
Figure 2A). On average, each WSI was divided into 813 tiles, and processing WSIs from 120 cases of endometrioid cancer resulted in the creation of 97,547 tiles.
We first constructed an image exclusion program, in which we specifically conducted the following pre-processing steps using OpenCV: (i) Excluding edge tiles with different numbers of pixels in height and width; and (ii) Converting the tile to HSV format and binarizing the tile through treating pixels that matched the specified pink color range from (100, 50, 50) to (179, 255, 255) as white (255) and pixels that did not match as black (0). The program calculated the average value of the pink color area and excluded it if it was greater than 25 (i.e., if there was a large amount of pink color within one tile). In total, 38,699 tiles were excluded automatically using the constructed program. Furthermore, we manually excluded 53,451 tiles without sufficient tumor component. The exclusion criteria were specified as follows: tiles in which more than 25% of the tile area consists of non-tumor components (e.g., stroma), tiles containing irrelevant contaminants within the slide, tiles with folding due to poor tissue extension during sample preparation and air trapping, and tiles with artifacts during scanning.
Supplementary Figure 1A shows an overview of the tiles exclusion process through the program and manual inspection. The total number of excluded tiles (
Supplementary Figure 1B) amounted to 92,150, while eligible tiles (
Supplementary Figure 1C) amounted to 5,397, accounting for 5.5% of the total number of divided tiles.
Supplementary Table 1 details the number of tiles and characteristics for each patient.
2.5. Hardware and Software Libraries Used
The experiments were carried out with Python (version 3.8.10), making use of the following packages: torch (version 2.0.0), torchvision (0.15.1), numpy (version 1.24.1), scikit-learn (version 1.2.2), matplotlib (version 3.7.1), and timm (version 0.6.13). Model development and evaluation were performed on a workstation with GeForce RTX 3080 (NVIDIA, Santa Clara, CA) graphics processing units, a Ryzen Threadripper 3960X (24 cores, 3.8 GHz) central processing unit (Advanced Micro Devices, Santa Clara, CA), and 256 GB of memory.
2.6. Data Split and Training Data Preparation
The useful tiles were divided into separate data sets for training, validation, and testing. The data set cases were randomly split into training, validation, and test sets for each prediction task, such that tiles from the same patient were contained in only one of these sets. This approach ensured that the test data set was independent from the training process, allowing for a patient-level split. The split ratio for training:validation:testing was set at 70%:15%:15%.
2.7. Classification Model Construction Using Convolutional Neural Networks
Construction of the CNN-based binary classification model for MMR status, pMMR or dMMR, was conducted using pre-trained CNN models through torchvision in the Pytorch library, including GoogLeNet [
39], VGG19 [
40], ResNet50 [
41], ResNet101 [
41], wideResNet101-2 [
42], and EfficientNet-B7 [
43]. We constructed a model that input a non-overlapping image tile of size 512 × 512 pixels at a resolution of 1.84 μm/pixel and output a tile-level probability for MMR status. We fine-tuned the pre-trained models in torchvision using the prepared training data set and validated the results using the validation data set, following the provided instructions. The trainable parameters were fine-tuned using a stochastic gradient descent optimization method, and we examined the conditions for data pre-processing and the hyperparameters needed for model training. To address the imbalance in the number of tiles in each class, we down-sampled the larger class of pMMR, randomly reducing cases to align with the smaller class in terms of slide numbers. The detailed results of the down-sampling process are presented in
Supplementary Table 2. We also examined changes in model performance with data augmentation. We conducted the following four patterns of data augmentation: (i) No data expansion (original tile), (ii) original tile with added 90° and 270° rotations (resulting in three times the data), (iii) original tile with added vertical and horizontal flips (resulting in four times the data), and (iv) original tile with both rotations (as in ii) and flips (as in iii) (resulting in six times the data). Furthermore, we examined the conditions for the hyperparameters, provisionally using ResNet50 [
41] for the validation network. For the hyperparameters, we changed the batch size (8, 16, 32), number of epochs (30, 60, 90, 120), and learning rate (1e-2, 1e-3, 1e-4).
2.8. Classification Model Construction Using Attention Networks and Our API-Net-Based Model
We also verified the performance differences between CNNs and attention-based networks, such as Vision Transformer (ViT) [
44]. We selected pre-trained ViT models from the torchvision models in the Pytorch library, as mentioned above. The hyperparameters and data set were similarly chosen as for the CNNs mentioned above. We examined two ViT models—ViT_b16 and ViT_b32—in this study. Additionally, we examined the model of the modified network based on API-Net [
45]. This modified network is a class-aware visualization and classification technique that employs attention mechanisms, which we developed for cytopathological classification and feature extraction. This API-Net-based model takes pairs of images as input and learns the embeddings of input features and representative embeddings, called prototypes, for each MMR class. We used the existing API-Net to estimate attention vectors. Given an unknown image, the classification model predicts classes through comparing the unknown images to prototypes, recognizing their similarity for the determination of classes.
2.9. Evaluation of Constructed Model Performance
The following calculated parameters were used as indicators of model performance: Accuracy = (TP + TN)/(TP + FP + FN + TN); Precision = TP/(TP + FP); Recall = TP/(TP + FN); and F-score = 2 * precision * recall/(precision + recall). TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative tiles, respectively. A receiver operating characteristic (ROC) curve is a probability curve for classification of problems at various threshold settings. The ROC curve was plotted using the TPR against the FPR, where TPR is on the y-axis and FPR is on the x-axis. The AUROC represents the area under the ROC curve.
4. Discussion
The incidence of endometrial cancer is increasing worldwide [
46] and, considering the rising importance of molecular biological tests, we need to think about future approaches to diagnosis and treatment in this context. The PORTEC-RAINBO [
47] trial is one of the largest clinical trials investigating genotype-matched therapy for endometrial cancer, which aims to improve clinical outcomes and reduce the toxicity of unnecessary treatments in patients with endometrial cancer through molecularly directed adjuvant therapy strategies. One of the RAINBO trials, the MMRd-GREEN trial, enrolled patients with dMMR endometrial cancer at stage II with significant lymphovascular space invasion (LVSI) or stage III, mismatch repair-deficient endometrial cancer. It then compared a group receiving adjuvant radiotherapy with concurrent and adjuvant durvalumab for one year with a group receiving radiotherapy alone. In this trial, IHC was used for the determination of dMMR, as performed in the present study. Assessment of MMR status will become increasingly important in the future.
Both MSI and MMR status can serve as surrogate markers for selecting candidates to receive ICI and refer to similar biological entities; however, there are various methods used for their detection in clinical laboratories. In the MSI test, five microsatellite regions (BAT-25, BAT-26, MONO-27, NR-21, and NR-24) in DNA obtained from tumor and normal tissue from the same patient are amplified using PCR [
15]. Tumors are classified as high microsatellite instability (MSI-H) if two of the five microsatellite markers present a length difference between the tumor and normal samples; low microsatellite instability (MSI-L) if only one microsatellite marker presents a length difference; and microsatellite stable (MSS) if no length difference is observed. In addition, there are NGS methods that specifically target only microsatellite regions and that evaluate MMR function as part of comprehensive cancer genome profiling approaches. When targeting microsatellite regions only, the length of a total of 18 microsatellite marker regions is measured through NGS, and MSI-H is diagnosed when 33% or more of the markers present instability [
18].
Regarding MMR status, it has been reported [
48] that there is a concordance rate of 90% or higher between IHC staining and MSI testing in colorectal cancer; however, another report [
49] has suggested lower concordance rates in other types of cancer. In the evaluation of immunohistochemistry staining and MSI testing for endometrial cancer, the overall concordance rate was 93.3% and, in cases that were discordant, the reason was promotor hypermethylation of MLH1 [
50]. Moreover, in endometrial cancer, although specific discrepancies are observed in the dMMR sub-group, IHC results are considered a better predictive factor for MMR gene status than determination using PCR [
51].
Although IHC was performed for assessment in the current study, it is important to recognize that MSI testing has limitations that should be understood. When the DNA extraction quantity is low or the DNA quality is poor, there is a 14% probability that the test cannot provide an accurate evaluation. Furthermore, if the purity of the tumor cells in the sample is less than 30%, the results are likely to be false negative [
52].
We previously constructed a model using the TCGA data set annotated with MSI, following a similar approach to that used in the present study; however, the accuracy of determination was low (data not shown). Therefore, while there is a certain degree of agreement between the assessment methods, differences in assessment methods could lead to variations in the obtained results.
While most previous investigations in medical imaging classification have used CNNs, a combined analysis of the PORTEC randomized trial and a clinical cohort conducted by Fremond et al. [
30] used attention-based models for class classification. Traditionally, CNNs have been widely used for image classification tasks; however, the introduction of the attention mechanism [
53] has allowed for more accurate execution. CNNs capture relationships between adjacent pixels in images and recognize the content being displayed through structures called convolutional layers in the architecture. However, a disadvantage of CNNs is that they are influenced by elements other than the intended target, such as background objects.
On the other hand, the attention mechanism, originally developed primarily for natural language processing, has also been proven to be useful in the field of image recognition. In the task of image recognition, a technology derived from Transformers [
54] incorporating attention, known as Vision Transformer, has emerged [
44]. ViT models are capable of visualizing how much attention is paid to which areas within an image. Unlike CNNs, pure ViT does not include convolutional structures and is composed solely of the attention mechanism, although it can also be used in conjunction with CNNs. Through identifying areas of interest within images using the attention mechanism, we believe that the accuracy of recognition can be improved, thus addressing the disadvantage of CNNs when combined with attention mechanisms.
A further structural difference is that CNNs rely on fixed local receptive fields in the early layers, while ViTs use self-attention to aggregate global information in the early layers [
55]. We compared the performance of ViT and an original constructed API-Net-based model incorporating a CNN structure internally, as a model utilizing an attention mechanism. In this study, for the models utilizing the attention mechanism, the accuracy of ViT on the test data set was lower compared to the other networks (
Table 2). The reason for the lower accuracy of ViT could be attributed to the fact that, unlike other networks, ViT does not use convolution in its internal structure. This suggests that incorporating a CNN architecture could be beneficial for pathological image diagnosis of endometrial cancer. Additionally, in the comparison of CNNs, ResNet showed higher accuracy. Considering that ResNet is used within the API-Net-based model as the classification backbone, ResNet can be considered highly useful in this context. The potential for further performance improvement through combining CNNs with attention mechanisms may be considered for the molecular classification of cancers.
H&E-stained slides are the most widely used method in the clinical context for pathologists to confirm the histological type of endometrial cancer. In this study, we confirmed that the MMR status of endometrial cancer could be predicted from H&E-stained slides using deep learning. To the best of our knowledge, there are only a few studies [
27,
28,
30,
56] worldwide that have tested this concept in endometrial cancer. ResNet, which performed particularly well in the present study, has also been used in several previous studies [
27,
30,
31,
34,
56] in the field of medical imaging, although the number of layers and target organs differed.
For example, in colorectal cancer, the use of artificial intelligence (AI) in the colorectal cancer diagnostic algorithm is expected to reduce testing costs and avoid treatment-related expenses [
57]. A strategy using high-sensitivity AI followed by a high-specificity panel is expected to achieve the most significant cost reduction (about USD 400 million, or 12.9%), compared to a strategy using NGS alone [
57]. Meanwhile, a strategy using only high-specificity AI may achieve the highest diagnostic accuracy (97%) and the shortest time to initiation of treatment [
57]. This report was based on cost assumptions for colorectal cancer from 2017 to 2020 in the United States. Although it is necessary to assess how much of a cost reduction can be achieved in other contexts, the use of a similar approach for endometrial cancer may have the potential to save time and costs. Additionally, the use of AI-based approaches to assist the decision making of oncologists in treating cancer has the potential to allow optimal treatment to be provided to cancer patients sooner [
58].
Author Contributions
Conceptualization, Mina Umemoto and Tasuku Mariya; Data curation, Mina Umemoto; Funding acquisition, Tasuku Mariya; Investigation, Mina Umemoto, Tasuku Mariya, Shintaro Sugita, Takayuki Kanaseki, Yuka Takenaka and Shota Shinkai; Methodology, Mina Umemoto and Tasuku Mariya; Project administration, Tasuku Mariya; Software, Yuta Nambu, Mai Nagata, Toshihiro Horimai and Shota Shinkai; Supervision, Tasuku Mariya, Motoki Matsuura, Masahiro Iwasaki, Yoshihiko Hirohashi, Tadashi Hasegawa, Toshihiko Torigoe, Yuichi Fujino and Tsuyoshi Saito; Validation, Mina Umemoto, Tasuku Mariya, Yuta Nambu, Mai Nagata and Toshihiro Horimai; Visualization, Mina Umemoto; Writing – original draft, Mina Umemoto; Writing – review & editing, Tasuku Mariya.