1. Introduction
Breast cancer is one of the most common and deadly diseases affecting women worldwide [
1]. According to the World Health Organization (WHO), about 2.3 million women were diagnosed for having breast cancer in 2020, and more than 685,000 of them died due to the disease. Despite improved survival rates due to early detection and better treatment, breast cancer remains a major health threat [
2]. Early detection is crucial for improving the prognosis and survival rates of breast cancer patients [
3]. Routine examinations and screening using medical imaging technology play a vital role in detecting cancer at an early stage when treatment is more effective. Mammography, ultrasound, and magnetic resonance imaging (MRI) are some of the methods available for breast cancer detection [
4]. Ultrasound, in particular, is a frequently used tool in breast examinations because it is non-invasive, relatively inexpensive, and easily accessible [
3,
5]. However, ultrasound has limitations in terms of availability and class imbalance, which can obscure important details and complicate diagnosis. Therefore, enhancing the quality and the amount of ultrasound image dataset is a significant focus in medical research [
6,
7,
8].
Although ultrasound is a highly useful tool for breast cancer detection, it faces several significant challenges. Ultrasound images often suffer from suboptimal quality, influenced by noise and artifacts, which can obscure critical details and make cancer detection more difficult [
9,
10]. This noise and these artifacts can lead to diagnostic errors or necessitate unnecessary additional examinations. The limited availability of medical data and the high cost of annotation often hinder the development of accurate machine learning models [
11,
12,
13]. Collecting and annotating high-quality and big amount of medical data requires significant resources and access to medical facilities [
14]. In addition, to be able to interpret ultrasound images, there is huge dependency on the expertise and experience of radiologists, which may lead to variability between examiners. This variability can result in differences in diagnosis and treatment received by patients [
15]. Although several automatic detection methods have been developed, many still struggle to adequately handle the complexity and variability of medical images. Existing detection algorithms may not be robust or accurate enough for widespread clinical application [
16].
The potential utilization of ultrasound image data with the advancement of deep learning technology enables earlier detection of breast cancer without the need for invasive procedures from the outset [
17,
18,
19]. Deep learning technology can learn features from the available data, allowing for the classification of breast conditions into several categories, namely normal, benign tumor, and cancer [
20,
21]. However, the limited availability of medical data and class imbalance in the data often pose obstacles in developing accurate deep learning models [
22]. The collection and annotation of high-quality medical data require significant resources, access to medical facilities, and patient privacy considerations. One common approach to addressing data limitations and class imbalance, which are common issues in training CNN models is data augmentation [
23].
The commonly performed process is data augmentation using traditional techniques, which include geometric augmentation, color augmentation, and noise augmentation. These techniques encompass processes such as reflecting images, cropping and translating images, and altering the image color palette [
24]. However, conventional augmentation methods have several drawbacks. Although these methods can increase data variability, they tend to be limited in generating sufficiently realistic variations. This is because the transformations performed are deterministic and often do not reflect the natural diversity of the actual data [
25,
26]. As an alternative, GAN (Generative Adversarial Network) is a generative data augmentation method that can produce synthetic data by learning the distribution patterns of original data [
27]. GAN, introduced by Goodfellow, uses Jensen-Shannon (JS) divergence in the calculation of the loss function [
28,
29]. The implementation of JS divergence has a drawback of being identical to the occurrence of vanishing gradients, leading to unstable GAN training. Besides this issue, GAN often experiences mode collapse, where the GAN model fails to capture the diversity of the entire data distribution and is fixated on generating data with certain patterns [
30]. To address the limitations of traditional GAN, a variant known as the Wasserstein Generative Adversarial Network (WGAN) was introduced [
31,
32]. WGAN uses the Wasserstein distance as a metric to measure the difference between the original data distribution and the generated data distribution, enabling more stable training and more realistic results [
33,
34].
In 2021, research by Xiao et al. [
35] utilized the Wasserstein GAN model for data augmentation to address class imbalance issues. This model was applied to three RNA-seq cancer patient datasets obtained from the TCGA cancer gene expression database: Breast Invasive Carcinoma (BRCA), Lung Adenocarcinoma (LUAD), and Stomach Adenocarcinoma (STAD). The datasets consisted of two classes, normal (N) and tumor (T), which were divided into testing and training data. Data augmentation with WGAN was performed only on the training data, where the number of data in the minority class was expanded in order to match the majority class as to reach the class balance. The LUAD dataset was expanded from 22 N and 110 T to 110 N and 110 T, the STAD dataset from 18 N and 223 T to 223 N and 223 T, and the BRCA dataset from 73 N and 745 T to 745 N and 745 T. Cancer condition classification was then performed using a Support Vector Machine (SVM) model. The results showed that, compared to using the original dataset alone, the SVM model exhibited significantly improved performance with the augmented dataset. The SVM model accuracy increased from 50% to 90% on the LUAD dataset, 50% to 93.33% on the STAD dataset, and 50% to 98.33% on the BRCA dataset. Building on the previous research by Xiao et al., this study discusses the augmentation of breast ultrasound image data using WGAN to generate synthetic images that can address class imbalance issues in each class.
In WGAN, the images generated by the Generator originate from the mapping of a random latent vector with dimension n. This random vector is transformed by the Generator into synthetic images that increasingly resemble the real image data. According to the original WGAN training algorithm, the training process continues until the Generator converges [
33,
34]. However, in practice, the WGAN training process defines one of the hyperparameters prior to training, namely the number of epochs or training iteration steps [
36,
37,
38]. Based on the above explanation, this study aims to conduct research on generating synthetic images using the WGAN model. It is hoped that this research will contribute to the creation of image datasets with the best possible quality to address issues of dataset availability or imbalanced datasets
2. Materials and Methods
This study will use an annotated breast ultrasound image dataset to train and test the WGAN model [
39]. The training process of WGAN will involve two neural networks: a generator that produces synthetic ultrasound images and a discriminator that assesses the authenticity of these images [
40]. The generator and discriminator will be trained iteratively until the generator is capable of producing images that closely resemble the original ones. This research focuses on the data augmentation process using WGAN, as illustrated in the block diagram in
Figure 1 and flowchart in
Figure 2.
The study begins with the collection of an annotated breast ultrasound image dataset to be used for model training. Pre-processing is then conducted, which includes image data normalization, resizing, and converting the images to grayscale [
41]. The training of the Wasserstein GAN, involving the generator (a neural network that produces synthetic ultrasound images) and the discriminator (a neural network that measures the distribution difference between original and synthetic ultrasound images), is performed iteratively. Feedback from the discriminator is used to enhance the generator's performance. The output from the WGAN generator consists of synthetic ultrasound images that closely resemble the original images with high quality. These synthetic images are then used for data augmentation to increase both the size and variability of the dataset which then will be the input data for classification process.
This study will use an annotated breast ultrasound image dataset to train and test the WGAN model. The training process of WGAN will involve two neural networks: a generator that produces synthetic ultrasound images and a discriminator that assesses the authenticity of these images. The generator and discriminator will be trained iteratively until the generator is capable of producing images that closely resemble the original ones. This research focuses on the data augmentation process using WGAN, as illustrated in the block diagram and flowchart in
Figure 1 and
Figure 2.
2.1. Breast Ultrasound Image Data Acquisition
The dataset used in this study is derived from research conducted by Al-Dhabyani et al. (2020). This dataset contains breast ultrasound image data from several individuals with varying conditions. There are 437 images classified as Benign, 133 images categorized as Normal, and 210 images classified as Malignant. In this study, the Benign, Normal, and Malignant classes will be referred to as classes 0, 1, and 2, respectively, as shown in
Figure 3.
Figure 3 in this document presents sample images from the breast ultrasound dataset used in the study. The images are categorized into three distinct classes, each representing different medical conditions. On the left, there are images from Class 0, which consists of images diagnosed as Benign. These images depict lesions or changes that do not show signs of cancer, providing an overview of relatively safe conditions. In the center, there are images from Class 1, representing Normal conditions. These images show healthy breast tissue without any detected abnormalities. This class serves as a reference for distinguishing between normal and abnormal conditions. On the right, there are images from Class 2, which depict Malignant conditions. These images indicate the presence of abnormalities that may be cancerous, making them crucial for further diagnosis and management. By presenting these three classes side by side,
Figure 3 offers a clear visualization of the characteristic differences between benign, normal, and malignant conditions. This is particularly important in the context of research, as it aids researchers and medical practitioners in understanding and developing better detection methods for various breast conditions.
2.2. Pre-Processing
Pre-processing in the augmentation of breast cancer ultrasound images involves a series of steps to prepare the data before using it to train the model [
42,
43]. The two main aspects of pre-processing are normalization and image resizing. Below is the process undertaken for pre-processing the breast ultrasound image dataset using WGAN.
2.3. Wasserstein GAN Training
The WGAN training process begins with initializing the parameters [
47,
48]. It then proceeds in a main loop that continues until the generator's parameters converge. Each iteration of the main loop involves several updates to the Critic or Discriminator. In each Discriminator iteration, a batch of real data is first sampled from the original data distribution. Then, a batch of data is sampled from random noise or the latent space. The Discriminator's gradient is computed to update its parameters, followed by a process of weight clipping. After several Discriminator updates, the Generator's parameters are updated. A batch of data from the latent space is sampled again, and the generator's gradient is computed to update the Generator's parameters. While the original algorithm repeats this process continuously until the generator's parameters converge, in this study, the iterations in the main loop are limited by the number of epochs or steps.
In WGAN, the Wasserstein distance is implemented in the Discriminator's loss function as shown in equation (2), calculated by averaging the scores for real and fake images [
49,50]. The difference between the average scores of fake and real images is used as the loss, which the discriminator tries to maximize to effectively distinguish between real and fake images. For the Generator, the loss function is calculated by taking the negative of the average score the Discriminator assigns to the fake images. This encourages the generator to create images that receive high scores from the discriminator, indicating that the images appear more realistic.
Table 1 shows the details of the algorithm of WGAN training according to the original WGAN research by Arjovsky.
2.3. Evaluating the Effectiveness of WGAN-Based Augmentation
The original preprocessed image dataset will be going through two different processes. Firstly, the complete preprocessed dataset will be used for WGAN training. On the other hand, the preprocessed dataset is split into training and test set. The training set which will be combined with the synthetic images generated by WGAN generator will be the expanded dataset. The original training set and expanded dataset then will be the input of CNN Classifiers during the training. Then, the trained classifiers will be tested by the test set as to measure the performance difference of the classifiers with different datasets. The performance is measured by four metrics, namely accuracy, precision, recall, and F1-score. In this work, we are using transfer learning classifiers, which are VGG16, ResNet50, MobileNetV2, and YOLOv8.
3. Results and Discussion
3.1. Results of WGAN Training
Each of the generator and discriminator models of the WGAN in this study is constructed using convolutional neural networks with layer architectures as shown in
Figure 4. For the training process, several hyperparameters are defined by the researchers to adapt the WGAN model to the existing dataset, ensuring that the WGAN model can generate synthetic data of good quality and consistent with the dataset used.
Table 2 displays the hyperparameters involved in the WGAN training process and their values.
During the training process over 5000 epochs for each dataset class, the loss of the generator and discriminator is recorded.
Figure 5,
Figure 6 and
Figure 7 present the loss graphs for each WGAN training process.
Figure 5,
Figure 6 and
Figure 7 display the loss of each Generator and Discriminator model during the training process using image data from classes 0, 1, and 2. The loss of the Wasserstein GAN (WGAN) training process using three different datasets reveals several important aspects regarding model convergence and stability. Unlike traditional GANs, WGAN's Discriminator or Critic does not perform an evaluation by classifying input data as real or fake but rather computes the Wasserstein distance between two distributions, namely the real data distribution and the synthetic data distribution generated by the Generator.
An evaluation of the loss graphs in
Figure 5,
Figure 6 and
Figure 7 indicates that the stabilization of the discriminator loss occurs more rapidly compared to the generator loss. The graphs show that in the early epochs of training, there are marked fluctuations in the loss values. This phenomenon indicates that the WGAN model makes significant initial adjustments in model weights in response to different data distributions. The synthetic images produced by the Generator in the early stages of training are still poor, resulting in a significant difference between the distributions of real image data and the synthetic images generated by the Generator. As the training progresses, the WGAN Generator model improves in producing synthetic images.
Based on the analysis of the loss patterns, the model for class 0 (
Figure 5) begins to show stability after approximately 3000 epochs, with minimal fluctuations thereafter. If training continues without an epoch limit, the model will likely remain stable with slight improvements in the quality of synthetic images. The model for class 1 (
Figure 6) shows stabilization after 2500 epochs, but with significant variation still present. The loss pattern indicates that the model for class 1 requires more epochs to achieve the level of stability reached by the model for class 0. Compared to the model for class 1, the model for class 2 (
Figure 7) shows a loss pattern similar to the model for class 0 and achieves stability more quickly, around 3000 epochs. Adding more epochs to the training process can further minimize loss and enhance stability, though the improvement in synthetic data quality may not be very significant.
Thus, the stability evaluation of the WGAN training shows that the models for classes 0 and 2 achieve stability faster than class 1, which requires more epochs to reach stability. The increasingly stable loss patterns at the end of training indicate that the models have successfully approximated the real data distribution, signifying good convergence. The differing behaviors of each model relate to the complexity and variability of the data in each class. The training dataset size for class 1 is the smallest, which, based on its loss graph, requires more time to reach the stability level achieved by the models for classes 0 and 2. This evaluation provides a clearer picture of the stability and convergence of the WGAN models used and shows how the models can be further improved with additional training if necessary. The implementation of Wasserstein distance in the discriminator loss provides an indicator of training progress, enabling researchers to monitor and adjust the model as needed. With achieved stabilization, the WGAN model can reliably generate high-quality synthetic images, which is crucial in addressing data limitations in the medical field, especially in breast ultrasound image data.
3.2. Image Synthetic Augmented by WGAN
The size of the expanded ultrasound image dataset, as shown in
Table 3, compares the sizes of the original image dataset and the dataset after the augmentation process.
Figure 8,
Figure 9 and
Figure 10 display 5 synthetic image samples for each class generated by the WGAN Generator during the second augmentation process, which was previously trained using image data from the corresponding classes.
Figure 7 shows synthetic image samples for class 0 generated by the generator trained on class 0 data.
Figure 8 shows synthetic image samples for class 1 (Normal category) generated by the WGAN model generator trained on class 1 data.
Figure 9 shows synthetic image samples for class 2 (Malignant/Cancer category) generated by the WGAN model generator trained on class 2 image data.
3.3. Prediction Using CNN Classifiers
After expanding the original dataset, the classifiers performance utilizing different datasets is examined based on evaluation metrics. The performance of the classifiers with each dataset is presented in
Table 4,
Table 5,
Table 6 and
Table 7.
We compared multiple pretrained models that have been known best for the classification, including VGG16, ResNet50, and MobileNetV2. In addition, we also incorporated YOLOv8, a pretrained model known for object detection but can also be utilized for classification task to our experiment. Each model was evaluated based on accuracy, precision, recall, and F1-score. Based on the results from
Table 4,
Table 5,
Table 6 and
Table 7, all models exhibit similar performance behaviors in their prediction capabilities according to the evaluation metrics scores. All evaluation metrics indicate that the three models improve in prediction accuracy when the dataset is augmented compared to the original dataset.
VGG16 consistently achieved the highest results, with an accuracy of 83.33%, outperforming the other models in all metrics. ResNet50 and MobileNetV2 also exhibited notable improvements, particularly when augmented data generated by WGAN was used. The results achieved by YOLOv8 only differs by 2 percent compared to that of VGG16. Even though YOLOv8’s strong performance in object detection tasks, its evaluation metrics in this classification context did not surpass those of VGG16. It is indicating that its architecture, while powerful for object detection, might not be as well-suited for direct image classification tasks compared to models like VGG16. This suggests that while YOLOv8 may be effective for other types of image-based tasks, more specialized models like VGG16 may offer better results for ultrasound image classification.
Overall, the best accuracy, precision, and F1-score were achieved by the VGG16 model, with scores of 83.33%, 84.90%, and 82.19%, respectively. In contrast, the best recall was obtained by the MobileNetV2 model, with a score of 81.51%. The combination of the VGG16 model with the expanded dataset proved to be the most effective, showing the most significant improvements across all evaluation metrics, including accuracy, precision, recall, and F1-score. This improvement suggests that the model can leverage additional data to learn more complex and accurate features, which is crucial in medical applications such as breast cancer detection. The increase in precision and F1-score particularly indicates that the model becomes more reliable in correctly predicting positive cases, thereby reducing misclassification errors, which can have serious consequences in a clinical context.
4. Conclusions
The potential of WGAN for data augmentation in the medical imaging field is very promising. This study demonstrates how the implementation of Wasserstein GAN on limited breast ultrasound medical data can be conducted with stable training processes, resulting in synthetic image data that closely resemble original breast ultrasound images. The stability of the WGAN training process is related to the implementation of the Wasserstein distance as the loss function in the Discriminator. This implementation also facilitates monitoring and interpreting model performance by researchers. The differences in stability among the WGAN models for each class are influenced by the dataset size of each class. Overall, the issue of limited medical data, which often hinders research, can be addressed through data augmentation by WGAN. The effectiveness of utilizing WGAN in data augmentation is apparent in the classifier’s performances, where all of the evaluation metrics of each classifier is increasing, with the best accuracy score is achieved by VGG16 model with 83.33% accuracy. This comprehensive analysis underscores the importance of data augmentation in enhancing model performance, especially in critical domains where accuracy and reliability are paramount.
Researchers recommend that in future studies, each model be trained with a different number of epochs based on the needs of each model. Additionally, the application of weight clipping in WGAN in this study could be replaced with an alternative such as the implementation of gradient penalty, providing smoother weight constraints in the model and resulting in more stable training performance and better synthetic images.
Author Contributions
The writing of this paper involved significant contributions from several researchers, each playing a vital role in various aspects of the study. I Gede Susrama Mas Diyasa served as the lead author, guiding the overall manuscript preparation, formulation of the research idea, and result analysis. Sayyidah Humairah was responsible for conducting the coding and testing of the algorithms used in this research, ensuring that all technical procedures were executed effectively and yielded accurate data. Eva Yulia Puspaningrum focused on the machine learning analysis, evaluating the performance of the applied models and interpreting the results from a scientific data perspective. To ensure the accuracy of the findings in a medical context, Fara Disa Durry, an experienced medical doctor, handled the validation of the results, confirming that the research outcomes were applicable and relevant in the medical field. In terms of overall supervision, Caesarendra acted as the supervisor, providing guidance and direction throughout the research development process and ensuring that each phase adhered to academic standards. Finally, Wahyu Dwi Lestari played a crucial role in funding acquisition, securing the resources necessary to support this research. This solid collaboration among the researchers enabled the study to proceed smoothly and produced valuable findings that contribute to the advancement of science.