1. Introduction
Natural disasters have long affected humanity, with climate change intensifying their frequency and severity. These events cause significant loss of life, property damage, and service disruptions, such as electricity and transportation, while posing serious health risks. The economic and psychological toll is profound [
1].
Technological advances such as Pattern Recognition (PR), Deep Learning (DL), Machine Learning (ML), and Artificial Intelligence (AI) provide powerful tools for disaster detection, risk reduction, and response management. As reviewed in [
2,
3,
4], these technologies hold great promise for future disaster response, especially AI and ML in the domain of computer vision (CV), through the use of predictive models that analyze large datasets, identify patterns, forecast potential disasters and provide early warnings of hazardous events [
4,
5]. Deep learning (DL) is increasingly used for flood detection and segmentation, overcoming the limitations of traditional mapping [
6]. However, DL models require extensive labeled data, which is difficult to obtain in disaster scenarios, requires multiple experts, is time-consuming, and is prone to human annotation.
To provide a comprehensive overview of the available resources in flood-related visual understanding from aerial imagery, we compiled a selection of ten publicly available datasets, namely AIDER [
7], ISBDA [
8,
9], FloodNet [
10], FAD [
11], Spacenet-8 [
12], FSSD [
13], WaterBodies [
14], Incidents1M [
15], RescueNet [
16], BlessemFlood21 [
17], released between 2019 and 2024. These are illustrated in
Figure 1, which depicts a bubble chart where each bubble represents a dataset. The horizontal axis indicates the year of publication, while the vertical axis corresponds to the dataset image size. The area of each bubble is scaled proportionally to the dataset size to emphasize the disparity in scale across datasets. Each bubble is also annotated with the dataset name, and a distinct color is used for visual separation. Gray bubbles indicate datasets without flood annotations (AIDER [
7], ISBDA [
8,
9], Incidents1M [
15]), and for AIDER [
7] and Incidents1M [
15] only the subset of flood-related aerial images is considered in the reported sizes.
The increasing trend in dataset size over recent years highlights the growing effort to support deep learning-based flood analysis and emergency response using aerial visual data. However, the creation of large-scale annotated datasets remains a significant challenge. Manual annotation is labor-intensive, often requiring expert interpretation of complex visual patterns. Furthermore, achieving scene diversity is difficult, as aerial imagery of floods is typically limited to specific geographical regions and events, which may reduce generalizability across diverse flood scenarios.
To overcome these challenges, one viable solution is the use of synthetic data generated through advanced generative methods, such as diffusion models or generative adversarial networks (GANs). Synthetic imagery enables the creation of diverse flooding scenarios, including rare or extreme cases, thereby improving the model’s generalizability and robustness. Moreover, in the absence of ground truth labels, pseudo-labeling through automated segmentation algorithms offers a promising alternative. These methods can generate approximate annotations at scale, significantly reducing the dependence on expert input and enabling effective training of deep learning models in a weakly- or unsupervised manner.
In this paper, we solve the problem of flood segmentation in UAV imagery using synthetic generated images of flooding events instead of real ones to answer the following open question:
Can pseudo-labels from synthetic UAV data enable real-world flood segmentation? In our work, we propose a framework to create synthetic generated images of flooding events using two algorithms: (i) text-to-image synthesis and (ii) image inpainting based on real segmentation masks. The generated images undergo unsupervised pseudo-labeling for flood segmentation masks, and filtering using feature embeddings and clustering to refine the synthetic dataset. We train different popular Convolutional Neural Network (CNN) architectures for flood segmentation. Our results show that these models when trained on both synthetic and real images can achieve higher segmentation accuracy. We evaluate their performance on an unseen test set and compare the results to models trained with human-annotated labels and the unsupervised pseudo-labeling approach. A schematic overview of the proposed framework is shown in
Figure 2.
The contributions of this work can be summarized as listed below.
We introduce to the best of our knowledge, the first scalable pipeline for the unsupervised generation of synthetic aerial flood imagery, utilizing text-to-image diffusion models guided by semantically enriched prompts. To enable segmentation training without the need for manual annotation, we integrate an unsupervised pseudo-labeling approach [
18], which automatically produces segmentation masks by exploiting the distinct color characteristics of floodwater and surrounding background elements.
We demonstrate through extensive experiments with state-of-the-art flood segmentation models, that models trained solely on filtered synthetic data achieve performance close to real-data-trained models, with minor performance drops, and introduce an approach to combine real and synthetic data in order to boost performance.
We systematically examine how the structure and semantics of text prompts affect the quality and realism of the generated flood imagery, identifying factors that influence scene consistency and visual fidelity.
The remainder of this paper is structured as follows:
Section 2 provides a summary of related research. Our proposed unsupervised framework is described in
Section 3.
Section 4 outlines the experimental framework of this study.
Section 5 presents the experimental results along with an in-depth discussion. Finally,
Section 6 concludes the paper and outlines directions for future research.
2. Related Work
Flood Segmentation in UAV and Satelite Imagery: Deep learning techniques, particularly CNNs, are increasingly used for flood segmentation in remote sensing imagery, surpassing traditional methods by enabling more accurate and efficient delineation of flooded areas and enhancing decision-making processes [
6]. CNNs have shown strong capabilities in flood detection from satellite imagery by leveraging temporal variations in synthetic aperture radar (SAR) and multispectral data to differentiate between permanent water bodies and flood-affected areas [
19,
20]. However, their effectiveness is often constrained by the reliance on pre-disaster imagery for accurate change detection. To address uncertainty in SAR-based water segmentation, Bayesian CNNs have been proposed due to their ability to estimate both the mean and variance of model parameters, providing a probabilistic understanding of predictions [
21].
U-Net variants have been widely adopted for water body segmentation and flood extent mapping tasks. For instance, in [
22], a modified U-Net architecture was proposed that incorporated geomorphic features and utilized pre-processed Sentinel-1 radar imagery to achieve three-class classification. This model successfully differentiated flood water from permanent water and background. Similarly, in [
12,
23] was demonstrated that lightweight U-Net configurations can offer an effective balance between accuracy, computational efficiency, and robustness. The use of transfer learning and targeted data augmentation proved essential in enabling the detection of flooded infrastructure, including roads and buildings. Furthermore, in [
14], the performance of various CNN architectures for water body semantic segmentation was evaluated using high-resolution satellite and aerial imagery. The U-Net model with a MobileNet-V3 backbone, along with auxiliary features and data augmentation, achieved superior segmentation accuracy.
Benchmark experiments involving semantic segmentation, have validated state-of-the-art deep learning models including XceptionNet and ENet for distinguishing floodwaters from natural water bodies, and detecting inundated roads and buildings in UAV-acquired high-resolution post-disaster imagery [
10]. Also, a CNN integrated into the Deep Earth Learning, Tools, and Analysis (DELTA) framework achieved high precision and recall for water segmentation across diverse datasets [
24]. In [
25], a multiscale attentive decoder network (ADNet) was proposed for automatic flood identification using Sentinel-1 images. When evaluated on the Sen1Floods11 benchmark dataset, ADNet outperformed recent deep learning and threshold-based approaches.
In [
26], an enhanced version of the Efficient Neural Network (ENet) architecture was adopted for the semantic segmentation of UAV footage captured during flood events. The approach integrates atrous (dilated) separable convolutions in the encoder, enlarging the receptive field without increasing computational complexity [
27], and depth-wise separable convolutions in the decoder, enabling efficient feature extraction with a reduced number of parameters. Atrous convolutions have been further utilized in disaster response scenarios to enhance the efficiency of search and rescue operations during events such as floods, high tides, and tsunamis. A notable example is FASegNet, a recently proposed CNN architecture designed for the semantic segmentation of flood and tsunami-affected areas [
28].
Transformer-based architectures have also demonstrated strong performance in semantic segmentation of remote sensing imagery. In [
29], a novel approach was introduced employing the Swin Transformer as the backbone to enhance contextual feature representation, coupled with a densely connected feature aggregation module as the decoder. Additionally, the Bitemporal Image Transformer (BiT) model proposed in [
19] showed superior performance in change detection tasks by effectively identifying and localizing regions of change between image pairs.
In [
30], an interactive semantic segmentation model for multi-source UAV flood images using four prompt types. A prompt encoder maps prompts into a three-channel space to lower labeling costs, while an image encoder, combining Mamba and convolution operations, extracts global features. The model further improves prompt utilization by incorporating a spatial and channel attention module with residual connections. This enables a multiscale fusion and filtering of prompt information and image features across both spatial and channel dimensions.
Several unsupervised flood segmentation methods have been developed using clustering and region-growing techniques. These include object-based K-means with region growing on SAR data [
31] and UAV imagery [
32], datacube-based flood mapping with probabilistic thresholds [
33], tile-based histogram thresholding with contextual filters [
34], and graph-based segmentation using Bayesian Markov random fields [
35], all demonstrating effectiveness across various imaging sources and scenarios.
In [
18], a fully unsupervised segmentation method, UFS-HT-REM, was introduced for fast and accurate flood area detection using UAV-acquired color imagery without requiring pre-disaster reference images. This framework addresses flood segmentation through a parameter-free, unsupervised image analysis pipeline that progressively eliminates non-flood regions using binary masks derived from color and edge information. Specifically, non-flood areas are excluded through mask calculations applied to each channel of the LAB color space, an RGB-based vegetation index, and edge maps from the original image. A probability map of flood presence is then generated using a weighted fusion strategy, followed by a modified hysteresis thresholding process to produce the final segmentation. The method demonstrates both high accuracy and computational efficiency, making it well suited for real-time, on-board processing during UAV operations. This methodology has been used as pseudo-label generator for an unsupervised DL approach [
36], and also serves the same purpose as a module in this work.
Pseudo-Label-Based Methods for Semantic Segmentation: Pseudo-labels in image segmentation are a common strategy in semi-supervised and unsupervised learning. The idea is to use automatically generated labels, often produced by an initial model, unsupervised method, or rule-based system to train deep learning models without requiring large amounts of manually annotated data. This approach has been widely explored in general semantic segmentation tasks, such as urban scene understanding, medical imaging, and object detection.
A recent review and analysis of various PL methods and their applications in semi-supervised semantic segmentation (SSSS) underlined that training with limited labeled data by leveraging automatically generated labels can be effective [
37]. Limitations of existing pseudo-label generation by leveraging enhanced class activation maps and dual attention mechanisms to produce semantically rich labels, achieving competitive or superior performance, has been addressed in [
38]. Furthermore, self-supervised learning and pseudo-label refinement were integrated in a novel weakly-supervised semantic segmentation (WSSS) approach, achieving near fully-supervised performance by enhancing feature representation and mitigating label noise [
39]. In [
40], PseudoSeg was introduced, a method that generates structured pseudo-labels for training with unlabeled or weakly-labeled data, demonstrating effectiveness in both low-data and high-data regimes.
In the context of flood segmentation, pseudo-labeling is less established. A few recent studies have started exploring automatic label generation for SAR and optical imagery, particularly when annotated flood datasets are limited or unavailable. A semi-supervised learning method for flood segmentation using Sentinel-1 SAR imagery, employing a cyclical training process with an ensemble of U-Net models trained on both high-confidence hand-labeled data and generated pseudo-labels was introduced in [
41]. Moreover, an unsupervised deep learning framework for water extraction from multispectral imagery, combining NDWI with a binarization algorithm to generate pseudo-labels for training was proposed [
42]. A novel WSSS framework, TFCSD, has been introduced for efficient urban flood mapping, significantly reducing manual annotation by decoupling the generation of positive and negative samples [
43]. The method enhances edge delineation and stability and maintains high performance even without pre-disaster data by incorporating SAM-assisted interactive labeling [
44].
In our recent work [
36], we have proposed a novel unsupervised deep learning approach for flood segmentation in Unmanned Aerial Vehicle (UAV) imagery, which leverages automatically generated pseudo-labels as training and validation masks, thereby eliminating the need for manually annotated ground truth data. Two widely used Convolutional Neural Network (CNN) architectures for semantic segmentation were trained under this framework. The results demonstrated that training with pseudo-labels alone can achieve performance levels comparable to those obtained using conventional ground truth annotations. Finally, in [
45] a semi-supervised semantic segmentation algorithm for accurate flood delineation in SAR data was proposed. The method exhibited promising results utilizing a pseudo-label generation strategy and self-supervised teacher-student models.
Deep Generative Models and Diffusion-based Image Synthesis: The performance of modern DL models is fundamentally tied to the availability of large-scale datasets. In many real-world domains, such as disaster monitoring or remote sensing, data acquisition is costly. To address this bottleneck, a promising direction is the generation of synthetic data, which can serve as a scalable and controllable alternative to real-world data collection. Synthetic data can be created through simulation, procedural generation, or by leveraging deep generative models. Among generative approaches, models are typically grouped into four main categories: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), auto-regressive models, and, more recently, diffusion models [
46,
47].
Diffusion models have emerged as the most widely adopted family of models, in a variety of domains such as image and text-to-image synthesis [
48,
49,
50], image inpainting [
47,
51], and image-to-image translation [
52,
53]. These models operate by learning to denoise a sample starting from pure Gaussian noise, gradually transforming it into a realistic image. They exhibit greater training stability compared to GANs and offer fine-grained control over the generation process via conditioning mechanisms, such as textual prompts or image masks, and have proven to generate sharper images compared to VAEs, which often produce blurry outputs due to their reliance on approximate posterior inference [
46]. The ability of diffusion-family models to learn strong latent representations that are aligned with textual prompts, and are able to generalize to unseen scenarios (textual prompts) generating high fidelity images, is asserted by the emergence of widely exploited models such as the DALL-E series [
54,
55], Stable Diffusion [
56], and MidJourney, which generate high-quality, customizable images from textual descriptions. Their capacity to synthesize diverse, domain-specific imagery makes diffusion models particularly suitable for augmenting datasets in applications where labeled data is scarce or costly to produce.
4. Experimental Setup
To assess the generalization capability of a segmentation model trained on synthetic flood imagery, we conducted a series of controlled experiments using state-of-the-art deep learning architectures. Specifically, we selected three top-performing segmentation models, which were trained under six different data configurations: (a) exclusively on the real-world dataset with the actual ground truth
, (b) on the real-world dataset with pseudo-labels assuming no ground truth is available
, (c) on the synthetic dataset
(see Equation
6) with corresponding pseudo-labels (
), (d) on the semi-synthetic dataset
(see Equation
7) with corresponding pseudo-labels (
), (e) on the combination of real-world and both synthetic and semi-synthetic datasets
(
as defined in Equation
8) with the actual ground truth and corresponding pseudo-labels (
), and (f) on the combined real-world and filtered synthetic and semi-synthetic datasets
(
as defined in Equation
10) with the actual ground truth and corresponding pseudo-labels (
. The evaluation was performed on an independent real-world dataset with respective expert annotations, which was not used during training, to measure the effectiveness of the synthetic data in bridging the domain gap.
4.1. Datasets and Methods
Datasets: For the real-world dataset baselines, we employed two publicly available datasets that depict flood-affected regions, each accompanied by ground truth segmentation masks delineating flooded areas. These datasets consist of aerial imagery captured by UAVs and helicopters, encompassing a diverse range of environmental contexts, including urban, peri-urban, and rural landscapes. The images exhibit significant variability in scene composition, featuring elements such as vegetation, rivers, buildings, roads, mountainous terrain, and the sky. Additionally, they were acquired from multiple altitudes and viewing angles, ensuring a comprehensive representation of flood scenarios. Notably, both datasets maintain a similar balance between flood and background pixels, which mitigates potential class imbalance issues during model training.
As a baseline training set
we opted to use the well-known
Flood Area Dataset (FAD) [
11]. This dataset comprises 290 RGB images, accompanied with manually annotated segmentation masks. The dataset exhibits variability in image resolution and dimensions. Contributing to the dataset’s heterogeneity is its diverse range of environmental contexts, including 203 urban and peri-urban scenes and 87 rural scenes. In terms of visual composition, 108 images contain full or partial views of the sky, while 182 images lack any visible sky. This diversity supports robust model training across varying landscape types and viewing conditions.
To assess the generalization capability of the segmentation models, we utilized only for inference the
Flood Semantic Segmentation Dataset (FSSD) [
13] as an independent test set, which consists of 663 RGB images and corresponding ground truth segmentation masks. Similarly to the FAD, this dataset comprises of images obtained from UAVs portraying diverse flooded scenes captured from various camera perspectives. The image sizes and resolutions also vary, but were all resized and, if necessary, zero-padded, to 512 × 512 by the dataset creator. All 663 images were used as our test set. Representative samples from both datasets, along with the corresponding ground truth masks, are illustrated in
Figure 7.
Flood Segmentation Methods: We trained three well-established Convolutional Neural Networks (CNNs) for semantic segmentation tasks. The first model, the original U-Net architecture introduced in [
59], is composed of a symmetric encoder–decoder structure. The encoder consists of repeated convolutional blocks followed by max pooling operations, progressively reducing spatial dimensions while enriching semantic abstraction. The decoder mirrors the encoder structure through transposed convolutions and skip connections, allowing precise localization by fusing high-resolution features from earlier layers. The model contains approximately 31 million trainable parameters.
The second architecture is the Fully Convolutional Network (FCN) integrated with a ResNet-50 backbone, following the design principles of [
60]. FCNs are designed for pixel-wise classification by replacing fully connected layers with convolutional ones, preserving spatial resolution throughout the network. The ResNet-50 backbone introduces residual connections, which facilitate the training of deeper architectures by mitigating the vanishing gradient problem. In the FCN, skip connections from intermediate layers to the decoder enhance fine-grained segmentation by integrating both low-level and high-level features. The model contains approximately 33 million trainable parameters.
The third model is DeepLabV3, as proposed by Chen et al. in [
61], which introduces atrous (dilated) convolution to expand the receptive field without loss of resolution. The architecture leverages Atrous Spatial Pyramid Pooling (ASPP), a module that captures multi-scale contextual information by applying parallel atrous convolutions with varying dilation rates. This enables the model to efficiently aggregate information at multiple spatial scales. DeepLabV3 is typically built on a ResNet backbone, where the final layers are adapted to maintain spatial resolution and support the ASPP module. The segmentation output is then upsampled to match the original input dimensions. This architecture is particularly effective in handling objects at multiple scales and segmenting fine structures in complex scenes. The DeepLabV3 model used in this study consists of approximately 41 million trainable parameters.
4.2. Training Protocol
Implementation Details: All models were implemented in PyTorch and trained for 100 epochs with a random generated batch of 8 images, using an initial learning rate of . A dynamic learning rate adjustment strategy was employed, whereby the learning rate was reduced by a factor of 0.5, if no improvement in validation performance was observed over five consecutive epochs. Optimization was conducted using the Adam optimizer with a weight decay of to prevent overfitting by penalizing large weights and encouraging model generalization. Adam was selected due to its adaptive learning rate strategy, which computes individual learning rates for each parameter based on estimates of first and second moments of the gradients. This facilitates efficient convergence and robustness in training deep neural networks, particularly in high-dimensional and non-convex optimization landscapes.
The model weights were initialized using the Kaiming normal initialization, which is specifically designed for layers with ReLU activations. This initialization strategy maintains the variance of activations across layers, thereby promoting stable gradient flow during training and preventing issues such as vanishing or exploding gradients. Each utilized dataset was randomly partitioned into 90% for training and 10% for validation. The Dice loss function was employed as the objective to address class imbalance, and model performance was evaluated using the accuracy on the validation set.
Notably, improved model convergence and higher validation accuracy were observed when training was performed on a combined dataset comprising real-world, synthetic, and semi-synthetic images (see also
Section 5). In this setting, each training batch was composed of a fixed ratio of samples—specifically, two randomly selected images from the real-world dataset and three images each from the synthetic and semi-synthetic subsets. This sampling strategy enabled the model to benefit from the diversity and volume of synthetic data while maintaining grounding in real-world examples, effectively enhancing generalization and training stability.
Training was performed on a system equipped with an Intel i7 CPU (2.3 GHz), 40 GB of RAM, and two NVIDIA Quadro RTX 4000 GPUs. The total training times ranged from less than an hour to approximately four hours, depending on the deep learning architecture and the size of the training dataset.
Dataset Pre-processing: In the case of model training in real-word data, as previously stated, we utilized the FAD dataset. In the absence of ground truth, corresponding pseudo-label segmentation masks were automatically generated using an adapted UFS-HT-REM pipeline. All images were resized to pixels to speed up training, and normalized to zero mean and unit variance. These processes were also applied on the synthetic dataset, as well as on the FFSD dataset that is used for model performance evaluation.
Data Augmentation: To mitigate the limited training data, we applied on-the-fly augmentation with a per-transformation probability of 0.5. Augmentations included horizontal flipping, random rotations within degrees, and additive uniform noise in the range . Over 100 epochs, this strategy effectively expanded training variability.
5. Experimental Results
In this section, we present an evaluation of the proposed synthetic dataset generation method in the UAV flood segmentation task. First, we assess the effectiveness of training segmentation models using only synthetic data, comparing their performance against models trained on real-world datasets. Our experiments include three widely used architectures: DeepLabV3, FCN-ResNet50, and U-Net. In the second part of this section, we analyze the role of prompt structure and semantic context in the quality and utility of the generated synthetic images. Additionally, we evaluate the similarity between the synthetic and real image feature distributions by providing clustering statistics.
5.1. Impact of Synthetic Data on Model Performance
We evaluated the impact of the proposed synthetic dataset creation framework on the performance of three widely used semantic segmentation architectures: DeepLabV3 [
61], FCN-ResNet50 [
60], and U-Net [
59]. For this, each model was trained exclusively on both intermediate synthetic datasets (text-to-image,
, and image inpainting,
), on the combined real-world and synthetic dataset, before
and after the filtering process
, and was compared against its counterpart trained on the real-world dataset with the actual ground truth
, and with pseudo-label segmentation masks
instead of the ground truth. As reported in
Table 1,
Table 2, and
Table 3, across all architectures, we observed that models trained solely on synthetic data achieved segmentation performance remarkably close to those trained on real-world data, and in case of the U-Net even better. On average, the observed performance was above
in F
1=-score, with a drop in the range of 2-5% compared to training with real-world data, and a raise of 1.08% in case of the U-Net, highlighting the high fidelity and generalizability of the generated synthetic samples.
Among the three evaluated architectures, we observed that DeepLabV3 exhibited the highest performance drop when trained solely on synthetic data, while U-Net consistently achieved a slight increase in performance—approximately 1% across all evaluation metrics—compared to its training on real-world data. This divergence in behavior can be attributed to architectural differences: DeepLabV3, with its atrous spatial pyramid pooling and deeper encoder, may rely more heavily on fine-grained real-world features and textures that the synthetic data only partially captures. In contrast, U-Net’s symmetric encoder-decoder structure with skip connections may better exploit the spatial coherence and regularity present in the synthetic masks and images, thus benefiting from the structured nature of the generated dataset.
A particularly noteworthy outcome of our experimental analysis is the consistent performance improvement observed across all examined segmentation models, DeepLabV3, FCN-ResNet50, and U-Net, when trained on the combined dataset comprising both real-world data (images and manually annotated masks) and synthetic data (generated images and pseudo-label masks). These gains range between 1% and 6% in the F1-score, underscoring the complementary nature of the synthetic dataset. The inclusion of synthetically generated samples, which capture diverse flood-affected scenarios and augment underrepresented visual patterns, likely aids in regularizing the models and enhancing their generalization capabilities. Additionally, the pseudo-label masks, though automatically generated, provide reasonably accurate supervision signals that help the models learn more robust decision boundaries. This result highlights the value of synthetic data as a scalable and effective means to enrich limited annotated datasets in domain-specific applications such as disaster response.
The introduction of our filtering strategy following the initial creation of the synthetic dataset (row 5 vs row 6 in
Table 1,
Table 2,
Table 3) further enhances model performance across all segmentation architectures. Specifically, we observe improvements ranging from 1% to 4% in key evaluation metrics when the filtered synthetic dataset is used instead of the unfiltered one. Among the evaluated models, U-Net consistently exhibits the highest performance gain, suggesting that its architecture may be particularly sensitive to noisy or low-quality training examples. By eliminating synthetic samples with low semantic alignment to real data distributions, the filtering stage effectively increases the signal-to-noise ratio in the training set, leading to better convergence and generalization. This demonstrates the importance of curating synthetic data not only in terms of diversity, but also in maintaining fidelity to real-world distributions.
Overall, among the evaluated architectures, as observed in
Figure 8, DeepLabV3 yielded the lowest performance when trained solely on synthetic data, though it was able to marginally surpass training on real-world data when a combined real and synthetic dataset was used. FCN-ResNet50 consistently achieved the highest performance in most scenarios, yet its performance declined slightly under synthetic-only training conditions. U-Net, in contrast, demonstrated notably stronger performance with synthetic data and significantly outperformed the other deep learning models when trained with a filtered combination of real and synthetic data. This can be attributed to U-Net’s encoder-decoder structure with skip connections, which excels at capturing both fine-grained local details and global spatial context. Such architectural characteristics are particularly well-suited for segmentation tasks involving structured patterns. Consequently, the U-Net architecture excels in our synthetic flood imagery that features regular shapes and consistent textural cues, as generated via stable diffusion. U-Net’s inductive biases align well with the spatial regularity present in the synthetic data, enhancing its ability to generalize effectively in this context. Statistcally, we have obtained 58.22% over all images better inferences with U-Net, 29.26% with FCN-ResNet50, and only 12.67% with DeepLabV3 indicating it is the worst performing model when trained with combined real and synthetic data. Note that inferences can excel in more than one model.
5.2. Real and Synthetic Dataset Similarity, Role of Prompt Semantics in Dataset Quality
Real and Synthetic Data Similarity: To quantitatively assess the distributional similarity between real and synthetic datasets, we computed the Maximum Mean Discrepancy (MMD) between their respective feature representations. Specifically, we extracted deep features using a pre-trained ResNet50 model and computed the MMD between the real dataset and two variants of the synthetic dataset. The resulting scores were and . Both scores indicate a reasonable alignment with the real data distribution, as MMD values below 0.1 are typically indicative of good distributional similarity in high-dimensional spaces. Notably, the second synthetic dataset demonstrates a significantly closer alignment to the real data, suggesting improvements in generation fidelity—possibly due to enhanced prompt structure, better context grounding, or more effective filtering. These results support the hypothesis that high-quality synthetic data, when properly curated, can closely mimic real-world distributions and serve as a valuable resource for training deep segmentation models.
Additionally, as a second way to assess the similarity between a real-world dataset (D
r) and two synthetic datasets (SD
s and SD
ip), with a more direct and visual manner, we also used the feature embeddings from each image and applied Principal Component Analysis (PCA) for dimensionality reduction and visualization. The resulting 2D PCA plot, shown in
Figure 9, demonstrates a substantial overlap between the three datasets in the feature space (best-fit ellipses overlap), indicating a high degree of visual similarity. Notably, the synthetic datasets exhibit significant alignment with the real-world distribution, suggesting that the synthetic data generation processes effectively capture the global visual characteristics of real imagery. Among them, SD
ip shows a tighter overlap with D
r, which can be attributed to the ground truth mask prior, while SD
s exhibits a slightly broader spread, which may imply higher visual diversity or variability in synthesis quality. This kind of deviation may reflect that, while the synthetic dataset captures much of the variance of the real data (hence the overlap), it might also be exploring new areas or patterns in the feature space that are not fully represented in the real data. This is also suggested by the slight deviation in orientation of the fitted ellipses between the real and synthetic datasets. The novel pattern assumption appears to be verified by the model performance increase when training on the union of the real and synthetic datasets. These findings support the potential utility of synthetic datasets as substitutes or complements to real-world data for training DL models, especially in scenarios where labeled real data is limited or costly to obtain.
Prompt Structure and Semantics in Synthetic Image Quality: The structure and semantics of textual prompts play a critical role in guiding the fidelity of generated synthetic images. As shown in
Figure 10, minor variations in phrasing, specificity, or contextual richness can substantially affect visual realism and alignment with the intended flood scenario. Detailed prompts that include spatial relationships, lighting conditions, environmental context (e.g., urban vs. rural), and object-level cues (e.g., “partially submerged tractors”, “muddy floodwater”, “sky reflections”) consistently result in higher quality images. Conversely, vague or semantically sparse prompts often lead to artifacts, inconsistencies in water boundaries, or unnatural object placements (e.g, "flooded city" or "flooded river").
For instance, as illustrated in
Figure 10, the use of a vague or underspecified prompt in the text-to-image synthesis pipeline leads to the generation of semantically and structurally inconsistent elements, such as the appearance of building rooftops that were not explicitly requested and exhibit unrealistic geometries (highlighted with red boxes). Similarly, in the image-to-image generation approach conditioned on segmentation masks, employing a more generic prompt results in the synthesis of implausible structures—such as slanted apartment-like buildings—demonstrating that prompt specificity directly impacts the semantic and geometric fidelity of the generated content.
5.3. Ablation on Filtering Threshold
As a final task, we investigate the sensitivity of our filtering scheme to the choice of strictness parameter, we varied the
k-value in the z-score threshold formulation, presented in
Section 3.3. Specifically, we examined
,
, and
, corresponding to progressively stricter inclusion criteria for synthetic images based on their distance to real data clusters in feature space. Our results, shown in
Table 4, demonstrate that a stricter threshold (i.e.,
) consistently yields better segmentation performance across the majority of evaluated models. This suggests that retaining only the most semantically aligned synthetic samples—those with the highest fidelity to real data distributions—is beneficial for generalization.
The performance gains are particularly evident for DeepLabV3 and FCN-ResNet50, which both benefit significantly from the more selective filtering. This is likely due to the fact that these architectures possess deep and complex feature extractors, which are more sensitive to domain discrepancies introduced by low-quality or semantically inconsistent synthetic data. By aggressively filtering out such samples, the learned representations remain more robust and transferable to real-world data. Interestingly, the intermediate threshold shows the greatest benefit for the U-Net architecture. Unlike the others, U-Net has a more symmetric and shallow encoder-decoder structure, which may allow it to benefit from a slightly larger, more diverse synthetic training set—so long as the noise introduced remains within tolerable bounds. The less favorable performance of DeepLabV3 and FCN-ResNet50 with may highlight their greater sensitivity to noisy or out-of-distribution synthetic examples.
5.4. Qualitative Segmentation Results
To evaluate the qualitative performance of the proposed framework, we present segmentation results from the best-performing model, namely U-Net trained on the combined real-world and filtered synthetic dataset. This configuration consistently achieved the highest F1-scores across validation and test sets, confirming its robustness and generalization capability under varying image conditions.
For interpretability and comprehensive assessment, representative segmentation outputs were selected based on percentile sampling of the F1-score distribution over the test set. Specifically, we used the best performing U-Net model variant, i.e. the variant trained with real and synthetic data, and sorted all segmented test images in descending order of F1-scores and extracted five representative cases corresponding to the 0th (best), 20th, 40th, 60th, and 80th percentiles. This approach enables a structured visualization of the model’s performance across different levels of difficulty, from highly accurate segmentations to more challenging scenarios.
As illustrated in
Figure 11, the top-ranked examples, e.g. 0th and 20th percentiles (
Figure 11 (c) and (h)), show high performance segmentation with near-precise delineation of flood boundaries and minimal false positives or negatives. These cases typically involve well-lit, high-contrast scenes with distinct flood regions and minimal occlusions. In contrast, lower percentile examples, e.g. 60th and 80th percentiles (
Figure 11 (r) and (w)), demonstrate the model’s limitations, often corresponding to visual ambiguities such as flooded vegetation, low contrast between land and water surfaces, or significant reflections and shadow artifacts. Nonetheless, even in these more difficult cases, the U-Net generally preserves the structural integrity of the flooded regions, indicating resilience to challenging input conditions.
These results further substantiate the effectiveness of combining real-world data with selectively filtered synthetic images, which likely enhanced the diversity and coverage of the training set, enabling the U-Net to generalize well across both typical and edge-case scenarios. For comparative reasons, we also present in
Figure 11 the segmentation results of the other two models, FCN-ResNet50 and DeepLabV3, trained with the same configuration as the U-Net, together with the respective F
1-scores achieved. FCN-ResNet50 performed second best and produced in general better results than the DeepLabV3 architecture, as observed in
Figure 11 (d), (i), and (s), respectively, compared to (e), (j), and (t).
In
Figure 12, we showcase the effectiveness of synthetic and semi-synthetic data for flood segmentation presenting representative segmentation results from the same U-Net architecture trained under four distinct configurations: (i) using real-world images with manually annotated ground truth masks, (ii) using purely synthetic images generated via Stable Diffusion with automatically produced pseudo-labels, (iii) using semi-synthetic images created through image inpainting with corresponding pseudo-labels, and (iv) a combined dataset comprising real, synthetic, and semi-synthetic samples with their respective ground truth and pseudo-label masks. Five representative cases were chosen among the descending sorted differences of the F
1-scores of the best and worst performing configurations, (iv) and (i), corresponding to the highest difference (first row), the differences of the 25th, 50th, 75th percentiles, and the lowest difference (last row).
Among the four cases, the combined dataset (case iv in the last column) achieved the highest performance in terms of segmentation accuracy, as measured by the F1-score. This indicates that the integration of real-world examples with synthetic and semi-synthetic data can significantly enhance the model’s generalization capability. The observed improvement is attributed to the increased variability and diversity introduced by the synthetic samples, which augment the training distribution and expose the model to a wider range of environmental conditions, textures, and structural layouts. This diversity appears to regularize the training and prevent overfitting to the limited real-world data.
Interestingly, both synthetic (case ii, third column) and semi-synthetic (case iii, fourth column) training independently produced segmentation results that were not only comparable to, but in some cases exceeded, the performance of models trained solely on real-world data (case i, second column), as observed in
Figure 12 (c), (d), and (n) compared to (b) and (l). This highlights the capacity of U-Net to learn robust features even when trained exclusively on artificially generated data, provided that the pseudo-labels contain sufficient region of interest despite their inherent noise. It also underscores the model’s resilience to imperfect supervision and demonstrates the potential of unsupervised or weakly supervised learning approaches for remote sensing tasks where high-quality annotated datasets are often scarce. The synthetic data, generated entirely from random noise using a generative diffusion process, provided a broad distribution of flood-like appearances, while the semi-synthetic data retained structural realism from the real imagery due to inpainting guided by true segmentation masks. Of course, there are also failure cases where complex patterns could not be learned and synthetic generated data as well as their combination with real data seemingly confused the model (see
Figure 12 (w), (x) and (y)).
These findings collectively suggest that in the absence of annotated datasets, the use of synthetic or semi-synthetic imagery in conjunction with automatic pseudo-labeling can offer a viable alternative for training deep segmentation models. Furthermore, combining such data with limited real-world samples results in a synergistic effect that further improves segmentation accuracy, advocating for hybrid training strategies in future work.
6. Conclusions
In this work, we presented a framework for constructing a synthetic dataset aimed at the task of flood segmentation in UAV imagery. Our approach integrates two distinct generative strategies: a text-to-image synthesis pipeline guided by flood-related textual prompts, and an image-to-image translation paradigm that leverages real segmentation masks as structural priors to generate realistic flood scenes via inpainting. These complementary approaches are unified to produce a diverse and semantically meaningful dataset.
To address the challenge of missing ground truth annotations for segmentation, we employ an unsupervised pseudo-labeling strategy to generate segmentation masks for the synthetic images. This allows us to construct paired image-mask samples without the need for manual annotation, significantly reducing the cost and effort typically associated with dataset curation. Furthermore, we incorporate a filtering stage based on outlier detection to ensure the realism and structural fidelity of the generated images, discarding samples that do not meet quality standards. To evaluate the effectiveness of the synthetic dataset, we conducted experiments with three top-performing flood segmentation models, assessing their performance on a real-world benchmark dataset. Our findings demonstrate that training with the synthetic data, even when annotated via unsupervised pseudo-labeling, not only leads to minor model performance drops, but the combination of synthetic and real-world data during training is able to improve model generalization and robustness. Our framework offers a scalable and low-cost solution for generating annotated flood segmentation datasets, with practical applications in disaster monitoring, remote sensing, and other vision-based environmental analysis tasks.
While our current framework provides a scalable solution for generating synthetic flood segmentation datasets, several promising directions remain open for future exploration. These include improving the accuracy of the unsupervised pseudo-labeling method by integrating stronger segmentation priors or combining multiple weak supervision signals. We also plan to explore domain adaptation techniques—such as adversarial training and feature alignment—to further reduce the gap between synthetic and real data. Lastly, we aim to extend the framework by considering multi-modal data synthesis for broader environmental monitoring applications.
Figure 1.
Flood-related aerial image datasets and their sizes (in images). Gray bubbles indicate absence of explicit flood annotations. Sizes for AIDER and Incidents1M reflect only flood-related images.
Figure 1.
Flood-related aerial image datasets and their sizes (in images). Gray bubbles indicate absence of explicit flood annotations. Sizes for AIDER and Incidents1M reflect only flood-related images.
Figure 2.
Schematic overview of the proposed approach.
Figure 2.
Schematic overview of the proposed approach.
Figure 3.
Overview of synthetic dataset creation pipeline. The process begins with the generation of synthetic flood images using two methodologies: (i) text-to-image synthesis and (ii) image inpainting based on real segmentation masks. The generated images undergo an unsupervised pseudo-labeling process (PL Gen) to obtain corresponding segmentation masks. Next, a filtering stage is applied to refine the synthetic dataset by leveraging feature embeddings and clustering techniques.
Figure 3.
Overview of synthetic dataset creation pipeline. The process begins with the generation of synthetic flood images using two methodologies: (i) text-to-image synthesis and (ii) image inpainting based on real segmentation masks. The generated images undergo an unsupervised pseudo-labeling process (PL Gen) to obtain corresponding segmentation masks. Next, a filtering stage is applied to refine the synthetic dataset by leveraging feature embeddings and clustering techniques.
Figure 4.
Illustration of the image generation process using Stable Diffusion over 50 denoising steps. The sequence shows the initial random noise and intermediate outputs at every 10th step (t) until the final synthesized image. Diversity in the generated outputs is achieved through carefully constructed text prompts. The model successfully generates realistic flooded environments, including urban/peri-urban scenes (a), (b), and rural landscapes (c), (d). Variations in sky presence which can include sky (b), (d) or not (a), (c), simulate different camera viewpoints and orientations.
Figure 4.
Illustration of the image generation process using Stable Diffusion over 50 denoising steps. The sequence shows the initial random noise and intermediate outputs at every 10th step (t) until the final synthesized image. Diversity in the generated outputs is achieved through carefully constructed text prompts. The model successfully generates realistic flooded environments, including urban/peri-urban scenes (a), (b), and rural landscapes (c), (d). Variations in sky presence which can include sky (b), (d) or not (a), (c), simulate different camera viewpoints and orientations.
Figure 5.
Synthetic images
(a) from
Figure 4 representing flooded urban/peri-urban environments with sky absence and presence, and rural scenes also without and with sky. Real-world UAV captured flood-related images with the same scene diversity
(c), along with their respective ground truths
(d), which are used, as described in
Section 3.2, to generate semi-synthetic images
(e) via inpainting. Respective pseudo-labels, PL
s and PL
ip, are overlaid in blue for the synthetic I
s(b) and semi-synthetic images I
ip (f).
Figure 5.
Synthetic images
(a) from
Figure 4 representing flooded urban/peri-urban environments with sky absence and presence, and rural scenes also without and with sky. Real-world UAV captured flood-related images with the same scene diversity
(c), along with their respective ground truths
(d), which are used, as described in
Section 3.2, to generate semi-synthetic images
(e) via inpainting. Respective pseudo-labels, PL
s and PL
ip, are overlaid in blue for the synthetic I
s(b) and semi-synthetic images I
ip (f).
Figure 6.
Representative examples of synthetic
(a) and semi-synthetic images
(d) which were filtered with outlier threshold
. Semi-synthetic images
were generated as described in
Section 3.2, with the ground truth
(c) used for inpainting derived from the real-world UAV captured flood-related images
(b).
Figure 6.
Representative examples of synthetic
(a) and semi-synthetic images
(d) which were filtered with outlier threshold
. Semi-synthetic images
were generated as described in
Section 3.2, with the ground truth
(c) used for inpainting derived from the real-world UAV captured flood-related images
(b).
Figure 7.
Sample images from the training dataset FAD [
11]
(a) and the corresponding ground truths
(b), which were used in
Section 3.2 to generate semi-synthetic images, and from the test dataset FSSD [
13]
(c) together with their respective ground truths
(d).
Figure 7.
Sample images from the training dataset FAD [
11]
(a) and the corresponding ground truths
(b), which were used in
Section 3.2 to generate semi-synthetic images, and from the test dataset FSSD [
13]
(c) together with their respective ground truths
(d).
Figure 8.
The F1-score of DeepLabV3, FCN-ResNet50 and U-Net on the test dataset (FSSD) according to training with the real dataset (FAD), text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when its combined with the FAD dataset, with the usage of ground truth (GT) and pseudo-label (PL) masks.
Figure 8.
The F1-score of DeepLabV3, FCN-ResNet50 and U-Net on the test dataset (FSSD) according to training with the real dataset (FAD), text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when its combined with the FAD dataset, with the usage of ground truth (GT) and pseudo-label (PL) masks.
Figure 9.
2D PCA visualization of ResNet50 feature embeddings extracted from real (FAD) and synthetic (SDs, SDip) datasets with the best fit ellipses of each dataset calculated by 2D normal distribution fitting. Each point represents an image, colored by dataset. The substantial overlap suggests strong visual alignment between real and synthetic domains.
Figure 9.
2D PCA visualization of ResNet50 feature embeddings extracted from real (FAD) and synthetic (SDs, SDip) datasets with the best fit ellipses of each dataset calculated by 2D normal distribution fitting. Each point represents an image, colored by dataset. The substantial overlap suggests strong visual alignment between real and synthetic domains.
Figure 10.
Comparison of accepted (top) and rejected (bottom) synthetic flood images generated via prompt-to-image (left) and image-to image (right) translation. Problematic regions—including artifacts, erroneous configurations, and undesired objects—are highlighted with red bounding boxes.
Figure 10.
Comparison of accepted (top) and rejected (bottom) synthetic flood images generated via prompt-to-image (left) and image-to image (right) translation. Problematic regions—including artifacts, erroneous configurations, and undesired objects—are highlighted with red bounding boxes.
Figure 11.
Original images (first column), ground truth masks (second column) and representative segmentation results of the U-Net (third column), FCN-ResNet50 (fourth column), and DeepLabV3 (fifth column) from the FSSD test dataset. The masks are overlaid in blue. Rows correspond to the (best results), , , and percentile of the descending sorted F1-score values of the top performing U-Net architecture trained with real and synthetic data. FCN-ResNet50 and DeepLabV3 were also trained under the same conditions.
Figure 11.
Original images (first column), ground truth masks (second column) and representative segmentation results of the U-Net (third column), FCN-ResNet50 (fourth column), and DeepLabV3 (fifth column) from the FSSD test dataset. The masks are overlaid in blue. Rows correspond to the (best results), , , and percentile of the descending sorted F1-score values of the top performing U-Net architecture trained with real and synthetic data. FCN-ResNet50 and DeepLabV3 were also trained under the same conditions.
| |
|
|
|
|
| Image |
Ground truth |
U-Net |
FCN-ResNet50 |
DeepLabV3 |
Figure 12.
Original images (first column), and representative segmentation results from the FSSD test dataset of the U-Net architecture trained with different datasets: real-world images and ground truths (second column), synthetic images and corresponding pseudo-label masks (third column), semi-synthetic images and corresponding pseudo-label masks (fourth column), and combined real-world data and filtered synthetic datasets and corresponding ground truth and pseudo-labels (fifth column). Segmentations are overlaid in blue. Rows correspond to the (best results), , , and (worst results) percentile of the descending sorted F1-score differences of the top performing U-Net architecture (last columns) and its counterpart trained with real-world images and ground truth (second column).
Figure 12.
Original images (first column), and representative segmentation results from the FSSD test dataset of the U-Net architecture trained with different datasets: real-world images and ground truths (second column), synthetic images and corresponding pseudo-label masks (third column), semi-synthetic images and corresponding pseudo-label masks (fourth column), and combined real-world data and filtered synthetic datasets and corresponding ground truth and pseudo-labels (fifth column). Segmentations are overlaid in blue. Rows correspond to the (best results), , , and (worst results) percentile of the descending sorted F1-score differences of the top performing U-Net architecture (last columns) and its counterpart trained with real-world images and ground truth (second column).
| |
|
|
|
U-Net |
| |
U-Net |
U-Net |
U-Net |
FAD ∪ SDfilt / |
| Image |
FAD / GT |
SDs / PLs |
SDip / PLip |
GT ∪ PLfilt |
Table 1.
DeepLabV3 performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
Table 1.
DeepLabV3 performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
| Training |
Test Metrics |
| Dataset |
Acc (%) |
IoU (%) |
Pr (%) |
Rec (%) |
F1 (%) |
| Dr
|
79.48 |
60.03 |
67.39 |
84.62 |
75.03 |
| Dpl
|
78.03 |
58.75 |
65.04 |
85.85 |
74.01 |
| SDs
|
76.48 |
55.02 |
64.47 |
78.97 |
70.99 |
| SDip
|
74.54 |
54.70 |
60.85 |
84.42 |
70.72 |
| Dr ∪ SDall
|
79.21 |
60.94 |
65.89 |
89.02 |
75.73 |
| Dr ∪ SDfilt
|
79.55 |
61.34 |
66.34 |
89.06 |
76.04 |
Table 2.
FCN-ResNet50 performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
Table 2.
FCN-ResNet50 performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
| Training |
Test Metrics |
| Dataset |
Acc (%) |
IoU (%) |
Pr (%) |
Rec (%) |
F1 (%) |
| Dr
|
83.60 |
66.90 |
71.65 |
90.98 |
80.16 |
| Dpl
|
83.64 |
67.67 |
70.72 |
94.00 |
80.72 |
| SDs
|
81.35 |
64.80 |
67.48 |
94.22 |
78.64 |
| SDip
|
81.28 |
64.44 |
67.65 |
93.14 |
78.38 |
| Dr ∪ SDall
|
83.74 |
67.82 |
70.84 |
94.09 |
80.83 |
| Dr ∪ SDfilt
|
84.22 |
68.12 |
72.08 |
92.54 |
81.04 |
Table 3.
U-Net performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
Table 3.
U-Net performance on the test dataset (FSSD) based on training with the real dataset and ground truths (), the real dataset and pseudo-labels (), the text-to-image synthetic dataset (), the semi-synthetic image inpainting dataset (), and their union, unfiltered () and filtered (), when it is combined with the real-world dataset .
| Training |
Test Metrics |
| Dataset |
Acc (%) |
IoU (%) |
Pr (%) |
Rec (%) |
F1 (%) |
| Dr
|
81.03 |
62.62 |
68.95 |
87.21 |
77.01 |
| Dpl
|
80.33 |
63.04 |
66.65 |
92.07 |
77.33 |
| SDs
|
81.45 |
64.04 |
68.56 |
90.68 |
78.08 |
| SDip
|
81.11 |
64.04 |
67.63 |
92.34 |
78.08 |
| Dr ∪ SDall
|
81.91 |
65.52 |
68.21 |
94.32 |
79.17 |
| Dr ∪ SDfilt
|
86.97 |
72.00 |
76.85 |
91.94 |
83.72 |
Table 4.
Impact of the z-score threshold parameter k on the segmentation performance of different architectures trained on the filtered synthetic dataset. A lower k value corresponds to stricter filtering of synthetic images.
Table 4.
Impact of the z-score threshold parameter k on the segmentation performance of different architectures trained on the filtered synthetic dataset. A lower k value corresponds to stricter filtering of synthetic images.
| Method |
Thresh.
Par. (k) |
Filt.
Imgs |
Test Metrics |
| |
|
|
Acc (%) |
IoU (%) |
Pr (%) |
Rec (%) |
F1 (%) |
| DeepLabV3 |
- |
0/580 |
79.21 |
60.94 |
65.89 |
89.02 |
75.73 |
| DeepLabV3 |
3 |
3/580 |
79.11 |
60.76 |
65.82 |
88.77 |
75.59 |
| DeepLabV3 |
2 |
23/580 |
78.68 |
59.69 |
65.73 |
86.67 |
74.76 |
| DeepLabV3 |
1.5 |
53/580 |
79.55 |
61.34 |
66.34 |
89.06 |
76.04 |
| FCN-ResNet50 |
- |
0/580 |
83.74 |
67.82 |
70.84 |
94.09 |
80.83 |
| FCN-ResNet50 |
3 |
3/580 |
83.56 |
67.64 |
70.51 |
94.32 |
80.70 |
| FCN-ResNet50 |
2 |
23/580 |
82.85 |
66.90 |
69.27 |
95.14 |
80.17 |
| FCN-ResNet50 |
1.5 |
53/580 |
84.22 |
68.12 |
72.08 |
92.54 |
81.04 |
| U-Net |
- |
0/580 |
81.91 |
65.52 |
68.21 |
94.32 |
79.17 |
| U-Net |
3 |
3/580 |
84.67 |
68.82 |
72.67 |
92.85 |
81.53 |
| U-Net |
2 |
23/580 |
87.57 |
72.83 |
78.17 |
91.42 |
84.28 |
| U-Net |
1.5 |
53/580 |
86.97 |
72.00 |
76.85 |
91.94 |
83.72 |