1. Introduction
Attention mechanisms [
1,
2,
3] have become a fundamental component of deep learning models, including the field of computer vision, where spatial understanding plays a crucial role. A wide range of attention mechanisms have been designed to enable models to focus on the most relevant parts of input images, leading to improved performance and enhanced interpretability of the models’ decision-making processes [
4,
5,
6,
7].
The origins of attention in deep learning can be traced back to the revolutionary works in natural language processing (NLP), introduced as a way to address the limitations of traditional sequence-to-sequence models [
1], such as the inability to effectively capture long-range dependencies. The initial success of attention mechanisms in NLP tasks, such as machine translation and language modeling [
8,
9,
10], inspired researchers to explore their potential in other domains, including computer vision.
In the computer vision domain, attention mechanisms have been adapted to various tasks, such as image classification, object detection, and semantic segmentation. The key idea behind these attention mechanisms is to allow the model to dynamically focus on the most relevant spatial regions of the input image, rather than treating all regions equally. The selective focus has been shown to improve the model’s performance and provide better interpretability of its decision-making [
11].
However, the traditional approaches to attention mechanisms in computer vision often suffer from a lack of smoothness in the attention maps, resulting in sharp transitions that negatively affect model generalization. This problem of spatial incoherence is particularly pronounced in the task of semantic segmentation, where accurate pixel-level predictions require a detailed understanding of the spatial relationships within the image [
12].
Semantic segmentation, the task of assigning a semantic label to each pixel in an image, relies heavily on capturing fine-grained details and understanding the spatial context. The existing attention mechanisms struggle to effectively capture the detailed relationships within the image due to the abrupt pixel-level transitions in the attention maps [
13,
14]. This inconsistency in the attention maps often leads to inaccurate segmentation boundary predictions and a tendency for the model to overfit on the training data, limiting its ability to generalize well on unseen images.
Additionally, the lack of smoothness in attention maps can make models susceptible to noise [
15]. Small perturbations in the input image may drastically alter the attention weights, causing unstable predictions [
16,
17]. Such sensitivity to noise is particularly problematic in real-world scenarios where images are often corrupted by artifacts or imperfections.
To address these limitations, we present a new approach called Smooth Attention that incorporates a smoothness constraint, encouraging gradual changes in attention weights and mitigating the risks of sharp transitions or noise sensitivity. By enabling a detailed understanding of spatial relationships within the image, Smooth Attention leads to higher-level performance on tasks where spatial coherence is crucial, such as image semantic segmentation.
4. Experiments
To demonstrate the effectiveness of Smooth Attention, we perform experiments on five diversified image segmentation datasets, including the Caltech-UCSD Birds-200-2011 [
37], Large-Scale Dataset for Segmentation and Classification [
38], Fire Segmentation Image Dataset [
39], Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy [
40], and Flood Semantic Segmentation Dataset [
41].
To explore the influence of the attention module, we implement a custom model with the U-Net architecture [
42,
43,
44] which is a popular choice for semantic segmentation tasks. We use ResNet18 [
45] as the encoder, which is followed by the Smooth Attention mechanism to help the model focus on important features. The decoder is implemented as a series of transposed convolutions that upscale the feature map [
46]. Each transposed convolution is followed by a ReLU activation, except for the last one. The decoder gradually increases the spatial dimensions while reducing the number of channels [
47,
48]. The final layer outputs the same number of channels as the number of classes for segmentation. The complete model architecture is demonstrated in
Figure 1.
We perform experiments across multiple smoothness thresholds (from 0.1 to 0.9 with a 0.1 step) for each dataset, and compare the results with a threshold of 2.0 that illustrates the model without attention mechanism in place (any value above 1.0 means no smoothing constraint applied). All datasets are split into 80% for training, and 20% for testing. We choose IoU, Dice Coefficient, Test Accuracy, Test Precision Test F1 Score as metrics to compare the final results [
49].
The results for Caltech-UCSD Birds-200-2011 [
37] can be observed in
Table 1. The superior performance of the Smooth Attention mechanism at lower smoothness thresholds (0.1, 0.2, 0.3) compared to the model without attention (threshold of 2.0) suggests that the attention module is effectively helping the model in focusing on the most relevant features in the image segmentation task. By applying the smoothness constraint, the attention mechanism is able to selectively highlight most informative regions of the input, leading to better segmentation accuracy, IoU, and Dice coefficient metrics.
We see the consistently high Test Recall values across all smoothness thresholds, showing that the model is able to correctly identify most of the positive instances in the test set, regardless of the attention mechanism’s configuration. Similar to IoU, the Dice coefficient peaks at a lower smoothness threshold, demonstrating that low levels of attention smoothness provide better balance between focus and flexibility in feature selection. At the same time, the highest accuracy and precision occur at a higher threshold, suggesting that for overall pixel-wise classification a smoother attention map reduces the noise and the number of false positives. The recall peak occurs at the lowest smoothness threshold, since true positives can be captured at a strict threshold easier. By applying a stricter smoothness constraint, the attention module is able to capture important visual cues, maximizing the model’s ability to detect positive instances in the data.
The experiment results for the Large-Scale Dataset for Segmentation and Classification [
38] can be observed in
Table 2. The stronger IoU and Dice coefficient performance at lower smoothing thresholds indicates that the model is more effective in identifying variances in the input image data at strict thresholds. Such behavior unveils the rich tapestry of diversity within the fish segmentation data, highlighting the necessity for fine-grained approaches to capture the complexities of the images.
Consistently high accuracy scores for all thresholds, including 2.0 (no attention), showcase the absence of background complexity within image data, helping the model be accurate in object segmentation with and without smoothing. Still, slight improvements can be seen when attention smoothing is applied with peak scores seen at thresholds of 0.2, 0.5 and 0.7.
Precision metric demonstrates the best result at a threshold of 0.7, showing accurate prediction of pixel-level distribution. The higher thresholds can be more beneficial for tasks requiring broader feature recognition helped by greater noise suppression and lesser sensitivity to variation in input data.
Moreover, high values of Recall and F1 Score at lower thresholds are a clear representation of the model’s effectiveness in capturing critical details due to strict smoothing masks.
Table 3 demonstrates the results for the Fire Segmentation Image Dataset [
39]. The IoU and Dice coefficient peaked at a smoothing threshold of 0.4, with strong performance also observed at 0.5 and 0.6. This behavior can be attributed to the special characteristics of fire images in the dataset. Fire scenes typically exhibit complex, irregular shapes with varying intensity and color gradients, making them challenging to segment accurately.
The dataset’s nature, featuring dynamic fire boundaries and potential smoke interference, explains why moderate smoothing thresholds (0.4 - 0.6) outperform others. At these levels, the algorithm effectively balances detail preservation and noise reduction. Lower thresholds (0.1 - 0.3) likely retain too much noise and small, irrelevant features, while higher thresholds (0.7 - 0.9) risk oversimplifying the fire’s complex structure.
Consistently high values of Accuracy and Recall suggest the model’s ability to correctly identify true positives and relevant instances in the image dataset, regardless of whether smoothing is applied. However, attention smoothing does lead to marginally improved performances in these metrics.
The balanced approach to smoothing not only improves traditional segmentation metrics like IoU and Dice coefficient but also enhances the model’s overall predictive capabilities across a broader range of performance indicators.
Table 4 presents the results of the Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy [
40]. The dataset contains images of the commonly used gastrointestinal endoscopy surgical tools that pose challenges for segmentation models due to the shape complexity of the surgical tools captured in the scenes.
The best performance across numerous evaluation metric criteria can be seen at lower thresholds (0.1-0.3) with peak values achieved at a threshold of 0.2. Such behavior clearly indicates the importance of fine-grained attention in the medical industry where the decision based on spatial understanding of the structure of organ imagery and surgical tools can play a major role in life and death situations.
In tasks of segmentation and classification the medical instruments are commonly segmented incorrectly and mistaken for others. The lower attention threshold strictly punishes false positives, leading to more accurate segmentation results, as seen at the metrics-performance level.
Additionally, the high values of Dice and IOU at the threshold of 0.2 show that the model is able to capture image-specific details and identify accurately both linear and non-linear segmentation boundaries. At the same time, high values of Accuracy, Precision, Recall and F1 Score at the lower thresholds highlight the model’s readiness to adhere to local sensitivity in pixel-level variations that are crucial in the medical field.
Table 5 reveals the results for the Flood Semantic Segmentation Dataset [
41]. The dataset covers a wide range of image data of flood disaster area sensing in multiple geographical regions, introducing the complexity of segmentation variation.
The optimal performance can be observed at the thresholds (0.4-0.6) with strong IOU, Dice and Precision results. Due to the fluid image structure variations, it is important to find the balancing between border smoothing and fine-grained detail retention. Such balance ensures the model captures segmentation features while minimizing artifacts arising from overly aggressive smoothing techniques [
49,
50].
Lower thresholds yield high recall but lower precision values. This phenomenon occurs because the balance achieved ensures the detection of image details, resulting in a higher percentage of true positives; however, still providing the needed leniency towards false positives, as the complexity of flooded areas in varying geographical terrains—such as reflections, shallow water, and debris—can cause the model to overreact to minor pixel distributions variations.
Higher thresholds tend to perform less effectively due to a tendency to under-segment the shape complexities of flooded and non-flooded areas. This loss of detail obscures critical features, making it challenging to accurately distinguish between waterlogged regions and surrounding terrain. Consequently, important contextual information is lost, leading to inaccuracies in assessing the extent of flooding.
We demonstrate the comparison of the attention values as heatmaps for the Flood Semantic Segmentation Dataset [41] at different thresholds in
Figure 2 and
Figure 3.
Figure 2 illustrates attention heatmap at a threshold of 2.0 (without smoothing applied), resulting in a more fragmented representation of attention values across the map. This higher threshold leads to sharper and isolated areas of focus, which obscure subtle relationships in the image data. Conversely,
Figure 3 presents attention heatmap at a threshold of 0.4 (with effective smoothing applied), showing a more cohesive and visually integrated attention values distribution. The smoothing process creates a gradual transition between areas of high and low attention, enhancing spatial interpretability and resulting in a more stable segmentation model performance.
Developing further the insights from the previous figures, we introduce in
Figure 4 and
Figure 5 the 3D heatmap representation of attention for the Flood Semantic Segmentation Dataset [41] at the same respective thresholds.
Figure 4 shows the 3D heatmap at a threshold of 2.0 (without smoothing applied), clearly revealing a sparse and jagged attention landscape. The isolated peaks in this visualization indicate areas of high attention, but the overall structure appears disjointed, making it challenging to discern the concise relationships between different attention regions. In contrast,
Figure 5 presents the 3D heatmap with a threshold of 0.4 (effective smoothing applied), which significantly alters the visual interpretation of attention. The smoothing process results in a more fluid attention surface, building a precise understanding of what regions help the model complete the task accurately. After the smoothing constraint is applied, the attention regions shift due to the averaging effect that reduces noise and emphasizes broader trends in the values’ distribution. Such cohesive representation not only enhances the visibility of region-level importance but also facilitates the identification of underlying patterns within the image data [
51].
Figure 6 and
Figure 7 explore the representation of attention values using 3D scatter plots, providing a distinct perspective on the attention distribution under varying levels of smoothing for the Large-Scale Dataset for Segmentation and Classification [38].
Figure 6 illustrates the attention values at a threshold of 0.9 (with light smoothing applied). It can be observed that the attention points are relatively sparse, with some clusters indicating areas of significant focus. However, higher threshold values limit the mechanism’s ability to identify finer details in the distribution map. The light smoothing enhances the overall visual coherence, developing a stronger understanding of the focusing area where the fish is illustrated, yet it still retains some of the original fragmentation.
In contrast,
Figure 7 presents the attention values at a threshold of 0.1 (with strong smoothing applied). Such an approach results in a dense interconnected scatter plot, where attention points are uniformly distributed across the attention region. The strong smoothing effectively blurs the boundaries between areas of high and low attention, creating a continuous representation of attention across the image data.
Figure 7 highlights the strict relationships between different regions, making it easier to identify patterns that could have been missed in the original attention map’s spatial inconsistency.
Figure 8 and
Figure 9 present the final predicted segmentation results for the Flood Semantic Segmentation Dataset [41].
Figure 8 illustrates the segmentation output at a threshold of 2.0 (without smoothing applied), revealing the segmented images that are characterized by abrupt edges and fragmented regions. While some areas of interest are accurately captured, the lack of smoothing leads to a disjointed segmentation that does not effectively represent the underlying structural details in the spatial data.
At the same time
Figure 9 demonstrates the predicted segmentation mask at a threshold of 0.4 (with effective smoothing applied). The result with the Smooth Attention method achieves a more coherent and unified segmentation, having smoother transitions between the object boundaries. The application of smoothing constraint enhances the model’s ability to capture complex shapes and relationships on the pixel-level, resulting in a segmentation mask that outlines detailed variation within the images.
We can see that the introduction of Smooth Attention notably improves the spatial distribution of the attention values within the attention map, helping the model to achieve better segmentation results. By incorporating a smoothness constraint, the Smooth Attention method encourages gradual changes in attention weights, which effectively mitigates the noise sensitivity problems with attention value distribution. We achieve the reduction of sharp transitions and foster a deeper understanding of spatial relationships of the image data.