1. Introduction
In the field of autonomous driving, object detection is a crucial technology [
1], and its accuracy and robustness are of paramount importance for practical applications [
2]. However, in foggy weather scenarios, challenges arise due to weakened light and issues such as blurred object edges, which lead to a decline in algorithm performance, consequently affecting the safety and reliability of autonomous vehicles [
3]. Therefore, conducting research on target detection in foggy weather scenes holds significant significance.
In recent years, researchers have made certain progress in addressing the problem of object detection in foggy weather conditions [
4,
5]. Traditional methods primarily rely on conventional computer vision techniques such as edge detection, filtering, and background modeling. While these methods can partially handle foggy images, their effectiveness in complex scenes and under challenging foggy conditions is limited. To address the issue of object detection in complex foggy scenes, scholars have started exploring the utilization of physical models to represent foggy images. He et al. [
6] proposed a single-image dehazing method based on the dark channel prior, while Zhu et al. [
7] presented a fast single-image dehazing approach based on color attenuation prior. These dehazing methods improve the visibility of foggy images and subsequently enhance the accuracy of object detection. However, physical model-based methods require the estimation of fog density, making it difficult to handle multiple fog densities in complex scenes.
With the continuous development of deep learning techniques, deep learning has gradually become a research hotspot in the field of object detection [
8,
9]. Compared to traditional methods, deep learning models can directly learn tasks from raw data and exhibit improved generalization through training on large-scale datasets. Deep learning-based object detection algorithms can be categorized into two-stage detectors and one-stage detectors. Two-stage detectors first generate a set of candidate boxes and then perform classification and position regression for each candidate box. Faster R-CNN [
10] is the most representative algorithm in this category, which employs an RPN [
10] to generate candidate boxes and utilizes ROI Pooling [
11] for classification and position regression of each candidate box. In addressing the problem of object detection in foggy weather conditions, Chen et al. [
12] proposed a domain adaptive method that aligns features and adapts domains between source and target domains, thereby improving the detection performance in the target domain. However, region proposal-based methods require more computational resources and incur higher costs, making them less suitable for real-time applications with stringent timing requirements.
One-stage detectors directly perform classification and position regression on the input image without the need for generating candidate boxes. The most representative algorithms in this category are the YOLO series [
13,
14,
15] and SSD [
16]. YOLO divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell, while SSD predicts bounding boxes of different sizes on different feature layers. Compared to two-stage detectors, one-stage detectors have a significant advantage in terms of speed, making them particularly suitable for real-time applications. However, improving the accuracy of object detection in complex weather and lighting conditions remains a challenge. Hnewa et al. [
17] proposed a cross-domain object detection method that utilizes multi-scale features and domain adaptation techniques to enhance the detection performance in complex weather conditions. Liu et al. [
18] designed a fully differentiable image processing module based on YOLOv3 [
13] for object detection in foggy and low-light scenarios. Although this image-adaptive approach improves detection accuracy, it also introduces some undesirable noise.
In the aforementioned research on foggy weather object detection, although the detection accuracy has been improved, most of these methods are primarily focused on defogging and image enhancement [
19]. This study aims to enable object detection algorithms to achieve clear detection in foggy weather scenes without any preprocessing of the original image. In recent years, the application of Transformer models [
20,
21,
22] in computer vision has been increasing. These models leverage self-attention mechanisms to capture relationships within an image, thereby enhancing model performance. In this study, the Swin Transformer [
22] component is incorporated into the YOLOv5s model to improve detection accuracy in adverse weather conditions.
The main contributions of this study are as follows:
On the basis of the YOLOv5s model, we introduce a multi-scale attention feature detection layer called SwinFocus, based on the Swin Transformer, to better capture the correlations among different regions in foggy images;
The traditional YOLO Head is replaced with a decoupled head, which decomposes the object detection task into different subtasks, reducing the model’s reliance on specific regions in the input image;
In the stage of non-maximum suppression (NMS), Soft-NMS is employed to better preserve the target information, thereby effectively reducing issues such as false positives and false negatives.
The remaining sections of this paper are organized as follows. In
Section 2, we provide a brief overview of the original YOLOv5s model and elaborate on the innovations proposed in this study.
Section 3 presents the dataset, experimental details, and results obtained in our experiments. Finally, in
Section 4, we summarize our work and propose some future research directions.
Figure 1.
Operational Procedure of YOLOv5s-Fog.This framework incorporates an augmented predictive feature layer to bolster the network’s regional comprehension. Additionally, we employ decoupled head to effectively address scenarios characterized by diminished contrast and indistinct boundaries. Lastly, the Soft-NMS technique is employed for the integration of bounding boxes.
Figure 1.
Operational Procedure of YOLOv5s-Fog.This framework incorporates an augmented predictive feature layer to bolster the network’s regional comprehension. Additionally, we employ decoupled head to effectively address scenarios characterized by diminished contrast and indistinct boundaries. Lastly, the Soft-NMS technique is employed for the integration of bounding boxes.
Figure 2.
Swin Transformer Architecture.
Figure 2.
Swin Transformer Architecture.
Figure 3.
The consecutive Swin Transformer Blocks. The Swin Transformer Block first introduces Windows Multi-head Self Attention (W-MSA), which significantly reduces computational complexity compared to MSA. However, it only performs self-attention computations within each window. To enable information propagation between windows, the Shifted Windows Multi-Head Self-Attention (SW-MSA) is subsequently introduced.
Figure 3.
The consecutive Swin Transformer Blocks. The Swin Transformer Block first introduces Windows Multi-head Self Attention (W-MSA), which significantly reduces computational complexity compared to MSA. However, it only performs self-attention computations within each window. To enable information propagation between windows, the Shifted Windows Multi-Head Self-Attention (SW-MSA) is subsequently introduced.
Figure 4.
Decoupled Head Structure. Decoupled Head is a multi-task learning approach that divides object detection into two steps: image classification and object localization within the image.(YOLOv5s-Fog incorporates four detection heads).
Figure 4.
Decoupled Head Structure. Decoupled Head is a multi-task learning approach that divides object detection into two steps: image classification and object localization within the image.(YOLOv5s-Fog incorporates four detection heads).
Figure 5.
The issues that can occur during the post-processing stage of NMS. In Figure (a), there are two reliable pedestrian detections (green bounding box and red bounding box) with scores of 0.85 and 0.75, respectively. However, due to the significant overlap between the green and red bounding boxes, the green bounding box is assigned a lower score. The situation in Figure (b) is similar to that in Figure (a).
Figure 5.
The issues that can occur during the post-processing stage of NMS. In Figure (a), there are two reliable pedestrian detections (green bounding box and red bounding box) with scores of 0.85 and 0.75, respectively. However, due to the significant overlap between the green and red bounding boxes, the green bounding box is assigned a lower score. The situation in Figure (b) is similar to that in Figure (a).
Figure 6.
The network architecture of YOLOv5s-Fog introduces the following enhancements compared to the original version: (a) Addition of a target detection layer called SwinFocus based on Swin Transformer. (b) Use of a decoupled detection head to accomplish the final stage of the detection task.
Figure 6.
The network architecture of YOLOv5s-Fog introduces the following enhancements compared to the original version: (a) Addition of a target detection layer called SwinFocus based on Swin Transformer. (b) Use of a decoupled detection head to accomplish the final stage of the detection task.
Figure 7.
Partial detection results of IA-YOLO, YOLOv5s, and YOLOv5s-Fog on RTTS are shown below. The first row corresponds to IA-YOLO, the second row corresponds to YOLOv5s, and the third row corresponds to YOLOv5s-Fog.
Figure 7.
Partial detection results of IA-YOLO, YOLOv5s, and YOLOv5s-Fog on RTTS are shown below. The first row corresponds to IA-YOLO, the second row corresponds to YOLOv5s, and the third row corresponds to YOLOv5s-Fog.
Figure 8.
The loss curve during the training process of YOLOv5s-Fog and the performance of each training stage on RTTS (bottom right) are shown.
Figure 8.
The loss curve during the training process of YOLOv5s-Fog and the performance of each training stage on RTTS (bottom right) are shown.
Figure 9.
Visualization of the detection results of YOLOv5s-Fog on the RTTS dataset. The green, blue, and red boxes represent true positive (TP), false positive (FP), and false negative (FN) detections, respectively.
Figure 9.
Visualization of the detection results of YOLOv5s-Fog on the RTTS dataset. The green, blue, and red boxes represent true positive (TP), false positive (FP), and false negative (FN) detections, respectively.
Table 1.
The relevant datasets used for training and testing purposes include V_C_t from VOC and COCO, V_n_ts from VOC2007_test, and RTTS, which is currently the only real-world foggy scene object detection dataset with multi-class detection labels.
Table 1.
The relevant datasets used for training and testing purposes include V_C_t from VOC and COCO, V_n_ts from VOC2007_test, and RTTS, which is currently the only real-world foggy scene object detection dataset with multi-class detection labels.
Dataset |
Image |
Ps |
Car |
Bus |
Bicycle |
Motorcycle |
Total |
V_C_t |
8201 |
14012 |
3471 |
850 |
1478 |
1277 |
21088 |
V_n_ts |
2734 |
4528 |
337 |
1201 |
213 |
325 |
6604 |
RTTS |
4322 |
7950 |
18413 |
1838 |
534 |
862 |
29597 |
Table 2.
Experimental Setup of YOLOv5s-Fog.
Table 2.
Experimental Setup of YOLOv5s-Fog.
Configuration |
Parameter |
CPU |
Intel Xeon(R) CPU E5-2678 v3 |
GPU |
Nvidia Titan Xp*2 |
Pytorch |
1.12 |
CUDA |
11.1 |
cuDNN |
8.5.0 |
Table 3.
Comparison of the performance of each method on the conventional dataset (V_n_ts) and the foggy weather dataset (RTTS). The rightmost two columns present the mAP(%) on the two test datasets, including V_n_ts and RTTS.
Table 3.
Comparison of the performance of each method on the conventional dataset (V_n_ts) and the foggy weather dataset (RTTS). The rightmost two columns present the mAP(%) on the two test datasets, including V_n_ts and RTTS.
Methods |
V_n_ts |
RTTS |
YOLOv3 [13] |
64.13 |
28.82 |
YOLOv3-SPP [35] |
70.10 |
30.80 |
YOLOv4 [14] |
79.84 |
35.15 |
MSBDN [32] |
/ |
30.20 |
GridDehaze [33] |
/ |
32.41 |
DAYOLO [17] |
56.51 |
29.93 |
DSNet [34] |
53.29 |
28.91 |
IA-YOLO [18] |
72.65 |
36.73 |
YOLOv5 [15] |
87.56 |
68.00 |
Ours |
92.23 |
73.40 |
Table 4.
The Ablation Experiment on the RTTS Dataset.
Table 4.
The Ablation Experiment on the RTTS Dataset.
Methods |
mAP (%) |
mAP50-95 (%) |
GFLOPs |
YOLOv5s |
68.00 |
41.17 |
15.8 |
YOLOv5s + SwinFocus |
70.15 (↑2.15) |
43.40 (↑2.23) |
56.2 |
YOLOv5s + SwinFocus + Decoupled Head |
71.79 (↑1.64) |
44.38 (↑0.98) |
57.4 |
YOLOv5s + SwinFocus + Decoupled Head + Soft-NMS |
73.40 (↑1.61) |
45.58 (↑1.20) |
59.0 |
Table 5.
The impact of incorporating the component on the precision (P) and recall (R) of the model was evaluated on the RTTS dataset.
Table 5.
The impact of incorporating the component on the precision (P) and recall (R) of the model was evaluated on the RTTS dataset.
Methods |
P |
R |
|
All |
Person |
Car |
Bus |
Bicycle |
Motorcycle |
All |
Person |
Car |
Bus |
Bicycle |
Motorcycle |
YOLOv5s |
0.87 |
0.912 |
0.926 |
0.795 |
0.86 |
0.856 |
0.489 |
0.725 |
0.504 |
0.318 |
0.485 |
0.413 |
YOLOv5s-Fog-1 |
0.74 |
0.69 |
0.911 |
0.753 |
0.647 |
0.7 |
0.635 |
0.641 |
0.632 |
0.496 |
0.697 |
0.712 |
YOLOv5s-Fog-2 |
0.88 |
0.924 |
0.938 |
0.835 |
0.83 |
0.88 |
0.55 |
0.735 |
0.51 |
0.397 |
0.614 |
0.493 |
YOLOv5s-Fog-3 |
0.78 |
0.851 |
0.762 |
0.675 |
0.81 |
0.807 |
0.70 |
0.809 |
0.793 |
0.601 |
0.694 |
0.631 |