2.2. ECA Attention Mechanism Module
In the operational scenarios of railway construction sites, protective equipment often occupies a small portion of the image and has indistinct boundaries. To enhance the network’s ability to recognize small objects, an attention mechanism is introduced into the YOLOv5 network model in this paper. In the recognition of protective equipment, although the shape and color features of the protective equipment are crucial information to focus on, the monitoring devices are often positioned at a distance from the actual working scene, resulting in the small objects of protective equipment occupying a very small proportion of the entire image with unclear boundary features. If the model only focuses on easily confusable features, it may lead to missed detections. Therefore, during the training process, the model may be biased towards larger objects, neglecting the feature extraction of small objects. To address this issue, this study incorporates the ECA attention mechanism module [20]. The ECA attention mechanism module helps to extract features from small objects more effectively, thus improving the model’s attention to these small objects.
Attention mechanism modules, from the SE attention mechanism [21] to the latest Biformer attention mechanism [22], have significant implications for improving network robustness and accuracy. In fact, numerous studies have demonstrated the effectiveness of attention mechanisms in enhancing model performance, particularly in tasks involving small objects. For instance, SENet is a classic example where the introduction of the SE attention mechanism module significantly improved model performance. Wang et al. proposed the enhancement of small object detection by incorporating the BiCAM attention mechanism module [23]. Building upon these previous research findings, this study introduces the ECA mechanism into the model, enhancing its capability to handle small objects. The structure of the ECA attention mechanism module is depicted in
Figure 2.
The ECA attention mechanism is an efficient channel attention mechanism that avoids dimension reduction and enables local cross-channel interactions. Firstly, we apply channel-wise global average pooling to the input feature map, aggregating the convolutional features and obtaining a higher-level representation that summarizes the information across channels, as shown in (1).
In this context,
represents the input feature map, and we define
. Next, we employ a weight matrix W
k to learn channel attention, which effectively captures local cross-channel interaction. The weight matrix W
k is defined as
The weight matrix W
k in this context involves
×
parameters. Channel attention can be learned by
In this equation, ω
i represents the attention weight of the i
th channel, σ denotes the sigmoid function,
is the parameter of the convolutional kernel,
represents the (i+j)
th channel of the input feature map, and
denotes the set of the i
th channel and its k adjacent channels. This design enables all channels to share the same set of learnable parameters. Moreover, this strategy can be effortlessly implemented through the utilization of a fast 1D convolution with a kernel size of k, as shown in the following (4):
C1D represents the one-dimensional convolution operation, which is used in the equation. The size of the kernel in the one-dimensional convolution, denoted by k, determines the coverage range of channel interaction. The value of k is proportional to the channel dimension C. By specifying the channel dimension C, the kernel size k can be adaptively determined, as shown in the following (5):
In the equation, the term
denotes the nearest odd number to
. In all experiments conducted in this study, we set
and
to 2 and 1, respectively. Finally, the original feature map is element-wise multiplied by the attention weights generated by ECA, resulting in the following (6):
In the equation, ⊙ represents the Hadamard product, represents the feature map after being weighted by attention, denotes the attention weights, and represents the original feature map. By employing this approach, the ECA attention mechanism can effectively learn and utilize the local cross-channel interactions in the input feature map, providing favorable conditions for further optimization of the model.
In summary, the incorporation of the ECA module significantly improves the network’s capability to detect small objects by effectively addressing dimension reduction and facilitating local cross-channel interactions. This attention mechanism plays a crucial role in enhancing the accuracy of detecting small objects in railway construction sites.
2.3. EIoU Loss Function
YOLOv5 employs the GIoU loss function [23], which combines the minimum bounding rectangle of the predicted and ground truth boxes. This loss function, compared to the IoU loss, addresses the challenge of predicting accurate distances when the predicted and ground truth boxes are not intersecting. The GIoU loss function is defined as follows, according to (7):
In this context, refers to the minimum enclosing area of the predicted bounding box and the ground truth bounding box. represents the ratio of the intersection to the union between the predicted box and the ground truth box, while represents the union of the predicted box and the ground truth box.
However, in practical railway construction scenarios, the presence of various equipment and construction materials can often lead to partial occlusion of the targets. This occlusion can significantly affect the adequacy of feature extraction, resulting in potential missed detections and false alarms. During the model training process, it is common for the predicted bounding boxes to primarily capture the unobstructed portions of the targets, neglecting the occluded regions.
During the model training process, it is common for the predicted bounding boxes to only capture the unobstructed portions of the targets, meaning that the predicted boxes are confined within the unoccluded regions of the ground truth boxes. In such cases, the GIoU (Generalized Intersection over Union) metric struggles to accurately determine the positional relationship between the predicted and ground truth boxes. As a result, the loss function remains largely unchanged, leading to slow convergence and reduced detection accuracy of the model.
The following section of this paper will illustrate the limitations of GIoU through a specific example, as shown in
Figure 3 and
Figure 4. These figures depict a sample image with occluded targets.
In the figures, the blue boxes (smaller boxes) represent the predicted bounding boxes, while the green boxes (larger boxes) represent the ground truth bounding boxes.
Figure 3 and
Figure 4 correspond to the predicted results before and after fitting, respectively.
The following table provides the coordinates of the top-left and bottom-right corners for the boxes shown in the figures:
Table 1.
Coordinates of Boxes in Figures.
Table 1.
Coordinates of Boxes in Figures.
Box |
Xmin |
Ymin |
Xmax |
Ymax |
ground truth bounding box |
213 |
83 |
257 |
146 |
Predicted Box (Before Fitting) |
219 |
107 |
247 |
145 |
Predicted Box (After Fitting) |
219 |
100 |
247 |
138 |
When using GIoU to calculate the loss function before and after fitting, the result is consistently 0.3838. This indicates that the model fitting in this case did not improve accuracy. However, in reality, we can intuitively perceive that the prediction performance after fitting is significantly better than before. Clearly, GIoU is unable to achieve this. Therefore, when training the protective equipment recognition model, it is necessary to introduce a loss function that can determine the positional relationship between the predicted bounding box and the actual box when the predicted box is inside the actual box.
To tackle this issue, we propose the utilization of the EIoU loss function [25]. The EIoU loss function is formulated as shown in (8).
where
represents the ratio of the intersection to the union between the predicted bounding box and the ground truth box.
is the Euclidean distance between the centers of the predicted box and the ground truth box.
is the Euclidean distance between the widths of the predicted box and the ground truth box.
is the Euclidean distance between the heights of the predicted box and the ground truth box.
is the distance between the predicted box and the ground truth box’s minimum bounding rectangle diagonal.
represents the closure width between the predicted box and the ground truth box.
represents the closure height between the predicted box and the ground truth box.
Compared to the GIoU loss function, the EIoU loss function takes into account the distance between the centers of the predicted box and the ground truth box, addressing the issue of the GIoU loss function’s inability to determine the positional relationship between the predicted and ground truth boxes when there is an inclusion relationship, i.e., when the ground truth box contains the predicted box. By considering the overlap between the predicted box and the ground truth box, as well as directly computing penalties for width and height, the EIoU loss function improves the convergence speed and regression accuracy of the model.
When using the EIoU loss function to calculate the loss value before and after fitting, the function value increased from 0.0660 before fitting to 0.0895, demonstrating the effectiveness of the model fitting. By utilizing the EIoU loss function, it becomes possible to determine the positional relationship between the predicted and ground truth boxes when there is an inclusion relationship, thereby accelerating the convergence speed and improving the regression accuracy. This approach has been validated through experimentation, confirming the enhanced generalization ability of the model. In practical applications, the improved YOLO-EA model exhibits faster convergence speed and lower loss values, thereby achieving the desired accurate detection results for protective equipment.