1. Introduction
Various types of small unmanned aerial vehicles pose a serious threat to infrastructure, hardware and people [
1]. At the same time, the accurate detection of drone targets in low resolution, visually blurred infrared images is a challenging task. There are two main problems:
1) The influence of the target itself: due to the long imaging distance, infrared targets are generally small, with only a few to several tens of pixels in the image. In addition, infrared targets usually have a low signal-to-noise ratio (SCR) and are easily submerged in strong noise and cluttered backgrounds. Therefore, the radiation intensity of the target is lower and it lacks significant morphological features, making target detection in infrared images difficult.
2) The contradiction between the target and the detection algorithm: compared to targets in visible light images, targets in infrared images present more challenging problems. Such as the lack of shape and texture features, which leads to the weakening or even loss of high-frequency amplitude of small targets after filtering and convolution calculations. Besides, although building shallow networks can improve performance in deep learning algorithms, the contradiction between advanced semantic features and high resolution still cannot be resolved.
Overall, there are too many negative samples in the image due to the large variation in target size and the extremely low percentage of pixels in infrared images, resulting in the loss of most of the available information during algorithm operation [
2]. In addition, most negative samples are easily classified, which makes it difficult for the algorithm to optimize in the expected direction. Therefore, the nets designed for normal objects is hardly use to detect small infrared targets.
To detect small infrared targets, researchers have proposed many traditional methods over the past few decades. The traditional detection method involves implementing SIRST (Single-frame InfraRed Small Target) detection by calculating the non-coherence between target and background. Typical methods include filter-based methods [
3,
4,
5], which can only suppress uniform and smooth background noise, resulting in high false alarm rates and unstable performance for complex backgrounds. The HVS method [
6,
7,
8,
9] uses the ratio of gray values between each pixel position and its neighbouring region as an enhancement factor, which can effectively enhance the real target. However, it cannot effectively suppress the background noise. Methods based on low-rank representation [
10,
11,
12] can adapt to infrared images with low SCR ratios. However, in complex backgrounds, there is still a high false alarm rate for small and shape-varying targets. Most traditional methods heavily rely on manual features. These methods are simple calculation and do not require training or learning. However, designing hand-crafted features and tuning hyperparameters require expert knowledge and a significant amount of engineering efforts.
With the development of CNN methods, more data-driven methods are being applied to infrared small target detection [
13,
14,
15,
16]. Data-driven methods are suitable for more complex real scenarios, and are less affected by target size, shape, and background changes. These methods require a large amount of data to demonstrate strong model fitting ability and have achieved better detection performance than traditional methods. Based on data-driven methods, the convolutional segmentation network can simultaneously produce pixel-level classification and location output [
17]. The first segmentation-based SIRST detection method ACM was proposed by [
18], which designed a semantic segmentation network using asymmetric context module. On this basis, Dai [
19] further introduced expanded local contrast to improve their model. By combining traditional methods with deep learning methods and using bottom-up local attention modulation modules to embed subtle low-level details at higher levels, excellent detection performance was achieved. In [
20], a balance between missed detection (MD) and false alarms (FA) was achieved by using cGAN networks to separately build models for MD and FA as two subtasks as generators. Next, a discriminator for image classification is used to distinguish the outputs of the two generators and ground-truth images. Zhang [
21] uses attention mechanisms to guide the pyramid context network to detect targets. First, the feature map is partitioned to calculate local correlations. Second, the global contextual attention is used to calculate the correlation between semantics. Finally, the decoded images of different scales are fused to improve detection performance. Cheng [
22] Using visible light imagery to achieve drone detection. First, the backbone network is lightly improved by using a multi-scale fusion method to improve the use of shallow features. To address the problem of drone loss in multi-scale detection, a novel non-maximum suppression method is developed to ultimately achieve real-time detection. However, the above methods still have many shortcomings. First, the problem of small target feature loss in the deep layers of the network still exists, and the contradiction between high-level semantic features and high-resolution cannot be resolved. Second, the coding maps generated by each downsampling layer cannot be well used. Overall, the above methods overlook the characteristics of drones. These problems will make the detection algorithm less robust to scene changes (such as cluttered backgrounds, targets with different SCR, shapes and sizes).
To solve these problems, we propose a data-driven progressive feature fusion detection method (PFFNet) from the perspective of infrared unmanned aerial vehicle target detection. First, global features were extracted from the input infrared image. Then, passes the encoding maps output by the downsampling to the FSM and PFM modules. The deep features that include high-level semantic information, the shallow features that contain rich image contour and the position information can be fully fused. Thereby improving the utilization of the output encoding maps of the downsampling layer. In addition, the output feature maps are cross-scale fused to enhance the response amplitude of infrared unmanned aerial vehicle targets in the deep network and solve the problem of feature loss in small targets in the deep layers of the network. The high-level semantic information and shallow semantic information are superimposed and output through dimensional cascading. The confidence map is obtained through threshold segmentation to output the final detection result. Finally, to verify the effectiveness of PFFNet, we conducted extensive ablation studies on FSM and PFM. And then conducted comparative experiments with existing methods on the SIRST Aug and IRSTD datasets. The experimental results show that the various modules of PFFNet have improved the detection of infrared unmanned aerial vehicle targets. Our algorithm has stronger robustness, better detection performance, and faster target detection time.
2. Methods
Given an input image I, we aim to classify each pixel by end-to-end convolving a neural network to determine whether it is a drone target. Finally output a segmentation result that is the same size as I. The PFFNet detection algorithm is divided into two parts: the global feature extractor and the progressive feature fusion network. The global feature extractor extracts the basic features of the input infrared image I by looking at the entire image. The redundant information in the image can effectively reduce by obtaining these basic features.
The progressive fusion network is divided into two modules: the Neck and the Head. The Neck includes the Pool Pyramid Fusion Model (PFM) and the Feature Selection Model (FSM). The former is used to enhance the feature response amplitude in the deep network of the infrared drone target. The latter acts as a bridge for information interaction between high and low layers, increasing the utilization rate of the downsampling output encoding map. The Head implements the progressive fusion of feature maps of different scales and generates a segmentation mask.
As shown in
Figure 1, the input image I is encoded into different dimensions and resolutions by the backbone to generate encoding maps
ba (
a=2, 3, 4). The low-level spatial position information of the target's salient features is obtained from
ba (a=2,3) by the FSM. Locating the high-frequency response area to reduce the influence of redundant signals on the target position information, and outputs the feature maps
fa (
a=2,3). The
b4 is used as the input of the PFM to output the decoded image
p. The PFM is composed of four different pooling structures in parallel to form a pyramid network. The high-frequency response amplitude of deep target features is enhanced and then passes it to the FSM after upsampling. The FSM and PFM extract local features of targets and use the progressive fusion method to calculate the phase output feature maps
ya (
a=1,2,3). After being processed by the Ghost Model [
23],
ya is doubled in size and element-wise added. This process greatly simplifies the task of small target detection by sharing the same weight for all convolution blocks, and reduces the parameters of the P algorithm by using element-wise addition while reducing the network inference time.
Then, the fused output is upsampled and dimensionally cascaded through convolution calculation. We proposed a multi-scale fusion strategy to progressively fuse feature maps of different sizes. Furthermore, the confidence map O is obtained by performing the final threshold segmentation on the fused feather map. Backbone is mainly used to expand the receptive field and extract deep semantic features. Upsampling helps to restore the size of the feature map. The progressive multi-scale feature fusion is achieved by upsampling and downsampling. The FSM and PFM modules are used to ensure the feature representation of small targets in the network.
To achieve good context data modeling ability, the simplest way is repeatedly and stack the network depth. The more layers the network has, the richer the semantic information and the larger the receptive field [
24,
25,
26,
27]. However, infrared small targets have significant differences in size and a very low pixel ratio. If the network depth is blindly increased, the problem of feature disappearance may occur after the drone target undergoes multiple downsampling operations. Therefore, we should design special modules to extract high-level features while ensuring the representation of small targets with a very small pixel ratio in the deep network.
2.1. Feature Selection Module
The Feature Selection Module is mainly divided into two parts: LSM (Location Selection Model) and CSM (Channel Selection Model). Due to the small proportion in the image, drone targets are easy to lose or even weakening of the response amplitude of the target area during downsampling and upsampling. We found through experiments that there are rich target contour features in high-level semantic features, and accurate target location information in low-level semantic features. The FSM can use the semantic information of each dimension to achieve information interaction between different encoding maps. Through this module, the utilization rate of the downsampling and upsampling output encoding maps can be effectively increased. Besides, the effectiveness of multi-scale feature fusion can be guaranteed by locating and enhancing the high-frequency response amplitude area.
Figure 2 shows the feature representation of small targets retained in deep networks without losing spatial detail encoding of target positions. By combining LSM and CSM and utilizing CSM to enhance information interaction between high and low levels, obtaining target position through LSM. The combination forms the Feature Selection Module, which can fully integrate the deep features of high-level semantic information and the shallow features with rich image contour information and position information. And then improving the utilization rate of the encoding output map. The output of the feature selection module
can be represented as:
Where
XH is the deep feature that includes high-level semantic information,
XL is the shallow feature that contains rich image contour information and position information, ⊗ and ⊕ represents element-wise multiplication and addition of vectors, C and L represent the CSM and LSM modules, respectively.
2.2. Channel Selection Model
To solve the problem of losing or weakening the target area response value during upsampling of drone targets, we use CSM to enhance the target area response amplitude. As shown in
Figure 3(a), the channel features at each spatial position are individually aggregated. The subtle details of deep drone targets are highlighted by directionally enhancing the high-frequency response channel weights of small targets. This module first performs average pooling and max pooling operations on the input feature map X to generate different 3D tensors
xi. Coupleing the global information of the feature map X in its internal channel. Then, a 1x1 convolution is used to evaluate the importance of each channel and calculate the corresponding weight. The aggregated output
can be represented as:
When
i=1,
x1 is the feature vector obtained by average-pooling. When
i=2,
x2 is the feature vector obtained by max-pooling.
are the point-wise convolutions with two convolution kernels of size 1×1 but different dimensions. δ represents the sigmoid function. σ represents the rectified linear unit, and output size of
and
. Inspired by [
28], this paper takes
r=8 as the downsampling ratio for channel reduction.
2.3. Location Selection Model
The pixel number of infrared small targets in infrared images is extremely low, which lead to easily introduce interference signals during the process of feature extraction. LSM could be used to quickly locate local regions with visual saliency. As shown in
Figure 3(b), this module calculates the maximum and mean values of the input feature map X and performs a cascade operation in the dimension direction. Then the module performs a convolution operation on the concatenated feature map. Here, a 7×7 convolution can further expand the receptive field of the convolution kernel, capturing areas with higher local response amplitudes from the lower-level network. In addition, the accurate position of the drone target in the entire feature map is calculated. The high response amplitude area
can be calculated using the following formula:
When i=1, M(*) takes the mean of the feature map X. When i=2, M(*) takes the maximum value of the feature map X; ℂ represents the dimension cascade operation. The final output size of the feature map of this module is (1, W, H).
2.4. PFM
Deeper neural networks can obtain more detailed semantic information of the target, but this method is not suitable for smaller targets. As the number of downsampling increases, the feature of drone targets (such as propellers and arms) weakens or even disappears. To solve this problem, this paper proposes a Pooling Pyramid Fusion Module (PFM) for infrared small target detection, which is used to process the encoding map of the highest downsampling layer. Due to the small target size, spatial dimension compression can be achieved through different global adaptive pooling layer structures. Besides, the corresponding dimension mean-value can be extracted to enhance the feature representation of small targets in deep networks. As shown in
Figure 4, the input feature map
is parallelly input into the pyramid pooling module for decoding, generating four encoding structures of different size 1×1, 2×2, 3×3, and 6×6. Then, 1×1 convolution is used to reduce the feature dimension to 1/4C. The four feature maps of different sizes are upsampled by bilinear interpolation. Then concatenating with the input feature map in the channel dimension. Finally, a 3×3 convolution is performed to output the feature map
, and form a contextual pyramid though five feature maps of the same dimension but different scales.
2.5. Segmentation Head
After multiple downsampling and convolution calculations, the targets’ feature response in the deepest layer of the convolutional network will weaken. In response to this problem, we proposed a progressive feature fusion structure that is better suited for drones. As shown in
Figure 5, this segmentation head can fuse different sizes of feature maps and enable the stacking of information between high and low layers to enhance the high-frequency response amplitude of the target. The input
I with different sizes are proceed through Ghost Model, which is used to generate encoding maps with the same number and texture information through simple linear calculations. This reduces the convolution parameter volume and improves training and inference efficiency.
PFFNet uses
SoftIoULoss and
CELoss to train the entire network and optimize the weighted loss between the predicted and segmented images. They can be expressed by the following formula:
where
T and
P represent the pixel values of the real target and the output prediction, respectively. Based on the initial loss value during training,
α=3 and
β=1 were set to balance the individual loss with the total loss to optimize the algorithm in the expected direction. To ensure the stability in the calculation, this paper sets
smooth=1. Different weight balances may affect performance indicators [
29].
4. Conclusions
This paper proposes a fast detection method for infrared small targets: Progressive Feature Fusion Network (PFFNet). Faced the problem of losing target area response values during downsampling of drone targets, FSM is proposed. It can fully fuse deep features with high-level semantic information and shallow features with rich image contour and location information. Successfully achieved information exchange between downsampling layers. Then, PFM is proposed to integrate deep features and enhance high-frequency response amplitude from a multi-scale perspective to address the problem of weakened small target feature representation in deep networks. Meanwhile, a lightweight segmentation head suitable for infrared small targets is designed to progressively fuse low-level and high-level semantics from the perspective of feature fusion. In addition, the utilization of features in downsampling layers was improved. Finally, module comparison is conducted on two datasets with different complexities, which fully demonstrates the effectiveness of each module. The practicality verification and inference time statistics on the SIRST Aug dataset confirm that PFFNet has a good performance for fast detection of infrared small targets. As well, a large number of data-driven algorithm comparison experiments demonstrate the ability of PFFNet to cope with complex scene detection tasks in term of numerical evaluation. Meanwhile, PFFNet has better detection performance and shorter inference time.
However, there are still some problems of the algorithm that need further research. Such as dealing with network overfitting, utilizing more efficient contextual information, etc. In future work, attention mechanisms and fusion structures will continue to be explored for their application in infrared drone target detection.
Author Contributions
Conceptualization, C.Z. and Z.H.; methodology, Z.H.; software, Z.H.; validation, Z.H., K.Q. and M.Y.; formal analysis, M.Y.; investigation, H.F.; resources, C.Z.; data curation, C.Z.; writing—original draft preparation, Z.H.; writing—review and editing, C.Z.; visualization, Z.H.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. and M.Y.