In this section, a sky clouds dataset for training image inpainting method is introduced. The basic content of the datasets and evaluation metrics in the experiment are demonstrated. Then, in the experiment, several exiting infrared small target detection methods are compared with the proposed. Ultimately, qualitative and quantitative analysis and discussion of experimental results are presented.
4.1. Datasets
Recently, many image inpainting methods specialize in processing RGB images, leading not applicable to infrared images. For the purpose of obtaining high quality repaired results of infrared images, a dataset of pure infrared cloud images is proposed for the training phase of image inpainting. This dataset is composed of various cloud images. Images of the dataset are acquired by cropping and segmenting original images in MATLAB 2019b. These images were made into a dataset containing 43,500 infrared background images including cirrus, stratus, cirrocumulus, etc. The part of the dataset is displayed in
Figure 4.
Moreover, to evaluate the performance of the proposed framework, three public datasets are utilized, including SIRST dataset [
43], IRSTD-1k dataset [
44] and IRSTD640 dataset [
16]. The SIRST dataset consists of 427 typical images and 480 instances of various scenes isolated from hundreds of real-world videos. It is widely applied to detect infrared dim small targets [
43]. The IRSTD-1k dataset is a publicly available dataset consisting of many infrared dim small target images of complex backgrounds, including 1,000 images that have varying-shape targets and low contrast and low SCR background with clutters and noises and ground truth images corresponding to them [
44]. The NUDT-SIRST dataset is composed of composite images of 5 main background scenes and a few IR images proposed in [
45]. The MFIRST [
32] dataset includes 9,956 training images and 100 test images. These images come from realistic infrared sequences and synthetic infrared images [
43]. The MFIRST is utilized to train the model of the coarse detection module.
4.3. Evaluation Metrics Indicators
In this paper, in order to examine the performance of the proposed framework and compare the effectiveness of different methods, as with many segmentation-based dim small target detection methods, several evaluation metrics are adopted, liking precision rate, recall rate, F-measure and receiver operating characteristic curve (ROC).
F-measure: F-measure is a primary evaluation metric in object segmentation methods [33]. It balances precision rate (Prec) and recall rate (Rec) to favorably represent the ability of precisely detecting targets with fewer false alarms. The Prec and Rec are defined as follows.
where
and
denote the number of successfully detected target pixels and background pixels that are incorrectly detected as targets, respectively.
denotes the number of ground truth target pixels that are mistakenly recognized as background. The F-measure is defined as [
32] by Prec and Rec,
where
is a constant. In the paper, it is set as 1. Thus, the F-measure is named as F1-measure (F1).
Receiver Operation Characteristics: Receiver Operation Characteristics (ROC) is a target level metric that is utilized to record and represent the tendency of the detection probability (
) under different false alarm rate (
). The detection probability (
) and false alarm rate (
) are defined as [
33]:
where
and
denote the number of targets that are truly detected and ground-truth targets, respectively.
and
represent the number of the false detection pixels and total pixels of images. Here, if the distance between centers of the ground truth and the predicted result is less than 4 pixels, the predicted result is regarded as a correct detection. Otherwise, the result is viewed as false.
4.4. Contrast Methods and Parameter Setting
To assess the detection performance of the proposed framework for different scenarios, in this part, we implemented comparison experiments employing different principles of comparative methods. Conventional methods include background prediction-based, local feature-based and sparse matrix-based approaches like Top-Hat [
24], Local Intensity and Gradient properties (LIG) [
29], Average Absolute Gray Difference (AAGD) [
30], Multiscale Tri-layer Local Contrast Measure (TLLCM) [
9], Non-convex Rank Approximation Minimization (NRAM) [
31] and Partial Sum of the Tensor Nuclear Norm (PSTNN) [
46]. In recent years, many deep learning-based infrared small target detection methods have been proposed, liking Attentional Local Contrast Network (ALCNet) [
14], Dense Nested Attention Network (DNANet) [
33], Attention-Guided Pyramid Context Network (AGPCNet) [
34], etc. In this paper, the parameter settings of these traditional methods are chosen according to the parameters suggested in their papers, as illustrated in
Table 1. For methods based on deep learning, the optimal models from the original papers are adopted as the detection network models.
4.5. Contrast Experiment Results
Qualitative Comparison. For comparing the performance of these algorithms by visual comparison, several images with various backgrounds are selected as the comparative data.
Figure 5 illustrates these raw infrared images, where targets are marked by red boxes.
Figure 5(1, 2, 5-7) contain complex background clutters. The background of
Figure 5(1) consists of trees. In
Figure 5(2), targets are filled with mountains and a number of bright backgrounds. The building noise affects infrared dim and small targets in
Figure 5(6) and (7). The backgrounds in
Figure 5(3), (4), (5) and (8) are flatter, which is composed of irregular clouds, smooth natural and artificial scenes. Figure demonstrates the results of comparative methods. In the figure, for conventional methods, since the results are represented by the saliency map, we uniformly set the threshold to
. Meanwhile, to conveniently compare the detection performance of various methods, results of these algorithms are normalized to be in the range of 0-1. In Figure , small targets are masked by red boxes and blue circles to represent the clutter in results.
Figure 6(a) and (g) are the 3-D representations of the original images and the ground-truth, respectively. Intuitively, TopHat has a weak ability to detect small targets. Due to the morphological filtering of backgrounds in local regions, a large amount of background clutter is remained in results, even drowning out targets. LIG, AAGD, and TLLCM all focus on local information, so detection results are better than TopHat in visual terms. In the case of relatively smooth backgrounds, LIG and AAGD are able to detect small targets; however, there is still background noise in results. On the contrary, they achieve poor detection performance for the complicated background with building edges, complex ground and strong cloud clutters. TLLCM is able to separate small targets from the background better, except for the scenes of
Figure 6(a2), (a3) and (a6). Especially, when TLLCM processes multi-target images, the second target is severely weakened, as shown in
Figure 6(k3) and (k6). NRAM and PSTNN convert the problem of target detection to that of sparse matrix separation, nevertheless, in
Figure 6(f1-f7) and (h1-h7), background clutters are separated as targets with a high false alarm rate, even with failure in detecting
Figure 6(a4). ALCNet, DNANet and AGPCNet are the state-of-the-art deep learning-based methods with higher accuracy and lower false alarm rate compared to the traditional methods. Unfortunately, the results of ALCNet contain a lot of noise when dealing with complex backgrounds and in the
Figure 6 (a3) and (a4) scenarios, it cannot even successfully detect targets. DNANet missed the real target and detects the background as the target during detecting Fig (a4). In the case of multiple targets, such as
Figure 6(a3), (a4) and (8), three deep learning-based methods, suffer from detection misses. Our method performs local background prediction on target candidate regions generated in coarse detection module to detect dim and small targets. In comparison, as demonstrated in
Figure 6(l1-l5), our proposed algorithm can achieve the detection of targets better, although the results contain individual clutter in some scenarios.
Quantitative Comparison. ROC curves and F1-measure-threshold curves of contrast methods on three public datasets are shown in
Figure 7 and
Figure 8, respectively. The horizontal axis of
Figure 7 is the false alarm rate and the vertical axis is the detection rate. In
Figure 8, the vertical axis and horizontal axis are the F1-measure (F1) and threshold, respectively. ROC is a metric that shows the potency of the method by applying different threshold. It can be seen that the proposed method has better effects than methods. Especially, in
Figure 7(ii), the proposed algorithm obviously outperforms other comparison algorithms but lower than DNANet. Our method rapidly converges to the highest detection rate when
, even reaching 95%. As demonstrated in
Figure 7(iii), although the proposed algorithm is faster to be stable, the true positive rate is lower than the ALCNet, AGPCNet and DNANet. For IRSTD-1k dataset, although many algorithms have a good and stable detection curve, our framework is overall second only to DNANet. In contrast, our algorithm reaches steady state faster than DNANet. F1 is an evaluation metric that balances detection rate and recall rate. The F1 curves of our framework on IRSTD-1k and IRST640 datasets are better than other comparison algorithms. Nonetheless, the F1 of ALCNet, AGPCNet, DNANet and PSTNN are higher than our framework on SIRST dataset when the threshold value is in the range of 0 to 0.15. As the threshold value gradually increases, the F1 of our algorithm is gradually higher than the other algorithms, second only to LIG.
To further compare detection capabilities of each algorithm, we obtained the data in
Table 2 by selecting fixed thresholds from
Figure 8.The Prec, Rec and F-measure of
Table 2 is calculated by the F-measure calculation method described in [
32]. Here, the threshold of calculation is set to 0.15. The Prec reflects the proportion of correct results in the results detected as a target, as shown in Equation (
2). The Rec denotes the ratio of the number of true results and the total number of targets, as defined in Equation (
3). The F1-measure balances the performance of the Prec and Rec. Ideally, the Prec, Rec and F1-measure are both 1. In tables, the largest value has been bolded, while the secondly largest value is masked by an underscore. For IRSTD-1k and IRST640, the Rec and F1-meansure of our proposed method are both better than other methods, but the performance of precision rate is worse than DNANet. Unfortunately, our framework is less advantageous compared to other algorithms in SIRST dataset, since most comparison methods have similar metrics performance to the proposed algorithm, as demonstrated in
Table 2 SIRST. Though our detection framework is flawed in some aspects, its capability to detect dim and small targets is improved over most algorithms. Finally, through a quantitative evaluation of the metrics, we can see that our framework has a better performance than existing methods. Moreover, these metrics represent our approach can work more stably for various clutter and noisy backgrounds.