In this section, we present in detail the structure of D2MFNet and the benchmark dataset. Firstly, the composition of the main network structure, consisting of MFCAM and D2FPN, is described. Among them, MFCAM focuses on information on different frequency components to better experience the information of the target’s grayscale changes and edge contours. D2FPN introduces frequency domain features to further mine image feature information. Finally, the collected data set is processed, and the benchmark is built and analyzed.
3.1. Multi-Frequency Combined Attention Module
To obtain more feature information in a situation where sonar images’ quality is quite terrible that even human eyes cannot figure out the specific target type, we first elaborate on the Fast Fourier Transform (FFT), which is the underlying algorithm of the following module, and then construct the D2FPN to realize feature extract and fusion in time and frequency domain.
3.1.1. Fast Fourier Transform
Fast Fourier Transform (FFT) is a fast Fourier transform that simplifies the computational complexity of Discrete Fourier Transform (DFT). The two-dimensional DFT is a transformation method that converts an image from the spatial domain to the frequency domain. And the formula of the two-dimensional Discrete Fourier Transform (2D DFT) is as follows:
which
represents the spatial domain matrix of size
,
represents the Fourier transform of
,
u and
v can be used to determine the frequency of the sine-cosine,
, and
is the frequency domain matrix size after the transform.
Correspondingly, the inverse 2D DFT can be written as:
which
.
The size of the frequency domain matrix is the same as the size of the original spatial domain matrix. Each point in the frequency domain matrix represents a function with frequencies , and the combination of these functions in the spatial domain is the original function .
With 2D FFT, we choose three images from different classes in SCTD to show the result after frequency domain transfer. As shown in
Figure 1, the amplitude and the phase spectrum after spectral centralization are shown. Spectrum centralization makes the spectrum distribution in the frequency domain show the law of low middle and high encirclement. The amplitude spectrum has obvious signal structure characteristics, only contains the periodic structure contained in the image itself, and does not indicate where it is. The phase spectrum resembles a random pattern, and the movement of an object in space is equivalent to the phase shift in the frequency domain, making the phase spectrum equally important.
The high-frequency components in the frequency domain of the image correspond to the detailed information of the image, and the low-frequency components of the image correspond to the contour information of the image. The high-frequency component represents the abrupt part of the signal, while the low-frequency component determines the overall image of the signal. In the spectrogram, you can see points with different brightness, where a large brightness proves a large gradient which is the high-frequency component and a small brightness proves a small gradient at that point which is the low-frequency component.
3.1.2. construct of MFCAM
The high-frequency components in the frequency domain of the image correspond to the detailed information of the image. And the low-frequency components of the image correspond to the contour information of the image. In the task of target detection, the common methods only use low-frequency information to learn image features. This is a disadvantage when SSS images have less data. In this case, both channel and spatial attention mechanisms are essential, and additional attention information from high-frequency components is required. Therefore, based on the CBAM structure [
57], we added attention structures with different frequency ranges as shown in
Figure 2 to achieve a more accurate detection effect of underwater targets.
Multi-frequency channel attention module.
In the attention mechanism of fusing global and local information, the channel attention mechanism can better fuse the global information to express the overall features of the feature map in different channels during feature extraction. Based on the existing channel attention mechanism structure, the frequency domain feature extraction method is added to extract and combine the channel weights in the high-frequency, low-frequency and other frequency ranges, which can better integrate the limited feature information.
In the structure shown in the
Figure 3, when extracting the channel weights, the original feature map is converted into the frequency domain, and filtering operations in different frequency ranges are performed to extract the feature maps of high-frequency, low-frequency and other frequency ranges. On this basis, the filtered feature map is under the global average pooling, the channel feature weights of different channels and different frequency ranges are extracted, and then the weights of the three frequency ranges are added to obtain the final channel feature weights. The one-dimensional channel weights are multiplied by the original feature map one by one, to obtain the feature map processed by the multi-frequency channel attention mechanism module, and enter the next module.
Multi-frequency spatial attention module.
Different from the channel attention mechanism, the spatial attention mechanism adds corresponding weights to the feature map in a channel according to the set area size to achieve the effect of extracting local more important feature information in the same space. Based on the existing spatial attention mechanism structure, and considering the existence of different feature information in amplitude and phase after frequency domain conversion, according to the different distribution characteristics of amplitude and phase information, the two are combined to better express the key feature information in the region. The structure of the multi-frequency spatial attention mechanism based on FFT is proposed in this section.
In the structure shown in
Figure 4, the frequency domain conversion is performed first, and the amplitude information and phase information are extracted separately. Set the window size of 2*2 or 3*3, and perform maximum pooling and minimum pooling of amplitude information to extract the characteristic weights of high-frequency and low-frequency ranges. At the same time, the phase information is under average pooling using the set window size. Multiply the weights with the feature maps of the original maps to obtain three feature maps with different treatments in the same space. The two feature maps of amplitude processing are combined and then combined with the feature maps of phase processing to form the final feature map processed by the multi-frequency spatial attention mechanism module, thus ending the overall process of the attention mechanism.
3.4. Benchmark Dataset
As introduced in related work, since the number and scale of SSS image datasets currently disclosed are far less than those of target detection in other fields, we search and use two publicly available SSS image datasets: SCTD and KLSG for related research. At the same time, method experiments using these two datasets can prove that our proposed MFNet is more robust.
The first dataset, SCTD, is a dataset designed for SSS image target detection tasks, which already contains the target annotation information required by this model, and does not perform additional processing for annotation. SCTD contains a total of 357 images in three categories, including 271 ships, 35 humans, and 57 aircraft. The second dataset, KLSG, contains two categories and a total of 447 images, including 395 ships and 62 aircraft, but KLSG is a dataset designed for SSS image classification tasks and does not have annotated content related to object targets. We used the VOC annotation format as a standard to annotate the target objects required for the target detection task for KLSG.
Since the SCTD doesn’t have evaluation metrics analysis based on multiple detection methods, and the original KLSG is a classification task dataset with no such evaluation metrics content, we put two datasets together for benchmark building and comparison. To establish the datasets benchmark for general target detection methods comparison and analysis, we choose these famous one-stage and two-stage general target detection methods with generally better results, such as Faster R-CNN, Cascade RCNN, Sparse R-CNN, SSD512, Retina Net, YOLOv5, YOLOv7, Deformable DETR, DAB-DETR, DINO and the
MFNet we proposed. AP50(Average Precision with Intersection over Union greater than 50%) and mAP(mean Average Precision) were used as the main evaluation metrics in this benchmark. And the metrics we use in this section and in the experiment section will be introduced and detailed in the Evaluation Metrics subsection. The benchmark test results are shown in
Table 1
At the same time, during training the methods for the datasets, we consider that the smaller size of the dataset, the greater impact on the method training. We also briefly verify this concern in the case of training with pre-trained parameters and with no pre-trained parameters. The result is that under training without pre-trained parameters, the AP value of most models will be lower than 0.1%, which is more difficult to analyze and compare. Therefore, we use pre-trained parameters such as vgg16 and resnet50 in the one-stage and two-stage methods training to achieve a more visible data comparison effect.
We don’t use data amplification methods, such as cropping, flipping, Mixup, Cutout, Mosaic, etc. This is because using data amplification in very small datasets makes the method training process more prone to overfitting. In fact, such data amplification in a very small data set only makes multiples for a small number of target feature replications, resulting in a large number of duplicate features. However, this effect is relatively small in ordinary large datasets and the positive effect of data amplification in such data sets is far greater than the negative effect.
Moreover, in order to reduce the adverse effect of overfitting on the metrics evaluation, the data in the table are tested and inferred using the convergent model. And there are multiple versions in the YOLO series of methods, such as YOLOv5 has four versions of YOLOv5-s, YOLOv5-l, YOLOv5-m, YOLOv5-x, and YOLOv7 also has five versions of YOLOv7-l, YOLOv7-x, YOLOv7-w, YOLOv7-e, YOLOv7-d. In the benchmark, basic settings such as YOLOv5-s and YOLOv7-l are used.
As shown in
Table 1, compared with the general object detection methods with better results in recent years, the D2MFNet proposed in this paper has obvious advantages. There are certain laws of difference between categories, between datasets, between methods, and between two-pair or three-pair combinations among them.
(1) Between categories:
It is obvious that the AP50 of the ship category with more data is higher, while the AP50 of the aircraft and human categories is much smaller. One of the main reasons is that the number of ship categories is much larger than other categories, and its effective target characteristics are much more.
(2) Between datasets:
In Faster R-CNN, Cascade RCNN, and YOLOv5, the SCTD dataset has better results. The reason may be that SCTD is an RGB dataset, and its original input channel is 3, which is more in line with the structure of conventional general target detection methods, including the basic method, Cascade RCNN, used in this paper. The experimental results in the Benchmark illustrate this point. There is also the possibility that the categories distribution of SCTD is more reasonable. However, since both datasets are small-scale datasets, it is not possible to further verify the proportion of this cause in this phenomenon, and considering the small sample size, cross-validation is not possible, so this possible problem is ignored. At the same time, it is guessed that the KLSG was generated as an RGB dataset when it was acquired. When producing SSS images, many SSS equipment manufacturers prefer to show SSS images in the form of color drawings, which can let the naked eye better distinguish the target and can better promote their own products. But the SSS images should be grayscale images, and it was converted into a grayscale image when the dataset was made. The process of changing from a monochromatic image to a multi-layer image and then to a monochromatic image will result in poor dataset quality.
(3) Between methods:
The overall mAP of the one-stage methods is lower than that of the two-stage methods. It is mainly because the one-stage methods do not produce candidate regions, they directly perform the class probability and position coordinate value operation of the object, so the final detection result can be directly obtained after a single detection. The one-stage methods have a faster detection speed, but the accuracy rate is worse than the two-stage methods. However, due to the small amount of data, the one-stage methods do not have an advantage in speed in the experiments in this paper. What’s more, poor methods such as Sparse R-CNN may be designed based on large-scale data to ensure accuracy while reducing training parameters and time, and the experimental results are also in line with our expectations, which are most likely due to data scale and category distribution.
These are all possible factors that affect the method training results.