2.1. Overview
In this paper, we propose a novel MMBSN method for HAD as illustrated in
Figure 1. Specifically, the proposed method comprises three distinct phases:
1) Sample Preparation Stage: The raw HSI initially undergoes a down-sampling process as using pixel shuffle down-sampling (PD) to obtain a set of down-sampled HSI samples. Subsequently, spectral and spatial screening modules are employed to select a specific proportion of training samples for network training.
2) Training Stage: The selected samples are sequentially fed into the blind spot network. The blind spot network comprises a multi-scale mask convolution module, a spatial-spectral joint module, a background feature concern module, and a dynamic learnable fusion module. Ultimately, the reconstructed HSI is obtained through supervised reconstruction of multiple training samples using L_1 loss.
3) Detection Stage: The original HSI is down-sampled by PD operation to obtain a set of down-sampled HSIs. Each down-sampled HSI is sequentially input into the trained MMBSN model. Finally, the background HSI is reconstructed using PD inversion operation, and the resulting reconstruction error serves as the result of HAD.
The subsequent sections offer a thorough exposition of these facets.
(1) Extracting Prior Knowledge with Dual Clustering: the purpose of Dual Clustering is to obtain coarse labels for supervised network learning and provide the network with a clear learning direction to enhance its performance. Dual clustering (i.e. unsupervised DBSCAN and connected domain analysis clustering) techniques are employed to cluster the HSI from spectral domain to spatial domain which yields preliminary separation results between background and anomaly regions. Subsequently, prior samples representing background and anomaly regions are obtained through this processing which effectively purifies the supervision information provided to the deep network by conveying more background-related information as well as anomaly-related information. These anomaly features are then utilized to suppress anomaly generation while the background features contribute towards reconstructing most of the background.
(2) Training for Fully Convolutional Auto-Encoder: the prior background and anomaly samples extracted in the first stage are used as training data for fully convolutional auto-encoder model training. During the training phase, the original hyperspectral information is inputted into a fully convolutional deep network using a mask strategy while an adversarial consistency network is employed to learn the true background distribution and suppress anomaly generation. Finally, with leveraging self-supervision learning as a foundation, the whole deep network is guided to learn by incorporating the triplet loss and adversarial consistency loss. Additionally, spatial and spectral joint attention mechanism is brought in both the encoder and decoder stages to enable adaptive learning for spatial and spectral focus.
(3) Testing with the Original Hyperspectral Imagery: the parameters of the proposed deep network are fixed, and the original hyperspectral imagery is fed into the trained network for reconstructing the expected background for hyperspectral imagery. At this stage, the deep network only consists of an encoder and a decoder. The reconstruction error serves as the final detection result of the proposed hyperspectral anomaly detection method.
2.2. Sample Preparation Stage
Due to the inherent limitations of self-supervision, anomalies samples are reconstructed by anomaly samples supervision, which at training time will inevitably result in a proficient reconstruction of anomalies, i.e., the identity map of anomalies. Although the blind spot network has partially mitigated this issue, its impact remains substantial. This phase primarily aims to address this problem (as depicted in
Figure 1). Through PD down-sampling, we can obtain samples with disrupted local correlation while preserving back-ground spatial correlation. However, due to the characteristics of PD down-sampling, the sampled sample may include anomaly pixels or background pixels within the original anomaly region. Therefore, we propose a combined spatial-spectral screening method for extracting purer samples. The screened samples have more background pixels compared to the original anomaly region as
Figure 2. Therefore, during the training phase, emphasis is placed on learning how to reconstruct background pixels rather than preserving the identity mapping of anomalies. To effectively address overfitting, we employ partial sample training and complete image testing as a strategy.
1) Pixel-shuffle Down-Sampling: The primary objective of the
operation is to disrupt the spatial correlation among anomalies while preserving the spatial correlation among backgrounds as much as possible, thereby enhancing the distinction between backgrounds and anomalies. Since all HSI obtained after
operation exhibited remarkably strong correlations, this enabled us to train with only partial samples. The
Figure 3 illustrates the
and
diagram with a step factor of 2. In the visualization, the blue box signifies the sampling box with a step factor of 2, and each number inside represents the index of the pixel, we can intuitively see the basic process of
and
, A given HSI
,where
,
and
are the row number, column number, and spectral dimension (the number of spectral channels) of the HSI respectively, which is decomposed into four sub-images referred to as
(·), and the four sub-images are recovered into an HSI referred to as
. In these sub-images obtained through the
operation, the scale of the anomaly target is effectively reduced in the original anomaly region. However, due to the inherent characteristics of this process, it is challenging to determine whether this region is sampled as abnormal or background pixels, and which is more dominant.
2) Spatial-Spectral Joint Screening: Samples are selected based on their spatial and spectral characteristics. The classical GRX method is utilized to obtain the spectral distribution deviation score, which indicates the degree of deviation from the background distribution. A lower score implies fewer pixels in the sample deviate from the background distribution. We aim to obtain samples with the overall minimum deviation in pixel distribution. The overall bias score on spectral characteristics can be expressed as follows:
where,
is the stride factor for down-sampling,
is the ith Spectral vector in the down-sampled Sample Hyperspectral samples
and
are the mean and the covariance matrix
, respectively.
The proposed method utilizes a spatial domain based screening approach to calculate the spatial structural similarity between test pixels and their neighborhood background, inspired by the local mean filtering algorithm. This can be mathematically expressed as follows:
As shown in
Figure 4, he pixel to be measured is selected as the center of an outer window, which is then divided into a central inner window and eight neighborhood background inner windows (with sizes 9 and 3, respectively). Subsequently, the Euclidean distance between the central inner window
and the eight neighboring background inner windows
is calculated to quantify the spatial structure similarity between them. Since the similarity between the backgrounds is high, a higher similarity score indicates a higher likelihood that the central window represents the background, while a lower similarity score suggests potential local spatial anomalies within the central inner window. By calculating this measure for all central inner windows, it becomes possible to assess the number of local spatially similar anomalies present in the sample. Combined with spatial spectrum analysis, a comprehensive screening score for each sample can be obtained:
where,
stands for normalization. Finally, according to the comprehensive score, a certain proportion of samples were selected from small to large as training sample
,
. where
is the proportion of screened samples.
With the spatial-spectral joint screening, we can ensure that the background pixels sampled from the original anomaly regions are dominant in the filtered samples. In the process of learning supervision, since the background pixels are more dominant in the supervised learning samples, the network is more inclined to reconstruct the background. Even if there are a few abnormal pixels present, they will engage in a supervised competition with the background pixels from other samples, resulting in a larger reconstruction error.
2.3. Training Stage
1) Multi-Scale Mask Convolution Module (MMCM):
multi-scale mask convolution module is designed to adapt to the detection of anomaly targets at different scales. Due to the characteristics of the blind spot network, the center pixel is reconstructed based on surrounding pixel information. Therefore, we utilize small-scale mask convolution to mask the small target in the center and large-scale mask convolution to isolate similar anomalous pixels around the large target. As illustrated in
Figure 1, the multi-scale mask convolution module comprises a 1×1×B×128 convolution and six mask convolutions of varying scales, with inner and outer window sizes being (1,3), (1,5), (1,7), (3,5), (3,7) and (3,9) respectively. These mask convolutions consist of two scales for an inner window and different-sized background acceptance domains for an outer window. Given training sample
,
, we first extract features using a
convolution followed by dividing them into six branches and using background features from various receiving fields to reconstruct obscured center pixels. The output channel is turned to 64 after the mask convolution module. Due to the varying center masks of different scale mask convolutions, their detection performance for abnormal objects also varies significantly. Small-scale mask convolutions are more suitable for detecting small targets, while large-scale mask convolutions are better suited for detecting large targets.
2) Spatial-Spectral Joint Module (SSJM): To enhance the utilization rate of spatial information and the interaction between spatial information and spectral information, we propose a Spatial-Spectral Joint Module (shown in
Figure 5) that leverages deep convolution (
) for extracting features from different frequency bands. Additionally, deep extended convolution (
) is employed to capture background features at greater distances, aiding in the reconstruction of the center pixel. On the other branch,
is utilized to determine the importance of various band features. These important values are then transformed into weighted enhancement features with significant contributions using a Sigmoid activation function. By focusing on the most influential features, redundancy in spectral characteristics can be reduced while improving the utilization of spatial attributes. Finally, point convolution (
) is applied to enhance the interaction between spatial and spectral features. To prevent focus polarization caused by self-supervision, a feature fusion approach employing a jumping mode is adopted. SSJM can effectively facilitate the interaction of spatial information across different bands. The entire process can be summarized as follows:
where,
is the feature extracted by the multi-scale mask convolution module, and
is the enhanced feature by the spatial-spectral joint module.
3) Background Feature Attention Module (BFAM): The function of the background feature attention mechanism is to solve the problem that the mask convolution of the large background receptive field may introduce adjacent abnormal features. We need to make the network pay more attention to the background features so as to ignore the small number of introduced abnormal features.The fundamental concept involves computing the similarity between a feature vector at a specific position and other feature vectors, summing all these cosine similarities, and subsequently obtaining the confidence level for background features through a Sigmoid layer. Since background accounts for most of the HSI, the background feature vector has a large similarity with other background feature vectors, while the anomaly feature is just the opposite. Finally, we obtain the background confidence of each position, which is weighted to the input feature to enhance the expression of the background feature. To prevent information loss due to extreme cases of attention, I also added a skip connection.
Figure 6 shows the process of background feature enhancement extracted by mask convolution with an inner window of 3. It can be expressed as:
where,
and
are the eigenmatrix and the transpose of the eigenmatrix, respectively, and
is the matrix multiplication.
represents the inner product of spectral vectors at the
positions with spectral vectors at the other
positions, respectively, and
Finally, the background enhancement features under three different background receptive fields are fused by concatenation and 1x1 convolution. Through the background feature attention, we can make the network pay more attention to the background features, thereby widening the distance between the anomaly and the background. Here, only the feature enhancement process of mask convolution with inner window 3 is shown, while the feature enhancement process of mask convolution with inner window 1 is same as it.
3) Dynamic Learnable Fusion Module (DLFM): The detection performance of mask convolution varies at different scales. Small-scale mask convolution exhibits good detection performance for small anomalous targets, but its performance is degraded when detecting large anomalous targets due to the interference from neighboring anomalous pixels. Conversely, the large-scale mask mold incorporates a well-designed center shielding window and demonstrates excellent detection performance for large anomaly targets. However, excessive center shielding hinders the utilization of background information in the surrounding area, making it challenging to detect smaller anomaly targets. To address multi-scale anomaly targets detection requirements, we propose a dynamic learnable fusion module as
Figure 7. Specifically, small-scale mask convolution effectively identifies small-scale anomalous targets and reconstructs large-scale ones. On the other hand, large scale mask convolution efficiently detects large scale anomalous targets and reconstructs small scale ones. Therefore, we introduce three dynamic learnable parameters
,
and
to fuse advantages of mask convolutions at different scales,
represents the weight of the feature
extracted by the mask convolution with a mask window of 1,
represents the weight of the feature
extracted by the mask convolution with a mask window of 3, and
represents the weight of the difference feature
. The features extracted by the first two mask convolutions are weighted and summed, then the weighted difference features are subtracted. Finally, the resulting features undergo dynamic weight learning and adaptation to obtain the final fused features, which enhance the background and suppress anomalies. The dynamic fusion process can be expressed as:
where,
is the feature extracted by small-scale masked convolution,
is the feature extracted by large-scale masked convolution and
is the output of DLFM.
The reconstructed background sample
,
is finally obtained through the convolution of
and
, where, which can be expressed as:
We opted for
losses as reconstruction losses. During the reconstruction process, our supervised samples undergo a combined spatial-spectral screening method, resulting in a higher proportion of background pixels being sampled within the original anomaly region compared to anomalous pixels. Consequently, during the supervised anomalous pixel reconstruction process, both background and anomalous pixels coexist simultaneously. However, due to the inherent advantage of background pixels, network learning tends to prioritize their reconstruction while inhibiting the reconstruction of anomaly pixels. The loss function can be expressed as: