1. Introduction
Automatic modulation recognition [
1,
2] is the process of identifying the modulation of the received signal in the absence of sufficient a priori information. Defining the modulation is necessary for correct demodulation, which is fundamental in spectrum monitoring [
3], information countermeasures [
4], cognitive radio [
5], etc. With the increasing development of wireless communication technology, the modulation of signals tends to be diversified, and the number of frequency-using devices is increasing. Therefore, the study of real-time and efficient AMR is of great practical significance.
The mainstream AMR methods are divided into two categories, i.e., likelihood theory-based (LB-AMR) [
1,
6,
7] and the feature-based (FB-AMR) [
2,
8] methods. However, the performance of these traditional methods relies on manually estimated parameters [
9], which leads to harder feature extraction under the high data transmission rates [
10]. Instead of relying on artificial derivation to extract features, deep learning models feed signals directly into the network for end-to-end learning. Experiments have confirmed that the methods based on deep learning have better recognition accuracy than the traditional LB-AMR and FB-AMR methods [
11]. At present, a large number of deep neural networks such as Convolutional Neural Network (CNN) [
12], Denoising Automatic Encoder (DAE) [
13], and Recurrent Neural Network (RNN) [
14] are all introduced into AMR tasks. In the existing DL-AMR methods, most take a single modality as the input data type such as in-phase/quadrature (I/Q) [
14], amplitude/phase series (A/P) [
15], welch spectrum, square spectrum, and fourth power spectrum [
16,
17]. However, a single modality only contains the limited identifying information required for recognition completely from specific domains.
For DL-AMR methods [
12,
13,
14], different input data types have their own advantages. As shown in
Table 1, input data from different modalities perform distinctively well for particular modulations due to the domain gap. Obviously, the I/Q, A/P, and spectral data have significant distinguishing abilities for PAM, QAM, and PSK modulations, respectively. However, the use of single-domain data formats does not provide the sufficiently efficient and complete view for recognition, which is due to the fact that different modes contain specific properties.
In recent years, several studies have also focused on the advantages of multimodal information fusion for AMR tasks. In [
21], modality discriminative features are captured separately using three Resnet networks, and I/Q, A/P, and amplitude of spectrum, square spectrum, and fourth power spectrum features are concatenated with the corresponding bitwise summation. [
22] propose a dual-stream structure based on CNN-LSTM (DSCLDNN), which combines the characteristics of I/Q with A/P by pairwise cross-interacting the characteristics of the two streams. Specifically, the DSCLDNN multiplies I/Q and A/P features with an outer product. Unlike the above direct addition or multiplication fusion approach, [
20] uses a PNN model to cross-fuse the three modal features in a fixed order. However, most of the above methods fuse multimodal features by direct or crosswise summation or outer product, which tends to ignore the variability of different modes and their different impacts on modulation identification.
Generally, the attention mechanism [
23,
24] can identify the channel-wise importance. Therefore, each modality has adaptively obtained its respective attention weight. For a feature map, attention weights need to be focused on both channel and spatial dimensions. Channel attention such as SENet [
23], GSoPNet [
25], and SRM [
26] extract the attention information of different channels to distribute greater weight to important channels. For the spatial dimension, the attention mechanism such as GENet [
24], RAM [
27], and self-attention [
28] are used to extract important spatial regions or spatial locations of high relevance. For multi-channel inputs composed of multimodal signals, the structure of the channel and spatial attention mechanisms was borrowed for the dual-channel attention fusion(DAF) we designed. Specifically, the dual channels are local and global branches. On the local branch, the spatial attention mechanism extracts local high-level feature details, while the channel attention mechanism on the global branch assigns attention weights to the different modal channels.
The main contribution of this work can be summarized as follows:
We propose a deep learning method based on iterative dual-scale attentional fusion (iDAF), which complements the properties and complementarity of multimodal information with each other to achieve better recognition.
We design two embedding layers to extract the local and global information, extracting information that promotes recognition from different-sized respective fields. The extracted features are sent into the iterative Dual-scale channel attention module (iDCAM), which consist of the local and global branch. The branches respectively focus on the details of the high-level features and the variability across modalities.
Experiments on the RML2016.10A dataset demonstrate the validity and rationalization of iDAF. The highest accuracy amount of 93.5% is achieved at 10dB and the recognition accuracy is 0.6232 at full SNR.
3. The Proposed Method
In this section, we first preprocess the initial data to obtain three modalities representation. Then we introduce iterative dual-scale attentional fusion (iDAF), consisting of data embedding layers and iterative dual-channel attention module (iDCAM).
3.1. Data Preprocessing
This paper aims to identify modulation in a single-input single-output radio transmission system (SISO). The receiver transmits signal
s through transmission channel
h to obtain the baseband transmission signal.
where
s is the complex baseband signal transmitted by the transmitter under some modulation scheme,
is the frequency offset,
is the phase offset,
A is the communication channel gain,
n is the Additive Gaussian White Noise (AWGN),
i represents the
i-th value received. The purpose of the automatic modulation recognition task is to transmit signals through the baseband of the receiver and determine the pattern of modulation recognition, which can be classified as a
estimation problem for identifying
K types of radio modulations.
The key to the recognition task is to obtain the effective features of the signal, while the representational ability of the features extracted by a single modality is limited especially in the case of low SNR. In order to cover the amplitude, phase, and spectrum characteristics required identifying for modulation recognition, three modalities are selected to ensure that the required identifying information is included. I/Q and A/P contain instantaneous amplitude, phase and frequency information as modality one (IQ) and modality two (AP), respectively. The welch spectrum, square spectrum, and fourth power spectrum selected as the third modality (SA) represent the spectral characteristics of the signal in the frequency domain.
Therefore, prior to input into the neural network, the original signal symbol is transferred to the three modal representations in the following ways:
In-phase/orthogonal (IQ): Generally, the receiver stores the signal in the modality of I/Q to facilitate mathematical operation and hardware design, which is expressed as follows:
where
I and
Q represent the in-phase and quadrature components,
and
refer to the real and imaginary parts of the signal, respectively.
Amplitude/phase(AP): Calculate the instantaneous amplitude and phase of the signal, expressed as:
where the values of
n are
.
Spectrum (SP): The spectrum expresses the change of frequency over time, which is an important discrimination of different modulations. the calculation of the spectrum is expressed as:
where
n represents the
n-th power of the spectrum, including 1, 2, 4 which are corresponding to the welch spectrum, square spectrum, and fourth power spectrum. Here, M1 and M2 represent signal waveform and frequency, and M3 refers to signal time-frequency characteristics. The feature vectors of the three modalities were normalized into (batchsize x 128).
In order to observe the specific performance of modulation on different modalities, we plot the data of IQ, AP, and SP modality by 11 modulations. It can be seen that several modulations will be well classified similarly in different modalities, while specific modulations will behave distinctively in a single modality. Therefore, we introduce an attentional fusion to integrate the above similar and distinct features,.
3.2. Iterative dual-channel attention fusion (iDAF)
For the iDAF, we design with two data embedding layers to construct the local and global feature maps, then send it into an iterative dual-channel attention module (iDCAM) for attention weight assignment as shown in
Figure 1.
3.2.1. Data embedding
The signal data consists of three modal inputs, including I/Q, A/P, and spectrum analysis. For the original signal, it is preprocessed into three modalities inputs, denoted as , , . The preprocessed inputs () represent orthogonal information, amplitude-phase domain, and spectral features respectively.
Due to the variability of multimodal features, direct fusion would ignore the properties unique to different modalities. Therefore, we capture features from both local and global feature maps. The local feature map extracts detailed high-level semantic features, and the global feature map focuses on inter-modal salient characteristics. Therefore, we construct these two feature maps separately using feature extraction networks with different-sized receptive fields.
For the local feature map
X, the feature extraction network is expected to focus on local details and contextual information. Inspired by [
35], we propose the local embedding layer with CNN, LSTM and DNN which is fine-tuned to extract local attention information. Firstly, preprocessed data passes through a few convolution layers to model frequency. Therefore, the long-term features are obtained by undistorted convolution (UD-Conv) layers with channel dimensions 128, 64, 32, and 16. Concretely, UD-Conv consists of a zero-padding layer of size (2,0,0,0), a convolution layer, the Relu function, and batch normalization. Using the zero-padding, two columns are added to ensure that the signal features can be transmitted with as little time-frequency information as possible. Following [
36], the outputs of CNN are sent into LSTM and DNN. The LSTM layer is a bidirectional recursive model with 100 cells, which makes predictions using information both before and after the current moment in the sequence. The input is passed to the model in the original order, the incoming data in the reverse order, and finally the forward and reverse outputs are merged. The long-short time series learning capability of the LSTM identifies temporal correlations in I/Q data with inherent memory properties, and benefits learning the temporal dependencies of instantaneous amplitude and phase [
15].
Figure 2.
The IQ data plot of 11 modulations.
Figure 2.
The IQ data plot of 11 modulations.
Figure 3.
The AP data plot of 11 modulations.
Figure 3.
The AP data plot of 11 modulations.
Figure 4.
The SP data plot of 11 modulations.
Figure 4.
The SP data plot of 11 modulations.
The residual mapping function is a shortcut path between different layers, which can deepen the communication between deep and shallow neural network features. Inspired by [
14], the Resnet has achieved the best performance on classifying signal modulation with a 4-convolution-layer structure. After four UD-Conv layers, long-term features are extracted by the convolution layers, while short-term information may be neglected during the convolution process. Therefore, the original data containing long-short-term features are entered into LSTM together with the extracted long-term features via the residual connection. Inspired by [
37], the extracting capability of CNN is combined with LSTM and DNN. As shown in
Figure 5a, the learned short-term features are fed into the dense layer together with the long-term features previously extracted by CNN. The local embedding layer captures the data characteristics of each modal with unshared parameters, which is expressed as
, where
represents the local embedding layer and
indicates the local network parameters.
To obtain the global feature map
Y, an optimized CNN with 3 convolutional layers is utilized to extract features
in the global receptive field in
Figure 5b.
3.2.2. Dual-scale channel attention module
After constructing the feature maps in the previous section, the feature maps are fed into an iterative multimodal attention module (iDCAM). The dual-scale channel attention module (DCAM) is a computational unit that can be constructed and superimposed for feature map transformation containing two branches.
Figure 6.
Architecture of the proposed Dual-scale channel attention module (DCAM).
Figure 6.
Architecture of the proposed Dual-scale channel attention module (DCAM).
The branches include a local attention branch and a global attention branch, correspondingly for extracting the local identification properties and the channel variability between modalities, respectively. The local attention branch extracts the intra-modal attention through the self-attention mechanism of the Transformer, which extracts the local recognition properties of specific modality features. Meanwhile, the global attention branch increases the receptive field by pooling to obtain inter-modal global attention in the channel dimension. The feature maps are respectively fed into the dual-scale channel attention module and the following steps are performed as follows:
1) Passing through the encoder.
To capture the attention information between different modalities, the feature map is first sent into the encoder layer of the Transformer [
38]. The encoder consists of a self-attention module and a feed-forward neural network. Concretely, the self-attention mechanism is able to interact with the vectors converted from different sequence tokens, giving attention information about the correlation between different modalities. The basic formula of the self-attention mechanism is first expressed as follows:
Therefore, the input x is converted to a query Q, a key K, and a value V by means of three learnable weights ,, and . Here, Q is used to query the similarity of other vectors to itself and K is used for indexing for operations.
By dot-multiplying Q and K, the similarity between the two is computed, which is then converted into a weight probability distribution to get the importance of different modalities in different signal sequences as attention information. Specifically, the attention information is normalized by scaling factor and softmax.
Finally, the output of this self-attention layer is obtained by weighting the value V which helps in the classification with the attention information and then accumulating it. Utilizing multiple self-attention layer operations, the multi-head attention layer as shown in the following equation:
2) Construct the global channel attention matrix.
First, feature mappings across spatial dimensions H × W are aggregated after a squeeze compression operation. A channel descriptor containing global attention information is generated by global average pooling, which is denoted as follows:
After squeeze compression, the aggregated information is sent into two convolution layers to capture the channel dependencies.
where
and
B represent the Rectified Linear Unit (ReLU) function and Batch Normalization (BN), respectively. Specifically, the
we used is the point-wise convolution, which enhances the nonlinear capabilities of the network. The kernel size of
and
is
and
respectively.
3) Matrix multiplication between the attention matrix and the original features.
where ⊗ represents matrix addition and ⊕ is matrix multiplication.
contains the summation information of local attention
and global attention
extracted through DCAM.
3.2.3. Iteritive dual-scale attentional module
The inputs are high-level feature maps X and low-level feature maps Y. X utilizes the local sensing and context-sensitive inference capabilities of CNN and LSTM to capture the discriminative properties of each modality. However, the extracted high-level features are rich in local semantic information but ignore inter-modal difference information. In contrast, Y extracts global information with a larger perceptual field, and the extracted low-level features extract the distinctiveness between different modalities from a holistic perspective. However, due to the use of fewer convolutional layers, the deep feature semantic information is difficult to mine. Therefore, due to the desire to complement the advantages of low-level features and high-level features, an iterative dual-scale attentional module (iDCAM) is designed.
Figure 7.
The iterative attention mechanism iDCAM.
Figure 7.
The iterative attention mechanism iDCAM.
By stacking the DCAM designed in the previous section, iDCAM assigns multimodal attention weights to different modality features.
where
represents the summation information of local
X and global
Y.
3.2.4. Residual encoder
After passing through iDCAM, features are sent into the decoder along with the sum of intermediate features. The features are fed into the decoder after being assigned weights by iDCAM, and decoding is guided by a cross-attention mechanism using the sum of the intermediate features () and ().