1. Introduction
Vehicle re-identification (ReID) [
1,
2,
3,
4,
5,
6,
7,
8,
9] holds great importance in the realm of Intelligent Transportation Systems (ITS) in the context of smart cities. This technology facilitates the identification of the exact vehicle across various surveillance cameras by analyzing vehicle images. Traditionally, license plate images have been employed for vehicle identification. However, obtaining clear license plate information can be challenging due to various external factors like obstructed license plates, obstacles, and image blurriness.
Thanks to the success of deep learning, Vehicle identification algorithm in the field of surveillance cameras again achieved impressive results [
10,
11,
12,
13,
14,
15,
16,
17] . Typically, these methods [
11,
17,
18,
19,
20] employ a deep metric learning model that relies on feature extraction networks. The objective is to train the model to distinguish between vehicles with the same ID and those with different IDs to accomplish vehicle ReID. However, as shown in
Figure 1 , there are discernible disparities between vehicle images captured by Unmanned Aerial Vehicles (UAV) and those acquired through stationary cameras. The ReID challenge regarding UAV imagery introduces unique complexities stemming from intricate shooting angles, occlusions, limited discriminative power of top-down features, and substantial variations in vehicle scales.
It is worth mentioning that traditional vehicle ReID methods, primarily designed for stationary cameras, face challenges in delivering optimal performance when adapted to the domain of UAV-based ReID. Firstly, the shooting angle of UAVs is complex. UAVs can shoot at different positions and angles, and the camera’s viewpoint will change accordingly. This viewpoint change may cause the same object or scene to have different appearances and characteristics in different images. Second, the UAV can overlook or squint at a target or scene at different angles, resulting in viewpoint changes in the image. This viewpoint change may cause deformation or occlusion of the target shape, thus causing difficulties for feature extraction. To solve the above problems, it is necessary to add a mechanism [
3,
21,
22,
23] that can extract more detailed features when ReID extracts features to deal with the challenges brought by the drone perspective. The change in UAV viewpoint makes the feature extraction algorithm need a certain robustness, which can correctly identify and describe the target in the case of significant changes in viewpoint. The difference in UAV’s view angle makes the feature extraction algorithm need to have the ability to adapt to shape changes and occlusions to improve the feature reliability and robustness in different views.
In recent years, the attention mechanism has gained significant popularity across multiple domains of deep convolutional neural networks. Its fundamental concept revolves around identifying the most crucial information for a given target task from a vast volume of available data. The attention mechanism selectively focuses on the image’s different regions or feature channels to improve the model’s attention and perception ability for crucial visual content. In the context of UAV-based vehicle ReID, the attention mechanism enables the model to enhance its perception capabilities by selectively highlighting the vehicle’s specific regions or feature channels.
However, most attention mechanisms [
23,
24,
25,
26] focus on extracting features only from channels or spaces. The channel attention mechanism can effectively enhance essential channels, but it cannot deal with the problem of slight inter-class similarity. Spatial attention mechanisms can selectively amplify or suppress features in specific regions spatially, but they ignore the relationship between channels. To overcome the shortcomings of a single attention mechanism, recent studies have begun to combine channel and spatial attention [
27,
28,
29] . Such a hybrid attention mechanism can consider the relationship between channel and space at the same time to better capture the critical information in the input feature tensor. By introducing multiple branches of the attention mechanism or fusing different attention weights, the interaction between features can be modeled more comprehensively. Shuffle Attention (SA) [
27] divides molecular channels to extract key channel features and local spatial fusion features, with each subchannel acquiring channel and spatial fusion attention. The Bottleneck Attention Module (BAM) [
28] is a technique that generates an attention map through two distinct pathways: channel and spatial. On the other hand, the Dual Attention Network (DANet) [
29] incorporates two different types of attention modules on dilated Fully Convolutional Networks (FCN). These attention modules effectively capture semantic dependencies in both spatial and channel dimensions.
Most methods make the input feature map directly pass through the fused attention. At the same time, SA [
27] can provide richer feature representation by dividing the subchannels, which better capture the structure and associations in images or other data. However, the SA [
27] method of dividing the channel into subchannels mainly focuses on weighting the input features in the channel dimension, ignoring the possible details in the spatial dimension.
For that, our proposed Dual Mixing Attention Module (DMAM), which combines Spatial Mixing Attention (SMA) with Channel Mixing Attention (CMA), in which the Original feature is divided according to the dimensions of spatial and channel to obtain multiple subspaces. Each sub-feature map is processed independently, and the features of different channels and local regions can be extracted so that the network can better associate local features with the whole feature. Then, a learnable weight is applied to capture the dependencies between local features in the mixture space. In conclusion, the features extracted from multiple subspaces are merged to enhance their comprehensive interaction. This approach enables the extraction of more resilient features and leads to improved recognition accuracy. The key contributions of this method are outlined as follows:
• We introduce a novel Dual Mixing Attention Network (DMANet) designed to handle the challenges of Unmanned Aerial Vehicle (UAV)-based vehicle re-identification (ReID). DMANet effectively addresses issues related to shooting angles, occlusions, top-down features, and scale variations, resulting in enhanced viewpoint-robust feature extraction.
• Our proposed Dual Mixing Attention Module (DMAM) employs Spatial Mixing Attention (SMA) and Channel Mixing Attention (CMA) to capture pixel-level pairwise relationships and channel dependencies. This modular design fosters comprehensive feature interactions, improving discriminative feature extraction under varying viewpoints.
• The versatility of DMAM allows its seamless integration into existing backbone networks at varying depths, significantly enhancing vehicle discrimination performance. Our approach demonstrates superior performance through extensive experiments compared to representative methods in the UAV-based vehicle re-identification task, affirming its efficacy in challenging aerial scenarios.
The structure of the paper will be as follows: In
Section 2, a comprehensive review and discussion of related studies will be presented. The proposed approach will be elaborated in
Section 3, providing a detailed description. Following this,
Section 4 will present the experimental results along with comparisons. Finally, conclusions will be provided in
Section 5.
3. Proposed Method
3.1. Dual Mixing Attention Module
We use a standard ResNet-50 as our backbone to extract features. Our proposed DMAM shown in
Figure 2 . Because the vehicle space photographed from the UAV perspective changes significantly, and the shooting angle of UAVs is complex. UAVs can overlook or squint at a target or scene at different angles, resulting in viewpoint changes in the image. Currently, some attention mechanisms use channel or spatial attention mechanisms. However, spatial attention is more focused on the space region but ignores the different characteristics of the channel. Channel attention filters important feature channels while missing spatial features. Although partial attention mechanisms use channel and spatial fusion attention, mixed attention tends to learn along a single dimension, ignoring the remaining dimensions’ features. To address this, we propose a novel DMAM to capture pixel-level pairwise relationships and channel dependencies, where DMAM comprises SMA and CMA. To enhance vehicle ReID, the DMAM can also be easily added to backbone networks at any level. We denote an input original feature as
, which goes through DMAM and output enhanced feature as
.
C denotes the number of channels,
H and
W denote the height and width, respectively.
Firstly, the original feature graph is split according to the dimensions of space and channel, and multiple subspaces are obtained, namely: = {, ..., } and = {, ..., }. Each sub-feature map is processed independently, and the features of different channels and local regions can be extracted so that the network can better associate local features with the whole feature. Secondly, channel subfeature and spatial subfeature, and , respectively, are sent into CMA and SMA to learn channels and spatial mixed features. The output is and . The dimensions of feature maps , are , and the dimensions of feature maps , are . The hybrid features increase the ability to input data model diverse, which better extracts complex feature representations from the uav perspective. Third, the features extracted from their respective spaces are aggregated. After reshaping, the feature maps are transformed into and . After concat, the output of the two feature maps is . The aggregation of features of multiple subspaces enhances the correlation between features. Integrating features promotes the interaction and information transfer between different features. Moreover, the generalization ability is also improved. Fourth, the feature map passes through a set of 1×1 convolution (Cov), Batch Normalization (Bn), and Rectified Linear Unit (Relu), with an output of feature map . The 1×1 Cov changes the dimension of the feature graph from to . Bn is normalized, changing the data distribution and preventing gradient explosion. As an activation function, Relu has low computational complexity, which improves the speed of the neural network gradient descent algorithm to better cope with significant changes in vehicle size. Cov, Bn, and Relu effectively enhance the performance of the model and better learn complex or occluding features. Finally, the feature map uses the residual structure to learn the original feature map through gap connections, which can accelerate the model’s convergence rate, better use the information of previous levels, and better identify the features of the top-down vertical view shot by the UAV.
After several steps, we proposed a Dual Mixing Attention Module, including subspace segmentation, learning channels and spatial hybrid features, feature aggregation, convolutional normalization activation, and residual structure. DMAM can be connected behind the backbone network to make the features more profound and expressive, effectively coping with complicated shooting angles, occlusions, low discrimination of top-down features, and significant changes in vehicle scales. DMAM improves the robustness and discrimination of features, making the features more profound and expressive, thus improving the model’s performance in ReID tasks.
3.2. Channel Mixing Attention
Channel Mixing Attention splits the feature map along the channel. Then, progressively merge the channel and spatial attention to obtain the CMA. The model can understand and represent the input data more comprehensively through this synthesis, improving features’ distinguishing ability and generalization performance.
CMA is broken down into three phases, as depicted in
Figure 3 : Firstly, the dimension of channel subfeature
is
, splitting along the channel into feature maps
and
, as the input feature maps of space and channel attention. Secondly, the spatial input feature map
is multiplied by group norm (
) and shuffing (
) to extract the features of
space. More discriminative features are extracted by focusing on the critical space areas of vehicles through spatial attention.
is divided into g groups along the channel. The mean and variance of each group are summed. The formula for
is as follows:
the parameter
represents the mean, and the parameter
illustrates the variance. The formula for
is as follows:
The formula for
is as follows:
the the parameter
is a tiny constant. Shuffing uses variable parameters to extract the feature weights of the subspace so that the model can better filter and focus on essential feature channels. The calculation formula is as follows:
In the above equation, parameters and are variable parameters. By calculating the weight of each channel in the feature graph, it is possible to assign different importance to the features of different channels. The model’s performance can be improved by focusing more intensively on the feature channels that are more helpful to the task.
In another branch, feature map is sequentially subjected to adaptive average pool () and shuffing. The output is multiplied by feature map to extract the spatial features of feature maps. Spatial attention learns the importance of different locations and weights features according to this importance. changes the dimensions of feature map from to , which is used to extract the spatial features of the feature map.
Finally, the output dimension after concat is . The output feature map contains attention to feature graph channels and Spaces. By mixing channel and spatial attention, the model can focus more precisely on different channels, locations, and correlations. The model can better adapt to different scales, shapes, and positions of vehicles photographed by UAVs.
3.3. Spatial Mixing Attention
SMA is similar to CMA. The difference between the two is that CMA groups feature maps in channel latitude, mainly focusing on the attention of each set of channels. In contrast, SMA groups feature maps in spatial dimensions. The input matrix . In SMA, attention is used to weight and select features in the spatial dimension. However, in CMA, the attention mechanism is used to weight and choose features in the channel dimension.
The SMA was also divided into two branches, one of which went through ,shuffing, and multiplied with to extract channel features. The other one went through the adaptive average pool,shuffing and multiplying by feature map to extract spatial features. Finally, the output dimension after concat is .
By dividing the molecular space, the model can divide the feature map into different subregions and calculate the attention weight of each subregion. This approach focuses more accurately on the importance of different spatial locations, allowing the model to capture local information about the vehicle better. This approach considers the relationship between different locations and channels while retaining the importance of spatial location. It helps the model understand the interaction of different channels at various locations, thus improving the consistency and accuracy of feature representation. It has great potential when dealing with complex UAV perspectives.
5. Conclusions
In this paper, we proposed a novel DMANet to extract discriminative features robust to variations in viewpoint. Specifically, we first present a plug-and-play DMAM, where DMAM was composed of SMA and CMA: First, the original feature was divided according to the dimensions of spatial and channel to obtain multiple subspaces. Then, a learnable weight was applied to capture the dependencies. Finally, the components extracted from all subspaces were aggregated to promote their comprehensive feature interaction. The experiments showed that the proposed structure performs better than the representative methods in the UAV-based vehicle ReID task.
Futher Work. There are few datasets for vehicle ReID based on the UAV perspective, and the research space is ample. Consider extending the dataset regarding different scenes, lighting, and resolution. Furthermore, consider setting up data sets of changes in vehicle details, such as changes in the position of the vehicle decoration or changes in the passenger. The model needs to determine whether the changed vehicle belongs to the same ID through the changes in details. This change aligns with reality and will present a significant challenge for vehicle ReID.