1. Introduction
The rapid advancement of space information technology and the exponential expansion of remote sensing image data have created a pressing need for efficient and convenient extraction of valuable information from vast amounts of remote sensing images. In response to this demand, cross-modal retrieval between remote sensing images and text descriptions has emerged as a valuable approach. This retrieval process involves finding text descriptions that match given remote sensing images or identifying remote sensing images that contain relevant content based on text descriptions. The growing attention towards this field highlights its potential in addressing the aforementioned demand.
Significant advancements have been achieved in the cross-modal retrieval of natural images and texts, resulting in impressive average R@1 accuracies of 75.8% and 95.3% on the MS COCO and Flickr30k datasets, respectively [
1] . However, when compared to natural images, remote sensing images possess three distinct characteristics. Firstly, they serve as objective representations of ground objects, leading to intricate and diverse semantic details within the images. This implies that remote sensing images can be dissected into multiple basic units for semantic expression. Secondly, unlike natural images, remote sensing images lack specific themes and focal points [
2], which contributes to their pronounced multi-perspective nature. Consequently, the same remote sensing image can generate various descriptions from different perspectives, encompassing different combinations and permutations of the underlying fine-grained semantic units. Thirdly, remote sensing images of the same geographical area may exhibit variations in colors, brightness, resolution, and shooting angles due to factors such as weather conditions, photography equipment, and aircraft positions. These inherent characteristics pose substantial challenges in achieving effective cross-modal retrieval for remote sensing images. To address the fine-grained and multi-perspective attributes, it becomes crucial to capture the intricate semantic correlations between images and texts, including the associations between image regions and text descriptors. Furthermore, it is necessary to overcome the influences of substantial differences in image resolution, color, and angle on the precise alignment of image-text correlations.
Recent studies on cross-modal retrieval of remote sensing images and texts have predominantly followed a two-step approach, involving unimodal feature extraction (
Figure 1a) and multimodal interaction (
Figure 1b). During the unimodal feature extraction stage, remote sensing images and text data are transformed into numerical representations that capture their semantic content for further statistical modeling. Deep learning techniques, such as convolutional neural networks (CNNs) (e.g., VGGNet [
3], ResNet [
4]) and vision Transformer networks [
5], are commonly employed for extracting image features. Similarly, recurrent neural networks (RNNs) (e.g., LSTM [
6], GRU [
7]) and Transformer models (e.g., BERT [
8]) are utilized for extracting textual features. In the subsequent multimodal interaction stage, the semantic consistencies between image and text features are leveraged to generate comprehensive feature representations that effectively summarize the multimodal data. Baltrusaitis et al. [
9] classified multimodal feature representations into joint representations and coordinated representations. Joint representations merge multiple unimodal signals and map them into a unified representation, while coordinated representations process information independently for each modality while incorporating similarity constraints between different modalities. Following this framework, recent methods for multimodal interaction between remote sensing images and texts can be categorized into two groups: multimodal semantic alignment and multimodal fusion encoding.
The upper part of b illustrates multimodal semantic alignment methods [
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]
Figure 1b. These approaches aim to align image and text data in a shared embedding space based on their semantic information. By doing so, images and texts with similar semantics are positioned closer to each other in this space. During cross-modal retrieval, similarity between image and text features is determined by measuring their distance in the shared embedding space, followed by sorting. In the context of multimodal interaction, the simple dot product or shallow attention mechanisms are commonly employed to calculate the similarity between images and texts. Triplet loss [
21] and InfoNCE loss [
22] are utilized either directly or through intermediate variables to impose constraints on the position and distance of image and text features within the shared embedding space. The bottom half of
Figure 1b depicts the method of multimodal data fusion encoding [
23]. This approach involves feeding remote sensing images and text features into a unified fusion encoder to obtain joint representations of the image-text pairs. Subsequently, a binary classification task known as the image-text matching (ITM) task is performed to determine the degree of compatibility between the image and text. During retrieval, the ITM score is employed as a measure of similarity between the image and text.
In earlier multimodal semantic alignment methods, cosine similarity was directly employed to align the global feature vectors of images and texts [
14,
15]. However, it is worth noting that the global similarity observed between images and texts often arises from the intricate aggregation of local similarities between image fragments (such as objects) and sentence fragments (such as words) [
24]. Consequently, the global feature vectors alone struggle to capture the fine-grained correlations required to address the fine-grained and multi-angle characteristics of remote sensing images. To overcome this limitation, researchers have explored the use of fine-grained unimodal features as replacements for global features. For instance, region features [
25] and patch features [
23] have been utilized for images, while word features have been employed for texts [
16,
23,
25]. These fine-grained correlations between images and texts are then established through cross-attention mechanisms between the modalities. To guide the unimodal features in the shared embedding space, previous studies, such as Yuan, Zhang, Fu, et al. [
10], Zheng et al. [
25], and Cheng, Zhou, Fu, et al. [
16], have employed shallow cross-attention mechanisms. However, despite utilizing high-performance unimodal encoders, simplistic interaction calculations between the features may still fall short when dealing with complex visual-and-language tasks [
26]. To address this limitation, Li et al. [
23] introduced a large-scale Transformer network as a multimodal fusion encoder. By leveraging multiple multi-head cross-attention modules, this approach enabled complex interaction calculations to be performed on the fine-grained features across modalities, thereby further exploring potential fine-grained correlations between the modalities.
However, existing multimodal fusion encoding models primarily rely on the ITM task as the sole training objective, lacking precise supervision signals for capturing fine-grained correlations between images and texts. This limitation makes it challenging to provide efficient supervision for the correlation between specific words in the text and corresponding regions in the image. To address this issue, we have incorporated the masked language modeling (MLM) task from the recent vision-language pre-training (VLP) model [
27,
28,
29]. In the MLM task, certain words in the text are masked, and the model is trained to predict these masked words using context and image region-level information from the text. This approach facilitates more effective capture of fine-grained image-text correlations.
To address the challenges posed by imaging differences in remote sensing images, such as variations caused by weather conditions, photography equipment, and spacecraft position, this paper introduces the multi-view joint representations contrast (MVJRC) task. This task incorporates automatic contrast, histogram equalization, brightness adjustment, definition adjustment, flipping, rotation, and offset operations to simulate imaging differences. Additionally, a weight sharing twin network is designed to maximize the similarity between enhanced views of the same remote sensing image and the joint representations of the corresponding text during training. By leveraging the update gradient alternation, the model effectively utilizes the mutual information contained in the joint representations of the same remote sensing image under different views as supervision signals. The MVJRC task successfully filters out noise interference caused by imaging differences in remote sensing images. It achieves strong consistency in the joint representations of different views for texts and remote sensing images, facilitating easier discrimination of paired samples. Furthermore, MVJRC enhances the complex cross-attention module between modalities by providing additional complementary signals, thereby enabling consistent fine-grained correlations.
The increasing computational complexity associated with large-scale networks can lead to reduced efficiency in measuring the similarity of multimodal data during cross-modal retrieval. While identifying negative samples with low similarity (easy negatives) is straightforward, identifying negative samples with high similarity (hard negatives) often requires a more intricate model. To address this challenge, we propose the retrieval filtering (RF) method. This method employs a small-scale network as a filter and utilizes knowledge distillation [
30] to transfer the "knowledge" of similarity measurements from the complex fusion network to the filter. During retrieval, the small-scale filter is initially used to screen out easy negatives, and the top k samples with high similarity are then fed into the complex fusion encoder for similarity calculation and re-ranking. By adopting the RF method, retrieval efficiency can be significantly improved while ensuring minimal accuracy loss, even with a large sample size.
In this research, we introduced a multi-task guided fusion encoder (MTGFE) for cross-modal retrieval of remote sensing images and texts. The key contributions of this paper can be summarized as follows:
(1) The model was trained using a combination of the ITM, MLM, and MVJRC tasks, enhancing its ability to capture fine-grained correlations between remote sensing images and texts. (2) The introduction of the MVJRC task improved the consistency of feature expression and fine-grained correlation, particularly when dealing with variations in colors, resolutions, and shooting angles. (3) To address the computational complexity and retrieval efficiency limitations of large-scale fusion coding networks, we proposed the RF method. This method filters out easy negative samples, ensuring both high retrieval accuracy and efficient retrieval performance.
Figure 1.
General framework for remote sensing image and text retrieval. (a) Unimodal feature extraction stage. (b) Multimodal interaction stage. The methods can be categorized into two groups based on the generation of a unified multimodal representation: multimodal semantic alignment and multimodal fusion encoding.
Figure 1.
General framework for remote sensing image and text retrieval. (a) Unimodal feature extraction stage. (b) Multimodal interaction stage. The methods can be categorized into two groups based on the generation of a unified multimodal representation: multimodal semantic alignment and multimodal fusion encoding.
Figure 2.
Overview of the MTGFE model. It comprises two components: (a) a unimodal encoder that utilizes the ViT and BERT (first 6 layers) models to extract features from images and texts, and (b) a multimodal fusion encoder that generates joint image-text representations through ITM, MLM, and MVJRC tasks. Additionally, (c) a retrieval filter is trained via knowledge distillation. During retrieval, the filter eliminates easy negatives, and the teacher network performs re-ranking.
Figure 2.
Overview of the MTGFE model. It comprises two components: (a) a unimodal encoder that utilizes the ViT and BERT (first 6 layers) models to extract features from images and texts, and (b) a multimodal fusion encoder that generates joint image-text representations through ITM, MLM, and MVJRC tasks. Additionally, (c) a retrieval filter is trained via knowledge distillation. During retrieval, the filter eliminates easy negatives, and the teacher network performs re-ranking.
Figure 3.
The illustrates the diagram of the MLM task, where the masked tokens are predicted by leveraging the contextual information from both the input image and text.
Figure 3.
The illustrates the diagram of the MLM task, where the masked tokens are predicted by leveraging the contextual information from both the input image and text.
Figure 4.
The MVJRC task involves setting up a twin network with shared parameters from MTGFE. The cosine similarity of the joint feature is calculated after the projection head and prediction head, and the gradient is updated alternately.
Figure 4.
The MVJRC task involves setting up a twin network with shared parameters from MTGFE. The cosine similarity of the joint feature is calculated after the projection head and prediction head, and the gradient is updated alternately.
Figure 5.
Retrieval filtering architecture. Knowledge distillation is utilized to transfer the knowledge from MTGFE to the filter. During the retrieval process, the filter is employed to exclude easily distinguishable negatives, while samples with higher similarity are forwarded to MTGFE for recalibration and ranking.
Figure 5.
Retrieval filtering architecture. Knowledge distillation is utilized to transfer the knowledge from MTGFE to the filter. During the retrieval process, the filter is employed to exclude easily distinguishable negatives, while samples with higher similarity are forwarded to MTGFE for recalibration and ranking.
Figure 6.
Attention heat maps of sentence words on the image area in the image-text fusion encoder. (a) ITM task only, (b) ITM+MLM tasks, (c) ITM+MVJRC tasks, and (d) ITM+MLM+MVJRC tasks simultaneously.
Figure 6.
Attention heat maps of sentence words on the image area in the image-text fusion encoder. (a) ITM task only, (b) ITM+MLM tasks, (c) ITM+MVJRC tasks, and (d) ITM+MLM+MVJRC tasks simultaneously.
Figure 7.
Evaluation of correlation quality between text words and image regions for image-text pairs with more complex semantics. (a) Results obtained using the ITM task, (b) Results obtained using ITM+MLM tasks, (c) Results obtained using ITM+MVJRC tasks, and (d) Results obtained using ITM+MLM+MVJRC tasks simultaneously.
Figure 7.
Evaluation of correlation quality between text words and image regions for image-text pairs with more complex semantics. (a) Results obtained using the ITM task, (b) Results obtained using ITM+MLM tasks, (c) Results obtained using ITM+MVJRC tasks, and (d) Results obtained using ITM+MLM+MVJRC tasks simultaneously.
Table 1.
Basic information of datasets.
Table 1.
Basic information of datasets.
Dataset |
Images |
Captions |
Captions per image |
No. of classes |
Image size |
UCM-captions |
2,100 |
10,500 |
5 |
21 |
256×256 |
Sydney-captions |
613 |
6,035 |
5 |
7 |
500×500 |
RSICD |
10,921 |
54,605 |
5 |
31 |
224×224 |
RSITMD |
4,743 |
23,715 |
5 |
32 |
256×256 |
Table 2.
Experimental results of remote sensing image-text cross-modal retrieval on UCM-captions, Sydney-captions, RSICD, RSITMD datasets and comparison with baseline models.
Table 2.
Experimental results of remote sensing image-text cross-modal retrieval on UCM-captions, Sydney-captions, RSICD, RSITMD datasets and comparison with baseline models.
3]*Approach |
UCM-captions dataset |
Sydney-captions dataset |
|
Text Retrieval |
Image Retrieval |
1]*mR
|
Text Retrieval |
Image Retrieval |
1]*mR
|
|
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
|
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
|
VSE++ |
12.38 |
44.76 |
65.71 |
10.1 |
31.8 |
56.85 |
36.93 |
24.14 |
53.45 |
67.24 |
6.21 |
33.56 |
51.03 |
39.27 |
SCAN |
14.29 |
45.71 |
67.62 |
12.76 |
50.38 |
77.24 |
44.67 |
18.97 |
51.72 |
74.14 |
17.59 |
56.9 |
76.21 |
49.26 |
MTFN |
10.47 |
47.62 |
64.29 |
14.19 |
52.38 |
78.95 |
44.65 |
20.69 |
51.72 |
68.97 |
13.79 |
55.51 |
77.59 |
48.05 |
SAM |
11.9 |
47.1 |
76.2 |
10.5 |
47.6 |
93.8 |
47.85 |
9.6 |
34.6 |
53.8 |
7.7 |
28.8 |
59.6 |
32.35 |
AMFMN |
16.67 |
45.71 |
68.57 |
12.86 |
53.24 |
79.43 |
46.08 |
29.31 |
58.62 |
67.24 |
13.45 |
60 |
81.72 |
51.72 |
LW-MCR |
13.14 |
50.38 |
79.52 |
18.1 |
47.14 |
63.81 |
45.35 |
20.69 |
60.34 |
77.59 |
15.52 |
58.28 |
80.34 |
52.13 |
MAFA-Net |
14.5 |
56.1 |
95.7 |
10.3 |
48.2 |
80.1 |
50.82 |
22.3 |
60.5 |
76.4 |
13.1 |
61.4 |
81.9 |
52.6 |
FBCLM |
28.57 |
63.81 |
82.86 |
27.33 |
72.67 |
94.38 |
61.6 |
25.81 |
56.45 |
75.81 |
27.1 |
70.32 |
89.68 |
57.53 |
MTGFE |
47.14 |
78.1 |
90.95 |
40.19 |
74.95 |
94.67 |
71 |
44.83 |
68.97 |
86.21 |
38.28 |
69.31 |
83.1 |
61.52 |
3]*Approach
|
RSICD dataset |
RSITMD dataset |
|
Text Retrieval |
Image Retrieval |
1]*mR
|
Text Retrieval |
Image Retrieval |
1]*mR
|
|
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
|
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
|
VSE++ |
3.38 |
9.51 |
17.46 |
2.82 |
11.32 |
18.1 |
10.43 |
10.38 |
27.65 |
39.6 |
7.79 |
24.87 |
38.67 |
24.83 |
SCAN |
5.85 |
12.89 |
19.84 |
3.71 |
16.4 |
26.73 |
14.24 |
11.06 |
25.88 |
39.38 |
9.82 |
29.38 |
42.12 |
26.27 |
MTFN |
5.02 |
12.52 |
19.74 |
4.9 |
17.17 |
29.49 |
14.81 |
10.4 |
27.65 |
36.28 |
9.96 |
31.37 |
45.84 |
26.92 |
SAM |
12.8 |
31.6 |
47.3 |
11.5 |
35.7 |
53.4 |
32.05 |
- |
- |
- |
- |
- |
- |
- |
AMFMN |
5.39 |
15.08 |
23.4 |
4.9 |
18.28 |
31.44 |
16.42 |
10.63 |
24.78 |
41.81 |
11.51 |
34.69 |
54.87 |
29.72 |
LW-MCR |
4.39 |
13.35 |
20.29 |
4.3 |
18.85 |
32.34 |
15.59 |
9.73 |
26.77 |
37.61 |
9.25 |
34.07 |
54.03 |
28.58 |
MAFA-Net |
12.3 |
35.7 |
54.41 |
12.9 |
32.4 |
47.6 |
32.55 |
- |
- |
- |
- |
- |
- |
- |
FBCLM |
13.27 |
27.17 |
37.6 |
13.54 |
38.74 |
56.94 |
31.21 |
12.84 |
30.53 |
45.89 |
10.44 |
37.01 |
57.94 |
32.44 |
GaLR |
6.59 |
19.9 |
31 |
4.69 |
19.5 |
32.1 |
18.96 |
14.82 |
31.64 |
42.48 |
11.15 |
36.68 |
51.68 |
31.41 |
MTGFE |
15.28 |
37.05 |
51.6 |
8.67 |
27.56 |
43.92 |
30.68 |
17.92 |
40.93 |
53.32 |
16.59 |
48.5 |
67.43 |
40.78 |
Table 3.
Retrieval accuracies of different task combinations on the RSITMD dataset.
Table 3.
Retrieval accuracies of different task combinations on the RSITMD dataset.
1]*Task |
Text Retrieval |
Image Retrieval |
1]*mR |
|
R@1 |
R@5 |
R@10 |
R@1 |
R@5 |
R@10 |
|
ITM |
15.71 |
35.62 |
50.44 |
13.41 |
44.78 |
65.66 |
37.6 |
ITM+MLM |
16.37 |
38.05 |
52.88 |
16.46 |
47.92 |
67.43 |
39.85 |
ITM+MVJRC |
12.39 |
33.19 |
49.56 |
10.66 |
40.35 |
61.64 |
34.63 |
ITM+MLM+MVJRC |
17.92 |
31.19 |
53.32 |
16.59 |
48.5 |
67.43 |
40.78 |
Table 4.
Performance of model migration on the RSICD dataset.
Table 4.
Performance of model migration on the RSICD dataset.
1]*Method |
Text Retrieval |
Image Retrieval |
1]*mR |
|
R@1 |
R@5 |
R@10 |
time(ms) |
R@1 |
R@5 |
R@10 |
time(ms) |
|
MTGFE |
15.28 |
37.05 |
51.6 |
472.1 |
8.67 |
27.56 |
43.92 |
94.41 |
30.68 |
MTGFE+Filter |
13.82 |
36.32 |
50.41 |
24.7 |
8.27 |
27.17 |
42.8 |
14.27 |
29.8 |