1. Introduction
Remote sensing data classification has important research significance in the fields of urban planning , mining exploration, and agriculture [
1,
2,
3,
4,
5]. In recent years, with the rapid development of sensor technology, a variety of remote sensing data sources have been provided to support remote sensing data classification tasks. For example, hyperspectral image (HSI) is able to reflect fine spectral information of ground objects [
6,
7], but is susceptible to factors such as weather and has challenges in distinguishing ground objects with similar spectral reflectance. In contrast, light detection and ranging (LiDAR) is not sensitive to weather conditions, and the elevation information it contains helps the model distinguish between objects with similar spectral information but different heights [
8]. Multimodal remote sensing data classification can improve the model’s ability to distinguish ground objects through multimodal data fusion, which has attracted extensive attention from researchers in recent years.
In the process of multimodal data fusion, a key problem is: How to help models understand the relationship between multimodal data and target tasks? The difficulty of solving this problem lies in: On the one hand, the coupling relationship between task-related information and task-unrelated information in multimodal data is often tight and complex, which makes it difficult for the model to accurately distinguish the two above. For example, there may be spectral noise in HSI, and there may also be noise in LiDAR due to the scattering effect of multipath effects. It is frustrating that the noise information is not only difficult to be directly observed and interfered with, but also often brings serious interference to the model learning process. On the other hand, the association pattern between multimodal data and target tasks often changes dynamically with specific context information. Take HSI and LiDAR for example, different classes of ground objects in urban scenes often have inconsistent heights and different spectral reflectance (difference in ground object materials), so multimodal data fusion tends to make comprehensive use of HSI and LiDAR. However, in the forest scene with similar height of trees, attention to LiDAR information can lead to intermodal information interference. The dynamic change of the association pattern requires the model to fully understand the dependency between the context information and the multimodal data fusion, which puts forward higher requirements for its intelligence.
In order to accurately capture task-related information from multimodal data, inspired by the human visual system’s focus on only the key information in the visual scene, researchers have developed a variety of attention mechanisms. These attention mechanisms were designed to adaptively emphasize task-related information and suppress task-unrelated information in model learning. Gao et al. [
9] proposed an adaptive feature fusion module to adjust the model’s degree of attention to different modalities through dynamic attention allocation. However, this method only models the global importance of multimodal data, and the noise information contained in the data is also concerned when a certain modality is emphasized. Considering the consistency of multimodal data in describing objects, the mutual supervision between multimodal data can help the model to capture task-related information. To solve this problem, Zhang et al. [
10] proposed a mutual guidance attention module, which highlights key task-based information and suppresses useless noise information by establishing mutual supervision between different modal information flows. In order to more fully mine the consistency information among multimodal data, Song et al. [
11] developed a cross-modal attention fusion module, which learns the global dependency relationship between multimodal data by establishing deep interaction between them, and then uses this dependency relationship to improve the model’s ability to capture task-related information. However, optimization of attention mechanisms in the above methods often relies on establishing statistical associations between multimodal data and labels, thus minimizing the empirical risk of the model on the training data. It should be noted that overemphasis on statistical dependencies between multimodal data and labels can lead models to falsely establish false statistical associations between non-causal factors (such as spectral noise) and labels. For example, ground objects in a certain region show abnormal spectral responses within a certain band due to weather factors. If the model establishes a statistical correlation between abnormal spectral responses and labels, the model will often mistakenly emphasize abnormal spectral information. In addition, multimodal remote sensing data classification tasks usually face high labeling costs, which makes it difficult to obtain sufficient labeling training data in practical applications. This problem of data sparsity leads to the possibility that the training data may not accurately reflect the real data distribution. In this context, simply minimizing the empirical risk of the model on the training data may make it difficult for the model to achieve satisfactory generalization performance on the unseen multimodal data.
Fortunately, human causal reasoning mechanisms and knowledge induction mechanisms provide solutions to the above problems. On the one hand, humans are able to identify causal associations that are truly related to each other from appearances, rather than through statistical learning. For example, although “fish” is usually found in water and the two show a high statistical dependence, humans can clearly recognize that there is non-causal relationship between “water” and “fish”. Inspired by this, if a causal reasoning mechanism can be built into the model learning process, it can be made to cut off the false statistical association between non-causal factors and labels, and learn the real causal effect between data and labels. On the other hand, humans are able to generalize core knowledge from different tasks, so they can effectively deal with unseen tasks. For example, by practicing addition and subtraction exercises, students can deduce the rules of addition and subtraction operations from them, so that they can perform them correctly when they encounter addition and subtraction exercises that they have not seen in practice. Inspired by this, if a knowledge induction mechanism can be established in the process of model optimization, the model can jump out of the limitations of fitting training data in the learning process, and generalize cross-task shared knowledge from different multimodal remote sensing data classification tasks to improve its generalization ability to unseen multimodal data.
In order to capture complex association patterns among multimodal data, researchers have conducted extensive explorations. Considering that the association modes of multimodal data are often closely related to the above information, Xue et al. [
12] proposed a self-calibrated convolution, which captures multi-scale context information of multimodal data. The weight of multi-scale context information is adjusted adaptively by spectral and spatial self-attention. Roy et al. [
13] taking into account the differences among multimodal data, customized a personalized feature learning mode for HSI and LiDAR, captured HSI rich spatial-spectral information by convolution, and LiDAR spatial structure information by morphological expansion layer and erosion layer. Finally, multimodal data was integrated by additive operation. However, simple additive operations are difficult to capture complex nonlinear relationships between multimodal data. Therefore, Wang et al. [
14] proposed a spatial-spectral mutual guided module to realize cross-modal information fusion of multimodal spatial and spectral information, and to improve the semantic correlation between multimodal spectrum and spatial information by using adaptive, multi-scale and mutual learning technologies. Further, in order to fully capture the local and global associations of multimodal data, Du et al. [
15] proposed a spatial-spectral graph network, which uses image block-based convolution to preserve the local structures of multimodal data, and uses the information transfer mechanism of graph neural networks to capture the global associations. The above one-step multimodal data fusion method has significant limitations in the mining of complex association patterns: The multimodal data only goes through one forward process in the fusion process, and cannot evaluate the quality of its current fusion strategy and make adaptive adjustments to it. Such one-step fusion process limits the model’s ability to mine complex association patterns. In addition, the optimization process of multimodal data fusion strategies of these methods is often implicit, and is not directly guided by the gradient of classification loss. It is difficult to ensure that the model truly understands the complex role relationship between multimodal data and labels in feature fusion, which can lead to the model incorrectly suppressing task-related information or emphasizing task-unrelated information.
Fortunately, the human feedback learning mechanism provided a solution to this problem, specifically, humans can understand the complex relationships between things through interactive feedback with the environment. For example, chemists combine different chemicals during experiments and get feedback signals by observing the reaction phenomena between them. Based on the feedback signal, chemists can better understand the interaction between chemical substances and the effect of this interaction relationship on the reaction phenomenon, and then modify their experimental operations to obtain the expected experimental results. Inspired by this, if the task-oriented feedback mechanism can be established in the process of multimodal feature fusion, the model can fully understand how the current multimodal data fusion strategy acts on the target task, and adjust its multimodal data fusion strategy according to its performance in the target task, fully capturing and intelligently adapting to complex association patterns among multimodal data.
In summary, the key issue that this paper aims to address is: How to establish feedback learning mechanisms, causal inference mechanism and knowledge induction mechanisms in multimodal remote sensing data classification tasks?
To address this issue, we integrated reinforcement learning, causal learning, and meta-learning into a unified multimodal remote sensing data classification framework and proposed a causal meta-reinforcement learning framework. Specifically, to establish feedback learning mechanisms and causal reasoning mechanisms, we customized a reinforcement learning environment for multimodal remote sensing data classification tasks and designed causal distribution prediction actions, classification rewards, and causal intervention rewards. Through this approach, on the one hand, intelligent agents can understand the complex relationships between multimodal data and labels in their interaction feedback with the environment. On the other hand, intelligent agents can distinguish causal and non-causal factors in multimodal data under the feedback and driving of reward signals, and accurately capture pure causal factors by mining multimodal complementary information. To establish a knowledge induction mechanism, inspired by meta-learning scenario simulation training, we constructed meta training tasks and meta validation tasks, and based on this, we constructed a bi-layer optimization mechanism. By simulating the generalization scenario of visible multimodal data to unseen multimodal data, we encouraged intelligent entities to induce cross-task shared knowledge, thereby improving their generalization ability on unseen multimodal data. The contributions of this paper are as follows:
- 1)
A causal meta-reinforcement learning (CMRL) is proposed, which simulates human feedback learning mechanism, causal inference mechanism and knowledge induction mechanism to fully mine multimodal complementary information, accurately capture true causal relationships, and reasonably induce cross-task shared knowledge;
- 2)
Breaking the limitations of implicit optimization of fusion features, a reinforcement learning environment has been customized for the classification task of multimodal remote sensing data. Through the interaction feedback between intelligent agents and multimodal data, full mining of multimodal complementary information has been achieved;
- 3)
Breaking through the traditional learning model of establishing statistical associations between data and labels, causal reasoning has been introduced for the first time into multimodal remote sensing data classification tasks. Causal distribution prediction actions, classification rewards, and causal intervention rewards have been designed to encourage intelligent agents to capture pure causal factors and cut off false statistical associations between non-causal factors and labels;
- 4)
The shortcomings of the optimization mode that minimizes the empirical risk of the model on training data under sparse training samples are revealed, and a targeted bi-layer optimization mechanism based on meta-learning was proposed. By encouraging agents to induce cross-task shared knowledge from scenario simulation, their generalization ability on unseen test data is improved.