1. Introduction
The 2021 UNESCO Engineering for Sustainable Development report points out the role the profession has on the 2030 Agenda. The third goal, "ensure healthy lives and promote well-being for all at all ages," highlights the advancements in improving medical diagnosis and care through low-cost tools [
1]. The most affordable tool in neuroscience is Electroencephalography (EEG), which is also the most studied thanks to its high temporal resolution and portability. EEG captures subjects’ neural bioelectric activity generated by neuron activation through electrodes placed on the scalp [
2]. Techniques like Event-Related Potentials (ERPs) extract representative time-frequency information within the time-series data for understanding the neurological activity of a subject or a group [
3]. Also, Brain-Computer Interfaces (BCI) profit from those patterns for controlling external devices, such as prostheses [
4]. To learn the brain patterns, a BCI paradigm presents stimuli and asks the subject to perform tasks. For instance, the Motor Imagery (MI) paradigm, which consists of the mental rehearsal of motor tasks without physical movement, has been used to support the diagnosis, treatment, and follow-up of brain diseases [
5].
However, EEG is far from the panacea for all neuroimaging needs. Due to its non-invasive nature, superficial EEG sacrifices spatial resolution and becomes highly susceptible to electromagnetic artifacts (e.g., electrical devices) and physiological noise (e.g., eye movement and muscle activity), yielding less useful features [
6,
7]. Moreover, the volume conduction effect introduces noise or cross-talk between electrodes, hampering the source localization of brain activity and the interpretation of EEG data [
8]. Lastly, both between- and within-subject variability challenge the development of a universal MI-EEG algorithm. Differences in genetic, cognitive, and neurodevelopmental factors cause the same task or stimuli to evoke distinct brain patterns across individuals, complicating the creation of subject-independent solutions [
9,
10]. Additionally, as users become more familiar with the BCI device or task over time, their performance and initial brain patterns evolve, demanding personalized calibration sessions [
11].
Traditional MI-EEG algorithms attempt to extract features by analyzing the signal’s power levels. The baseline Common Spatial Patterns (CSP) exploits the power distribution over the scalp to find spatial patterns discriminating two tasks [
12]. Variations of CSP include L1-CSP, which regularizes the patterns using the L1-norm; Sparse Filter Band CSP (SFBCSP), which automatically selects useful spectral bands from a precomputed set [
13]; and Multi-Kernel Stein Spatial Patterns (MKSSP), which extracts nonlinear patterns in a low-dimensional Riemannian manifold [
14]. However, a low signal-to-noise ratio makes CSP and its variants extract features from artifacts rather than the actual EEG [
15].
Conversely, machine learning algorithms exploit their ability to automatically learn features from a given training dataset to minimize the prediction error as much as possible [
16]. Deep Learning (DL) methods extract nonlinear patterns, dealing with EEG noise and volume conduction effect [
17]. Convolutional Neural Networks (CNNs), a DL model, are the most successful EEG feature extraction architectures because they look for space and time patterns [
18]. Examples of CNNs for MI-EEG classification are EEGNet, ShallowConvNet, and DeepConvNet, which, like SFBCSP, also extract spatial patterns from certain frequency bands [
19]. Unlike SFBCSP, the above architectures unravel nonlinear, deeper, and more complex patterns [
20]. Other DL models for EEG applications include Autoencoders that embed EEG signals into a generative noise-reduced feature space [
21]; Recurrent Neural Networks (RNNs) that exploit the sequential nature of EEG features [
22]; and, more recently, Transformers that use their long-term memory to capture both global and local patterns [
23].
However, there are two main concerns about the medical applications of the above DL models. Firstly, they become "black boxes," lacking interpretable information to understand each subject’s neurological abilities [
24]. Secondly, they ignore the closed link between neurological, physiological, and personal behaviors [
25]. Analyzing how these factors influence the model provides users and scientists with valuable information to improve the results beforehand. Thus, multimodal and multidomain strategies can couple information from patients’ moods and habits to understand their neurophysiological responses and exploit this additional information to improve model performance and interpretability.
This work proposes Multimodal and Explainable Deep Learning (MEDL) as an approach for MI-EEG discrimination and physiological interpretability. Specifically, our proposal is threefold: i) Different DL models are tested for subject-dependent MI-EEG discrimination. ii) A CAM-based approach is used to quantify and visualize relevant MI-EEG features. iii) A Questionnaire-MI Performance Canonical Correlation Analysis (QMIP-CCA) strategy is introduced as a multidomain explainability stage for physiological and MI-EEG discrimination non-linear feature matching. Experiments are carried out with the GIGAScience MI dataset due to its relatively large number of subjects and the additional questionnaire that it offers for physiological subject information [
26]. The results obtained demonstrate that shallow networks achieve acceptable MI discrimination results. In addition, our CAM-based method allows us to code MI spatio-frequency group patterns and measure EEG features of the sensorimotor cortex in people who are getting better at MI. Finally, our QMIP-CCA allows quantifying and visualizing relevant physiological questions from tabular data and matching them to MI-EEG performance measures.
The agenda for this paper is as follows.
Section 2 summarizes the related work.
Section 3 describes the materials and methods.
Section 4 describes the experiments and discuss the results. Lastly,
Section 5 outlines the conclusions and future work.
2. Related work
Since Koles et al.’s work in 1990, CSP has been the go-to tool for feature extraction in EEG data [
27]. CSP provides spatial filters that maximize the variance ratio of two multivariate signals, enabling the discrimination of two classes for classification purposes. Unfortunately, since the technique depends on the signals’ variance, it becomes sensitive to noise and struggles in small datasets [
28]. In turn, variants of CSP have sprung up to enhance the original algorithm. L1CSP redefines the base objective in terms of the L1-norm instead of the usual L2-norm to reduce artifacts’ influence on the signal [
29]. In contrast, FBCSP adds a bandpass filtering stage for multiple, manually selected frequency bands before the usual spatial filtering. After that, it uses a mutual information algorithm to select discriminant features [
30]. Lastly, SFBCSP selects the bandpass filters semi-automatically, rather than manually like FBCSP, by integrating a sparse regression model to learn the optimal features from each input filter band [
31]. However, these models remain power-reliant and, in the case of FBCSP and SFBCSP, require some form of manual input. Once features have been extracted, classification can be performed by various methods. One such technique are Support Vector Machines (SVMs). SVM works by finding the optimal hyperplane for separating classes [
32]. However, SVMs fail to classify data when features are too similar between classes [
33].
Thanks to their ability to recognize and extract non-linear features from raw EEG data [
34], DL models solve many of these traditional algorithms’ shortcomings. CNNs are a family of DL strategies that scan the input signal for representative patterns, starting from simple structures until reaching a more complex combination of these initial features. These algorithms are commonly used for image processing, but by tuning the size and number of filters, it is possible to extract information from the temporal, spatial, and frequency domains, e.g., for MI-EEG classification. Also, unlike CSP, these models automatically fine-tune the best filters for the given task through gradient decent [
35]. Some relevant examples of CNNs for MI-EEG include the EEGNet, KREEGNet, ShallowConvNet, DeepConvNet, TCFusionnet, and KCS-FCNet.
In particular, EEGNet gets different features from EEG data in three steps: temporal convolution, depthwise convolution (to find patterns in time and space), and separable convolution (to combine the data from the first two steps) [
36]. Variations of this original architecture have since emerged. TCFusionNet, for example, integrates the EEGNet architecture with residual blocks composed of dilated convolutions, which add these residual features to the ones from the separable convolution to increase the size of the receptive field while avoiding the exploding/vanishing gradient problem of deeper models [
37]. ShallowConvNet works similarly to EEGNet but skips the separable convolution stage and uses regular convolution for spatial filtering. This reduces the number of parameters requiring training, leading to faster training and better interpretability, but at the cost of some performance. DeepConvNet, on the other hand, goes much further by adding a series of 2D convolutions that get bigger and bigger to pull information from the earlier stages and find complex structures [
19]. Regarding Deep Kernel Learning (DKL) methods, KREEGNet computes the functional connectivities from the temporal convolution through a Gaussian similarity, alongside Delta Kernel for label outputs, all to use Central Kernel Alignment (CKA) as a regularizer [
38]. Likewise, KCS-FCNet computes the functional connectivities from a temporal convolution through a Gaussian kernel, much like KREEGNet; however, it then relies on the measured connectivities as input for a Fully-Connected block to classify the MI data on a high-dimensional space [
39].
Nevertheless, as computational power has increased, more powerful and complex algorithms have been developed to extract more information from EEG signals. Deep Belief Networks (DBNs) stack multiple Restricted Boltzmann Machines (RBMs), which learn to reconstruct a given input through unsupervised training; however, as a result of using RBMs, DBNs require a pre-training stage on a separate dataset, which is a luxury only available on larger databases [
40]. Another set of neural networks for EEG data is Autoencoders, which have been used as dynamic Principal Component Analysis (PCAs) to select the most relevant characteristics before classification [
41]. Ideally, the model will filter out any kind of noise contaminating the signals, as it acts as redundant information, leaving only features intrinsic to the EEG for classification [
42]. Unfortunately, physiological noise is extremely difficult to properly filter, as it is influenced by factors such as stress levels, personal background, and even the testing environment [
43]. Alternatively, RNNs are a natural choice for EEG analysis, as their ability to remember previous inputs makes them ideal for time series data, thanks to their strong modeling capabilities, making them useful at leveraging cross-series information, even when handling heterogenous signals [
44]. Recently, Transformer networks utilize an Attention Mechanism to capture global information from EEG by encoding information from temporal, frequency, and spatial features, allowing the model to identify relevant sample dependencies while dealing with the low signal-noise ratio problem [
45]. Despite these models’ multiple advantages, their use is severely limited in tasks requiring interpretability, as these architectures lack tools to allow users to properly assess the cause for a given output, as it is difficult to assure if their performance is thanks to them extracting meaningful information, or noise [
46]. Furthermore, precisely due to their complexity, they are vulnerable to overfitting [
47]. EEG-BCI classification methods are summarized in
Figure 1.
Now, multiple strategies have emerged to provide interpretable results from DL-based methods. The authors in [
48] propose four different types of explanations offered by the many algorithms: Example, Attribution, Hidden Semantics, and Rules. Example algorithms check which inputs are similar to each other according to the model. Attribution looks at which elements of the input had the greatest influence on the given output. Hidden Semantics employs the network’s neurons to explain the output; for instance, when classifying animals, it determines whether the neurons are focusing on the head, legs, or other parts of the creature. Finally, rules explain the output in terms of a series of decisions taken by the model; in their most simple incarnation, these rules take the form "If X is present, then Y." However, when working with EEG data, not every algorithm will provide useful insight. Rules as explanation reduce the model to a Decision Tree algorithm but do not necessarily create explainable rules themselves [
49]. This leaves open the other three types of explanations. Hidden Semantics are useful for evaluating the importance of specific features at certain points, but they produce abstract results the deeper the neurons are [
50]. For inter-subject analysis, Example algorithms are capable of finding similarities between subjects, providing information as to which patients share similar attributes, but struggle in providing explainability to extracted features as these must already make sense to the BCI professionals beforehand [
51]. Next, Attribution possesses the most potential for BCI insight, as these techniques map the importance of the output back to the original input (EEG data). Common attribution algorithms are CAMs, which create a mask that highlights which pixels from the input image supply the most important information for the output. Introduced by [
52], CAMs map out the elements within an image most relevant for the CNN. This method involves weighting the activation maps from a given layer and finding the average contribution of each pixel to the model’s decision. Grad-CAM, by [
53], generalizes CAMs by redefining the weights in terms of the gradient produced by the model. However, this technique uses a global average for weight calculations, assuming each activation to be equally important. Grad-Cam++ [
54] then redefines Grad-Cam to be in terms of a weighted average instead of global. Next, LayerCAM enables the use of CAMs for any convolutional layer [
55]. By utilizing the backward class-specific gradients, it is possible to generate a separate weight for each spatial location.
Another strategy to improve EEG classification and interpretability is to incorporate information from different domains into the DL structure. Multi-domain Fusion Deep Graph CNN (MdGCNN) by [
56] fuses time-frequency and spatial information through the use of graph convolutions, which learn the discriminant features across the domains, followed by a sort pooling layer to act as a fusion stage to bridge the extracted information to regular convolutions, which then produce the output. Still, this approach does not include information from external sources to the EEG, limiting its interpretability, unlike the authors in [
57] who perform emotion recognition through RNNs using visual information from videos in addition to the EEG, fusing into a hierarchical attention mechanism to organize features based on their perceived significance. This method allows the analysis of how physiological and neurological responses relate to each other during classification. Lastly, [
58] uses a deep and wide CNN to pull out features and perform an initial classification. Kernel matching via Gaussian embedding is then used to combine data from questionnaires and make the model output more accurate.
Ultimately, it is evident that traditional feature extraction algorithms, like CSP, provide the best interpretability but are also the least powerful for EEG-based classification, primarily due to their need for manual tuning. Complex DL solutions such as Autoencoder, RNNs, and Transformers offer the ability to extract much more information when compared to other solutions, but their results aren’t easily interpretable or require large datasets [
59]. This, in turn, leaves CNNs-based EEGNet variants as the best compromise between the two. They work similarly to traditional CSP algorithms, requiring no manual tuning for the filters; can be expanded upon to exploit information present in much deeper structures; and possess extremely useful interpretation tools in the form of CAMs.
Figure 1.
EEG-BCI classification methods. Both traditional and DL approaches are presented.
Figure 1.
EEG-BCI classification methods. Both traditional and DL approaches are presented.
Figure 2.
GIGAScience database experiment for MI-EEG classification (left vs right hand). Left: Trial timing: A marker appears onscreen; after two seconds, an instruction is shown to the patient to imagine moving either their left or right hand. The instruction stays onscreen for three seconds before disappearing. Right: Spatial EEG montage: Electrodes are placed starting at the left-frontal nodes and going on a serpent pattern until they reach the back, at which point they go back to the front down the center until they reach the CPZ node (10-10 system).
Figure 2.
GIGAScience database experiment for MI-EEG classification (left vs right hand). Left: Trial timing: A marker appears onscreen; after two seconds, an instruction is shown to the patient to imagine moving either their left or right hand. The instruction stays onscreen for three seconds before disappearing. Right: Spatial EEG montage: Electrodes are placed starting at the left-frontal nodes and going on a serpent pattern until they reach the back, at which point they go back to the front down the center until they reach the CPZ node (10-10 system).
Figure 3.
Shannon’s entropy for the GIGAScience database questionnaire answers. Questions were sorted by their entropy value in decreasing order. The dotted line shows the selected threshold (25th percentile) for selecting questions.
Figure 3.
Shannon’s entropy for the GIGAScience database questionnaire answers. Questions were sorted by their entropy value in decreasing order. The dotted line shows the selected threshold (25th percentile) for selecting questions.
Figure 4.
MI-EEG classification models based on Deep Learning. A softmax activation is always used after the final dense layer for label prediction.
Figure 4.
MI-EEG classification models based on Deep Learning. A softmax activation is always used after the final dense layer for label prediction.
Figure 5.
Experiment’s workflow using MEDL. Each model generates CAMs which are then used to enhance the original EEG input. Model and CAM-based performace measure along with the questionnaire are used to perform multimodal analysis via QMIP-CCA.
Figure 5.
Experiment’s workflow using MEDL. Each model generates CAMs which are then used to enhance the original EEG input. Model and CAM-based performace measure along with the questionnaire are used to perform multimodal analysis via QMIP-CCA.
Figure 6.
MI-EEG GiGaScience classification results. Blue: Accuracy; Orange: AUC; Green: Kappa
Figure 6.
MI-EEG GiGaScience classification results. Blue: Accuracy; Orange: AUC; Green: Kappa
Figure 7.
Inter-subject accuracies results. Subjects are sorted based on EEGNet performance.
Figure 7.
Inter-subject accuracies results. Subjects are sorted based on EEGNet performance.
Figure 8.
Group performing MI-EEG classification results.
Figure 8.
Group performing MI-EEG classification results.
Figure 9.
Class score percentage gain per MI class for EEGNet.
Figure 9.
Class score percentage gain per MI class for EEGNet.
Figure 10.
Class score percentage gain per MI class for ShallowConvNet.
Figure 10.
Class score percentage gain per MI class for ShallowConvNet.
Figure 11.
Class score percentage gain per MI class for TCFusion.
Figure 11.
Class score percentage gain per MI class for TCFusion.
Figure 16.
Violin plot of change in accuracy after CAM enhancements for different subject groups across DL models.
Figure 16.
Violin plot of change in accuracy after CAM enhancements for different subject groups across DL models.
Figure 17.
The QMIP-CCA relevance analysis results are derived from the multimodal GiGaScience dataset. Questionnaire (left) and MI-EEG classification performance measures (right) are studied. Linear CCA and our kernel-based CA enhancement are presented. Background colors for the questionnaire divide the questions into Pre-MI, Run 1 through 5, and Post-MI. For the MI-EEG classification measures, colors show the corresponding DL model.
Figure 17.
The QMIP-CCA relevance analysis results are derived from the multimodal GiGaScience dataset. Questionnaire (left) and MI-EEG classification performance measures (right) are studied. Linear CCA and our kernel-based CA enhancement are presented. Background colors for the questionnaire divide the questions into Pre-MI, Run 1 through 5, and Post-MI. For the MI-EEG classification measures, colors show the corresponding DL model.
Table 1.
Subject-dependent MI-EEG classification DL hyperparameters.
Table 1.
Subject-dependent MI-EEG classification DL hyperparameters.
Training Hyperparameter |
Argument |
Value |
|
Monitor |
Training Loss |
|
Factor |
0.1 |
Reduce learning rate on plateau |
Patience |
30 |
|
Min Delta |
0.01 |
|
Min Learning Rate |
0 |
Adam |
Learning Rate |
0.01 |
Stratified Shuffle Split |
Splits |
5 |
|
Test size |
0.2 |
Table 2.
MI-EEG classification performance comparison: ACC vs. CAM-enhanced ACC. Average ACC ± standard deviation.
Table 2.
MI-EEG classification performance comparison: ACC vs. CAM-enhanced ACC. Average ACC ± standard deviation.
Model |
ACC
|
CAM-enhanced ACC
|
Difference . |
EEGNet |
|
|
|
KREEGNet |
|
|
|
KCS-FCNet |
|
|
|
DeepConvNet |
|
|
|
ShallowConvNet |
|
|
|
TCFusion |
|
|
|