1. Introduction
In everyday life, we constantly face situations demanding high stakes for maximum gains; for instance, to succeed in rapidly acquiring complex cognitive skills or making decisions under high pressure. Thereby, a fit between personal skills and the task’s requirements determines the quality of outcomes. This fit is vital, especially in performance-oriented contexts such as learning and training, safety-critical monitoring, or high-risk decision-making. A person’s performance can be affected by several factors: 1) level of experience and skills, 2) current physical conditions (e.g., illness or fatigue), 3) current psychological conditions (e.g., stress, motivation, or emotions), or 4) external circumstances (e.g., noise, temperature, or distractions; Hart and Staveland [
1], Young et al. [
2]).
To reliably quantify the mental effort during a particular task, different measures can be used: 1) behavioural (i.e., performance-based), 2) subjective, and 3) neurophysiological measures [
3,
4,
5]. While performance can be inspected by tracking the user’s task-related progress, the actual pattern of invested cognitive resources can only be derived by measuring brain activity with neuroimaging techniques. Coupled with sophisticated signal processing and machine learning, advances in portable neuroimaging techniques have paved the way for studying mental effort and its possible influences from a neuroergonomic perspective [
6,
7]. Recently, functional near-infrared spectroscopy (fNIRS) has been used to study cognitive and emotional processes with high ecological validity [
6,
8,
9,
10]. fNIRS is an optical imaging technology allowing researchers to measure local oxy-haemoglobin (HbO) and deoxy-haemoglobin (HbR) changes in cortical regions. Higher mental effort is associated with an increase of HbO and a decrease of HbR in the prefrontal cortex (PFC) [
11,
12,
13]. The PFC is crucial for executive functions like maintaining goal-directed behaviour and suppressing goal-irrelevant distractions [
14,
15]. In addition to changes in the central nervous system, an increased mental effort also leads to changes in the autonomic nervous system. The autonomic nervous system, as part of the peripheral nervous system, regulates automatic physiological processes to maintain homeostasis in bodily functioning [
16,
17]. Increased mental effort is associated with decreased parasympathetic nervous system activity and increased sympathetic nervous system activity [
18,
19,
20]. Typical correlates of the autonomic nervous system for cognitive demands, engagement or mental effort are cardiac activity (e.g., heart rate and heart rate variability), respiration (rate, airflow, and volume), electrodermal activity (skin conductance level and response), blood pressure, body temperature, and ocular measures like pupil dilation, blinks, and eye movements [
7,
19,
21,
22,
23,
24].
Not surprisingly, all these measures are, thus, often used as a stand-alone indicator for mental effort (i.e., in a
unimodal approach). However, a
multimodal approach has several advantages over using only one measure. It can compensate for specific weaknesses and profit from the strengths of the different complementary measurement methods (performance, subjective experience as well as neuro- and peripheral physiological measures) [
25,
26,
27]. For instance, (neuro-)physiological measures can be obtained without imposing an additional task [
16] and allow for capturing cognitive subprocesses involved in executing the primary task [
28]. A multimodal approach, hence, provides a more comprehensive view of (neuro-)physiological processes related to mental effort [
4,
5,
25,
29], as it can capture both central and peripheral nervous system processes [
21,
27]. However, fusing data from different sources remains a major challenge for multimodal approaches. Machine learning (ML) methods provide solutions to compare and combine data streams from different measurements. ML algorithms are becoming increasingly popular in computational neuroscience [
30,
31]. The rationale behind these algorithms is that the relationship between several input data streams and a particular outcome variable, e.g., mental effort, can be estimated from the data by iteratively fitting and adapting the respective models. This allows for data-driven analyses and provides ways to exploratorily identify patterns in the data that are informative [
32].
Data-driven approaches can also be advantageous in bridging the disparity between laboratory research and real-world applications. For instance, when specific temporal events (such as a stimulus onset) or the brain correlates of interest, are not precisely known. In contrast to traditional laboratory studies that typically rely on simplified and artificial stimuli and tasks, a naturalistic approach seeks to emulate, to some extent, the intricacy of real-world situations. Hence, these studies can provide insights into how the brain processes information and responds to complex stimuli in the real world [
33].
Real-world settings are usually characterised by multiple situational characteristics including concurrent distractions that affect the allocation of attentional and cognitive resources [
34]. According to the working memory model by Baddeley and Hitch [
35], performance is notably diminished when distractions deplete resources from the same modality as the primary task. However, Soerqvist et al. [
36] propose the involvement of cognitive control mechanisms that result in reduced processing of task-irrelevant information under higher mental effort. To uphold task-relevant cognitive processes, high-level cortical areas, particularly the prefrontal cortex (PFC), which govern top-down regulation and executive functioning, suppress task- or stimulus-irrelevant neural activities by inhibiting the processing of distractions [
28]. Consequently, the effects of distractors are mitigated. In light of these considerations, understanding the capacity of a stimulus to capture attention in a bottom-up manner, known as salience, emerges as a crucial aspect. A salient stimulus has the potential to disrupt top-down goal-oriented and intentional attention processes [
37] and to impair performance in a primary task [
38,
39,
40]. Previous studies found that irrelevant, yet intelligible speech exerts such disruptive effects on participants’ performance in complex cognitive tasks [
41,
42]. Consequently, intelligible speech might heighten the salience of a distracting stimulus. Moreover, further studies revealed that the emotional intensity and valence of a stimulus also play a role in influencing its salience [
37,
43]. Despite their detrimental impact on performance, people frequently experience such salient distractions (such as verbal utterances from colleagues) at work, even in highly demanding safety-relevant tasks. Therefore, gaining an understanding of the underlying cognitive processes in naturalistic scenarios and identifying critical moments that lead to performance decreases in real-world settings are crucial research topics in the field of neuroergonomics.
To decode and predict cognitive states, most research so far focused on subject-dependent classification. These approaches face the challenge of high inter-individual variability in physiological signals when generalising the model to others [
44]. Recently, pioneering efforts have been made to develop cross-subject models that overcome the need for subject-specific information during training [
45,
46]. Solutions to address the challenge of inter-individual variability [
47] are crucial for the development of “plug and play” real-time state recognition systems [
48] as well as the resource-conserving exploitation of already available large datasets without time-consuming individual calibration sessions. Taking into account the aforementioned considerations, we present a feasibility study to decode mental effort from multimodal physiological and behavioural signals in a quasi-realistic scenario. We used an adapted warship commander task, that induces mental effort based on a combination of attentional and cognitive processes, such as object perception, object discrimination, rule application, and decision-making [
49]. To create a complex close-to-naturalistic scenario, three emotional types of auditory speech-based stimuli with neutrally, positively, and negatively connotated prosody were presented during the task as concurrent distractions [
50]. Concurrently, both brain-related as well as peripheral physiological signals associated with mental effort were recorded.
We hypothesised that a well-designed multimodal voting ML architecture is preferable compared to a classifier based on a) only one modality (unimodal approach) and b) a combined, unbalanced feature set of all modalities. We expected that a multimodal voting ML model is capable of predicting subjectively experienced mental effort induced by the task itself but also by the suppression of situational auditory distractions in a complex close-to-realistic environment. Thus, we first investigated whether a combined prediction of various ML models is superior to the prediction of a single model (RQ1) and, second, we explored whether a multimodal classification that combines and prioritizes the predictions of different modalities is superior to a unimodal prediction (RQ2).
3. Results
We compare the results for a mental effort prediction based on a subject-wise (1a) median and (1b) upper quartile split of the Nasa TLX effort scale as well as based on the (2) experimentally induced task load. Further, we compare two sizes of the validation set (one subject and two subjects).
3.1. Unimodal Predictions
Performance of the different modalities and classifiers is visualised in
Figure 5). We do not see substantially better performance when using a larger validation set of two subjects, neither for the median split (compare
Figure 5) and Supplementary Figure 3) nor for the upper quartile split (compare Supplementary Figures 7, and 11) or the prediction of the experimentally induced task load (Supplementary Figures 16 and 20). We will, therefore, focus on the models fitted with a validation set of one subject, as this is more time- and resource-efficient.
Figure 5 shows the performance in a median-split-based unimodal (
Figure 5A, B, D, E) as well as the multimodal approach (
Figure 5C; elaborated on in Section Multimodal Predictions). Regarding the unimodal classifications, we see the highest predictions of the subjectively perceived mental effort for performance data (
Figure 5E compared with ocular, physiological, or brain activity measures;
Figure 5A, B, and D). Except for the performance-based model, we observe overfitting indicated by the large deviation between training and test performance (
Figure 5A, B, D). None of the brain activity-based models performs significantly better than the dummy classifier (
Figure 5A and G) in the test data set. When examining the single classification models within each modality, the KNN, RFC, and SVM were more likely to be overfitted, as seen by the good performance in the training set but a significantly worse performance for the test subject. We combined the different classifiers using a voting classifier, of which we ascertained the voting procedure (soft vs. hard voting) and the weights with a randomised grid search. See
Figure 6 for an overview of the selected voting procedures and the allocated weights per modality.
Interestingly, for eight out of eighteen participants, we observed high prediction performances with F1 scores ranging between 0.7 and 1.0. However, we also identified several subjects whose subjectively perceived mental effort was hard to predict based on the training data of the other subjects. See Table 1-3 in the Supplementary Material for a detailed comparison of the classifiers’ performances in the different test subjects. Concluding, the results indicate that transfer learning and generalisation over subjects is much more challenging when using the neurophysiological compared with the performance-based features.
3.2. Unimodal Predictions – Brain Activity
The unimodal voting classifiers for brain activity mainly used hard voting (94.4%) and gave the highest weights to the LDA classifier. Classifiers revealed strong overfitting (
Figure 5A) and neither a performance that was better than the single classifiers nor dummy classifier. We then compared the performance of the classifiers with respect to the percentage of correctly and falsely classified cases in a confusion matrix (
Figure 7). Therefore, we used the best-performing classifier for each test subject and then summed over all test subjects. We compared the distribution of the true positives, true negatives, false positives, and false negatives in these classifiers with the respective distribution of the voting classifier. Here (
Figure 7A), we see that both distributions indicate a high number of falsely identified “High Mental Effort” cases (False Positives), leading to a recall of 45.6% and precision of only 39.3% for the voting classifier and a recall of 57.5% and precision of 49.8% for single classifiers.
3.3. Unimodal Predictions – Physiological Measures
For classifying subjectively perceived mental effort based on physiological measures such as heart rate, respiration, and body temperature, soft voting was chosen in half of the test subjects (
Figure 6B). The weighting of the classifiers varied considerably, with the KNN obtaining the highest average weights. The voting classifier (
Figure 5B) showed strong overfitting, and its performance in the test subject was neither significantly better than any of the single classifiers nor dummy classifier. Regarding the percentage of correctly and falsely classified cases (
Figure 7B), we see that the distributions for the best-performing single classifiers seem to be slightly better than the distributions of the voting classifier. The latter had difficulties in correctly identifying the conditions with low mental effort as can be seen in the high number of false negatives. When comparing the recall and precision of both approaches, we have a recall of only 29.9% for the voting classifier (precision: 38.6%) and an average recall of 51.0% for the best single classifiers (precision: 50.2%).
3.4. Unimodal Predictions – Ocular Measures
For subjectively perceived mental effort classification based on ocular measures such as pupil dilation and fixations, the split of soft vs. hard voting was 5.6% for soft voting and 94.4% for hard voting (
Figure 6C). KNN and SVM were weighted highest. The F
1 score of the voting classifier (F
1 = .35,
Figure 5D) did not show a significantly better classification performance than the dummy classifier (F
1 = .37). The percentage of correctly and falsely classified cases (
Figure 7D) was similar to the brain models, with a recall of 39.1% for the voting classifier (precision: 35.1%) and an average recall of 57.9% for the best single classifiers (average precision: 47.6%).
3.5. Unimodal Predictions – Performance
At last, we predicted subjectively perceived mental effort based on performance (accuracy and speed). 27.8% of the test subjects had voting classifiers using soft voting, and 72.2% used hard voting (
Figure 6D) with SVM being weighted highest. GNB, RFC, and SVM showed a significantly better performance than the dummy classifier. The performance of the combined voting classifier in the test subject was significantly better than a dummy classifier. The percentage of correctly and falsely classified cases (
Figure 7E) reveals superior classification performance compared with the brain-, physiological- and ocular-based models. However, the voting classifier had still a high number of falsely identified “High Mental Effort” cases (False Positives), leading to a recall of 78.5% and a precision of 57.6%. The best-performing single classifiers have an average recall of 82.4% and an average precision of 62.3%.
3.6. Unimodal Predictions based on the Upper Quartile Split
To identify informative measures for very high perceived mental effort potentially reflecting cognitive overload, we also performed predictions based on the subject-wise split at the upper quartile. Compared with the median-split-based results, we observed decreased classifiers’ performance even below dummy classifier performance (Supplementary Figure 7). This might be explained by the fact that we reframed a binary prediction problem with evenly distributed classes into an outlier detection problem. Using the upper quartile split, we created imbalanced classes regarding the number of the respective samples, which made the reliable identification of the less well-represented class in the training set more difficult (reflected in the recall; Supplementary Figure 9).
3.7. Unimodal Predictions based on the Experimental Condition
We further fitted models to predict the experimentally induced task load instead of the subjectively perceived mental effort. The prediction of mental effort operationalised by the task load was substantially more successful than the prediction of subjectively perceived mental effort. All modalities, including brain activity and physiological activity, revealed at least one classifier that was able to predict the current task load above the chance level. The unimodal voting classifiers were all significantly better than a dummy classifier. Best unimodal voting classifications were obtained based on performance measures. Interestingly, other classification models were favoured in the unimodal voting, and the distribution between soft- and hard voting differed compared with the subjectively based approach, with soft voting being used more often (Supplementary Figure 17).
3.8. Multimodal Predictions based on the Median Split
In the final step, we combined the different modalities into a multimodal prediction.
Figure 5C and
Figure 7C shows the performance of the multimodal voting classifier, and
Figure 8A the average allocated weights to the different modalities. To compare the rather complex feature set construction of the multimodal voting with a simpler approach, we also trained two exemplary classifiers (LR without feature selection and RFC with additional feature selection) on the whole feature set without a previous splitting into the different modalities (
Figure 5F).
In most test subjects (55.6%), soft voting was selected to combine the predictions for the different modalities; 44.4% used hard voting. In line with the results outlined above, the multimodal classifier relied on performance to predict subjectively perceived mental effort, thereby turning it into a unimodal classifier. The voting classifier led to a significantly better classification than the dummy classifier (
Figure 5C). The multimodal classifier exhibited an equivalent percentage of correctly and falsely classified cases (
Figure 7C) compared with the performance-based classifier, demonstrating an average recall of 78.5% and an average precision of 57.6%. On average, it performed better than the classifiers trained with the combined whole feature set, which showed substantial overfitting.
In order to assess the performance of the multimodal classifier without incorporating performance-based information such as speed and accuracy, we constrained the classifier to utilize only (neuro-)physiological and visual measures. This approach is especially relevant for naturalistic applications where obtaining an accurate assessment of behavioural performance is challenging or impossible within the critical time window. For the multimodal prediction without performance, brain activity was weighted highest (
Figure 8B). However, classifiers revealed strong overfitting during the training, and the average performance was decreased to chance level (average recall: 40.6% and average precision: 38.3%;
Figure 5C).
3.9. Multimodal Predictions based on the Upper Quartile Split
With the upper quartile split, we observed a fundamentally different allocation of weights. High weights were assigned to brain and ocular activity (
Figure 8B), while performance received only minimal weights. Hence, the exclusion of performance-based measures had minimal impact on the allocation of weights (Supplementary Figure 10B) and the overall performance of the multimodal classifiers remained largely unaffected (Supplementary Figure 7C). Among the eighteen test subjects, the multimodal classification demonstrated the highest performance in two cases (Supplementary Table 2). However, on average, the multimodal classification based on an upper quartile split did not demonstrate superiority over the unimodal classifiers. It further did not significantly outperform the dummy classifier or classifiers trained on a feature set of simply combined modalities without weight assignment (average recall: 21.9% and average precision: 18.5%; Supplementary Figure 7).
3.10. Multimodal Predictions based on the Experimental Condition
Similar to the multimodal voting classifier based on a subject-wise median split of perceived mental effort, classifiers predicted the experimentally induced task load solely using the performance measures. The average prediction performance was exceptionally high, significantly outperforming a dummy classifier, and comparable to the performance of the classifiers trained on the combined feature set (average recall: 99.7% and average precision: 91.3%; Supplementary Figure 16). When we only allowed (neuro-)physiological and visual measures as features, visual measures were weighted highest (Supplementary Figure 19B). In this case, the average performance of the multimodal classifiers was also significantly above the chance level, with an average recall of 82.7% and precision of 58.7%, indicating a successful identification of mental effort based on neurophysiological, physiological, and visual measures (Supplementary Figure 16C).
Figure 1.
Elements of the WCT interface. Left side of the screen (map): Participants had to monitor the aerial space of the airport. When an unregistered drone entered the yellow area (outer circle), participants had to warn that drone; when an unregistered drone entered the red area (inner circle), participants had to repel it. Right side of the screen (graphical user interface): Participants had to request codes and pictures of unknown flying objects and then classify them as birds, registered drones, or unregistered drones.
Figure 1.
Elements of the WCT interface. Left side of the screen (map): Participants had to monitor the aerial space of the airport. When an unregistered drone entered the yellow area (outer circle), participants had to warn that drone; when an unregistered drone entered the red area (inner circle), participants had to repel it. Right side of the screen (graphical user interface): Participants had to request codes and pictures of unknown flying objects and then classify them as birds, registered drones, or unregistered drones.
Figure 2.
Procedure of the experiment. The presented procedure is exemplary as the task load condition was alternating, and the concurrent emotional condition was pseudo-randomised throughout the different blocks.
Figure 2.
Procedure of the experiment. The presented procedure is exemplary as the task load condition was alternating, and the concurrent emotional condition was pseudo-randomised throughout the different blocks.
Figure 3.
fNIRS optodes’ location. Montage of optodes on fNIRS cap on a standard 10-20 EEG system, red optodes: sources, blue optodes: detectors, green lines: long channels, blue lines: short channels. Setup with 41 (source-detector-pairs) × 2 (wavelengths) = 82 optical channels of interest.
Figure 3.
fNIRS optodes’ location. Montage of optodes on fNIRS cap on a standard 10-20 EEG system, red optodes: sources, blue optodes: detectors, green lines: long channels, blue lines: short channels. Setup with 41 (source-detector-pairs) × 2 (wavelengths) = 82 optical channels of interest.
Figure 4.
Classification procedure with cross-validated randomised grid searches (maximum number of 100 iterations) and a validation set consisting of one or two subjects. The first grid search optimises the hyperparameters for the different individual and unimodal classifiers. The second grid search optimises the weights as well as voting procedure (soft or hard) for the unimodal voting classifier. The third grid search optimises the weights as well as the voting procedure (soft or hard) for the multimodal voting classifier.
Figure 4.
Classification procedure with cross-validated randomised grid searches (maximum number of 100 iterations) and a validation set consisting of one or two subjects. The first grid search optimises the hyperparameters for the different individual and unimodal classifiers. The second grid search optimises the weights as well as voting procedure (soft or hard) for the unimodal voting classifier. The third grid search optimises the weights as well as the voting procedure (soft or hard) for the multimodal voting classifier.
Figure 5.
Prediction of the subjectively perceived mental effort based on a median split; validation set: . Bootstrapped 95% confidence intervals (CI; 5000 iterations) of the mean F1 scores for the training set (left, orange) and the test set (right, blue) of the different unimodal and multimodal models. Notches in the boxes of the plot visualize the upper and lower boundary of the CI with the solid line representing the mean and the dashed grey line representing the median. The box comprises 50% of the distribution from the 25th to the 75th quartile. The ends of the whiskers represent the 5th and 95th quartile of the distribution. The continuous grey dashed line shows the upper boundary of the CI of the dummy classifier.
Figure 5.
Prediction of the subjectively perceived mental effort based on a median split; validation set: . Bootstrapped 95% confidence intervals (CI; 5000 iterations) of the mean F1 scores for the training set (left, orange) and the test set (right, blue) of the different unimodal and multimodal models. Notches in the boxes of the plot visualize the upper and lower boundary of the CI with the solid line representing the mean and the dashed grey line representing the median. The box comprises 50% of the distribution from the 25th to the 75th quartile. The ends of the whiskers represent the 5th and 95th quartile of the distribution. The continuous grey dashed line shows the upper boundary of the CI of the dummy classifier.
Figure 6.
Weights and procedure of a unimodal voting classifier to predict subjectively perceived mental effort based on a median split; validation set: . Error bars represent the standard deviation.
Figure 6.
Weights and procedure of a unimodal voting classifier to predict subjectively perceived mental effort based on a median split; validation set: . Error bars represent the standard deviation.
Figure 7.
Prediction of the subjectively perceived mental effort (confusion matrix of test set) based on a median split; validation set: . Percentage of correctly and falsely classified perceived mental effort per model across all test subjects: TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives, with “Positives” representing “High Mental Effort” and “Negatives” representing “Low Mental Effort”. For the “Best Performing Single Classifier” we selected the classifier (LDA, LR, SVM, KNN, RFC, or GNB) with the best F1 score for each subject.
Figure 7.
Prediction of the subjectively perceived mental effort (confusion matrix of test set) based on a median split; validation set: . Percentage of correctly and falsely classified perceived mental effort per model across all test subjects: TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives, with “Positives” representing “High Mental Effort” and “Negatives” representing “Low Mental Effort”. For the “Best Performing Single Classifier” we selected the classifier (LDA, LR, SVM, KNN, RFC, or GNB) with the best F1 score for each subject.
Figure 8.
Weights and procedure of a multimodal voting classifier to predict subjectively perceived mental effort based on a median split; validation set: . (B) shows the allocation of weights when performance measures are not included in the multimodal classification. Error bars represent the standard deviation.
Figure 8.
Weights and procedure of a multimodal voting classifier to predict subjectively perceived mental effort based on a median split; validation set: . (B) shows the allocation of weights when performance measures are not included in the multimodal classification. Error bars represent the standard deviation.
Table 1.
Included Features per Modality
Table 1.
Included Features per Modality
Modality |
Features |
Brain Activity |
Mean, standard deviation, peak-to-peak (PTP) amplitude, skewness, and kurtosis of the 82 optical channels |
Physiology |
|
Heart Rate |
Mean, standard deviation, skewness, and kurtosis of heart rate |
|
Mean, standard deviation, skewness, and kurtosis of heart rate variability |
Respiration |
Mean, standard deviation, skewness, and kurtosis of respiration rate |
|
Mean, standard deviation, skewness, and kurtosis of respiration amplitude |
Temperature |
Mean, standard deviation, skewness, and kurtosis of body temperature |
Ocular Measures |
|
Fixations |
Number of fixations, total duration and average duration of fixations, and standard deviation of the duration of fixations |
Pupillometry |
Mean, standard deviation, skewness, and kurtosis of pupil dilation |
Performance |
Average reaction time and cumulative accuracy |