1. Introduction
Promising potentials for objectified, data-based support through the integration of artificial intelligence, and its subcategories of machine learning (ML) and deep learning, for data interpretation have been shown for the healthcare sector in numerous studies. It has been demonstrated that these techniques are beneficial for analyzing complex and multivariate data; finding discriminative, class-specific differences; and ultimately providing objective, data-based decision support to medical practitioners [
1,
2]. Furthermore, an advantage over the commonly used inference-based statistical analysis methods has been reported [
3,
4]. It has been shown that ML-based systems even surpass human guidance in disease detection [
5,
6]. In addition, a reduction in false-positive mistakes and the mitigation of different experience levels of medical practitioners have been reported [
7]. In the context of concrete biomechanical use cases, ML has proven useful in the diagnosis of gait disorders [
8,
9], the recognition of human activities [
10], age-related assessments [
11,
12], and the optimization of the rehabilitation phase [
13]. Various biomedical diseases have been considered, e.g., after a stroke [
8], in Parkinson’s disease [
9], in osteoarthritis [
14], and in total hip arthroplasty [
15]. However, regarding the application of ML methods for the evaluation of posture parameters, little research has been conducted [
16].
A common way to check a person’s posture is to assess the back contour through a visual inspection. However, this procedure is susceptible to subjectivity and potential errors. A comparative study where 28 chiropractors, physical therapists, rheumatologists, and orthopedic surgeons evaluated the posture of subjects from lateral photographs found that the intra-rater reliability was only moderate (kappa = 0.50) and the inter-rater reliability was weak (kappa = 0.16) [
17]. There is therefore a great need for research to support medical diagnostics with data-based yet transparent ML methods. This has recently led to the development of smartphone (apps) that assess posture in a semi-automated way [
18]. Methods that support these diagnostics while allowing assessments that are comprehensible to the user may therefore be of great medical benefit.
Due to the difficulty and error-proneness of objectively assessing posture by experts, it can be concluded that training as well as test posture data for ML might be negatively affected in terms of wrongly assigned labels by experts. On the one hand, this negatively affects the training process of an ML classifier, and on the other hand, the true performance of the test data is possibly underestimated. This problem is known even for test data from benchmark datasets (e.g., MNIST, ImageNet) [
19]. A re-assessment of class labels is often not possible simply because the datasets are large, and therefore, not all cases can be re-examined economically. A recently described approach for dealing with these problems is
confident learning for estimating uncertainty in dataset labels [
20]. The approaches enable both the supervised training of a model for training data with incorrect labels and the identification of possible errors in the test data, which can then be re-evaluated by experts and thus corrected. Although there are promising results from confident learning [
19,
20,
21,
22], and the characteristics of biomechanical expert evaluations, which can be accompanied by errors, highlight the importance of such approaches, no work is known to the best of the authors’ knowledge that has applied confident learning in the context of biomechanical or sports science issues.
Regarding the use of ML models, the model’s opacity often makes it difficult for users to trust and understand its decisions [28]. Such a lack of transparency violates the requirements of the European General Data Protection Regulation (GDPR, EU 2016/679) [
23], which greatly limits the practicality of using the model in clinical settings [
24]. Recent advances in
explainable artificial intelligence (XAI) have made it possible to make ML more and more applicable in practical clinical contexts, for example, in the biomechanical domain [
25,
26]. XAI provides various methods for increasing the trustworthiness and transparency of black box models [
27], such as local interpretable model agnostic explanations (LIME) [
28], Shapley additive explanations (SHAP) [
29], and deep learning important features (DeepLIFT) [
30]. The use of XAI has proven especially valuable in understanding personalized differences in pathology, such as in monitoring pre- and post-operative therapy measures, and is thus highly relevant to the field of personalized medicine [
26].
Although these works have shown interesting perspectives, local interpretations have so far mainly focused on a few XAI methods in the biomedical domain, e.g., LIME [
25] or layer-wise relevance propagation [
2]. Furthermore, until now, these methods have not shown to what extent changes in the implemented features would have an influence on the model prediction. This, however, would be highly relevant both in the context of good comprehensibility and in terms of the planning of therapy measures that normally depend on the classification of a human examiner.
Counterfactual explanations (CFs), an XAI tool, could be a way to address these aspects, which, to the best of the authors’ knowledge, has not yet found its way into the biomedical context. CFs examine which features would need to be changed to achieve a desired prediction. Since human posture is multifactorial, i.e., a large number of individual posture parameters (e.g., depth of lumbar lordosis, forward tilt of the pelvis, degree of thoracic kyphosis) are included in the summary assessment by a physician, it would be interesting to know for which combinations and expressions of these individual parameters he would assess the posture as correct. In the context of this binary classification problem of a posture assessment (“good” or “weak”, which means “no therapy” or “therapy”), this could mean that, for a subject classified with an 80% probability as pathologic: “if we could improve the pelvic tilt by X degrees, the patient would be classified as not having poor posture with a probability of 80%”, whereby individual personal characteristics (e.g., gender, age) could be additionally included. By providing explanations in this way (explanations contrastive to the instance of interest) and usually focusing on a small number of features to change, CFs are particularly human-friendly explanations [
31].
Due to the above-mentioned research deficits, the aim of the present work was twofold for using the posture data of subjects with hyperkyphosis or hyperlordosis, as well as healthy subjects: First, we wanted to evaluate the general modelling abilities and check if it is possible to classify the presence of hyperkyphosis or hyperlordosis for giving an objective, data-based orientation. In parallel, we wanted to evaluate confident learning for model training, as well as to test data label error identification and check if the reevaluation of flagged test labels and a potential correction improves the performance of the model. Second, we wanted to analyze if CFs add useful insights into the trained models and provide reasonable/plausible suggestions for the improvement of parameters in biomechanical terms.
4. Discussion
The present results show that it is possible to classify the presence of hyperlordosis or hyperkyphosis based on postural data measured using stereophotogrammetry by means of ML. The use of confident learning to show possible class label errors in the test set, and the re-evaluation and correction of the respective cases by experts, showed that the original labels of the test data were partially incorrect. After correcting the class labels for both hyperlordosis and hyperkyphosis, the best mean PRAUC value of 0.97 was achieved. The erroneous test labels, therefore, led to the actual performance of the model being underestimated.
In the present case of the ML-based classification of hyperlordosis and hyperkyphosis, around 10% of the test labels were incorrect. In particular, when the datasets were not labeled by combining the expert judgments of several people, as was also the case in the present dataset, the described approach could help to identify errors in the existing data without having to check all the data samples again, which is, in many cases, not feasible for economic reasons. Although the results highlight the benefits of using confident learning to identify potentially mislabeled test set labels, no performance benefits were found when using confident learning for model training with partially mislabeled training data.
Since feature extraction is an important step to improve the accuracy of a model, avoid overfitting, reduce the computing power, and improve the interpretability [
46], a reduction in the number of suitable features should be aimed for. With regards to the interpretability, especially in relation to previous research and existing knowledge, expert-based features, which are common in practice and reported in the literature, proved to be superior [
25,
35]. The results with selected, interpretable, and practice-relevant features led to improved classification results in the study compared to the use of all available features. Nevertheless, in this context, a possible a priori loss of information due to feature selection should be critically discussed, which is particularly related to non-data-based selections [
1]. However, the potential a priori loss of information through expert-based feature construction and selection appears to be low overall, since the selected features achieved improved classification results compared to the use of available features as the model input. Therefore, it can be assumed that the present expert-based feature set is highly suitable and superior to the use of the whole set of available features.
According to [
32], the criteria for good CFs include the following: (a) a CF with the predefined class prediction can be generated; (b) a CF should be close to the instance in terms of the feature values and it should change as few features as possible; (c) several different CFs should be provided; and (d) a CF should have probable or realistic characteristic values. For evaluation, these aspects are discussed below:
(a) In this study, ten different CFs could be found for each person. Consequently, the results showed that it was, in general, possible to find CFs for the specified task. (b) Considering the global feature changes, the CFs were relatively close to the original feature values, and a maximum of two features were dominantly varied per class. The changes appeared to be necessary to change the class membership, since healthy subjects and subjects with hyperkyphosis and hyperlordosis, according to the results of this study and other research [
36], showed differences in their respective features. Accordingly, the analysis of the exemplary local CFs also showed that these were relatively close to the original characteristic values and that individual characteristic changes predominated. Overall, this corresponds to the criterion mentioned.
In the present study, the proximity and diversity were set to the default values of scikit-learn. Depending on the area of application, further tuning of the parameters can be useful. For example, increasing the proximity weight might result in features that are closer to the original query instance and less diverse.
(c) Ten different CFs were given for each instance, which again speaks to the fulfillment of the criterion. However, providing multiple solutions is both advantageous and disadvantageous. The question remains of how to find a reasonable, context-relevant, and meaningful explanation from all the explanations provided. A possible approach could be either the definition of context-specific external criteria to select the most appropriate CF or an expert-based selection based on prior knowledge and suitability for individual subject characteristics.
(d) Looking at the features that were globally modified to change the class prediction of subjects with postural deficits, it can be seen that differences between healthy subjects and subjects with hyperlordosis were mainly observable for the features KI% and FL%. Persons with hyperkyphosis differed from healthy persons by the features KI% and FC%. This is consistent with the differences reported in the literature for hyperkyphosis and hyperlordosis [
47], as well as the statistical comparison of healthy subjects and the subjects with postural deficits in this study. The XAI interpretations thus appear plausible overall.
The results showed that the CFs, which changed the characteristics of the persons with postural deficits towards healthy persons with regard to the feature FL% for hyperkyphosis, did not agree with the feature values of the healthy group according to the Mann–Whitney U test. However, the small effect size did not appear to indicate a greater implausibility. No statistical differences were found for any of the other features, which in turn speaks to the general plausibility of the CFs.
On closer inspection, the distributions of the trait values did not match exactly, but the values of the CFs appeared to be closely related to the feature values in the distribution of the healthy subjects, and were, therefore, at least realistic. Thus, it seems likely that CFs can meaningfully shift the class affiliation of individuals with postural deficits based on the postural parameters used for healthy individuals and small possible feature changes. Since this is one of the first works in this field without sufficient comparative studies being available, it is necessary to further evaluate these findings with future studies. Furthermore, the optimization of the parameters proximity and diversity could also have the potential to better correspond with the actual characteristics of healthy people.
Based on [
20], the black box problem (a) and the problem of labeling the data (b) can be characterized as central challenges when using AI with biomechanical data. In the present work, contributions were made to solving the problem in (a), which, in contrast to other methods from the XAI area, is particularly user-friendly, and the problem in (b) through label error detection. In the present study, CFs were used as an XAI tool for interpretation. However, it should be noted that it has not been analyzed intensively whether other XAI methods match with the results found for the CFs and, thus, support the local suggestions. In general, the agreement between different XAI methods and the XAI results of different classifiers is little addressed, whereas more or less strong variations of the XAI results are to be expected [
26]. Therefore, future work should try to combine different XAI interpretation methods to generate more robust interpretations as an ensemble approach.
Although very good modeling results were obtained, there are several points to discuss that are related to the persistent modeling error and could help to further reduce it; e.g., the experimental design could be optimized to improve the class separation (development of an optimal experimental design). It should also be noted that logistic regression shows a reduced performance only in the classification of hyperkyphosis, and otherwise has a similar model performance to the Gaussian process classifier. Since logistic regression is itself a very interpretable approach, it may also be useful, depending on the area of application, to use logistic regression only for classification and to interpret the model directly, rather than generating CFs. Nevertheless, there are also promising results that have been reported for the use of logistic regression in combination with CFs [
48].
For the evaluation, the current study compared the statistical characteristics of the characteristic feature values of healthy test persons with the CFs, which suggest what the characteristics for the test persons with hyperlordosis and hyperkyphosis should look like so that they can be classified as healthy. For the global analysis, the ten CFs of each person were aggregated to form a median, which might possibly eliminate the original relationships between the features. Consequently, for future works, another analysis that takes into account the relationships between the features could be the individual assessment of the local CFs by experts.
Another practical limitation is that the resulting models can only recognize characteristics for which they have been trained (here, hyperlordosis and hyperkyphosis) and are therefore pathology-dependent. Recently, interpretable, pathology-independent classifiers have been proposed to deal with this limitation [
16,
49]. Transferring the methodology of the present study to these classifiers could potentially create a powerful tool and could further increase the practical relevance of the ML methodology in biomechanical research.
Author Contributions
Conceptualization, C.D. and O.L.; methodology, C.D. and O.L.; software, C.D.; validation, C.D., O.L., S.S., and S.B.; formal analysis, C.D.; investigation, O.L.; resources, M.F.; data curation, C.D.; writing—original draft preparation, C.D., O.L., S.S., and S.B.; writing—review and editing, C.D., O.L., S.S., S.B., and M.F.; visualization, C.D.; supervision, M.F.; project administration, M.F.; funding acquisition, M.F. All authors have read and agreed to the published version of the manuscript.