1. Introduction
Typhoid fever and malaria are two of the most prevalent febrile diseases in the world, presenting serious public health issues, especially in tropical and subtropical areas. Typhoid and malaria are common in these areas due to the high humidity, temperatures, inadequate healthcare facilities, and the shortage of qualified healthcare providers [
1]. Salmonella enterica serotype Typhi is the bacteria that causes typhoid fever or enteric fever which affects millions of people worldwide and can have serious consequences if left untreated [
2,
3,
4]. Malaria on the other hand is caused by plasmodium parasites that are transmitted by Anopheles mosquito bites, infecting millions of people and claiming the lives of hundreds of thousands every year [
5,
6,
7]. Malaria is the second most studied disease [
8] due to its widespread prevalence, high mortality rate, drug resistance, and environmental factors such as climate change. The prompt and effective diagnosis of these febrile diseases is essential for efficient treatment and care, but current diagnostic techniques often face limitations in accessibility, specificity, and sensitivity.
Machine learning (ML) algorithms are frequently used in the healthcare sector, to help decision-makers make well-informed decisions [
9,
10]. Medical diagnostics has found ML machine learning to be a potent tool that can improve the efficiency and accuracy of diagnosis, but to guarantee that medical professionals can rely on and comprehend the judgments made by these models, the use of ML models in clinical settings demands a high level of interpretability and transparency. According to Anderson & Thomas [
11], concerns about ML algorithms' lack of interpretability frequently impede their acceptance in the healthcare sector. Their capacity to comprehend and interpret the choices made by ML models is critical in this sector, as decisions can have a significant impact on patient outcomes. To address this challenge, an explainable AI (XAI) technique like Local Interpretable Model-agnostic Explanations (LIME) offers insights into how models arrive at their predictions, thereby promoting trust and aiding in clinical decision-making by healthcare professionals. XAI is becoming increasingly important in the healthcare sector, where making decisions has extremely high stakes because it is challenging for healthcare professionals to trust and comprehend the decisions made by traditional machine learning models. In clinical settings, where comprehension of the reasoning behind a diagnosis is critical for patient safety, regulatory compliance, and ethical considerations, the lack of interpretability may impede the adoption of AI [
12]. Therefore, XAI offers solutions to these problems by facilitating AI models' transparent and intelligible decision-making process. XAI techniques such as LIME are widely utilized to clarify the inner workings of complex models. LIME operates by using an interpretable model local to the prediction to approximate the black-box model. It modifies the input data, tracks how the predictions change as a result, and then fits a straightforward, understandable model to these modified samples [
13,
14]. In situations where individual case explanations are required, LIME is especially helpful as it helps determine which characteristics are most important for a particular prediction. The interpretability of ML models in the healthcare industry is greatly enhanced by LIME, which allows physicians to better comprehend and rely on AI-driven insights and their capacity to offer concise, useful explanations improves the usefulness of AI systems in the processes of diagnosis and treatment planning. LIME has been applied in several healthcare settings such as in diagnosing diabetes [
15], classification of co-morbidities associated with febrile diseases in children and pregnant women [
16], and transparent health predictions [
17]. To further improve accuracy and explainability, incorporating large language models (LLMs) into diagnostic processes seems promising in addition to XAI techniques. These models can bridge the gap between complex ML algorithms and clinical understanding. They are trained on a wealth of medical data and can provide distinctive interpretations and generate detailed, contextually relevant explanations for diagnostic outcomes.
The use of LLMs in medical contexts has advanced significantly; thanks to projects like Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT). These models can produce human-like text and comprehend intricate linguistic patterns because they have been trained on enormous volumes of text data. The applications of BERT go beyond identifying pandemic illnesses; it can also be used to process electronic medical records and evaluate the results of goals-of-care talks in clinical trials [
18,
19,
20]. GPT has proven to be remarkably adept at producing coherent and contextually relevant text in various domains [
21]. GPT can help in the healthcare industry by delivering comprehensive patient reports, producing justifications for medical diagnoses, and offering assistance during clinical decision-making processes [
22]. The accuracy and explainability of diagnostic systems can be greatly improved by integrating these LLMs and they can produce thorough narratives that clarify the reasoning behind diagnostic predictions, which facilitates clinician comprehension and validation of AI recommendations. This ability is essential for bridging the gap between cutting-edge AI models and real-world, routine clinical use, raising the standard of healthcare delivery as a whole.
This study aims to enhance the interpretability of typhoid and malaria diagnosis using ML techniques like Extreme gradient boost (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) with LIME, and LLMs such as GPT, Gemini, and Perplexity. It emphasizes the potential of integrating these tools to interpret and contextualize medical data, hence, bridging the gap between healthcare worker comprehension and complex ML diagnoses. Real-time patient dataset consisting of symptoms and diagnoses of malaria and typhoid was collected from healthcare facilities across the Niger Delta region of Nigeria. By leveraging these advanced tools, we seek to develop a diagnostic model that delivers precise diagnoses and provides transparent and understandable insights into their decision-making processes. This research holds significant potential to improve diagnostic practices, ultimately contributing to better patient outcomes and advancing the field of medical diagnostics. This study can advance the field of diagnostic medicine and enhance diagnostic procedures, which will ultimately lead to better patient outcomes. This study's primary contributions are:
The consideration of multiple diseases (typhoid fever and malaria) allows for a thorough evaluation of the patient's health, which is essential for managing co-infection and comorbidity.
Using real-world data ensures that the models are trained and validated on clinical cases thereby enhancing the practical relevance and applicability of our findings.
The black-box nature of many ML models is addressed by the integration of an XAI method, which gives medical professionals transparent and comprehensible insights into how each feature influences the diagnosis, ensuring that diagnostic results are presented in a way that is meaningful for easier interpretation. By focusing on interpretability, healthcare workers can make more accurate and timely diagnoses.
LLMs give the diagnosis process an extra layer of context-aware understanding and incorporating them makes it possible to better understand and analyze complex medical outcomes.
The combination of LLMs and conventional ML models enables a thorough comparison of various diagnosis strategies. This not only demonstrates the models' efficacy but also the advantages and disadvantages of each approach to managing medical data.
The integration of XAI, LLMs, and ML puts this work at the forefront of medical AI research. It demonstrates the viability and benefits of using a hybrid approach to address difficult diagnostic problems, establishing a standard for further study in the area.
The study is prepared as follows: The methodology is presented in
Section 2, including data collection, preprocessing, and the application of XAI and ML models, along with the incorporation of LLMs for improved diagnostic interpretability. The results are discussed in
Section 3, evaluating the effectiveness of various algorithms and illustrating how XAI offers insights into model decisions, along with the implications for clinical practice.
Section 4 concludes the study, highlighting its limitations, and offering recommendations for further research to advance diagnostic techniques.
3. Results and Discussion
The results of our assessment of the models' performance are shown in this section, the XAI method adopted as well as the experimental assessment of the large language models of febrile disease diagnoses (Malaria and Typhoid Fever).
Figure 7,
Figure 8 and
Figure 9 present the confusion matrices, an essential instrument for assessing how well a classification ML model performs.
Table 3 presents values of these metrics and the computation time of each model while
Figure 10 is a pictorial representation of the model’s performance based on the considered metrics. The result shows that RF (accuracy = 71.99%, precision =71.29%, recall=71.99%, F1-Score=71.45%) demonstrates superior performance, outperforming XGBoost (accuracy=71.29%, precision=70.56%, recall=71.29%, F1-Score=70.66%) and SVM (accuracy= 68.60%, precision=68.65%, recall=68.60%, F1-Score=68.21%). High recall and precision are essential for diagnosing diseases like typhoid and malaria by guaranteeing that the majority of real cases are identified. In this case, high precision helps prevent needless treatments for illnesses that are not present. Because both XGBoost and RF do a good job of balancing these metrics, they are better suited for clinical applications where false positives and false negatives can have detrimental effects. Also, XGBoost has a smaller log loss, which suggests more accurate and well-calibrated probability estimates as well as stronger diagnosis confidence and this may be critical in medical diagnostics, where accuracy is not as important as confidence in the presence of a disease. In medical scenarios where treatment decisions are influenced by the certainty of a diagnosis, lower log loss values for XGBoost indicate that its probability estimates are more reliable. Because of RF's higher log loss, probability estimates are less trustworthy, which could cause uncertainty when making decisions. SVM performs worse than the other two models in terms of performance metrics and computation time (running time exceeds one hour), implying that it might not work as well for diagnosing typhoid and malaria in this specific dataset. Therefore, ensemble techniques (XGBoost and Random Forest) may be better at capturing the intricate relationships between symptoms and diseases than the SVM model.
The LIME plots (
Figure 11,
Figure 12 and
Figure 13) provide a global view of how the features(symptoms) contribute to the model’s diagnoses across the entire test dataset, identifying features with the highest average contributions, both positively and negatively across all diagnoses. The XGBoost LIME diagram in
Figure 11 shows that symptoms such as SWRFVR, HDACH, and CNST as specified by their negative contributions on the left side of the plot suggest that the lower levels or absence of these symptoms are associated with a lower likelihood of a patient having malaria and typhoid. Whereas symptoms such as BITAIM, LTG, CHLNRIG, MSCBDYPN, and FVR are the most influential symptoms constantly contributing to the diagnoses of malaria and typhoid across numerous patients.
The RF LIME diagram in
Figure 12 also points out that the same symptoms (SWRFVR, HDACH, and CNST) are associated with a lower likelihood of having malaria and typhoid whereas BITAIM, CHLNRIG, ABDPN, LTG, GENBDYPN, MSCBDYPN, FTG and HGGDFVR are influential symptoms that contribute to the diagnoses of malaria and typhoid among patients.
Figure 13 shows the SVM LIME diagram, indicating CHLNRIG has the highest feature importance, followed by MSCBDYPN, LTG, ABDPN, BITAIM, FTG, and CNST as the influential symptoms that contribute to the diagnoses of malaria and typhoid among patients while GENBDYPN, SWRFVR, FVR, HGGDFVR and HDACH are associated with a lower likelihood of having malaria and typhoid.
It is observed that medical experts should focus on these influential symptoms for the diagnosis of malaria and typhoid fever in patients. They are BITAIM, CHNLNRIG, LTG, ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR. This is consistent with the results of Asuquo et al. [
34] where GENBDYPN, CHNLNRIG, ABPN, FVR, FTG, and HGDFVR were observable symptoms. LIME has numerous advantages. It explains the individual diagnosis in a form that is relatively easy for humans to comprehend, aiding healthcare workers to understand why a model made a specific diagnosis. LIME can be applied to many ML models and this versatility makes it suitable for various medical diagnostic systems. Besides, LIME is suitable for generating explanations using local approximations [
35]. The limitation of LIME is that it is computationally intensive and expensive to generate explanations for individual diagnoses, especially for large datasets and complex models.
Furthermore, three sets of experiments were conducted to evaluate the performance of ChatGPT, Gemini, and Perplexity in diagnosing malaria and typhoid. In Experiment 1, one prompt was sent at a time to the LLMs for the first 100 patients in the dataset, recording the outputs in a CSV format to see how they performed with a single set of prompts. Experiment 2 involved sending 100 prompts from the first 100 patients in the dataset to the LLMs, and storing the outputs in a CSV format to observe their responses to a series of prompts. In Experiment 3, 100 unique prompts were sent to the models repeatedly until the entire dataset was exhausted, to assess how the models performed when given large sets of unique prompts.
Table 4 presents the results of the three experiments. In Exp 1, ChatGPT 3.5 has a slightly better performance with the highest F1-score (30.99%) because the F1-score is crucial as it balances recall and precision, providing a comprehensive measure of the model's performance. Although better accuracy and recall are achieved by ChatGPT 3.5 and Gemini (30%), Perplexity is better at minimizing false positives with its highest precision (38.90%). In Exp 2, Perplexity performs better, with the highest F1-score (26.29%), accuracy (28%), and recall (28%). Because it provides a comprehensive measure of the model's performance by balancing recall and precision, the F1-score is especially significant. ChatGPT 3.5 is better at reducing false positives with the highest precision, while Gemini has the lowest performance. In Exp 3, ChatGPT 3.5 has better accuracy, precision, and recall followed by Gemini and Perplexity. Although the ChatGPT model may be having trouble striking a balance between minimizing false positives and identifying true positives, the model's relatively low F1- score suggests that there may be an imbalance between precision and recall.
ChatGPT is an innovative technological tool for comprehending and processing natural language, making it suitable for interpreting and summarizing complex up-to-date information. Gemini is an adaptable tool that can handle various data types such as images and text, making it suitable for diagnostic purposes. Perplexity is specialized in comprehending and generating complex queries as well as maintaining context that can be vital for the retrieval and analysis of medical research. These LLMs lack specialized knowledge and are also capable of producing inaccurate answers which can be critical in a medical context. They require high computational power to generate and process responses which could limit real-time systems. Data security and patient privacy are concerns when handling sensitive medical data and they require proper validation and regulatory approval before they can be trusted and adopted for clinical use. To facilitate healthcare professionals' comprehension of the reasoning behind a diagnostic output, LLMs integrate and analyze large amounts of medical data and produce human-readable explanations for their decisions.
The overall ML models' performance in the study was moderate, suggesting the need for a sufficient dataset to enhance the diagnostic models. While the traditional SMOTE aided in balancing the dataset, employing an advanced oversampling method may help in improving the model performance. Even with GridSearchCV, the hyperparameters might still be improved, particularly for SVM. Better configurations could result from investigating alternative parameter tuning techniques like RandomizedSearchCV or Bayesian Optimization. To improve the results of the LLMs, the LLMs will be fine-tuned with a larger dataset, and an ensemble method will be employed to combine the strengths of different LLMs.
To integrate ML, XAI, and LLM techniques into an app, we propose two methods.
Method 1: Separate Training and Validation for ML and LLM
Train, test, and validate an ML model to diagnose malaria and typhoid based on the patient dataset
Apply LIME to explain the ML models' diagnoses and how each symptom contributed to the diagnoses
Train, Test, and Validate an LLM model independently for generating explanations based on the patient dataset
Integrate the outputs from ML, LIME, and LLM to provide a comprehensive and interpretable diagnosis.
The advantage of method 1 is that it might lead to higher diagnostic performance considering the specific training of the two models (ML and LLM) for this task. The disadvantage is that the training and validation process of two independent models would increase the computational complexity of the diagnostic system, especially in combining the results to ensure consistency and coherence.
Method 2: Integrated ML, LIME, and LLM Process
Train, Test, and Validate an ML model to diagnose malaria and typhoid based on the patient dataset
Apply LIME to explain the ML models' diagnoses and how each symptom contributed to the diagnoses
Use LLM for further explainability by passing the patient symptoms and ML results (with LIME explanations) through the model to generate diagnostic explanations in natural language.
The advantage of method 2 is its simplicity because an integrated pipeline reduces complexity, making the system easier to develop, test, and maintain which we recommend for implementation. Performance will be increased and computational overhead could be decreased by streamlining the procedure into a single pipeline. The explanations produced by LIME can be directly considered by the LLM, which may result in more logical and contextually appropriate explanations. The disadvantage is that the quality of the initial ML and LIME outputs determines the quality of the explanations provided by the LLM.
Whereas previous studies [
36,
37,
38,39] applied ML models for disease diagnoses, this paper integrated the use of XAI and LLMs, to enhance transparency and interpretability in the diagnostic processes. The use of LIME for feature importance analyses and LLMs for generating context-aware explanations have distinguished the present study. Several factors can contribute to the low performance scores in
Table 4. These include: 1) the dataset used during the training is limited in size and diversity, affecting the models' ability to generalize to unseen cases. 2) LLMs may require further fine-tuning and optimization, as the complexity of the diseases being diagnosed may overlap with other illnesses thereby challenging the models to accurately differentiate between them. Furthermore, LLMs did not show high domain tolerance to the investigated illnesses, hence, fine-tuning them on domain-specific data can significantly improve their performance.
4. Conclusions
This study creates a medical diagnostic framework for Malaria and Typhoid fever by integrating XAI, LLMs, and ML models. This approach aims to demystify the black-box nature of ML models, offering transparent insights into how each feature or symptom affects the diagnosis. The RF model showed superior prediction performance across all metrics compared to XGBoost and SVM. The high recall and precision values in RF are crucial for accurately diagnosing these diseases, and preventing unnecessary treatments. However, XGBoost exhibited the lowest log loss (0.7808) and fastest computation time, indicating more reliable probability estimates and stronger diagnosis confidence, which is vital for treatment decisions. Further analysis indicates that SVM performs worse than the other two models in terms of performance metrics and computation time, making it less suitable for this dataset. The study suggests that ensemble techniques like RF and XGBoost better capture the complex relationships between symptoms and diseases. The XAI analysis identified BITAIM, CHNLNRIG, LTG, ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR as key features for predicting Malaria and Typhoid. Among LLMs, ChatGPT 3.5 performed slightly better than Gemini and Perplexity. This study recommends integrating ML, LIME, and LLM processes due to the simplified workflow of the overall development and maintenance process, resource efficiency, and improved explainability as a result of passing both patient symptoms and ML results through the LLM as the LLM can take into account the full context provided by LIME.
Author Contributions
Conceptualization, F.-M.U., and K.A.; methodology, K. A., C.A., D. A., and M.E.; validation, F.-M.U., O.O., K. A., C.A., D.A, and M.E.; formal analysis, K. A., P.A., and D. A.; data curation, K.A., and P. A.; writing—original draft preparation, K. A., D.A., P. A., E. J. A. J. and M.E.; writing—review and editing, K. A., M. E., D. A., O.O., C. A., O.M. and F.-M.U. supervision, F.-M.U. C. A., D. A., O.O., and M.E.; project administration, F.-M.U., and O.O.; funding acquisition, F.-M.U. All authors have read and agreed to the published version of the manuscript.