3.3. Confusion Matrix and Performance Metrics
The predictive model developed for neonatal sepsis detection demonstrated strong overall performance. Leveraging data from pulse oximetry (PO), near-infrared spectroscopy (NIRS), and skin temperature (ST), the model achieved an impressive accuracy rate of 87.67 ± 7.42%. This high level of accuracy indicates the model’s robust capability in identifying somatosensory evoked potential (s-SEPs) 6 to 48 hours before clinical diagnosis. Furthermore, the sensitivity analysis conducted on the features highlighted the significant contributions of NIRS and ST, with these modalities having the most substantial impact on the model’s predictive power. The integration of multimodal biosignal data resulted in a considerable enhancement of the model’s accuracy, underscoring the importance of comprehensive monitoring in the early detection of sepsis in neonates.
Figure 2 describes the confusion matrix, from which several performance metrics can be derived. The model achieved an accuracy of 83%, meaning that 83% of the time, the model’s predictions (both sepsis and non-sepsis) were correct. The sensitivity (or recall) of 76% indicates that the model correctly identified 76% of the actual sepsis cases, suggesting effectiveness in detecting most sepsis cases but also highlighting that 24% of cases were missed.
The model’s specificity was 90%, showing that it correctly identified 90% of the non-sepsis cases. This high specificity indicates a low rate of false positives, meaning the model is reliable in identifying non-sepsis conditions. Precision, calculated at 88.37%, reflects the accuracy of sepsis predictions; when the model predicts sepsis, it is correct 88.37% of the time. This high precision minimizes the number of false alarms. The F1 score, which balances precision and recall, was 81.5%, indicating the model’s overall reliability and effectiveness in predicting sepsis cases.
Accuracy: The proportion of total correct predictions (both true positives and true negatives) out of all predictions.
Substituting the values from the confusion matrix:
The model achieved an accuracy of 83%, indicating that 83% of the model’s predictions were correct. This means that the model accurately identified both sepsis and non-sepsis cases 83% of the time. Accuracy is a fundamental metric for evaluating the performance of a predictive model, as it provides a general overview of how often the model makes correct predictions.
In this study, the model’s accuracy of 83% indicates a strong performance, suggesting that the integration of multiple biosignals (pulse oximetry, NIRS, and skin temperature) effectively enhances the model’s ability to predict sepsis.
Sensitivity (Recall or True Positive Rate): The proportion of actual positive cases (sepsis) correctly identified by the model.
The model demonstrated a sensitivity of 76%, meaning it correctly identified 76% of the actual sepsis cases. Sensitivity, also known as recall, measures the proportion of true positive cases that the model accurately detects. In the context of neonatal sepsis detection, a sensitivity of 76% indicates that the model is quite effective in identifying most sepsis cases, ensuring that a significant majority of the afflicted infants are correctly diagnosed in a timely manner.
However, the sensitivity of 76% also implies that the model misses 24% of actual sepsis cases. These missed cases are referred to as false negatives, where the model fails to identify sepsis when it is actually present. In clinical settings, false negatives are particularly concerning because they mean that some infants with sepsis might not receive the necessary and urgent medical attention.
Despite this limitation, a sensitivity of 76% is still relatively high, especially given the complex and multifaceted nature of sepsis, which can present with a wide range of symptoms and severity. The model’s ability to detect three-quarters of sepsis cases is a significant achievement, suggesting that it effectively utilizes the integrated biosignal data (pulse oximetry, NIRS, and skin temperature) to identify patterns indicative of sepsis.
Specificity (True Negative Rate): The proportion of actual negative cases (non-sepsis) correctly identified by the model.
Our model achieved a specificity of 90%, which means that it correctly identified 90% of the non-sepsis cases. With a specificity of 90%, the model demonstrates a high level of accuracy in distinguishing between sepsis and non-sepsis cases.
Furthermore, the high specificity complements the model’s sensitivity, providing a balanced performance. While sensitivity ensures that most sepsis cases are detected (76% sensitivity), specificity ensures that most non-sepsis cases are correctly identified (90% specificity). This balance is essential for a reliable diagnostic tool, as it ensures both high detection rates of actual sepsis and low rates of false alarms.
Precision (Positive Predictive Value): The proportion of positive predictions (sepsis) that are actually positive.
With a precision of 88.37%, the model shows that when it predicts sepsis, it is correct 88.37% of the time. Precision, also known as positive predictive value, measures the proportion of true positive predictions among all positive predictions made by the model. This metric is particularly important in assessing the reliability of the model’s positive predictions.
A high precision of 88.37% indicates that the majority of the sepsis cases identified by the model are indeed true sepsis cases. This high precision is crucial in a clinical setting because it minimizes the number of false positives—instances where the model incorrectly predicts sepsis in infants who do not actually have the condition.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
The F1 score of 81.5% reflects a good balance between precision and recall, indicating the model’s robustness in predicting sepsis cases accurately. The F1 score is the harmonic mean of precision and recall, providing a single metric that balances the trade-off between these two important measures.
In our case, the F1 score of 81.5% indicates that the model is well-balanced and robust in its predictions, effectively managing the trade-offs between precision and recall. This balance is particularly crucial in medical diagnostics, where both false positives (incorrectly predicting sepsis) and false negatives (failing to predict sepsis) have significant implications.
The calculated metrics from the confusion matrix demonstrate that the model performs well in detecting sepsis in neonates, with high accuracy, sensitivity, and specificity. The slight discrepancies between the calculated and reported metrics could be due to different evaluation datasets or inherent variability in model performance.