Background: Nonsurgical treatment in uncomplicated appendicitis is in many cases a reasonable option, despite the sparsity of robust, easy access, externally validated and multimodally informed clinical decision support systems (CDSS). Developed by OpenAI, the Generative Pre trained Transformer 3.5 model (GPT 3), may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra indications. Our objective was to determine whether GPT 3.5, when provided high throughput clinical, laboratory and radiological text based information will come to similar clinical decisions as a machine learning model and a board certified surgeon (reference standard) in decision making for appendectomy versus conservative treatment.
Methods: In this cohort study we randomly collected patients presenting at the Emergency Department (ED) of two German hospitals (GFO, Troisdorf and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0+386. Overall agreement between GPT 3.5 output and the reference standard was assessed by means of inter observer kappa values as well as accuracy, sensitivity, specificity, positive and negative predictive value with the “Caret” and “irr” package. Statistical significance was defined as p < 0.05.
Results: There was agreement between surgeon decision and GPT 3.5 in 102 of 113 cases and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT 3.5. The estimated model training accuracy was 83.3 % (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0 % (95% CI: 66.4, 97.2). This in comparison to the GPT 3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (P = 0.21).
Conclusions: This to our knowledge first “intended use” for surgical treatment GPT 3.5 study comparing surgical decision making versus algorithm found a high degree of agreement between board certified surgeons and GPT 3.5 for surgical decision making in patients presenting to the emergency department with lower abdominal pain.