Preprint Article Version 1 This version is not peer-reviewed

Feasibility of GPT-3.5 Versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study in Suspected Appendicitis

Version 1 : Received: 29 September 2024 / Approved: 30 September 2024 / Online: 30 September 2024 (07:47:37 CEST)

How to cite: Sanduleanu, S.; Ersahin, K.; Bremm, J.; Talibova, N.; Damer, T.; Erdogan, M.; Kottlors, J.; Goertz, L.; Bruns, C.; Maintz, D.; Abdullayev, N. Feasibility of GPT-3.5 Versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study in Suspected Appendicitis. Preprints 2024, 2024092358. https://doi.org/10.20944/preprints202409.2358.v1 Sanduleanu, S.; Ersahin, K.; Bremm, J.; Talibova, N.; Damer, T.; Erdogan, M.; Kottlors, J.; Goertz, L.; Bruns, C.; Maintz, D.; Abdullayev, N. Feasibility of GPT-3.5 Versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study in Suspected Appendicitis. Preprints 2024, 2024092358. https://doi.org/10.20944/preprints202409.2358.v1

Abstract

Background: Nonsurgical treatment in uncomplicated appendicitis is in many cases a reasonable option, despite the sparsity of robust, easy access, externally validated and multimodally informed clinical decision support systems (CDSS). Developed by OpenAI, the Generative Pre trained Transformer 3.5 model (GPT 3), may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra indications. Our objective was to determine whether GPT 3.5, when provided high throughput clinical, laboratory and radiological text based information will come to similar clinical decisions as a machine learning model and a board certified surgeon (reference standard) in decision making for appendectomy versus conservative treatment. Methods: In this cohort study we randomly collected patients presenting at the Emergency Department (ED) of two German hospitals (GFO, Troisdorf and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0+386. Overall agreement between GPT 3.5 output and the reference standard was assessed by means of inter observer kappa values as well as accuracy, sensitivity, specificity, positive and negative predictive value with the “Caret” and “irr” package. Statistical significance was defined as p < 0.05. Results: There was agreement between surgeon decision and GPT 3.5 in 102 of 113 cases and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT 3.5. The estimated model training accuracy was 83.3 % (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0 % (95% CI: 66.4, 97.2). This in comparison to the GPT 3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (P = 0.21). Conclusions: This to our knowledge first “intended use” for surgical treatment GPT 3.5 study comparing surgical decision making versus algorithm found a high degree of agreement between board certified surgeons and GPT 3.5 for surgical decision making in patients presenting to the emergency department with lower abdominal pain.

Keywords

appendectomy; surgical decision making; artificial intelligence

Subject

Public Health and Healthcare, Other

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.