Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

A Performance Evaluation of Large Language Model in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity

Version 1 : Received: 3 September 2024 / Approved: 3 September 2024 / Online: 5 September 2024 (07:01:17 CEST)

How to cite: REYHAN, A. H.; MUTAF, Ç.; UZUN, İ.; YÜKSEKYAYLA, F. A Performance Evaluation of Large Language Model in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. Preprints 2024, 2024090257. https://doi.org/10.20944/preprints202409.0257.v1 REYHAN, A. H.; MUTAF, Ç.; UZUN, İ.; YÜKSEKYAYLA, F. A Performance Evaluation of Large Language Model in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. Preprints 2024, 2024090257. https://doi.org/10.20944/preprints202409.0257.v1

Abstract

Abstract Background: This study evaluates the ability of six popular chatbots; ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity to provide reliable answers to questions concerning keratoconus. Methods: Chatbots responses were assessed using mDISCERN and Global Quality Score (GQS) metrics. Readability was evaluated using nine validated readability assessments. We also addressed the quality and accountability of websites from which the questions originated. Results: We analyzed 20 websites, 65% "Private practice or independent user" and 35% "Official patient education materials." The mean JAMA Benchmark score was 1.5±0.68, indicating low accountability. Reliability, measured using mDISCERN, ranged from 42.9±3.16 (ChatGPT-3.5) to 46.95±3.53 (Copilot). The most frequent question was "What is Keratoconus?" with 70% of websites providing relevant information. This received the highest mDISCERN score (49.33±4.96) and a relatively high GQS score (3.50±0.55), with an Automated Readability Level Calculator score of 13.17±2.13. Moderate positive correlations were determined between the website numbers and both mDISCERN (?=0.265,?=0.25) and GQS (?=0.453,?=0.05) scores. The quality of information, assessed using the GQS, ranged from 3.01±0.51 (ChatGPT-3.5) to 3.3±0.65 (Gemini) (p=0.34). The differences between the texts were statistically significant. Gemini emerged as the easiest to read, while ChatGPT-3.5 and Perplexity were the most difficult. Based on mDISCERN scores, Gemini and Copilot exhibited the highest percentage of responses in the "Good" range (51-62 points). For the GQS, the Gemini model exhibited the highest percentage of responses in the "Good" quality range, with 40% of its responses scoring 4-5. Conclusions: While all chatbots performed well, Gemini and Copilot showed better reliability and quality. However, their readability often exceeded recommended levels. Continuous improvements are essential to match information with patients' health literacy for effective use in ophthalmology.

Keywords

Keratoconus; Chatbots; Large Language Models

Subject

Public Health and Healthcare, Primary Health Care

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.