Introduction
The use of artificial intelligence (AI) in healthcare is expanding rapidly, with applications ranging from diagnostic support to patient education. However, the medical field demands a high level of accuracy and reliability, since misinformation can lead to adverse health outcomes. The advent of large language models (LLMs) has revolutionized the field of natural language processing (NLP), enabling machines to generate human-like and contextually appropriate responses. Models such as ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity have attracted significant attention for their potential applications across various domains, particularly in healthcare. However, the accuracy and reliability of LLMs in specific medical contexts remain underexplored.
Keratoconus is a progressive eye disease characterized by the thinning and bulging of the cornea, leading to visual impairment. Patients and caregivers frequently seek information regarding the symptoms, diagnosis, and therapeutic options for this condition that affects a significant part of the population [
1]. Due to the complexity and specificity of medical information, it is crucially important to evaluate the performance of LLMs in providing accurate and reliable answers to questions related to keratoconus. Obtaining early and accurate information is essential for effective management and treatment. This trend underscores the importance of evaluating the quality of information provided by LLMs, which are increasingly being used to answer health-related queries.
This study was intended to assess the performance of six leading LLMs (ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity) in the context of keratoconus. We planned to determine the extent to which these models can be considered reliable sources of medical information by comparing their responses to frequently asked questions (FAQs) about keratoconus. The findings of this study will yield valuable insights into the capabilities and limitations of LLMs in the medical sphere, guiding future developments in and applications of this technology in the health care setting.
Materials and Methods
Ethics
Since the LLMs used in this study are public applications and no patients were involved, ethics committee approval was not required.
Data Collection and Search Strategy
All Google searches used in data collection were executed using a clean-installed Google Chrome (Menlo Park, CA) browser in Incognito Mode. In order to avoid bias from previous searches and targeted search results based on geography, we disabled all location filters, advertisements, and sponsored results. The search terms used were “Keratoconus FAQ,” and the “People also ask” box was used to obtain FAQs generated by Google’s machine learning algorithms.
Question Selection and Categorization
The first 20 websites were reviewed. The 20 most frequently asked questions concerning keratoconus were selected by two experienced ophthalmologists (AHR, ÇM). These subsequently transformed similar question patterns into a common question template. Websites used to answer each of the 20 FAQs in this study were first categorized according to the information source: (1) educational institution, including academic medical centers, (2) private practice or independent user, (3) crowd-sourced reference (such as Wikipedia), or (4) official patient education materials published by a national organization (such as the American Academy of Ophthalmology).
JAMA Accountability Analysis
All websites were evaluated for accountability (scores of 0–4) using JAMA benchmarks. According to JAMA guidelines, a website containing patient education materials should (1) include all authors and their relevant credentials, (2) list references, (3) provide disclosures, and (4) provide the date of the most recent update.
Large Language Model (LLM)
The LLM was trained on extensive bodies of text data, including books, scholarly articles, and web pages, covering a wide array of subjects including medicine, sports, and politics. The LLM models employed were ChatGPT-3.5, ChatGPT-4, Gemini, Copilot, Chatsonic, and Perplexity. These were asked 20 FAQs related to ‘keratoconus,’ and their responses were recorded.
Evaluation of LLM-Chatbot Responses
As shown in
Table 3, DISCERN is a scoring system developed by Oxford University, consisting of three parts and 16 questions and used to evaluate the reliability and quality of online health information. The DISCERN scoring system result range is 15–75, and the results are classified as excellent (63–75 points), good (51–62), reasonable (39–50), poor (27–38), or very poor (15–26). The Global Quality Scale (GQS) was applied to assess the quality of LLM responses. Accordingly, 1 point indicates poor quality, and 5 points indicate excellent quality (
Table 3). Additionally, this scale was also used for quality classification, 1-2 points representing low quality, 3 points moderate quality, and 4-5 points high quality.
Table 3.
mDISCERN and GQS Content and Readability indexes.
Table 3.
mDISCERN and GQS Content and Readability indexes.
DICERN scoring system |
Total score (15–75 points) |
1.Are the aims clear? |
1-5 points |
2.Does it achieve its aims? |
1-5 points |
3.Is it relevant? |
1-5 points |
4.Is it clear what sources of information were used to compile the publication (other than the author or producer)? |
1-5 points |
5.Is it clear when the information used or reported in the publication was produced? |
1-5 points |
6. Is it balanced and unbiased? |
1-5 points |
7. Does it provide details of additional sources of support and information? |
1-5 points |
8. Does it refer to areas of uncertainty? |
1-5 points |
9.Does it describe how each treatment works?
|
1-5 points |
10. Does it describe the benefits of each treatment? |
1-5 points |
11. Does it describe the risks of each treatment? |
1-5 points |
12. Does it describe what would happen if no treatment is used? |
1-5 points |
13.Does it describe how the treatment choices affect overall quality of life?
|
1-5 points |
14. Is it clear that there may be more than 1 possible treatment choice? |
1-5 points |
15.Does it provide support for shared decision making?
|
1-5 points |
16. Based on the answers to all of these questions, rate the overall quality of the publication |
1-5 points |
Global Quality Score |
Score |
Poor quality, very unlikely to be of any use to patients |
0-1 Points |
Poor quality but some information present, of very limited use to patients |
0-1 Points |
Suboptimal flow, some information covered but important topics missing, somewhat useful |
0-1 Points |
Good quality and flow, most important topics covered, useful to patients |
0-1 Points |
Excellent quality and flow, highly useful to patients |
0-1 Points |
Readability indexes |
|
Flesch reading ease score (FRE) |
206.835 - (1.015 (W/S)) - (84.6 * (S/W) |
Flesch–Kincaid grade level (FKGL) |
0.39 * (W/S) + 11.8 * (B/W)−15.59 |
Gunning FOG Index (GFI) |
0.4 × [(W/S) + 100 × (C*/W)] |
Coleman-Lıau Readabılıty Index (CLI) |
(0.0588 × L)−(0.296 × S*)−15.8 |
Automated Readabılıty Index (ARI) |
(4.71 * (C/W)) + (0.5 * (W/S)) - 21.43 |
Simple measure of Gobbledygook (SMOG) |
1.0430 × √C + 3.1291 |
Linsear Write Readability Formula (LW) |
(ASL + (2 * HDW)) / SL |
Forcast Readability Formula (FORCAST ) |
20 - ( # of Single Syllable Words x 150 / # of Words x 10 ) |
Average Readıng Level Consensus Calc (ARLC) |
Based on (8) above popular readability formulas, your text yielded a final result |
The LLM-Chatbots responses were evaluated and scored in a double-blinded manner by two experienced ophthalmologists. The LLM-Chatbot responses represented the average scores given by two experienced ophthalmologists using DISCERN (15-75 points) and GQS (1-5 points) (AHR, ÇM). A consensus score was then determined.
Readability Analysis
Each of the 20 websites that provided answers to the 20 FAQs examined in this study was evaluated for readability using nine validated readability assessments: Flesch Reading Ease (FRE), Gunning Fog Index (GFI), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), Coleman-Liau Index (CLI), Automated Readability Index (ARI), Linsear Write Formula (LINSEAR), FORCAST Readability Formula, and the Automated Readability Level Calculator (ARLC).
Statistical Analysis
Statistical analyses were conducted using R software (Version 4.1.1, R Foundation, Vienna, Austria). Descriptive statistics were used to categorize the sources of online information regarding keratoconus. Categorical variables were expressed as numbers and percentages. Differences in the length and readability of responses across the LLM-Chatbots were compared using One-Way ANOVA and Tukey’s honest significance post-hoc test since the samples met parametric assumptions. Relationships between the data were evaluated with a two-tailed Pearson’s χ² test. A p-value ess than 0.05 was considered statistically significant.
Discussion
This study evaluated the efficacy of six LLMs in terms of accurately responding to medical queries by comparing their performance on common keratoconus-related questions sourced from Google searches. The findings indicate that LLM-Chatbots have the potential to provide comprehensive responses to keratoconus-related inquiries.
LLMs can provide keratoconus patients with up-to-date, evidence-based information, facilitating rapid access to the latest therapeutic options and research findings. Patients can use LLMs for a better understanding of their condition and to make informed healthcare decisions. Accurate and comprehensible information from LLMs can enhance patient adherence to treatment plans and alleviate concerns, thereby improving their emotional well-being. LLMs also have the potential to empower keratoconus patients by equipping them with the knowledge required for active participation in their healthcare journeys.
This study investigated practical scenarios in which concerned patients might seek assistance from emerging resources. To the best of our awareness, this is the first study to evaluate LLM responses to keratoconus-related queries. The research builds on previous studies examining the applicability of LLM chatbots, such as ChatGPT, across various medical subspecialties. Prior research has explored the use of LLMs for providing medical information, patient education, and diagnostic and treatment recommendations, albeit with mixed results [
2,
3]. One significant cause for concern is the potential for misinformation in medical chatbots, which can be manipulated by special interest groups [
4]. AI models are designed to generate content based on probable word sequences rather than producing factual answers. While generative AI chatbots can debunk misinformation, they can also spread falsehoods if not regularly updated with the latest scientific evidence [
5]. For example, Lim et al. reported that ChatGPT-3.5 incorrectly stated that “Atropine eye drops are a new treatment for myopia and their optimal dosage has not yet been determined”[
6]. Giuffrè et al. evaluated LLMs in the context of digestive diseases and concluded that, despite their potential, their current accuracy and reliability are inadequate for clinical use [
7]. Conversely, LLMs such as ChatGPT and Google Bard have exhibited impressive medical knowledge and capabilities, proving beneficial for patient communication [
8,
9,
10].
In the field of ophthalmology, LLM chatbots have exhibited promise in addressing common patient queries concerning eye health [
11]. Cohen et al. determined that human responses to ophthalmology-based questions contained a similar rate of incorrect or inappropriate material (27%), as also reported by Bernstein et al. [
12,
13]. However, AI responses in the current study were more accurate (94%) than those provided by ChatGPT in Bernstein et al.‘s study (77%) [
13].
In this study, responses from private practice or independent organization websites (65%, n=13) registered a mean JAMA accountability score of 1.5±0.68, indicating infrequent adherence to JAMA criteria. Official patient education materials from national organizations (35%, n=7) had a slightly higher mean score of 1.57±0.75, suggesting marginally better compliance. The overall mean JAMA Benchmark score was 1.5±0.68, indicating low accountability across the websites. This trend is consistent with numerous previous studies. A comprehensive analysis of five studies regarding the readability and accountability of online ophthalmology patient education materials reported a mean JAMA accountability score of 1.13 with a standard deviation of 1.15, reflecting substantial deficiencies in both quality and accountability [
14,
15,
16,
17,
18]. These findings highlight the need for improved standards in creating and disseminating online patient education materials.
When considering keratoconus-related FAQs, metrics such as mDISCERN, GQS, and ARLC scores provide insights into the performance of LLMs in the medical sphere. The question “What is keratoconus?,” addressed by 70% of websites, registered the highest mDISCERN and GQS scores. In contrast, less frequently addressed questions such as “Are There Multiple Forms of Keratoconus?” (15%) received lower mDISCERN scores. Similarly, “Does Keratoconus Cause Eye Pain?” (20%) and “Can Keratoconus Go Away On Its Own?” (15%) registered lower GQS scores. The ARLC score, indicating readability, exhibited less variability, with most questions scoring between 12 and 16. For instance, “What is Keratoconus?” achieved an ARLC score of 13.17 ± 2.13, while “Can LASIK or RK Surgery Cause Keratoconus?” (25%) scored 16.33 ± 2.65. These findings highlight the importance of question frequency in determining the response quality and the potential of LLMs to provide high-quality, reliable medical information, especially for frequently asked questions. However, readability does not exhibit strong correlation with the number of websites addressing a question.
mDISCERN indices evaluate the performance of LLMs in providing medical information, assessing the informativeness, accuracy, and safety of the content. Wilhelm et al. identified significant quality differences among LLMs, with notable variability in mDISCERN scores. The Claude-instant-v1.0 model received the highest score, and Bloomz the lowest [
19]. The present study indicates that although all LLMs performed reasonably well, their ability to provide accurate and reliable medical information differs significantly. Models such Gemini and Copilot scored higher, suggesting better performance. The significant variability in mDISCERN scores underscores the need for continuous improvement and validation. Standardized evaluation metrics and rigorous testing protocols are essential for assessing AI model performance and identifying potential areas for improvement.
The mDISCERN score distribution in this study reveals that Gemini and Copilot performed better in the “Good” range (51-62 points) compared to other LLMs, probably due to superior training data, fine-tuning, or algorithms. Conversely, Perplexity and Chatsonic registered the highest percentage of responses in the “Poor” range (27-38 points), indicating potential weaknesses due to less comprehensive training and suboptimal fine-tuning. These findings suggest that while LLMs can generate reasonably reliable medical information, there is a significant gap in achieving high reliability across all models. No model reached the “Excellent” range in mDISCERN scores, indicating that current LLMs are not yet capable of providing highly reliable information for all questions. Onder et al. evaluated ChatGPT-4 responses concerning hypothyroidism during pregnancy using DISCERN tools, reporting that most responses were either Fair (78.9%) or Good (21.1%) [
20]. This highlights the model’s capability to generate dependable information in most instances. The performance differences among LLMs emphasize the need for ongoing research and development in order to enhance the reliability and quality of information generated by these models.
Evaluating LLMs using the GQS provided valuable insights into the quality of medical information generated by them. Although no significant differences were observed among LLMs, models such as Gemini and Copilot consistently scored higher, indicating better overall quality and more robust mechanisms for generating accurate content. Ostrowska et al. evaluated the reliability and safety GQS of LLMs in the context of laryngeal cancer, describing ChatGPT 3.5 as the most successful model [
21]. This emphasizes the need for model-specific evaluations in order to identify the best-performing models for particular medical spheres.
GQS score analysis revealed varying levels of quality in medical information produced by LLMs. The majority of models scored in the 3-3.5 range on a five-point scale, indicating moderate quality. Gemini emerged as the top performer, with 40% of its outputs in the “Good” quality range (4-5 points). Copilot and Chatsonic also performed well, with 30% of their responses in the “Good” range. In contrast, ChatGPT models (3.5 and 4.0) achieved lower rates of “Good” quality responses (15% and 10%, respectively). In contrast to our findings, Onder et al. reported that 84.2% of ChatGPT-4’s responses regarding hypothyroidism during pregnancy were of high quality, followed by 10.5% medium quality responses [
20]. This discrepancy suggests that the specific medical sphere or the nature of the questions in the present study may have been particularly challenging for these models, a subject warranting further investigation at a later date.
Although our expert evaluators preferred chatbot responses, their readability frequently exceeded the American Medical Association’s (AMA) recommendation of a sixth-grade reading level for patient education materials. Using eight popular readability formulae, the final ARLC scores indicated the following reading levels: ChatGPT-3.5, ChatGPT-4, and Perplexity were rated as extremely difficult, Gemini and Copilot as difficult, and Chatsonic as very difficult. The corresponding grade levels were ChatGPT-3.5, ChatGPT-4, Perplexity at the College Graduate level, Gemini and Copilot at the Twelfth Grade level, and Chatsonic at College Entry level (
Figure 1). These findings align with previous research showing that chatbot-generated patient education information is frequently written at reading levels significantly exceeding the comprehension of the average patient [
12,
22].
Research indicates that tailoring patient education materials to patients’ health literacy levels can significantly enhance compliance and optimize health outcomes [
23]. A scoping review of visual aids in health literacy reported that materials intended for individuals with low literacy levels significantly improved health literacy outcomes, including medication adherence and comprehension [
24].
While some chatbots, such as ChatGPT-4 and Chatsonic, produce detailed and complex responses, others, including Perplexity, generate shorter and simpler answers. These differences in response length and complexity highlight the varying capabilities of LLM-Chatbots in addressing keratoconus-related FAQs. This information is crucially important for selecting an appropriate chatbot for specific informational needs, particularly in medical and educational contexts in which the depth and clarity of information are paramount.
The adaptability of chatbots to user requests is significant for their potential application in ophthalmology. Despite challenging reading levels, providing patient education materials remains highly beneficial. This study demonstrates the usefulness of chatbots in providing keratoconus-related information for patients. Ophthalmologists report a loss of efficiency due to excessive time spent on non-clinical tasks. Chatbots can help alleviate this burden. A semi-supervised model, in which the ophthalmologist reviews AI-generated responses, represents the future of AI and can be highly beneficial tool for ophthalmologists.
While this study provides insights into the differences in responses from six LLMs to common keratoconus-related questions, a number of limitations must also be considered. In particular, the questions were sourced from Google, and the manner in which patients interpret these responses was not investigated. When ophthalmologists provide information regarding keratoconus and advice on using AI tools such LLMs, it is essential that the patient’s health literacy level be taken into account.