Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study (2024)

  • Journal List
  • Int J Retina Vitreous
  • v.10; 2024
  • PMC11367851

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study (1)

Link to Publisher's site

Int J Retina Vitreous. 2024; 10: 61.

Published online 2024 Sep 2. doi:10.1186/s40942-024-00579-9

PMCID: PMC11367851

PMID: 39223678

Piotr Strzalkowski,Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study (2)1 Alicja Strzalkowska,1 Jay Chhablani,2 Kristina Pfau,3 Marie-Hélène Errera,2 Mathias Roth,1 Friederike Schaub,4 Nikolaos E. Bechrakis,5 Hans Hoerauf,6 Constantin Reiter,7 Alexander K. Schuster,8 Gerd Geerling,1 and Rainer Guthoff1

Author information Article notes Copyright and License information PMC Disclaimer

Associated Data

Data Availability Statement

Abstract

Background

Large language models (LLMs) such as ChatGPT-4 and Google Gemini show potential for patient health education, but concerns about their accuracy require careful evaluation. This study evaluates the readability and accuracy of ChatGPT-4 and Google Gemini in answering questions about retinal detachment.

Methods

Comparative study analyzing responses from ChatGPT-4 and Google Gemini to 13 retinal detachment questions, categorized by difficulty levels (D1, D2, D3). Masked responses were reviewed by ten vitreoretinal specialists and rated on correctness, errors, thematic accuracy, coherence, and overall quality grading. Analysis included Flesch Readability Ease Score, word and sentence counts.

Results

Both Artificial Intelligence tools required college-level understanding for all difficulty levels. Google Gemini was easier to understand (p = 0.03), while ChatGPT-4 provided more correct answers for the more difficult questions (p = 0.0005) with fewer serious errors. ChatGPT-4 scored highest on most challenging questions, showing superior thematic accuracy (p = 0.003). ChatGPT-4 outperformed Google Gemini in 8 of 13 questions, with higher overall quality grades in the easiest (p = 0.03) and hardest levels (p = 0.0002), showing a lower grade as question difficulty increased.

Conclusions

ChatGPT-4 and Google Gemini effectively address queries about retinal detachment, offering mostly accurate answers with few critical errors, though patients require higher education for comprehension. The implementation of AI tools may contribute to improving medical care by providing accurate and relevant healthcare information quickly.

Keywords: Retinal detachment, ChatGPT-4, Google Gemini, Artificial intelligence, Language learning models, Patient education

Background

Our clinical practice has already been transformed by the internet over the last few decades [1]. In particular, recently introduced data-driven tools such as artificial intelligence (AI) have the potential to revolutionize healthcare even more in the future [24]. This change is already underway, with more people turning to online platforms and self-diagnosis tools, such as symptom checkers [5] for healthcare information [6, 7], particularly as accessing traditional face to face medical care becomes more challenging. However, these online tools often lack essential details to accurately assess symptom urgency [7]. Yet, there may be a shift on the horizon. Recent initiatives by the World Health Organization (WHO) seek to set standards for AI-assisted healthcare technologies, encouraging additional exploration of their potential benefits [8].

Large language models (LLM) like ChatGPT-4 launched for public use in November 2022 and Google Gemini, released in December 2023 and renamed in February 2024 also offer advantages in patient health’s education [9]. However, there are concerns that while they can write persuasive texts, these can potentially be inaccurate, distorting scientific facts and spreading misinformation [9].

Providing accurate and timely healthcare information is critical in a serious eye condition that requires immediate treatment, such as acute retinal detachment (RD) or endophthalmitis. Prompt treatment is essential to reduce the risk of permanent visual deterioration, as duration of macula-involving RD is one of the few modifiable factors for a better postoperative visual outcome [10]. Patients with acute RD often seek medical care sooner, are more conscious of the symptoms of RD [11], and tend also to be better educated [12]. This suggests that raising awareness and educating patients about the classic signs of RD could not only result in more patients seeing an ophthalmologist while their macula is still attached but could also result in a better postoperative outcome for patients.

The aim of this study is to evaluate the readability and accuracy of ChatGPT-4 and Google Gemini in responding to queries about RD.

Methods

In our comparative study, we included 13 questions frequently asked by patients on topics such as symptoms, causes of retinal detachment, surgical techniques and follow-up treatment. These questions were categorized into three difficulty levels (D1-D3) by two vitreoretinal specialists (P.S. and R.G.) (Table1).

Table 1

All 13 questions sorted by difficulty level

QuestionDifficulty level 1
Q1I see a shadow. What should I do?
Q2I see flashes of light. What should I do?
Q3I see floaters in one eye. What should I do?
Difficulty level 2
Q4What are the risk factors for retinal detachment?
Q5What forms of retinal detachment are there?
Q6How does a retinal detachment develop?
Q7How quickly does a retinal detachment need to be treated?
Q8What are the chances of success of vitrectomy for retinal detachment?
Difficulty level 3
Q9What are the treatment options for retinal detachment?
Q10How exactly is a vitrectomy performed to treat a retinal detachment?
Q11Which tamponades are used in vitrectomy for retinal detachment?
Q12How do gas tamponades differ from silicone oil tamponades in retinal surgery?
Q13What needs to be considered during postoperative care after vitrectomy?

Open in a separate window

To obtain the most precise and specialized answer possible, ChatGPT-4 (Generative Pre-trained Transformer), developed by OpenAI (San Francisco, CA, USA) and Google Gemini (Google DeepMind, London, United Kingdom) were instructed via a prompt to assume the role of an ophthalmologist when answering:

Take the role of an ophthalmologist who answers patients’ questions. Write in continuous text and exclude images and illustrations for explanation. Your task is to give a concise, specific answer that is accurate by current standards. The length of the answer should not exceed 150 words.

Each question was asked independently in a new chat window after the prompt was repeated, and the previous dialogue was deleted to avoid any possible interference of the previous questions and answers with the following ones. The evaluation criteria included the correctness, errors, thematic accuracy and coherence of the answers.

Evaluation of the answers

The answer options for each question in the online survey were organized as follows:

Correctness (single answer)

  • Correct: The entire answer is correct.

  • Partially incorrect: The core statement of the answer is correct, but the rest of the answer contains one or more errors.

  • Incorrect: The core statement of the answer is incorrect.

Error rating if applicable (multiple answers)

  • Serious errors in content: The core statement of the answer AND / OR the rest of the answer contains one or more serious errors in content that could have serious consequences / pose a risk to patients.

  • Content errors: The core statement of the answer contains one or more content errors that do not pose a risk to patients OR the core statement of the answer is correct, but the rest of the answer contains one or more content errors that do not or only slightly change the core statement of the answer and do not pose a risk to patients.

  • Formal errors: The answer contains one or more grammatical or linguistic errors, for example, but these do not affect the core message of the answer or have any other significant consequences.

Thematic accuracy (single answer)

  • Applicable: The answer identifies the central concept and is thematically specific.

  • Partially correct: The answer identifies the central concept, but also partially addresses an unrelated topic.

  • Not applicable: The answer does not identify the central concept and / or targets an unrelated topic.

Coherence (single answer)

  • Coherent: The core message of the answer is fully supported by the rest of the answer.

  • Partially coherent: The core statement of the answer is essentially confirmed by the rest of the answer, but there are deviating statements / contradictions in the rest of the answer.

  • Incoherent: The core statement of the answer contradicts the rest of the answer.

For the parameter’s correctness, thematic accuracy and coherence, only a single answer was possible; for error assessment, multiple answers or assessments of individual parts of the answer were possible due to the different error categories (content vs. formal errors).

Our 13 masked questions and the corresponding answers from ChatGPT-4 and Google Gemini were sent online to ten experienced vitreoretinal specialists via the RedCap platform [13, 14].

Each question was given an overall quality grading at the end in addition to the assessment of the correctness, accuracy, thematic accuracy and coherence of the answers. The overall quality grades were categorized based on the American GPA scoring system as follows: excellent = 4 points, good = 3 points, satisfactory = 2 points, sufficient = 1 points, bad = 0 points [15].

Evaluation of readability

The readability of all generated answers was analyzed with the online tool readable (Readable.com, Horsham, United Kingdom) with regard to number of words, number of sentences, number of words per sentence, number of long words (> 6 letters), Flesch Reading Ease (FRES) score [16] and reading level.

The formula for calculating the FRES is:

206.835 − 1.015 × (Total Words / Total Sentences) − 84.6 × (Total Syllables / Total Words).

Flesch Readability Ease Score for evaluating the readability of a text is shown in Table2.

Table 2

The table shows the FRE score with corresponding school level and description of the reading difficulty level [17]

FRE scoreSchool LevelInterpretation
100.0–90.05th GradeVery easy to read. Easily understood by an average 11-year-old student.
90.0–80.06th GradeEasy to read. Conversational English for consumers.
80.0–70.07th GradeFairly easy to read
70.0–60.08th-9th GradePlain English. Easily understood by 13- to 15-year-old students
60.0–50.010th-12th GradeFairly difficult to read.
50.0–30.0CollegeDifficult to read
30.0–10.0College GraduateVery difficult to read. Best understood by university graduates
10.0–0.0ProfessionalExtremely difficult to read. Best understood by university graduates

Open in a separate window

Statistical analysis

Statistical analysis was performed using GraphPad Prism10, Version 10.2.2 (341), (GraphPad Software, San Diego, USA) for Mac. For statistical analysis, categorical variables were presented as absolute and relative frequencies, whereas mean and standard deviation were computed for approximately normal-distributed continuous variables, otherwise median and interquartile range. Evaluation of data normality was performed using the Shapiro-Wilk test. Welch’s t-test was used to evaluate the difference in means in both Large Language Models. Fisher’s Exact Test was used to evaluate the association between categorical variables. Non-normally distributed continuous variables were compared using Mann-Whitney test. For multiple comparisons, Brown-Forsythe and Welch ANOVA test or non-parametric Kruskal-Wallis test and post hoc Dunn’s test with correction for multiple testing were used. All statistical tests were two-sided and p-value < 0.05 was considered statistically significant.

Ethical considerations

In concordance with German legislation, an approval of a medical ethical committee was not needed for a study that did not include patient data. The study was performed in accordance with the ethical standards set forth in the 1964 Declaration of Helsinki.

Results

Readability

Flesch Readability ease score (FRES)

The overall FRES was 36 ± 9.7 for ChatGPT-4 and 45 ± 11 for Google Gemini (p = 0.03). Regarding the level of difficulty for D1 (low) there was a significantly lower FRES for ChatGPT-4 39.1 ± 4.9 compared to 55.5 ± 3.1 for Google Gemini (p = 0.01). For D2 (medium) and D3 (high) the FRES was 31.5 ± 5.4 and 39.9 ± 11.4 (p = 0.2) and 37.7 ± 14.4 and 43.4 ± 10.5 (p = 0.5) for ChatGPT-4 and Google Gemini, respectively. While no statistically significant difference in FRES was found for D1, D2 and D3 for ChatGPT-4, a significant difference was found between D1 and D2 for Google Gemini (p = 0.04) (Fig.1).

Open in a separate window

Fig. 1

Flesch Readability Ease Score (FRES) for ChatGPT-4 and Google Gemini overall and for all difficulty levels (D1, D2, D3). The bars represent the mean FRES values, and the whiskers indicate the standard deviation (SD)

Number of words

The mean number of words was 159 ± 20.6 and 155 ± 42.3 for ChatGPT-4 and Google Gemini, respectively (p = 0.76). Answers generated by ChatGPT-4 for D1, D2 and D3 counted 179 ± 39.1, 150 ± 6.7 and 156 ± 7.7 words (p = 0.17). For Google Gemini the mean number of words in D1, D2 and D3 was 155 ± 20.8, 115 ± 14.3 (p = 0.05) and 195 ± 31.3 (p = 0.003). The mean difference in number of words was significant between ChatGPT-4 and Google Gemini for D2 (+ 34.6 words, p = 0.003) and D3 (-38.4 words, p = 0.04).

Number of sentences

The mean number of sentences was 9.1 ± 1.9 and 8.7 ± 3.2 for ChatGPT-4 and Google Gemini, respectively (p = 0.72). Answers generated by ChatGPT-4 for D1, D2 and D3 showed no significant difference in 8.7 ± 3.2, 8.8 ± 1.9 and 9.6 ± 1.5 sentences (p = 0.67). In contrast, for Google Gemini the mean number of sentences in D1, D2 and D3 was 7.3 ± 1.5, 6.0 ± 1.0 (p = 0.01) and 12.2 ± 1.9 (p = 0.0007). The mean difference in number of sentences was significant between ChatGPT-4 and Google Gemini for D2 (+ 2.8 sentences, p = 0.03) and D3 (-2.6 sentences, p = 0.047).

Number of words per sentence

The mean number of words per sentence was 18.3 ± 4.2 for ChatGPT-4 and 18.6 ± 3.1 for Google Gemini (p = 0.76). Answers generated by ChatGPT-4 for D1, D2 and D3 counted 21.8 ± 5.7, 17.7 ± 4.4 and 16.7 ± 2.8 words (p = 0.21). For Google Gemini the mean number of words in D1, D2 and D3 was 21.3 ± 1.9, 19.5 ± 3.3 (p = 0.35) and 16.0 ± 1.2 (p = 0.02).

Number of long words

The mean number of long words (more than 6 letters) was 34.3 ± 4.5 and 29.7 ± 7.0 for ChatGPT-4 and Google Gemini, respectively (p = 0.76). Answers generated by ChatGPT-4 for D1, D2 and D3 counted 31.1 ± 3.0, 35.7 ± 5.0 and 34.8 ± 4.4 words (p = 0.17). For Google Gemini the mean number of long words in D1, D2 and D3 was 24.4 ± 2.4, 30.0 ± 8.1 (p = 0.05) and 32.6 ± 6.9 (p = 0.003). The mean difference in the number of long words was significant between ChatGPT-4 and Google Gemini for D1 (+ 6.7 words, p = 0.04).

Correctness

For the difficulty level 1 and 2, there was no significant difference between ChatGPT-4 and Google Gemini in terms of correctness (p = 0.5). The total number of correct versus partially correct answers in difficulty level 3 was 36 vs. 13 for ChatGPT-4 and 18 vs. 30 for Google Gemini (p = 0.0005) (Table3).

Table 3

Correctness - number of correct, partially correct and incorrect answers for all 13 questions and difficulty levels

Correctness
ChatGPT-4Google Gemini
QuestionDifficulty levelCorrectPartially correctincorrectCorrectPartially correctincorrectp-value
1D19107210.5
210009100.9
3100010001
D1 total291026310.4
4D282010000.5
57307301
66317210.9
75414600.5
89104510.06
D2 total35132321620.7
9D38201900.005
1010005410.04
115412710.4
124604601
139106400.3
D3 total36131183020.0005

Open in a separate window

Errors (multiple answers possible)

The number of serious errors was higher for all difficulty levels in Google Gemini compared to ChatGPT-4 (D1: 1 vs. 0; D2: 4 vs. 2; D3: 4 vs. 1). Google Gemini also showed more errors in terms of content (D1: 3 vs. 1; D2: 14 vs. 12; D3: 21 vs. 9, p = 0.03) (Table4).

Table 4

Errors - number of serious errors, content and formal errors for all 13 questions and difficulty levels

Errors (multiple answers possible)
ChatGPT-4Google Gemini
QuestionDifficulty levelSerious errorsContent errorsFormal errorsSerious errorsContent errorsFormal errorsp-value
1D10101200.9
2000010-
3000000-
D1 total0101300.9
4D2021000-
50310300.9
61221200.8
71401500.9
80102400.9
D2 total212441400.7
9D30201800.9
10000140-
110300400.9
121301200.9
130101300.9
D3 total19042100.03

Open in a separate window

Thematic accuracy and coherence

The thematic accuracy (Table5) and coherence (Table6) showed higher scores for ChatGPT-4 compared to Google Gemini in terms of difficulty level 3 (p = 0.003), whereas there was no statistically significant difference for both LLMs in difficulty level 1 and 2.

Table 5

Thematic accuracy - number of applicable, partially applicable and not applicable answers for all 13 questions and difficulty levels

Thematic accuracy
ChatGPT-4Google Gemini
QuestionDifficulty levelApplicablePartially applicableNot applicableApplicablePartially applicableNot applicablep-value
1D18208201
210007300.2
39109101
D1 total273024600.5
4D29109101
59109101
64609100.06
78205500.4
88208201
D2 total38120401000.8
9D310006400.09
109107300.6
1110008200.5
129109101
1310016400.05
D3 total4821361400.003

Open in a separate window

Table 6

Coherence - number of coherent, partially coherent and incoherent answers for all 13 questions and difficulty levels

Coherence
ChatGPT-4Google Gemini
QuestionDifficulty levelCoherentPartially coherentIncoherentCoherentPartially coherentIncoherentp-value
1D19107300.58
210009100.9
38209100.9
D1 total273025500.7
4D273010000.2
510009100.9
646010000.01
77216400.7
810004600.01
D2 total38111391100.9
9D310007300.2
109107300.6
119109101
128208201
139107300.6
D3 total4550381200.1

Open in a separate window

Overall quality grading for each question

ChatGPT-4 outperformed Google Gemini in 8 out of 13 (62%) questions. Significantly better grades were achieved in Q1 3.5 ± 0.7 vs. 2.1 ± 0.9 (p = 0.001), Q2 3.7 ± 0.5 vs. 2.6 ± 0.7 (p = 0.01), Q8 3.3 ± 0.7 vs. 1.7 ± 1.1 (p = 0.001), Q9 3.3 ± 0.7 vs. 1.7 ± 1.3 (p = 0.002), Q10 3.4 ± 0.5 vs. 2.1 ± 1.1 (p = 0.005) and Q13 3.2 ± 1.3 vs. 2.0 ± 1.2 (p = 0.005) for ChatGPT-4. In comparison, Google Gemini achieved significant higher scores only in Q6 3.1 ± 0.9 vs. 1.8 ± 0.9 (p = 0.004).

Overall quality grading vs. difficulty level

The overall quality grading was significantly higher for ChatGPT-4 compared to Google Gemini (3.0 ± 1.0 vs. 2.4 ± 1.1, respectively; p < 0.01). In terms of difficulty level D1, ChatGPT-4 scored 3.5 ± 0.6 significantly better compared to 2.8 ± 0.9 for Google Gemini (p < 0.01). There was no significant difference between ChatGPT-4 2.7 ± 1.1 and Google Gemini 2.6 ± 1.1 for D2. For D3, ChatGPT-4 received better grades 2.9 ± 1.1 than Google Gemini 2.1 ± 1.1 (p < 0.01). In addition, both ChatGPT-4 (D1: 3.5 ± 0.6; D2: 2.7 ± 1.1; D3 2.9 ± 1.1; p < 0.01) and Google Gemini showed significantly lower grades as the difficulty level increased (D1: 2.8 ± 0.9; D2: 2.6 ± 1.1; D3: 2.1 ± 1.1; p < 0.01) (Fig.2).

Open in a separate window

Fig. 2

shows the quality grading in relation to the difficulty level (D1, D2, D3) for ChatGPT-4 and Google Gemini. The bars represent the mean quality grading values, and the whiskers indicate the standard deviation (SD)

Discussion

Retinal detachment (RD) is a sight-threatening eye condition that requires immediate surgical intervention to prevent permanent visual impairment. Providing timely and accurate health information is critical to patient understanding and treatment outcomes [10, 12]. In our study, ChatGPT-4 and Google Gemini showed promise in answering typical patient questions about RD. They delivered mostly correct and accurate responses with few serious errors. However, a college-level education is needed to comprehend the answers across various difficulty levels.

Large language models (LLMs) such as ChatGPT-4 and Google Gemini can provide health-related information to the users [18]. ChatGPT-4 is an autonomous machine-learning system capable of quickly generating complex and seemingly intelligent text in a conversational style in multiple languages, including English [9, 19]. Key benefits include its accessibility, cost-free usage, user-friendliness, and ongoing enhancements [9]. Consequently, it is conceivable that ChatGPT-4 could be used to help patients answer their health questions. The ability of ChatGPT-4 to respond to questions about medical examinations, including those related to ophthalmology [20, 21], has been the subject of great interest and has been investigated in several studies [22, 23].

In our study, based on the Flesch Readability Ease Score (FRES), both ChatGPT-4 and Google Gemini required at least a university degree to understand the answers, regardless of the difficulty level of the questions (D1-D3). However, Google Gemini was found to be easier to understand than ChatGPT-4, with scores of 45 ± 11 vs. 36 ± 9.7, respectively (p = 0.03). This trend persisted for D1 questions separately, with scores of 55.5 ± 3.1 for Google Gemini vs. 39.1 ± 4.9 for ChatGPT-4 (p = 0.01). For more challenging D2-D3 questions, there was no significant difference between Google Gemini and ChatGPT-4, with scores of 39.9 ± 11.4 vs. 31.5 ± 5.4 (p = 0.2) and 43.4 ± 10.5 vs. 37.7 ± 14.4 (p = 0.5), respectively.

Both LLMs were instructed to provide answers of up to 150 words in length. However, the mean number of words exceeded this limit, with an average of 159 ± 20.6 for ChatGPT-4 and 155 ± 42.3 for Google Gemini. Regarding the mean number of sentences, there was no significant difference between both models, with averages of 9.1 ± 1.9 for ChatGPT-4 and 8.7 ± 3.2 for Google Gemini (p = 0.72). LLMs can exceed the word limits suggested in the prompts for several reasons. They interpret prompts based on patterns from their training data, which may include longer responses. In particular, different text lengths in the training data can explain this behavior. Complex prompts may also require detailed explanations, leading to longer responses. Ambiguity in the instructions and the model’s goal of providing relevant and coherent responses can also lead to exceeding the limit. Interestingly, Google Gemini required more sentences for the more difficult questions, with averages of 7.3 ± 1.5 for D1 and 12.2 ± 1.9 for D3 (p = 0.0007). There was no difference between ChatGPT-4 and Google Gemini concerning the mean number of words per sentence. It was 18.3 ± 4.2 for ChatGPT-4 and 18.6 ± 3.1 for Google Gemini on average. The mean number of long words (defined as those with more than 6 letters) was 34.3 ± 4.5 for ChatGPT-4 and 29.7 ± 7.0 for Google Gemini (p = 0.76). The mean difference in the number of long words was significant between both AI tools for D1, with ChatGPT-4 exhibiting a higher count by 6.7 words on average (p = 0.04).

In terms of correctness, for the D1 and D2 questions, there was no significant difference between ChatGPT-4 and Google Gemini (p = 0.5). For D3, the total number of correct versus partially correct answers was 36 vs. 13 for ChatGPT-4 and 18 vs. 30 for Google Gemini (p = 0.0005). However, it is important to note that opinions on specific retinal disease treatments may vary, even among retinal specialists, and thus may affect the analysis of correctness. The number of serious errors was altogether low, but higher for all difficulty levels in Google Gemini compared to ChatGPT-4 (D1: 1 vs. 0; D2: 4 vs. 2; D3: 4 vs. 1). In terms of thematic accuracy and coherence, ChatGPT-4 showed higher scores compared to Google Gemini in terms of high difficulty level (p = 0.003), whereas there was no statistically significant difference for both LLMs in low and medium difficulty levels.

Considering the overall quality grades for each question, ChatGPT-4 outperformed Google Gemini in 8 out of 13 questions. In addition, ChatGPT-4 received better grades in difficulty levels D1 and D3: 3.5 ± 0.6 compared to 2.7 ± 0.9 (p = 0.002) and 2.9 ± 1.1 compared to 2.1 ± 1.1 (p = 0.0002), respectively. In addition, both ChatGPT-4 (p = 0.007) and Google Gemini (p = 0.02) achieved significantly lower grades with increasing difficulty.

Public health professionals should pay attention to online health-seeking behaviors, weighing potential benefits, addressing quality concerns, and outlining criteria for evaluation of online health information [24].

More than one-third of adults in the United States routinely use the internet for self-diagnosis, for both non-urgent and urgent symptoms [6, 7] Patients search for information via search engines like Google or Yahoo, as well as on health websites. This can help individuals to gain a deeper understanding of medical conditions alongside professional healthcare advice [25]. However, the popular symptom-related websites of the major search engines often lack most of the information needed to make a decision about whether a particular symptom requires immediate medical attention [7].

Misdiagnosis by physicians occurs in approximately 5% of outpatients [26]. In a study with a total of 118 physicians in the US correctly diagnosed 55.3% of easier and 5.8% of more difficult cases (p < 0.001) [27]. When asked about the accuracy of their initial diagnosis received via Symptom Checker, 41% of patients said that a doctor had confirmed their diagnosis and 35% said that they had not seen a doctor for a professional assessment [6]. An evaluation of 23 known symptom checker apps found that an appropriate categorization recommendation was made in 80% of emergencies, a rate comparable to doctors in training and nurses in training [27]. An AI system known as Babylon AI, which is used in Africa for triage and diagnostic purposes, has shown that it is able to recognize the disease presented in a clinical case with an accuracy comparable to that of human doctors [28].

Importantly, ChatGPT-4, like other LLMs can generate persuasive and subtle [29] but often inaccurate text, sometimes referred as a ‘hallucination’ [30] leading to the distortion of scientific facts and the spread of misinformation [9]. Importantly, the content of LLMs needs to be reviewed [29]. Future discussion should focus on the how rather than the if of introducing this technology [19].

Our study has certain limitations. We only used the two best known LLMs to assess the questions. Further validation with multiple LLMs is needed. We only included the most common questions asked by patients, but this may not fully reflect the complexity of patient education. In addition, treatment recommendations may also vary between different ophthalmologists. Human-generated responses may also generate controversial opinions and should be further investigated in subsequent studies. In addition, the study is limited to the English language, which may not take into account the different levels of education and understanding of patients in other languages. We also did not address potential accessibility issues, such as visual impairment, that may hinder access to AI-based tools. In addition, the instructions were specific to the LLMs, which may not fully reflect how patients would utilize such technology.

Conclusions

To summarize, ChatGPT-4 and Google Gemini showed promise in answering questions about retinal detachment, providing mostly correct answers with few critical errors, even though patients need higher education with good reading comprehension to understand them. The use of AI tools may help to improve medical care by providing accurate and relevant health information quickly. Based on the results of our study, LLMs show promise but are not yet suitable as a sole resource for patient education due to the risk of critical errors. We would suggest using these AI tools as supplementary rather than primary sources of information until further improvements are made to minimize errors and improve accessibility for a wider patient population.

Acknowledgements

Not applicable.

Abbreviations

AIArtificial intelligence
FRESFlesch Readability Ease Score
GPTGenerative Pre-training Transformer
LLMsLarge Language Models
WHOWorld Health Organization

Author contributions

P.S. wrote the main manuscript text.P.S. and R.G. conceived and designed the study.P.S., A.S., R.G. collected the data.P.S., M.R. and A.K.S. performed the analysis.All authors provided input on interpretation of results. All authors revised the manuscript. All authors red and approved the final manuscript.

Funding

None.

Open Access funding enabled and organized by Projekt DEAL.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Ethics approval and consent to participate

In concordance with German legislation, an approval of a medical ethical committee was not needed for a study that did not include patient data. The study was performed in accordance with the ethical standards set forth in the 1964 Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Hartzband P, Groopman J. Untangling the Web–patients, doctors, and the internet. N Engl J Med. 2010;362:1063–6. 10.1056/NEJMp0911938 [PubMed] [CrossRef] [Google Scholar]

2. Rich AS, Gureckis T. Lessons for artificial intelligence from the study of natural stupidity. Nat Mach Intell. 2019;1:174–80. 10.1038/s42256-019-0038-z [CrossRef] [Google Scholar]

3. Powles J, Hodson H. Google DeepMind and healthcare in an age of algorithms. Health Technol. 2017;7:351–67. 10.1007/s12553-017-0179-1 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

4. Bini SA, Artificial Intelligence M, Learning. Deep learning, and Cognitive Computing: what do these terms Mean and how will they Impact Health Care? J Arthroplasty. 2018;33:2358–61. 10.1016/j.arth.2018.02.067 [PubMed] [CrossRef] [Google Scholar]

5. Millenson ML, Baldwin JL, Zipperer L, Singh H. Beyond Dr. Google: the evidence on consumer-facing digital tools for diagnosis. Diagnosis (Berl). 2018;5:95–105. 10.1515/dx-2018-0009 [PubMed] [CrossRef] [Google Scholar]

6. Fox S, Duggan M. Health online 2013. Health. 2013;2013:1–55. [Google Scholar]

7. North F, Ward WJ, Varkey P, Tulledge-Scheitel SM. Should you search the internet for information about your acute symptom? Telemed J E Health. 2012;18:213–8. 10.1089/tmj.2011.0127 [PubMed] [CrossRef] [Google Scholar]

8. Wiegand T, Krishnamurthy R, Kuglitsch M, Lee N, Pujari S, Salathé M, et al. WHO and ITU establish benchmarking process for artificial intelligence in health. Lancet. 2019;394:9–11. 10.1016/S0140-6736(19)30762-7 [PubMed] [CrossRef] [Google Scholar]

9. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614:224–6. 10.1038/d41586-023-00288-7 [PubMed] [CrossRef] [Google Scholar]

10. Yorston D, Donachie PHJ, Laidlaw DA, Steel DH, Sparrow JM, Aylward GW, et al. Factors affecting visual recovery after successful repair of macula-off retinal detachments: findings from a large prospective UK cohort study. Eye. 2021;35:1431–9. 10.1038/s41433-020-1021-y [PMC free article] [PubMed] [CrossRef] [Google Scholar]

11. Eijk ESV, Busschbach JJV, Timman R, Monteban HC, Vissers JMH, van Meurs JC. What made you wait so long? Delays in presentation of retinal detachment: knowledge is related to an attached macula. Acta Ophthalmol. 2016;94:434–40. 10.1111/aos.13016 [PubMed] [CrossRef] [Google Scholar]

12. Xu D, Uhr J, Patel SN, Pandit RR, Jenkins TL, Khan MA, et al. Sociodemographic factors influencing Rhegmatogenous Retinal Detachment Presentation and Outcome. Ophthalmol Retina. 2021;5:337–41. 10.1016/j.oret.2020.08.001 [PubMed] [CrossRef] [Google Scholar]

13. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inf. 2009;42:377–81. 10.1016/j.jbi.2008.08.010 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

14. Harris PA, Taylor R, Minor BL, Elliott V, Fernandez M, O’Neal L, et al. The REDCap consortium: building an international community of software platform partners. J Biomed Inf. 2019;95:103208. 10.1016/j.jbi.2019.103208 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

15. Nord C, Roey S, Perkins R, Lyons M, Lemanski N, Brown J et al. The nation’s report card[TM]: America’s high school graduates. Results of the 2009 NAEP high school transcript study. NCES 2011 – 462. National Center for Education Statistics. 2011; https://eric.ed.gov/?id=ED518324

16. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32:221–33. 10.1037/h0057532 [PubMed] [CrossRef] [Google Scholar]

17. Flesch R. How to write plain English: a book for lawyers and consumers. Harper & Row New York; 1979.

18. Rowland SP, Fitzgerald JE, Holme T, Powell J, McGregor A. What is the clinical value of mHealth for patients? NPJ Digit Med. 2020;3:4. 10.1038/s41746-019-0206-x [PMC free article] [PubMed] [CrossRef] [Google Scholar]

19. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023. p. e107–8. [PubMed]

20. Mihalache A, Huang RS, Popovic MM, Muni RH. Performance of an upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023. p. 798–800. [PMC free article] [PubMed]

21. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in Ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324. 10.1016/j.xops.2023.100324 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

22. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. 10.1371/journal.pdig.0000198 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

23. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. 10.2196/45312 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

24. Cline RJ, Haynes KM. Consumer health information seeking on the internet: the state of the art. Health Educ Res. 2001;16:671–92. 10.1093/her/16.6.671 [PubMed] [CrossRef] [Google Scholar]

25. White RW, Horvitz E. Experiences with web search on medical concerns and self diagnosis. AMIA Annu Symp Proc. 2009;2009:696–700. [PMC free article] [PubMed] [Google Scholar]

26. Singh H, Meyer AND, Thomas EJ. The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations. BMJ Qual Saf. 2014;23:727–31. 10.1136/bmjqs-2013-002627 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

27. Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ. 2015;351:h3480. 10.1136/bmj.h3480 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

28. Baker A, Perov Y, Middleton K, Baxter J, Mullarkey D, Sangar D, et al. A comparison of Artificial Intelligence and human doctors for the purpose of triage and diagnosis. Front Artif Intell. 2020;3:543405. 10.3389/frai.2020.543405 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

29. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233–9. 10.1056/NEJMsr2214184 [PubMed] [CrossRef] [Google Scholar]

30. González Corbelle J, Bugarín-Diz A, Alonso-Moral J, Taboada J. Dealing with hallucination and omission in neural Natural Language Generation: A use case on meteorology. In: Shaikh S, Ferreira T, Stent A, editors. Proceedings of the 15th International Conference on Natural Language Generation. Waterville, Maine, USA and virtual meeting: Association for Computational Linguistics; 2022. pp. 121–30.

Articles from International Journal of Retina and Vitreous are provided here courtesy of BMC

Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Jeremiah Abshire

Last Updated:

Views: 6053

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Jeremiah Abshire

Birthday: 1993-09-14

Address: Apt. 425 92748 Jannie Centers, Port Nikitaville, VT 82110

Phone: +8096210939894

Job: Lead Healthcare Manager

Hobby: Watching movies, Watching movies, Knapping, LARPing, Coffee roasting, Lacemaking, Gaming

Introduction: My name is Jeremiah Abshire, I am a outstanding, kind, clever, hilarious, curious, hilarious, outstanding person who loves writing and wants to share my knowledge and understanding with you.