Home Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum
Article Open Access

Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum

  • Annika Meyer ORCID logo EMAIL logo , Ari Soleman , Janik Riese and Thomas Streichert ORCID logo
Published/Copyright: May 29, 2024

Abstract

Objectives

Laboratory medical reports are often not intuitively comprehensible to non-medical professionals. Given their recent advancements, easier accessibility and remarkable performance on medical licensing exams, patients are therefore likely to turn to artificial intelligence-based chatbots to understand their laboratory results. However, empirical studies assessing the efficacy of these chatbots in responding to real-life patient queries regarding laboratory medicine are scarce.

Methods

Thus, this investigation included 100 patient inquiries from an online health forum, specifically addressing Complete Blood Count interpretation. The aim was to evaluate the proficiency of three artificial intelligence-based chatbots (ChatGPT, Gemini and Le Chat) against the online responses from certified physicians.

Results

The findings revealed that the chatbots’ interpretations of laboratory results were inferior to those from online medical professionals. While the chatbots exhibited a higher degree of empathetic communication, they frequently produced erroneous or overly generalized responses to complex patient questions. The appropriateness of chatbot responses ranged from 51 to 64 %, with 22 to 33 % of responses overestimating patient conditions. A notable positive aspect was the chatbots’ consistent inclusion of disclaimers regarding its non-medical nature and recommendations to seek professional medical advice.

Conclusions

The chatbots’ interpretations of laboratory results from real patient queries highlight a dangerous dichotomy – a perceived trustworthiness potentially obscuring factual inaccuracies. Given the growing inclination towards self-diagnosis using AI platforms, further research and improvement of these chatbots is imperative to increase patients’ awareness and avoid future burdens on the healthcare system.

Introduction

Laboratory medical reports are crucial in guiding clinical decision-making. Nonetheless, their technical nature often poses comprehension challenges for individuals without medical training [1]. Consequently, many seek clarification online [1], increasingly trusting artificial intelligence (AI)-powered chatbots over conventional search engines for medical advice [2].

This shift has been notably influenced by the launch of the AI chatbot “ChatGPT” (Chat Generative Pre-trained Transformer) in late 2022, which has not only provided the general public with access to an advanced AI [3] but also achieved unprecedented user growth [4]. In the medical domain, research indicates that 78 % of its users are inclined to employ ChatGPT for self-diagnosis purposes [5]. The emergence of other AI chatbots such as Gemini and Le Chat has further broadened user options [6].

While these chatbots demonstrate proficiency in handling binary medical multiple-choice questions [7], [8], [9], [10], [11], [12], [13], their performances in academic tests may not adequately reflect the complex and nuanced reality of clinical practice [14]. This discrepancy highlights the importance of evaluating these chatbots in real-world scenarios, particularly as they were not specifically designed for medical inquiries [15], [16], [17].

In this context, an initial study by Cadamuro et al., which employed 10 fictive laboratory scenarios, revealed ChatGPT’s ability to identify laboratory tests, categorize values within given reference intervals and provide superficial interpretations. However, this study also highlighted the need for more extensive research involving a wider range of medical laboratory reports [1] to reduce misinterpretations in the post-analytical phase [18].

Building on this foundation, our study evaluates the capabilities of three chatbots (ChatGPT [GPT-4], Gemini [Gemini Pro], and Le Chat [Mistral Large]) using 100 patient inquiries focused on laboratory medical reports from an online health forum. This approach seeks to bridge the gap between theoretical data and real-life applications. Thus, the objective of this research is to explore the practical applicability and reliability of these chatbots in the field of laboratory medicine, offering insights into their potential utility and limitations in a genuine medical context.

Materials and methods

Chatbot selection

For this retrospective study, we selected chatbots based on Cascella et al.’s publication [6]. We excluded chatbots without web-based user interface, and those that were based on or performed below the level of the large language model GPT-3.5 and its predecessors. This refined our selection to three advanced chatbots, namely ChatGPT (GPT-4), Gemini (Gemini Pro), and Le Chat (Mistral Large).

Data collection

To assess the efficacy of these chatbots in interpreting laboratory reports, we sourced real-life patient queries from the ‘AskDocs’ subreddit on Reddit. This platform allows users to engage anonymously within specialized communities [19]. In ‘AskDocs’, users anonymously post medical questions which are then answered by verified physicians [20]. A comprehensive description of this forum is provided in the publication by Nobles et al. [20]. Our research methodology was designed to only observe, avoiding any direct interaction, thereby preserving the community’s integrity and ensuring compliance with its guidelines [21].

In compliance to Reddit’s Data API [22] as well as Developer Terms [23], we utilized the search term “CBC” (Complete Blood Count) to identify relevant posts. This term was selected for its prominence in general medical practice, as outlined by the European Federation of Clinical Chemistry and Laboratory Medicine Working Group on Artificial Intelligence (WG-AI) [1].

Our initial search yielded 635 posts. After applying exclusion criteria such as posts dated before ChatGPT’s knowledge cut-off date, absence of physician responses, presence of images, and deviation from the topic, we narrowed this selection down to 135 relevant posts. To determine an adequate sample size for robust statistical analysis, we conducted a Monte Carlo simulation. We assumed moderate factor loadings (lambda=0.5) and aimed for a power of 0.8 [24]. This simulation indicated a minimum required sample size of 77. To enhance the stability of our results, we chose to include the 100 posts in our analysis, exceeding the minimum requirement to increase the study’s statistical power and potential generalizability. Thus, the posts were organized using Reddit’s internal sorting mechanisms, prioritizing relevance first, then categorizing them as ‘hot’, ‘top’, ‘new’, and ‘comments’. The first 100 posts were selected to ensure a balanced representation that captures both relevance and current trends in the discussions [25].

For each selected post, we gathered data including the title, text, number of upvotes, and comments, along with the most upvoted physician response. Upvotes serve as an internal Reddit rating system, allowing users to express approval of specific content [19]. Due to the possibility of users upvoting their own posts and the limited variability in upvotes across the chosen posts – as evidenced by a narrow interquartile range – we decided to exclude upvote data from the primary analysis. Instead, upvotes were utilized solely for descriptive statistics. This approach mitigates potential biases in user engagement metrics related to self-upvoting. Moreover, some of the Reddit’s users provided reference ranges and some did not, addressing a potential confounder previously noted in the literature on ChatGPT [18]. Every post, along with its title, was then presented to all the three chatbots each in a new chat session to obtain their responses.

Evaluation by medical experts

Two medical experts, from an early-career physician to a professor, independently assessed the responses from the online physicians and the chatbots. Discrepancies were resolved through structured discussion using a consensus approach, reducing biases related to training level [1]. In instances of ambiguity, relevant literature databases such as NCBI, PubMed, and Amboss, or relevant medical textbooks [26], were consulted.

To prevent identification bias, all responses were anonymized and edited to remove any language that might reveal the non-human nature of the chatbots. Phrases such as “I am not a doctor”, “not a medical professional”, “AI”, “artificial”, “language model”, and “not a physician” were systematically searched for and removed.

The responses were then ranked relative to each other on a scale from 1 (best) to 4 (last), to allow a comparative perspective on the effectiveness of each respondent, chatbots and physicians. The ranking process for the chatbots involved assessing several criteria across the chatbots’ outputs: potential dangerous content, medical errors, technical errors (supplying faulty links, changing the output language), failure to answer the question, content-wise and technically correct answer to the question, empathetic and correct answer to the question.

Each response was also evaluated on a scale from 1 (excellent) to 6 (inadequate), focusing on the criteria of quality, clarity, medical accuracy and empathy. The overall quality assessment encompassed all response dimensions, while medical accuracy was specifically measured against medical content alone. Responses with incorrect information were rated lower than those with omissions or incomplete information.

Furthermore, appropriateness was scrutinized. In this context, overestimations were defined as evaluations that incorrectly identified healthy conditions as pathological, exaggerated the severity of a pathology, or recommended unnecessarily severe or invasive diagnostic steps or interventions.

Statistical analysis

All assessments were then statistically analyzed using the R programming language, specifically utilizing the packages ‘rio’ [27], ‘tidyverse’ [28], ‘gtsummary’ [29], ‘labelled’ [30], ‘ggpubr’ [31], ‘fastDummies’ [32].

For the descriptive statistics, categorical variables were presented as frequency and percentage, and continuous variables as median and interquartile range. The normal distribution of continuous data was refuted using the Kolmogorov–Smirnov and the Shapiro–Wilk test (Supplementary Material, Appendix 1). The Shapiro–Wilk and Kolmogorov–Smirnov tests were selected to assess the conformity of the data with the assumption of normality in its underlying distribution. This assumption served as a foundational requirement for subsequent parametric statistical analyses [33]. The analysis of categorical variables was conducted using the McNemar test, whereas the paired Wilcoxon signed rank test was applied for ordinal data. For models with continuous predictor variables, random intercept logistic regression was employed. A p-value of less than 0.05 was considered statistically significant. The Bonferroni correction was implemented to mitigate the heightened likelihood of a type I error arising from multiple comparisons. This adjustment was made to the p-values, utilizing the R package ‘gtsummary’ [29]. While alternative statistical methodologies such as the ‘Hochberg’ method were available, the adoption of Bonferroni’s approach was favored due to its conservative nature and capability to regulate the anticipated count of type I errors per family [34].

Results

General characteristics

The study analyzed posts (respective patient questions) with a mean word count of 146 (IQ: 82, 231) and their titles containing 7 (IQ: 5, 9) words. On average, posts received 1 (IQ: 1, 2) upvotes and 4.5 (IQ: 3.0, 6.0) comments. The mean age of respondents was 27 (IQ: 22, 31) and 52 % (49/94) identified as female. On average, online physicians used the fewest words (median: 20; IQ: 10, 35), while ChatGPT used the most (median: 337; IQ: 281, 379; p<0.001 for all comparisons). Le Chat and Gemini did not differ in their word count (p>0.9), with 261 (IQ: 216, 305) and 257 (IQ: 208, 301), respectively (Table 1, Figure 1A, Supplementary Material, Appendices 1–3).

Table 1:

Summary statistics for included posts and answers.

Characteristics n=100 a
Age of asker, years 27 (22, 31)
 (Missing) 7
Gender of asker
 Female 49 (52 %)
 Male 45 (48 %)
 (Missing) 6
Interactions
 Upvotes 1 (1, 2)
 Comments 4.5 (3.0, 6.0)
Word count
 Title’s word count 7 (5, 9)
 Post’s word count 146 (82, 231)
  1. a Median (interquartile range) or frequency (%).

Figure 1: 
Comparison of laboratory medicine report interpretation by chatbots and physicians in an online health forum by rank, adequacy and word count. (A) Illustrates the word count distributions for ChatGPT, Gemini, Le Chat, and physicians, using combined violin and box plots. (B) Depicts the adequacy of responses in a stacked bar chart, categorizing them as overestimations, appropriate estimations, underestimations, or no estimations. (C) Presents a density plot comparing the ranking frequencies of ChatGPT, Gemini, Le Chat, and physicians.
Figure 1:

Comparison of laboratory medicine report interpretation by chatbots and physicians in an online health forum by rank, adequacy and word count. (A) Illustrates the word count distributions for ChatGPT, Gemini, Le Chat, and physicians, using combined violin and box plots. (B) Depicts the adequacy of responses in a stacked bar chart, categorizing them as overestimations, appropriate estimations, underestimations, or no estimations. (C) Presents a density plot comparing the ranking frequencies of ChatGPT, Gemini, Le Chat, and physicians.

Assessment by medical professionals

The evaluation highlighted significant differences in rankings between chatbots and physicians (<0.001 for all comparisons). Online physicians were frequently ranked highest, leading in 60 % (60/100) of cases, while Gemini was ranked last 39 % (39/100) of the time. Although there was no significant difference between online doctors and chatbots regarding the absence of estimations or underestimations, ChatGPT alone matched the online physicians in both quality (p=0.3) and accuracy (p=0.057).

In comparisons among the chatbots, ChatGPT and Gemini varied significantly in rank (p<0.001), quality (p<0.001) and accuracy (p=0.007). Le Chat, however, showed lower levels of empathy compared to ChatGPT (p<0.001) and Gemini (p=0.007), the latter two exhibiting similar empathy levels (p>0.9). Moreover, chatbots tended to overestimate conditions, with overestimation rates ranging from 22 to 33 % contrasting the 1.0 % (1/100) rate observed among online physicians (p<0.001 for all comparisons) (see Supplementary Material, Appendices 2–7, Figures 1B and C, and 2).

Figure 2: 
Assessment of laboratory medicine report interpretation by chatbots and physicians in regard to quality, clarity, empathy and accuracy. Violin plots represent the distribution of performance metrics – accuracy, clarity, empathy, and overall quality – among ChatGPT, Gemini, Le Chat, and physicians in an online health forum.
Figure 2:

Assessment of laboratory medicine report interpretation by chatbots and physicians in regard to quality, clarity, empathy and accuracy. Violin plots represent the distribution of performance metrics – accuracy, clarity, empathy, and overall quality – among ChatGPT, Gemini, Le Chat, and physicians in an online health forum.

Furthermore, medical professionals observed several challenges in the chatbots’ interpretation of laboratory results. Notably, the chatbots exhibited difficulties in maintaining consistency, particularly in interpreting complex contexts, distinguishing between abnormal and critical laboratory values, and providing diagnostic recommendations. Inconsistencies were particularly evident in their application of reference intervals, where they applied differing standards to patients of identical sex and similar age without citing the underlying sources for such varying reference ranges. This absence of standardized reference values also led to divergent interpretations of the same laboratory data for a single patient between the chatbots (e.g., ChatGPT: “Your ALP value is 36 which is slightly low.”; Gemini: “Alkaline Phosphatase: Your level is 36 U/L, which is slightly high.”, Le Chat: “Alkaline Phosphatase (ALP) 36 U/L: This is a liver enzyme, and your result is within the normal range, which is typically 20–140 U/L.”). Furthermore, issues with the reliability of cited sources were frequently noted in Gemini and occasionally in ChatGPT, where the sources provided were often invalid, unsuitable, or misleading. Despite these limitations, several aspects of the chatbots responses were positively acknowledged. These included its consistent recommendations for professional medical consultation, interpersonal capabilities, adept use of verbal imagery, and the incorporation of gender-sensitive language. Gemini was particularly noted for its structured approach and ongoing recommendations for preparing for medical appointments (Table 2).

Table 2:

Narrative review of benefits and drawbacks in the laboratory result interpretation by chatbots.

Evaluated aspects Description of performance Examples/notes – ChatGPT Examples/notes – Gemini Examples/notes – Le Chat
Negative

Reference intervals Inconsistency in reference interval (27 year old male): ‘red blood cells (RBC): the normal range for men is generally 4.7 to 6.1 million cells per microliter (mµL).’ (36 year old male): ‘RBC (red blood cell count): these cells carry oxygen throughout your body. A typical range for men is roughly between 4.5 and 5.5 million cells/µL’ (19 year old female): ‘these are borderline low (normal range for hemoglobin is 12.3–15.2 g/dL and hematocrit is 36.1–46.4 %).’ (17 year old female): ‘a hemoglobin of 13 might be high for you based on your previous results, but it’s generally within the normal range for most females (around 12–16 g/dL)’ (28 year old female): ‘white blood cells (WBC): your WBC count is slightly below the normal range, which is typically 4.5–11.0 thou/cu mm. This could be due to various reasons, such as a viral infection, nutritional deficiencies, or stress.’ (19 year old female): ‘WBC (white blood cell count) – 4.3 (normal range: 4.0–11.0 × 103/µL)’
Inconsistency between given and used reference intervals None found Gemini applies different reference ranges than the laboratory given pH of 5.0–8.0: ‘normal findings except for a slightly high urine pH (7.5). Ideally, urine pH should be between 6.0 and 7.0’ Le Chat applies different upper limit for CRP than the 0.9 given by the laboratory: ‘a normal CRP level is typically less than 1.0 mg/dL’
Value not correctly categorized within reference limits While one reddit user’s RBC is at 5.38 µL, ChatGPT states: ‘RBC is slightly high but not alarmingly so. Normal range for females is typically around 4.2 to 5.4 million cells per microliter (µL), although this can slightly vary between laboratories’ ‘Hemoglobin: yours is at 17.4, again within the normal range (13.0–17.0)’ ‘Hemoglobin: your result is 17.4 g/dL, which is within the normal range for men (13.0–17.0 g/dL)’
Others ChatGPT applies differing units within one response: ‘white blood cells (WBC): a normal WBC count ranges from 4,500 to 11,000 white blood cells per microliter of blood. Your numbers have gone from 5.1 to 4.2, which is a slight decrease but still within the normal range’ While the reddit user gives the percentage of immature granulocytes as 2 % and the absolute number as 0.11, Gemini makes the following statement: ‘2.11 % – this is a very small percentage’ None found

Contexts Citing faulty sources According to the cancer society, Hodgkin’s lymphomas account for 12 % of childhood cancers [35], but ChatGPT describes 3 % (‘in fact, according to the American cancer society, only about 3 % of childhood cancers are lymphomas’) Links often invalid (‘[invalid URL removed]’), with misleading content or in a language other than the input language (e.g., ‘www.toutiao.com/article/6482582255939093006/’) Neither referring to sources nor supplying links
Misinterpretation of modality and its diagnostic value ChatGPT misinterprets falling neutrophils and rising lymphocytes without additional blood smear as ‘your slight shift in neutrophil and lymphocyte percentages is known as a ‘shift to the left’’ Instead of ferritin, Gemini attributes the highest diagnostic value to iron in iron deficiency anaemia: ‘iron saturation: Low at 7 % (normal starts at 15 %). This is the key indicator of iron deficiency’ Le Chat overestimates the diagnostic value of CBC results when deciding on antibiotic therapy: ‘it is possible that they determined that the CBC results were not conclusive enough to warrant antibiotic treatment’
Limited understanding of complex interrelationships While metabolic alkalosis due to the loss of stomach acid would be compensated by hypoventilation and thus the increase in pCO2 [35], ChatGPT misinterprets: ‘the low potassium and CO2 in the second visit could also be related to the vomiting (losing stomach acid can lead to metabolic alkalosis, which can lower blood potassium)’ Gemini fails to establish a connection between non-fasting and glucose: ‘glucose (121, normal range 65–99) – while mildly elevated, a single reading isn’t conclusive for diabetes. However, it warrants further investigation considering the other symptoms’ Le Chat fails to establish a connection between non-fasting and glucose: ‘glucose 121 (65–99): this is higher than the normal range. It could indicate hyperglycemia, which is high blood sugar. This could be due to many reasons, including diet, stress, or an underlying condition like diabetes. However, a single high reading does not necessarily mean diabetes, and further testing would be needed to confirm this’

Positive

Promoting medical professionals Consistent recommendation for professional medical consultations ‘It’s important that she continues to work with her healthcare providers to manage her symptoms and monitor her health. You might also consider seeking a second opinion if you feel it would be beneficial’ ‘Don’t self-diagnose: avoid searching the internet for diagnoses based on your results. Let your doctor guide you’ ‘I would recommend reaching out to your healthcare provider to discuss these results. They have a full understanding of your medical history and can provide you with the most accurate advice’
Inclusion of prior medical assessments ‘They don’t necessarily indicate something more serious like lymphoma, especially considering your doctor’s lack of concern’ ‘Doctor’s opinion: it’s positive that your doctor isn’t concerned, especially since your BUN and GFR are normal’ ‘Your doctor, who has access to your complete medical history and test results, has mentioned that they are not concerned. This should be reassuring, as they have a better understanding of your overall health’
Advice on the preparation for a physician’s appointments ‘In the meantime, continue to advocate for yourself. Record your symptoms, make notes about anything that might be relevant (including diet, sleep, stress, etc.), and don’t hesitate to ask your doctors for clarification or more information’ ‘Gather past CBC results: bring copies of previous blood tests to the hematology appointment. This will give the doctor a clearer picture. Prepare a list of questions: write down any questions you have about your son’s low MPV and his medications. I Hope this information is helpful! Again, it’s best to wait for your son’s pediatrician to give specific advice based on his individual case’ None found
Linguistic capabilities Recognition and usage of figurative language ChatGPT responds to ‘Doctor Google has officially scared me’ with: ‘don’t let Dr. Google scare you’ Gemini responds to ‘feeling like death’ with: ‘feeling like death: your CBC results alone likely wouldn’t explain feeling that bad. There could be other factors at play, and it’s important to discuss this with your doctor’ Le Chat does not employ linguistic imagery, but utilizes it: ‘medical interpretation requires a holistic view of your health status, and a single test result is just one piece of the puzzle’
Utilization of gender-conscious language ‘However, it’s important to note that while these could be potential explanations, it’s best to continue the conversation with your doctor. He or she knows your medical history best and will be able to guide you through the next steps, which might involve further testing or treatment to manage any ongoing issues that could be contributing to the elevated CRP’ ‘If you’re still feeling anxious, talking to your doctor about your health anxiety might be helpful. They can provide resources and strategies for managing it’ ‘I would recommend reaching out to your healthcare provider to discuss these results. They have a full understanding of your medical history and can provide you with the most accurate advice’
Interpersonal capabilities Empathy ‘I’m truly sorry to hear about your sister’s condition. It’s clear you are going through a very challenging time, and it’s only natural to be concerned and want answers’ ‘I understand your concern about your CBC results. It’s completely normal to feel anxious waiting for test results, especially when some values fall outside the listed reference ranges’ ‘Please try to stay calm while waiting for your doctor’s response. It’s natural to feel anxious, but remember that many conditions can cause these changes in your blood counts, and many of them are treatable’
Positive reaffirmation ‘First of all, congratulations on your commitment to improving your health and the weight loss, that’s a tremendous achievement!’ ‘Alcohol and hypertension: it’s commendable that you’ve cut down on alcohol consumption. Excessive alcohol intake can contribute to hypertension’ ‘It’s great that you’re being proactive about your health’

Discussion

Comprehending laboratory results presents a significant challenge for those outside the medical profession, largely due to the detailed and data-heavy nature of these results [1]. Consequently, patients may turn to AI-based chatbots for medical advice [1, 2, 36], a trend accelerated by digital advancements that enable patients to access their laboratory reports directly, bypassing initial consultations with physicians [36, 37]. However, several issues arise from the reliance on chatbots for interpreting laboratory reports.

Despite their user-friendly interfaces and their rapid, personalized, and seemingly expert responses [1, 6, 38], none of the three chatbots matched the proficiency of online physicians regarding laboratory report interpretations. This finding contradicts earlier research by Ayers et al., who preferred ChatGPT’s responses to general patient inquiries in the same forum [15]. As ChatGPT’s proficiency differs by specialty [13], the discrepancy may be due to this study’s focus on laboratory reports, thereby indicating that their interpretation poses a particular challenge for those chatbots.

In this context, all three chatbots occasionally treated laboratory results as “standalone information”, especially when clinical information was limited. This approach is broadly recognized as problematic [18] and likely contributed to their overestimation tendency. This pattern may be further enhanced by all three chatbots’ inability to distinguish between critical and abnormal laboratory values, aligning with previous literature on ChatGPT [1]. For ChatGPT, this overestimation tendency is also evident in fictive scenarios across various medical contexts, such as plastic surgery [39] and multidisciplinary patient vignettes [40]. Overestimations, deemed less harmful than underestimations, might be strategically employed by their developers to avoid legal repercussions [39]. Instead of mitigating unfound patient concerns, however, such overestimations might inadvertently amplify them, thereby increasing logistical and financial burdens on the healthcare system [39, 40].

Furthermore, reference ranges posed another notable challenge to all three chatbots. Despite literature suggesting that ChatGPT and other chatbots (CopyAI and Writesonic) can navigate within provided reference intervals [1, 38], ChatGPT, Le Chat and Gemini struggled in their absence, often showing inconsistencies in ranges and units, as well as misclassifying laboratory values within them. Confirming previous research [38], the use of different reference ranges by each chatbot led to varying interpretations of identical blood values – a situation reminiscent of the saying “two doctors, three opinions”. Notably, even when reference ranges were specified, ChatGPT occasionally failed to classify laboratory values accurately, while Le Chat and Gemini struggled more frequently. Despite ChatGPT’s superior performance compared to its competitors [16, 17, 38], none of the three chatbots proved consistently reliable. This inconsistency extends to other medical specialties, such as risk stratification in non-traumatic chest pain scenarios, further compromising their clinical reliability [41]. This emphasizes the need for substantial medical expertise to accurately interpret and validate their outputs, thus limiting their standalone use in interpreting laboratory reports.

Furthermore, the challenge of detecting inaccuracies is amplified by the sophisticated linguistic capabilities and empathetic responses of ChatGPT [15, 42], Gemini [43], and Le Chat [16, 44]. These features, while enhancing user engagement, may obscure the detection of errors [45]. Although all chatbots employ a cautionary tone, frequent disclaimers about their non-medical status, and recommendations for professional medical consultations similar to those described for ChatGPT in the literature [1, 15], it remains to be seen whether they can compensate for the misplaced trust in the chatbots’ interpretive abilities of laboratory reports [46, 47].

Despite these challenges, the chatbots’ rapid, personalized, and clear responses demonstrate significant potential for future applications in patient-centric communication. To optimize the utility and safety of chatbots in interpreting laboratory values, a strategic integration of the distinct strengths of various chatbots could prove beneficial. For instance, combining ChatGPT’s adept handling of reference ranges with Gemini’s structured advice and practical tips for medical appointment preparation could potentially streamline the logistical aspects of healthcare interactions. In fact, the implementation of AI in laboratory medicine could lead to significant financial savings, estimated at around 883.5 billion euros annually [48], and a reduction in labor hours by approximately 53.4 million [49]. The potential of AI in shifting healthcare efficiency and cost management is also reflected by the high amount of health start-ups focusing on screening and diagnostics [50]. This underscores the economic and safety imperatives of further developing AI-driven chatbots, emphasizing their role in enhancing patient safety and healthcare efficiency.

However, it’s essential to recognize that laboratory results are medical reports intended for interpretation by trained professionals. Given the current limitations of AI-based chatbots in accurately interpreting such results, while they may serve as tools for medical professionals, their unsupervised use by patients is not advisable.

Limitations

Overall, this study has limitations. The inherent variability of AI output, despite the high consistency observed in ChatGPT’s responses to simulated laboratory reports [1], poses a challenge to reproducibility. Furthermore, as AI models continue to evolve, the results of this study may not be applicable to future iterations of these chatbots.

Although the medical experts evaluated each physician and chatbot’s response to the best of their knowledge, subjective variation cannot be fully accounted for.

Moreover, the study’s focus on English language laboratory results from a single online health forum limits its generalizability across different specialties, forums, or languages, particularly given the English language bias in multilingual large language models like ChatGPT.

Conclusions

Overall, the chatbots’ perceived high trustworthiness, coupled with content inaccuracies and a tendency to overestimate medical laboratory reports, creates a risky scenario for medical non-professionals. Instead of alleviating unfound patient concerns and thereby relieving the burden on the healthcare system, the chatbots may inadvertently promote over- and misdiagnosis. Thus, given the substantial patient inclination to self-diagnose using chatbots, it is crucial to enhance the chatbot’s development to safeguard patients and prevent future healthcare system burdens.


Corresponding author: Annika Meyer, Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, University Hospital Cologne, Kerpener Str. 62, 50937 Cologne, Germany, E-mail:

Acknowledgments

We acknowledge the support of DeepL, DeepL Write and ChatGPT (GPT-4) in the linguistic improvement and translation of the manuscript. Furthermore, ChatGPT was used to facilitate and accelerate the programming of the statistics. Any output from these artificial intelligences were critically reviewed by the authors.

  1. Research ethics: This project was reviewed and approved by the local Ethics Committee on 11. August 2023 (23-1288-retro).

  2. Informed consent: Not applicable.

  3. Author contributions: TS and AM designed the study. AM collected the patient questions as well as the physician and ChatGPT responses. TS, AM, AS evaluated the responses regarding the questions. AS analyzed and interpreted the data statistically as well as wrote the manuscript. TS and AS critically reviewed and improved the manuscript. All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Competing interests: The authors state no conflict of interest.

  5. Research funding: None declared.

  6. Data availability: In compliance with data protection regulations and to maintain confidentiality and privacy information, the authors are unable to share the raw data from this study. We are committed to upholding ethical standards and safeguarding sensitive data, and as such, cannot provide access to the dataset.

References

1. Cadamuro, J, Cabitza, F, Debeljak, Z, Bruyne, SD, Frans, G, Perez, SM, et al.. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European federation of clinical chemistry and laboratory medicine (EFLM) working group on artificial intelligence (WG-AI). Clin Chem Lab Med 2023;61:1158–66. https://doi.org/10.1515/cclm-2023-0355.Search in Google Scholar PubMed

2. Nov, O, Singh, N, Mann, D. Putting ChatGPT’s medical advice to the (turing) test: survey study. JMIR Med Educ 2023;9:e46939. https://doi.org/10.2196/46939.Search in Google Scholar PubMed PubMed Central

3. Liebrenz, M, Schleifer, R, Buadze, A, Bhugra, D, Smith, A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digit Health 2023;5:e105–6. https://doi.org/10.1016/s2589-7500(23)00019-5.Search in Google Scholar

4. Hu, K. ChatGPT sets record for fastest-growing user base – analyst note; 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ [Accessed 28 Dec 2023].Search in Google Scholar

5. Shahsavar, Y, Choudhury, A. User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors 2023;10:e47564. https://doi.org/10.2196/47564.Search in Google Scholar PubMed PubMed Central

6. Cascella, M, Semeraro, F, Montomoli, J, Bellini, V, Piazza, O, Bignami, E. The breakthrough of large language models release for medical applications: 1-year timeline and perspectives. J Med Syst 2024;48:22. https://doi.org/10.1007/s10916-024-02045-3.Search in Google Scholar PubMed PubMed Central

7. Huh, S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination? A descriptive study. J Educ Eval Health Prof 2023;20:1. https://doi.org/10.3352/jeehp.2023.20.01.Search in Google Scholar

8. Gilson, A, Safranek, CW, Huang, T, Socrates, V, Chi, L, Taylor, RA, et al.. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312. https://doi.org/10.2196/45312.Search in Google Scholar PubMed PubMed Central

9. Kung, TH, Cheatham, M, Medenilla, A, Sillos, C, De Leon, L, Elepaño, C, et al.. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digital Health 2023;2:e0000198. https://doi.org/10.1371/journal.pdig.0000198.Search in Google Scholar PubMed PubMed Central

10. Takagi, S, Watari, T, Erabi, A, Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ 2023;9:e48002. https://doi.org/10.2196/48002.Search in Google Scholar PubMed PubMed Central

11. Jung, LB, Gudera, JA, Wiegand, TLT, Allmendinger, S, Dimitriadis, K, Koerte, IK. ChatGPT besteht schriftliche medizinische Staatsexamina nach Ausschluss der Bildfragen. Dtsch Arztebl Int 2023;120:373–4. https://doi.org/10.3238/arztebl.m2023.0113.Search in Google Scholar PubMed PubMed Central

12. Pal, A, Sankarasubbu, M. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. ArXiv 2024;abs/2402.07023.10.18653/v1/2024.clinicalnlp-1.3Search in Google Scholar

13. Meyer, A, Riese, J, Streichert, T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ 2024;10:e50965. https://doi.org/10.2196/50965.Search in Google Scholar PubMed PubMed Central

14. Mbakwe, AB, Lourentzou, I, Celi, LA, Mechanic, OJ, Dagan, A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digital Health 2023;2:e0000205. https://doi.org/10.1371/journal.pdig.0000205.Search in Google Scholar PubMed PubMed Central

15. Ayers, J, Poliak, A, Dredze, M, Leas, E, Zhu, Z, Kelley, J, et al.. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023;183:589–96. https://doi.org/10.1001/jamainternmed.2023.1838.Search in Google Scholar PubMed PubMed Central

16. Mistral, AI. Mistral large, our new flagship model; 2024. https://mistral.ai/news/mistral-large/ [Accessed 26 Feb 2024].Search in Google Scholar

17. Team, G, Anil, R, Borgeaud, S, Wu, Y, Alayrac, J-B, Yu, J, et al.. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 2023.Search in Google Scholar

18. Plebani, M. ChatGPT: angel or demond? Critical thinking is still needed. Clin Chem Lab Med 2023;61:1131–2. https://doi.org/10.1515/cclm-2023-0387.Search in Google Scholar PubMed

19. Anderson, KE. Ask me anything: what is reddit? Library hi tech news. 2015;32:8–11. https://doi.org/10.1108/lhtn-03-2015-0018.Search in Google Scholar

20. Nobles, AL, Leas, EC, Dredze, M, Ayers, JW. Examining peer-to-peer and patient-provider interactions on a social media community facilitating ask the doctor services. Proc Int AAAI Conf Web Soc Media 2020;14:464–75. https://doi.org/10.1609/icwsm.v14i1.7315.Search in Google Scholar

21. Reddit. Rules. https://www.reddit.com/r/AskDocs/about/rules/ [Accessed 02 Apr 2024].Search in Google Scholar

22. Data API Terms; 2023. https://www.redditinc.com/policies/data-api-terms [Accessed 02 Apr 2024].Search in Google Scholar

23. Developer Terms; 2024. https://www.redditinc.com/policies/developer-terms [Accessed 02 Apr 2024].Search in Google Scholar

24. Beaujean, A. Sample size determination for regression models using Monte Carlo methods in R. Practical Assess Res Eval 2014;19:1–16.Search in Google Scholar

25. Reddit. What filters and sorts are available? https://support.reddithelp.com/hc/en-us/articles/19695706914196-What-filters-and-sorts-are-available [Accessed 07 May 2024].Search in Google Scholar

26. Kreuzer, KA. Referenz hämatologie. New York: Georg Thieme Verlag; 2019.10.1055/b-004-140282Search in Google Scholar

27. Chan, CH, Leeper, TJ, Becker, J, Schoch, D. rio: a swiss-army knife for data file I/O; 2023.Search in Google Scholar

28. Wickham, H, Averick, M, Bryan, J, Chang, W, McGowan, LDA, François, R, et al.. Welcome to the {tidyverse}. J Open Source Softw 2019;4:1686. https://doi.org/10.21105/joss.01686.Search in Google Scholar

29. Sjoberg, DD, Whiting, K, Curry, M, Lavery, JA, Larmarange, J. Reproducible summary tables with the gtsummary package. R J 2021;13:570–80. https://doi.org/10.32614/rj-2021-053.Search in Google Scholar

30. Larmarange, J. labelled: manipulating labelled data; 2023.Search in Google Scholar

31. Kassambara, A. ggpubr: ‘ggplot2’ based publication ready plots; 2023.Search in Google Scholar

32. Kaplan, J. fastDummies: fast creation of dummy (binary) columns and rows from categorical variables; 2023.Search in Google Scholar

33. Razali, NM, Wah, YB. Power comparisons of Shapiro–Wilk, Kolmogorov–Smirnov, Lilliefors and Anderson–Darling tests. J Stat Model Anal 2011;2:21–33.Search in Google Scholar

34. Andrade, C. Multiple testing and protection against a type 1 (false positive) error using the Bonferroni and Hochberg corrections. Indian J Psychol Med 2019;41:99–100. https://doi.org/10.4103/ijpsym.ijpsym_499_18.Search in Google Scholar

35. Do, C, Vasquez, PC, Soleimani, M. Metabolic alkalosis pathogenesis, diagnosis, and treatment: core curriculum 2022. Am J Kidney Dis 2022;80:536–51. https://doi.org/10.1053/j.ajkd.2021.12.016.Search in Google Scholar PubMed PubMed Central

36. Nancy, CE. Laboratory testing in general practice: a patient safety blind spot. BMJ Qual Saf 2015;24:667. https://doi.org/10.1136/bmjqs-2015-004644.Search in Google Scholar PubMed

37. López Yeste, ML, Izquierdo Álvarez, S, Pons Mas, AR, Álvarez Domínguez, L, Marqués García, F, Rodríguez, MPC, et al.. Management of postanalytical processes in the clinical laboratory according to ISO 15189:2012 standard requirements: considerations on the review, reporting and release of results. Adv Lab Med 2021;2:51–9. https://doi.org/10.1515/almed-2020-0110.Search in Google Scholar PubMed PubMed Central

38. Abusoglu, S, Serdar, M, Unlu, A, Abusoglu, G. Comparison of three chatbots as an assistant for problem-solving in clinical laboratory. Clin Chem Lab Med 2024;62:1362–6. https://doi.org/10.1515/cclm-2023-1058.Search in Google Scholar PubMed

39. Abi-Rafeh, J, Hanna, S, Bassiri-Tehrani, B, Kazan, R, Nahai, F. Complications following facelift and neck lift: implementation and assessment of large language model and artificial intelligence (ChatGPT) performance across 16 simulated patient presentations. Aesthetic Plast Surg 2023;47(6). https://doi.org/10.1007/s00266-023-03538-1.Search in Google Scholar PubMed

40. Nastasi, AJ, Courtright, KR, Halpern, SD, Weissman, GE. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep 2023;13:17885. https://doi.org/10.1038/s41598-023-45223-y.Search in Google Scholar PubMed PubMed Central

41. Heston, TF, Lewis, LM. ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain. PLoS One 2024;19:e0301854. https://doi.org/10.1371/journal.pone.0301854.Search in Google Scholar PubMed PubMed Central

42. Orrù, G, Piarulli, A, Conversano, C, Gemignani, A. Human-like problem-solving abilities in large language models using ChatGPT. Front Artif Intell 2023;6:1199350. https://doi.org/10.3389/frai.2023.1199350.Search in Google Scholar PubMed PubMed Central

43. Rane, N, Choudhary, S, Rane, J. Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. J Appl Artif Intell 2024;5:69–93. https://doi.org/10.2139/ssrn.4723687.Search in Google Scholar

44. Lee, YK, Suh, J, Zhan, H, Li, JJ, Ong, DC. Large language models produce responses perceived to be empathic. ArXiv 2024;abs/2403.18148.Search in Google Scholar

45. Chew, HSJ. The use of artificial intelligence-based conversational agents (chatbots) for weight loss: scoping review and practical recommendations. JMIR Med Inform 2022;10:e32578. https://doi.org/10.2196/32578.Search in Google Scholar PubMed PubMed Central

46. Sofroniou, S. How I analysed my blood test results with ChatGPT: my personal experience; 2023. https://medium.com/@sophia.sofroniou/how-i-analysed-my-blood-test-results-with-chatgpt-my-personal-experience-d5fa1ed6c5a9 [Accessed 15 Apr 2024].Search in Google Scholar

47. Medium. Steps to use ChatGPT-4 for blood test translation; 2023. https://generativeai.pub/steps-to-use-chatgpt-4-for-blood-work-translation-da99f266cbe3 [Accessed 15 Apr 2024].Search in Google Scholar

48. Deloitte & MedTech Europe. Potenzielle finanzielle Einsparungen* durch ausgewählte KI-Anwendungen im europäischen Gesundheitswesen im Jahr 2020 (in Milliarden Euro). Belgium: Statista; 2020.Search in Google Scholar

49. Deloitte & MedTech Europe. Eingesparte Zeit durch ausgewählte KI-Anwendungen im europäischen Gesundheitswesen im Jahr 2020 (in Millionen Stunden). Belgium: Statista; 2020.Search in Google Scholar

50. CB Insights. Verteilung der 150 vielversprechendsten Digital Health-Start-ups nach Segment im Jahr 2020. New York: Statista; 2020.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cclm-2024-0246).


Received: 2024-02-24
Accepted: 2024-05-13
Published Online: 2024-05-29
Published in Print: 2024-11-26

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. External quality assurance (EQA): navigating between quality and sustainability
  4. Reviews
  5. Molecular allergology: a clinical laboratory tool for precision diagnosis, stratification and follow-up of allergic patients
  6. Nitrous oxide abuse direct measurement for diagnosis and follow-up: update on kinetics and impact on metabolic pathways
  7. Opinion Papers
  8. A vision to the future: value-based laboratory medicine
  9. Point-of-care testing, near-patient testing and patient self-testing: warning points
  10. Navigating the path of reproducibility in microRNA-based biomarker research with ring trials
  11. Point/Counterpoint
  12. Six Sigma – is it time to re-evaluate its value in laboratory medicine?
  13. The value of Sigma-metrics in laboratory medicine
  14. Genetics and Molecular Diagnostics
  15. Analytical validation of the amplification refractory mutation system polymerase chain reaction-capillary electrophoresis assay to diagnose spinal muscular atrophy
  16. Can we identify patients carrying targeted deleterious DPYD variants with plasma uracil and dihydrouracil? A GPCO-RNPGx retrospective analysis
  17. General Clinical Chemistry and Laboratory Medicine
  18. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum
  19. External quality assessment performance in ten countries: an IFCC global laboratory quality project
  20. Multivariate anomaly detection models enhance identification of errors in routine clinical chemistry testing
  21. Enhanced patient-based real-time quality control using the graph-based anomaly detection
  22. Performance evaluation and user experience of BT-50 transportation unit with automated and scheduled quality control measurements
  23. Stability of steroid hormones in dried blood spots (DBS)
  24. Quantification of C1 inhibitor activity using a chromogenic automated assay: analytical and clinical performances
  25. Reference Values and Biological Variations
  26. Time-dependent characteristics of analytical measurands
  27. Cancer Diagnostics
  28. Expert-level detection of M-proteins in serum protein electrophoresis using machine learning
  29. An automated workflow based on data independent acquisition for practical and high-throughput personalized assay development and minimal residual disease monitoring in multiple myeloma patients
  30. Cardiovascular Diseases
  31. Analytical validation of the Mindray CL1200i analyzer high sensitivity cardiac troponin I assay: MERITnI study
  32. Diabetes
  33. Limitations of glycated albumin standardization when applied to the assessment of diabetes patients
  34. Patient result monitoring of HbA1c shows small seasonal variations and steady decrease over more than 10 years
  35. Letters to the Editor
  36. Inaccurate definition of Bence Jones proteinuria in the EFLM Urinalysis Guideline 2023
  37. Use of the term “Bence-Jones proteinuria” in the EFLM European Urinalysis Guideline 2023
  38. Is uracil enough for effective pre-emptive DPD testing?
  39. Reply to: “Is uracil enough for effective pre-emptive DPD testing?”
  40. Accurate predictory role of monocyte distribution width on short-term outcome in sepsis patients
  41. Reply to: “Accurate predictory role of monocyte distribution width on short-term outcome in sepsis patients”
  42. Spurious parathyroid hormone (PTH) elevation caused by macro-PTH
  43. Setting analytical performance specifications for copeptin-based testing
  44. Serum vitamin B12 levels during chemotherapy against diffuse large B-cell lymphoma: a case report and review of the literature
  45. Evolution of acquired haemoglobin H disease monitored by capillary electrophoresis: a case of a myelofibrotic patient with a novel ATRX mutation
Downloaded on 3.7.2025 from https://www.degruyterbrill.com/document/doi/10.1515/cclm-2024-0246/html
Scroll to top button