Comparing the Accuracy of ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash in Answering Frequently Asked Questions About Systemic Lupus Erythematosus: Quantitative Study

Background: Systemic lupus erythematosus (SLE) is a complex, fluctuating disease, creating a continuous need for reliable patient information. A prior study concluded that patients with SLE often turn to the internet, including artificial intelligence (AI) chatbots, for information regarding SLE. The rise of AI chatbots as a primary information source presents a critical challenge regarding the accuracy of the information they provide. Objective: This study aimed to evaluate the performance of the latest generation of AI chatbots (ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash) in answering frequently asked questions about SLE. Methods: Twenty-two frequently asked questions about SLE in Bahasa Indonesia (the Indonesian language) were posed to each chatbot. Responses were independently and blindly evaluated for accuracy by 5 clinical immunologists using a 4-point Likert scale. Readability was assessed using the Flesch reading ease score formula. Statistical comparisons for accuracy and readability were performed using repeated-measures ANOVA or the Friedman test, followed by the Bonferroni test for pairwise comparisons. The Spearman ρ was used to evaluate correlations among accuracy, readability, and word count. Results: Gemini 2.5 Flash demonstrated the highest accuracy, with a mean score of 1.25 (SD 0.53), significantly outperforming ChatGPT-4o (mean 1.71, SD 0.61; <.001). Gemini 2.5 Flash significantly outperformed ChatGPT-4o in 2 evaluated domains. The interreliability analysis revealed a statistically significant level of agreement among the 5 evaluators across all responses (Kendall =0.389; <.001). Readability for all 3 chatbots was low (median Flesch reading ease score 42.22‐46.66). Gemini 2.5 Flash produced the longest responses (8509 total words), followed by DeepSeek-V3 (5410 words) and ChatGPT-4o (3632 words). A significant negative correlation was found between word count and lower accuracy (ρ=−0.401; =.001). Conclusions: Our study found that ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash provided overall satisfactory responses to SLE-related questions. The highest accuracy was demonstrated by Gemini 2.5 Flash; however, the absolute differences in scores among the 3 AI chatbots were relatively small. All 3 AI chatbots demonstrated low readability, which may limit accessibility for patient use. This finding highlights a critical “blind spot” in which clinical accuracy, as rated by experts, does not equate to patient accessibility. Thus, further research is required to develop more comprehensive evaluation frameworks incorporating safety, factuality, and calibration of AI chatbots across different medical fields and topics.

Supervised Fine-Tuning of Large Language Models With Chain-of-Thought Reasoning for Pediatric Heart Disease Detection in Unstructured Echocardiogram Reports: Algorithm Development and Validation

Background: Pediatric heart disease (PHD), including congenital heart defects, is often incompletely captured in electronic health records, particularly when clinical significance must be inferred from unstructured echocardiogram reports. Automated methods capable of extracting clinically meaningful PHD from narrative reports could improve clinical decision support and research applications. Objective: The aim of the study is to evaluate the feasibility of using supervised fine-tuning of large language models (LLMs), with and without chain-of-thought (CoT) reasoning, to characterize patients with clinically significant or historical PHD from unstructured echocardiogram reports. Methods: We developed a PHD detection algorithm using fine-tuned open-source LLMs, including LLaMA (Meta) and Qwen (Alibaba), to analyze 9749 echocardiogram reports. A subset of 712 reports was adjudicated by 2 pediatric cardiac anesthesiologists, classifying 506 (71.1%) as clinically significant PHD and 206 (28.9%) as not significant. While DeepSeek R1 has shown improved performance with CoT reasoning, its application in medical contexts is underexplored. We incorporated R1-generated CoT into model prompts and fine-tuned backbone LLMs. Results: The fine-tuned Qwen-7B-10k-overthink-CoT achieved the highest accuracy (92.4%), outperforming Qwen-7B-without-CoT (90%), LLaMA-3B-without-CoT (87.9%), Qwen-3B-without-CoT (85.6%), Qwen-3B-10k-overthink-CoT (68.5%), and LLaMA-3B-10k-overthink-CoT (46.2%). In a second dataset, an external validation was performed (n=113; 64 positive, 49 negative), Qwen-7B-10k-overthink-CoT sustained a strong, balanced performance (82.7%), followed by Qwen-7B-without-CoT (88.4%), LLaMA-3B-without-CoT (86.8%), Qwen-3B-without-CoT (84.5%), Qwen-3B-10k-overthink-CoT (58.9%), and LLaMA-3B-10k-overthink-CoT (46.2%). The fine-tuned Qwen-7B model with overthinking CoT (10,000 tokens) achieved the highest internal accuracy (92.4%), with balanced sensitivity and specificity. Across repeated runs, CoT-enhanced models demonstrated improved classification consistency compared to non-CoT models (Qwen-7B-without-CoT: 90%, LLaMA-3B-without-CoT: 87.9%, Qwen-3B-without-CoT: 85.6%). In external validation (n=113), non-CoT variants achieved higher accuracy (up to 88.4%), whereas the Qwen-7B CoT model demonstrated more balanced class performance (accuracy=82.7%). Conclusions: Supervised fine-tuning of LLMs with CoT offers an effective approach for automated PHD detection within unstructured data in the electronic medical record. While CoT-enhanced models demonstrated improved internal performance and more balanced classification, they did not consistently achieve higher accuracy in external validation, highlighting trade-offs between accuracy and class balance. These findings highlight the promise of LLM-based approaches for clinical text phenotyping while underscoring the need for larger, multicenter validation and careful calibration for real-world deployment. Continued validation and integration into the electronic medical record are essential for real-world, artificial intelligence–driven clinical decision support.
<img src="https://jmir-production.s3.us-east-2.amazonaws.com/thumbs/65aeb9ead7dafad2f061f602c5cfe3ed" />

User Perspectives on a Clinical Decision Tool to Support Individualized Exercise Prescriptions for Breast Cancer Survivors Not Meeting Exercise Guidelines: Cross-Sectional Survey

<strong>Background:</strong> More than 80% of breast cancer survivors do not meet the recommended levels of exercise, and &lt;50% of health care providers promote exercise as part of survivorship care. Patient-provider communication may enhance exercise engagement by increasing patients’ understanding of exercise benefits and linking patients to resources, such as rehabilitation and exercise programs. <strong>Objective:</strong> This study aimed to explore perspectives on a novel clinical decision tool designed to support individualized exercise discussions and prescriptions among breast cancer survivors who do not meet exercise guidelines and health care providers who primarily treat such survivors. <strong>Methods:</strong> We conducted a cross-sectional online survey among US breast cancer survivors and health care providers. Participants were (1) female breast cancer survivors aged ≥35 years engaging in ≤150 minutes/week of moderate-intensity aerobic exercise or ≤2 days/week of muscle-strengthening exercise and (2) health care providers who had cared for breast cancer survivors within the past 12 months and reported below-average guideline adherence among their patients. Respondents reviewed a paper draft of a web-based clinical decision prototype tool for supporting individualized exercise discussions and prescriptions based on patients’ demographic, clinical, and contextual characteristics. We assessed perceived usefulness, potential uses (eg, counseling), preferred timing of access within clinical encounters, and preferences for tool characteristics (inputs/outputs). <strong>Results:</strong> The analytic sample comprised 26 breast cancer survivors and 69 health care providers. The survivors’ median age was 48 (IQR 37-65) years. Providers included patient navigators/social workers/nurses (29/69, 42.0%), breast oncologists (13/69, 18.8%), and occupational/physical therapists (12/69, 17.4%). The majority of providers (62/69, 89.9%, 95% CI 80.2%-95.8%) and survivors (23/26, 88.5%, 95% CI 69.8%-97.6%) reported that they would find the tool useful. Similarly, 85.5% of providers (59/69, 95% CI 75.0%-92.8%) and 84.6% of survivors (22/26, 95% CI 65.1%-95.6%) reported that the tool would increase their confidence to discuss exercise in a clinical setting. Both groups preferred that survivors access the tool with staff after a medical appointment (survivors: 20/26, 76.9%, 95% CI 56.4%-91.0%; providers: 58/67, 86.6%, 95% CI 76.0%-93.7%). Both groups also endorsed treatment history and readiness to exercise to consider as key inputs and improved quality of life and reduced treatment-related side effects as exercise benefits to communicate as tool outputs. <strong>Conclusions:</strong> The prototype tool concept was well received, with high endorsement of individual characteristics to consider and clinical benefits of exercise to communicate. Findings will inform refinement of the tool and future implementation testing in an understudied population of breast cancer survivors.

Effects of SGLT2 inhibition on incident heart failure in carriers of cardiomyopathy-associated genetic variants

Nature Medicine, Published online: 08 June 2026; doi:10.1038/s41591-026-04439-x

In a whole-exome sequencing analysis, the beneficial effects of the SGLT2 inhibitor dapagliflozin in reducing the risk of future heart failure hospitalization in individuals with type 2 diabetes were markedly greater in individuals who carried a cardiomyopathy-associated genetic variant compared with noncarriers, suggesting a personalized preventative therapy based on genetic information.

Single-cell spatial pharmacobiology for imaging antibody-based therapies in solid tumors

Nature Biotechnology, Published online: 08 June 2026; doi:10.1038/s41587-026-03171-8

We have developed single-cell spatial pharmacobiology (SSP), which combines in situ imaging of a systemically infused fluorescent therapeutic antibody with high-plex spatial proteomics. Applied to head and neck and pancreatic tumors from patients treated in phase 1 trials, SSP revealed marked spatial heterogeneity in antibody delivery and target engagement, which was shaped by conserved stromal barriers.

Solventum executive pay disclosure includes $11M severance for ex-MedSurg leader

Solventum recently reported compensation for its top executives and median employee for its first full year as a standalone company. The business formerly known as 3M Health Care spun off from 3M and became an independently publicly traded company on April 1, 2024. As part of the move, some Solventum executives received hiring bonuses, make-whole…

The post Solventum executive pay disclosure includes $11M severance for ex-MedSurg leader appeared first on Medical Design and Outsourcing.