Structured stress test reveals safety gaps in ChatGPT Health triage

February 24, 2026 at 12:34 AM

•

1 min read

TL;DR Summary

A Nature Medicine study tests ChatGPT Health with 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 responses). Performance follows an inverted U-shape, with the most dangerous errors at extremes: 35% for non-urgent cases and 48% for emergencies. Among gold-standard emergencies, 52% were under-triaged (e.g., could misdirect diabetic ketoacidosis or impending respiratory failure to 24–48 hours instead of ED), while classic emergencies like stroke and anaphylaxis were correctly triaged. Anchoring by family or friends shifted edge-case triage toward less urgent care (OR 11.7). Crisis-intervention messages activated inconsistently across suicidal ideation presentations. No significant effects by patient race, gender, or barriers to care. Overall, the findings raise safety concerns and call for prospective validation before consumer deployment of AI triage tools.

Topics:science #ai-safety #chatgpt-health #clinical-vignettes #emergency-medicine #healthcare #triage

Share this article

ChatGPT Health performance in a structured test of triage recommendations Nature

Reading Insights

Total Reads

Unique Readers

Time Saved

11 min

vs 12 min read

Condensed

95%

2,256 → 123 words

Want the full story? Read the original article

Read on Nature

JavaScript Required

tl;dr daily news requires JavaScript to be enabled. Please enable JavaScript in your browser settings.

Related Sources

Reading Insights