Structured stress test reveals safety gaps in ChatGPT Health triage

1 min read
Source: Nature
TL;DR Summary

A Nature Medicine study tests ChatGPT Health with 60 clinician-authored vignettes across 21 clinical domains under 16 factorial conditions (960 responses). Performance follows an inverted U-shape, with the most dangerous errors at extremes: 35% for non-urgent cases and 48% for emergencies. Among gold-standard emergencies, 52% were under-triaged (e.g., could misdirect diabetic ketoacidosis or impending respiratory failure to 24–48 hours instead of ED), while classic emergencies like stroke and anaphylaxis were correctly triaged. Anchoring by family or friends shifted edge-case triage toward less urgent care (OR 11.7). Crisis-intervention messages activated inconsistently across suicidal ideation presentations. No significant effects by patient race, gender, or barriers to care. Overall, the findings raise safety concerns and call for prospective validation before consumer deployment of AI triage tools.

Share this article

Reading Insights

Total Reads

0

Unique Readers

0

Time Saved

11 min

vs 12 min read

Condensed

95%

2,256123 words

Want the full story? Read the original article

Read on Nature