Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

AI’s Limitations in Clinical Decision-Making

Generative artificial intelligence (AI) tools have the potential to be valuable allies in healthcare, assisting doctors in making rapid diagnoses and selecting appropriate treatments. Their reliability is still a concern, however. A study published in July in the journal npj Digital Medicine highlights these limitations.
A Promising Ally
“AI has the potential to assist healthcare professionals by improving efficiency, access to quality care for all, and health equity. In clinical settings, it can serve as a decision-support tool, saving doctors valuable time during diagnoses,” said Dr Zhiyong Lu, a senior researcher at the National Institutes of Health and adjunct professor of computer science at the University of Illinois, Champaign–Urbana, Illinois. Lu, who is the corresponding author of the study, shared his insights in an interview with Medscape Medical News.
The study evaluated the performance of GPT-4V, a newly released multimodal AI model by OpenAI that processes various types of data, including text and images. The study focused on the model’s ability to respond to medical questions and the justifications it provided for its responses.
GPT-4V Versus Doctors
The study involved 207 multiple-choice questions from the New England Journal of Medicine’s Image Challenge, which is commonly used to evaluate doctors’ diagnostic abilities. These questions spanned the following nine medical specialties: Dermatology (34 questions), pathology (17), pulmonology (21), gastroenterology (29), neurology (13), ophthalmology (25), cardiology (13), infectious diseases (21), and various other cases (34).
GPT-4V’s answers were compared with those provided by nine doctors from different specialties. Each participant, including GPT-4V, was presented with real clinical images and brief case summaries that included the patient’s medical history and details of the symptoms. They were then tasked with selecting the correct diagnosis from a set of options.
In this “closed” testing scenario, where no external sources could be consulted, GPT-4V achieved an accuracy rate of 81.6%, slightly outperforming the doctors, who had an accuracy rate of 77.8%. Notably, the tool correctly diagnosed 78.3% of the cases that the doctors got wrong.
However, when asked to describe the images and provide written justifications for their diagnosis, GPT-4V struggled. It presented flawed justifications in 35.5% of the cases where it had made correct diagnoses. The biggest challenge for the tool was interpreting the images accurately, with a 27.2% error rate in image comprehension.
For instance, in one case, GPT-4V correctly identified malignant syphilis and provided multiple pieces of evidence to support its diagnosis. However, it failed to recognize that two skin lesions presented at different angles were manifestations of the same condition.
The tool also had difficulties in ruling out certain diagnoses based on available evidence and distinguishing between similar clinical manifestations in different medical conditions. Its performance was further hindered when faced with complex cases or those involving new information.
Researchers noted that the tool’s success without consulting external sources indicates that the tool can support doctors by enhancing data-driven decision-making, thus enabling quicker and more precise diagnoses. However, it does not replace the invaluable experience and knowledge that professionals bring to the table or the utility of external sources.
Understanding AI’s Limits
AI tools are not yet sophisticated enough to replace human expertise, which remains essential for minimizing risks in medical care. The researchers stressed that understanding the limitations of AI is crucial before fully integrating it into daily clinical practice. Ensuring that AI is used safely and effectively in medicine depends on recognizing these shortcomings.
“There’s no guarantee that AI’s reasoning is always correct. Doctors must understand the reasons behind AI-generated results rather than blindly trusting them, despite the high accuracy. Doctors should continue to rely on their expert judgment when treating patients,” advised Lu.
The study underscores the need for further research to evaluate AI’s role in real-world medical scenarios, emphasizing the importance of quantitative data analysis and active involvement from healthcare professionals.
“Our study highlights some of the challenges in integrating AI into clinical decision support, which is becoming increasingly crucial with advances in technology. But to fully realize AI’s potential responsibly, more research is needed,” Lu concluded.
This story was translated from the Medscape Portuguese edition using several editorial tools, including AI, as part of the process. Human editors reviewed this content before publication.
 
Send comments and news tips to [email protected].

en_USEnglish