Do you ask Artificial Intelligence about your health? Failures exceed 20%, according to US researchers

MIAMI.- In these hectic times, it is increasingly common to resort to Artificial intelligence (IA) and consult you on topics doctors and even problems of healthbut an investigation carried out in the US revealed in its initial phase that errors in their answers can exceed 20%.

The study led by researchers at the University of Pennsylvania revealed that AI-powered virtual assistants answer users’ everyday health questions with an accuracy of almost 76%.

The initial result on the use of AI, which has raised concerns about reliability in real-world applications for medical affairs, indicated that the tools may be more effective in the hands of trained doctors than patient users of this technology.

Research on AI and health

According to the information, the researchers investigated the use of AI by the average person for health-related topics and the precision with which virtual tools, such as ChatGPT, responded to increasingly frequent medical queries.

The scientific team found that people asked more questions about specialized areas such as neurology and dermatology.

The work explicitly focused on healthcare scenarios that the average internet user could query with AI, “a perspective that previous research on large-scale language models (LLMs) and healthcare has not addressed,” according to the study.

“We wanted to understand how accurate LLMs, like ChatGPT, are in answering questions about health symptoms, similar to how we have historically used Google, these models are in answering such queries and how harmful those answers could be,” Amulya Yadav, co-author of the study and associate professor of computer science and intelligent systems in the College of Information Sciences and Technology (IST) at Penn State, told a publication.

The questions, center of the study

To do this, 34 invited participants, including professors, administrative staff, and undergraduate and graduate students, presented 212 questions and answers generated by AI about real and imaginary health problems.

The questions were written from the perspective of both the patient and the doctor. Participants could choose one of four LLMs to use in the contest: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro and Llama3-8b, it was reported.

“This type of participatory research is essential to understanding how the public uses AI in their daily lives,” stressed Bonam Mingole, lead author of the study and a doctoral candidate in information science and technology.

The researchers then asked nine certified doctors to evaluate the accuracy of the AI-generated responses and their potential harm.

Response Accuracy

To do this, they used a six-point scale that ranged from very low to very high, and an evaluation committee subsequently awarded the eight best answers that generated the most medically accurate information and also the proposal that generated the answer most likely to cause harm.

The specialized team found that, overall, 76.2% of the responses generated by LLM provided accurate information.

Specialties such as obstetrics and gynecology and otolaryngology showed the best LLM performance, with high validity scores and low risk scores, it was reported, but internal medicine, neurology and dermatology had the worst AI performance, with low validity scores and higher risk scores, according to the researchers.

It was reported that all the research findings will be presented at the FAccT conference (Association for Computing Machinery Fairness, Accountability and Transparency) 2026, which will be held in Montreal (Canada) from June 25 to 28.