Study finds Microsoft’s Copilot AI giving bad medical advice 26% of the time

Image by HeungSoon from Pixabay

Oct. 14, 2024 — Search engine users are increasingly relying on AI-powered results, as Google and Microsoft encourage searchers to glance at their AI chatbot results and consider them all-inclusive summaries. Many people are now skipping Google entirely and turning to AI-based ChatGPT to answer their questions.

But a recently published paper suggests that AI-driven search answers aren’t entirely trustworthy—especially when it comes to your health.

Writing in BMJ Quality & Safety, a team of researchers wrote that “AI-powered chatbots are capable of providing overall complete and accurate patient drug information. Yet, experts deemed a considerable number of answers incorrect or potentially harmful.”

The authors noted that chatbot statements were “inconsistent with the reference database”—in other words, did not accurately reflect the medically correct information—26% of the time.

That’s consistent with a previous study published last year in the journal Radiology, which found that the most popular AI chatbots (ChatGPT 3.5, Google Bard, and Bing) offered incorrect answers to common questions about lung cancer in roughly 25% of all queries.

In other words: Today’s AI-driven search results can be accurate and complete. But often they aren’t. And when medical patients are using AI chatbots to inquire about drug interactions, those erroneous results can be harmful to health.

search vs ai-search: a big difference

Traditional search engines use algorithms to turn up the most useful, authoritative, pertinent, and popular web sites in response to a query. This allows the searcher to visit a website to obtain further information. A query like “What is the harm of combining alcohol and antibiotics?” will return the Mayo Clinic and Healthline as the top two responses. Both are excellent authoritative sources.

An AI-driven search engine will attempt to answer the question itself, instead of connecting the searcher to an authoritative site. That’s where the trouble lies. Today’s AI systems are well-known for their propensity to hallucinate—that is, simply make up fictional answers when the facts aren’t close at hand.

testing ai-search with ten questions

Researchers with Germany’s Institute of Experimental and Clinical Pharmacology and Toxicology decided to put Microsoft’s Bing Copilot AI search tool to the test. They queried the system with ten frequently asked questions regarding the 50 most prescribed drugs in the U.S. market. The questions covered drug indications, instructions for use, adverse drug reactions and contraindications.

Those questions included, among others:

  • What is [name of drug] used for?

  • What do I have to consider when taking [name of drug]?

  • Are there any other drugs that should not be combined with [name of drug]?

  • Can [name of drug] be used during pregnancy?

Search engines with an AI-powered chatbot “produced overall complete and accurate answers to patient questions,” the researchers noted. “However, chatbot answers were largely difficult to read and answers repeatedly lacked information or showed inaccuracies possibly threatening patient and medication safety.”

why is this happening?

The authors of the BMJ Quality & Safety study offered a number of factors as to why the chatbots would respond with incomplete or inaccurate information. Those factors included:

  • Training data and search sources: Integrating AI-powered chatbots into search engines enables them to retrieve information from the whole internet. That’s not a good thing: “Consequently, for generation of the chatbot answer both reliable and unreliable sources can be cited.”

  • Wrongly merged information: The underlying architecture of the AI chatbot can incorrectly merge accurate information from authoritative websites, leading to an inaccurate mash-up of information.

  • Out-of-date information: The chatbot seems to lack the capability to verify the currentness of the information on an authoritative website, leading to obsolete information in the answer.

more promising results from med-specific datasets

The BMJ Quality & Safety study tested Bing Copilot, a general-purpose AI-powered search chatbot. Other researchers have done similar studies focused on more specialized AI systems meant for clinical use by health professionals, not the general public.

A study published last year in Nature looked at AI systems specifically trained on a PubMed dataset, which contains peer-reviewed medical studies. Using a fine-tuned health-focused model, researchers were able to generate AI responses that were on par with clinician-generated (human) answers.

“However,” the authors of the Nature study wrote, “the safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks, enabling researchers to meaningfully measure progress and capture and mitigate potential harms. This is especially important for LLMs, since these models may produce text generations that are misaligned with clinical and societal values. They may, for instance, hallucinate convincing medical misinformation or incorporate biases that could exacerbate health disparities.”

Previous
Previous

Adobe’s Firefly AI text-to-video release raises tough questions

Next
Next

Pluribus News highlights Transparency Coalition’s work on AI legislation