AI chatbots remain overconfident -- even when they’re wrong

Large Language Models appear to be unaware of their own mistakes, prompting concerns about common uses for AI chatbots.

2025-07-22

(Press-News.org) Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities?

Researchers asked both human participants and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates.

However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published today in the journal Memory & Cognition.

“Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers,” said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. “So, they’d still be a little bit overconfident, but not as overconfident.”

“The LLMs did not do that,” said Cash, who was lead author of the study. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”

The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged. However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time.

“When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted,” said Danny Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences and coauthor of the study.

“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I’m slow to answer, you might realize I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about,” said Oppenheimer.

Asking AI The Right Questions

While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life.

For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues,” including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs “hallucinated,” or produced incorrect information, in 69 to 88 percent of legal queries.

Clearly, the question of whether AI knows what it’s talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis.

“If I'd asked ‘What is the population of London,’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration,” said Oppenheimer.

However, by asking questions about future events – such as the winners of the upcoming Academy Awards – or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots’ apparent weakness in metacognition – that is, the ability to be aware of one’s own thought processes.

“We still don’t know exactly how AI estimates its confidence,” said Oppenheimer, “but it appears not to engage in introspection, at least not skillfully.”

The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.

In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness

“Gemini was just straight up really bad at playing Pictionary,” said Cash. “But worse yet, it didn’t know that it was bad at Pictionary. It’s kind of like that friend who swears they’re great at pool but never makes a shot.”

Building Trust with Artificial Intelligence

For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions. Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it's a good sign that its answer cannot be trusted.

The researchers note that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.

“Maybe if it had thousands or millions of trials, it would do better,” said Oppenheimer.

Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.

"If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem," said Cash.

“I do think it’s interesting that LLMs often fail to learn from their own behavior,” said Cash. “And maybe there’s a humanist story to be told there. Maybe there’s just something special about the way that humans learn and communicate.”

END

ELSE PRESS RELEASES FROM THIS DATE:

From hydrogen bonds to high performance: The future of aqueous batteries

2025-07-22

Peking University, July 07, 2025: A research team led by Prof. Pan Feng from the School of Advanced Materials, Peking University Shenzhen Graduate School has uncovered key mechanisms that govern how protons are stored and transported in aqueous batteries. The study provides critical insights that could lead to safer, faster-charging, and higher-capacity alternatives to today’s lithium-ion batteries. Published in Matter, a Cell Press journal, the study titled “Proton storage and transfer in aqueous batteries” reveals how hydrogen-bond network engineering enables efficient proton storage and transport. Background Aqueous batteries, ...

Ancient brachiopods used tiny bristles to maintain “social distancing,” study reveals

2025-07-22

Understanding how ancient species arranged themselves in space is a key puzzle in paleoecology, but direct evidence of how prehistoric organisms used their body structures to regulate spacing has long eluded scientists. Now, researchers in China have uncovered the first direct evidence: Approximately 436-million-year-old brachiopods from the early Silurian period used tiny, bristle-like structures called setae to maintain orderly, "checkerboard" spacing—ensuring they had enough room to thrive on the ancient seafloor. The findings, published in Proceedings ...

320 million trees are killed by lightning each year — Considerable biomass loss

2025-07-22

Considerable biomass loss 320 million trees are killed by lightning each year Lightning has a greater impact on forests than previously thought. Researchers at the Technical University of Munich (TUM) have developed new model calculations that, for the first time, estimate the global influence of lightning on forest ecosystems. According to their findings, an estimated 320 million trees die each year due to lightning strikes. Tree losses caused by direct lightning-ignited wildfires are not included ...

Research alert: Gene signature an early warning system for aggressive pancreatic cancer, study finds

2025-07-22

Precancerous cells must adapt to and overcome cellular stress and inflammation in order to progress and form malignant tumors. Now, researchers at University of California San Diego School of Medicine have identified a link between stress and inflammation and pancreatic ductal adenocarcinoma (PDAC), one of the most aggressive and lethal types of cancer. The findings could serve as an early warning system for the disease, leading to the detection of PDAC before it becomes life-threatening. Previous studies have shown ...

The Covid-19 pandemic may have aged our brains, according to a new study

2025-07-22

A new study, led by experts at the University of Nottingham, has found that the Covid-19 pandemic may have accelerated people’s brain health, even if they were never infected with the virus. What does it mean to grow older, not just in years, but in terms of brain health? Can stress, isolation, and global disruption leave their mark on people’s minds? The findings of this new study, which are published in Nature Communications, showed that people who lived through the Covid-19 pandemic showed signs of faster brain ageing over time than ...

Pitt study uncovers how the immune system fends off gut parasites

2025-07-22

New research from the University of Pittsburgh reveals how the immune system defends against intestinal parasitic worms, or helminths, one of the most common infections worldwide in communities with limited access to clean water and sanitation. The findings, published today in the journal Immunity, suggest that currently available non-steroidal anti-inflammatory drugs (NSAIDS), similar to ibuprofen, could act on the newly discovered pathway to boost immunity to parasitic infections. “While parasitic worms are less of an issue in most of the U.S. and other wealthy nations, these infections affect almost a quarter ...

Tiny fossil suggests spiders and their relatives originated in the sea

2025-07-22

A new analysis of an exquisitely preserved fossil that lived half a billion years ago suggests that arachnids – spiders and their close kin – evolved in the ocean, challenging the widely held belief that their diversification happened only after their common ancestor had conquered the land. Spiders and scorpions have existed for some 400 million years, with little change. Along with closely related arthropods grouped together as arachnids, they have dominated the Earth as the most successful ...

Psychological and physical health of a preterm birth cohort at age 35

2025-07-22

About The Study: In this cohort study, preterm individuals had higher early life medical risk and faced increased mental health disorders, cardiometabolic issues, and body composition differences compared with full-term peers at age 35. Despite strong evidence linking preterm birth to long-term health consequences, many primary care clinicians in the U.S. remain unaware of these risks, often due to infrequent birth history inquiries in adult health care settings. Corresponding Author: To contact the corresponding author, Amy L. D’Agata, PhD, RN, email amydagata@uri.edu. To access the embargoed study: Visit our For The Media website at this link https://media.jamanetwork.com/ (doi:10.1001/jamanetworkopen.2025.22599) Editor’s ...

Leading the way comes at a cost for feathered friends

2025-07-22

Like humans, animals can become stressed when trying to lead a group of peers in a particular direction, a new study from The Australian National University (ANU) has shown. According to study co-author Associate Professor Damien Farine, many animal groups make decisions in a very democratic way, taking a “majority rules” approach. While effective, it can also take a toll. “We already have evidence of how this decision-making can work – it’s like a voting process. So, individuals might start to move away from the group in the direction they want to go to find food and if they get enough ...

Psychedelics and cannabis offer treatment hope for people with eating disorders

2025-07-22

A pioneering international survey of people living with eating disorders has found that cannabis and psychedelics, such as ‘magic mushrooms’ or LSD, were best rated as alleviating symptoms by respondents who self-medicated with the non-prescribed drugs. The worst-rated drugs were alcohol, tobacco, nicotine and cocaine. Prescribed drugs, such as antidepressants, were generally not well rated for treating eating-disorder symptoms but were positively rated for effects on general mental health. The research, led by PhD student Sarah-Catherine ...

LAST 30 PRESS RELEASES:

People with sensitive personalities more likely to experience mental health problems

Want to improve early detection of diabetes? Look in the same households as those with abnormal blood sugar

Unveiling the gut-heart connection: The role of microbiota in heart failure

Breakthrough insights into tumor angiogenesis and endothelial cell origins

Unlocking the power of mitochondrial biogenesis to combat acute kidney injury

MIT study sheds light on graphite’s lifespan in nuclear reactors

The role of fucosylation in digestive diseases and cancer

Meet Allie, the AI-powered chess bot trained on data from 91 million games

Students’ image tool offers sharper signs, earlier detection in the lab or from space

UBC Okanagan study suggests fasting effects on the body are not the same for everyone

Children’s Hospital of Philadelphia and Children’s Hospital Colorado researchers conduct first prospective study of pediatric EoE patients and disease progression