Science 2026-03-16 3 min read

Ask ChatGPT the same question ten times and you may get five different answers

WSU researchers tested over 700 scientific hypotheses and found AI correct just 60% better than a coin flip - with alarming inconsistency

True or false: does this scientific hypothesis hold up? It is the kind of question that demands careful reasoning, weighing evidence, and understanding nuance. It is also the kind of question that ChatGPT gets wrong in ways that are difficult to detect - because even when the answer is wrong, the explanation sounds perfectly convincing.

Seven hundred hypotheses, ten tries each

Mesut Cicek, an associate professor of marketing at Washington State University, and his colleagues took 719 hypotheses from business research papers published since 2021 and presented each one to ChatGPT as a simple true-or-false question. Each hypothesis was submitted ten times with identical wording.

The raw accuracy looked passable at first glance: 76.5% correct using ChatGPT-3.5 in 2024, improving to 80% with ChatGPT-5 mini in 2025. But those numbers are misleading. Since every question has only two possible answers, random guessing would get 50% right. After adjusting for that baseline, the AI's performance was only about 60% better than chance. In letter-grade terms, that is a low D.

The accuracy also masked a dangerous asymmetry. ChatGPT was reasonably good at confirming hypotheses that research had supported - but when a hypothesis had been disproven, it identified it correctly just 16.4% of the time. In other words, the AI has a strong bias toward saying "true."

The consistency problem

Raw accuracy, however, was not the most troubling finding. The inconsistency was. Across ten identical prompts with identical wording, ChatGPT gave the same answer consistently only 73% of the time. In several cases, the split was exactly five true and five false - a coin flip masquerading as analysis.

This is not a minor technical quibble. If a business manager uses AI to evaluate a strategic hypothesis and gets an answer, they have no way of knowing whether asking again would produce the opposite conclusion. The model's linguistic confidence does not waver even when its answers do. It delivers wrong answers with the same polished fluency as correct ones.

Fluency without comprehension

The findings, published in the Rutgers Business Review, reinforce a distinction that AI researchers have long emphasized but that the public often misses: large language models are pattern-matching engines, not reasoning systems. They can produce grammatically perfect, contextually plausible text because they have absorbed vast amounts of language. But they do not understand what they are saying in any meaningful sense.

Cicek put it directly: current AI tools do not have a brain. They memorize patterns and can provide useful insights, but they do not understand the material they are working with. The study suggests the much-discussed arrival of artificial general intelligence - AI that can truly think - is further away than some technology advocates claim.

Not just ChatGPT

The published paper focused on ChatGPT, but Cicek has run similar tests with other AI tools and found comparable results. The problem appears to be structural - baked into how large language models work - rather than specific to a single product. This matters because organizations are increasingly embedding AI into decision-making processes that require exactly the kind of nuanced reasoning these tools struggle with.

The study also builds on Cicek's earlier work showing that consumers are actually less likely to buy products when marketing emphasizes AI involvement - suggesting that public enthusiasm for AI may be more tempered than the technology sector assumes.

What this means for using AI at work

The researchers are not arguing that AI is useless. Cicek himself uses it. But the findings make a strong case for treating AI output as a starting point rather than an answer - particularly for questions that involve nuance, conflicting evidence, or complex reasoning.

For business managers, the practical recommendation is straightforward: verify AI results independently, provide training on what AI can and cannot do well, and resist the temptation to treat fluent language as a proxy for accurate analysis. The gap between how confident AI sounds and how reliable it actually is remains wide enough to cause real damage when the stakes matter.

Source: Mesut Cicek (Washington State University), Sevincgul Ulu (Southern Illinois University), Can Uslay (Rutgers University), and Kate Karniouchina (Northeastern University). Published in the Rutgers Business Review.