Top AI coding tools make mistakes one in four times

Benchmarking research shows leading AI models still struggle to reliably produce structured outputs used in software development

2026-03-17

(Press-News.org) New research from the University of Waterloo shows that artificial intelligence (AI) still struggles with some basic software development tasks, raising questions about how reliably AI systems can assist developers.

As Large Language Models (LLMs) are increasingly incorporated into software development, developers have struggled to ensure that AI-generated responses are accurate, consistent, and easy to integrate into larger development workflows.

Previously, LLMs responded to software development prompts with free-form natural language answers. To address this problem, several AI companies, including OpenAI, Google and Anthropic, have introduced “structured outputs”. These outputs force LLM responses to follow predefined formats such as JSON, XML, or Markdown, making them easier for both humans and software systems to read and process.

A new benchmarking study from Waterloo, however, shows that the technology is not yet as reliable as many developers had hoped. Even the most advanced models achieved only about 75 per cent accuracy in the tests, while open-source models performed closer to 65 per cent.

The study evaluated 11 LLM models across 18 structured output formats and 44 tasks designed to assess how reliably the systems followed structured rules.

“With this kind of study, we want to measure not only the syntax of the code – that is, whether it’s following the set rules – but also whether the outputs produced for various tasks were accurate,” said Dongfu Jiang, a PhD student in computer science and co-first author on the research. “We found that while they do okay with text-related tasks, they really struggle on tasks involving image, video, or website generation.”

The study was a collaborative effort involving Waterloo’s Jialin Yang, an undergraduate student, and Dr. Wenhu Chen, an assistant professor of computer science, and incorporated annotations from 17 other researchers at Waterloo and around the world.

“There have been a lot of similar benchmarking projects happening in our labs recently,” Chen said. “At Waterloo, students often begin as annotators, then organize projects and create their own benchmarking studies. They’re not just using AI in their studies – they’re building, researching and evaluating it.”

While LLM-structured outputs are an exciting step for software development, the researchers say the systems are not yet reliable enough to operate without human oversight. “Developers might have these agents working for them, but they still need significant human supervision,” Jiang said.

The research, “StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs,” appears in Transactions on Machine Learning Research and will be presented at ICLR 2026.

END

ELSE PRESS RELEASES FROM THIS DATE:

Hidden acid imbalance in kidney disease raises red flags

2026-03-17

Niigata Japan - A Japanese registry has identified a blind spot in the routine care of patients with chronic kidney disease (CKD). Serum bicarbonate levels are rarely measured, leaving metabolic acidosis largely undetected and hence, undertreated. Metabolic acidosis is a common complication of CKD and is associated with muscle loss, bone disease, insulin resistance, accelerated kidney decline, and increased mortality. Clinical guidelines recommend treatment when the serum bicarbonate level falls below 22 mEq/L. However, real-world data from Asia have been limited. To address this, Mai Tanaka and colleagues extracted nationwide data ...

No evidence to suggest medicinal cannabis is effective for depression, anxiety or PTSD: research

2026-03-17

Australian media release (see below for North American media release) A landmark Lancet Psychiatry paper published today – the largest-ever review of the safety and efficacy of cannabinoids across a range of mental health conditions – found no evidence that medicinal cannabis is effective in treating anxiety, depression or post-traumatic stress disorder (PTSD). The study comes amid more than one million prescription approvals and a tripling of sales of cannabinoid medications (including ...

The Lancet Global Health: Modelling suggests climate change could drive millions globally into physical inactivity by 2050 and be linked to an estimated half a million premature deaths

2026-03-17

The Lancet Global Health: Modelling suggests climate change could drive millions globally into physical inactivity by 2050 and be linked to an estimated half a million premature deaths Rising temperatures due to climate change could drive millions more adults globally into physical inactivity by 2050, being linked to hundreds of thousands of premature deaths and billions of dollars in lost productivity, suggests a modelling study published in The Lancet Global Health journal. Climate change is making the world hotter, and this growing heat is likely to affect how active people can be. Physical inactivity is already a major global health problem, ...

Fathers’ health crucial to improving pregnancy and child outcomes

2026-03-17

UNDER EMBARGO UNTIL 23:30 UK TIME ON MONDAY 16 MARCH 2026 Fathers’ health crucial to improving pregnancy and child outcomes Researchers say boys and men are an important but ‘persistently under-appreciated’ population for measures to improve the health of the next generation of children Improving health and well-being of future fathers critical to addressing intergenerational disparities and legacies of racism A focus on shared responsibility for pregnancy and parenthood ...

Major step towards a first global system to track health before pregnancy

2026-03-17

Under embargo to Monday 16 March 2026, 23:30 UK time Peer reviewed / Survey Major step towards a first global system to track health before pregnancy The key health and social indicators needed for a new global system to monitor people’s health before pregnancy have been identified for the first time by researchers at University College London and the University of Southampton. As more women are becoming pregnant with health conditions that can complicate pregnancy and childbirth, such as obesity, diabetes ...

Climate action could prevent over 13 million premature deaths, but equity choices matter for global health

2026-03-16

A new study published in The Lancet Global Health reveals a previously underappreciated tension at the heart of international climate negotiations: policies designed to protect developing countries from bearing an unfair share of the cost of cutting carbon emissions could inadvertently deprive those same countries of millions of life-saving air quality improvements. The leaders of the study also identify a promising way to resolve this dilemma. The study, conducted by researchers at The University of Texas at Austin, Emory University, Princeton University, and collaborators across six countries, modeled ...

Bull sharks have ‘friends’

2026-03-16

Bull sharks form social relationships with specific “friends”, new research reveals. Sharks are often viewed as solitary, but the study – carried out on the Shark Reef Marine Reserve in Fiji – found that rather than mixing at random, sharks have “active social preferences” and choose their social partners. The research was carried out by the University of Exeter, University of Lancaster, Fiji Shark Lab, and Beqa Adventure Divers. “As humans we cultivate a range of social relationships – from casual acquaintances to our best friends, but we also actively avoid certain ...

New research shows how to diagnose people with Alzheimer’s plus a hard-to-identify dementia type

2026-03-16

PROVIDENCE, R.I. [Brown University] — People with Alzheimer’s disease often have other neurodegenerative conditions as well, including a less-understood disorder called frontotemporal lobar degeneration (FTLD). While a precise diagnosis of FTLD has only been possible during an autopsy, new research shows how clinicians may be able to diagnose people living with both Alzheimer’s and FTLD by evaluating neuropsychiatric symptoms. In a study published in Neurology, researchers found that compared to patients who have either of the two types of dementia alone, having both Alzheimer’s disease and FTLD is associated with greater likelihood of having known neuropsychiatric symptoms ...

Large craters offer clues to the origin of asteroid 16 Psyche

2026-03-16

Even 200 years after asteroid 16 Psyche was discovered, astronomers continue to puzzle over its formation. Psyche is the 10th-most massive asteroid in the main belt between Mars and Jupiter, and the largest known metallic asteroid, at 140 miles in diameter. NASA's Psyche mission will arrive in 2029 to determine its origin. Psyche may be a leftover building block of an early planet, shredded by violent collisions, or a planetary fragment that once separated into layers before losing its rocky outer mantle. Other hypotheses suggest Psyche is an ancient remnant that either started metal-rich ...

Researchers develop biochar-based photocatalyst that rapidly removes antibiotic pollutants from water

2026-03-16

A new study reports that a biochar-enhanced photocatalyst can efficiently degrade antibiotic contaminants in water, offering a promising strategy for addressing one of the growing threats to global water quality. The research, published in the journal Biochar, describes the development of a ternary composite material composed of biochar, titanium dioxide, and graphitic carbon nitride. The material demonstrated remarkable ability to break down sulfadiazine, a widely used sulfonamide antibiotic that is frequently detected in aquatic environments. Antibiotic pollution has become ...

LAST 30 PRESS RELEASES:

Researchers show dinos hatched eggs less efficiently than modern birds

Neuroscientist from US-Mexico border dismantles science’s class problem from the inside

What flocking birds can teach AI

The scientist who warned that profit, not science, decides which drugs reach patients

A sea slug taught her how the brain works, and she never looked back

KIER cracks seawater electrolysis deposit problem with dual electrode system

Automated intervention shows significant increase in smoking cessation behavior