Technology 2026-03-17 3 min read

Even the best AI coding tools get structured outputs wrong 25% of the time

A Waterloo benchmarking study of 11 large language models finds that none can reliably produce the formatted code outputs that modern software development depends on.

The structured output promise

Software development has been quietly reorganizing itself around a bet: that AI models can produce not just natural language responses but precisely formatted outputs - JSON, XML, Markdown - that slot directly into automated workflows. Companies like OpenAI, Google, and Anthropic have built structured output features into their models specifically to enable this. The idea is that developers can treat an LLM like a function that returns clean, machine-readable data rather than conversational text.

A new benchmarking study from the University of Waterloo suggests the bet is not paying off yet. The most advanced commercial models achieved roughly 75% accuracy on structured output tasks. Open-source alternatives landed around 65%. In a domain where a single malformed JSON object can crash a pipeline, those error rates are not minor.

44 tasks, 18 formats, 11 models

The study, titled "StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs," tested 11 large language models across 18 structured output formats and 44 tasks. The evaluation measured not just whether the outputs followed syntactic rules - proper brackets, valid schemas - but whether the content itself was accurate.

"We want to measure not only the syntax of the code - that is, whether it's following the set rules - but also whether the outputs produced for various tasks were accurate," said Dongfu Jiang, a PhD student in computer science and co-first author. "We found that while they do okay with text-related tasks, they really struggle on tasks involving image, video, or website generation."

That distinction matters. A model might produce valid JSON that parses without errors but contains fabricated data or mischaracterized content. Syntactic correctness is necessary but not sufficient. The 25% failure rate for top models encompasses both structural errors and content errors.

The gap between demo and deployment

Structured outputs have been marketed as a solution to one of the biggest practical problems in AI-assisted development: getting model responses into a format that other software can consume. In a demo, asking a model to return a JSON object with specific fields looks seamless. In production, where thousands of API calls per hour need to produce valid, accurate, consistent outputs, a 25% error rate creates a reliability problem that human developers must absorb.

"Developers might have these agents working for them, but they still need significant human supervision," Jiang said. That supervision cost - reviewing AI outputs for both structural validity and factual accuracy - partially erodes the productivity gains that motivated adopting AI tools in the first place.

Where the failures cluster

The study found that text-based structured tasks - generating formatted reports, converting between text formats - performed reasonably well. The failures concentrated in multimodal tasks: generating code that produces images, creating website layouts, or producing structured descriptions of video content. These tasks require the model to maintain spatial, temporal, or visual reasoning while simultaneously adhering to strict formatting rules. Current architectures struggle to do both.

The research was a collaborative effort involving Jiang, undergraduate student Jialin Yang, and assistant professor Dr. Wenhu Chen at Waterloo, with annotations from 17 additional researchers worldwide. It appears in Transactions on Machine Learning Research and will be presented at ICLR 2026.

The honest assessment

The study does not argue that structured outputs are a dead end. It argues that the technology is not yet reliable enough for unsupervised use. For development teams considering how much to trust AI-generated structured data, the benchmark provides a concrete answer: trust, but verify. And budget for the verification.

Dr. Chen noted a broader trend at Waterloo: "Students often begin as annotators, then organize projects and create their own benchmarking studies. They're not just using AI in their studies - they're building, researching and evaluating it." That pipeline of critical evaluation may prove as valuable as the AI tools themselves. Without honest benchmarks, the gap between marketing claims and actual reliability remains invisible until something breaks in production.

Source: Jiang, D., Yang, J., Chen, W. et al. "StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs." Transactions on Machine Learning Research, 2026. To be presented at ICLR 2026. University of Waterloo.