Medicine Technology 🌱 Environment Space Energy Physics Engineering Social Science Earth Science Science
Technology 2026-02-19 3 min read

AI Image Models Still Struggle with 'Left' and 'Right' - This Method Fixes That

Learn-to-Steer raised spatial accuracy in Stable Diffusion from 7% to 54% without retraining the model, by reading and redirecting internal attention patterns

Ask an image-generation model to create "a cat under the table" and there is roughly a 1 in 14 chance it places the cat in the correct spatial relationship. That 7% figure is not a quirk of a specific model - it reflects a fundamental gap between how large language models process text and how they handle the geometric logic of spatial language.

The problem is well known. Models like Stable Diffusion and Flux.1 can produce photorealistic images of extraordinary detail, but ask them to arrange two objects in a specific spatial relationship - above, below, left of, right of - and they routinely get it wrong. The words are understood in a linguistic sense; the spatial logic is not reliably executed in the output.

A collaboration between researchers at Bar-Ilan University's Department of Computer Science and NVIDIA's AI research center in Israel has produced a method that substantially narrows that gap - without touching the models themselves.

Reading the model's internal geometry

The key insight behind Learn-to-Steer is that image-generation models are not completely blind to spatial relationships. They have internal attention patterns that encode, in some form, how objects are being arranged relative to one another. Those patterns are not directly visible in the output image, but they are present in the model's internal computations at every step of the generation process.

The Bar-Ilan team trained a lightweight classifier to read those internal attention patterns and determine, at each step, whether the model is tracking the spatial instruction correctly. When it detects that the model's internal representation is heading toward an incorrect arrangement, the classifier applies a small, targeted correction to the model's internal activations - steering the generation process back toward the intended spatial relationship.

"Instead of assuming we know how the model should think, we allowed it to teach us," said Sapir Yiflach, the study's lead researcher. "This enabled us to guide its reasoning in real time, essentially reading and steering the model's thought patterns to produce more accurate results."

Measured performance gains

In Stable Diffusion SD2.1, the baseline accuracy on spatial relationship prompts was 7%. With Learn-to-Steer applied, it rose to 54% - an improvement by a factor of roughly 7.7. In Flux.1, the baseline was higher at 20%, and Learn-to-Steer brought it to 61%. Critically, both models maintained their overall image quality and general performance. The correction is precise enough to improve spatial reasoning without degrading other capabilities.

The method requires no retraining of the base model - a significant practical advantage. Retraining a large image-generation model is computationally expensive, requires large labeled datasets, and risks degrading performance on tasks that were not part of the retraining objective. Learn-to-Steer sidesteps all of those costs by operating as an inference-time intervention on a frozen model.

Why spatial reasoning is hard for these models

Image-generation models learn from massive datasets of image-text pairs, but spatial language in those pairs is often ambiguous or inconsistently applied. A caption might say "a dog beside a car" without specifying left or right. Training on such data teaches statistical associations between words and visual patterns, not the geometric logic of spatial relationships. The result is a model that knows what "left of" means in a linguistic sense but does not reliably execute the spatial constraint in its outputs.

The attention mechanism that Learn-to-Steer exploits is the same mechanism responsible for linking text tokens to image regions during generation. By analyzing how spatial tokens like "left," "under," and "beside" are attended to relative to object tokens, the classifier can infer whether the model is correctly maintaining the spatial constraint at any given generation step.

Potential applications and current limitations

Accurate spatial reasoning in generated images has practical value in design tools, educational content, assistive technology, and any application where precise layout matters. A graphic designer using AI assistance to mockup a diagram, or an educator generating instructional illustrations, depends on objects ending up where they are described as being.

The current study evaluated performance on a specific benchmark of spatial relationship prompts. How well Learn-to-Steer generalizes to more complex scenes with three or more objects, or to ambiguous spatial language, was not fully characterized. The method also assumes that the model's internal attention patterns encode spatial information in a form the classifier can read - an assumption that may not hold equally well across all model architectures. The research was presented at the WACV 2026 Conference in Tucson, Arizona in March 2026.

Source: Bar-Ilan University. Study presented at WACV 2026. Lead researcher: Sapir Yiflach. Co-authors: Prof. Gal Chechik (Bar-Ilan University / NVIDIA), Dr. Yuval Atzmon (NVIDIA). Media contact: Elana Oberlander, elanadovrut@gmail.com, +972-353-17395.