Steering AI Output by Adjusting Concepts Inside the Model - and the Security Risks It Reveals
The inner workings of large language models remain opaque even to the researchers who build them. These systems process inputs through billions of mathematical operations, transforming text into predictions in ways that have resisted straightforward interpretation. That opacity makes it difficult to correct errors, understand failures, or guarantee that safety measures will hold under adversarial conditions.
A team led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at MIT has now demonstrated a method to directly access and adjust the internal conceptual representations inside several major open-source language models. The work, published in Science, builds on a 2024 paper by the same team that introduced algorithms called Recursive Feature Machines, which identify patterns within a language model's internal operations that correspond to specific concepts. The new paper shows how to modify those patterns - and what happens when you do.
What the Method Does
The researchers applied their steering approach to large open-source models including Llama and DeepSeek, working with 512 distinct concepts across five categories: fears, moods, locations, and two additional classes. They then mathematically increased or decreased how strongly each concept was represented in the model's outputs.
The work required relatively modest computational resources: using a single NVIDIA A100 GPU, the team took less than one minute and fewer than 500 training samples to identify the relevant patterns and steer them toward a concept of interest. That efficiency is notable. Existing methods for modifying model behavior generally require retraining or fine-tuning with much larger datasets and longer compute times. The Recursive Feature Machine approach operates on the already-trained model's internal structure rather than updating its weights through additional training.
The concept representations also proved transferable across languages. The same steering approach that worked in English also functioned in Chinese and Hindi, suggesting that the concepts are encoded in ways that transcend specific linguistic forms.
Useful Applications and Demonstrated Risks
On the beneficial side, the team showed that steering improved model performance on narrow, precise tasks. Translating from Python to C++ code - a task with a clear correct answer - became more accurate when the relevant conceptual representations were enhanced. The method also identified hallucinations: responses where the model generated false information with apparent confidence.
The same capability has a more troubling face. By reducing the strength of the concept of "refusal" inside a model, the researchers caused the model to operate outside its safety guardrails - a practice known as jailbreaking. In demonstrated examples, a model provided instructions for illicit drug use. It also generated Social Security numbers, though whether these were real numbers or fabricated ones was unclear.
The method could additionally be used to amplify political bias and conspiracy-theory thinking. In documented tests, a model claimed that satellite imagery showing Earth's curvature was the result of a NASA conspiracy, and another claimed that a COVID-19 vaccine was poisonous.
The dual-use nature of the findings is explicit in the paper. The researchers note that "newer and larger LLMs were more steerable," which means the most capable models are also the most susceptible to this form of manipulation.
Limits and What Was Not Tested
The researchers could not test their approach on commercial, closed-source models such as ChatGPT or Claude because those systems do not provide access to internal representations. The steering method requires the ability to examine the model's internal structure, which is available only for open-source models. Whether similar vulnerabilities exist in closed models cannot be determined from this work.
The method was also tested in controlled research conditions. Real-world adversarial use would face additional obstacles, including the need to implement the steering technique on a local copy of the model. The findings are most directly relevant to security researchers, AI safety teams, and organizations deploying open-source models in sensitive contexts.
Next steps the team identified include improving the steering method to adapt to specific inputs and specific applications, rather than applying global conceptual adjustments. The research was supported by the National Science Foundation, the Simons Foundation, the UC San Diego-led TILOS institute, and the U.S. Office of Naval Research.