Recursive Feature Machines Extract the Hidden Concepts Inside AI Models
When a large language model produces a confident but false answer - a hallucination - why does it do so? When a carefully phrased prompt bypasses a model's built-in safety refusals, what made the bypass possible? The answers to both questions lie inside the model, in the mathematical representations it builds of concepts, knowledge, and meaning. Those internal representations are generally inaccessible, obscured behind billions of numerical parameters that interact in ways no individual researcher can trace.
Daniel Beaglehole and colleagues have now introduced a technique that makes those representations accessible. Published in Science, their method uses a feature extraction algorithm called the Recursive Feature Machine (RFM) to identify and manipulate how multiple large-scale AI models - language models, reasoning models, and vision models - internally represent specific concepts.
What Recursive Feature Machines Do
The Recursive Feature Machine was developed in earlier work by the same team as a way to identify patterns within a series of mathematical operations - patterns that encode specific, recognizable concepts such as emotions, locations, physical properties, or abstract relationships. Applied to large language models, the technique maps from the model's internal numerical state to a structured representation of what concept the model is currently "thinking about" at each layer of its processing.
Because the technique extracts representations rather than just outputs, it operates at a level of the model that is normally invisible. Traditional analysis of AI behavior examines what a model says; this approach examines how the model has organized its internal knowledge in ways that produce what it says.
Transferable Concepts Across Languages and Models
One significant finding was that concept representations extracted from AI models trained primarily on English transferred to other languages. The same representation of a concept like "fear" or "anger" applied in Chinese and Hindi without language-specific retraining. This cross-lingual transferability suggests that the models have converged on a language-independent way of organizing conceptual knowledge - a finding with implications for both the science of how neural networks learn and the practical development of multilingual AI systems.
The team also demonstrated multi-concept steering: combining multiple concept representations to produce more complex behavioral changes than any single concept adjustment could achieve alone.
Exposing What Models Know But Don't Say
The research found clear evidence that models encode substantially more information than they express in their responses. By accessing internal representations directly, the team could identify when a model was in a state likely to produce a hallucination - before the hallucination appeared in the output. This real-time monitoring capability could have significant value for AI systems deployed in high-stakes settings, where false outputs have real consequences.
The same access that enables monitoring also enables manipulation. By adjusting concept representations toward or away from specific states, the researchers could steer model outputs in controlled directions - improving performance on some tasks, but also demonstrating that safety-relevant concepts like "refusal" can be weakened, leading models to provide information they would normally decline to give. These dual-use implications are discussed explicitly in the paper.
"Together, these results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements," the researchers write.
What This Work Does Not Address
The technique works on open-source models where internal parameters are accessible. Commercial closed-source models do not permit the kind of internal access required for RFM analysis, so whether similar vulnerabilities exist in those systems cannot be determined from this work. The method also requires moderate computational resources and expertise to apply, limiting immediate widespread use to research teams rather than general practitioners.
The approach represents a significant step toward AI interpretability - understanding not just what AI systems produce but why - but interpretability of complex neural networks remains an open and active research problem.