AI Models for Wildlife ID Work Well in the Lab. In the Field, They Often Fall Apart.
A model that can identify 97% of cats in stock photography may be nearly useless at spotting a cat in the wild. That gap - between benchmark performance and real-world capability - is at the center of a pointed critique published in PLOS Biology by two researchers at the University of Exeter. Their target is the growing use of AI imaging systems in ecology and wildlife conservation, and their argument is that the field has a transferability problem it has not yet reckoned with honestly.
The Problem With Benchmarks
When developers test an AI model, they typically hold back a portion of their training dataset and measure how well the model performs on those held-out examples. The accuracy score that results becomes the benchmark - the number that goes into press releases, grant applications, and product marketing. The implicit claim is that this number tells you something about how the model will behave on new data in the real world.
Dr. Thomas O Shea-Wheller, from the Environment and Sustainability Institute at Exeter Penryn Campus in Cornwall, says that claim is frequently wrong. "The take-home message is that despite being considered as the gold standard, performance benchmarks do not reliably indicate the true ability of AI models," he said. "We see lots of claims purporting to compare the ability of the latest models to humans across very broad scenarios. However, these are derived from performance testing on datasets that do not always carry over to real-world tasks."
The problem is that training datasets and test datasets are typically drawn from the same source. A model trained and tested on stock images of cats learns what stock-image cats look like. Deployed to identify cats from camera traps in a forest, it encounters lighting conditions, angles, partial occlusions, and background contexts it has never seen. Performance degrades - sometimes catastrophically.
Confident and Wrong
Katie Murray, from the Centre for Ecology and Conservation, identifies what may be the most practically dangerous aspect of the problem: the model often does not know it is failing. "In the case of wildlife identification, you can end up with something that is not working well, but seems very confident in its conclusions," she said. "Put simply, AI struggles with things it has not seen before, but it will not necessarily express that to the user."
This is a well-documented pathology in machine learning called overconfidence. Models trained on narrow datasets produce high confidence scores even for inputs that fall far outside their training distribution. For a wildlife biologist relying on automated species counts to inform conservation decisions, confident-but-wrong identifications can corrupt entire datasets before anyone notices something is off. "When a model fails, it is often not detected until extensive damage is done," O Shea-Wheller said.
Ecology Is Particularly Exposed
The paper uses examples from both species identification and medical diagnostic imaging to illustrate the transferability gap, but the authors argue ecology faces particular risks. AI imaging systems are increasingly sold to conservation programs as tools that can scale up monitoring without scaling up human labor. The appeal is real: automated camera trap analysis, species abundance estimation, and behavioral observation at landscape scale are genuinely useful if they work.
The problem is that ecosystems are not standardized datasets. A model trained on camera trap images from one habitat will encounter different vegetation, different lighting patterns, different animal behaviors, and sometimes different subspecies or age classes in a new location. O Shea-Wheller emphasizes that the issue is not with AI as a technology but with how it is being evaluated and deployed. "AI can be incredibly powerful, but context is key - models need to be evaluated in their real-world use cases, and if they are not, it can lead to serious issues down the line."
What the Researchers Recommend
The paper stops short of calling for a moratorium on AI in ecology. Instead, the authors call for more honest communication about benchmark limitations, and increased adoption of tools that allow models to be rapidly tested within their actual deployment environments before being relied upon for decisions.
Their core argument is simple: benchmarks should not be used to estimate generalized model performance. "As things currently stand, the only reliable way to evaluate how well an AI model will work is to actually test it within your specific use case," O Shea-Wheller said. That means more domain-specific validation studies and less trust in numbers generated from datasets that may bear only superficial resemblance to the environments where the model will actually operate.