(Press-News.org) By Alistair Jones
SMU Office of Research Governance & Administration – A dystopian future where advanced artificial intelligence (AI) systems replace human decision-making has long been a trope of science fiction. The malevolent computer HAL, which takes control of the spaceship in Stanley Kubrick's film, 2001: A Space Odyssey, is a chilling example.
But rather than being fearful of automation, a more useful response is to consider what types of repetitive human tasks could be safely offloaded to AI, particularly with the advances of large language models (LLMs) that can sort through vast amounts of data, see patterns and make predictions.
Such is the area of recent research co-authored by Christoph Treude, an Associate Professor of Computer Science at Singapore Management University (SMU). The team explores potential roles for LLMs in annotating software engineering artifacts, a process that is expensive and time consuming when done manually.
"Our work has investigated when and how LLMs can reduce – not replace – human effort in software engineering annotation," Professor Treude says of his paper Can LLMs Replace Manual Annotation of Software Engineering Artifacts?, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025).
"We examined 10 cases from previous work where multiple humans had annotated the same samples, covering subjective tasks such as rating the quality of code summaries. We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability. However, for high-context tasks, LLMs are unreliable.
"[In terms of impact], our study has already been cited many times as guidance on integrating LLMs into annotation workflows, though it is sometimes over-interpreted as endorsing full replacement of humans by LLMs, which we explicitly caution against."
Consistent agreement
So, what is the process of annotations in software engineering?
"This kind of annotation assigns predefined labels to artefacts when quality or meaning can’t be judged with simple metrics. For example, whether a generated code summary is accurate, adequate, concise and similar," Professor Treude says.
"This annotated data allows researchers to evaluate tools, analyse developer behaviour, or train new models."
For the study, raters were arranged in pairs and the degree of agreement between each pair was scored.
"Their agreement reflects label reliability," Professor Treude says. "Our study compares human-human, human-model and model-model agreement to see whether LLMs behave like human raters."
Before the LLMs began labelling, the researchers provided them with a few examples of correct input-output pairs.
"These 'few-shot' prompts act as short training examples, helping the model learn the labelling style and increase consistency. It is a widely used best practice in LLM research," Professor Treude says.
The researchers compared different large language models, including GPT-4, Claude 3.5 and Gemini 1.5. Did one stand out?
"Newer, stronger models show slightly higher consistency, but the main signal is the agreement across models. When several capable LLMs independently agree, the task is likely suitable for some automation, regardless of which specific model is used," Professor Treude says.
And for what tasks are LLMs most suitable for substituting human effort?
"LLMs work well for low-context, deductive labelling, tasks with clear, pre-defined categories, such as checking if a variable name matches its value, or assessing code summary quality," Professor Treude says.
"They struggle with high-context tasks, such as deciding whether a bug report was truly resolved. In our 10-task dataset, seven allowed the safe substitution of one human."
Situational awareness
According to the researchers, model-model agreement is a predictor of whether a given task may be suitable for LLMs.
"We found a strong positive correlation between model-model and model-human agreement. In other words, when multiple models give the same answer, they usually match human judgment as well," Professor Treude says.
"High model-model agreement can indicate that LLMs are able to perform a task well, while low agreement is an early warning to avoid automation."
The researchers say it is unsafe to substitute LLMs for humans in high-context tasks, for example when working with static analysis warnings, which identify potential issues in source code, such as bugs, security vulnerabilities and coding standard violations.
"That task requires substantial contextual understanding: examining code changes, project history and warning semantics. Humans achieve high agreement, but models perform poorly. This demonstrates that LLMs still lack the deep situational awareness needed for complex software engineering tasks," Professor Treude says.
While humans can be replaced for selective samples, there are risks.
"High agreement doesn’t eliminate bias. LLMs may systematically err in one direction: for example, always giving higher quality scores, or reinforce patterns from their training data," Professor Treude says.
"There’s also a methodological risk of over-generalising these findings to other qualitative methods. To some extent, LLMs can support deductive coding, but they cannot replace the reflexivity and interpretation that are central to qualitative software engineering research.
"Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment."
Beyond the scenario
"Our study is the first to systematically evaluate LLM substitution across software engineering annotation tasks and different LLMs," Professor Treude says.
"We introduced two data-driven predictors for task suitability and sample selection: model-model agreement (how often two LLMs independently agree) and model confidence (how sure the model is about its own label). Together, these imply a two-step workflow for deciding when LLM assistance is appropriate: first for the task, then for each sample."
But the researchers are aware that their study has limitations.
"We focused on a narrow slice of annotation: discrete, multiple-choice labelling with pre-defined categories on pre-selected samples," Professor Treude says.
"We didn’t study open-ended interpretive analysis, continuous ratings, or free-text outputs. We also didn’t test for demographic bias, longitudinal model drift, or prompt variation. Therefore, our results apply specifically to structured, deductive annotation."
The researchers describe their work as a first step.
"The next step is to go beyond the scenario of replacing a single annotator. We plan to investigate situations where the annotation categories are not pre-defined and where models must deal with open-ended or evolving codebooks. We are also exploring other types of software engineering tasks to test whether the same reliability patterns hold," Professor Treude says.
"More broadly, this is part of a larger research agenda on the use of AI in science, examining how these models might assist, and where they fall short, across different qualitative methods such as thematic analysis, grounded theory, or interview coding."
END
Reducing human effort in rating software
2025-11-28
ELSE PRESS RELEASES FROM THIS DATE:
Robots that rethink: A SMU project on self-adaptive embodied AI
2025-11-28
By Vince Chong
SMU Office of Research Governance & Administration – It might not be ubiquitous just yet but embodied artificial intelligence is slowly but surely cementing its place in the world. Robotic systems equipped with sensors and cameras help with everything from factory assembly to surgery, while autonomous, self-driving cars and drones are science fiction no more.
Despite these advances though, there is a limit to what embodied AI can do in unpredictable, everyday environments like homes or offices. Say, a robotic arm may be programmed ...
Collaborating for improved governance
2025-11-28
By Alistair Jones
SMU Office of Research Governance & Administration – Beomgeun Cho, an Assistant Professor of Public Policy at Singapore Management University (SMU), is the inaugural recipient of the Kyujin Jung Memorial Research Award, presented at the 2025 American Society for Public Administration (ASPA) conference.
The award, which recognises a promising early-career scholar in the field of public administration, is based on an overall assessment of their academic trajectory and achievements, rather than a specific paper.
"I am deeply honoured to be the ...
The 'black box' of nursing talent’s ebb and flow
2025-11-28
SMU Office of Research Governance & Administration – Associate Professor Yasmin Ortiga chats with Filipino nurses for a living. The sociologist at Singapore Management University (SMU) tends to ask them about their hopes and dreams, why they uprooted themselves to go work in a hospital miles away from their home in the Philippines, often leaving young children behind. Every conversation builds up her research on international migration.
Of late, the chats have thrown up cases of these migrant nurses choosing to bypass Singapore, which relies on Filipinos for over half of its foreign registered nurses. A blip or a notable shift ...
Leading global tax research from Singapore: The strategic partnership between SMU and the Tax Academy of Singapore
2025-11-28
By Vince Chong
SMU Office of Research Governance & Administration - Increasing tax competition, cryptocurrency taxation, environmental taxation, and myriad global tax reforms. These are among the many concerns facing the international tax system right now, as regulatory environments diverge and grow ever more complex.
Tackling the challenges head on is SMU Assistant Professor Vincent Ooi, who leads a prestigious new initiative launched in partnership with the Tax Academy of Singapore: ...
SMU and South Korea to create seminal AI deepfake detection tool
2025-11-28
SMU Office of Research Governance & Administration – In a coup for Singapore Management University (SMU), a team led by Associate Professor of Computer Science He Shengfeng has edged out competing research institutions to clinch a grant for developing a groundbreaking deepfake detection system.
The Artificial Intelligence (AI) project, when completed in an estimated three years’ time, promises to have widespread commercial applications. It would also be the first multilingual deepfake data set that includes dialectal variants such as Singlish and Korean dialects.
“Many existing tools don’t perform well on Asian languages, accents, or content,” ...
Strengthening international scientific collaboration: Diamond to host SESAME delegation from Jordan
2025-11-28
Diamond Light Source, the UK’s national synchrotron facility, will today welcome delegates from SESAME (Synchrotron-light for Experimental Science and Applications in the Middle East) as well as representatives from the UK government, the Hashemite Kingdom of Jordan and UKRI (UK Research and Innovation). The event is aimed at deepening an existing collaboration between the two facilities.
SESAME, located in Allan, Jordan, is an intergovernmental synchrotron radiation facility established under the auspices of UNESCO and modelled on CERN. It is a unique scientific ...
Air pollution may reduce health benefits of exercise
2025-11-28
Long-term exposure to toxic air can substantially weaken the health benefits of regular exercise, suggests a new study by an international team including UCL (University College London) researchers.
The study, published in the journal BMC Medicine, analysed data from more than 1.5 million adults tracked for more than a decade in countries including the UK, Taiwan, China, Denmark and the United States.
The team found that the protective effect of regular exercise on people’s risk of dying over a specific period – from any cause and from cancer and heart disease specifically – appeared to be reduced, but not eliminated, for those who ...
Ancient DNA reveals a North African origin and late dispersal of domestic cats
2025-11-27
The domestic cat may be a far more recent arrival to Europe than previously thought, arriving roughly 2000 years ago and not because of the Paleolithic expansion of Near East farmers. The findings offer new insight into one of humanity’s most enigmatic animal companions and identify North Africa as the cradle of the modern housecat. The domestic cat has a long and complex, albeit uncertain, history. Genetic studies show that all modern cats descended from the African wildcat, which is found today in North Africa and the Near East. However, sparse archaeological remains and the difficulty of distinguishing domestic from wild ...
Inhibiting a master regulator of aging regenerates joint cartilage in mice
2025-11-27
An injection that blocks the activity of a protein involved in aging reverses naturally occurring cartilage loss in the knee joints of old mice, a Stanford Medicine-led study has found. The treatment also prevented the development of arthritis after knee injuries mirroring the ACL tears often experienced by athletes or recreational exercisers. An oral version of the treatment is already in clinical trials with the goal of treating age-related muscle weakness.
Samples of human tissue from knee replacement surgeries — which include both the extracellular scaffolding, or matrix, in the joint as well as cartilage-generating ...
Metronome-trained monkeys can tap to the beat of human music
2025-11-27
Macaques can tap along to a musical beat, according to a new study – findings that upend the assumption that only animals with vocal-learning abilities can find and move to a groove. According to the authors, the discovery offers fresh insights that suggest the roots of rhythm may run far deeper in our evolutionary past than previously believed. Humans have a unique ability to perceive and move in time to a steady musical beat. It is a skill that develops early in life and requires complex pattern recognition, prediction, and motor coordination. Outside of humans, the ability to synchronize ...