PRESS-NEWS.org - Press Release Distribution
PRESS RELEASES DISTRIBUTION

Reducing human effort in rating software

2025-11-28
(Press-News.org) By Alistair Jones

SMU Office of Research Governance & Administration – A dystopian future where advanced artificial intelligence (AI) systems replace human decision-making has long been a trope of science fiction. The malevolent computer HAL, which takes control of the spaceship in Stanley Kubrick's film, 2001: A Space Odyssey, is a chilling example.

But rather than being fearful of automation, a more useful response is to consider what types of repetitive human tasks could be safely offloaded to AI, particularly with the advances of large language models (LLMs) that can sort through vast amounts of data, see patterns and make predictions.

Such is the area of recent research co-authored by Christoph Treude, an Associate Professor of Computer Science at Singapore Management University (SMU). The team explores potential roles for LLMs in annotating software engineering artifacts, a process that is expensive and time consuming when done manually.

"Our work has investigated when and how LLMs can reduce – not replace – human effort in software engineering annotation," Professor Treude says of his paper Can LLMs Replace Manual Annotation of Software Engineering Artifacts?, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025).

"We examined 10 cases from previous work where multiple humans had annotated the same samples, covering subjective tasks such as rating the quality of code summaries. We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability. However, for high-context tasks, LLMs are unreliable. 

"[In terms of impact], our study has already been cited many times as guidance on integrating LLMs into annotation workflows, though it is sometimes over-interpreted as endorsing full replacement of humans by LLMs, which we explicitly caution against."

Consistent agreement

So, what is the process of annotations in software engineering?

"This kind of annotation assigns predefined labels to artefacts when quality or meaning can’t be judged with simple metrics. For example, whether a generated code summary is accurate, adequate, concise and similar," Professor Treude says. 

"This annotated data allows researchers to evaluate tools, analyse developer behaviour, or train new models."

For the study, raters were arranged in pairs and the degree of agreement between each pair was scored.

"Their agreement reflects label reliability," Professor Treude says. "Our study compares human-human, human-model and model-model agreement to see whether LLMs behave like human raters."

Before the LLMs began labelling, the researchers provided them with a few examples of correct input-output pairs. 

"These 'few-shot' prompts act as short training examples, helping the model learn the labelling style and increase consistency. It is a widely used best practice in LLM research," Professor Treude says.

The researchers compared different large language models, including GPT-4, Claude 3.5 and Gemini 1.5. Did one stand out?

"Newer, stronger models show slightly higher consistency, but the main signal is the agreement across models. When several capable LLMs independently agree, the task is likely suitable for some automation, regardless of which specific model is used," Professor Treude says.

And for what tasks are LLMs most suitable for substituting human effort?

 "LLMs work well for low-context, deductive labelling, tasks with clear, pre-defined categories, such as checking if a variable name matches its value, or assessing code summary quality," Professor Treude says. 

"They struggle with high-context tasks, such as deciding whether a bug report was truly resolved. In our 10-task dataset, seven allowed the safe substitution of one human."

Situational awareness

According to the researchers, model-model agreement is a predictor of whether a given task may be suitable for LLMs.

"We found a strong positive correlation between model-model and model-human agreement. In other words, when multiple models give the same answer, they usually match human judgment as well," Professor Treude says. 

"High model-model agreement can indicate that LLMs are able to perform a task well, while low agreement is an early warning to avoid automation."

The researchers say it is unsafe to substitute LLMs for humans in high-context tasks, for example when working with static analysis warnings, which identify potential issues in source code, such as bugs, security vulnerabilities and coding standard violations.

"That task requires substantial contextual understanding: examining code changes, project history and warning semantics. Humans achieve high agreement, but models perform poorly. This demonstrates that LLMs still lack the deep situational awareness needed for complex software engineering tasks," Professor Treude says.

While humans can be replaced for selective samples, there are risks.

"High agreement doesn’t eliminate bias. LLMs may systematically err in one direction: for example, always giving higher quality scores, or reinforce patterns from their training data," Professor Treude says. 

"There’s also a methodological risk of over-generalising these findings to other qualitative methods. To some extent, LLMs can support deductive coding, but they cannot replace the reflexivity and interpretation that are central to qualitative software engineering research.

"Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment."

Beyond the scenario

"Our study is the first to systematically evaluate LLM substitution across software engineering annotation tasks and different LLMs," Professor Treude says. 

"We introduced two data-driven predictors for task suitability and sample selection: model-model agreement (how often two LLMs independently agree) and model confidence (how sure the model is about its own label). Together, these imply a two-step workflow for deciding when LLM assistance is appropriate: first for the task, then for each sample."

But the researchers are aware that their study has limitations. 

"We focused on a narrow slice of annotation: discrete, multiple-choice labelling with pre-defined categories on pre-selected samples," Professor Treude says. 

"We didn’t study open-ended interpretive analysis, continuous ratings, or free-text outputs. We also didn’t test for demographic bias, longitudinal model drift, or prompt variation. Therefore, our results apply specifically to structured, deductive annotation."

The researchers describe their work as a first step.

"The next step is to go beyond the scenario of replacing a single annotator. We plan to investigate situations where the annotation categories are not pre-defined and where models must deal with open-ended or evolving codebooks. We are also exploring other types of software engineering tasks to test whether the same reliability patterns hold," Professor Treude says. 

"More broadly, this is part of a larger research agenda on the use of AI in science, examining how these models might assist, and where they fall short, across different qualitative methods such as thematic analysis, grounded theory, or interview coding."

END


ELSE PRESS RELEASES FROM THIS DATE:

Robots that rethink: A SMU project on self-adaptive embodied AI

2025-11-28
By Vince Chong SMU Office of Research Governance & Administration – It might not be ubiquitous just yet but embodied artificial intelligence is slowly but surely cementing its place in the world. Robotic systems equipped with sensors and cameras help with everything from factory assembly to surgery, while autonomous, self-driving cars and drones are science fiction no more. Despite these advances though, there is a limit to what embodied AI can do in unpredictable, everyday environments like homes or offices. Say, a robotic arm may be programmed ...

Collaborating for improved governance

2025-11-28
By Alistair Jones SMU Office of Research Governance & Administration – Beomgeun Cho, an Assistant Professor of Public Policy at Singapore Management University (SMU), is the inaugural recipient of the Kyujin Jung Memorial Research Award, presented at the 2025 American Society for Public Administration (ASPA) conference.  The award, which recognises a promising early-career scholar in the field of public administration, is based on an overall assessment of their academic trajectory and achievements, rather than a specific paper. "I am deeply honoured to be the ...

The 'black box' of nursing talent’s ebb and flow

2025-11-28
SMU Office of Research Governance & Administration – Associate Professor Yasmin Ortiga chats with Filipino nurses for a living. The sociologist at Singapore Management University (SMU) tends to ask them about their hopes and dreams, why they uprooted themselves to go work in a hospital miles away from their home in the Philippines, often leaving young children behind. Every conversation builds up her research on international migration.  Of late, the chats have thrown up cases of these migrant nurses choosing to bypass Singapore, which relies on Filipinos for over half of its foreign registered nurses. A blip or a notable shift ...

Leading global tax research from Singapore: The strategic partnership between SMU and the Tax Academy of Singapore

2025-11-28
By Vince Chong SMU Office of Research Governance & Administration - Increasing tax competition, cryptocurrency taxation, environmental taxation, and myriad global tax reforms. These are among the many concerns facing the international tax system right now, as regulatory environments diverge and grow ever more complex. Tackling the challenges head on is SMU Assistant Professor Vincent Ooi, who leads a prestigious new initiative launched in partnership with the Tax Academy of Singapore: ...

SMU and South Korea to create seminal AI deepfake detection tool

2025-11-28
SMU Office of Research Governance & Administration – In a coup for Singapore Management University (SMU), a team led by Associate Professor of Computer Science He Shengfeng has edged out competing research institutions to clinch a grant for developing a groundbreaking deepfake detection system. The Artificial Intelligence (AI) project, when completed in an estimated three years’ time, promises to have widespread commercial applications. It would also be the first multilingual deepfake data set that includes dialectal variants such as Singlish and Korean dialects. “Many existing tools don’t perform well on Asian languages, accents, or content,” ...

Strengthening international scientific collaboration: Diamond to host SESAME delegation from Jordan

2025-11-28
Diamond Light Source, the UK’s national synchrotron facility, will today welcome delegates from SESAME (Synchrotron-light for Experimental Science and Applications in the Middle East) as well as representatives from the UK government, the Hashemite Kingdom of Jordan and UKRI (UK Research and Innovation). The event is aimed at deepening an existing collaboration between the two facilities.   SESAME, located in Allan, Jordan, is an intergovernmental synchrotron radiation facility established under the auspices of UNESCO and modelled on CERN. It is a unique scientific ...

Air pollution may reduce health benefits of exercise

2025-11-28
Long-term exposure to toxic air can substantially weaken the health benefits of regular exercise, suggests a new study by an international team including UCL (University College London) researchers. The study, published in the journal BMC Medicine, analysed data from more than 1.5 million adults tracked for more than a decade in countries including the UK, Taiwan, China, Denmark and the United States. The team found that the protective effect of regular exercise on people’s risk of dying over a specific period – from any cause and from cancer and heart disease specifically – appeared to be reduced, but not eliminated, for those who ...

Ancient DNA reveals a North African origin and late dispersal of domestic cats

2025-11-27
The domestic cat may be a far more recent arrival to Europe than previously thought, arriving roughly 2000 years ago and not because of the Paleolithic expansion of Near East farmers. The findings offer new insight into one of humanity’s most enigmatic animal companions and identify North Africa as the cradle of the modern housecat. The domestic cat has a long and complex, albeit uncertain, history. Genetic studies show that all modern cats descended from the African wildcat, which is found today in North Africa and the Near East. However, sparse archaeological remains and the difficulty of distinguishing domestic from wild ...

Inhibiting a master regulator of aging regenerates joint cartilage in mice

2025-11-27
An injection that blocks the activity of a protein involved in aging reverses naturally occurring cartilage loss in the knee joints of old mice, a Stanford Medicine-led study has found. The treatment also prevented the development of arthritis after knee injuries mirroring the ACL tears often experienced by athletes or recreational exercisers. An oral version of the treatment is already in clinical trials with the goal of treating age-related muscle weakness. Samples of human tissue from knee replacement surgeries — which include both the extracellular scaffolding, or matrix, in the joint as well as cartilage-generating ...

Metronome-trained monkeys can tap to the beat of human music

2025-11-27
Macaques can tap along to a musical beat, according to a new study – findings that upend the assumption that only animals with vocal-learning abilities can find and move to a groove. According to the authors, the discovery offers fresh insights that suggest the roots of rhythm may run far deeper in our evolutionary past than previously believed. Humans have a unique ability to perceive and move in time to a steady musical beat. It is a skill that develops early in life and requires complex pattern recognition, prediction, and motor coordination. Outside of humans, the ability to synchronize ...

LAST 30 PRESS RELEASES:

Fossil amber reveals the secret lives of Cretaceous ants

Predicting extreme rainfall through novel spatial modeling

The Lancet: First-ever in-utero stem cell therapy for fetal spina bifida repair is safe, study finds

Nanoplastics can interact with Salmonella to affect food safety, study shows

Eric Moore, M.D., elected to Mayo Clinic Board of Trustees

NYU named “research powerhouse” in new analysis

New polymer materials may offer breakthrough solution for hard-to-remove PFAS in water

Biochar can either curb or boost greenhouse gas emissions depending on soil conditions, new study finds

Nanobiochar emerges as a next generation solution for cleaner water, healthier soils, and resilient ecosystems

Study finds more parents saying ‘No’ to vitamin K, putting babies’ brains at risk

Scientists develop new gut health measure that tracks disease

Rice gene discovery could cut fertiliser use while protecting yields

Jumping ‘DNA parasites’ linked to early stages of tumour formation

Ultra-sensitive CAR T cells provide potential strategy to treat solid tumors

Early Neanderthal-Human interbreeding was strongly sex biased

North American bird declines are widespread and accelerating in agricultural hotspots

Researchers recommend strategies for improved genetic privacy legislation

How birds achieve sweet success

More sensitive cell therapy may be a HIT against solid cancers

Scientists map how aging reshapes cells across the entire mammalian body

Hotspots of accelerated bird decline linked to agricultural activity

How ancient attraction shaped the human genome

NJIT faculty named Senior Members of the National Academy of Inventors

App aids substance use recovery in vulnerable populations

College students nationwide received lifesaving education on sudden cardiac death

Oak Ridge National Laboratory launches the Next-Generation Data Centers Institute

Improved short-term sea level change predictions with better AI training

UAlbany researchers develop new laser technique to test mRNA-based therapeutics

New water-treatment system removes nitrogen, phosphorus from farm tile drainage

Major Canadian study finds strong link between cannabis, anxiety and depression

[Press-News.org] Reducing human effort in rating software