Method prevents an AI model from being overconfident about wrong answers

More efficient than other approaches, the “Thermometer” technique could help someone know when they should trust a large language model.

2024-07-31

(Press-News.org) CAMBRIDGE, MA — People use large language models for a huge array of tasks, from translating an article to identifying financial fraud. However, despite the incredible capabilities and versatility of these models, they sometimes generate inaccurate responses.

On top of that problem, the models can be overconfident about wrong answers or underconfident about correct ones, making it tough for a user to know when a model can be trusted.

Researchers typically calibrate a machine-learning model to ensure its level of confidence lines up with its accuracy. A well-calibrated model should have less confidence about an incorrect prediction, and vice-versa. But because large language models (LLMs) can be applied to a seemingly endless collection of diverse tasks, traditional calibration methods are ineffective.

Now, researchers from MIT and the MIT-IBM Watson AI Lab have introduced a calibration method tailored to large language models. Their method, called Thermometer, involves building a smaller, auxiliary model that runs on top of a large language model to calibrate it.

Thermometer is more efficient than other approaches — requiring less power-hungry computation — while preserving the accuracy of the model and enabling it to produce better-calibrated responses on tasks it has not seen before.

By enabling efficient calibration of an LLM for a variety of tasks, Thermometer could help users pinpoint situations where a model is overconfident about false predictions, ultimately preventing them from deploying that model in a situation where it may fail.

“With Thermometer, we want to provide the user with a clear signal to tell them whether a model’s response is accurate or inaccurate, in a way that reflects the model’s uncertainty, so they know if that model is reliable,” says Maohao Shen, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on Thermometer.

Shen is joined on the paper by Gregory Wornell, the Sumitomo Professor of Engineering who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory for Electronics, and is a member of the MIT-IBM Watson AI Lab; senior author Soumya Ghosh, a research staff member in the MIT-IBM Watson AI Lab; as well as others at MIT and the MIT-IBM Watson AI Lab. The research was recently presented at the International Conference on Machine Learning.

Universal calibration

Since traditional machine-learning models are typically designed to perform a single task, calibrating them usually involves one task-specific method. On the other hand, since LLMs have the flexibility to perform many tasks, using a traditional method to calibrate that model for one task might hurt its performance on another task.

Calibrating an LLM often involves sampling from the model multiple times to obtain different predictions and then aggregating these predictions to obtain better-calibrated confidence. However, because these models have billions of parameters, the computational costs of such approaches rapidly add up.

“In a sense, large language models are universal because they can handle various tasks. So, we need a universal calibration method that can also handle many different tasks,” says Shen.

With Thermometer, the researchers developed a versatile technique that leverages a classical calibration method called temperature scaling to efficiently calibrate an LLM for a new task.

In this context, a “temperature” is a scaling parameter used to adjust a model’s confidence to be aligned with its prediction accuracy. Traditionally, one determines the right temperature using a labeled validation dataset of task-specific examples.

Since LLMs are often applied to new tasks, labeled datasets can be nearly impossible to acquire. For instance, a user who wants to deploy an LLM to answer customer questions about a new product likely does not have a dataset containing such questions and answers.

Instead of using a labeled dataset, the researchers train an auxiliary model that runs on top of an LLM to automatically predict the temperature needed to calibrate it for this new task.

They use labeled datasets of a few representative tasks to train the Thermometer model, but then once it has been trained, it can generalize to new tasks in a similar category without the need for additional labeled data.

A Thermometer model trained on a collection of multiple-choice question datasets, perhaps including one with algebra questions and one with medical questions, could be used to calibrate an LLM that will answer questions about geometry or biology, for instance.

“The aspirational goal is for it to work on any task, but we are not quite there yet,” Ghosh says.

The Thermometer model only needs to access a small part of the LLM’s inner workings to predict the right temperature that will calibrate its prediction for data points of a specific task.

An efficient approach

Importantly, the technique does not require multiple training runs and only slightly slows the LLM. Plus, since temperature scaling does not alter a model’s predictions, Thermometer preserves its accuracy.

When they compared Thermometer to several baselines on multiple tasks, it consistently produced better-calibrated uncertainty measures while requiring much less computation.

“As long as we train a Thermometer model on a sufficiently large number of tasks, it should be able to generalize well across any new task, just like a large language model, it is also a universal model,” Shen adds.

The researchers also found that if they train a Thermometer model for a smaller LLM, it can be directly applied to calibrate a larger LLM within the same family.

In the future, they want to adapt Thermometer for more complex text-generation tasks and apply the technique to even larger LLMs. The researchers also hope to quantify the diversity and number of labeled datasets one would need to train a Thermometer model so it can generalize to a new task.

###

This research was funded, in part, by the MIT-IBM Watson AI Lab.

END

ELSE PRESS RELEASES FROM THIS DATE:

Are cardiovascular risk factors linked to migraine?

2024-07-31

MINNEAPOLIS – Having high blood pressure, specifically high diastolic blood pressure, was linked to a slightly higher odds of ever having migraine in female participants, according to a new study published in the July 31, 2024, online issue of Neurology®, the medical journal of the American Academy of Neurology. Diastolic pressure is when the heart is resting between beats. However, the study did not find an increased risk between other cardiovascular risk factors and migraine. “Previous research shows that migraine is linked to a higher ...

Cleveland Clinic-led research identifies priority zones that may help improve colorectal cancer screening among Hispanic/Latino individuals

2024-07-31

July 31, 2024, CLEVELAND – Cleveland Clinic-led research has identified geographic areas in the United States where strategic efforts to promote colorectal cancer screening could help reduce healthcare gaps affecting Hispanic/Latino communities. The study, published in the American Journal of Public Health, marks a first step toward conducting larger neighborhood-level studies addressing disparities in colorectal cancer screening. The Hispanic /Latino population has the lowest colorectal cancer screening rate among U.S. racial and ethnic groups as defined by ...

AI bowel cancer test can tell whether patients need chemotherapy

2024-07-31

A new artificial intelligence (AI) test to determine the risk of bowel cancers coming back could help patients avoid chemotherapy, according to new research led by the University of Leeds. The test uses an AI algorithm to accurately assess the number of immune cells known as CD3 inside early-stage bowel cancer tumours. Bowel cancer, also known as colorectal cancer, is found anywhere in the large bowel, which includes the colon and rectum. It is one of the most common cancers in the world, with 1.9m cases diagnosed in 2020. * In the study, the CD3 Score test reliably showed which stage II cancers were most likely to recur within five years of surgery – and this could ...

Analysis of 24 different modern conversational Large Language Models reveals that most major open- and closed-source LLMs tend to lean left when asked politically charged questions

2024-07-31

When 24 different state-of-the-art Large Language Models (LLMs) were administered a battery of different tests designed to reveal political orientation, a significant majority produced responses rated as left-of-center, according to a study published July 31, 2024 in the open-access journal PLOS ONE by David Rozado from Otago Polytechnic, New Zealand. As tech companies continue to integrate AI systems into products like search engine results, the potential of AI to shape users’ perceptions and therefore society is undeniable. ...

New small molecule could treat sickle cell disease in adults that don’t respond to hydroxyurea, alone

2024-07-31

Sickle cell disease, while rare, is the most common inherited blood disorder and affects over 100,000 people in the United States, more than 90% of whom are Black according to the Centers for Disease Control and Prevention. Although a medication called hydroxyurea can alleviate pain and lower the number of hospital visits, not all adults respond well to this treatment. Researchers at Boston Medical Center (BMC) discovered a new small molecule that could lead to less sickled red blood cells and improved symptoms. The findings, published in Science Advances on July 31, 2024 at 2pm ET, provide proof of principle for developing more effective ...

A whole new view on glacier melting in Antarctica

2024-07-31

An international research team deployed the unmanned submarine ‘Ran’ from the University of Gothenburg underneath thick ice in Antarctica. They got back the very first detailed maps of the underside of a glacier, revealing clues to future sea level rise. The autonomous underwater vehicle, Ran, was programmed to dive into the cavity of Dotson ice shelf in West Antarctica, and scan the ice above it with an advanced sonar system. For 27 days, the submarine travelled a total of over 1.000 kilometres back and forth under the glacier, reaching 17 kilometres into the cavity. An ice shelf is a mass of glacial ice, fed from land by tributary glaciers, that floats ...

Study examines suicide contagion following celebrity deaths, opening avenues for prevention

2024-07-31

New research models the rapid and expansive spread of suicidal behaviors following the suicides of Robin Williams in 2014, and of Kate Spade and Anthony Bourdain, which occurred three days apart in 2018. Columbia University researchers developed a computer model to examine the dynamics underlying suicide contagion. They found that both the 2014 and 2018 events led to large increases in suicidal thought and behavior. The findings, which appear in the journal Science Advances, provide a framework for quantifying suicidal contagion to better understand, prevent, and contain its spread. “The model we developed ...

Mass extinction 66 million years ago triggered rapid evolution of bird genomes

2024-07-31

ANN ARBOR—Shortly after an asteroid slammed into Earth 66 million years ago, life for non-avian dinosaurs ended, but the evolutionary story for the early ancestors of birds began. The fossil record tells us that the early ancestors of living birds began their evolutionary journey just after the mass extinction event caused by the asteroid, but researchers weren't sure how they would see that story reflected in bird genomes. Now, a University of Michigan study has identified important changes in birds' genomes sparked by the mass extinction, called the end-Cretaceous mass extinction event, ultimately contributing to ...

The next generation of RNA chips

2024-07-31

An international research team led by the University of Vienna has succeeded in developing a new version of RNA building blocks with higher chemical reactivity and photosensitivity. This can significantly reduce the production time of RNA chips used in biotechnological and medical research. The chemical synthesis of these chips is now twice as fast and seven times more efficient. The results of the research were recently published in the prestigious journal Science Advances. The emergence and approval of RNA-based medical products, such as mRNA vaccines during the COVID-19 pandemic, has brought the RNA molecule ...

3D models provide unprecedented look at corals’ response to bleaching events

2024-07-31

In a new study, marine biologists from Scripps Institution of Oceanography at UC San Diego and Arizona State University are providing a first-of-its-kind glimpse into coral “bleaching” responses to stress, using imaging technology to pinpoint coral survival rates following multiple bleaching events off the island of Maui. Their findings were published July 31 in the journal PLOS ONE. Using a time series of coral reef 3D models from Maui, a team of researchers led by Scripps Oceanography’s Smith Lab tracked the bleaching response of 1,832 coral colonies from 2014 to ...

Method prevents an AI model from being overconfident about wrong answers

ELSE PRESS RELEASES FROM THIS DATE:

LAST 30 PRESS RELEASES: