Large language models don’t behave like people, even though we may expect them to

A new study shows someone’s beliefs about an LLM play a significant role in the model’s performance and are important for how it is deployed.

2024-07-23

(Press-News.org) CAMBRIDGE, MA – One thing that makes large language models (LLMs) so powerful is the diversity of tasks to which they can be applied. The same machine-learning model that can help a graduate student draft an email could also aid a clinician in diagnosing cancer.

However, the wide applicability of these models also makes them challenging to evaluate in a systematic way. It would be impossible to create a benchmark dataset to test a model on every type of question it can be asked.

In a new paper, MIT researchers took a different approach. They argue that, because humans decide when to deploy large language models, evaluating a model requires an understanding of how people form beliefs about its capabilities.

For example, the graduate student must decide whether the model could be helpful in drafting a particular email, and the clinician must determine which cases would be best to consult the model on.

Building off this idea, the researchers created a framework to evaluate an LLM based on its alignment with a human’s beliefs about how it will perform on a certain task.

They introduce a human generalization function — a model of how people update their beliefs about an LLM’s capabilities after interacting with it. Then, they evaluate how aligned LLMs are with this human generalization function.

Their results indicate that when models are misaligned with the human generalization function, a user could be overconfident or underconfident about where to deploy it, which might cause the model to fail unexpectedly. Furthermore, due to this misalignment, more capable models tend to perform worse than smaller models in high-stakes situations.

“These tools are exciting because they are general-purpose, but because they are general-purpose, they will be collaborating with people, so we have to take the human in the loop into account,” says study co-author Ashesh Rambachan, assistant professor of economics and a principal investigator in the Laboratory for Information and Decision Systems (LIDS).

Rambachan is joined on the paper by lead author Keyon Vafa, a postdoc at Harvard University; and Sendhil Mullainathan, an MIT professor in the departments of Electrical Engineering and Computer Science and of Economics, and a member of LIDS. The research will be presented at the International Conference on Machine Learning.

Human generalization

As we interact with other people, we form beliefs about what we think they do and do not know. For instance, if your friend is finicky about correcting people’s grammar, you might generalize and think they would also excel at sentence construction, even though you’ve never asked them questions about sentence construction.

“Language models often seem so human. We wanted to illustrate that this force of human generalization is also present in how people form beliefs about language models,” Rambachan says.

As a starting point, the researchers formally defined the human generalization function, which involves asking questions, observing how a person or LLM responds, and then making inferences about how that person or model would respond to related questions.

If someone sees that an LLM can correctly answer questions about matrix inversion, they might also assume it can ace questions about simple arithmetic. A model that is misaligned with this function — one that doesn’t perform well on questions a human expects it to answer correctly — could fail when deployed.

With that formal definition in hand, the researchers designed a survey to measure how people generalize when they interact with LLMs and other people.

They showed survey participants questions that a person or LLM got right or wrong and then asked if they thought that person or LLM would answer a related question correctly. Through the survey, they generated a dataset of nearly 19,000 examples of how humans generalize about LLM performance across 79 diverse tasks.

Measuring misalignment

They found that participants did quite well when asked whether a human who got one question right would answer a related question right, but they were much worse at generalizing about the performance of LLMs.

“Human generalization gets applied to language models, but that breaks down because these language models don’t actually show patterns of expertise like people would,” Rambachan says.

People were also more likely to update their beliefs about an LLM when it answered questions incorrectly than when it got questions right. They also tended to believe that LLM performance on simple questions would have little bearing on its performance on more complex questions.

In situations where people put more weight on incorrect responses, simpler models outperformed very large models like GPT-4.

“Language models that get better can almost trick people into thinking they will perform well on related questions when, in actuality, they don’t,” he says.

One possible explanation for why humans are worse at generalizing for LLMs could come from their novelty — people have far less experience interacting with LLMs than with other people.

“Moving forward, it is possible that we may get better just by virtue of interacting with language models more,” he says.

To this end, the researchers want to conduct additional studies of how people’s beliefs about LLMs evolve over time as they interact with a model. They also want to explore how human generalization could be incorporated into the development of LLMs.

“When we are training these algorithms in the first place, or trying to update them with human feedback, we need to account for the human generalization function in how we think about measuring performance,” he says.

In the meanwhile, the researchers hope their dataset could be used a benchmark to compare how LLMs perform related to the human generalization function, which could help improve the performance of models deployed in real-world situations.

###

This research was funded, in part, by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.

END

ELSE PRESS RELEASES FROM THIS DATE:

NREL researchers highlight opportunities for manufacturing perovskite solar panels with a long-term vision

2024-07-23

Researchers working at the forefront of an emerging photovoltaic (PV) technology are thinking ahead about how to scale, deploy, and design future solar panels to be easily recyclable. Solar panels made of perovskites may eventually play an important role amid global decarbonization efforts to reduce greenhouse gas emissions. As the technology emerges from the testing stages, it is a perfect time to think critically about how best to design the solar panels to minimize their impact on the environment decades from now. “When you have a technology in its very early stages, you have the ability to design it better. It’s a cleaner slate,” said Joey Luther, a senior ...

Top Medicare advantage plans less available in disadvantaged areas

2024-07-23

Looking for a Medicare Advantage plan with a five-star quality rating? You’re less likely to find one available to you if you live in a county with higher poverty and unemployment, finds a new study published in JAMA Network Open. These geographic disparities may be contributing to unequal health outcomes and limiting federal funds from reaching the regions most in need, according to the researchers. “What this means is that Medicare beneficiaries living in counties with greater social disadvantage ...

Better carbon storage better carbon storage with stacked geology with stacked geology

2024-07-23

The overarching goal of all carbon capture and storage projects is the same: Keep carbon dioxide (CO2) emissions out of the atmosphere by storing them in the subsurface for good. One way to do that is to inject the CO2 into a reservoir space that’s covered with a big lid – an impermeable caprock that can keep the gas in place and stop any upward flow in its tracks. That’s the model that petroleum exploration has relied on for decades when searching for oil traps, and it works for both oil and CO2. But according to research led by The University of Texas at Austin’s Bureau of Economic Geology, subsurface ...

Sharp temperature reduction for quantum dots in polymer by highly efficient heat dissipation pathways

2024-07-23

A new publication from Opto-Electronic Advances; DOI 10.29026/oea.2024.240036 , discusses sharp temperature reduction for quantum dots in polymer by highly efficient heat dissipation pathways. Quantum dots (QDs), a kind of luminescent nanocrystals featuring size-tunable emission spectra and superior color purity are widely applied in optoelectronic fields. However, these particles are facing luminous efficiency degradation due to their temperature-sensitive characteristic and the high temperature in optoelectronic ...

UAF researcher creates way to detect elusive volcanic vibrations

2024-07-23

A new automated system of monitoring and classifying persistent vibrations at active volcanoes can eliminate the hours of manual effort needed to document them. Graduate student researcher Darren Tan at the University of Alaska Fairbanks Geophysical Institute led development of the system, which is based on machine learning. Machine learning is a branch of artificial intelligence focused on building systems that learn from data, identify patterns and make decisions with minimal human intervention. Details about Tan’s automated ...

Lissajous pattern multi-pass cell: Enhancing high sensitivity and simultaneous dual-gas LITES sensing

2024-07-23

A new publication from Opto-Electronic Sciences; DOI 10.29026/oes.2024.240013 , discusses highly sensitive and real-simultaneous CH4/C2H2 dual-gas LITES sensor based on Lissajous pattern multi-pass cell. Trace gases are atmospheric constituents with a volume fraction of less than 1%. Despite their low concentrations, nitrogen oxides, hydrocarbons, and sulfides in the atmosphere have a significant impact on the environment, closely related to phenomena such as acid rain, greenhouse effects, and ozone layer depletion. Therefore, race gas detection of is crucial for environmental ...

Asexual reproduction usually leads to a lack of genetic diversity. Not for these ants.

2024-07-23

Genetic diversity is essential to the survival of a species. It’s easy enough to maintain if a species reproduces sexually; an egg and a sperm combine genetic material from two creatures into one, forming a genomically robust offspring with two distinct versions of the species’ genome. Without that combination of different genetic makeups, asexually reproducing species typically suffer from a lack of diversity that can doom them to a limited run on Earth. One such animal should be the clonal raider ant, which produces daughter after genetically identical daughter directly from an unfertilized ovum ...

Mini lungs make major COVID-19 discoveries possible

2024-07-23

Scientists at Sanford Burnham Prebys, University of California San Diego and their international collaborators have reported that more types of lung cells can be infected by SARS-CoV-2 than previously thought, including those without known viral receptors. The research team also reported for the first time that the lung is capable of independently mustering an inflammatory antiviral response without help from the immune system when exposed to SARS-CoV-2. This work is especially timely, as cases of COVID-19 are on the rise in the scientists’ hometown of San Diego during a summertime spike. Looking beyond the region, more than half of the states in the country have reported “very ...

Exploratory analysis associates HIV drug abacavir with elevated cardiovascular disease risk in large global trial

2024-07-23

WHAT: Current or previous use of the antiretroviral drug (ARV) abacavir was associated with an elevated risk of major adverse cardiovascular events (MACE) in people with HIV, according to an exploratory analysis from a large international clinical trial primarily funded by the National Institutes of Health (NIH). There was no elevated MACE risk for the other antiretroviral drugs included in the analysis. The findings will be presented at the 2024 International AIDS Conference (AIDS 2024) in Munich, Germany. The Randomized Trial to Prevent Vascular Events in HIV (REPRIEVE) enrolled 7,769 study participants with HIV from 12 countries that found ...

Control of light–matter interactions in two-dimensional materials with nanoparticle-on-mirror structures

2024-07-23

A new publication from Opto-Electronic Sciences; DOI 10.29026/oes.2024.230051 , discusses control of light–matter interactions in two-dimensional materials with nanoparticle-on-mirror structures. With the rapid development of high bit-rate wireless services driven by mobile internet, AI computing, high-definition videos, virtual reality/augmented reality (VAR) applications, and so on, the demand for wireless data rates has grown explosively in the past decades1, 2. Supporting such fast data rate at tens of Gbit/s pushes the carrier frequency to the THz (0.1-10 THz) ...

Large language models don’t behave like people, even though we may expect them to

ELSE PRESS RELEASES FROM THIS DATE:

LAST 30 PRESS RELEASES: