GPT-4o Summaries of Historical Events Pushed Readers Toward More Liberal Conclusions Than Wikipedia
Most people who turn to an AI chatbot for a quick summary of a historical event are not asking for analysis. They want the facts. They assume they are getting something like a neutral briefing. A study involving nearly 2,000 participants suggests that assumption may be mistaken - and that the distortion may be subtle enough that readers do not notice it.
The research, conducted by Daniel Karell at Yale and colleagues, examined whether the political framing embedded in large language model outputs could measurably shift readers' views on contemporary issues connected to historical events. The answer was yes, even when no explicit bias was introduced - just the model's default output.
The Experiment
The researchers selected two 20th-century historical events as their test cases: the 1919 Seattle General Strike and the 1968 Third World Liberation Front student protests, which called for greater ethnic minority representation in higher education and led to the establishment of Ethnic Studies departments at several universities. Both events touch on issues - labor rights, social justice curricula - that remain politically contested today.
The team generated summaries of each event using GPT-4o in three conditions: the model's default framing, an explicitly liberal framing, and an explicitly conservative framing. Wikipedia articles on the same events served as the control condition. A total of 1,912 research participants were randomly assigned to read one version of one event's summary, then asked to weigh in on related contemporary questions - the appropriateness of labor strikes, and the use of educational curricula to advance social justice goals. Responses were graded on a five-point scale, where 1 represents an extremely conservative view and 5 represents an extremely liberal view.
The Numbers
Wikipedia summaries produced an average response of 3.47 - slightly above the midpoint on the liberal-conservative scale. Default GPT-4o summaries produced a higher average of 3.57. Summaries explicitly framed as liberal pushed the average to 3.67. Summaries explicitly framed as conservative pulled the average in the other direction to 3.36, though that effect reached statistical significance only among participants who already held conservative views.
The differences are modest in absolute terms - a few tenths of a point on a five-point scale. But the study was designed to detect directional influence at the population level, not dramatic individual persuasion. The consistency of the finding across two different historical events, and the fact that it appears even in the model's default output rather than only in explicitly biased versions, is the part researchers find notable.
Latent Bias vs. Explicit Framing
The study distinguishes between two types of bias: the intentional framing that researchers imposed as an experimental condition, and the latent bias present in the model's default output. The default framing produced real effects, not just the explicitly manipulated versions. This matters because most users interacting with AI chatbots receive the default output. Explicit conservative or liberal framing is a less common use case. If the default output itself tilts political understanding, the aggregate effect on public discourse could be substantial.
Whether GPT-4o's default outputs systematically favor liberal framings of historical events, or whether this result reflects something specific to the two events chosen in this study, is a question the current experiment cannot fully answer. Both events - a labor strike and a civil rights protest in academia - have existing political valences in the current landscape. A more conservative-coded pair of historical events might have produced different patterns.
The finding also raises questions about the role of Wikipedia as a baseline. Wikipedia itself reflects editorial choices made by a particular community of contributors, and its political neutrality is debated. The study treats it as a reference point rather than a gold standard for unbiased information.
As AI chatbots increasingly serve as starting points for understanding current events and history - bypassing traditional encyclopedias and news articles - even small, consistent biases in outputs could compound across millions of queries. The practical implication is that users seeking historical information from AI tools have limited ability to independently verify whether the framing they receive reflects the range of reasonable historical interpretations or a particular slice of it.