Britannica Is Suing OpenAI for Stealing the Encyclopedia

← Back to Blog

Encyclopedia Britannica was founded in 1768. It has survived the printing press, the typewriter, the internet, and Wikipedia. It pivoted to digital. It survived. And now, 258 years after its first edition, it's in federal court claiming that OpenAI scraped its entire catalog without permission and built a product that competes directly with it — while putting Britannica's name on wrong answers.

The lawsuit, filed last week alongside Merriam-Webster (which Britannica owns), is one of the most interesting copyright cases in the AI wave. Not because it's the biggest — the New York Times lawsuit was splashier. But because Britannica specifically targets RAG. That's new. And that changes what's at stake.

What They're Actually Claiming

Three main accusations. Let's go through them because they're all different beasts.

First: training data. Britannica holds copyright to roughly 100,000 online articles. The complaint alleges OpenAI scraped all of it and used it to train GPT models without permission or payment. This is the same argument the New York Times is making, and it's the same argument Ziff Davis, a bunch of newspapers, and dozens of book authors have made. The legal question — does using content as training data constitute copyright infringement, or is it "transformative" enough to qualify as fair use — is still largely unsettled. Anthropic got a mixed ruling last year, settling for $1.5B over the book-downloading part while surviving on the fair use argument for the training itself. So this is contested ground.

Second: RAG outputs. This is the more novel claim. Britannica argues that when ChatGPT uses its RAG pipeline — the tool that pulls live web content to answer queries — and it pulls from Britannica articles, then reproduces "full or partial verbatim copies" in responses, that's direct infringement. Not training. Not transformation. Literal reproduction. This is a much cleaner argument legally, and it's one that publishers haven't hammered as hard as training data.

100,000 Britannica articles allegedly scraped and used in ChatGPT training and RAG workflows without permission or payment

Third: trademark infringement. ChatGPT apparently generates incorrect answers and then attributes them to Encyclopedia Britannica. Not just misusing the content — actively misusing the brand. Hallucinating something false and stamping "according to Britannica" on it. That's the Lanham Act. That's a different statute entirely, and honestly it might be the most damaging allegation from a brand-damage standpoint.

Why RAG Is the New Battleground

The AI industry has been quietly banking on the fair use doctrine for training data. The argument is: we transformed the content into model weights, not a competing product. A court bought that for Anthropic, at least partially. It's a plausible defense.

RAG doesn't work the same way. RAG isn't transforming your content into weights — it's retrieving your actual text and handing pieces of it to a user, in real time, as a substitute for reading the original article. The "transformation" argument gets significantly weaker when you're basically running a live quote-and-summarize machine pointed at someone's copyrighted database.

Abstract visualization of data being retrieved and copied by an AI RAG pipeline

Publishers have known this was coming. Perplexity is facing a similar lawsuit from Britannica that's still working through the courts. Google's AI Overviews have drawn complaints from content creators for years. But OpenAI is the biggest target, and the RAG-specific argument puts pressure on a core piece of how ChatGPT's search and web-browsing features work.

If Britannica wins on the RAG claim, every AI company running a retrieval pipeline against web content has a problem. Not hypothetically — immediately.

The Revenue Substitution Argument Is Real

There's a line in the complaint that I keep coming back to:

"ChatGPT starves web publishers like Britannica of revenue by generating responses to users' queries that substitute, and directly compete with, the content from publishers."

This is the core economic injury, and it's hard to argue with the logic. People used to Google something, see a Britannica result at the top, click it, and Britannica got the pageview and the ad revenue. Now they ask ChatGPT, get a summary pulled from Britannica's articles, never visit the site, and Britannica gets nothing. Meanwhile ChatGPT is charging $20/month subscriptions partly powered by Britannica's editorial work.

The substitution effect is measurable. Traffic to reference and information sites has declined materially since AI search took off. Wikipedia has reportedly seen drops. News sites have documented it. Britannica, which went all-in on digital subscriptions after killing the print edition in 2012, is watching its core product being commoditized by the systems that consumed it.

The Hallucination-Trademark Problem Is Underrated

I don't think enough people are talking about the trademark angle. The allegation is that ChatGPT will sometimes hallucinate incorrect information and attribute it to Britannica. Wrong facts, Britannica's name on them.

Think about what that means for a brand whose entire value proposition is accuracy and credibility. Britannica has been a byword for "reliable information" since before the United States existed. They survived Wikipedia by pivoting to fact-checked, expert-authored digital content as a premium product. Their brand equity is almost entirely tied to the idea that Britannica = trustworthy.

And ChatGPT is out here saying "according to Encyclopedia Britannica" before stating something false. That's not a theoretical harm. Every time that happens, it slightly degrades the public's trust in the Britannica brand. Over millions of queries, that's real damage.

1768 Year Britannica was founded. It survived 258 years. Now it's in federal court over AI copyright infringement.

Where This Fits in the Larger War

The AI copyright litigation wave isn't random. It's a pattern:

News publishers — NYT, Ziff Davis, dozens of newspapers — focus on training data and the substitution effect on journalism
Book authors — the Anthropic settlement was the big one, with writers getting a $1.5B class action payout
Reference publishers — Britannica (and Perplexity's version of this) targeting RAG specifically
Stock photo agencies — Getty Images against Stability AI, targeting image generation

Each category is probing a different part of how AI systems consume and redistribute content. Collectively, they're mapping the legal boundaries of what AI companies can do. The courts are going to spend the next five years drawing lines that the industry has been pretending don't exist.

OpenAI has a few standard defenses. Fair use. Transformative use. The argument that training data isn't reproduced in outputs (complicated by the RAG claims here). They've also been doing licensing deals — they've signed agreements with some publishers, the Associated Press among them — which suggests they know the legal exposure is real, even if they've never admitted it publicly.

What I Actually Think

Here's my take: the training-data copyright argument was always going to be a messy fight, and it probably ends with some settlement formula or a licensing market that didn't exist before. That's been the trajectory with music, with photos, with every prior content type the internet ran through the disruption machine.

The RAG argument is different and more interesting. RAG isn't training — it's retrieval. It's closer to a search engine citing a source, except it's pulling the content wholesale into the response and removing the user's incentive to visit the original. If courts decide that's infringement, AI search products have to rebuild their retrieval systems with explicit licensing agreements. That's a much bigger structural change than paying settlement funds.

The trademark claim — hallucinating under someone's name — is probably the easiest win legally. That's clearly actionable under existing statute. Whether it produces meaningful damages is another question, but it's not a stretch legally.

Britannica isn't a scrappy startup swinging for a payday. They're a 258-year-old institution that built something genuinely valuable and watched a trillion-dollar company consume it. The lawsuit is measured, targeted at specific legal theories, and designed to pressure the RAG workflow specifically. That's not desperation — that's strategy.

The internet ate the encyclopedia once already. Britannica adapted. The question this lawsuit is really asking is: does adapting count for anything if the next wave just takes what you built without asking?

We'll find out. The courts are slow, but they do eventually answer.