Some Legal Clarity For Copyright And AI

By Ian Hollidae, 2025/06/24

So we finally have some sort of legal direction regarding LLM's training on public material.

The tech industry will call this a win. The content industry, not so much. From the court summary (full PDF at the link above):

The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.

The copies used to convert purchased print library copies into digital library copies were justified, too, though for a different fair use. The first factor strongly favors this result, and the third favors it, too. The fourth is neutral. Only the second slightly disfavors it. On balance, as the purchased print copy was destroyed and its digital replacement not redistributed, this was a fair use.

The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use. Anthropic employees said copies of works (pirated ones, too) would be retained forever for general purpose even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic's pocketbook and convenience.

And, as for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic. On this record in this posture, the central library copies were retained even when no longer serving as sources for training copies, hundreds of engineers could access them to make copies for other uses, and engineers did make other copies. Anthropic has dodged discovery on these points. We cannot determine the right answer concerning such copies because the record is too poorly developed as to them. Anthropic is not entitled to an order blessing all copying that Anthropic has ever made after obtaining the data, to use its words

As someone who spends time on both the tech and the content side of things, I think the content side will eventually come out just fine after this. I'm not sure I could say that before now.

First, it appears that copyright precedence was followed. If you download/obtain something legally, it falls into fair use. This seems pretty straight forward. Some part of what Anthropic did cleared the established legal hurdles.

Secondly, storing copyrighted material in a permanent database without compensating the owner is considered piracy. This part of what Anthropic did fell short of the established legal hurdles and will go to trial later this year. I think it was this aspect of LLM training that really cause a lot of angst. The idea of web crawlers scouring the web without restraint and with no regard for ownership seemed inevitable. I'm glad to see some rational lines being drawn.

In some ways, this ruling is a bit of a relief. Enough of the talk and speculation and the start of rubber-meets-the-road action. Looks like the results are off to a good start.

Tags: AI