When is AI stealing?

Published on 2024-07-15 by TomatoSoup

I've heard language and graphics models referred to as copyright infringement as a service. I'm going to philosophize about when, exactly, LLMs and models like DALLE and Stable Diffusion are committing copyright infringement and/or "stealing".

I'm going to conflate the issue of copyright infringement and stealing. The first is a fact-based legal issue and the second can be varyingly referring to art and text being stolen (copied with either no or only vague legal basis to not be copyright infringement) to train the models to the output of the models stealing jobs from artists and authors.


Is content stolen when the model is trained? Maybe! Certainly there's a lot of content out there that is already permissively licensed enough to be no-questions-asked used as input. There's also the tremendous gray zone of content that is published intended for "public consumption" and where nobody would be too put out if it was simply republished, such as reddit comments or deviant art posts. As an example, I've been quoted by Professor O'Neill when she was describing why people might prefer her random number generator over Doctor Vigna's. I was quoted without attribution, even! But I have zero problem with this. Certainly, some people might be frustrated, but when I say "too put out" I mean nobody would claim their work was stolen, or that they were owed compensation, to their work simply being republished without proper attribution. Of course it's a completely different story if someone throws their watermark on work that isn't theirs! But simply saying "I found this on the net" and reprinting a publicly posted comment or picture isn't stolen. Is it copyright infringement? Yeah, probably. Increasingly services are saying that they'll train models on your work, which makes it legal, but legality is not morality and there's a buttload of content out there that predates these changes.

But I've said cases where something is repeated with merely inadequate attribution. I'll get to when things are repeated with downright false attribution later.

What kind of content are these things trained on that is stolen? Well, there's a dataset known as The Pile (ominous noises) which includes a set called books3. It is widely believed that this dataset includes all of several book piracy websites. Broadly speaking, if you have to pay for some content, and that content's license doesn't say you can train on it or use it for broad purposes, then training on that content is stealing. Now, there's an optimal amount of crime in the world and it's not 0, so we shouldn't be too hard on models that contain trace amounts of theft, we should just strive to minimize that as much as possible. But we'll return to that again.


Is theft occuring when the model is evaluated? Again, maybe! Certainly I'm committing copyright infringement, or at least attempting to, if I prompt a language model with the first few sentences of Harry Potter and expect the rest to pop out, or ask a graphics model to give me Mario and Luigi fighting Bowser. Similarly these models will happily spit out code that is under a license that demands proper attribution and then append the wrong license entirely. But surely that's the operators's fault, right? I don't care how sheltered and unaware of Mario you claim to be, if you make a platformer featuring a red plumber fighting a giant turtle Nintendo's gonna slap your shit! Just because a tool is capable of doing something illegal or unethical doesn't mean that the tool itself is illegal or unethical.

This problem exists even without these models! Back in 2009, Mike Pall elucidated some of the internal design decisions for LuaJIT, a just-in-time compiler so performant that it's faster at calling native code than C is. He released the intellectual property and design concepts into the public domain. He noted that the code itself is already permissively licensed but that the IP and the license on the code itself are orthogonal issues. However, he continued to say-

I cannot guarantee it to be free of third-party IP however. In fact nobody can. Writing software has become a minefield and any moderately complex piece of software is probably (unknowingly to the author) encumbered by hundreds of dubious patents. This especially applies to compilers. The curent IP system is broken and software patents must be abolished. Ceterum censeo.

But what about stealing jobs? ...Maybe! I run a Pathfinder campaign and most of the player character tokens are generated. At this point in the campaign, would any of us pay a human for character art? Probably not! But if I were running a broadcast campaign like Critical Roll or Dimension 20, then of course the art would be commissioned. But what's more, probably we'd generated a few dozen different concepts and bring them to an artist we like the style of and use it as reference art. I have artist friends who say that graphics models have been a slight boon because people whine less about revisions costing extra.

But if a major studio just drops some AI art into a game? Yeah, theft. Certainly these tools can speed up an artist's workflow, generating bits and bobs to composite together, or to get a start on a scrapboard of concept art, but wholesale creating art to sell from simple invocation of these tools is theft. What about a minor studio, or a free to play game? There's a fitness app that's been a one-dev labor of love, where you level a character in a fantasy world by walking to gain XP, and the art for the different regions is generated from midjourney. Is this theft? Ehh... it's an early start up. If the app starts making a load of cash then I'll expect the dev to start replacing assets with genuine ones. But for a small project with low odds of commercialization I don't see it as theft.


At what point does flagrantly copying a bunch of other artists' work stop being stealing? If you swing by YouTube and search Extreme Soundclown Megamix V you'll find a 12 minute video without a single novel chord or beat. If you search Project Discovery by pluffaduff you'll find a faithful recreation of Daft Punk's second album recreated from the original samples they used along with many samples they definitely did not. Are these graphic models not just consuming a tremendous amount of commercial data and creating remixes? If human authorship is a necessary component for an original work then do these models not count because they don't need enough effort? Why do tools like Krita or Blender count as art? How do we characterize the exact level of effort an artist must put in for their product to count as art?

There's a game called StarSector which features AI cores as salvageable objects. The highest tier has its capabilities described thusly:

The alpha-level AI core is the physical soul of a fearsome alien intelligence. An alpha can create art which perfectly simulates human pathos, plausibly debate any philosophical position, and form what appear to be deep and meaningful bonds with human beings.

None of the descriptions for AI cores permits them to actually have these human experiences. Our current 2024 models can't even be described as intelligent, they're just passable at simulating intelligent output given the right inputs. They're labor saving devices in the same way that a perfect gradient might be difficult to paint and belong in a gallery a hundred years ago but now anyone can use the gradient tool to whip up an identical product in minutes. As a child I used MS Paint's line and bucket tools to create a hundred Piet Mondrian paintings without even knowing of the man.

It's not that machines can't be part of the creating process but that the machine can't run off on its own. These models are the result of brutal optimization down a mathematical gradient. Compare that to an art-making robot that kills itself in the process: A paint brush attached to an axle with the power cable wrapped around the same axle. This is art! It starts lame but at the end its final strokes create something worthy of putting on a wall. It looks, to my philistine eye, like a brutalist piece of wall scroll calligraphy. And it has a story to it!

It's the fact that literally nobody knows how these models work that they're not artistic in-and-of themselves. Nobody consciously set out to create this model. It's like Doctor Malcolm's line before his famous "your scientists were so preoccupied" line:

I'll tell you the problem with the scientific power that you're using here: it didn't require any discipline to attain it. You know, you read what others had done and you took the next step. You didn't earn the knowledge for yourselves so you don't take any responsibility for it.

Didn't earn the art for themselves indeed.


How did we get here?

Well, you've gotta remember that copyright has a concept of fair use and research is one of those exceptions. OpenAI made GPT-1 and 2 and released the weights for them. They also provided inference on these models as a service. It's research! And there's nothing wrong with being compensated for a service you're providing, right? You're not paying for the content, you're paying for the compute time!

Then GPT-3 came out and they said they were scared over how powerful it was and so they wouldn't release the weights, but it's still a research product and you can pay for the compute time. These were not instruction-tuned models, only exceptionally powerful autocompletes, so it's not like any random person could use them well anyway.

Then GPT-3.5, ChatGPT, came out. Now everybody knows about it.

Similar stories play out for other models where they hide as limited access research tools where the users are only compensating the researchers for the server time. Think about how crappy the first DALLE model was and then how quickly it provided acceptable results.

Why aren't they being sued into the ground? Because they hid behind research long enough to amass enough money to make legal fights a losing battle. Even The Mouse hasn't tried suing them.

Speaking of The Mouse, why are there all these exceptions to copyright? Sure, education, criticism, research, these are all good things and if copyright is intended to promote the progress of science and useful arts then those are noble goals. But how did they get away with it for so long to such an aggressive degree?


Because copyright is too fuckin' long!

Surprise, this is a rant about copyright!

From my US-centric perspective copyright started as 14-28 years, depending on renewal. After The Mouse got to it it was 95 years after publication, 120 years if unpublished, or 70 years after the author's death, whatever comes first.

GPT-1 was 2018. They could have trained, 100% legally, on the tremendous amount of data available 2004 and prior. Sure, it would suck in some ways if fair use protections were weaker, but we wouldn't be waiting til' 2024 to legally reproduce something published in 1928! Steamboat Willie is public domain now and people are still having a hard time publishing it because everyone is working under this mental model that copyright never expires. Automated tools still take it down.

As an aside, that's why Let's Encrypt only generates certs that are valid for 90 days. Long expiration times lead to complacency that "this is how things have always run." and then nobody knows how to renew the certs when the time comes. Mozilla fucked it up and none of the Firefox extensions worked for a day or so. Microsoft fucked it up and their CDN for office went down.

There's an ongoing lawsuit between OpenAI and the New York Times for training on its articles without permission. Is web scraping legal? Did they have sufficient modal dialogues proving that OpenAI consented to their terms? How much is this database worth? Right now it's worth a lot because it's not gonna be legally free until like 2070, assuming the authors are in their thirties or so. Nobody would give a shit if 1996-on was unambiguously legal to use for any purpose.

Reddit wouldn't be doing it's giant anti-scraping anti-third-party-client crackdown if they didn't think they could license all of their content for 6,500,000,000 dollars.


So where do I think the theft is actually occuring?

At the risk of anthropomorphizing the models, training a model on stuff that anyone can easily see is kinda-sorta-not-really similar to an artist copying a style they saw. Obviously if they misrepresent themselves it's theft and infringement but when used for throwaway assets for free games or shared in small groups of friends... who gives a shit?

The actual location of the theft is when models like Midjourney or GPT-x hoover up every last bit of humanity available and then hide these models behind pay-per-token business models. These models are representative of some sum total of human creation and the fact that I can't run GPT-4o at 0.1 token/s, desperately swapping weights between RAM and disk, is a disgrace. This is our cultural birthright and that these companies are saying "Yeah we can consume everything you've ever done but you can never have the fruits of our labors" is bullshit.

The location of the theft is when corporations use these models to make people pay top dollar for products that should be the result of humans. When people and organizations use the near-direct output of these models for purposes that they otherwise have both the financial incentive and means to hire an artist for. Why are we paying for things if not to compensate the humans that brought those things to us?

Corporations have already made the vast majority of the money they could make in the first 14 years of a product's existence, and certainly all they plausibly could make in the first 28. Stop hoarding it. The point of copyright is that nothing exists in a vacuum, your original work is based on everything that has come before and for your novel contribution you deserve a period to exclusively benefit. After that it's time to return your work to the pool of things that have come before, thank you very much, it's time for the next generation of ideas.

The ultimate irony is that, because these models are not the result of any human authorship, they aren't copyrightable! Sure you have to agree to a license to download some of them but if somebody else hucks the weights into a torrent and you download it you've committed zero crimes and can use it for any purpose.