Listen "Training AI"
Episode Synopsis
From John Gruber today:
It’s fair for public data to be excluded on an opt-out basis, rather than included on an opt-in one [...]
No, no it’s not. This is a critical thing about ownership and copyright in the world. We own what we make the moment we make it. Publishing text or images on the web does not make it fair game to train AI on. The “public” in “public web” means free to access; it does not mean it’s free to use.
Besides that, I’d also add what I’ve seen no one else mention so far: People post content on web that they don’t own all the time. No one has to prove ownership to post anything.
Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) doesn’t have the right to make the choice for me to let my content be used for training AI. This is where I struggle the most with the “opt-out” style of AI training on the web.
Whether reposting my content elsewhere is in good faith or not, it is now up someone other than me to decide whether or not to disallow AI training webcrawlers in their robots.txt file. To add insult to injury, that person may not have the knowledge—or even the power—to do so if they’re posting content they don’t own on a site they also don’t own, like social media.
I can play whac-a-mole with those bots on servers I control—which I don’t like doing, for the record—but I have none of that control anywhere else.
More episodes of the podcast LMNT
Nob
14/01/2026
Star Wars
13/01/2026
Philly
05/01/2026
Plastic, Part 2
03/01/2026
That’s Why It’s a Red Flag
02/01/2026
2026
01/01/2026
Willow
31/12/2025
Spaghetti on a Bagel
30/12/2025
Grid
26/12/2025
Big Day
25/12/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.