Training AI

22/06/2024

Listen "Training AI"

Episode Synopsis


From John Gruber today:
It’s fair for public data to be excluded on an opt-out basis, rather than included on an opt-in one [...]
No, no it’s not. This is a critical thing about ownership and copyright in the world. We own what we make the moment we make it. Publishing text or images on the web does not make it fair game to train AI on. The “public” in “public web” means free to access; it does not mean it’s free to use.
Besides that, I’d also add what I’ve seen no one else mention so far: People post content on web that they don’t own all the time. No one has to prove ownership to post anything.
Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) doesn’t have the right to make the choice for me to let my content be used for training AI. This is where I struggle the most with the “opt-out” style of AI training on the web.
Whether reposting my content elsewhere is in good faith or not, it is now up someone other than me to decide whether or not to disallow AI training webcrawlers in their robots.txt file. To add insult to injury, that person may not have the knowledge—or even the power—to do so if they’re posting content they don’t own on a site they also don’t own, like social media.
I can play whac-a-mole with those bots on servers I control—which I don’t like doing, for the record—but I have none of that control anywhere else.

More episodes of the podcast LMNT

Nob 14/01/2026
Star Wars 13/01/2026
Philly 05/01/2026
2026 01/01/2026
Willow 31/12/2025
Grid 26/12/2025
Big Day 25/12/2025