AI training opt-out
AI training opt-out is the bundle of mechanisms that prevent your content from being used to train language and image models: robots.txt blocks for named user-agents, meta `noai`/`noimageai` tags, `X-Robots-Tag` HTTP headers, and — in the EU — text-and-data-mining reservation under DSM Directive Article 4(3).
Long definition
There is no single switch. AI training opt-out is a layered practice, and each layer addresses a different actor and legal regime.
Layer 1 — robots.txt user-agent blocks. The base case for honor-system actors. Block GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, PerplexityBot, OAI-SearchBot, and any newly-published agents. Publishers like Reuters, The New York Times, and CNN run robots.txt files with 20+ AI-related user-agent blocks each. This works for compliant operators only.
Layer 2 — meta tags. The noai and noimageai meta robots directives, popularized by DeviantArt and adopted by some image hosting platforms, signal "do not train on this content / image." Adoption is uneven. Stable Diffusion's training operators have publicly committed to honoring noai; others are silent. Useful as a belt-and-braces signal alongside robots.txt.
Layer 3 — X-Robots-Tag HTTP header. Same directives as meta robots, delivered at the HTTP level. X-Robots-Tag: noai, noimageai covers non-HTML resources (PDFs, images) where a meta tag isn't possible.
Layer 4 — legal reservation, EU. The Digital Single Market Directive Article 4(3) gives rightholders a TDM (text-and-data-mining) opt-out: a "machine-readable" reservation against commercial mining of works. The Hamburg Regional Court's LAION ruling (September 2024) and follow-up case law have started defining what "machine-readable" means in practice — robots.txt blocks for AI bots and explicit ToS clauses both qualify in current jurisprudence. EU rightholders gain enforceable position; non-EU sites do not.
Layer 5 — paywalls, login walls, and access control. The only mechanism that actually stops a non-compliant scraper. If the data isn't reachable without authentication, it isn't trained on (until credential leaks or contract violations come into play).
For most publishers, Layers 1-3 plus a Terms of Service clause is the working configuration. EU publishers should add Layer 4 by ensuring their reservation is machine-readable and unambiguous. Sites with high-value paid content should consider Layer 5 for the most sensitive material.
Note that none of these layers retroactively remove content from already-trained models. Opt-out affects future cycles only. The window to act is before the next crawl your content appears in.
Common misconceptions
- "A robots.txt block legally prevents training." Outside the EU, robots.txt is honor-system only. Inside the EU, it can serve as the "machine-readable" reservation for TDM purposes. The legal weight depends on jurisdiction.
- "Meta noai is a recognized standard." It's a community convention with growing but partial adoption. Treat it as a signal, not a guarantee.
- "Opt-out removes my content from existing models." It doesn't. GPT-4, Claude 3, Gemini Pro — anything already trained — keeps the content. Opt-out applies to future training runs.
- "I only need to block GPTBot." That covers OpenAI's first-party crawler. It doesn't touch CCBot, ClaudeBot, Google-Extended, Bytespider, or the dozen other named agents. A real opt-out lists all of them.
Continue exploring