GEO & AI Search · Glossary · Updated Apr 2026

AI training opt-out

Definition

AI training opt-out is the bundle of mechanisms that prevent your content from being used to train language and image models: robots.txt blocks for named user-agents, meta `noai`/`noimageai` tags, `X-Robots-Tag` HTTP headers, and — in the EU — text-and-data-mining reservation under DSM Directive Article 4(3).

Find related

Long definition

There is no single switch. AI training opt-out is a layered practice, and each layer addresses a different actor and legal regime.

Layer 1 — robots.txt user-agent blocks. The base case for honor-system actors. Block GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider, PerplexityBot, OAI-SearchBot, and any newly-published agents. Publishers like Reuters, The New York Times, and CNN run robots.txt files with 20+ AI-related user-agent blocks each. This works for compliant operators only.

Layer 2 — meta tags. The noai and noimageai meta robots directives, popularized by DeviantArt and adopted by some image hosting platforms, signal "do not train on this content / image." Adoption is uneven. Stable Diffusion's training operators have publicly committed to honoring noai; others are silent. Useful as a belt-and-braces signal alongside robots.txt.

Layer 3 — X-Robots-Tag HTTP header. Same directives as meta robots, delivered at the HTTP level. X-Robots-Tag: noai, noimageai covers non-HTML resources (PDFs, images) where a meta tag isn't possible.

Layer 4 — legal reservation, EU. The Digital Single Market Directive Article 4(3) gives rightholders a TDM (text-and-data-mining) opt-out: a "machine-readable" reservation against commercial mining of works. The Hamburg Regional Court's LAION ruling (September 2024) and follow-up case law have started defining what "machine-readable" means in practice — robots.txt blocks for AI bots and explicit ToS clauses both qualify in current jurisprudence. EU rightholders gain enforceable position; non-EU sites do not.

Layer 5 — paywalls, login walls, and access control. The only mechanism that actually stops a non-compliant scraper. If the data isn't reachable without authentication, it isn't trained on (until credential leaks or contract violations come into play).

For most publishers, Layers 1-3 plus a Terms of Service clause is the working configuration. EU publishers should add Layer 4 by ensuring their reservation is machine-readable and unambiguous. Sites with high-value paid content should consider Layer 5 for the most sensitive material.

Note that none of these layers retroactively remove content from already-trained models. Opt-out affects future cycles only. The window to act is before the next crawl your content appears in.

Common misconceptions

  • "A robots.txt block legally prevents training." Outside the EU, robots.txt is honor-system only. Inside the EU, it can serve as the "machine-readable" reservation for TDM purposes. The legal weight depends on jurisdiction.
  • "Meta noai is a recognized standard." It's a community convention with growing but partial adoption. Treat it as a signal, not a guarantee.
  • "Opt-out removes my content from existing models." It doesn't. GPT-4, Claude 3, Gemini Pro — anything already trained — keeps the content. Opt-out applies to future training runs.
  • "I only need to block GPTBot." That covers OpenAI's first-party crawler. It doesn't touch CCBot, ClaudeBot, Google-Extended, Bytespider, or the dozen other named agents. A real opt-out lists all of them.