GEO & AI Search

Managing LLM Crawlers: GPTBot, ClaudeBot, Google-Extended

Eight bots, two jobs, and the trade-off matrix between training opt-out and AI visibility

Enric Ramos · Apr 26, 2026 · 11 min read

The robots.txt file on the average mid-sized site in 2026 is wrong about LLM crawlers in three different ways at once. It blocks a retrieval bot the team wanted to allow. It allows a training bot the team meant to block. And it relies on directives the bot's owner has already publicly stated they ignore. The cost shows up as missing citations in ChatGPT, Claude, and Perplexity that a cleaner robots.txt would have earned.

Part of the problem is that the LLM crawler population doubled between January 2024 and January 2026. Eight bots from four major operators now hit your access logs regularly, and at least four more are in beta or partial rollout. Each one does one of two jobs — training or retrieval — and a few do both, with the same user-agent making it harder to separate. The right policy depends on which job each bot does and how much you value the trade between training opt-out and AI-channel visibility.

This article maps the eight crawlers that matter, the training-vs-retrieval split for each, the robots.txt directives they actually honor, and the policy frameworks that fit different business models. It assumes you already understand the basics of robots.txt and have decided that AI-channel visibility is worth measuring. If you haven't decided, read Should You Block AI Training Crawlers? first.

The two jobs every LLM crawler does

Every LLM crawler is doing one of two things, and the policy that's right for each is different.

Training crawlers fetch pages to add to a dataset that will be used to train future model versions. They run on a slow cadence — weeks to months — and the content they fetch shapes models that will be deployed in a future version, not the current one. Blocking a training crawler opts your domain out of future training runs. It does not affect current retrieval-time behavior.

Retrieval crawlers fetch pages on demand when a live user query triggers a web search. They run on a fast cadence — sometimes seconds after a query — and the content they fetch is fed into the answer generation pipeline immediately. Blocking a retrieval crawler removes you from that surface's AI Search results entirely.

A few crawlers do both jobs under the same user-agent, which is a problem. ByteDance's Bytespider is the clearest case — it serves both ByteDance's training pipeline and ByteDance's retrieval needs without exposing a way to opt out of one without the other. Where the operator has published a clear split (OpenAI, Anthropic, Google, Perplexity), you can be selective. Where they haven't, you accept a binary.

The trade-off matrix is straightforward in shape, complex in implementation. Block training to deny LLMs your content for future model improvement. Allow retrieval to stay visible in current AI Search surfaces. The interesting case is the middle: split policies that block training and allow retrieval, which is what most publishers and SaaS sites should default to in 2026.

The eight crawlers that matter in 2026

Below is the field guide. User-agents listed are what shows up in access logs as of April 2026.

GPTBot (OpenAI, training). Documented at openai.com/gptbot. Honors robots.txt. Blocking opts you out of training data for future GPT models. Does not affect ChatGPT Search visibility. See GPTBot glossary entry for the full directive list.

ChatGPT-User and OAI-SearchBot (OpenAI, retrieval). Used when a ChatGPT user's query triggers a web search. Honors robots.txt. Blocking removes you from ChatGPT Search citations entirely. The two user-agents serve overlapping jobs; treat them as a pair.

ClaudeBot (Anthropic, training). Documented at anthropic.com. Honors robots.txt. Blocking opts you out of training data for future Claude models. Does not affect Claude's ability to retrieve your content when a user pastes a URL or runs a Claude-with-search query. See ClaudeBot glossary entry.

Claude-User and claude-web (Anthropic, retrieval). Used when a Claude user runs a tool-enabled query that fetches a URL. Honors robots.txt. Blocking removes you from Claude's at-query-time retrieval. The exact user-agent string has changed twice in eighteen months; check the current docs at anthropic.com before writing rules.

Google-Extended (Google, training). Not a crawler — it's a directive. Google's regular Googlebot crawls your site as always, but a User-agent: Google-Extended rule in robots.txt tells Google not to use the fetched content for training Gemini or Vertex AI generative features. Honors disallow. Does not affect Google Search ranking or AI Overview citation. See Google-Extended glossary entry.

PerplexityBot (Perplexity, retrieval). Documented at perplexity.ai. Honors robots.txt — most of the time. Perplexity was caught in mid-2024 fetching content via headless browser when blocked at the bot level, which prompted public criticism and a partial walk-back. As of April 2026, the company claims to honor robots.txt for PerplexityBot. Audit your access logs to verify.

Bytespider (ByteDance, training + retrieval). Used by ByteDance for training Doubao and other models, and for retrieval in ByteDance products. Does not cleanly separate the two jobs. Honors robots.txt. Blocking is binary — you opt out of both training and retrieval. See Bytespider glossary entry.

CCBot (Common Crawl, training-adjacent). Common Crawl is not an LLM company; it's a non-profit that publishes a public web crawl that is used as training data by every major LLM operator. Blocking CCBot is the upstream lever — it removes your content from the dataset that feeds OpenAI, Anthropic, Mistral, Meta, and many smaller operators. See Common Crawl glossary entry.

A few minor or partially-deployed crawlers also show up: Applebot-Extended (Apple's training opt-out, parallel to Google-Extended), Cohere-AI, cohere-ai, Diffbot, Meta-ExternalAgent, FacebookBot, Bingbot (also feeds Bing Chat retrieval). The matrix below lists the canonical eight; the long tail follows the same logic.

The robots.txt directives table

Below is a reference table for what each crawler actually honors. "Honors disallow" means it has been observed to respect Disallow directives in real-world testing. "Honors crawl-delay" means it slows down when asked. "Public docs" means the operator publishes a stable URL describing the crawler.

User-agent	Operator	Job	Honors Disallow	Honors Crawl-delay	Public docs
GPTBot	OpenAI	Training	Yes	No	openai.com/gptbot
ChatGPT-User	OpenAI	Retrieval	Yes	No	openai.com docs
OAI-SearchBot	OpenAI	Retrieval	Yes	No	openai.com docs
ClaudeBot	Anthropic	Training	Yes	No	anthropic.com
Claude-User	Anthropic	Retrieval	Yes	No	anthropic.com
Google-Extended	Google	Training (directive)	Yes	N/A	developers.google.com
PerplexityBot	Perplexity	Retrieval	Mostly	No	perplexity.ai
Bytespider	ByteDance	Training+Retrieval	Yes	No	Sparse
CCBot	Common Crawl	Training-adjacent	Yes	Yes	commoncrawl.org

A few honest caveats. None of these crawlers honor crawl-delay reliably except CCBot, which is a non-profit with a different operating philosophy. None of them publish IP ranges you can verify against, which means user-agent spoofing is a real attack surface — anyone can claim to be ChatGPT-User. And documentation lag is the rule, not the exception. The user-agent strings in this table are accurate as of April 2026 and may shift; verify against the operator's docs before shipping a rule.

A trade-off matrix you can actually decide against

Here's the framework I use when sitting with a client to set LLM crawler policy. The matrix has two axes — content sensitivity and AI-channel revenue exposure — and four resulting postures.

Quadrant 1: Low sensitivity, high AI exposure. Marketing content, blog posts, public documentation, product info. Allow everything. Training opt-out has no upside; retrieval blocking actively costs you brand exposure. Policy: allow GPTBot, ClaudeBot, Google-Extended, PerplexityBot, ChatGPT-User, Claude-User, Bytespider, CCBot.

Quadrant 2: Low sensitivity, low AI exposure. Internal-facing pages, low-traffic product pages, sites whose audience doesn't use LLMs to discover content. Allow training, allow retrieval, but with no special effort. Policy: same as Quadrant 1; the difference is that you don't invest in AI-channel measurement.

Quadrant 3: High sensitivity, high AI exposure. This is the interesting case — publishers, SaaS docs, premium content sites that want AI visibility but don't want their content used as free training data. Block training, allow retrieval. Policy: block GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot. Allow ChatGPT-User, Claude-User, PerplexityBot, OAI-SearchBot.

Quadrant 4: High sensitivity, low AI exposure. Paywalled content, member-only sites, enterprise documentation, regulated industries. Block everything except where there's a specific contractual or strategic reason to allow. Policy: block all eight crawlers. Use llms.txt and x-robots-tag headers as defense-in-depth.

The mistake I see most often: Quadrant 3 sites running Quadrant 4 policies. They block ChatGPT-User along with GPTBot because the team conflated the two, and they're invisible in ChatGPT Search for no benefit. The reverse mistake — Quadrant 4 sites with Quadrant 1 policies because nobody audited the robots.txt — is also common, and worse, because it leaks training data the operator did not consent to.

Practical robots.txt snippets

Here are working snippets for each posture. Drop them into your existing robots.txt; they don't conflict with regular Googlebot or Bingbot rules.

Posture A — Allow everything (Quadrant 1 & 2):

No additional rules needed. Your existing robots.txt with User-agent: * rules already governs LLM crawlers. The default behavior is "allow."

Posture B — Block training, allow retrieval (Quadrant 3, the most common):

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Allow: /

# Google
User-agent: Google-Extended
Disallow: /

# Perplexity (retrieval-only operator)
User-agent: PerplexityBot
Allow: /

# ByteDance (binary; choose based on Doubao exposure)
User-agent: Bytespider
Disallow: /

# Common Crawl (upstream training data feed)
User-agent: CCBot
Disallow: /

Posture C — Block everything (Quadrant 4):

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

For Posture C, robots.txt alone is not enough. Add X-Robots-Tag: noai, noimageai response headers as a second layer (some operators read these), and consider rate-limiting or IP-blocking known bot IP ranges as a third. None of these are bulletproof, but together they signal intent clearly enough that most operators comply and the rest become a litigation question.

Where robots.txt fails and what to layer on

Robots.txt has three structural weaknesses against LLM crawlers, and serious policies layer additional defenses on top.

No verification. Anyone can claim to be ChatGPT-User. Real OpenAI traffic comes from a published IP range; impostor traffic comes from anywhere. If you want certainty that the bot you're allowing is real, log requests by IP and cross-reference against the operator's published ranges (when published — most aren't).

No retroactive effect. Blocking GPTBot today does not remove your content from training datasets that already exist. OpenAI has not committed to retraining without your data; they've committed to not fetching new data. The damage from prior crawls is permanent.

No cross-operator coverage. Blocking GPTBot does not block models trained on data leaked through Common Crawl, scraped through third-party datasets, or licensed from data brokers. The training-data supply chain has many tributaries; robots.txt only affects the headwaters.

The defenses that layer well: X-Robots-Tag: noai headers, llms.txt for explicit AI-policy declarations, EU TDM Article 4(3) reservation language in your terms of service for the legal lever, and active log monitoring to catch crawlers that ignore your rules. None of these are silver bullets. Together they constitute a defensible posture.

Audit checklist for your robots.txt today

Run this checklist now:

Open your-domain.com/robots.txt. List every User-agent: block.
For each LLM-related user-agent, verify the directive against the operator's current docs. User-agent strings have changed; rules referencing old strings do nothing.
Decide your posture (A, B, or C above) based on the four-quadrant matrix.
Update robots.txt to match. Ship the change.
Monitor access logs for two weeks. Expect to see new bots you've never noticed; the LLM crawler population grows monthly.
Cross-check vendor citation tools (Profound, Otterly, Athena) against your sampling — see Tracking Your Brand's Visibility in AI Answers — to verify you haven't accidentally blocked a retrieval surface you wanted to be on.

For the strategic frame on whether to opt out of training at all, see Should You Block AI Training Crawlers? A Strategic Framework. For the broader playbook this fits into, see Generative Engine Optimization: The 2026 Playbook.

Frequently asked questions

If I block GPTBot today, does my content get removed from existing GPT models?

No. Blocking GPTBot prevents future training fetches. Content already in the training corpus stays in the corpus. OpenAI has not committed to retraining models without your data; the lever you're pulling is forward-looking only.

Does Google-Extended affect my AI Overview visibility?

No. Google-Extended controls whether your content is used to train Gemini and Vertex AI generative products. AI Overviews use a separate retrieval layer that draws from the regular index, governed by regular Googlebot. You can block Google-Extended and still appear in AI Overviews. See google-extended.

What about user-agent spoofing? How do I know the bot I'm allowing is the real one?

Major operators publish IP ranges, but unevenly. OpenAI publishes for GPTBot. Google publishes for Googlebot (which extends to Google-Extended). Anthropic publishes for ClaudeBot as of late 2025. For high-stakes sites, IP verification is the layer above user-agent matching. For most sites, user-agent matching plus access-log monitoring is sufficient.

Should I use llms.txt instead of robots.txt for AI policies?

Use both. llms.txt is a proposed standard from September 2024 that some tooling reads and some ignores. It's a useful explicit declaration, but it has not displaced robots.txt as the operative directive layer. Read Implementing llms.txt: A Practical Guide for the implementation pattern.

What's the legal weight of robots.txt for AI training?

Variable by jurisdiction. In the EU, Article 4(3) of the Copyright in the DSM Directive lets you reserve text-and-data-mining rights through machine-readable means; robots.txt directives are widely interpreted as a valid reservation. In the US, the legal status is less settled and currently in active litigation. Treat robots.txt as a strong norm and a good-faith signal, not as an enforceable contract everywhere.

How often should I revisit my LLM crawler policy?

Quarterly, at minimum. New crawlers launch every few months, user-agent strings change, and your business-model exposure to AI search shifts as the surfaces grow. A policy set in 2024 and never revisited is almost certainly wrong by now.

The summary for the impatient: most sites should be in Posture B — block training, allow retrieval, with a current and audited robots.txt. The cost of getting this wrong is invisible until you measure citation rate and notice you're absent from surfaces you intended to be on. Audit, decide, ship, monitor. The matrix doesn't change much; the user-agent strings do.

GEO & AI Search

Optimizing for Perplexity: What Sources Get Cited

Perplexity citations don't follow Google's logic. Older domains, .edu and .gov bias, deeper retrieval, and a freshness signal that punishes thin update cycles. Here's the playbook for the second-largest answer engine.

Apr 26, 2026 · 11 min read

GEO & AI Search

Tracking Your Brand's Visibility in AI Answers

Five vendors now sell AI-answer visibility tracking. The metrics they report don't match. Here's the toolset, the metric definitions worth using, and a manual sampling protocol when budget rules out vendors.

Apr 26, 2026 · 12 min read

GEO & AI Search

Citation Rate: The KPI Your SEO Dashboard Is Missing

Citation rate is the GEO equivalent of organic CTR — and your dashboard does not show it. Here is how to define it, instrument it, and report it without lying.

Apr 26, 2026 · 10 min read

The two jobs every LLM crawler does

The eight crawlers that matter in 2026

The robots.txt directives table

A trade-off matrix you can actually decide against

Practical robots.txt snippets

Where robots.txt fails and what to layer on

Audit checklist for your robots.txt today

Frequently asked questions

If I block GPTBot today, does my content get removed from existing GPT models?

Does Google-Extended affect my AI Overview visibility?

What about user-agent spoofing? How do I know the bot I'm allowing is the real one?

Should I use llms.txt instead of robots.txt for AI policies?

What's the legal weight of robots.txt for AI training?

How often should I revisit my LLM crawler policy?

Related articles

Optimizing for Perplexity: What Sources Get Cited

Tracking Your Brand's Visibility in AI Answers

Citation Rate: The KPI Your SEO Dashboard Is Missing