GEO & AI Search · Glossary · Updated Apr 2026

Common Crawl

Definition

Common Crawl is a non-profit open-web archive that has crawled the public internet monthly since 2008. Its crawler is `CCBot`, and the resulting petabytes of HTML and text are the foundation of most public LLM training datasets, including GPT-3, LLaMA, and many others. Hosted at commoncrawl.org.

Find related

Long definition

Common Crawl Foundation runs the closest thing the AI industry has to a shared corpus. Each monthly crawl ingests billions of pages and publishes the result as compressed WARC, WET, and WAT files on AWS Open Data — free for anyone to download. As of 2025 the cumulative archive sits in the multi-petabyte range and continues to grow.

The crawler identifies itself as CCBot. Block it with the standard rule:

User-agent: CCBot
Disallow: /

Why this single bot matters more than any first-party AI crawler: most public LLM training corpora derive from Common Crawl. GPT-3's training data was 60% Common Crawl by token weight, per the original paper. Meta's LLaMA series, EleutherAI's Pile, MassiveText, RefinedWeb, and FineWeb all start from Common Crawl extracts. Filtered subsets like C4 (Colossal Clean Crawled Corpus, used for T5) are derived datasets. Blocking GPTBot or ClaudeBot does nothing if your content was already harvested by CCBot a year earlier and is sitting in someone's training extract.

That said, the leverage of blocking CCBot today is real for future datasets. A site that blocks CCBot will not appear in next month's WARC and will be progressively dropped from refreshed training extracts as model labs re-derive their corpora. Sites that go from open to blocked in 2024-2025 may still appear in 2023-trained models, but won't show up in 2026 retrains.

For verification: published CCBot IPs are documented at commoncrawl.org. Reverse DNS resolves to the *.commoncrawl.org zone for legitimate hits. As with all AI-relevant bots, spoofers exist — log-file analysis against published IPs catches them.

Practical posture: if you block GPTBot, ClaudeBot, and Google-Extended but allow CCBot, your AI opt-out has a hole the size of the open web. Add CCBot to any serious training-opt-out list.

Common misconceptions

  • "Common Crawl is run by OpenAI." It's an independent non-profit founded by Gil Elbaz in 2007. OpenAI is one of many users of the data, not the operator.
  • "Blocking CCBot retroactively removes me from past datasets." No. Once a WARC is published and downloaded, your content is in someone's local copy regardless of future blocks. The block protects future crawls only.
  • "CCBot is a small crawler." Each monthly crawl ingests billions of pages. CCBot is one of the largest non-search crawlers on the web by volume, comparable in scope to Bing's index.
  • "Filtered datasets like C4 are unrelated to my robots.txt." C4 derives from Common Crawl. If you blocked CCBot before the snapshot C4 used, your content is not in C4. The chain of derivation matters.