GEO & AI Search · Glossary · Updated Apr 2026

Bytespider

Definition

Bytespider is the web crawler operated by ByteDance, the parent company of TikTok, used to gather training data for ByteDance's models including the Doubao family. It's known for aggressive request rates, frequently appears at the top of Cloudflare's "most-blocked AI bots" reports, and respects robots.txt when properly addressed.

Find related

Long definition

Bytespider identifies itself with a user-agent string containing Bytespider. ByteDance documents the crawler at toutiao.com and on its developer pages, listing IP ranges and verification methods. The bot fetches public web content used to train ByteDance's internal models — including those powering TikTok features and the Doubao chatbot family deployed in China.

What sets it apart from GPTBot or ClaudeBot is the volume. Cloudflare's 2024 Radar data placed Bytespider as the most-blocked AI crawler on customer sites, with request rates that hit small servers hard during peak indexing windows. Operators have reported sustained 50-200 requests per second from Bytespider IPs against single domains, dwarfing typical Googlebot rates for the same site.

The robots.txt block is standard:

User-agent: Bytespider
Disallow: /

ByteDance has publicly stated that Bytespider honors robots.txt. In practice, compliance has improved since 2023 reports of aggressive ignores, but operators still recommend verifying with log analysis against published IP ranges. Spoofers using the Bytespider user-agent are common — block at WAF level (by ASN or IP block) for traffic that doesn't match the published source.

Cloudflare offers a one-click "Block AI Bots" toggle that includes Bytespider alongside GPTBot, ClaudeBot, CCBot, and other named training crawlers. For sites that don't run Cloudflare, the equivalent is a robots.txt section listing all the agents and a WAF rule for non-compliant repeat offenders.

For non-Chinese audiences, the cost-benefit of allowing Bytespider is asymmetric. Visibility in Doubao or TikTok search rarely converts to traffic for sites outside that ecosystem, while bandwidth and server cost from aggressive crawling is real. Most non-Chinese publishers default to a block.

Common misconceptions

  • "Bytespider only crawls Chinese-language sites." It crawls globally. ByteDance trains models on multilingual data, and English-language sites see steady Bytespider traffic regardless of localization.
  • "Bytespider always ignores robots.txt." Behavior has improved since 2023. Current ByteDance policy is to honor robots.txt; verify with your logs against published IPs before assuming non-compliance.
  • "Blocking Bytespider blocks TikTok crawlers entirely." TikTok runs additional bots for embed previews, OG tag fetching, and link sharing. Block patterns for those are documented separately.
  • "Bytespider is the same as Common Crawl." Different operator, different purpose. ByteDance runs Bytespider for first-party training. Common Crawl (CCBot) is a public dataset that many AI labs also use, including some that don't run their own crawlers.