Robots.txt Patterns From the Wild (And Why They Break)
Real-world robots.txt examples with annotations, mistakes, and the cases they don't cover
Most robots.txt files I audit break in subtle ways. Some block more than intended; some block less; some rely on directives Google ignores entirely. The problem isn't that the file is complicated — it's that the failure modes are silent. Your robots.txt could be blocking half your product pages and you wouldn't notice until traffic dropped three weeks later.
This article walks through real-world robots.txt examples for different site types, annotates each line with what it's doing and the edge cases, and ends with the testing workflow that catches issues before deploy.
Syntax refresher (30 seconds)
User-agent: <bot-name or * for all>
Disallow: <path pattern to block>
Allow: <path pattern to allow, overrides Disallow when more specific>
Sitemap: <full URL to sitemap.xml>
Rules:
- First matching
User-agentblock wins for that crawler.*is the fallback. - Within a block, longest matching
AlloworDisallowwins (more specific beats less specific). *in paths matches any sequence of characters;$anchors to end of URL.Crawl-delay: N— Google ignores. Only Bing, Yandex, and a few smaller crawlers honor it.- Lines starting with
#are comments.
One file per host. /robots.txt at the root. Must be served as plain text with HTTP 200 to be valid. A 404 is treated as "no restrictions"; a 503 means "slow down, retry soon"; a 5xx for many days leads to eventual full crawl halt.
Example 1: Ecommerce site
# robots.txt for shop.example.com
User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /login
Disallow: /logout
# Kill internal search — infinite URL space with no SEO value
Disallow: /search
Disallow: /*?q=
# Block parameter noise (tracking, session, referral)
Disallow: /*?utm_
Disallow: /*?ref=
Disallow: /*?session=
Disallow: /*?fbclid=
Disallow: /*?gclid=
# Allow faceted URLs with SEO value
Allow: /category/*?brand=
Allow: /category/*?gender=
# ...but block deep facet combinations
Disallow: /*?*&*&*
# (matches any URL with 3+ parameters; excludes them from crawl)
Sitemap: https://shop.example.com/sitemap.xml
What this does line by line:
- Transactional pages blocked: cart, checkout, account, login. These have zero SEO value and shouldn't be crawled.
- Internal search blocked:
/searchpath plus any URL with?q=parameter. - Parameter noise blocked: UTM, referral, session IDs, Facebook/Google click IDs. These create duplicate URLs with no content differences.
- Single-facet category URLs allowed (brand, gender filters have real search intent).
- Multi-facet combinations blocked via the
*?*&*&*pattern — matches URLs with 3+ parameters, which is almost always low-value combinatorial noise.
The silent failure mode: the Disallow: /*?*&*&* pattern catches any URL with 3+ parameters, including ones you might want indexable (a PDP with color + size + warranty selectors). Test against real URLs before deploying.
Example 2: News / publication site
# robots.txt for news.example.com
User-agent: *
Disallow: /admin
Disallow: /edit
Disallow: /drafts
Disallow: /api
# Block print versions (duplicate content)
Disallow: /*/print/
Disallow: /*?print=
# Archive pagination past page 10 — low-value deep pages
# Note: only works if your pagination follows ?page= exactly
Disallow: /archive/*?page=1[0-9]
Disallow: /archive/*?page=[2-9][0-9]
Disallow: /archive/*?page=[1-9][0-9][0-9]
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Googlebot-News
Allow: /
Sitemap: https://news.example.com/sitemap-index.xml
Sitemap: https://news.example.com/news-sitemap.xml
Highlights:
- Standard blocks on internal routes (admin, API).
- Print-version URLs blocked (they're duplicates).
- Pagination beyond page 10 blocked via regex-style patterns. Catches
?page=10through?page=999. Note: if your URL structure is/archive/page-10/instead, these patterns miss entirely. Test. - AI training bots (GPTBot, ClaudeBot, CCBot) disallowed. Increasingly common for content sites. Googlebot-News explicitly allowed in a more-specific block.
- Two sitemap references —
sitemap-index.xmlfor general content,news-sitemap.xmlfor Google News freshness indexing.
The pitfall here: User-agent: Googlebot-News only applies to Googlebot-News-specific behavior. The regular Googlebot would fall under User-agent: *. If you need different rules for Googlebot-News vs Googlebot, declare each explicitly.
Example 3: SaaS / documentation
# robots.txt for app.example.com
User-agent: *
# App routes — require authentication, no SEO value
Disallow: /dashboard
Disallow: /settings
Disallow: /projects
Disallow: /billing
# Docs are public and should be crawled
Allow: /docs
# API docs
Allow: /api-docs
# Block auth flow pages
Disallow: /auth
Disallow: /reset-password
# Block old docs that still 200 OK for backward-compat
Disallow: /docs/v1
Disallow: /docs/v2
User-agent: *
Sitemap: https://app.example.com/docs-sitemap.xml
# NOTE: /robots.txt for the marketing site is separate
# See example.com/robots.txt
Highlights:
- App subdomain (
app.example.com) gets its own robots.txt. Every subdomain = separate file. - Authenticated routes blocked en masse. Nothing under
/dashboard,/settings, etc. has SEO value. - Current docs allowed by an explicit
Allow:(overriding nothing specifically, just for clarity). - Old doc versions blocked to consolidate rankings on current versions (while keeping them accessible via deep links for users who need them).
Gotcha: marketing site and app are typically on different subdomains (example.com vs app.example.com). Each subdomain needs its own /robots.txt. Making this clear via comments prevents future confusion.
The 10 most common mistakes
-
Blocking CSS/JS.
Disallow: /cssorDisallow: /assets— Google needs CSS and JS to render pages properly. If it can't fetch them, rendering breaks. Never block these. -
Disallow: /in production. This blocks everything. Staging sites accidentally deployed to production with this in place have dropped entire sites from the index. -
Relying on
Crawl-delay. Google ignores it. Bing respects it. If you're trying to slow Googlebot down, use Search Console's crawl rate tool, not robots.txt. -
Trailing slash mismatches.
Disallow: /adminblocks/admin,/admin/settings,/admin/users.Disallow: /admin/only blocks/admin/...—/adminitself isn't blocked. Be intentional about the trailing slash. -
Pattern that blocks too much.
Disallow: /pageblocks/page,/page/2,/pagebreak,/pagination. UseDisallow: /page/or anchor with$for exact matches. -
Pattern that blocks too little.
Disallow: /search?q=only blocks URLs with exactly that prefix./search?q=shoesis blocked;/search?query=shoesisn't. Use wildcards:Disallow: /search. -
Noindex + Disallow combo. Already covered in the noindex glossary entry — they cancel each other. Disallowed pages never get crawled, so Google never sees the noindex directive. The URL can stay in the index with a blank snippet.
-
Forgetting the sitemap reference. Not strictly required (submitting in GSC works), but helps other crawlers (Bing, smaller engines) discover your sitemap.
-
Case sensitivity assumptions. Paths in robots.txt are case-sensitive.
Disallow: /Admindoes not block/admin. Pick a convention and stick with it. -
Blocking staging + production with one file. If the file syncs from staging to production,
Disallow: /in staging becomesDisallow: /in production. Keep them separate, or build robots.txt dynamically per environment.
The testing workflow
Before deploying any robots.txt change:
Step 1: Google's robots.txt tester. In Search Console → Settings → robots.txt. Paste the proposed file. Test against 20-30 real URLs from your site including:
- URLs you want blocked (confirm they're blocked)
- URLs you want allowed (confirm they're allowed)
- Edge cases (URLs with parameters, trailing slashes, uppercase, unusual characters)
Step 2: Diff against current. diff old-robots.txt new-robots.txt. Every changed line should have an explicit reason documented.
Step 3: Staging deploy with monitoring. Push to staging environment, run automated tests that fetch 50+ URLs as Googlebot user-agent and confirm 200/403/etc match expectations.
Step 4: Production deploy at low-traffic window. Friday afternoon is safer than Monday morning — if you break something, you have Saturday to notice and revert without peak traffic pressure.
Step 5: Post-deploy log check. Within 24h, check Googlebot activity in logs. Any major shift (50%+ change in request volume to a URL pattern) is a signal that the change did something you didn't expect.
When to use robots.txt vs alternatives
| Goal | Use robots.txt? | Better alternative |
|---|---|---|
| Prevent crawl of URL pattern | ✅ Yes | — |
| Prevent indexing | ❌ No | noindex meta or header |
| Pass/prevent link equity | ❌ No | rel=nofollow on links |
| Consolidate duplicates | ❌ No | rel=canonical |
| Slow Googlebot down | ❌ No | GSC crawl rate setting |
| Block AI training bots | ✅ Yes | — (this is the main remaining growth area for robots.txt rules) |
The golden rule: robots.txt controls crawling. It doesn't control indexing. For indexing control, use noindex.
Frequently asked questions
Does blocking a URL in robots.txt remove it from the index?
No. Blocked URLs can remain in the index if they have inbound links. Google can't crawl them to see noindex, so they sometimes sit in the index with a "No information is available for this page" note. For actual removal, use noindex (keep crawlable) or return 404/410.
Can robots.txt have conditional rules (e.g., different per crawler)?
Not conditional — declarative per-crawler. Each User-agent: block applies to that specific crawler. You can have User-agent: Googlebot, User-agent: Bingbot, User-agent: * with different rules in each.
How often does Googlebot re-read robots.txt?
Typically every 24 hours, but can be more frequent on active sites. Changes to robots.txt take effect within a day. If you need immediate application, also use noindex on pages whose crawlability status just changed.
Does robots.txt affect PageRank flow?
Not directly. Disallowed URLs can still receive and hold external link equity. What they can't do is pass link equity through themselves — Google can't crawl the outbound links. If you want to block link flow from a page, use rel="nofollow" on the page's links, not robots.txt on the page.
What if my robots.txt returns 404?
Google treats a 404 as "no restrictions — crawl everything." That's probably what you want for most sites if you have no rules. But a 5xx is different — Google treats 5xx as "can't determine restrictions, slow way down." Sustained 5xx can trigger a full crawl halt.
What to read next
- The Complete Guide to Technical SEO Audits — robots.txt as one component of the crawlability layer.
- How to prioritize crawl budget for large sites — the crawl-budget implications of robots.txt decisions.
- Log file analysis — verifying that robots.txt changes had the intended effect on actual crawl behavior.
Related articles
The Complete Guide to Technical SEO Audits
Most technical SEO audits fail the same way: they generate 80-page PDFs with 200 findings, and clients execute none of them. The audits that move rankings solve for one thing: which of five layers is broken, and which single fix restores the most value.
Hreflang Implementation: Mistakes and How to Avoid Them
Hreflang breaks silently. Bidirectionality errors, region code confusion, and mixed delivery methods cause international SEO issues that don't show up as explicit errors — just underperformance in secondary markets.
Core Web Vitals in 2026: What Still Matters
Core Web Vitals is a real but modest ranking signal — and the metrics keep shifting. INP replaced FID in March 2024. Here's what the current three metrics actually measure, what they don't, and where optimization actually moves the needle.