Technical SEO

How to Prioritize Crawl Budget for Large Sites

A log-driven framework that cuts wasted crawls by 40%+

Enric Ramos · Apr 25, 2026 · 9 min read

For sites under 10,000 URLs, crawl budget is usually not your problem. Google will crawl you faster than you can publish. The diagnostic question is simpler: is your content good enough to rank?

For sites past 100,000 URLs — large ecommerce catalogs, news archives, user-generated content platforms, SaaS with deep documentation — crawl budget starts to bite. Googlebot has to decide which fraction of your URLs to fetch, how often, and in what order. If you haven't shaped those decisions, Google shapes them for you, usually suboptimally. Pages you want indexed get crawled every 60+ days; parameter-noise URLs get crawled daily.

This article is the framework I use to find and fix crawl budget waste on large sites. It's log-file-driven because logs are the only source of truth about what Googlebot actually does.

The three-layer problem

Crawl budget waste lives in three layers. Most site owners fix layer 1 and call it done; the compounding gains come from addressing all three.

Crawl waste — Googlebot spending capacity on URLs that shouldn't be crawled at all (parameter noise, faceted explosion, soft 404s, infinite calendars).
Priority inversion — Googlebot spending capacity on low-value URLs at the expense of high-value ones (new product pages starving while /tag/archive-page-347 gets hit weekly).
Capacity ceiling — Googlebot throttled below your site's actual capacity due to slow responses or 5xx errors.

Work them in that order. Cleaning waste reveals the priority problem; fixing priority makes the capacity ceiling visible as the next bottleneck.

Diagnosing: logs + GSC

Start with the 30-day view from log file analysis and Search Console's Crawl Stats. You're looking for:

Total Googlebot requests per day and the trend. Stable is fine; a 30% drop is an emergency.
Distribution by URL path prefix. Where are the requests going — product pages, category pages, faceted URLs, search endpoints, admin?
Response code distribution for Googlebot. A healthy site shows 90%+ 200s, some 301s, minimal 404s and 304s, ~zero 5xx.
Average response time for bot requests. Googlebot ramps down when TTFB exceeds ~1 second consistently.

Export both datasets into the same spreadsheet. The first thing you'll spot: where does the median Googlebot request actually go? On a healthy large site, you'd expect 60%+ to product/article/category URLs. When it's 20% with 50% going to /search?q=... or ?filter_color=red&filter_size=..., you have the diagnosis.

Layer 1: Reduce crawl waste

The biggest waste sources, in order of how often they bite:

Faceted navigation explosion. /category?color=red&size=10&brand=nike generates combinatorial URL counts that Google will happily crawl. Fix by picking the small set of facet combinations worth indexing (high search volume, unique content) and shutting the rest down via canonical tag to parent + <meta name="robots" content="noindex,follow">. For the worst offenders, robots.txt Disallow cuts the crawl at the source.

Search endpoints. Your internal search /search?q=... is infinite. Googlebot will find deep-link patterns in user-generated content and crawl thousands of them. Disallow: /search in robots.txt is the fix on 99% of sites.

Parameter noise. UTM parameters, session IDs, ?ref=, ?fbclid, ?gclid — any parameter that doesn't change the displayed content creates duplicates. Self-canonicals that strip parameters work; Disallow on parameter patterns works better for the highest-volume noise.

Soft 404 content. Pages that return 200 OK but should be errors — "This product is no longer available," empty search result pages, expired event pages. Google flags these in the Page Indexing report. Converting them to actual 404/410 responses or redirecting to the nearest relevant URL both frees up crawl budget.

Infinite calendars and date archives. /events/2019/01/01 leads to 2020/01/02 leads to 2025/01/15 with mostly empty pages. If the archive isn't genuinely useful to users, noindex,nofollow the whole structure.

Low-value paginated deep pages. Page 247 of a category listing will never be the landing page for any meaningful query. Let it stay crawlable but self-canonical it to the parent listing to consolidate signals.

Layer 2: Prioritize the right URLs

Once waste is reduced, the capacity freed up needs to flow to URLs that matter. Googlebot uses signals to pick what to crawl first:

Internal link frequency — URLs heavily linked from high-traffic pages get prioritized.
Sitemap inclusion — URLs in sitemap.xml signal "we care about this."
lastmod signals — URLs with recent lastmod (verified by Google as actually changed) get recrawled sooner.
External backlinks — URLs with high-quality inbound links get prioritized.
User-click signals from SERPs — URLs that Google has seen clicked on get recrawled more often.

The practical moves:

Sitemap hygiene. A bloated sitemap (100k URLs, 30k are noindex'd, 10k are 404) dilutes the signal. Prune ruthlessly. Include only indexable, canonical, currently-live URLs.

Sitemap sharding. For sites past 50k URLs, use multiple sitemaps under an index. Shard by freshness: one sitemap for URLs updated in the last 7 days, one for 30 days, one for evergreen. Google processes them in priority order.

Internal linking audit. The orphan page problem cuts both ways: orphans don't get crawled, AND the pages they should be linked from don't get the PageRank signal that outbound links would provide. Fix from both ends — add links to orphans and ensure high-priority URLs are linked from high-priority sources.

Lastmod accuracy. Google eventually ignores lastmod when it contradicts the content they fetch. Don't lie. If the content genuinely changed, update lastmod; if not, leave it.

Layer 3: Improve server response

After clearing waste and redirecting priority, server capacity becomes the next bottleneck. Symptoms:

Googlebot request rate plateaus despite increased demand (new sitemap entries, more backlinks).
Response times to Googlebot creep above 1 second.
5xx errors to bot requests, even briefly, correlate with crawl rate drops.

Fixes:

CDN caching for bot-eligible pages. Google does not discriminate against cached responses; what matters is the Cache-Control header and whether the content is actually the same. Aggressive caching for product pages, articles, category pages (TTL 1-24 hours) dramatically reduces origin load.

Origin response time under load. When Googlebot spikes its crawl rate, does origin TTFB stay under 1s? If not, the bottleneck is in rendering (server-side templating, database queries, external API calls). Profile during Googlebot activity windows.

5xx rate control. Aim for <0.5% 5xx rate on bot traffic over any 24h window. Anything higher and Googlebot throttles. Investigate sources — often a specific URL pattern that errors under load (search autocomplete, recommendation feed, expired product handling).

TTFB from geo-distant origins. Googlebot requests come from US data centers mostly. If your origin is single-region in Europe, TTFB to Googlebot is 150-200ms before any processing. A CDN with an edge in Google's regions brings this to 20-40ms — see CDN configuration for SEO for the cache-header rules that keep Googlebot happy.

A real 90-day crawl budget plan

If I walked into a site with >500k URLs and 60%+ crawl waste today, here's the 90-day shape:

Days 0-30 — Diagnose + quick wins.

Log file analysis baseline: current crawl distribution by URL pattern.
GSC Crawl Stats trend + Page Indexing state counts.
Identify top 3 crawl waste sources (usually: internal search, faceted URLs, parameter noise).
Ship the three Disallow / canonical fixes for the worst 3.
Expected impact: 20-30% reduction in crawl waste within 2-3 weeks.

Days 30-60 — Priority rebalance.

Sitemap audit: prune non-indexable URLs, shard by freshness.
Internal linking fixes for top 100 high-value URLs (ensure each has 5+ inbound internal links from high-priority sources).
Orphan page cleanup: delete or reconnect.
Expected impact: high-priority URLs get crawled 2-4x more often; new content indexes in days instead of weeks.

Days 60-90 — Server capacity.

CDN tuning: extend cacheable TTLs, push more URL patterns through edge cache.
Origin profiling under bot load: find and fix the slow 10% of URL templates.
Investigate + fix any 5xx spikes.
Expected impact: Googlebot request ceiling raises 30-50%; the capacity is now available when needed.

By day 90, you're operating at 40-50% better crawl efficiency and have visibility into the next bottleneck (usually content quality or structural issues beyond crawling's reach).

Common pitfalls

Asking Google for a crawl rate increase in GSC without fixing waste. You're just asking Googlebot to spend more time on the same garbage.

Treating noindex as a crawl budget fix. A noindexed URL still gets crawled (that's how Google sees the directive). For crawl savings, use robots.txt Disallow on URL patterns that have no indexation value.

Canonicalizing everything instead of fixing. A canonical doesn't stop the crawl — Google still fetches the URL to read the canonical. For true crawl-budget savings, structural fixes (fewer URLs generated in the first place) beat canonicals.

Ignoring sitemap lastmod. A sitemap with stale lastmod values is roughly as useful as no lastmod. Get it right; Google's recrawl prioritization depends on it.

Frequently asked questions

How do I know if I have a crawl budget problem?

Three signals: (1) URLs in your sitemap that have never appeared in GSC Coverage as "Crawled." (2) New content taking weeks or months to index despite strong internal linking. (3) Log analysis showing 30%+ of Googlebot requests going to URLs you don't want indexed. If none of these apply, you probably don't have a crawl budget problem.

Will Googlebot crawl my new content faster if I submit it manually in GSC?

A modest, temporary bump. Don't use it as a substitute for proper internal linking and sitemap inclusion. The effect doesn't stack — submitting the same URL repeatedly doesn't increase priority.

Does blocking pages in robots.txt pass no link equity?

Disallowed pages still receive external backlinks and Google is aware of those links, even if it can't crawl the target. What's lost is the ability for Google to follow links out of the disallowed page. For pages you want to prevent indexing and link flow, noindex, nofollow in-page is the right tool.

How does crawl budget interact with indexability?

Crawl is the gate; indexability is the filter. A URL has to be crawlable to be indexable. Crawl budget optimization ensures Google has enough capacity to reach indexable URLs; indexability fixes ensure that once crawled, the URL actually enters the index.

Should I use a dedicated crawler tool or rely on GSC for this?

Both. GSC tells you the end state (what's indexed, what isn't). A dedicated crawler tells you the intermediate state (what Googlebot sees on each URL). Log analysis tells you the observed behavior. You need at least two of the three for any meaningful diagnosis.

What to read next

The Complete Guide to Technical SEO Audits — the broader audit framework this fits into.
Log file analysis: what to look for, tool by tool — the detailed how-to for the diagnostic step.
Site architecture for SEO — structural fixes that prevent crawl waste at the source.

Technical SEO

The Complete Guide to Technical SEO Audits

Most technical SEO audits fail the same way: they generate 80-page PDFs with 200 findings, and clients execute none of them. The audits that move rankings solve for one thing: which of five layers is broken, and which single fix restores the most value.

Apr 25, 2026 · 11 min read

Technical SEO

Hreflang Implementation: Mistakes and How to Avoid Them

Hreflang breaks silently. Bidirectionality errors, region code confusion, and mixed delivery methods cause international SEO issues that don't show up as explicit errors — just underperformance in secondary markets.

Apr 25, 2026 · 8 min read

Technical SEO

Core Web Vitals in 2026: What Still Matters

Core Web Vitals is a real but modest ranking signal — and the metrics keep shifting. INP replaced FID in March 2024. Here's what the current three metrics actually measure, what they don't, and where optimization actually moves the needle.

Apr 25, 2026 · 9 min read

The three-layer problem

Diagnosing: logs + GSC

Layer 1: Reduce crawl waste

Layer 2: Prioritize the right URLs

Layer 3: Improve server response

A real 90-day crawl budget plan

Common pitfalls

Frequently asked questions

How do I know if I have a crawl budget problem?

Will Googlebot crawl my new content faster if I submit it manually in GSC?

Does blocking pages in robots.txt pass no link equity?

How does crawl budget interact with indexability?

Should I use a dedicated crawler tool or rely on GSC for this?

What to read next

Related articles

The Complete Guide to Technical SEO Audits

Hreflang Implementation: Mistakes and How to Avoid Them

Core Web Vitals in 2026: What Still Matters