XML Sitemaps at Scale: Multi-Sitemap Strategies
The 50k / 50MB limits hit surprisingly often — and most sites handle the overflow wrong
The XML sitemap protocol is tiny — a few tags, simple semantics — and yet most large sites get it wrong in ways that cost indexation speed. The mistakes are usually: letting the sitemap bloat with non-indexable URLs, blowing the 50,000-URL limit without a sitemap index, and misusing lastmod until Google ignores it.
This article is for sites past 50,000 URLs where sitemap structure starts to matter. Smaller sites can skim — one correctly-shaped sitemap at the root is usually all they need.
The limits that hit real sites
- 50,000 URLs per sitemap (hard limit, enforced by Google).
- 50 MB uncompressed per sitemap (hard limit).
- Up to 50,000 sitemap files in one sitemap index (for a theoretical ceiling of 2.5 billion URLs).
Most sites cross the 50k-URL limit first, usually between years 2-5 of publishing. The 50MB limit catches mostly video sitemaps (large <video> entries) and sites with very long URLs.
When you cross a limit, Google's behavior depends:
- A sitemap over 50k URLs: Google ignores anything past the first 50,000.
- A sitemap over 50MB: Google may reject the entire sitemap.
- Neither triggers GSC alerts by default. You find out when indexation stalls for the URLs that didn't make it.
The sitemap index pattern
For anything past ~40,000 URLs, move to a sitemap index. The structure:
<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-articles-1.xml</loc>
<lastmod>2026-04-24</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-articles-2.xml</loc>
<lastmod>2026-04-23</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-04-24</lastmod>
</sitemap>
</sitemapindex>
Each child sitemap then contains up to 50,000 <url> entries. Submit the index URL in Search Console; it discovers all children.
Sharding strategies
You have to decide how to split URLs across sitemaps. The strategy affects two things:
- Crawl priority — Google processes sitemaps in an order influenced by
lastmod. Sitemaps with recentlastmodget crawled more often. - Diagnostic clarity in GSC — Search Console's Page Indexing report filters by sitemap, so if you know which sitemap your "crawled - not indexed" URLs come from, you know where the problem pattern lives.
Strategy 1: By content type
sitemap-articles.xml
sitemap-categories.xml
sitemap-products.xml
sitemap-glossary.xml
sitemap-about.xml
Best for sites with distinct content types that have different publishing cadences. Products change daily, glossary rarely, categories intermittently.
Strategy 2: By freshness tier
sitemap-last-7-days.xml # URLs updated in the last week
sitemap-last-30-days.xml # updated in the last month
sitemap-evergreen.xml # stable, older content
Best for sites where Google's recrawl cadence needs to vary dramatically. News sites, publishers with large archives.
Strategy 3: Hybrid (what I usually recommend for ecommerce / publishers)
sitemap-index.xml
├── sitemap-articles-recent.xml # last 30 days
├── sitemap-articles-archive.xml # older, max 50k
├── sitemap-products-in-stock.xml
├── sitemap-products-discontinued.xml # redirected, for cleanup
├── sitemap-categories.xml
└── sitemap-pages.xml # About, contact, legal
Combines content-type and freshness. Scales cleanly past millions of URLs.
Strategy 4: By ID range (mechanical, used by very large sites)
sitemap-products-0-50000.xml
sitemap-products-50001-100000.xml
sitemap-products-100001-150000.xml
Best for sites generating sitemaps from database IDs where other groupings are impractical. Diagnostic value is lower (ranges don't map to meaning), but scales to any catalog size.
What to include and exclude
Rule: the sitemap should list URLs you want indexed. Everything else is noise that dilutes the signal.
Include:
- Canonical URLs only. If URL A and URL B both exist, with A canonical to B, only list B in the sitemap.
- Indexable URLs only. URLs returning 200 OK, not
noindex, not blocked by robots.txt. - Stable URLs. Short-lived or expiring URLs (search result pages, dynamic filter combinations) don't belong.
Exclude:
- URLs that
noindex— including them confuses Google's signal. - Redirects — Google will follow them, but listing the target URL directly is cleaner.
- URLs blocked by robots.txt — can't be crawled anyway.
- 404s / 410s — dead URLs dilute the sitemap.
- Pagination beyond the first few pages unless there's real intent to index them.
A bloated sitemap (100k URLs, 40k non-indexable) is worse than a clean one (60k all indexable). Google treats the list as your declaration of priority; lying wastes everyone's time.
lastmod strategy
lastmod is the single most valuable optional field. It tells Google when the content was last meaningfully changed, influencing recrawl priority.
Rules that matter:
- Accurate. Set
lastmodonly when content actually changed. A<link>update, a typo fix, a new comment — not material change. - Consistent format. ISO 8601:
2026-04-24or2026-04-24T09:00:00+00:00. Not "April 24, 2026." - Not the same for every URL. If your CMS sets every sitemap entry's
lastmodto today's date, Google notices and starts ignoring the field site-wide. Lastmod has to differ per URL based on actual change history.
Common failures:
- Static build regeneration updates everything. Gatsby/Hugo/Next.js site rebuilds overwrite every URL's build timestamp. Fix: pull
lastmodfrom git blame on the source file, not the build time. - Plugin-generated sitemaps defaulting to "now." WordPress sitemap plugins sometimes default to the current date if the post's
modified_dateisn't available. Check your plugin's behavior. - Database timestamps vs content timestamps.
updated_atin your database updates on every row touch, including view counter increments. Use a separatecontent_updated_atthat changes only on content edits.
Image and video sitemaps
Extension sitemaps for image and video content:
Image sitemap (for sites that want image URLs indexed for Google Image Search):
<url>
<loc>https://example.com/product/sneakers</loc>
<image:image>
<image:loc>https://example.com/images/sneakers.jpg</image:loc>
<image:caption>Nike Pegasus 41 running shoes, side view</image:caption>
</image:image>
</url>
Useful for ecommerce and visual content sites. The image URLs must be crawlable separately.
Video sitemap (for video content pages):
<url>
<loc>https://example.com/videos/product-demo</loc>
<video:video>
<video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
<video:title>Product demo: getting started in 3 minutes</video:title>
<video:description>...</video:description>
<video:content_loc>https://example.com/video.mp4</video:content_loc>
<video:duration>186</video:duration>
</video:video>
</url>
Required for sites that want video rich results and video-specific SERP features. The video file needs to be accessible; thumbnails need to match the video content.
Discovery and submission
Three ways Google discovers your sitemap:
Sitemap:line in/robots.txt— standard, recommended.- Search Console submission — direct submit of sitemap or sitemap index URL. Gives you per-sitemap indexing stats.
- HTTP ping — deprecated by Google. Don't rely on it.
Submit the sitemap index, not individual child sitemaps. Google discovers children via the index. If you submit children directly, Search Console's reports become fragmented and harder to interpret.
Verification and monitoring in GSC
After submission, check:
- Status: Success — sitemap parsed without errors.
- Discovered URLs count — matches what you sent (±small margin for URL normalization).
- Last read — Google's cadence. Large sitemaps are re-read every few days; small ones daily.
- Indexed URLs (per-sitemap view) — the ratio of indexed-to-discovered per sitemap is the most valuable diagnostic. If sitemap-articles.xml indexes 95% but sitemap-products.xml indexes 40%, the problem lives in product URLs specifically.
Set up weekly monitoring of the indexed ratio per sitemap. A drop of 10% over 2 weeks is a signal that something changed in indexability for that URL group.
Common mistakes
One giant sitemap past 50k URLs. Classic oversight; the URLs past 50k silently disappear. Move to a sitemap index.
Listing noindex URLs in sitemap. Confuses Google: "you want this indexed, but also you don't?" Strips your signal.
Every URL has <priority>1.0. Google ignores <priority> entirely. Don't bother with it — focus on lastmod instead.
<changefreq> set without regard to actual change. Google mostly ignores this too. Waste of bytes.
Sitemap at a different host than the content. Sitemaps are per-host. A sitemap at example.com/sitemap.xml listing URLs at www.example.com/ can cause problems. Match the host.
Including query-parameter URLs. ?utm_source=... URLs in sitemap are noise. The sitemap should list canonical versions only.
Frequently asked questions
How often should I regenerate the sitemap?
As often as content changes. For a news site, hourly or realtime. For an ecommerce with daily catalog updates, daily. For a slow-moving site, weekly. What matters is that lastmod accurately reflects content changes.
Does Google read my sitemap for every URL?
No. Google reads the sitemap periodically (daily or every few days, depending on size and change signals) and uses it to discover URLs and update its recrawl queue. The sitemap isn't the primary discovery mechanism — internal linking is.
Should I use <priority> or <changefreq>?
No. Google ignores both. Skip them to keep sitemap file size smaller.
Can I have too many URLs in my sitemap?
You can't hit a Google-enforced "too many" past splitting into an index. But there's a signal-quality ceiling: a sitemap with millions of URLs, half of which are weak, tells Google your site is half weak. Prune aggressively; quality over quantity.
How do I handle URLs that 404 in my sitemap?
Remove them immediately. A sitemap that lists 404s is lying about indexable URLs. If you know a URL 404'd recently, pull it from the sitemap ASAP — don't wait for the next full regeneration.
What to read next
- The Complete Guide to Technical SEO Audits — sitemap as one component of crawlability + discovery.
- How to prioritize crawl budget for large sites — how sitemap sharding interacts with crawl budget.
- Robots.txt patterns from the wild — how sitemap references fit in the robots.txt file.
Related articles
The Complete Guide to Technical SEO Audits
Most technical SEO audits fail the same way: they generate 80-page PDFs with 200 findings, and clients execute none of them. The audits that move rankings solve for one thing: which of five layers is broken, and which single fix restores the most value.
Hreflang Implementation: Mistakes and How to Avoid Them
Hreflang breaks silently. Bidirectionality errors, region code confusion, and mixed delivery methods cause international SEO issues that don't show up as explicit errors — just underperformance in secondary markets.
Core Web Vitals in 2026: What Still Matters
Core Web Vitals is a real but modest ranking signal — and the metrics keep shifting. INP replaced FID in March 2024. Here's what the current three metrics actually measure, what they don't, and where optimization actually moves the needle.