XML Sitemaps at Scale: Multi-Sitemap Strategies

The 50k / 50MB limits hit surprisingly often — and most sites handle the overflow wrong

Enric Ramos · · 7 min read
a close up of a text description on a computer screen

The XML sitemap protocol is tiny — a few tags, simple semantics — and yet most large sites get it wrong in ways that cost indexation speed. The mistakes are usually: letting the sitemap bloat with non-indexable URLs, blowing the 50,000-URL limit without a sitemap index, and misusing lastmod until Google ignores it.

This article is for sites past 50,000 URLs where sitemap structure starts to matter. Smaller sites can skim — one correctly-shaped sitemap at the root is usually all they need.

The limits that hit real sites

  • 50,000 URLs per sitemap (hard limit, enforced by Google).
  • 50 MB uncompressed per sitemap (hard limit).
  • Up to 50,000 sitemap files in one sitemap index (for a theoretical ceiling of 2.5 billion URLs).

Most sites cross the 50k-URL limit first, usually between years 2-5 of publishing. The 50MB limit catches mostly video sitemaps (large <video> entries) and sites with very long URLs.

When you cross a limit, Google's behavior depends:

  • A sitemap over 50k URLs: Google ignores anything past the first 50,000.
  • A sitemap over 50MB: Google may reject the entire sitemap.
  • Neither triggers GSC alerts by default. You find out when indexation stalls for the URLs that didn't make it.

The sitemap index pattern

For anything past ~40,000 URLs, move to a sitemap index. The structure:

<!-- sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-articles-1.xml</loc>
    <lastmod>2026-04-24</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-articles-2.xml</loc>
    <lastmod>2026-04-23</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-04-24</lastmod>
  </sitemap>
</sitemapindex>

Each child sitemap then contains up to 50,000 <url> entries. Submit the index URL in Search Console; it discovers all children.

Sharding strategies

You have to decide how to split URLs across sitemaps. The strategy affects two things:

  1. Crawl priority — Google processes sitemaps in an order influenced by lastmod. Sitemaps with recent lastmod get crawled more often.
  2. Diagnostic clarity in GSC — Search Console's Page Indexing report filters by sitemap, so if you know which sitemap your "crawled - not indexed" URLs come from, you know where the problem pattern lives.

Strategy 1: By content type

sitemap-articles.xml
sitemap-categories.xml
sitemap-products.xml
sitemap-glossary.xml
sitemap-about.xml

Best for sites with distinct content types that have different publishing cadences. Products change daily, glossary rarely, categories intermittently.

Strategy 2: By freshness tier

sitemap-last-7-days.xml       # URLs updated in the last week
sitemap-last-30-days.xml       # updated in the last month
sitemap-evergreen.xml          # stable, older content

Best for sites where Google's recrawl cadence needs to vary dramatically. News sites, publishers with large archives.

Strategy 3: Hybrid (what I usually recommend for ecommerce / publishers)

sitemap-index.xml
├── sitemap-articles-recent.xml       # last 30 days
├── sitemap-articles-archive.xml      # older, max 50k
├── sitemap-products-in-stock.xml
├── sitemap-products-discontinued.xml # redirected, for cleanup
├── sitemap-categories.xml
└── sitemap-pages.xml                 # About, contact, legal

Combines content-type and freshness. Scales cleanly past millions of URLs.

Strategy 4: By ID range (mechanical, used by very large sites)

sitemap-products-0-50000.xml
sitemap-products-50001-100000.xml
sitemap-products-100001-150000.xml

Best for sites generating sitemaps from database IDs where other groupings are impractical. Diagnostic value is lower (ranges don't map to meaning), but scales to any catalog size.

What to include and exclude

Rule: the sitemap should list URLs you want indexed. Everything else is noise that dilutes the signal.

Include:

  • Canonical URLs only. If URL A and URL B both exist, with A canonical to B, only list B in the sitemap.
  • Indexable URLs only. URLs returning 200 OK, not noindex, not blocked by robots.txt.
  • Stable URLs. Short-lived or expiring URLs (search result pages, dynamic filter combinations) don't belong.

Exclude:

  • URLs that noindex — including them confuses Google's signal.
  • Redirects — Google will follow them, but listing the target URL directly is cleaner.
  • URLs blocked by robots.txt — can't be crawled anyway.
  • 404s / 410s — dead URLs dilute the sitemap.
  • Pagination beyond the first few pages unless there's real intent to index them.

A bloated sitemap (100k URLs, 40k non-indexable) is worse than a clean one (60k all indexable). Google treats the list as your declaration of priority; lying wastes everyone's time.

lastmod strategy

lastmod is the single most valuable optional field. It tells Google when the content was last meaningfully changed, influencing recrawl priority.

Rules that matter:

  • Accurate. Set lastmod only when content actually changed. A <link> update, a typo fix, a new comment — not material change.
  • Consistent format. ISO 8601: 2026-04-24 or 2026-04-24T09:00:00+00:00. Not "April 24, 2026."
  • Not the same for every URL. If your CMS sets every sitemap entry's lastmod to today's date, Google notices and starts ignoring the field site-wide. Lastmod has to differ per URL based on actual change history.

Common failures:

  • Static build regeneration updates everything. Gatsby/Hugo/Next.js site rebuilds overwrite every URL's build timestamp. Fix: pull lastmod from git blame on the source file, not the build time.
  • Plugin-generated sitemaps defaulting to "now." WordPress sitemap plugins sometimes default to the current date if the post's modified_date isn't available. Check your plugin's behavior.
  • Database timestamps vs content timestamps. updated_at in your database updates on every row touch, including view counter increments. Use a separate content_updated_at that changes only on content edits.

Image and video sitemaps

Extension sitemaps for image and video content:

Image sitemap (for sites that want image URLs indexed for Google Image Search):

<url>
  <loc>https://example.com/product/sneakers</loc>
  <image:image>
    <image:loc>https://example.com/images/sneakers.jpg</image:loc>
    <image:caption>Nike Pegasus 41 running shoes, side view</image:caption>
  </image:image>
</url>

Useful for ecommerce and visual content sites. The image URLs must be crawlable separately.

Video sitemap (for video content pages):

<url>
  <loc>https://example.com/videos/product-demo</loc>
  <video:video>
    <video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
    <video:title>Product demo: getting started in 3 minutes</video:title>
    <video:description>...</video:description>
    <video:content_loc>https://example.com/video.mp4</video:content_loc>
    <video:duration>186</video:duration>
  </video:video>
</url>

Required for sites that want video rich results and video-specific SERP features. The video file needs to be accessible; thumbnails need to match the video content.

Discovery and submission

Three ways Google discovers your sitemap:

  1. Sitemap: line in /robots.txt — standard, recommended.
  2. Search Console submission — direct submit of sitemap or sitemap index URL. Gives you per-sitemap indexing stats.
  3. HTTP ping — deprecated by Google. Don't rely on it.

Submit the sitemap index, not individual child sitemaps. Google discovers children via the index. If you submit children directly, Search Console's reports become fragmented and harder to interpret.

Verification and monitoring in GSC

After submission, check:

  • Status: Success — sitemap parsed without errors.
  • Discovered URLs count — matches what you sent (±small margin for URL normalization).
  • Last read — Google's cadence. Large sitemaps are re-read every few days; small ones daily.
  • Indexed URLs (per-sitemap view) — the ratio of indexed-to-discovered per sitemap is the most valuable diagnostic. If sitemap-articles.xml indexes 95% but sitemap-products.xml indexes 40%, the problem lives in product URLs specifically.

Set up weekly monitoring of the indexed ratio per sitemap. A drop of 10% over 2 weeks is a signal that something changed in indexability for that URL group.

Common mistakes

One giant sitemap past 50k URLs. Classic oversight; the URLs past 50k silently disappear. Move to a sitemap index.

Listing noindex URLs in sitemap. Confuses Google: "you want this indexed, but also you don't?" Strips your signal.

Every URL has <priority>1.0. Google ignores <priority> entirely. Don't bother with it — focus on lastmod instead.

<changefreq> set without regard to actual change. Google mostly ignores this too. Waste of bytes.

Sitemap at a different host than the content. Sitemaps are per-host. A sitemap at example.com/sitemap.xml listing URLs at www.example.com/ can cause problems. Match the host.

Including query-parameter URLs. ?utm_source=... URLs in sitemap are noise. The sitemap should list canonical versions only.

Frequently asked questions

How often should I regenerate the sitemap?

As often as content changes. For a news site, hourly or realtime. For an ecommerce with daily catalog updates, daily. For a slow-moving site, weekly. What matters is that lastmod accurately reflects content changes.

Does Google read my sitemap for every URL?

No. Google reads the sitemap periodically (daily or every few days, depending on size and change signals) and uses it to discover URLs and update its recrawl queue. The sitemap isn't the primary discovery mechanism — internal linking is.

Should I use <priority> or <changefreq>?

No. Google ignores both. Skip them to keep sitemap file size smaller.

Can I have too many URLs in my sitemap?

You can't hit a Google-enforced "too many" past splitting into an index. But there's a signal-quality ceiling: a sitemap with millions of URLs, half of which are weak, tells Google your site is half weak. Prune aggressively; quality over quantity.

How do I handle URLs that 404 in my sitemap?

Remove them immediately. A sitemap that lists 404s is lying about indexable URLs. If you know a URL 404'd recently, pull it from the sitemap ASAP — don't wait for the next full regeneration.

Related articles

a computer screen with a rocket on top of it

The Complete Guide to Technical SEO Audits

Most technical SEO audits fail the same way: they generate 80-page PDFs with 200 findings, and clients execute none of them. The audits that move rankings solve for one thing: which of five layers is broken, and which single fix restores the most value.

· 11 min read
a computer screen with a rocket on top of it

Core Web Vitals in 2026: What Still Matters

Core Web Vitals is a real but modest ranking signal — and the metrics keep shifting. INP replaced FID in March 2024. Here's what the current three metrics actually measure, what they don't, and where optimization actually moves the needle.

· 9 min read