Technical SEO · Glossary · Updated Apr 2026

XML sitemap

Definition

An XML sitemap is a machine-readable file listing URLs that a site wants search engines to discover. A single sitemap holds up to 50,000 URLs or 50 MB uncompressed; larger sites use a sitemap index that references multiple sitemap files.

Find related

Long definition

The sitemap protocol (sitemaps.org) defines a tiny XML schema: each <url> entry has a <loc> (required), and optional <lastmod>, <changefreq>, and <priority>. Google mostly ignores <changefreq> and <priority>; <lastmod> is used when it's reliable (content actually changed on that date).

For multilingual sites, the protocol is extended with <xhtml:link rel="alternate" hreflang="..."> per URL — the recommended delivery method for hreflang on sites with many locales.

Sitemaps are discovered via:

  • A Sitemap: line in /robots.txt
  • Direct submission in Google Search Console (gives you per-sitemap indexing stats)
  • Manual ping (deprecated by Google; sitemap index submission replaces it)

For sites past 50,000 URLs, use a sitemap index — a sitemap of sitemaps. The index itself can reference up to 50,000 sitemap files. Common decomposition strategies: by content type (articles, products, categories), by publish date (monthly shards), or by freshness tier (daily-updated vs static).

A sitemap is a hint, not a guarantee: listing a URL does not force indexing. It just ensures the URL is discoverable and, via <lastmod>, signals when it's worth recrawling.

Common misconceptions

  • "A sitemap gets my pages indexed faster." It makes them discoverable faster. Indexability decisions (quality, duplicates, canonicals) still apply. A sitemap full of thin/duplicate URLs hurts you by signaling "these are the pages I think are worth indexing" — and inviting Google to disagree publicly in Search Console.
  • "Bigger sitemap = better." Include only indexable, canonical, currently-live URLs. Noindex'd pages, redirects, and 404s in a sitemap pollute the signal and confuse Search Console's reporting.
  • "Lastmod can be set to 'now' to trigger recrawls." Google ignores lastmod when it contradicts what its own crawls observe. Lying in sitemaps doesn't accelerate crawling — it eventually makes Google ignore the field for your site.
  • "Sitemap URLs must be in robots.txt." They don't have to be. Search Console submission is enough. But adding Sitemap: lines in robots.txt helps other crawlers (Bing, search engines you don't actively submit to) find them.