Technical SEO · Glossary · Updated Apr 2026

Log file analysis

Definition

Log file analysis is the parsing of web server access logs to see exactly which URLs search crawlers (Googlebot, Bingbot, others) have requested, at what rate, and with what response codes. It's the only ground-truth source for what crawlers actually do on your site.

Find related

Long definition

Every request a crawler makes shows up in your web server's access log — Apache, Nginx, load balancers, CDNs all write one line per request. A typical line records the IP, timestamp, method + URL, HTTP status, response size, referrer, and User-Agent. For SEO, the rows with search-crawler User-Agents (verified by reverse DNS) are the signal.

What log analysis uniquely answers, where no other tool can:

Is Googlebot actually crawling the URLs you care about? Your sitemap says they exist; GSC hints at coverage; logs confirm whether the bot hit them.
What's the crawl distribution? Percentage of bot requests going to canonical vs. non-canonical, to low-value faceted URLs, to 404s, to redirects.
Response-code patterns — a spike in 5xx to Googlebot is a ranking risk you won't see in GSC until a week later.
Orphan pages that still get crawled — URLs with no internal links but still fetched by Googlebot (via old backlinks, old sitemap entries, memory).
Crawl cadence per URL group — daily vs weekly vs monthly, correlated with importance.

Tools: Screaming Frog Log File Analyzer, Splunk, ELK, custom parsers. At scale (>100M rows/month), dedicated log pipelines (Kafka + Clickhouse, BigQuery) beat desktop tools.

Verification is non-negotiable. Anyone can spoof Googlebot/2.1 in their User-Agent — bot-sniffing spam scripts do it constantly. Google publishes its IP ranges and requires reverse DNS lookup to confirm. Unverified "Googlebot" hits are noise and sometimes worse.

Common misconceptions

"GSC's Crawl Stats report replaces log analysis." GSC's Crawl Stats is aggregated, sampled, and has a ~3-day delay. Logs give you per-URL, real-time, complete data. GSC tells you the average; logs tell you the distribution.
"Logs only matter for huge sites." False. Even 10k-URL sites benefit from catching crawl anomalies (Googlebot hammering your /search endpoint, old redirect chains still being crawled, admin URLs being fetched repeatedly).
"Setting up log analysis is expensive." For small-medium sites (under 10M requests/month), Screaming Frog's tool runs on a laptop. For larger, the infra cost is a few hundred euros/month — cheap compared to the SEO cost of missing a crawl regression.
"CDN logs are enough." Edge CDN logs show cached responses; miss-to-origin requests tell a different story. For complete picture you need origin logs, or at minimum CDN logs with cache_status annotation.

Continue exploring