Log File Analysis: What to Look For, Tool by Tool
The only SEO diagnostic with ground truth about what crawlers actually do
If you're doing technical SEO on a site with more than 10,000 URLs and you've never looked at the logs, you're diagnosing with one eye closed.
Log file analysis is the only source of ground truth for what search crawlers actually do. Search Console's Crawl Stats aggregates and samples (and lags by 2-3 days). Tool-based crawls show you what a crawler could fetch. Logs show what Googlebot actually did — every request, every timestamp, every response code. No other diagnostic gets you the same level of specificity.
This article covers the patterns that matter, the tools that handle the work, and the Googlebot verification step most people skip (and shouldn't).
Why log analysis still matters in 2026
In 2026, Search Console is more useful than it was in 2018 — Crawl Stats improved, Page Indexing got granular. But the fundamental limitations persist:
- Aggregation — Crawl Stats gives you averages per host, not per URL. You can't tell which URLs are getting crawled weekly vs never.
- Sampling — for sites above certain thresholds, GSC samples. The "Total crawl requests" is an estimate.
- Lag — GSC data is 2-3 days behind. A regression that hits today shows up in GSC on Wednesday.
- No attribution — GSC doesn't tell you which 5xx errors correlated with Googlebot activity vs your user traffic vs your monitoring bots.
Logs answer all of those cleanly. And there's no substitute — crawl tools that approximate logs (simulating what Googlebot would fetch) always miss the edge cases that matter.
What you uniquely learn from logs
Five patterns only logs surface:
1. Crawl distribution per URL pattern
Export the last 30 days of Googlebot requests, group by URL prefix, and sort by request count. You'll almost always find surprises:
- A rarely-used search endpoint (
/search?q=...) absorbing 20-40% of crawl. - Parameter-noise URLs (
?ref=...,?utm_*) showing as 10-15% of crawl even though canonicals point elsewhere. - Old staging paths (
/preview/...) or archived section (/2019/...) still being hit regularly. - Faceted URLs you thought were blocked but aren't.
Each pattern is a crawl-budget leak. Fixing the top 3 often reclaims 30-50% of crawl capacity.
2. Response code patterns per crawler
Filter to Googlebot and bucket by status code. Healthy shape for a typical site:
- 200 OK: 85-95%
- 301/302 redirects: 3-8%
- 304 Not Modified: 1-5% (higher is better — indicates effective caching)
- 404/410: <2%
- 5xx: <0.5%
Anomalies to investigate:
- 5xx spike — Googlebot throttles on persistent 5xx. A 2% 5xx rate sustained for 24h will measurably drop crawl rate.
- High 301 — means Googlebot is still crawling old URLs and being redirected. Check for internal links still pointing at the old patterns; fix at the source.
- 404 cluster on one URL pattern — a broken category, a deleted batch of products, a routing bug. High-value to catch early.
3. Crawl rate regressions early
Daily Googlebot request counts, charted, reveal regressions days before GSC does. A sudden 40% drop or a steady downward trend over 2-3 weeks is usually a technical issue you can fix if you catch it.
Common causes of crawl rate drops that logs surface first:
- A misconfigured WAF or bot manager blocking Googlebot IPs (check 403s to verified Googlebot).
- A robots.txt change that accidentally broadened
Disallowpatterns. - A server hosting change that moved TTFB from 200ms to 800ms.
- A new
noindexdeployed to a major template, causing Google to deprioritize the whole section.
4. Orphan URL crawl behavior
Orphan pages (URLs with no internal links pointing at them) are invisible to most tools but visible in logs. Googlebot still fetches them, just rarely, usually via backlinks or stale sitemap entries. If they rank for anything, worth reconnecting them; if they're truly dead, serve 410.
5. Pre-indexation signals
URLs that Googlebot has fetched but haven't appeared in GSC Coverage yet are in the limbo between "crawled" and "indexed or rejected." Spotting them early (they're in logs hours after Googlebot fetches; they're in GSC days later) lets you address indexability issues before Google concludes the URL is low-value.
Setup options: scale-appropriate tools
Small sites (under 1M Googlebot requests / month):
- Screaming Frog Log File Analyzer — desktop app, reads raw log files. £99/year license. Handles up to ~10M rows comfortably. Best entry point.
- Ryte Botify Log Analyzer — more expensive, more features (cross-tool correlation with crawls, GSC integration).
- Manual awk/grep pipelines — viable for one-off analysis, painful for regular monitoring.
Medium sites (1M-50M requests / month):
- A BigQuery pipeline: nightly logrotate to GCS → BigQuery external table → SQL queries. Costs €50-200/month depending on query frequency.
- ElasticSearch + Kibana — self-hosted, free, significant setup and maintenance cost. Popular choice at enterprises.
- Datadog Log Management — if you already have Datadog, adding log ingestion is fast.
Large sites (50M+ requests / month):
- Custom pipelines with Kafka → Clickhouse or BigQuery partitioned tables.
- Pre-aggregation before storage (hourly rollups with dimensions) to keep query costs manageable.
The main decision isn't technical — it's how often you'll look at the data. Weekly review → a spreadsheet export from Screaming Frog is enough. Daily monitoring → you need a dashboarded pipeline.
Googlebot verification (non-negotiable)
Anyone can spoof Googlebot/2.1 in their User-Agent. Spammers do it constantly. Treating unverified "Googlebot" hits as real data produces garbage analysis.
Google's official verification method: reverse DNS lookup of the IP, confirm the hostname ends in .googlebot.com or .google.com, then forward DNS lookup to confirm the IP matches.
Shortcut for analysis pipelines: Google publishes IP ranges as JSON. Filter your logs to just requests from those IPs and treat only those as verified.
Impact of skipping verification: on sites with any measurable spam bot traffic, "Googlebot" in raw logs can be 2-5× the actual verified Googlebot volume. Your distributions, response-code stats, and trend charts all get polluted.
The 5 patterns I check first on any new site
When doing a log analysis for the first time on a site, this is the order:
- Request volume by URL pattern — where is Googlebot spending capacity? Flag anything over 5% of total going to non-indexable patterns.
- Response code distribution — healthy shape or anomalies? Escalate any 5xx rate above 1%.
- Trend over 30 days — rising, flat, or falling? A falling trend is the single most actionable finding.
- Coverage gap — URLs in sitemap not hit by Googlebot in 60+ days. List them; this is your priority-inversion problem.
- Orphan signal — URLs in logs but not in sitemap and not linked internally. Reconnect or retire.
Each of these takes 15-30 minutes once the pipeline is set up. Two hours gets you a complete log-based diagnostic.
Integration with action
Log analysis is most valuable when it closes the loop. The pattern:
- Monday: export the previous 7 days of Googlebot activity.
- Monday AM: run the 5 checks above, flag anomalies.
- Monday afternoon: ticket the fixes (crawl waste blocks, internal linking, sitemap hygiene).
- Following Monday: check the previous week's fixes worked (crawl distribution shifts visibly within 5-7 days).
Without this weekly cadence, log analysis becomes a one-off deep dive that answers questions you've forgotten the context for. The value compounds in the routine.
Common mistakes
Analyzing without verifying. First run of log analysis on a site where spam traffic wasn't filtered, the distribution looks totally different (and wrong). Always verify first.
Looking at raw request counts instead of unique URL fetches. A URL fetched 50 times in 30 days at daily cadence is different signal from one fetched 50 times in one day (often a misconfigured retry loop). Track unique URL / period metrics too.
Ignoring CDN logs vs origin logs. Edge CDN shows the full traffic; origin shows what bypassed cache. Different stories. Both are useful; know which you're looking at.
Assuming log patterns generalize across user agents. Googlebot, Bingbot, GPTBot, and ClaudeBot have different crawl patterns, different respected-directive subsets, different use cases. Analyze per-agent, especially when making decisions about blocking or allowing patterns.
Frequently asked questions
What if I don't have access to server logs?
Start with CDN logs (most major CDNs expose bot request logs). If that's not possible, some SEO tools approximate log data from their own crawls — imperfect but better than nothing. Long-term, getting log access is a one-time conversation with infra; worth pushing.
How much storage do logs take?
Typical: one log line is 200-400 bytes. 10M requests/day = ~3GB/day raw, compressible to 300-500MB. One year of compressed logs at moderate scale is 100-200GB — cheap to store in object storage.
Are bot logs different from user logs?
Usually the same log file, differentiated by User-Agent. Some teams split bot vs user traffic at the LB/CDN level for clarity. Either works; unified is simpler for first-time analysis.
Do I need to look at logs if I'm already using GSC?
GSC tells you aggregated truth about the past 2-3 days, per-site. Logs tell you per-URL truth in real time. On sites over 10k URLs, you need both — GSC for the status report, logs for the diagnostic detail.
Can I automate log analysis?
Yes. The common pipeline: logs → S3/GCS → daily aggregation job → stored summary tables → dashboard. Automate the ingestion and aggregation; the interpretation (what's an anomaly, what to fix) stays human.
What to read next
- How to prioritize crawl budget for large sites — what to do with the insights logs surface.
- The Complete Guide to Technical SEO Audits — log analysis in the broader audit context.
- Robots.txt patterns from the wild — turning log findings into robots.txt fixes.
Related articles
The Complete Guide to Technical SEO Audits
Most technical SEO audits fail the same way: they generate 80-page PDFs with 200 findings, and clients execute none of them. The audits that move rankings solve for one thing: which of five layers is broken, and which single fix restores the most value.
Hreflang Implementation: Mistakes and How to Avoid Them
Hreflang breaks silently. Bidirectionality errors, region code confusion, and mixed delivery methods cause international SEO issues that don't show up as explicit errors — just underperformance in secondary markets.
Core Web Vitals in 2026: What Still Matters
Core Web Vitals is a real but modest ranking signal — and the metrics keep shifting. INP replaced FID in March 2024. Here's what the current three metrics actually measure, what they don't, and where optimization actually moves the needle.