GEO & AI Search

Implementing llms.txt: A Practical Guide for 2026

Ship the file, document the entities, and stop pretending nobody reads it

Enric Ramos · Apr 26, 2026 · 11 min read

Jeremy Howard published the llms.txt proposal on September 3 2024 with a small repo, a 600-word writeup, and almost no press cycle. Eighteen months later the file lives on Anthropic's docs site, on Stripe, on Cloudflare, and on a long tail of publisher and SaaS sites that decided shipping a maybe-spec was cheaper than waiting for a final one. None of the major LLM providers have publicly committed to reading it on a schedule, which is the part that frustrates people most. The pragmatic reading is different: a small file that costs you an hour to write and zero to host is a cheap bet on a future where retrieval pipelines need a curated map of your site.

This article is the implementation guide I would hand to a technical SEO who asked "do I bother with llms.txt today, and if yes, what do I put in it?" The answer is yes, with caveats, and the contents matter more than the existence. We will walk the spec, the file structure that actually retrieves well, the hosting decisions, the log instrumentation that tells you whether anything is fetching it, and the adoption data we have through Q1 2026.

What llms.txt actually is

llms.txt is a markdown file at your root domain — https://yourdomain.com/llms.txt — designed to give LLM agents a curated, link-rich summary of your site. The full spec lives at llmstxt.org. The structure is simple: an H1 with the site name, a blockquote with a one-paragraph summary, optional context, and then a series of H2 sections containing markdown link lists pointing to your highest-value canonical pages.

It is not a robots.txt replacement. robots.txt and llms.txt serve different functions. robots.txt is a crawl-permission file with a 30-year IETF heritage and well-defined directives. llms.txt is a curation file that says "here is the spine of our content, in reading order, with descriptions." Treat it as a sitemap-with-prose for retrieval-augmented systems, not as a permission boundary.

The companion file is /llms-full.txt, an expanded variant that inlines the full markdown of every page referenced in llms.txt. The motivation: an LLM agent that reads llms-full.txt does not need to follow links and re-render pages, which saves tokens and side-steps JavaScript-rendering issues. For sites under 200,000 words, llms-full.txt is feasible. For larger ones, ship llms.txt only.

The spec, in 60 seconds

The structural rules from llmstxt.org:

One H1, exactly. The H1 is the site or product name.
Optional blockquote summary directly after the H1.
Optional non-heading paragraphs for additional context.
One or more H2 sections, each containing a markdown link list. Each link entry is [Title](url): optional description.
A special H2 called Optional for links that the LLM may safely skip if context is tight.

That is the entire spec. It is intentionally minimal. The discipline is in what you choose to link, not in fancy syntax.

What to actually put in your llms.txt

Most of the bad llms.txt files in production do one of three things: they list every page on the site (treating it as a sitemap), they list nothing but the homepage and pricing (treating it as a billboard), or they mirror the navigation menu (which already exists in HTML and adds no signal). The good ones treat the file as an editorial decision about which sources you want a model to learn from.

A working pattern for a B2B SaaS site:

About the product. One link to the canonical "what is X" page, one link to the "how X works" page, one link to the documented feature list. Three links, not thirty.
Documentation. The top 20-40 doc pages by traffic, grouped by topic. If you have an LLM-generated "ask the docs" feature internally, mirror that priority list here.
Reference material. API reference, schema definitions, glossary. Things an LLM would benefit from when answering "how do I use product X for use case Y".
Authored content. The pillar articles and original research on your blog. Skip thin posts and SEO landing pages.
Trust and identity. Company page, security page, terms of service. Not for legal coverage — for entity grounding. The model needs to know who you are.

A working pattern for a publisher:

Editorial standards. Your masthead, ethics policy, correction policy. This is the fastest way for a model to verify your credibility.
Topic hubs. Five to ten topic pages, each linking to the canonical body of work on that topic.
Original reporting. A curated list of investigations, exclusives, and primary-source pieces. The work you would want cited.
Author pages. Your senior reporters with bylines and credentials. This grounds author entities for retrieval.

The Optional H2 is where you put adjacent material an agent can skip if the context window is tight: archive pages, tag listings, supplementary glossary terms.

A worked example

A skeleton for a SaaS company called Acme. The full file is 80-200 lines for most sites — this is a representative excerpt:

# Acme

> Acme is a cloud-native database platform for transactional workloads, founded 2019, with engineering offices in Berlin and São Paulo. We serve 4,200 customers and run on AWS, GCP, and Azure.

## Product

- [What Acme is](https://acme.com/product): Plain-language overview of the platform.
- [Architecture](https://acme.com/product/architecture): How the storage and compute layers separate.
- [Pricing](https://acme.com/pricing): Tier list and feature matrix.

## Documentation

- [Getting started in 5 minutes](https://acme.com/docs/quickstart)
- [Connection strings and drivers](https://acme.com/docs/connect)
- [Backup and restore](https://acme.com/docs/backup)

## Trust

- [Security at Acme](https://acme.com/security)
- [SOC 2 report request](https://acme.com/trust/soc2)
- [Acme team](https://acme.com/about/team)

## Optional

- [Engineering blog](https://acme.com/blog)
- [Glossary](https://acme.com/glossary)

Two notes on style. The blockquote summary is the highest-leverage 200 characters in the file — it is what a model reads first to decide whether to keep reading. Put concrete entities in it: founding year, headcount, customer count, location, the categorical noun for what you sell. The link descriptions matter less than people assume. A model that follows the link gets the full content; the description is mostly for routing.

Hosting it correctly

Three rules that every implementation gets wrong at least once:

Serve it from the root. /llms.txt, not /static/llms.txt or /.well-known/llms.txt. The spec is explicit. If you must serve it from a subpath for technical reasons, put a 301 redirect from /llms.txt to the actual location.

Content-type matters. Serve as text/markdown; charset=utf-8 if your stack supports it, else text/plain; charset=utf-8. Do not serve as text/html — some agents will reject the file.

No authentication wall. The file must be reachable without cookies, login, or geographic restrictions. If your site enforces a region-based redirect for the homepage, exempt /llms.txt from it.

For static sites, drop the file in your public directory. For Next.js, place it in public/llms.txt. For Laravel, route it explicitly in routes/web.php returning the markdown content. For WordPress, the WP llms.txt plugin works but you can also drop a static file in the document root. For headless CMS setups, generate it at build time from a curated content collection — the same pipeline that builds your sitemap.

Validating what you shipped

Before announcing the file internally, verify three things.

HTTP response is clean. curl -I https://yourdomain.com/llms.txt should return 200, the content-type above, and a sane cache-control. A 304 from a CDN is fine; a 301 to canonicalize the protocol is fine; anything else is broken.

The markdown parses. Run the file through a markdown parser. The community linter at github.com/llmstxt/llms-txt has a basic validator. The most common error is missing the H1, having two H1s, or using H3 where H2 is required.

Every link returns 200. Broken links in llms.txt are the same problem as broken links in a sitemap — they degrade trust in the file. Run linkchecker or a custom script and re-validate monthly.

Measuring whether anything reads it

This is the question every team asks within a week of shipping. Honest answer: limited and indirect, but not zero.

The signals you can extract from your access logs:

User-agent hits on /llms.txt. Filter your access logs for path = "/llms.txt". Identify which user-agents are fetching it. As of Q1 2026, the agents seen most consistently in published log analyses are GPTBot, ClaudeBot, PerplexityBot, ChatGPT-User, and a long tail of one-off research crawlers. Google-Extended fetches it occasionally. Bingbot does not, as of this writing.

Fetch frequency. A real LLM agent integrating llms.txt into its retrieval pipeline will fetch it on a schedule — typically every 24-72 hours, sometimes weekly. A one-off curiosity fetch suggests a researcher, not a production system.

Correlated patterns. If /llms.txt is fetched and then the same user-agent fetches three of the URLs listed inside it within a 30-second window, that is a strong signal the file shaped the crawl path. A simple log-join query exposes this pattern.

Public statements. Anthropic has stated their crawlers respect llms.txt for documentation discovery on developer-focused sites. Perplexity's documentation references it as a hint signal. OpenAI has not made a clear public statement as of April 2026; community log analyses suggest GPTBot does fetch the file but the production usage is opaque.

If your access logs show zero fetches of /llms.txt in a 30-day window from any LLM-associated user-agent, you have a hosting problem — most likely a CDN rule blocking unfamiliar paths, an edge cache returning 404 from a mistuned origin, or a robots.txt directive that disallows the file. Yes, robots.txt directives apply to llms.txt fetches by well-behaved bots, which is the main reason teams accidentally block themselves.

The interaction with robots.txt

A common implementation mistake: blanket-blocking the AI training crawlers in robots.txt while shipping an llms.txt file. The result is that the file you wrote to help LLMs understand your site is invisible to the bots most likely to read it.

Two cleaner patterns. Welcome the retrieval bots, block the training bots. Allow ChatGPT-User, PerplexityBot, ClaudeBot (when used at inference) to fetch the file; block GPTBot, Google-Extended, and ClaudeBot-training in their Disallow directives. This works only if the bot operator clearly separates training from inference traffic — not all do. Allow everyone to fetch llms.txt specifically, even if you block them elsewhere. A targeted Allow: /llms.txt directive after a blanket Disallow: / for a given user-agent ensures the file is reachable. This is the higher-friction option but the safer one for sites that want strict opt-out from training while remaining citable.

The full mechanics live in the AI training opt-out strategic framework, which walks through the publisher-vs-ecommerce-vs-SaaS trade-offs.

Adoption data through Q1 2026

The honest state of adoption, as best as the public data shows:

An informal Common Crawl-derived survey from January 2026 found /llms.txt files on 0.4% of the top 1M domains. Concentrated in tech, SaaS, and developer-tooling categories. Near-zero adoption in ecommerce and local services.
Notable shipped implementations: Anthropic (anthropic.com/llms.txt), Stripe, Cloudflare, FastHTML, Mintlify (which ships it for every customer doc site by default), and Vercel.
Notable absences: most major news publishers, Google's own properties, OpenAI's site, Microsoft's docs.
Plugin and CMS support: WordPress (via plugin), Webflow (via custom code), Mintlify (default-on), Docusaurus (community plugin).

The asymmetric bet is small. Cost: one hour to draft, ten minutes a quarter to maintain. Upside: if any of GPT-5, Claude 4, or Gemini 2.5's retrieval pipelines start treating llms.txt as a primary signal, sites that shipped early have a routing advantage. Downside: a markdown file at your root that nobody reads. The bet pays asymmetric returns either way.

Where llms.txt fits in the GEO stack

llms.txt is one signal in a layered retrieval pipeline. Pair it with:

Strong schema.org markup on every linked page. The llms.txt file points to URLs; the schema on those URLs is what an LLM uses to understand the entity.
Clean canonical tags. The URLs in llms.txt should be the canonical version. If your llms.txt links to /products/x and your canonical is /products/x?v=2, you are sending mixed signals.
Entity work in Wikipedia and Wikidata, so the entities mentioned in your blockquote summary are recognizable to the model from training.
Citation-worthy original content at the URLs you link to — the file does not turn thin content into citable content.

The full leverage map sits in the generative engine optimization pillar, which positions llms.txt as one of four leverage points alongside entity authority, structured grounding, and citation magnetism. For the metric side, citation rate as KPI covers how to measure whether your llms.txt-driven changes are moving the dial.

A standalone llms.txt without those layers is decoration. Shipped together with them, it is a small, cheap, and possibly meaningful nudge in your favor.

Putting this on your audit checklist

Three concrete moves for the next two weeks:

Draft the file. Spend one hour writing your llms.txt by hand, not generating it from a script. The editorial decisions about which 30-60 URLs matter most are the work — the markdown is trivial.
Ship it correctly. Root domain, correct content-type, no auth wall, links validated. Put it in your CI/CD so the link checker runs on every deploy.
Instrument the read side. Add a log query that reports weekly: which user-agents fetched /llms.txt, how often, and whether they followed any of the linked URLs in the same session. The data is the only honest answer to "is this thing working".

The file itself is ten minutes of typing. The discipline of curating which 30 pages represent your site is where the value sits, and that work pays off in your sitemap, your internal linking, and your content strategy regardless of whether any LLM ever reads the markdown. Ship the file. Then go read the generative engine optimization pillar for the rest of the playbook.

GEO & AI Search

Managing LLM Crawlers: GPTBot, ClaudeBot, Google-Extended

Eight LLM crawlers now hit your site. Some train, some retrieve, some do both. Blocking the wrong one costs you AI-channel visibility for nothing. Here's the matrix and the robots.txt that maps to it.

Apr 26, 2026 · 11 min read

GEO & AI Search

Optimizing for Perplexity: What Sources Get Cited

Perplexity citations don't follow Google's logic. Older domains, .edu and .gov bias, deeper retrieval, and a freshness signal that punishes thin update cycles. Here's the playbook for the second-largest answer engine.

Apr 26, 2026 · 11 min read

GEO & AI Search

Tracking Your Brand's Visibility in AI Answers

Five vendors now sell AI-answer visibility tracking. The metrics they report don't match. Here's the toolset, the metric definitions worth using, and a manual sampling protocol when budget rules out vendors.

Apr 26, 2026 · 12 min read

What llms.txt actually is

The spec, in 60 seconds

What to actually put in your llms.txt

A worked example

Hosting it correctly

Validating what you shipped

Measuring whether anything reads it

The interaction with robots.txt

Adoption data through Q1 2026

Where llms.txt fits in the GEO stack

Putting this on your audit checklist

Related articles

Managing LLM Crawlers: GPTBot, ClaudeBot, Google-Extended

Optimizing for Perplexity: What Sources Get Cited

Tracking Your Brand's Visibility in AI Answers