GEO & AI Search

Entity SEO: Building the Knowledge Graph LLMs Read

From indexed page to recognized topic authority, via Wikidata

Enric Ramos · Apr 26, 2026 · 9 min read

Ask ChatGPT-4 about your competitor and you will get a paragraph of confident facts. Ask it about you and you will get "I don't have specific information about that company." That gap is not random. It is the difference between an entity the model recognizes and a string of text it has merely seen during training. Entity SEO is the discipline of closing that gap on purpose.

For 15 years, SEO worked on pages and links. The unit of optimization was a URL. That model still runs Google's blue links, but it is not what large language models do. LLMs compress the open web into a knowledge graph during pre-training, then enrich that graph at inference time via retrieval. If your company, product, or author is not a node in that graph, no amount of on-page optimization gets you cited.

This article walks through what an entity actually is to a model, how the knowledge graph gets built, the role Wikipedia and Wikidata play as the spine, and the practical steps that move you from "indexed string" to "recognized topic authority". Real schemas, real verification queries, real timelines.

What an entity is to a language model

A search engine entity is a distinct, identifiable thing — a person, organization, product, place, concept — with a stable identifier and a cluster of attributes. Google's Knowledge Graph has used machine-readable entity IDs (the kg:/m/ mids) since 2012. Wikidata uses Q-numbers (Barcelona is Q1492). Both serve the same function: a primary key the system can attach facts to.

LLMs build something analogous during pre-training, but it is fuzzier. A model does not store "Stripe = Q24278997" as an explicit row. It stores a dense embedding region where the token sequence "Stripe" co-occurs with "payments", "API", "Patrick Collison", "Dublin", "Y Combinator", and "Bridge". When a user asks "what payment processors integrate well with Next.js", the model retrieves that region and generates a response constrained by it.

The practical implication: you are not optimizing for a keyword match. You are optimizing for a stable cluster of co-occurrences strong enough that the model resolves your name to a single, correct entity rather than a generic noun phrase or, worse, a confused merge with another company.

Why Wikipedia and Wikidata are the training spine

Every major foundation model — GPT-4, Claude 3, Gemini, Llama 3, Mistral — trained on a snapshot of Wikipedia. It is not the largest source by token count, but it is the highest-quality structured source. Wikipedia articles are explicitly entity-keyed, cross-linked, fact-checked, and machine-readable via the Wikimedia API.

Wikidata sits underneath Wikipedia. It is the federated, multilingual knowledge base that powers infoboxes, the Google Knowledge Panel, and the entity disambiguation pipeline used by every search system. When you see a knowledge panel on a Google SERP, the canonical facts almost always trace back to a Wikidata Q-number.

For LLMs, this means Wikidata acts as a high-precedence source during both pre-training and retrieval-augmented inference. A fact that appears in Wikidata is treated as canonical unless explicitly contradicted by a higher-confidence source. A claim that exists only on your own site has lower epistemic weight in the model's internal reasoning.

The path to entity status almost always runs through here. Notable companies, products, and people get Wikipedia articles; those articles get linked to Wikidata items; those items get pulled into the training data. If you skip this layer, you are asking the model to construct your entity from scattered web mentions alone — and it will often fail or hallucinate.

The "indexed string" trap

Most B2B SaaS companies are stuck at indexed-string status. Google has crawled their site. Their pages rank for branded queries. Their docs show up in technical searches. None of that makes them an entity. Three diagnostic queries reveal the gap:

Ask GPT-4o, Claude 3.5, and Gemini 1.5 the same question: "Tell me about [your company] in three sentences. Include founding year, founders, and primary product."
Score the answers on factual accuracy and confidence. Hedging language ("I'm not certain", "based on limited information") is a direct signal of weak entity status.
Ask the inverse: "What companies compete with [established competitor] in [your category]?" If you are not in the list, the model does not associate you with the category cluster.

A company with strong entity status passes all three. The Wikipedia article exists, the Wikidata item is populated with instance of, industry, founder, inception properties, and the schema.org Organization markup on the corporate site reinforces the same facts. A company stuck at indexed-string fails at least two.

Building the entity, layer by layer

Treat entity construction as a layered project, not a single fix. Each layer reinforces the next; skipping layers leaves the structure unstable.

Layer 1 — Canonical entity definition. Pick the exact name you want recognized. "Stripe" not "Stripe, Inc." not "Stripe Payments". Variants confuse the disambiguation pipeline. Document the canonical form and use it consistently in schema, press releases, social bios, and Wikipedia.

Layer 2 — Schema.org Organization on your site. Implement Organization structured data on the homepage and an AboutPage if you have one. Include name, legalName, url, logo, foundingDate, founder (each as a Person with their own URL), address, and critically, sameAs pointing to every authoritative profile: LinkedIn company page, Crunchbase, GitHub org, Wikidata Q-number, Wikipedia article URL.

Layer 3 — Wikidata item. Even before Wikipedia notability, you can create or augment a Wikidata item. Populate instance of (Q4830453 for business), industry, country, inception, founded by, headquarters location, official website, and subsidiary / parent organization if applicable. Each property cites a source — a press article, an SEC filing, the company's About page. Wikidata is permissive about creation but ruthless about poorly sourced edits.

Layer 4 — Wikipedia article. This is the gate. Wikipedia notability requires "significant coverage in reliable, secondary sources independent of the subject". Translation: at least three full-length articles in tier-one publications (TechCrunch alone is no longer enough; you want WSJ, FT, NYT, The Information, or strong industry publications) explicitly about the company, not just quoting it. Once notability is met, the article should be drafted by someone other than your team — agencies that "do Wikipedia for SEO" frequently get articles flagged and deleted because of conflict-of-interest editing.

Layer 5 — Reinforcement at scale. Press mentions, podcast appearances, conference talks, GitHub repos, academic citations. Each adds a co-occurrence signal that travels into the next training run. The compounding here is real: by year three, the entity is so well-established that minor errors in any single source no longer perturb the model's representation.

sameAs: the property that does the heavy lifting

If you implement only one schema property well, make it sameAs. It is the equal sign across the open web's identity graph. Your Organization schema's sameAs should list, at minimum:

Wikipedia article URL
Wikidata entity URL (https://www.wikidata.org/wiki/Q12345678)
LinkedIn company URL
Crunchbase profile URL
GitHub organization URL (if you ship code)
X / Twitter handle URL
Official YouTube channel URL

Each sameAs target should reciprocally point back to your canonical site. Wikipedia's external links section, Wikidata's official website property, the URL field on LinkedIn, the website field on GitHub. A bidirectional graph is much harder for any system to misresolve than a unidirectional one.

For founders and key authors, replicate this pattern with Person schema. Author entities matter for E-E-A-T and increasingly for AI citation — Perplexity weights author authority signals when ranking sources for a query.

Disambiguation: when your name collides

A surprising number of SEO problems are entity collisions. "Notion" is also a verb. "Stripe" is also a pattern. "Anthropic" is also an adjective. "Apollo" is at least a dozen different companies, a Greek god, and a NASA program. The model's job at retrieval time is to pick the right one given the query context.

You help disambiguation through three signals:

Distinctive co-occurrence terms in your canonical content. "Notion" + "workspace" + "blocks" + "Ivan Zhao" forms a tight cluster the model can resolve.
Schema disambiguatingDescription on the Organization or Thing. A 1-2 sentence definition that includes the category and a distinguishing feature.
Wikidata description, which appears under the entity name in many AI search interfaces and trains the model on the concise canonical phrasing.

When users query "Apollo" + "sales engagement", the model should resolve to your company. When they query "Apollo" + "moon landing", it should not. The work is making the cluster legible.

Measuring entity strength over time

Entity status is not binary, and it is not visible in Google Search Console. You need an out-of-band measurement loop. The lightest version is a monthly prompt panel:

A fixed list of 20 questions about your company, category, and competitive set
Run against GPT-4o, Claude 3.5, Gemini 1.5, and Perplexity
Score each response on (a) inclusion of your brand, (b) factual accuracy, (c) citation rate of your owned URLs

Track the trend. A new Wikipedia article typically lifts inclusion scores within 3-6 months — the next training cutoff for at least one major model. Wikidata edits show up faster because Wikidata is a live retrieval source for some systems, not just a training source. Schema additions on your own site help retrieval-augmented systems immediately and pre-training-only models on a 12-18 month lag.

For a deeper dive on the measurement framework, see tracking brand visibility in LLM answers — the toolset has matured in 2025 and 2026.

Where this connects to the rest of GEO

Entity SEO is one of four leverage points in the broader generative engine optimization discipline. The others — citation magnetism, structured grounding, and brand mention density — all depend on having a strong entity foundation. You cannot earn AI citations for a vaguely-defined company that the model cannot resolve. You cannot ground retrieval in your content if the entity behind it is not recognized.

The full GEO playbook lives in the generative engine optimization pillar, which connects entity work to the retrieval, grounding, and measurement pieces.

Putting this on your audit checklist

For your next quarterly SEO audit, add an entity layer. Five concrete checks, in priority order:

Wikidata item exists and is populated. If not, create or augment. Allow 60 days for it to propagate.
Organization schema includes sameAs to at least 6 authoritative profiles, including Wikidata. Validate with the Rich Results Test on developers.google.com.
Founder Person schema on About page, with sameAs to LinkedIn and any author profiles.
Wikipedia article exists or notability path is documented. If notability is not yet met, the work is press and citations, not Wikipedia editing.
Monthly LLM prompt panel running, with at least 12 months of trend data planned. The discipline of running it monthly is what turns entity work from one-off project into a measurable program.

The companies that will dominate AI search in 2027 are the ones doing this work in 2026. The training cutoffs are unforgiving, and the compounding favors whoever built their entity first.

For the deeper retrieval mechanics that decide whether your entity actually gets cited once recognized, read the generative engine optimization pillar next.

GEO & AI Search

Managing LLM Crawlers: GPTBot, ClaudeBot, Google-Extended

Eight LLM crawlers now hit your site. Some train, some retrieve, some do both. Blocking the wrong one costs you AI-channel visibility for nothing. Here's the matrix and the robots.txt that maps to it.

Apr 26, 2026 · 11 min read

GEO & AI Search

Optimizing for Perplexity: What Sources Get Cited

Perplexity citations don't follow Google's logic. Older domains, .edu and .gov bias, deeper retrieval, and a freshness signal that punishes thin update cycles. Here's the playbook for the second-largest answer engine.

Apr 26, 2026 · 11 min read

GEO & AI Search

Tracking Your Brand's Visibility in AI Answers

Five vendors now sell AI-answer visibility tracking. The metrics they report don't match. Here's the toolset, the metric definitions worth using, and a manual sampling protocol when budget rules out vendors.

Apr 26, 2026 · 12 min read

What an entity is to a language model

Why Wikipedia and Wikidata are the training spine

The "indexed string" trap

Building the entity, layer by layer

sameAs: the property that does the heavy lifting

Disambiguation: when your name collides

Measuring entity strength over time

Where this connects to the rest of GEO

Putting this on your audit checklist

Related articles

Managing LLM Crawlers: GPTBot, ClaudeBot, Google-Extended

Optimizing for Perplexity: What Sources Get Cited

Tracking Your Brand's Visibility in AI Answers