Semantic HTML
Semantic HTML is the practice of using HTML elements that carry meaning about their content (`<article>`, `<section>`, `<nav>`, `<main>`, `<aside>`, `<header>`, `<footer>`) instead of generic `<div>` containers. Crawlers, screen readers, and LLM scrapers rely on these tags to extract structure.
Long definition
Semantic HTML is the difference between a page that's structured and a page that just looks structured. A <div class="article"> renders the same as <article> — the second carries machine-readable meaning that the first does not.
The core landmark elements, per the WHATWG HTML spec:
<main>— the page's primary content (one per page).<article>— a self-contained piece of content (a blog post, a product card, a forum reply).<section>— a thematic grouping inside the page; usually has a heading.<nav>— primary navigation links.<aside>— content tangentially related (sidebar, callout, related-posts).<header>and<footer>— header/footer for the page or for an<article>.<figure>/<figcaption>— images or media with caption.<time datetime="...">— machine-readable dates.
These tags are read by three audiences. Screen readers use them as landmarks for keyboard navigation — users jump between <nav>, <main>, <aside>. Search crawlers use them as content-vs-chrome signals; Google's documentation explicitly mentions <main> and <article> as helpful for primary-content extraction. LLM scrapers (GPTBot, ClaudeBot, PerplexityBot) extract <article> content as the canonical chunk for retrieval and quoting — divs are noisier and quoted less reliably.
Semantic HTML compounds with structured data. JSON-LD Article schema works better when there's an actual <article> element wrapping the content; Person and Organization schemas pair naturally with <header> and contact <footer> blocks.
The cost of semantic HTML is roughly zero — it's a tag substitution. The cost of not using it is silent: lower content-extraction confidence, weaker accessibility, less reliable AI quoting. For new templates, default to semantic. For legacy templates, the migration is mostly mechanical.
Common misconceptions
- "Semantic HTML is a direct ranking factor." It isn't a tagged factor. It improves the inputs to ranking — content extraction, accessibility, structured-data validity — which improve the signals algorithms read.
- "Divs are fine if I add ARIA roles." ARIA is a fallback when semantic HTML can't express the role. The W3C's first ARIA rule is "do not use ARIA if a native element does the job".
<button>beats<div role="button">every time. - "
<section>and<div>are interchangeable." A<section>should have a heading and represent a thematic chunk. A<div>is a generic styling box with no meaning. Mixing them up produces broken document outlines. - "LLMs ignore HTML structure and just read text." Modern LLM scrapers parse the DOM and prefer semantic landmarks. Sites with clean
<article>markup get cleaner quotations in AI Overviews and Perplexity citations.
Continue exploring