Embedding
An embedding is a fixed-length numeric vector that represents a piece of text (or image, audio, code) inside a learned semantic space, where similar meanings sit close together. Typical dimensions are 768, 1536, or 3072. Embeddings are the atomic data unit behind semantic search, RAG, and recommendation systems.
Long definition
Take the sentence "best running shoes for flat feet" and pass it through an embedding model. The output is a list of, say, 1,536 floating-point numbers — coordinates in a high-dimensional space the model learned during training. "running shoes for fallen arches" lands geometrically close. "blue Italian leather loafers" lands far away. That spatial closeness, measured by cosine similarity or dot product, is what makes embeddings useful.
Modern embedding models are produced by the major foundation-model labs and a few specialists. OpenAI offers text-embedding-3-small (1,536 dims) and text-embedding-3-large (3,072 dims). Cohere's Embed v3 family runs at 1,024 dims. Google's Gecko sits at 768 dims. Voyage AI, Mistral, and BAAI's bge family fill out the open-source side, with bge-large-en-v1.5 at 1,024 dims being a common self-hosted choice.
Two technical points matter:
- Dimension count is not quality. A well-trained 768-dim model can outperform a poorly-trained 3,072-dim model for a given task. Bigger vectors cost more storage and compute at search time.
- Embeddings are model-specific. A vector from
text-embedding-3-smallcannot be compared to a vector frombge-large. They live in different geometric spaces. Standardize on one model per index.
For SEO and content work, embeddings power three concrete things: semantic search inside your own site (retrieving relevant docs without exact keyword match), RAG pipelines (feeding retrieved docs into LLM prompts so answers stay grounded in your content), and content gap analysis (clustering query and content vectors to find missing topical coverage).
Generation is cheap — pennies per million tokens for small models. Storage at scale is where costs and architecture decisions live: a 1M-document index at 1,536 dims is 6 GB before metadata, and that's where vector databases (pgvector, Pinecone, Weaviate, Qdrant) earn their keep.
Common misconceptions
- "Embeddings store the original text." They don't. The vector is a lossy semantic fingerprint. To retrieve text you also store the source document or its ID alongside the vector and look it up after the similarity search.
- "Higher dimensions always mean better quality." Diminishing returns kick in fast. Beyond ~1,500 dims, retrieval quality on most tasks plateaus while storage and search costs keep climbing linearly.
- "You can mix vectors from different models in one index." No. Vectors are only meaningful within the geometric space of the model that produced them. Cross-model comparison is undefined.
- "Embeddings replace keyword search." They complement it. Hybrid retrieval (lexical BM25 plus vector similarity, often with a re-ranker) consistently beats pure vector search on real-world query mixes — especially for proper nouns, exact codes, and rare terms.
Continue exploring