Embeddings
How we turn discrete things — words, users, items — into dense vectors a network can reason about.
The Idea
A neural network can't directly consume the word "cat" or user ID #42. An embedding is a learned lookup table mapping each discrete token to a dense vector in ℝd.
embed = nn.Embedding(num_tokens=50_000, dim=768)
v = embed(token_id) # shape (768,)Why Vectors?
Geometric structure encodes meaning. After training, semantically similar items end up close together. The classic example:
vec("king") − vec("man") + vec("woman") ≈ vec("queen")Distance and direction become meaningful — you can do arithmetic on concepts.
Where Embeddings Show Up
- Word embeddings — Word2Vec, GloVe, then learned end-to-end in transformers.
- Token embeddings — the input layer of every LLM.
- Positional embeddings — encode where a token is in a sequence.
- Item/user embeddings — recommender systems.
- Sentence/image embeddings — for retrieval, clustering, RAG.
Contrastive Learning
Train embeddings by pulling positive pairs together and pushing negatives apart:
L = −log( exp(sim(a, a⁺)/τ) /
Σ_b exp(sim(a, b)/τ) )This is how CLIP aligns images and captions, and how modern retrieval models (E5, BGE) are trained.
Modern AI runs on embeddings. RAG, semantic search, recommendations, and multimodal models are all just clever uses of well-trained vector spaces.
An embedding layer is essentially:
The classic 'king − man + woman ≈ queen' result shows that:
What do positional embeddings encode in a transformer?
In a contrastive loss with temperature τ, lowering τ:
Which similarity metric is most commonly used to compare normalized embeddings?
How does CLIP learn its image-text embedding space?
For a vocabulary of 50,000 tokens with embedding dim 768, the lookup table has roughly: