Word Embeddings Explained: From One-Hot to Dense Vectors
Learn how word embeddings transform words into meaningful numerical vectors. Understand one-hot encoding, learned embeddings, Word2Vec, GloVe, and when to use pretrained vs learned embeddings.
Computers don't understand wordsβthey only understand numbers. So how do we teach a machine learning model about language? The answer is word embeddings: mathematical representations that capture the meaning of words as dense numerical vectors.
In this guide, we'll explore how word embeddings work, why they're so powerful, and how to use them effectively in your NLP projects.
Before we can process text with neural networks, we need to convert words to numbers. The naive approach is one-hot encoding:
# Vocabulary: ["cat", "dog", "bird", "fish"]
# One-hot encoding:
cat = [1, 0, 0, 0]
dog = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]
# Each word is a sparse vector with:
# - Length = vocabulary size
# - Exactly one 1, rest are 0s- Huge dimensionality: For a vocabulary of 10,000 words, each word is a 10,000-dimensional vector!
- No semantic meaning: "cat" and "dog" are equally different from each other as "cat" and "democracy"
- Sparse: 99.99% of values are zeros, wasting memory and computation
- No relationships: Can't capture that "king" - "man" + "woman" β "queen"
# One-hot vectors are all orthogonal (perpendicular)
import numpy as np
cat = np.array([1, 0, 0, 0])
dog = np.array([0, 1, 0, 0])
# Cosine similarity (measures how similar vectors are)
similarity = np.dot(cat, dog) / (np.linalg.norm(cat) * np.linalg.norm(dog))
print(f"Similarity between cat and dog: {similarity}") # 0.0
# All words are equally dissimilar! π±Word embeddings solve these problems by representing words as dense, low-dimensional vectors where similar words have similar vectors.
# Word embeddings (typically 50-300 dimensions)
cat = [0.2, -0.5, 0.8, 0.1, -0.3, ...] # 128-dim vector
dog = [0.3, -0.4, 0.7, 0.2, -0.2, ...] # Similar to cat!
bird = [0.1, -0.6, 0.9, 0.0, -0.4, ...] # Also similar (animals)
car = [-0.8, 0.3, -0.2, 0.9, 0.5, ...] # Very different
# Now we can measure semantic similarity:
similarity(cat, dog) = 0.92 # Very similar!
similarity(cat, bird) = 0.78 # Somewhat similar
similarity(cat, car) = 0.15 # Not similar- Semantic similarity: Similar words have similar vectors
- Dimensionality reduction: 10,000 words β 128-300 dimensions
- Dense: All values are meaningful (no zeros)
- Learned relationships: Captures analogies and relationships
There are two main approaches to creating word embeddings:
Train embeddings from scratch as part of your model. The embeddings learn to be useful for your specific task.
import torch.nn as nn
# Create an embedding layer
vocab_size = 10000
embed_dim = 128
embedding = nn.Embedding(
num_embeddings=vocab_size, # vocabulary size
embedding_dim=embed_dim # vector dimension
)
# Initially, embeddings are random
print(embedding.weight[0]) # Random values
# During training, backpropagation updates these vectors
# to be useful for your task (e.g., sentiment classification)How it works:
# Example: Sentiment classification
# Input: "This movie is great"
# Word indices: [45, 892, 12, 234]
# Step 1: Look up embeddings
word_indices = torch.tensor([45, 892, 12, 234])
embeddings = embedding(word_indices)
# Shape: (4, 128) - 4 words, each is 128-dim vector
# Step 2: Process with neural network
output = model(embeddings)
# Step 3: Compute loss
loss = criterion(output, label)
# Step 4: Backpropagation updates embedding weights!
loss.backward()
optimizer.step()
# The embedding for "great" learns to have positive sentimentUse embeddings trained on massive text corpora (like Wikipedia). These capture general language knowledge.
Popular pretrained embeddings:
- Word2Vec (Google, 2013): Trained on Google News
- GloVe (Stanford, 2014): Trained on Wikipedia + web text
- FastText (Facebook, 2016): Handles out-of-vocabulary words
Word2Vec is based on a simple but powerful idea: "You shall know a word by the company it keeps" (J.R. Firth, 1957).
Words that appear in similar contexts should have similar meanings.
Given a word, predict its surrounding words (context).
# Training data from text:
# "The cat sat on the mat"
# Create training pairs (center word -> context words):
# Input: "cat" -> Output: ["The", "sat"]
# Input: "sat" -> Output: ["cat", "on"]
# Input: "on" -> Output: ["sat", "the"]
# Input: "the" -> Output: ["on", "mat"]
# The model learns:
# - "cat" and "mat" appear in similar contexts
# - So their embeddings should be similar!Word2Vec embeddings capture amazing semantic relationships:
# Vector arithmetic works!
king - man + woman β queen
# In code:
result = embedding['king'] - embedding['man'] + embedding['woman']
nearest = find_nearest_word(result)
print(nearest) # "queen"
# Other examples:
Paris - France + Italy β Rome
bigger - big + small β smaller
walking - walk + swim β swimmingGloVe (Global Vectors for Word Representation) takes a different approach: it uses global word co-occurrence statistics.
# Step 1: Build co-occurrence matrix
# Count how often words appear together in a context window
# ice steam solid gas water
# ice 0 0 5 0 3
# steam 0 0 0 4 2
# solid 5 0 0 0 1
# gas 0 4 0 0 1
# water 3 2 1 1 0
# Step 2: Train embeddings to reconstruct this matrix
# Goal: dot(embedding[i], embedding[j]) β log(co_occurrence[i,j])GloVe combines the benefits of:
- Global statistics (like LSA/SVD methods)
- Local context (like Word2Vec)
import torch
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, n_classes):
super().__init__()
# Learnable embedding layer
self.embedding = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=embed_dim,
padding_idx=0 # Don't update padding token
)
# Rest of model...
self.fc = nn.Linear(embed_dim, n_classes)
def forward(self, x):
# x: word indices, shape (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
# Average pooling
pooled = embedded.mean(dim=1) # (batch, embed_dim)
return self.fc(pooled)
# During training, embeddings are updated via backprop
model = TextClassifier(vocab_size=10000, embed_dim=128, n_classes=2)
optimizer = torch.optim.Adam(model.parameters())
for batch_x, batch_y in train_loader:
loss = criterion(model(batch_x), batch_y)
loss.backward() # Updates embedding weights!
optimizer.step()# Load pretrained GloVe embeddings
import numpy as np
def load_glove_embeddings(glove_file, word2idx, embed_dim=100):
"""
Load GloVe embeddings and create embedding matrix
"""
embeddings = np.random.randn(len(word2idx), embed_dim) * 0.01
with open(glove_file, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
if word in word2idx:
idx = word2idx[word]
embeddings[idx] = vector
return embeddings
# Load embeddings
pretrained_embeddings = load_glove_embeddings(
'glove.6B.100d.txt',
word2idx,
embed_dim=100
)
# Initialize embedding layer with pretrained weights
embedding = nn.Embedding(vocab_size, embed_dim)
embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
# Option A: Freeze embeddings (don't update during training)
embedding.weight.requires_grad = False
# Option B: Fine-tune embeddings (update during training)
embedding.weight.requires_grad = True| Scenario | Recommendation | Reason |
|---|---|---|
| Small dataset (< 10K) | Freeze | Not enough data to learn good embeddings |
| Large dataset (> 100K) | Fine-tune | Can adapt embeddings to your task |
| Domain-specific (medical, legal) | Fine-tune | Pretrained embeddings may not fit your domain |
| General domain | Freeze or fine-tune | Both work well |
| Limited compute | Freeze | Fewer parameters to update = faster training |
What happens when you encounter a word not in your vocabulary?
# Reserve index 1 for unknown words
word2idx = {"<PAD>": 0, "<UNK>": 1, "cat": 2, "dog": 3, ...}
def encode_word(word, word2idx):
return word2idx.get(word, 1) # Return 1 if word not found
# Example
print(encode_word("cat", word2idx)) # 2
print(encode_word("elephant", word2idx)) # 1 (unknown)
print(encode_word("xyzabc", word2idx)) # 1 (unknown)FastText represents words as bags of character n-grams, allowing it to generate embeddings for unseen words.
# FastText breaks words into character n-grams
# Word: "playing"
# 3-grams: ["<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"]
# Even if "playing" is unseen, we can combine embeddings of its n-grams
# This works especially well for:
# - Morphological variations: play, playing, played, player
# - Typos: plaing, playng
# - Rare words: uncommon words share n-grams with common onesChoosing the right embedding dimension is important:
| Dimension | Use Case | Pros | Cons |
|---|---|---|---|
| 50-100 | Small datasets, simple tasks | Fast, less overfitting | May not capture complex semantics |
| 128-256 | Most NLP tasks (recommended) | Good balance | Standard choice |
| 300-512 | Large datasets, complex tasks | Rich representations | More parameters, slower |
| 768+ | Transformer models (BERT, etc.) | State-of-the-art | Very expensive |
# Rule of thumb: Start with 128 or 256
embed_dim = 128 # Good default
# For very large vocabularies (100K+), you might need more
embed_dim = 256
# For small vocabularies (< 5K), you can use less
embed_dim = 64Embeddings are high-dimensional, but we can visualize them using dimensionality reduction:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Get embeddings for some words
words = ['king', 'queen', 'man', 'woman', 'cat', 'dog', 'car', 'truck']
embeddings = np.array([embedding[word] for word in words])
# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
x, y = embeddings_2d[i]
plt.scatter(x, y)
plt.annotate(word, (x, y), fontsize=12)
plt.title('Word Embeddings Visualization')
plt.show()
# You'll see:
# - king, queen, man, woman cluster together (royalty/gender)
# - cat, dog cluster together (animals)
# - car, truck cluster together (vehicles)# β Wrong: Padding tokens get updated
embedding = nn.Embedding(vocab_size, embed_dim)
# Padding tokens (index 0) will have non-zero gradients!
# β
Correct: Padding tokens stay at zero
embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
# Padding tokens don't contribute to loss or gradients# β Wrong: Building vocab from all data (including test)
vocab = build_vocab(train_texts + val_texts + test_texts)
# This is data leakage!
# β
Correct: Build vocab only from training data
vocab = build_vocab(train_texts)
# Val and test may have unknown words - that's okay!# β Wrong: Including every word (even rare ones)
vocab_size = 100000 # Way too large!
# Most words appear only once or twice
# β
Correct: Filter by frequency
min_freq = 2 # Only keep words that appear at least twice
max_vocab = 10000 # Cap vocabulary size
# This reduces:
# - Model size (fewer embedding parameters)
# - Overfitting (rare words are noise)
# - Training time| Aspect | Learned Embeddings | Pretrained Embeddings |
|---|---|---|
| Training data needed | 10K+ examples | Can work with 1K+ |
| Domain adaptation | Automatic | May need fine-tuning |
| Training time | Longer | Shorter (embeddings fixed) |
| Performance | Better with lots of data | Better with little data |
| Vocabulary coverage | Only your data | Millions of words |
| Unknown words | Many | Fewer |
| Use case | Large datasets, specific domains | Small datasets, general domains |
Traditional word embeddings have one limitation: each word has a single embedding, regardless of context.
# Problem: "bank" has the same embedding in both sentences
sentence1 = "I went to the bank to deposit money" # financial institution
sentence2 = "I sat on the river bank" # riverside
# Traditional embeddings:
embedding['bank'] # Same vector for both! π
# Contextual embeddings (BERT, ELMo):
bert_embedding(sentence1, word='bank') # Different vector!
bert_embedding(sentence2, word='bank') # Different vector!Modern models like BERT, GPT, and RoBERTa use contextual embeddings that change based on surrounding words. However, they're much more expensive to train and use.
- Start with learned embeddings (128-256 dimensions)
- Use padding_idx=0 for padding tokens
- Filter vocabulary by frequency (min_freq=2)
- Cap vocabulary size (max_vocab=10000-20000)
- Try pretrained embeddings first (GloVe, FastText)
- Fine-tune if you have enough data (> 50K examples)
- Use FastText for handling unknown words
- Consider contextual embeddings (BERT) for state-of-the-art results
Word embeddings are a fundamental building block of modern NLP. They transform discrete words into continuous vectors that capture semantic meaning, enabling neural networks to understand language.
Key takeaways:
- One-hot encoding is inefficient and doesn't capture semantics
- Word embeddings are dense, low-dimensional vectors that capture meaning
- Similar words have similar embeddings (cosine similarity)
- Learned embeddings adapt to your task but need more data
- Pretrained embeddings (Word2Vec, GloVe) work well with less data
- Contextual embeddings (BERT) are state-of-the-art but expensive
Understanding word embeddings is crucial for any NLP practitioner. They're the foundation upon which more complex models like RNNs, LSTMs, and Transformers are built.
Related Articles
Text Preprocessing and Tokenization for NLP: A Complete Guide
Master text preprocessing and tokenization for NLP. Learn about vocabulary building, padding, truncation, and handling variable-length sequences in deep learning models.
BiLSTM for Text Classification: Understanding Sequential Deep Learning
Learn how Bidirectional LSTM networks process text sequentially to capture context, word order, and meaning. A complete guide to building your first sequence model for NLP.
From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings
A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping β with full PyTorch implementation.