Back to articles
Word Embeddings Explained: From One-Hot to Dense Vectors
intermediateNLP

Word Embeddings Explained: From One-Hot to Dense Vectors

Learn how word embeddings transform words into meaningful numerical vectors. Understand one-hot encoding, learned embeddings, Word2Vec, GloVe, and when to use pretrained vs learned embeddings.

14 min read

Computers don't understand wordsβ€”they only understand numbers. So how do we teach a machine learning model about language? The answer is word embeddings: mathematical representations that capture the meaning of words as dense numerical vectors.

In this guide, we'll explore how word embeddings work, why they're so powerful, and how to use them effectively in your NLP projects.

Before we can process text with neural networks, we need to convert words to numbers. The naive approach is one-hot encoding:

python
# Vocabulary: ["cat", "dog", "bird", "fish"]

# One-hot encoding:
cat  = [1, 0, 0, 0]
dog  = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]

# Each word is a sparse vector with:
# - Length = vocabulary size
# - Exactly one 1, rest are 0s
  1. Huge dimensionality: For a vocabulary of 10,000 words, each word is a 10,000-dimensional vector!
  2. No semantic meaning: "cat" and "dog" are equally different from each other as "cat" and "democracy"
  3. Sparse: 99.99% of values are zeros, wasting memory and computation
  4. No relationships: Can't capture that "king" - "man" + "woman" β‰ˆ "queen"
python
# One-hot vectors are all orthogonal (perpendicular)
import numpy as np

cat = np.array([1, 0, 0, 0])
dog = np.array([0, 1, 0, 0])

# Cosine similarity (measures how similar vectors are)
similarity = np.dot(cat, dog) / (np.linalg.norm(cat) * np.linalg.norm(dog))
print(f"Similarity between cat and dog: {similarity}")  # 0.0

# All words are equally dissimilar! 😱

The Core Problem: One-hot encoding treats all words as completely independent, ignoring the rich semantic relationships in language.

Word embeddings solve these problems by representing words as dense, low-dimensional vectors where similar words have similar vectors.

python
# Word embeddings (typically 50-300 dimensions)
cat  = [0.2, -0.5, 0.8, 0.1, -0.3, ...]  # 128-dim vector
dog  = [0.3, -0.4, 0.7, 0.2, -0.2, ...]  # Similar to cat!
bird = [0.1, -0.6, 0.9, 0.0, -0.4, ...]  # Also similar (animals)
car  = [-0.8, 0.3, -0.2, 0.9, 0.5, ...]  # Very different

# Now we can measure semantic similarity:
similarity(cat, dog)  = 0.92  # Very similar!
similarity(cat, bird) = 0.78  # Somewhat similar
similarity(cat, car)  = 0.15  # Not similar
  1. Semantic similarity: Similar words have similar vectors
  2. Dimensionality reduction: 10,000 words β†’ 128-300 dimensions
  3. Dense: All values are meaningful (no zeros)
  4. Learned relationships: Captures analogies and relationships

There are two main approaches to creating word embeddings:

Train embeddings from scratch as part of your model. The embeddings learn to be useful for your specific task.

python
import torch.nn as nn

# Create an embedding layer
vocab_size = 10000
embed_dim = 128

embedding = nn.Embedding(
    num_embeddings=vocab_size,  # vocabulary size
    embedding_dim=embed_dim      # vector dimension
)

# Initially, embeddings are random
print(embedding.weight[0])  # Random values

# During training, backpropagation updates these vectors
# to be useful for your task (e.g., sentiment classification)

How it works:

python
# Example: Sentiment classification
# Input: "This movie is great"
# Word indices: [45, 892, 12, 234]

# Step 1: Look up embeddings
word_indices = torch.tensor([45, 892, 12, 234])
embeddings = embedding(word_indices)
# Shape: (4, 128) - 4 words, each is 128-dim vector

# Step 2: Process with neural network
output = model(embeddings)

# Step 3: Compute loss
loss = criterion(output, label)

# Step 4: Backpropagation updates embedding weights!
loss.backward()
optimizer.step()

# The embedding for "great" learns to have positive sentiment

Use embeddings trained on massive text corpora (like Wikipedia). These capture general language knowledge.

Popular pretrained embeddings:

  • Word2Vec (Google, 2013): Trained on Google News
  • GloVe (Stanford, 2014): Trained on Wikipedia + web text
  • FastText (Facebook, 2016): Handles out-of-vocabulary words

Word2Vec is based on a simple but powerful idea: "You shall know a word by the company it keeps" (J.R. Firth, 1957).

Words that appear in similar contexts should have similar meanings.

Given a word, predict its surrounding words (context).

python
# Training data from text:
# "The cat sat on the mat"

# Create training pairs (center word -> context words):
# Input: "cat"  -> Output: ["The", "sat"]
# Input: "sat"  -> Output: ["cat", "on"]
# Input: "on"   -> Output: ["sat", "the"]
# Input: "the"  -> Output: ["on", "mat"]

# The model learns:
# - "cat" and "mat" appear in similar contexts
# - So their embeddings should be similar!

Word2Vec embeddings capture amazing semantic relationships:

python
# Vector arithmetic works!
king - man + woman β‰ˆ queen

# In code:
result = embedding['king'] - embedding['man'] + embedding['woman']
nearest = find_nearest_word(result)
print(nearest)  # "queen"

# Other examples:
Paris - France + Italy β‰ˆ Rome
bigger - big + small β‰ˆ smaller
walking - walk + swim β‰ˆ swimming

Mind-Blowing Fact: These relationships emerge automatically from training on text! The model was never explicitly told that "king" relates to "man" like "queen" relates to "woman".

GloVe (Global Vectors for Word Representation) takes a different approach: it uses global word co-occurrence statistics.

python
# Step 1: Build co-occurrence matrix
# Count how often words appear together in a context window

#           ice    steam   solid   gas    water
# ice       0      0       5       0      3
# steam     0      0       0       4      2
# solid     5      0       0       0      1
# gas       0      4       0       0      1
# water     3      2       1       1      0

# Step 2: Train embeddings to reconstruct this matrix
# Goal: dot(embedding[i], embedding[j]) β‰ˆ log(co_occurrence[i,j])

GloVe combines the benefits of:

  • Global statistics (like LSA/SVD methods)
  • Local context (like Word2Vec)
python
import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, n_classes):
        super().__init__()
        
        # Learnable embedding layer
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embed_dim,
            padding_idx=0  # Don't update padding token
        )
        
        # Rest of model...
        self.fc = nn.Linear(embed_dim, n_classes)
    
    def forward(self, x):
        # x: word indices, shape (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        # Average pooling
        pooled = embedded.mean(dim=1)  # (batch, embed_dim)
        
        return self.fc(pooled)

# During training, embeddings are updated via backprop
model = TextClassifier(vocab_size=10000, embed_dim=128, n_classes=2)
optimizer = torch.optim.Adam(model.parameters())

for batch_x, batch_y in train_loader:
    loss = criterion(model(batch_x), batch_y)
    loss.backward()  # Updates embedding weights!
    optimizer.step()
python
# Load pretrained GloVe embeddings
import numpy as np

def load_glove_embeddings(glove_file, word2idx, embed_dim=100):
    """
    Load GloVe embeddings and create embedding matrix
    """
    embeddings = np.random.randn(len(word2idx), embed_dim) * 0.01
    
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            
            if word in word2idx:
                idx = word2idx[word]
                embeddings[idx] = vector
    
    return embeddings

# Load embeddings
pretrained_embeddings = load_glove_embeddings(
    'glove.6B.100d.txt',
    word2idx,
    embed_dim=100
)

# Initialize embedding layer with pretrained weights
embedding = nn.Embedding(vocab_size, embed_dim)
embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))

# Option A: Freeze embeddings (don't update during training)
embedding.weight.requires_grad = False

# Option B: Fine-tune embeddings (update during training)
embedding.weight.requires_grad = True
ScenarioRecommendationReason
Small dataset (< 10K)FreezeNot enough data to learn good embeddings
Large dataset (> 100K)Fine-tuneCan adapt embeddings to your task
Domain-specific (medical, legal)Fine-tunePretrained embeddings may not fit your domain
General domainFreeze or fine-tuneBoth work well
Limited computeFreezeFewer parameters to update = faster training

What happens when you encounter a word not in your vocabulary?

python
# Reserve index 1 for unknown words
word2idx = {"<PAD>": 0, "<UNK>": 1, "cat": 2, "dog": 3, ...}

def encode_word(word, word2idx):
    return word2idx.get(word, 1)  # Return 1 if word not found

# Example
print(encode_word("cat", word2idx))      # 2
print(encode_word("elephant", word2idx)) # 1 (unknown)
print(encode_word("xyzabc", word2idx))   # 1 (unknown)

FastText represents words as bags of character n-grams, allowing it to generate embeddings for unseen words.

python
# FastText breaks words into character n-grams
# Word: "playing"
# 3-grams: ["<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"]

# Even if "playing" is unseen, we can combine embeddings of its n-grams
# This works especially well for:
# - Morphological variations: play, playing, played, player
# - Typos: plaing, playng
# - Rare words: uncommon words share n-grams with common ones

Choosing the right embedding dimension is important:

DimensionUse CaseProsCons
50-100Small datasets, simple tasksFast, less overfittingMay not capture complex semantics
128-256Most NLP tasks (recommended)Good balanceStandard choice
300-512Large datasets, complex tasksRich representationsMore parameters, slower
768+Transformer models (BERT, etc.)State-of-the-artVery expensive
python
# Rule of thumb: Start with 128 or 256
embed_dim = 128  # Good default

# For very large vocabularies (100K+), you might need more
embed_dim = 256

# For small vocabularies (< 5K), you can use less
embed_dim = 64

Embeddings are high-dimensional, but we can visualize them using dimensionality reduction:

python
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Get embeddings for some words
words = ['king', 'queen', 'man', 'woman', 'cat', 'dog', 'car', 'truck']
embeddings = np.array([embedding[word] for word in words])

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    x, y = embeddings_2d[i]
    plt.scatter(x, y)
    plt.annotate(word, (x, y), fontsize=12)

plt.title('Word Embeddings Visualization')
plt.show()

# You'll see:
# - king, queen, man, woman cluster together (royalty/gender)
# - cat, dog cluster together (animals)
# - car, truck cluster together (vehicles)
python
# ❌ Wrong: Padding tokens get updated
embedding = nn.Embedding(vocab_size, embed_dim)
# Padding tokens (index 0) will have non-zero gradients!

# βœ… Correct: Padding tokens stay at zero
embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
# Padding tokens don't contribute to loss or gradients
python
# ❌ Wrong: Building vocab from all data (including test)
vocab = build_vocab(train_texts + val_texts + test_texts)
# This is data leakage!

# βœ… Correct: Build vocab only from training data
vocab = build_vocab(train_texts)
# Val and test may have unknown words - that's okay!
python
# ❌ Wrong: Including every word (even rare ones)
vocab_size = 100000  # Way too large!
# Most words appear only once or twice

# βœ… Correct: Filter by frequency
min_freq = 2  # Only keep words that appear at least twice
max_vocab = 10000  # Cap vocabulary size

# This reduces:
# - Model size (fewer embedding parameters)
# - Overfitting (rare words are noise)
# - Training time
AspectLearned EmbeddingsPretrained Embeddings
Training data needed10K+ examplesCan work with 1K+
Domain adaptationAutomaticMay need fine-tuning
Training timeLongerShorter (embeddings fixed)
PerformanceBetter with lots of dataBetter with little data
Vocabulary coverageOnly your dataMillions of words
Unknown wordsManyFewer
Use caseLarge datasets, specific domainsSmall datasets, general domains

Traditional word embeddings have one limitation: each word has a single embedding, regardless of context.

python
# Problem: "bank" has the same embedding in both sentences
sentence1 = "I went to the bank to deposit money"  # financial institution
sentence2 = "I sat on the river bank"              # riverside

# Traditional embeddings:
embedding['bank']  # Same vector for both! πŸ˜•

# Contextual embeddings (BERT, ELMo):
bert_embedding(sentence1, word='bank')  # Different vector!
bert_embedding(sentence2, word='bank')  # Different vector!

Modern models like BERT, GPT, and RoBERTa use contextual embeddings that change based on surrounding words. However, they're much more expensive to train and use.

  1. Start with learned embeddings (128-256 dimensions)
  2. Use padding_idx=0 for padding tokens
  3. Filter vocabulary by frequency (min_freq=2)
  4. Cap vocabulary size (max_vocab=10000-20000)
  1. Try pretrained embeddings first (GloVe, FastText)
  2. Fine-tune if you have enough data (> 50K examples)
  3. Use FastText for handling unknown words
  4. Consider contextual embeddings (BERT) for state-of-the-art results

Word embeddings are a fundamental building block of modern NLP. They transform discrete words into continuous vectors that capture semantic meaning, enabling neural networks to understand language.

Key takeaways:

  • One-hot encoding is inefficient and doesn't capture semantics
  • Word embeddings are dense, low-dimensional vectors that capture meaning
  • Similar words have similar embeddings (cosine similarity)
  • Learned embeddings adapt to your task but need more data
  • Pretrained embeddings (Word2Vec, GloVe) work well with less data
  • Contextual embeddings (BERT) are state-of-the-art but expensive

Start Simple: Begin with learned embeddings (128-256 dim) for most tasks. Only move to pretrained or contextual embeddings if you need better performance or have limited data.

Understanding word embeddings is crucial for any NLP practitioner. They're the foundation upon which more complex models like RNNs, LSTMs, and Transformers are built.

Related Articles