intermediateNLP

Word Embeddings Explained: From One-Hot to Dense Vectors

Learn how word embeddings transform words into meaningful numerical vectors. Understand one-hot encoding, learned embeddings, Word2Vec, GloVe, and when to use pretrained vs learned embeddings.

DevLifted TeamApril 24, 2026

Computers don't understand words—they only understand numbers. So how do we teach a machine learning model about language? The answer is word embeddings: mathematical representations that capture the meaning of words as dense numerical vectors.

In this guide, we'll explore how word embeddings work, why they're so powerful, and how to use them effectively in your NLP projects.

The Problem: Representing Words as Numbers

Before we can process text with neural networks, we need to convert words to numbers. The naive approach is one-hot encoding:

python

# Vocabulary: ["cat", "dog", "bird", "fish"]

# One-hot encoding:
cat  = [1, 0, 0, 0]
dog  = [0, 1, 0, 0]
bird = [0, 0, 1, 0]
fish = [0, 0, 0, 1]

# Each word is a sparse vector with:
# - Length = vocabulary size
# - Exactly one 1, rest are 0s

Problems with One-Hot Encoding

Huge dimensionality: For a vocabulary of 10,000 words, each word is a 10,000-dimensional vector!
No semantic meaning: "cat" and "dog" are equally different from each other as "cat" and "democracy"
Sparse: 99.99% of values are zeros, wasting memory and computation
No relationships: Can't capture that "king" - "man" + "woman" ≈ "queen"

python

# One-hot vectors are all orthogonal (perpendicular)
import numpy as np

cat = np.array([1, 0, 0, 0])
dog = np.array([0, 1, 0, 0])

# Cosine similarity (measures how similar vectors are)
similarity = np.dot(cat, dog) / (np.linalg.norm(cat) * np.linalg.norm(dog))
print(f"Similarity between cat and dog: {similarity}")  # 0.0

# All words are equally dissimilar! 😱

The Core Problem: One-hot encoding treats all words as completely independent, ignoring the rich semantic relationships in language.

Enter Word Embeddings: Dense Representations

Word embeddings solve these problems by representing words as dense, low-dimensional vectors where similar words have similar vectors.

python

# Word embeddings (typically 50-300 dimensions)
cat  = [0.2, -0.5, 0.8, 0.1, -0.3, ...]  # 128-dim vector
dog  = [0.3, -0.4, 0.7, 0.2, -0.2, ...]  # Similar to cat!
bird = [0.1, -0.6, 0.9, 0.0, -0.4, ...]  # Also similar (animals)
car  = [-0.8, 0.3, -0.2, 0.9, 0.5, ...]  # Very different

# Now we can measure semantic similarity:
similarity(cat, dog)  = 0.92  # Very similar!
similarity(cat, bird) = 0.78  # Somewhat similar
similarity(cat, car)  = 0.15  # Not similar

Key Properties of Good Embeddings

Semantic similarity: Similar words have similar vectors
Dimensionality reduction: 10,000 words → 128-300 dimensions
Dense: All values are meaningful (no zeros)
Learned relationships: Captures analogies and relationships

How Are Embeddings Learned?

There are two main approaches to creating word embeddings:

1. Learned Embeddings (Task-Specific)

Train embeddings from scratch as part of your model. The embeddings learn to be useful for your specific task.

python

import torch.nn as nn

# Create an embedding layer
vocab_size = 10000
embed_dim = 128

embedding = nn.Embedding(
    num_embeddings=vocab_size,  # vocabulary size
    embedding_dim=embed_dim      # vector dimension
)

# Initially, embeddings are random
print(embedding.weight[0])  # Random values

# During training, backpropagation updates these vectors
# to be useful for your task (e.g., sentiment classification)

How it works:

python

# Example: Sentiment classification
# Input: "This movie is great"
# Word indices: [45, 892, 12, 234]

# Step 1: Look up embeddings
word_indices = torch.tensor([45, 892, 12, 234])
embeddings = embedding(word_indices)
# Shape: (4, 128) - 4 words, each is 128-dim vector

# Step 2: Process with neural network
output = model(embeddings)

# Step 3: Compute loss
loss = criterion(output, label)

# Step 4: Backpropagation updates embedding weights!
loss.backward()
optimizer.step()

# The embedding for "great" learns to have positive sentiment

2. Pretrained Embeddings (Transfer Learning)

Use embeddings trained on massive text corpora (like Wikipedia). These capture general language knowledge.

Popular pretrained embeddings:

Word2Vec (Google, 2013): Trained on Google News
GloVe (Stanford, 2014): Trained on Wikipedia + web text
FastText (Facebook, 2016): Handles out-of-vocabulary words

Word2Vec: Learning from Context

Word2Vec is based on a simple but powerful idea: "You shall know a word by the company it keeps" (J.R. Firth, 1957).

Words that appear in similar contexts should have similar meanings.

The Skip-Gram Model

Given a word, predict its surrounding words (context).

python

# Training data from text:
# "The cat sat on the mat"

# Create training pairs (center word -> context words):
# Input: "cat"  -> Output: ["The", "sat"]
# Input: "sat"  -> Output: ["cat", "on"]
# Input: "on"   -> Output: ["sat", "the"]
# Input: "the"  -> Output: ["on", "mat"]

# The model learns:
# - "cat" and "mat" appear in similar contexts
# - So their embeddings should be similar!

Famous Word2Vec Examples

Word2Vec embeddings capture amazing semantic relationships:

python

# Vector arithmetic works!
king - man + woman ≈ queen

# In code:
result = embedding['king'] - embedding['man'] + embedding['woman']
nearest = find_nearest_word(result)
print(nearest)  # "queen"

# Other examples:
Paris - France + Italy ≈ Rome
bigger - big + small ≈ smaller
walking - walk + swim ≈ swimming

Mind-Blowing Fact: These relationships emerge automatically from training on text! The model was never explicitly told that "king" relates to "man" like "queen" relates to "woman".

GloVe: Global Vectors

GloVe (Global Vectors for Word Representation) takes a different approach: it uses global word co-occurrence statistics.

How GloVe Works

python

# Step 1: Build co-occurrence matrix
# Count how often words appear together in a context window

#           ice    steam   solid   gas    water
# ice       0      0       5       0      3
# steam     0      0       0       4      2
# solid     5      0       0       0      1
# gas       0      4       0       0      1
# water     3      2       1       1      0

# Step 2: Train embeddings to reconstruct this matrix
# Goal: dot(embedding[i], embedding[j]) ≈ log(co_occurrence[i,j])

GloVe combines the benefits of:

Global statistics (like LSA/SVD methods)
Local context (like Word2Vec)

Using Embeddings in PyTorch

Option 1: Learn from Scratch

python

import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, n_classes):
        super().__init__()
        
        # Learnable embedding layer
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embed_dim,
            padding_idx=0  # Don't update padding token
        )
        
        # Rest of model...
        self.fc = nn.Linear(embed_dim, n_classes)
    
    def forward(self, x):
        # x: word indices, shape (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        # Average pooling
        pooled = embedded.mean(dim=1)  # (batch, embed_dim)
        
        return self.fc(pooled)

# During training, embeddings are updated via backprop
model = TextClassifier(vocab_size=10000, embed_dim=128, n_classes=2)
optimizer = torch.optim.Adam(model.parameters())

for batch_x, batch_y in train_loader:
    loss = criterion(model(batch_x), batch_y)
    loss.backward()  # Updates embedding weights!
    optimizer.step()

Option 2: Use Pretrained Embeddings

python

# Load pretrained GloVe embeddings
import numpy as np

def load_glove_embeddings(glove_file, word2idx, embed_dim=100):
    """
    Load GloVe embeddings and create embedding matrix
    """
    embeddings = np.random.randn(len(word2idx), embed_dim) * 0.01
    
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            
            if word in word2idx:
                idx = word2idx[word]
                embeddings[idx] = vector
    
    return embeddings

# Load embeddings
pretrained_embeddings = load_glove_embeddings(
    'glove.6B.100d.txt',
    word2idx,
    embed_dim=100
)

# Initialize embedding layer with pretrained weights
embedding = nn.Embedding(vocab_size, embed_dim)
embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))

# Option A: Freeze embeddings (don't update during training)
embedding.weight.requires_grad = False

# Option B: Fine-tune embeddings (update during training)
embedding.weight.requires_grad = True

When to Freeze vs Fine-Tune

Scenario	Recommendation	Reason
Small dataset (< 10K)	Freeze	Not enough data to learn good embeddings
Large dataset (> 100K)	Fine-tune	Can adapt embeddings to your task
Domain-specific (medical, legal)	Fine-tune	Pretrained embeddings may not fit your domain
General domain	Freeze or fine-tune	Both work well
Limited compute	Freeze	Fewer parameters to update = faster training

Handling Unknown Words

What happens when you encounter a word not in your vocabulary?

Strategy 1: UNK Token

python

# Reserve index 1 for unknown words
word2idx = {"<PAD>": 0, "<UNK>": 1, "cat": 2, "dog": 3, ...}

def encode_word(word, word2idx):
    return word2idx.get(word, 1)  # Return 1 if word not found

# Example
print(encode_word("cat", word2idx))      # 2
print(encode_word("elephant", word2idx)) # 1 (unknown)
print(encode_word("xyzabc", word2idx))   # 1 (unknown)

Strategy 2: Subword Embeddings (FastText)

FastText represents words as bags of character n-grams, allowing it to generate embeddings for unseen words.

python

# FastText breaks words into character n-grams
# Word: "playing"
# 3-grams: ["<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"]

# Even if "playing" is unseen, we can combine embeddings of its n-grams
# This works especially well for:
# - Morphological variations: play, playing, played, player
# - Typos: plaing, playng
# - Rare words: uncommon words share n-grams with common ones

Embedding Dimensions: How Many?

Choosing the right embedding dimension is important:

Dimension	Use Case	Pros	Cons
50-100	Small datasets, simple tasks	Fast, less overfitting	May not capture complex semantics
128-256	Most NLP tasks (recommended)	Good balance	Standard choice
300-512	Large datasets, complex tasks	Rich representations	More parameters, slower
768+	Transformer models (BERT, etc.)	State-of-the-art	Very expensive

python

# Rule of thumb: Start with 128 or 256
embed_dim = 128  # Good default

# For very large vocabularies (100K+), you might need more
embed_dim = 256

# For small vocabularies (< 5K), you can use less
embed_dim = 64

Visualizing Embeddings

Embeddings are high-dimensional, but we can visualize them using dimensionality reduction:

python

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Get embeddings for some words
words = ['king', 'queen', 'man', 'woman', 'cat', 'dog', 'car', 'truck']
embeddings = np.array([embedding[word] for word in words])

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    x, y = embeddings_2d[i]
    plt.scatter(x, y)
    plt.annotate(word, (x, y), fontsize=12)

plt.title('Word Embeddings Visualization')
plt.show()

# You'll see:
# - king, queen, man, woman cluster together (royalty/gender)
# - cat, dog cluster together (animals)
# - car, truck cluster together (vehicles)

Common Pitfalls and Solutions

Pitfall 1: Not Setting padding_idx

python

# ❌ Wrong: Padding tokens get updated
embedding = nn.Embedding(vocab_size, embed_dim)
# Padding tokens (index 0) will have non-zero gradients!

# ✅ Correct: Padding tokens stay at zero
embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
# Padding tokens don't contribute to loss or gradients

Pitfall 2: Vocabulary Mismatch

python

# ❌ Wrong: Building vocab from all data (including test)
vocab = build_vocab(train_texts + val_texts + test_texts)
# This is data leakage!

# ✅ Correct: Build vocab only from training data
vocab = build_vocab(train_texts)
# Val and test may have unknown words - that's okay!

Pitfall 3: Too Large Vocabulary

python

# ❌ Wrong: Including every word (even rare ones)
vocab_size = 100000  # Way too large!
# Most words appear only once or twice

# ✅ Correct: Filter by frequency
min_freq = 2  # Only keep words that appear at least twice
max_vocab = 10000  # Cap vocabulary size

# This reduces:
# - Model size (fewer embedding parameters)
# - Overfitting (rare words are noise)
# - Training time

Learned vs Pretrained: A Comparison

Aspect	Learned Embeddings	Pretrained Embeddings
Training data needed	10K+ examples	Can work with 1K+
Domain adaptation	Automatic	May need fine-tuning
Training time	Longer	Shorter (embeddings fixed)
Performance	Better with lots of data	Better with little data
Vocabulary coverage	Only your data	Millions of words
Unknown words	Many	Fewer
Use case	Large datasets, specific domains	Small datasets, general domains

Modern Alternatives: Contextual Embeddings

Traditional word embeddings have one limitation: each word has a single embedding, regardless of context.

python

# Problem: "bank" has the same embedding in both sentences
sentence1 = "I went to the bank to deposit money"  # financial institution
sentence2 = "I sat on the river bank"              # riverside

# Traditional embeddings:
embedding['bank']  # Same vector for both! 😕

# Contextual embeddings (BERT, ELMo):
bert_embedding(sentence1, word='bank')  # Different vector!
bert_embedding(sentence2, word='bank')  # Different vector!

Modern models like BERT, GPT, and RoBERTa use contextual embeddings that change based on surrounding words. However, they're much more expensive to train and use.

Practical Recommendations

For Beginners

Start with learned embeddings (128-256 dimensions)
Use padding_idx=0 for padding tokens
Filter vocabulary by frequency (min_freq=2)
Cap vocabulary size (max_vocab=10000-20000)

For Production

Try pretrained embeddings first (GloVe, FastText)
Fine-tune if you have enough data (> 50K examples)
Use FastText for handling unknown words
Consider contextual embeddings (BERT) for state-of-the-art results

Conclusion

Word embeddings are a fundamental building block of modern NLP. They transform discrete words into continuous vectors that capture semantic meaning, enabling neural networks to understand language.

Key takeaways:

One-hot encoding is inefficient and doesn't capture semantics
Word embeddings are dense, low-dimensional vectors that capture meaning
Similar words have similar embeddings (cosine similarity)
Learned embeddings adapt to your task but need more data
Pretrained embeddings (Word2Vec, GloVe) work well with less data
Contextual embeddings (BERT) are state-of-the-art but expensive

Start Simple: Begin with learned embeddings (128-256 dim) for most tasks. Only move to pretrained or contextual embeddings if you need better performance or have limited data.

Understanding word embeddings is crucial for any NLP practitioner. They're the foundation upon which more complex models like RNNs, LSTMs, and Transformers are built.

#deep-learning #nlp

intermediate

Text Preprocessing and Tokenization for NLP: A Complete Guide

Master text preprocessing and tokenization for NLP. Learn about vocabulary building, padding, truncation, and handling variable-length sequences in deep learning models.

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

BiLSTM for Text Classification: Understanding Sequential Deep Learning

Learn how Bidirectional LSTM networks process text sequentially to capture context, word order, and meaning. A complete guide to building your first sequence model for NLP.

Related Articles

Text Preprocessing and Tokenization for NLP: A Complete Guide

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

BiLSTM for Text Classification: Understanding Sequential Deep Learning