beginnerDeep Learning Natural Language Processing

Transfer Learning in NLP: Standing on the Shoulders of Giants

A complete beginner's guide to transfer learning in NLP - how pretrained models work, why freezing encoders makes sense, and how to use sentence transformers effectively.

HaneeshApril 22, 2026

Imagine you want to become a chef. You could start from scratch — learning what fire is, how heat works, basic chemistry. Or you could start with knowledge that master chefs have already figured out, and focus on your specific recipes. Transfer learning is the second approach: borrowing intelligence from models trained on massive datasets, and adapting it to your specific problem. This post explains how transfer learning revolutionized NLP, why it works so well, and how to use it effectively.

What You'll Learn

By the end of this post, you'll understand: what transfer learning is and why it's revolutionary, how pretrained language models capture meaning, why freezing encoders makes sense, the difference between feature extraction and fine-tuning, and how to use sentence transformers in practice.

Part 1 — The Old Way: Training from Scratch

Before transfer learning, every NLP project started from zero. Want to classify movie reviews? Train a model from scratch on your 10,000 reviews. Want to detect spam? Train from scratch on your emails. Want to answer questions? Train from scratch on your Q&A pairs.

This had three major problems:

Problem 1: You Need Massive Datasets

Deep learning models have millions of parameters. To train them well, you need millions of examples. But most real-world projects have thousands, not millions. Training from scratch on small datasets leads to severe overfitting — the model memorizes the training data but fails on new examples.

The Data Hunger Problem

Imagine trying to learn English by reading only 100 sentences. You might memorize those 100 sentences perfectly, but you wouldn't really understand English. You need exposure to millions of sentences to learn the patterns of language. That's the problem with training from scratch on small datasets.

Problem 2: You Waste Compute

Training a language model from scratch takes weeks on expensive GPUs. Every project repeats this expensive process, even though they're all learning the same basic things: what words mean, how grammar works, how sentences relate to each other. It's like every chef learning from scratch that water boils at 100°C — wasteful duplication of effort.

Problem 3: You Learn Shallow Patterns

With limited data, models learn superficial patterns. A spam classifier might learn 'if email contains "free money", it's spam' — but miss deeper patterns like writing style, urgency markers, or social engineering tactics. These deeper patterns require massive datasets to learn.

Part 2 — The New Way: Transfer Learning

Transfer learning flips the script. Instead of starting from zero, you start with a model that's already been trained on billions of words. This model has already learned:

What words mean and how they relate to each other
Grammar and syntax patterns
Common phrases and idioms
Semantic relationships (synonyms, antonyms, analogies)
Context and how meaning changes based on surrounding words

You take this pretrained model and adapt it to your specific task. This is called transfer learning — transferring knowledge from one task (general language understanding) to another (your specific problem).

The Chef Analogy

Transfer learning is like going to culinary school. You don't learn what fire is or how knives work — you start with that knowledge and focus on specific techniques and recipes. The pretrained model is your culinary school education; your specific task is your signature dish.

Part 3 — How Pretrained Models Work

Let's understand what happens when a model is 'pretrained'. The most common approach is called masked language modeling:

Masked Language Modeling

The model is shown billions of sentences with random words masked out, and learns to predict the missing words:

masked_lm_example.txt

text

Original: "The cat sat on the mat"
Masked:   "The cat [MASK] on the mat"
Task:     Predict that [MASK] = "sat"

Original: "I love eating pizza for dinner"
Masked:   "I love [MASK] pizza for dinner"
Task:     Predict that [MASK] = "eating"

Original: "The weather is beautiful today"
Masked:   "The [MASK] is beautiful today"
Task:     Predict that [MASK] = "weather"

To predict the masked word, the model must understand:

Context: What words appear before and after
Grammar: What part of speech fits here (noun, verb, adjective)
Semantics: What meaning makes sense in this context
World knowledge: Common patterns and relationships

After training on billions of sentences, the model develops a rich internal representation of language. It hasn't just memorized words — it's learned the deep structure of how language works.

Sentence Transformers: Specialized for Similarity

Sentence transformers are pretrained models specifically trained to produce good sentence embeddings. They're trained using contrastive learning:

contrastive_learning.txt

text

Training pairs:

Similar sentences (should have similar embeddings):
- "Book a flight to Tokyo" ↔ "Reserve a plane ticket to Tokyo"
- "What's the weather?" ↔ "How's the weather today?"

Dissimilar sentences (should have different embeddings):
- "Book a flight" ↔ "What's the weather?"
- "I love pizza" ↔ "The sky is blue"

The model learns to make similar sentences close in embedding space,
and dissimilar sentences far apart.

This training creates embeddings where semantic similarity = geometric proximity. Sentences that mean similar things end up close together in the 384-dimensional space.

Part 4 — Two Approaches: Feature Extraction vs Fine-Tuning

There are two main ways to use a pretrained model:

Approach 1: Feature Extraction (Frozen Encoder)

Use the pretrained model as a frozen feature extractor. You never update its weights — you just use it to convert text into embeddings, then train a small classifier on top.

feature_extraction.py

python

from sentence_transformers import SentenceTransformer
import torch.nn as nn

# 1. Load pretrained model (frozen)
encoder = SentenceTransformer('all-MiniLM-L6-v2')
# We NEVER update encoder's weights

# 2. Convert all text to embeddings once
train_embeddings = encoder.encode(train_texts)  # (N, 384)

# 3. Train a small classifier on the embeddings
classifier = nn.Sequential(
    nn.Linear(384, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, num_classes)
)
# Only classifier weights are updated during training

Pros:

Fast: Only training a small classifier, not the entire encoder
Low memory: Don't need to store gradients for the encoder
Works on CPU: No need for expensive GPUs
Can't overfit the encoder: The pretrained weights stay perfect

Cons:

Can't adapt encoder: If your domain is very different from the pretraining data, you're stuck
Slightly lower accuracy: Fine-tuning usually gives 2-5% better accuracy

Approach 2: Fine-Tuning

Update the pretrained model's weights on your specific task. You start with the pretrained weights and continue training, but with a very small learning rate.

fine_tuning.py

python

from transformers import AutoModel
import torch.nn as nn

# 1. Load pretrained model (will be updated)
encoder = AutoModel.from_pretrained('bert-base-uncased')

# 2. Add a classifier head
model = nn.Sequential(
    encoder,
    nn.Linear(768, num_classes)
)

# 3. Train the ENTIRE model (encoder + classifier)
# Use a small learning rate for the encoder
optimizer = torch.optim.Adam([
    {'params': encoder.parameters(), 'lr': 1e-5},  # Small LR for encoder
    {'params': classifier.parameters(), 'lr': 1e-3}  # Normal LR for classifier
])

Pros:

Best accuracy: Usually 2-5% better than frozen features
Adapts to your domain: Can learn domain-specific patterns

Cons:

Slow: Training the entire encoder takes much longer
Needs GPU: Too slow on CPU
High memory: Need to store gradients for millions of parameters
Can overfit: With small datasets, you might make the encoder worse

Which Approach to Choose?

Start with feature extraction (frozen encoder). It's fast, works on CPU, and gives 90% of the performance. Only move to fine-tuning if: (1) you need that extra 5-10% accuracy, (2) you have a GPU, and (3) you have enough data (at least 10,000+ examples) to avoid overfitting.

Part 5 — Why Freezing Makes Sense

Let's dig deeper into why freezing the encoder is often the right choice, especially for small datasets.

Reason 1: The Encoder Is Already Excellent

The pretrained encoder was trained on billions of sentences. Your dataset has thousands. If you try to 'improve' it with your tiny dataset, you'll almost certainly make it worse. This is called catastrophic forgetting — the model forgets its general knowledge while memorizing your specific examples.

The Dictionary Analogy

Imagine you have a dictionary created by linguists who studied millions of texts. You have 100 sentences. Should you rewrite the dictionary based on your 100 sentences? Of course not — you'd make it worse. Same logic applies to pretrained encoders.

Reason 2: Computational Efficiency

Computing gradients through a transformer encoder is expensive. By freezing it, you:

Compute embeddings once: Convert all text to embeddings before training starts
No backprop through encoder: Only compute gradients for the small classifier
Train on CPU: The classifier is small enough to train without a GPU
Iterate faster: Training takes minutes instead of hours

Reason 3: Caching

With a frozen encoder, embeddings never change. You can compute them once and save to disk:

caching_embeddings.py

python

import torch
from pathlib import Path

def get_embeddings(texts, cache_path):
    # Check if we already computed these
    if Path(cache_path).exists():
        print("Loading from cache (instant!)")
        return torch.load(cache_path)
    
    # First time - compute and save
    print("Computing embeddings (takes a few minutes)...")
    embeddings = encoder.encode(texts)
    torch.save(embeddings, cache_path)
    return embeddings

# First run: computes embeddings (slow)
train_emb = get_embeddings(train_texts, 'train_embeddings.pt')

# Subsequent runs: loads from cache (instant)
train_emb = get_embeddings(train_texts, 'train_embeddings.pt')

This is a huge time saver. Computing 15,000 embeddings takes 2-3 minutes. If you're experimenting with different classifier architectures, you'd waste hours recomputing the same embeddings. With caching, subsequent runs start instantly.

Part 6 — Using Sentence Transformers in Practice

Let's see a complete example of using sentence transformers for classification:

complete_example.py

python

from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn

# 1. Load pretrained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# This model produces 384-dimensional embeddings

# 2. Encode your texts
train_texts = [
    "Book a flight to Tokyo",
    "What's the weather today?",
    "Play some music",
    # ... thousands more
]

train_embeddings = model.encode(
    train_texts,
    batch_size=256,           # Process 256 at a time
    show_progress_bar=True,   # Show progress
    convert_to_tensor=True    # Return PyTorch tensor
)
print(train_embeddings.shape)  # (N, 384)

# 3. Build a classifier on top
classifier = nn.Sequential(
    nn.Linear(384, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(128, num_classes)
)

# 4. Train the classifier (not the encoder!)
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(50):
    # Forward pass
    logits = classifier(train_embeddings)
    loss = criterion(logits, train_labels)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 5. Inference on new text
new_text = "Reserve a plane ticket"
new_embedding = model.encode([new_text], convert_to_tensor=True)
prediction = classifier(new_embedding).argmax(dim=1)

Choosing a Sentence Transformer Model

There are many pretrained sentence transformers. Here are the most popular:

Model	Embedding Size	Size	Speed	Quality	Best For
all-MiniLM-L6-v2	384	80MB	Fast	Good	General purpose, CPU-friendly
all-mpnet-base-v2	768	420MB	Medium	Best	When you need highest quality
paraphrase-MiniLM-L6-v2	384	80MB	Fast	Good	Paraphrase detection
multi-qa-MiniLM-L6-cos-v1	384	80MB	Fast	Good	Question answering

Start with all-MiniLM-L6-v2

For most tasks, all-MiniLM-L6-v2 is the best starting point. It's small (80MB), fast (works on CPU), and produces high-quality embeddings. Only switch to larger models if you need that extra 2-3% accuracy and have the compute budget.

Part 7 — Common Mistakes and How to Avoid Them

Fine-tuning on tiny datasets: With <5,000 examples, stick to frozen features. Fine-tuning will overfit.
Not caching embeddings: Computing embeddings takes minutes. Cache them to disk and reuse.
Using the wrong model: all-MiniLM-L6-v2 is for general text. For code, use code-specific models. For scientific text, use scientific models.
Forgetting to normalize: Some models require L2 normalization of embeddings. Check the model card.
Comparing embeddings with wrong metric: Use cosine similarity, not Euclidean distance, for sentence embeddings.
Training the encoder on small data: If you have <10,000 examples, don't fine-tune. You'll make it worse.

Part 8 — The Impact of Transfer Learning

Transfer learning has revolutionized NLP. Here's what changed:

Before Transfer Learning	After Transfer Learning
Need millions of labeled examples	Need thousands of labeled examples
Train for weeks on expensive GPUs	Train for hours on CPU
Each project starts from scratch	Each project starts with pretrained knowledge
Accuracy: 60-70% on hard tasks	Accuracy: 85-95% on hard tasks
Only big companies can do NLP	Anyone can do NLP

The Democratization of NLP

Transfer learning democratized NLP. Before, only companies with massive datasets and compute budgets could build good NLP systems. Now, a student with a laptop can download a pretrained model and achieve state-of-the-art results on their specific task. This is one of the most important developments in AI history.

Key Takeaways

Transfer learning means starting with pretrained knowledge instead of training from scratch.
Pretrained models learned from billions of sentences and understand deep language patterns.
Two approaches: Feature extraction (frozen encoder) and fine-tuning (update encoder).
Start with frozen features: Faster, works on CPU, can't overfit the encoder.
Only fine-tune if: You have 10,000+ examples, a GPU, and need that extra 5% accuracy.
Cache embeddings: Compute once, save to disk, reuse forever.
all-MiniLM-L6-v2 is a great default: Small, fast, high-quality embeddings.
Transfer learning democratized NLP: Anyone can now build state-of-the-art systems.

The Bottom Line

Transfer learning is the foundation of modern NLP. By starting with pretrained models trained on billions of sentences, you can achieve excellent results with small datasets and limited compute. The frozen encoder + small classifier approach works for 90% of problems and trains in minutes on CPU. This is why transfer learning is the default approach for NLP today.

#deep-learning #nlp

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.

intermediate

Understanding Transformers: The Architecture Behind Modern AI

A comprehensive guide to understanding the Transformer architecture that powers GPT, BERT, and other modern language models.

What You'll Learn

The Data Hunger Problem

The Chef Analogy

Which Approach to Choose?

The Dictionary Analogy

Start with all-MiniLM-L6-v2

The Democratization of NLP

The Bottom Line

Related Articles

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

Understanding Transformers: The Architecture Behind Modern AI