Text Preprocessing and Tokenization for NLP: A Complete Guide
Master text preprocessing and tokenization for NLP. Learn about vocabulary building, padding, truncation, and handling variable-length sequences in deep learning models.
Before you can train a neural network on text, you need to convert raw text into a format the model can understand. This process—text preprocessing and tokenization—is often overlooked but critically important. Poor preprocessing can tank your model's performance, while good preprocessing can give you a significant boost.
In this guide, we'll cover everything you need to know about preparing text for deep learning models.
Here's the typical pipeline for processing text:
Let's walk through each step with practical examples.
Raw text is messy. It contains special characters, HTML tags, URLs, and inconsistent formatting. Cleaning prepares text for tokenization.
import re
def clean_text(text):
"""
Clean raw text for NLP processing
"""
# 1. Lowercase (optional - depends on task)
text = text.lower()
# 2. Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# 3. Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# 4. Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# 5. Remove special characters (keep letters, numbers, spaces)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# 6. Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Example
raw = "Check out https://example.com! Email: test@email.com <b>Amazing</b> product!!!"
cleaned = clean_text(raw)
print(cleaned)
# Output: "check out amazing product"| Task | Lowercase? | Reason |
|---|---|---|
| Sentiment analysis | Yes | "GREAT" and "great" have same sentiment |
| Named entity recognition | No | "Apple" (company) vs "apple" (fruit) |
| Question answering | No | Proper nouns are important |
| Text classification | Usually yes | Reduces vocabulary size |
| Machine translation | No | Case carries meaning in many languages |
Tokenization splits text into individual units (tokens). The most common approach is word tokenization, but there are others.
Split text into words. The simplest approach is splitting on whitespace:
def simple_tokenize(text):
"""Simple whitespace tokenization"""
return text.lower().split()
text = "I love machine learning!"
tokens = simple_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning!']
# Problem: Punctuation is attached to words!A better approach handles punctuation:
import re
def word_tokenize(text):
"""Tokenize text into words, handling punctuation"""
# Split on word boundaries
tokens = re.findall(r'\b\w+\b', text.lower())
return tokens
text = "I love machine learning!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning']
# Punctuation is removedSplit text into individual characters. Useful for tasks like text generation or handling typos.
def char_tokenize(text):
"""Tokenize text into characters"""
return list(text.lower())
text = "hello"
tokens = char_tokenize(text)
print(tokens)
# Output: ['h', 'e', 'l', 'l', 'o']
# Pros: Small vocabulary, handles any word
# Cons: Very long sequences, loses word boundariesModern approach used by BERT, GPT, etc. Splits words into subword units.
# Example: BPE (Byte Pair Encoding)
# Word: "playing"
# Subwords: ["play", "##ing"]
# Word: "unbelievable"
# Subwords: ["un", "##believ", "##able"]
# Benefits:
# - Handles unknown words (break into known subwords)
# - Smaller vocabulary than word-level
# - Captures morphology (play, playing, played share "play")A vocabulary maps words to integer indices. This is crucial for converting text to numbers.
from collections import Counter
class Vocabulary:
def __init__(self, min_freq=2, max_vocab=10000):
# Special tokens
self.word2idx = {"<PAD>": 0, "<UNK>": 1}
self.idx2word = {0: "<PAD>", 1: "<UNK>"}
self.min_freq = min_freq
self.max_vocab = max_vocab
def build_from_texts(self, texts):
"""Build vocabulary from list of texts"""
# Count word frequencies
word_counts = Counter()
for text in texts:
tokens = simple_tokenize(text)
word_counts.update(tokens)
# Filter by frequency
valid_words = [
word for word, count in word_counts.items()
if count >= self.min_freq
]
# Sort by frequency and take top max_vocab
valid_words = sorted(
valid_words,
key=lambda w: word_counts[w],
reverse=True
)[:self.max_vocab - 2] # -2 for PAD and UNK
# Assign indices (starting from 2)
for idx, word in enumerate(valid_words, start=2):
self.word2idx[word] = idx
self.idx2word[idx] = word
@property
def vocab_size(self):
return len(self.word2idx)
# Example usage
texts = [
"I love machine learning",
"Machine learning is amazing",
"I love deep learning"
]
vocab = Vocabulary(min_freq=1, max_vocab=100)
vocab.build_from_texts(texts)
print(f"Vocabulary size: {vocab.vocab_size}")
print(f"Word to index: {vocab.word2idx}")Most vocabularies include special tokens:
| Token | Index | Purpose |
|---|---|---|
| <PAD> | 0 | Padding token (for variable-length sequences) |
| <UNK> | 1 | Unknown words (not in vocabulary) |
| <SOS> | 2 | Start of sequence (for generation tasks) |
| <EOS> | 3 | End of sequence (for generation tasks) |
| <MASK> | 4 | Masked token (for BERT-style pretraining) |
# Analyze your data first
import numpy as np
from collections import Counter
def analyze_vocabulary(texts):
"""Analyze vocabulary statistics"""
word_counts = Counter()
for text in texts:
tokens = simple_tokenize(text)
word_counts.update(tokens)
total_words = sum(word_counts.values())
unique_words = len(word_counts)
print(f"Total words: {total_words:,}")
print(f"Unique words: {unique_words:,}")
# Coverage analysis
sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
for vocab_size in [1000, 5000, 10000, 20000]:
if vocab_size <= len(sorted_words):
covered = sum(count for _, count in sorted_words[:vocab_size])
coverage = covered / total_words * 100
print(f"Top {vocab_size} words cover {coverage:.1f}% of text")
# Example output:
# Total words: 1,000,000
# Unique words: 50,000
# Top 1000 words cover 75.2% of text
# Top 5000 words cover 89.1% of text
# Top 10000 words cover 93.8% of text
# Top 20000 words cover 96.2% of textRule of thumb: Choose vocabulary size to cover 90-95% of your text. Beyond that, you're mostly adding noise.
Convert words to integer indices using the vocabulary.
class Vocabulary:
# ... (previous methods)
def encode(self, text):
"""Convert text to list of indices"""
tokens = simple_tokenize(text)
return [self.word2idx.get(token, 1) for token in tokens] # 1 = <UNK>
def decode(self, indices):
"""Convert indices back to text"""
tokens = [self.idx2word.get(idx, "<UNK>") for idx in indices]
return " ".join(tokens)
# Example
vocab = Vocabulary()
vocab.build_from_texts(["I love machine learning"])
text = "I love deep learning"
encoded = vocab.encode(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {vocab.decode(encoded)}")
# Output:
# Original: I love deep learning
# Encoded: [2, 3, 1, 4] # "deep" is unknown (1)
# Decoded: i love <UNK> learningNeural networks need fixed-size inputs, but sentences have different lengths. We solve this with padding and truncation.
def pad_sequences(sequences, max_len=None, padding='post', truncating='post', value=0):
"""
Pad sequences to uniform length
Args:
sequences: List of sequences (lists of integers)
max_len: Maximum length (if None, use longest sequence)
padding: 'pre' or 'post' (where to add padding)
truncating: 'pre' or 'post' (where to truncate)
value: Padding value (usually 0)
Returns:
Padded sequences as numpy array
"""
import numpy as np
if max_len is None:
max_len = max(len(seq) for seq in sequences)
padded = []
for seq in sequences:
# Truncate if too long
if len(seq) > max_len:
if truncating == 'post':
seq = seq[:max_len]
else: # 'pre'
seq = seq[-max_len:]
# Pad if too short
if len(seq) < max_len:
pad_len = max_len - len(seq)
if padding == 'post':
seq = seq + [value] * pad_len
else: # 'pre'
seq = [value] * pad_len + seq
padded.append(seq)
return np.array(padded)
# Example
sequences = [
[1, 2, 3], # "I love you"
[4, 5, 6, 7, 8], # "Machine learning is very cool"
[9, 10] # "Hello world"
]
padded = pad_sequences(sequences, max_len=5, padding='post')
print(padded)
# Output:
# [[1 2 3 0 0]
# [4 5 6 7 8]
# [9 10 0 0 0]]def analyze_sequence_lengths(texts):
"""Analyze sequence length distribution"""
lengths = [len(simple_tokenize(text)) for text in texts]
print(f"Min length: {min(lengths)}")
print(f"Max length: {max(lengths)}")
print(f"Mean length: {np.mean(lengths):.1f}")
print(f"Median length: {np.median(lengths):.1f}")
# Percentiles
for p in [50, 75, 90, 95, 99]:
print(f"{p}th percentile: {np.percentile(lengths, p):.0f}")
return lengths
# Example output:
# Min length: 3
# Max length: 127
# Mean length: 12.4
# Median length: 11.0
# 50th percentile: 11
# 75th percentile: 16
# 90th percentile: 22
# 95th percentile: 28
# 99th percentile: 45
# Choose max_len to cover 95-99% of sequences
# For this data: max_len=30 would be good| Padding Type | Example | Use Case |
|---|---|---|
| Post (default) | [1, 2, 3, 0, 0] | Most tasks (classification, etc.) |
| Pre | [0, 0, 1, 2, 3] | Generation tasks (important words at end) |
# For most NLP tasks, use post-padding
# The model learns to ignore padding tokens at the end
# For generation tasks, pre-padding might be better
# The last token is often most important for predictionLet's put it all together in a complete pipeline:
import torch
from torch.utils.data import Dataset, DataLoader
class TextProcessor:
def __init__(self, min_freq=2, max_vocab=10000, max_len=50):
self.vocab = Vocabulary(min_freq, max_vocab)
self.max_len = max_len
def fit(self, texts):
"""Build vocabulary from training texts"""
cleaned_texts = [clean_text(text) for text in texts]
self.vocab.build_from_texts(cleaned_texts)
return self
def transform(self, texts):
"""Convert texts to padded sequences"""
# Clean and encode
sequences = []
for text in texts:
cleaned = clean_text(text)
encoded = self.vocab.encode(cleaned)
sequences.append(encoded)
# Pad sequences
padded = pad_sequences(
sequences,
max_len=self.max_len,
padding='post',
value=0 # <PAD> token
)
return torch.tensor(padded, dtype=torch.long)
def fit_transform(self, texts):
"""Fit and transform in one step"""
return self.fit(texts).transform(texts)
class TextDataset(Dataset):
def __init__(self, texts, labels, processor=None):
if processor is None:
processor = TextProcessor()
self.X = processor.fit_transform(texts)
else:
self.X = processor.transform(texts)
self.y = torch.tensor(labels, dtype=torch.long)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# Usage
train_texts = ["I love this movie", "This film is terrible", ...]
train_labels = [1, 0, ...] # 1=positive, 0=negative
# Create processor and dataset
processor = TextProcessor(max_len=30)
train_dataset = TextDataset(train_texts, train_labels, processor)
# Create data loader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Use in training loop
for batch_x, batch_y in train_loader:
# batch_x: (32, 30) - padded sequences
# batch_y: (32,) - labels
logits = model(batch_x)
loss = criterion(logits, batch_y)
# ...When using RNNs with very different sequence lengths, packed sequences can improve efficiency:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
def create_packed_batch(sequences, lengths):
"""Create packed sequence for efficient RNN processing"""
# Sort by length (required for packing)
sorted_lengths, sorted_idx = lengths.sort(0, descending=True)
sorted_sequences = sequences[sorted_idx]
# Pack sequences
packed = pack_padded_sequence(
sorted_sequences,
sorted_lengths.cpu(),
batch_first=True
)
return packed, sorted_idx
# In your RNN forward pass:
def forward(self, x, lengths):
packed, sorted_idx = create_packed_batch(x, lengths)
packed_out, (h_n, c_n) = self.lstm(packed)
# Unpack and unsort
output, _ = pad_packed_sequence(packed_out, batch_first=True)
unsorted_idx = sorted_idx.argsort()
output = output[unsorted_idx]
return outputFor transformer models, create attention masks to ignore padding tokens:
def create_attention_mask(sequences, pad_token_id=0):
"""Create attention mask (1 for real tokens, 0 for padding)"""
return (sequences != pad_token_id).float()
# Example
sequences = torch.tensor([
[1, 2, 3, 0, 0], # "I love you <PAD> <PAD>"
[4, 5, 6, 7, 0] # "This is great movie <PAD>"
])
mask = create_attention_mask(sequences)
print(mask)
# Output:
# [[1. 1. 1. 0. 0.]
# [1. 1. 1. 1. 0.]]
# Use in transformer:
output = transformer(sequences, attention_mask=mask)# ❌ Wrong: Building vocab from all data
all_texts = train_texts + val_texts + test_texts
vocab.build_from_texts(all_texts) # Data leakage!
# ✅ Correct: Build vocab only from training data
vocab.build_from_texts(train_texts)
# Val/test may have unknown words - that's realistic!# ❌ Wrong: Different preprocessing for train/test
train_clean = [text.lower() for text in train_texts]
test_clean = [text.upper() for text in test_texts] # Different!
# ✅ Correct: Same preprocessing everywhere
processor = TextProcessor()
processor.fit(train_texts)
train_X = processor.transform(train_texts)
test_X = processor.transform(test_texts) # Same preprocessing# ❌ Wrong: Arbitrary choice
max_len = 100 # Why 100? No analysis!
# ✅ Correct: Data-driven choice
lengths = analyze_sequence_lengths(train_texts)
max_len = int(np.percentile(lengths, 95)) # Cover 95% of data
print(f"Chosen max_len: {max_len}")# Memory usage scales with:
# - Vocabulary size (embedding parameters)
# - Sequence length (computation and memory)
# - Batch size (memory)
# Example memory calculation:
vocab_size = 10000
embed_dim = 128
max_len = 50
batch_size = 32
# Embedding layer: vocab_size * embed_dim * 4 bytes (float32)
embedding_memory = vocab_size * embed_dim * 4 / (1024**2) # MB
print(f"Embedding memory: {embedding_memory:.1f} MB")
# Batch memory: batch_size * max_len * embed_dim * 4 bytes
batch_memory = batch_size * max_len * embed_dim * 4 / (1024**2) # MB
print(f"Batch memory: {batch_memory:.1f} MB")- Smaller vocabulary: Reduces embedding layer size
- Shorter sequences: Less computation in RNNs/Transformers
- Larger batches: Better GPU utilization (up to memory limits)
- Packed sequences: Skip computation on padding (RNNs only)
- Analyze your data first: Understand length distributions and vocabulary
- Build vocabulary only from training data: Avoid data leakage
- Choose max_len to cover 95-99% of sequences: Balance coverage and efficiency
- Use consistent preprocessing: Same pipeline for train/val/test
- Reserve index 0 for padding: Makes masking easier
- Filter vocabulary by frequency: Remove rare words (noise)
- Consider subword tokenization: For handling unknown words
- Monitor memory usage: Especially with large vocabularies/sequences
Text preprocessing and tokenization are foundational skills for NLP. While they might seem mundane compared to designing neural architectures, they can make or break your model's performance.
Key takeaways:
- Clean text appropriately for your task (don't over-clean)
- Build vocabulary from training data only to avoid leakage
- Choose sequence length based on data analysis, not arbitrary numbers
- Use padding and truncation to handle variable-length sequences
- Be consistent in preprocessing across train/val/test splits
Master these fundamentals, and you'll have a solid foundation for any NLP project, from simple classification to complex language generation.
Related Articles
Word Embeddings Explained: From One-Hot to Dense Vectors
Learn how word embeddings transform words into meaningful numerical vectors. Understand one-hot encoding, learned embeddings, Word2Vec, GloVe, and when to use pretrained vs learned embeddings.
BiLSTM for Text Classification: Understanding Sequential Deep Learning
Learn how Bidirectional LSTM networks process text sequentially to capture context, word order, and meaning. A complete guide to building your first sequence model for NLP.
From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings
A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.