intermediateNLP

Text Preprocessing and Tokenization for NLP: A Complete Guide

Master text preprocessing and tokenization for NLP. Learn about vocabulary building, padding, truncation, and handling variable-length sequences in deep learning models.

DevLifted TeamApril 24, 2026

Before you can train a neural network on text, you need to convert raw text into a format the model can understand. This process—text preprocessing and tokenization—is often overlooked but critically important. Poor preprocessing can tank your model's performance, while good preprocessing can give you a significant boost.

In this guide, we'll cover everything you need to know about preparing text for deep learning models.

The Text Processing Pipeline

Here's the typical pipeline for processing text:

Let's walk through each step with practical examples.

Step 1: Text Cleaning

Raw text is messy. It contains special characters, HTML tags, URLs, and inconsistent formatting. Cleaning prepares text for tokenization.

Common Cleaning Operations

python

import re

def clean_text(text):
    """
    Clean raw text for NLP processing
    """
    # 1. Lowercase (optional - depends on task)
    text = text.lower()
    
    # 2. Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # 3. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 4. Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # 5. Remove special characters (keep letters, numbers, spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # 6. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Example
raw = "Check out https://example.com! Email: test@email.com <b>Amazing</b> product!!!"
cleaned = clean_text(raw)
print(cleaned)
# Output: "check out amazing product"

Important: Don't over-clean! Removing too much information can hurt performance. For example, keeping punctuation might help with sentiment analysis ("Great!" vs "Great.").

To Lowercase or Not?

Task	Lowercase?	Reason
Sentiment analysis	Yes	"GREAT" and "great" have same sentiment
Named entity recognition	No	"Apple" (company) vs "apple" (fruit)
Question answering	No	Proper nouns are important
Text classification	Usually yes	Reduces vocabulary size
Machine translation	No	Case carries meaning in many languages

Step 2: Tokenization

Tokenization splits text into individual units (tokens). The most common approach is word tokenization, but there are others.

Word Tokenization

Split text into words. The simplest approach is splitting on whitespace:

python

def simple_tokenize(text):
    """Simple whitespace tokenization"""
    return text.lower().split()

text = "I love machine learning!"
tokens = simple_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning!']

# Problem: Punctuation is attached to words!

A better approach handles punctuation:

python

import re

def word_tokenize(text):
    """Tokenize text into words, handling punctuation"""
    # Split on word boundaries
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

text = "I love machine learning!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning']
# Punctuation is removed

Character Tokenization

Split text into individual characters. Useful for tasks like text generation or handling typos.

python

def char_tokenize(text):
    """Tokenize text into characters"""
    return list(text.lower())

text = "hello"
tokens = char_tokenize(text)
print(tokens)
# Output: ['h', 'e', 'l', 'l', 'o']

# Pros: Small vocabulary, handles any word
# Cons: Very long sequences, loses word boundaries

Subword Tokenization (BPE, WordPiece)

Modern approach used by BERT, GPT, etc. Splits words into subword units.

python

# Example: BPE (Byte Pair Encoding)
# Word: "playing"
# Subwords: ["play", "##ing"]

# Word: "unbelievable"
# Subwords: ["un", "##believ", "##able"]

# Benefits:
# - Handles unknown words (break into known subwords)
# - Smaller vocabulary than word-level
# - Captures morphology (play, playing, played share "play")

Step 3: Building a Vocabulary

A vocabulary maps words to integer indices. This is crucial for converting text to numbers.

Basic Vocabulary Class

python

from collections import Counter

class Vocabulary:
    def __init__(self, min_freq=2, max_vocab=10000):
        # Special tokens
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}
        self.min_freq = min_freq
        self.max_vocab = max_vocab
    
    def build_from_texts(self, texts):
        """Build vocabulary from list of texts"""
        # Count word frequencies
        word_counts = Counter()
        for text in texts:
            tokens = simple_tokenize(text)
            word_counts.update(tokens)
        
        # Filter by frequency
        valid_words = [
            word for word, count in word_counts.items()
            if count >= self.min_freq
        ]
        
        # Sort by frequency and take top max_vocab
        valid_words = sorted(
            valid_words,
            key=lambda w: word_counts[w],
            reverse=True
        )[:self.max_vocab - 2]  # -2 for PAD and UNK
        
        # Assign indices (starting from 2)
        for idx, word in enumerate(valid_words, start=2):
            self.word2idx[word] = idx
            self.idx2word[idx] = word
    
    @property
    def vocab_size(self):
        return len(self.word2idx)

# Example usage
texts = [
    "I love machine learning",
    "Machine learning is amazing",
    "I love deep learning"
]

vocab = Vocabulary(min_freq=1, max_vocab=100)
vocab.build_from_texts(texts)

print(f"Vocabulary size: {vocab.vocab_size}")
print(f"Word to index: {vocab.word2idx}")

Special Tokens

Most vocabularies include special tokens:

Token	Index	Purpose
<PAD>	0	Padding token (for variable-length sequences)
<UNK>	1	Unknown words (not in vocabulary)
<SOS>	2	Start of sequence (for generation tasks)
<EOS>	3	End of sequence (for generation tasks)
<MASK>	4	Masked token (for BERT-style pretraining)

Best Practice: Always reserve index 0 for padding. This makes it easy to ignore padding in loss calculations and attention mechanisms.

Vocabulary Size: How Big?

python

# Analyze your data first
import numpy as np
from collections import Counter

def analyze_vocabulary(texts):
    """Analyze vocabulary statistics"""
    word_counts = Counter()
    for text in texts:
        tokens = simple_tokenize(text)
        word_counts.update(tokens)
    
    total_words = sum(word_counts.values())
    unique_words = len(word_counts)
    
    print(f"Total words: {total_words:,}")
    print(f"Unique words: {unique_words:,}")
    
    # Coverage analysis
    sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
    
    for vocab_size in [1000, 5000, 10000, 20000]:
        if vocab_size <= len(sorted_words):
            covered = sum(count for _, count in sorted_words[:vocab_size])
            coverage = covered / total_words * 100
            print(f"Top {vocab_size} words cover {coverage:.1f}% of text")

# Example output:
# Total words: 1,000,000
# Unique words: 50,000
# Top 1000 words cover 75.2% of text
# Top 5000 words cover 89.1% of text
# Top 10000 words cover 93.8% of text
# Top 20000 words cover 96.2% of text

Rule of thumb: Choose vocabulary size to cover 90-95% of your text. Beyond that, you're mostly adding noise.

Step 4: Text Encoding

Convert words to integer indices using the vocabulary.

python

class Vocabulary:
    # ... (previous methods)
    
    def encode(self, text):
        """Convert text to list of indices"""
        tokens = simple_tokenize(text)
        return [self.word2idx.get(token, 1) for token in tokens]  # 1 = <UNK>
    
    def decode(self, indices):
        """Convert indices back to text"""
        tokens = [self.idx2word.get(idx, "<UNK>") for idx in indices]
        return " ".join(tokens)

# Example
vocab = Vocabulary()
vocab.build_from_texts(["I love machine learning"])

text = "I love deep learning"
encoded = vocab.encode(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {vocab.decode(encoded)}")

# Output:
# Original: I love deep learning
# Encoded: [2, 3, 1, 4]  # "deep" is unknown (1)
# Decoded: i love <UNK> learning

Step 5: Handling Variable-Length Sequences

Neural networks need fixed-size inputs, but sentences have different lengths. We solve this with padding and truncation.

Padding: Making Sequences the Same Length

python

def pad_sequences(sequences, max_len=None, padding='post', truncating='post', value=0):
    """
    Pad sequences to uniform length
    
    Args:
        sequences: List of sequences (lists of integers)
        max_len: Maximum length (if None, use longest sequence)
        padding: 'pre' or 'post' (where to add padding)
        truncating: 'pre' or 'post' (where to truncate)
        value: Padding value (usually 0)
    
    Returns:
        Padded sequences as numpy array
    """
    import numpy as np
    
    if max_len is None:
        max_len = max(len(seq) for seq in sequences)
    
    padded = []
    for seq in sequences:
        # Truncate if too long
        if len(seq) > max_len:
            if truncating == 'post':
                seq = seq[:max_len]
            else:  # 'pre'
                seq = seq[-max_len:]
        
        # Pad if too short
        if len(seq) < max_len:
            pad_len = max_len - len(seq)
            if padding == 'post':
                seq = seq + [value] * pad_len
            else:  # 'pre'
                seq = [value] * pad_len + seq
        
        padded.append(seq)
    
    return np.array(padded)

# Example
sequences = [
    [1, 2, 3],           # "I love you"
    [4, 5, 6, 7, 8],     # "Machine learning is very cool"
    [9, 10]              # "Hello world"
]

padded = pad_sequences(sequences, max_len=5, padding='post')
print(padded)
# Output:
# [[1 2 3 0 0]
#  [4 5 6 7 8]
#  [9 10 0 0 0]]

Choosing max_len: Data Analysis

python

def analyze_sequence_lengths(texts):
    """Analyze sequence length distribution"""
    lengths = [len(simple_tokenize(text)) for text in texts]
    
    print(f"Min length: {min(lengths)}")
    print(f"Max length: {max(lengths)}")
    print(f"Mean length: {np.mean(lengths):.1f}")
    print(f"Median length: {np.median(lengths):.1f}")
    
    # Percentiles
    for p in [50, 75, 90, 95, 99]:
        print(f"{p}th percentile: {np.percentile(lengths, p):.0f}")
    
    return lengths

# Example output:
# Min length: 3
# Max length: 127
# Mean length: 12.4
# Median length: 11.0
# 50th percentile: 11
# 75th percentile: 16
# 90th percentile: 22
# 95th percentile: 28
# 99th percentile: 45

# Choose max_len to cover 95-99% of sequences
# For this data: max_len=30 would be good

Pre vs Post Padding

Padding Type	Example	Use Case
Post (default)	[1, 2, 3, 0, 0]	Most tasks (classification, etc.)
Pre	[0, 0, 1, 2, 3]	Generation tasks (important words at end)

python

# For most NLP tasks, use post-padding
# The model learns to ignore padding tokens at the end

# For generation tasks, pre-padding might be better
# The last token is often most important for prediction

Complete Text Processing Pipeline

Let's put it all together in a complete pipeline:

python

import torch
from torch.utils.data import Dataset, DataLoader

class TextProcessor:
    def __init__(self, min_freq=2, max_vocab=10000, max_len=50):
        self.vocab = Vocabulary(min_freq, max_vocab)
        self.max_len = max_len
    
    def fit(self, texts):
        """Build vocabulary from training texts"""
        cleaned_texts = [clean_text(text) for text in texts]
        self.vocab.build_from_texts(cleaned_texts)
        return self
    
    def transform(self, texts):
        """Convert texts to padded sequences"""
        # Clean and encode
        sequences = []
        for text in texts:
            cleaned = clean_text(text)
            encoded = self.vocab.encode(cleaned)
            sequences.append(encoded)
        
        # Pad sequences
        padded = pad_sequences(
            sequences,
            max_len=self.max_len,
            padding='post',
            value=0  # <PAD> token
        )
        
        return torch.tensor(padded, dtype=torch.long)
    
    def fit_transform(self, texts):
        """Fit and transform in one step"""
        return self.fit(texts).transform(texts)

class TextDataset(Dataset):
    def __init__(self, texts, labels, processor=None):
        if processor is None:
            processor = TextProcessor()
            self.X = processor.fit_transform(texts)
        else:
            self.X = processor.transform(texts)
        
        self.y = torch.tensor(labels, dtype=torch.long)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Usage
train_texts = ["I love this movie", "This film is terrible", ...]
train_labels = [1, 0, ...]  # 1=positive, 0=negative

# Create processor and dataset
processor = TextProcessor(max_len=30)
train_dataset = TextDataset(train_texts, train_labels, processor)

# Create data loader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Use in training loop
for batch_x, batch_y in train_loader:
    # batch_x: (32, 30) - padded sequences
    # batch_y: (32,) - labels
    logits = model(batch_x)
    loss = criterion(logits, batch_y)
    # ...

Advanced Techniques

1. Packed Sequences (for RNNs)

When using RNNs with very different sequence lengths, packed sequences can improve efficiency:

python

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

def create_packed_batch(sequences, lengths):
    """Create packed sequence for efficient RNN processing"""
    # Sort by length (required for packing)
    sorted_lengths, sorted_idx = lengths.sort(0, descending=True)
    sorted_sequences = sequences[sorted_idx]
    
    # Pack sequences
    packed = pack_padded_sequence(
        sorted_sequences,
        sorted_lengths.cpu(),
        batch_first=True
    )
    
    return packed, sorted_idx

# In your RNN forward pass:
def forward(self, x, lengths):
    packed, sorted_idx = create_packed_batch(x, lengths)
    packed_out, (h_n, c_n) = self.lstm(packed)
    
    # Unpack and unsort
    output, _ = pad_packed_sequence(packed_out, batch_first=True)
    unsorted_idx = sorted_idx.argsort()
    output = output[unsorted_idx]
    
    return output

2. Attention Masks

For transformer models, create attention masks to ignore padding tokens:

python

def create_attention_mask(sequences, pad_token_id=0):
    """Create attention mask (1 for real tokens, 0 for padding)"""
    return (sequences != pad_token_id).float()

# Example
sequences = torch.tensor([
    [1, 2, 3, 0, 0],  # "I love you <PAD> <PAD>"
    [4, 5, 6, 7, 0]   # "This is great movie <PAD>"
])

mask = create_attention_mask(sequences)
print(mask)
# Output:
# [[1. 1. 1. 0. 0.]
#  [1. 1. 1. 1. 0.]]

# Use in transformer:
output = transformer(sequences, attention_mask=mask)

Common Pitfalls and Solutions

Pitfall 1: Data Leakage in Vocabulary

python

# ❌ Wrong: Building vocab from all data
all_texts = train_texts + val_texts + test_texts
vocab.build_from_texts(all_texts)  # Data leakage!

# ✅ Correct: Build vocab only from training data
vocab.build_from_texts(train_texts)
# Val/test may have unknown words - that's realistic!

Pitfall 2: Inconsistent Preprocessing

python

# ❌ Wrong: Different preprocessing for train/test
train_clean = [text.lower() for text in train_texts]
test_clean = [text.upper() for text in test_texts]  # Different!

# ✅ Correct: Same preprocessing everywhere
processor = TextProcessor()
processor.fit(train_texts)

train_X = processor.transform(train_texts)
test_X = processor.transform(test_texts)  # Same preprocessing

Pitfall 3: Wrong max_len Choice

python

# ❌ Wrong: Arbitrary choice
max_len = 100  # Why 100? No analysis!

# ✅ Correct: Data-driven choice
lengths = analyze_sequence_lengths(train_texts)
max_len = int(np.percentile(lengths, 95))  # Cover 95% of data
print(f"Chosen max_len: {max_len}")

Performance Considerations

Memory Usage

python

# Memory usage scales with:
# - Vocabulary size (embedding parameters)
# - Sequence length (computation and memory)
# - Batch size (memory)

# Example memory calculation:
vocab_size = 10000
embed_dim = 128
max_len = 50
batch_size = 32

# Embedding layer: vocab_size * embed_dim * 4 bytes (float32)
embedding_memory = vocab_size * embed_dim * 4 / (1024**2)  # MB
print(f"Embedding memory: {embedding_memory:.1f} MB")

# Batch memory: batch_size * max_len * embed_dim * 4 bytes
batch_memory = batch_size * max_len * embed_dim * 4 / (1024**2)  # MB
print(f"Batch memory: {batch_memory:.1f} MB")

Speed Optimization

Smaller vocabulary: Reduces embedding layer size
Shorter sequences: Less computation in RNNs/Transformers
Larger batches: Better GPU utilization (up to memory limits)
Packed sequences: Skip computation on padding (RNNs only)

Best Practices Summary

Analyze your data first: Understand length distributions and vocabulary
Build vocabulary only from training data: Avoid data leakage
Choose max_len to cover 95-99% of sequences: Balance coverage and efficiency
Use consistent preprocessing: Same pipeline for train/val/test
Reserve index 0 for padding: Makes masking easier
Filter vocabulary by frequency: Remove rare words (noise)
Consider subword tokenization: For handling unknown words
Monitor memory usage: Especially with large vocabularies/sequences

Conclusion

Text preprocessing and tokenization are foundational skills for NLP. While they might seem mundane compared to designing neural architectures, they can make or break your model's performance.

Key takeaways:

Clean text appropriately for your task (don't over-clean)
Build vocabulary from training data only to avoid leakage
Choose sequence length based on data analysis, not arbitrary numbers
Use padding and truncation to handle variable-length sequences
Be consistent in preprocessing across train/val/test splits

Remember: Good preprocessing is invisible when it works, but bad preprocessing will sabotage even the best models. Invest time in getting it right!

Master these fundamentals, and you'll have a solid foundation for any NLP project, from simple classification to complex language generation.

#nlp

intermediate

Word Embeddings Explained: From One-Hot to Dense Vectors

Learn how word embeddings transform words into meaningful numerical vectors. Understand one-hot encoding, learned embeddings, Word2Vec, GloVe, and when to use pretrained vs learned embeddings.

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

BiLSTM for Text Classification: Understanding Sequential Deep Learning

Learn how Bidirectional LSTM networks process text sequentially to capture context, word order, and meaning. A complete guide to building your first sequence model for NLP.

Related Articles

Word Embeddings Explained: From One-Hot to Dense Vectors

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

BiLSTM for Text Classification: Understanding Sequential Deep Learning