Back to articles
Text Preprocessing and Tokenization for NLP: A Complete Guide
intermediateNLP

Text Preprocessing and Tokenization for NLP: A Complete Guide

Master text preprocessing and tokenization for NLP. Learn about vocabulary building, padding, truncation, and handling variable-length sequences in deep learning models.

12 min read

Before you can train a neural network on text, you need to convert raw text into a format the model can understand. This process—text preprocessing and tokenization—is often overlooked but critically important. Poor preprocessing can tank your model's performance, while good preprocessing can give you a significant boost.

In this guide, we'll cover everything you need to know about preparing text for deep learning models.

Here's the typical pipeline for processing text:

Let's walk through each step with practical examples.

Raw text is messy. It contains special characters, HTML tags, URLs, and inconsistent formatting. Cleaning prepares text for tokenization.

python
import re

def clean_text(text):
    """
    Clean raw text for NLP processing
    """
    # 1. Lowercase (optional - depends on task)
    text = text.lower()
    
    # 2. Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # 3. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 4. Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # 5. Remove special characters (keep letters, numbers, spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # 6. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Example
raw = "Check out https://example.com! Email: test@email.com <b>Amazing</b> product!!!"
cleaned = clean_text(raw)
print(cleaned)
# Output: "check out amazing product"

Important: Don't over-clean! Removing too much information can hurt performance. For example, keeping punctuation might help with sentiment analysis ("Great!" vs "Great.").
TaskLowercase?Reason
Sentiment analysisYes"GREAT" and "great" have same sentiment
Named entity recognitionNo"Apple" (company) vs "apple" (fruit)
Question answeringNoProper nouns are important
Text classificationUsually yesReduces vocabulary size
Machine translationNoCase carries meaning in many languages

Tokenization splits text into individual units (tokens). The most common approach is word tokenization, but there are others.

Split text into words. The simplest approach is splitting on whitespace:

python
def simple_tokenize(text):
    """Simple whitespace tokenization"""
    return text.lower().split()

text = "I love machine learning!"
tokens = simple_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning!']

# Problem: Punctuation is attached to words!

A better approach handles punctuation:

python
import re

def word_tokenize(text):
    """Tokenize text into words, handling punctuation"""
    # Split on word boundaries
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

text = "I love machine learning!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['i', 'love', 'machine', 'learning']
# Punctuation is removed

Split text into individual characters. Useful for tasks like text generation or handling typos.

python
def char_tokenize(text):
    """Tokenize text into characters"""
    return list(text.lower())

text = "hello"
tokens = char_tokenize(text)
print(tokens)
# Output: ['h', 'e', 'l', 'l', 'o']

# Pros: Small vocabulary, handles any word
# Cons: Very long sequences, loses word boundaries

Modern approach used by BERT, GPT, etc. Splits words into subword units.

python
# Example: BPE (Byte Pair Encoding)
# Word: "playing"
# Subwords: ["play", "##ing"]

# Word: "unbelievable"
# Subwords: ["un", "##believ", "##able"]

# Benefits:
# - Handles unknown words (break into known subwords)
# - Smaller vocabulary than word-level
# - Captures morphology (play, playing, played share "play")

A vocabulary maps words to integer indices. This is crucial for converting text to numbers.

python
from collections import Counter

class Vocabulary:
    def __init__(self, min_freq=2, max_vocab=10000):
        # Special tokens
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = {0: "<PAD>", 1: "<UNK>"}
        self.min_freq = min_freq
        self.max_vocab = max_vocab
    
    def build_from_texts(self, texts):
        """Build vocabulary from list of texts"""
        # Count word frequencies
        word_counts = Counter()
        for text in texts:
            tokens = simple_tokenize(text)
            word_counts.update(tokens)
        
        # Filter by frequency
        valid_words = [
            word for word, count in word_counts.items()
            if count >= self.min_freq
        ]
        
        # Sort by frequency and take top max_vocab
        valid_words = sorted(
            valid_words,
            key=lambda w: word_counts[w],
            reverse=True
        )[:self.max_vocab - 2]  # -2 for PAD and UNK
        
        # Assign indices (starting from 2)
        for idx, word in enumerate(valid_words, start=2):
            self.word2idx[word] = idx
            self.idx2word[idx] = word
    
    @property
    def vocab_size(self):
        return len(self.word2idx)

# Example usage
texts = [
    "I love machine learning",
    "Machine learning is amazing",
    "I love deep learning"
]

vocab = Vocabulary(min_freq=1, max_vocab=100)
vocab.build_from_texts(texts)

print(f"Vocabulary size: {vocab.vocab_size}")
print(f"Word to index: {vocab.word2idx}")

Most vocabularies include special tokens:

TokenIndexPurpose
<PAD>0Padding token (for variable-length sequences)
<UNK>1Unknown words (not in vocabulary)
<SOS>2Start of sequence (for generation tasks)
<EOS>3End of sequence (for generation tasks)
<MASK>4Masked token (for BERT-style pretraining)

Best Practice: Always reserve index 0 for padding. This makes it easy to ignore padding in loss calculations and attention mechanisms.
python
# Analyze your data first
import numpy as np
from collections import Counter

def analyze_vocabulary(texts):
    """Analyze vocabulary statistics"""
    word_counts = Counter()
    for text in texts:
        tokens = simple_tokenize(text)
        word_counts.update(tokens)
    
    total_words = sum(word_counts.values())
    unique_words = len(word_counts)
    
    print(f"Total words: {total_words:,}")
    print(f"Unique words: {unique_words:,}")
    
    # Coverage analysis
    sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
    
    for vocab_size in [1000, 5000, 10000, 20000]:
        if vocab_size <= len(sorted_words):
            covered = sum(count for _, count in sorted_words[:vocab_size])
            coverage = covered / total_words * 100
            print(f"Top {vocab_size} words cover {coverage:.1f}% of text")

# Example output:
# Total words: 1,000,000
# Unique words: 50,000
# Top 1000 words cover 75.2% of text
# Top 5000 words cover 89.1% of text
# Top 10000 words cover 93.8% of text
# Top 20000 words cover 96.2% of text

Rule of thumb: Choose vocabulary size to cover 90-95% of your text. Beyond that, you're mostly adding noise.

Convert words to integer indices using the vocabulary.

python
class Vocabulary:
    # ... (previous methods)
    
    def encode(self, text):
        """Convert text to list of indices"""
        tokens = simple_tokenize(text)
        return [self.word2idx.get(token, 1) for token in tokens]  # 1 = <UNK>
    
    def decode(self, indices):
        """Convert indices back to text"""
        tokens = [self.idx2word.get(idx, "<UNK>") for idx in indices]
        return " ".join(tokens)

# Example
vocab = Vocabulary()
vocab.build_from_texts(["I love machine learning"])

text = "I love deep learning"
encoded = vocab.encode(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {vocab.decode(encoded)}")

# Output:
# Original: I love deep learning
# Encoded: [2, 3, 1, 4]  # "deep" is unknown (1)
# Decoded: i love <UNK> learning

Neural networks need fixed-size inputs, but sentences have different lengths. We solve this with padding and truncation.

python
def pad_sequences(sequences, max_len=None, padding='post', truncating='post', value=0):
    """
    Pad sequences to uniform length
    
    Args:
        sequences: List of sequences (lists of integers)
        max_len: Maximum length (if None, use longest sequence)
        padding: 'pre' or 'post' (where to add padding)
        truncating: 'pre' or 'post' (where to truncate)
        value: Padding value (usually 0)
    
    Returns:
        Padded sequences as numpy array
    """
    import numpy as np
    
    if max_len is None:
        max_len = max(len(seq) for seq in sequences)
    
    padded = []
    for seq in sequences:
        # Truncate if too long
        if len(seq) > max_len:
            if truncating == 'post':
                seq = seq[:max_len]
            else:  # 'pre'
                seq = seq[-max_len:]
        
        # Pad if too short
        if len(seq) < max_len:
            pad_len = max_len - len(seq)
            if padding == 'post':
                seq = seq + [value] * pad_len
            else:  # 'pre'
                seq = [value] * pad_len + seq
        
        padded.append(seq)
    
    return np.array(padded)

# Example
sequences = [
    [1, 2, 3],           # "I love you"
    [4, 5, 6, 7, 8],     # "Machine learning is very cool"
    [9, 10]              # "Hello world"
]

padded = pad_sequences(sequences, max_len=5, padding='post')
print(padded)
# Output:
# [[1 2 3 0 0]
#  [4 5 6 7 8]
#  [9 10 0 0 0]]
python
def analyze_sequence_lengths(texts):
    """Analyze sequence length distribution"""
    lengths = [len(simple_tokenize(text)) for text in texts]
    
    print(f"Min length: {min(lengths)}")
    print(f"Max length: {max(lengths)}")
    print(f"Mean length: {np.mean(lengths):.1f}")
    print(f"Median length: {np.median(lengths):.1f}")
    
    # Percentiles
    for p in [50, 75, 90, 95, 99]:
        print(f"{p}th percentile: {np.percentile(lengths, p):.0f}")
    
    return lengths

# Example output:
# Min length: 3
# Max length: 127
# Mean length: 12.4
# Median length: 11.0
# 50th percentile: 11
# 75th percentile: 16
# 90th percentile: 22
# 95th percentile: 28
# 99th percentile: 45

# Choose max_len to cover 95-99% of sequences
# For this data: max_len=30 would be good
Padding TypeExampleUse Case
Post (default)[1, 2, 3, 0, 0]Most tasks (classification, etc.)
Pre[0, 0, 1, 2, 3]Generation tasks (important words at end)
python
# For most NLP tasks, use post-padding
# The model learns to ignore padding tokens at the end

# For generation tasks, pre-padding might be better
# The last token is often most important for prediction

Let's put it all together in a complete pipeline:

python
import torch
from torch.utils.data import Dataset, DataLoader

class TextProcessor:
    def __init__(self, min_freq=2, max_vocab=10000, max_len=50):
        self.vocab = Vocabulary(min_freq, max_vocab)
        self.max_len = max_len
    
    def fit(self, texts):
        """Build vocabulary from training texts"""
        cleaned_texts = [clean_text(text) for text in texts]
        self.vocab.build_from_texts(cleaned_texts)
        return self
    
    def transform(self, texts):
        """Convert texts to padded sequences"""
        # Clean and encode
        sequences = []
        for text in texts:
            cleaned = clean_text(text)
            encoded = self.vocab.encode(cleaned)
            sequences.append(encoded)
        
        # Pad sequences
        padded = pad_sequences(
            sequences,
            max_len=self.max_len,
            padding='post',
            value=0  # <PAD> token
        )
        
        return torch.tensor(padded, dtype=torch.long)
    
    def fit_transform(self, texts):
        """Fit and transform in one step"""
        return self.fit(texts).transform(texts)

class TextDataset(Dataset):
    def __init__(self, texts, labels, processor=None):
        if processor is None:
            processor = TextProcessor()
            self.X = processor.fit_transform(texts)
        else:
            self.X = processor.transform(texts)
        
        self.y = torch.tensor(labels, dtype=torch.long)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Usage
train_texts = ["I love this movie", "This film is terrible", ...]
train_labels = [1, 0, ...]  # 1=positive, 0=negative

# Create processor and dataset
processor = TextProcessor(max_len=30)
train_dataset = TextDataset(train_texts, train_labels, processor)

# Create data loader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Use in training loop
for batch_x, batch_y in train_loader:
    # batch_x: (32, 30) - padded sequences
    # batch_y: (32,) - labels
    logits = model(batch_x)
    loss = criterion(logits, batch_y)
    # ...

When using RNNs with very different sequence lengths, packed sequences can improve efficiency:

python
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

def create_packed_batch(sequences, lengths):
    """Create packed sequence for efficient RNN processing"""
    # Sort by length (required for packing)
    sorted_lengths, sorted_idx = lengths.sort(0, descending=True)
    sorted_sequences = sequences[sorted_idx]
    
    # Pack sequences
    packed = pack_padded_sequence(
        sorted_sequences,
        sorted_lengths.cpu(),
        batch_first=True
    )
    
    return packed, sorted_idx

# In your RNN forward pass:
def forward(self, x, lengths):
    packed, sorted_idx = create_packed_batch(x, lengths)
    packed_out, (h_n, c_n) = self.lstm(packed)
    
    # Unpack and unsort
    output, _ = pad_packed_sequence(packed_out, batch_first=True)
    unsorted_idx = sorted_idx.argsort()
    output = output[unsorted_idx]
    
    return output

For transformer models, create attention masks to ignore padding tokens:

python
def create_attention_mask(sequences, pad_token_id=0):
    """Create attention mask (1 for real tokens, 0 for padding)"""
    return (sequences != pad_token_id).float()

# Example
sequences = torch.tensor([
    [1, 2, 3, 0, 0],  # "I love you <PAD> <PAD>"
    [4, 5, 6, 7, 0]   # "This is great movie <PAD>"
])

mask = create_attention_mask(sequences)
print(mask)
# Output:
# [[1. 1. 1. 0. 0.]
#  [1. 1. 1. 1. 0.]]

# Use in transformer:
output = transformer(sequences, attention_mask=mask)
python
# ❌ Wrong: Building vocab from all data
all_texts = train_texts + val_texts + test_texts
vocab.build_from_texts(all_texts)  # Data leakage!

# ✅ Correct: Build vocab only from training data
vocab.build_from_texts(train_texts)
# Val/test may have unknown words - that's realistic!
python
# ❌ Wrong: Different preprocessing for train/test
train_clean = [text.lower() for text in train_texts]
test_clean = [text.upper() for text in test_texts]  # Different!

# ✅ Correct: Same preprocessing everywhere
processor = TextProcessor()
processor.fit(train_texts)

train_X = processor.transform(train_texts)
test_X = processor.transform(test_texts)  # Same preprocessing
python
# ❌ Wrong: Arbitrary choice
max_len = 100  # Why 100? No analysis!

# ✅ Correct: Data-driven choice
lengths = analyze_sequence_lengths(train_texts)
max_len = int(np.percentile(lengths, 95))  # Cover 95% of data
print(f"Chosen max_len: {max_len}")
python
# Memory usage scales with:
# - Vocabulary size (embedding parameters)
# - Sequence length (computation and memory)
# - Batch size (memory)

# Example memory calculation:
vocab_size = 10000
embed_dim = 128
max_len = 50
batch_size = 32

# Embedding layer: vocab_size * embed_dim * 4 bytes (float32)
embedding_memory = vocab_size * embed_dim * 4 / (1024**2)  # MB
print(f"Embedding memory: {embedding_memory:.1f} MB")

# Batch memory: batch_size * max_len * embed_dim * 4 bytes
batch_memory = batch_size * max_len * embed_dim * 4 / (1024**2)  # MB
print(f"Batch memory: {batch_memory:.1f} MB")
  • Smaller vocabulary: Reduces embedding layer size
  • Shorter sequences: Less computation in RNNs/Transformers
  • Larger batches: Better GPU utilization (up to memory limits)
  • Packed sequences: Skip computation on padding (RNNs only)
  1. Analyze your data first: Understand length distributions and vocabulary
  2. Build vocabulary only from training data: Avoid data leakage
  3. Choose max_len to cover 95-99% of sequences: Balance coverage and efficiency
  4. Use consistent preprocessing: Same pipeline for train/val/test
  5. Reserve index 0 for padding: Makes masking easier
  6. Filter vocabulary by frequency: Remove rare words (noise)
  7. Consider subword tokenization: For handling unknown words
  8. Monitor memory usage: Especially with large vocabularies/sequences

Text preprocessing and tokenization are foundational skills for NLP. While they might seem mundane compared to designing neural architectures, they can make or break your model's performance.

Key takeaways:

  • Clean text appropriately for your task (don't over-clean)
  • Build vocabulary from training data only to avoid leakage
  • Choose sequence length based on data analysis, not arbitrary numbers
  • Use padding and truncation to handle variable-length sequences
  • Be consistent in preprocessing across train/val/test splits

Remember: Good preprocessing is invisible when it works, but bad preprocessing will sabotage even the best models. Invest time in getting it right!

Master these fundamentals, and you'll have a solid foundation for any NLP project, from simple classification to complex language generation.

Related Articles