
Understanding Neural Networks: From Word Counting to Meaning Understanding
A beginner-friendly guide to pretrained sentence embeddings, multi-layer perceptrons, and the building blocks that make modern NLP work — explained with simple examples and zero jargon.
Imagine teaching a computer to understand what you mean when you say something. Not just matching keywords, but actually understanding that 'book a flight' and 'reserve a plane ticket' mean the same thing, even though they share almost no words. This post explains how modern AI does exactly that, using two powerful ideas: sentence embeddings (a way to capture meaning) and multi-layer perceptrons (a simple but effective neural network). We'll break down every concept into plain English with real examples.
What You'll Learn
Let's start with the old way of teaching computers to understand text: counting words. This approach is called TF-IDF (Term Frequency-Inverse Document Frequency), and it's like giving each word a score based on how often it appears.
Here's how it works: if you have the sentence 'book a flight to Tokyo', the computer creates a list of all possible words it knows (maybe 10,000 words), and marks which ones appear in your sentence. So 'book' gets a 1, 'flight' gets a 1, 'Tokyo' gets a 1, and the other 9,997 words get a 0.
The Fatal Flaws of Word Counting
Think about it like this: imagine trying to understand a recipe by only counting how many times each ingredient appears, without knowing the order or how they relate to each other. You'd know there's flour and eggs, but not whether you're making a cake or scrambled eggs!
Now for the magic trick. Instead of counting words, what if we could capture the meaning of a sentence as a set of numbers? That's exactly what sentence embeddings do.
Think of it like GPS coordinates. Every location on Earth can be described by two numbers (latitude and longitude). Similarly, every sentence can be described by a list of numbers (typically 384 or 768 numbers) that capture its meaning. Sentences that mean similar things get similar numbers, like how nearby places have similar GPS coordinates.
The Map of Meaning
Here's the best part: someone else already built this map for us! Companies like Google and Hugging Face trained models on billions of sentences to learn these coordinates. We can download their work for free and use it. This is called transfer learning — borrowing intelligence from someone else's hard work.
# Let's see sentence embeddings in action
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
'Book a flight to New York',
'Reserve a plane ticket to NYC',
'What is the weather today?'
]
# Convert each sentence to 384 numbers
embeddings = model.encode(sentences)
# Now we can measure how similar sentences are
# 'Book a flight' and 'Reserve a plane ticket' will be very similar (close to 1.0)
# 'Book a flight' and 'What is the weather' will be very different (close to 0.0)The model we use is called all-MiniLM-L6-v2. It's small (only 80MB), fast (works on regular computers), and produces 384-dimensional embeddings. Think of those 384 numbers as 384 different aspects of meaning — like 'is this about travel?', 'is this a question?', 'is this urgent?', and 381 other subtle aspects.
Here's an important concept: we freeze the embedding model, which means we never change it. We use it exactly as we downloaded it. Why?
Think of it like using a dictionary. The dictionary was created by experts who studied millions of words. When you look up a word, you don't rewrite the dictionary — you just use what's already there. Same idea here: the embedding model was trained on billions of sentences, and we only have thousands. If we tried to 'improve' it with our small dataset, we'd actually make it worse.
The Cooking Analogy
Converting 15,000 sentences into embeddings takes a few minutes. If we had to do this every time we train our model, we'd waste hours. So we do something smart: caching.
Caching means: compute the embeddings once, save them to a file, and then just load that file every time you need them. It's like meal prepping — cook once on Sunday, eat all week.
# Simple caching pattern
import torch
from pathlib import Path
def load_or_compute_embeddings(texts, cache_file):
# Check if we already computed these
if Path(cache_file).exists():
print("Loading from cache (instant!)")
return torch.load(cache_file)
# First time — compute and save
print("Computing embeddings (takes a few minutes)...")
embeddings = model.encode(texts)
torch.save(embeddings, cache_file)
return embeddingsNow we get to the heart of it: neural networks. Let's build up the intuition step by step.
Imagine you're trying to separate apples from oranges on a table. If all the apples are on one side and all the oranges are on the other, you can draw a straight line between them. Easy!
But what if the apples are in the middle and the oranges are in a circle around them? No straight line can separate them. You need a curved boundary. That's exactly the problem with simple models — they can only draw straight lines (or flat surfaces in higher dimensions).
The XOR Problem
A hidden layer is a transformation step between input and output. Think of it like this:
- Input layer: Your data comes in (the 384 embedding numbers)
- Hidden layer 1: Transform those 384 numbers into 256 new numbers that capture useful patterns
- Hidden layer 2: Transform those 256 numbers into 128 numbers that capture even more refined patterns
- Output layer: Transform those 128 numbers into 151 final scores (one for each possible intent)
Each hidden layer is like a filter that extracts increasingly sophisticated patterns. The first layer might detect simple things like 'contains travel words' or 'sounds like a question'. The second layer might detect combinations like 'travel question about weather' or 'urgent booking request'.
Here's a crucial detail: between each layer, we apply something called ReLU (Rectified Linear Unit). It's incredibly simple: if a number is positive, keep it; if it's negative, change it to zero.
# ReLU in action
def relu(x):
return max(0, x)
# Examples:
relu(5) # Returns 5 (positive, keep it)
relu(-3) # Returns 0 (negative, zero it out)
relu(0) # Returns 0Why is this important? Without ReLU (or some other nonlinear function), stacking multiple layers would be pointless — they'd collapse into a single layer mathematically. ReLU breaks this equivalence and lets the network learn curves instead of just straight lines.
The Light Switch Analogy
As data flows through multiple layers, the numbers can get out of control — some might be in the thousands, others near zero. This makes training unstable. Batch Normalization fixes this.
Here's the idea: after each layer, normalize the numbers so they have a consistent scale (roughly mean=0, standard deviation=1). It's like adjusting the volume on different audio tracks so they're all at the same level before mixing them together.
The Temperature Analogy
One critical detail: Batch Normalization behaves differently during training versus testing. During training, it uses the current batch's statistics. During testing, it uses stable averages computed during training. This is why you must call model.eval() before testing — forgetting this is the #1 cause of mysteriously bad test results!
Here's a problem: if you train a model too long on the same data, it starts to memorize instead of learn. It's like a student who memorizes answers without understanding the concepts — they ace the practice test but fail the real exam.
Dropout is a clever solution: during training, randomly turn off 30% of the neurons in each layer. Which 30%? Different ones each time, chosen randomly.
The Sports Team Analogy
During testing, all neurons are active (no dropout). The model has learned to work with any subset of neurons, so when all are present, it performs even better.
Training a neural network means adjusting millions of numbers (the weights) to minimize errors. The old way (called SGD - Stochastic Gradient Descent) adjusts every weight by the same amount. Adam is smarter.
Adam gives each weight its own personalized learning rate based on its history. If a weight's gradient has been consistently pointing in one direction, Adam lets it move faster. If a weight's gradient has been bouncing around randomly, Adam makes it move more cautiously.
The Driving Analogy
Adam also includes weight decay, which is a fancy term for 'penalize weights that get too large'. This prevents the model from becoming too confident about any single pattern, which helps it generalize better to new data.
How do you know when to stop training? If you stop too early, the model hasn't learned enough. If you train too long, it starts memorizing the training data and performs poorly on new data.
Early stopping solves this automatically. Here's how it works:
- After each training epoch, test the model on validation data (data it hasn't trained on)
- If the validation accuracy improves, save the model weights and reset a counter
- If the validation accuracy doesn't improve, increment the counter
- If the counter reaches a threshold (say, 5 epochs without improvement), stop training and restore the best weights
The Studying Analogy
Let's see how all these pieces fit together in our MLP (Multi-Layer Perceptron) classifier:
Here's what happens step by step:
- Input: A sentence like 'Book a flight to Tokyo'
- Sentence Transformer: Converts it to 384 numbers capturing its meaning (frozen, never changes)
- First Hidden Layer: 384 → 256 numbers, normalized, ReLU applied, 30% randomly dropped
- Second Hidden Layer: 256 → 128 numbers, normalized, ReLU applied, 30% randomly dropped
- Output Layer: 128 → 151 final scores (one for each possible intent)
- Prediction: Pick the intent with the highest score
Training is an iterative process. Here's the cycle that repeats thousands of times:
- Forward Pass: Feed a batch of examples through the network, get predictions
- Compute Loss: Measure how wrong the predictions are (using CrossEntropyLoss)
- Backward Pass: Calculate how to adjust each weight to reduce the error (using backpropagation)
- Update Weights: Adjust the weights using Adam optimizer
- Repeat: Do this for thousands of batches across many epochs
The Archery Analogy
When you combine all these techniques, something magical happens. On a dataset with 151 different intents (like 'book_flight', 'weather_query', 'play_music', etc.), this approach achieves over 90% accuracy. Compare that to the old word-counting approach which maxes out around 78%.
Why such a big jump? The sentence embeddings do most of the heavy lifting. They already understand that 'book', 'reserve', and 'schedule' are related. They already know that 'flight' and 'plane ticket' mean similar things. The MLP just needs to learn simple decision boundaries in this well-organized space.
| Approach | Accuracy | Why |
|---|---|---|
| Word Counting (TF-IDF) | ~78% | Can't understand synonyms or word order |
| Sentence Embeddings + MLP | 90% | Understands meaning, learns complex patterns |
| Fine-tuned Transformer | ~95% | Even better but much more expensive |
Here are the most common mistakes beginners make, and how to avoid them:
- Forgetting model.eval(): Always call this before testing. If you don't, Dropout stays active and BatchNorm uses wrong statistics. Your test accuracy will be mysteriously bad.
- Not restoring best weights: Early stopping finds the best epoch, but you must load those weights back. Otherwise you're using the last epoch's weights, which are often overfit.
- Learning rate too high or too low: For Adam, 1e-3 (0.001) is a good starting point. Too high and training explodes, too low and nothing happens.
- Hidden layers too small: With 151 output classes, you need enough capacity. [256, 128] works well. [32] is too small.
- Not caching embeddings: Computing embeddings takes minutes. Cache them to disk so you only do it once.
| Concept | Simple Explanation | Why It Matters |
|---|---|---|
| Sentence Embeddings | Converting text to numbers that capture meaning | Synonyms become similar numbers; no more word-counting limitations |
| Transfer Learning | Using someone else's pre-trained model | Start with intelligence instead of starting from scratch |
| Frozen Encoder | Never changing the embedding model | Prevents overfitting on small data; much faster training |
| Hidden Layers | Transformation steps between input and output | Lets the network learn curves instead of just straight lines |
| ReLU | Keep positive numbers, zero out negative ones | Enables learning complex patterns by adding nonlinearity |
| Batch Normalization | Keeping numbers at consistent scale | Makes training stable and faster |
| Dropout | Randomly turning off neurons during training | Prevents memorization, forces robust learning |
| Adam Optimizer | Smart weight updates with per-parameter learning rates | Learns 5-10x faster than basic gradient descent |
| Early Stopping | Stop when validation stops improving | Automatically finds the best epoch, prevents overfitting |
| Caching | Save computed embeddings to disk | Pay the cost once, reuse forever |
Let's zoom out and see the forest, not just the trees. Modern NLP works by dividing labor: a pretrained model (the sentence transformer) does the hard work of understanding language, and a small neural network (the MLP) does the easier work of classification.
This pattern — frozen pretrained encoder + small trainable head — is everywhere in modern AI. It's how Google, Meta, and virtually every company doing serious NLP work builds their systems. You've just learned the foundation.
What You've Learned
The beautiful thing is that once you understand these building blocks, you can understand much more complex systems. Transformers, BERT, GPT — they all use these same fundamental ideas, just arranged in more sophisticated ways. You've taken the first step into a much larger world.
Related Articles
Batch Normalization Explained: Why Your Neural Network Needs It
A complete beginner's guide to Batch Normalization - what it is, why it works, how to implement it, and the critical train vs eval mode difference that trips up everyone.
Dropout Explained: The Surprisingly Simple Trick That Prevents Overfitting
A complete beginner's guide to Dropout regularization - why randomly turning off neurons makes neural networks smarter, how it works, and how to use it correctly in PyTorch.
Adam Optimizer Explained: Why It's Better Than Plain Gradient Descent
A complete beginner's guide to the Adam optimizer - how it adapts learning rates per parameter, why it converges faster than SGD, and how to use it effectively in PyTorch.