Back to articles
Understanding Neural Networks: From Word Counting to Meaning Understanding

Understanding Neural Networks: From Word Counting to Meaning Understanding

A beginner-friendly guide to pretrained sentence embeddings, multi-layer perceptrons, and the building blocks that make modern NLP work — explained with simple examples and zero jargon.

20 min read

Imagine teaching a computer to understand what you mean when you say something. Not just matching keywords, but actually understanding that 'book a flight' and 'reserve a plane ticket' mean the same thing, even though they share almost no words. This post explains how modern AI does exactly that, using two powerful ideas: sentence embeddings (a way to capture meaning) and multi-layer perceptrons (a simple but effective neural network). We'll break down every concept into plain English with real examples.

What You'll Learn

By the end of this post, you'll understand: why counting words isn't enough, what sentence embeddings are and why they're magical, how neural networks learn patterns, what hidden layers do, and why techniques like Dropout and BatchNorm matter. No math background required — just curiosity.

Let's start with the old way of teaching computers to understand text: counting words. This approach is called TF-IDF (Term Frequency-Inverse Document Frequency), and it's like giving each word a score based on how often it appears.

Here's how it works: if you have the sentence 'book a flight to Tokyo', the computer creates a list of all possible words it knows (maybe 10,000 words), and marks which ones appear in your sentence. So 'book' gets a 1, 'flight' gets a 1, 'Tokyo' gets a 1, and the other 9,997 words get a 0.

The Fatal Flaws of Word Counting

This approach has three huge problems: (1) It treats 'book', 'reserve', and 'schedule' as completely different, even though they mean similar things. (2) It loses word order — 'dog bites man' and 'man bites dog' look identical. (3) If someone uses a word you've never seen before, it just disappears.

Think about it like this: imagine trying to understand a recipe by only counting how many times each ingredient appears, without knowing the order or how they relate to each other. You'd know there's flour and eggs, but not whether you're making a cake or scrambled eggs!

Now for the magic trick. Instead of counting words, what if we could capture the meaning of a sentence as a set of numbers? That's exactly what sentence embeddings do.

Think of it like GPS coordinates. Every location on Earth can be described by two numbers (latitude and longitude). Similarly, every sentence can be described by a list of numbers (typically 384 or 768 numbers) that capture its meaning. Sentences that mean similar things get similar numbers, like how nearby places have similar GPS coordinates.

The Map of Meaning

Imagine a giant map where every possible sentence has a location. 'Book a flight' and 'Reserve a plane ticket' would be right next to each other because they mean the same thing. 'What's the weather?' would be in a completely different neighborhood. A sentence embedding is just the coordinates on this map.

Here's the best part: someone else already built this map for us! Companies like Google and Hugging Face trained models on billions of sentences to learn these coordinates. We can download their work for free and use it. This is called transfer learning — borrowing intelligence from someone else's hard work.

embedding_example.py
python
# Let's see sentence embeddings in action
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    'Book a flight to New York',
    'Reserve a plane ticket to NYC',
    'What is the weather today?'
]

# Convert each sentence to 384 numbers
embeddings = model.encode(sentences)

# Now we can measure how similar sentences are
# 'Book a flight' and 'Reserve a plane ticket' will be very similar (close to 1.0)
# 'Book a flight' and 'What is the weather' will be very different (close to 0.0)

The model we use is called all-MiniLM-L6-v2. It's small (only 80MB), fast (works on regular computers), and produces 384-dimensional embeddings. Think of those 384 numbers as 384 different aspects of meaning — like 'is this about travel?', 'is this a question?', 'is this urgent?', and 381 other subtle aspects.

Here's an important concept: we freeze the embedding model, which means we never change it. We use it exactly as we downloaded it. Why?

Think of it like using a dictionary. The dictionary was created by experts who studied millions of words. When you look up a word, you don't rewrite the dictionary — you just use what's already there. Same idea here: the embedding model was trained on billions of sentences, and we only have thousands. If we tried to 'improve' it with our small dataset, we'd actually make it worse.

The Cooking Analogy

Freezing the embeddings is like using pre-made pasta sauce from a professional chef. You could try to improve it, but you'd probably mess it up. Instead, you use the sauce as-is and focus on cooking the pasta perfectly. That's what we do — use the professional embeddings and focus on training our classifier.

Converting 15,000 sentences into embeddings takes a few minutes. If we had to do this every time we train our model, we'd waste hours. So we do something smart: caching.

Caching means: compute the embeddings once, save them to a file, and then just load that file every time you need them. It's like meal prepping — cook once on Sunday, eat all week.

caching_example.py
python
# Simple caching pattern
import torch
from pathlib import Path

def load_or_compute_embeddings(texts, cache_file):
    # Check if we already computed these
    if Path(cache_file).exists():
        print("Loading from cache (instant!)")
        return torch.load(cache_file)
    
    # First time — compute and save
    print("Computing embeddings (takes a few minutes)...")
    embeddings = model.encode(texts)
    torch.save(embeddings, cache_file)
    return embeddings

Now we get to the heart of it: neural networks. Let's build up the intuition step by step.

Imagine you're trying to separate apples from oranges on a table. If all the apples are on one side and all the oranges are on the other, you can draw a straight line between them. Easy!

But what if the apples are in the middle and the oranges are in a circle around them? No straight line can separate them. You need a curved boundary. That's exactly the problem with simple models — they can only draw straight lines (or flat surfaces in higher dimensions).

The XOR Problem

There's a famous example called XOR: imagine four points at the corners of a square. Two opposite corners are red, two are blue. No single straight line can separate red from blue. But if you draw TWO lines and combine them cleverly, suddenly it's possible. That's what hidden layers do — they let the network draw multiple lines and combine them.

A hidden layer is a transformation step between input and output. Think of it like this:

  1. Input layer: Your data comes in (the 384 embedding numbers)
  2. Hidden layer 1: Transform those 384 numbers into 256 new numbers that capture useful patterns
  3. Hidden layer 2: Transform those 256 numbers into 128 numbers that capture even more refined patterns
  4. Output layer: Transform those 128 numbers into 151 final scores (one for each possible intent)

Each hidden layer is like a filter that extracts increasingly sophisticated patterns. The first layer might detect simple things like 'contains travel words' or 'sounds like a question'. The second layer might detect combinations like 'travel question about weather' or 'urgent booking request'.

Here's a crucial detail: between each layer, we apply something called ReLU (Rectified Linear Unit). It's incredibly simple: if a number is positive, keep it; if it's negative, change it to zero.

relu_example.py
python
# ReLU in action
def relu(x):
    return max(0, x)

# Examples:
relu(5)    # Returns 5 (positive, keep it)
relu(-3)   # Returns 0 (negative, zero it out)
relu(0)    # Returns 0

Why is this important? Without ReLU (or some other nonlinear function), stacking multiple layers would be pointless — they'd collapse into a single layer mathematically. ReLU breaks this equivalence and lets the network learn curves instead of just straight lines.

The Light Switch Analogy

Think of ReLU like a light switch. If the signal is strong enough (positive), let it through. If it's weak or negative, turn it off. This simple on/off behavior, repeated across thousands of neurons, creates incredibly complex patterns.

As data flows through multiple layers, the numbers can get out of control — some might be in the thousands, others near zero. This makes training unstable. Batch Normalization fixes this.

Here's the idea: after each layer, normalize the numbers so they have a consistent scale (roughly mean=0, standard deviation=1). It's like adjusting the volume on different audio tracks so they're all at the same level before mixing them together.

The Temperature Analogy

Imagine you're a chef and ingredients come to you at wildly different temperatures — some boiling hot, some frozen. Batch normalization is like a temperature regulator that brings everything to room temperature before you start cooking. Now you can focus on the recipe instead of constantly adjusting for extreme temperatures.

One critical detail: Batch Normalization behaves differently during training versus testing. During training, it uses the current batch's statistics. During testing, it uses stable averages computed during training. This is why you must call model.eval() before testing — forgetting this is the #1 cause of mysteriously bad test results!

Here's a problem: if you train a model too long on the same data, it starts to memorize instead of learn. It's like a student who memorizes answers without understanding the concepts — they ace the practice test but fail the real exam.

Dropout is a clever solution: during training, randomly turn off 30% of the neurons in each layer. Which 30%? Different ones each time, chosen randomly.

The Sports Team Analogy

Imagine training a basketball team where, at every practice, 30% of players are randomly sent home. The remaining players can't rely on any specific teammate always being there, so each player must learn to be useful on their own. The team becomes more robust because no single player becomes a crutch. That's exactly what dropout does for neural networks.

During testing, all neurons are active (no dropout). The model has learned to work with any subset of neurons, so when all are present, it performs even better.

Training a neural network means adjusting millions of numbers (the weights) to minimize errors. The old way (called SGD - Stochastic Gradient Descent) adjusts every weight by the same amount. Adam is smarter.

Adam gives each weight its own personalized learning rate based on its history. If a weight's gradient has been consistently pointing in one direction, Adam lets it move faster. If a weight's gradient has been bouncing around randomly, Adam makes it move more cautiously.

The Driving Analogy

SGD is like driving at a fixed speed everywhere — 60 mph on the highway and 60 mph in a school zone. Adam is like adaptive cruise control that watches the road and adjusts: it speeds up on clear highways (consistent gradients) and slows down where the terrain is bumpy (noisy gradients). Adam typically learns 5-10x faster than SGD.

Adam also includes weight decay, which is a fancy term for 'penalize weights that get too large'. This prevents the model from becoming too confident about any single pattern, which helps it generalize better to new data.

How do you know when to stop training? If you stop too early, the model hasn't learned enough. If you train too long, it starts memorizing the training data and performs poorly on new data.

Early stopping solves this automatically. Here's how it works:

  1. After each training epoch, test the model on validation data (data it hasn't trained on)
  2. If the validation accuracy improves, save the model weights and reset a counter
  3. If the validation accuracy doesn't improve, increment the counter
  4. If the counter reaches a threshold (say, 5 epochs without improvement), stop training and restore the best weights

The Studying Analogy

Imagine studying for an exam. You take practice tests to check your progress. If your practice test scores keep improving, keep studying. But if you take 5 practice tests in a row and your score doesn't improve (or gets worse), you're probably overthinking it — stop studying and go with what you knew 5 tests ago. That's early stopping.

Let's see how all these pieces fit together in our MLP (Multi-Layer Perceptron) classifier:

Here's what happens step by step:

  1. Input: A sentence like 'Book a flight to Tokyo'
  2. Sentence Transformer: Converts it to 384 numbers capturing its meaning (frozen, never changes)
  3. First Hidden Layer: 384 → 256 numbers, normalized, ReLU applied, 30% randomly dropped
  4. Second Hidden Layer: 256 → 128 numbers, normalized, ReLU applied, 30% randomly dropped
  5. Output Layer: 128 → 151 final scores (one for each possible intent)
  6. Prediction: Pick the intent with the highest score

Training is an iterative process. Here's the cycle that repeats thousands of times:

  1. Forward Pass: Feed a batch of examples through the network, get predictions
  2. Compute Loss: Measure how wrong the predictions are (using CrossEntropyLoss)
  3. Backward Pass: Calculate how to adjust each weight to reduce the error (using backpropagation)
  4. Update Weights: Adjust the weights using Adam optimizer
  5. Repeat: Do this for thousands of batches across many epochs

The Archery Analogy

Training is like learning archery. You shoot an arrow (forward pass), see where it lands (compute loss), figure out how to adjust your aim (backward pass), make the adjustment (update weights), and shoot again. After thousands of shots, you get really good at hitting the bullseye.

When you combine all these techniques, something magical happens. On a dataset with 151 different intents (like 'book_flight', 'weather_query', 'play_music', etc.), this approach achieves over 90% accuracy. Compare that to the old word-counting approach which maxes out around 78%.

Why such a big jump? The sentence embeddings do most of the heavy lifting. They already understand that 'book', 'reserve', and 'schedule' are related. They already know that 'flight' and 'plane ticket' mean similar things. The MLP just needs to learn simple decision boundaries in this well-organized space.

ApproachAccuracyWhy
Word Counting (TF-IDF)~78%Can't understand synonyms or word order
Sentence Embeddings + MLP
90%
Understands meaning, learns complex patterns
Fine-tuned Transformer~95%Even better but much more expensive

Here are the most common mistakes beginners make, and how to avoid them:

  1. Forgetting model.eval(): Always call this before testing. If you don't, Dropout stays active and BatchNorm uses wrong statistics. Your test accuracy will be mysteriously bad.
  2. Not restoring best weights: Early stopping finds the best epoch, but you must load those weights back. Otherwise you're using the last epoch's weights, which are often overfit.
  3. Learning rate too high or too low: For Adam, 1e-3 (0.001) is a good starting point. Too high and training explodes, too low and nothing happens.
  4. Hidden layers too small: With 151 output classes, you need enough capacity. [256, 128] works well. [32] is too small.
  5. Not caching embeddings: Computing embeddings takes minutes. Cache them to disk so you only do it once.
ConceptSimple ExplanationWhy It Matters
Sentence EmbeddingsConverting text to numbers that capture meaningSynonyms become similar numbers; no more word-counting limitations
Transfer LearningUsing someone else's pre-trained modelStart with intelligence instead of starting from scratch
Frozen EncoderNever changing the embedding modelPrevents overfitting on small data; much faster training
Hidden LayersTransformation steps between input and outputLets the network learn curves instead of just straight lines
ReLUKeep positive numbers, zero out negative onesEnables learning complex patterns by adding nonlinearity
Batch NormalizationKeeping numbers at consistent scaleMakes training stable and faster
DropoutRandomly turning off neurons during trainingPrevents memorization, forces robust learning
Adam OptimizerSmart weight updates with per-parameter learning ratesLearns 5-10x faster than basic gradient descent
Early StoppingStop when validation stops improvingAutomatically finds the best epoch, prevents overfitting
CachingSave computed embeddings to diskPay the cost once, reuse forever

Let's zoom out and see the forest, not just the trees. Modern NLP works by dividing labor: a pretrained model (the sentence transformer) does the hard work of understanding language, and a small neural network (the MLP) does the easier work of classification.

This pattern — frozen pretrained encoder + small trainable head — is everywhere in modern AI. It's how Google, Meta, and virtually every company doing serious NLP work builds their systems. You've just learned the foundation.

What You've Learned

You now understand: why word counting isn't enough, how sentence embeddings capture meaning, what neural networks do and why they need hidden layers, how techniques like Dropout and BatchNorm make training work, why Adam is better than basic gradient descent, and how early stopping prevents overfitting. These concepts apply to almost every modern deep learning system.

The beautiful thing is that once you understand these building blocks, you can understand much more complex systems. Transformers, BERT, GPT — they all use these same fundamental ideas, just arranged in more sophisticated ways. You've taken the first step into a much larger world.

Related Articles