intermediateNatural Language Processing Best Practices

Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable

Learn how to add semantic caching to your RAG pipeline for lower latency and cost, then measure quality with RAGAS evaluation metrics.

AI EducatorApril 14, 2026

You've built a RAG bot. It retrieves context, generates answers, and mostly works. But two questions keep nagging: how do I make it faster? and how do I know if it's actually good? This post tackles both. We'll wire up a semantic cache that intercepts repeated queries before they ever touch the LLM, then plug in RAGAS — a reference-free evaluation framework — to put hard numbers on retrieval and generation quality. If you're still figuring out how to break your documents into chunks, start with Chunking in RAG — Breaking Text the Right Way.

Why Exact-Match Caching Falls Short

A traditional cache matches queries by their exact string. That's fine for database lookups, but terrible for LLM traffic. Users ask "What is Python?", "Tell me about Python", and "Explain the Python language" — three strings, one intent. An exact-match cache misses all of them after the first.

Semantic caching solves this by comparing meaning instead of characters. Every query gets embedded into a vector, and we search the cache using cosine similarity. If a stored query is close enough, we skip the LLM entirely and return the cached response.

How Semantic Caching Works

The flow is straightforward: embed the incoming query, search a vector store for the nearest cached embedding, and compare the similarity score against a configurable threshold. Above the threshold means a hit — below means a miss, and the full retrieval-generation pipeline runs as normal. The new response is then stored for future queries.

Setting Up GPTCache

GPTCache is an open-source library from Zilliz with pluggable components for embedding, storage, similarity evaluation, and eviction. It wraps the OpenAI API so you can drop it into an existing project with minimal changes.

gptcache_setup.py

python

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# 1. Embedding model (ONNX is lightweight and fast)
onnx = Onnx()

# 2. Storage backends
cache_base = CacheBase("sqlite")          # stores responses
vector_base = VectorBase("faiss",         # stores embeddings
                          dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)

# 3. Initialize
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

# 4. Use it — same interface as openai
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What is Python?"}],
)

Drop-in Replacement

GPTCache's openai adapter mirrors the real OpenAI API. Your first call is a cache miss and hits the LLM normally. A second call with a semantically similar query — like "Tell me about Python" — returns instantly from the cache.

Tuning the Similarity Threshold

The threshold is the single most important knob. Set it too low and you'll serve cached answers to the wrong questions. Set it too high and the cache barely fires. The right value depends on your use case.

Threshold Range	Use Case	Trade-off
0.80 – 0.85	Factual Q&A, compliance, medical	High precision, fewer cache hits
0.70 – 0.80	General conversational queries	Balanced — good starting point
0.60 – 0.70	Creative, fuzzy, or exploratory apps	High recall, risk of wrong answers

To find the sweet spot, build a small test harness: create pairs of queries that should match and pairs that should not, then sweep the threshold and observe where false hits start appearing.

custom_threshold.py

python

from gptcache.similarity_evaluation import SimilarityEvaluation

class ThresholdEvaluation(SimilarityEvaluation):
    def __init__(self, threshold=0.8):
        self.threshold = threshold

    def evaluation(self, src_dict, cache_dict, **kwargs):
        distance = cache_dict.get("search_result", (1.0,))[0]
        similarity = 1 / (1 + distance)  # convert L2 → similarity
        return similarity if similarity >= self.threshold else 0.0

    def range(self):
        return 0.0, 1.0

# Plug it in
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=ThresholdEvaluation(threshold=0.8),
)

Wrapping Your RAG Bot with a Cache Layer

Rather than modifying your existing RAG pipeline, wrap it. The CachedRAGBot class below sits in front of your Day 1 bot, checks the cache first, and only falls through to full retrieval + generation on a miss.

cached_rag_bot.py

python

import time
import numpy as np
from sentence_transformers import SentenceTransformer

class CachedRAGBot:
    def __init__(self, rag_bot, threshold=0.8):
        self.rag_bot = rag_bot
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = threshold
        self.cache = {}           # query_text → (embedding, response, ts)
        self.cache_embeddings = []
        self.cache_keys = []

    def _find_similar(self, query_embedding):
        if not self.cache_embeddings:
            return None, 0.0
        sims = np.dot(self.cache_embeddings, query_embedding)
        best_idx = np.argmax(sims)
        return (self.cache_keys[best_idx], sims[best_idx]) \
            if sims[best_idx] >= self.threshold else (None, sims[best_idx])

    def query(self, question):
        start = time.time()
        q_emb = self.encoder.encode(question, normalize_embeddings=True)
        cached_key, score = self._find_similar(q_emb)

        if cached_key:
            return {
                "answer": self.cache[cached_key][1],
                "cached": True,
                "similarity": float(score),
                "latency_ms": (time.time() - start) * 1000,
            }

        # Cache miss — full pipeline
        response = self.rag_bot.query(question)
        self.cache[question] = (q_emb, response, time.time())
        self.cache_embeddings.append(q_emb)
        self.cache_keys.append(question)

        return {
            "answer": response,
            "cached": False,
            "similarity": float(score),
            "latency_ms": (time.time() - start) * 1000,
        }

Why Normalize Embeddings?

Setting normalize_embeddings=True makes every vector unit-length, so a simple dot product equals cosine similarity. This avoids importing a separate distance function and keeps the cache lookup fast.

Measuring Quality with RAGAS

Speed without quality is useless. RAGAS (Retrieval-Augmented Generation Assessment) is a framework that evaluates your RAG pipeline across multiple dimensions without needing human-annotated ground truth. It uses an LLM as a judge to score each sample automatically.

The Five Core Metrics

Metric	What It Measures	Needs Reference?
Faithfulness	Is every claim in the answer grounded in retrieved context?	No
Answer Relevancy	Does the answer actually address the question asked?	No
Context Precision	Are the most relevant chunks ranked at the top?	Yes
Context Recall	Does the context contain all the information needed?	Yes
Factual Correctness	Is the answer factually accurate vs. a reference?	Yes

Reference Answers Are Optional — But Valuable

Faithfulness and Answer Relevancy work without any ground truth. Context Precision, Context Recall, and Factual Correctness require reference answers. Even a small set of 20–25 curated Q&A pairs dramatically improves the signal from your evaluation.

Running Your First RAGAS Evaluation

ragas_eval.py

python

from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecision,
    LLMContextRecall,
    FactualCorrectness,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Build evaluation samples from your RAG bot
samples = []
for question, reference_answer in qa_pairs:
    result = rag_bot.query(question)
    samples.append(SingleTurnSample(
        user_input=question,
        response=result["answer"],
        retrieved_contexts=result["contexts"],
        reference=reference_answer,
    ))

# Run evaluation
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
scores = evaluate(
    dataset=EvaluationDataset(samples=samples),
    metrics=[
        Faithfulness(),
        ResponseRelevancy(),
        LLMContextPrecision(),
        LLMContextRecall(),
        FactualCorrectness(),
    ],
    llm=evaluator_llm,
)

print(scores)
# {'faithfulness': 0.857, 'response_relevancy': 0.923, ...}

# Export per-question breakdown
scores.to_pandas().to_csv("ragas_results.csv", index=False)

Crafting Good Test Pairs

Your evaluation is only as good as your test data. Aim for 20+ diverse question-answer pairs that cover several categories of difficulty and intent.

Simple factual — "What is X?" Tests basic single-chunk retrieval.
Multi-hop — "How does X relate to Y?" Tests context aggregation across chunks.
Paraphrased — Same question in 3 different wordings. Tests semantic cache hit rate.
Adversarial — Questions with no answer in the documents. Tests faithfulness (the bot should say it doesn't know).
Specific — "What was the revenue in Q3?" Tests precision and whether the right chunk surfaces.

Bootstrap with RAGAS TestsetGenerator

RAGAS can auto-generate test pairs from your documents using TestsetGenerator. Generate 25 pairs, then manually review and correct them. This is much faster than writing all of them by hand.

Building the Comparison Table

The final deliverable ties everything together: a table comparing latency (cached vs. uncached) and RAGAS scores across different configurations — for example, different chunk sizes.

comparison.py

python

import pandas as pd
import time

def run_evaluation(bot, qa_pairs, config_name):
    results = []
    for q, ref in qa_pairs:
        start = time.time()
        result = bot.query(q)
        latency = (time.time() - start) * 1000
        results.append({
            "config": config_name,
            "question": q,
            "answer": result["answer"],
            "contexts": result.get("contexts", []),
            "reference": ref,
            "latency_ms": latency,
            "cached": result.get("cached", False),
        })
    return results

# Sweep configs
configs = {
    "chunk_512_nocache":  rag_512,
    "chunk_512_cached":   cached_rag_512,
    "chunk_1024_nocache": rag_1024,
    "chunk_1024_cached":  cached_rag_1024,
}

all_results = []
for name, bot in configs.items():
    all_results.extend(run_evaluation(bot, qa_pairs, name))

df = pd.DataFrame(all_results)

comparison = df.groupby("config").agg(
    avg_latency_ms=("latency_ms", "mean"),
    p95_latency_ms=("latency_ms", lambda x: x.quantile(0.95)),
    cache_hit_rate=("cached", "mean"),
).reset_index()

print(comparison.to_markdown(index=False))

Config	Avg Latency (ms)	P95 Latency (ms)	Cache Hit Rate
chunk_512_nocache	1,847	2,340	0%
chunk_512_cached	423	1,920	65%
chunk_1024_nocache	2,103	2,890	0%
chunk_1024_cached	512	2,100	60%

The numbers above are illustrative, but the pattern is consistent: caching cuts average latency by 3–4× once the cache warms up, and the benefit compounds as query volume grows.

Notebook Structure for Reproducibility

Organize your evaluation notebook so anyone can re-run it end to end. A clean structure also makes it easier to add new configurations or metrics later.

Setup & Imports — Install dependencies, import your RAG bot from Day 1.
Add Semantic Cache — Wrap the bot with CachedRAGBot, set threshold.
Define Q&A Pairs — 20+ diverse test cases across all categories.
Run Queries — Execute cached and uncached runs, record latency and hit/miss.
RAGAS Evaluation — Score both configurations on all five metrics.
Comparison Table — Aggregate latency stats and RAGAS scores per config.
Visualization — Box plots for latency distribution, bar charts for RAGAS scores.
Analysis — Which chunking strategy scored highest on faithfulness? How much latency did caching save? Any false cache hits?

Key Takeaways

Semantic caching and RAGAS evaluation address two sides of the same coin: performance and quality. Caching makes your pipeline cheaper and faster without changing the underlying retrieval or generation logic. RAGAS gives you a quantitative signal on whether that logic is working well in the first place.

The Threshold Is Everything

Start your similarity threshold at 0.8 for factual workloads and tune down carefully. A single false cache hit — serving the wrong answer confidently — can erode user trust faster than any latency savings can build it.

Checklist Before You Ship

Semantic cache integrated with a configurable threshold
Cache hit/miss logging with latency timestamps
20+ diverse Q&A test pairs created
RAGAS metrics computed: faithfulness, answer relevancy, context precision, context recall, factual correctness
Comparison table: cached vs. uncached latency
Comparison table: RAGAS scores per chunking strategy
Evaluation notebook runs end-to-end without errors
README updated with Day 2 results

Next Steps

Once your evaluation pipeline is solid, experiment with eviction policies (LRU, TTL) to keep the cache fresh, try different embedding models to improve hit rates, and explore RAGAS's TestsetGenerator to scale your test suite automatically. For deeper evaluation insights, check out Inside a Production ML Evaluation Harness to learn about F1 scores, macro averaging, and latency percentiles.

#rag

intermediate

Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

Production-ready prompt engineering patterns with runnable Python code: chain-of-thought, few-shot learning, self-consistency, prompt chaining, structured output, system prompt design, and advanced techniques including A/B testing and regression frameworks.

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings

A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.

Drop-in Replacement

Why Normalize Embeddings?

Reference Answers Are Optional — But Valuable

Bootstrap with RAGAS TestsetGenerator

The Threshold Is Everything

Next Steps

Related Articles

Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings