
Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable
Learn how to add semantic caching to your RAG pipeline for lower latency and cost, then measure quality with RAGAS evaluation metrics.
You've built a RAG bot. It retrieves context, generates answers, and mostly works. But two questions keep nagging: how do I make it faster? and how do I know if it's actually good? This post tackles both. We'll wire up a semantic cache that intercepts repeated queries before they ever touch the LLM, then plug in RAGAS — a reference-free evaluation framework — to put hard numbers on retrieval and generation quality. If you're still figuring out how to break your documents into chunks, start with Chunking in RAG — Breaking Text the Right Way.
A traditional cache matches queries by their exact string. That's fine for database lookups, but terrible for LLM traffic. Users ask "What is Python?", "Tell me about Python", and "Explain the Python language" — three strings, one intent. An exact-match cache misses all of them after the first.
Semantic caching solves this by comparing meaning instead of characters. Every query gets embedded into a vector, and we search the cache using cosine similarity. If a stored query is close enough, we skip the LLM entirely and return the cached response.
The flow is straightforward: embed the incoming query, search a vector store for the nearest cached embedding, and compare the similarity score against a configurable threshold. Above the threshold means a hit — below means a miss, and the full retrieval-generation pipeline runs as normal. The new response is then stored for future queries.
GPTCache is an open-source library from Zilliz with pluggable components for embedding, storage, similarity evaluation, and eviction. It wraps the OpenAI API so you can drop it into an existing project with minimal changes.
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# 1. Embedding model (ONNX is lightweight and fast)
onnx = Onnx()
# 2. Storage backends
cache_base = CacheBase("sqlite") # stores responses
vector_base = VectorBase("faiss", # stores embeddings
dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)
# 3. Initialize
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()
# 4. Use it — same interface as openai
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "What is Python?"}],
)Drop-in Replacement
GPTCache's `openai` adapter mirrors the real OpenAI API. Your first call is a cache miss and hits the LLM normally. A second call with a semantically similar query — like *"Tell me about Python"* — returns instantly from the cache.
The threshold is the single most important knob. Set it too low and you'll serve cached answers to the wrong questions. Set it too high and the cache barely fires. The right value depends on your use case.
| Threshold Range | Use Case | Trade-off |
|---|---|---|
| 0.80 – 0.85 | Factual Q&A, compliance, medical | High precision, fewer cache hits |
| 0.70 – 0.80 | General conversational queries | Balanced — good starting point |
| 0.60 – 0.70 | Creative, fuzzy, or exploratory apps | High recall, risk of wrong answers |
To find the sweet spot, build a small test harness: create pairs of queries that should match and pairs that should not, then sweep the threshold and observe where false hits start appearing.
from gptcache.similarity_evaluation import SimilarityEvaluation
class ThresholdEvaluation(SimilarityEvaluation):
def __init__(self, threshold=0.8):
self.threshold = threshold
def evaluation(self, src_dict, cache_dict, **kwargs):
distance = cache_dict.get("search_result", (1.0,))[0]
similarity = 1 / (1 + distance) # convert L2 → similarity
return similarity if similarity >= self.threshold else 0.0
def range(self):
return 0.0, 1.0
# Plug it in
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=ThresholdEvaluation(threshold=0.8),
)Rather than modifying your existing RAG pipeline, wrap it. The CachedRAGBot class below sits in front of your Day 1 bot, checks the cache first, and only falls through to full retrieval + generation on a miss.
import time
import numpy as np
from sentence_transformers import SentenceTransformer
class CachedRAGBot:
def __init__(self, rag_bot, threshold=0.8):
self.rag_bot = rag_bot
self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
self.threshold = threshold
self.cache = {} # query_text → (embedding, response, ts)
self.cache_embeddings = []
self.cache_keys = []
def _find_similar(self, query_embedding):
if not self.cache_embeddings:
return None, 0.0
sims = np.dot(self.cache_embeddings, query_embedding)
best_idx = np.argmax(sims)
return (self.cache_keys[best_idx], sims[best_idx]) \
if sims[best_idx] >= self.threshold else (None, sims[best_idx])
def query(self, question):
start = time.time()
q_emb = self.encoder.encode(question, normalize_embeddings=True)
cached_key, score = self._find_similar(q_emb)
if cached_key:
return {
"answer": self.cache[cached_key][1],
"cached": True,
"similarity": float(score),
"latency_ms": (time.time() - start) * 1000,
}
# Cache miss — full pipeline
response = self.rag_bot.query(question)
self.cache[question] = (q_emb, response, time.time())
self.cache_embeddings.append(q_emb)
self.cache_keys.append(question)
return {
"answer": response,
"cached": False,
"similarity": float(score),
"latency_ms": (time.time() - start) * 1000,
}Why Normalize Embeddings?
Setting `normalize_embeddings=True` makes every vector unit-length, so a simple dot product equals cosine similarity. This avoids importing a separate distance function and keeps the cache lookup fast.
Speed without quality is useless. RAGAS (Retrieval-Augmented Generation Assessment) is a framework that evaluates your RAG pipeline across multiple dimensions without needing human-annotated ground truth. It uses an LLM as a judge to score each sample automatically.
| Metric | What It Measures | Needs Reference? |
|---|---|---|
| Faithfulness | Is every claim in the answer grounded in retrieved context? | No |
| Answer Relevancy | Does the answer actually address the question asked? | No |
| Context Precision | Are the most relevant chunks ranked at the top? | Yes |
| Context Recall | Does the context contain all the information needed? | Yes |
| Factual Correctness | Is the answer factually accurate vs. a reference? | Yes |
Reference Answers Are Optional — But Valuable
Faithfulness and Answer Relevancy work without any ground truth. Context Precision, Context Recall, and Factual Correctness require reference answers. Even a small set of 20–25 curated Q&A pairs dramatically improves the signal from your evaluation.
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import (
Faithfulness,
ResponseRelevancy,
LLMContextPrecision,
LLMContextRecall,
FactualCorrectness,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
# Build evaluation samples from your RAG bot
samples = []
for question, reference_answer in qa_pairs:
result = rag_bot.query(question)
samples.append(SingleTurnSample(
user_input=question,
response=result["answer"],
retrieved_contexts=result["contexts"],
reference=reference_answer,
))
# Run evaluation
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
scores = evaluate(
dataset=EvaluationDataset(samples=samples),
metrics=[
Faithfulness(),
ResponseRelevancy(),
LLMContextPrecision(),
LLMContextRecall(),
FactualCorrectness(),
],
llm=evaluator_llm,
)
print(scores)
# {'faithfulness': 0.857, 'response_relevancy': 0.923, ...}
# Export per-question breakdown
scores.to_pandas().to_csv("ragas_results.csv", index=False)Your evaluation is only as good as your test data. Aim for 20+ diverse question-answer pairs that cover several categories of difficulty and intent.
- Simple factual — "What is X?" Tests basic single-chunk retrieval.
- Multi-hop — "How does X relate to Y?" Tests context aggregation across chunks.
- Paraphrased — Same question in 3 different wordings. Tests semantic cache hit rate.
- Adversarial — Questions with no answer in the documents. Tests faithfulness (the bot should say it doesn't know).
- Specific — "What was the revenue in Q3?" Tests precision and whether the right chunk surfaces.
Bootstrap with RAGAS TestsetGenerator
RAGAS can auto-generate test pairs from your documents using `TestsetGenerator`. Generate 25 pairs, then manually review and correct them. This is much faster than writing all of them by hand.
The final deliverable ties everything together: a table comparing latency (cached vs. uncached) and RAGAS scores across different configurations — for example, different chunk sizes.
import pandas as pd
import time
def run_evaluation(bot, qa_pairs, config_name):
results = []
for q, ref in qa_pairs:
start = time.time()
result = bot.query(q)
latency = (time.time() - start) * 1000
results.append({
"config": config_name,
"question": q,
"answer": result["answer"],
"contexts": result.get("contexts", []),
"reference": ref,
"latency_ms": latency,
"cached": result.get("cached", False),
})
return results
# Sweep configs
configs = {
"chunk_512_nocache": rag_512,
"chunk_512_cached": cached_rag_512,
"chunk_1024_nocache": rag_1024,
"chunk_1024_cached": cached_rag_1024,
}
all_results = []
for name, bot in configs.items():
all_results.extend(run_evaluation(bot, qa_pairs, name))
df = pd.DataFrame(all_results)
comparison = df.groupby("config").agg(
avg_latency_ms=("latency_ms", "mean"),
p95_latency_ms=("latency_ms", lambda x: x.quantile(0.95)),
cache_hit_rate=("cached", "mean"),
).reset_index()
print(comparison.to_markdown(index=False))| Config | Avg Latency (ms) | P95 Latency (ms) | Cache Hit Rate |
|---|---|---|---|
| chunk_512_nocache | 1,847 | 2,340 | 0% |
| chunk_512_cached | 423 | 1,920 | 65% |
| chunk_1024_nocache | 2,103 | 2,890 | 0% |
| chunk_1024_cached | 512 | 2,100 | 60% |
The numbers above are illustrative, but the pattern is consistent: caching cuts average latency by 3–4× once the cache warms up, and the benefit compounds as query volume grows.
Organize your evaluation notebook so anyone can re-run it end to end. A clean structure also makes it easier to add new configurations or metrics later.
- Setup & Imports — Install dependencies, import your RAG bot from Day 1.
- Add Semantic Cache — Wrap the bot with
CachedRAGBot, set threshold. - Define Q&A Pairs — 20+ diverse test cases across all categories.
- Run Queries — Execute cached and uncached runs, record latency and hit/miss.
- RAGAS Evaluation — Score both configurations on all five metrics.
- Comparison Table — Aggregate latency stats and RAGAS scores per config.
- Visualization — Box plots for latency distribution, bar charts for RAGAS scores.
- Analysis — Which chunking strategy scored highest on faithfulness? How much latency did caching save? Any false cache hits?
Semantic caching and RAGAS evaluation address two sides of the same coin: performance and quality. Caching makes your pipeline cheaper and faster without changing the underlying retrieval or generation logic. RAGAS gives you a quantitative signal on whether that logic is working well in the first place.
The Threshold Is Everything
Start your similarity threshold at **0.8** for factual workloads and tune down carefully. A single false cache hit — serving the wrong answer confidently — can erode user trust faster than any latency savings can build it.
- Semantic cache integrated with a configurable threshold
- Cache hit/miss logging with latency timestamps
- 20+ diverse Q&A test pairs created
- RAGAS metrics computed: faithfulness, answer relevancy, context precision, context recall, factual correctness
- Comparison table: cached vs. uncached latency
- Comparison table: RAGAS scores per chunking strategy
- Evaluation notebook runs end-to-end without errors
- README updated with Day 2 results
Next Steps
Once your evaluation pipeline is solid, experiment with eviction policies (LRU, TTL) to keep the cache fresh, try different embedding models to improve hit rates, and explore RAGAS's `TestsetGenerator` to scale your test suite automatically. For deeper evaluation insights, check out [**Inside a Production ML Evaluation Harness**](/blog/inside-an-ml-evaluation-harness) to learn about F1 scores, macro averaging, and latency percentiles.
Related Articles
The Impartial Judge: Inside a Production ML Evaluation Harness
A developer's walkthrough of a real ML eval harness — F1, macro averaging, OOS recall, warmup, and p50/p95/p99 latency — and the design decisions behind each.
From Words to Intelligence: Building an MLP Classifier on Pretrained Sentence Embeddings
A deep dive into pretrained sentence embeddings, MLP architecture, BatchNorm, Dropout, Adam, and early stopping — with full PyTorch implementation.
TF-IDF + Logistic Regression: The Classical ML Baseline You Should Try First
Before reaching for LLMs or neural networks for text classification, try the boring thing. Here's how TF-IDF + Logistic Regression works, why it's often embarrassingly competitive, and where it breaks.