Back to articles
Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable

Semantic Caching & RAGAS Evaluation: Make Your RAG Pipeline Faster and Measurable

Learn how to add semantic caching to your RAG pipeline for lower latency and cost, then measure quality with RAGAS evaluation metrics.

14 min read

You've built a RAG bot. It retrieves context, generates answers, and mostly works. But two questions keep nagging: how do I make it faster? and how do I know if it's actually good? This post tackles both. We'll wire up a semantic cache that intercepts repeated queries before they ever touch the LLM, then plug in RAGAS — a reference-free evaluation framework — to put hard numbers on retrieval and generation quality. If you're still figuring out how to break your documents into chunks, start with Chunking in RAG — Breaking Text the Right Way.

A traditional cache matches queries by their exact string. That's fine for database lookups, but terrible for LLM traffic. Users ask "What is Python?", "Tell me about Python", and "Explain the Python language" — three strings, one intent. An exact-match cache misses all of them after the first.

Semantic caching solves this by comparing meaning instead of characters. Every query gets embedded into a vector, and we search the cache using cosine similarity. If a stored query is close enough, we skip the LLM entirely and return the cached response.

The flow is straightforward: embed the incoming query, search a vector store for the nearest cached embedding, and compare the similarity score against a configurable threshold. Above the threshold means a hit — below means a miss, and the full retrieval-generation pipeline runs as normal. The new response is then stored for future queries.

GPTCache is an open-source library from Zilliz with pluggable components for embedding, storage, similarity evaluation, and eviction. It wraps the OpenAI API so you can drop it into an existing project with minimal changes.

gptcache_setup.py
python
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# 1. Embedding model (ONNX is lightweight and fast)
onnx = Onnx()

# 2. Storage backends
cache_base = CacheBase("sqlite")          # stores responses
vector_base = VectorBase("faiss",         # stores embeddings
                          dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)

# 3. Initialize
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

# 4. Use it — same interface as openai
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What is Python?"}],
)

Drop-in Replacement

GPTCache's `openai` adapter mirrors the real OpenAI API. Your first call is a cache miss and hits the LLM normally. A second call with a semantically similar query — like *"Tell me about Python"* — returns instantly from the cache.

The threshold is the single most important knob. Set it too low and you'll serve cached answers to the wrong questions. Set it too high and the cache barely fires. The right value depends on your use case.

Threshold RangeUse CaseTrade-off
0.80 – 0.85Factual Q&A, compliance, medicalHigh precision, fewer cache hits
0.70 – 0.80General conversational queriesBalanced — good starting point
0.60 – 0.70Creative, fuzzy, or exploratory appsHigh recall, risk of wrong answers

To find the sweet spot, build a small test harness: create pairs of queries that should match and pairs that should not, then sweep the threshold and observe where false hits start appearing.

custom_threshold.py
python
from gptcache.similarity_evaluation import SimilarityEvaluation

class ThresholdEvaluation(SimilarityEvaluation):
    def __init__(self, threshold=0.8):
        self.threshold = threshold

    def evaluation(self, src_dict, cache_dict, **kwargs):
        distance = cache_dict.get("search_result", (1.0,))[0]
        similarity = 1 / (1 + distance)  # convert L2 → similarity
        return similarity if similarity >= self.threshold else 0.0

    def range(self):
        return 0.0, 1.0

# Plug it in
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=ThresholdEvaluation(threshold=0.8),
)

Rather than modifying your existing RAG pipeline, wrap it. The CachedRAGBot class below sits in front of your Day 1 bot, checks the cache first, and only falls through to full retrieval + generation on a miss.

cached_rag_bot.py
python
import time
import numpy as np
from sentence_transformers import SentenceTransformer

class CachedRAGBot:
    def __init__(self, rag_bot, threshold=0.8):
        self.rag_bot = rag_bot
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = threshold
        self.cache = {}           # query_text → (embedding, response, ts)
        self.cache_embeddings = []
        self.cache_keys = []

    def _find_similar(self, query_embedding):
        if not self.cache_embeddings:
            return None, 0.0
        sims = np.dot(self.cache_embeddings, query_embedding)
        best_idx = np.argmax(sims)
        return (self.cache_keys[best_idx], sims[best_idx]) \
            if sims[best_idx] >= self.threshold else (None, sims[best_idx])

    def query(self, question):
        start = time.time()
        q_emb = self.encoder.encode(question, normalize_embeddings=True)
        cached_key, score = self._find_similar(q_emb)

        if cached_key:
            return {
                "answer": self.cache[cached_key][1],
                "cached": True,
                "similarity": float(score),
                "latency_ms": (time.time() - start) * 1000,
            }

        # Cache miss — full pipeline
        response = self.rag_bot.query(question)
        self.cache[question] = (q_emb, response, time.time())
        self.cache_embeddings.append(q_emb)
        self.cache_keys.append(question)

        return {
            "answer": response,
            "cached": False,
            "similarity": float(score),
            "latency_ms": (time.time() - start) * 1000,
        }

Why Normalize Embeddings?

Setting `normalize_embeddings=True` makes every vector unit-length, so a simple dot product equals cosine similarity. This avoids importing a separate distance function and keeps the cache lookup fast.

Speed without quality is useless. RAGAS (Retrieval-Augmented Generation Assessment) is a framework that evaluates your RAG pipeline across multiple dimensions without needing human-annotated ground truth. It uses an LLM as a judge to score each sample automatically.

MetricWhat It MeasuresNeeds Reference?
FaithfulnessIs every claim in the answer grounded in retrieved context?No
Answer RelevancyDoes the answer actually address the question asked?No
Context PrecisionAre the most relevant chunks ranked at the top?Yes
Context RecallDoes the context contain all the information needed?Yes
Factual CorrectnessIs the answer factually accurate vs. a reference?Yes

Reference Answers Are Optional — But Valuable

Faithfulness and Answer Relevancy work without any ground truth. Context Precision, Context Recall, and Factual Correctness require reference answers. Even a small set of 20–25 curated Q&A pairs dramatically improves the signal from your evaluation.

ragas_eval.py
python
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecision,
    LLMContextRecall,
    FactualCorrectness,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

# Build evaluation samples from your RAG bot
samples = []
for question, reference_answer in qa_pairs:
    result = rag_bot.query(question)
    samples.append(SingleTurnSample(
        user_input=question,
        response=result["answer"],
        retrieved_contexts=result["contexts"],
        reference=reference_answer,
    ))

# Run evaluation
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
scores = evaluate(
    dataset=EvaluationDataset(samples=samples),
    metrics=[
        Faithfulness(),
        ResponseRelevancy(),
        LLMContextPrecision(),
        LLMContextRecall(),
        FactualCorrectness(),
    ],
    llm=evaluator_llm,
)

print(scores)
# {'faithfulness': 0.857, 'response_relevancy': 0.923, ...}

# Export per-question breakdown
scores.to_pandas().to_csv("ragas_results.csv", index=False)

Your evaluation is only as good as your test data. Aim for 20+ diverse question-answer pairs that cover several categories of difficulty and intent.

  • Simple factual"What is X?" Tests basic single-chunk retrieval.
  • Multi-hop"How does X relate to Y?" Tests context aggregation across chunks.
  • Paraphrased — Same question in 3 different wordings. Tests semantic cache hit rate.
  • Adversarial — Questions with no answer in the documents. Tests faithfulness (the bot should say it doesn't know).
  • Specific"What was the revenue in Q3?" Tests precision and whether the right chunk surfaces.

Bootstrap with RAGAS TestsetGenerator

RAGAS can auto-generate test pairs from your documents using `TestsetGenerator`. Generate 25 pairs, then manually review and correct them. This is much faster than writing all of them by hand.

The final deliverable ties everything together: a table comparing latency (cached vs. uncached) and RAGAS scores across different configurations — for example, different chunk sizes.

comparison.py
python
import pandas as pd
import time

def run_evaluation(bot, qa_pairs, config_name):
    results = []
    for q, ref in qa_pairs:
        start = time.time()
        result = bot.query(q)
        latency = (time.time() - start) * 1000
        results.append({
            "config": config_name,
            "question": q,
            "answer": result["answer"],
            "contexts": result.get("contexts", []),
            "reference": ref,
            "latency_ms": latency,
            "cached": result.get("cached", False),
        })
    return results

# Sweep configs
configs = {
    "chunk_512_nocache":  rag_512,
    "chunk_512_cached":   cached_rag_512,
    "chunk_1024_nocache": rag_1024,
    "chunk_1024_cached":  cached_rag_1024,
}

all_results = []
for name, bot in configs.items():
    all_results.extend(run_evaluation(bot, qa_pairs, name))

df = pd.DataFrame(all_results)

comparison = df.groupby("config").agg(
    avg_latency_ms=("latency_ms", "mean"),
    p95_latency_ms=("latency_ms", lambda x: x.quantile(0.95)),
    cache_hit_rate=("cached", "mean"),
).reset_index()

print(comparison.to_markdown(index=False))
ConfigAvg Latency (ms)P95 Latency (ms)Cache Hit Rate
chunk_512_nocache1,8472,3400%
chunk_512_cached4231,92065%
chunk_1024_nocache2,1032,8900%
chunk_1024_cached5122,10060%

The numbers above are illustrative, but the pattern is consistent: caching cuts average latency by 3–4× once the cache warms up, and the benefit compounds as query volume grows.

Organize your evaluation notebook so anyone can re-run it end to end. A clean structure also makes it easier to add new configurations or metrics later.

  1. Setup & Imports — Install dependencies, import your RAG bot from Day 1.
  2. Add Semantic Cache — Wrap the bot with CachedRAGBot, set threshold.
  3. Define Q&A Pairs — 20+ diverse test cases across all categories.
  4. Run Queries — Execute cached and uncached runs, record latency and hit/miss.
  5. RAGAS Evaluation — Score both configurations on all five metrics.
  6. Comparison Table — Aggregate latency stats and RAGAS scores per config.
  7. Visualization — Box plots for latency distribution, bar charts for RAGAS scores.
  8. Analysis — Which chunking strategy scored highest on faithfulness? How much latency did caching save? Any false cache hits?

Semantic caching and RAGAS evaluation address two sides of the same coin: performance and quality. Caching makes your pipeline cheaper and faster without changing the underlying retrieval or generation logic. RAGAS gives you a quantitative signal on whether that logic is working well in the first place.

The Threshold Is Everything

Start your similarity threshold at **0.8** for factual workloads and tune down carefully. A single false cache hit — serving the wrong answer confidently — can erode user trust faster than any latency savings can build it.

  • Semantic cache integrated with a configurable threshold
  • Cache hit/miss logging with latency timestamps
  • 20+ diverse Q&A test pairs created
  • RAGAS metrics computed: faithfulness, answer relevancy, context precision, context recall, factual correctness
  • Comparison table: cached vs. uncached latency
  • Comparison table: RAGAS scores per chunking strategy
  • Evaluation notebook runs end-to-end without errors
  • README updated with Day 2 results

Next Steps

Once your evaluation pipeline is solid, experiment with eviction policies (LRU, TTL) to keep the cache fresh, try different embedding models to improve hit rates, and explore RAGAS's `TestsetGenerator` to scale your test suite automatically. For deeper evaluation insights, check out [**Inside a Production ML Evaluation Harness**](/blog/inside-an-ml-evaluation-harness) to learn about F1 scores, macro averaging, and latency percentiles.

Related Articles