
Prompt Engineering Patterns & Techniques: The Complete Production Toolkit
Production-ready prompt engineering patterns with runnable Python code: chain-of-thought, few-shot learning, self-consistency, prompt chaining, structured output, system prompt design, and advanced techniques including A/B testing and regression frameworks.
Prompt engineering is applied interface design for language models. Every pattern here solves a specific failure mode: inconsistent reasoning, unstructured output, fragile single-call architectures, untested prompts leaking into production. The code is runnable, the patterns are battle-tested, and every section ends with something you can ship.
Prerequisites
CoT forces the model to externalize its reasoning before producing a final answer. Two variants: zero-shot CoT (append a trigger phrase) and manual CoT (spell out the reasoning structure). Zero-shot is fast to implement; manual CoT gives you control over the reasoning path.
from openai import OpenAI
client = OpenAI()
buggy_code = """
def process_orders(orders: list[dict]) -> float:
total = 0
for order in orders:
if order['status'] == 'completed':
total += order['amount']
if order.get('discount'):
total -= order['discount']
elif order['status'] == 'refunded':
total += order['amount'] # BUG: should subtract
return total
"""
# WITHOUT CoT — direct request
response_direct = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": "You are a senior code reviewer."},
{"role": "user", "content": f"Review this code for bugs:\n```python\n{buggy_code}\n```"}
]
)
print("Direct:", response_direct.choices[0].message.content)
# WITH zero-shot CoT — one phrase changes everything
response_zs_cot = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": "You are a senior code reviewer."},
{"role": "user", "content": (
f"Review this code for bugs. Think through each branch "
f"step by step, tracing the logic for each order status "
f"before stating your findings.\n"
f"```python\n{buggy_code}\n```"
)}
]
)
print("Zero-shot CoT:", response_zs_cot.choices[0].message.content)
# WITH manual CoT — explicit reasoning structure
response_manual_cot = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": "You are a senior code reviewer."},
{"role": "user", "content": (
f"Review this function for bugs. Follow these steps exactly:\n\n"
f"Step 1: State the function's intended purpose based on its name and signature.\n"
f"Step 2: Trace through each conditional branch. For each branch, state what SHOULD happen vs what DOES happen.\n"
f"Step 3: Check edge cases — empty list, missing keys, type mismatches.\n"
f"Step 4: List each bug with line number, severity (critical/warning/info), and fix.\n\n"
f"```python\n{buggy_code}\n```"
)}
]
)
print("Manual CoT:", response_manual_cot.choices[0].message.content)The direct version typically says "looks fine" or catches only the obvious bug. Zero-shot CoT catches the refund sign error. Manual CoT catches the refund bug, the missing key safety issue, and the edge case of an empty discount value being falsy vs zero.
from openai import OpenAI
from dataclasses import dataclass
client = OpenAI()
@dataclass
class CoTResult:
reasoning: str
answer: str
raw_response: str
def call_with_cot(
task: str,
reasoning_steps: list[str] | None = None,
model: str = "gpt-4o",
system: str = "You are a precise analytical assistant."
) -> CoTResult:
"""Wrap any task with chain-of-thought reasoning."""
if reasoning_steps:
steps_text = "\n".join(f"Step {i+1}: {s}" for i, s in enumerate(reasoning_steps))
prompt = f"{task}\n\nReason through this following these steps:\n{steps_text}\n\nAfter your reasoning, provide your final answer on a new line starting with 'ANSWER:'"
else:
prompt = f"{task}\n\nThink through this step by step. After your reasoning, provide your final answer on a new line starting with 'ANSWER:'"
response = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
)
text = response.choices[0].message.content
if "ANSWER:" in text:
parts = text.split("ANSWER:", 1)
return CoTResult(reasoning=parts[0].strip(), answer=parts[1].strip(), raw_response=text)
return CoTResult(reasoning=text, answer=text, raw_response=text)
# Usage
result = call_with_cot(
"Is this SQL query safe from injection? SELECT * FROM users WHERE id = '" + "' + user_input + '",
reasoning_steps=[
"Identify where user input enters the query",
"Check if parameterized queries are used",
"Assess the injection risk",
"Suggest the fix"
]
)
print(f"Reasoning: {result.reasoning}")
print(f"Answer: {result.answer}")When to Skip CoT
Few-shot learning teaches format and behavior through examples, not instructions. The model pattern-matches against your examples rather than interpreting your description of the task. This is almost always more reliable than zero-shot for structured output tasks.
from openai import OpenAI
import json
client = OpenAI()
# Few-shot examples as user/assistant pairs in the messages array
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{
"role": "system",
"content": "Extract entities from text. Return JSON with keys: persons, organizations, locations, dates. Each value is a list of strings."
},
# Example 1
{"role": "user", "content": "Apple CEO Tim Cook announced the new iPhone at their Cupertino headquarters on September 12th."},
{"role": "assistant", "content": json.dumps({"persons": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["September 12th"]})},
# Example 2
{"role": "user", "content": "No relevant entities here, just a general statement about the weather."},
{"role": "assistant", "content": json.dumps({"persons": [], "organizations": [], "locations": [], "dates": []})},
# Example 3 — shows how to handle ambiguity
{"role": "user", "content": "Jordan visited the Amazon office in Jordan last March."},
{"role": "assistant", "content": json.dumps({"persons": ["Jordan"], "organizations": ["Amazon"], "locations": ["Jordan"], "dates": ["last March"]})},
# Actual request
{"role": "user", "content": "Microsoft's Satya Nadella met with EU regulators in Brussels on January 15, 2026 to discuss the OpenAI partnership."}
]
)
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))
# {"persons": ["Satya Nadella"], "organizations": ["Microsoft", "EU", "OpenAI"],
# "locations": ["Brussels"], "dates": ["January 15, 2026"]}from anthropic import Anthropic
import json
client = Anthropic()
# Anthropic uses the same user/assistant pattern but with a separate system param
examples = [
("Apple CEO Tim Cook announced the new iPhone at Cupertino on September 12th.",
{"persons": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["September 12th"]}),
("No relevant entities here, just a general statement about the weather.",
{"persons": [], "organizations": [], "locations": [], "dates": []}),
("Jordan visited the Amazon office in Jordan last March.",
{"persons": ["Jordan"], "organizations": ["Amazon"], "locations": ["Jordan"], "dates": ["last March"]}),
]
messages = []
for text, output in examples:
messages.append({"role": "user", "content": text})
messages.append({"role": "assistant", "content": json.dumps(output)})
# Add the real request
messages.append({"role": "user", "content": "Microsoft's Satya Nadella met with EU regulators in Brussels on January 15, 2026."})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
system="Extract entities from text. Return JSON with keys: persons, organizations, locations, dates. Each value is a list of strings. Return only the JSON, no explanation.",
messages=messages
)
result = json.loads(response.content[0].text)
print(json.dumps(result, indent=2))from openai import OpenAI
import json
import random
client = OpenAI()
class FewShotBuilder:
"""Build few-shot prompts dynamically from an example dataset."""
def __init__(self, system_prompt: str, examples: list[tuple[str, str]]):
self.system_prompt = system_prompt
self.examples = examples # list of (input, output) tuples
def build_messages(
self,
query: str,
n_examples: int = 3,
strategy: str = "random"
) -> list[dict]:
"""Build the messages array with selected examples."""
if strategy == "random":
selected = random.sample(self.examples, min(n_examples, len(self.examples)))
elif strategy == "first":
selected = self.examples[:n_examples]
elif strategy == "last":
selected = self.examples[-n_examples:]
else:
raise ValueError(f"Unknown strategy: {strategy}")
messages = [{"role": "system", "content": self.system_prompt}]
for inp, out in selected:
messages.append({"role": "user", "content": inp})
messages.append({"role": "assistant", "content": out})
messages.append({"role": "user", "content": query})
return messages
def call(
self,
query: str,
n_examples: int = 3,
model: str = "gpt-4o",
temperature: float = 0,
**kwargs
) -> str:
messages = self.build_messages(query, n_examples)
response = client.chat.completions.create(
model=model,
temperature=temperature,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Usage: sentiment classifier
sentiment_examples = [
("This product exceeded my expectations!", "positive"),
("Terrible quality, broke after one day.", "negative"),
("It works as described, nothing special.", "neutral"),
("Absolutely love it, buying another one!", "positive"),
("Worst purchase I've ever made.", "negative"),
("Decent for the price point.", "neutral"),
("The packaging was damaged but product is fine.", "neutral"),
("Customer service was rude and unhelpful.", "negative"),
]
classifier = FewShotBuilder(
system_prompt="Classify the sentiment of product reviews. Respond with exactly one word: positive, negative, or neutral.",
examples=sentiment_examples
)
# Test example ordering impact
results = {}
for strategy in ["first", "last", "random"]:
label = classifier.call(
"The item arrived late but works perfectly.",
n_examples=3,
strategy=strategy
)
results[strategy] = label
print(f"Strategy '{strategy}': {label}")
# Ordering can shift results — 'first' examples are all extreme,
# 'last' examples include more nuanced neutral casesExample Ordering Matters
Self-consistency samples multiple reasoning paths at temperature > 0 and takes the majority answer. It turns an unreliable 70% accuracy into a reliable 90%+ by letting variance work in your favor. The implementation is straightforward with async parallel calls.
import asyncio
from openai import AsyncOpenAI
from collections import Counter
from dataclasses import dataclass
client = AsyncOpenAI()
@dataclass
class ConsistencyResult:
answer: str
confidence: float
vote_counts: dict[str, int]
all_responses: list[str]
is_confident: bool
async def single_call(messages: list[dict], model: str) -> str:
"""Make one completion call with temperature > 0."""
response = await client.chat.completions.create(
model=model,
temperature=0.7, # Must be > 0 for diversity
max_tokens=1024,
messages=messages
)
return response.choices[0].message.content
def extract_answer(response: str) -> str:
"""Extract the final answer from a CoT response.
Looks for 'ANSWER:' marker, falls back to last line."""
for line in response.strip().split("\n"):
if line.strip().upper().startswith("ANSWER:"):
return line.split(":", 1)[1].strip().lower()
# Fallback: last non-empty line
lines = [l.strip() for l in response.strip().split("\n") if l.strip()]
return lines[-1].lower() if lines else response.strip().lower()
async def self_consistency(
messages: list[dict],
n_samples: int = 5,
model: str = "gpt-4o",
confidence_threshold: float = 0.6,
answer_extractor: callable = extract_answer
) -> ConsistencyResult:
"""Run self-consistency with majority voting."""
# Fire all calls in parallel
tasks = [single_call(messages, model) for _ in range(n_samples)]
raw_responses = await asyncio.gather(*tasks)
# Extract and normalize answers
answers = [answer_extractor(r) for r in raw_responses]
vote_counts = dict(Counter(answers))
# Majority vote
winner = max(vote_counts, key=vote_counts.get)
confidence = vote_counts[winner] / n_samples
return ConsistencyResult(
answer=winner,
confidence=confidence,
vote_counts=vote_counts,
all_responses=raw_responses,
is_confident=confidence >= confidence_threshold
)
# --- Usage ---
async def main():
messages = [
{"role": "system", "content": "You are a code reviewer. Analyze the code and determine if it has a security vulnerability."},
{"role": "user", "content": (
"Does this code have a security issue?\n\n"
"```python\n"
"def get_user(request):\n"
" user_id = request.args.get('id')\n"
" query = f'SELECT * FROM users WHERE id = {user_id}'\n"
" return db.execute(query).fetchone()\n"
"```\n\n"
"Think step by step, then provide your answer as:\n"
"ANSWER: yes or no"
)}
]
result = await self_consistency(messages, n_samples=7, confidence_threshold=0.7)
print(f"Answer: {result.answer}")
print(f"Confidence: {result.confidence:.0%}")
print(f"Votes: {result.vote_counts}")
if not result.is_confident:
print("LOW CONFIDENCE — escalate to human review")
# In production: log to monitoring, flag for review, or
# fall back to a more capable model
else:
print(f"High confidence result: {result.answer}")
asyncio.run(main())
# Output:
# Answer: yes
# Confidence: 100%
# Votes: {'yes': 7}
# High confidence result: yesCost Control
Prompt chaining decomposes a complex task into a pipeline of focused steps. Each step gets a simple, well-defined job. The output of step N becomes the input of step N+1. Failures are isolated, intermediate results are inspectable, and individual steps can be swapped without rewriting the pipeline.
from openai import OpenAI
import json
import logging
client = OpenAI()
logger = logging.getLogger(__name__)
class ChainStep:
def __init__(self, name: str, system: str, prompt_template: str, model: str = "gpt-4o"):
self.name = name
self.system = system
self.prompt_template = prompt_template
self.model = model
def run(self, **kwargs) -> str:
prompt = self.prompt_template.format(**kwargs)
logger.info(f"[Chain] Running step: {self.name}")
logger.debug(f"[Chain] Input: {prompt[:200]}...")
response = client.chat.completions.create(
model=self.model,
temperature=0,
messages=[
{"role": "system", "content": self.system},
{"role": "user", "content": prompt}
]
)
result = response.choices[0].message.content
logger.info(f"[Chain] Step '{self.name}' complete. Output length: {len(result)}")
return result
class PromptChain:
def __init__(self, steps: list[ChainStep]):
self.steps = steps
self.intermediate_results: dict[str, str] = {}
def run(self, initial_input: str) -> dict[str, str]:
self.intermediate_results = {"input": initial_input}
current = initial_input
for i, step in enumerate(self.steps):
try:
# Each step gets access to all previous results
result = step.run(
input=current,
original=initial_input,
**self.intermediate_results
)
self.intermediate_results[step.name] = result
current = result
except Exception as e:
logger.error(f"[Chain] Step '{step.name}' failed: {e}")
self.intermediate_results[f"{step.name}_error"] = str(e)
# Return partial results so caller can decide what to do
return self.intermediate_results
return self.intermediate_results
# --- Build a customer feedback pipeline ---
extract_step = ChainStep(
name="extract_issues",
system="You extract customer issues from feedback. Return a JSON array of objects with keys: issue, quote, category.",
prompt_template="Extract all distinct issues from this customer feedback:\n\n{input}"
)
classify_step = ChainStep(
name="classify_severity",
system="You classify issue severity. Return JSON: same array with an added 'severity' field (critical/high/medium/low) and 'reasoning' field.",
prompt_template="Classify the severity of each issue. Consider business impact and customer emotion.\n\nIssues:\n{input}"
)
respond_step = ChainStep(
name="draft_response",
system="You draft professional customer support responses. Be empathetic but concise. Address every issue mentioned.",
prompt_template="Draft a response to this customer. Address each issue with the appropriate urgency.\n\nOriginal feedback:\n{original}\n\nClassified issues:\n{input}"
)
chain = PromptChain([extract_step, classify_step, respond_step])
# Run the chain
feedback = """Your app crashed three times today during checkout. I lost my cart
each time. Also the new UI is confusing — I couldn't find the settings page.
I've been a customer for 5 years and I'm considering switching to a competitor."""
results = chain.run(feedback)
# Inspect intermediate results
for step_name, output in results.items():
print(f"\n{'='*60}")
print(f"STEP: {step_name}")
print(f"{'='*60}")
print(output[:500])Why Chaining Beats Single Prompts
Unstructured model output is the #1 source of production bugs. JSON parsing failures, missing fields, wrong types — structured output patterns eliminate these entirely. Three approaches: JSON mode, Pydantic + instructor, and schema enforcement with retry.
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"Extract structured data from job postings. "
"Return JSON with these exact keys:\n"
"- title: string\n"
"- company: string\n"
"- location: string (or 'remote')\n"
"- salary_min: number or null\n"
"- salary_max: number or null\n"
"- requirements: string[]\n"
"- experience_years: number or null"
)
},
{
"role": "user",
"content": (
"We're hiring a Senior Backend Engineer at Acme Corp! "
"Based in SF or remote. $180k-$240k. Need 5+ years with "
"Python, PostgreSQL, and distributed systems. Bonus points "
"for Kubernetes experience."
)
}
]
)
job = json.loads(response.choices[0].message.content)
print(json.dumps(job, indent=2))
# Guaranteed valid JSON — but field names/types aren't enforcedimport instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum
class ExperienceLevel(str, Enum):
JUNIOR = "junior"
MID = "mid"
SENIOR = "senior"
STAFF = "staff"
PRINCIPAL = "principal"
class JobPosting(BaseModel):
title: str = Field(description="Job title")
company: str = Field(description="Company name")
location: str = Field(description="Office location or 'remote'")
salary_min: int | None = Field(description="Minimum salary in USD")
salary_max: int | None = Field(description="Maximum salary in USD")
requirements: list[str] = Field(description="Required skills and qualifications")
experience_years: int | None = Field(description="Required years of experience")
level: ExperienceLevel = Field(description="Inferred experience level")
# Patch the client — instructor handles retries and validation
client = instructor.from_openai(OpenAI())
job = client.chat.completions.create(
model="gpt-4o",
temperature=0,
max_retries=2, # Auto-retry on validation failure
response_model=JobPosting,
messages=[
{
"role": "user",
"content": (
"We're hiring a Senior Backend Engineer at Acme Corp! "
"Based in SF or remote. $180k-$240k. Need 5+ years with "
"Python, PostgreSQL, and distributed systems."
)
}
]
)
# job is a fully typed Pydantic model — IDE autocomplete works
print(f"{job.title} at {job.company}")
print(f"Salary: ${job.salary_min:,} - ${job.salary_max:,}")
print(f"Level: {job.level.value}")
print(f"Requirements: {', '.join(job.requirements)}")from openai import OpenAI
import json
from pydantic import BaseModel, ValidationError
client = OpenAI()
def extract_with_retry(
prompt: str,
schema: type[BaseModel],
model: str = "gpt-4o",
max_retries: int = 3
) -> BaseModel:
"""Extract structured data with automatic retry on parse/validation failure."""
schema_json = json.dumps(schema.model_json_schema(), indent=2)
system = (
f"Extract data matching this JSON schema. Return ONLY valid JSON, no markdown.\n\n"
f"Schema:\n{schema_json}"
)
messages = [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
]
for attempt in range(max_retries):
response = client.chat.completions.create(
model=model,
temperature=0,
response_format={"type": "json_object"},
messages=messages
)
raw = response.choices[0].message.content
try:
data = json.loads(raw)
return schema.model_validate(data)
except (json.JSONDecodeError, ValidationError) as e:
error_msg = str(e)
# Add the error as context for the retry
messages.append({"role": "assistant", "content": raw})
messages.append({
"role": "user",
"content": f"That response had an error: {error_msg}\nPlease fix and return valid JSON matching the schema."
})
raise ValueError(f"Failed to extract valid data after {max_retries} attempts")
# Usage
class MeetingInfo(BaseModel):
title: str
participants: list[str]
date: str
action_items: list[str]
next_meeting: str | None = None
meeting = extract_with_retry(
"Meeting notes: Project sync with Alice, Bob, and Carol on Jan 15. "
"Decided to launch beta by Feb 1. Alice will handle deployment, "
"Bob owns the docs. Follow-up scheduled for Jan 22.",
schema=MeetingInfo
)
print(meeting.model_dump_json(indent=2))The system prompt is the behavioral contract between you and the model. A vague system prompt produces vague behavior. A precise one produces a reliable agent. Below: a bad system prompt, a good one, and the reasoning behind every change.
# BAD system prompt — vague, no structure, no constraints
BAD_SYSTEM_PROMPT = """You are a helpful customer support agent for TechCorp.
Be nice to customers and help them with their problems.
Try to resolve issues quickly."""# GOOD system prompt — specific, structured, constrained
GOOD_SYSTEM_PROMPT = """You are a Tier 1 support agent for TechCorp's SaaS platform.
## Role
You handle billing questions, account access issues, and basic technical
troubleshooting. You do NOT handle: data deletion requests, security
incidents, or enterprise contract negotiations.
## Behavioral Rules
1. Greet the customer by name if available.
2. Acknowledge their issue before solving it.
3. If you cannot resolve in 2 exchanges, escalate to Tier 2.
4. Never promise refunds — only Tier 2+ can authorize those.
5. Never share internal system details, ticket IDs, or agent names.
## Response Format
- Keep responses under 150 words.
- Use numbered steps for instructions.
- End every message with a clear next action or question.
## Escalation Triggers (auto-escalate to Tier 2)
- Customer mentions "lawyer", "legal", or "lawsuit"
- Account has been compromised or hacked
- Billing discrepancy over $500
- Customer has asked the same question 3+ times
## Tone
- Professional but warm. Not robotic, not overly casual.
- Match the customer's energy — if they're frustrated, lead with empathy.
- Never use exclamation marks more than once per message.
## Knowledge Boundaries
- You have access to: account status, billing history, known issues list.
- You do NOT have access to: source code, infrastructure details, roadmap.
- If asked about something outside your knowledge, say so directly.
Do not guess or fabricate information."""
# The difference: 3 lines vs 33 lines. The 33-line version produces
# consistent, predictable behavior across thousands of conversations.from openai import OpenAI
from anthropic import Anthropic
# Core prompt content — provider-agnostic
CORE_INSTRUCTIONS = {
"role": "Tier 1 support agent for TechCorp",
"capabilities": ["billing", "account access", "basic troubleshooting"],
"restrictions": ["no refunds", "no internal details", "no legal matters"],
"escalation_triggers": ["legal threats", "compromised accounts", "billing > $500"],
"tone": "professional, warm, empathetic",
"max_words": 150,
}
def build_openai_system_prompt(config: dict) -> str:
"""OpenAI models respond well to markdown formatting."""
return f"""You are a {config['role']}.
## Capabilities
{chr(10).join(f'- {c}' for c in config['capabilities'])}
## Restrictions
{chr(10).join(f'- {r}' for r in config['restrictions'])}
## Escalation Triggers
{chr(10).join(f'- {t}' for t in config['escalation_triggers'])}
## Response Rules
- Tone: {config['tone']}
- Max length: {config['max_words']} words
- End every message with a clear next action."""
def build_anthropic_system_prompt(config: dict) -> str:
"""Claude responds well to XML tags for structure."""
return f"""You are a {config['role']}.
<capabilities>
{chr(10).join(f'- {c}' for c in config['capabilities'])}
</capabilities>
<restrictions>
{chr(10).join(f'- {r}' for r in config['restrictions'])}
</restrictions>
<escalation_triggers>
{chr(10).join(f'- {t}' for t in config['escalation_triggers'])}
</escalation_triggers>
<response_rules>
- Tone: {config['tone']}
- Max length: {config['max_words']} words
- End every message with a clear next action.
</response_rules>"""
# OpenAI call
openai_client = OpenAI()
response_oai = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": build_openai_system_prompt(CORE_INSTRUCTIONS)},
{"role": "user", "content": "Hi, I was charged twice for my subscription last month."}
]
)
# Anthropic call
anthropic_client = Anthropic()
response_claude = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
system=build_anthropic_system_prompt(CORE_INSTRUCTIONS),
messages=[
{"role": "user", "content": "Hi, I was charged twice for my subscription last month."}
]
)Telling the model what NOT to do is sometimes more effective than describing what you want. Especially useful for eliminating specific failure modes you've observed in production.
from openai import OpenAI
client = OpenAI()
# Without negative examples — model tends to over-explain
prompt_without = "Summarize this error log for a developer."
# With negative examples — eliminates specific bad behaviors
prompt_with = """Summarize this error log for a developer.
Do NOT:
- Include timestamps in your summary
- Suggest fixes (just describe what happened)
- Use phrases like "it appears that" or "it seems like" — state facts directly
- Repeat the same error if it occurs multiple times — just note the count
Do:
- Group related errors
- State the root error first, cascading failures second
- Include the exact error message for the root cause"""
error_log = """
2026-05-31 10:00:01 ERROR DatabaseConnection: Connection refused to postgres:5432
2026-05-31 10:00:01 ERROR DatabaseConnection: Connection refused to postgres:5432
2026-05-31 10:00:02 ERROR UserService: Failed to fetch user - database unavailable
2026-05-31 10:00:02 ERROR AuthService: Cannot validate token - UserService unreachable
2026-05-31 10:00:03 ERROR APIGateway: 503 Service Unavailable on /api/users/me
2026-05-31 10:00:03 ERROR HealthCheck: Service unhealthy - 3 downstream failures
"""
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": "You are a concise incident summarizer."},
{"role": "user", "content": f"{prompt_with}\n\n{error_log}"}
]
)
print(response.choices[0].message.content)from jinja2 import Template
from openai import OpenAI
client = OpenAI()
# Jinja2 template with conditionals and loops
REVIEW_TEMPLATE = Template("""
You are reviewing a {{ review_type }} for {{ project_name }}.
{% if context %}
<context>
{{ context }}
</context>
{% endif %}
Review the following {{ review_type }} and provide feedback:
{% for criterion in criteria %}
- {{ criterion }}
{% endfor %}
{% if strict_mode %}
You MUST flag any issue that violates the criteria above. Do not approve with unresolved issues.
{% else %}
Minor style issues can be noted but shouldn't block approval.
{% endif %}
Respond with:
- APPROVE, REQUEST_CHANGES, or COMMENT
- List of specific findings
""")
# Render with different configurations
prompt_strict = REVIEW_TEMPLATE.render(
review_type="pull request",
project_name="payments-service",
context="This service handles PCI-compliant credit card processing.",
criteria=[
"No hardcoded secrets or credentials",
"All database queries use parameterized statements",
"Error messages don't leak internal details",
"All new endpoints have authentication checks",
],
strict_mode=True
)
prompt_relaxed = REVIEW_TEMPLATE.render(
review_type="documentation update",
project_name="internal-wiki",
context=None,
criteria=[
"Technically accurate",
"Clear and concise",
"Includes code examples where relevant",
],
strict_mode=False
)
# Use the rendered template
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "user", "content": f"{prompt_strict}\n\n```python\nAPI_KEY = 'sk-live-abc123'\n```"}
]
)
print(response.choices[0].message.content)from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class PromptVersion:
version: str
system_prompt: str
user_template: str
model: str
temperature: float
created_at: str = field(default_factory=lambda: datetime.now().isoformat())
notes: str = ""
deprecated: bool = False
class PromptRegistry:
"""Simple prompt version registry with rollback support."""
def __init__(self):
self._versions: dict[str, dict[str, PromptVersion]] = {}
self._active: dict[str, str] = {} # prompt_name -> active version
def register(self, name: str, version: PromptVersion):
if name not in self._versions:
self._versions[name] = {}
self._versions[name][version.version] = version
# First version becomes active by default
if name not in self._active:
self._active[name] = version.version
def activate(self, name: str, version: str):
if version not in self._versions.get(name, {}):
raise ValueError(f"Version {version} not found for prompt '{name}'")
self._active[name] = version
def get_active(self, name: str) -> PromptVersion:
version_id = self._active[name]
return self._versions[name][version_id]
def rollback(self, name: str) -> str:
"""Roll back to the previous version."""
versions = list(self._versions[name].keys())
current_idx = versions.index(self._active[name])
if current_idx == 0:
raise ValueError("Already at the earliest version")
previous = versions[current_idx - 1]
self._active[name] = previous
return previous
def list_versions(self, name: str) -> list[dict]:
return [
{"version": v.version, "active": v.version == self._active[name],
"notes": v.notes, "deprecated": v.deprecated}
for v in self._versions.get(name, {}).values()
]
# Usage
registry = PromptRegistry()
registry.register("classifier", PromptVersion(
version="v1.0",
system_prompt="Classify support tickets into: billing, technical, account, other.",
user_template="Classify this ticket: {ticket_text}",
model="gpt-4o-mini",
temperature=0,
notes="Initial version"
))
registry.register("classifier", PromptVersion(
version="v1.1",
system_prompt="Classify support tickets into: billing, technical, account, security, other. Return only the category name.",
user_template="Classify this ticket: {ticket_text}",
model="gpt-4o-mini",
temperature=0,
notes="Added security category, enforced single-word output"
))
registry.activate("classifier", "v1.1")
active = registry.get_active("classifier")
print(f"Active: {active.version} — {active.notes}")
# Something broke? Roll back.
previous = registry.rollback("classifier")
print(f"Rolled back to: {previous}")
print(registry.list_versions("classifier"))import random
import json
import time
from dataclasses import dataclass, field, asdict
from openai import OpenAI
client = OpenAI()
@dataclass
class ABTestResult:
variant: str
prompt_version: str
input_text: str
output_text: str
latency_ms: float
timestamp: float = field(default_factory=time.time)
feedback_score: float | None = None # filled later by human eval
class PromptABTest:
"""Simple A/B test framework for prompt variants."""
def __init__(self, variants: dict[str, dict], log_file: str = "ab_results.jsonl"):
"""
variants: {"A": {"system": "...", "template": "..."}, "B": {...}}
"""
self.variants = variants
self.log_file = log_file
self.results: list[ABTestResult] = []
def run(self, input_text: str, model: str = "gpt-4o") -> ABTestResult:
# Random assignment
variant_name = random.choice(list(self.variants.keys()))
variant = self.variants[variant_name]
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{"role": "system", "content": variant["system"]},
{"role": "user", "content": variant["template"].format(input=input_text)}
]
)
latency = (time.perf_counter() - start) * 1000
result = ABTestResult(
variant=variant_name,
prompt_version=variant.get("version", "unknown"),
input_text=input_text,
output_text=response.choices[0].message.content,
latency_ms=round(latency, 1)
)
self.results.append(result)
self._log(result)
return result
def _log(self, result: ABTestResult):
with open(self.log_file, "a") as f:
f.write(json.dumps(asdict(result)) + "\n")
def summary(self) -> dict:
from collections import defaultdict
stats = defaultdict(lambda: {"count": 0, "total_latency": 0.0, "scores": []})
for r in self.results:
s = stats[r.variant]
s["count"] += 1
s["total_latency"] += r.latency_ms
if r.feedback_score is not None:
s["scores"].append(r.feedback_score)
return {
variant: {
"count": s["count"],
"avg_latency_ms": round(s["total_latency"] / s["count"], 1),
"avg_score": round(sum(s["scores"]) / len(s["scores"]), 2) if s["scores"] else None
}
for variant, s in stats.items()
}
# Usage
test = PromptABTest({
"A": {
"version": "v1.0",
"system": "Summarize support tickets in 1-2 sentences.",
"template": "Summarize: {input}"
},
"B": {
"version": "v1.1",
"system": "You summarize support tickets. Output format: [CATEGORY] One-sentence summary.",
"template": "Summarize this support ticket concisely: {input}"
}
})
# Run across test inputs
test_inputs = [
"My payment failed three times. Card ending 4242. Error says 'insufficient funds' but I have money.",
"Can't log in since the update. Password reset email never arrives. Checked spam.",
"Your API is returning 500 errors on the /users endpoint since 3pm EST."
]
for inp in test_inputs:
result = test.run(inp)
print(f"[{result.variant}] {result.output_text[:80]}... ({result.latency_ms}ms)")
print("\nSummary:", json.dumps(test.summary(), indent=2))from openai import OpenAI
from dataclasses import dataclass
import json
client = OpenAI()
@dataclass
class TestCase:
input_text: str
expected_contains: list[str] # output must contain ALL of these
expected_not_contains: list[str] = None # output must NOT contain any of these
expected_exact: str = None # for classification tasks
def run_regression_suite(
system_prompt: str,
test_cases: list[TestCase],
model: str = "gpt-4o",
verbose: bool = True
) -> dict:
"""Run a regression suite against a prompt. Returns pass/fail stats."""
results = {"passed": 0, "failed": 0, "failures": []}
for i, tc in enumerate(test_cases):
response = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": tc.input_text}
]
)
output = response.choices[0].message.content
passed = True
reasons = []
# Check expected_contains
for phrase in tc.expected_contains:
if phrase.lower() not in output.lower():
passed = False
reasons.append(f"Missing: '{phrase}'")
# Check expected_not_contains
if tc.expected_not_contains:
for phrase in tc.expected_not_contains:
if phrase.lower() in output.lower():
passed = False
reasons.append(f"Should not contain: '{phrase}'")
# Check exact match
if tc.expected_exact and output.strip().lower() != tc.expected_exact.lower():
passed = False
reasons.append(f"Expected '{tc.expected_exact}', got '{output.strip()}'")
if passed:
results["passed"] += 1
if verbose:
print(f" PASS test {i+1}")
else:
results["failed"] += 1
results["failures"].append({"test": i+1, "input": tc.input_text, "output": output, "reasons": reasons})
if verbose:
print(f" FAIL test {i+1}: {reasons}")
return results
# Usage — test a classification prompt
classifier_prompt = "Classify support tickets into exactly one category: billing, technical, account, security. Return only the category name, lowercase."
tests = [
TestCase("I was charged twice", expected_contains=["billing"], expected_exact="billing"),
TestCase("Can't reset my password", expected_contains=["account"], expected_exact="account"),
TestCase("API returns 500", expected_contains=["technical"], expected_exact="technical"),
TestCase("Someone logged into my account from Russia", expected_contains=["security"], expected_exact="security"),
TestCase("Your pricing page is confusing", expected_contains=["billing"], expected_not_contains=["technical"]),
]
print("Running regression suite...")
results = run_regression_suite(classifier_prompt, tests)
print(f"\nResults: {results['passed']}/{results['passed'] + results['failed']} passed")
if results["failures"]:
print("Failures:")
for f in results["failures"]:
print(f" Test {f['test']}: {f['reasons']}")import tiktoken
from dataclasses import dataclass
# Pricing per 1M tokens (as of early 2026 — check for updates)
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4.1": {"input": 2.00, "output": 8.00},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
@dataclass
class CostEstimate:
input_tokens: int
estimated_output_tokens: int
input_cost: float
output_cost: float
total_cost: float
model: str
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for a given text. Uses cl100k_base for GPT-4 family."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def count_message_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
"""Count tokens for a full messages array including overhead."""
total = 0
for msg in messages:
total += 4 # message overhead
total += count_tokens(msg["content"], model)
total += count_tokens(msg["role"], model)
total += 2 # reply priming
return total
def estimate_cost(
messages: list[dict],
model: str = "gpt-4o",
estimated_output_tokens: int = 500
) -> CostEstimate:
"""Estimate the cost of an API call before making it."""
input_tokens = count_message_tokens(messages, model)
pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (estimated_output_tokens / 1_000_000) * pricing["output"]
return CostEstimate(
input_tokens=input_tokens,
estimated_output_tokens=estimated_output_tokens,
input_cost=round(input_cost, 6),
output_cost=round(output_cost, 6),
total_cost=round(input_cost + output_cost, 6),
model=model
)
# Usage
messages = [
{"role": "system", "content": "You are a helpful assistant." * 50}, # simulate a long system prompt
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
for model in ["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-20250514"]:
est = estimate_cost(messages, model=model, estimated_output_tokens=800)
print(f"{model}: {est.input_tokens} in / ~{est.estimated_output_tokens} out = ${est.total_cost:.4f}")
# Gate expensive calls
est = estimate_cost(messages, model="gpt-4o", estimated_output_tokens=2000)
if est.total_cost > 0.05:
print(f"WARNING: Estimated cost ${est.total_cost:.4f} exceeds $0.05 threshold")
# In production: switch to a cheaper model, truncate input, or require approvalThese are real production failures, not hypotheticals. Each one has burned engineering hours.
| Anti-Pattern | Before (Broken) | After (Fixed) | Why It Matters |
|---|---|---|---|
| String concatenation prompts | f"Classify: {user_input}" | Messages array with system/user roles | Concatenation is vulnerable to prompt injection. Messages format lets the model distinguish instructions from data. |
| No output format enforcement | "Return the data as JSON" | response_format={"type": "json_object"} or Pydantic + instructor | Models return markdown-wrapped JSON, extra text, or malformed JSON ~5% of the time without enforcement. |
| Hardcoded prompts in application code | prompt = "You are a helpful..." buried in route handler | Prompt registry with versioning, loaded from config | Prompt changes require code deploys. Registry lets you update prompts without redeploying. |
| No prompt testing | Manual testing in playground, ship it | Regression suite with TestCase assertions, run in CI | Model updates or prompt edits silently break behavior. Regression tests catch it before production. |
| Temperature 1.0 for deterministic tasks | temperature=1.0 (the default) | temperature=0 for classification/extraction, 0.7 for creative tasks | High temperature on deterministic tasks causes inconsistent output that breaks downstream parsers. |
| Ignoring token limits | Dump entire document into prompt | Count tokens first, chunk or summarize if over limit | Silent truncation produces wrong answers. The model processes a cut-off document without telling you. |
| Same prompt for all providers | One prompt for GPT-4 and Claude | Provider-specific formatting (XML for Claude, markdown for GPT) | Each model has formatting preferences. Ignoring them costs 10-20% quality. |
| Pattern | OpenAI (GPT-4o) | Anthropic (Claude) | Open Source (Llama, Mistral) |
|---|---|---|---|
| Structured delimiters | Markdown headers (##), bullet lists, bold for emphasis | XML tags (<instructions>, <context>, <output>) | Simple markers like [INST] or ### — varies by model |
| System prompt | First message with role: 'system' | Separate 'system' parameter in API call | Often prepended to first user message or uses special tokens |
| JSON output | response_format: {type: 'json_object'} — native support | No native JSON mode — use strong prompting + XML wrapper | Rarely supported natively — rely on prompt engineering |
| Few-shot format | user/assistant message pairs work well | user/assistant pairs + prefill (start assistant's response) | Varies — some need [INST]/[/INST] wrapping per example |
| Chain-of-thought | "Let's think step by step" works reliably | Prefers explicit step structure with XML: <thinking>...</thinking> | Inconsistent — smaller models often ignore CoT instructions |
| Max context | 128K tokens (GPT-4o) | 200K tokens (Claude 3.5/4) | 8K-128K depending on model — check per model |
| Temperature behavior | 0 = mostly deterministic, 0.7 = good creative range | 0 = deterministic, tends to be more verbose at higher temps | Behavior varies significantly — test per model |
Claude Prefill Trick
- CoT is a reasoning amplifier. Use manual CoT for control, zero-shot for convenience. Skip it for simple classification.
- Few-shot examples beat instructions for format control. Build them dynamically from a dataset. Watch example ordering bias.
- Self-consistency turns 70% accuracy into 90%+ at 5x cost. Gate it behind confidence thresholds and use it only for high-stakes calls.
- Prompt chaining makes complex tasks debuggable. Each step is independently testable and swappable.
- Structured output eliminates parsing bugs. Use instructor/Pydantic in production, not raw JSON mode.
- System prompts need structure: role, capabilities, restrictions, format, tone, escalation triggers. 30 lines beats 3.
- Version your prompts like code. A/B test variants. Run regression suites in CI. Estimate costs before calling.
- Adapt per provider. XML for Claude, markdown for GPT, test everything for open-source models.
These patterns are the daily toolkit. They compose: CoT inside a chain step, few-shot inside a self-consistency loop, structured output at every stage. The next level is agent architecture — where prompts become tools that call other tools. See Phase 2: Agent Architecture Patterns for that.
Related Articles
LLM Evaluation & Benchmarking Beyond RAGAS: Production Eval Systems That Actually Work
Build production-grade LLM evaluation from scratch: async JudgeClient, position-bias-corrected pairwise comparison, rubric scoring with normalization, judge calibration, meta-evaluation, human eval with SQLite and Cohen's kappa, pytest CI/CD integration, eval dataset construction, bootstrap confidence intervals, and online monitoring.
Guardrails, Safety & Output Validation: Building LLM Applications That Don't Break
Production guardrails for LLM applications — input/output filtering, structured output enforcement with Pydantic and JSON mode, content moderation pipelines, PII detection and redaction, hallucination detection, and integration patterns with Guardrails AI and NeMo Guardrails.
Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG
A comprehensive 8-week roadmap covering LLM APIs, prompt engineering, function calling, tool use, and retrieval-augmented generation — everything you need to build production AI applications.