Prompt Engineering Patterns & Techniques: The Complete Production Toolkit

Production-ready prompt engineering patterns with runnable Python code: chain-of-thought, few-shot learning, self-consistency, prompt chaining, structured output, system prompt design, and advanced techniques including A/B testing and regression frameworks.

AI EducatorMay 31, 2026

Prompt engineering is applied interface design for language models. Every pattern here solves a specific failure mode: inconsistent reasoning, unstructured output, fragile single-call architectures, untested prompts leaking into production. The code is runnable, the patterns are battle-tested, and every section ends with something you can ship.

Prerequisites

You need: Python 3.10+, the openai and anthropic SDKs installed, and API keys for at least one provider. All code uses async where it matters and sync where clarity matters more. Install with: pip install openai anthropic pydantic instructor tiktoken jinja2

Pattern 1: Chain-of-Thought (CoT)

CoT forces the model to externalize its reasoning before producing a final answer. Two variants: zero-shot CoT (append a trigger phrase) and manual CoT (spell out the reasoning structure). Zero-shot is fast to implement; manual CoT gives you control over the reasoning path.

Zero-Shot vs Manual CoT: Code Review Example

cot_code_review.py

python

from openai import OpenAI

client = OpenAI()

buggy_code = """
def process_orders(orders: list[dict]) -> float:
    total = 0
    for order in orders:
        if order['status'] == 'completed':
            total += order['amount']
            if order.get('discount'):
                total -= order['discount']
        elif order['status'] == 'refunded':
            total += order['amount']  # BUG: should subtract
    return total
"""

# WITHOUT CoT — direct request
response_direct = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {"role": "system", "content": "You are a senior code reviewer."},
        {"role": "user", "content": f"Review this code for bugs:\n```python\n{buggy_code}\n```"}
    ]
)
print("Direct:", response_direct.choices[0].message.content)

# WITH zero-shot CoT — one phrase changes everything
response_zs_cot = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {"role": "system", "content": "You are a senior code reviewer."},
        {"role": "user", "content": (
            f"Review this code for bugs. Think through each branch "
            f"step by step, tracing the logic for each order status "
            f"before stating your findings.\n"
            f"```python\n{buggy_code}\n```"
        )}
    ]
)
print("Zero-shot CoT:", response_zs_cot.choices[0].message.content)

# WITH manual CoT — explicit reasoning structure
response_manual_cot = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {"role": "system", "content": "You are a senior code reviewer."},
        {"role": "user", "content": (
            f"Review this function for bugs. Follow these steps exactly:\n\n"
            f"Step 1: State the function's intended purpose based on its name and signature.\n"
            f"Step 2: Trace through each conditional branch. For each branch, state what SHOULD happen vs what DOES happen.\n"
            f"Step 3: Check edge cases — empty list, missing keys, type mismatches.\n"
            f"Step 4: List each bug with line number, severity (critical/warning/info), and fix.\n\n"
            f"```python\n{buggy_code}\n```"
        )}
    ]
)
print("Manual CoT:", response_manual_cot.choices[0].message.content)

The direct version typically says "looks fine" or catches only the obvious bug. Zero-shot CoT catches the refund sign error. Manual CoT catches the refund bug, the missing key safety issue, and the edge case of an empty discount value being falsy vs zero.

Reusable CoT Wrapper

cot_wrapper.py

python

from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class CoTResult:
    reasoning: str
    answer: str
    raw_response: str

def call_with_cot(
    task: str,
    reasoning_steps: list[str] | None = None,
    model: str = "gpt-4o",
    system: str = "You are a precise analytical assistant."
) -> CoTResult:
    """Wrap any task with chain-of-thought reasoning."""
    if reasoning_steps:
        steps_text = "\n".join(f"Step {i+1}: {s}" for i, s in enumerate(reasoning_steps))
        prompt = f"{task}\n\nReason through this following these steps:\n{steps_text}\n\nAfter your reasoning, provide your final answer on a new line starting with 'ANSWER:'"
    else:
        prompt = f"{task}\n\nThink through this step by step. After your reasoning, provide your final answer on a new line starting with 'ANSWER:'"

    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ]
    )
    text = response.choices[0].message.content
    if "ANSWER:" in text:
        parts = text.split("ANSWER:", 1)
        return CoTResult(reasoning=parts[0].strip(), answer=parts[1].strip(), raw_response=text)
    return CoTResult(reasoning=text, answer=text, raw_response=text)

# Usage
result = call_with_cot(
    "Is this SQL query safe from injection? SELECT * FROM users WHERE id = '" + "' + user_input + '",
    reasoning_steps=[
        "Identify where user input enters the query",
        "Check if parameterized queries are used",
        "Assess the injection risk",
        "Suggest the fix"
    ]
)
print(f"Reasoning: {result.reasoning}")
print(f"Answer: {result.answer}")

When to Skip CoT

CoT adds latency and tokens. Skip it for classification, simple extraction, and any task where the model already achieves >95% accuracy without it. Use it for debugging, multi-step reasoning, math, and any task where you've seen the model skip steps.

Pattern 2: Few-Shot Learning

Few-shot learning teaches format and behavior through examples, not instructions. The model pattern-matches against your examples rather than interpreting your description of the task. This is almost always more reliable than zero-shot for structured output tasks.

OpenAI Messages Format: Entity Extraction

few_shot_openai.py

python

from openai import OpenAI
import json

client = OpenAI()

# Few-shot examples as user/assistant pairs in the messages array
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {
            "role": "system",
            "content": "Extract entities from text. Return JSON with keys: persons, organizations, locations, dates. Each value is a list of strings."
        },
        # Example 1
        {"role": "user", "content": "Apple CEO Tim Cook announced the new iPhone at their Cupertino headquarters on September 12th."},
        {"role": "assistant", "content": json.dumps({"persons": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["September 12th"]})},
        # Example 2
        {"role": "user", "content": "No relevant entities here, just a general statement about the weather."},
        {"role": "assistant", "content": json.dumps({"persons": [], "organizations": [], "locations": [], "dates": []})},
        # Example 3 — shows how to handle ambiguity
        {"role": "user", "content": "Jordan visited the Amazon office in Jordan last March."},
        {"role": "assistant", "content": json.dumps({"persons": ["Jordan"], "organizations": ["Amazon"], "locations": ["Jordan"], "dates": ["last March"]})},
        # Actual request
        {"role": "user", "content": "Microsoft's Satya Nadella met with EU regulators in Brussels on January 15, 2026 to discuss the OpenAI partnership."}
    ]
)

result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))
# {"persons": ["Satya Nadella"], "organizations": ["Microsoft", "EU", "OpenAI"],
#  "locations": ["Brussels"], "dates": ["January 15, 2026"]}

Anthropic Messages Format: Same Task

few_shot_anthropic.py

python

from anthropic import Anthropic
import json

client = Anthropic()

# Anthropic uses the same user/assistant pattern but with a separate system param
examples = [
    ("Apple CEO Tim Cook announced the new iPhone at Cupertino on September 12th.",
     {"persons": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Cupertino"], "dates": ["September 12th"]}),
    ("No relevant entities here, just a general statement about the weather.",
     {"persons": [], "organizations": [], "locations": [], "dates": []}),
    ("Jordan visited the Amazon office in Jordan last March.",
     {"persons": ["Jordan"], "organizations": ["Amazon"], "locations": ["Jordan"], "dates": ["last March"]}),
]

messages = []
for text, output in examples:
    messages.append({"role": "user", "content": text})
    messages.append({"role": "assistant", "content": json.dumps(output)})

# Add the real request
messages.append({"role": "user", "content": "Microsoft's Satya Nadella met with EU regulators in Brussels on January 15, 2026."})

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    system="Extract entities from text. Return JSON with keys: persons, organizations, locations, dates. Each value is a list of strings. Return only the JSON, no explanation.",
    messages=messages
)

result = json.loads(response.content[0].text)
print(json.dumps(result, indent=2))

Dynamic Few-Shot Builder

few_shot_builder.py

python

from openai import OpenAI
import json
import random

client = OpenAI()

class FewShotBuilder:
    """Build few-shot prompts dynamically from an example dataset."""

    def __init__(self, system_prompt: str, examples: list[tuple[str, str]]):
        self.system_prompt = system_prompt
        self.examples = examples  # list of (input, output) tuples

    def build_messages(
        self,
        query: str,
        n_examples: int = 3,
        strategy: str = "random"
    ) -> list[dict]:
        """Build the messages array with selected examples."""
        if strategy == "random":
            selected = random.sample(self.examples, min(n_examples, len(self.examples)))
        elif strategy == "first":
            selected = self.examples[:n_examples]
        elif strategy == "last":
            selected = self.examples[-n_examples:]
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        messages = [{"role": "system", "content": self.system_prompt}]
        for inp, out in selected:
            messages.append({"role": "user", "content": inp})
            messages.append({"role": "assistant", "content": out})
        messages.append({"role": "user", "content": query})
        return messages

    def call(
        self,
        query: str,
        n_examples: int = 3,
        model: str = "gpt-4o",
        temperature: float = 0,
        **kwargs
    ) -> str:
        messages = self.build_messages(query, n_examples)
        response = client.chat.completions.create(
            model=model,
            temperature=temperature,
            messages=messages,
            **kwargs
        )
        return response.choices[0].message.content

# Usage: sentiment classifier
sentiment_examples = [
    ("This product exceeded my expectations!", "positive"),
    ("Terrible quality, broke after one day.", "negative"),
    ("It works as described, nothing special.", "neutral"),
    ("Absolutely love it, buying another one!", "positive"),
    ("Worst purchase I've ever made.", "negative"),
    ("Decent for the price point.", "neutral"),
    ("The packaging was damaged but product is fine.", "neutral"),
    ("Customer service was rude and unhelpful.", "negative"),
]

classifier = FewShotBuilder(
    system_prompt="Classify the sentiment of product reviews. Respond with exactly one word: positive, negative, or neutral.",
    examples=sentiment_examples
)

# Test example ordering impact
results = {}
for strategy in ["first", "last", "random"]:
    label = classifier.call(
        "The item arrived late but works perfectly.",
        n_examples=3,
        strategy=strategy
    )
    results[strategy] = label
    print(f"Strategy '{strategy}': {label}")

# Ordering can shift results — 'first' examples are all extreme,
# 'last' examples include more nuanced neutral cases

Example Ordering Matters

Models are biased toward the pattern in the last example they see. If your last example is 'negative', the model is slightly more likely to classify ambiguous inputs as negative. Shuffle examples or place the most representative example last.

Pattern 3: Self-Consistency

Self-consistency samples multiple reasoning paths at temperature > 0 and takes the majority answer. It turns an unreliable 70% accuracy into a reliable 90%+ by letting variance work in your favor. The implementation is straightforward with async parallel calls.

self_consistency.py

python

import asyncio
from openai import AsyncOpenAI
from collections import Counter
from dataclasses import dataclass

client = AsyncOpenAI()

@dataclass
class ConsistencyResult:
    answer: str
    confidence: float
    vote_counts: dict[str, int]
    all_responses: list[str]
    is_confident: bool

async def single_call(messages: list[dict], model: str) -> str:
    """Make one completion call with temperature > 0."""
    response = await client.chat.completions.create(
        model=model,
        temperature=0.7,  # Must be > 0 for diversity
        max_tokens=1024,
        messages=messages
    )
    return response.choices[0].message.content

def extract_answer(response: str) -> str:
    """Extract the final answer from a CoT response.
    Looks for 'ANSWER:' marker, falls back to last line."""
    for line in response.strip().split("\n"):
        if line.strip().upper().startswith("ANSWER:"):
            return line.split(":", 1)[1].strip().lower()
    # Fallback: last non-empty line
    lines = [l.strip() for l in response.strip().split("\n") if l.strip()]
    return lines[-1].lower() if lines else response.strip().lower()

async def self_consistency(
    messages: list[dict],
    n_samples: int = 5,
    model: str = "gpt-4o",
    confidence_threshold: float = 0.6,
    answer_extractor: callable = extract_answer
) -> ConsistencyResult:
    """Run self-consistency with majority voting."""
    # Fire all calls in parallel
    tasks = [single_call(messages, model) for _ in range(n_samples)]
    raw_responses = await asyncio.gather(*tasks)

    # Extract and normalize answers
    answers = [answer_extractor(r) for r in raw_responses]
    vote_counts = dict(Counter(answers))

    # Majority vote
    winner = max(vote_counts, key=vote_counts.get)
    confidence = vote_counts[winner] / n_samples

    return ConsistencyResult(
        answer=winner,
        confidence=confidence,
        vote_counts=vote_counts,
        all_responses=raw_responses,
        is_confident=confidence >= confidence_threshold
    )

# --- Usage ---
async def main():
    messages = [
        {"role": "system", "content": "You are a code reviewer. Analyze the code and determine if it has a security vulnerability."},
        {"role": "user", "content": (
            "Does this code have a security issue?\n\n"
            "```python\n"
            "def get_user(request):\n"
            "    user_id = request.args.get('id')\n"
            "    query = f'SELECT * FROM users WHERE id = {user_id}'\n"
            "    return db.execute(query).fetchone()\n"
            "```\n\n"
            "Think step by step, then provide your answer as:\n"
            "ANSWER: yes or no"
        )}
    ]

    result = await self_consistency(messages, n_samples=7, confidence_threshold=0.7)

    print(f"Answer: {result.answer}")
    print(f"Confidence: {result.confidence:.0%}")
    print(f"Votes: {result.vote_counts}")

    if not result.is_confident:
        print("LOW CONFIDENCE — escalate to human review")
        # In production: log to monitoring, flag for review, or
        # fall back to a more capable model
    else:
        print(f"High confidence result: {result.answer}")

asyncio.run(main())
# Output:
# Answer: yes
# Confidence: 100%
# Votes: {'yes': 7}
# High confidence result: yes

Cost Control

Self-consistency with n=5 costs 5x a single call. Reserve it for high-stakes decisions. In production, start with n=3 and only increase if confidence is below your threshold. You can also use a cheaper model (gpt-4o-mini) for the parallel samples and a stronger model for tie-breaking.

Pattern 4: Prompt Chaining

Prompt chaining decomposes a complex task into a pipeline of focused steps. Each step gets a simple, well-defined job. The output of step N becomes the input of step N+1. Failures are isolated, intermediate results are inspectable, and individual steps can be swapped without rewriting the pipeline.

prompt_chain.py

python

from openai import OpenAI
import json
import logging

client = OpenAI()
logger = logging.getLogger(__name__)

class ChainStep:
    def __init__(self, name: str, system: str, prompt_template: str, model: str = "gpt-4o"):
        self.name = name
        self.system = system
        self.prompt_template = prompt_template
        self.model = model

    def run(self, **kwargs) -> str:
        prompt = self.prompt_template.format(**kwargs)
        logger.info(f"[Chain] Running step: {self.name}")
        logger.debug(f"[Chain] Input: {prompt[:200]}...")

        response = client.chat.completions.create(
            model=self.model,
            temperature=0,
            messages=[
                {"role": "system", "content": self.system},
                {"role": "user", "content": prompt}
            ]
        )
        result = response.choices[0].message.content
        logger.info(f"[Chain] Step '{self.name}' complete. Output length: {len(result)}")
        return result

class PromptChain:
    def __init__(self, steps: list[ChainStep]):
        self.steps = steps
        self.intermediate_results: dict[str, str] = {}

    def run(self, initial_input: str) -> dict[str, str]:
        self.intermediate_results = {"input": initial_input}
        current = initial_input

        for i, step in enumerate(self.steps):
            try:
                # Each step gets access to all previous results
                result = step.run(
                    input=current,
                    original=initial_input,
                    **self.intermediate_results
                )
                self.intermediate_results[step.name] = result
                current = result
            except Exception as e:
                logger.error(f"[Chain] Step '{step.name}' failed: {e}")
                self.intermediate_results[f"{step.name}_error"] = str(e)
                # Return partial results so caller can decide what to do
                return self.intermediate_results

        return self.intermediate_results

# --- Build a customer feedback pipeline ---

extract_step = ChainStep(
    name="extract_issues",
    system="You extract customer issues from feedback. Return a JSON array of objects with keys: issue, quote, category.",
    prompt_template="Extract all distinct issues from this customer feedback:\n\n{input}"
)

classify_step = ChainStep(
    name="classify_severity",
    system="You classify issue severity. Return JSON: same array with an added 'severity' field (critical/high/medium/low) and 'reasoning' field.",
    prompt_template="Classify the severity of each issue. Consider business impact and customer emotion.\n\nIssues:\n{input}"
)

respond_step = ChainStep(
    name="draft_response",
    system="You draft professional customer support responses. Be empathetic but concise. Address every issue mentioned.",
    prompt_template="Draft a response to this customer. Address each issue with the appropriate urgency.\n\nOriginal feedback:\n{original}\n\nClassified issues:\n{input}"
)

chain = PromptChain([extract_step, classify_step, respond_step])

# Run the chain
feedback = """Your app crashed three times today during checkout. I lost my cart 
each time. Also the new UI is confusing — I couldn't find the settings page. 
I've been a customer for 5 years and I'm considering switching to a competitor."""

results = chain.run(feedback)

# Inspect intermediate results
for step_name, output in results.items():
    print(f"\n{'='*60}")
    print(f"STEP: {step_name}")
    print(f"{'='*60}")
    print(output[:500])

Why Chaining Beats Single Prompts

Each step is testable in isolation. If severity classification is wrong, fix that step without touching extraction or response generation. Intermediate results double as an audit log. Steps can use different models — cheap models for extraction, expensive ones for response drafting.

Pattern 5: Structured Output Prompting

Unstructured model output is the #1 source of production bugs. JSON parsing failures, missing fields, wrong types — structured output patterns eliminate these entirely. Three approaches: JSON mode, Pydantic + instructor, and schema enforcement with retry.

OpenAI JSON Mode

structured_json_mode.py

python

from openai import OpenAI
import json

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "system",
            "content": (
                "Extract structured data from job postings. "
                "Return JSON with these exact keys:\n"
                "- title: string\n"
                "- company: string\n"
                "- location: string (or 'remote')\n"
                "- salary_min: number or null\n"
                "- salary_max: number or null\n"
                "- requirements: string[]\n"
                "- experience_years: number or null"
            )
        },
        {
            "role": "user",
            "content": (
                "We're hiring a Senior Backend Engineer at Acme Corp! "
                "Based in SF or remote. $180k-$240k. Need 5+ years with "
                "Python, PostgreSQL, and distributed systems. Bonus points "
                "for Kubernetes experience."
            )
        }
    ]
)

job = json.loads(response.choices[0].message.content)
print(json.dumps(job, indent=2))
# Guaranteed valid JSON — but field names/types aren't enforced

Pydantic + Instructor: Type-Safe Structured Output

structured_instructor.py

python

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum


class ExperienceLevel(str, Enum):
    JUNIOR = "junior"
    MID = "mid"
    SENIOR = "senior"
    STAFF = "staff"
    PRINCIPAL = "principal"


class JobPosting(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Office location or 'remote'")
    salary_min: int | None = Field(description="Minimum salary in USD")
    salary_max: int | None = Field(description="Maximum salary in USD")
    requirements: list[str] = Field(description="Required skills and qualifications")
    experience_years: int | None = Field(description="Required years of experience")
    level: ExperienceLevel = Field(description="Inferred experience level")


# Patch the client — instructor handles retries and validation
client = instructor.from_openai(OpenAI())

job = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    max_retries=2,  # Auto-retry on validation failure
    response_model=JobPosting,
    messages=[
        {
            "role": "user",
            "content": (
                "We're hiring a Senior Backend Engineer at Acme Corp! "
                "Based in SF or remote. $180k-$240k. Need 5+ years with "
                "Python, PostgreSQL, and distributed systems."
            )
        }
    ]
)

# job is a fully typed Pydantic model — IDE autocomplete works
print(f"{job.title} at {job.company}")
print(f"Salary: ${job.salary_min:,} - ${job.salary_max:,}")
print(f"Level: {job.level.value}")
print(f"Requirements: {', '.join(job.requirements)}")

Manual Schema Enforcement with Retry

structured_retry.py

python

from openai import OpenAI
import json
from pydantic import BaseModel, ValidationError

client = OpenAI()

def extract_with_retry(
    prompt: str,
    schema: type[BaseModel],
    model: str = "gpt-4o",
    max_retries: int = 3
) -> BaseModel:
    """Extract structured data with automatic retry on parse/validation failure."""
    schema_json = json.dumps(schema.model_json_schema(), indent=2)
    system = (
        f"Extract data matching this JSON schema. Return ONLY valid JSON, no markdown.\n\n"
        f"Schema:\n{schema_json}"
    )

    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt}
    ]

    for attempt in range(max_retries):
        response = client.chat.completions.create(
            model=model,
            temperature=0,
            response_format={"type": "json_object"},
            messages=messages
        )
        raw = response.choices[0].message.content

        try:
            data = json.loads(raw)
            return schema.model_validate(data)
        except (json.JSONDecodeError, ValidationError) as e:
            error_msg = str(e)
            # Add the error as context for the retry
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": f"That response had an error: {error_msg}\nPlease fix and return valid JSON matching the schema."
            })

    raise ValueError(f"Failed to extract valid data after {max_retries} attempts")

# Usage
class MeetingInfo(BaseModel):
    title: str
    participants: list[str]
    date: str
    action_items: list[str]
    next_meeting: str | None = None

meeting = extract_with_retry(
    "Meeting notes: Project sync with Alice, Bob, and Carol on Jan 15. "
    "Decided to launch beta by Feb 1. Alice will handle deployment, "
    "Bob owns the docs. Follow-up scheduled for Jan 22.",
    schema=MeetingInfo
)
print(meeting.model_dump_json(indent=2))

Pattern 6: System Prompt Design

The system prompt is the behavioral contract between you and the model. A vague system prompt produces vague behavior. A precise one produces a reliable agent. Below: a bad system prompt, a good one, and the reasoning behind every change.

Bad vs Good: Customer Support Agent

system_prompt_bad.py

python

# BAD system prompt — vague, no structure, no constraints
BAD_SYSTEM_PROMPT = """You are a helpful customer support agent for TechCorp. 
Be nice to customers and help them with their problems. 
Try to resolve issues quickly."""

system_prompt_good.py

python

# GOOD system prompt — specific, structured, constrained
GOOD_SYSTEM_PROMPT = """You are a Tier 1 support agent for TechCorp's SaaS platform.

## Role
You handle billing questions, account access issues, and basic technical 
troubleshooting. You do NOT handle: data deletion requests, security 
incidents, or enterprise contract negotiations.

## Behavioral Rules
1. Greet the customer by name if available.
2. Acknowledge their issue before solving it.
3. If you cannot resolve in 2 exchanges, escalate to Tier 2.
4. Never promise refunds — only Tier 2+ can authorize those.
5. Never share internal system details, ticket IDs, or agent names.

## Response Format
- Keep responses under 150 words.
- Use numbered steps for instructions.
- End every message with a clear next action or question.

## Escalation Triggers (auto-escalate to Tier 2)
- Customer mentions "lawyer", "legal", or "lawsuit"
- Account has been compromised or hacked
- Billing discrepancy over $500
- Customer has asked the same question 3+ times

## Tone
- Professional but warm. Not robotic, not overly casual.
- Match the customer's energy — if they're frustrated, lead with empathy.
- Never use exclamation marks more than once per message.

## Knowledge Boundaries
- You have access to: account status, billing history, known issues list.
- You do NOT have access to: source code, infrastructure details, roadmap.
- If asked about something outside your knowledge, say so directly.
  Do not guess or fabricate information."""

# The difference: 3 lines vs 33 lines. The 33-line version produces
# consistent, predictable behavior across thousands of conversations.

Adapting System Prompts for OpenAI vs Anthropic

system_prompt_provider_adaption.py

python

from openai import OpenAI
from anthropic import Anthropic

# Core prompt content — provider-agnostic
CORE_INSTRUCTIONS = {
    "role": "Tier 1 support agent for TechCorp",
    "capabilities": ["billing", "account access", "basic troubleshooting"],
    "restrictions": ["no refunds", "no internal details", "no legal matters"],
    "escalation_triggers": ["legal threats", "compromised accounts", "billing > $500"],
    "tone": "professional, warm, empathetic",
    "max_words": 150,
}

def build_openai_system_prompt(config: dict) -> str:
    """OpenAI models respond well to markdown formatting."""
    return f"""You are a {config['role']}.

## Capabilities
{chr(10).join(f'- {c}' for c in config['capabilities'])}

## Restrictions  
{chr(10).join(f'- {r}' for r in config['restrictions'])}

## Escalation Triggers
{chr(10).join(f'- {t}' for t in config['escalation_triggers'])}

## Response Rules
- Tone: {config['tone']}
- Max length: {config['max_words']} words
- End every message with a clear next action."""

def build_anthropic_system_prompt(config: dict) -> str:
    """Claude responds well to XML tags for structure."""
    return f"""You are a {config['role']}.

<capabilities>
{chr(10).join(f'- {c}' for c in config['capabilities'])}
</capabilities>

<restrictions>
{chr(10).join(f'- {r}' for r in config['restrictions'])}
</restrictions>

<escalation_triggers>
{chr(10).join(f'- {t}' for t in config['escalation_triggers'])}
</escalation_triggers>

<response_rules>
- Tone: {config['tone']}
- Max length: {config['max_words']} words
- End every message with a clear next action.
</response_rules>"""

# OpenAI call
openai_client = OpenAI()
response_oai = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": build_openai_system_prompt(CORE_INSTRUCTIONS)},
        {"role": "user", "content": "Hi, I was charged twice for my subscription last month."}
    ]
)

# Anthropic call
anthropic_client = Anthropic()
response_claude = anthropic_client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    system=build_anthropic_system_prompt(CORE_INSTRUCTIONS),
    messages=[
        {"role": "user", "content": "Hi, I was charged twice for my subscription last month."}
    ]
)

Pattern 7: Advanced Techniques

Negative Examples

Telling the model what NOT to do is sometimes more effective than describing what you want. Especially useful for eliminating specific failure modes you've observed in production.

negative_examples.py

python

from openai import OpenAI

client = OpenAI()

# Without negative examples — model tends to over-explain
prompt_without = "Summarize this error log for a developer."

# With negative examples — eliminates specific bad behaviors
prompt_with = """Summarize this error log for a developer.

Do NOT:
- Include timestamps in your summary
- Suggest fixes (just describe what happened)
- Use phrases like "it appears that" or "it seems like" — state facts directly
- Repeat the same error if it occurs multiple times — just note the count

Do:
- Group related errors
- State the root error first, cascading failures second
- Include the exact error message for the root cause"""

error_log = """
2026-05-31 10:00:01 ERROR DatabaseConnection: Connection refused to postgres:5432
2026-05-31 10:00:01 ERROR DatabaseConnection: Connection refused to postgres:5432
2026-05-31 10:00:02 ERROR UserService: Failed to fetch user - database unavailable
2026-05-31 10:00:02 ERROR AuthService: Cannot validate token - UserService unreachable
2026-05-31 10:00:03 ERROR APIGateway: 503 Service Unavailable on /api/users/me
2026-05-31 10:00:03 ERROR HealthCheck: Service unhealthy - 3 downstream failures
"""

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {"role": "system", "content": "You are a concise incident summarizer."},
        {"role": "user", "content": f"{prompt_with}\n\n{error_log}"}
    ]
)
print(response.choices[0].message.content)

Prompt Templates with Jinja2

prompt_templates.py

python

from jinja2 import Template
from openai import OpenAI

client = OpenAI()

# Jinja2 template with conditionals and loops
REVIEW_TEMPLATE = Template("""
You are reviewing a {{ review_type }} for {{ project_name }}.

{% if context %}
<context>
{{ context }}
</context>
{% endif %}

Review the following {{ review_type }} and provide feedback:

{% for criterion in criteria %}
- {{ criterion }}
{% endfor %}

{% if strict_mode %}
You MUST flag any issue that violates the criteria above. Do not approve with unresolved issues.
{% else %}
Minor style issues can be noted but shouldn't block approval.
{% endif %}

Respond with:
- APPROVE, REQUEST_CHANGES, or COMMENT
- List of specific findings
""")

# Render with different configurations
prompt_strict = REVIEW_TEMPLATE.render(
    review_type="pull request",
    project_name="payments-service",
    context="This service handles PCI-compliant credit card processing.",
    criteria=[
        "No hardcoded secrets or credentials",
        "All database queries use parameterized statements",
        "Error messages don't leak internal details",
        "All new endpoints have authentication checks",
    ],
    strict_mode=True
)

prompt_relaxed = REVIEW_TEMPLATE.render(
    review_type="documentation update",
    project_name="internal-wiki",
    context=None,
    criteria=[
        "Technically accurate",
        "Clear and concise",
        "Includes code examples where relevant",
    ],
    strict_mode=False
)

# Use the rendered template
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {"role": "user", "content": f"{prompt_strict}\n\n```python\nAPI_KEY = 'sk-live-abc123'\n```"}
    ]
)
print(response.choices[0].message.content)

Prompt Version Registry

prompt_versioning.py

python

from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class PromptVersion:
    version: str
    system_prompt: str
    user_template: str
    model: str
    temperature: float
    created_at: str = field(default_factory=lambda: datetime.now().isoformat())
    notes: str = ""
    deprecated: bool = False


class PromptRegistry:
    """Simple prompt version registry with rollback support."""

    def __init__(self):
        self._versions: dict[str, dict[str, PromptVersion]] = {}
        self._active: dict[str, str] = {}  # prompt_name -> active version

    def register(self, name: str, version: PromptVersion):
        if name not in self._versions:
            self._versions[name] = {}
        self._versions[name][version.version] = version
        # First version becomes active by default
        if name not in self._active:
            self._active[name] = version.version

    def activate(self, name: str, version: str):
        if version not in self._versions.get(name, {}):
            raise ValueError(f"Version {version} not found for prompt '{name}'")
        self._active[name] = version

    def get_active(self, name: str) -> PromptVersion:
        version_id = self._active[name]
        return self._versions[name][version_id]

    def rollback(self, name: str) -> str:
        """Roll back to the previous version."""
        versions = list(self._versions[name].keys())
        current_idx = versions.index(self._active[name])
        if current_idx == 0:
            raise ValueError("Already at the earliest version")
        previous = versions[current_idx - 1]
        self._active[name] = previous
        return previous

    def list_versions(self, name: str) -> list[dict]:
        return [
            {"version": v.version, "active": v.version == self._active[name],
             "notes": v.notes, "deprecated": v.deprecated}
            for v in self._versions.get(name, {}).values()
        ]


# Usage
registry = PromptRegistry()

registry.register("classifier", PromptVersion(
    version="v1.0",
    system_prompt="Classify support tickets into: billing, technical, account, other.",
    user_template="Classify this ticket: {ticket_text}",
    model="gpt-4o-mini",
    temperature=0,
    notes="Initial version"
))

registry.register("classifier", PromptVersion(
    version="v1.1",
    system_prompt="Classify support tickets into: billing, technical, account, security, other. Return only the category name.",
    user_template="Classify this ticket: {ticket_text}",
    model="gpt-4o-mini",
    temperature=0,
    notes="Added security category, enforced single-word output"
))

registry.activate("classifier", "v1.1")
active = registry.get_active("classifier")
print(f"Active: {active.version} — {active.notes}")

# Something broke? Roll back.
previous = registry.rollback("classifier")
print(f"Rolled back to: {previous}")
print(registry.list_versions("classifier"))

Production Patterns

Prompt A/B Testing

prompt_ab_testing.py

python

import random
import json
import time
from dataclasses import dataclass, field, asdict
from openai import OpenAI

client = OpenAI()

@dataclass
class ABTestResult:
    variant: str
    prompt_version: str
    input_text: str
    output_text: str
    latency_ms: float
    timestamp: float = field(default_factory=time.time)
    feedback_score: float | None = None  # filled later by human eval


class PromptABTest:
    """Simple A/B test framework for prompt variants."""

    def __init__(self, variants: dict[str, dict], log_file: str = "ab_results.jsonl"):
        """
        variants: {"A": {"system": "...", "template": "..."}, "B": {...}}
        """
        self.variants = variants
        self.log_file = log_file
        self.results: list[ABTestResult] = []

    def run(self, input_text: str, model: str = "gpt-4o") -> ABTestResult:
        # Random assignment
        variant_name = random.choice(list(self.variants.keys()))
        variant = self.variants[variant_name]

        start = time.perf_counter()
        response = client.chat.completions.create(
            model=model,
            temperature=0,
            messages=[
                {"role": "system", "content": variant["system"]},
                {"role": "user", "content": variant["template"].format(input=input_text)}
            ]
        )
        latency = (time.perf_counter() - start) * 1000

        result = ABTestResult(
            variant=variant_name,
            prompt_version=variant.get("version", "unknown"),
            input_text=input_text,
            output_text=response.choices[0].message.content,
            latency_ms=round(latency, 1)
        )
        self.results.append(result)
        self._log(result)
        return result

    def _log(self, result: ABTestResult):
        with open(self.log_file, "a") as f:
            f.write(json.dumps(asdict(result)) + "\n")

    def summary(self) -> dict:
        from collections import defaultdict
        stats = defaultdict(lambda: {"count": 0, "total_latency": 0.0, "scores": []})
        for r in self.results:
            s = stats[r.variant]
            s["count"] += 1
            s["total_latency"] += r.latency_ms
            if r.feedback_score is not None:
                s["scores"].append(r.feedback_score)
        return {
            variant: {
                "count": s["count"],
                "avg_latency_ms": round(s["total_latency"] / s["count"], 1),
                "avg_score": round(sum(s["scores"]) / len(s["scores"]), 2) if s["scores"] else None
            }
            for variant, s in stats.items()
        }

# Usage
test = PromptABTest({
    "A": {
        "version": "v1.0",
        "system": "Summarize support tickets in 1-2 sentences.",
        "template": "Summarize: {input}"
    },
    "B": {
        "version": "v1.1",
        "system": "You summarize support tickets. Output format: [CATEGORY] One-sentence summary.",
        "template": "Summarize this support ticket concisely: {input}"
    }
})

# Run across test inputs
test_inputs = [
    "My payment failed three times. Card ending 4242. Error says 'insufficient funds' but I have money.",
    "Can't log in since the update. Password reset email never arrives. Checked spam.",
    "Your API is returning 500 errors on the /users endpoint since 3pm EST."
]

for inp in test_inputs:
    result = test.run(inp)
    print(f"[{result.variant}] {result.output_text[:80]}... ({result.latency_ms}ms)")

print("\nSummary:", json.dumps(test.summary(), indent=2))

Prompt Regression Testing

prompt_regression.py

python

from openai import OpenAI
from dataclasses import dataclass
import json

client = OpenAI()

@dataclass
class TestCase:
    input_text: str
    expected_contains: list[str]  # output must contain ALL of these
    expected_not_contains: list[str] = None  # output must NOT contain any of these
    expected_exact: str = None  # for classification tasks


def run_regression_suite(
    system_prompt: str,
    test_cases: list[TestCase],
    model: str = "gpt-4o",
    verbose: bool = True
) -> dict:
    """Run a regression suite against a prompt. Returns pass/fail stats."""
    results = {"passed": 0, "failed": 0, "failures": []}

    for i, tc in enumerate(test_cases):
        response = client.chat.completions.create(
            model=model,
            temperature=0,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": tc.input_text}
            ]
        )
        output = response.choices[0].message.content

        passed = True
        reasons = []

        # Check expected_contains
        for phrase in tc.expected_contains:
            if phrase.lower() not in output.lower():
                passed = False
                reasons.append(f"Missing: '{phrase}'")

        # Check expected_not_contains
        if tc.expected_not_contains:
            for phrase in tc.expected_not_contains:
                if phrase.lower() in output.lower():
                    passed = False
                    reasons.append(f"Should not contain: '{phrase}'")

        # Check exact match
        if tc.expected_exact and output.strip().lower() != tc.expected_exact.lower():
            passed = False
            reasons.append(f"Expected '{tc.expected_exact}', got '{output.strip()}'")

        if passed:
            results["passed"] += 1
            if verbose:
                print(f"  PASS test {i+1}")
        else:
            results["failed"] += 1
            results["failures"].append({"test": i+1, "input": tc.input_text, "output": output, "reasons": reasons})
            if verbose:
                print(f"  FAIL test {i+1}: {reasons}")

    return results

# Usage — test a classification prompt
classifier_prompt = "Classify support tickets into exactly one category: billing, technical, account, security. Return only the category name, lowercase."

tests = [
    TestCase("I was charged twice", expected_contains=["billing"], expected_exact="billing"),
    TestCase("Can't reset my password", expected_contains=["account"], expected_exact="account"),
    TestCase("API returns 500", expected_contains=["technical"], expected_exact="technical"),
    TestCase("Someone logged into my account from Russia", expected_contains=["security"], expected_exact="security"),
    TestCase("Your pricing page is confusing", expected_contains=["billing"], expected_not_contains=["technical"]),
]

print("Running regression suite...")
results = run_regression_suite(classifier_prompt, tests)
print(f"\nResults: {results['passed']}/{results['passed'] + results['failed']} passed")
if results["failures"]:
    print("Failures:")
    for f in results["failures"]:
        print(f"  Test {f['test']}: {f['reasons']}")

Token Counting and Cost Estimation

token_cost_estimation.py

python

import tiktoken
from dataclasses import dataclass

# Pricing per 1M tokens (as of early 2026 — check for updates)
MODEL_PRICING = {
    "gpt-4o":       {"input": 2.50, "output": 10.00},
    "gpt-4o-mini":  {"input": 0.15, "output": 0.60},
    "gpt-4.1":      {"input": 2.00, "output": 8.00},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}

@dataclass
class CostEstimate:
    input_tokens: int
    estimated_output_tokens: int
    input_cost: float
    output_cost: float
    total_cost: float
    model: str

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text. Uses cl100k_base for GPT-4 family."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def count_message_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens for a full messages array including overhead."""
    total = 0
    for msg in messages:
        total += 4  # message overhead
        total += count_tokens(msg["content"], model)
        total += count_tokens(msg["role"], model)
    total += 2  # reply priming
    return total

def estimate_cost(
    messages: list[dict],
    model: str = "gpt-4o",
    estimated_output_tokens: int = 500
) -> CostEstimate:
    """Estimate the cost of an API call before making it."""
    input_tokens = count_message_tokens(messages, model)
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})

    input_cost = (input_tokens / 1_000_000) * pricing["input"]
    output_cost = (estimated_output_tokens / 1_000_000) * pricing["output"]

    return CostEstimate(
        input_tokens=input_tokens,
        estimated_output_tokens=estimated_output_tokens,
        input_cost=round(input_cost, 6),
        output_cost=round(output_cost, 6),
        total_cost=round(input_cost + output_cost, 6),
        model=model
    )

# Usage
messages = [
    {"role": "system", "content": "You are a helpful assistant." * 50},  # simulate a long system prompt
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

for model in ["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-20250514"]:
    est = estimate_cost(messages, model=model, estimated_output_tokens=800)
    print(f"{model}: {est.input_tokens} in / ~{est.estimated_output_tokens} out = ${est.total_cost:.4f}")

# Gate expensive calls
est = estimate_cost(messages, model="gpt-4o", estimated_output_tokens=2000)
if est.total_cost > 0.05:
    print(f"WARNING: Estimated cost ${est.total_cost:.4f} exceeds $0.05 threshold")
    # In production: switch to a cheaper model, truncate input, or require approval

Anti-Patterns: Before and After

These are real production failures, not hypotheticals. Each one has burned engineering hours.

Anti-Pattern	Before (Broken)	After (Fixed)	Why It Matters
String concatenation prompts	f"Classify: {user_input}"	Messages array with system/user roles	Concatenation is vulnerable to prompt injection. Messages format lets the model distinguish instructions from data.
No output format enforcement	"Return the data as JSON"	response_format={"type": "json_object"} or Pydantic + instructor	Models return markdown-wrapped JSON, extra text, or malformed JSON ~5% of the time without enforcement.
Hardcoded prompts in application code	prompt = "You are a helpful..." buried in route handler	Prompt registry with versioning, loaded from config	Prompt changes require code deploys. Registry lets you update prompts without redeploying.
No prompt testing	Manual testing in playground, ship it	Regression suite with TestCase assertions, run in CI	Model updates or prompt edits silently break behavior. Regression tests catch it before production.
Temperature 1.0 for deterministic tasks	temperature=1.0 (the default)	temperature=0 for classification/extraction, 0.7 for creative tasks	High temperature on deterministic tasks causes inconsistent output that breaks downstream parsers.
Ignoring token limits	Dump entire document into prompt	Count tokens first, chunk or summarize if over limit	Silent truncation produces wrong answers. The model processes a cut-off document without telling you.
Same prompt for all providers	One prompt for GPT-4 and Claude	Provider-specific formatting (XML for Claude, markdown for GPT)	Each model has formatting preferences. Ignoring them costs 10-20% quality.

Provider Differences That Matter

Pattern	OpenAI (GPT-4o)	Anthropic (Claude)	Open Source (Llama, Mistral)
Structured delimiters	Markdown headers (##), bullet lists, bold for emphasis	XML tags (<instructions>, <context>, <output>)	Simple markers like [INST] or ### — varies by model
System prompt	First message with role: 'system'	Separate 'system' parameter in API call	Often prepended to first user message or uses special tokens
JSON output	response_format: {type: 'json_object'} — native support	No native JSON mode — use strong prompting + XML wrapper	Rarely supported natively — rely on prompt engineering
Few-shot format	user/assistant message pairs work well	user/assistant pairs + prefill (start assistant's response)	Varies — some need [INST]/[/INST] wrapping per example
Chain-of-thought	"Let's think step by step" works reliably	Prefers explicit step structure with XML: <thinking>...</thinking>	Inconsistent — smaller models often ignore CoT instructions
Max context	128K tokens (GPT-4o)	200K tokens (Claude 3.5/4)	8K-128K depending on model — check per model
Temperature behavior	0 = mostly deterministic, 0.7 = good creative range	0 = deterministic, tends to be more verbose at higher temps	Behavior varies significantly — test per model

Claude Prefill Trick

With Anthropic's API, you can start the assistant's response by adding a partial assistant message. For JSON output, add {"role": "assistant", "content": "{"} to force the model to continue with valid JSON. This is more reliable than prompting alone.

Key Takeaways

CoT is a reasoning amplifier. Use manual CoT for control, zero-shot for convenience. Skip it for simple classification.
Few-shot examples beat instructions for format control. Build them dynamically from a dataset. Watch example ordering bias.
Self-consistency turns 70% accuracy into 90%+ at 5x cost. Gate it behind confidence thresholds and use it only for high-stakes calls.
Prompt chaining makes complex tasks debuggable. Each step is independently testable and swappable.
Structured output eliminates parsing bugs. Use instructor/Pydantic in production, not raw JSON mode.
System prompts need structure: role, capabilities, restrictions, format, tone, escalation triggers. 30 lines beats 3.
Version your prompts like code. A/B test variants. Run regression suites in CI. Estimate costs before calling.
Adapt per provider. XML for Claude, markdown for GPT, test everything for open-source models.

Next Steps

These patterns are the daily toolkit. They compose: CoT inside a chain step, few-shot inside a self-consistency loop, structured output at every stage. The next level is agent architecture — where prompts become tools that call other tools. See Phase 2: Agent Architecture Patterns for that.

#prompt-engineering #llm #ai-engineering #best-practices #production-ai

advanced

LLM Evaluation & Benchmarking Beyond RAGAS: Production Eval Systems That Actually Work

Build production-grade LLM evaluation from scratch: async JudgeClient, position-bias-corrected pairwise comparison, rubric scoring with normalization, judge calibration, meta-evaluation, human eval with SQLite and Cohen's kappa, pytest CI/CD integration, eval dataset construction, bootstrap confidence intervals, and online monitoring.

advanced

Guardrails, Safety & Output Validation: Building LLM Applications That Don't Break

Production guardrails for LLM applications — input/output filtering, structured output enforcement with Pydantic and JSON mode, content moderation pipelines, PII detection and redaction, hallucination detection, and integration patterns with Guardrails AI and NeMo Guardrails.

intermediate

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

A comprehensive 8-week roadmap covering LLM APIs, prompt engineering, function calling, tool use, and retrieval-augmented generation — everything you need to build production AI applications.

Prerequisites

When to Skip CoT

Example Ordering Matters

Cost Control

Why Chaining Beats Single Prompts

Claude Prefill Trick

Related Articles

LLM Evaluation & Benchmarking Beyond RAGAS: Production Eval Systems That Actually Work

Guardrails, Safety & Output Validation: Building LLM Applications That Don't Break

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG