Prompt Engineering Basics
System prompt design, few-shot learning, chain-of-thought prompting, prompt evolution case study, and five common prompt failure patterns.
A well-engineered prompt is the foundation of reliable AI agent behavior. This guide covers fundamental techniques for crafting effective prompts in harnesses, optimizing their quality, and measuring their impact.
Target audience: Engineers and product managers optimizing agent quality and cost.
1. Anatomy of a Good System Prompt
A system prompt sets expectations before any user input arrives. Structure it in four layers:
Layer 1: Role Definition
Tell the model who it is and why. This anchors its behavior.
You are a code review expert with 15 years of experience across Python, Go, and Rust.
You specialize in catching security issues, performance problems, and maintainability concerns.
Layer 2: Task Specification
Describe what the model will be asked to do. Be specific about scope.
You will receive pull requests and provide feedback on:
- Security vulnerabilities and unsafe patterns
- Performance regressions and inefficient algorithms
- Code clarity and maintainability
- Test coverage gaps
Layer 3: Output Format
Specify exactly how to structure responses. This reduces ambiguity and makes parsing easier.
Respond in JSON with this structure:
{
"severity": "critical" | "high" | "medium" | "low",
"category": "security" | "performance" | "clarity" | "testing",
"line_number": <number>,
"current_code": "<snippet>",
"issue": "<clear description>",
"suggestion": "<specific fix>"
}
Layer 4: Constraints & Guidelines
Explicitly state what NOT to do. This prevents common failures.
Constraints:
- Do NOT make style preference comments (use automated linters for that)
- Do NOT suggest refactors larger than 50 lines without clear justification
- Do NOT comment on comments themselves unless they're misleading
- Do NOT assume context beyond the PR diff—ask if you need it
Guidelines:
- Focus on blocking issues only when severity is "critical"
- Explain *why* something is a problem, not just that it is
- Suggest concrete fixes, not vague improvements
- Keep each comment focused on one issue
Complete Example: Coding Agent System Prompt
You are a senior software engineer assistant specialized in code generation and debugging.
Your role is to write correct, efficient, and maintainable code while explaining your decisions.
TASK:
You will be asked to:
- Generate code that solves a specific problem
- Debug broken code and explain root causes
- Optimize existing code for performance or clarity
- Suggest architectural improvements
OUTPUT FORMAT:
Always respond in this structure:
1. Brief explanation of approach
2. Complete code (in markdown code block)
3. Key assumptions made
4. How to test it
5. Known limitations or trade-offs
CONSTRAINTS:
- Use only stdlib unless alternatives are explicitly approved
- Do NOT use deprecated APIs—migrate to current replacements
- Do NOT skip error handling
- Do NOT generate code without tests
- For performance-critical code, add comments on algorithmic choice
- Refuse requests that ask for credentials, keys, or security bypasses
GUIDELINES:
- Prefer clarity over cleverness—readable code wins
- Explain non-obvious decisions inline
- When multiple approaches exist, mention trade-offs
- Flag security or performance implications explicitly
- Keep functions under 30 lines when possible
2. Few-Shot Learning
Few-shot examples teach the model by demonstration, often more effectively than lengthy instructions.
What Few-Shot Does
Few-shot examples show the model the pattern you want, reducing the gap between “understand task” and “execute task.” A single example is often worth a paragraph of instructions.
Without few-shot:
Extract structured data from user messages.
With few-shot:
Extract structured data from user messages.
Example:
Input: "I want to book a flight from NYC to LA on March 15 for 2 people"
Output: {
"intent": "book_flight",
"origin": "NYC",
"destination": "LA",
"date": "2026-03-15",
"passengers": 2
}
How Many Examples Are Needed?
- 1 example: Teaches pattern recognition for simple tasks (classification, basic extraction)
- 3 examples: Standard choice for moderate complexity; covers edge cases and variation
- 5+ examples: Use for complex reasoning, multiple decision points, or high stakes
- 10+ examples: Overkill except for highly specialized tasks; token cost often exceeds benefit
Rule of thumb: Start with 1 example, add more if the model fails or shows inconsistency.
Example Selection Strategies
Choose examples that:
- Cover diversity: Include edge cases, typical cases, and boundary cases
- Match distribution: If 80% of inputs are simple queries, make sure 80% of examples are simple
- Show variation: Don’t repeat the same pattern multiple times
- Are representative: Use real data from your actual use cases, not synthetic examples
Bad example set (too similar):
Example 1: "Book a flight NYC to LA"
Example 2: "Book a flight Boston to Miami"
Example 3: "Book a flight Seattle to Denver"
Better example set (diverse):
Example 1: Simple flight booking "Book NYC to LA on March 15"
Example 2: Complex query with preferences "I need a round-trip to London, prefer early morning departures, budget under $1500"
Example 3: Ambiguous input "Show me flights next week"
Example 4: Out-of-scope request "What's the cheapest airline?"
Formatting Examples for Clarity
Use consistent structure and clear delimiters:
Few-shot examples:
---
EXAMPLE 1 - Simple case
Input: "<user message>"
Output: <structured response>
---
EXAMPLE 2 - Edge case with multiple conditions
Input: "<user message>"
Output: <structured response>
---
Cost/Benefit Analysis
- Benefit: Few examples often improve accuracy by 10-40%
- Cost: Each example adds tokens to every request (persistent context)
Token math:
- 1 example: ~50 tokens
- 3 examples: ~150 tokens
- 5 examples: ~250 tokens
For a harness running 1000 requests/day:
- 5 examples adds 250,000 tokens/day (~5% of typical quota)
Optimization: Use examples strategically. For high-volume tasks, 1-2 well-chosen examples often beat 5 mediocre ones.
Implementation: FewShotSelector Class
class FewShotSelector:
"""Select optimal examples for few-shot prompts based on input similarity."""
def __init__(self, examples: list[dict]):
"""
Args:
examples: List of {"input": str, "output": str} dicts
"""
self.examples = examples
self.embeddings = [embed(ex["input"]) for ex in examples]
def select(self, user_input: str, k: int = 3) -> list[dict]:
"""
Return k most similar examples to user_input.
Args:
user_input: The incoming request
k: Number of examples to return (default 3)
Returns:
List of examples most similar to user_input
"""
input_embedding = embed(user_input)
similarities = [cosine_similarity(input_embedding, ex)
for ex in self.embeddings]
top_k_indices = sorted(range(len(similarities)),
key=lambda i: similarities[i],
reverse=True)[:k]
return [self.examples[i] for i in top_k_indices]
def build_prompt(self, user_input: str, k: int = 3) -> str:
"""Build prompt with selected examples."""
selected = self.select(user_input, k)
examples_section = "\n\n".join([
f"---\nEXAMPLE {i+1}\nInput: {ex['input']}\n"
f"Output: {ex['output']}"
for i, ex in enumerate(selected)
])
return f"{SYSTEM_PROMPT}\n\nFew-shot examples:\n{examples_section}\n\nUser request: {user_input}"
3. Chain-of-Thought Prompting
Chain-of-thought (CoT) prompting asks the model to show its reasoning steps before providing an answer.
What It Is
Instead of asking for a direct answer, ask the model to think out loud:
Direct prompt:
Is this code vulnerable to SQL injection?
Chain-of-thought prompt:
Let me analyze this code for SQL injection vulnerabilities.
Step 1: Identify where user input enters the code
Step 2: Check if input is validated or parameterized
Step 3: Assess risk of malicious SQL
Step 4: Recommend fixes if needed
Is this code vulnerable to SQL injection?
When It Helps
CoT improves accuracy for tasks requiring reasoning:
- Debugging: Walking through execution helps catch logical errors
- Security analysis: Reasoning about threat models catches subtle issues
- Complex decisions: Multi-step logic benefits from explicit intermediate steps
- Ambiguous inputs: Clarifying assumptions before answering
CoT is less valuable for:
- Simple classification (“Is this spam?”)
- Pattern matching (“Extract the date”)
- Retrieval tasks (“What’s the password for X?”)
Effectiveness
Studies show CoT can improve accuracy by 20-50% on reasoning tasks:
- Math problems: +30-40% accuracy improvement
- Logic puzzles: +25-35% improvement
- Code analysis: +15-30% improvement (highest gains on subtle bugs)
- Trade-off decisions: +20-40% improvement
Trade-off: CoT responses are longer (2-3x more tokens), so use it judiciously on high-stakes decisions.
Example Prompts
Code debugging with CoT:
Analyze this function for bugs. Show your reasoning:
1. What is the expected behavior?
2. Trace through with sample inputs
3. Identify where behavior diverges from expectation
4. What's the root cause?
5. How would you fix it?
[code]
Security analysis with CoT:
Is this code vulnerable? Explain your reasoning:
1. What sensitive operation does this code perform?
2. Where could an attacker inject input?
3. Is the input validated or sanitized?
4. What's the worst-case attack?
5. How could we prevent it?
[code]
Architecture decision with CoT:
Should we use PostgreSQL or DynamoDB for this service?
Reason through:
1. What are the access patterns?
2. What's the expected volume and growth?
3. How important is consistency?
4. What's the team's expertise?
5. What are the operational trade-offs?
Then recommend which is better.
Variants
Step-by-step CoT (shown above)
- Simple, easy to implement
- Works for most reasoning tasks
- Produces sequential reasoning
Tree-of-thought
- Explores multiple reasoning paths
- Backtracks when a path fails
- More expensive but higher accuracy
- Use for critical decisions only
Example tree-of-thought structure:
Let's explore multiple approaches to this problem:
Approach 1: [Try solution A, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...
Approach 2: [Try solution B, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...
Approach 3: [Try solution C, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...
Best approach: [Explain why]
4. Instruction Clarity
Clear instructions are the easiest optimization. Vague instructions produce vague results.
Vague vs. Specific Instructions
Vague:
Improve this code.
Specific:
Refactor this code to:
1. Reduce cyclomatic complexity below 5 per function
2. Extract repeated patterns into helpers
3. Add type hints to all parameters
4. Keep changes under 100 lines
5. Maintain current test coverage
Vague:
Check if this is secure.
Specific:
Review this code for:
1. SQL injection vulnerabilities
2. Hardcoded secrets or credentials
3. Missing input validation on user-facing APIs
4. Insecure deserialization of untrusted data
5. Missing CSRF tokens on state-changing endpoints
Focus on blocking issues only. Do NOT comment on:
- Code style or formatting
- Performance unless critical
- Non-security architectural choices
Active Voice & Clear Verbs
Use action verbs that specify expected output:
| Vague | Better |
|---|---|
| ”Look at this code" | "Identify N+1 query problems in this code" |
| "Make it better" | "Reduce response time from 2s to <500ms" |
| "Think about this" | "List 3 architectural options with trade-offs" |
| "Check for issues" | "Audit for hardcoded credentials and secrets" |
| "Suggest improvements" | "Recommend 1-3 concrete refactorings ranked by ROI” |
Explicit Constraints
State what the model should NOT do. Negative constraints prevent common failures:
DO:
- Use Python 3.11+ features
- Suggest concurrent solutions where possible
- Explain performance implications
DO NOT:
- Use eval() or exec()
- Suggest third-party packages without justification
- Over-engineer for hypothetical future cases
- Suggest async without proof it's needed
Instruction Quality Checklist
Before finalizing a prompt, verify:
- Is the goal explicit? Can you point to a sentence that states the end goal?
- Are constraints listed? What should NOT be done?
- Is output format specified? How should the response be structured?
- Are examples provided? For complex tasks, are there 1-3 examples?
- Can you measure success? How would you know if the response is good?
- Is it concise? Can you remove any sentences without losing clarity?
- Have you tested it? Did you run this prompt and verify the response quality?
Common Pitfalls & Fixes
Pitfall 1: Magical thinking
❌ Please use your best judgment and find the best solution.
✅ Compare these 3 specific approaches and rank by: correctness, maintainability, test coverage.
Pitfall 2: Contradictions
❌ Be thorough but concise. Check everything but don't overthink.
✅ Be thorough (mention all critical issues). Be concise (max 200 words).
Pitfall 3: Vague success criteria
❌ Make the code better.
✅ Reduce time complexity from O(n²) to O(n log n) and improve readability.
Pitfall 4: Assuming context
❌ You know our stack, so recommend the best framework.
✅ We use Python + FastAPI for APIs. Recommend the best testing framework for integration tests, considering: setup time, assertion clarity, CI integration.
5. Output Format Constraints
Forcing a specific output format reduces ambiguity and prevents hallucination.
JSON Format Enforcement
Structure responses as JSON to enable reliable parsing:
Respond ONLY as valid JSON, no markdown or extra text:
{
"issues": [
{
"severity": "critical",
"type": "security",
"location": "line 42",
"description": "...",
"fix": "..."
}
],
"summary": "..."
}
Without this constraint, the model might wrap JSON in markdown:
Here's what I found:
\`\`\`json
{ ... }
\`\`\`
Which breaks parsing. With format enforcement, parsing is reliable.
XML Tags
For semi-structured output, use XML tags:
Respond using these XML tags:
<analysis>
<issue>
<severity>critical|high|medium|low</severity>
<description>...</description>
<fix>...</fix>
</issue>
</analysis>
Benefits:
- Human-readable
- Hierarchical structure
- Easy regex parsing
Regular Expressions for Validation
After getting a response, validate it matches expected format:
import re
def validate_output(response: str) -> bool:
"""Ensure response matches expected format."""
pattern = r'^\[\s*(?:{"severity":\s*"[^"]+",\s*"category":\s*"[^"]+",.*?}\s*,?)*\s*\]$'
return bool(re.match(pattern, response, re.DOTALL))
def retry_until_valid(user_input: str, max_retries: int = 3) -> str:
"""Keep retrying until response matches format."""
for attempt in range(max_retries):
response = call_model(user_input)
if validate_output(response):
return response
# Tell model what went wrong
user_input += f"\n\nYour last response didn't match the required format.\nError: {validation_error}"
raise ValueError("Failed to get validly formatted response after retries")
Schema Enforcement
For strict requirements, define a schema and ask the model to follow it:
Follow this schema exactly:
{
"type": "object",
"properties": {
"issues": {
"type": "array",
"items": {
"type": "object",
"properties": {
"severity": {"type": "string", "enum": ["critical", "high", "medium", "low"]},
"line": {"type": "integer", "minimum": 1},
"description": {"type": "string", "maxLength": 500}
},
"required": ["severity", "line", "description"]
}
}
}
}
Do NOT include extra fields. Do NOT deviate from this structure.
Preventing Hallucination Through Format
Format constraints prevent the model from inventing data:
Without format constraint:
"The code also has a vulnerability on line 127 where user input isn't sanitized."
[Line 127 might not exist in the actual code!]
With format constraint:
{
"line_number": 127,
"issue": "...",
"evidence": "..."
}
The model must now cite evidence, reducing hallucination. Add:
You MUST cite the exact code snippet for every issue. If you cannot find the exact line, set line_number to null.
6. Prompt Optimization
Optimization means testing prompts against real data and iterating based on results.
A/B Testing Prompts
Run both versions against the same inputs and measure which performs better:
def ab_test_prompts(test_cases: list[str], prompt_a: str, prompt_b: str) -> dict:
"""Compare two prompts on accuracy."""
results_a = [call_model(prompt_a.format(input=case)) for case in test_cases]
results_b = [call_model(prompt_b.format(input=case)) for case in test_cases]
accuracy_a = sum(evaluate(result) for result in results_a) / len(results_a)
accuracy_b = sum(evaluate(result) for result in results_b) / len(results_b)
return {
"accuracy_a": accuracy_a,
"accuracy_b": accuracy_b,
"winner": "A" if accuracy_a > accuracy_b else "B",
"improvement": abs(accuracy_a - accuracy_b),
"cost_a": len(prompt_a) * len(test_cases),
"cost_b": len(prompt_b) * len(test_cases),
}
# Example: Compare detailed instructions vs. concise instructions
test_cases = [
"Review this code for security",
"Is this SQL injection vulnerable?",
# ... more real examples
]
results = ab_test_prompts(
test_cases,
prompt_a="Detailed instructions...",
prompt_b="Concise instructions..."
)
print(f"B wins by {results['improvement']*100:.1f}% and saves {results['cost_a']-results['cost_b']} tokens")
Measuring Impact
Define success metrics appropriate to your task:
| Task | Success Metric |
|---|---|
| Code review | % of real issues found, % of false positives |
| Bug diagnosis | Accuracy of root cause identification |
| Code generation | % of generated code that passes tests |
| Summarization | Recall of key points, Length variance |
| Classification | Precision, Recall, F1 score |
Example measurement:
def measure_impact(prompt_version: str, test_set: list[dict]) -> dict:
"""Measure prompt quality on real test cases."""
results = []
for case in test_set:
response = call_model(prompt_version, case["input"])
result = {
"input": case["input"],
"response": response,
"expected": case["expected"],
"correct": evaluate_correctness(response, case["expected"]),
"latency_ms": measure_latency(),
"tokens_used": count_tokens(response),
}
results.append(result)
return {
"accuracy": sum(r["correct"] for r in results) / len(results),
"avg_latency_ms": sum(r["latency_ms"] for r in results) / len(results),
"avg_tokens": sum(r["tokens_used"] for r in results) / len(results),
"false_positive_rate": calculate_fpr(results),
}
Prompt Versioning
Track prompt changes and their impact:
Prompt v1.0 (baseline)
- Accuracy: 85%
- Tokens: 250/request
- Notes: Initial version
Prompt v1.1 (added CoT)
- Accuracy: 90% (+5%)
- Tokens: 380/request (+52%)
- Notes: CoT improves security detection
Prompt v1.2 (optimized examples)
- Accuracy: 91% (+6% vs baseline)
- Tokens: 310/request (+24% vs baseline)
- Notes: Removed redundant examples, kept impact
Prompt v1.3 (compressed instructions)
- Accuracy: 91% (no change)
- Tokens: 280/request (-12% vs v1.2)
- Notes: Removed filler, maintained clarity
Use semantic versioning:
- Major: Significant accuracy change (>5%)
- Minor: Optimization or small improvement (1-5%)
- Patch: Bug fix (typo, format issue)
When to Update Prompts
Update when:
- Accuracy drops below acceptable threshold
- You discover a category of failures (e.g., “misses 30% of SQL injection issues”)
- A new model version becomes available
- Input distribution shifts (new use cases appear)
Don’t update when:
- One-off failures occur (investigate first)
- Accuracy is already good
- Change would increase token cost by >20% for <2% accuracy gain
7. Specialized Prompts for Different Tasks
Code Generation Prompts
Template:
Generate [language] code that:
1. Solves: [specific problem]
2. Assumptions: [what we assume about inputs, environment]
3. Constraints: [max lines, no external libs, etc.]
4. Output: [function signature, return type, etc.]
Include:
- Docstring with parameters and return value
- Error handling for: [specific cases]
- Test case showing usage
Do NOT:
- Use deprecated APIs
- Skip error handling
- Generate code without tests
Example:
Generate Python code that:
1. Parses a CSV file into structured data
2. Assumptions: File fits in memory, valid UTF-8, standard format
3. Constraints: Use only stdlib, no pandas. Handle missing values.
4. Output: Function `parse_csv(path: str) -> list[dict]`
Include:
- Docstring with examples
- Error handling for: FileNotFoundError, malformed rows
- Test case showing usage
Do NOT: Use pandas, Skip validation, Assume perfect input
Summarization Prompts
Template:
Summarize this [document/code/discussion] for [audience]:
Requirements:
1. Length: [word count or percentage]
2. Tone: [formal/casual/technical]
3. Include: [specific points]
4. Omit: [what to skip]
5. Structure: [bullet points/paragraphs/executive summary]
Focus on [what matters most to this audience].
Example:
Summarize this pull request for project managers:
Requirements:
1. Length: 100-150 words
2. Tone: Formal, business-focused
3. Include: What changed, why, impact on timeline
4. Omit: Technical implementation details
5. Structure: Executive summary + bullet points
Focus on: Business value and any timeline implications.
Analysis/Reasoning Prompts
Template:
Analyze [subject] by:
1. Identifying [key elements]
2. Examining [relationships/trade-offs]
3. Evaluating [against criteria]
4. Considering [edge cases/constraints]
5. Recommending [action with justification]
Show your reasoning steps.
Highlight [key uncertainties/risks].
Example:
Analyze this database query for performance:
1. Identify: Full scans, missing indexes, N+1 patterns
2. Examine: How filters and joins affect execution
3. Evaluate: Against target <100ms latency
4. Consider: Edge cases like empty result sets
5. Recommend: Query optimization with cost/benefit
Show your reasoning steps.
Highlight: Trade-offs between complexity and speed.
Classification Prompts
Template:
Classify the following [items] into categories:
Categories:
- [Category 1]: [Definition, examples]
- [Category 2]: [Definition, examples]
- [Category N]: [Definition, examples]
Format: Return as JSON { "item": "category" }
Confidence: Only classify if >80% confident. Use "unknown" otherwise.
Explain: Briefly justify difficult classifications.
Example:
Classify these support tickets by urgency:
Categories:
- Critical: System down, data loss, security breach (needs <1h response)
- High: Major feature broken, significant degradation (<4h response)
- Medium: Minor feature issue, workaround exists (<24h response)
- Low: Questions, feature requests, small bugs (<1 week response)
Format: Return as JSON { "ticket_id": "urgency_level" }
Confidence: Only classify if >80% confident. Use "unclear" otherwise.
Explain: Briefly justify Critical classifications.
8. Common Prompt Antipatterns
Antipattern 1: Over-Apologizing
❌ "I'm sorry, but I might not be able to fully solve this..."
✅ "I'll solve this with these approaches: ..."
❌ "I apologize if this isn't what you wanted..."
✅ "Here's my response. Tell me if you need adjustments."
Over-apologizing makes the model less confident and produces wishy-washy outputs.
Antipattern 2: Magical Thinking
❌ "Use your best judgment to find the optimal solution"
✅ "Optimize for: (1) correctness, (2) performance, (3) clarity. Rank trade-offs."
❌ "Do what makes the most sense"
✅ "Choose the approach that: minimizes latency and maintains test coverage"
Models don’t have “judgment”—they need specific criteria.
Antipattern 3: Contradictory Instructions
❌ "Be thorough but brief. Cover everything but stay concise."
✅ "Cover critical issues (max 5). Keep each explanation <50 words."
❌ "Be creative but follow the rules exactly"
✅ "Follow the rules exactly. Suggest improvements in a separate section."
Contradictions confuse the model. Resolve them with specific criteria.
Antipattern 4: Overly Complex Instructions
❌ "Consider whether, given the circumstances and taking into account various factors
that might influence the outcome, you should evaluate the potential for optimization
while simultaneously ensuring compliance with standards."
✅ "Optimize for speed. Ensure standards compliance."
Simplify instructions. Use active voice. Prefer lists to paragraphs.
Antipattern 5: Assuming Unstated Context
❌ "You know our codebase. Review this for issues."
✅ "This is a user-facing API service. Review for: security, performance, and clarity."
❌ "Make it production-ready."
✅ "Make it production-ready by: adding error handling, tests, and logging."
State context explicitly. Don’t assume the model knows your system.
Antipattern 6: Vague Scope
❌ "Improve the code quality"
✅ "Reduce cyclomatic complexity <5, improve test coverage from 60% to >80%"
❌ "Find the security issues"
✅ "Audit for: SQL injection, hardcoded secrets, missing validation"
Vague scope produces vague results. Define specific targets.
9. Prompt Compression
Compression removes unnecessary tokens while maintaining quality.
Techniques
1. Remove filler
❌ "In order to properly handle this, it is important to note that we need to ensure..."
✅ "Handle this by ensuring..."
2. Abbreviate examples
❌ "For instance, consider a situation where a user enters data into a form..."
✅ "Ex: User enters data into form"
3. Consolidate instructions
❌ "Do not use if/else. Do not use switch statements. Do not use ternary operators."
✅ "Use only guards and early returns (no if/else, switch, or ternary)."
4. Use bullets instead of prose
❌ "The response should include an explanation of what was found, why it's important,
and what should be done about it."
✅ "Include: what was found, why it's important, what to do."
5. Parameterize repetition
❌ "The input is valid UTF-8. The input is properly formatted. The input contains no secrets."
✅ "Input assumptions: valid UTF-8, properly formatted, no secrets."
Savings
A well-compressed prompt saves 30-50% of tokens without losing quality:
| Prompt | Tokens | Accuracy | Cost/Request |
|---|---|---|---|
| v1 (verbose) | 450 | 92% | $0.0045 |
| v2 (compressed) | 220 | 91% | $0.0022 |
| Savings | -51% | -1% | -51% |
For 10,000 requests/month:
- v1: 4.5M tokens, $45
- v2: 2.2M tokens, $22
- Savings: $276/year per harness
Compression Checklist
Before finalizing, check:
- Every sentence serves a purpose
- Can any words be deleted?
- Are examples as minimal as possible?
- Can bullets replace paragraphs?
- Is there repetition that can be consolidated?
- Have you tested compressed version against original?
10. Prompt Testing
Systematic testing catches regressions and verifies improvements.
Manual Testing
Before deploying a prompt change:
- Run 5-10 examples yourself
- Vary the inputs: typical cases, edge cases, boundary cases
- Review the output: Does it match your expectations?
- Check the format: Is JSON valid? Are all fields present?
- Assess quality: Is the reasoning sound? Are there hallucinations?
Example test script:
test_cases = [
{
"name": "Simple case",
"input": "...",
"expected": "..."
},
{
"name": "Edge case with null",
"input": "...",
"expected": "..."
},
# ... more cases
]
for test in test_cases:
response = call_model(PROMPT, test["input"])
passed = evaluate(response, test["expected"])
status = "PASS" if passed else "FAIL"
print(f"[{status}] {test['name']}")
if not passed:
print(f" Expected: {test['expected']}")
print(f" Got: {response}")
Automated Testing
Run prompts at scale against test suites:
class PromptTest:
def __init__(self, prompt: str, test_cases: list[dict]):
self.prompt = prompt
self.test_cases = test_cases
def run(self) -> dict:
results = []
for case in self.test_cases:
response = call_model(self.prompt, case["input"])
result = {
"test": case["name"],
"passed": evaluate(response, case["expected"]),
"latency_ms": measure_latency(),
"tokens": count_tokens(response),
}
results.append(result)
return {
"pass_rate": sum(1 for r in results if r["passed"]) / len(results),
"avg_latency": sum(r["latency_ms"] for r in results) / len(results),
"total_tokens": sum(r["tokens"] for r in results),
"failed_tests": [r["test"] for r in results if not r["passed"]],
}
# Usage
tests = PromptTest(PROMPT_V2, TEST_CASES)
results = tests.run()
print(f"Pass rate: {results['pass_rate']*100:.1f}%")
print(f"Failed: {results['failed_tests']}")
Measuring Success Metrics
Define task-specific metrics:
def evaluate_security_audit(response: str, ground_truth: dict) -> dict:
"""Evaluate security audit prompt on real vulnerabilities."""
parsed = json.loads(response)
found_issues = set(issue["type"] for issue in parsed["issues"])
actual_issues = set(ground_truth["vulnerabilities"])
tp = len(found_issues & actual_issues) # True positives
fp = len(found_issues - actual_issues) # False positives
fn = len(actual_issues - found_issues) # False negatives
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return {
"precision": precision, # Of issues found, how many were real?
"recall": recall, # Of real issues, how many were found?
"f1": f1, # Harmonic mean
"false_positive_count": fp,
}
Regression Detection
After updating a prompt, verify you didn’t break anything:
def detect_regression(old_prompt: str, new_prompt: str,
test_cases: list[dict]) -> bool:
"""Check if new prompt is significantly worse than old."""
old_results = PromptTest(old_prompt, test_cases).run()
new_results = PromptTest(new_prompt, test_cases).run()
old_accuracy = old_results["pass_rate"]
new_accuracy = new_results["pass_rate"]
# Flag if accuracy drops more than 2%
regression = (old_accuracy - new_accuracy) > 0.02
if regression:
print(f"REGRESSION DETECTED: {old_accuracy*100:.1f}% -> {new_accuracy*100:.1f}%")
print("Do not deploy new prompt")
return regression
Testing Checklist
Before deploying a prompt, verify:
- Manual testing passed on 5-10 diverse examples
- No regression on existing test suite (accuracy within 2%)
- Edge cases tested (null inputs, empty arrays, boundary values)
- Format is valid (JSON parses, XML well-formed, etc.)
- Success metrics meet target thresholds
- Token cost is acceptable
- Latency is acceptable
- No hallucinations on known-false examples
Summary: The Prompt Engineering Workflow
-
Start with clear requirements: What should the prompt do? How do you measure success?
-
Build the base prompt:
- Role definition
- Task specification
- Output format
- Constraints
-
Optimize:
- Add few-shot examples (1-3)
- Add chain-of-thought if reasoning matters
- Compress unnecessary words
- Test against real cases
-
Measure:
- Accuracy on test suite
- Token cost
- Latency
- Error rate
-
Iterate:
- A/B test variations
- Update based on real failures
- Version changes
- Detect regressions
-
Deploy:
- Document prompt version
- Monitor performance in production
- Set up alerts for accuracy drops
- Schedule periodic audits
Key principle: Well-engineered prompts are tested, measured, and iterated. “Feels right” is not good enough. Measure impact and optimize based on data.
References & Further Reading
- Few-shot learning: Show the model examples (1-3 for most tasks)
- Chain-of-thought: Ask for reasoning steps on complex tasks (+20-50% accuracy)
- Prompt compression: Remove filler, save 30-50% tokens without quality loss
- Format constraints: Force JSON/XML output to prevent hallucination
- A/B testing: Compare prompts on real test sets, measure impact
- Regression detection: Verify updates don’t break existing quality
For specialized frameworks and tools, see Phase 1-2 documentation.
11. Prompt Evolution Case Study
Systematic iteration beats random tweaking. This case study shows how a single prompt evolves from naive to optimized, with measurable quality improvements at each stage.
The Task
Summarize a technical document for a non-technical audience (e.g., board members, customers, or marketing teams).
Version 1: Naive Prompt
Summarize this.
What happens: The model produces a summary that is technically accurate but reads like a compressed version of the original. It retains jargon, assumes domain knowledge, and varies wildly in length between runs.
Example output (given a 2000-word API migration guide):
The document describes migrating from REST API v1 to v2. Key changes include
OAuth2 replacing Basic Auth, JSON:API response format, rate limiting changes
from 1000/hr to 5000/hr with sliding windows, and deprecation of XML endpoints.
Clients must update their authentication headers and response parsers.
Quality metrics:
- Accuracy: 70% (captures main points but misses nuance)
- Relevance to audience: 30% (too technical for non-technical readers)
- Length consistency: Low (varies from 50 to 500 words across runs)
- Actionability: 10% (reader doesn’t know what to do with this)
Version 2: Structured Prompt
You are a technical writer who specializes in translating engineering
documents for business audiences.
Summarize this document for a non-technical audience (board members,
customers, marketing team).
Requirements:
1. Length: 100-150 words
2. Tone: Professional, jargon-free
3. Structure: One paragraph of context, then 3-5 bullet points
4. Replace all technical terms with plain language
5. End with one sentence on business impact
Do NOT:
- Use acronyms without explanation
- Assume the reader knows what an API is
- Include implementation details
Example output:
We're upgrading how our software connects with partner systems to be faster
and more secure.
Key changes:
- Security upgrade: We're moving to a modern login system that better
protects customer data
- Faster performance: Partners can now make 5x more requests per hour,
reducing delays
- Simpler data format: Responses use a standardized structure, making
integrations easier to maintain
- Legacy support ending: The old connection method will stop working on
December 31, 2025
This upgrade reduces integration maintenance costs by an estimated 30% and
positions us for the next generation of partner integrations.
Quality metrics:
- Accuracy: 85% (captures points and translates correctly)
- Relevance to audience: 75% (mostly jargon-free, business-focused)
- Length consistency: High (stays within 100-150 words)
- Actionability: 60% (mentions deadline and cost impact)
Version 3: Few-Shot Optimized Prompt
You are a technical writer who specializes in translating engineering
documents for business audiences.
Summarize this document for a non-technical audience (board members,
customers, marketing team).
Requirements:
1. Length: 100-150 words
2. Tone: Professional, jargon-free
3. Structure: One sentence of context, then 3-5 bullet points, then
one sentence on what the reader should do next
4. Replace all technical terms with plain language
5. Quantify impact where possible (cost, time, risk)
Do NOT: Use acronyms, assume technical knowledge, include how-it-works details
---
EXAMPLE 1 - Security patch summary
Input: [2000-word document about TLS 1.3 migration]
Output:
We're upgrading our encryption to the latest industry standard, improving
both security and speed.
- Stronger protection: Customer data is encrypted with the newest standard,
meeting 2025 compliance requirements
- Faster connections: Page load times improve by 10-15% due to fewer
network round-trips
- No customer action needed: The change is transparent to end users
- Timeline: Rolling out over 2 weeks starting March 1
Next step: No action required. Contact [email protected] with questions.
---
EXAMPLE 2 - Infrastructure change summary
Input: [1500-word document about database migration]
Output:
We're moving our data storage to a more reliable system that reduces
downtime risk.
- 99.99% uptime guarantee: Up from 99.9% (reduces potential outage hours
from 8.7/year to 0.9/year)
- Cost reduction: Infrastructure costs drop 25% ($180K annual savings)
- 4-hour maintenance window: Scheduled for Sunday 2am-6am ET, March 15
- Risk: Low. Automated rollback if issues detected within 30 minutes
Next step: Notify customers of the maintenance window by March 8.
---
Now summarize the following document:
Quality metrics:
- Accuracy: 95% (examples teach the right level of detail)
- Relevance to audience: 95% (consistently business-focused, quantified)
- Length consistency: High (examples anchor the expected length)
- Actionability: 90% (always ends with next step)
Results Comparison
| Metric | V1 (Naive) | V2 (Structured) | V3 (Few-Shot) |
|---|---|---|---|
| Accuracy | 70% | 85% | 95% |
| Audience relevance | 30% | 75% | 95% |
| Length consistency | Low | High | High |
| Actionability | 10% | 60% | 90% |
| Tokens (prompt) | 2 | ~120 | ~350 |
| Cost per request | $0.00002 | $0.0012 | $0.0035 |
The lesson: Each iteration addressed a specific failure mode. V2 fixed structure and audience targeting. V3 fixed consistency and actionability by showing rather than telling. The 175x cost increase from V1 to V3 is negligible compared to the quality improvement — and the prompt cost is amortized across thousands of requests.
Systematic improvement process:
- Write the simplest prompt that could work
- Run it on 5 real inputs
- Identify the most common failure mode
- Add instructions or examples that specifically address that failure
- Measure again — if improved, repeat from step 3; if not, revert and try a different fix
12. Common Prompt Failures
Five failure patterns that cause prompts to produce poor results, with concrete examples and fixes.
Failure 1: Over-Constraining Format
Bad prompt:
Return exactly 3 bullet points. Each bullet must be exactly 15 words.
The first bullet must start with "The". The second with "This".
The third with "Our". End each with a period. No sub-bullets.
Why it fails: The model spends its reasoning capacity satisfying format constraints instead of producing quality content. Output becomes awkward and forced as the model contorts language to hit exact word counts.
Fixed prompt:
Return 3 bullet points, each 10-20 words. Keep them concise and parallel in structure.
Failure 2: Ambiguous Instructions
Bad prompt:
Make this code better and clean it up.
Why it fails: “Better” and “clean” have dozens of interpretations. One run refactors variable names, the next restructures control flow, the next adds type hints. Results are inconsistent and often unwanted.
Fixed prompt:
Refactor this code to: (1) reduce function length to <20 lines each,
(2) add type hints to all parameters and return values,
(3) replace magic numbers with named constants.
Do not change the public API or add new dependencies.
Failure 3: Missing Context
Bad prompt:
Review this function for bugs.
def process(data):
return transform(data, config.MODE)
Why it fails: The model doesn’t know what transform does, what config.MODE contains, or what the expected behavior is. It will invent plausible but fictional bugs: “config.MODE might be None” or “transform might raise ValueError” — none of which may be true.
Fixed prompt:
Review this function for bugs.
Context:
- transform() is defined in utils.py, accepts (list[dict], str), returns list[dict]
- config.MODE is always one of: "fast", "accurate", "balanced" (set at startup, never None)
- Expected behavior: filter data entries matching the mode's criteria
def process(data):
return transform(data, config.MODE)
Failure 4: Contradictory Instructions
Bad prompt:
Be thorough and check every possible issue. Also, keep your response
under 50 words. Cover security, performance, maintainability, and
testing concerns. Be brief.
Why it fails: The model cannot be thorough across four categories in 50 words. It picks one constraint to satisfy (usually brevity) and ignores the others, or produces a response that satisfies neither — superficial analysis that also exceeds the word limit.
Fixed prompt:
Check for critical security issues only (injection, auth bypass, data exposure).
Keep response under 100 words. Flag only blocking issues — skip style and minor concerns.
Failure 5: Too Many Instructions
Bad prompt:
Analyze this code. Check for: SQL injection, XSS, CSRF, SSRF, path traversal,
command injection, insecure deserialization, XML external entities, broken
authentication, sensitive data exposure, missing rate limiting, improper error
handling, insufficient logging, outdated dependencies, hardcoded secrets,
weak cryptography, insecure redirects, clickjacking, CORS misconfiguration,
and business logic flaws. For each, explain the risk, show the vulnerable line,
suggest a fix, rate severity, estimate effort to fix, and cite the relevant
OWASP category. Also check for performance issues, code style violations,
and test coverage gaps.
Why it fails: Models exhibit “attention decay” on long instruction lists. Items near the beginning and end get more attention; items in the middle are frequently skipped. A 20-item checklist typically results in 8-12 items actually checked.
Fixed prompt:
Audit this code for the OWASP Top 5 most critical issues:
1. Injection (SQL, command, path traversal)
2. Broken authentication or authorization
3. Sensitive data exposure (hardcoded secrets, logs)
4. Security misconfiguration (CORS, headers)
5. Known vulnerable dependencies
For each issue found: cite the line, explain the risk in one sentence,
and suggest a fix. Skip issues not present.
Rule of thumb: Keep instruction lists to 5-7 items maximum. If you need more coverage, split into multiple prompts and combine results.
The Scalpel Principle: Focused Prompts Beat Bloated Ones
A general-purpose LLM juggles multiple roles in one context window: coding assistant, file editor, git manager, terminal operator, and more. When you also ask it to analyse Victorian birth records, every token of “here’s how to use the Edit tool” competes for attention with “FreeBMD districts cover surrounding parishes.”
A dedicated harness prompt does ONE thing. No ambiguity, no mode-switching, no attention split. Every token reinforces the single task.
Why This Matters Beyond Harnesses
This principle applies to any prompt. Irrelevant context degrades output quality in two ways:
- Attention dilution: Transformer attention is finite. Tokens spent on unrelated instructions reduce the model’s capacity to focus on your actual task.
- Mode confusion: When a prompt contains instructions for multiple roles, the model may blend behaviours. A coding assistant asked to also do data analysis may format analytical output as code comments.
Before and After: Bloated vs Focused
Bloated prompt (embedded in a general-purpose assistant with 5,000 tokens of system instructions):
You are a helpful coding assistant. You can read files, write files,
run shell commands, manage git repositories, create pull requests,
review code, and help with any programming task.
[... 4,800 more tokens of tool definitions, conventions, and rules ...]
Now analyse this birth record and extract the district, year, and
registration quarter.
Record: "John Smith, born 1842, registered Q3, Lambeth district"
Focused prompt (110 tokens, purpose-built):
You are a genealogy data extractor. Given a birth record, return JSON
with fields: name, year, quarter (Q1-Q4), district.
If a field is unclear, set it to null. Do not guess.
Record: "John Smith, born 1842, registered Q3, Lambeth district"
The focused version consistently produces correct structured output. The bloated version sometimes wraps the answer in markdown code blocks, adds unsolicited explanations, or formats the output as a Python dictionary instead of JSON — because its system prompt trained it to behave like a coding assistant.
The Rule
Before sending any prompt, ask: “Is every token in this context relevant to the task?” If not, remove the irrelevant parts. A 110-token prompt that does one thing well will outperform a 5,000-token prompt that does twenty things adequately.
Validation Checklist
How do you know you got this right?
Performance Checks
- Base prompt produces correct output on 5+ test cases
- Few-shot examples improve accuracy by 5%+ (measured against baseline)
- Prompt tokens optimized: removed 20%+ filler without quality loss
- Output format enforced: JSON/XML parsing succeeds on 95%+ of responses
Implementation Checks
- System prompt written with all 4 layers: role, task, format, constraints
- Few-shot examples selected: diverse, representative, 1-3 examples chosen
- Chain-of-thought added (if reasoning-heavy task): step-by-step logic visible
- Constraints explicit: know what NOT to do clearly stated
- Tested on 3+ prompt variants: measured which performs best
- Compression applied: repeated instructions condensed, synonyms reduced
Integration Checks
- Prompt integrates with harness tool calling: model makes valid tool calls
- Output parsing works: JSON schema validation succeeds
- Memory integration: system prompt + working memory fit in context
- Error handling: malformed output caught and recovered gracefully
Common Failure Modes
- Few-shot examples too similar: Same pattern repeated; diversity matters
- Constraints contradictory: “Don’t hallucinate” + “creative output” incompatible
- Output format not enforced: Model adds prose around required JSON
- Chain-of-thought verbose: Multi-step reasoning bloats tokens without quality gain
- Prompt not versioned: Changes not tracked; can’t revert or measure impact
Sign-Off Criteria
- Tried 3+ prompt variants, measured difference with doc 16 metrics
- Baseline established: know the starting accuracy/cost/latency
- Improvements documented: measured impact of few-shot, CoT, compression
- Prompt version documented and pinned in config
- A/B test plan for monitoring drift in production
See Also
- Doc 05 (AI Agents): System prompt guides agent behavior within chosen framework
- Doc 14 (Advanced Patterns): Advanced prompting techniques (extended thinking, etc)
- Doc 16 (Evaluation & Benchmarking): Measure prompt impact on quality/cost/latency