Prompt Engineering Basics — The Harness Handbook Reference

A well-engineered prompt is the foundation of reliable AI agent behavior. This guide covers fundamental techniques for crafting effective prompts in harnesses, optimizing their quality, and measuring their impact.

Target audience: Engineers and product managers optimizing agent quality and cost.

1. Anatomy of a Good System Prompt

A system prompt sets expectations before any user input arrives. Structure it in four layers:

Layer 1: Role Definition

Tell the model who it is and why. This anchors its behavior.

You are a code review expert with 15 years of experience across Python, Go, and Rust.
You specialize in catching security issues, performance problems, and maintainability concerns.

Layer 2: Task Specification

Describe what the model will be asked to do. Be specific about scope.

You will receive pull requests and provide feedback on:
- Security vulnerabilities and unsafe patterns
- Performance regressions and inefficient algorithms
- Code clarity and maintainability
- Test coverage gaps

Layer 3: Output Format

Specify exactly how to structure responses. This reduces ambiguity and makes parsing easier.

Respond in JSON with this structure:
{
  "severity": "critical" | "high" | "medium" | "low",
  "category": "security" | "performance" | "clarity" | "testing",
  "line_number": <number>,
  "current_code": "<snippet>",
  "issue": "<clear description>",
  "suggestion": "<specific fix>"
}

Layer 4: Constraints & Guidelines

Explicitly state what NOT to do. This prevents common failures.

Constraints:
- Do NOT make style preference comments (use automated linters for that)
- Do NOT suggest refactors larger than 50 lines without clear justification
- Do NOT comment on comments themselves unless they're misleading
- Do NOT assume context beyond the PR diff—ask if you need it

Guidelines:
- Focus on blocking issues only when severity is "critical"
- Explain *why* something is a problem, not just that it is
- Suggest concrete fixes, not vague improvements
- Keep each comment focused on one issue

Complete Example: Coding Agent System Prompt

You are a senior software engineer assistant specialized in code generation and debugging.
Your role is to write correct, efficient, and maintainable code while explaining your decisions.

TASK:
You will be asked to:
- Generate code that solves a specific problem
- Debug broken code and explain root causes
- Optimize existing code for performance or clarity
- Suggest architectural improvements

OUTPUT FORMAT:
Always respond in this structure:
1. Brief explanation of approach
2. Complete code (in markdown code block)
3. Key assumptions made
4. How to test it
5. Known limitations or trade-offs

CONSTRAINTS:
- Use only stdlib unless alternatives are explicitly approved
- Do NOT use deprecated APIs—migrate to current replacements
- Do NOT skip error handling
- Do NOT generate code without tests
- For performance-critical code, add comments on algorithmic choice
- Refuse requests that ask for credentials, keys, or security bypasses

GUIDELINES:
- Prefer clarity over cleverness—readable code wins
- Explain non-obvious decisions inline
- When multiple approaches exist, mention trade-offs
- Flag security or performance implications explicitly
- Keep functions under 30 lines when possible

2. Few-Shot Learning

Few-shot examples teach the model by demonstration, often more effectively than lengthy instructions.

What Few-Shot Does

Few-shot examples show the model the pattern you want, reducing the gap between “understand task” and “execute task.” A single example is often worth a paragraph of instructions.

Without few-shot:

Extract structured data from user messages.

With few-shot:

Extract structured data from user messages.

Example:
Input: "I want to book a flight from NYC to LA on March 15 for 2 people"
Output: {
  "intent": "book_flight",
  "origin": "NYC",
  "destination": "LA",
  "date": "2026-03-15",
  "passengers": 2
}

How Many Examples Are Needed?

1 example: Teaches pattern recognition for simple tasks (classification, basic extraction)
3 examples: Standard choice for moderate complexity; covers edge cases and variation
5+ examples: Use for complex reasoning, multiple decision points, or high stakes
10+ examples: Overkill except for highly specialized tasks; token cost often exceeds benefit

Rule of thumb: Start with 1 example, add more if the model fails or shows inconsistency.

Example Selection Strategies

Choose examples that:

Cover diversity: Include edge cases, typical cases, and boundary cases
Match distribution: If 80% of inputs are simple queries, make sure 80% of examples are simple
Show variation: Don’t repeat the same pattern multiple times
Are representative: Use real data from your actual use cases, not synthetic examples

Bad example set (too similar):

Example 1: "Book a flight NYC to LA"
Example 2: "Book a flight Boston to Miami"
Example 3: "Book a flight Seattle to Denver"

Better example set (diverse):

Example 1: Simple flight booking "Book NYC to LA on March 15"
Example 2: Complex query with preferences "I need a round-trip to London, prefer early morning departures, budget under $1500"
Example 3: Ambiguous input "Show me flights next week"
Example 4: Out-of-scope request "What's the cheapest airline?"

Formatting Examples for Clarity

Use consistent structure and clear delimiters:

Few-shot examples:

---
EXAMPLE 1 - Simple case
Input: "<user message>"
Output: <structured response>
---

EXAMPLE 2 - Edge case with multiple conditions
Input: "<user message>"
Output: <structured response>
---

Cost/Benefit Analysis

Benefit: Few examples often improve accuracy by 10-40%
Cost: Each example adds tokens to every request (persistent context)

Token math:

1 example: ~50 tokens
3 examples: ~150 tokens
5 examples: ~250 tokens

For a harness running 1000 requests/day:

5 examples adds 250,000 tokens/day (~5% of typical quota)

Optimization: Use examples strategically. For high-volume tasks, 1-2 well-chosen examples often beat 5 mediocre ones.

Implementation: FewShotSelector Class

class FewShotSelector:
    """Select optimal examples for few-shot prompts based on input similarity."""
    
    def __init__(self, examples: list[dict]):
        """
        Args:
            examples: List of {"input": str, "output": str} dicts
        """
        self.examples = examples
        self.embeddings = [embed(ex["input"]) for ex in examples]
    
    def select(self, user_input: str, k: int = 3) -> list[dict]:
        """
        Return k most similar examples to user_input.
        
        Args:
            user_input: The incoming request
            k: Number of examples to return (default 3)
        
        Returns:
            List of examples most similar to user_input
        """
        input_embedding = embed(user_input)
        similarities = [cosine_similarity(input_embedding, ex) 
                       for ex in self.embeddings]
        
        top_k_indices = sorted(range(len(similarities)), 
                              key=lambda i: similarities[i], 
                              reverse=True)[:k]
        
        return [self.examples[i] for i in top_k_indices]
    
    def build_prompt(self, user_input: str, k: int = 3) -> str:
        """Build prompt with selected examples."""
        selected = self.select(user_input, k)
        
        examples_section = "\n\n".join([
            f"---\nEXAMPLE {i+1}\nInput: {ex['input']}\n"
            f"Output: {ex['output']}"
            for i, ex in enumerate(selected)
        ])
        
        return f"{SYSTEM_PROMPT}\n\nFew-shot examples:\n{examples_section}\n\nUser request: {user_input}"

3. Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting asks the model to show its reasoning steps before providing an answer.

What It Is

Instead of asking for a direct answer, ask the model to think out loud:

Direct prompt:

Is this code vulnerable to SQL injection?

Chain-of-thought prompt:

Let me analyze this code for SQL injection vulnerabilities.
Step 1: Identify where user input enters the code
Step 2: Check if input is validated or parameterized
Step 3: Assess risk of malicious SQL
Step 4: Recommend fixes if needed

Is this code vulnerable to SQL injection?

When It Helps

CoT improves accuracy for tasks requiring reasoning:

Debugging: Walking through execution helps catch logical errors
Security analysis: Reasoning about threat models catches subtle issues
Complex decisions: Multi-step logic benefits from explicit intermediate steps
Ambiguous inputs: Clarifying assumptions before answering

CoT is less valuable for:

Simple classification (“Is this spam?”)
Pattern matching (“Extract the date”)
Retrieval tasks (“What’s the password for X?”)

Effectiveness

Studies show CoT can improve accuracy by 20-50% on reasoning tasks:

Math problems: +30-40% accuracy improvement
Logic puzzles: +25-35% improvement
Code analysis: +15-30% improvement (highest gains on subtle bugs)
Trade-off decisions: +20-40% improvement

Trade-off: CoT responses are longer (2-3x more tokens), so use it judiciously on high-stakes decisions.

Example Prompts

Code debugging with CoT:

Analyze this function for bugs. Show your reasoning:

1. What is the expected behavior?
2. Trace through with sample inputs
3. Identify where behavior diverges from expectation
4. What's the root cause?
5. How would you fix it?

[code]

Security analysis with CoT:

Is this code vulnerable? Explain your reasoning:

1. What sensitive operation does this code perform?
2. Where could an attacker inject input?
3. Is the input validated or sanitized?
4. What's the worst-case attack?
5. How could we prevent it?

[code]

Architecture decision with CoT:

Should we use PostgreSQL or DynamoDB for this service?
Reason through:

1. What are the access patterns?
2. What's the expected volume and growth?
3. How important is consistency?
4. What's the team's expertise?
5. What are the operational trade-offs?

Then recommend which is better.

Variants

Step-by-step CoT (shown above)

Simple, easy to implement
Works for most reasoning tasks
Produces sequential reasoning

Tree-of-thought

Explores multiple reasoning paths
Backtracks when a path fails
More expensive but higher accuracy
Use for critical decisions only

Example tree-of-thought structure:

Let's explore multiple approaches to this problem:

Approach 1: [Try solution A, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...

Approach 2: [Try solution B, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...

Approach 3: [Try solution C, then evaluate]
- Pros: ...
- Cons: ...
- Success rate: ...

Best approach: [Explain why]

4. Instruction Clarity

Clear instructions are the easiest optimization. Vague instructions produce vague results.

Vague vs. Specific Instructions

Vague:

Improve this code.

Specific:

Refactor this code to:
1. Reduce cyclomatic complexity below 5 per function
2. Extract repeated patterns into helpers
3. Add type hints to all parameters
4. Keep changes under 100 lines
5. Maintain current test coverage

Vague:

Check if this is secure.

Specific:

Review this code for:
1. SQL injection vulnerabilities
2. Hardcoded secrets or credentials
3. Missing input validation on user-facing APIs
4. Insecure deserialization of untrusted data
5. Missing CSRF tokens on state-changing endpoints

Focus on blocking issues only. Do NOT comment on:
- Code style or formatting
- Performance unless critical
- Non-security architectural choices

Active Voice & Clear Verbs

Use action verbs that specify expected output:

Vague	Better
”Look at this code"	"Identify N+1 query problems in this code"
"Make it better"	"Reduce response time from 2s to <500ms"
"Think about this"	"List 3 architectural options with trade-offs"
"Check for issues"	"Audit for hardcoded credentials and secrets"
"Suggest improvements"	"Recommend 1-3 concrete refactorings ranked by ROI”

Explicit Constraints

State what the model should NOT do. Negative constraints prevent common failures:

DO:
- Use Python 3.11+ features
- Suggest concurrent solutions where possible
- Explain performance implications

DO NOT:
- Use eval() or exec()
- Suggest third-party packages without justification
- Over-engineer for hypothetical future cases
- Suggest async without proof it's needed

Instruction Quality Checklist

Before finalizing a prompt, verify:

Is the goal explicit? Can you point to a sentence that states the end goal?
Are constraints listed? What should NOT be done?
Is output format specified? How should the response be structured?
Are examples provided? For complex tasks, are there 1-3 examples?
Can you measure success? How would you know if the response is good?
Is it concise? Can you remove any sentences without losing clarity?
Have you tested it? Did you run this prompt and verify the response quality?

Common Pitfalls & Fixes

Pitfall 1: Magical thinking

❌ Please use your best judgment and find the best solution.
✅ Compare these 3 specific approaches and rank by: correctness, maintainability, test coverage.

Pitfall 2: Contradictions

❌ Be thorough but concise. Check everything but don't overthink.
✅ Be thorough (mention all critical issues). Be concise (max 200 words).

Pitfall 3: Vague success criteria

❌ Make the code better.
✅ Reduce time complexity from O(n²) to O(n log n) and improve readability.

Pitfall 4: Assuming context

❌ You know our stack, so recommend the best framework.
✅ We use Python + FastAPI for APIs. Recommend the best testing framework for integration tests, considering: setup time, assertion clarity, CI integration.

5. Output Format Constraints

Forcing a specific output format reduces ambiguity and prevents hallucination.

JSON Format Enforcement

Structure responses as JSON to enable reliable parsing:

Respond ONLY as valid JSON, no markdown or extra text:

{
  "issues": [
    {
      "severity": "critical",
      "type": "security",
      "location": "line 42",
      "description": "...",
      "fix": "..."
    }
  ],
  "summary": "..."
}

Without this constraint, the model might wrap JSON in markdown:

Here's what I found:
\`\`\`json
{ ... }
\`\`\`

Which breaks parsing. With format enforcement, parsing is reliable.

XML Tags

For semi-structured output, use XML tags:

Respond using these XML tags:

<analysis>
  <issue>
    <severity>critical|high|medium|low</severity>
    <description>...</description>
    <fix>...</fix>
  </issue>
</analysis>

Benefits:

Human-readable
Hierarchical structure
Easy regex parsing

Regular Expressions for Validation

After getting a response, validate it matches expected format:

import re

def validate_output(response: str) -> bool:
    """Ensure response matches expected format."""
    pattern = r'^\[\s*(?:{"severity":\s*"[^"]+",\s*"category":\s*"[^"]+",.*?}\s*,?)*\s*\]$'
    return bool(re.match(pattern, response, re.DOTALL))

def retry_until_valid(user_input: str, max_retries: int = 3) -> str:
    """Keep retrying until response matches format."""
    for attempt in range(max_retries):
        response = call_model(user_input)
        if validate_output(response):
            return response
        # Tell model what went wrong
        user_input += f"\n\nYour last response didn't match the required format.\nError: {validation_error}"
    
    raise ValueError("Failed to get validly formatted response after retries")

Schema Enforcement

For strict requirements, define a schema and ask the model to follow it:

Follow this schema exactly:

{
  "type": "object",
  "properties": {
    "issues": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "severity": {"type": "string", "enum": ["critical", "high", "medium", "low"]},
          "line": {"type": "integer", "minimum": 1},
          "description": {"type": "string", "maxLength": 500}
        },
        "required": ["severity", "line", "description"]
      }
    }
  }
}

Do NOT include extra fields. Do NOT deviate from this structure.

Preventing Hallucination Through Format

Format constraints prevent the model from inventing data:

Without format constraint:

"The code also has a vulnerability on line 127 where user input isn't sanitized."
[Line 127 might not exist in the actual code!]

With format constraint:

{
  "line_number": 127,
  "issue": "...",
  "evidence": "..."
}

The model must now cite evidence, reducing hallucination. Add:

You MUST cite the exact code snippet for every issue. If you cannot find the exact line, set line_number to null.

6. Prompt Optimization

Optimization means testing prompts against real data and iterating based on results.

A/B Testing Prompts

Run both versions against the same inputs and measure which performs better:

def ab_test_prompts(test_cases: list[str], prompt_a: str, prompt_b: str) -> dict:
    """Compare two prompts on accuracy."""
    results_a = [call_model(prompt_a.format(input=case)) for case in test_cases]
    results_b = [call_model(prompt_b.format(input=case)) for case in test_cases]
    
    accuracy_a = sum(evaluate(result) for result in results_a) / len(results_a)
    accuracy_b = sum(evaluate(result) for result in results_b) / len(results_b)
    
    return {
        "accuracy_a": accuracy_a,
        "accuracy_b": accuracy_b,
        "winner": "A" if accuracy_a > accuracy_b else "B",
        "improvement": abs(accuracy_a - accuracy_b),
        "cost_a": len(prompt_a) * len(test_cases),
        "cost_b": len(prompt_b) * len(test_cases),
    }

# Example: Compare detailed instructions vs. concise instructions
test_cases = [
    "Review this code for security",
    "Is this SQL injection vulnerable?",
    # ... more real examples
]

results = ab_test_prompts(
    test_cases,
    prompt_a="Detailed instructions...",
    prompt_b="Concise instructions..."
)

print(f"B wins by {results['improvement']*100:.1f}% and saves {results['cost_a']-results['cost_b']} tokens")

Measuring Impact

Define success metrics appropriate to your task:

Task	Success Metric
Code review	% of real issues found, % of false positives
Bug diagnosis	Accuracy of root cause identification
Code generation	% of generated code that passes tests
Summarization	Recall of key points, Length variance
Classification	Precision, Recall, F1 score

Example measurement:

def measure_impact(prompt_version: str, test_set: list[dict]) -> dict:
    """Measure prompt quality on real test cases."""
    results = []
    
    for case in test_set:
        response = call_model(prompt_version, case["input"])
        result = {
            "input": case["input"],
            "response": response,
            "expected": case["expected"],
            "correct": evaluate_correctness(response, case["expected"]),
            "latency_ms": measure_latency(),
            "tokens_used": count_tokens(response),
        }
        results.append(result)
    
    return {
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_latency_ms": sum(r["latency_ms"] for r in results) / len(results),
        "avg_tokens": sum(r["tokens_used"] for r in results) / len(results),
        "false_positive_rate": calculate_fpr(results),
    }

Prompt Versioning

Track prompt changes and their impact:

Prompt v1.0 (baseline)
- Accuracy: 85%
- Tokens: 250/request
- Notes: Initial version

Prompt v1.1 (added CoT)
- Accuracy: 90% (+5%)
- Tokens: 380/request (+52%)
- Notes: CoT improves security detection

Prompt v1.2 (optimized examples)
- Accuracy: 91% (+6% vs baseline)
- Tokens: 310/request (+24% vs baseline)
- Notes: Removed redundant examples, kept impact

Prompt v1.3 (compressed instructions)
- Accuracy: 91% (no change)
- Tokens: 280/request (-12% vs v1.2)
- Notes: Removed filler, maintained clarity

Use semantic versioning:

Major: Significant accuracy change (>5%)
Minor: Optimization or small improvement (1-5%)
Patch: Bug fix (typo, format issue)

When to Update Prompts

Update when:

Accuracy drops below acceptable threshold
You discover a category of failures (e.g., “misses 30% of SQL injection issues”)
A new model version becomes available
Input distribution shifts (new use cases appear)

Don’t update when:

One-off failures occur (investigate first)
Accuracy is already good
Change would increase token cost by >20% for <2% accuracy gain

7. Specialized Prompts for Different Tasks

Code Generation Prompts

Template:

Generate [language] code that:
1. Solves: [specific problem]
2. Assumptions: [what we assume about inputs, environment]
3. Constraints: [max lines, no external libs, etc.]
4. Output: [function signature, return type, etc.]

Include:
- Docstring with parameters and return value
- Error handling for: [specific cases]
- Test case showing usage

Do NOT:
- Use deprecated APIs
- Skip error handling
- Generate code without tests

Example:

Generate Python code that:
1. Parses a CSV file into structured data
2. Assumptions: File fits in memory, valid UTF-8, standard format
3. Constraints: Use only stdlib, no pandas. Handle missing values.
4. Output: Function `parse_csv(path: str) -> list[dict]`

Include:
- Docstring with examples
- Error handling for: FileNotFoundError, malformed rows
- Test case showing usage

Do NOT: Use pandas, Skip validation, Assume perfect input

Summarization Prompts

Template:

Summarize this [document/code/discussion] for [audience]:

Requirements:
1. Length: [word count or percentage]
2. Tone: [formal/casual/technical]
3. Include: [specific points]
4. Omit: [what to skip]
5. Structure: [bullet points/paragraphs/executive summary]

Focus on [what matters most to this audience].

Example:

Summarize this pull request for project managers:

Requirements:
1. Length: 100-150 words
2. Tone: Formal, business-focused
3. Include: What changed, why, impact on timeline
4. Omit: Technical implementation details
5. Structure: Executive summary + bullet points

Focus on: Business value and any timeline implications.

Analysis/Reasoning Prompts

Template:

Analyze [subject] by:
1. Identifying [key elements]
2. Examining [relationships/trade-offs]
3. Evaluating [against criteria]
4. Considering [edge cases/constraints]
5. Recommending [action with justification]

Show your reasoning steps.
Highlight [key uncertainties/risks].

Example:

Analyze this database query for performance:

1. Identify: Full scans, missing indexes, N+1 patterns
2. Examine: How filters and joins affect execution
3. Evaluate: Against target <100ms latency
4. Consider: Edge cases like empty result sets
5. Recommend: Query optimization with cost/benefit

Show your reasoning steps.
Highlight: Trade-offs between complexity and speed.

Classification Prompts

Template:

Classify the following [items] into categories:

Categories:
- [Category 1]: [Definition, examples]
- [Category 2]: [Definition, examples]
- [Category N]: [Definition, examples]

Format: Return as JSON { "item": "category" }

Confidence: Only classify if >80% confident. Use "unknown" otherwise.
Explain: Briefly justify difficult classifications.

Example:

Classify these support tickets by urgency:

Categories:
- Critical: System down, data loss, security breach (needs <1h response)
- High: Major feature broken, significant degradation (<4h response)
- Medium: Minor feature issue, workaround exists (<24h response)
- Low: Questions, feature requests, small bugs (<1 week response)

Format: Return as JSON { "ticket_id": "urgency_level" }

Confidence: Only classify if >80% confident. Use "unclear" otherwise.
Explain: Briefly justify Critical classifications.

8. Common Prompt Antipatterns

Antipattern 1: Over-Apologizing

❌ "I'm sorry, but I might not be able to fully solve this..."
✅ "I'll solve this with these approaches: ..."

❌ "I apologize if this isn't what you wanted..."
✅ "Here's my response. Tell me if you need adjustments."

Over-apologizing makes the model less confident and produces wishy-washy outputs.

Antipattern 2: Magical Thinking

❌ "Use your best judgment to find the optimal solution"
✅ "Optimize for: (1) correctness, (2) performance, (3) clarity. Rank trade-offs."

❌ "Do what makes the most sense"
✅ "Choose the approach that: minimizes latency and maintains test coverage"

Models don’t have “judgment”—they need specific criteria.

Antipattern 3: Contradictory Instructions

❌ "Be thorough but brief. Cover everything but stay concise."
✅ "Cover critical issues (max 5). Keep each explanation <50 words."

❌ "Be creative but follow the rules exactly"
✅ "Follow the rules exactly. Suggest improvements in a separate section."

Contradictions confuse the model. Resolve them with specific criteria.

Antipattern 4: Overly Complex Instructions

❌ "Consider whether, given the circumstances and taking into account various factors 
   that might influence the outcome, you should evaluate the potential for optimization
   while simultaneously ensuring compliance with standards."
✅ "Optimize for speed. Ensure standards compliance."

Simplify instructions. Use active voice. Prefer lists to paragraphs.

Antipattern 5: Assuming Unstated Context

❌ "You know our codebase. Review this for issues."
✅ "This is a user-facing API service. Review for: security, performance, and clarity."

❌ "Make it production-ready."
✅ "Make it production-ready by: adding error handling, tests, and logging."

State context explicitly. Don’t assume the model knows your system.

Antipattern 6: Vague Scope

❌ "Improve the code quality"
✅ "Reduce cyclomatic complexity <5, improve test coverage from 60% to >80%"

❌ "Find the security issues"
✅ "Audit for: SQL injection, hardcoded secrets, missing validation"

Vague scope produces vague results. Define specific targets.

9. Prompt Compression

Compression removes unnecessary tokens while maintaining quality.

Techniques

1. Remove filler

❌ "In order to properly handle this, it is important to note that we need to ensure..."
✅ "Handle this by ensuring..."

2. Abbreviate examples

❌ "For instance, consider a situation where a user enters data into a form..."
✅ "Ex: User enters data into form"

3. Consolidate instructions

❌ "Do not use if/else. Do not use switch statements. Do not use ternary operators."
✅ "Use only guards and early returns (no if/else, switch, or ternary)."

4. Use bullets instead of prose

❌ "The response should include an explanation of what was found, why it's important, 
   and what should be done about it."
✅ "Include: what was found, why it's important, what to do."

5. Parameterize repetition

❌ "The input is valid UTF-8. The input is properly formatted. The input contains no secrets."
✅ "Input assumptions: valid UTF-8, properly formatted, no secrets."

Savings

A well-compressed prompt saves 30-50% of tokens without losing quality:

Prompt	Tokens	Accuracy	Cost/Request
v1 (verbose)	450	92%	$0.0045
v2 (compressed)	220	91%	$0.0022
Savings	-51%	-1%	-51%

For 10,000 requests/month:

v1: 4.5M tokens, $45
v2: 2.2M tokens, $22
Savings: $276/year per harness

Compression Checklist

Before finalizing, check:

Every sentence serves a purpose
Can any words be deleted?
Are examples as minimal as possible?
Can bullets replace paragraphs?
Is there repetition that can be consolidated?
Have you tested compressed version against original?

10. Prompt Testing

Systematic testing catches regressions and verifies improvements.

Manual Testing

Before deploying a prompt change:

Run 5-10 examples yourself
Vary the inputs: typical cases, edge cases, boundary cases
Review the output: Does it match your expectations?
Check the format: Is JSON valid? Are all fields present?
Assess quality: Is the reasoning sound? Are there hallucinations?

Example test script:

test_cases = [
    {
        "name": "Simple case",
        "input": "...",
        "expected": "..."
    },
    {
        "name": "Edge case with null",
        "input": "...",
        "expected": "..."
    },
    # ... more cases
]

for test in test_cases:
    response = call_model(PROMPT, test["input"])
    passed = evaluate(response, test["expected"])
    status = "PASS" if passed else "FAIL"
    print(f"[{status}] {test['name']}")
    if not passed:
        print(f"  Expected: {test['expected']}")
        print(f"  Got: {response}")

Automated Testing

Run prompts at scale against test suites:

class PromptTest:
    def __init__(self, prompt: str, test_cases: list[dict]):
        self.prompt = prompt
        self.test_cases = test_cases
    
    def run(self) -> dict:
        results = []
        for case in self.test_cases:
            response = call_model(self.prompt, case["input"])
            result = {
                "test": case["name"],
                "passed": evaluate(response, case["expected"]),
                "latency_ms": measure_latency(),
                "tokens": count_tokens(response),
            }
            results.append(result)
        
        return {
            "pass_rate": sum(1 for r in results if r["passed"]) / len(results),
            "avg_latency": sum(r["latency_ms"] for r in results) / len(results),
            "total_tokens": sum(r["tokens"] for r in results),
            "failed_tests": [r["test"] for r in results if not r["passed"]],
        }

# Usage
tests = PromptTest(PROMPT_V2, TEST_CASES)
results = tests.run()
print(f"Pass rate: {results['pass_rate']*100:.1f}%")
print(f"Failed: {results['failed_tests']}")

Measuring Success Metrics

Define task-specific metrics:

def evaluate_security_audit(response: str, ground_truth: dict) -> dict:
    """Evaluate security audit prompt on real vulnerabilities."""
    parsed = json.loads(response)
    
    found_issues = set(issue["type"] for issue in parsed["issues"])
    actual_issues = set(ground_truth["vulnerabilities"])
    
    tp = len(found_issues & actual_issues)  # True positives
    fp = len(found_issues - actual_issues)  # False positives
    fn = len(actual_issues - found_issues)  # False negatives
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        "precision": precision,  # Of issues found, how many were real?
        "recall": recall,        # Of real issues, how many were found?
        "f1": f1,                # Harmonic mean
        "false_positive_count": fp,
    }

Regression Detection

After updating a prompt, verify you didn’t break anything:

def detect_regression(old_prompt: str, new_prompt: str, 
                     test_cases: list[dict]) -> bool:
    """Check if new prompt is significantly worse than old."""
    old_results = PromptTest(old_prompt, test_cases).run()
    new_results = PromptTest(new_prompt, test_cases).run()
    
    old_accuracy = old_results["pass_rate"]
    new_accuracy = new_results["pass_rate"]
    
    # Flag if accuracy drops more than 2%
    regression = (old_accuracy - new_accuracy) > 0.02
    
    if regression:
        print(f"REGRESSION DETECTED: {old_accuracy*100:.1f}% -> {new_accuracy*100:.1f}%")
        print("Do not deploy new prompt")
    
    return regression

Testing Checklist

Before deploying a prompt, verify:

Manual testing passed on 5-10 diverse examples
No regression on existing test suite (accuracy within 2%)
Edge cases tested (null inputs, empty arrays, boundary values)
Format is valid (JSON parses, XML well-formed, etc.)
Success metrics meet target thresholds
Token cost is acceptable
Latency is acceptable
No hallucinations on known-false examples

Summary: The Prompt Engineering Workflow

Start with clear requirements: What should the prompt do? How do you measure success?
Build the base prompt:
- Role definition
- Task specification
- Output format
- Constraints
Optimize:
- Add few-shot examples (1-3)
- Add chain-of-thought if reasoning matters
- Compress unnecessary words
- Test against real cases
Measure:
- Accuracy on test suite
- Token cost
- Latency
- Error rate
Iterate:
- A/B test variations
- Update based on real failures
- Version changes
- Detect regressions
Deploy:
- Document prompt version
- Monitor performance in production
- Set up alerts for accuracy drops
- Schedule periodic audits

Key principle: Well-engineered prompts are tested, measured, and iterated. “Feels right” is not good enough. Measure impact and optimize based on data.

References & Further Reading

Few-shot learning: Show the model examples (1-3 for most tasks)
Chain-of-thought: Ask for reasoning steps on complex tasks (+20-50% accuracy)
Prompt compression: Remove filler, save 30-50% tokens without quality loss
Format constraints: Force JSON/XML output to prevent hallucination
A/B testing: Compare prompts on real test sets, measure impact
Regression detection: Verify updates don’t break existing quality

For specialized frameworks and tools, see Phase 1-2 documentation.

11. Prompt Evolution Case Study

Systematic iteration beats random tweaking. This case study shows how a single prompt evolves from naive to optimized, with measurable quality improvements at each stage.

The Task

Summarize a technical document for a non-technical audience (e.g., board members, customers, or marketing teams).

Version 1: Naive Prompt

Summarize this.

What happens: The model produces a summary that is technically accurate but reads like a compressed version of the original. It retains jargon, assumes domain knowledge, and varies wildly in length between runs.

Example output (given a 2000-word API migration guide):

The document describes migrating from REST API v1 to v2. Key changes include
OAuth2 replacing Basic Auth, JSON:API response format, rate limiting changes
from 1000/hr to 5000/hr with sliding windows, and deprecation of XML endpoints.
Clients must update their authentication headers and response parsers.

Quality metrics:

Accuracy: 70% (captures main points but misses nuance)
Relevance to audience: 30% (too technical for non-technical readers)
Length consistency: Low (varies from 50 to 500 words across runs)
Actionability: 10% (reader doesn’t know what to do with this)

Version 2: Structured Prompt

You are a technical writer who specializes in translating engineering
documents for business audiences.

Summarize this document for a non-technical audience (board members,
customers, marketing team).

Requirements:
1. Length: 100-150 words
2. Tone: Professional, jargon-free
3. Structure: One paragraph of context, then 3-5 bullet points
4. Replace all technical terms with plain language
5. End with one sentence on business impact

Do NOT:
- Use acronyms without explanation
- Assume the reader knows what an API is
- Include implementation details

Example output:

We're upgrading how our software connects with partner systems to be faster
and more secure.

Key changes:
- Security upgrade: We're moving to a modern login system that better
  protects customer data
- Faster performance: Partners can now make 5x more requests per hour,
  reducing delays
- Simpler data format: Responses use a standardized structure, making
  integrations easier to maintain
- Legacy support ending: The old connection method will stop working on
  December 31, 2025

This upgrade reduces integration maintenance costs by an estimated 30% and
positions us for the next generation of partner integrations.

Quality metrics:

Accuracy: 85% (captures points and translates correctly)
Relevance to audience: 75% (mostly jargon-free, business-focused)
Length consistency: High (stays within 100-150 words)
Actionability: 60% (mentions deadline and cost impact)

Version 3: Few-Shot Optimized Prompt

You are a technical writer who specializes in translating engineering
documents for business audiences.

Summarize this document for a non-technical audience (board members,
customers, marketing team).

Requirements:
1. Length: 100-150 words
2. Tone: Professional, jargon-free
3. Structure: One sentence of context, then 3-5 bullet points, then
   one sentence on what the reader should do next
4. Replace all technical terms with plain language
5. Quantify impact where possible (cost, time, risk)

Do NOT: Use acronyms, assume technical knowledge, include how-it-works details

---
EXAMPLE 1 - Security patch summary
Input: [2000-word document about TLS 1.3 migration]
Output:
We're upgrading our encryption to the latest industry standard, improving
both security and speed.
- Stronger protection: Customer data is encrypted with the newest standard,
  meeting 2025 compliance requirements
- Faster connections: Page load times improve by 10-15% due to fewer
  network round-trips
- No customer action needed: The change is transparent to end users
- Timeline: Rolling out over 2 weeks starting March 1

Next step: No action required. Contact [email protected] with questions.

---
EXAMPLE 2 - Infrastructure change summary
Input: [1500-word document about database migration]
Output:
We're moving our data storage to a more reliable system that reduces
downtime risk.
- 99.99% uptime guarantee: Up from 99.9% (reduces potential outage hours
  from 8.7/year to 0.9/year)
- Cost reduction: Infrastructure costs drop 25% ($180K annual savings)
- 4-hour maintenance window: Scheduled for Sunday 2am-6am ET, March 15
- Risk: Low. Automated rollback if issues detected within 30 minutes

Next step: Notify customers of the maintenance window by March 8.
---

Now summarize the following document:

Quality metrics:

Accuracy: 95% (examples teach the right level of detail)
Relevance to audience: 95% (consistently business-focused, quantified)
Length consistency: High (examples anchor the expected length)
Actionability: 90% (always ends with next step)

Results Comparison

Metric	V1 (Naive)	V2 (Structured)	V3 (Few-Shot)
Accuracy	70%	85%	95%
Audience relevance	30%	75%	95%
Length consistency	Low	High	High
Actionability	10%	60%	90%
Tokens (prompt)	2	~120	~350
Cost per request	$0.00002	$0.0012	$0.0035

The lesson: Each iteration addressed a specific failure mode. V2 fixed structure and audience targeting. V3 fixed consistency and actionability by showing rather than telling. The 175x cost increase from V1 to V3 is negligible compared to the quality improvement — and the prompt cost is amortized across thousands of requests.

Systematic improvement process:

Write the simplest prompt that could work
Run it on 5 real inputs
Identify the most common failure mode
Add instructions or examples that specifically address that failure
Measure again — if improved, repeat from step 3; if not, revert and try a different fix

12. Common Prompt Failures

Five failure patterns that cause prompts to produce poor results, with concrete examples and fixes.

Failure 1: Over-Constraining Format

Bad prompt:

Return exactly 3 bullet points. Each bullet must be exactly 15 words.
The first bullet must start with "The". The second with "This".
The third with "Our". End each with a period. No sub-bullets.

Why it fails: The model spends its reasoning capacity satisfying format constraints instead of producing quality content. Output becomes awkward and forced as the model contorts language to hit exact word counts.

Fixed prompt:

Return 3 bullet points, each 10-20 words. Keep them concise and parallel in structure.

Failure 2: Ambiguous Instructions

Bad prompt:

Make this code better and clean it up.

Why it fails: “Better” and “clean” have dozens of interpretations. One run refactors variable names, the next restructures control flow, the next adds type hints. Results are inconsistent and often unwanted.

Fixed prompt:

Refactor this code to: (1) reduce function length to <20 lines each,
(2) add type hints to all parameters and return values,
(3) replace magic numbers with named constants.
Do not change the public API or add new dependencies.

Failure 3: Missing Context

Bad prompt:

Review this function for bugs.

def process(data):
    return transform(data, config.MODE)

Why it fails: The model doesn’t know what transform does, what config.MODE contains, or what the expected behavior is. It will invent plausible but fictional bugs: “config.MODE might be None” or “transform might raise ValueError” — none of which may be true.

Fixed prompt:

Review this function for bugs.

Context:
- transform() is defined in utils.py, accepts (list[dict], str), returns list[dict]
- config.MODE is always one of: "fast", "accurate", "balanced" (set at startup, never None)
- Expected behavior: filter data entries matching the mode's criteria

def process(data):
    return transform(data, config.MODE)

Failure 4: Contradictory Instructions

Bad prompt:

Be thorough and check every possible issue. Also, keep your response
under 50 words. Cover security, performance, maintainability, and
testing concerns. Be brief.

Why it fails: The model cannot be thorough across four categories in 50 words. It picks one constraint to satisfy (usually brevity) and ignores the others, or produces a response that satisfies neither — superficial analysis that also exceeds the word limit.

Fixed prompt:

Check for critical security issues only (injection, auth bypass, data exposure).
Keep response under 100 words. Flag only blocking issues — skip style and minor concerns.

Failure 5: Too Many Instructions

Bad prompt:

Analyze this code. Check for: SQL injection, XSS, CSRF, SSRF, path traversal,
command injection, insecure deserialization, XML external entities, broken
authentication, sensitive data exposure, missing rate limiting, improper error
handling, insufficient logging, outdated dependencies, hardcoded secrets,
weak cryptography, insecure redirects, clickjacking, CORS misconfiguration,
and business logic flaws. For each, explain the risk, show the vulnerable line,
suggest a fix, rate severity, estimate effort to fix, and cite the relevant
OWASP category. Also check for performance issues, code style violations,
and test coverage gaps.

Why it fails: Models exhibit “attention decay” on long instruction lists. Items near the beginning and end get more attention; items in the middle are frequently skipped. A 20-item checklist typically results in 8-12 items actually checked.

Fixed prompt:

Audit this code for the OWASP Top 5 most critical issues:
1. Injection (SQL, command, path traversal)
2. Broken authentication or authorization
3. Sensitive data exposure (hardcoded secrets, logs)
4. Security misconfiguration (CORS, headers)
5. Known vulnerable dependencies

For each issue found: cite the line, explain the risk in one sentence,
and suggest a fix. Skip issues not present.

Rule of thumb: Keep instruction lists to 5-7 items maximum. If you need more coverage, split into multiple prompts and combine results.

The Scalpel Principle: Focused Prompts Beat Bloated Ones

A general-purpose LLM juggles multiple roles in one context window: coding assistant, file editor, git manager, terminal operator, and more. When you also ask it to analyse Victorian birth records, every token of “here’s how to use the Edit tool” competes for attention with “FreeBMD districts cover surrounding parishes.”

A dedicated harness prompt does ONE thing. No ambiguity, no mode-switching, no attention split. Every token reinforces the single task.

Why This Matters Beyond Harnesses

This principle applies to any prompt. Irrelevant context degrades output quality in two ways:

Attention dilution: Transformer attention is finite. Tokens spent on unrelated instructions reduce the model’s capacity to focus on your actual task.
Mode confusion: When a prompt contains instructions for multiple roles, the model may blend behaviours. A coding assistant asked to also do data analysis may format analytical output as code comments.

Before and After: Bloated vs Focused

Bloated prompt (embedded in a general-purpose assistant with 5,000 tokens of system instructions):

You are a helpful coding assistant. You can read files, write files,
run shell commands, manage git repositories, create pull requests,
review code, and help with any programming task.

[... 4,800 more tokens of tool definitions, conventions, and rules ...]

Now analyse this birth record and extract the district, year, and
registration quarter.

Record: "John Smith, born 1842, registered Q3, Lambeth district"

Focused prompt (110 tokens, purpose-built):

You are a genealogy data extractor. Given a birth record, return JSON
with fields: name, year, quarter (Q1-Q4), district.

If a field is unclear, set it to null. Do not guess.

Record: "John Smith, born 1842, registered Q3, Lambeth district"

The focused version consistently produces correct structured output. The bloated version sometimes wraps the answer in markdown code blocks, adds unsolicited explanations, or formats the output as a Python dictionary instead of JSON — because its system prompt trained it to behave like a coding assistant.

The Rule

Before sending any prompt, ask: “Is every token in this context relevant to the task?” If not, remove the irrelevant parts. A 110-token prompt that does one thing well will outperform a 5,000-token prompt that does twenty things adequately.

Validation Checklist

How do you know you got this right?

Performance Checks

Base prompt produces correct output on 5+ test cases
Few-shot examples improve accuracy by 5%+ (measured against baseline)
Prompt tokens optimized: removed 20%+ filler without quality loss
Output format enforced: JSON/XML parsing succeeds on 95%+ of responses

Implementation Checks

System prompt written with all 4 layers: role, task, format, constraints
Few-shot examples selected: diverse, representative, 1-3 examples chosen
Chain-of-thought added (if reasoning-heavy task): step-by-step logic visible
Constraints explicit: know what NOT to do clearly stated
Tested on 3+ prompt variants: measured which performs best
Compression applied: repeated instructions condensed, synonyms reduced

Integration Checks

Prompt integrates with harness tool calling: model makes valid tool calls
Output parsing works: JSON schema validation succeeds
Memory integration: system prompt + working memory fit in context
Error handling: malformed output caught and recovered gracefully

Common Failure Modes

Few-shot examples too similar: Same pattern repeated; diversity matters
Constraints contradictory: “Don’t hallucinate” + “creative output” incompatible
Output format not enforced: Model adds prose around required JSON
Chain-of-thought verbose: Multi-step reasoning bloats tokens without quality gain
Prompt not versioned: Changes not tracked; can’t revert or measure impact

Sign-Off Criteria

Tried 3+ prompt variants, measured difference with doc 16 metrics
Baseline established: know the starting accuracy/cost/latency
Improvements documented: measured impact of few-shot, CoT, compression
Prompt version documented and pinned in config
A/B test plan for monitoring drift in production