Testing & Quality Assurance
Testing non-deterministic AI systems — success rate measurement, regression detection, acceptance criteria templates, and statistical significance.
Introduction: Why Testing Harnesses Is Different
Traditional software testing assumes determinism: run the same input, get the same output every time. AI harnesses violate this assumption.
The fundamental challenge: An LLM producing different outputs each run means your tests need to measure statistical success, not binary pass/fail.
Example:
- Traditional test: “Given X, output is exactly Y” ✅ or ❌
- Harness test: “Given X, output meets criteria C 94% of time” — and that 94% is the metric
This document covers testing strategies for non-deterministic systems, quality metrics that matter, regression detection patterns, and a complete pre-deployment checklist.
Part 1: Testing Non-Deterministic Systems
The Core Challenge: Stochasticity
LLMs use temperature/sampling to generate variety. This is desirable for creative tasks but complicates testing:
# Traditional software test (deterministic)
def test_add():
assert add(2, 3) == 5 # Always true
# LLM test (non-deterministic) - WRONG approach
def test_agent_finds_bug():
result = agent.run("Find the bug in this code")
assert "bug found" in result # Fails 10% of time (non-deterministic)
Solution: Collect statistics over multiple runs.
# Correct: Test the success rate, not individual runs
import statistics
def test_agent_success_rate():
runs = 50
successes = 0
for _ in range(runs):
result = agent.run("Find the bug in this code")
if "bug found" in result:
successes += 1
success_rate = successes / runs
assert success_rate >= 0.90, f"Expected ≥90%, got {success_rate:.1%}"
# Result: Agent succeeds 48/50 = 96% ✅ (within budget)
Multiple-Run Testing Strategy
Instead of one pass/fail per test, run N times and measure:
| Metric | Interpretation | Threshold |
|---|---|---|
| Success rate | Percentage of runs that achieve goal | ≥90% for production |
| Mean latency | Average time per run | <10s for interactive agents |
| Latency p95 | 95th percentile (worst-case performance) | <30s for batch agents |
| Cost per success | $ spent divided by successful runs | Budget-dependent |
| Failure modes | Why did runs fail? (categories) | <5% unexplained |
Setting Up Multiple-Run Tests
import pytest
from dataclasses import dataclass
import time
@dataclass
class RunResult:
success: bool
latency_seconds: float
cost: float
error_message: Optional[str] = None
class AgentTestSuite:
def __init__(self, agent, num_runs=50):
self.agent = agent
self.num_runs = num_runs
def run_test(self, task_description, validation_fn):
"""
Run task N times, validate each with validation_fn
Args:
task_description: Prompt to give agent
validation_fn: Function that returns True if output is correct
Returns:
List of RunResult objects
"""
results = []
for i in range(self.num_runs):
start = time.time()
try:
output = self.agent.run(task_description)
success = validation_fn(output)
latency = time.time() - start
cost = self.agent.last_token_cost()
results.append(RunResult(
success=success,
latency_seconds=latency,
cost=cost,
error_message=None if success else "Validation failed"
))
except Exception as e:
latency = time.time() - start
results.append(RunResult(
success=False,
latency_seconds=latency,
cost=self.agent.last_token_cost(),
error_message=str(e)
))
return results
def analyze_results(self, results, test_name):
"""Print summary statistics"""
successes = sum(1 for r in results if r.success)
success_rate = successes / len(results)
latencies = [r.latency_seconds for r in results]
costs = [r.cost for r in results]
print(f"\n=== {test_name} ===")
print(f"Success rate: {successes}/{len(results)} = {success_rate:.1%}")
print(f"Mean latency: {sum(latencies) / len(latencies):.2f}s")
print(f"P95 latency: {sorted(latencies)[int(len(latencies) * 0.95)]:.2f}s")
print(f"Total cost: ${sum(costs):.2f}")
print(f"Cost per success: ${sum(costs) / successes:.4f}")
# Failure analysis
failures_by_reason = {}
for r in results:
if not r.success:
reason = r.error_message or "Unknown"
failures_by_reason[reason] = failures_by_reason.get(reason, 0) + 1
if failures_by_reason:
print(f"Failures by reason:")
for reason, count in failures_by_reason.items():
print(f" {reason}: {count}")
return {
'success_rate': success_rate,
'mean_latency': sum(latencies) / len(latencies),
'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)],
'total_cost': sum(costs),
'cost_per_success': sum(costs) / successes if successes > 0 else float('inf')
}
# Usage example
def test_code_review_agent():
suite = AgentTestSuite(agent=my_code_reviewer, num_runs=50)
# Test: Agent finds bugs in intentionally broken code
test_code = """
def calculate_discount(price, discount):
return price - (price * discount) # Should be / not -
"""
results = suite.run_test(
task_description=f"Find bugs in this code:\n{test_code}",
validation_fn=lambda output: "discount" in output.lower() and "wrong" in output.lower()
)
metrics = suite.analyze_results(results, "Code Review - Bug Detection")
# Assert success rate meets threshold
assert metrics['success_rate'] >= 0.85, f"Success rate {metrics['success_rate']:.1%} below 85% threshold"
assert metrics['mean_latency'] < 10.0, f"Mean latency {metrics['mean_latency']:.1f}s exceeds 10s"
Seed Management: When to Fix Randomness
By default, LLM sampling is random. Sometimes you want reproducibility:
| Scenario | Temperature Setting | Why |
|---|---|---|
| Production agent | 0.7 (default) | Good balance: some variety, mostly coherent |
| Debugging failure case | 0.0 (deterministic) | Reproduce exact failure reliably |
| Stress test | 0.3–0.5 | Controlled variety, less variance |
| Creative tasks | 0.9+ | Maximum diversity (brainstorm) |
# Reproduce a specific failure
def test_agent_specific_failure():
"""When an agent fails, set temp=0.0 to reproduce deterministically"""
agent.set_temperature(0.0)
result = agent.run("Reproduce the failure case")
agent.set_temperature(0.7) # Reset to normal
# Now inspect what went wrong
assert validate(result), f"Failure reproduced: {result}"
# Stress test with controlled randomness
def test_agent_stability():
"""Test with reduced variance to detect instability"""
agent.set_temperature(0.4)
results = []
for _ in range(20):
output = agent.run("task")
results.append(output)
# Check: Are outputs mostly consistent?
unique_outputs = len(set(results))
agent.set_temperature(0.7) # Reset
print(f"Unique outputs from 20 runs: {unique_outputs}")
Part 2: Unit Testing Agents & Tools
Testing Individual Tools in Isolation
Tools are the most testable component. Unit test them independently of the agent:
import pytest
from unittest.mock import patch, MagicMock
class TestWebSearchTool:
"""Test web search tool without involving the agent"""
def test_search_success(self):
"""Normal case: search returns results"""
from harness.tools import web_search
results = web_search("Python list comprehension")
assert len(results) > 0
assert all('title' in r and 'url' in r for r in results)
def test_search_empty_results(self):
"""Edge case: unusual query returns few results"""
from harness.tools import web_search
results = web_search("xyzabc1234randomquery")
assert isinstance(results, list)
# Don't assert len > 0; empty results are valid
def test_search_timeout(self):
"""Failure case: network timeout"""
from harness.tools import web_search
with patch('requests.get', side_effect=TimeoutError("Connection timeout")):
with pytest.raises(Exception) as exc_info:
web_search("query")
assert "timeout" in str(exc_info.value).lower()
def test_search_input_validation(self):
"""Security: reject dangerous inputs"""
from harness.tools import web_search
# Should sanitize/reject HTML injection attempts
dangerous_query = "<script>alert('xss')</script>"
result = web_search(dangerous_query)
# Either raises or sanitizes
assert isinstance(result, list)
class TestCodeExecutionTool:
"""Test code execution safely in sandbox"""
def test_execute_valid_python(self):
"""Normal case: execute safe Python"""
from harness.tools import execute_code
output = execute_code("print(2 + 2)")
assert "4" in output
def test_execute_timeout(self):
"""Failure case: infinite loop (timeout)"""
from harness.tools import execute_code
with pytest.raises(TimeoutError):
execute_code("while True: pass", timeout=2)
def test_execute_blocked_operations(self):
"""Security: prevent file deletion"""
from harness.tools import execute_code
with pytest.raises(PermissionError):
execute_code("import os; os.remove('/etc/passwd')")
def test_execute_with_syntax_error(self):
"""Error handling: Python syntax error"""
from harness.tools import execute_code
with pytest.raises(SyntaxError):
execute_code("print(this is broken")
# Run all tool tests
if __name__ == "__main__":
pytest.main([__file__, "-v"])
Testing Tool Integration
Now test how tools work together:
class TestToolIntegration:
"""Test tools in combination without the full agent"""
def test_search_then_parse(self):
"""Pipeline: search → parse results → validate"""
from harness.tools import web_search, parse_html
# Search for Python docs
results = web_search("Python documentation")
assert len(results) > 0
# Parse first result
first_url = results[0]['url']
content = parse_html(first_url)
assert len(content) > 0
assert "python" in content.lower()
def test_read_file_then_execute_code(self):
"""Pipeline: read Python file → execute it"""
from harness.tools import read_file, execute_code
import tempfile
# Create temp Python file
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write("result = 'success'\nprint(result)")
temp_path = f.name
try:
# Read and execute
code = read_file(temp_path)
output = execute_code(code)
assert "success" in output
finally:
import os
os.unlink(temp_path)
Mocking the LLM for Tool Testing
When you want to test tool handling without paying for API calls:
from unittest.mock import patch, MagicMock
class TestToolHandling:
"""Test how agent handles tool calls, without real LLM"""
def test_agent_calls_search_when_asked(self):
"""Agent should call web search for information retrieval"""
# Mock the LLM to always request web search
mock_llm = MagicMock()
mock_llm.generate.return_value = {
'tool_calls': [
{
'tool': 'web_search',
'args': {'query': 'Python async await'},
'id': 'call_1'
}
]
}
agent = MyAgent(llm=mock_llm)
result = agent.run("How does async/await work?")
# Verify agent called the tool
assert mock_llm.generate.called
tool_call = mock_llm.generate.return_value['tool_calls'][0]
assert tool_call['tool'] == 'web_search'
assert 'async' in tool_call['args']['query'].lower()
def test_agent_retries_failed_tool_call(self):
"""Agent should retry when tool fails"""
mock_tool = MagicMock(side_effect=[
Exception("Network error"), # First call fails
"Success" # Second call succeeds
])
agent = MyAgent(tools={'search': mock_tool})
result = agent.run_with_retry("test", max_retries=2)
assert mock_tool.call_count == 2
assert "Success" in result
Part 3: Integration Testing
End-to-End Agent Testing
Test the complete agent workflow:
class TestAgentEndToEnd:
"""Complete agent workflow: setup → run → verify"""
def test_agent_solves_coding_task(self):
"""E2E: Agent writes code to solve a specific problem"""
agent = MyCodeWritingAgent()
agent.setup() # Initialize memory, tools, etc.
# Give agent a task
result = agent.run(
"Write a Python function that finds all prime numbers up to 100"
)
# Verify result is valid Python
assert "def " in result
assert "prime" in result.lower()
# Try executing the code (in sandbox)
from harness.tools import execute_code
output = execute_code(result + "\nprint(find_primes(20))")
# Parse output and verify correctness
primes = eval(output.strip())
expected = [2, 3, 5, 7, 11, 13, 17, 19]
assert primes == expected
def test_agent_multi_tool_workflow(self):
"""E2E: Agent uses multiple tools in sequence"""
agent = MyResearchAgent()
# Task requiring search → parsing → code execution
result = agent.run(
"Research how to implement quicksort, write code, test it"
)
# Verify workflow:
# 1. Agent searched (check memory/logs)
# 2. Agent wrote code
# 3. Agent ran tests
assert len(agent.memory['searches']) > 0
assert len(agent.memory['code_written']) > 0
assert agent.memory['test_results']['success']
Memory Persistence Testing
Verify agent remembers across sessions:
class TestMemoryPersistence:
"""Test that agent state survives restart"""
def test_session_continuity(self):
"""Agent remembers facts across sessions"""
import tempfile
import shutil
temp_dir = tempfile.mkdtemp()
try:
# Session 1: Tell agent something
agent1 = MyAgent(workspace=temp_dir)
agent1.setup()
agent1.run("Remember that my favorite color is blue")
agent1.save()
# Session 2: Agent should recall
agent2 = MyAgent(workspace=temp_dir)
agent2.load()
response = agent2.run("What is my favorite color?")
assert "blue" in response.lower()
finally:
shutil.rmtree(temp_dir)
def test_memory_corruption_recovery(self):
"""Corrupted memory doesn't crash agent"""
import tempfile
import json
temp_dir = tempfile.mkdtemp()
memory_file = f"{temp_dir}/memory.json"
# Write corrupt JSON
with open(memory_file, 'w') as f:
f.write("{ invalid json")
agent = MyAgent(workspace=temp_dir)
# Should handle gracefully
agent.load() # Should not crash
agent.setup() # Should initialize with defaults
# Should still work
result = agent.run("test")
assert result is not None
Failure Scenario Testing
Test how agent handles problems:
class TestFailureScenarios:
"""Test agent resilience to failures"""
def test_tool_timeout(self):
"""Tool times out; agent should handle gracefully"""
from harness.tools import web_search
with patch('requests.get', side_effect=TimeoutError("timeout")):
agent = MyAgent()
# Should handle timeout without crashing
result = agent.run("Search for something")
# Either falls back or retries
assert result is not None or agent.error_log[-1]['type'] == 'timeout'
def test_insufficient_context_window(self):
"""Context window filled; agent should summarize"""
agent = MyAgent(context_limit=2000) # Very small limit
# Add lots of context
large_text = "x" * 10000
agent.add_to_context(large_text)
# Agent should trim/summarize, not crash
result = agent.run("What is the task?")
assert result is not None
def test_budget_exceeded(self):
"""Agent exceeds cost budget; should stop gracefully"""
agent = MyAgent(token_budget=1000)
# Mock expensive LLM calls
expensive_task = "Solve this very complex problem " * 100
with pytest.raises(BudgetExceededError):
agent.run(expensive_task)
# Agent should stop, not try to hide the error
assert agent.total_cost > agent.token_budget
def test_agent_stuck_in_loop(self):
"""Agent keeps doing same action; iteration limit kicks in"""
# Create scenario where agent gets stuck
mock_tool = MagicMock(return_value="unchanged")
agent = MyAgent(tools={'test_tool': mock_tool}, max_iterations=5)
result = agent.run("Do something", iteration_limit=5)
# Should stop after 5 iterations even if not solved
assert mock_tool.call_count <= 5
assert "iteration limit" in agent.status.lower() or not result
Part 4: Regression Detection
Establishing Baselines
Before you can detect regressions, you need a baseline:
import json
from datetime import datetime
class BaselineManager:
"""Manage performance baselines for regression detection"""
def __init__(self, baseline_file="baselines.json"):
self.baseline_file = baseline_file
self.baselines = self.load_baselines()
def load_baselines(self):
"""Load existing baselines or create empty"""
try:
with open(self.baseline_file, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def save_baselines(self):
"""Persist baselines to disk"""
with open(self.baseline_file, 'w') as f:
json.dump(self.baselines, f, indent=2)
def establish_baseline(self, test_name, results):
"""
Establish a new baseline from multiple test runs.
Call once when you set up a test.
Args:
test_name: Name of the test
results: List of RunResult objects from run_test()
"""
successes = sum(1 for r in results if r.success)
latencies = [r.latency_seconds for r in results]
baseline = {
'test_name': test_name,
'established': datetime.now().isoformat(),
'num_runs': len(results),
'success_rate': successes / len(results),
'mean_latency': sum(latencies) / len(latencies),
'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)],
'model_version': 'claude-3-sonnet', # Or whatever model you use
'notes': 'Initial baseline'
}
self.baselines[test_name] = baseline
self.save_baselines()
print(f"Baseline established for '{test_name}':")
print(f" Success rate: {baseline['success_rate']:.1%}")
print(f" Mean latency: {baseline['mean_latency']:.2f}s")
# Usage: Establish baselines when creating a new test
def setup_test_baselines():
"""Run once to establish baselines for all tests"""
manager = BaselineManager()
suite = AgentTestSuite(agent=my_agent, num_runs=100)
# Test 1: Code review
results = suite.run_test(
"Review this code for bugs: ...",
validation_fn=lambda out: "bug" in out.lower()
)
manager.establish_baseline("code_review", results)
# Test 2: Document analysis
results = suite.run_test(
"Summarize this document: ...",
validation_fn=lambda out: len(out) > 50
)
manager.establish_baseline("summarization", results)
Regression Detection: Continuous Testing
Run tests regularly and compare to baseline:
class RegressionDetector:
"""Detect quality degradation vs baseline"""
def __init__(self, baseline_manager):
self.baseline_manager = baseline_manager
self.alert_threshold = 0.05 # Alert if 5% regression
def test_for_regression(self, test_name, results):
"""
Run test and compare to baseline.
Alert if performance degrades.
"""
baseline = self.baseline_manager.baselines.get(test_name)
if not baseline:
print(f"No baseline for {test_name}; skipping regression check")
return
# Calculate current metrics
successes = sum(1 for r in results if r.success)
current_success_rate = successes / len(results)
latencies = [r.latency_seconds for r in results]
current_mean_latency = sum(latencies) / len(latencies)
current_p95 = sorted(latencies)[int(len(latencies) * 0.95)]
# Compare to baseline
baseline_success = baseline['success_rate']
baseline_latency = baseline['mean_latency']
success_delta = current_success_rate - baseline_success
latency_delta = current_mean_latency - baseline_latency
print(f"\n=== Regression Check: {test_name} ===")
print(f"Success rate: {baseline_success:.1%} → {current_success_rate:.1%} (Δ {success_delta:+.1%})")
print(f"Mean latency: {baseline_latency:.2f}s → {current_mean_latency:.2f}s (Δ {latency_delta:+.2f}s)")
# Alerts
alerts = []
if success_delta < -self.alert_threshold:
alerts.append(f"ALERT: Success rate dropped {abs(success_delta):.1%}")
if latency_delta > (baseline_latency * 0.2): # 20% latency increase
alerts.append(f"ALERT: Latency increased {abs(latency_delta):.2f}s")
if alerts:
for alert in alerts:
print(alert)
return False # Regression detected
print("✓ No regression detected")
return True # Passed
# Usage: Run regression tests after model update
def test_after_model_update():
"""
After updating the model, verify quality didn't degrade.
Run frequently (daily, per-commit, weekly).
"""
detector = RegressionDetector(baseline_manager)
suite = AgentTestSuite(agent=my_agent, num_runs=50) # Fewer runs for frequent testing
# Run each registered test
for test_name in ['code_review', 'summarization', 'bug_finding']:
results = suite.run_test(
task_description=test_prompts[test_name],
validation_fn=validators[test_name]
)
passed = detector.test_for_regression(test_name, results)
if not passed:
# Optional: Rollback, alert, etc.
print(f"Regression in {test_name}; consider rolling back")
Tracking What Triggers Regression Tests
Document when regressions happen:
class RegressionLog:
"""Track regression incidents and causes"""
def __init__(self, log_file="regression_log.json"):
self.log_file = log_file
self.log = self.load_log()
def log_regression(self, test_name, cause, severity):
"""
Record a regression incident.
Args:
test_name: Which test regressed
cause: What caused it (e.g., "model_update", "config_change")
severity: "minor" (5%), "moderate" (10%), "severe" (20%+)
"""
entry = {
'timestamp': datetime.now().isoformat(),
'test_name': test_name,
'cause': cause,
'severity': severity,
'git_commit': get_current_git_commit(),
'model_version': get_model_version()
}
self.log.append(entry)
self.save_log()
print(f"Logged regression: {cause} in {test_name}")
def analyze_causes(self):
"""What changes most often cause regressions?"""
from collections import Counter
causes = [entry['cause'] for entry in self.log]
cause_counts = Counter(causes)
print("\nMost common regression causes:")
for cause, count in cause_counts.most_common():
print(f" {cause}: {count} incidents")
def load_log(self):
try:
with open(self.log_file, 'r') as f:
return json.load(f)
except FileNotFoundError:
return []
def save_log(self):
with open(self.log_file, 'w') as f:
json.dump(self.log, f, indent=2)
# Usage: Log regression when detected
if not regression_detector.test_for_regression(test_name, results):
logger = RegressionLog()
logger.log_regression(
test_name=test_name,
cause="model_update", # or "config_change", "tool_modification"
severity="moderate"
)
Part 5: Performance Testing
Load Testing Agents
How many concurrent requests can your harness handle?
import concurrent.futures
import time
from statistics import mean, stdev
class LoadTestAgent:
"""Simulate multiple concurrent agent requests"""
def __init__(self, agent, num_workers=5):
self.agent = agent
self.num_workers = num_workers
def run_load_test(self, task, num_requests=100):
"""
Simulate load: num_requests tasks, num_workers concurrent.
Measure throughput and latency.
"""
results = []
errors = []
def run_task():
start = time.time()
try:
output = self.agent.run(task)
latency = time.time() - start
cost = self.agent.last_token_cost()
return {
'success': True,
'latency': latency,
'cost': cost,
'output_length': len(output)
}
except Exception as e:
latency = time.time() - start
return {
'success': False,
'latency': latency,
'error': str(e)
}
# Run requests concurrently
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_workers) as executor:
futures = [executor.submit(run_task) for _ in range(num_requests)]
results = [f.result() for f in concurrent.futures.as_completed(futures)]
# Analyze
successes = sum(1 for r in results if r['success'])
latencies = [r['latency'] for r in results if r['success']]
print(f"\n=== Load Test Results ===")
print(f"Requests: {num_requests}, Workers: {self.num_workers}")
print(f"Success rate: {successes}/{num_requests} = {successes/len(results):.1%}")
print(f"Throughput: {successes / sum(latencies):.1f} req/sec")
print(f"Mean latency: {mean(latencies):.2f}s")
print(f"Latency stdev: {stdev(latencies):.2f}s" if len(latencies) > 1 else "")
print(f"P95 latency: {sorted(latencies)[int(len(latencies) * 0.95)]:.2f}s")
print(f"P99 latency: {sorted(latencies)[int(len(latencies) * 0.99)]:.2f}s")
return {
'success_rate': successes / len(results),
'throughput': successes / sum(latencies),
'mean_latency': mean(latencies),
'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)]
}
# Usage
def test_harness_capacity():
"""Can harness handle realistic load?"""
load_tester = LoadTestAgent(my_agent, num_workers=5)
results = load_tester.run_load_test(
task="Summarize this article",
num_requests=50
)
# Assertions
assert results['success_rate'] >= 0.95, "Sub-95% success under load"
assert results['p95_latency'] < 15.0, "P95 latency exceeds 15s"
Latency Benchmarking
Measure how fast your agent responds:
class LatencyBenchmark:
"""Detailed latency analysis"""
def __init__(self, agent):
self.agent = agent
def benchmark_by_complexity(self, test_cases):
"""
Test latency across different task complexities.
test_cases: List of (complexity_name, task_prompt)
"""
results = {}
for complexity_name, task_prompt in test_cases:
latencies = []
for _ in range(20):
start = time.time()
self.agent.run(task_prompt)
latencies.append(time.time() - start)
results[complexity_name] = {
'mean': mean(latencies),
'min': min(latencies),
'max': max(latencies),
'stdev': stdev(latencies) if len(latencies) > 1 else 0
}
print("\n=== Latency by Task Complexity ===")
for complexity, stats in results.items():
print(f"\n{complexity}:")
print(f" Mean: {stats['mean']:.2f}s")
print(f" Range: {stats['min']:.2f}s – {stats['max']:.2f}s")
print(f" Stdev: {stats['stdev']:.2f}s")
return results
# Usage
benchmark = LatencyBenchmark(my_agent)
results = benchmark.benchmark_by_complexity([
("simple", "Is the sky blue?"),
("moderate", "Explain the theory of relativity"),
("complex", "Design a distributed database system")
])
Cost Per Task Tracking
Monitor spending per inference:
class CostAnalyzer:
"""Track and analyze cost per task"""
def __init__(self):
self.cost_history = []
def analyze_task_cost(self, task_description, results):
"""
Analyze costs for a completed task.
Args:
task_description: What the task was
results: List of RunResult objects
"""
costs = [r.cost for r in results if r.success]
if not costs:
print("No successful runs; can't analyze cost")
return
total_cost = sum(costs)
mean_cost = total_cost / len(costs)
min_cost = min(costs)
max_cost = max(costs)
entry = {
'task': task_description,
'num_runs': len(costs),
'total_cost': total_cost,
'mean_cost': mean_cost,
'range': (min_cost, max_cost),
'timestamp': datetime.now().isoformat()
}
self.cost_history.append(entry)
print(f"\n=== Cost Analysis ===")
print(f"Task: {task_description}")
print(f"Successful runs: {len(costs)}")
print(f"Mean cost per run: ${mean_cost:.4f}")
print(f"Range: ${min_cost:.4f} – ${max_cost:.4f}")
print(f"Total cost: ${total_cost:.2f}")
return entry
def monthly_cost_forecast(self, requests_per_day=100):
"""Project monthly cost based on recent history"""
if not self.cost_history:
print("No cost history; can't forecast")
return
recent_costs = [e['mean_cost'] for e in self.cost_history[-10:]]
avg_cost = mean(recent_costs)
monthly_cost = avg_cost * requests_per_day * 30
print(f"\n=== Monthly Cost Forecast ===")
print(f"Based on recent average: ${avg_cost:.4f} per request")
print(f"At {requests_per_day} requests/day:")
print(f" Daily: ${avg_cost * requests_per_day:.2f}")
print(f" Monthly: ${monthly_cost:.2f}")
Part 6: Quality Metrics Framework
Core Metrics
Every harness should track these:
| Metric | Definition | Target | Tool |
|---|---|---|---|
| Task completion rate | % of requests that achieve goal | ≥90% | Multiple-run tests |
| Mean latency (p50) | Median response time | <10s for interactive | Benchmarks |
| Latency p95 | 95th percentile (worst case) | <30s | Benchmarks |
| Cost per success | $ spent / successes | Budget-dependent | Cost analyzer |
| Hallucination rate | % of outputs containing false claims | <2% | Manual review + automated checks |
| Error rate | % of requests that error out | <5% | Error tracking |
| Tool success rate | % of tool calls that succeed | ≥95% | Tool logs |
Sample Metrics Dashboard
class MetricsDashboard:
"""Collect and display all key metrics"""
def __init__(self):
self.metrics = {
'task_completion': [],
'latencies': [],
'costs': [],
'errors': [],
'hallucinations': []
}
def record_task(self, success, latency, cost, error=None, hallucination=False):
"""Record one task execution"""
self.metrics['task_completion'].append(success)
if latency:
self.metrics['latencies'].append(latency)
if cost:
self.metrics['costs'].append(cost)
if error:
self.metrics['errors'].append(error)
if hallucination:
self.metrics['hallucinations'].append(1)
def display_dashboard(self):
"""Print current metrics"""
completions = self.metrics['task_completion']
latencies = self.metrics['latencies']
costs = self.metrics['costs']
print(f"\n{'='*50}")
print(f"{'HARNESS METRICS DASHBOARD':^50}")
print(f"{'='*50}")
if completions:
comp_rate = sum(completions) / len(completions)
print(f"\nTask Completion: {comp_rate:.1%} ({sum(completions)}/{len(completions)})")
if latencies:
print(f"\nLatency:")
print(f" Mean (p50): {sorted(latencies)[len(latencies)//2]:.2f}s")
print(f" P95: {sorted(latencies)[int(len(latencies)*0.95)]:.2f}s")
print(f" P99: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
if costs:
total = sum(costs)
print(f"\nCosts:")
print(f" Total: ${total:.2f}")
print(f" Mean/task: ${total/len(costs):.4f}")
print(f" Monthly (100 req/day): ${(total/len(costs)) * 100 * 30:.2f}")
if self.metrics['errors']:
error_rate = len(self.metrics['errors']) / len(completions)
print(f"\nErrors: {error_rate:.1%}")
print(f"\n{'='*50}\n")
Part 7: Pre-Deployment Checklist
Before shipping an agent to production:
Smoke Tests (Basic Functionality)
class SmokeTests:
"""Quick validation that agent works at all"""
def run_all(self, agent):
"""Run smoke tests"""
checks = [
("Agent initializes", self.test_initialization),
("Tools available", self.test_tools_available),
("Can process input", self.test_basic_task),
("Can handle error", self.test_error_handling),
("Memory persists", self.test_memory),
("Cost tracking works", self.test_cost_tracking)
]
print("\n=== SMOKE TESTS ===\n")
passed = 0
for name, test_fn in checks:
try:
test_fn(agent)
print(f"✓ {name}")
passed += 1
except Exception as e:
print(f"✗ {name}: {e}")
print(f"\nResult: {passed}/{len(checks)} passed")
return passed == len(checks)
def test_initialization(self, agent):
assert agent is not None
assert agent.model is not None
def test_tools_available(self, agent):
tools = agent.available_tools()
assert len(tools) > 0
def test_basic_task(self, agent):
result = agent.run("Say hello")
assert result is not None
assert len(result) > 0
def test_error_handling(self, agent):
# Intentional error; should handle gracefully
try:
agent.run("undefined_command()")
except Exception as e:
# Expected: Error should be handled
assert "error" in str(e).lower() or agent.last_error is not None
def test_memory(self, agent):
agent.run("Remember: test_value = 123")
result = agent.run("What did I tell you to remember?")
assert "123" in result or "test_value" in result.lower()
def test_cost_tracking(self, agent):
agent.run("test")
cost = agent.total_cost()
assert isinstance(cost, (int, float))
assert cost >= 0
Performance Baseline Validation
class PreDeploymentValidation:
"""Comprehensive pre-deployment checks"""
def validate_all(self, agent, baseline_manager):
"""Run complete pre-deployment validation"""
checks = []
# Smoke tests
print("\n1. Running smoke tests...")
smoke = SmokeTests()
if not smoke.run_all(agent):
print("✗ Smoke tests failed; do not deploy")
return False
# Quality metrics
print("\n2. Validating quality metrics...")
suite = AgentTestSuite(agent, num_runs=50)
results = suite.run_test(
"Main task",
validation_fn=lambda x: True # Custom validation
)
metrics = suite.analyze_results(results, "Pre-deployment")
if metrics['success_rate'] < 0.85:
print(f"✗ Success rate {metrics['success_rate']:.1%} below 85% threshold")
return False
# Regression check
print("\n3. Checking for regressions...")
detector = RegressionDetector(baseline_manager)
if not detector.test_for_regression("main_test", results):
print("✗ Regression detected; investigate before deploying")
return False
# Cost validation
print("\n4. Validating cost is within budget...")
daily_budget = 100.0 # dollars
estimated_daily_cost = metrics['cost_per_success'] * 1000 # Assume 1000 daily requests
if estimated_daily_cost > daily_budget:
print(f"✗ Estimated daily cost ${estimated_daily_cost:.2f} exceeds budget ${daily_budget}")
return False
print(f"✓ Estimated daily cost: ${estimated_daily_cost:.2f} (within budget)")
# Security review
print("\n5. Security review...")
security_passed = self.validate_security(agent)
if not security_passed:
print("✗ Security issues found")
return False
# Logging configured
print("\n6. Checking logging...")
if not agent.logger:
print("✗ Logging not configured")
return False
print("✓ All checks passed; ready to deploy")
return True
def validate_security(self, agent):
"""Check security constraints"""
# These are examples; customize for your harness
security_checks = [
("API keys not in code", lambda a: not self.has_hardcoded_secrets(a)),
("Input validation enabled", lambda a: a.validate_input),
("Output sanitization enabled", lambda a: a.sanitize_output),
("Rate limiting configured", lambda a: a.rate_limiter is not None)
]
all_passed = True
for check_name, check_fn in security_checks:
try:
if check_fn(agent):
print(f" ✓ {check_name}")
else:
print(f" ✗ {check_name}")
all_passed = False
except Exception as e:
print(f" ? {check_name}: {e}")
return all_passed
def has_hardcoded_secrets(self, agent):
"""Check for hardcoded API keys, tokens, etc."""
import inspect
source = inspect.getsource(agent)
secret_patterns = [
r'sk_live_', # Stripe
r'api[_-]?key\s*=\s*["\']', # Generic API key
r'password\s*=\s*["\']', # Password
]
import re
for pattern in secret_patterns:
if re.search(pattern, source):
return True # Found secret
return False # No secrets found
Part 8: Continuous Testing Strategy
Testing Frequency
| Test Type | Frequency | Purpose |
|---|---|---|
| Unit tests (tools) | Per commit | Catch regressions early |
| Smoke tests | Daily | Basic functionality |
| Quality metrics | Weekly | Track trends |
| Regression tests | After model update, per release | Detect degradation |
| Load tests | Monthly | Capacity planning |
| Full validation | Before deployment | Go/no-go decision |
class ContinuousTestingSchedule:
"""Automated testing schedule"""
def run_daily_tests(self):
"""Run every day at 8 AM"""
print("=== Daily Tests ===")
# Smoke tests
smoke = SmokeTests()
smoke.run_all(agent)
# Quick quality check (smaller sample)
suite = AgentTestSuite(agent, num_runs=20)
results = suite.run_test("task", validation_fn)
metrics = suite.analyze_results(results, "Daily quality check")
def run_weekly_tests(self):
"""Run every Monday"""
print("=== Weekly Tests ===")
# Full quality metrics (larger sample)
suite = AgentTestSuite(agent, num_runs=100)
results = suite.run_test("task", validation_fn)
metrics = suite.analyze_results(results, "Weekly quality")
# Alert if degraded
if metrics['success_rate'] < 0.85:
send_alert(f"Quality degradation: {metrics['success_rate']:.1%}")
def run_pre_release_tests(self):
"""Before deploying new version"""
print("=== Pre-Release Validation ===")
validator = PreDeploymentValidation()
go_nogo = validator.validate_all(agent, baseline_manager)
if go_nogo:
print("✓ APPROVED FOR RELEASE")
return True
else:
print("✗ BLOCKED: Fix issues before release")
return False
Part 9: Failure Case Testing
Common Failure Modes
class FailureModeTests:
"""Test how agent handles predictable failures"""
def test_tool_failure_recovery(self):
"""Tool fails; agent should retry or fallback"""
from unittest.mock import patch
with patch('harness.tools.web_search', side_effect=Exception("Network error")):
agent = MyAgent()
result = agent.run("Search for something")
# Agent should:
# 1. Catch the error
# 2. Log it
# 3. Retry or fallback
assert agent.has_logged_error
assert agent.fallback_used or agent.retry_count > 0
def test_model_hallucination(self):
"""Detect when model makes up facts"""
from harness.validation import check_hallucination
# Run agent on task where ground truth is known
ground_truth = {"capital_of_france": "Paris"}
result = agent.run("What is the capital of France?")
# Validate factual claims
hallucinated = check_hallucination(result, ground_truth)
if hallucinated:
print(f"Agent hallucinated: {hallucinated}")
# Log incident
agent.log_hallucination(result, hallucinated)
def test_context_overflow(self):
"""Context window full; agent should handle"""
# Fill context with large text
large_prompt = "X" * 50000
agent = MyAgent(context_limit=4096)
result = agent.run(large_prompt)
# Should either:
# 1. Summarize to fit context
# 2. Reject gracefully
# 3. Chunk and process in parts
assert result is not None or agent.error_type == "context_overflow"
def test_budget_overrun_protection(self):
"""Cost exceeds budget; should stop"""
agent = MyAgent(budget=10.0) # $10 budget
expensive_task = "Solve a very complex problem " * 1000
with pytest.raises(BudgetExceededError):
agent.run(expensive_task)
# Verify agent stopped cleanly
assert agent.total_cost <= agent.budget * 1.05 # Allow 5% overage for last request
def test_iteration_limit(self):
"""Agent stuck in loop; iteration limit stops it"""
# Create scenario where agent repeats same action
agent = MyAgent(max_iterations=5)
result = agent.run("Impossible task")
# Should stop after 5 iterations
assert agent.iteration_count <= 5
assert "iteration limit" in agent.status.lower() or result is None
Part 10: Test Fixtures & Sample Test Suite
Creating Representative Test Cases
class TestDatasets:
"""Curated test cases for validation"""
@staticmethod
def get_code_review_tests():
"""Test cases for code review agent"""
return [
{
"name": "Obvious bug",
"code": "def divide(a, b):\n return a / b # Bug: no zero check",
"expected_findings": ["zero", "divide", "error"]
},
{
"name": "Logic error",
"code": "if x > 0:\n print('negative')",
"expected_findings": ["logic", "contradiction"]
},
{
"name": "Memory leak",
"code": "class Handler:\n def __init__(self):\n self.listeners = [] # Never cleared",
"expected_findings": ["leak", "memory", "clear"]
}
]
@staticmethod
def get_writing_tests():
"""Test cases for writing/summary agent"""
return [
{
"name": "Technical summary",
"content": "Quantum computing uses quantum bits...",
"validation": lambda out: len(out) > 50 and "quantum" in out.lower()
},
{
"name": "Simplified explanation",
"content": "Einstein's theory of relativity states...",
"validation": lambda out: len(out) > 50 and "simple" not in out.lower() # Should be complex
}
]
# Complete sample test suite
class SampleHarnessTestSuite:
"""Ready-to-use test suite for your harness"""
def __init__(self, agent):
self.agent = agent
self.baseline_manager = BaselineManager()
def run_all(self):
"""Run complete test suite"""
print("\n" + "="*60)
print("COMPREHENSIVE HARNESS TEST SUITE")
print("="*60)
# 1. Unit tests
print("\n[1/5] Unit Tests (Tools)")
self.run_unit_tests()
# 2. Smoke tests
print("\n[2/5] Smoke Tests")
smoke = SmokeTests()
smoke_passed = smoke.run_all(self.agent)
if not smoke_passed:
print("Smoke tests failed; aborting further tests")
return False
# 3. Quality metrics
print("\n[3/5] Quality Metrics (50 runs)")
suite = AgentTestSuite(self.agent, num_runs=50)
results = suite.run_test(
"Main task",
validation_fn=self.main_validation
)
metrics = suite.analyze_results(results, "Main task")
# 4. Regression detection
print("\n[4/5] Regression Detection")
detector = RegressionDetector(self.baseline_manager)
regression_passed = detector.test_for_regression("main", results)
# 5. Pre-deployment validation
print("\n[5/5] Pre-Deployment Validation")
validator = PreDeploymentValidation()
deployment_ready = validator.validate_all(self.agent, self.baseline_manager)
# Summary
print("\n" + "="*60)
print("TEST SUITE SUMMARY")
print("="*60)
print(f"Smoke tests: {'PASS' if smoke_passed else 'FAIL'}")
print(f"Quality metrics: {metrics['success_rate']:.1%}")
print(f"Regression check: {'PASS' if regression_passed else 'FAIL'}")
print(f"Deployment ready: {'YES' if deployment_ready else 'NO'}")
print("="*60 + "\n")
return deployment_ready
def run_unit_tests(self):
"""Run tool unit tests"""
test_tools = TestWebSearchTool()
test_tools.test_search_success()
test_tools.test_search_input_validation()
print("✓ Web search tool tests passed")
test_code = TestCodeExecutionTool()
test_code.test_execute_valid_python()
test_code.test_execute_blocked_operations()
print("✓ Code execution tool tests passed")
def main_validation(self, output):
"""Customize this for your agent"""
return output is not None and len(output) > 10
# Usage: Run complete test suite
suite = SampleHarnessTestSuite(my_agent)
ready_to_deploy = suite.run_all()
if ready_to_deploy:
print("✓ Agent approved for production deployment")
else:
print("✗ Agent requires fixes before deployment")
Key Takeaways
For Testing Non-Deterministic Systems
- Don’t test individual runs: Collect statistics over N runs (typically 50–100)
- Measure success rate, not pass/fail: “Agent succeeds 94% of time” is the metric
- Track failure modes: Understand why runs fail, not just that they do
- Use seeds strategically: Set temperature=0.0 only for debugging specific failures
For Regression Detection
- Establish baselines first: Run test N times when creating it, save metrics
- Test frequently: Daily smoke tests, weekly full tests, always before releases
- Alert on degradation: Success rate drops 5%+, latency increases 20%+
- Track causes: Log what changes triggered regressions (model update, config, tools)
For Pre-Deployment
- Run smoke tests: Verify basic functionality
- Check quality metrics: Success rate ≥90%, cost within budget
- Run regression tests: Compare to baseline, ensure no degradation
- Security review: No hardcoded secrets, input validation, rate limiting
- Load test: Verify handles expected concurrency
Tools & Patterns
| Purpose | Pattern | Example |
|---|---|---|
| Unit test tools | Mock LLM, test tool in isolation | TestWebSearchTool class |
| Integration test | End-to-end workflow, verify output quality | TestAgentEndToEnd class |
| Load testing | Concurrent.futures, measure throughput | LoadTestAgent class |
| Regression detection | Baseline + statistical comparison | RegressionDetector class |
| Metrics dashboard | Collect and display key stats | MetricsDashboard class |
| Pre-deployment | Comprehensive checklist | PreDeploymentValidation class |
Related Documentation
06_harness_architecture.md— Component failure modes, recovery strategies05_ai_agents.md— Iteration limits, handling stuck agents08_claw_code_python.md— Instrumentation patterns, logging04_memory_systems.md— Memory validation, persistence testing
References & Further Reading
- Paper: “Is RLHF Required for Human Preference Alignment?” (Zyg et al., 2024) — On measuring alignment quality statistically
- Tool:
lm-evaluation-harness(EleutherAI) — Standard benchmarking framework - Tool:
pytest— Python testing framework used in examples - Pattern: “Chaos engineering for LLMs” — Intentionally failing components to test resilience
- Resource: GitHub’s “Software Performance Testing” guide — Load testing strategies
Document Metadata
Created: April 2026
Scope: Testing and QA strategies for AI harnesses
Audience: Engineers building and deploying harnesses
Key concepts: Non-deterministic testing, regression detection, quality metrics, pre-deployment validation
Code examples: 25+ runnable Python patterns
Estimated reading time: 2–3 hours
Estimated implementation time: 1–2 weeks (build test infrastructure)
Next step: Pick one testing pattern that matters most to your harness (unit tests, regression detection, or pre-deployment checklist) and implement it this week. The infrastructure you build compounds; every test reduces deployment risk.
Part 11: Pre-Production Acceptance Criteria Template
Go/No-Go Decision Template
Copy this checklist into your release process. Every item must be checked before deploying to production. Fill in the measured values and compare against thresholds.
# Pre-Production Acceptance Criteria
# Date: ____________________
# Version: _________________
# Reviewer: ________________
## Quality Gates
| Gate | Threshold | Measured | Pass? |
|-----------------------------|-----------------|------------|-------|
| Task success rate | >= 90% | ____% | [ ] |
| Hallucination rate | <= 2% | ____% | [ ] |
| Tool call success rate | >= 95% | ____% | [ ] |
| Error rate (unhandled) | <= 5% | ____% | [ ] |
| Regression vs baseline | No degradation | +/- ____% | [ ] |
## Latency Thresholds
| Metric | Threshold | Measured | Pass? |
|-----------------------------|-----------------|------------|-------|
| Step latency p50 | < 5s | ____s | [ ] |
| Step latency p95 | < 15s | ____s | [ ] |
| End-to-end session p95 | < 60s | ____s | [ ] |
| Tool call latency p95 | < 10s | ____s | [ ] |
## Cost Ceilings
| Metric | Threshold | Measured | Pass? |
|-----------------------------|-----------------|------------|-------|
| Cost per successful task | < $0.50 | $____ | [ ] |
| Projected daily cost | < $____budget | $____ | [ ] |
| Projected monthly cost | < $____budget | $____ | [ ] |
## Security & Operations
| Check | Pass? |
|-----------------------------------------------|-------|
| No hardcoded secrets in codebase | [ ] |
| Input validation enabled | [ ] |
| Rate limiting configured | [ ] |
| Structured logging active | [ ] |
| Budget enforcement (hard limit) active | [ ] |
| Health check endpoints responding | [ ] |
| Alerting configured (see Doc 09) | [ ] |
## Decision
[ ] GO — All gates passed, approved for production
[ ] NO-GO — Failures listed below must be resolved
Blocking issues:
1. _______________________________________________
2. _______________________________________________
How Test Results Feed Into Monitoring (Doc 09)
The quality gates above establish baseline thresholds that production monitoring should enforce continuously:
# Map test acceptance criteria to production alerts
ACCEPTANCE_TO_ALERTS = {
# Test gate -> Production alert rule
"success_rate >= 90%": "alert if harness_success_rate < 0.90 for 5m",
"p95_latency < 15s": "alert if harness_latency_step_p95_ms > 15000 for 5m",
"error_rate <= 5%": "alert if harness_errors_rate > 0.05 for 5m",
"cost_per_task < $0.50": "alert if harness_cost_per_session_usd > 0.50",
}
# After running acceptance tests, generate a Prometheus alerting config
def generate_alert_rules(acceptance_results: dict) -> str:
"""Convert acceptance criteria into Prometheus alert rules."""
rules = []
rules.append("groups:")
rules.append("- name: harness_acceptance")
rules.append(" rules:")
if acceptance_results["success_rate"] >= 0.90:
rules.append(" - alert: SuccessRateBelowBaseline")
rules.append(f" expr: harness_success_rate < {acceptance_results['success_rate'] - 0.05}")
rules.append(" for: 5m")
rules.append(" labels: { severity: critical }")
if acceptance_results["p95_latency"]:
threshold_ms = int(acceptance_results["p95_latency"] * 1000 * 1.2) # 20% headroom
rules.append(" - alert: LatencyAboveBaseline")
rules.append(f" expr: harness_latency_step_p95_ms > {threshold_ms}")
rules.append(" for: 5m")
rules.append(" labels: { severity: warning }")
return "\n".join(rules)
How to Calculate Statistical Significance of Test Results
With non-deterministic systems, you need to know whether a measured success rate of 88% is genuinely below your 90% threshold or just noise. Use a binomial proportion confidence interval:
import math
def is_statistically_significant(
successes: int,
total_runs: int,
threshold: float,
confidence: float = 0.95
) -> dict:
"""
Determine if measured success rate is significantly below threshold.
Args:
successes: Number of successful runs
total_runs: Total number of runs
threshold: Required success rate (e.g., 0.90)
confidence: Confidence level (default 95%)
Returns:
Dict with measured rate, confidence interval, and verdict
"""
p_hat = successes / total_runs
# Z-score for confidence level (1.96 for 95%, 2.58 for 99%)
z = {0.90: 1.645, 0.95: 1.96, 0.99: 2.58}.get(confidence, 1.96)
# Wilson score interval (better than normal approximation for small N)
denominator = 1 + z**2 / total_runs
centre = (p_hat + z**2 / (2 * total_runs)) / denominator
margin = z * math.sqrt((p_hat * (1 - p_hat) + z**2 / (4 * total_runs)) / total_runs) / denominator
lower_bound = centre - margin
upper_bound = centre + margin
# Verdict: fail only if upper bound of confidence interval is below threshold
significant_failure = upper_bound < threshold
return {
"measured_rate": round(p_hat, 4),
"confidence_interval": (round(lower_bound, 4), round(upper_bound, 4)),
"threshold": threshold,
"significant_failure": significant_failure,
"verdict": "FAIL (significant)" if significant_failure else "PASS (within noise)",
"recommendation": f"Run {max(100, total_runs * 2)} tests for tighter bounds" if not significant_failure and p_hat < threshold else None
}
# Usage examples
print(is_statistically_significant(successes=44, total_runs=50, threshold=0.90))
# measured_rate=0.88, CI=(0.7639, 0.9434), verdict="PASS (within noise)"
# 88% is below 90%, but the CI includes 90% — could be noise with only 50 runs
print(is_statistically_significant(successes=82, total_runs=100, threshold=0.90))
# measured_rate=0.82, CI=(0.7342, 0.8838), verdict="FAIL (significant)"
# 82% with 100 runs — CI upper bound 88.4% is below 90%, genuine regression
Rule of thumb for sample sizes:
- 50 runs: Can detect 10%+ regressions with 95% confidence
- 100 runs: Can detect 5%+ regressions with 95% confidence
- 200 runs: Can detect 3%+ regressions with 95% confidence
If you need to distinguish 89% from 91%, you need at least 200 runs. For most go/no-go decisions, 50-100 runs is sufficient.
See Also
- Doc 09 (Operations & Observability) — Monitoring is continuous testing; quality metrics in production feed back into test baselines
- Doc 06 (Harness Architecture) — Understand the components you’re testing; testing strategy differs per component
- Doc 05 (AI Agents) — Understand agentic behavior for non-deterministic testing frameworks
- Doc 16 (Evaluation & Benchmarking) — Formal evaluation methods for complex reasoning; extends testing to production metrics