Knowledge Management at Scale — The Harness Handbook Reference

Managing knowledge bases for large harnesses requires strategic planning and architectural choices. This guide addresses the critical gaps that emerge as knowledge bases grow beyond simple markdown wikis, covering everything from markdown scaling limits to hybrid vector systems and multi-agent knowledge sharing.

Audience: Knowledge engineers, architects, systems engineers managing large knowledge bases.

1. The Markdown Wiki Pattern (Revisited)

The markdown wiki is the natural starting point for knowledge management in harnesses. It’s simple, version-controlled, and works beautifully—until it doesn’t.

Performance by Scale

Knowledge Base Size	Characteristics	Typical Use Cases
<400K words (~100 articles)	Instant search, full context, Git-native	Single-agent harnesses, domain-specific bots
400K-1M words (~250-600 articles)	Noticeable search latency (2-5s), context windows strained	Multi-team coordination, medium enterprises
>1M words (~600+ articles)	Unusable performance (10-30s searches), full context impossible	Enterprise-scale systems, multi-domain knowledge

Why Markdown Works Until It Doesn’t

Strengths:

Native Git support (history, diffs, blame)
Human-readable, easy to edit
Simple to embed in prompts
No external infrastructure
Works offline

Failure Modes:

Search scalability: Full-text search across gigabytes of markdown becomes slow
Context bloat: Jamming 1M words into a prompt leaves no room for actual work
Relevance: Keyword search finds 50 loosely related documents, none perfect
Staleness: Large wikis accumulate outdated information faster than they’re fixed
Maintenance overhead: Duplicate information, broken links, inconsistent terminology

When to Transition

Stay with markdown if:

Total size <400K words
Search < 2s response time is acceptable
Knowledge is stable (not changing hourly)
Scope is domain-specific (narrow vocab)

Transition if:

Approaching 1M words total
Search latency > 5s unacceptable
Knowledge changes frequently
Multi-domain coverage (wide, shallow vocab)
Agents need semantic understanding, not just keywords

2. Knowledge Base Scaling Strategies

Three proven patterns exist for scaling beyond pure markdown. Each addresses different constraints.

Strategy A: Multi-Tier Markdown (Hierarchical Summaries)

Architecture: Organize knowledge in three tiers—raw data, summaries, and indexes.

docs/
├── raw/              # Full original documents
│   ├── api-v1.md    # Complete reference
│   └── api-v2.md
├── summaries/       # Distilled versions
│   ├── api-quick-start.md
│   ├── api-common-patterns.md
│   └── api-faq.md
└── indexes/         # Navigation layer
    ├── README.md    # Overview
    ├── by-topic.md  # Hierarchical toc
    └── glossary.md  # Terms

How it works:

Maintain raw docs (Git-controlled, exhaustive)
Create summaries (80% knowledge, 20% bulk)
Serve summaries to agents (faster search, smaller context)
Link back to raw for deep dives

When to use:

Knowledge size 400K-2M words
High-accuracy requirements
Mixed stability (some docs change often, others rarely)
Teams maintaining knowledge manually

Implementation:

# Agent prompt pattern
- Load: /docs/indexes/README.md          (entry point)
- Search: /docs/summaries/*              (quick answer)
- Deep-dive: /docs/raw/* (link provided) (complete info)

Pros:

Still Git-native
Controlled token usage
Explicit relevance (human curation)
Fast to implement

Cons:

Manual maintenance burden grows
Summarization is lossy
Doesn’t scale past 2-3M words
No semantic understanding

Strategy B: Hybrid Markdown + Vector Index

Architecture: Keep markdown, add vector embeddings for semantic search.

docs/
├── *.md              # Unmodified markdown
└── .vectorindex/
    ├── embeddings.db # Vector store (SQLite + vectors)
    ├── chunks.json   # Chunk→source mapping
    └── index.json    # Metadata

How it works:

Chunk markdown into ~300-500 token segments
Generate embeddings for each chunk (using OpenAI, Anthropic, or local model)
On search: convert query to embedding, find nearest neighbors
Return top-k relevant chunks + links to full documents

When to use:

Knowledge size 1-10M words
Semantic search critical (not just keywords)
Knowledge doesn’t change hourly
Budget for embedding model calls

Implementation (Python example):

from sentence_transformers import SentenceTransformer
import numpy as np

# Chunk documents
chunks = chunk_markdown(docs, chunk_size=400, overlap=50)

# Generate embeddings (one-time)
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, local
embeddings = [model.encode(chunk['text']) for chunk in chunks]

# Store with faiss or sqlite-vec
index = VectorIndex(embeddings, chunks)
index.save('.vectorindex/')

# Query time
query = "how do I configure rate limiting?"
query_vec = model.encode(query)
top_k = index.search(query_vec, k=5)
return [chunks[i] for i in top_k]

Pros:

Markdown still Git-native
Semantic search (understands intent)
Scales to 10M+ words
Works offline (if using local embeddings)
No external database needed (SQLite-based)

Cons:

Embedding generation cost (one-time, or on updates)
Chunk boundaries can split concepts
Requires embedding model (third-party or local)
Index must be kept in sync

Strategy C: Full Vector Database

Architecture: Replace markdown search with dedicated vector DB. Markdown becomes source-of-truth; vector DB is derived index.

Markdown (Git)  →  ETL Pipeline  →  Vector DB  →  Agent Queries
api.md          Chunk            Pinecone       Semantic search
faq.md          Embed            Weaviate       + Metadata filter
examples.md     Index            Milvus

When to use:

Knowledge size >10M words
Semantic + metadata search essential
Real-time updates (new docs hourly)
Multi-agent access (shared infrastructure)
Can afford managed service or self-hosted DB

Implementation pattern:

# ETL: Markdown → Vector DB
def sync_knowledge_base(markdown_dir, vector_db):
    docs = load_markdown(markdown_dir, track_version=True)
    
    for doc in docs:
        chunks = chunk(doc['text'], size=400)
        for i, chunk in enumerate(chunks):
            vector_db.upsert({
                'id': f"{doc['id']}_chunk_{i}",
                'text': chunk,
                'embedding': embed(chunk),
                'metadata': {
                    'source': doc['path'],
                    'category': doc['category'],
                    'version': doc['version'],
                    'updated_at': doc['mtime'],
                }
            })

# Query: Use metadata filters + semantic search
results = vector_db.search(
    query_embedding=embed(user_query),
    filter={'version': '>=2.0', 'category': 'api'},
    top_k=10
)

Pros:

Scales to 100M+ words
Real-time updates
Metadata filtering (narrow search space)
Cloud-native (easy multi-agent sharing)
Advanced search (hybrid keyword + semantic)

Cons:

Requires external service (cost, latency, dependencies)
Complexity (ETL pipeline, index management)
Data residency concerns (if SaaS)
More moving parts to maintain

Decision Matrix

Criteria	Strategy A	Strategy B	Strategy C
Size	<2M words	1-10M	>10M
Search latency	<5s	<2s	<1s
Semantic search?	No	Yes	Yes
Infrastructure	Git only	SQLite	Pinecone/Weaviate
Update frequency	Manual	Batch	Real-time
Cost	Free	~$50-200/mo	$200-2000+/mo
Implementation time	1 week	2-3 weeks	4-6 weeks

3. Incremental Knowledge Updates

Knowledge isn’t static. Teams discover bugs, refine explanations, and add new features constantly. Updates must happen without breaking running agents.

Zero-Downtime Updates

Pattern: Version the knowledge base, switch at query time

docs/
├── v1/
│   ├── api.md
│   └── guides.md
├── v2/
│   ├── api.md
│   └── guides.md
└── CURRENT_VERSION  # Points to v2

Implementation:

# Agent loads version dynamically
class KnowledgeBase:
    def load(self):
        current = read('docs/CURRENT_VERSION').strip()
        self.path = f'docs/{current}/'
        self.docs = load_markdown(self.path)
    
    def search(self, query):
        return semantic_search(query, self.docs)

# Update process
# 1. Create new version directory
copy_dir('docs/v2', 'docs/v3')
# 2. Edit docs/v3/* as needed
# 3. Write new version pointer
write('docs/CURRENT_VERSION', 'v3')
# 4. Commit
git commit -m "docs: release v3 of knowledge base"

Adding Knowledge Without Recompilation

Key principle: Knowledge should be loaded at query time, not at agent initialization.

Anti-pattern (compiles knowledge into weights):

# ❌ Don't do this
agent = HarnessAgent(knowledge=hardcoded_docs)
# Now you need to retrain the agent to add new knowledge

Pattern (external lookup):

# ✓ Do this
agent = HarnessAgent()

def answer(query):
    context = knowledge_base.search(query)  # Loaded at query time
    return agent.answer(query, context=context)

Versioning Knowledge

Semantic versioning for knowledge:

v1.0: Initial knowledge base
v1.1: Bug fixes, clarifications (backward compatible)
v2.0: Major changes, removed docs (breaking)

Track breaking changes:

# CHANGELOG.md
## v2.0 (2025-06-15)
### Breaking Changes
- Removed: Old authentication flow (replaced by OAuth2)
- Changed: API endpoint /v1/users → /v2/accounts
- Deprecated: legacy-config.yaml format

### Migration Guide
See upgrade-v1-to-v2.md

Backward Compatibility

Pattern: Support multiple versions in parallel

class MultiVersionKnowledge:
    def __init__(self):
        self.v1 = load_markdown('docs/v1/')
        self.v2 = load_markdown('docs/v2/')
    
    def answer(self, query, version='v2'):
        if version == 'v2':
            return self.v2.search(query)
        elif version == 'v1':
            return self.v1.search(query)
        else:
            raise ValueError(f"Unknown version: {version}")

# Agent can specify version
response = knowledge.answer(query, version=request.version)

Use cases:

Legacy systems still on v1 API
Gradual migration (v1 → v2)
A/B testing different knowledge bases
Backward compatibility windows

4. Conflicting Information Resolution

Large knowledge bases inevitably contain contradictions. One document says “do X”, another says “don’t do X”. Which is correct?

Sources of Conflict

Temporal: Old docs say one thing, new docs say another
Authoritative: Engineering docs contradict product docs
Scope: “Works for API v2” vs “Works for all versions”
Interpretation: Ambiguous requirements lead to different conclusions

Manual Conflict Resolution

Process: Explicit review, authoritative decision, tracking

## Conflict: Authentication Method
### Issue: #347

**Claim 1** (api.md, line 45):
"Use Basic Auth with username:password"

**Claim 2** (guides/modern-auth.md, line 12):
"Basic Auth is deprecated. Use OAuth2 instead."

**Investigation**:
- Basic Auth still works, but not recommended for new integrations
- OAuth2 is preferred for security

**Resolution**:
- **Authoritative**: Engineering team decision
- **Winner**: OAuth2 (Claim 2 is correct)
- **Action**: Deprecate Basic Auth docs, add migration guide

**Tracking**:
```yaml
version: v2.5
resolution_id: conflict_auth_001
resolved_by: Security Team Lead
resolved_at: 2025-06-15
decision: oauth2_preferred
migration_deadline: 2025-12-31

Automated Conflict Detection

Pattern: Detect contradictions programmatically

import re

# Extract claims from documents
def extract_claims(doc):
    """Find factual statements (simplified example)"""
    patterns = [
        r'(?:use|must|should|can)\s+(\w+)',  # "use OAuth2"
        r'(?:deprecated|legacy|old)\s+(\w+)',  # "deprecated Basic Auth"
    ]
    claims = []
    for pattern in patterns:
        for match in re.finditer(pattern, doc):
            claims.append({
                'text': match.group(0),
                'entity': match.group(1),
                'confidence': 0.8,
            })
    return claims

# Detect contradictions
def detect_conflicts(claims):
    """Find opposing claims about same entity"""
    by_entity = {}
    for claim in claims:
        entity = claim['entity']
        by_entity.setdefault(entity, []).append(claim)
    
    conflicts = []
    for entity, entity_claims in by_entity.items():
        if len(entity_claims) > 1:
            # Check if claims contradict
            texts = [c['text'] for c in entity_claims]
            if contradicts(texts):
                conflicts.append({
                    'entity': entity,
                    'claims': entity_claims,
                    'severity': 'high',
                })
    return conflicts

# Alert humans to review
for conflict in detect_conflicts(all_claims):
    print(f"⚠️ Conflict detected: {conflict['entity']}")
    print(f"   Claims: {[c['text'] for c in conflict['claims']]}")

Resolution Tracking

Maintain a decisions log:

# docs/DECISIONS.yaml
decisions:
  - id: auth_method_001
    question: "Should new integrations use Basic Auth or OAuth2?"
    claims:
      - text: "Use Basic Auth"
        source: api.md:45
        version: v1.0
      - text: "Use OAuth2"
        source: guides/modern-auth.md:12
        version: v2.0
    resolution: "OAuth2 for new integrations, Basic Auth deprecated"
    decided_by: Security Team
    decided_at: 2025-06-15
    expires_at: 2025-12-31
    
  - id: rate_limit_format_001
    question: "What HTTP header for rate limit remaining?"
    claims:
      - text: "X-RateLimit-Remaining"
        source: v1/api.md
      - text: "RateLimit-Remaining"
        source: v2/api.md
    resolution: "RateLimit-* headers (follows RFC 6585)"
    decided_by: API Team
    decided_at: 2025-03-01
    migration_path: "v1 still supports X-RateLimit-*, v2+ requires RateLimit-*"

5. Knowledge Base Maintenance

Knowledge decays. Docs become outdated, links break, examples fail. Systematic maintenance prevents slow degradation.

Stale Content Detection

Pattern: Track document age, flag old content

import os
from datetime import datetime, timedelta

def find_stale_docs(doc_dir, max_age_days=180):
    """Flag docs not updated in 6+ months"""
    stale = []
    for filepath in glob(f'{doc_dir}/**/*.md', recursive=True):
        mtime = os.path.getmtime(filepath)
        age_days = (datetime.now() - datetime.fromtimestamp(mtime)).days
        
        if age_days > max_age_days:
            stale.append({
                'file': filepath,
                'age_days': age_days,
                'last_modified': datetime.fromtimestamp(mtime),
            })
    
    return sorted(stale, key=lambda x: x['age_days'], reverse=True)

# Flag in CI
stale = find_stale_docs('docs/', max_age_days=180)
for doc in stale:
    print(f"⚠️ Stale: {doc['file']} ({doc['age_days']} days old)")
    print(f"   Last updated: {doc['last_modified'].date()}")

Pruning Obsolete Information

Example: Removing deprecated API docs

# Before
docs/
├── api/
│   ├── v1.md  (deprecated, 2023)
│   ├── v2.md  (current)
│   └── v3.md  (beta)
└── CHANGELOG.md

# After (if v1 truly unused)
docs/
├── api/
│   ├── v2.md  (current)
│   └── v3.md  (beta)
├── archive/
│   └── v1-2023-12-15.md.bak  (kept for history)
└── CHANGELOG.md
  (entry: "Removed: v1 API docs, archived to archive/")

Decision criteria:

Is anyone still using this version? (Check logs, support tickets)
Can this be archived instead of deleted?
Does regulatory/compliance require keeping history?

Refreshing Out-of-Date Content

Process: Explicit review and update cycles

# docs/REVIEW_SCHEDULE.md

## Q2 2025 Reviews (April-June)
- [ ] Authentication docs (updated Q4 2024, OAuth2 changes)
- [ ] Rate limiting (updated Q1 2024, service limits changed)
- [ ] Deployment guide (updated Q1 2025, should be stable)

## Q3 2025 Reviews (July-September)
- [ ] SDKs (check for new versions)
- [ ] Examples (run all code samples)
- [ ] FAQ (check support tickets for new patterns)

Template for updates:

---
last_reviewed: 2025-04-15
review_cycle: quarterly
status: current  # current | outdated | deprecated
version_applies: v2.0+
---

# Rate Limiting Guide

Last updated: 2025-04-15 (reviewed, current)
Updated by: API Team
Changes: Added RateLimit-Reset header description

[content...]

Archival Strategy

Keep full history, organize by relevance:

docs/
├── current/           # What's live now
│   ├── api.md
│   └── guides/
├── deprecated/        # Still documented, but don't use
│   ├── legacy-auth.md
│   └── old-format.md
└── archive/          # Historical, for reference only
    └── v1-2023/
        ├── api.md
        └── changelog-2023.md

Example metadata:

# In each deprecated doc
---
status: deprecated
deprecated_since: 2024-06-01
removal_date: 2025-06-01  # Planned
replacement: oauth2.md
---

6. Semantic Search vs Keyword Search

Different search approaches solve different problems. Most large systems need both.

Keyword Search (What’s familiar)

How it works: Index words, find documents containing query words

Pros:

Fast (simple hash lookups)
Predictable (easy to debug)
Works for exact matches (“OAuth2”, “rate_limit_exceeded”)
Low latency

Cons:

Misses synonyms (“token-based auth” vs “OAuth”)
Sensitive to phrasing (plural forms, tense)
Noisy results (all documents with keyword)

When to use:

Exact technical terms (“v2 API”, “RateLimit-Remaining”)
User knows what they’re looking for
Latency critical (<100ms)

Semantic Search (What makes sense)

How it works: Convert query and documents to embeddings (dense vectors), find similar vectors

Pros:

Understands intent (“how do I limit requests?” matches “rate limiting guide”)
Handles synonyms and rephrasing
Relevance ranking (nearest vectors = most relevant)
Works with fuzzy/partial knowledge

Cons:

Slower (embedding generation + vector search)
Requires embedding model (third-party or local)
Less obvious why a result appeared
Sensitive to out-of-domain queries

When to use:

User doesn’t know exact terminology
Intent-based search (“how do I add auth?”)
Learning mode (exploring new domain)

Hybrid Approach

Pattern: Keyword + semantic search combined

def hybrid_search(query, knowledge_base, top_k=5):
    """
    Combine keyword and semantic results.
    1. Get top keyword matches (fast, exact)
    2. Get top semantic matches (slow, relevant)
    3. Merge and rank by score
    """
    
    # Keyword search (fast, exact matches)
    keyword_results = knowledge_base.keyword_search(query, top_k=10)
    keyword_scores = {r['id']: r['score'] for r in keyword_results}
    
    # Semantic search (thorough, intent-based)
    semantic_results = knowledge_base.semantic_search(query, top_k=10)
    semantic_scores = {r['id']: r['score'] for r in semantic_results}
    
    # Merge: boost documents that appear in both
    merged = {}
    for doc_id in set(keyword_scores.keys()) | set(semantic_scores.keys()):
        keyword_score = keyword_scores.get(doc_id, 0) * 0.4
        semantic_score = semantic_scores.get(doc_id, 0) * 0.6
        merged[doc_id] = keyword_score + semantic_score
    
    # Return top results
    top = sorted(merged.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [knowledge_base.get(doc_id) for doc_id, _ in top]

Vector Embeddings for Semantic Search

Choosing an embedding model:

Model	Speed	Quality	Size	Cost
`all-MiniLM-L6-v2` (local)	Fast	Good	22 MB	Free
`all-mpnet-base-v2` (local)	Medium	Very good	438 MB	Free
OpenAI `text-embedding-3-small`	Slow	Excellent	Cloud	$0.02/1M
Anthropic `claude-3-5-sonnet`	Slow	State-of-art	Cloud	$3/1M input

Chunking strategies:

def chunk_markdown(doc_text, chunk_size=400, overlap=50):
    """
    Split long documents into overlapping chunks.
    
    Too small (100 tokens): loses context
    Too large (800 tokens): mixes unrelated concepts
    Overlap prevents losing info at chunk boundaries
    """
    
    sentences = split_sentences(doc_text)
    chunks = []
    current = []
    current_tokens = 0
    
    for sentence in sentences:
        sent_tokens = count_tokens(sentence)
        
        if current_tokens + sent_tokens > chunk_size:
            # Save chunk
            chunks.append(' '.join(current))
            
            # Start new chunk with overlap
            current = current[-overlap_sentences:]
            current_tokens = sum(count_tokens(s) for s in current)
        
        current.append(sentence)
        current_tokens += sent_tokens
    
    if current:
        chunks.append(' '.join(current))
    
    return chunks

# Example
doc = load_markdown('api.md')
chunks = chunk_markdown(doc, chunk_size=400, overlap=50)
embeddings = [embed(chunk) for chunk in chunks]

7. Knowledge Graph Patterns

When documents alone aren’t enough, model relationships explicitly.

When to Move Beyond Flat Documents

Red flags for flat markdown:

Lots of “see also” links (suggests implicit structure)
Questions like “what APIs use this data model?”
Relationships: Entity A (e.g., User) relates to B (e.g., Account)
Traversal: Want to follow chains (User → Account → API Key)

Example: E-commerce knowledge base

Problem: Find all operations that require authentication
Markdown approach: Search for "authentication" in all docs (gets noise)
Graph approach: Query: AuthenticationRequired -[:relatesTo]-> Operation

Entity-Relationship Patterns

Represent domain concepts as entities with relationships:

entities:
  # Concept entities
  APIEndpoint:
    name: API endpoint
    examples: ["/users", "/accounts/{id}"]
  
  DataModel:
    name: Data structure
    examples: ["User", "Account", "AuthToken"]
  
  AuthenticationMethod:
    name: Auth approach
    examples: ["OAuth2", "BasicAuth"]

relationships:
  - type: "endpoint_uses_model"
    from: APIEndpoint
    to: DataModel
    example: "POST /users receives User model"
  
  - type: "endpoint_requires_auth"
    from: APIEndpoint
    to: AuthenticationMethod
    example: "GET /users requires OAuth2"
  
  - type: "model_contains_field"
    from: DataModel
    to: Field
    example: "User.id is required string"

Graph Traversal

Navigate relationships to answer complex questions:

Query: What endpoints can an unauthenticated user call?

Traversal:
1. Find all APIEndpoints
2. Filter where NOT (endpoint_requires_auth -> *)
3. Return: [GET /status, POST /login, GET /docs]

---

Query: If we remove OAuthToken data model, what breaks?

Traversal:
1. Find DataModel("OAuthToken")
2. Find all APIEndpoints that endpoint_uses_model -> OAuthToken
3. Find all AuthenticationMethods that auth_produces -> OAuthToken
4. Return: [breaking endpoints, auth methods that fail]

Knowledge Graph Databases

When to use a graph database:

100 entity types, >1000 relationships
Complex queries (multi-hop traversals)
Real-time insights needed
Multiple agents querying same knowledge

Popular options:

Neo4j: Most mature, Cypher query language
Amazon Neptune: AWS managed
TigerGraph: Performance-optimized, supports real-time analytics
ArangoDB: Multi-model (documents + graphs)

Example Neo4j setup:

# Define nodes
CREATE (user:Entity {name: "User", type: "DataModel"})
CREATE (oauth:Entity {name: "OAuth2", type: "AuthMethod"})
CREATE (endpoint:Entity {name: "GET /users", type: "APIEndpoint"})

# Define relationships
CREATE (endpoint)-[:requires_auth]->(oauth)
CREATE (endpoint)-[:returns_model]->(user)

# Query: Find all auth methods used by any endpoint
MATCH (auth:AuthMethod)<-[:requires_auth]-(endpoint:APIEndpoint)
RETURN DISTINCT auth.name

# Query: Find endpoints that use a specific data model
MATCH (endpoint:APIEndpoint)-[:uses_model]->(User:DataModel)
RETURN endpoint.name

8. Curation & Quality Control

Knowledge quality directly impacts agent quality. Garbage in, garbage out.

Who Maintains Knowledge?

Models:

Central team (dedicated knowledge managers)
- Pro: Consistent, high quality
- Con: Slow updates, bottleneck
- Best for: Large organizations, critical knowledge
Domain experts (subject matter experts)
- Pro: Accurate, fast updates
- Con: Inconsistent style, variable quality
- Best for: Technical knowledge, multiple domains
Hybrid (domain experts + QA reviewers)
- Pro: Fast + accurate + consistent
- Con: Coordination overhead
- Best for: Growing organizations

Example policy:

# CONTRIBUTION_POLICY.md

ownership:
  api-documentation:
    primary: API Team
    secondary: Engineering Team Lead
    review_required: true
  
  getting-started:
    primary: Product Team
    secondary: API Team
    review_required: true
  
  internal-runbooks:
    primary: Ops Team
    secondary: None
    review_required: false

process:
  new_knowledge:
    - Author writes/edits
    - Assigned reviewer checks (48h deadline)
    - Author addresses feedback
    - Merged to main branch
  
  quality_review:
    - Quarterly: All docs reviewed by primary owner
    - Bi-annual: Cross-team review for consistency

Quality Standards

Checklist for knowledge acceptance:

# Knowledge Quality Checklist

## Accuracy
- [ ] Claims are current and correct
- [ ] Examples have been tested (code runs, URLs work)
- [ ] No contradictions with existing docs

## Completeness
- [ ] Covers happy path + common errors
- [ ] Includes version info (what systems/versions?)
- [ ] Links to related knowledge

## Clarity
- [ ] No jargon without explanation
- [ ] Active voice preferred
- [ ] Short paragraphs (3 sentences max)

## Maintenance
- [ ] Author identified (who maintains this?)
- [ ] Review cycle defined (how often updated?)
- [ ] Stakeholders identified (who should know if this changes?)

## Structure
- [ ] Follows template for doc type
- [ ] Heading hierarchy is logical
- [ ] Code examples use syntax highlighting

Peer Review Process

Pattern: Two-tier review (technical + editorial)

# Workflow: Pull Request to knowledge base
# 1. Author submits new/edited docs
# 2. Technical reviewer (domain expert) approves accuracy
# 3. Editorial reviewer (writing expert) approves clarity
# 4. Both approvals required to merge

class KnowledgeReview:
    def __init__(self, pr):
        self.pr = pr
        self.technical_approval = False
        self.editorial_approval = False
    
    def is_approved(self):
        return self.technical_approval and self.editorial_approval
    
    def request_technical_review(self, reviewer):
        """Domain expert verifies correctness"""
        pass
    
    def request_editorial_review(self, reviewer):
        """Writing expert verifies clarity and style"""
        pass

Automated Quality Checks

Lint knowledge base in CI:

#!/bin/bash
# scripts/validate-knowledge.sh

echo "Validating knowledge base..."

# Check 1: No broken links
echo "Checking for broken internal links..."
rg '\[.*\]\((docs/.*?\.md)\)' docs/ | while read match; do
    file=$(echo $match | grep -oE 'docs/[^)]+\.md')
    [ ! -f "$file" ] && echo "❌ Broken link: $file"
done

# Check 2: Required metadata
echo "Checking for metadata..."
for md in docs/**/*.md; do
    grep -q "last_reviewed:" "$md" || echo "⚠️ Missing metadata: $md"
done

# Check 3: Code examples are valid
echo "Validating code examples..."
# Extract ```bash blocks and run them
rg '```bash' -A 100 docs/ | ./check-bash-examples.py

# Check 4: No stale docs
echo "Finding stale documentation..."
find docs -name "*.md" -mtime +180 | while read f; do
    echo "⚠️ Stale (>6mo): $f"
done

# Check 5: Consistent terminology
echo "Checking for terminology consistency..."
if grep -r "API key" docs/ && grep -r "API-key" docs/; then
    echo "⚠️ Inconsistent: 'API key' vs 'API-key'"
fi

9. Integration Patterns

How agents discover and use knowledge.

Explicit Loading (Pull Model)

Agent loads knowledge at startup:

class Harness:
    def __init__(self, knowledge_paths):
        self.knowledge = {}
        for path in knowledge_paths:
            self.knowledge[path] = load_markdown(path)
    
    def answer(self, query):
        context = self.knowledge.get('api.md', '')
        return self.agent.answer(query, context=context)

Pros:

Simple, predictable
Full context loaded upfront
Good for small, stable knowledge

Cons:

Token-heavy (loads everything, uses little)
Stale if knowledge updated
Doesn’t scale (can’t load 10M words)

Dynamic Discovery (Push/Pull Hybrid)

Agent requests knowledge when needed:

class DynamicHarness:
    def __init__(self):
        self.kb = VectorIndex('docs/')
    
    def answer(self, query):
        # Fetch relevant knowledge at query time
        relevant_chunks = self.kb.search(query, top_k=5)
        context = '\n---\n'.join([c['text'] for c in relevant_chunks])
        return self.agent.answer(query, context=context)

Pros:

Only loads relevant knowledge
Automatically updated with docs
Scales to large bases
Accurate context (not everything)

Cons:

Extra latency (search time)
Search quality matters
Requires vector index

Knowledge as a Tool

Pattern: Agent calls knowledge lookup as a function

from langchain.tools import Tool

knowledge_search = Tool(
    name="search_knowledge_base",
    description="Search the knowledge base for relevant information",
    func=lambda query: knowledge_base.search(query, top_k=3)
)

agent = HarnessAgent(tools=[knowledge_search, code_executor, ...])

# Agent uses tool autonomously
response = agent.answer(
    "How do I configure rate limiting?",
    tools=[knowledge_search, code_executor]
)
# Agent might call: search_knowledge_base("rate limiting configuration")

Pros:

Agent decides when knowledge is needed
Natural integration with other tools
Supports multi-step reasoning
Works with frameworks (LangChain, LlamaIndex)

Cons:

Extra LLM calls (search decisions)
Latency increases
More complex debugging

Knowledge Orchestration

Coordinating knowledge across tools:

class KnowledgeOrchestrator:
    """
    Manage which knowledge is available to which agents/tools.
    """
    
    def __init__(self):
        self.global_kb = VectorIndex('docs/')  # Available everywhere
        self.api_team_kb = VectorIndex('docs/api/')  # API team only
        self.internal_kb = VectorIndex('docs/internal/')  # Employees only
    
    def get_kb_for_agent(self, agent_name, access_level):
        """Return appropriate knowledge for agent"""
        kbs = [self.global_kb]  # Everyone gets this
        
        if access_level == 'api_team':
            kbs.append(self.api_team_kb)
        
        if access_level == 'internal':
            kbs.append(self.internal_kb)
        
        return CombinedIndex(kbs)
    
    def search(self, query, agent_name, access_level):
        kb = self.get_kb_for_agent(agent_name, access_level)
        return kb.search(query)

10. Performance Optimization

Keep knowledge retrieval fast, even at scale.

Indexing Strategies

Multi-level indexing:

Raw documents (1000 files, 10M words)
    ↓ (expensive, one-time)
Inverted index (keywords → documents)
    ↓
Vector index (chunks → embeddings)
    ↓
Query time: Use index, not raw docs

Implementation:

# Build indices once, reuse many times
class OptimizedKnowledgeBase:
    def __init__(self, doc_dir):
        # Load from cache if exists
        self.keyword_index = load_or_build_keyword_index(doc_dir)
        self.vector_index = load_or_build_vector_index(doc_dir)
    
    def search(self, query, method='hybrid', top_k=5):
        """Search using pre-built indices"""
        if method == 'keyword':
            return self.keyword_index.search(query, top_k)
        elif method == 'semantic':
            return self.vector_index.search(query, top_k)
        else:
            # Hybrid: combine both indices
            k_results = self.keyword_index.search(query, top_k=10)
            v_results = self.vector_index.search(query, top_k=10)
            return merge_results(k_results, v_results, top_k)

Caching Frequently Accessed Knowledge

Pattern: LRU cache for common queries

from functools import lru_cache
import hashlib

class CachedKnowledgeBase:
    def __init__(self, kb):
        self.kb = kb
        self.cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
    
    def search(self, query, top_k=5):
        """Search with caching"""
        cache_key = hashlib.md5(f"{query}:{top_k}".encode()).hexdigest()
        
        if cache_key in self.cache:
            self.cache_hits += 1
            return self.cache[cache_key]
        
        self.cache_misses += 1
        results = self.kb.search(query, top_k)
        self.cache[cache_key] = results
        
        # Keep cache under 1000 items
        if len(self.cache) > 1000:
            # Remove least recently used
            oldest = min(self.cache.items(), key=lambda x: x[1]['timestamp'])
            del self.cache[oldest[0]]
        
        return results
    
    def invalidate(self, pattern=None):
        """Clear cache when knowledge updates"""
        if pattern is None:
            self.cache.clear()
        else:
            self.cache = {k: v for k, v in self.cache.items() if pattern not in k}

When to cache:

Frequently asked questions (FAQ section)
Common patterns (e.g., “how to setup”, “authentication”)
Time-sensitive: cache expiry after 1-24 hours

Approximate Nearest Neighbor Search

For very large vector indices, exact search becomes slow. Use ANN:

Method	Speed	Accuracy	Best For
Exact search	Slow (O(n))	100%	<1M vectors
FAISS	Fast (O(log n))	99%+	1-100M vectors
HNSW	Very fast	95%+	Streaming/real-time
IVF	Fast	90%+	Partitioned search

FAISS example:

import faiss
import numpy as np

# Build approximate index once
vectors = np.array([embedding for chunk in chunks])
index = faiss.IndexFlatL2(dim=384)  # Flat for <1M vectors
index.add(vectors)

# Save for reuse
faiss.write_index(index, 'knowledge.index')

# Query time: fast
query_vec = embed(query)
distances, indices = index.search(np.array([query_vec]), k=5)
results = [chunks[i] for i in indices[0]]

Lazy Loading

Don’t load everything at startup:

class LazyKnowledgeBase:
    def __init__(self, doc_dir):
        self.doc_dir = doc_dir
        self.chunks = None  # Load on first use
        self.index = None
    
    def _ensure_loaded(self):
        if self.chunks is None:
            self.chunks = self._load_chunks()
            self.index = self._build_index(self.chunks)
    
    def search(self, query):
        self._ensure_loaded()
        return self.index.search(query)

When multiple agents or teams need the same knowledge.

Centralized Knowledge Base

Single source of truth, shared by all agents:

# Knowledge base serving 10 agents

Knowledge Base (Git + Vector Index)
    ├─ API Team Agent (reads api.md, integrations.md)
    ├─ Support Agent (reads faq.md, troubleshooting.md)
    ├─ Analytics Agent (reads data-models.md, queries.md)
    ├─ DevOps Agent (reads deployment.md, runbooks.md)
    └─ ...

Benefits:

Single update syncs to all agents
Consistent information
Easy to audit (all in Git)

Challenges:

Knowledge is generic (covers many use cases)
Agents load knowledge they don’t use
No specialization

Agent-Specific Knowledge

Each agent has custom knowledge subset:

Base Knowledge
    ├─ api.md (for all agents)
    └─ faq.md (for all agents)

Specializations
    ├─ api-team/
    │   ├─ sdk-internals.md
    │   └─ performance-tuning.md
    ├─ support-team/
    │   ├─ troubleshooting.md
    │   └─ workarounds.md
    └─ devops/
        ├─ deployment-matrix.md
        └─ runbooks/

Implementation:

class SpecializedAgent:
    def __init__(self, agent_type):
        self.base_kb = VectorIndex('docs/base/')
        self.specialized_kb = VectorIndex(f'docs/{agent_type}/')
    
    def search(self, query):
        # Search specialized first, fall back to base
        specialized = self.specialized_kb.search(query, top_k=3)
        if specialized:
            return specialized
        return self.base_kb.search(query, top_k=3)

Knowledge Inheritance Hierarchies

Organize knowledge by scope and specificity:

Level 1: Industry Standards
  └─ "What is OAuth2?" (applies to all companies)

Level 2: Company Policies
  └─ "We use OAuth2 with 15-min token lifetime" (applies to this company)

Level 3: Product-Specific
  └─ "Our API endpoints require OAuth2 with X-API-Key header"

Level 4: Team-Specific
  └─ "API Team: we document endpoints in OpenAPI 3.1" (how to maintain level 3)

In practice:

docs/
├── L1-standards/
│   ├── oauth2.md
│   └── rest-best-practices.md
├── L2-company/
│   ├─ company-security-policy.md
│   └── authentication-standard.md
├── L3-product/
│   ├─ api/
│   └── integrations/
└── L4-team/
    ├─ api-team/
    └─ support-team/

12. Real-World Examples

Example 1: Transitioning from Markdown to Hybrid (Small→Medium)

Starting state: 300 markdown files, 600K words. Keyword search takes 10s. Search results noisy.

Goal: Reduce search time to <2s, improve relevance.

Approach: Multi-Tier Markdown + Vector Index

Step 1: Assess current state

# Count words in markdown
find docs -name "*.md" -exec wc -w {} + | tail -1
# Output: 612,000 total

# Check search performance
time knowledge_base.search("how to setup oauth")
# Output: real 0m9.742s (too slow)

Step 2: Create summaries

docs/
├── raw/
│   ├── authentication-complete.md (4000 words)
│   └── rate-limiting-full.md (3500 words)
└── summaries/
    ├── authentication-quick.md (500 words)
    └── rate-limiting-quick.md (400 words)

Summarization process:

Manual review: SME reads full doc, writes 80/20 version
Markup: Add [full docs](../raw/authentication-complete.md) links
Review: Another SME checks summary is accurate

Step 3: Build vector index (on summaries)

# This is fast since we're indexing 50K words, not 600K
chunks = chunk_markdown('docs/summaries/', chunk_size=400)
embeddings = embed_batch(chunks)  # ~10 min with OpenAI API
save_vector_index(embeddings, 'docs/.index/')

Step 4: Update agent to search summaries first

# Before
context = '\n'.join(load_all_markdown('docs/'))  # 600K tokens, slow

# After
relevant = vector_search('docs/.index/', query, top_k=3)
context = '\n---\n'.join(relevant)  # 1.2K tokens, fast

Results:

Search time: 10s → 0.5s
Context size: 600K tokens → 1.2K tokens
Accuracy: Improved (semantic search vs keyword)
Maintenance: +20% (keep summaries updated)

Lessons learned:

Summarization is lossy, but acceptable for common queries
Keep raw docs for deep dives
Vector index on summaries is maintenance sweet spot

Example 2: Knowledge Graph for Domain Relationships

Scenario: Fintech company with complex API (users, accounts, transactions, cards).

Problem: Markdown says “Card requires an Account” but doesn’t show what else depends on Account. When Account data model changes, what breaks?

Solution: Knowledge graph

Entities:

datamodels:
  - User: root entity
  - Account: requires User
  - Card: requires Account
  - Transaction: requires Card or Account
  - Webhook: triggers on Transaction

endpoints:
  - POST /accounts: creates Account (requires User)
  - POST /cards: creates Card (requires Account)
  - POST /transactions: posts Transaction (requires Card)

Relationships:

User
  ├─ creates → Account
  └─ has_many → Account

Account
  ├─ created_by → User
  ├─ creates → Card
  └─ has_many → Card

Card
  ├─ belongs_to → Account
  ├─ enables → Transaction
  └─ has_many → Transaction

Webhook
  └─ triggers_on → Transaction

Queries enabled:

# "What breaks if we remove the Card model?"
MATCH (Card:DataModel)<-[:uses_model]-(endpoint:APIEndpoint)
RETURN endpoint.name

# "What does a User need before they can post a transaction?"
MATCH (User:DataModel)-[:creates]->(Account:DataModel)
      -[:creates]->(Card:DataModel)
      -[:enables]->(Transaction:DataModel)
RETURN [User, Account, Card, Transaction]

# "What endpoints touch the Account data model?"
MATCH (endpoint:APIEndpoint)-[:creates|:updates|:returns]->(Account:DataModel)
RETURN endpoint.name

Markdown can’t answer these. Graph can.

Scenario: Support organization with 3 teams, 1 shared knowledge base.

Setup:

Tier 1 support: Answer common questions (FAQ only)
Tier 2 support: Troubleshoot (FAQ + troubleshooting)
Tier 3 support: Escalations (all knowledge)
Billing team: Handle refunds (billing knowledge only)

Knowledge structure:

docs/
├── shared/
│   ├── faq.md
│   ├── product-overview.md
│   └── glossary.md
├── tier2/
│   ├── troubleshooting.md
│   └── common-errors.md
├── tier3/
│   ├── system-architecture.md
│   └── internal-runbooks.md
└── billing/
    ├── refund-policy.md
    └── pricing.md

Agent setup:

class SupportAgent:
    def __init__(self, tier):
        self.tier = tier
        self.shared_kb = VectorIndex('docs/shared/')
        
        if tier == 'tier1':
            self.specialized_kb = VectorIndex('docs/shared/')
        elif tier == 'tier2':
            self.specialized_kb = VectorIndex('docs/tier2/')
        elif tier == 'tier3':
            self.specialized_kb = VectorIndex('docs/tier3/')
    
    def answer(self, customer_query):
        # Search tier-appropriate knowledge
        context = self.specialized_kb.search(customer_query, top_k=5)
        return self.agent.answer(customer_query, context=context)

# Usage
tier1_agent = SupportAgent('tier1')  # Limited knowledge
tier2_agent = SupportAgent('tier2')  # More knowledge
tier3_agent = SupportAgent('tier3')  # Full knowledge

Workflow:

Customer asks: "Why was I charged twice?"

Tier 1: Searches FAQ, finds generic refund article
        → Suggests contacting support
        → Creates ticket

Tier 2: Searches troubleshooting + FAQ
        → Looks up transaction logs
        → Can explain double-charge scenarios
        → May resolve or escalate

Tier 3: Full system access + advanced knowledge
        → Digs into billing code
        → Finds root cause
        → Implements fix + refund

Benefits:

Tier 1 stays focused on common issues
Tier 2 can self-service for common problems
Knowledge is progressively revealed
Easy to promote from Tier 1 → 2 (just point to broader KB)

Decision Framework

Choosing a knowledge management strategy:

Start here:
├─ Is your knowledge base < 400K words?
│  └─ YES: Use pure markdown (simple, Git-native)
│
├─ Is it 400K-2M words?
│  ├─ Semantic search important? 
│  │  ├─ YES: Hybrid markdown + vector index
│  │  └─ NO: Multi-tier markdown
│  │
│
├─ Is it > 2M words?
│  ├─ Relationships matter?
│  │  ├─ YES: Add knowledge graph (Neo4j)
│  │  └─ NO: Vector database (Pinecone/Weaviate)
│  │
├─ Are there >100 data models with complex relationships?
│  └─ YES: Knowledge graph (mandatory)
│
└─ Do multiple agents need different knowledge?
   └─ YES: Implement access control + specialization

References & Tools

Vector Databases:

Pinecone (managed, expensive)
Weaviate (self-hosted or managed)
Milvus (open-source, self-hosted)
SQLite-vec (embedded, free)

Embedding Models:

Local: all-MiniLM-L6-v2, all-mpnet-base-v2
API: OpenAI Embeddings, Anthropic API, Cohere

Graph Databases:

Neo4j (most popular, great Cypher docs)
TigerGraph (performance-optimized)
ArangoDB (multi-model)

Search Tools:

Elasticsearch (full-featured, complex)
Solr (enterprise search)
Meilisearch (simple, fast)

Chunking & Indexing:

LangChain: Document loaders, splitters
LlamaIndex: Document indexing specialized for LLMs
Unstructured: PDF/document parsing

Summary

Knowledge management at scale requires deliberate architectural choices:

Start simple: Markdown wikis work for <400K words
Scale strategically: Choose A/B/C based on size and constraints
Maintain actively: Old knowledge is worse than no knowledge
Search smartly: Combine keyword and semantic approaches
Share wisely: Multi-agent systems need structured access
Graph when needed: Relationships require explicit modeling

The jump from flat documents to vector indexes to knowledge graphs is not arbitrary—each layer solves real problems at specific scales. Begin where you are, transition when you feel pain, and measure improvements.

13. Real-World Scaling Case Study

This case study traces a knowledge base from 50 articles to 800, documenting the breaking points, migration decisions, and implementation code at each stage.

The Scenario

A developer tools company maintains internal knowledge for its coding assistant harness. The knowledge covers API documentation, integration guides, troubleshooting runbooks, and architecture decisions. Over 18 months, the knowledge base grew from a small wiki to a sprawling corpus that degraded search quality.

The symptom: “Our agent used to find the right answer immediately. Now it returns vaguely related articles or hallucinates details from outdated docs.”

Stage 1: 0-100 Articles (Months 1-6)

Architecture: Pure markdown wiki following the Karpathy pattern (see doc 04 for memory layer context).

knowledge/
├── raw/           # Source documents, meeting notes, specs
│   ├── api-v2-spec.md
│   ├── onboarding-notes.md
│   └── ... (85 files)
└── wiki/          # LLM-compiled, structured markdown
    ├── authentication.md
    ├── rate-limiting.md
    ├── error-codes.md
    └── ... (50 files)

How it worked:

Authors dropped raw sources into raw/
An LLM compiled them into clean wiki articles in wiki/
The agent loaded all of wiki/ into context at startup (~80K tokens)
Full-text search with simple keyword matching

Metrics:

Total size: ~120K words (well under 400K limit)
Search latency: <500ms (in-memory grep)
Search relevance: 92% (small corpus, most queries hit the right doc)
Context usage: 80K tokens out of 200K available — comfortable

What worked: Everything. The Karpathy pattern is excellent at this scale. Human-readable, Git-versioned, no infrastructure beyond the filesystem.

Stage 2: 100-400 Articles (Months 6-12)

What changed: The company added three new product lines, each with its own API, guides, and troubleshooting docs. The wiki grew from 50 to 250 compiled articles.

First signs of trouble:

Total wiki size: ~340K words
Context usage: 340K tokens — exceeds most model context windows
Search latency: 2.1s (still acceptable)
Search relevance: 74% (dropped 18 points)

The breaking point: The agent could no longer load all wiki articles into context. It had to selectively load, but keyword search returned 15-20 partially relevant articles for common queries like “how do I authenticate?”

Fix: Selective loading with topic indexes

# Added a lightweight topic index for selective loading
# Instead of loading all 250 articles, load only relevant ones

import json
from pathlib import Path

class TopicIndex:
    """Map topics to relevant wiki articles for selective loading."""

    def __init__(self, wiki_dir: str):
        self.wiki_dir = Path(wiki_dir)
        self.index = self._build_index()

    def _build_index(self) -> dict[str, list[str]]:
        """Build topic -> [article_paths] mapping from frontmatter."""
        index = {}
        for md_file in self.wiki_dir.glob("*.md"):
            topics = extract_frontmatter_topics(md_file)
            for topic in topics:
                index.setdefault(topic, []).append(str(md_file))
        return index

    def get_articles(self, query: str, max_articles: int = 10) -> list[str]:
        """Return article paths relevant to query, ranked by topic overlap."""
        query_terms = query.lower().split()
        scored = {}
        for topic, paths in self.index.items():
            for term in query_terms:
                if term in topic.lower():
                    for path in paths:
                        scored[path] = scored.get(path, 0) + 1

        ranked = sorted(scored.items(), key=lambda x: x[1], reverse=True)
        return [path for path, _ in ranked[:max_articles]]

Metrics after fix:

Context usage: ~40K tokens per query (loading 8-12 relevant articles)
Search relevance: 81% (improved from 74%, still below Stage 1)
Search latency: 1.8s
Maintenance cost: Added frontmatter tagging to all articles (~2 days of work)

Stage 3: 400-800 Articles (Months 12-18)

What changed: The company acquired a competitor and merged their documentation. The wiki ballooned to 650+ articles. Topic-based selective loading was no longer sufficient — too many articles shared the same topics, and keyword matching couldn’t distinguish “authentication for Product A” from “authentication for Product B.”

The numbers:

Total wiki size: ~780K words
Topic index entries: 45 topics, avg 14 articles per topic
Search relevance: 58% (unacceptable — agent hallucinating to fill gaps)
Context usage: 60K tokens (loading too many loosely related articles)
False positive rate: 35% (over a third of retrieved articles were wrong)

Decision: Transition to hybrid markdown + vector retrieval. Keep the wiki as source of truth, add embeddings for semantic search.

The Migration: Markdown to Hybrid

Step 1: Generate embeddings for all wiki articles

from sentence_transformers import SentenceTransformer
import sqlite3
import json
import hashlib
from pathlib import Path

def migrate_wiki_to_hybrid(wiki_dir: str, db_path: str):
    """
    One-time migration: chunk wiki articles and store embeddings.
    Preserves original markdown files untouched.
    """
    model = SentenceTransformer("all-MiniLM-L6-v2")  # 22MB, runs locally
    wiki = Path(wiki_dir)

    db = sqlite3.connect(db_path)
    db.execute("""
        CREATE TABLE IF NOT EXISTS chunks (
            id TEXT PRIMARY KEY,
            source_file TEXT,
            chunk_index INTEGER,
            text TEXT,
            embedding BLOB,
            word_count INTEGER
        )
    """)

    for md_file in wiki.glob("*.md"):
        content = md_file.read_text()
        chunks = chunk_by_heading(content, max_tokens=400)

        for i, chunk in enumerate(chunks):
            chunk_id = hashlib.sha256(
                f"{md_file.name}:{i}".encode()
            ).hexdigest()[:16]
            embedding = model.encode(chunk).tobytes()

            db.execute(
                "INSERT OR REPLACE INTO chunks VALUES (?, ?, ?, ?, ?, ?)",
                (chunk_id, md_file.name, i, chunk, embedding, len(chunk.split()))
            )

    db.commit()
    db.close()


def chunk_by_heading(content: str, max_tokens: int = 400) -> list[str]:
    """Split markdown by headings, merge small sections, split large ones."""
    sections = []
    current = []
    current_len = 0

    for line in content.split("\n"):
        if line.startswith("#") and current_len > 50:
            sections.append("\n".join(current))
            current = [line]
            current_len = len(line.split())
        else:
            current.append(line)
            current_len += len(line.split())

            if current_len > max_tokens:
                sections.append("\n".join(current))
                current = []
                current_len = 0

    if current:
        sections.append("\n".join(current))

    return sections

Step 2: Build the hybrid search system

import numpy as np
from sentence_transformers import SentenceTransformer

class HybridKnowledgeBase:
    """
    Combines keyword search (fast, exact) with semantic search (slow, relevant).
    Markdown files remain the source of truth; vector index is derived.
    """

    def __init__(self, wiki_dir: str, db_path: str):
        self.wiki_dir = Path(wiki_dir)
        self.db_path = db_path
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self._load_index()

    def _load_index(self):
        """Load all chunks and embeddings from SQLite."""
        db = sqlite3.connect(self.db_path)
        rows = db.execute(
            "SELECT id, source_file, text, embedding FROM chunks"
        ).fetchall()
        db.close()

        self.chunks = []
        self.embeddings = []
        for chunk_id, source, text, emb_bytes in rows:
            self.chunks.append({
                "id": chunk_id,
                "source": source,
                "text": text,
            })
            self.embeddings.append(
                np.frombuffer(emb_bytes, dtype=np.float32)
            )
        self.embeddings = np.array(self.embeddings)

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Hybrid search: keyword (weight 0.3) + semantic (weight 0.7)."""
        keyword_scores = self._keyword_search(query)
        semantic_scores = self._semantic_search(query)

        combined = {}
        for i, chunk in enumerate(self.chunks):
            cid = chunk["id"]
            kw = keyword_scores.get(cid, 0.0) * 0.3
            sem = semantic_scores.get(cid, 0.0) * 0.7
            combined[cid] = kw + sem

        ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
        results = []
        for cid, score in ranked[:top_k]:
            chunk = next(c for c in self.chunks if c["id"] == cid)
            results.append({**chunk, "score": score})
        return results

    def _keyword_search(self, query: str) -> dict[str, float]:
        """Simple term-frequency scoring."""
        terms = query.lower().split()
        scores = {}
        for chunk in self.chunks:
            text_lower = chunk["text"].lower()
            hits = sum(1 for t in terms if t in text_lower)
            if hits > 0:
                scores[chunk["id"]] = hits / len(terms)
        return scores

    def _semantic_search(self, query: str) -> dict[str, float]:
        """Cosine similarity against all chunk embeddings."""
        query_vec = self.model.encode(query)
        similarities = np.dot(self.embeddings, query_vec) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_vec)
        )
        return {
            self.chunks[i]["id"]: float(sim)
            for i, sim in enumerate(similarities)
        }

    def refresh(self):
        """Re-embed changed files only (incremental update)."""
        db = sqlite3.connect(self.db_path)
        for md_file in self.wiki_dir.glob("*.md"):
            file_hash = hashlib.sha256(md_file.read_bytes()).hexdigest()[:16]
            existing = db.execute(
                "SELECT id FROM chunks WHERE source_file = ? LIMIT 1",
                (md_file.name,)
            ).fetchone()

            if not existing:
                # New file — chunk and embed
                chunks = chunk_by_heading(md_file.read_text())
                for i, chunk in enumerate(chunks):
                    chunk_id = hashlib.sha256(
                        f"{md_file.name}:{i}".encode()
                    ).hexdigest()[:16]
                    embedding = self.model.encode(chunk).tobytes()
                    db.execute(
                        "INSERT INTO chunks VALUES (?, ?, ?, ?, ?, ?)",
                        (chunk_id, md_file.name, i, chunk,
                         embedding, len(chunk.split()))
                    )
        db.commit()
        db.close()
        self._load_index()

Metrics at Each Stage

Metric	Stage 1 (0-100)	Stage 2 (100-400)	Stage 3 (400-800)	Stage 3 + Hybrid
Articles	50	250	650	650
Total words	120K	340K	780K	780K
Search latency	<500ms	2.1s	4.8s	800ms
Search relevance	92%	74% → 81%	58%	89%
False positive rate	5%	18%	35%	8%
Context tokens/query	80K (all)	40K (selective)	60K (noisy)	8K (precise)
Infrastructure	Filesystem	Filesystem + index	Filesystem + index	SQLite + embeddings
Migration effort	N/A	2 days	N/A	1 week

Key Takeaways

The Karpathy pattern works brilliantly until ~100 articles. Don’t over-engineer at this stage — markdown wiki is the right answer.
100-400 articles is the danger zone. You feel the pain but it’s not bad enough to force a migration. Topic indexes buy time, but semantic search is coming whether you plan for it or not.
The hybrid approach preserves your investment in markdown. You don’t throw away the wiki — you add a vector layer on top. Git history, human readability, and editability are preserved.
Incremental embedding is essential. Re-embedding 800 articles on every change is wasteful. Track file hashes, embed only what changed.
Weight semantic search higher than keywords (0.7 vs 0.3). At scale, users search by intent (“how do I limit API calls?”) not by exact terms (“rate_limit_exceeded”). Semantic search handles this naturally.
Context tokens per query dropped 10x (80K to 8K) while relevance only dropped 3 points (92% to 89%). Precise retrieval beats brute-force context stuffing.

For memory layer integration patterns (working memory, episodic memory, semantic memory), see doc 04 (Memory Systems). The hybrid search system described here slots into the semantic memory layer.

Validation Checklist

How do you know you got this right?

Performance Checks

Knowledge search latency: <2 sec for markdown, <5 sec for vector, <1 sec for graph
Indexing time reasonable: one-time embedding <1 hour for 1M words
Memory usage: vector index <5GB for 1M words (embedded models)
Staleness acceptable: knowledge updates reflected within 24 hours

Implementation Checks

Current strategy chosen: markdown/multi-tier/hybrid/graph decided and documented
Knowledge base size measured: growth tracked month-over-month
Search tested on 10+ representative queries: recall >80%
Chunking strategy verified: relevant documents returned for edge cases
Embedding quality checked: similar docs ranked together
Multi-agent access control working: agents see only intended knowledge
Deduplication implemented: no duplicate information taking space

Integration Checks

Harness agent can search knowledge base: integration with perception layer working
Results flow into context: agent can reason over retrieved documents
Update mechanism working: new knowledge added without full reindex
Fallback graceful: search failure doesn’t crash agent
Cost tracking: know per-query cost (embeddings, graph traversal, etc)

Common Failure Modes

Knowledge bloat: No cleanup; size grows unbounded, search becomes slow
Stale information: Old docs contradicting new ones; no version control
Poor relevance: Search returns noise; chunking strategy wrong
Expensive embeddings: Querying too often; implement caching
Graph inconsistency: Relationships contradictory or out-of-sync with documents
Acces control broken: Agent sees knowledge it shouldn’t have

Sign-Off Criteria

Knowledge base size tracked and decision point identified (when to scale)
Search latency acceptable for use case (interactive <2sec, batch <5sec)
Quality validation: spot-check 20+ search results, >80% relevant
Tested at scale: if expecting 1M words, tested with 500K+
Maintenance plan clear: who updates, how often, cleanup schedule

1. The Markdown Wiki Pattern (Revisited)

Performance by Scale

Why Markdown Works Until It Doesn’t

When to Transition

2. Knowledge Base Scaling Strategies

Strategy A: Multi-Tier Markdown (Hierarchical Summaries)

Strategy B: Hybrid Markdown + Vector Index

Strategy C: Full Vector Database

Decision Matrix

3. Incremental Knowledge Updates

Zero-Downtime Updates

Adding Knowledge Without Recompilation

Versioning Knowledge

Backward Compatibility

4. Conflicting Information Resolution

Sources of Conflict

Manual Conflict Resolution

Automated Conflict Detection

Resolution Tracking

5. Knowledge Base Maintenance

Stale Content Detection

Pruning Obsolete Information

Refreshing Out-of-Date Content

Archival Strategy

6. Semantic Search vs Keyword Search

Keyword Search (What’s familiar)

Semantic Search (What makes sense)

Hybrid Approach

Vector Embeddings for Semantic Search

7. Knowledge Graph Patterns

When to Move Beyond Flat Documents

Entity-Relationship Patterns

Graph Traversal

Knowledge Graph Databases

8. Curation & Quality Control

Who Maintains Knowledge?

Quality Standards

Peer Review Process

Automated Quality Checks

9. Integration Patterns

Explicit Loading (Pull Model)

Dynamic Discovery (Push/Pull Hybrid)

Knowledge as a Tool

Knowledge Orchestration

10. Performance Optimization

Indexing Strategies

Caching Frequently Accessed Knowledge

Approximate Nearest Neighbor Search

Lazy Loading

11. Multi-Agent Knowledge Sharing

Centralized Knowledge Base

Agent-Specific Knowledge

Knowledge Inheritance Hierarchies

12. Real-World Examples

Example 1: Transitioning from Markdown to Hybrid (Small→Medium)

Example 2: Knowledge Graph for Domain Relationships

Example 3: Multi-Agent Knowledge Sharing

Decision Framework

References & Tools

Summary

13. Real-World Scaling Case Study

The Scenario

Stage 1: 0-100 Articles (Months 1-6)

Stage 2: 100-400 Articles (Months 6-12)

Stage 3: 400-800 Articles (Months 12-18)

The Migration: Markdown to Hybrid

Metrics at Each Stage

Key Takeaways

Validation Checklist

Performance Checks

Implementation Checks

Integration Checks

Common Failure Modes

Sign-Off Criteria

See Also