Skip to main content
Reference

Knowledge Management at Scale

Scaling beyond markdown wikis — hybrid search systems, knowledge graphs, real-world scaling case study from 50 to 800 articles.

Managing knowledge bases for large harnesses requires strategic planning and architectural choices. This guide addresses the critical gaps that emerge as knowledge bases grow beyond simple markdown wikis, covering everything from markdown scaling limits to hybrid vector systems and multi-agent knowledge sharing.

Audience: Knowledge engineers, architects, systems engineers managing large knowledge bases.


1. The Markdown Wiki Pattern (Revisited)

The markdown wiki is the natural starting point for knowledge management in harnesses. It’s simple, version-controlled, and works beautifully—until it doesn’t.

Performance by Scale

Knowledge Base SizeCharacteristicsTypical Use Cases
<400K words (~100 articles)Instant search, full context, Git-nativeSingle-agent harnesses, domain-specific bots
400K-1M words (~250-600 articles)Noticeable search latency (2-5s), context windows strainedMulti-team coordination, medium enterprises
>1M words (~600+ articles)Unusable performance (10-30s searches), full context impossibleEnterprise-scale systems, multi-domain knowledge

Why Markdown Works Until It Doesn’t

Strengths:

  • Native Git support (history, diffs, blame)
  • Human-readable, easy to edit
  • Simple to embed in prompts
  • No external infrastructure
  • Works offline

Failure Modes:

  1. Search scalability: Full-text search across gigabytes of markdown becomes slow
  2. Context bloat: Jamming 1M words into a prompt leaves no room for actual work
  3. Relevance: Keyword search finds 50 loosely related documents, none perfect
  4. Staleness: Large wikis accumulate outdated information faster than they’re fixed
  5. Maintenance overhead: Duplicate information, broken links, inconsistent terminology

When to Transition

Stay with markdown if:

  • Total size <400K words
  • Search < 2s response time is acceptable
  • Knowledge is stable (not changing hourly)
  • Scope is domain-specific (narrow vocab)

Transition if:

  • Approaching 1M words total
  • Search latency > 5s unacceptable
  • Knowledge changes frequently
  • Multi-domain coverage (wide, shallow vocab)
  • Agents need semantic understanding, not just keywords

2. Knowledge Base Scaling Strategies

Three proven patterns exist for scaling beyond pure markdown. Each addresses different constraints.

Strategy A: Multi-Tier Markdown (Hierarchical Summaries)

Architecture: Organize knowledge in three tiers—raw data, summaries, and indexes.

docs/
├── raw/              # Full original documents
│   ├── api-v1.md    # Complete reference
│   └── api-v2.md
├── summaries/       # Distilled versions
│   ├── api-quick-start.md
│   ├── api-common-patterns.md
│   └── api-faq.md
└── indexes/         # Navigation layer
    ├── README.md    # Overview
    ├── by-topic.md  # Hierarchical toc
    └── glossary.md  # Terms

How it works:

  1. Maintain raw docs (Git-controlled, exhaustive)
  2. Create summaries (80% knowledge, 20% bulk)
  3. Serve summaries to agents (faster search, smaller context)
  4. Link back to raw for deep dives

When to use:

  • Knowledge size 400K-2M words
  • High-accuracy requirements
  • Mixed stability (some docs change often, others rarely)
  • Teams maintaining knowledge manually

Implementation:

# Agent prompt pattern
- Load: /docs/indexes/README.md          (entry point)
- Search: /docs/summaries/*              (quick answer)
- Deep-dive: /docs/raw/* (link provided) (complete info)

Pros:

  • Still Git-native
  • Controlled token usage
  • Explicit relevance (human curation)
  • Fast to implement

Cons:

  • Manual maintenance burden grows
  • Summarization is lossy
  • Doesn’t scale past 2-3M words
  • No semantic understanding

Strategy B: Hybrid Markdown + Vector Index

Architecture: Keep markdown, add vector embeddings for semantic search.

docs/
├── *.md              # Unmodified markdown
└── .vectorindex/
    ├── embeddings.db # Vector store (SQLite + vectors)
    ├── chunks.json   # Chunk→source mapping
    └── index.json    # Metadata

How it works:

  1. Chunk markdown into ~300-500 token segments
  2. Generate embeddings for each chunk (using OpenAI, Anthropic, or local model)
  3. On search: convert query to embedding, find nearest neighbors
  4. Return top-k relevant chunks + links to full documents

When to use:

  • Knowledge size 1-10M words
  • Semantic search critical (not just keywords)
  • Knowledge doesn’t change hourly
  • Budget for embedding model calls

Implementation (Python example):

from sentence_transformers import SentenceTransformer
import numpy as np

# Chunk documents
chunks = chunk_markdown(docs, chunk_size=400, overlap=50)

# Generate embeddings (one-time)
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, local
embeddings = [model.encode(chunk['text']) for chunk in chunks]

# Store with faiss or sqlite-vec
index = VectorIndex(embeddings, chunks)
index.save('.vectorindex/')

# Query time
query = "how do I configure rate limiting?"
query_vec = model.encode(query)
top_k = index.search(query_vec, k=5)
return [chunks[i] for i in top_k]

Pros:

  • Markdown still Git-native
  • Semantic search (understands intent)
  • Scales to 10M+ words
  • Works offline (if using local embeddings)
  • No external database needed (SQLite-based)

Cons:

  • Embedding generation cost (one-time, or on updates)
  • Chunk boundaries can split concepts
  • Requires embedding model (third-party or local)
  • Index must be kept in sync

Strategy C: Full Vector Database

Architecture: Replace markdown search with dedicated vector DB. Markdown becomes source-of-truth; vector DB is derived index.

Markdown (Git)  →  ETL Pipeline  →  Vector DB  →  Agent Queries
api.md          Chunk            Pinecone       Semantic search
faq.md          Embed            Weaviate       + Metadata filter
examples.md     Index            Milvus

When to use:

  • Knowledge size >10M words
  • Semantic + metadata search essential
  • Real-time updates (new docs hourly)
  • Multi-agent access (shared infrastructure)
  • Can afford managed service or self-hosted DB

Implementation pattern:

# ETL: Markdown → Vector DB
def sync_knowledge_base(markdown_dir, vector_db):
    docs = load_markdown(markdown_dir, track_version=True)
    
    for doc in docs:
        chunks = chunk(doc['text'], size=400)
        for i, chunk in enumerate(chunks):
            vector_db.upsert({
                'id': f"{doc['id']}_chunk_{i}",
                'text': chunk,
                'embedding': embed(chunk),
                'metadata': {
                    'source': doc['path'],
                    'category': doc['category'],
                    'version': doc['version'],
                    'updated_at': doc['mtime'],
                }
            })

# Query: Use metadata filters + semantic search
results = vector_db.search(
    query_embedding=embed(user_query),
    filter={'version': '>=2.0', 'category': 'api'},
    top_k=10
)

Pros:

  • Scales to 100M+ words
  • Real-time updates
  • Metadata filtering (narrow search space)
  • Cloud-native (easy multi-agent sharing)
  • Advanced search (hybrid keyword + semantic)

Cons:

  • Requires external service (cost, latency, dependencies)
  • Complexity (ETL pipeline, index management)
  • Data residency concerns (if SaaS)
  • More moving parts to maintain

Decision Matrix

CriteriaStrategy AStrategy BStrategy C
Size<2M words1-10M>10M
Search latency<5s<2s<1s
Semantic search?NoYesYes
InfrastructureGit onlySQLitePinecone/Weaviate
Update frequencyManualBatchReal-time
CostFree~$50-200/mo$200-2000+/mo
Implementation time1 week2-3 weeks4-6 weeks

3. Incremental Knowledge Updates

Knowledge isn’t static. Teams discover bugs, refine explanations, and add new features constantly. Updates must happen without breaking running agents.

Zero-Downtime Updates

Pattern: Version the knowledge base, switch at query time

docs/
├── v1/
│   ├── api.md
│   └── guides.md
├── v2/
│   ├── api.md
│   └── guides.md
└── CURRENT_VERSION  # Points to v2

Implementation:

# Agent loads version dynamically
class KnowledgeBase:
    def load(self):
        current = read('docs/CURRENT_VERSION').strip()
        self.path = f'docs/{current}/'
        self.docs = load_markdown(self.path)
    
    def search(self, query):
        return semantic_search(query, self.docs)

# Update process
# 1. Create new version directory
copy_dir('docs/v2', 'docs/v3')
# 2. Edit docs/v3/* as needed
# 3. Write new version pointer
write('docs/CURRENT_VERSION', 'v3')
# 4. Commit
git commit -m "docs: release v3 of knowledge base"

Adding Knowledge Without Recompilation

Key principle: Knowledge should be loaded at query time, not at agent initialization.

Anti-pattern (compiles knowledge into weights):

# ❌ Don't do this
agent = HarnessAgent(knowledge=hardcoded_docs)
# Now you need to retrain the agent to add new knowledge

Pattern (external lookup):

# ✓ Do this
agent = HarnessAgent()

def answer(query):
    context = knowledge_base.search(query)  # Loaded at query time
    return agent.answer(query, context=context)

Versioning Knowledge

Semantic versioning for knowledge:

  • v1.0: Initial knowledge base
  • v1.1: Bug fixes, clarifications (backward compatible)
  • v2.0: Major changes, removed docs (breaking)

Track breaking changes:

# CHANGELOG.md
## v2.0 (2025-06-15)
### Breaking Changes
- Removed: Old authentication flow (replaced by OAuth2)
- Changed: API endpoint /v1/users → /v2/accounts
- Deprecated: legacy-config.yaml format

### Migration Guide
See upgrade-v1-to-v2.md

Backward Compatibility

Pattern: Support multiple versions in parallel

class MultiVersionKnowledge:
    def __init__(self):
        self.v1 = load_markdown('docs/v1/')
        self.v2 = load_markdown('docs/v2/')
    
    def answer(self, query, version='v2'):
        if version == 'v2':
            return self.v2.search(query)
        elif version == 'v1':
            return self.v1.search(query)
        else:
            raise ValueError(f"Unknown version: {version}")

# Agent can specify version
response = knowledge.answer(query, version=request.version)

Use cases:

  • Legacy systems still on v1 API
  • Gradual migration (v1 → v2)
  • A/B testing different knowledge bases
  • Backward compatibility windows

4. Conflicting Information Resolution

Large knowledge bases inevitably contain contradictions. One document says “do X”, another says “don’t do X”. Which is correct?

Sources of Conflict

  1. Temporal: Old docs say one thing, new docs say another
  2. Authoritative: Engineering docs contradict product docs
  3. Scope: “Works for API v2” vs “Works for all versions”
  4. Interpretation: Ambiguous requirements lead to different conclusions

Manual Conflict Resolution

Process: Explicit review, authoritative decision, tracking

## Conflict: Authentication Method
### Issue: #347

**Claim 1** (api.md, line 45):
"Use Basic Auth with username:password"

**Claim 2** (guides/modern-auth.md, line 12):
"Basic Auth is deprecated. Use OAuth2 instead."

**Investigation**:
- Basic Auth still works, but not recommended for new integrations
- OAuth2 is preferred for security

**Resolution**:
- **Authoritative**: Engineering team decision
- **Winner**: OAuth2 (Claim 2 is correct)
- **Action**: Deprecate Basic Auth docs, add migration guide

**Tracking**:
```yaml
version: v2.5
resolution_id: conflict_auth_001
resolved_by: Security Team Lead
resolved_at: 2025-06-15
decision: oauth2_preferred
migration_deadline: 2025-12-31

Automated Conflict Detection

Pattern: Detect contradictions programmatically

import re

# Extract claims from documents
def extract_claims(doc):
    """Find factual statements (simplified example)"""
    patterns = [
        r'(?:use|must|should|can)\s+(\w+)',  # "use OAuth2"
        r'(?:deprecated|legacy|old)\s+(\w+)',  # "deprecated Basic Auth"
    ]
    claims = []
    for pattern in patterns:
        for match in re.finditer(pattern, doc):
            claims.append({
                'text': match.group(0),
                'entity': match.group(1),
                'confidence': 0.8,
            })
    return claims

# Detect contradictions
def detect_conflicts(claims):
    """Find opposing claims about same entity"""
    by_entity = {}
    for claim in claims:
        entity = claim['entity']
        by_entity.setdefault(entity, []).append(claim)
    
    conflicts = []
    for entity, entity_claims in by_entity.items():
        if len(entity_claims) > 1:
            # Check if claims contradict
            texts = [c['text'] for c in entity_claims]
            if contradicts(texts):
                conflicts.append({
                    'entity': entity,
                    'claims': entity_claims,
                    'severity': 'high',
                })
    return conflicts

# Alert humans to review
for conflict in detect_conflicts(all_claims):
    print(f"⚠️ Conflict detected: {conflict['entity']}")
    print(f"   Claims: {[c['text'] for c in conflict['claims']]}")

Resolution Tracking

Maintain a decisions log:

# docs/DECISIONS.yaml
decisions:
  - id: auth_method_001
    question: "Should new integrations use Basic Auth or OAuth2?"
    claims:
      - text: "Use Basic Auth"
        source: api.md:45
        version: v1.0
      - text: "Use OAuth2"
        source: guides/modern-auth.md:12
        version: v2.0
    resolution: "OAuth2 for new integrations, Basic Auth deprecated"
    decided_by: Security Team
    decided_at: 2025-06-15
    expires_at: 2025-12-31
    
  - id: rate_limit_format_001
    question: "What HTTP header for rate limit remaining?"
    claims:
      - text: "X-RateLimit-Remaining"
        source: v1/api.md
      - text: "RateLimit-Remaining"
        source: v2/api.md
    resolution: "RateLimit-* headers (follows RFC 6585)"
    decided_by: API Team
    decided_at: 2025-03-01
    migration_path: "v1 still supports X-RateLimit-*, v2+ requires RateLimit-*"

5. Knowledge Base Maintenance

Knowledge decays. Docs become outdated, links break, examples fail. Systematic maintenance prevents slow degradation.

Stale Content Detection

Pattern: Track document age, flag old content

import os
from datetime import datetime, timedelta

def find_stale_docs(doc_dir, max_age_days=180):
    """Flag docs not updated in 6+ months"""
    stale = []
    for filepath in glob(f'{doc_dir}/**/*.md', recursive=True):
        mtime = os.path.getmtime(filepath)
        age_days = (datetime.now() - datetime.fromtimestamp(mtime)).days
        
        if age_days > max_age_days:
            stale.append({
                'file': filepath,
                'age_days': age_days,
                'last_modified': datetime.fromtimestamp(mtime),
            })
    
    return sorted(stale, key=lambda x: x['age_days'], reverse=True)

# Flag in CI
stale = find_stale_docs('docs/', max_age_days=180)
for doc in stale:
    print(f"⚠️ Stale: {doc['file']} ({doc['age_days']} days old)")
    print(f"   Last updated: {doc['last_modified'].date()}")

Pruning Obsolete Information

Example: Removing deprecated API docs

# Before
docs/
├── api/
│   ├── v1.md  (deprecated, 2023)
│   ├── v2.md  (current)
│   └── v3.md  (beta)
└── CHANGELOG.md

# After (if v1 truly unused)
docs/
├── api/
│   ├── v2.md  (current)
│   └── v3.md  (beta)
├── archive/
│   └── v1-2023-12-15.md.bak  (kept for history)
└── CHANGELOG.md
  (entry: "Removed: v1 API docs, archived to archive/")

Decision criteria:

  • Is anyone still using this version? (Check logs, support tickets)
  • Can this be archived instead of deleted?
  • Does regulatory/compliance require keeping history?

Refreshing Out-of-Date Content

Process: Explicit review and update cycles

# docs/REVIEW_SCHEDULE.md

## Q2 2025 Reviews (April-June)
- [ ] Authentication docs (updated Q4 2024, OAuth2 changes)
- [ ] Rate limiting (updated Q1 2024, service limits changed)
- [ ] Deployment guide (updated Q1 2025, should be stable)

## Q3 2025 Reviews (July-September)
- [ ] SDKs (check for new versions)
- [ ] Examples (run all code samples)
- [ ] FAQ (check support tickets for new patterns)

Template for updates:

---
last_reviewed: 2025-04-15
review_cycle: quarterly
status: current  # current | outdated | deprecated
version_applies: v2.0+
---

# Rate Limiting Guide

Last updated: 2025-04-15 (reviewed, current)
Updated by: API Team
Changes: Added RateLimit-Reset header description

[content...]

Archival Strategy

Keep full history, organize by relevance:

docs/
├── current/           # What's live now
│   ├── api.md
│   └── guides/
├── deprecated/        # Still documented, but don't use
│   ├── legacy-auth.md
│   └── old-format.md
└── archive/          # Historical, for reference only
    └── v1-2023/
        ├── api.md
        └── changelog-2023.md

Example metadata:

# In each deprecated doc
---
status: deprecated
deprecated_since: 2024-06-01
removal_date: 2025-06-01  # Planned
replacement: oauth2.md
---

Different search approaches solve different problems. Most large systems need both.

Keyword Search (What’s familiar)

How it works: Index words, find documents containing query words

Pros:

  • Fast (simple hash lookups)
  • Predictable (easy to debug)
  • Works for exact matches (“OAuth2”, “rate_limit_exceeded”)
  • Low latency

Cons:

  • Misses synonyms (“token-based auth” vs “OAuth”)
  • Sensitive to phrasing (plural forms, tense)
  • Noisy results (all documents with keyword)

When to use:

  • Exact technical terms (“v2 API”, “RateLimit-Remaining”)
  • User knows what they’re looking for
  • Latency critical (<100ms)

Semantic Search (What makes sense)

How it works: Convert query and documents to embeddings (dense vectors), find similar vectors

Pros:

  • Understands intent (“how do I limit requests?” matches “rate limiting guide”)
  • Handles synonyms and rephrasing
  • Relevance ranking (nearest vectors = most relevant)
  • Works with fuzzy/partial knowledge

Cons:

  • Slower (embedding generation + vector search)
  • Requires embedding model (third-party or local)
  • Less obvious why a result appeared
  • Sensitive to out-of-domain queries

When to use:

  • User doesn’t know exact terminology
  • Intent-based search (“how do I add auth?”)
  • Learning mode (exploring new domain)

Hybrid Approach

Pattern: Keyword + semantic search combined

def hybrid_search(query, knowledge_base, top_k=5):
    """
    Combine keyword and semantic results.
    1. Get top keyword matches (fast, exact)
    2. Get top semantic matches (slow, relevant)
    3. Merge and rank by score
    """
    
    # Keyword search (fast, exact matches)
    keyword_results = knowledge_base.keyword_search(query, top_k=10)
    keyword_scores = {r['id']: r['score'] for r in keyword_results}
    
    # Semantic search (thorough, intent-based)
    semantic_results = knowledge_base.semantic_search(query, top_k=10)
    semantic_scores = {r['id']: r['score'] for r in semantic_results}
    
    # Merge: boost documents that appear in both
    merged = {}
    for doc_id in set(keyword_scores.keys()) | set(semantic_scores.keys()):
        keyword_score = keyword_scores.get(doc_id, 0) * 0.4
        semantic_score = semantic_scores.get(doc_id, 0) * 0.6
        merged[doc_id] = keyword_score + semantic_score
    
    # Return top results
    top = sorted(merged.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [knowledge_base.get(doc_id) for doc_id, _ in top]

Choosing an embedding model:

ModelSpeedQualitySizeCost
all-MiniLM-L6-v2 (local)FastGood22 MBFree
all-mpnet-base-v2 (local)MediumVery good438 MBFree
OpenAI text-embedding-3-smallSlowExcellentCloud$0.02/1M
Anthropic claude-3-5-sonnetSlowState-of-artCloud$3/1M input

Chunking strategies:

def chunk_markdown(doc_text, chunk_size=400, overlap=50):
    """
    Split long documents into overlapping chunks.
    
    Too small (100 tokens): loses context
    Too large (800 tokens): mixes unrelated concepts
    Overlap prevents losing info at chunk boundaries
    """
    
    sentences = split_sentences(doc_text)
    chunks = []
    current = []
    current_tokens = 0
    
    for sentence in sentences:
        sent_tokens = count_tokens(sentence)
        
        if current_tokens + sent_tokens > chunk_size:
            # Save chunk
            chunks.append(' '.join(current))
            
            # Start new chunk with overlap
            current = current[-overlap_sentences:]
            current_tokens = sum(count_tokens(s) for s in current)
        
        current.append(sentence)
        current_tokens += sent_tokens
    
    if current:
        chunks.append(' '.join(current))
    
    return chunks

# Example
doc = load_markdown('api.md')
chunks = chunk_markdown(doc, chunk_size=400, overlap=50)
embeddings = [embed(chunk) for chunk in chunks]

7. Knowledge Graph Patterns

When documents alone aren’t enough, model relationships explicitly.

When to Move Beyond Flat Documents

Red flags for flat markdown:

  • Lots of “see also” links (suggests implicit structure)
  • Questions like “what APIs use this data model?”
  • Relationships: Entity A (e.g., User) relates to B (e.g., Account)
  • Traversal: Want to follow chains (User → Account → API Key)

Example: E-commerce knowledge base

Problem: Find all operations that require authentication
Markdown approach: Search for "authentication" in all docs (gets noise)
Graph approach: Query: AuthenticationRequired -[:relatesTo]-> Operation

Entity-Relationship Patterns

Represent domain concepts as entities with relationships:

entities:
  # Concept entities
  APIEndpoint:
    name: API endpoint
    examples: ["/users", "/accounts/{id}"]
  
  DataModel:
    name: Data structure
    examples: ["User", "Account", "AuthToken"]
  
  AuthenticationMethod:
    name: Auth approach
    examples: ["OAuth2", "BasicAuth"]

relationships:
  - type: "endpoint_uses_model"
    from: APIEndpoint
    to: DataModel
    example: "POST /users receives User model"
  
  - type: "endpoint_requires_auth"
    from: APIEndpoint
    to: AuthenticationMethod
    example: "GET /users requires OAuth2"
  
  - type: "model_contains_field"
    from: DataModel
    to: Field
    example: "User.id is required string"

Graph Traversal

Navigate relationships to answer complex questions:

Query: What endpoints can an unauthenticated user call?

Traversal:
1. Find all APIEndpoints
2. Filter where NOT (endpoint_requires_auth -> *)
3. Return: [GET /status, POST /login, GET /docs]

---

Query: If we remove OAuthToken data model, what breaks?

Traversal:
1. Find DataModel("OAuthToken")
2. Find all APIEndpoints that endpoint_uses_model -> OAuthToken
3. Find all AuthenticationMethods that auth_produces -> OAuthToken
4. Return: [breaking endpoints, auth methods that fail]

Knowledge Graph Databases

When to use a graph database:

  • 100 entity types, >1000 relationships

  • Complex queries (multi-hop traversals)
  • Real-time insights needed
  • Multiple agents querying same knowledge

Popular options:

  • Neo4j: Most mature, Cypher query language
  • Amazon Neptune: AWS managed
  • TigerGraph: Performance-optimized, supports real-time analytics
  • ArangoDB: Multi-model (documents + graphs)

Example Neo4j setup:

# Define nodes
CREATE (user:Entity {name: "User", type: "DataModel"})
CREATE (oauth:Entity {name: "OAuth2", type: "AuthMethod"})
CREATE (endpoint:Entity {name: "GET /users", type: "APIEndpoint"})

# Define relationships
CREATE (endpoint)-[:requires_auth]->(oauth)
CREATE (endpoint)-[:returns_model]->(user)

# Query: Find all auth methods used by any endpoint
MATCH (auth:AuthMethod)<-[:requires_auth]-(endpoint:APIEndpoint)
RETURN DISTINCT auth.name

# Query: Find endpoints that use a specific data model
MATCH (endpoint:APIEndpoint)-[:uses_model]->(User:DataModel)
RETURN endpoint.name

8. Curation & Quality Control

Knowledge quality directly impacts agent quality. Garbage in, garbage out.

Who Maintains Knowledge?

Models:

  1. Central team (dedicated knowledge managers)

    • Pro: Consistent, high quality
    • Con: Slow updates, bottleneck
    • Best for: Large organizations, critical knowledge
  2. Domain experts (subject matter experts)

    • Pro: Accurate, fast updates
    • Con: Inconsistent style, variable quality
    • Best for: Technical knowledge, multiple domains
  3. Hybrid (domain experts + QA reviewers)

    • Pro: Fast + accurate + consistent
    • Con: Coordination overhead
    • Best for: Growing organizations

Example policy:

# CONTRIBUTION_POLICY.md

ownership:
  api-documentation:
    primary: API Team
    secondary: Engineering Team Lead
    review_required: true
  
  getting-started:
    primary: Product Team
    secondary: API Team
    review_required: true
  
  internal-runbooks:
    primary: Ops Team
    secondary: None
    review_required: false

process:
  new_knowledge:
    - Author writes/edits
    - Assigned reviewer checks (48h deadline)
    - Author addresses feedback
    - Merged to main branch
  
  quality_review:
    - Quarterly: All docs reviewed by primary owner
    - Bi-annual: Cross-team review for consistency

Quality Standards

Checklist for knowledge acceptance:

# Knowledge Quality Checklist

## Accuracy
- [ ] Claims are current and correct
- [ ] Examples have been tested (code runs, URLs work)
- [ ] No contradictions with existing docs

## Completeness
- [ ] Covers happy path + common errors
- [ ] Includes version info (what systems/versions?)
- [ ] Links to related knowledge

## Clarity
- [ ] No jargon without explanation
- [ ] Active voice preferred
- [ ] Short paragraphs (3 sentences max)

## Maintenance
- [ ] Author identified (who maintains this?)
- [ ] Review cycle defined (how often updated?)
- [ ] Stakeholders identified (who should know if this changes?)

## Structure
- [ ] Follows template for doc type
- [ ] Heading hierarchy is logical
- [ ] Code examples use syntax highlighting

Peer Review Process

Pattern: Two-tier review (technical + editorial)

# Workflow: Pull Request to knowledge base
# 1. Author submits new/edited docs
# 2. Technical reviewer (domain expert) approves accuracy
# 3. Editorial reviewer (writing expert) approves clarity
# 4. Both approvals required to merge

class KnowledgeReview:
    def __init__(self, pr):
        self.pr = pr
        self.technical_approval = False
        self.editorial_approval = False
    
    def is_approved(self):
        return self.technical_approval and self.editorial_approval
    
    def request_technical_review(self, reviewer):
        """Domain expert verifies correctness"""
        pass
    
    def request_editorial_review(self, reviewer):
        """Writing expert verifies clarity and style"""
        pass

Automated Quality Checks

Lint knowledge base in CI:

#!/bin/bash
# scripts/validate-knowledge.sh

echo "Validating knowledge base..."

# Check 1: No broken links
echo "Checking for broken internal links..."
rg '\[.*\]\((docs/.*?\.md)\)' docs/ | while read match; do
    file=$(echo $match | grep -oE 'docs/[^)]+\.md')
    [ ! -f "$file" ] && echo "❌ Broken link: $file"
done

# Check 2: Required metadata
echo "Checking for metadata..."
for md in docs/**/*.md; do
    grep -q "last_reviewed:" "$md" || echo "⚠️ Missing metadata: $md"
done

# Check 3: Code examples are valid
echo "Validating code examples..."
# Extract ```bash blocks and run them
rg '```bash' -A 100 docs/ | ./check-bash-examples.py

# Check 4: No stale docs
echo "Finding stale documentation..."
find docs -name "*.md" -mtime +180 | while read f; do
    echo "⚠️ Stale (>6mo): $f"
done

# Check 5: Consistent terminology
echo "Checking for terminology consistency..."
if grep -r "API key" docs/ && grep -r "API-key" docs/; then
    echo "⚠️ Inconsistent: 'API key' vs 'API-key'"
fi

9. Integration Patterns

How agents discover and use knowledge.

Explicit Loading (Pull Model)

Agent loads knowledge at startup:

class Harness:
    def __init__(self, knowledge_paths):
        self.knowledge = {}
        for path in knowledge_paths:
            self.knowledge[path] = load_markdown(path)
    
    def answer(self, query):
        context = self.knowledge.get('api.md', '')
        return self.agent.answer(query, context=context)

Pros:

  • Simple, predictable
  • Full context loaded upfront
  • Good for small, stable knowledge

Cons:

  • Token-heavy (loads everything, uses little)
  • Stale if knowledge updated
  • Doesn’t scale (can’t load 10M words)

Dynamic Discovery (Push/Pull Hybrid)

Agent requests knowledge when needed:

class DynamicHarness:
    def __init__(self):
        self.kb = VectorIndex('docs/')
    
    def answer(self, query):
        # Fetch relevant knowledge at query time
        relevant_chunks = self.kb.search(query, top_k=5)
        context = '\n---\n'.join([c['text'] for c in relevant_chunks])
        return self.agent.answer(query, context=context)

Pros:

  • Only loads relevant knowledge
  • Automatically updated with docs
  • Scales to large bases
  • Accurate context (not everything)

Cons:

  • Extra latency (search time)
  • Search quality matters
  • Requires vector index

Knowledge as a Tool

Pattern: Agent calls knowledge lookup as a function

from langchain.tools import Tool

knowledge_search = Tool(
    name="search_knowledge_base",
    description="Search the knowledge base for relevant information",
    func=lambda query: knowledge_base.search(query, top_k=3)
)

agent = HarnessAgent(tools=[knowledge_search, code_executor, ...])

# Agent uses tool autonomously
response = agent.answer(
    "How do I configure rate limiting?",
    tools=[knowledge_search, code_executor]
)
# Agent might call: search_knowledge_base("rate limiting configuration")

Pros:

  • Agent decides when knowledge is needed
  • Natural integration with other tools
  • Supports multi-step reasoning
  • Works with frameworks (LangChain, LlamaIndex)

Cons:

  • Extra LLM calls (search decisions)
  • Latency increases
  • More complex debugging

Knowledge Orchestration

Coordinating knowledge across tools:

class KnowledgeOrchestrator:
    """
    Manage which knowledge is available to which agents/tools.
    """
    
    def __init__(self):
        self.global_kb = VectorIndex('docs/')  # Available everywhere
        self.api_team_kb = VectorIndex('docs/api/')  # API team only
        self.internal_kb = VectorIndex('docs/internal/')  # Employees only
    
    def get_kb_for_agent(self, agent_name, access_level):
        """Return appropriate knowledge for agent"""
        kbs = [self.global_kb]  # Everyone gets this
        
        if access_level == 'api_team':
            kbs.append(self.api_team_kb)
        
        if access_level == 'internal':
            kbs.append(self.internal_kb)
        
        return CombinedIndex(kbs)
    
    def search(self, query, agent_name, access_level):
        kb = self.get_kb_for_agent(agent_name, access_level)
        return kb.search(query)

10. Performance Optimization

Keep knowledge retrieval fast, even at scale.

Indexing Strategies

Multi-level indexing:

Raw documents (1000 files, 10M words)
    ↓ (expensive, one-time)
Inverted index (keywords → documents)

Vector index (chunks → embeddings)

Query time: Use index, not raw docs

Implementation:

# Build indices once, reuse many times
class OptimizedKnowledgeBase:
    def __init__(self, doc_dir):
        # Load from cache if exists
        self.keyword_index = load_or_build_keyword_index(doc_dir)
        self.vector_index = load_or_build_vector_index(doc_dir)
    
    def search(self, query, method='hybrid', top_k=5):
        """Search using pre-built indices"""
        if method == 'keyword':
            return self.keyword_index.search(query, top_k)
        elif method == 'semantic':
            return self.vector_index.search(query, top_k)
        else:
            # Hybrid: combine both indices
            k_results = self.keyword_index.search(query, top_k=10)
            v_results = self.vector_index.search(query, top_k=10)
            return merge_results(k_results, v_results, top_k)

Caching Frequently Accessed Knowledge

Pattern: LRU cache for common queries

from functools import lru_cache
import hashlib

class CachedKnowledgeBase:
    def __init__(self, kb):
        self.kb = kb
        self.cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
    
    def search(self, query, top_k=5):
        """Search with caching"""
        cache_key = hashlib.md5(f"{query}:{top_k}".encode()).hexdigest()
        
        if cache_key in self.cache:
            self.cache_hits += 1
            return self.cache[cache_key]
        
        self.cache_misses += 1
        results = self.kb.search(query, top_k)
        self.cache[cache_key] = results
        
        # Keep cache under 1000 items
        if len(self.cache) > 1000:
            # Remove least recently used
            oldest = min(self.cache.items(), key=lambda x: x[1]['timestamp'])
            del self.cache[oldest[0]]
        
        return results
    
    def invalidate(self, pattern=None):
        """Clear cache when knowledge updates"""
        if pattern is None:
            self.cache.clear()
        else:
            self.cache = {k: v for k, v in self.cache.items() if pattern not in k}

When to cache:

  • Frequently asked questions (FAQ section)
  • Common patterns (e.g., “how to setup”, “authentication”)
  • Time-sensitive: cache expiry after 1-24 hours

For very large vector indices, exact search becomes slow. Use ANN:

MethodSpeedAccuracyBest For
Exact searchSlow (O(n))100%<1M vectors
FAISSFast (O(log n))99%+1-100M vectors
HNSWVery fast95%+Streaming/real-time
IVFFast90%+Partitioned search

FAISS example:

import faiss
import numpy as np

# Build approximate index once
vectors = np.array([embedding for chunk in chunks])
index = faiss.IndexFlatL2(dim=384)  # Flat for <1M vectors
index.add(vectors)

# Save for reuse
faiss.write_index(index, 'knowledge.index')

# Query time: fast
query_vec = embed(query)
distances, indices = index.search(np.array([query_vec]), k=5)
results = [chunks[i] for i in indices[0]]

Lazy Loading

Don’t load everything at startup:

class LazyKnowledgeBase:
    def __init__(self, doc_dir):
        self.doc_dir = doc_dir
        self.chunks = None  # Load on first use
        self.index = None
    
    def _ensure_loaded(self):
        if self.chunks is None:
            self.chunks = self._load_chunks()
            self.index = self._build_index(self.chunks)
    
    def search(self, query):
        self._ensure_loaded()
        return self.index.search(query)

11. Multi-Agent Knowledge Sharing

When multiple agents or teams need the same knowledge.

Centralized Knowledge Base

Single source of truth, shared by all agents:

# Knowledge base serving 10 agents

Knowledge Base (Git + Vector Index)
    ├─ API Team Agent (reads api.md, integrations.md)
    ├─ Support Agent (reads faq.md, troubleshooting.md)
    ├─ Analytics Agent (reads data-models.md, queries.md)
    ├─ DevOps Agent (reads deployment.md, runbooks.md)
    └─ ...

Benefits:

  • Single update syncs to all agents
  • Consistent information
  • Easy to audit (all in Git)

Challenges:

  • Knowledge is generic (covers many use cases)
  • Agents load knowledge they don’t use
  • No specialization

Agent-Specific Knowledge

Each agent has custom knowledge subset:

Base Knowledge
    ├─ api.md (for all agents)
    └─ faq.md (for all agents)

Specializations
    ├─ api-team/
    │   ├─ sdk-internals.md
    │   └─ performance-tuning.md
    ├─ support-team/
    │   ├─ troubleshooting.md
    │   └─ workarounds.md
    └─ devops/
        ├─ deployment-matrix.md
        └─ runbooks/

Implementation:

class SpecializedAgent:
    def __init__(self, agent_type):
        self.base_kb = VectorIndex('docs/base/')
        self.specialized_kb = VectorIndex(f'docs/{agent_type}/')
    
    def search(self, query):
        # Search specialized first, fall back to base
        specialized = self.specialized_kb.search(query, top_k=3)
        if specialized:
            return specialized
        return self.base_kb.search(query, top_k=3)

Knowledge Inheritance Hierarchies

Organize knowledge by scope and specificity:

Level 1: Industry Standards
  └─ "What is OAuth2?" (applies to all companies)

Level 2: Company Policies
  └─ "We use OAuth2 with 15-min token lifetime" (applies to this company)

Level 3: Product-Specific
  └─ "Our API endpoints require OAuth2 with X-API-Key header"

Level 4: Team-Specific
  └─ "API Team: we document endpoints in OpenAPI 3.1" (how to maintain level 3)

In practice:

docs/
├── L1-standards/
│   ├── oauth2.md
│   └── rest-best-practices.md
├── L2-company/
│   ├─ company-security-policy.md
│   └── authentication-standard.md
├── L3-product/
│   ├─ api/
│   └── integrations/
└── L4-team/
    ├─ api-team/
    └─ support-team/

12. Real-World Examples

Example 1: Transitioning from Markdown to Hybrid (Small→Medium)

Starting state: 300 markdown files, 600K words. Keyword search takes 10s. Search results noisy.

Goal: Reduce search time to <2s, improve relevance.

Approach: Multi-Tier Markdown + Vector Index

Step 1: Assess current state

# Count words in markdown
find docs -name "*.md" -exec wc -w {} + | tail -1
# Output: 612,000 total

# Check search performance
time knowledge_base.search("how to setup oauth")
# Output: real 0m9.742s (too slow)

Step 2: Create summaries

docs/
├── raw/
│   ├── authentication-complete.md (4000 words)
│   └── rate-limiting-full.md (3500 words)
└── summaries/
    ├── authentication-quick.md (500 words)
    └── rate-limiting-quick.md (400 words)

Summarization process:

  • Manual review: SME reads full doc, writes 80/20 version
  • Markup: Add [full docs](../raw/authentication-complete.md) links
  • Review: Another SME checks summary is accurate

Step 3: Build vector index (on summaries)

# This is fast since we're indexing 50K words, not 600K
chunks = chunk_markdown('docs/summaries/', chunk_size=400)
embeddings = embed_batch(chunks)  # ~10 min with OpenAI API
save_vector_index(embeddings, 'docs/.index/')

Step 4: Update agent to search summaries first

# Before
context = '\n'.join(load_all_markdown('docs/'))  # 600K tokens, slow

# After
relevant = vector_search('docs/.index/', query, top_k=3)
context = '\n---\n'.join(relevant)  # 1.2K tokens, fast

Results:

  • Search time: 10s → 0.5s
  • Context size: 600K tokens → 1.2K tokens
  • Accuracy: Improved (semantic search vs keyword)
  • Maintenance: +20% (keep summaries updated)

Lessons learned:

  • Summarization is lossy, but acceptable for common queries
  • Keep raw docs for deep dives
  • Vector index on summaries is maintenance sweet spot

Example 2: Knowledge Graph for Domain Relationships

Scenario: Fintech company with complex API (users, accounts, transactions, cards).

Problem: Markdown says “Card requires an Account” but doesn’t show what else depends on Account. When Account data model changes, what breaks?

Solution: Knowledge graph

Entities:

datamodels:
  - User: root entity
  - Account: requires User
  - Card: requires Account
  - Transaction: requires Card or Account
  - Webhook: triggers on Transaction

endpoints:
  - POST /accounts: creates Account (requires User)
  - POST /cards: creates Card (requires Account)
  - POST /transactions: posts Transaction (requires Card)

Relationships:

User
  ├─ creates → Account
  └─ has_many → Account

Account
  ├─ created_by → User
  ├─ creates → Card
  └─ has_many → Card

Card
  ├─ belongs_to → Account
  ├─ enables → Transaction
  └─ has_many → Transaction

Webhook
  └─ triggers_on → Transaction

Queries enabled:

# "What breaks if we remove the Card model?"
MATCH (Card:DataModel)<-[:uses_model]-(endpoint:APIEndpoint)
RETURN endpoint.name

# "What does a User need before they can post a transaction?"
MATCH (User:DataModel)-[:creates]->(Account:DataModel)
      -[:creates]->(Card:DataModel)
      -[:enables]->(Transaction:DataModel)
RETURN [User, Account, Card, Transaction]

# "What endpoints touch the Account data model?"
MATCH (endpoint:APIEndpoint)-[:creates|:updates|:returns]->(Account:DataModel)
RETURN endpoint.name

Markdown can’t answer these. Graph can.

Example 3: Multi-Agent Knowledge Sharing

Scenario: Support organization with 3 teams, 1 shared knowledge base.

Setup:

  • Tier 1 support: Answer common questions (FAQ only)
  • Tier 2 support: Troubleshoot (FAQ + troubleshooting)
  • Tier 3 support: Escalations (all knowledge)
  • Billing team: Handle refunds (billing knowledge only)

Knowledge structure:

docs/
├── shared/
│   ├── faq.md
│   ├── product-overview.md
│   └── glossary.md
├── tier2/
│   ├── troubleshooting.md
│   └── common-errors.md
├── tier3/
│   ├── system-architecture.md
│   └── internal-runbooks.md
└── billing/
    ├── refund-policy.md
    └── pricing.md

Agent setup:

class SupportAgent:
    def __init__(self, tier):
        self.tier = tier
        self.shared_kb = VectorIndex('docs/shared/')
        
        if tier == 'tier1':
            self.specialized_kb = VectorIndex('docs/shared/')
        elif tier == 'tier2':
            self.specialized_kb = VectorIndex('docs/tier2/')
        elif tier == 'tier3':
            self.specialized_kb = VectorIndex('docs/tier3/')
    
    def answer(self, customer_query):
        # Search tier-appropriate knowledge
        context = self.specialized_kb.search(customer_query, top_k=5)
        return self.agent.answer(customer_query, context=context)

# Usage
tier1_agent = SupportAgent('tier1')  # Limited knowledge
tier2_agent = SupportAgent('tier2')  # More knowledge
tier3_agent = SupportAgent('tier3')  # Full knowledge

Workflow:

Customer asks: "Why was I charged twice?"

Tier 1: Searches FAQ, finds generic refund article
        → Suggests contacting support
        → Creates ticket

Tier 2: Searches troubleshooting + FAQ
        → Looks up transaction logs
        → Can explain double-charge scenarios
        → May resolve or escalate

Tier 3: Full system access + advanced knowledge
        → Digs into billing code
        → Finds root cause
        → Implements fix + refund

Benefits:

  • Tier 1 stays focused on common issues
  • Tier 2 can self-service for common problems
  • Knowledge is progressively revealed
  • Easy to promote from Tier 1 → 2 (just point to broader KB)

Decision Framework

Choosing a knowledge management strategy:

Start here:
├─ Is your knowledge base < 400K words?
│  └─ YES: Use pure markdown (simple, Git-native)

├─ Is it 400K-2M words?
│  ├─ Semantic search important? 
│  │  ├─ YES: Hybrid markdown + vector index
│  │  └─ NO: Multi-tier markdown
│  │

├─ Is it > 2M words?
│  ├─ Relationships matter?
│  │  ├─ YES: Add knowledge graph (Neo4j)
│  │  └─ NO: Vector database (Pinecone/Weaviate)
│  │
├─ Are there >100 data models with complex relationships?
│  └─ YES: Knowledge graph (mandatory)

└─ Do multiple agents need different knowledge?
   └─ YES: Implement access control + specialization

References & Tools

Vector Databases:

  • Pinecone (managed, expensive)
  • Weaviate (self-hosted or managed)
  • Milvus (open-source, self-hosted)
  • SQLite-vec (embedded, free)

Embedding Models:

  • Local: all-MiniLM-L6-v2, all-mpnet-base-v2
  • API: OpenAI Embeddings, Anthropic API, Cohere

Graph Databases:

  • Neo4j (most popular, great Cypher docs)
  • TigerGraph (performance-optimized)
  • ArangoDB (multi-model)

Search Tools:

  • Elasticsearch (full-featured, complex)
  • Solr (enterprise search)
  • Meilisearch (simple, fast)

Chunking & Indexing:

  • LangChain: Document loaders, splitters
  • LlamaIndex: Document indexing specialized for LLMs
  • Unstructured: PDF/document parsing

Summary

Knowledge management at scale requires deliberate architectural choices:

  1. Start simple: Markdown wikis work for <400K words
  2. Scale strategically: Choose A/B/C based on size and constraints
  3. Maintain actively: Old knowledge is worse than no knowledge
  4. Search smartly: Combine keyword and semantic approaches
  5. Share wisely: Multi-agent systems need structured access
  6. Graph when needed: Relationships require explicit modeling

The jump from flat documents to vector indexes to knowledge graphs is not arbitrary—each layer solves real problems at specific scales. Begin where you are, transition when you feel pain, and measure improvements.


13. Real-World Scaling Case Study

This case study traces a knowledge base from 50 articles to 800, documenting the breaking points, migration decisions, and implementation code at each stage.

The Scenario

A developer tools company maintains internal knowledge for its coding assistant harness. The knowledge covers API documentation, integration guides, troubleshooting runbooks, and architecture decisions. Over 18 months, the knowledge base grew from a small wiki to a sprawling corpus that degraded search quality.

The symptom: “Our agent used to find the right answer immediately. Now it returns vaguely related articles or hallucinates details from outdated docs.”

Stage 1: 0-100 Articles (Months 1-6)

Architecture: Pure markdown wiki following the Karpathy pattern (see doc 04 for memory layer context).

knowledge/
├── raw/           # Source documents, meeting notes, specs
│   ├── api-v2-spec.md
│   ├── onboarding-notes.md
│   └── ... (85 files)
└── wiki/          # LLM-compiled, structured markdown
    ├── authentication.md
    ├── rate-limiting.md
    ├── error-codes.md
    └── ... (50 files)

How it worked:

  • Authors dropped raw sources into raw/
  • An LLM compiled them into clean wiki articles in wiki/
  • The agent loaded all of wiki/ into context at startup (~80K tokens)
  • Full-text search with simple keyword matching

Metrics:

  • Total size: ~120K words (well under 400K limit)
  • Search latency: <500ms (in-memory grep)
  • Search relevance: 92% (small corpus, most queries hit the right doc)
  • Context usage: 80K tokens out of 200K available — comfortable

What worked: Everything. The Karpathy pattern is excellent at this scale. Human-readable, Git-versioned, no infrastructure beyond the filesystem.

Stage 2: 100-400 Articles (Months 6-12)

What changed: The company added three new product lines, each with its own API, guides, and troubleshooting docs. The wiki grew from 50 to 250 compiled articles.

First signs of trouble:

Total wiki size: ~340K words
Context usage: 340K tokens — exceeds most model context windows
Search latency: 2.1s (still acceptable)
Search relevance: 74% (dropped 18 points)

The breaking point: The agent could no longer load all wiki articles into context. It had to selectively load, but keyword search returned 15-20 partially relevant articles for common queries like “how do I authenticate?”

Fix: Selective loading with topic indexes

# Added a lightweight topic index for selective loading
# Instead of loading all 250 articles, load only relevant ones

import json
from pathlib import Path

class TopicIndex:
    """Map topics to relevant wiki articles for selective loading."""

    def __init__(self, wiki_dir: str):
        self.wiki_dir = Path(wiki_dir)
        self.index = self._build_index()

    def _build_index(self) -> dict[str, list[str]]:
        """Build topic -> [article_paths] mapping from frontmatter."""
        index = {}
        for md_file in self.wiki_dir.glob("*.md"):
            topics = extract_frontmatter_topics(md_file)
            for topic in topics:
                index.setdefault(topic, []).append(str(md_file))
        return index

    def get_articles(self, query: str, max_articles: int = 10) -> list[str]:
        """Return article paths relevant to query, ranked by topic overlap."""
        query_terms = query.lower().split()
        scored = {}
        for topic, paths in self.index.items():
            for term in query_terms:
                if term in topic.lower():
                    for path in paths:
                        scored[path] = scored.get(path, 0) + 1

        ranked = sorted(scored.items(), key=lambda x: x[1], reverse=True)
        return [path for path, _ in ranked[:max_articles]]

Metrics after fix:

  • Context usage: ~40K tokens per query (loading 8-12 relevant articles)
  • Search relevance: 81% (improved from 74%, still below Stage 1)
  • Search latency: 1.8s
  • Maintenance cost: Added frontmatter tagging to all articles (~2 days of work)

Stage 3: 400-800 Articles (Months 12-18)

What changed: The company acquired a competitor and merged their documentation. The wiki ballooned to 650+ articles. Topic-based selective loading was no longer sufficient — too many articles shared the same topics, and keyword matching couldn’t distinguish “authentication for Product A” from “authentication for Product B.”

The numbers:

Total wiki size: ~780K words
Topic index entries: 45 topics, avg 14 articles per topic
Search relevance: 58% (unacceptable — agent hallucinating to fill gaps)
Context usage: 60K tokens (loading too many loosely related articles)
False positive rate: 35% (over a third of retrieved articles were wrong)

Decision: Transition to hybrid markdown + vector retrieval. Keep the wiki as source of truth, add embeddings for semantic search.

The Migration: Markdown to Hybrid

Step 1: Generate embeddings for all wiki articles

from sentence_transformers import SentenceTransformer
import sqlite3
import json
import hashlib
from pathlib import Path

def migrate_wiki_to_hybrid(wiki_dir: str, db_path: str):
    """
    One-time migration: chunk wiki articles and store embeddings.
    Preserves original markdown files untouched.
    """
    model = SentenceTransformer("all-MiniLM-L6-v2")  # 22MB, runs locally
    wiki = Path(wiki_dir)

    db = sqlite3.connect(db_path)
    db.execute("""
        CREATE TABLE IF NOT EXISTS chunks (
            id TEXT PRIMARY KEY,
            source_file TEXT,
            chunk_index INTEGER,
            text TEXT,
            embedding BLOB,
            word_count INTEGER
        )
    """)

    for md_file in wiki.glob("*.md"):
        content = md_file.read_text()
        chunks = chunk_by_heading(content, max_tokens=400)

        for i, chunk in enumerate(chunks):
            chunk_id = hashlib.sha256(
                f"{md_file.name}:{i}".encode()
            ).hexdigest()[:16]
            embedding = model.encode(chunk).tobytes()

            db.execute(
                "INSERT OR REPLACE INTO chunks VALUES (?, ?, ?, ?, ?, ?)",
                (chunk_id, md_file.name, i, chunk, embedding, len(chunk.split()))
            )

    db.commit()
    db.close()


def chunk_by_heading(content: str, max_tokens: int = 400) -> list[str]:
    """Split markdown by headings, merge small sections, split large ones."""
    sections = []
    current = []
    current_len = 0

    for line in content.split("\n"):
        if line.startswith("#") and current_len > 50:
            sections.append("\n".join(current))
            current = [line]
            current_len = len(line.split())
        else:
            current.append(line)
            current_len += len(line.split())

            if current_len > max_tokens:
                sections.append("\n".join(current))
                current = []
                current_len = 0

    if current:
        sections.append("\n".join(current))

    return sections

Step 2: Build the hybrid search system

import numpy as np
from sentence_transformers import SentenceTransformer

class HybridKnowledgeBase:
    """
    Combines keyword search (fast, exact) with semantic search (slow, relevant).
    Markdown files remain the source of truth; vector index is derived.
    """

    def __init__(self, wiki_dir: str, db_path: str):
        self.wiki_dir = Path(wiki_dir)
        self.db_path = db_path
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self._load_index()

    def _load_index(self):
        """Load all chunks and embeddings from SQLite."""
        db = sqlite3.connect(self.db_path)
        rows = db.execute(
            "SELECT id, source_file, text, embedding FROM chunks"
        ).fetchall()
        db.close()

        self.chunks = []
        self.embeddings = []
        for chunk_id, source, text, emb_bytes in rows:
            self.chunks.append({
                "id": chunk_id,
                "source": source,
                "text": text,
            })
            self.embeddings.append(
                np.frombuffer(emb_bytes, dtype=np.float32)
            )
        self.embeddings = np.array(self.embeddings)

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Hybrid search: keyword (weight 0.3) + semantic (weight 0.7)."""
        keyword_scores = self._keyword_search(query)
        semantic_scores = self._semantic_search(query)

        combined = {}
        for i, chunk in enumerate(self.chunks):
            cid = chunk["id"]
            kw = keyword_scores.get(cid, 0.0) * 0.3
            sem = semantic_scores.get(cid, 0.0) * 0.7
            combined[cid] = kw + sem

        ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
        results = []
        for cid, score in ranked[:top_k]:
            chunk = next(c for c in self.chunks if c["id"] == cid)
            results.append({**chunk, "score": score})
        return results

    def _keyword_search(self, query: str) -> dict[str, float]:
        """Simple term-frequency scoring."""
        terms = query.lower().split()
        scores = {}
        for chunk in self.chunks:
            text_lower = chunk["text"].lower()
            hits = sum(1 for t in terms if t in text_lower)
            if hits > 0:
                scores[chunk["id"]] = hits / len(terms)
        return scores

    def _semantic_search(self, query: str) -> dict[str, float]:
        """Cosine similarity against all chunk embeddings."""
        query_vec = self.model.encode(query)
        similarities = np.dot(self.embeddings, query_vec) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_vec)
        )
        return {
            self.chunks[i]["id"]: float(sim)
            for i, sim in enumerate(similarities)
        }

    def refresh(self):
        """Re-embed changed files only (incremental update)."""
        db = sqlite3.connect(self.db_path)
        for md_file in self.wiki_dir.glob("*.md"):
            file_hash = hashlib.sha256(md_file.read_bytes()).hexdigest()[:16]
            existing = db.execute(
                "SELECT id FROM chunks WHERE source_file = ? LIMIT 1",
                (md_file.name,)
            ).fetchone()

            if not existing:
                # New file — chunk and embed
                chunks = chunk_by_heading(md_file.read_text())
                for i, chunk in enumerate(chunks):
                    chunk_id = hashlib.sha256(
                        f"{md_file.name}:{i}".encode()
                    ).hexdigest()[:16]
                    embedding = self.model.encode(chunk).tobytes()
                    db.execute(
                        "INSERT INTO chunks VALUES (?, ?, ?, ?, ?, ?)",
                        (chunk_id, md_file.name, i, chunk,
                         embedding, len(chunk.split()))
                    )
        db.commit()
        db.close()
        self._load_index()

Metrics at Each Stage

MetricStage 1 (0-100)Stage 2 (100-400)Stage 3 (400-800)Stage 3 + Hybrid
Articles50250650650
Total words120K340K780K780K
Search latency<500ms2.1s4.8s800ms
Search relevance92%74% → 81%58%89%
False positive rate5%18%35%8%
Context tokens/query80K (all)40K (selective)60K (noisy)8K (precise)
InfrastructureFilesystemFilesystem + indexFilesystem + indexSQLite + embeddings
Migration effortN/A2 daysN/A1 week

Key Takeaways

  1. The Karpathy pattern works brilliantly until ~100 articles. Don’t over-engineer at this stage — markdown wiki is the right answer.

  2. 100-400 articles is the danger zone. You feel the pain but it’s not bad enough to force a migration. Topic indexes buy time, but semantic search is coming whether you plan for it or not.

  3. The hybrid approach preserves your investment in markdown. You don’t throw away the wiki — you add a vector layer on top. Git history, human readability, and editability are preserved.

  4. Incremental embedding is essential. Re-embedding 800 articles on every change is wasteful. Track file hashes, embed only what changed.

  5. Weight semantic search higher than keywords (0.7 vs 0.3). At scale, users search by intent (“how do I limit API calls?”) not by exact terms (“rate_limit_exceeded”). Semantic search handles this naturally.

  6. Context tokens per query dropped 10x (80K to 8K) while relevance only dropped 3 points (92% to 89%). Precise retrieval beats brute-force context stuffing.

For memory layer integration patterns (working memory, episodic memory, semantic memory), see doc 04 (Memory Systems). The hybrid search system described here slots into the semantic memory layer.


Validation Checklist

How do you know you got this right?

Performance Checks

  • Knowledge search latency: <2 sec for markdown, <5 sec for vector, <1 sec for graph
  • Indexing time reasonable: one-time embedding <1 hour for 1M words
  • Memory usage: vector index <5GB for 1M words (embedded models)
  • Staleness acceptable: knowledge updates reflected within 24 hours

Implementation Checks

  • Current strategy chosen: markdown/multi-tier/hybrid/graph decided and documented
  • Knowledge base size measured: growth tracked month-over-month
  • Search tested on 10+ representative queries: recall >80%
  • Chunking strategy verified: relevant documents returned for edge cases
  • Embedding quality checked: similar docs ranked together
  • Multi-agent access control working: agents see only intended knowledge
  • Deduplication implemented: no duplicate information taking space

Integration Checks

  • Harness agent can search knowledge base: integration with perception layer working
  • Results flow into context: agent can reason over retrieved documents
  • Update mechanism working: new knowledge added without full reindex
  • Fallback graceful: search failure doesn’t crash agent
  • Cost tracking: know per-query cost (embeddings, graph traversal, etc)

Common Failure Modes

  • Knowledge bloat: No cleanup; size grows unbounded, search becomes slow
  • Stale information: Old docs contradicting new ones; no version control
  • Poor relevance: Search returns noise; chunking strategy wrong
  • Expensive embeddings: Querying too often; implement caching
  • Graph inconsistency: Relationships contradictory or out-of-sync with documents
  • Acces control broken: Agent sees knowledge it shouldn’t have

Sign-Off Criteria

  • Knowledge base size tracked and decision point identified (when to scale)
  • Search latency acceptable for use case (interactive <2sec, batch <5sec)
  • Quality validation: spot-check 20+ search results, >80% relevant
  • Tested at scale: if expecting 1M words, tested with 500K+
  • Maintenance plan clear: who updates, how often, cleanup schedule

See Also

  • Doc 04 (Memory Systems): Knowledge management complements agent memory layers
  • Doc 14 (Advanced Patterns): Knowledge graphs for complex reasoning systems
  • Doc 20 (Integration Patterns): Exposing knowledge search via APIs