Knowledge Transfer Methods — The Harness Handbook Reference

Critical Question: How do I adapt models to my needs without training from scratch?

This guide compares the three primary methods for transferring and adapting LLM knowledge: distillation, fine-tuning, and retrieval-augmented generation (RAG). Each solves different problems and offers different tradeoffs in cost, speed, and quality.

1. Knowledge Distillation

What It Is

Knowledge distillation teaches a smaller, faster model to replicate the behavior of a larger, more capable “teacher” model. Instead of starting from scratch, the student model learns to mimic the teacher’s decision patterns.

Why It Works

Large models (like GPT-4) contain learned patterns spread across billions of parameters. A smaller model can’t replicate all this knowledge directly, but it can learn to approximate the teacher’s outputs by studying:

The probability distributions the teacher produces (not just the top answer)
Intermediate reasoning patterns through intermediate representations
The “soft targets” (probabilistic outputs) rather than hard labels

The Process

Teacher Model (e.g., GPT-4, 70B model)
        ↓
  Generate outputs & probabilities
        ↓
        ↓ Distillation training
        ↓
Student Model (e.g., 7B model)
        ↓
    Learns similar behavior
        ↓
  Achieves 90-95% of teacher quality

Key concept: Temperature controls knowledge softness.

Temperature (τ): A scaling parameter in the softmax function that controls how “soft” probability distributions are
- Higher temperature (τ > 1): Probability distributions are softer, revealing more about the teacher’s reasoning
- τ = 1: Standard softmax (default)
- τ = 3-5: Common for distillation (knowledge becomes more granular)
- Loss function: L = α × CE(student, true_label) + (1-α) × KL(student, teacher@τ)

Performance & Cost

Metric	Value
Training cost	10-20% of original training cost
Quality retention	90-95% of teacher model quality
Training time	2-4 weeks (on 1-2 GPUs)
Model size reduction	10x-100x smaller common (70B → 7B, 70B → 13B)
Inference cost	10-100x cheaper than teacher
Inference latency	10-100x faster than teacher

Example in Practice

Scenario: You want GPT-4’s coding ability in a 7B model to run on-device.

Generate 100K coding prompts and solutions with GPT-4 (teacher)
For each prompt, collect both the final answer and the probability distribution (softmax logits)
Train a 7B model (student) to match the teacher’s probability distributions
Result: A 7B model that handles coding ~90% as well as GPT-4, but runs on consumer hardware

Real example: Mistral 7B was partially trained using distillation techniques, achieving strong performance compared to larger models.

2. Fine-Tuning

What It Is

Fine-tuning is continued training of an existing model on task-specific or domain-specific data. Rather than training from scratch, you start with a pre-trained model’s weights and update them to specialize in your domain.

Three Approaches

Full Fine-Tuning

Update all model weights
Most expensive of the three
Highest quality improvement possible
Risk: catastrophic forgetting (losing original knowledge)

Parameter-Efficient Fine-Tuning (PEFT)

Update only a small subset of weights
1% of the cost of full fine-tuning
Preserve original knowledge better
Most practical for production

LoRA (Low-Rank Adaptation)

Freeze original weights, add small trainable “adapter” matrices
Mathematical insight: Update matrices decomposed into low-rank approximation
- Instead of updating all weights in a layer, add: W_new = W_original + α × A × B
- A and B are much smaller matrices (e.g., rank 8 or 16 vs rank 2048)
Reduces trainable parameters from millions to thousands
Allows serving multiple LoRA adapters on the same base model

Performance & Cost

Metric	Full Fine-Tune	LoRA	RAG
Training cost	2-5% of from-scratch	0.5-1% of from-scratch	~$0
Training time	3-7 days	4-24 hours	0
Data requirements	100-10K examples	100-1K examples	0 (uses existing docs)
Quality improvement	+5-15% on domain tasks	+3-10% on domain tasks	+0% (retrieval-dependent)
Model size overhead	0% (replaces weights)	+1-5% (LoRA matrices)	0 (no model change)
Best for	Domain specialization	Lightweight customization	Real-time knowledge

When Fine-Tuning Helps

Domain-specific language (strong signal):

Legal documents with domain-specific terminology and reasoning patterns
Medical literature with specialized knowledge
Code in a particular framework or company’s proprietary patterns
Customer support with domain-specific responses

Task-specific behavior (measurable):

Sentiment analysis (financial news vs. social media)
Classification in specialized domains
Instruction-following format (e.g., “always respond in JSON”)

When Fine-Tuning Does NOT Help

Fundamental capability gaps: A 7B model can’t learn to “do math better” via fine-tuning if math wasn’t in its training
Knowledge that requires reasoning: Fine-tuning teaches patterns, not deep reasoning
Contradicting original training: Model weights pull back toward original training
Rare or highly specific knowledge: Need 10+ examples per pattern for reliable learning

Practical Example

Scenario: Adapt Llama 3 8B for medical diagnosis support.

Base Llama 3 8B model
        ↓
Fine-tune on 1,000 medical Q&A pairs
        ↓
Hyperparameters:
  - Learning rate: 2e-4
  - Epochs: 3
  - Batch size: 8
        ↓
Result: Medical domain knowledge encoded in weights
        ↓
Quality: +10-15% accuracy on medical tasks
Cost: ~$1,000-$3,000 in compute

3. RAG (Retrieval-Augmented Generation)

What It Is

RAG provides knowledge at query time rather than training time. When a user asks a question, the system:

Retrieves relevant documents from a knowledge base
Passes retrieved documents + user question to the model
Model generates answer grounded in retrieved context

The Process

User Question
        ↓
Vector Database
  (embeddings of documents)
        ↓
Retrieve top-K relevant docs
        ↓
Construct prompt:
  "Context: [retrieved docs]
   Question: [user query]
   Answer:"
        ↓
Run through LLM
        ↓
Grounded Answer

Performance & Cost

Metric	Value
Training cost	~$0
Setup cost	$100-$1,000 (vector DB, indexing)
Ongoing cost	$10-$100/month (vector search)
Quality	Depends on retrieval quality (60-95%)
Data freshness	Real-time (updates immediately)
Latency	Slower (+100-500ms for retrieval)
Knowledge capacity	Limited by context window (4K-128K tokens)

When RAG Excels

Real-time, changing knowledge: News, stock prices, weather, current events
Proprietary documents: Internal wikis, policies, customer data
Large knowledge bases: Can store millions of documents
Frequent updates: Add new documents without retraining
Multi-source knowledge: Combine documents from different systems

Critical Factor: Retrieval Quality

RAG quality is entirely dependent on retrieval. If you retrieve the wrong documents, the model can’t compensate.

Retrieval quality metrics:

Precision@K: Of top-K retrieved docs, how many are relevant? (target: >70%)
Recall: Of all relevant docs, what percentage did we retrieve? (target: >80%)
MRR (Mean Reciprocal Rank): How high is the first relevant result? (target: >0.7)

Common retrieval failures:

Vector embeddings don’t match query intent (semantic mismatch)
Document chunks too large or too small
No relevant document in the knowledge base
Outdated or duplicate documents

Practical Example

Scenario: Build a customer support bot that uses company documentation.

Knowledge Base:
  - 500 support articles
  - 2,000 FAQ entries
  - 100 product docs

User: "How do I reset my password?"
        ↓
Retrieve top-3 relevant articles
        ↓
Prompt LLM with:
  "Context: [Article: Password Reset Steps]
   Question: How do I reset my password?
   Answer:"
        ↓
LLM generates grounded answer
        ↓
Cost: ~$0 training, $0.01 per query

4. Direct Comparison Matrix

Decision Matrix

Method	Cost	Time	Quality	Inference Speed	Best For
Training from Scratch	$$$$$	Months	⭐⭐⭐⭐⭐	Slow	New capability, unlimited budget
Distillation	$	Weeks	⭐⭐⭐⭐	Fast	Model compression, edge deployment
Full Fine-Tuning	$$	Days	⭐⭐⭐⭐	Medium	Domain specialization
LoRA Fine-Tuning	$	Hours	⭐⭐⭐	Medium	Lightweight customization, cost-sensitive
RAG	~$0	Minutes	⭐⭐⭐	Slow	Real-time knowledge, frequent updates

Quick Decision Tree

Do you need real-time or frequently-updated knowledge?
├─ YES → Use RAG
└─ NO  → Continue...

Does your domain have specialized language/patterns?
├─ YES, and you have 100+ examples → Use Fine-Tuning or LoRA
├─ YES, but no training data → Use RAG
└─ NO  → Continue...

Do you need to run on edge/resource-constrained hardware?
├─ YES → Use Distillation
└─ NO  → Continue...

Do you need new fundamental capability?
├─ YES → Train from Scratch (expensive!)
└─ NO  → Use base model as-is or add RAG

5. LoRA (Low-Rank Adaptation) Implementation Guide

LoRA is one of the most practical fine-tuning techniques for production use. Instead of updating all weights in a model, LoRA adds small “adapter” matrices that are far cheaper to train and store.

The Mathematics

For a weight matrix W in a layer, LoRA introduces two low-rank matrices:

W_new = W_original + α * (A @ B)

Where:
- W_original: original weight matrix (e.g., shape 2048 × 2048)
- A: "down-projection" matrix (2048 × 8, if rank r=8)
- B: "up-projection" matrix (8 × 2048)
- α: scaling factor (usually 1.0 or 2.0)

Key insight: Instead of learning 4.2M parameters (2048×2048), you learn only 32K (2048×8 + 8×2048).

LoRA Implementation with Transformers

Here’s a complete working example using HuggingFace and peft:

# Install: pip install peft transformers torch bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
from transformers import Trainer, TrainingArguments

# Step 1: Load base model (Llama 7B)
model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Step 2: Configure LoRA
lora_config = LoraConfig(
    r=8,                           # Rank of the adaptation matrices
    lora_alpha=16,                 # Scaling factor (alpha/r = 2.0)
    target_modules=["q_proj", "v_proj"],  # Apply to attention projections
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Step 3: Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,738,415,616 (0.06%)

# Step 4: Prepare data
# Example: Medical Q&A dataset
training_data = [
    {"text": "Q: What are symptoms of diabetes?\nA: Increased thirst, urination, fatigue..."},
    {"text": "Q: How is hypertension treated?\nA: With antihypertensive medications..."},
    # ... more examples
]

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

dataset = Dataset.from_dict({"text": training_data})
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Step 5: Train
training_args = TrainingArguments(
    output_dir="./lora-llama-medical",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    save_total_limit=3,
    logging_steps=10,
    warmup_ratio=0.1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

# Step 6: Save LoRA weights (only 10-30MB!)
model.save_pretrained("./lora-medical-adapter")

# Step 7: Load and use
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("./lora-medical-adapter")

# Inference
prompt = "Q: What are the risk factors for stroke?\nA:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

LoRA Practical Metrics

Training efficiency:

Llama 7B LoRA on 1,000 medical Q&A pairs
Hardware: Single RTX 4070
Time: 4-6 hours
Cost: ~$1-2
Quality gain: +10-15% on medical domain accuracy

Adapter size breakdown:

Base model: 14GB (FP16)
LoRA adapter: 10-30MB
Total deployment: 14GB base + 10MB adapter
Can serve 100+ different LoRA adapters on same base model

Comparison: LoRA vs Full Fine-Tuning

# Full Fine-Tuning (all parameters updated)
def full_finetune(model):
    for param in model.parameters():
        param.requires_grad = True
    # Training updates 6.7B parameters
    # Storage: 14GB (replaces original weights)
    # Training time: 1-2 days
    # Cost: $50-200

# LoRA Fine-Tuning (only adapters)
def lora_finetune(model):
    model = get_peft_model(model, lora_config)
    # Training updates 4.2M parameters (0.06%)
    # Storage: 10MB (additive to base)
    # Training time: 4-6 hours
    # Cost: $1-2

5b. When to Fine-Tune vs Distill vs RAG: Decision Tree

Here’s a comprehensive decision flowchart with real examples:

START: You need to adapt a model
│
├─ Question 1: Is your knowledge frequently updated or real-time?
│  ├─ YES → Go to RAG (Section 3)
│  │    Examples: News chatbot, stock prices, weather forecasts
│  │    Why: Retraining is slow; retrieval is instant
│  │
│  └─ NO → Continue to Question 2
│
├─ Question 2: Do you have labeled domain data (100+ examples)?
│  ├─ YES → Continue to Question 3
│  │    Examples: Medical Q&A, customer support logs, code samples
│  │
│  ├─ NO, but have unlabeled text → Use Distillation (Section 1)
│  │    Example: Have 1M medical papers but can't label them
│  │    Approach: Distill from teacher, then fine-tune on small labeled set
│  │
│  └─ NO data at all → Use base model as-is or add RAG
│
├─ Question 3: Do you need to run on edge/constrained hardware?
│  ├─ YES → Distillation first (compress), then optionally fine-tune
│  │    Examples: Mobile app, embedded device, Raspberry Pi
│  │    Approach: Distill 70B → 7B, then fine-tune on 100 domain examples
│  │    Timeline: 3 weeks total
│  │    Cost: $5K-10K
│  │
│  └─ NO → Continue to Question 4
│
├─ Question 4: How much labeled data do you have?
│  ├─ <100 examples → RAG is safer (fine-tuning may overfit)
│  │
│  ├─ 100-1,000 examples → LoRA fine-tuning (best ROI)
│  │    Cost: $100-500
│  │    Time: 4-12 hours
│  │    Quality: +5-10% on domain tasks
│  │
│  ├─ 1,000-10,000 examples → Full fine-tuning (best quality)
│  │    Cost: $1K-5K
│  │    Time: 1-3 days
│  │    Quality: +10-20% on domain tasks
│  │
│  └─ 10,000+ examples → Consider training from scratch
│      Only if: New task entirely outside pre-trained knowledge
│
├─ Question 5: What's your cost tolerance?
│  ├─ <$500 → RAG + LoRA
│  │
│  ├─ $500-5K → LoRA or full fine-tuning
│  │
│  └─ $5K+ → Distillation + fine-tuning + RAG
│
└─ DECISION: Output below

Decision Matrix by Scenario

Scenario	Approach	Cost	Time	Quality	Example
Chatbot for internal docs	RAG + Claude API	$0 training	1 day	⭐⭐⭐⭐	Zendesk docs → vector DB → ChatGPT
Medical diagnosis on laptop	Distill + LoRA	$8K	3 weeks	⭐⭐⭐⭐	Distill 70B medical model → 7B → fine-tune on hospital data
Sentiment analysis for tweets	LoRA on DistilBERT	$200	4 hours	⭐⭐⭐⭐⭐	500 labeled tweets → 100x faster than base model
Real-time stock analysis	RAG on news feeds	$50/mo	2 days	⭐⭐⭐	Ingest Yahoo Finance, Bloomberg → retrieve → analyze
Edge deployment (2GB device)	Distillation + int4	$3K	2 weeks	⭐⭐⭐	Compress + quantize → fits on phone
Fine-tune Claude on customer data	Call fine-tuning API	$1-5K	1 day	⭐⭐⭐⭐⭐	1K examples → Claude becomes your domain expert

5c. Cost Comparison Table

Full Cost Analysis (All-In Costs)

Method	Compute Cost	Labor Cost	Data Labeling	Total 1-Time	Annual Maintenance	Break-Even
Train from Scratch	$50K-500K	$10K (engineer-months)	Included	$60K-510K	$5K-20K	N/A (one-time)
Distillation	$3K-10K	$2K	~$500 (data cleanup)	$5.5K-12.5K	$0	0 months
Full Fine-Tuning	$1K-5K	$1K	$500-2K	$2.5K-8K	$0	0 months
LoRA Fine-Tuning	$200-1K	$500-1K	$300-1K	$1K-3K	$0	0 months
RAG Setup	$100-500 (vector DB)	$1K (integration)	$0	$1.1K-1.5K	$100-500/mo	3 months
Claude API Fine-Tuning	$1K-5K	$200 (via API)	$0	$1.2K-5.2K	$0	0 months

Cost Per 1% Quality Improvement

Distillation:    $50-100 per 1% improvement
  - Compresses knowledge, 5-10% quality loss acceptable
  - Cost amortized if serving 1000s of requests

Full Fine-Tuning: $20-60 per 1% improvement (BEST ROI)
  - Direct domain specialization
  - If you have 100+ examples, fine-tuning beats all others

LoRA:            $10-30 per 1% improvement
  - Cheapest per improvement
  - Best for cost-conscious teams

RAG:             $0-10 per 1% improvement (if documents high quality)
  - Free to implement (only retrieval cost)
  - Quality depends entirely on document quality
  - If documents poor, no ROI

Claude API:      $50/1K tokens (~$0.01 per 1% on small problems)
  - Expensive long-term if high volume
  - Cheap for occasional queries

Real-World Pricing Example: Building a Customer Support Bot

Scenario: 100K customer queries/month, 20,000 unique questions

Option A: Fine-tuned 7B model (local)
├─ Data collection & labeling: 1,000 Q&A pairs = $5K
├─ LoRA fine-tuning: 8 hours compute = $100
├─ Server cost (small): RTX 4070 = $600 (one-time) + $100/year power
├─ Total year 1: $5.8K
└─ Cost per query: $0.00006

Option B: RAG + Claude API
├─ Vector DB setup: Pinecone = $100 setup
├─ Indexing 20K Q&As: $100
├─ Claude API: 100K queries × 500 tokens avg × $0.003 = $150/month
├─ Total year 1: $1.9K (setup) + $1.8K (annual) = $3.7K
└─ Cost per query: $0.00037

Option C: RAG + Open-source model (vLLM)
├─ Setup: Same as Option B = $200
├─ Hosting (cloud): 2x H100 = $600/month
├─ Total year 1: $200 + $7.2K = $7.4K
└─ Cost per query: $0.00074

Option D: Full enterprise (fine-tuned + RAG hybrid)
├─ Fine-tune: $5K
├─ Server: $600
├─ RAG/vector DB: $500/year
├─ Total year 1: $6.1K
└─ Cost per query: $0.00006 (same as A, but higher quality)

6. Hybrid Approaches

The most powerful setups combine multiple methods:

Distillation + Fine-Tuning

Pattern: Compress knowledge, then specialize.

GPT-4 (teacher)
    ↓
    ↓ Distillation
    ↓
7B Student Model (general knowledge)
    ↓
    ↓ Fine-tuning on domain data
    ↓
7B Domain Specialist (specialized knowledge in small package)

When to use: You want a small model with domain expertise. Example: Compress GPT-4’s knowledge into 7B, then fine-tune on medical literature. Cost: $5K-$15K

Fine-Tuning + RAG

Pattern: Specialize model, augment with live knowledge.

Base Model
    ↓
    ↓ Fine-tuning on domain data
    ↓
Domain-Specialized Model
    ↓ + RAG pipeline
    ↓
Real-time domain answers

When to use: You need both specialized behavior AND current knowledge. Example: Fine-tune on your company’s writing style, augment with latest docs and policies. Cost: $2K-$5K

Distillation + Fine-Tuning + RAG

Pattern: Compress, specialize, augment with knowledge.

Large Teacher Model
    ↓
    ↓ Distillation
    ↓
Small Student (edge-deployable)
    ↓
    ↓ Fine-tuning
    ↓
Domain Specialist
    ↓ + RAG
    ↓
Real-time domain specialist on edge

When to use: Maximum optimization across cost, quality, and deployment. Example: Compress GPT-4 → fine-tune for your domain → deploy on device with local RAG. Cost: $10K-$25K

6. Practical Examples

Example 1: Compress GPT-4 Knowledge into 7B Model

Goal: Run GPT-4-level coding ability on a single GPU.

Approach: Distillation

Steps:

Generate 100K diverse code problems with GPT-4 solutions
Collect softmax logits (probabilities) from GPT-4
Train Mistral 7B or Llama 7B to match GPT-4’s probability distributions
Use temperature τ=4 to preserve detailed knowledge

Timeline: 3 weeks Cost: $2,000-$5,000 Result: 7B model at 90-95% of GPT-4’s code quality, 100x cheaper inference

Example 2: Fine-Tune 7B Model on Company Documentation

Goal: Help engineers find answers in your codebase and docs.

Approach: LoRA fine-tuning

Steps:

Collect 500-1,000 Q&A pairs from your docs and code
Train LoRA adapter on Llama 7B for 12 hours
Serve base model + LoRA adapter (1% size overhead)

Timeline: 1 day Cost: $100-$500 Result: Model answers questions specific to your company

Example 3: Real-Time News Knowledge in Chatbot

Goal: Chatbot answers questions about current events without retraining.

Approach: RAG

Steps:

Ingest news feeds into vector database (Pinecone, Weaviate, etc.)
For each user query, retrieve top-5 relevant articles
Augment prompt with retrieved articles
LLM generates answer grounded in current news

Timeline: 1 day to set up Cost: $20-$50/month Result: Always-current knowledge at no training cost

Example 4: Combined Approach for Optimal Cost/Quality/Speed

Goal: Edge-deployable medical diagnosis assistant that knows your hospital’s protocols.

Approach: Distillation + Fine-tuning + RAG

Steps:

Distillation (Week 1): Compress medical knowledge from larger model into Mistral 7B
Fine-tuning (Week 2): LoRA-adapt 7B on your hospital’s diagnostic protocols
RAG setup (Week 2): Index latest clinical guidelines, patient histories

Architecture:

Edge Device
├─ Distilled 7B model (2GB, edge-ready)
├─ LoRA adapter (10MB)
└─ Local vector DB of guidelines

Timeline: 2-3 weeks Cost: $8K-$12K Result: Specialized medical assistant deployable on-device with real-time protocol updates

7. Measuring Success

For Distilled Models

BLEU/ROUGE scores: Compare student outputs to teacher outputs (target: 0.8+ alignment)
Task accuracy: Measure on held-out test set (target: 90%+ of teacher performance)
Latency improvement: Track inference speed gains (expect: 10-50x faster)
Size reduction: Verify model compression ratio (expect: 10-100x smaller)

A/B test: Direct comparison on same prompts

prompt = "Explain quantum computing"
teacher_output = GPT-4.generate(prompt)
student_output = Distilled7B.generate(prompt)
similarity = cosine_similarity(embed(teacher), embed(student))
# Target: similarity > 0.85

For Fine-Tuned Models

Task-specific metrics: Accuracy, F1, precision/recall on your domain
Domain-specific benchmarks: Compare before/after fine-tuning
Degradation on general tasks: Ensure you didn’t hurt base model capabilities

Example for medical domain:

Before fine-tuning: 65% accuracy on medical Q&A
After fine-tuning:  78% accuracy on medical Q&A
Cost:               +13% improvement for $2,000

For RAG Systems

Retrieval quality:
- Precision@5: Are top-5 docs relevant? (target: >70%)
- Recall: What % of relevant docs do we find? (target: >80%)
- MRR (Mean Reciprocal Rank): How high is first match? (target: >0.7)
End-to-end quality:
- Answerability: Can LLM answer question given retrieved docs? (target: >85%)
- Grounding: Does answer cite retrieved sources correctly? (target: >95%)
- Hallucination rate: Does model invent facts? (target: <5%)

Measurement:

100 test questions
├─ Retrieve docs for each
├─ Measure retrieval precision/recall
├─ LLM generates answers
├─ Human evaluates answer quality
└─ Calculate final end-to-end quality score

8. When NOT to Use Each Method

Don’t Use Distillation If:

Model is already small (7B or smaller) — no benefit to compression
Quality loss is unacceptable — distillation loses 5-10% quality
You need real-time updates — distillation creates static model
Access to teacher model is restricted — can’t generate training data

Don’t Use Fine-Tuning If:

You need new fundamental capability — fine-tuning refines, doesn’t create
You have <10 examples per pattern — too little data to learn effectively
Knowledge is constantly changing — fine-tuning is static, use RAG
Your data contains contradictions — model training will oscillate

Don’t Use RAG If:

Your knowledge base is secret/proprietary — exposing documents is a security risk
Latency must be <100ms — retrieval adds latency
You need very large context (100K+ tokens) — context windows are limited
Knowledge is sparse and scattered — poor retrieval results

9. Economics Comparison

All-In Costs (compute + labor)

Method	Compute	Labor	Total	Amortized/Year
Train from Scratch	$50K-$500K	$20K	$70K-$520K	$70K-$520K (one-time)
Distillation	$3K-$10K	$2K	$5K-$12K	$5K-$12K (one-time)
Full Fine-Tuning	$1K-$5K	$1K	$2K-$6K	$2K-$6K (one-time)
LoRA Fine-Tuning	$200-$1K	$500	$700-$1.5K	$700-$1.5K (one-time)
RAG Setup	$100-$500	$500	$600-$1K	$600-$1K (one-time)
RAG Ongoing	—	—	—	$100-$1K/month

Cost-Per-Quality Comparison

Assuming you’re measuring quality improvement from baseline:

Distillation: $50-$100 per 1% quality improvement
Fine-tuning: $20-$60 per 1% quality improvement (best ROI)
LoRA: $10-$30 per 1% quality improvement
RAG: $0-$10 per 1% quality improvement (depends on data quality)

10. Decision Framework

Flowchart: Which Method to Choose?

START
│
├─ Is knowledge frequently updated or real-time?
│  ├─ YES → Use RAG
│  └─ NO → Continue
│
├─ Do you have 100+ domain-specific training examples?
│  ├─ YES → Use Fine-Tuning or LoRA
│  └─ NO → Continue
│
├─ Do you need the model to run on edge/device?
│  ├─ YES → Use Distillation (then Fine-tune if specialized)
│  └─ NO → Continue
│
├─ Do you need a fundamental new capability?
│  ├─ YES → Train from Scratch (warning: expensive!)
│  └─ NO → Use base model + RAG, or fine-tune if you have data
│
└─ END: Choose combined approach if multiple factors apply

Decision Factors Checklist

Use Distillation when:

You have a large teacher model
You need to reduce model size significantly
Inference cost/speed is critical
You have 50K+ examples to train on

Use Fine-Tuning when:

You have 100-10K domain-specific examples
You want to specialize model behavior
One-time setup cost is acceptable
Base model can be improved within domain

Use LoRA when:

You want fine-tuning but need to minimize cost
You need to serve multiple domain variants
Model size/memory is constrained
Quick iteration is important

Use RAG when:

Knowledge changes frequently
You can’t label training examples
You want to preserve base model knowledge
Latency tolerance is >100ms
Knowledge is in unstructured documents

Combine methods when:

You need compression + specialization → Distill + Fine-tune
You need specialization + live updates → Fine-tune + RAG
You need all three → Distill + Fine-tune + RAG

Summary: Quick Reference

Question	Answer
Fastest to deploy?	RAG (minutes)
Cheapest?	RAG (~$0 training)
Best quality for domain tasks?	Fine-tuning (if you have data)
Best for edge/on-device?	Distillation
Most flexible?	RAG (updates instantly)
Best ROI on quality?	LoRA fine-tuning
Best for real-time knowledge?	RAG
Best for specialized behavior?	Fine-tuning + LoRA
Highest quality possible?	Distillation + Fine-tuning + RAG

Golden Rule: Start with RAG if you have documents. Add fine-tuning if you have labeled examples. Add distillation if you need to run on constrained hardware. Train from scratch only as a last resort.

11. Real-World Example 1: Fine-Tuning Claude on Medical Data

Scenario

A healthcare startup wants to fine-tune Claude (via Anthropic’s API) on their medical literature and case studies to make it better at answering domain-specific questions.

Dataset Preparation

import json
from anthropic import Anthropic

# Prepare training data: question-answer pairs
training_data = [
    {
        "messages": [
            {"role": "user", "content": "What are the symptoms of Type 2 diabetes?"},
            {"role": "assistant", "content": "Type 2 diabetes symptoms include increased thirst, frequent urination, fatigue, blurred vision, and slow wound healing..."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "How is hypertension diagnosed?"},
            {"role": "assistant", "content": "Hypertension is diagnosed when blood pressure readings are consistently 130/80 mmHg or higher..."}
        ]
    },
    # ... 998 more examples ...
]

# Save as JSONL (one JSON object per line)
with open("medical_training.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

Fine-Tuning Process

from anthropic import Anthropic

client = Anthropic()

# Step 1: Upload training file
with open("medical_training.jsonl", "rb") as f:
    training_file = client.beta.files.upload(
        file=("medical_training.jsonl", f, "application/x-ndjson"),
    )

print(f"File uploaded: {training_file.id}")

# Step 2: Create fine-tuning job
fine_tune_job = client.beta.fine_tuning.jobs.create(
    model="claude-sonnet-4",
    training_file=training_file.id,
    hyperparameters={
        "epochs": 3,
        "learning_rate_multiplier": 1.0,
        "batch_size": 32,
    }
)

print(f"Fine-tuning job created: {fine_tune_job.id}")

# Step 3: Monitor job status
import time

while True:
    job_status = client.beta.fine_tuning.jobs.retrieve(fine_tune_job.id)
    
    if job_status.status == "succeeded":
        print(f"Fine-tuning complete! Model ID: {job_status.fine_tuned_model}")
        break
    elif job_status.status == "failed":
        print(f"Fine-tuning failed: {job_status.error}")
        break
    
    print(f"Status: {job_status.status} - Progress: {job_status.training_steps_completed}/{job_status.total_training_steps}")
    time.sleep(30)

# Step 4: Use fine-tuned model
fine_tuned_model = job_status.fine_tuned_model

response = client.messages.create(
    model=fine_tuned_model,
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is the recommended treatment for gestational diabetes?"}
    ]
)

print(response.content[0].text)

Cost Analysis

Training data: 1,000 examples
Training cost: $6 (input tokens) + $18 (output tokens) = $24
Training time: ~1-2 hours
Fine-tuned model name: ft-YOUR-ID

Monthly usage (assuming 100 queries/day):
- Input: 100 tokens per query × 100 queries = 10,000 tokens = $0.15
- Output: 500 tokens per query × 100 queries = 50,000 tokens = $0.75
- Daily cost: $0.90
- Monthly cost: $27
- Annual cost: $324

ROI: $24 training investment + $324 annual = $348 total
Baseline (without fine-tuning): $12/day = $4,380/year
Savings: $4,032/year (11.6x ROI)

Evaluation Metrics

After fine-tuning, test on held-out medical questions:

# Test set (not used in training)
test_questions = [
    "How is COPD managed?",
    "What are risk factors for myocardial infarction?",
    "Explain the pathophysiology of cirrhosis"
]

# Baseline (no fine-tuning)
baseline_response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=500,
    messages=[{"role": "user", "content": "How is COPD managed?"}]
)

# Fine-tuned model
finetuned_response = client.messages.create(
    model=fine_tuned_model,
    max_tokens=500,
    messages=[{"role": "user", "content": "How is COPD managed?"}]
)

# Evaluate quality (manually or with another model)
# Metrics: Domain accuracy, terminology precision, specificity to medical context

12. Real-World Example 2: Distilling Claude into Phi for On-Device Deployment

Scenario

A mobile health app wants Claude’s reasoning on 7-inch tablets with only 2GB RAM. Solution: Distill Claude’s knowledge into Phi-2 (2.7B parameters, 5.5GB in FP16, 2.75GB in int8).

Distillation Process

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import anthropic

client = anthropic.Anthropic()

# Step 1: Generate diverse medical Q&A with Claude (teacher)
medical_topics = [
    "diabetes management",
    "hypertension treatment",
    "asthma control",
    "chronic kidney disease",
    "heart failure"
]

def generate_training_data(topic, num_examples=100):
    """Use Claude to generate diverse training examples"""
    
    prompt = f"""Generate {num_examples} diverse medical Q&A pairs about {topic}.
    Format: Q: [question]\nA: [answer]\n\n
    Make questions and answers realistic, varied in difficulty, and accurate."""
    
    response = client.messages.create(
        model="claude-sonnet-4",
        max_tokens=8000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse response into Q&A pairs
    qa_pairs = []
    lines = response.content[0].text.split("\n")
    
    current_q = None
    current_a = None
    
    for line in lines:
        if line.startswith("Q:"):
            if current_q and current_a:
                qa_pairs.append({"q": current_q, "a": current_a})
            current_q = line[2:].strip()
            current_a = None
        elif line.startswith("A:"):
            current_a = line[2:].strip()
    
    if current_q and current_a:
        qa_pairs.append({"q": current_q, "a": current_a})
    
    return qa_pairs

# Generate 500 Q&A pairs (5 topics × 100 each)
all_training_data = []
for topic in medical_topics:
    print(f"Generating examples for {topic}...")
    examples = generate_training_data(topic, num_examples=100)
    all_training_data.extend(examples)

print(f"Generated {len(all_training_data)} training examples")

# Step 2: Also collect soft targets (probabilities) from Claude
def collect_soft_targets(qa_pairs, batch_size=10):
    """Get logits/probabilities from Claude for knowledge distillation"""
    
    soft_targets = []
    
    for i in range(0, len(qa_pairs), batch_size):
        batch = qa_pairs[i:i+batch_size]
        
        for item in batch:
            # For each Q&A, get Claude's confidence scores
            # (This is simplified; real distillation would use logits)
            response = client.messages.create(
                model="claude-sonnet-4",
                max_tokens=100,
                messages=[
                    {"role": "user", "content": f"Q: {item['q']}\n\nA: {item['a']}\n\nHow confident are you in this answer? 0-100"}
                ]
            )
            
            confidence = float(response.content[0].text.split()[-1])
            soft_targets.append({
                "question": item['q'],
                "answer": item['a'],
                "confidence": confidence / 100.0  # Normalize to 0-1
            })
    
    return soft_targets

soft_targets = collect_soft_targets(all_training_data[:50])  # Example: first 50

# Step 3: Fine-tune Phi-2 with knowledge distillation
def distill_to_phi(training_data, soft_targets, num_epochs=3, temperature=3.0):
    """
    Distill Claude's knowledge into Phi-2
    
    Temperature > 1 makes teacher (Claude) outputs softer (more informative)
    """
    
    # Load Phi-2
    model_name = "microsoft/phi-2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Prepare data
    from torch.utils.data import DataLoader, Dataset
    
    class DistillationDataset(Dataset):
        def __init__(self, qa_pairs, soft_targets, tokenizer):
            self.qa_pairs = qa_pairs
            self.soft_targets = {st['question']: st['confidence'] for st in soft_targets}
            self.tokenizer = tokenizer
        
        def __len__(self):
            return len(self.qa_pairs)
        
        def __getitem__(self, idx):
            qa = self.qa_pairs[idx]
            text = f"Q: {qa['q']}\nA: {qa['a']}"
            
            # Get soft target confidence if available
            confidence = self.soft_targets.get(qa['q'], 0.9)
            
            tokens = self.tokenizer(
                text,
                truncation=True,
                max_length=256,
                return_tensors="pt"
            )
            
            return {
                'input_ids': tokens['input_ids'].squeeze(),
                'attention_mask': tokens['attention_mask'].squeeze(),
                'soft_target': torch.tensor(confidence, dtype=torch.float)
            }
    
    dataset = DistillationDataset(all_training_data[:100], soft_targets, tokenizer)
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
    
    # Training loop with distillation loss
    from torch import nn
    
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
    criterion_ce = nn.CrossEntropyLoss()
    criterion_kl = nn.KLDivLoss(reduction='batchmean')
    
    for epoch in range(num_epochs):
        total_loss = 0
        
        for batch in dataloader:
            input_ids = batch['input_ids'].to("cuda")
            attention_mask = batch['attention_mask'].to("cuda")
            soft_targets = batch['soft_target'].to("cuda")
            
            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
            logits = outputs.logits
            
            # Distillation loss: match teacher's soft targets
            # Use cross-entropy loss as proxy
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = input_ids[..., 1:].contiguous()
            
            loss = criterion_ce(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1)
            )
            
            # Weight by teacher confidence (soft targets)
            weighted_loss = loss * soft_targets.mean()
            
            optimizer.zero_grad()
            weighted_loss.backward()
            optimizer.step()
            
            total_loss += weighted_loss.item()
        
        print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader):.4f}")
    
    return model

# Distill Phi-2
distilled_phi = distill_to_phi(all_training_data, soft_targets)

# Step 4: Quantize for mobile (int8)
from torch.quantization import quantize_dynamic, QConfig
import torch.nn as nn

quantized_phi = quantize_dynamic(
    distilled_phi,
    {nn.Linear},
    dtype=torch.qint8
)

# Save quantized model
torch.jit.save(torch.jit.script(quantized_phi), "phi_medical_int8.pt")

print("Distilled and quantized Phi ready for mobile!")
print(f"Model size: ~2.75GB (int8)")
print(f"On-device inference: ~30-50 tokens/s on modern tablet")

Results Comparison

Teacher Model (Claude via API):
- Quality: ⭐⭐⭐⭐⭐ (state-of-the-art)
- Speed: Slow (cloud API latency)
- Cost: $0.003 per 1K input tokens
- Deployment: Cloud only

Student Model (Distilled Phi-2):
- Quality: ⭐⭐⭐⭐ (90-95% of Claude quality)
- Speed: ⭐⭐⭐ (30-50 tokens/s on tablet)
- Cost: $0 (one-time distillation cost)
- Deployment: Offline on 2GB tablet

Distillation ROI:
- Training cost: $200 (Claude API calls)
- One-time savings per 1,000 queries: $3 (vs API)
- Per app installation: Infinite offline queries

13. When Methods Fail: Troubleshooting Guide

Distillation Not Working

Problem: Student model quality doesn’t improve with teacher data

Causes and fixes:

Temperature too low (τ = 1): Probability distributions too sharp
- Fix: Increase to τ = 3-5
Student model too small relative to task: Can’t fit knowledge
- Fix: Use larger student (7B instead of 3.5B) or simpler task
Insufficient training data: Need 10K+ examples for good distillation
- Fix: Generate more synthetic data from teacher
Dataset mismatch: Training data doesn’t match deployment domain
- Fix: Collect data representative of real use

Fine-Tuning Not Working

Problem: Accuracy doesn’t improve, or model “forgets” base knowledge

Causes and fixes:

Catastrophic forgetting: Model overwrites original knowledge
- Fix: Lower learning rate (2e-5 instead of 2e-4), use LoRA
Data quality: Mislabeled or inconsistent examples
- Fix: Clean data, verify labels, use quality heuristics
Insufficient data: <100 examples is risky
- Fix: Collect more examples or use LoRA (more data-efficient)
Wrong task: Fine-tuning teaches patterns, not new capabilities
- Fix: If model fundamentally can’t do task, use RAG or distill better teacher

RAG Not Working

Problem: Retrieval misses relevant documents

Causes and fixes:

Poor embeddings: Document embeddings don’t match queries
- Fix: Use better embedding model (OpenAI, Jina, or specialized domain)
Bad chunking: Documents too large or too small
- Fix: Chunk at 256-512 token boundaries (semantic chunks)
Outdated documents: Knowledge base stale
- Fix: Refresh index regularly, remove obsolete docs
Query intent mismatch: User question phrased differently than documents
- Fix: Expand queries (multi-query, HyDE), rewrite with LLM
No relevant document exists: Knowledge base doesn’t contain answer
- Fix: Add missing documents, fall back to base model

Diagnostic query:

def diagnose_rag(query, retriever, model):
    """Check where RAG breaks down"""
    
    # Step 1: Check retrieval quality
    docs = retriever.retrieve(query, top_k=5)
    
    if not docs:
        print("❌ Retrieval failed: No documents returned")
        return "retrieval_broken"
    
    if docs[0]['score'] < 0.5:
        print("⚠️  Low retrieval confidence:", docs[0]['score'])
        return "poor_retrieval"
    
    # Step 2: Check if documents contain answer
    doc_text = " ".join([d['text'] for d in docs])
    answer_present = "yes" in doc_text.lower()  # Simplified check
    
    if not answer_present:
        print("❌ Retrieved docs don't contain answer")
        return "docs_insufficient"
    
    # Step 3: Check if LLM can use documents
    prompt = f"Query: {query}\nDocuments: {doc_text}\nAnswer:"
    response = model.generate(prompt)
    
    if "don't know" in response.lower():
        print("⚠️  LLM can't answer despite documents")
        return "llm_failure"
    
    print("✓ RAG working correctly")
    return "ok"

Validation Checklist

How do you know you got this right?

Performance Checks

Fine-tuned model outperforms base model on 5+ domain-specific tasks (measure accuracy delta)
RAG retrieval precision@5 exceeds 70% on representative test queries
Distilled student model retains 90%+ of teacher model quality on held-out evaluation set

Implementation Checks

LoRA adapter loads without errors and produces coherent output on first inference
Training data is clean: verified labels, no duplicates, no data leakage between train/test splits
Quantization level chosen matches deployment hardware (int4 for edge, int8 for server, FP16 for quality-critical)
RAG vector database indexed and returning results within latency budget (<500ms retrieval)
Cost-per-1% quality improvement calculated for your chosen method (LoRA vs full fine-tune vs RAG)
Hybrid approach evaluated: considered combining methods (e.g., fine-tune + RAG) before settling on single method
Evaluation pipeline built: automated comparison of base vs adapted model on domain test set

Integration Checks

Adapted model integrates with harness inference loop (loads, generates, streams tokens)
LoRA adapters stored separately from base model weights for multi-domain serving
RAG pipeline connected to harness tool system (retrieval as a tool the agent can invoke)

Common Failure Modes

Catastrophic forgetting: Fine-tuned model loses general knowledge. Fix: use LoRA instead of full fine-tuning, or lower learning rate to 2e-5.
RAG retrieval miss: Wrong documents retrieved, model hallucinates. Fix: improve chunking strategy (256-512 tokens), use better embedding model, add reranking step.
Distillation plateau: Student quality stops improving well below teacher. Fix: increase temperature (tau=3-5), generate more diverse training data, use larger student model.
Overfitting on small datasets: Model memorizes training examples, fails on new inputs. Fix: need 100+ examples minimum for LoRA, 1000+ for full fine-tune; add dropout and regularization.

Sign-Off Criteria

Chosen transfer method matches decision tree criteria (real-time knowledge -> RAG, domain specialization -> fine-tune, edge deployment -> distillation)
Before/after metrics documented: baseline accuracy vs adapted model accuracy on domain tasks
Cost analysis complete: total spend (compute + labor + data) justified by quality improvement
Fallback strategy defined: what happens if adapted model underperforms (revert to base model, switch methods)
Monitoring plan in place: track model accuracy in production to detect drift and schedule retraining