Skip to main content
Reference

Knowledge Transfer Methods

Distillation, fine-tuning, LoRA, RAG compared — complete LoRA implementation, decision trees, cost comparison, and real-world examples.

Critical Question: How do I adapt models to my needs without training from scratch?

This guide compares the three primary methods for transferring and adapting LLM knowledge: distillation, fine-tuning, and retrieval-augmented generation (RAG). Each solves different problems and offers different tradeoffs in cost, speed, and quality.


1. Knowledge Distillation

What It Is

Knowledge distillation teaches a smaller, faster model to replicate the behavior of a larger, more capable “teacher” model. Instead of starting from scratch, the student model learns to mimic the teacher’s decision patterns.

Why It Works

Large models (like GPT-4) contain learned patterns spread across billions of parameters. A smaller model can’t replicate all this knowledge directly, but it can learn to approximate the teacher’s outputs by studying:

  • The probability distributions the teacher produces (not just the top answer)
  • Intermediate reasoning patterns through intermediate representations
  • The “soft targets” (probabilistic outputs) rather than hard labels

The Process

Teacher Model (e.g., GPT-4, 70B model)

  Generate outputs & probabilities

        ↓ Distillation training

Student Model (e.g., 7B model)

    Learns similar behavior

  Achieves 90-95% of teacher quality

Key concept: Temperature controls knowledge softness.

  • Temperature (τ): A scaling parameter in the softmax function that controls how “soft” probability distributions are
    • Higher temperature (τ > 1): Probability distributions are softer, revealing more about the teacher’s reasoning
    • τ = 1: Standard softmax (default)
    • τ = 3-5: Common for distillation (knowledge becomes more granular)
    • Loss function: L = α × CE(student, true_label) + (1-α) × KL(student, teacher@τ)

Performance & Cost

MetricValue
Training cost10-20% of original training cost
Quality retention90-95% of teacher model quality
Training time2-4 weeks (on 1-2 GPUs)
Model size reduction10x-100x smaller common (70B → 7B, 70B → 13B)
Inference cost10-100x cheaper than teacher
Inference latency10-100x faster than teacher

Example in Practice

Scenario: You want GPT-4’s coding ability in a 7B model to run on-device.

  1. Generate 100K coding prompts and solutions with GPT-4 (teacher)
  2. For each prompt, collect both the final answer and the probability distribution (softmax logits)
  3. Train a 7B model (student) to match the teacher’s probability distributions
  4. Result: A 7B model that handles coding ~90% as well as GPT-4, but runs on consumer hardware

Real example: Mistral 7B was partially trained using distillation techniques, achieving strong performance compared to larger models.


2. Fine-Tuning

What It Is

Fine-tuning is continued training of an existing model on task-specific or domain-specific data. Rather than training from scratch, you start with a pre-trained model’s weights and update them to specialize in your domain.

Three Approaches

Full Fine-Tuning

  • Update all model weights
  • Most expensive of the three
  • Highest quality improvement possible
  • Risk: catastrophic forgetting (losing original knowledge)

Parameter-Efficient Fine-Tuning (PEFT)

  • Update only a small subset of weights
  • 1% of the cost of full fine-tuning
  • Preserve original knowledge better
  • Most practical for production

LoRA (Low-Rank Adaptation)

  • Freeze original weights, add small trainable “adapter” matrices
  • Mathematical insight: Update matrices decomposed into low-rank approximation
    • Instead of updating all weights in a layer, add: W_new = W_original + α × A × B
    • A and B are much smaller matrices (e.g., rank 8 or 16 vs rank 2048)
  • Reduces trainable parameters from millions to thousands
  • Allows serving multiple LoRA adapters on the same base model

Performance & Cost

MetricFull Fine-TuneLoRARAG
Training cost2-5% of from-scratch0.5-1% of from-scratch~$0
Training time3-7 days4-24 hours0
Data requirements100-10K examples100-1K examples0 (uses existing docs)
Quality improvement+5-15% on domain tasks+3-10% on domain tasks+0% (retrieval-dependent)
Model size overhead0% (replaces weights)+1-5% (LoRA matrices)0 (no model change)
Best forDomain specializationLightweight customizationReal-time knowledge

When Fine-Tuning Helps

Domain-specific language (strong signal):

  • Legal documents with domain-specific terminology and reasoning patterns
  • Medical literature with specialized knowledge
  • Code in a particular framework or company’s proprietary patterns
  • Customer support with domain-specific responses

Task-specific behavior (measurable):

  • Sentiment analysis (financial news vs. social media)
  • Classification in specialized domains
  • Instruction-following format (e.g., “always respond in JSON”)

When Fine-Tuning Does NOT Help

  • Fundamental capability gaps: A 7B model can’t learn to “do math better” via fine-tuning if math wasn’t in its training
  • Knowledge that requires reasoning: Fine-tuning teaches patterns, not deep reasoning
  • Contradicting original training: Model weights pull back toward original training
  • Rare or highly specific knowledge: Need 10+ examples per pattern for reliable learning

Practical Example

Scenario: Adapt Llama 3 8B for medical diagnosis support.

Base Llama 3 8B model

Fine-tune on 1,000 medical Q&A pairs

Hyperparameters:
  - Learning rate: 2e-4
  - Epochs: 3
  - Batch size: 8

Result: Medical domain knowledge encoded in weights

Quality: +10-15% accuracy on medical tasks
Cost: ~$1,000-$3,000 in compute

3. RAG (Retrieval-Augmented Generation)

What It Is

RAG provides knowledge at query time rather than training time. When a user asks a question, the system:

  1. Retrieves relevant documents from a knowledge base
  2. Passes retrieved documents + user question to the model
  3. Model generates answer grounded in retrieved context

The Process

User Question

Vector Database
  (embeddings of documents)

Retrieve top-K relevant docs

Construct prompt:
  "Context: [retrieved docs]
   Question: [user query]
   Answer:"

Run through LLM

Grounded Answer

Performance & Cost

MetricValue
Training cost~$0
Setup cost$100-$1,000 (vector DB, indexing)
Ongoing cost$10-$100/month (vector search)
QualityDepends on retrieval quality (60-95%)
Data freshnessReal-time (updates immediately)
LatencySlower (+100-500ms for retrieval)
Knowledge capacityLimited by context window (4K-128K tokens)

When RAG Excels

  • Real-time, changing knowledge: News, stock prices, weather, current events
  • Proprietary documents: Internal wikis, policies, customer data
  • Large knowledge bases: Can store millions of documents
  • Frequent updates: Add new documents without retraining
  • Multi-source knowledge: Combine documents from different systems

Critical Factor: Retrieval Quality

RAG quality is entirely dependent on retrieval. If you retrieve the wrong documents, the model can’t compensate.

Retrieval quality metrics:

  • Precision@K: Of top-K retrieved docs, how many are relevant? (target: >70%)
  • Recall: Of all relevant docs, what percentage did we retrieve? (target: >80%)
  • MRR (Mean Reciprocal Rank): How high is the first relevant result? (target: >0.7)

Common retrieval failures:

  • Vector embeddings don’t match query intent (semantic mismatch)
  • Document chunks too large or too small
  • No relevant document in the knowledge base
  • Outdated or duplicate documents

Practical Example

Scenario: Build a customer support bot that uses company documentation.

Knowledge Base:
  - 500 support articles
  - 2,000 FAQ entries
  - 100 product docs

User: "How do I reset my password?"

Retrieve top-3 relevant articles

Prompt LLM with:
  "Context: [Article: Password Reset Steps]
   Question: How do I reset my password?
   Answer:"

LLM generates grounded answer

Cost: ~$0 training, $0.01 per query

4. Direct Comparison Matrix

Decision Matrix

MethodCostTimeQualityInference SpeedBest For
Training from Scratch$$$$$Months⭐⭐⭐⭐⭐SlowNew capability, unlimited budget
Distillation$Weeks⭐⭐⭐⭐FastModel compression, edge deployment
Full Fine-Tuning$$Days⭐⭐⭐⭐MediumDomain specialization
LoRA Fine-Tuning$Hours⭐⭐⭐MediumLightweight customization, cost-sensitive
RAG~$0Minutes⭐⭐⭐SlowReal-time knowledge, frequent updates

Quick Decision Tree

Do you need real-time or frequently-updated knowledge?
├─ YES → Use RAG
└─ NO  → Continue...

Does your domain have specialized language/patterns?
├─ YES, and you have 100+ examples → Use Fine-Tuning or LoRA
├─ YES, but no training data → Use RAG
└─ NO  → Continue...

Do you need to run on edge/resource-constrained hardware?
├─ YES → Use Distillation
└─ NO  → Continue...

Do you need new fundamental capability?
├─ YES → Train from Scratch (expensive!)
└─ NO  → Use base model as-is or add RAG

5. LoRA (Low-Rank Adaptation) Implementation Guide

LoRA is one of the most practical fine-tuning techniques for production use. Instead of updating all weights in a model, LoRA adds small “adapter” matrices that are far cheaper to train and store.

The Mathematics

For a weight matrix W in a layer, LoRA introduces two low-rank matrices:

W_new = W_original + α * (A @ B)

Where:
- W_original: original weight matrix (e.g., shape 2048 × 2048)
- A: "down-projection" matrix (2048 × 8, if rank r=8)
- B: "up-projection" matrix (8 × 2048)
- α: scaling factor (usually 1.0 or 2.0)

Key insight: Instead of learning 4.2M parameters (2048×2048), you learn only 32K (2048×8 + 8×2048).

LoRA Implementation with Transformers

Here’s a complete working example using HuggingFace and peft:

# Install: pip install peft transformers torch bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
from transformers import Trainer, TrainingArguments

# Step 1: Load base model (Llama 7B)
model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Step 2: Configure LoRA
lora_config = LoraConfig(
    r=8,                           # Rank of the adaptation matrices
    lora_alpha=16,                 # Scaling factor (alpha/r = 2.0)
    target_modules=["q_proj", "v_proj"],  # Apply to attention projections
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Step 3: Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,738,415,616 (0.06%)

# Step 4: Prepare data
# Example: Medical Q&A dataset
training_data = [
    {"text": "Q: What are symptoms of diabetes?\nA: Increased thirst, urination, fatigue..."},
    {"text": "Q: How is hypertension treated?\nA: With antihypertensive medications..."},
    # ... more examples
]

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

dataset = Dataset.from_dict({"text": training_data})
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Step 5: Train
training_args = TrainingArguments(
    output_dir="./lora-llama-medical",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    save_total_limit=3,
    logging_steps=10,
    warmup_ratio=0.1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

# Step 6: Save LoRA weights (only 10-30MB!)
model.save_pretrained("./lora-medical-adapter")

# Step 7: Load and use
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained("./lora-medical-adapter")

# Inference
prompt = "Q: What are the risk factors for stroke?\nA:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

LoRA Practical Metrics

Training efficiency:

  • Llama 7B LoRA on 1,000 medical Q&A pairs
  • Hardware: Single RTX 4070
  • Time: 4-6 hours
  • Cost: ~$1-2
  • Quality gain: +10-15% on medical domain accuracy

Adapter size breakdown:

  • Base model: 14GB (FP16)
  • LoRA adapter: 10-30MB
  • Total deployment: 14GB base + 10MB adapter
  • Can serve 100+ different LoRA adapters on same base model

Comparison: LoRA vs Full Fine-Tuning

# Full Fine-Tuning (all parameters updated)
def full_finetune(model):
    for param in model.parameters():
        param.requires_grad = True
    # Training updates 6.7B parameters
    # Storage: 14GB (replaces original weights)
    # Training time: 1-2 days
    # Cost: $50-200

# LoRA Fine-Tuning (only adapters)
def lora_finetune(model):
    model = get_peft_model(model, lora_config)
    # Training updates 4.2M parameters (0.06%)
    # Storage: 10MB (additive to base)
    # Training time: 4-6 hours
    # Cost: $1-2

5b. When to Fine-Tune vs Distill vs RAG: Decision Tree

Here’s a comprehensive decision flowchart with real examples:

START: You need to adapt a model

├─ Question 1: Is your knowledge frequently updated or real-time?
│  ├─ YES → Go to RAG (Section 3)
│  │    Examples: News chatbot, stock prices, weather forecasts
│  │    Why: Retraining is slow; retrieval is instant
│  │
│  └─ NO → Continue to Question 2

├─ Question 2: Do you have labeled domain data (100+ examples)?
│  ├─ YES → Continue to Question 3
│  │    Examples: Medical Q&A, customer support logs, code samples
│  │
│  ├─ NO, but have unlabeled text → Use Distillation (Section 1)
│  │    Example: Have 1M medical papers but can't label them
│  │    Approach: Distill from teacher, then fine-tune on small labeled set
│  │
│  └─ NO data at all → Use base model as-is or add RAG

├─ Question 3: Do you need to run on edge/constrained hardware?
│  ├─ YES → Distillation first (compress), then optionally fine-tune
│  │    Examples: Mobile app, embedded device, Raspberry Pi
│  │    Approach: Distill 70B → 7B, then fine-tune on 100 domain examples
│  │    Timeline: 3 weeks total
│  │    Cost: $5K-10K
│  │
│  └─ NO → Continue to Question 4

├─ Question 4: How much labeled data do you have?
│  ├─ <100 examples → RAG is safer (fine-tuning may overfit)
│  │
│  ├─ 100-1,000 examples → LoRA fine-tuning (best ROI)
│  │    Cost: $100-500
│  │    Time: 4-12 hours
│  │    Quality: +5-10% on domain tasks
│  │
│  ├─ 1,000-10,000 examples → Full fine-tuning (best quality)
│  │    Cost: $1K-5K
│  │    Time: 1-3 days
│  │    Quality: +10-20% on domain tasks
│  │
│  └─ 10,000+ examples → Consider training from scratch
│      Only if: New task entirely outside pre-trained knowledge

├─ Question 5: What's your cost tolerance?
│  ├─ <$500 → RAG + LoRA
│  │
│  ├─ $500-5K → LoRA or full fine-tuning
│  │
│  └─ $5K+ → Distillation + fine-tuning + RAG

└─ DECISION: Output below

Decision Matrix by Scenario

ScenarioApproachCostTimeQualityExample
Chatbot for internal docsRAG + Claude API$0 training1 day⭐⭐⭐⭐Zendesk docs → vector DB → ChatGPT
Medical diagnosis on laptopDistill + LoRA$8K3 weeks⭐⭐⭐⭐Distill 70B medical model → 7B → fine-tune on hospital data
Sentiment analysis for tweetsLoRA on DistilBERT$2004 hours⭐⭐⭐⭐⭐500 labeled tweets → 100x faster than base model
Real-time stock analysisRAG on news feeds$50/mo2 days⭐⭐⭐Ingest Yahoo Finance, Bloomberg → retrieve → analyze
Edge deployment (2GB device)Distillation + int4$3K2 weeks⭐⭐⭐Compress + quantize → fits on phone
Fine-tune Claude on customer dataCall fine-tuning API$1-5K1 day⭐⭐⭐⭐⭐1K examples → Claude becomes your domain expert

5c. Cost Comparison Table

Full Cost Analysis (All-In Costs)

MethodCompute CostLabor CostData LabelingTotal 1-TimeAnnual MaintenanceBreak-Even
Train from Scratch$50K-500K$10K (engineer-months)Included$60K-510K$5K-20KN/A (one-time)
Distillation$3K-10K$2K~$500 (data cleanup)$5.5K-12.5K$00 months
Full Fine-Tuning$1K-5K$1K$500-2K$2.5K-8K$00 months
LoRA Fine-Tuning$200-1K$500-1K$300-1K$1K-3K$00 months
RAG Setup$100-500 (vector DB)$1K (integration)$0$1.1K-1.5K$100-500/mo3 months
Claude API Fine-Tuning$1K-5K$200 (via API)$0$1.2K-5.2K$00 months

Cost Per 1% Quality Improvement

Distillation:    $50-100 per 1% improvement
  - Compresses knowledge, 5-10% quality loss acceptable
  - Cost amortized if serving 1000s of requests

Full Fine-Tuning: $20-60 per 1% improvement (BEST ROI)
  - Direct domain specialization
  - If you have 100+ examples, fine-tuning beats all others

LoRA:            $10-30 per 1% improvement
  - Cheapest per improvement
  - Best for cost-conscious teams

RAG:             $0-10 per 1% improvement (if documents high quality)
  - Free to implement (only retrieval cost)
  - Quality depends entirely on document quality
  - If documents poor, no ROI

Claude API:      $50/1K tokens (~$0.01 per 1% on small problems)
  - Expensive long-term if high volume
  - Cheap for occasional queries

Real-World Pricing Example: Building a Customer Support Bot

Scenario: 100K customer queries/month, 20,000 unique questions

Option A: Fine-tuned 7B model (local)
├─ Data collection & labeling: 1,000 Q&A pairs = $5K
├─ LoRA fine-tuning: 8 hours compute = $100
├─ Server cost (small): RTX 4070 = $600 (one-time) + $100/year power
├─ Total year 1: $5.8K
└─ Cost per query: $0.00006

Option B: RAG + Claude API
├─ Vector DB setup: Pinecone = $100 setup
├─ Indexing 20K Q&As: $100
├─ Claude API: 100K queries × 500 tokens avg × $0.003 = $150/month
├─ Total year 1: $1.9K (setup) + $1.8K (annual) = $3.7K
└─ Cost per query: $0.00037

Option C: RAG + Open-source model (vLLM)
├─ Setup: Same as Option B = $200
├─ Hosting (cloud): 2x H100 = $600/month
├─ Total year 1: $200 + $7.2K = $7.4K
└─ Cost per query: $0.00074

Option D: Full enterprise (fine-tuned + RAG hybrid)
├─ Fine-tune: $5K
├─ Server: $600
├─ RAG/vector DB: $500/year
├─ Total year 1: $6.1K
└─ Cost per query: $0.00006 (same as A, but higher quality)

6. Hybrid Approaches

The most powerful setups combine multiple methods:

Distillation + Fine-Tuning

Pattern: Compress knowledge, then specialize.

GPT-4 (teacher)

    ↓ Distillation

7B Student Model (general knowledge)

    ↓ Fine-tuning on domain data

7B Domain Specialist (specialized knowledge in small package)

When to use: You want a small model with domain expertise. Example: Compress GPT-4’s knowledge into 7B, then fine-tune on medical literature. Cost: $5K-$15K

Fine-Tuning + RAG

Pattern: Specialize model, augment with live knowledge.

Base Model

    ↓ Fine-tuning on domain data

Domain-Specialized Model
    ↓ + RAG pipeline

Real-time domain answers

When to use: You need both specialized behavior AND current knowledge. Example: Fine-tune on your company’s writing style, augment with latest docs and policies. Cost: $2K-$5K

Distillation + Fine-Tuning + RAG

Pattern: Compress, specialize, augment with knowledge.

Large Teacher Model

    ↓ Distillation

Small Student (edge-deployable)

    ↓ Fine-tuning

Domain Specialist
    ↓ + RAG

Real-time domain specialist on edge

When to use: Maximum optimization across cost, quality, and deployment. Example: Compress GPT-4 → fine-tune for your domain → deploy on device with local RAG. Cost: $10K-$25K


6. Practical Examples

Example 1: Compress GPT-4 Knowledge into 7B Model

Goal: Run GPT-4-level coding ability on a single GPU.

Approach: Distillation

Steps:

  1. Generate 100K diverse code problems with GPT-4 solutions
  2. Collect softmax logits (probabilities) from GPT-4
  3. Train Mistral 7B or Llama 7B to match GPT-4’s probability distributions
  4. Use temperature τ=4 to preserve detailed knowledge

Timeline: 3 weeks Cost: $2,000-$5,000 Result: 7B model at 90-95% of GPT-4’s code quality, 100x cheaper inference


Example 2: Fine-Tune 7B Model on Company Documentation

Goal: Help engineers find answers in your codebase and docs.

Approach: LoRA fine-tuning

Steps:

  1. Collect 500-1,000 Q&A pairs from your docs and code
  2. Train LoRA adapter on Llama 7B for 12 hours
  3. Serve base model + LoRA adapter (1% size overhead)

Timeline: 1 day Cost: $100-$500 Result: Model answers questions specific to your company


Example 3: Real-Time News Knowledge in Chatbot

Goal: Chatbot answers questions about current events without retraining.

Approach: RAG

Steps:

  1. Ingest news feeds into vector database (Pinecone, Weaviate, etc.)
  2. For each user query, retrieve top-5 relevant articles
  3. Augment prompt with retrieved articles
  4. LLM generates answer grounded in current news

Timeline: 1 day to set up Cost: $20-$50/month Result: Always-current knowledge at no training cost


Example 4: Combined Approach for Optimal Cost/Quality/Speed

Goal: Edge-deployable medical diagnosis assistant that knows your hospital’s protocols.

Approach: Distillation + Fine-tuning + RAG

Steps:

  1. Distillation (Week 1): Compress medical knowledge from larger model into Mistral 7B
  2. Fine-tuning (Week 2): LoRA-adapt 7B on your hospital’s diagnostic protocols
  3. RAG setup (Week 2): Index latest clinical guidelines, patient histories

Architecture:

Edge Device
├─ Distilled 7B model (2GB, edge-ready)
├─ LoRA adapter (10MB)
└─ Local vector DB of guidelines

Timeline: 2-3 weeks Cost: $8K-$12K Result: Specialized medical assistant deployable on-device with real-time protocol updates


7. Measuring Success

For Distilled Models

  • BLEU/ROUGE scores: Compare student outputs to teacher outputs (target: 0.8+ alignment)
  • Task accuracy: Measure on held-out test set (target: 90%+ of teacher performance)
  • Latency improvement: Track inference speed gains (expect: 10-50x faster)
  • Size reduction: Verify model compression ratio (expect: 10-100x smaller)

A/B test: Direct comparison on same prompts

prompt = "Explain quantum computing"
teacher_output = GPT-4.generate(prompt)
student_output = Distilled7B.generate(prompt)
similarity = cosine_similarity(embed(teacher), embed(student))
# Target: similarity > 0.85

For Fine-Tuned Models

  • Task-specific metrics: Accuracy, F1, precision/recall on your domain
  • Domain-specific benchmarks: Compare before/after fine-tuning
  • Degradation on general tasks: Ensure you didn’t hurt base model capabilities

Example for medical domain:

Before fine-tuning: 65% accuracy on medical Q&A
After fine-tuning:  78% accuracy on medical Q&A
Cost:               +13% improvement for $2,000

For RAG Systems

  • Retrieval quality:

    • Precision@5: Are top-5 docs relevant? (target: >70%)
    • Recall: What % of relevant docs do we find? (target: >80%)
    • MRR (Mean Reciprocal Rank): How high is first match? (target: >0.7)
  • End-to-end quality:

    • Answerability: Can LLM answer question given retrieved docs? (target: >85%)
    • Grounding: Does answer cite retrieved sources correctly? (target: >95%)
    • Hallucination rate: Does model invent facts? (target: <5%)

Measurement:

100 test questions
├─ Retrieve docs for each
├─ Measure retrieval precision/recall
├─ LLM generates answers
├─ Human evaluates answer quality
└─ Calculate final end-to-end quality score

8. When NOT to Use Each Method

Don’t Use Distillation If:

  • Model is already small (7B or smaller) — no benefit to compression
  • Quality loss is unacceptable — distillation loses 5-10% quality
  • You need real-time updates — distillation creates static model
  • Access to teacher model is restricted — can’t generate training data

Don’t Use Fine-Tuning If:

  • You need new fundamental capability — fine-tuning refines, doesn’t create
  • You have <10 examples per pattern — too little data to learn effectively
  • Knowledge is constantly changing — fine-tuning is static, use RAG
  • Your data contains contradictions — model training will oscillate

Don’t Use RAG If:

  • Your knowledge base is secret/proprietary — exposing documents is a security risk
  • Latency must be <100ms — retrieval adds latency
  • You need very large context (100K+ tokens) — context windows are limited
  • Knowledge is sparse and scattered — poor retrieval results

9. Economics Comparison

All-In Costs (compute + labor)

MethodComputeLaborTotalAmortized/Year
Train from Scratch$50K-$500K$20K$70K-$520K$70K-$520K (one-time)
Distillation$3K-$10K$2K$5K-$12K$5K-$12K (one-time)
Full Fine-Tuning$1K-$5K$1K$2K-$6K$2K-$6K (one-time)
LoRA Fine-Tuning$200-$1K$500$700-$1.5K$700-$1.5K (one-time)
RAG Setup$100-$500$500$600-$1K$600-$1K (one-time)
RAG Ongoing$100-$1K/month

Cost-Per-Quality Comparison

Assuming you’re measuring quality improvement from baseline:

  • Distillation: $50-$100 per 1% quality improvement
  • Fine-tuning: $20-$60 per 1% quality improvement (best ROI)
  • LoRA: $10-$30 per 1% quality improvement
  • RAG: $0-$10 per 1% quality improvement (depends on data quality)

10. Decision Framework

Flowchart: Which Method to Choose?

START

├─ Is knowledge frequently updated or real-time?
│  ├─ YES → Use RAG
│  └─ NO → Continue

├─ Do you have 100+ domain-specific training examples?
│  ├─ YES → Use Fine-Tuning or LoRA
│  └─ NO → Continue

├─ Do you need the model to run on edge/device?
│  ├─ YES → Use Distillation (then Fine-tune if specialized)
│  └─ NO → Continue

├─ Do you need a fundamental new capability?
│  ├─ YES → Train from Scratch (warning: expensive!)
│  └─ NO → Use base model + RAG, or fine-tune if you have data

└─ END: Choose combined approach if multiple factors apply

Decision Factors Checklist

Use Distillation when:

  • You have a large teacher model
  • You need to reduce model size significantly
  • Inference cost/speed is critical
  • You have 50K+ examples to train on

Use Fine-Tuning when:

  • You have 100-10K domain-specific examples
  • You want to specialize model behavior
  • One-time setup cost is acceptable
  • Base model can be improved within domain

Use LoRA when:

  • You want fine-tuning but need to minimize cost
  • You need to serve multiple domain variants
  • Model size/memory is constrained
  • Quick iteration is important

Use RAG when:

  • Knowledge changes frequently
  • You can’t label training examples
  • You want to preserve base model knowledge
  • Latency tolerance is >100ms
  • Knowledge is in unstructured documents

Combine methods when:

  • You need compression + specialization → Distill + Fine-tune
  • You need specialization + live updates → Fine-tune + RAG
  • You need all three → Distill + Fine-tune + RAG

Summary: Quick Reference

QuestionAnswer
Fastest to deploy?RAG (minutes)
Cheapest?RAG (~$0 training)
Best quality for domain tasks?Fine-tuning (if you have data)
Best for edge/on-device?Distillation
Most flexible?RAG (updates instantly)
Best ROI on quality?LoRA fine-tuning
Best for real-time knowledge?RAG
Best for specialized behavior?Fine-tuning + LoRA
Highest quality possible?Distillation + Fine-tuning + RAG

Golden Rule: Start with RAG if you have documents. Add fine-tuning if you have labeled examples. Add distillation if you need to run on constrained hardware. Train from scratch only as a last resort.


11. Real-World Example 1: Fine-Tuning Claude on Medical Data

Scenario

A healthcare startup wants to fine-tune Claude (via Anthropic’s API) on their medical literature and case studies to make it better at answering domain-specific questions.

Dataset Preparation

import json
from anthropic import Anthropic

# Prepare training data: question-answer pairs
training_data = [
    {
        "messages": [
            {"role": "user", "content": "What are the symptoms of Type 2 diabetes?"},
            {"role": "assistant", "content": "Type 2 diabetes symptoms include increased thirst, frequent urination, fatigue, blurred vision, and slow wound healing..."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "How is hypertension diagnosed?"},
            {"role": "assistant", "content": "Hypertension is diagnosed when blood pressure readings are consistently 130/80 mmHg or higher..."}
        ]
    },
    # ... 998 more examples ...
]

# Save as JSONL (one JSON object per line)
with open("medical_training.jsonl", "w") as f:
    for example in training_data:
        f.write(json.dumps(example) + "\n")

Fine-Tuning Process

from anthropic import Anthropic

client = Anthropic()

# Step 1: Upload training file
with open("medical_training.jsonl", "rb") as f:
    training_file = client.beta.files.upload(
        file=("medical_training.jsonl", f, "application/x-ndjson"),
    )

print(f"File uploaded: {training_file.id}")

# Step 2: Create fine-tuning job
fine_tune_job = client.beta.fine_tuning.jobs.create(
    model="claude-sonnet-4",
    training_file=training_file.id,
    hyperparameters={
        "epochs": 3,
        "learning_rate_multiplier": 1.0,
        "batch_size": 32,
    }
)

print(f"Fine-tuning job created: {fine_tune_job.id}")

# Step 3: Monitor job status
import time

while True:
    job_status = client.beta.fine_tuning.jobs.retrieve(fine_tune_job.id)
    
    if job_status.status == "succeeded":
        print(f"Fine-tuning complete! Model ID: {job_status.fine_tuned_model}")
        break
    elif job_status.status == "failed":
        print(f"Fine-tuning failed: {job_status.error}")
        break
    
    print(f"Status: {job_status.status} - Progress: {job_status.training_steps_completed}/{job_status.total_training_steps}")
    time.sleep(30)

# Step 4: Use fine-tuned model
fine_tuned_model = job_status.fine_tuned_model

response = client.messages.create(
    model=fine_tuned_model,
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is the recommended treatment for gestational diabetes?"}
    ]
)

print(response.content[0].text)

Cost Analysis

Training data: 1,000 examples
Training cost: $6 (input tokens) + $18 (output tokens) = $24
Training time: ~1-2 hours
Fine-tuned model name: ft-YOUR-ID

Monthly usage (assuming 100 queries/day):
- Input: 100 tokens per query × 100 queries = 10,000 tokens = $0.15
- Output: 500 tokens per query × 100 queries = 50,000 tokens = $0.75
- Daily cost: $0.90
- Monthly cost: $27
- Annual cost: $324

ROI: $24 training investment + $324 annual = $348 total
Baseline (without fine-tuning): $12/day = $4,380/year
Savings: $4,032/year (11.6x ROI)

Evaluation Metrics

After fine-tuning, test on held-out medical questions:

# Test set (not used in training)
test_questions = [
    "How is COPD managed?",
    "What are risk factors for myocardial infarction?",
    "Explain the pathophysiology of cirrhosis"
]

# Baseline (no fine-tuning)
baseline_response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=500,
    messages=[{"role": "user", "content": "How is COPD managed?"}]
)

# Fine-tuned model
finetuned_response = client.messages.create(
    model=fine_tuned_model,
    max_tokens=500,
    messages=[{"role": "user", "content": "How is COPD managed?"}]
)

# Evaluate quality (manually or with another model)
# Metrics: Domain accuracy, terminology precision, specificity to medical context

12. Real-World Example 2: Distilling Claude into Phi for On-Device Deployment

Scenario

A mobile health app wants Claude’s reasoning on 7-inch tablets with only 2GB RAM. Solution: Distill Claude’s knowledge into Phi-2 (2.7B parameters, 5.5GB in FP16, 2.75GB in int8).

Distillation Process

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import anthropic

client = anthropic.Anthropic()

# Step 1: Generate diverse medical Q&A with Claude (teacher)
medical_topics = [
    "diabetes management",
    "hypertension treatment",
    "asthma control",
    "chronic kidney disease",
    "heart failure"
]

def generate_training_data(topic, num_examples=100):
    """Use Claude to generate diverse training examples"""
    
    prompt = f"""Generate {num_examples} diverse medical Q&A pairs about {topic}.
    Format: Q: [question]\nA: [answer]\n\n
    Make questions and answers realistic, varied in difficulty, and accurate."""
    
    response = client.messages.create(
        model="claude-sonnet-4",
        max_tokens=8000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse response into Q&A pairs
    qa_pairs = []
    lines = response.content[0].text.split("\n")
    
    current_q = None
    current_a = None
    
    for line in lines:
        if line.startswith("Q:"):
            if current_q and current_a:
                qa_pairs.append({"q": current_q, "a": current_a})
            current_q = line[2:].strip()
            current_a = None
        elif line.startswith("A:"):
            current_a = line[2:].strip()
    
    if current_q and current_a:
        qa_pairs.append({"q": current_q, "a": current_a})
    
    return qa_pairs

# Generate 500 Q&A pairs (5 topics × 100 each)
all_training_data = []
for topic in medical_topics:
    print(f"Generating examples for {topic}...")
    examples = generate_training_data(topic, num_examples=100)
    all_training_data.extend(examples)

print(f"Generated {len(all_training_data)} training examples")

# Step 2: Also collect soft targets (probabilities) from Claude
def collect_soft_targets(qa_pairs, batch_size=10):
    """Get logits/probabilities from Claude for knowledge distillation"""
    
    soft_targets = []
    
    for i in range(0, len(qa_pairs), batch_size):
        batch = qa_pairs[i:i+batch_size]
        
        for item in batch:
            # For each Q&A, get Claude's confidence scores
            # (This is simplified; real distillation would use logits)
            response = client.messages.create(
                model="claude-sonnet-4",
                max_tokens=100,
                messages=[
                    {"role": "user", "content": f"Q: {item['q']}\n\nA: {item['a']}\n\nHow confident are you in this answer? 0-100"}
                ]
            )
            
            confidence = float(response.content[0].text.split()[-1])
            soft_targets.append({
                "question": item['q'],
                "answer": item['a'],
                "confidence": confidence / 100.0  # Normalize to 0-1
            })
    
    return soft_targets

soft_targets = collect_soft_targets(all_training_data[:50])  # Example: first 50

# Step 3: Fine-tune Phi-2 with knowledge distillation
def distill_to_phi(training_data, soft_targets, num_epochs=3, temperature=3.0):
    """
    Distill Claude's knowledge into Phi-2
    
    Temperature > 1 makes teacher (Claude) outputs softer (more informative)
    """
    
    # Load Phi-2
    model_name = "microsoft/phi-2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Prepare data
    from torch.utils.data import DataLoader, Dataset
    
    class DistillationDataset(Dataset):
        def __init__(self, qa_pairs, soft_targets, tokenizer):
            self.qa_pairs = qa_pairs
            self.soft_targets = {st['question']: st['confidence'] for st in soft_targets}
            self.tokenizer = tokenizer
        
        def __len__(self):
            return len(self.qa_pairs)
        
        def __getitem__(self, idx):
            qa = self.qa_pairs[idx]
            text = f"Q: {qa['q']}\nA: {qa['a']}"
            
            # Get soft target confidence if available
            confidence = self.soft_targets.get(qa['q'], 0.9)
            
            tokens = self.tokenizer(
                text,
                truncation=True,
                max_length=256,
                return_tensors="pt"
            )
            
            return {
                'input_ids': tokens['input_ids'].squeeze(),
                'attention_mask': tokens['attention_mask'].squeeze(),
                'soft_target': torch.tensor(confidence, dtype=torch.float)
            }
    
    dataset = DistillationDataset(all_training_data[:100], soft_targets, tokenizer)
    dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
    
    # Training loop with distillation loss
    from torch import nn
    
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
    criterion_ce = nn.CrossEntropyLoss()
    criterion_kl = nn.KLDivLoss(reduction='batchmean')
    
    for epoch in range(num_epochs):
        total_loss = 0
        
        for batch in dataloader:
            input_ids = batch['input_ids'].to("cuda")
            attention_mask = batch['attention_mask'].to("cuda")
            soft_targets = batch['soft_target'].to("cuda")
            
            # Forward pass
            outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
            logits = outputs.logits
            
            # Distillation loss: match teacher's soft targets
            # Use cross-entropy loss as proxy
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = input_ids[..., 1:].contiguous()
            
            loss = criterion_ce(
                shift_logits.view(-1, shift_logits.size(-1)),
                shift_labels.view(-1)
            )
            
            # Weight by teacher confidence (soft targets)
            weighted_loss = loss * soft_targets.mean()
            
            optimizer.zero_grad()
            weighted_loss.backward()
            optimizer.step()
            
            total_loss += weighted_loss.item()
        
        print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader):.4f}")
    
    return model

# Distill Phi-2
distilled_phi = distill_to_phi(all_training_data, soft_targets)

# Step 4: Quantize for mobile (int8)
from torch.quantization import quantize_dynamic, QConfig
import torch.nn as nn

quantized_phi = quantize_dynamic(
    distilled_phi,
    {nn.Linear},
    dtype=torch.qint8
)

# Save quantized model
torch.jit.save(torch.jit.script(quantized_phi), "phi_medical_int8.pt")

print("Distilled and quantized Phi ready for mobile!")
print(f"Model size: ~2.75GB (int8)")
print(f"On-device inference: ~30-50 tokens/s on modern tablet")

Results Comparison

Teacher Model (Claude via API):
- Quality: ⭐⭐⭐⭐⭐ (state-of-the-art)
- Speed: Slow (cloud API latency)
- Cost: $0.003 per 1K input tokens
- Deployment: Cloud only

Student Model (Distilled Phi-2):
- Quality: ⭐⭐⭐⭐ (90-95% of Claude quality)
- Speed: ⭐⭐⭐ (30-50 tokens/s on tablet)
- Cost: $0 (one-time distillation cost)
- Deployment: Offline on 2GB tablet

Distillation ROI:
- Training cost: $200 (Claude API calls)
- One-time savings per 1,000 queries: $3 (vs API)
- Per app installation: Infinite offline queries

13. When Methods Fail: Troubleshooting Guide

Distillation Not Working

Problem: Student model quality doesn’t improve with teacher data

Causes and fixes:

  1. Temperature too low (τ = 1): Probability distributions too sharp
    • Fix: Increase to τ = 3-5
  2. Student model too small relative to task: Can’t fit knowledge
    • Fix: Use larger student (7B instead of 3.5B) or simpler task
  3. Insufficient training data: Need 10K+ examples for good distillation
    • Fix: Generate more synthetic data from teacher
  4. Dataset mismatch: Training data doesn’t match deployment domain
    • Fix: Collect data representative of real use

Fine-Tuning Not Working

Problem: Accuracy doesn’t improve, or model “forgets” base knowledge

Causes and fixes:

  1. Catastrophic forgetting: Model overwrites original knowledge
    • Fix: Lower learning rate (2e-5 instead of 2e-4), use LoRA
  2. Data quality: Mislabeled or inconsistent examples
    • Fix: Clean data, verify labels, use quality heuristics
  3. Insufficient data: <100 examples is risky
    • Fix: Collect more examples or use LoRA (more data-efficient)
  4. Wrong task: Fine-tuning teaches patterns, not new capabilities
    • Fix: If model fundamentally can’t do task, use RAG or distill better teacher

RAG Not Working

Problem: Retrieval misses relevant documents

Causes and fixes:

  1. Poor embeddings: Document embeddings don’t match queries
    • Fix: Use better embedding model (OpenAI, Jina, or specialized domain)
  2. Bad chunking: Documents too large or too small
    • Fix: Chunk at 256-512 token boundaries (semantic chunks)
  3. Outdated documents: Knowledge base stale
    • Fix: Refresh index regularly, remove obsolete docs
  4. Query intent mismatch: User question phrased differently than documents
    • Fix: Expand queries (multi-query, HyDE), rewrite with LLM
  5. No relevant document exists: Knowledge base doesn’t contain answer
    • Fix: Add missing documents, fall back to base model

Diagnostic query:

def diagnose_rag(query, retriever, model):
    """Check where RAG breaks down"""
    
    # Step 1: Check retrieval quality
    docs = retriever.retrieve(query, top_k=5)
    
    if not docs:
        print("❌ Retrieval failed: No documents returned")
        return "retrieval_broken"
    
    if docs[0]['score'] < 0.5:
        print("⚠️  Low retrieval confidence:", docs[0]['score'])
        return "poor_retrieval"
    
    # Step 2: Check if documents contain answer
    doc_text = " ".join([d['text'] for d in docs])
    answer_present = "yes" in doc_text.lower()  # Simplified check
    
    if not answer_present:
        print("❌ Retrieved docs don't contain answer")
        return "docs_insufficient"
    
    # Step 3: Check if LLM can use documents
    prompt = f"Query: {query}\nDocuments: {doc_text}\nAnswer:"
    response = model.generate(prompt)
    
    if "don't know" in response.lower():
        print("⚠️  LLM can't answer despite documents")
        return "llm_failure"
    
    print("✓ RAG working correctly")
    return "ok"

Validation Checklist

How do you know you got this right?

Performance Checks

  • Fine-tuned model outperforms base model on 5+ domain-specific tasks (measure accuracy delta)
  • RAG retrieval precision@5 exceeds 70% on representative test queries
  • Distilled student model retains 90%+ of teacher model quality on held-out evaluation set

Implementation Checks

  • LoRA adapter loads without errors and produces coherent output on first inference
  • Training data is clean: verified labels, no duplicates, no data leakage between train/test splits
  • Quantization level chosen matches deployment hardware (int4 for edge, int8 for server, FP16 for quality-critical)
  • RAG vector database indexed and returning results within latency budget (<500ms retrieval)
  • Cost-per-1% quality improvement calculated for your chosen method (LoRA vs full fine-tune vs RAG)
  • Hybrid approach evaluated: considered combining methods (e.g., fine-tune + RAG) before settling on single method
  • Evaluation pipeline built: automated comparison of base vs adapted model on domain test set

Integration Checks

  • Adapted model integrates with harness inference loop (loads, generates, streams tokens)
  • LoRA adapters stored separately from base model weights for multi-domain serving
  • RAG pipeline connected to harness tool system (retrieval as a tool the agent can invoke)

Common Failure Modes

  • Catastrophic forgetting: Fine-tuned model loses general knowledge. Fix: use LoRA instead of full fine-tuning, or lower learning rate to 2e-5.
  • RAG retrieval miss: Wrong documents retrieved, model hallucinates. Fix: improve chunking strategy (256-512 tokens), use better embedding model, add reranking step.
  • Distillation plateau: Student quality stops improving well below teacher. Fix: increase temperature (tau=3-5), generate more diverse training data, use larger student model.
  • Overfitting on small datasets: Model memorizes training examples, fails on new inputs. Fix: need 100+ examples minimum for LoRA, 1000+ for full fine-tune; add dropout and regularization.

Sign-Off Criteria

  • Chosen transfer method matches decision tree criteria (real-time knowledge -> RAG, domain specialization -> fine-tune, edge deployment -> distillation)
  • Before/after metrics documented: baseline accuracy vs adapted model accuracy on domain tasks
  • Cost analysis complete: total spend (compute + labor + data) justified by quality improvement
  • Fallback strategy defined: what happens if adapted model underperforms (revert to base model, switch methods)
  • Monitoring plan in place: track model accuracy in production to detect drift and schedule retraining

See Also

  • Doc 01 (Foundation Models) — Model selection affects which transfer methods are practical; hybrid approach enables knowledge transfer
  • Doc 04 (Memory Systems) — Knowledge transfer enables multi-layer memory systems (persistent and episodic layers)
  • Doc 03 (Hugging Face Ecosystem) — Find pre-trained models on Hugging Face to serve as starting points for distillation and fine-tuning
  • Doc 19 (Knowledge Management at Scale) — RAG at scale requires management patterns; knowledge transfer provides the underlying mechanisms