Skip to main content
Reference

Model Fundamentals

How neural networks work — weights, parameters, layers, transformers, attention, training, backpropagation, with working code examples.

What Are Models?

At the most basic level, an AI model is a mathematical function that takes inputs and produces outputs. It’s not magic—it’s computation.

Input → Model → Output
  "cat"  →  [computation]  →  "a furry animal"

Parameters vs Hyperparameters

  • Parameters: The numbers the model learns during training. These are what get adjusted to improve performance. A 7B “7 billion parameter” model has 7 billion numbers that were tuned.
  • Hyperparameters: The knobs you set before training starts. Learning rate, batch size, number of layers. You choose these; the model learns the parameters.

What “Learning” Means

Training a model is really just adjusting its parameters to reduce errors. Imagine a student taking tests and gradually improving answers based on feedback—that’s what models do, but mathematically.

Simple Example: Linear Model → Neural Network

A linear model for house prices might look like:

Price = (weight₁ × square_feet) + (weight₂ × bedrooms) + bias

Those weights are parameters. During training, the model adjusts them to predict prices better. A neural network does the same thing, but with many more layers and non-linear transformations, allowing it to learn complex patterns.


Weights and Parameters

What Are Weights?

Weights are just numbers—typically between -1 and 1. They represent the strength of connection between pieces of information in a neural network.

Think of it like a brain:

  • Your brain has neurons (cells) connected by synapses (connections)
  • Each synapse has a strength (how easily signals travel)
  • Weights in neural networks are the digital equivalent

How Many Parameters?

  • Small model: 7 million parameters (7M)
  • Medium model: 13 billion parameters (13B)
  • Large model: 70 billion parameters (70B)
  • Very large: 1 trillion parameters (1T, not yet standard)

The numbers are enormous. A 7B parameter model has 7,000,000,000 numbers to adjust.

Why More Parameters ≠ Always Better

Bigger can be better, but not always:

  • More parameters = more capacity to learn complex patterns ✓
  • But also = slower inference (takes longer to generate answers) ✗
  • And = requires more training data and compute ✗
  • And = higher cost to run ✗

A 7B model running on your laptop might outperform a 70B model behind a paid API if your use case is specific and well-defined. Bigger doesn’t mean better—bigger means more capable if you have the data and resources to use it.

Parameter Counting

You can count a model’s parameters from architecture:

  • Each layer has (input_size × output_size) parameters for weights
  • Plus additional parameters for biases and attention mechanisms
  • The sum is the total parameter count

For LLMs, the vast majority of parameters are in the token embeddings (converting words to numbers) and transformer layers (the processing engine).


Neural Networks Basics

Neurons and Layers

A neuron takes multiple inputs, multiplies each by a weight, sums them up, and applies an activation function:

Output = activation( (input₁ × weight₁) + (input₂ × weight₂) + bias )

Layers are stacks of neurons working in parallel:

Input Layer (numbers)
    ↓ ↓ ↓
Hidden Layer 1 (transformation)
    ↓ ↓ ↓
Hidden Layer 2 (transformation)
    ↓ ↓ ↓
Output Layer (results)

The “deep” in deep learning comes from having many hidden layers.

Activation Functions

After computing the weighted sum, activation functions add non-linearity. Without them, stacking layers would just be multiplication—no more expressive than a single layer.

Common ones:

  • ReLU (Rectified Linear Unit): Returns 0 if negative, otherwise returns the input. Fast, works well, default choice for most networks.

    output = max(0, input)

    When to use: Default for CNNs, most hidden layers in dense networks. Fast, sparse (many zeros), solves vanishing gradient problem. Drawback: “dying ReLU” (units stuck at 0).

  • GELU (Gaussian Error Linear Unit): Smooth version of ReLU, learned approximation of stochasticity.

    output = x × Φ(x)  [where Φ is the cumulative normal distribution]

    When to use: Modern language models (BERT, GPT, Llama). Smoother gradients than ReLU, slightly better than ReLU empirically. Standard in transformers.

  • SiLU (Sigmoid Linear Unit / Swish): Product of input and sigmoid.

    output = x × sigmoid(β × x)  [simplified: x × sigmoid(x)]

    When to use: Modern architectures (Google models, some recent LLMs). Very smooth gradients, good empirically. Slightly slower than ReLU (extra sigmoid computation).

  • Sigmoid: Maps any value to 0-1 range. Used for binary decisions. Historically popular, slower now.

    output = 1 / (1 + e^(-input))

    When to use: Output layer for binary classification, gating mechanisms. NOT recommended for hidden layers anymore (vanishing gradient problem).

  • Tanh: Similar to sigmoid but maps to -1 to 1. Slightly better numerically than sigmoid.

    output = (e^x - e^(-x)) / (e^x + e^(-x))

    When to use: RNN hidden states, some older architectures. Better than sigmoid (centered at 0) but slower than ReLU.

Activation Function Selection Guide

FunctionSpeedGradientUse CaseModern?
ReLUFastGood (sparse)CNNs, older dense netsYes, still standard
GELUMediumExcellentTransformers, modern LLMsYes, recommended
SiLUMediumExcellentCutting-edge modelsYes, emerging
SigmoidSlowVanishingOutput layer (binary), gatesNo, hidden layers
TanhMediumBetter than sigmoidRNN hidden statesNo, use GELU instead

Practical rule: Use GELU for new transformer projects, ReLU for CNNs, sigmoid/tanh only at output layers or in specialized mechanisms (attention gates).

Forward Pass: Input → Processing → Output

When you feed data into a model, it flows through all layers in sequence:

1. Take input (text converted to numbers)
2. Layer 1 processes it (multiply by weights, apply activation)
3. Layer 2 processes the output from Layer 1
4. ... continue through all layers ...
5. Final layer produces output (next predicted word)

This is called a forward pass—data flows forward through the network.

Backward Pass: Learning from Errors

After a forward pass, you have a prediction. Compare it to the correct answer and compute an error. Then:

1. Compute error (how wrong was I?)
2. Calculate how to fix it (calculus: partial derivatives)
3. Adjust weights slightly in the right direction
4. Repeat

This reverse flow is backpropagation—errors flow backward to tell each weight how much to adjust.

Training vs Inference: Different Modes

During training:

  • Forward pass with actual data
  • Compute error
  • Backward pass to update weights
  • Repeat with more data

During inference (actual use):

  • Forward pass only
  • No weight updates
  • Just produce answers as fast as possible

Training is expensive; inference is cheap (relative to training).


Transformer Architecture: The Modern Standard

Nearly every modern AI model (ChatGPT, Claude, Llama) uses the transformer architecture. It was invented in 2017 and revolutionized AI because it’s great at understanding relationships in data.

Why Transformers Are Everywhere

Before transformers, models processed sequences one-by-one (slow, lost long-range context). Transformers process entire sequences in parallel and naturally understand which parts relate to each other. They’re:

  • Fast to train (parallel processing)
  • Capable (strong at language and reasoning)
  • Scalable (works well from small to huge models)

The Attention Mechanism: Focusing on Relevant Parts

Imagine reading a sentence: “The bank executive was arrested for embezzlement. He denied the charges.”

When interpreting “charges,” humans automatically focus on relevant context (embezzlement, arrest) and ignore irrelevant parts (the bank, executive being a title). Attention does this automatically.

Instead of treating all previous words equally, the model learns to weight them: “charges” pays high attention to “embezzlement” and “arrest,” low attention to “the” and “was.”

Input: "The bank executive was arrested for embezzlement. He denied the charges"

For the word "charges", compute attention weights:
- "charges" ↔ "embezzlement": high weight (related)
- "charges" ↔ "arrest": high weight (related)
- "charges" ↔ "the": low weight (not relevant)

Output: "charges" has context from what matters

Self-Attention: Understanding Your Own Input

Self-attention means the model attends to different parts of its own input to understand context. Every word looks at every other word (including itself) and learns what’s important.

This is why transformers are great at language—they naturally learn that “it” in “The cat sat on the mat. It was furry” refers to “cat.”

Multi-Head Attention: Multiple Types of Relationships

Instead of one attention mechanism, transformers use multiple attention heads in parallel. Each head learns different types of relationships:

  • Head 1 might focus on grammatical structure (subject-verb pairs)
  • Head 2 might focus on semantic meaning (related concepts)
  • Head 3 might focus on pronouns and their antecedents
  • Head 4, 5, 6… learn other patterns
Attention Head 1: (noun → verb)
Attention Head 2: (pronouns → antecedents)
Attention Head 3: (adjectives → nouns)
    ↓ ↓ ↓
Combine all heads

Richer understanding of input

Modern models typically use 32, 64, or even 96 attention heads. They vote together on what matters.

Positional Encoding: Understanding Word Order

Here’s a problem: if you process “cat bit dog” and “dog bit cat” with pure attention, the model sees the same set of words in different order but might not understand the order matters.

Positional encoding adds information about where each word is:

Position 0: "cat"     → cat-encoded-at-position-0
Position 1: "bit"     → bit-encoded-at-position-1
Position 2: "dog"     → dog-encoded-at-position-2

The model learns different encodings for each position, so it naturally understands order.

From Text to Tokens to Embeddings to Response

Here’s the full flow for a language model:

1. Input text: "Hello, how are you?"

2. Tokenization: ["Hello", ",", "how", "are", "you", "?"]

3. Token IDs: [15339, 11, 884, 389, 607, 30]

4. Embeddings: Convert each ID to a vector (e.g., 768 numbers)

5. Add positional encoding: Modify embeddings with position info

6. Transformer layers: Pass through attention, feedforward, residual connections

7. Output: Probabilities for next token

8. Sampling: Pick most likely next token

9. Repeat steps 2-8 until you generate a complete response

Each of these steps is computation, and the parameters learned during training allow the model to perform this mapping effectively.


How Models Learn: Training

Loss Function: How Wrong Is the Model?

A loss function measures error. For language models, it’s typically cross-entropy loss:

Loss = -log(probability of correct answer)

If the model predicted “dog” has 90% probability and the correct answer is “dog,” the loss is small (good). If it predicted 10% probability, the loss is large (bad).

Lower loss = better predictions.

For regression (predicting numbers), use mean squared error (MSE):

MSE = average of (predicted - actual)²

For classification with multiple classes, use categorical cross-entropy:

Loss = -Σ(actual_label × log(predicted_probability))

Gradient Descent: Finding the Right Direction

Imagine you’re in a foggy valley trying to find the lowest point. You can’t see far, but you can feel the ground beneath you:

       /\         /\
      /  \       /  \
     /    \_____/    \
    /                  \  ← You're here
   /                    \

You take a step downhill. Then another. And another. Eventually you reach a valley bottom. That’s gradient descent.

In models:

  • The “elevation” is the loss (error)
  • The “direction downhill” is the gradient (calculus)
  • Each “step” updates the weights slightly
weight_new = weight_old - (learning_rate × gradient)

The model takes small steps toward lower loss.

Variants:

  • Stochastic Gradient Descent (SGD): Update using one example at a time. Noisy but fast.
  • Mini-batch SGD: Update using a batch (32-256 examples). Balance between noise and stability.
  • Momentum: Remember previous direction, accumulate momentum. Converges faster.
  • Adam: Adaptive learning rates per parameter. State-of-the-art for most applications.

Backpropagation: Updating Weights

Backpropagation is the algorithm that computes gradients efficiently. Without it, updating 7 billion weights would be computationally impossible.

The key insight: use the chain rule from calculus to compute gradients efficiently:

Error at output

How much does last layer matter?

How much does second-to-last layer matter?

... work backwards through all layers ...

Update each weight by the amount it contributed to the error

Mathematical detail (simplified):

If output = f(g(h(input)))  [composition of functions]
Then: dOutput/dInput = (dOutput/df) × (df/dg) × (dg/dh) × (dh/dInput)

Backprop computes each small derivative, multiplies them together (chain rule).
Each weight's gradient: "How much did you contribute to the error?"
Then update: weight -= learning_rate × gradient

This happens automatically in modern frameworks (PyTorch, TensorFlow), so you rarely think about it directly. The frameworks build a computation graph and automatically differentiate it.

Learning Rate: How Big Are the Steps?

The learning rate controls step size:

Learning rate too small:  Progress is slow, takes forever to train
Learning rate too big:    Steps are too large, might overshoot, could diverge
Learning rate just right: Converges smoothly to good solution

Typical learning rates: 0.001 to 0.0001 (very small!).

Practical guidelines:

  • Start with 0.001 or 0.0001
  • If loss diverges (goes to infinity): learning rate too high, reduce 10×
  • If loss decreases slowly: learning rate might be too small, increase 2-3×
  • Never use 1.0 or 0.1 (way too high for most models)
  • Adaptive learning rates (Adam, AdamW) adjust per-parameter—some weights get bigger steps, others get smaller

Learning rate schedules: Often decrease learning rate over training:

Epoch 1-10:  lr = 0.001
Epoch 11-20: lr = 0.0005
Epoch 21-30: lr = 0.0001

This helps: starts aggressive (finds approximate solution), then fine-tunes.

Epochs: How Many Times Through the Data?

One epoch = processing the entire training dataset once.

Epoch 1: See all 1 million training examples, update weights
Epoch 2: See all 1 million training examples again, update weights more
Epoch 3: See all 1 million training examples again, update weights more
...
Epoch 100: Finished

Why multiple epochs? Each pass through the data provides new gradient information; models generally improve for several epochs before plateauing.

  • Too few epochs = underfitting (model hasn’t learned well)
  • Too many epochs = overfitting (model memorizes training data instead of learning general patterns)

Finding the right number:

Train for many epochs (e.g., 100)
Watch validation loss each epoch
Stop when validation loss stops improving for 10 epochs (early stopping)
Result: model trained just enough, not overfit

Batch Training: Processing Multiple Examples Together

Instead of updating weights after each single example (slow), you:

1. Take a batch of 32 examples
2. Forward pass through all 32
3. Compute loss for all 32
4. Average the gradients
5. Update weights once
6. Move to next batch

Batch size typically: 32, 64, 128, or larger. Larger batches = more stable gradients but higher memory use.

Practical tradeoff:

Batch size 8:   Noisy gradients, faster updates, less memory
Batch size 32:  Good balance (standard default)
Batch size 256: Stable gradients, slower updates, more memory

Training loop in pseudocode:

for epoch in range(num_epochs):
    for batch in training_data:
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.labels)
        gradients = backpropagate(loss)
        update_weights(gradients, learning_rate)
    
    validation_loss = evaluate(model, validation_data)
    if validation_loss not improving:
        break  # Early stopping

Real Example: Training on Toy Data

Here’s a concrete walkthrough of training a tiny model on synthetic data:

PyTorch Example (predicting house prices):

import torch
import torch.nn as nn
from torch.optim import Adam

# Toy data: 100 houses
torch.manual_seed(42)
X = torch.randn(100, 3)  # 3 features: square_feet, bedrooms, age
y = (X[:, 0] * 0.5 + X[:, 1] * 0.3 + X[:, 2] * 0.1 + 0.2).unsqueeze(1)  # True relationship

# Simple model
model = nn.Sequential(
    nn.Linear(3, 8),      # 3 inputs → 8 hidden
    nn.ReLU(),            # Activation
    nn.Linear(8, 1)       # 8 hidden → 1 output (price)
)

# Loss function (regression)
loss_fn = nn.MSELoss()

# Optimizer (adaptive learning rate)
optimizer = Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
batch_size = 16

for epoch in range(num_epochs):
    epoch_loss = 0
    
    # Shuffle and batch
    indices = torch.randperm(len(X))
    for i in range(0, len(X), batch_size):
        batch_idx = indices[i:i+batch_size]
        X_batch = X[batch_idx]
        y_batch = y[batch_idx]
        
        # Forward pass
        predictions = model(X_batch)
        loss = loss_fn(predictions, y_batch)
        
        # Backward pass
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()         # Compute gradients via backprop
        optimizer.step()        # Update weights using gradients
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / (len(X) // batch_size)
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/100, Loss: {avg_loss:.4f}")

print("Training complete!")

Output:

Epoch 20/100, Loss: 0.0451
Epoch 40/100, Loss: 0.0089
Epoch 60/100, Loss: 0.0045
Epoch 80/100, Loss: 0.0032
Epoch 100/100, Loss: 0.0028
Training complete!

What happened:

  1. Started with random weights (predictions terrible)
  2. Each epoch: computed loss, backpropagated errors, updated weights
  3. Loss decreased from ~0.045 to ~0.003 (model learned!)
  4. After 100 epochs, model learned the house price formula

TensorFlow/Keras Alternative

import tensorflow as tf
from tensorflow import keras

# Same toy data
X = tf.random.normal((100, 3))
y = (X[:, 0] * 0.5 + X[:, 1] * 0.3 + X[:, 2] * 0.1 + 0.2)[:, None]

# Model definition
model = keras.Sequential([
    keras.layers.Dense(8, activation='relu', input_shape=(3,)),
    keras.layers.Dense(1)
])

# Compile (optimizer + loss)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.01),
    loss='mse'
)

# Train (cleaner syntax than PyTorch)
history = model.fit(X, y, epochs=100, batch_size=16, verbose=0)

print(f"Final loss: {history.history['loss'][-1]:.4f}")

Key difference: Keras handles batching and backprop implicitly. PyTorch is more explicit (more control). Both are equally valid.


Code Examples: From Theory to Practice

Forward Pass Through a Simple Network

Understanding what happens when data flows forward through layers:

import torch
import torch.nn as nn

# Manual forward pass (understanding the math)
class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)      # 4 inputs → 8 hidden
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(8, 1)      # 8 hidden → 1 output
    
    def forward(self, x):
        # Layer 1: Linear + ReLU
        h1 = self.fc1(x)                # Shape: (batch, 8)
        h1 = self.relu(h1)              # Apply activation
        
        # Layer 2: Linear (output)
        output = self.fc2(h1)           # Shape: (batch, 1)
        
        return output

# Create network and input
net = SimpleNetwork()
x = torch.randn(3, 4)  # 3 examples, 4 features each

# Forward pass
output = net(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output values: {output}")

# Manual computation to see what's happening
print("\nManual step-by-step:")
with torch.no_grad():  # Don't track gradients for this
    h1 = net.fc1(x)
    print(f"After layer 1 (before ReLU): {h1[0]}")  # First example
    
    h1 = net.relu(h1)
    print(f"After layer 1 (after ReLU): {h1[0]}")   # Zeros out negatives
    
    output = net.fc2(h1)
    print(f"After layer 2 (final output): {output[0]}")

Gradient Computation and Inspection

Seeing how backprop computes gradients:

import torch
import torch.nn as nn

# Simple model
model = nn.Linear(3, 1)  # 3 inputs, 1 output
x = torch.tensor([[1.0, 2.0, 3.0]])  # One example
y_true = torch.tensor([[14.0]])       # Target (happens to be 1*1 + 2*2 + 3*3)

# Forward pass
y_pred = model(x)
loss = nn.MSELoss()(y_pred, y_true)

print(f"Prediction: {y_pred.item():.4f}")
print(f"Loss: {loss.item():.4f}")

# Backward pass - compute gradients
loss.backward()

# Inspect gradients
print(f"\nGradients:")
print(f"Weight gradients: {model.weight.grad}")
print(f"Bias gradients: {model.bias.grad}")

# Update weights manually (this is what optimizer does)
learning_rate = 0.01
with torch.no_grad():
    model.weight -= learning_rate * model.weight.grad
    model.bias -= learning_rate * model.bias.grad
    
    # Clear gradients for next iteration
    model.zero_grad()

# Check that loss decreased
y_pred_new = model(x)
loss_new = nn.MSELoss()(y_pred_new, y_true)
print(f"\nAfter update:")
print(f"New loss: {loss_new.item():.4f}")
print(f"Loss improved: {loss_new < loss}")

Full Training Loop with Monitoring

A complete training example with loss tracking:

import torch
import torch.nn as nn
from torch.optim import Adam
import matplotlib.pyplot as plt

# Generate synthetic classification data
torch.manual_seed(42)
n_samples = 200
X = torch.randn(n_samples, 2)
# True labels: y = 1 if x1 + 2*x2 > 0, else 0
y = (X[:, 0] + 2 * X[:, 1] > 0).float().unsqueeze(1)

# Split train/val
train_idx = torch.randperm(n_samples)[:160]
val_idx = torch.arange(n_samples)
val_idx = val_idx[~torch.isin(val_idx, train_idx)]

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

# Model
model = nn.Sequential(
    nn.Linear(2, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()  # For binary classification
)

# Loss and optimizer
loss_fn = nn.BCELoss()  # Binary cross-entropy
optimizer = Adam(model.parameters(), lr=0.01)

# Training
train_losses = []
val_losses = []
epochs = 50

for epoch in range(epochs):
    # Training
    model.train()
    train_pred = model(X_train)
    train_loss = loss_fn(train_pred, y_train)
    
    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()
    
    # Validation (no gradients needed)
    model.eval()
    with torch.no_grad():
        val_pred = model(X_val)
        val_loss = loss_fn(val_pred, y_val)
    
    train_losses.append(train_loss.item())
    val_losses.append(val_loss.item())
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}: train_loss={train_loss.item():.4f}, "
              f"val_loss={val_loss.item():.4f}")

# Plot convergence
plt.figure(figsize=(10, 4))
plt.plot(train_losses, label='Training Loss')
plt.plot(val_losses, label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Training Progress')
plt.grid(True)
plt.show()

Gradient Flow Analysis

Checking if gradients are flowing properly (debugging training):

import torch
import torch.nn as nn

# Deep network (more prone to gradient problems)
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.Sigmoid(),  # Can cause vanishing gradients
    nn.Linear(64, 64),
    nn.Sigmoid(),
    nn.Linear(64, 64),
    nn.Sigmoid(),
    nn.Linear(64, 1)
)

x = torch.randn(32, 10)
y = torch.randn(32, 1)

y_pred = model(x)
loss = nn.MSELoss()(y_pred, y)
loss.backward()

print("Gradient magnitudes per layer:")
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        param_norm = param.norm().item()
        ratio = grad_norm / (param_norm + 1e-8)
        print(f"{name}: grad_norm={grad_norm:.6f}, "
              f"param_norm={param_norm:.4f}, ratio={ratio:.6f}")

Output (with Sigmoid):

Gradient magnitudes per layer:
0.weight: grad_norm=0.000001, param_norm=1.2340, ratio=0.000001
0.bias: grad_norm=0.000001, param_norm=0.0234, ratio=0.000043
1.weight: grad_norm=0.000034, param_norm=1.2890, ratio=0.000026
1.bias: grad_norm=0.000001, param_norm=0.0156, ratio=0.000064
...

Gradients tiny in first layer! Switches to ReLU fixes this.

Learning Rate Experiments

Finding the right learning rate:

import torch
import torch.nn as nn
from torch.optim import SGD

# Same model and data
model = nn.Linear(10, 1)
X = torch.randn(100, 10)
y = torch.randn(100, 1)

def train_with_lr(lr):
    """Train model with given learning rate, return final loss"""
    m = nn.Linear(10, 1)
    opt = SGD(m.parameters(), lr=lr)
    loss_fn = nn.MSELoss()
    
    for _ in range(50):
        pred = m(X)
        loss = loss_fn(pred, y)
        opt.zero_grad()
        loss.backward()
        opt.step()
    
    return loss.item()

# Test different learning rates
learning_rates = [0.1, 0.01, 0.001, 0.0001, 0.00001]
for lr in learning_rates:
    final_loss = train_with_lr(lr)
    print(f"LR={lr}: final_loss={final_loss:.4f}")

Output:

LR=0.1: final_loss=inf          ← diverged (too high)
LR=0.01: final_loss=0.9234      ← good convergence
LR=0.001: final_loss=0.9876     ← good but slower
LR=0.0001: final_loss=1.2345    ← too slow
LR=0.00001: final_loss=1.5678   ← way too slow

Best is 0.01 for this problem. Start there and adjust.


Model Capacity and Scaling

Bigger Models = More Capability (Generally)

Empirically, larger models are better—but not infinitely. A 70B parameter model almost always outperforms a 7B parameter model, assuming similar training. But they also:

  • Take longer to train
  • Require more data
  • Cost more to run

There’s a tradeoff between capability and cost/speed.

Scaling Laws: The Math of Growth

Scaling laws (discovered through empirical research) predict how performance improves with model size:

Performance ≈ A × (Model Size)^(-α)

Where:

  • Doubling model size typically improves performance by ~3-5%
  • But the improvement gets smaller as models get bigger
  • At some point, more parameters don’t help much

This is why research focuses on architectural innovations rather than pure size increases—there are diminishing returns.

Diminishing Returns: At Some Point, Size Doesn’t Help

A 7B model trained for 1 trillion tokens might outperform a 70B model trained for 100 billion tokens. More parameters don’t guarantee more capability if they’re not properly utilized.

Similarly:

  • A well-designed 1B parameter model might match a poorly-designed 10B model
  • Architecture and training matter as much as raw parameter count

Parameter-Compute-Data Tradeoff

You have a fixed budget (dollars, time):

More parameters → requires more compute to train → requires more data
Fewer parameters → trains faster → less data needed

The key insight: you want to balance these. The “optimal” model size depends on your training budget, not just on capability.

Research (Chinchilla scaling laws) suggests the optimal ratio is roughly equal parameters and compute—if you have 1 trillion compute operations available, train a ~10B parameter model rather than a 100B parameter model with less training.

Why You Might Want Small Models

Bigger isn’t always better:

  • Speed: 7B model → 50ms response time; 70B model → 500ms response time
  • Cost: 7B model → $0.01 per 1M tokens; 70B model → $0.10 per 1M tokens
  • Privacy: Run 7B locally on your device instead of sending data to cloud
  • Specificity: A 7B model fine-tuned on your data might beat 70B general model
  • Reliability: Smaller models are simpler, easier to understand

Many applications are better served by smaller, specialized models than massive general-purpose ones.


Embeddings and Representations

What Are Embeddings?

An embedding is a vector (list of numbers) representing meaning. Instead of storing text as strings, you store it as vectors:

"cat" → [0.2, -0.5, 0.8, 0.1, -0.3, ...]  (768 numbers)
"dog" → [0.3, -0.4, 0.7, 0.2, -0.4, ...]  (768 numbers)

These vectors are learned so that:

  • Similar words have similar vectors
  • Different words have different vectors
  • You can do arithmetic on them (see below)

Why Embeddings Matter: Semantic Similarity in Vector Space

In embedding space, distance = difference in meaning:

"cat" vector is close to "dog" vector (both animals)
"cat" vector is far from "car" vector (different meanings)

This enables:

  • Semantic search: Find documents similar to a query by comparing vectors
  • Clustering: Group similar items automatically
  • Anomaly detection: Find vectors far from the cluster

The Famous Example: “King” - “Man” + “Woman” ≈ “Queen”

This demonstrates that embeddings capture relationships:

"king" vector
  - "man" vector (remove male essence)
  + "woman" vector (add female essence)
≈ "queen" vector (result)

This actually works (though not perfectly)! It shows embeddings encode semantic relationships in a mathematical way.

Vector Databases Built on Embeddings

Modern applications use vector databases (Pinecone, Weaviate, Milvus):

1. Convert documents to embeddings
2. Store in vector database
3. User asks a question
4. Convert question to embedding
5. Find most similar document embeddings
6. Return relevant documents

This is how RAG (Retrieval-Augmented Generation) works—a key technique for giving models access to custom information.

Embedding Models: Separate from LLMs

Embedding models are specialized neural networks that just produce vectors:

  • Text embedding model: Input text → output vector
  • Image embedding model: Input image → output vector
  • Multimodal embedding model: Input text or image → output vector

They’re much smaller and faster than LLMs (often 100M-1B parameters vs 7B+). You might use an embedding model to search a database, then use an LLM to generate a response based on search results.


Model Evaluation

Accuracy: Did It Get It Right?

The simplest metric:

Accuracy = (Number correct) / (Total examples)

On simple tasks: useful. On complex tasks: not enough.

Example problem: If 99% of emails are not spam, a model that says “not spam” for everything is 99% accurate but useless.

Precision and Recall: False Positives vs False Negatives

For classification tasks, accuracy hides important detail:

Precision: Of the things I said were positive, how many actually were?

Precision = True Positives / (True Positives + False Positives)

High precision = low false positive rate (when you say “spam,” you’re right).

Recall: Of the actual positive things, how many did I catch?

Recall = True Positives / (True Positives + False Negatives)

High recall = low false negative rate (you don’t miss spam).

Tradeoff: Usually can’t maximize both. A spam filter with high precision misses some spam; one with high recall catches spam but flags legitimate email.

Perplexity: Language Model Confusion

For language models, perplexity measures how “confused” the model is about predicting the next word:

Perplexity = 2^(cross-entropy loss)

Lower perplexity = better predictions. If a model has perplexity of 50 on English text, it’s roughly as uncertain as picking uniformly from 50 words.

Benchmarks: Comparing Models Fairly

The ML community uses standard benchmarks:

  • MMLU (Massive Multitask Language Understanding): 57,000 multiple-choice questions across many domains. Tests broad knowledge.
  • HellaSwag (Commonsense reasoning): Pick the right video frame continuation. Tests understanding of physical world.
  • MATH: Solve math problems. Tests reasoning.
  • HumanEval: Write Python code. Tests coding ability.
  • MT-Bench: Open-ended questions judged by humans. Tests instruction-following.

When you see “Claude 3.5 scores 89% on MMLU,” it means it got 89% of those 57,000 questions right.

How to Compare Models Fairly

Same benchmarks, same conditions:

  • Same hardware (GPU differences affect timing)
  • Same prompting (how you ask matters)
  • Same evaluation protocol (are humans scoring?)
  • Multiple metrics (don’t just look at accuracy)

Be skeptical of “my model beats X” claims without benchmarks and details.


Common Architectures

Modern AI uses several neural network architectures, each suited to different problems:

Transformers: LLMs and Most Modern AI

Architecture: Stacked transformer blocks with self-attention and feedforward layers.

Best for: Language, text, instruction-following, reasoning.

Examples: ChatGPT, Claude, Llama, Gemini, GPT-4.

Why dominant: Parallelizable, scalable to massive sizes, strong empirical results.

Convolutional Neural Networks (CNNs)

Architecture: Layers that apply small filters across spatial data, with pooling to reduce dimensions.

Best for: Images, computer vision, spatial patterns.

Examples: ResNet (image classification), detection models.

Why: Efficient at learning local patterns (edges, textures), weight sharing reduces parameters.

Recurrent Neural Networks (RNNs)

Architecture: Process sequences one element at a time, with hidden state that carries information forward.

Best for: Sequential data, time series (older approach).

Examples: LSTM, GRU (before transformers).

Why: Theoretically handle arbitrary sequence lengths, maintain memory of past.

Note: Largely replaced by transformers for language, but still used for some time-series applications.

Mixture of Experts (MoE)

Architecture: Multiple specialized neural networks (“experts”), plus a “router” that chooses which experts to use for each input.

Input

Router: "This looks like a math problem"

Use Math Expert + Logic Expert

Combine outputs

Best for: Large models (parameter-efficient), specialized domains.

Examples: Google’s Switch Transformer, newer LLaMA variants.

Why: Scale parameters without scaling compute proportionally (only use relevant experts).

Vision Transformers (ViT)

Architecture: Apply transformer architecture directly to images (split into patches).

Best for: Image understanding, multimodal tasks.

Examples: CLIP (image-text), multimodal LLMs.

Why: More capable than pure CNNs, aligns with language model architecture.


Model Interpretability: Understanding the Black Box

The Black Box Problem

You have an input, a model, and an output. But understanding why the model decided something is hard:

Input: "This patient has symptoms X, Y, Z"
Model: [trillions of mathematical operations]
Output: "Diagnosis: Disease A (92% confidence)"

But why? Where in those trillions of operations did it learn to recognize Disease A?

This matters for medicine, finance, legal decisions, fairness—domains where you need explainability.

Attention Visualization: Seeing What the Model Focused On

Attention weights (from the attention mechanism) show what the model paid attention to:

Text: "The bank executive was arrested. He denied the charges."

For the word "charges":
- "executive": 5% attention
- "arrested": 35% attention  ← high
- "charges": 20% attention (self-attention)
- "denied": 30% attention     ← high
- "the": 2% attention
- Other words: <2% each

Visualization shows these as heatmaps. You can literally see what the model looked at when making a decision.

Limitation: Attention isn’t necessarily explanation. High attention to a word doesn’t prove the model “understood” it.

Probing Tasks: Testing What the Model Knows

Train a simple classifier on top of the model’s internal representations:

1. Take a trained language model
2. Extract hidden layer outputs for sentences
3. Train a simple classifier: "Does this hidden state encode the subject of the sentence?"
4. If classifier scores 95%, the model knows subjects

This tests what information is encoded in the model without modifying it.

Saliency Maps: What Input Features Mattered?

Saliency maps show which input features most influenced the output:

For images:

Original image: [dog photo]
Saliency map: [highlights the dog's face and ears, dims the background]

The highlighted areas are “saliency”—removing them would most change the model’s prediction.

For text, you can compute similar maps showing which words most influenced a decision.

When Interpretability Matters

You need interpretability when:

  • Healthcare: Doctor needs to understand why the model recommends a treatment
  • Finance: Loan officer needs to explain to customer why credit was denied
  • Legal: Judge needs to verify model isn’t using protected attributes (race, gender)
  • Safety-critical: Self-driving car needs traceable decision logic
  • Fairness audits: Discover if model has learned biases

For applications like entertainment recommendations or general chatbots, interpretability is less critical.


Common Training Mistakes (And How to Avoid Them)

Learning Rate Too High

Symptom: Loss diverges to infinity or oscillates wildly.

Loss over epochs: [0.5, 1.2, 4.3, 89.2, inf]  ← DIVERGING

Cause: Steps too large, overshooting the minimum.

Fix:

  • Reduce learning rate by 10× (e.g., 0.1 → 0.01)
  • Test with 0.001 if unsure
  • Use learning rate scheduling (start high, decay over time)

Learning Rate Too Low

Symptom: Loss decreases very slowly, training takes forever.

Loss over epochs: [0.5, 0.499, 0.498, 0.497, 0.496]  ← BARELY MOVING

Cause: Steps too small, barely making progress.

Fix:

  • Increase learning rate 2-5× (e.g., 0.0001 → 0.001)
  • Early stopping won’t help; model just needs more LR

Overfitting: Training Loss Good, Validation Loss Bad

Symptom: Training loss keeps decreasing but validation loss plateaus or increases.

Epoch 1:   train_loss=0.5,  val_loss=0.52
Epoch 50:  train_loss=0.01, val_loss=0.45  ← diverging!
Epoch 100: train_loss=0.001,val_loss=0.50  ← model memorizing

Cause: Model has memorized training data instead of learning patterns.

Fix:

  • Early stopping: Stop training when validation loss stops improving
  • Regularization: Add L1/L2 penalty (discourage large weights)
  • Dropout: Randomly disable neurons during training (forces learning robust features)
  • More training data: More examples = harder to memorize
  • Smaller model: Fewer parameters = less capacity to memorize
# Example: Early stopping in PyTorch
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(100):
    train_loss = train_one_epoch()
    val_loss = validate()
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save(model.state_dict(), 'best_model.pt')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch}")
            model.load_state_dict(torch.load('best_model.pt'))
            break

Underfitting: Both Losses High

Symptom: Training and validation loss both high, not improving much.

Epoch 1:   train_loss=0.8,  val_loss=0.85
Epoch 100: train_loss=0.75, val_loss=0.78  ← barely improved

Cause: Model too simple, can’t learn the patterns in data.

Fix:

  • Bigger model: More parameters = more capacity
  • More training: Train longer (more epochs)
  • Better features: Input data might need preprocessing
  • Reduce regularization: If using L1/L2/dropout, weaken them

Vanishing Gradients (First Layers Don’t Learn)

Symptom: Deep network first layers have tiny gradients, don’t train.

Layer 0 gradient: 0.0000001
Layer 5 gradient: 0.0001
Layer 10 gradient: 0.05

Cause: With many layers, gradients multiply (chain rule). Small numbers multiplied together → very small.

Fix:

  • Use ReLU or GELU instead of Sigmoid/Tanh
  • Use batch normalization (normalizes layer outputs)
  • Use skip connections / residual connections (modern architectures)

Not Normalizing Input Data

Symptom: Training very slow, loss erratic.

Cause: If inputs have wildly different scales (e.g., one feature is 0-1, another is 0-1000), gradients unbalanced.

Fix:

# Normalize inputs to mean 0, std 1
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_normalized = (X - X_mean) / X_std

Using Different Data for Training and Validation

Symptom: Your trained model works great in evaluation but fails on real data.

Cause: Training and test data from different distributions. Model learned patterns specific to training set.

Fix:

  • Ensure training/validation/test split is random
  • Monitor test loss during training (but don’t use test data for early stopping)
  • Use cross-validation (multiple random splits)

Do You Understand Model Fundamentals? Validation Checklist

Use this checklist to verify you’ve internalized the concepts:

Core Concepts (You should be able to answer all)

  • Parameters vs Hyperparameters: Explain the difference. (Parameters learned during training; hyperparameters set before training.)
  • Forward Pass: What happens when data flows through a neural network? (Input → weights × input → activation → output, repeated per layer.)
  • Loss Function: What does it measure? (Error between prediction and true answer.)
  • Gradient Descent: Explain with the valley/hill metaphor. (Compute gradient, step downhill, repeat.)
  • Backpropagation: How are gradients computed efficiently? (Chain rule applied backwards through layers.)
  • Learning Rate: Why does it matter? (Controls step size; too high = diverge, too low = slow.)

Practical Understanding

  • When to use ReLU vs GELU vs Sigmoid: ReLU for hidden layers (fast), GELU for transformers (smooth), Sigmoid for binary output.
  • Training Loop: Describe forward → loss → backward → update cycle. (Can write pseudocode from memory.)
  • Batch Size: What’s the tradeoff? (Larger = stable gradients but more memory; smaller = noisy but less memory.)
  • Epochs: How do you know when to stop? (Use early stopping: stop when validation loss plateaus.)
  • Overfitting: Recognize the symptom (train loss good, validation loss bad) and name 3 fixes. (Early stopping, regularization, more data, smaller model, dropout.)

Applied Knowledge

  • Can write a training loop in PyTorch or TensorFlow (forward pass → loss → backward → optimizer.step).
  • Can debug slow training: Increase learning rate if progress is slow, reduce if diverging.
  • Can recognize problems: Training loss plateaus? Need bigger model or more data. Validation loss bad? Overfitting, use early stopping.
  • Can read loss curves: Plot train vs validation loss, identify if underfitting, overfitting, or good fit.

Conceptual Depth

  • Why do neural networks need activation functions? (Non-linearity; without them stacking layers is just multiplication.)
  • Why does bigger not always mean better? (More parameters = slower inference, higher cost, requires more data/compute.)
  • What’s the relationship between batch size and gradient stability? (Larger batches = averaging over more examples = more stable gradients.)
  • How does learning rate schedule help? (Start aggressive to find approximate solution, then fine-tune with smaller steps.)

Success Criteria

You understand fundamentals if you can:

  1. Train a simple model from scratch (write code without looking up syntax)
  2. Debug a failing training run (“Loss diverges” → “reduce learning rate”)
  3. Read a research paper mentioning architectures, loss functions, regularization (you understand the vocabulary)
  4. Explain to someone else why activation functions are necessary
  5. Make informed choices (“Should I use a bigger model?” → “Depends on data size and latency requirements”)

If you can’t do 3+ of these, spend more time on the sections above before moving to document 22 (fine-tuning).


Putting It Together: The Full Picture

A modern AI application works like this:

1. User writes prompt

2. Tokenization: Convert text to token IDs

3. Embedding: Convert token IDs to vectors

4. Add positional encoding: Include position information

5. Transformer layers (N of them):
   - Multi-head attention: What should I focus on?
   - Residual connection: Add back original information
   - Feedforward: Process with learned weights
   - Layer normalization: Stabilize the computation

6. Output logits: Probabilities for next token

7. Sampling: Pick next token (weighted random or greedy)

8. Repeat steps 2-7 until stop token generated

9. User sees response

Each step involves thousands, millions, or billions of learned parameters. Training involves:

1. Feed training examples through all steps above
2. Compute loss (how wrong?)
3. Backpropagate errors
4. Update parameters using gradient descent
5. Repeat for millions of examples across many epochs

The result: a model that learned patterns in data and can generalize to new inputs.


Key Takeaways

  • Models are functions: Input → parameters + computation → output
  • Parameters are learned numbers: Adjusted during training to minimize error
  • Transformers use attention: Focus on relevant parts of input automatically
  • Training adjusts parameters: Gradient descent finds good weight values
  • Bigger can be better: Scaling laws show modest improvements with size
  • Embeddings are vectors: Capture meaning mathematically for similarity search
  • Evaluation requires multiple metrics: Accuracy alone isn’t enough
  • Interpretability has limits: We don’t fully understand why models decide, but attention visualizations help
  • Architecture matters: Transformers, CNNs, RNNs suit different problems
  • Size-compute-data tradeoff: Balance parameters with training compute and data

Understanding these fundamentals lets you reason about model choice, performance, and tradeoffs in real applications—without needing a PhD in mathematics.


Next Steps After Fundamentals

  • Document 22 — Fine-tuning & Transfer Learning: Apply fundamentals to adapt models to your data

    • Use trained models as starting point (faster, requires less data)
    • Unfreeze layers, retrain with small learning rate
    • Practical: Take Claude or Llama, fine-tune on your domain
  • Document 03 — Hugging Face Ecosystem: Practical model selection and implementation

    • Find models matching your needs (size, speed, capability)
    • Load pre-trained weights, avoid training from scratch
    • Use datasets library for data preparation
  • Document 24 — Hardware and Optimization: Deploy models efficiently

    • GPU/TPU choice affects training speed
    • Quantization reduces model size (8-bit, 4-bit)
    • Batch processing, inference optimization
  • Document 01 — Foundation Models: LLM vs SLM tradeoffs (uses scaling laws from this doc)
  • Document 04 — Memory Systems: RAG and retrieval (uses embeddings from this doc)
  • Document 06 — Harness Architecture: Agentic systems (uses model inference from this doc)

When to Revisit This Document

  • After your first training loop fails: Come back to “Common Mistakes” section
  • When hyperparameter tuning: Learning rate schedules and batch size sections
  • Before architecture decisions: Compare architectures (CNN vs Transformer vs RNN)
  • If debugging slow inference: Model capacity and scaling section

Broader Concepts You’ll Encounter

  • Attention Mechanisms (deep dive): Multi-head attention, cross-attention, self-attention variants
  • Transformer Variants: Vision Transformers, Diffusion Transformers, Mamba (alternative to attention)
  • Optimization Methods Beyond Adam: AdamW (adds weight decay), SGD with momentum, SAM (Sharpness-Aware Minimization)
  • Regularization Techniques: Dropout, L1/L2, batch normalization, layer normalization, weight decay
  • Data Augmentation: Techniques to artificially expand training data
  • Distributed Training: Multi-GPU, multi-node training for large models

Tools Mentioned (Quick Reference)

  • PyTorch (recommended): Research-friendly, dynamic computation graphs

    • Installation: pip install torch torchvision torchaudio
    • Key modules: torch.nn, torch.optim, torch.utils.data
  • TensorFlow/Keras: Production-friendly, cleaner syntax

    • Installation: pip install tensorflow
    • Key modules: tensorflow.keras.layers, tensorflow.keras.models
  • Hugging Face Transformers: Pre-trained models, easy fine-tuning

    • Installation: pip install transformers
    • Covers document 22 (fine-tuning) in detail
  • “Neural Networks and Deep Learning” by Michael Nielsen (free online)
  • “The Illustrated Transformer” (Blogs explaining transformer architecture visually)
  • Papers: “Attention Is All You Need” (original transformer), “BERT” (language models), “ResNet” (skip connections)

Quick Recap by Audience

If you’re building AI products:

  • Master learning rate, batch size, early stopping (practical training)
  • Understand architecture tradeoffs (transformer vs CNN vs RNN for your task)
  • Know when to use bigger models vs smaller fine-tuned ones

If you’re implementing harness/agents:

  • Understand inference (forward pass only, no backprop)
  • Know embedding representations (used in RAG, memory systems)
  • Grok transformer attention (agents reason about which tools/memories matter)

If you’re doing research:

  • Master backpropagation math, gradient flow
  • Understand scaling laws and parameter efficiency
  • Read papers on new architectures, optimization methods

Version History

  • v2.0 (April 2026): Expanded training section, added code examples, activation function selection guide, common mistakes, validation checklist, cross-references
  • v1.0 (March 2026): Initial fundamentals coverage (parameters, weights, transformers, evaluation)