Apple Intelligence & CoreML — The Harness Handbook Reference

Running machine learning models directly on Apple hardware unlocks privacy-first applications, offline functionality, and ultra-low latency inference. This guide covers Apple Intelligence, CoreML, and practical strategies for deploying AI on iPhones, iPads, and Macs.

1. Apple Intelligence (As of April 2026)

What It Is

Apple Intelligence is Apple’s on-device AI suite integrated into iOS, iPadOS, and macOS. It comprises privacy-first generative capabilities built directly into the operating system, from text processing to image understanding and writing assistance.

Key characteristics:

Runs entirely on-device (neural engine + GPU)
Data never leaves the device unless explicitly sent to Apple’s private cloud compute
Models are proprietary, optimized for Apple Silicon
Integrated into system frameworks and native apps
Seamless across Apple devices with iCloud+ subscription

Privacy-First Approach

Apple Intelligence enforces a “data minimization” architecture:

On-device processing — Text understanding, image analysis, and basic generation run locally
Private Cloud Compute — Complex queries can optionally use Apple’s encrypted cloud infrastructure
- Data is processed in an isolated environment
- Apple cannot see user data (cryptographic guarantees)
- Can be disabled per-request or system-wide
User control — Explicit opt-in for cloud features; defaults to on-device

Models and Capabilities

As of April 2026, Apple Intelligence provides:

Capability	Location	Use Case
Text understanding	On-device	Classify intent, extract entities, summarization
Writing tools	On-device + cloud	Proofreading, rewriting, tone adjustment
Image understanding	On-device	Describe images, identify objects, OCR
Generative images	Cloud	Create images from text (DALL-E style)
Smart reply	On-device	Email/message suggestions
Siri enhancements	Hybrid	Contextual understanding, personal requests

Limits to know:

Text generation is constrained (summaries, rewrites, not open-ended chat)
Image generation not available on-device (always cloud)
Fine-tuning models per-user is not exposed to developers

When On-Device vs Cloud

Use on-device when:

Privacy/compliance required (healthcare, finance, government)
Offline operation needed
Latency must be sub-100ms
User data sensitive (PII, proprietary documents)

Use cloud (Private Cloud Compute) when:

More complex reasoning needed
Generation quality is critical
Computational cost of local inference is high
User explicitly opts in

Performance Characteristics

On Apple Silicon devices:

M3/M4 MacBook Pro: Single token ~20-30ms, throughput ~30-50 tokens/sec
M2 Mac Mini: Single token ~30-40ms, throughput ~25-35 tokens/sec
iPhone 15 Pro (A18 Pro): Single token ~80-150ms, throughput ~7-12 tokens/sec
iPad Pro 11” (M4): Single token ~25-35ms, throughput ~28-40 tokens/sec

These assume quantized models (int8 or lower). Performance degrades gracefully on older devices.

Future Roadmap (2026 and Beyond)

Expected improvements:

Larger on-device model support (current limit ~3-7B parameters)
Faster inference via hardware optimizations
Developer API for custom on-device models
Cross-device model caching (via iCloud+)
Federated learning (training on private data)

2. CoreML Framework

What CoreML Is

CoreML is Apple’s native machine learning framework for iOS, iPadOS, macOS, watchOS, and tvOS. It enables on-device inference for models trained in popular frameworks (TensorFlow, PyTorch, scikit-learn, etc.).

Key features:

Automatic hardware optimization (CPU, GPU, Neural Engine)
Low memory footprint
Battery-efficient inference
Seamless integration with Vision and NLP frameworks
Precompiled models (.mlmodel format)

Supported Model Formats

CoreML accepts models in multiple formats:

Source Framework	Supported?	Notes
TensorFlow 2.x	✅	Via ONNX converter or coremltools
PyTorch	✅	Via ONNX or coremltools (direct conversion works)
ONNX	✅	Universal format; recommended for interop
scikit-learn	✅	Linear/tree-based models fully supported
XGBoost	✅	Gradient boosting models
LibSVM	✅	SVM models
Apple CreateML	✅	Xcode-native training (images, sounds, tabular)
Hugging Face	⚠️	Via ONNX conversion pipeline

Format: CoreML models are packaged as .mlmodel (zipped archive) or .mlpackage (directory structure for large models).

Performance Optimizations with Metal Performance Shaders

CoreML leverages Metal (Apple’s graphics API) for GPU inference:

Metal Performance Shaders (MPS) — Specialized kernels for neural network operations
- Matrix multiplication, convolution, pooling highly optimized
- Automatic batching and fusion of operations
- 2-10x faster than scalar GPU code
Neural Engine access — On devices with neural engines (A14+, M1+):
- Dedicated fixed-function AI accelerator
- Lowest latency for specific model architectures
- Used automatically by CoreML when beneficial
Quantization support — Reduce model precision:
- float16 (half-precision) — ~2x size reduction, minimal accuracy loss
- int8 (8-bit integer) — ~4x size reduction, slight accuracy loss
- int4 (4-bit) — ~8x size reduction, more accuracy loss (requires careful calibration)

Neural Engine (Dedicated AI Chip)

Modern Apple devices include a Neural Engine — a fixed-function coprocessor for ML:

Device	Neural Engine Cores	Performance
iPhone 15 Pro (A18 Pro)	16 cores	~13 TFLOPS int8
iPad Pro 11” M4	10 cores	~12 TFLOPS
MacBook Pro 16” M3 Max	16 cores	~14 TFLOPS
Mac Studio M2 Ultra	32 cores	~28 TFLOPS

When used: CoreML automatically routes operations (convolutions, matrix multiplies in specific shapes) to the Neural Engine when beneficial. You don’t need to explicitly manage this.

Important: The Neural Engine only accelerates specific model types (CNNs, RNNs, Transformers with certain patterns). Large sparse or dynamic models may run faster on GPU.

Versions and Compatibility

CoreML Version	Minimum OS	Features
CoreML 1.0 (2017)	iOS 11	Basic inference, image classification
CoreML 2.0 (2018)	iOS 12	Updatable models, custom operations
CoreML 3.0 (2019)	iOS 13	Control flow, flexible shapes
CoreML 4.0 (2020)	iOS 14	Vision integration, NLP models
CoreML 5.x (2021–2023)	iOS 15+	Larger model support, better quantization
CoreML 6.x (2024+)	iOS 18, macOS 14.6+	MLX integration, dynamic shapes, improved GPU

Recommendation: For new projects, target CoreML 6.x (iOS 18+) to access latest optimizations.

3. Converting Hugging Face Models to CoreML

Model Compatibility

Not all Hugging Face models convert cleanly to CoreML. Best candidates:

Readily convertible:

Vision transformers (ViT, DINO, CLIP visual encoder)
Small text encoders (sentence-transformers, distilBERT)
Stable Diffusion (via specialized ml-stable-diffusion project)
LLaMA-based models up to 7B (with quantization)

Difficult to convert:

Models with dynamic control flow (if statements in forward pass)
Models with custom operations not in ONNX standard
Large models >30B parameters (memory constraints)
Audio models with complex graph structures

Conversion Process

Strategy 1: HF → ONNX → CoreML (Recommended)

# 1. Export to ONNX (HF model)
from transformers import AutoModel
from optimum.onnx import export_models

model_id = "sentence-transformers/all-MiniLM-L6-v2"
export_models(model_id, output_dir="./onnx_model")

# 2. Convert ONNX → CoreML
import coremltools as ct

onnx_model = ct.converters.onnx.load_model("./onnx_model/model.onnx")
coreml_model = ct.converters.convert(onnx_model, target_deployment_target=ct.target.iOS16)
coreml_model.save("./model.mlmodel")

Advantages:

ONNX is a well-supported intermediate format
Better error messages if conversion fails
Standardized transformations (pruning, quantization) work on ONNX

Strategy 2: PyTorch → CoreML (Direct)

# 1. Export PyTorch model to TorchScript
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
scripted = torch.jit.script(model)
torch.jit.save(scripted, "model.pt")

# 2. Convert TorchScript → CoreML
import coremltools as ct

traced_model = torch.jit.load("model.pt")
example_input = torch.randn(1, 512)  # [batch, seq_len]

coreml_model = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=(1, 512), dtype=np.float32)],
    outputs=[ct.TensorType(dtype=np.float32)],
    target_deployment_target=ct.target.iOS16,
)
coreml_model.save("./model.mlmodel")

Advantages:

Single-step conversion
Fewer potential loss points
Better preserves model semantics

Strategy 3: Using Specialized Tools

For large models, use framework-specific converters:

Stable Diffusion:

git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion

python -m python_coreml_stable_diffusion.torch2coreml \
  --model-version stabilityai/stable-diffusion-2-base \
  --compute-unit all \
  --output-mlpackage ./models

LLaMA / Mistral (via MLX):

# Convert to MLX, then export as CoreML
from mlx_lm import load_model, save_model

model, tokenizer = load_model("mistralai/Mistral-7B-v0.1")
save_model(model, "mistral_mlx")

# Then convert MLX → CoreML (tool in development as of April 2026)

Example: Convert Mistral 7B to CoreML

Practical walkthrough:

# 1. Install dependencies
pip install transformers optimum[onnxruntime] coremltools

# 2. Download and export Mistral to ONNX
python << 'EOF'
from transformers import AutoModel
from optimum.onnx import export
from pathlib import Path

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
output_dir = Path("./mistral_onnx")

export(
    model_id=model_id,
    output_dir=output_dir,
    task="text-generation",
)
EOF

# 3. Quantize ONNX model (int8) to reduce size from ~13GB to ~3.5GB
python << 'EOF'
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained("./mistral_onnx")
config = AutoQuantizationConfig.avx512_vnni()
quantizer.quantize(save_dir="./mistral_onnx_quantized", quantization_config=config)
EOF

# 4. Convert ONNX → CoreML with quantization
python << 'EOF'
import coremltools as ct
from coremltools.models.neural_network import flexible_shape_utils

onnx_model = ct.converters.onnx.load_model("./mistral_onnx_quantized/model.onnx")

# 5-bit quantization for aggressive size reduction
mlmodel = ct.converters.convert(
    onnx_model,
    target_deployment_target=ct.target.iOS16,
    minimum_deployment_target=ct.target.iOS16,
    compute_units=ct.ComputeUnit.ALL,
)

# Quantize to int8
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(
    mlmodel,
    nbits=8,
    quantization_mode="linear",
)

mlmodel.save("./Mistral-7B.mlmodel")
print(f"Model size: {Path('./Mistral-7B.mlmodel').stat().st_size / 1e9:.2f}GB")
EOF

Result: Mistral 7B converts to ~3.5GB int8 CoreML model, runnable on Mac mini/MacBook Pro.

Quantization During Conversion

Quantization reduces model size and inference latency at the cost of accuracy:

Bit Width	Size Reduction	Accuracy Loss	Best For
float32 (baseline)	1x	0%	Reference, high accuracy
float16	2x	<1%	GPU inference, any device
int8	4x	1-5%	CPUs, Neural Engine, iOS
int4	8x	5-10%	Tight memory budgets
int3	10x	15-20%	Extreme constraints (watch)

Quantization-aware training (QAT):

For critical accuracy, retrain with quantization in loop:

import torch
from torch.quantization import quantize_dynamic

# Quantize after training
model = load_pretrained_model()
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8,
)

# Convert to CoreML

4. Performance on Apple Hardware

Unified Memory Advantage

Apple Silicon (M1/M2/M3/M4) uses a unified memory architecture where GPU and CPU share the same physical memory pool. This eliminates copies between systems.

Traditional GPU (discrete):

RAM → PCIe → GPU VRAM → GPU → GPU VRAM → PCIe → RAM
                    ↓ Bottleneck: PCIe bandwidth (16GB/s)

Apple Silicon (unified):

Shared Memory ← GPU, CPU, Neural Engine all access directly
             ↓ No copies; direct access ~100GB/s

Impact: A 7B model using 28GB of parameters in float16 can be loaded once and referenced by all compute units. No duplication.

Latency Benchmarks by Device

Measured end-to-end latency for Mistral 7B (quantized int8, ~3.5GB):

Device	First Token	Per Token	Tokens/Sec	Context Window
MacBook Pro 16” M3 Max	800ms	20-25ms	40-50	32K tokens (~25MB)
Mac Studio M2 Ultra	600ms	15-18ms	55-65	64K tokens (~50MB)
Mac Mini M2	1200ms	35-40ms	25-28	8K tokens (~6MB)
MacBook Air M2	1500ms	45-55ms	18-22	4K tokens (~3MB)
iPad Pro 11” M4	2000ms	80-100ms	10-12	2K tokens (~1.5MB)
iPhone 15 Pro	5000ms+	150-200ms	5-7	512 tokens (~400KB)

Observation: M-series Macs with 16GB+ unified memory run 7B models smoothly. iPhones/iPads have practical limits around 2-3B models.

Neural Engine Usage by Architecture

The Neural Engine accelerates specific operations:

Models that benefit (faster on NE):

ResNet, Vision Transformers (CNNs + attention)
Stable Diffusion (diffusion models)
BERT-sized encoders (<12 layers)

Models that don’t benefit (CPU/GPU better):

Large LLMs with dynamic shapes (Mistral, Llama 30B+)
Models with custom CUDA kernels (not expressible in ONNX)
Models with sparse operations

For LLMs, CoreML typically routes matmul to CPU/GPU, not Neural Engine.

5. Memory and Compute Constraints

Unified Memory Implications

With unified memory, the constraint is total device RAM, not GPU VRAM:

Max Model Size ≈ 60-70% of Device RAM (rest for OS, app overhead)

Device	Total RAM	Usable for Model	Largest Model (fp16)
MacBook Pro 16” M3 Max	36GB	~24GB	12B parameters
Mac Studio M2 Ultra	192GB	~130GB	65B parameters
Mac Mini M2	16GB	~10GB	5B parameters
MacBook Air M2	8GB	~5GB	2.5B parameters
iPad Pro 11” M4	16GB	~10GB	5B parameters
iPhone 15 Pro	12GB	~7GB	3.5B parameters

Context Window Limitations

Larger context windows consume more memory:

Memory = Model Size + (Context Length × Hidden Dimension × Batch Size × Bytes per Token)

For Mistral 7B (int8, 3.5GB model):

Context 8K: +1.2GB → 4.7GB total
Context 16K: +2.4GB → 5.9GB total
Context 32K: +4.8GB → 8.3GB total

Mac Mini M2 (16GB): Context fits ~10K tokens comfortably.

iPhone 15 Pro (12GB): Context fits ~1-2K tokens.

Real Example: M1 MacBook Pro (16GB)

Running Mistral 7B on a 16GB M1 MacBook Pro:

Model (int8): 3.5GB
OS Reserve: 2-3GB
App Overhead: 0.5GB
Available for Context: ~10GB

At 2 bytes/token for KV cache (fp16): 5K tokens context feasible
Performance: ~30 tokens/sec

Practical: Good for local AI assistant with ~2K-5K token contexts. Acceptable latency for interactive apps.

6. On-Device vs Cloud Trade-Offs

On-Device Inference

Advantages:

Privacy: No data leaves device (except user-initiated exports)
Offline: Works without internet
Latency: ~20-200ms total, no network delay
Cost: No per-request API fees
Control: Model behavior is predictable, reproducible

Disadvantages:

Limited capability: Model size/complexity constrained by device RAM
Stale knowledge: Models don’t have access to real-time information
Development iteration: Updating model requires app update (not instant)
Device strain: High battery/thermal cost on phones during inference
Maintenance: You manage model updates, security patches

Best for:

Personal/sensitive data (PII, health, finance)
Always-offline apps (flights, remote locations)
Sub-50ms latency requirements
Compliance-driven apps (HIPAA, GDPR, government)
Text classification, intent detection, embedding generation

Cloud Inference (Apple Private Cloud Compute or Third-Party APIs)

Advantages:

Capability: Access to largest, most capable models (GPT-4 scale)
Real-time knowledge: Current information from web
Zero device overhead: Inference costs are server-side, not battery
Instant updates: Roll out new models without app update
Scalability: Handle traffic spikes transparently

Disadvantages:

Privacy: Data sent over internet (even if encrypted)
Latency: 500ms-2000ms due to network + server processing
Cost: Per-request fees ($0.01-0.10+ per request)
Availability: Requires internet connection
Dependency: Reliant on third-party service uptime

Best for:

Non-sensitive user input (search queries, creative writing)
Complex reasoning requiring large models
Knowledge-grounded tasks (Q&A with current info)
Generative tasks (images, long text)
Spiky traffic (don’t want to provision for peaks)

Hybrid Approach (Recommended for Most Apps)

Pattern: Local for fast, common operations; cloud for complex reasoning.

User Input
    ↓
Local Intent Detection (0-50ms)
    ↓
    ├─ Fast Path (95% of queries): Local response (embedding lookup, simple rewrite)
    │                               Return to user in 100-200ms
    │
    └─ Complex Path (5% of queries): Send to cloud (with user consent)
                                      Return in 1000-2000ms

Example: Writing assistant app

On-device: Grammar check, tone analysis, style transfer (fast models)
Cloud: Generate complete rewrites, creative suggestions (large models)
User impact: Most corrections instant; complex suggestions wait 1-2 sec

7. Building iOS Apps with CoreML

SwiftUI Integration

import SwiftUI
import CoreML

struct ContentView: View {
    @State private var inputText: String = ""
    @State private var resultText: String = ""
    @State private var isProcessing = false
    @State private var error: String?
    
    @Environment(\.dismiss) var dismiss
    
    var body: some View {
        VStack(spacing: 16) {
            TextEditor(text: $inputText)
                .border(Color.gray)
                .frame(height: 100)
            
            Button(action: performInference) {
                if isProcessing {
                    ProgressView()
                        .progressViewStyle(.circular)
                } else {
                    Text("Analyze")
                }
            }
            .disabled(isProcessing)
            
            if !resultText.isEmpty {
                Text(resultText)
                    .font(.body)
                    .padding()
                    .background(Color.gray.opacity(0.1))
            }
            
            if let error {
                Text("Error: \(error)")
                    .foregroundColor(.red)
                    .font(.caption)
            }
            
            Spacer()
        }
        .padding()
        .navigationTitle("Text Analysis")
    }
    
    private func performInference() {
        isProcessing = true
        error = nil
        
        Task {
            do {
                resultText = try await analyzeText(inputText)
            } catch {
                self.error = error.localizedDescription
            }
            isProcessing = false
        }
    }
}

// MARK: - CoreML Integration

actor TextAnalyzer {
    private let model: TextClassifier  // Auto-generated from .mlmodel
    
    private init() {
        self.model = try! TextClassifier(configuration: .init())
    }
    
    static let shared = TextAnalyzer()
    
    func analyzeText(_ input: String) async throws -> String {
        guard !input.isEmpty else { return "No input provided" }
        
        // Tokenize input (framework-dependent; this is pseudo-code)
        let tokens = tokenize(input)
        let embedding = try model.prediction(input: tokens)
        
        return interpretResult(embedding)
    }
    
    private func tokenize(_ text: String) -> MLMultiArray {
        // Convert text to token IDs matching model's vocabulary
        // This depends on the model's tokenizer
        let tokenIds = [1, 2, 3, 4]  // Example
        let array = try! MLMultiArray(shape: [1, 4], dataType: .int32)
        array[0] = NSNumber(value: 1)
        return array
    }
    
    private func interpretResult(_ embedding: TextClassifierOutput) -> String {
        // Map model output to user-friendly text
        return "Analyzed: \(embedding.label ?? "Unknown")"
    }
}

// Usage in async context
func analyzeText(_ text: String) async throws -> String {
    return try await TextAnalyzer.shared.analyzeText(text)
}

Handling Model Inference in App

Best practices:

Load model once, reuse — Models are expensive to load; create a singleton
Run on background thread — Inference can block; use Task/async
Batch operations — If analyzing multiple texts, batch them for efficiency
Stream results — For long outputs, yield tokens incrementally
Handle OOM gracefully — Have fallback behavior if model can’t load

actor InferenceEngine {
    private let model: MyModel
    private let queue = DispatchQueue(label: "ml.inference", qos: .userInitiated)
    
    static let shared = InferenceEngine()
    
    private init() {
        do {
            self.model = try MyModel(configuration: MLModelConfiguration())
        } catch {
            os_log("Failed to load model: %{public}@", log: .default, type: .error, error.localizedDescription)
            fatalError("Model unavailable")
        }
    }
    
    func infer(input: InputData) async throws -> OutputData {
        return try await withCheckedThrowingContinuation { continuation in
            queue.async { [weak self] in
                do {
                    guard let self else {
                        continuation.resume(throwing: NSError(domain: "Model", code: -1))
                        return
                    }
                    let output = try self.model.prediction(input: input)
                    continuation.resume(returning: output)
                } catch {
                    continuation.resume(throwing: error)
                }
            }
        }
    }
}

Battery and Thermal Considerations

Inference is power-hungry. Plan accordingly:

Check battery level — Don’t run heavy inference if <20% battery

@Environment(\.isLowPowerModeEnabled) var isLowPowerMode

if isLowPowerMode {
    // Use smaller model, reduce batch size, or defer to cloud
}

Throttle frequency — Don’t call inference on every keystroke

@State private var lastInferenceTime: Date = .distantPast

func shouldRunInference() -> Bool {
    return Date().timeIntervalSince(lastInferenceTime) > 1.0  // Max once per second
}

Limit parallel inferences — Queue requests, don’t spawn threads unbounded

Monitor thermal state — Slow down or defer work if device is hot

let thermalState = ProcessInfo.processInfo.thermalState
if thermalState == .critical {
    // Halt inference until device cools
}

User Experience: Handling Latency

Don’t freeze the UI. Provide feedback during inference:

VStack(spacing: 16) {
    if isProcessing {
        ProgressView(value: progress)
            .tint(.blue)
        Text("Analyzing... ~\(estimatedRemainingSeconds)s")
            .font(.caption)
            .foregroundColor(.gray)
    } else {
        Text("Tap to analyze")
    }
}

For streaming outputs (like LLM tokens):

func streamInference(input: String) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            do {
                let tokens = try await inferenceEngine.inferStream(input: input)
                for try await token in tokens {
                    continuation.yield(token)
                }
                continuation.finish()
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

// Usage
Text(generatedText)
    .onReceive(streamInference(input: inputText).stream) { token in
        generatedText += token
    }

8. Vision Models on iOS

Vision models process images in real-time. The Vision framework integrates with CoreML seamlessly.

Vision Framework + CoreML

import Vision
import CoreML

class ObjectDetector {
    private let model: ObjectDetectionModel
    private let visionRequest: VNCoreMLRequest
    
    init() throws {
        self.model = try ObjectDetectionModel(configuration: .init())
        self.visionRequest = VNCoreMLRequest(model: self.model.model)
    }
    
    func detect(in image: UIImage) async throws -> [DetectedObject] {
        guard let ciImage = CIImage(image: image) else {
            throw NSError(domain: "Vision", code: -1, userInfo: nil)
        }
        
        return try await withCheckedThrowingContinuation { continuation in
            visionRequest.completionHandler = { request, error in
                if let error {
                    continuation.resume(throwing: error)
                    return
                }
                
                let results = request.results as? [VNRecognizedObjectObservation] ?? []
                let detected = results.map { observation in
                    DetectedObject(
                        label: observation.labels.first?.identifier ?? "Unknown",
                        confidence: Float(observation.confidence),
                        boundingBox: observation.boundingBox
                    )
                }
                
                continuation.resume(returning: detected)
            }
            
            let handler = VNImageRequestHandler(ciImage: ciImage)
            try handler.perform([visionRequest])
        }
    }
}

Real-Time Processing (Camera Feed)

import AVFoundation
import Vision

class CameraProcessor: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
    private let detector = try! ObjectDetector()
    @Published var detections: [DetectedObject] = []
    
    private let session = AVCaptureSession()
    private let queue = DispatchQueue(label: "vision.processing")
    
    func startSession() throws {
        guard let camera = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back) else {
            throw NSError(domain: "Camera", code: -1)
        }
        
        let input = try AVCaptureDeviceInput(device: camera)
        session.addInput(input)
        
        let output = AVCaptureVideoDataOutput()
        output.setSampleBufferDelegate(self, queue: queue)
        session.addOutput(output)
        
        session.startRunning()
    }
    
    func captureOutput(
        _ output: AVCaptureOutput,
        didOutput sampleBuffer: CMSampleBuffer,
        from connection: AVCaptureConnection
    ) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        
        let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
        let uiImage = UIImage(ciImage: ciImage)
        
        Task {
            do {
                let detections = try await self.detector.detect(in: uiImage)
                DispatchQueue.main.async {
                    self.detections = detections
                }
            } catch {
                os_log("Detection failed: %{public}@", log: .default, type: .error, error.localizedDescription)
            }
        }
    }
}

Example: Document Scanner

class DocumentScanner {
    private let visionRequest = VNDetectDocumentSegmentationRequest()
    
    func scanDocument(from image: UIImage) async throws -> UIImage {
        guard let ciImage = CIImage(image: image) else {
            throw NSError(domain: "Scanner", code: -1)
        }
        
        let handler = VNImageRequestHandler(ciImage: ciImage)
        try handler.perform([visionRequest])
        
        guard let observation = visionRequest.results?.first as? VNDocumentObservation else {
            throw NSError(domain: "Scanner", code: -2, userInfo: [NSLocalizedDescriptionKey: "No document detected"])
        }
        
        // Correct perspective based on document boundaries
        let correctedImage = perspectiveCorrection(ciImage, to: observation)
        return UIImage(ciImage: correctedImage)
    }
    
    private func perspectiveCorrection(_ image: CIImage, to doc: VNDocumentObservation) -> CIImage {
        // Use Vision observation to guide perspective transformation
        let topLeft = doc.topLeft
        let topRight = doc.topRight
        let bottomLeft = doc.bottomLeft
        let bottomRight = doc.bottomRight
        
        // Apply perspective transform filter
        let filter = CIPerspectiveCorrection()
        filter.inputImage = image
        filter.inputTopLeft = topLeft
        filter.inputTopRight = topRight
        filter.inputBottomLeft = bottomLeft
        filter.inputBottomRight = bottomRight
        
        return filter.outputImage ?? image
    }
}

9. Text Models on iOS

Running LLMs on iOS is constrained but feasible for smaller models (2-3B parameters).

Running Small LLMs on iOS

Model selection:

TinyLlama 1.1B — Fits on iPhone, decent quality
Mistral 7B quantized int4 — Requires iPad/Pro device
Phi 2.7B — Good quality/size tradeoff
Gemma 2B — Purpose-built for mobile

Example with TinyLlama:

actor LLMInference {
    private let model: TinyLlamaModel
    private let tokenizer: Tokenizer
    
    init() throws {
        self.model = try TinyLlamaModel(configuration: .init())
        self.tokenizer = try Tokenizer.load()
    }
    
    func generate(prompt: String, maxTokens: Int = 256) async throws -> String {
        let inputIds = tokenizer.encode(prompt)
        var generatedIds = inputIds
        var result = prompt
        
        for _ in 0..<maxTokens {
            let inputArray = try MLMultiArray(shape: [1, inputIds.count], dataType: .int32)
            for (i, id) in generatedIds.enumerated() {
                inputArray[i] = NSNumber(value: id)
            }
            
            let output = try model.prediction(inputIds: inputArray)
            let nextTokenId = output.logits.argmax()
            
            if nextTokenId == tokenizer.eosTokenId { break }
            
            generatedIds.append(nextTokenId)
            result += tokenizer.decode([nextTokenId])
        }
        
        return result
    }
}

Text Encoding/Decoding

Tokenization is critical for model correctness:

struct Tokenizer {
    private let vocabulary: [String: Int]
    private let reverseVocab: [Int: String]
    
    static func load() throws -> Self {
        // Load from bundled vocab.json
        let vocabData = try Data(contentsOf: Bundle.main.url(forResource: "vocab", withExtension: "json")!)
        let vocab = try JSONDecoder().decode([String: Int].self, from: vocabData)
        
        var reverseVocab: [Int: String] = [:]
        for (token, id) in vocab {
            reverseVocab[id] = token
        }
        
        return Tokenizer(vocabulary: vocab, reverseVocab: reverseVocab)
    }
    
    func encode(_ text: String) -> [Int] {
        let tokens = text.split(separator: " ").map(String.init)
        return tokens.compactMap { vocabulary[$0] }
    }
    
    func decode(_ ids: [Int]) -> String {
        let tokens = ids.compactMap { reverseVocab[$0] }
        return tokens.joined(separator: " ")
    }
    
    var eosTokenId: Int { vocabulary["</s>"] ?? -1 }
}

Token Streaming

For better UX, stream tokens as they’re generated:

func generateStreaming(prompt: String) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            do {
                let inputIds = tokenizer.encode(prompt)
                var generatedIds = inputIds
                
                // Emit initial prompt
                continuation.yield(prompt)
                
                for _ in 0..<256 {
                    let output = try predict(inputIds: generatedIds)
                    let nextId = output.logits.argmax()
                    
                    if nextId == tokenizer.eosTokenId { break }
                    
                    let token = tokenizer.decode([nextId])
                    continuation.yield(token)
                    generatedIds.append(nextId)
                    
                    // Yield less frequently to avoid UI overhead
                    try await Task.sleep(nanoseconds: 1_000_000)  // 1ms
                }
                
                continuation.finish()
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

// Usage in SwiftUI
@State private var generatedText = ""
@State private var isGenerating = false

var body: some View {
    VStack {
        TextEditor(text: $generatedText)
        
        Button(isGenerating ? "Generating..." : "Generate") {
            generateText()
        }
        .disabled(isGenerating)
    }
    
    private func generateText() {
        isGenerating = true
        generatedText = ""
        
        Task {
            for await token in generateStreaming(prompt: "Write a story:") {
                DispatchQueue.main.async {
                    generatedText += token
                }
            }
            DispatchQueue.main.async {
                isGenerating = false
            }
        }
    }
}

Memory Management for Models

LLMs consume significant memory. Manage carefully:

class ModelManager {
    private(set) var model: LLMModel?
    
    func loadModel() async throws {
        // Check available memory before loading
        var memInfo = os_proc_available_memory()
        guard memInfo >= 3 * 1024 * 1024 * 1024 else {  // 3GB threshold
            throw NSError(domain: "Memory", code: -1, 
                         userInfo: [NSLocalizedDescriptionKey: "Insufficient memory"])
        }
        
        self.model = try LLMModel(configuration: .init())
    }
    
    func unloadModel() {
        self.model = nil
    }
    
    // Monitor memory during inference
    func monitorMemory() {
        Timer.scheduledTimer(withTimeInterval: 0.5, repeats: true) { _ in
            let available = os_proc_available_memory()
            if available < 500 * 1024 * 1024 {  // <500MB
                os_log("Low memory: %lld MB", log: .default, available / 1024 / 1024)
                // Reduce batch size or stop inference
            }
        }
    }
}

10. Optimization Techniques

Quantization (Size + Speed)

Reduce model precision to shrink size and accelerate inference:

import coremltools as ct
from coremltools.models.neural_network import quantization_utils

# Load full-precision model
mlmodel = ct.models.MLModel("model_fp32.mlmodel")

# Quantize to int8 (4x size reduction)
quantized = quantization_utils.quantize_weights(
    mlmodel,
    nbits=8,
    quantization_mode="linear_quantization",
)
quantized.save("model_int8.mlmodel")

# For aggressive size reduction, use int4
quantized_4bit = quantization_utils.quantize_weights(
    mlmodel,
    nbits=4,
    quantization_mode="kmeans",  # Better quality for 4-bit
    nbits_bias=8,
)
quantized_4bit.save("model_int4.mlmodel")

Trade-off:

int8: ~1-2% accuracy loss, ~4x smaller
int4: ~5-10% accuracy loss, ~8x smaller
int2: ~15%+ accuracy loss, rarely used

Pruning (Remove Unnecessary Connections)

Structured pruning removes entire filters/channels:

import torch
from torch.nn.utils import prune

model = load_model()

# Prune 50% of weights across network
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.5)

# Make pruning permanent
for name, module in model.named_modules():
    if hasattr(module, 'weight_orig'):
        prune.remove(module, 'weight')

# Export pruned model
torch.onnx.export(model, ...)

Result: 30-50% model size reduction with ~2-5% accuracy loss.

Knowledge Distillation (Compress from Larger Model)

Train a smaller “student” model to mimic a larger “teacher”:

import torch
import torch.nn as nn

teacher_model = load_large_model()  # e.g., Mistral 13B
student_model = load_small_model()  # e.g., TinyLlama 1B
student_model.train()

criterion = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.Adam(student_model.parameters())

for batch in train_loader:
    input_ids, labels = batch
    
    # Get predictions from both models
    with torch.no_grad():
        teacher_logits = teacher_model(input_ids)
    
    student_logits = student_model(input_ids)
    
    # KL divergence between distributions
    loss = criterion(
        torch.log_softmax(student_logits / 3.0, dim=-1),
        torch.softmax(teacher_logits / 3.0, dim=-1),
    )
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Result: Smaller model with ~95% of larger model’s capability.

Dynamic Model Loading

Load models on-demand to save memory:

actor DynamicModelManager {
    private var cachedModels: [String: Any] = [:]
    
    func getModel(_ name: String) async throws -> MLModel {
        if let cached = cachedModels[name] {
            return cached as! MLModel
        }
        
        let model = try load(name)
        cachedModels[name] = model
        return model
    }
    
    func unloadModel(_ name: String) {
        cachedModels.removeValue(forKey: name)
    }
    
    func clearCache() {
        cachedModels.removeAll()
    }
}

Result Caching

Avoid re-computing identical inputs:

struct CachedInference {
    private let cache = NSCache<NSString, NSData>()
    private let model: MLModel
    
    func infer(input: String) async throws -> String {
        let key = NSString(string: input)
        
        if let cached = cache.object(forKey: key),
           let result = String(data: cached as Data, encoding: .utf8) {
            return result
        }
        
        let output = try await runModel(input: input)
        cache.setObject(output.data(using: .utf8)! as NSData, forKey: key)
        
        return output
    }
}

11. Practical Examples

Example 1: Local AI Assistant on Mac Mini

Setup: Mistral 7B running on Mac Mini M2, accessed via Xcode Playground or SwiftUI app.

Architecture:

User Input (SwiftUI)
    ↓
FastAPI Server (Python on Mac)
    ↓
MLX/Mistral-cpp (inference)
    ↓
Stream tokens back to app

Python server (pseudo-code):

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from mistral import Mistral
import asyncio

app = FastAPI()
model = Mistral.from_pretrained("mistral-7b", device="mps")
tokenizer = Mistral.get_tokenizer()

@app.post("/generate")
async def generate(prompt: str):
    def stream_tokens():
        tokens = model.generate(
            tokenizer.encode(prompt),
            max_tokens=256,
            temperature=0.7,
        )
        for token in tokens:
            yield tokenizer.decode([token]).encode()
    
    return StreamingResponse(stream_tokens(), media_type="text/plain")

iOS client:

struct AIAssistant {
    private let serverURL = URL(string: "http://localhost:8000")!
    
    func generateResponse(to prompt: String) async throws -> String {
        var request = URLRequest(url: serverURL.appendingPathComponent("/generate"))
        request.httpMethod = "POST"
        request.httpBody = prompt.data(using: .utf8)
        
        let (asyncBytes, _) = try await URLSession.shared.bytes(for: request)
        
        var result = ""
        for try await byte in asyncBytes {
            if let char = String(bytes: [byte], encoding: .utf8) {
                result += char
                // Update UI with streaming text
            }
        }
        
        return result
    }
}

Example 2: Convert Hugging Face Sentence Embeddings to iOS

Goal: Embed user queries locally, compare to stored embeddings.

# 1. Export model
python << 'EOF'
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")
example_input = torch.randn(1, 512)

traced = torch.jit.trace(model, example_input)
torch.jit.save(traced, "sentence_model.pt")
EOF

# 2. Convert to CoreML
python << 'EOF'
import coremltools as ct
import torch

traced = torch.jit.load("sentence_model.pt")
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 512), dtype=np.float32)],
    target_deployment_target=ct.target.iOS16,
)
mlmodel.save("SentenceEmbedder.mlmodel")
EOF

iOS usage:

class SemanticSearch {
    private let embedder = try! SentenceEmbedder(configuration: .init())
    private var documentEmbeddings: [String: [Float]] = [:]
    
    func indexDocuments(_ docs: [String]) async throws {
        for doc in docs {
            let embedding = try await embed(doc)
            documentEmbeddings[doc] = embedding
        }
    }
    
    func search(_ query: String, topK: Int = 5) async throws -> [String] {
        let queryEmbedding = try await embed(query)
        
        let scores = documentEmbeddings.mapValues { embedding in
            cosineSimilarity(queryEmbedding, embedding)
        }
        
        return scores.sorted { $0.value > $1.value }
            .prefix(topK)
            .map { $0.key }
    }
    
    private func embed(_ text: String) async throws -> [Float] {
        // Tokenize and pass to model
        let output = try embedder.prediction(input: tokenize(text))
        return output.embeddings
    }
    
    private func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
        let dotProduct = zip(a, b).map(*).reduce(0, +)
        let normA = sqrt(a.map { $0 * $0 }.reduce(0, +))
        let normB = sqrt(b.map { $0 * $0 }.reduce(0, +))
        return dotProduct / (normA * normB)
    }
}

Example 3: Hybrid On-Device + Cloud System

Pattern: Local for speed, cloud for quality.

enum AnalysisResult {
    case local(String, timeMs: Int)
    case cloud(String, timeMs: Int)
}

class HybridAnalyzer {
    private let localModel = try! LocalModel()
    private let cloudEndpoint = URL(string: "https://api.example.com/analyze")!
    
    func analyze(_ text: String) async -> AnalysisResult {
        let startLocal = Date()
        let localResult = try? await localModel.analyze(text)
        let localTime = Int(Date().timeIntervalSince(startLocal) * 1000)
        
        if let localResult, isHighConfidence(localResult) {
            return .local(localResult, timeMs: localTime)
        }
        
        // Fall back to cloud
        let startCloud = Date()
        let cloudResult = try? await callCloud(text: text)
        let cloudTime = Int(Date().timeIntervalSince(startCloud) * 1000)
        
        return .cloud(cloudResult ?? "Error", timeMs: cloudTime)
    }
    
    private func isHighConfidence(_ result: String) -> Bool {
        // Heuristic: if local response is >80 chars, likely high quality
        return result.count > 80
    }
    
    private func callCloud(text: String) async throws -> String {
        var request = URLRequest(url: cloudEndpoint)
        request.httpMethod = "POST"
        request.httpBody = try JSONEncoder().encode(["text": text])
        
        let (data, _) = try await URLSession.shared.data(for: request)
        let response = try JSONDecoder().decode(
            ["result": String].self,
            from: data
        )
        return response["result"] ?? "No response"
    }
}

Example 4: iOS Writing Assistant

Features:

Grammar/spell check (on-device)
Tone analysis (on-device)
Rewrite suggestions (cloud)

struct WritingAssistant: View {
    @State private var text = ""
    @State private var suggestions: [Suggestion] = []
    @State private var isAnalyzing = false
    
    private let analyzer = TextAnalyzer()
    
    var body: some View {
        VStack(spacing: 16) {
            TextEditor(text: $text)
                .border(Color.gray)
            
            if isAnalyzing {
                ProgressView()
            } else {
                Button("Analyze") {
                    analyzeText()
                }
            }
            
            List(suggestions) { suggestion in
                VStack(alignment: .leading) {
                    Text(suggestion.type)
                        .font(.caption)
                        .foregroundColor(.blue)
                    Text(suggestion.message)
                        .font(.body)
                    if let replacement = suggestion.replacement {
                        Button(action: { replaceText(suggestion, with: replacement) }) {
                            Label("Use suggestion", systemImage: "checkmark.circle")
                        }
                    }
                }
            }
        }
        .padding()
    }
    
    private func analyzeText() {
        isAnalyzing = true
        suggestions = []
        
        Task {
            // On-device: grammar and tone
            let grammarSuggestions = try await analyzer.checkGrammar(text)
            let toneSuggestions = try await analyzer.analyzeTone(text)
            
            DispatchQueue.main.async {
                self.suggestions = grammarSuggestions + toneSuggestions
            }
            
            // Cloud: rewrites (in background)
            if let rewrites = try? await analyzer.suggestRewrites(text) {
                DispatchQueue.main.async {
                    self.suggestions.append(contentsOf: rewrites)
                }
            }
            
            DispatchQueue.main.async {
                self.isAnalyzing = false
            }
        }
    }
    
    private func replaceText(_ suggestion: Suggestion, with replacement: String) {
        text = text.replacingOccurrences(of: suggestion.text, with: replacement)
    }
}

struct Suggestion: Identifiable {
    let id = UUID()
    let type: String  // "Grammar", "Tone", "Rewrite"
    let text: String  // Original text
    let message: String
    let replacement: String?
}

class TextAnalyzer {
    private let grammarModel = try! GrammarChecker()
    private let toneModel = try! ToneAnalyzer()
    
    func checkGrammar(_ text: String) async throws -> [Suggestion] {
        let issues = try grammarModel.check(text)
        return issues.map { issue in
            Suggestion(
                type: "Grammar",
                text: issue.text,
                message: issue.message,
                replacement: issue.correction
            )
        }
    }
    
    func analyzeTone(_ text: String) async throws -> [Suggestion] {
        let analysis = try toneModel.analyze(text)
        return analysis.suggestions.map { sugg in
            Suggestion(
                type: "Tone",
                text: sugg.text,
                message: sugg.message,
                replacement: sugg.alternative
            )
        }
    }
    
    func suggestRewrites(_ text: String) async throws -> [Suggestion] {
        let cloudURL = URL(string: "https://api.example.com/rewrite")!
        var request = URLRequest(url: cloudURL)
        request.httpMethod = "POST"
        request.httpBody = try JSONEncoder().encode(["text": text])
        
        let (data, _) = try await URLSession.shared.data(for: request)
        let response = try JSONDecoder().decode(
            CloudRewriteResponse.self,
            from: data
        )
        
        return response.rewrites.map { rewrite in
            Suggestion(
                type: "Rewrite",
                text: text,
                message: "Consider: \"\(rewrite.text)\"",
                replacement: rewrite.text
            )
        }
    }
}

struct CloudRewriteResponse: Codable {
    struct Rewrite: Codable {
        let text: String
        let rationale: String
    }
    let rewrites: [Rewrite]
}

12. Ecosystem and Tools

Xcode Integration

Creating .mlmodel files in Xcode:

Create ML training (images, text, tabular)
- File → New → ML Model
- Choose task type (image classification, sound classifier, etc.)
- Train directly in Xcode
- Output: .mlmodel file
Importing existing .mlmodel
- Drag .mlmodel into Xcode project
- Xcode auto-generates Swift code
- Access via generated classes

// Auto-generated from imported model
import CoreML

let model = MyModel()
let prediction = try model.prediction(input: MyModelInput(...))

ml-stable-diffusion (Successful Conversion Example)

Apple’s ml-stable-diffusion project demonstrates converting large generative models to CoreML:

git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion

# Convert Stable Diffusion 2.0 to CoreML
python -m python_coreml_stable_diffusion.torch2coreml \
  --model-version stabilityai/stable-diffusion-2-base \
  --compute-unit all \
  --output-mlpackage ./output

Results:

5GB model → ~2GB (int8 quantized)
Generate 512x512 image: ~30-60 sec on M1 Mac
Temperature/seed control via inputs

MLX (Apple’s ML Framework, 2024+)

MLX is Apple’s newer framework for large-scale ML on Apple Silicon:

import mlx.core as mx
from mlx_lm import load_model, generate

# Load Mistral directly
model, tokenizer = load_model("mistralai/Mistral-7B-v0.1")

prompt = "Once upon a time"
response = generate(model, tokenizer, prompt=prompt, max_tokens=100, verbose=True)

Advantages over PyTorch on Apple Silicon:

Unified memory optimized
2-4x faster on Apple Silicon
Native support for dynamic shapes

As of April 2026, MLX↔CoreML bridge is under development (not yet direct export).

Community Libraries

MLX-Lora: Fine-tune models on-device

pip install mlx-lora
mlx_lora.finetune(model, dataset, adapter_rank=16)

llama.cpp: Inference in C++ (can port to iOS)

git clone https://github.com/ggerganov/llama.cpp
./main -m model.gguf -p "Once upon a time"

Hugging Face Transformers (iOS): Minimal port of HF to Swift (experimental)

Best Practices Summary

When to Use On-Device Models

✅ Do use on-device when:

Data is sensitive (health, finance, personal)
Must work offline
Latency <100ms required
Compliance mandates (GDPR, HIPAA, government)
Cost per request matters (no API fees)
Model is <5B parameters

❌ Don’t use on-device when:

Need latest knowledge (current events, web data)
Inference must be <50ms on mobile
Model >10B parameters
Heavy inference frequency (battery drain)
Perfect accuracy critical (cloud models better)

Development Checklist

Performance Targets

Device	Target Model	Latency	Memory
iPhone	<2B (2.5B max)	<500ms/inference	<2GB model
iPad Pro	<7B	<100ms/inference	<5GB model
MacBook Air	<7B	<50ms/inference	<5GB model
MacBook Pro	7-13B	<50ms/inference	<10GB model
Mac Studio	30-65B	<100ms/inference	Depends on VRAM

References and Further Reading

Official Documentation:

Apple Machine Learning — CoreML, Vision, NLP docs
coremltools — Model conversion tool
ml-stable-diffusion — Diffusion model example

Tools:

MLX — Apple’s ML framework
llama.cpp — Portable LLM inference
ONNX — Model interchange format

Community:

Hugging Face — Model hub, conversion guides
Create ML templates — Xcode integration

Last Updated: April 2026

This guide is a living document. As Apple Intelligence and CoreML evolve, recommendations and best practices will be updated. Check for April 2026+ changes in Apple’s official documentation.

Validation Checklist

How do you know you got this right?

Performance Checks

CoreML model inference latency measured on target device (iPhone, iPad, or Mac) and within acceptable budget (<500ms for mobile, <100ms for Mac)
Model size after conversion fits within app bundle constraints (check against App Store limits and device storage)
Battery impact tested on real device during sustained inference (not just simulator)

Implementation Checks

Model converts from source framework (PyTorch/TensorFlow) to CoreML without errors (test ONNX intermediate step)
Quantization target verified: int8 for most cases, int4 only if size-critical and accuracy loss acceptable
Memory pressure tested on lowest-spec target device (e.g., iPhone with 6GB RAM, not just MacBook Pro)
Thermal throttling handled: app degrades gracefully when device is hot (check ProcessInfo.processInfo.thermalState)
OOM fallback strategy implemented: app doesn’t crash if model can’t load (fall back to cloud or cached results)
Token streaming implemented for LLM outputs (don’t freeze UI waiting for full generation)
Model loaded once as singleton, not re-created per inference call

Integration Checks

On-device vs cloud routing logic implemented (local for fast/private tasks, cloud for complex reasoning)
Privacy model documented: confirmed no user data leaves device unless explicitly permitted
Model versioning strategy defined: how to update the CoreML model without requiring a full app update

Common Failure Modes

Conversion failure at ONNX step: Model uses unsupported dynamic operations. Fix: use torch.jit.trace instead of torch.jit.script, or simplify forward pass to remove conditionals.
Neural Engine not used: Large LLMs with dynamic shapes route to CPU/GPU instead. Fix: expected behavior for LLMs; Neural Engine benefits CNNs and fixed-shape transformers. Profile with Instruments to confirm.
Silent accuracy degradation after quantization: int4 model produces nonsensical output. Fix: validate output quality on 50+ test inputs before shipping; use int8 if int4 quality is insufficient.
App rejected for bundle size: CoreML model exceeds App Store limits. Fix: host model on-demand via Background Assets framework or use more aggressive quantization.

Sign-Off Criteria

Tested on real device (not just simulator) across the full user workflow
Latency, memory, and battery benchmarks recorded and within targets from the Performance Targets table
Hybrid on-device/cloud fallback tested: confirmed app works offline and recovers when connectivity returns
Error handling covers OOM, thermal throttling, and model load failure without crashing
Model accuracy validated against source framework output on 100+ representative inputs

1. Apple Intelligence (As of April 2026)

What It Is

Privacy-First Approach

Models and Capabilities

When On-Device vs Cloud

Performance Characteristics

Future Roadmap (2026 and Beyond)

2. CoreML Framework

What CoreML Is

Supported Model Formats

Performance Optimizations with Metal Performance Shaders

Neural Engine (Dedicated AI Chip)

Versions and Compatibility

3. Converting Hugging Face Models to CoreML

Model Compatibility

Conversion Process

Strategy 1: HF → ONNX → CoreML (Recommended)

Strategy 2: PyTorch → CoreML (Direct)

Strategy 3: Using Specialized Tools

Example: Convert Mistral 7B to CoreML

Quantization During Conversion

4. Performance on Apple Hardware

Unified Memory Advantage

Latency Benchmarks by Device

Neural Engine Usage by Architecture

5. Memory and Compute Constraints

Unified Memory Implications

Context Window Limitations

Real Example: M1 MacBook Pro (16GB)

6. On-Device vs Cloud Trade-Offs

On-Device Inference

Cloud Inference (Apple Private Cloud Compute or Third-Party APIs)

Hybrid Approach (Recommended for Most Apps)

7. Building iOS Apps with CoreML

SwiftUI Integration

Handling Model Inference in App

Battery and Thermal Considerations

User Experience: Handling Latency

8. Vision Models on iOS

Vision Framework + CoreML

Real-Time Processing (Camera Feed)

Example: Document Scanner

9. Text Models on iOS

Running Small LLMs on iOS

Text Encoding/Decoding

Token Streaming

Memory Management for Models

10. Optimization Techniques

Quantization (Size + Speed)

Pruning (Remove Unnecessary Connections)

Knowledge Distillation (Compress from Larger Model)

Dynamic Model Loading

Result Caching

11. Practical Examples

Example 1: Local AI Assistant on Mac Mini

Example 2: Convert Hugging Face Sentence Embeddings to iOS

Example 3: Hybrid On-Device + Cloud System

Example 4: iOS Writing Assistant

12. Ecosystem and Tools

Xcode Integration

ml-stable-diffusion (Successful Conversion Example)

MLX (Apple’s ML Framework, 2024+)

Community Libraries

Best Practices Summary

When to Use On-Device Models

Development Checklist

Performance Targets

References and Further Reading

Validation Checklist

Performance Checks

Implementation Checks

Integration Checks

Common Failure Modes

Sign-Off Criteria

See Also