Skip to main content
Reference

Apple Intelligence & CoreML

On-device AI on Apple hardware — CoreML framework, Neural Engine optimization, model conversion, and performance benchmarks per device.

Running machine learning models directly on Apple hardware unlocks privacy-first applications, offline functionality, and ultra-low latency inference. This guide covers Apple Intelligence, CoreML, and practical strategies for deploying AI on iPhones, iPads, and Macs.

1. Apple Intelligence (As of April 2026)

What It Is

Apple Intelligence is Apple’s on-device AI suite integrated into iOS, iPadOS, and macOS. It comprises privacy-first generative capabilities built directly into the operating system, from text processing to image understanding and writing assistance.

Key characteristics:

  • Runs entirely on-device (neural engine + GPU)
  • Data never leaves the device unless explicitly sent to Apple’s private cloud compute
  • Models are proprietary, optimized for Apple Silicon
  • Integrated into system frameworks and native apps
  • Seamless across Apple devices with iCloud+ subscription

Privacy-First Approach

Apple Intelligence enforces a “data minimization” architecture:

  1. On-device processing — Text understanding, image analysis, and basic generation run locally
  2. Private Cloud Compute — Complex queries can optionally use Apple’s encrypted cloud infrastructure
    • Data is processed in an isolated environment
    • Apple cannot see user data (cryptographic guarantees)
    • Can be disabled per-request or system-wide
  3. User control — Explicit opt-in for cloud features; defaults to on-device

Models and Capabilities

As of April 2026, Apple Intelligence provides:

CapabilityLocationUse Case
Text understandingOn-deviceClassify intent, extract entities, summarization
Writing toolsOn-device + cloudProofreading, rewriting, tone adjustment
Image understandingOn-deviceDescribe images, identify objects, OCR
Generative imagesCloudCreate images from text (DALL-E style)
Smart replyOn-deviceEmail/message suggestions
Siri enhancementsHybridContextual understanding, personal requests

Limits to know:

  • Text generation is constrained (summaries, rewrites, not open-ended chat)
  • Image generation not available on-device (always cloud)
  • Fine-tuning models per-user is not exposed to developers

When On-Device vs Cloud

Use on-device when:

  • Privacy/compliance required (healthcare, finance, government)
  • Offline operation needed
  • Latency must be sub-100ms
  • User data sensitive (PII, proprietary documents)

Use cloud (Private Cloud Compute) when:

  • More complex reasoning needed
  • Generation quality is critical
  • Computational cost of local inference is high
  • User explicitly opts in

Performance Characteristics

On Apple Silicon devices:

  • M3/M4 MacBook Pro: Single token ~20-30ms, throughput ~30-50 tokens/sec
  • M2 Mac Mini: Single token ~30-40ms, throughput ~25-35 tokens/sec
  • iPhone 15 Pro (A18 Pro): Single token ~80-150ms, throughput ~7-12 tokens/sec
  • iPad Pro 11” (M4): Single token ~25-35ms, throughput ~28-40 tokens/sec

These assume quantized models (int8 or lower). Performance degrades gracefully on older devices.

Future Roadmap (2026 and Beyond)

Expected improvements:

  • Larger on-device model support (current limit ~3-7B parameters)
  • Faster inference via hardware optimizations
  • Developer API for custom on-device models
  • Cross-device model caching (via iCloud+)
  • Federated learning (training on private data)

2. CoreML Framework

What CoreML Is

CoreML is Apple’s native machine learning framework for iOS, iPadOS, macOS, watchOS, and tvOS. It enables on-device inference for models trained in popular frameworks (TensorFlow, PyTorch, scikit-learn, etc.).

Key features:

  • Automatic hardware optimization (CPU, GPU, Neural Engine)
  • Low memory footprint
  • Battery-efficient inference
  • Seamless integration with Vision and NLP frameworks
  • Precompiled models (.mlmodel format)

Supported Model Formats

CoreML accepts models in multiple formats:

Source FrameworkSupported?Notes
TensorFlow 2.xVia ONNX converter or coremltools
PyTorchVia ONNX or coremltools (direct conversion works)
ONNXUniversal format; recommended for interop
scikit-learnLinear/tree-based models fully supported
XGBoostGradient boosting models
LibSVMSVM models
Apple CreateMLXcode-native training (images, sounds, tabular)
Hugging Face⚠️Via ONNX conversion pipeline

Format: CoreML models are packaged as .mlmodel (zipped archive) or .mlpackage (directory structure for large models).

Performance Optimizations with Metal Performance Shaders

CoreML leverages Metal (Apple’s graphics API) for GPU inference:

  1. Metal Performance Shaders (MPS) — Specialized kernels for neural network operations

    • Matrix multiplication, convolution, pooling highly optimized
    • Automatic batching and fusion of operations
    • 2-10x faster than scalar GPU code
  2. Neural Engine access — On devices with neural engines (A14+, M1+):

    • Dedicated fixed-function AI accelerator
    • Lowest latency for specific model architectures
    • Used automatically by CoreML when beneficial
  3. Quantization support — Reduce model precision:

    • float16 (half-precision) — ~2x size reduction, minimal accuracy loss
    • int8 (8-bit integer) — ~4x size reduction, slight accuracy loss
    • int4 (4-bit) — ~8x size reduction, more accuracy loss (requires careful calibration)

Neural Engine (Dedicated AI Chip)

Modern Apple devices include a Neural Engine — a fixed-function coprocessor for ML:

DeviceNeural Engine CoresPerformance
iPhone 15 Pro (A18 Pro)16 cores~13 TFLOPS int8
iPad Pro 11” M410 cores~12 TFLOPS
MacBook Pro 16” M3 Max16 cores~14 TFLOPS
Mac Studio M2 Ultra32 cores~28 TFLOPS

When used: CoreML automatically routes operations (convolutions, matrix multiplies in specific shapes) to the Neural Engine when beneficial. You don’t need to explicitly manage this.

Important: The Neural Engine only accelerates specific model types (CNNs, RNNs, Transformers with certain patterns). Large sparse or dynamic models may run faster on GPU.

Versions and Compatibility

CoreML VersionMinimum OSFeatures
CoreML 1.0 (2017)iOS 11Basic inference, image classification
CoreML 2.0 (2018)iOS 12Updatable models, custom operations
CoreML 3.0 (2019)iOS 13Control flow, flexible shapes
CoreML 4.0 (2020)iOS 14Vision integration, NLP models
CoreML 5.x (2021–2023)iOS 15+Larger model support, better quantization
CoreML 6.x (2024+)iOS 18, macOS 14.6+MLX integration, dynamic shapes, improved GPU

Recommendation: For new projects, target CoreML 6.x (iOS 18+) to access latest optimizations.


3. Converting Hugging Face Models to CoreML

Model Compatibility

Not all Hugging Face models convert cleanly to CoreML. Best candidates:

Readily convertible:

  • Vision transformers (ViT, DINO, CLIP visual encoder)
  • Small text encoders (sentence-transformers, distilBERT)
  • Stable Diffusion (via specialized ml-stable-diffusion project)
  • LLaMA-based models up to 7B (with quantization)

Difficult to convert:

  • Models with dynamic control flow (if statements in forward pass)
  • Models with custom operations not in ONNX standard
  • Large models >30B parameters (memory constraints)
  • Audio models with complex graph structures

Conversion Process

# 1. Export to ONNX (HF model)
from transformers import AutoModel
from optimum.onnx import export_models

model_id = "sentence-transformers/all-MiniLM-L6-v2"
export_models(model_id, output_dir="./onnx_model")

# 2. Convert ONNX → CoreML
import coremltools as ct

onnx_model = ct.converters.onnx.load_model("./onnx_model/model.onnx")
coreml_model = ct.converters.convert(onnx_model, target_deployment_target=ct.target.iOS16)
coreml_model.save("./model.mlmodel")

Advantages:

  • ONNX is a well-supported intermediate format
  • Better error messages if conversion fails
  • Standardized transformations (pruning, quantization) work on ONNX

Strategy 2: PyTorch → CoreML (Direct)

# 1. Export PyTorch model to TorchScript
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
scripted = torch.jit.script(model)
torch.jit.save(scripted, "model.pt")

# 2. Convert TorchScript → CoreML
import coremltools as ct

traced_model = torch.jit.load("model.pt")
example_input = torch.randn(1, 512)  # [batch, seq_len]

coreml_model = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=(1, 512), dtype=np.float32)],
    outputs=[ct.TensorType(dtype=np.float32)],
    target_deployment_target=ct.target.iOS16,
)
coreml_model.save("./model.mlmodel")

Advantages:

  • Single-step conversion
  • Fewer potential loss points
  • Better preserves model semantics

Strategy 3: Using Specialized Tools

For large models, use framework-specific converters:

Stable Diffusion:

git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion

python -m python_coreml_stable_diffusion.torch2coreml \
  --model-version stabilityai/stable-diffusion-2-base \
  --compute-unit all \
  --output-mlpackage ./models

LLaMA / Mistral (via MLX):

# Convert to MLX, then export as CoreML
from mlx_lm import load_model, save_model

model, tokenizer = load_model("mistralai/Mistral-7B-v0.1")
save_model(model, "mistral_mlx")

# Then convert MLX → CoreML (tool in development as of April 2026)

Example: Convert Mistral 7B to CoreML

Practical walkthrough:

# 1. Install dependencies
pip install transformers optimum[onnxruntime] coremltools

# 2. Download and export Mistral to ONNX
python << 'EOF'
from transformers import AutoModel
from optimum.onnx import export
from pathlib import Path

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
output_dir = Path("./mistral_onnx")

export(
    model_id=model_id,
    output_dir=output_dir,
    task="text-generation",
)
EOF

# 3. Quantize ONNX model (int8) to reduce size from ~13GB to ~3.5GB
python << 'EOF'
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained("./mistral_onnx")
config = AutoQuantizationConfig.avx512_vnni()
quantizer.quantize(save_dir="./mistral_onnx_quantized", quantization_config=config)
EOF

# 4. Convert ONNX → CoreML with quantization
python << 'EOF'
import coremltools as ct
from coremltools.models.neural_network import flexible_shape_utils

onnx_model = ct.converters.onnx.load_model("./mistral_onnx_quantized/model.onnx")

# 5-bit quantization for aggressive size reduction
mlmodel = ct.converters.convert(
    onnx_model,
    target_deployment_target=ct.target.iOS16,
    minimum_deployment_target=ct.target.iOS16,
    compute_units=ct.ComputeUnit.ALL,
)

# Quantize to int8
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(
    mlmodel,
    nbits=8,
    quantization_mode="linear",
)

mlmodel.save("./Mistral-7B.mlmodel")
print(f"Model size: {Path('./Mistral-7B.mlmodel').stat().st_size / 1e9:.2f}GB")
EOF

Result: Mistral 7B converts to ~3.5GB int8 CoreML model, runnable on Mac mini/MacBook Pro.

Quantization During Conversion

Quantization reduces model size and inference latency at the cost of accuracy:

Bit WidthSize ReductionAccuracy LossBest For
float32 (baseline)1x0%Reference, high accuracy
float162x<1%GPU inference, any device
int84x1-5%CPUs, Neural Engine, iOS
int48x5-10%Tight memory budgets
int310x15-20%Extreme constraints (watch)

Quantization-aware training (QAT):

For critical accuracy, retrain with quantization in loop:

import torch
from torch.quantization import quantize_dynamic

# Quantize after training
model = load_pretrained_model()
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8,
)

# Convert to CoreML

4. Performance on Apple Hardware

Unified Memory Advantage

Apple Silicon (M1/M2/M3/M4) uses a unified memory architecture where GPU and CPU share the same physical memory pool. This eliminates copies between systems.

Traditional GPU (discrete):

RAM → PCIe → GPU VRAM → GPU → GPU VRAM → PCIe → RAM
                    ↓ Bottleneck: PCIe bandwidth (16GB/s)

Apple Silicon (unified):

Shared Memory ← GPU, CPU, Neural Engine all access directly
             ↓ No copies; direct access ~100GB/s

Impact: A 7B model using 28GB of parameters in float16 can be loaded once and referenced by all compute units. No duplication.

Latency Benchmarks by Device

Measured end-to-end latency for Mistral 7B (quantized int8, ~3.5GB):

DeviceFirst TokenPer TokenTokens/SecContext Window
MacBook Pro 16” M3 Max800ms20-25ms40-5032K tokens (~25MB)
Mac Studio M2 Ultra600ms15-18ms55-6564K tokens (~50MB)
Mac Mini M21200ms35-40ms25-288K tokens (~6MB)
MacBook Air M21500ms45-55ms18-224K tokens (~3MB)
iPad Pro 11” M42000ms80-100ms10-122K tokens (~1.5MB)
iPhone 15 Pro5000ms+150-200ms5-7512 tokens (~400KB)

Observation: M-series Macs with 16GB+ unified memory run 7B models smoothly. iPhones/iPads have practical limits around 2-3B models.

Neural Engine Usage by Architecture

The Neural Engine accelerates specific operations:

Models that benefit (faster on NE):

  • ResNet, Vision Transformers (CNNs + attention)
  • Stable Diffusion (diffusion models)
  • BERT-sized encoders (<12 layers)

Models that don’t benefit (CPU/GPU better):

  • Large LLMs with dynamic shapes (Mistral, Llama 30B+)
  • Models with custom CUDA kernels (not expressible in ONNX)
  • Models with sparse operations

For LLMs, CoreML typically routes matmul to CPU/GPU, not Neural Engine.


5. Memory and Compute Constraints

Unified Memory Implications

With unified memory, the constraint is total device RAM, not GPU VRAM:

Max Model Size ≈ 60-70% of Device RAM (rest for OS, app overhead)
DeviceTotal RAMUsable for ModelLargest Model (fp16)
MacBook Pro 16” M3 Max36GB~24GB12B parameters
Mac Studio M2 Ultra192GB~130GB65B parameters
Mac Mini M216GB~10GB5B parameters
MacBook Air M28GB~5GB2.5B parameters
iPad Pro 11” M416GB~10GB5B parameters
iPhone 15 Pro12GB~7GB3.5B parameters

Context Window Limitations

Larger context windows consume more memory:

Memory = Model Size + (Context Length × Hidden Dimension × Batch Size × Bytes per Token)

For Mistral 7B (int8, 3.5GB model):

  • Context 8K: +1.2GB → 4.7GB total
  • Context 16K: +2.4GB → 5.9GB total
  • Context 32K: +4.8GB → 8.3GB total

Mac Mini M2 (16GB): Context fits ~10K tokens comfortably.

iPhone 15 Pro (12GB): Context fits ~1-2K tokens.

Real Example: M1 MacBook Pro (16GB)

Running Mistral 7B on a 16GB M1 MacBook Pro:

Model (int8): 3.5GB
OS Reserve: 2-3GB
App Overhead: 0.5GB
Available for Context: ~10GB

At 2 bytes/token for KV cache (fp16): 5K tokens context feasible
Performance: ~30 tokens/sec

Practical: Good for local AI assistant with ~2K-5K token contexts. Acceptable latency for interactive apps.


6. On-Device vs Cloud Trade-Offs

On-Device Inference

Advantages:

  • Privacy: No data leaves device (except user-initiated exports)
  • Offline: Works without internet
  • Latency: ~20-200ms total, no network delay
  • Cost: No per-request API fees
  • Control: Model behavior is predictable, reproducible

Disadvantages:

  • Limited capability: Model size/complexity constrained by device RAM
  • Stale knowledge: Models don’t have access to real-time information
  • Development iteration: Updating model requires app update (not instant)
  • Device strain: High battery/thermal cost on phones during inference
  • Maintenance: You manage model updates, security patches

Best for:

  • Personal/sensitive data (PII, health, finance)
  • Always-offline apps (flights, remote locations)
  • Sub-50ms latency requirements
  • Compliance-driven apps (HIPAA, GDPR, government)
  • Text classification, intent detection, embedding generation

Cloud Inference (Apple Private Cloud Compute or Third-Party APIs)

Advantages:

  • Capability: Access to largest, most capable models (GPT-4 scale)
  • Real-time knowledge: Current information from web
  • Zero device overhead: Inference costs are server-side, not battery
  • Instant updates: Roll out new models without app update
  • Scalability: Handle traffic spikes transparently

Disadvantages:

  • Privacy: Data sent over internet (even if encrypted)
  • Latency: 500ms-2000ms due to network + server processing
  • Cost: Per-request fees ($0.01-0.10+ per request)
  • Availability: Requires internet connection
  • Dependency: Reliant on third-party service uptime

Best for:

  • Non-sensitive user input (search queries, creative writing)
  • Complex reasoning requiring large models
  • Knowledge-grounded tasks (Q&A with current info)
  • Generative tasks (images, long text)
  • Spiky traffic (don’t want to provision for peaks)

Pattern: Local for fast, common operations; cloud for complex reasoning.

User Input

Local Intent Detection (0-50ms)

    ├─ Fast Path (95% of queries): Local response (embedding lookup, simple rewrite)
    │                               Return to user in 100-200ms

    └─ Complex Path (5% of queries): Send to cloud (with user consent)
                                      Return in 1000-2000ms

Example: Writing assistant app

  • On-device: Grammar check, tone analysis, style transfer (fast models)
  • Cloud: Generate complete rewrites, creative suggestions (large models)
  • User impact: Most corrections instant; complex suggestions wait 1-2 sec

7. Building iOS Apps with CoreML

SwiftUI Integration

import SwiftUI
import CoreML

struct ContentView: View {
    @State private var inputText: String = ""
    @State private var resultText: String = ""
    @State private var isProcessing = false
    @State private var error: String?
    
    @Environment(\.dismiss) var dismiss
    
    var body: some View {
        VStack(spacing: 16) {
            TextEditor(text: $inputText)
                .border(Color.gray)
                .frame(height: 100)
            
            Button(action: performInference) {
                if isProcessing {
                    ProgressView()
                        .progressViewStyle(.circular)
                } else {
                    Text("Analyze")
                }
            }
            .disabled(isProcessing)
            
            if !resultText.isEmpty {
                Text(resultText)
                    .font(.body)
                    .padding()
                    .background(Color.gray.opacity(0.1))
            }
            
            if let error {
                Text("Error: \(error)")
                    .foregroundColor(.red)
                    .font(.caption)
            }
            
            Spacer()
        }
        .padding()
        .navigationTitle("Text Analysis")
    }
    
    private func performInference() {
        isProcessing = true
        error = nil
        
        Task {
            do {
                resultText = try await analyzeText(inputText)
            } catch {
                self.error = error.localizedDescription
            }
            isProcessing = false
        }
    }
}

// MARK: - CoreML Integration

actor TextAnalyzer {
    private let model: TextClassifier  // Auto-generated from .mlmodel
    
    private init() {
        self.model = try! TextClassifier(configuration: .init())
    }
    
    static let shared = TextAnalyzer()
    
    func analyzeText(_ input: String) async throws -> String {
        guard !input.isEmpty else { return "No input provided" }
        
        // Tokenize input (framework-dependent; this is pseudo-code)
        let tokens = tokenize(input)
        let embedding = try model.prediction(input: tokens)
        
        return interpretResult(embedding)
    }
    
    private func tokenize(_ text: String) -> MLMultiArray {
        // Convert text to token IDs matching model's vocabulary
        // This depends on the model's tokenizer
        let tokenIds = [1, 2, 3, 4]  // Example
        let array = try! MLMultiArray(shape: [1, 4], dataType: .int32)
        array[0] = NSNumber(value: 1)
        return array
    }
    
    private func interpretResult(_ embedding: TextClassifierOutput) -> String {
        // Map model output to user-friendly text
        return "Analyzed: \(embedding.label ?? "Unknown")"
    }
}

// Usage in async context
func analyzeText(_ text: String) async throws -> String {
    return try await TextAnalyzer.shared.analyzeText(text)
}

Handling Model Inference in App

Best practices:

  1. Load model once, reuse — Models are expensive to load; create a singleton
  2. Run on background thread — Inference can block; use Task/async
  3. Batch operations — If analyzing multiple texts, batch them for efficiency
  4. Stream results — For long outputs, yield tokens incrementally
  5. Handle OOM gracefully — Have fallback behavior if model can’t load
actor InferenceEngine {
    private let model: MyModel
    private let queue = DispatchQueue(label: "ml.inference", qos: .userInitiated)
    
    static let shared = InferenceEngine()
    
    private init() {
        do {
            self.model = try MyModel(configuration: MLModelConfiguration())
        } catch {
            os_log("Failed to load model: %{public}@", log: .default, type: .error, error.localizedDescription)
            fatalError("Model unavailable")
        }
    }
    
    func infer(input: InputData) async throws -> OutputData {
        return try await withCheckedThrowingContinuation { continuation in
            queue.async { [weak self] in
                do {
                    guard let self else {
                        continuation.resume(throwing: NSError(domain: "Model", code: -1))
                        return
                    }
                    let output = try self.model.prediction(input: input)
                    continuation.resume(returning: output)
                } catch {
                    continuation.resume(throwing: error)
                }
            }
        }
    }
}

Battery and Thermal Considerations

Inference is power-hungry. Plan accordingly:

  1. Check battery level — Don’t run heavy inference if <20% battery

    @Environment(\.isLowPowerModeEnabled) var isLowPowerMode
    
    if isLowPowerMode {
        // Use smaller model, reduce batch size, or defer to cloud
    }
  2. Throttle frequency — Don’t call inference on every keystroke

    @State private var lastInferenceTime: Date = .distantPast
    
    func shouldRunInference() -> Bool {
        return Date().timeIntervalSince(lastInferenceTime) > 1.0  // Max once per second
    }
  3. Limit parallel inferences — Queue requests, don’t spawn threads unbounded

  4. Monitor thermal state — Slow down or defer work if device is hot

    let thermalState = ProcessInfo.processInfo.thermalState
    if thermalState == .critical {
        // Halt inference until device cools
    }

User Experience: Handling Latency

Don’t freeze the UI. Provide feedback during inference:

VStack(spacing: 16) {
    if isProcessing {
        ProgressView(value: progress)
            .tint(.blue)
        Text("Analyzing... ~\(estimatedRemainingSeconds)s")
            .font(.caption)
            .foregroundColor(.gray)
    } else {
        Text("Tap to analyze")
    }
}

For streaming outputs (like LLM tokens):

func streamInference(input: String) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            do {
                let tokens = try await inferenceEngine.inferStream(input: input)
                for try await token in tokens {
                    continuation.yield(token)
                }
                continuation.finish()
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

// Usage
Text(generatedText)
    .onReceive(streamInference(input: inputText).stream) { token in
        generatedText += token
    }

8. Vision Models on iOS

Vision models process images in real-time. The Vision framework integrates with CoreML seamlessly.

Vision Framework + CoreML

import Vision
import CoreML

class ObjectDetector {
    private let model: ObjectDetectionModel
    private let visionRequest: VNCoreMLRequest
    
    init() throws {
        self.model = try ObjectDetectionModel(configuration: .init())
        self.visionRequest = VNCoreMLRequest(model: self.model.model)
    }
    
    func detect(in image: UIImage) async throws -> [DetectedObject] {
        guard let ciImage = CIImage(image: image) else {
            throw NSError(domain: "Vision", code: -1, userInfo: nil)
        }
        
        return try await withCheckedThrowingContinuation { continuation in
            visionRequest.completionHandler = { request, error in
                if let error {
                    continuation.resume(throwing: error)
                    return
                }
                
                let results = request.results as? [VNRecognizedObjectObservation] ?? []
                let detected = results.map { observation in
                    DetectedObject(
                        label: observation.labels.first?.identifier ?? "Unknown",
                        confidence: Float(observation.confidence),
                        boundingBox: observation.boundingBox
                    )
                }
                
                continuation.resume(returning: detected)
            }
            
            let handler = VNImageRequestHandler(ciImage: ciImage)
            try handler.perform([visionRequest])
        }
    }
}

Real-Time Processing (Camera Feed)

import AVFoundation
import Vision

class CameraProcessor: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
    private let detector = try! ObjectDetector()
    @Published var detections: [DetectedObject] = []
    
    private let session = AVCaptureSession()
    private let queue = DispatchQueue(label: "vision.processing")
    
    func startSession() throws {
        guard let camera = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back) else {
            throw NSError(domain: "Camera", code: -1)
        }
        
        let input = try AVCaptureDeviceInput(device: camera)
        session.addInput(input)
        
        let output = AVCaptureVideoDataOutput()
        output.setSampleBufferDelegate(self, queue: queue)
        session.addOutput(output)
        
        session.startRunning()
    }
    
    func captureOutput(
        _ output: AVCaptureOutput,
        didOutput sampleBuffer: CMSampleBuffer,
        from connection: AVCaptureConnection
    ) {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
        
        let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
        let uiImage = UIImage(ciImage: ciImage)
        
        Task {
            do {
                let detections = try await self.detector.detect(in: uiImage)
                DispatchQueue.main.async {
                    self.detections = detections
                }
            } catch {
                os_log("Detection failed: %{public}@", log: .default, type: .error, error.localizedDescription)
            }
        }
    }
}

Example: Document Scanner

class DocumentScanner {
    private let visionRequest = VNDetectDocumentSegmentationRequest()
    
    func scanDocument(from image: UIImage) async throws -> UIImage {
        guard let ciImage = CIImage(image: image) else {
            throw NSError(domain: "Scanner", code: -1)
        }
        
        let handler = VNImageRequestHandler(ciImage: ciImage)
        try handler.perform([visionRequest])
        
        guard let observation = visionRequest.results?.first as? VNDocumentObservation else {
            throw NSError(domain: "Scanner", code: -2, userInfo: [NSLocalizedDescriptionKey: "No document detected"])
        }
        
        // Correct perspective based on document boundaries
        let correctedImage = perspectiveCorrection(ciImage, to: observation)
        return UIImage(ciImage: correctedImage)
    }
    
    private func perspectiveCorrection(_ image: CIImage, to doc: VNDocumentObservation) -> CIImage {
        // Use Vision observation to guide perspective transformation
        let topLeft = doc.topLeft
        let topRight = doc.topRight
        let bottomLeft = doc.bottomLeft
        let bottomRight = doc.bottomRight
        
        // Apply perspective transform filter
        let filter = CIPerspectiveCorrection()
        filter.inputImage = image
        filter.inputTopLeft = topLeft
        filter.inputTopRight = topRight
        filter.inputBottomLeft = bottomLeft
        filter.inputBottomRight = bottomRight
        
        return filter.outputImage ?? image
    }
}

9. Text Models on iOS

Running LLMs on iOS is constrained but feasible for smaller models (2-3B parameters).

Running Small LLMs on iOS

Model selection:

  • TinyLlama 1.1B — Fits on iPhone, decent quality
  • Mistral 7B quantized int4 — Requires iPad/Pro device
  • Phi 2.7B — Good quality/size tradeoff
  • Gemma 2B — Purpose-built for mobile

Example with TinyLlama:

actor LLMInference {
    private let model: TinyLlamaModel
    private let tokenizer: Tokenizer
    
    init() throws {
        self.model = try TinyLlamaModel(configuration: .init())
        self.tokenizer = try Tokenizer.load()
    }
    
    func generate(prompt: String, maxTokens: Int = 256) async throws -> String {
        let inputIds = tokenizer.encode(prompt)
        var generatedIds = inputIds
        var result = prompt
        
        for _ in 0..<maxTokens {
            let inputArray = try MLMultiArray(shape: [1, inputIds.count], dataType: .int32)
            for (i, id) in generatedIds.enumerated() {
                inputArray[i] = NSNumber(value: id)
            }
            
            let output = try model.prediction(inputIds: inputArray)
            let nextTokenId = output.logits.argmax()
            
            if nextTokenId == tokenizer.eosTokenId { break }
            
            generatedIds.append(nextTokenId)
            result += tokenizer.decode([nextTokenId])
        }
        
        return result
    }
}

Text Encoding/Decoding

Tokenization is critical for model correctness:

struct Tokenizer {
    private let vocabulary: [String: Int]
    private let reverseVocab: [Int: String]
    
    static func load() throws -> Self {
        // Load from bundled vocab.json
        let vocabData = try Data(contentsOf: Bundle.main.url(forResource: "vocab", withExtension: "json")!)
        let vocab = try JSONDecoder().decode([String: Int].self, from: vocabData)
        
        var reverseVocab: [Int: String] = [:]
        for (token, id) in vocab {
            reverseVocab[id] = token
        }
        
        return Tokenizer(vocabulary: vocab, reverseVocab: reverseVocab)
    }
    
    func encode(_ text: String) -> [Int] {
        let tokens = text.split(separator: " ").map(String.init)
        return tokens.compactMap { vocabulary[$0] }
    }
    
    func decode(_ ids: [Int]) -> String {
        let tokens = ids.compactMap { reverseVocab[$0] }
        return tokens.joined(separator: " ")
    }
    
    var eosTokenId: Int { vocabulary["</s>"] ?? -1 }
}

Token Streaming

For better UX, stream tokens as they’re generated:

func generateStreaming(prompt: String) -> AsyncStream<String> {
    AsyncStream { continuation in
        Task {
            do {
                let inputIds = tokenizer.encode(prompt)
                var generatedIds = inputIds
                
                // Emit initial prompt
                continuation.yield(prompt)
                
                for _ in 0..<256 {
                    let output = try predict(inputIds: generatedIds)
                    let nextId = output.logits.argmax()
                    
                    if nextId == tokenizer.eosTokenId { break }
                    
                    let token = tokenizer.decode([nextId])
                    continuation.yield(token)
                    generatedIds.append(nextId)
                    
                    // Yield less frequently to avoid UI overhead
                    try await Task.sleep(nanoseconds: 1_000_000)  // 1ms
                }
                
                continuation.finish()
            } catch {
                continuation.finish(throwing: error)
            }
        }
    }
}

// Usage in SwiftUI
@State private var generatedText = ""
@State private var isGenerating = false

var body: some View {
    VStack {
        TextEditor(text: $generatedText)
        
        Button(isGenerating ? "Generating..." : "Generate") {
            generateText()
        }
        .disabled(isGenerating)
    }
    
    private func generateText() {
        isGenerating = true
        generatedText = ""
        
        Task {
            for await token in generateStreaming(prompt: "Write a story:") {
                DispatchQueue.main.async {
                    generatedText += token
                }
            }
            DispatchQueue.main.async {
                isGenerating = false
            }
        }
    }
}

Memory Management for Models

LLMs consume significant memory. Manage carefully:

class ModelManager {
    private(set) var model: LLMModel?
    
    func loadModel() async throws {
        // Check available memory before loading
        var memInfo = os_proc_available_memory()
        guard memInfo >= 3 * 1024 * 1024 * 1024 else {  // 3GB threshold
            throw NSError(domain: "Memory", code: -1, 
                         userInfo: [NSLocalizedDescriptionKey: "Insufficient memory"])
        }
        
        self.model = try LLMModel(configuration: .init())
    }
    
    func unloadModel() {
        self.model = nil
    }
    
    // Monitor memory during inference
    func monitorMemory() {
        Timer.scheduledTimer(withTimeInterval: 0.5, repeats: true) { _ in
            let available = os_proc_available_memory()
            if available < 500 * 1024 * 1024 {  // <500MB
                os_log("Low memory: %lld MB", log: .default, available / 1024 / 1024)
                // Reduce batch size or stop inference
            }
        }
    }
}

10. Optimization Techniques

Quantization (Size + Speed)

Reduce model precision to shrink size and accelerate inference:

import coremltools as ct
from coremltools.models.neural_network import quantization_utils

# Load full-precision model
mlmodel = ct.models.MLModel("model_fp32.mlmodel")

# Quantize to int8 (4x size reduction)
quantized = quantization_utils.quantize_weights(
    mlmodel,
    nbits=8,
    quantization_mode="linear_quantization",
)
quantized.save("model_int8.mlmodel")

# For aggressive size reduction, use int4
quantized_4bit = quantization_utils.quantize_weights(
    mlmodel,
    nbits=4,
    quantization_mode="kmeans",  # Better quality for 4-bit
    nbits_bias=8,
)
quantized_4bit.save("model_int4.mlmodel")

Trade-off:

  • int8: ~1-2% accuracy loss, ~4x smaller
  • int4: ~5-10% accuracy loss, ~8x smaller
  • int2: ~15%+ accuracy loss, rarely used

Pruning (Remove Unnecessary Connections)

Structured pruning removes entire filters/channels:

import torch
from torch.nn.utils import prune

model = load_model()

# Prune 50% of weights across network
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name='weight', amount=0.5)

# Make pruning permanent
for name, module in model.named_modules():
    if hasattr(module, 'weight_orig'):
        prune.remove(module, 'weight')

# Export pruned model
torch.onnx.export(model, ...)

Result: 30-50% model size reduction with ~2-5% accuracy loss.

Knowledge Distillation (Compress from Larger Model)

Train a smaller “student” model to mimic a larger “teacher”:

import torch
import torch.nn as nn

teacher_model = load_large_model()  # e.g., Mistral 13B
student_model = load_small_model()  # e.g., TinyLlama 1B
student_model.train()

criterion = nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.Adam(student_model.parameters())

for batch in train_loader:
    input_ids, labels = batch
    
    # Get predictions from both models
    with torch.no_grad():
        teacher_logits = teacher_model(input_ids)
    
    student_logits = student_model(input_ids)
    
    # KL divergence between distributions
    loss = criterion(
        torch.log_softmax(student_logits / 3.0, dim=-1),
        torch.softmax(teacher_logits / 3.0, dim=-1),
    )
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Result: Smaller model with ~95% of larger model’s capability.

Dynamic Model Loading

Load models on-demand to save memory:

actor DynamicModelManager {
    private var cachedModels: [String: Any] = [:]
    
    func getModel(_ name: String) async throws -> MLModel {
        if let cached = cachedModels[name] {
            return cached as! MLModel
        }
        
        let model = try load(name)
        cachedModels[name] = model
        return model
    }
    
    func unloadModel(_ name: String) {
        cachedModels.removeValue(forKey: name)
    }
    
    func clearCache() {
        cachedModels.removeAll()
    }
}

Result Caching

Avoid re-computing identical inputs:

struct CachedInference {
    private let cache = NSCache<NSString, NSData>()
    private let model: MLModel
    
    func infer(input: String) async throws -> String {
        let key = NSString(string: input)
        
        if let cached = cache.object(forKey: key),
           let result = String(data: cached as Data, encoding: .utf8) {
            return result
        }
        
        let output = try await runModel(input: input)
        cache.setObject(output.data(using: .utf8)! as NSData, forKey: key)
        
        return output
    }
}

11. Practical Examples

Example 1: Local AI Assistant on Mac Mini

Setup: Mistral 7B running on Mac Mini M2, accessed via Xcode Playground or SwiftUI app.

Architecture:

User Input (SwiftUI)

FastAPI Server (Python on Mac)

MLX/Mistral-cpp (inference)

Stream tokens back to app

Python server (pseudo-code):

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from mistral import Mistral
import asyncio

app = FastAPI()
model = Mistral.from_pretrained("mistral-7b", device="mps")
tokenizer = Mistral.get_tokenizer()

@app.post("/generate")
async def generate(prompt: str):
    def stream_tokens():
        tokens = model.generate(
            tokenizer.encode(prompt),
            max_tokens=256,
            temperature=0.7,
        )
        for token in tokens:
            yield tokenizer.decode([token]).encode()
    
    return StreamingResponse(stream_tokens(), media_type="text/plain")

iOS client:

struct AIAssistant {
    private let serverURL = URL(string: "http://localhost:8000")!
    
    func generateResponse(to prompt: String) async throws -> String {
        var request = URLRequest(url: serverURL.appendingPathComponent("/generate"))
        request.httpMethod = "POST"
        request.httpBody = prompt.data(using: .utf8)
        
        let (asyncBytes, _) = try await URLSession.shared.bytes(for: request)
        
        var result = ""
        for try await byte in asyncBytes {
            if let char = String(bytes: [byte], encoding: .utf8) {
                result += char
                // Update UI with streaming text
            }
        }
        
        return result
    }
}

Example 2: Convert Hugging Face Sentence Embeddings to iOS

Goal: Embed user queries locally, compare to stored embeddings.

# 1. Export model
python << 'EOF'
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer("all-MiniLM-L6-v2")
example_input = torch.randn(1, 512)

traced = torch.jit.trace(model, example_input)
torch.jit.save(traced, "sentence_model.pt")
EOF

# 2. Convert to CoreML
python << 'EOF'
import coremltools as ct
import torch

traced = torch.jit.load("sentence_model.pt")
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 512), dtype=np.float32)],
    target_deployment_target=ct.target.iOS16,
)
mlmodel.save("SentenceEmbedder.mlmodel")
EOF

iOS usage:

class SemanticSearch {
    private let embedder = try! SentenceEmbedder(configuration: .init())
    private var documentEmbeddings: [String: [Float]] = [:]
    
    func indexDocuments(_ docs: [String]) async throws {
        for doc in docs {
            let embedding = try await embed(doc)
            documentEmbeddings[doc] = embedding
        }
    }
    
    func search(_ query: String, topK: Int = 5) async throws -> [String] {
        let queryEmbedding = try await embed(query)
        
        let scores = documentEmbeddings.mapValues { embedding in
            cosineSimilarity(queryEmbedding, embedding)
        }
        
        return scores.sorted { $0.value > $1.value }
            .prefix(topK)
            .map { $0.key }
    }
    
    private func embed(_ text: String) async throws -> [Float] {
        // Tokenize and pass to model
        let output = try embedder.prediction(input: tokenize(text))
        return output.embeddings
    }
    
    private func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
        let dotProduct = zip(a, b).map(*).reduce(0, +)
        let normA = sqrt(a.map { $0 * $0 }.reduce(0, +))
        let normB = sqrt(b.map { $0 * $0 }.reduce(0, +))
        return dotProduct / (normA * normB)
    }
}

Example 3: Hybrid On-Device + Cloud System

Pattern: Local for speed, cloud for quality.

enum AnalysisResult {
    case local(String, timeMs: Int)
    case cloud(String, timeMs: Int)
}

class HybridAnalyzer {
    private let localModel = try! LocalModel()
    private let cloudEndpoint = URL(string: "https://api.example.com/analyze")!
    
    func analyze(_ text: String) async -> AnalysisResult {
        let startLocal = Date()
        let localResult = try? await localModel.analyze(text)
        let localTime = Int(Date().timeIntervalSince(startLocal) * 1000)
        
        if let localResult, isHighConfidence(localResult) {
            return .local(localResult, timeMs: localTime)
        }
        
        // Fall back to cloud
        let startCloud = Date()
        let cloudResult = try? await callCloud(text: text)
        let cloudTime = Int(Date().timeIntervalSince(startCloud) * 1000)
        
        return .cloud(cloudResult ?? "Error", timeMs: cloudTime)
    }
    
    private func isHighConfidence(_ result: String) -> Bool {
        // Heuristic: if local response is >80 chars, likely high quality
        return result.count > 80
    }
    
    private func callCloud(text: String) async throws -> String {
        var request = URLRequest(url: cloudEndpoint)
        request.httpMethod = "POST"
        request.httpBody = try JSONEncoder().encode(["text": text])
        
        let (data, _) = try await URLSession.shared.data(for: request)
        let response = try JSONDecoder().decode(
            ["result": String].self,
            from: data
        )
        return response["result"] ?? "No response"
    }
}

Example 4: iOS Writing Assistant

Features:

  • Grammar/spell check (on-device)
  • Tone analysis (on-device)
  • Rewrite suggestions (cloud)
struct WritingAssistant: View {
    @State private var text = ""
    @State private var suggestions: [Suggestion] = []
    @State private var isAnalyzing = false
    
    private let analyzer = TextAnalyzer()
    
    var body: some View {
        VStack(spacing: 16) {
            TextEditor(text: $text)
                .border(Color.gray)
            
            if isAnalyzing {
                ProgressView()
            } else {
                Button("Analyze") {
                    analyzeText()
                }
            }
            
            List(suggestions) { suggestion in
                VStack(alignment: .leading) {
                    Text(suggestion.type)
                        .font(.caption)
                        .foregroundColor(.blue)
                    Text(suggestion.message)
                        .font(.body)
                    if let replacement = suggestion.replacement {
                        Button(action: { replaceText(suggestion, with: replacement) }) {
                            Label("Use suggestion", systemImage: "checkmark.circle")
                        }
                    }
                }
            }
        }
        .padding()
    }
    
    private func analyzeText() {
        isAnalyzing = true
        suggestions = []
        
        Task {
            // On-device: grammar and tone
            let grammarSuggestions = try await analyzer.checkGrammar(text)
            let toneSuggestions = try await analyzer.analyzeTone(text)
            
            DispatchQueue.main.async {
                self.suggestions = grammarSuggestions + toneSuggestions
            }
            
            // Cloud: rewrites (in background)
            if let rewrites = try? await analyzer.suggestRewrites(text) {
                DispatchQueue.main.async {
                    self.suggestions.append(contentsOf: rewrites)
                }
            }
            
            DispatchQueue.main.async {
                self.isAnalyzing = false
            }
        }
    }
    
    private func replaceText(_ suggestion: Suggestion, with replacement: String) {
        text = text.replacingOccurrences(of: suggestion.text, with: replacement)
    }
}

struct Suggestion: Identifiable {
    let id = UUID()
    let type: String  // "Grammar", "Tone", "Rewrite"
    let text: String  // Original text
    let message: String
    let replacement: String?
}

class TextAnalyzer {
    private let grammarModel = try! GrammarChecker()
    private let toneModel = try! ToneAnalyzer()
    
    func checkGrammar(_ text: String) async throws -> [Suggestion] {
        let issues = try grammarModel.check(text)
        return issues.map { issue in
            Suggestion(
                type: "Grammar",
                text: issue.text,
                message: issue.message,
                replacement: issue.correction
            )
        }
    }
    
    func analyzeTone(_ text: String) async throws -> [Suggestion] {
        let analysis = try toneModel.analyze(text)
        return analysis.suggestions.map { sugg in
            Suggestion(
                type: "Tone",
                text: sugg.text,
                message: sugg.message,
                replacement: sugg.alternative
            )
        }
    }
    
    func suggestRewrites(_ text: String) async throws -> [Suggestion] {
        let cloudURL = URL(string: "https://api.example.com/rewrite")!
        var request = URLRequest(url: cloudURL)
        request.httpMethod = "POST"
        request.httpBody = try JSONEncoder().encode(["text": text])
        
        let (data, _) = try await URLSession.shared.data(for: request)
        let response = try JSONDecoder().decode(
            CloudRewriteResponse.self,
            from: data
        )
        
        return response.rewrites.map { rewrite in
            Suggestion(
                type: "Rewrite",
                text: text,
                message: "Consider: \"\(rewrite.text)\"",
                replacement: rewrite.text
            )
        }
    }
}

struct CloudRewriteResponse: Codable {
    struct Rewrite: Codable {
        let text: String
        let rationale: String
    }
    let rewrites: [Rewrite]
}

12. Ecosystem and Tools

Xcode Integration

Creating .mlmodel files in Xcode:

  1. Create ML training (images, text, tabular)

    • File → New → ML Model
    • Choose task type (image classification, sound classifier, etc.)
    • Train directly in Xcode
    • Output: .mlmodel file
  2. Importing existing .mlmodel

    • Drag .mlmodel into Xcode project
    • Xcode auto-generates Swift code
    • Access via generated classes
// Auto-generated from imported model
import CoreML

let model = MyModel()
let prediction = try model.prediction(input: MyModelInput(...))

ml-stable-diffusion (Successful Conversion Example)

Apple’s ml-stable-diffusion project demonstrates converting large generative models to CoreML:

git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion

# Convert Stable Diffusion 2.0 to CoreML
python -m python_coreml_stable_diffusion.torch2coreml \
  --model-version stabilityai/stable-diffusion-2-base \
  --compute-unit all \
  --output-mlpackage ./output

Results:

  • 5GB model → ~2GB (int8 quantized)
  • Generate 512x512 image: ~30-60 sec on M1 Mac
  • Temperature/seed control via inputs

MLX (Apple’s ML Framework, 2024+)

MLX is Apple’s newer framework for large-scale ML on Apple Silicon:

import mlx.core as mx
from mlx_lm import load_model, generate

# Load Mistral directly
model, tokenizer = load_model("mistralai/Mistral-7B-v0.1")

prompt = "Once upon a time"
response = generate(model, tokenizer, prompt=prompt, max_tokens=100, verbose=True)

Advantages over PyTorch on Apple Silicon:

  • Unified memory optimized
  • 2-4x faster on Apple Silicon
  • Native support for dynamic shapes

As of April 2026, MLX↔CoreML bridge is under development (not yet direct export).

Community Libraries

MLX-Lora: Fine-tune models on-device

pip install mlx-lora
mlx_lora.finetune(model, dataset, adapter_rank=16)

llama.cpp: Inference in C++ (can port to iOS)

git clone https://github.com/ggerganov/llama.cpp
./main -m model.gguf -p "Once upon a time"

Hugging Face Transformers (iOS): Minimal port of HF to Swift (experimental)


Best Practices Summary

When to Use On-Device Models

Do use on-device when:

  • Data is sensitive (health, finance, personal)
  • Must work offline
  • Latency <100ms required
  • Compliance mandates (GDPR, HIPAA, government)
  • Cost per request matters (no API fees)
  • Model is <5B parameters

Don’t use on-device when:

  • Need latest knowledge (current events, web data)
  • Inference must be <50ms on mobile
  • Model >10B parameters
  • Heavy inference frequency (battery drain)
  • Perfect accuracy critical (cloud models better)

Development Checklist

  • Model converts without errors (test ONNX intermediate)
  • Quantization target verified (int8 for most cases)
  • Bundle size < device storage limit (check distribution)
  • Latency acceptable for use case (profile on target device)
  • Battery impact tested (real device, not simulator)
  • Memory pressure tested (low memory device tested)
  • Privacy model documented (where data goes)
  • Fallback strategy if inference fails (cloud, cached results)
  • Error handling for OOM, thermal throttling
  • Model versioning strategy (how to update in app)

Performance Targets

DeviceTarget ModelLatencyMemory
iPhone<2B (2.5B max)<500ms/inference<2GB model
iPad Pro<7B<100ms/inference<5GB model
MacBook Air<7B<50ms/inference<5GB model
MacBook Pro7-13B<50ms/inference<10GB model
Mac Studio30-65B<100ms/inferenceDepends on VRAM

References and Further Reading

Official Documentation:

Tools:

  • MLX — Apple’s ML framework
  • llama.cpp — Portable LLM inference
  • ONNX — Model interchange format

Community:


Last Updated: April 2026

This guide is a living document. As Apple Intelligence and CoreML evolve, recommendations and best practices will be updated. Check for April 2026+ changes in Apple’s official documentation.


Validation Checklist

How do you know you got this right?

Performance Checks

  • CoreML model inference latency measured on target device (iPhone, iPad, or Mac) and within acceptable budget (<500ms for mobile, <100ms for Mac)
  • Model size after conversion fits within app bundle constraints (check against App Store limits and device storage)
  • Battery impact tested on real device during sustained inference (not just simulator)

Implementation Checks

  • Model converts from source framework (PyTorch/TensorFlow) to CoreML without errors (test ONNX intermediate step)
  • Quantization target verified: int8 for most cases, int4 only if size-critical and accuracy loss acceptable
  • Memory pressure tested on lowest-spec target device (e.g., iPhone with 6GB RAM, not just MacBook Pro)
  • Thermal throttling handled: app degrades gracefully when device is hot (check ProcessInfo.processInfo.thermalState)
  • OOM fallback strategy implemented: app doesn’t crash if model can’t load (fall back to cloud or cached results)
  • Token streaming implemented for LLM outputs (don’t freeze UI waiting for full generation)
  • Model loaded once as singleton, not re-created per inference call

Integration Checks

  • On-device vs cloud routing logic implemented (local for fast/private tasks, cloud for complex reasoning)
  • Privacy model documented: confirmed no user data leaves device unless explicitly permitted
  • Model versioning strategy defined: how to update the CoreML model without requiring a full app update

Common Failure Modes

  • Conversion failure at ONNX step: Model uses unsupported dynamic operations. Fix: use torch.jit.trace instead of torch.jit.script, or simplify forward pass to remove conditionals.
  • Neural Engine not used: Large LLMs with dynamic shapes route to CPU/GPU instead. Fix: expected behavior for LLMs; Neural Engine benefits CNNs and fixed-shape transformers. Profile with Instruments to confirm.
  • Silent accuracy degradation after quantization: int4 model produces nonsensical output. Fix: validate output quality on 50+ test inputs before shipping; use int8 if int4 quality is insufficient.
  • App rejected for bundle size: CoreML model exceeds App Store limits. Fix: host model on-demand via Background Assets framework or use more aggressive quantization.

Sign-Off Criteria

  • Tested on real device (not just simulator) across the full user workflow
  • Latency, memory, and battery benchmarks recorded and within targets from the Performance Targets table
  • Hybrid on-device/cloud fallback tested: confirmed app works offline and recovers when connectivity returns
  • Error handling covers OOM, thermal throttling, and model load failure without crashing
  • Model accuracy validated against source framework output on 100+ representative inputs

See Also

  • Doc 03 (Hugging Face Ecosystem) — Find and evaluate models to convert to CoreML; quantization affects on-device performance
  • Doc 04 (Memory Systems) — Unified memory (Apple Silicon) affects memory architecture; on-device memory patterns differ from cloud
  • Doc 24 (Hardware Landscape) — Understand Apple Silicon architecture (unified memory, Neural Engine) for CoreML optimization
  • Doc 25 (Edge & Physical AI) — On-device inference is core to edge deployment; CoreML is the implementation path