Real-World AI Applications
Autonomous vehicles, robotics, industrial IoT, healthcare, recommendation systems — how AI agents are deployed in production across industries.
AI in the lab is clean, controlled, and reproducible. AI in production is messy, complex, and must work when the stakes are high. This chapter explores how AI systems actually get deployed across industries, the patterns that emerge, and the real engineering challenges that matter.
1. Autonomous Vehicles: A Complete System
The self-driving car is the most complex AI application ever deployed at scale. It’s not one model—it’s a tightly integrated system where every component has hard real-time constraints.
System Architecture
Perception — Making sense of the world.
The car sees through multiple sensors:
- Cameras (8-10 per vehicle): RGB data, 1920×1080, 30 fps. Neural networks detect cars, pedestrians, cyclists, lane markings, traffic lights.
- LIDAR (2-5 units): 360° point clouds at 10-20 Hz, up to 200m range. Detection networks output 3D bounding boxes.
- Radar (4-6 units): Works in rain/fog when cameras fail. Good for velocity estimation.
Perception runs multiple networks in parallel:
- Object detection (YOLO, EfficientDet variants): cars, pedestrians, cyclists, signs
- Lane detection (custom CNNs): lane markings, curbs, drivable area
- Traffic light state (classification network): red/yellow/green
- Semantic segmentation: road surface, obstacles, traversable space
These run on dedicated hardware (NVIDIA DRIVE AGX, Tesla custom silicon) at fixed clock speeds. Latency budget: 50-100ms for full pipeline. Miss the deadline, the car doesn’t react to a pedestrian.
Sensor Fusion — Combining multiple modalities.
Raw detections from cameras, LIDAR, radar are independent. Fusion networks learn to combine them:
- LIDAR gives 3D position, camera gives appearance
- Radar gives velocity, confirms moving vs static
- Output: unified object list with position, velocity, classification, confidence
Localization — Where am I?
GPS alone is 5-10 meters of error. Not precise enough.
- HD Maps: Pre-built maps with centimeter-level detail (lane boundaries, curb positions, traffic rules per lane)
- Map Matching: GPS position snapped to nearest road, then refined
- IMU + Odometry: Integrate accelerometer and wheel odometry between GPS fixes
- Particle Filter: Fuse GPS, map, and motion to get 10cm accuracy
Localization must be continuous. Even 1 second of uncertainty is dangerous.
Prediction — What will they do?
Other agents (vehicles, pedestrians, cyclists) don’t stand still. Prediction networks forecast 3-8 seconds ahead:
- Trajectory Models (LSTM, transformer variants): given past positions and velocities, predict future
- Interaction Models: pedestrian behavior depends on nearby cars
- Uncertainty Quantification: not a single prediction, but a distribution (multimodal—multiple likely futures)
Critical insight: at intersections, vehicles have many possible futures (turn left, right, straight). Prediction networks must capture all of them.
Planning — Safe path given predictions.
Given my position, other agents’ predictions, and traffic rules, compute a safe trajectory:
- Search (hybrid A*, RRT): generate candidate paths
- Cost Function: collision risk, comfort (acceleration), efficiency, traffic rules
- Constraint Satisfaction: can’t accelerate infinitely, must obey lane boundaries
- Output: desired position and velocity for next 100ms
Planning runs at 10 Hz. The car re-plans continuously as new information arrives.
Control — Execute the plan.
A planning output of “go to (10.5, 20.3) at 5 m/s” must translate to steering angle and throttle:
- Lateral Control (steering): PID controller tracks desired path
- Longitudinal Control (throttle/brake): PID controller tracks desired speed
- Actuator Limits: steering has max angle, acceleration has limits
- Latency Compensation: control signal takes ~50ms to affect wheels, model predicts ahead
Real-World Complexity
The lab has clean scenarios. The road doesn’t.
Weather: Rain reduces camera quality and LIDAR range. Snow covers lane markings. Fog blocks everything. Models trained on sunny California don’t work in Seattle.
Shadows and Reflections: A shadow looks like a pothole. Rain on the camera lens creates artifacts. Specular reflections off wet roads confuse perception.
Occlusion: A parked car hides a pedestrian 20 meters ahead. Prediction must infer unseen agents.
Novel Scenarios: A construction worker in a pink tutu is not in the training set. A child chasing a ball into traffic is a tail case. The long tail of edge cases is infinite.
Adversarial Inputs: Can a sticker on a stop sign fool the perception system? (Yes, in the lab—Szegedy et al. 2014. Harder in the real world, but possible.)
Map Errors: HD maps are months old. Construction changed the road. A lane is missing. Planning must handle map uncertainty.
Current State of the Art
Tesla (Full Self-Driving, Autopilot):
- Vision-only: 8 cameras, no LIDAR or radar
- Approach: if you have enough cameras and neural networks, LIDAR is redundant
- Reality: works well on highways, struggles with complex urban scenarios
- Scale: millions of cars collecting data daily, continuous model improvement
- Advantage: lower cost, simpler hardware
Waymo (Waymo Driver):
- Full stack: cameras, LIDAR, radar, HD maps
- Approach: multimodal fusion, sophisticated prediction and planning
- Operations: fully autonomous (no human safety driver) in Phoenix, San Francisco, Los Angeles
- Scale: hundreds of thousands of rides, good operational data
- Advantage: more sensors → more redundancy, higher confidence
Others (Cruise, Aurora, Motional):
- Most use hybrid approaches (cameras + LIDAR)
- Regional focus: Phoenix for Waymo, San Francisco for Cruise
- Challenge: getting to >99.99% reliability for regulatory approval
Challenges
Edge Cases: Training data contains 99.9% highway driving. The 0.1% of interesting edge cases dominate actual failures. Collecting enough examples of rare scenarios takes years.
Long Tail: Once you solve the 80% case, the remaining 20% takes 80% of effort. Moving from 95% to 99% to 99.9% reliability requires exponential data collection and engineering.
Regulatory Approval: Proving safety to the DMV requires showing the system is safer than humans. How do you measure safety of a system that’s only been deployed to a few thousand cars? Waymo settled on extreme operational control (controlled areas, good weather, no complex intersections) to get to 99.99%.
Integration Complexity: Hundreds of neural networks, classical algorithms, and real-time constraints all interacting. One bug in the planning module causes a collision. Testing is hard; simulation helps but doesn’t catch everything.
Hardware Constraints: All this must run on a car’s compute platform, in real-time, with <100ms latency, consuming <10kW of power. NVIDIA DRIVE AGX has 700 TFLOPS. A single perception model might use 100 TFLOPS. No room for inefficiency.
2. Robotics Ecosystems: From SLAM to Grasping
Robots perceive, plan, and act. Unlike self-driving cars (which move in one plane), robots are 3D agents manipulating complex environments.
Navigation: SLAM
Before a robot can plan, it must know where it is and what the world looks like.
SLAM (Simultaneous Localization and Mapping):
- Robot has no prior map
- Uses cameras and/or LIDAR to observe the environment
- Builds a map incrementally while estimating its own position
- Classical approach: visual odometry (track features between frames) + loop closure detection (recognize when you’ve seen a place before)
- Modern approach: neural networks learn visual features that are robust to viewpoint and lighting changes
Warehouse robots (Amazon Robotic, Fetch, Mobile Industrial Robots):
- Environment: known warehouse with fixed layout
- Approach: use pre-built floor plans, no SLAM needed
- Precision required: 5cm accuracy to dock with shelves
- Method: wheel odometry + occasional global localization (magnetic markers or barcode landmarks)
Legged robots (Boston Dynamics Spot):
- Challenge: rough terrain, uneven ground, dynamic balance
- Approach: IMU + foot force sensors + visual odometry
- Real-time constraint: balance control runs at 1kHz, must react in milliseconds to terrain
- Navigation: SLAM on visual features + terrain classification (avoid obstacles, assess traversability)
Manipulation: Object Detection to Grasping
A robot arm is useless without knowing what to grasp and where.
Object Detection:
- Camera mounted on gripper or body observes the scene
- CNN detects objects: boxes, bottles, books (depends on task)
- 3D localization: use LIDAR or stereo to get 3D position
Grasp Planning:
- Given object position and 3D model, compute grasping poses
- Classical: test 1000 candidate grasps, choose best by physics simulation
- Learning: CNN predicts grasp quality directly from image (faster, but less generalizable)
- Key challenge: friction varies (wet surface vs dry), objects deform (soft materials), gripper shape matters
Placement:
- Once grasped, robot must place object somewhere
- Target: shelf, bin, conveyor belt (depends on task)
- Constraints: don’t hit obstacles, don’t drop on fragile items
- Use path planning (RRT, probabilistic roadmaps) in 6D space (3D position + 3D orientation)
Real-World Challenges:
- Objects are soft or rigid, heavy or light—physics varies
- Suction cups fail on porous materials, parallel grippers fail on cylinders
- Small changes in friction cause grasp failures
- No perfect 3D models in real warehouses; objects are jumbled together
Learning from Demonstration
Robots can learn from human examples (imitation learning):
- Human demonstrates grasping a bottle
- Camera records RGB + depth during demo
- Train neural network: image → robot joint commands
- Deploy: robot executes learned behavior
Challenge: distribution shift. Human moves smoothly, robot is jerky. Human understands physics intuitively, robot doesn’t. Network memorizes the demo but fails on new scenarios.
Reinforcement Learning for Robotics
Trial and error in the real world is expensive (destroy the robot). Use simulation:
- Train in simulation (PyBullet, Mujoco) with randomized physics
- Transfer to real robot via domain randomization (make sim sufficiently random that real world is just another variant)
Example: Google’s robotic arm learning to grasp (Levine et al. 2016). Train in simulation, deploy on 50 robotic arms, each learns from its own experiences. Aggregate data across robots, retrain, deploy. Cycle runs continuously. Within months, the robot can grasp 99% of objects it’s never seen.
Challenge: simulation doesn’t match reality perfectly. Systematic differences (friction coefficient) cause sim-to-real gap. Domain randomization helps but isn’t perfect.
Future: General-Purpose Robots
Current robots are task-specific:
- Amazon Robotic arms: grasp boxes only
- Dishwasher robots: load dishes only
- Welding robots: weld only
Future vision: Foundation models for robotics. Similar to how GPT-3 is “general” (can do many NLP tasks), train a single model on data from thousands of robots doing diverse tasks. Adapt via prompt or fine-tuning to new tasks.
Current state: OpenVLA (Open-Source Vocabulary-based Language Action model), RT-1 (Robotics Transformer) are early attempts. Still far from “one model for all robotics.”
3. Industrial IoT & Predictive Maintenance: Sensing Failure Before It Happens
In a manufacturing plant, an unexpected machine failure costs $100k+ per hour (lost production, labor, parts). Predictive maintenance uses ML to schedule maintenance before failure occurs.
Data Pipeline
Machines have sensors: vibration accelerometers, temperature, acoustic emission, current draw.
- Vibration sensor on a bearing: samples at 10kHz, records time series
- Temperature sensor: thermocouples, 1 Hz sample rate
- Current sensor: measures motor current draw, indicates load and efficiency
Data flows:
- Sensors → edge device (industrial computer on factory floor)
- Edge device → local buffer (SQL database)
- Buffer → cloud analytics (batch processing every night)
- Analytics → alerting (schedule maintenance for day X)
Models: LSTM and GRU
Predictive maintenance is a time series forecasting problem:
- Input: 30 days of historical sensor data
- Output: time until failure (0-365 days)
RNN architectures:
- LSTM (Long Short-Term Memory): learns long-range dependencies, good for slow degradation (bearing wear over months)
- GRU (Gated Recurrent Unit): simpler than LSTM, similar performance, less compute
- 1D Convolutions: simpler and faster, competitive with RNN for some tasks
Example model:
model = Sequential([
LSTM(128, input_shape=(30, 10)), # 30 days, 10 sensors
Dropout(0.2),
Dense(64, activation='relu'),
Dense(1) # output: days to failure
])
model.compile(loss='mse')
Alerting Strategy
A model outputs: “bearing will fail in 48 hours.”
Question: do you schedule maintenance now?
Cost function:
- Schedule too early: wasteful (spend money replacing part that still works)
- Schedule too late: catastrophic (machine fails, expensive downtime)
- Don’t schedule at all: disaster
Most plants use a decision rule:
- If predicted time to failure < 7 days: alert maintenance crew
- Crew schedules replacement within 7 days
- Hedge: run bearing 7 days with increased monitoring
Data Collection and Baseline
Collecting sufficient training data is the hard part.
A healthy bearing runs for 2-3 years. A failed bearing is the last day. To build a model:
- Collect data from 50 machines for 6 months
- Hopefully observe 5-10 failures
- Lots of healthy data, few failure examples
Imbalanced dataset: 99.9% healthy, 0.1% failure. Standard techniques fail (model just predicts “healthy” always).
Solution: oversampling failures, weighted loss functions, or anomaly detection (model learns normal, flags deviations).
Real-World Deployment
Sensors fail: a vibration sensor disconnects, temperature readout drifts, current measurement becomes noisy. Model must handle missing data, outliers, and sensor drift.
Fallback Logic:
if model_confidence < threshold:
use rule_based_alerting()
else:
use model_prediction()
Rules: if vibration suddenly spikes 10x normal, alert immediately (don’t wait for model).
Economics
A company with 100 industrial machines:
- Current approach: reactive maintenance (fix when broken) costs $50k/month in downtime
- Predictive maintenance: costs $20k/month (scheduled, no surprises)
- Cost of ML system: $200k initial, $20k/year operations
- Payback period: 4 months
- Once profitable, motivation to deploy everywhere
4. Smart Power Grids: Real-Time Optimization Under Uncertainty
A power grid must balance supply and demand in real time. Too much demand and the grid frequency drops (blackout risk). Too little and generators are overprovisioned (waste).
AI adds flexibility, cost optimization, and resilience.
Load Forecasting
Predict electricity demand hours/days ahead so generators can ramp up in time.
Factors:
- Time of day: peak at breakfast and evening
- Day of week: weekday vs weekend
- Season: summer (air conditioning) vs winter (heating)
- Weather: temperature, cloud cover, humidity
- Events: sports game, concert, holiday
Model:
- Inputs: past 2 weeks of demand, weather forecast, calendar features
- Output: demand (MW) for next 1, 6, 24 hours ahead
- Architecture: GRU or Transformer, multiple output heads for different horizons
For a large utility (50M customers), a 1% error in load forecast translates to millions of dollars of wasted generation or shortage risk.
Real-world challenge: Renewable penetration introduces variability. Solar output depends on cloud cover (hard to predict). Wind is intermittent. Battery storage adds flexibility but must be charged/discharged strategically.
Anomaly Detection
Grid attacks (physical or cyber) cause unusual patterns: a major line failure drops demand 20% instantly.
Approach:
- Baseline: normal grid behavior patterns (learned from months of data)
- Monitor: real-time power flows, voltage, frequency
- Alert: when measurements deviate from baseline (Z-score > 3σ)
Example: a distribution line is attacked (cut). Current suddenly drops. Voltage becomes unstable. Within milliseconds, anomaly detection flags it. Engineers can reroute power via alternate lines to avoid cascading failure.
Fault Localization
When a line fails, operators must identify which line to repair.
Traditional approach: manual investigation, walk the line to find the break (hours of time).
ML approach:
- Measure voltages and currents at multiple points in the grid
- Train model: electrical measurements → which line failed
- Use graph neural networks (grid is naturally a graph) or physics-informed neural networks
Given measurements from 100 sensor points, classify which of 1000 possible faults occurred.
Real-time constraint: decision within seconds, before cascading failures occur.
Optimization: Microgrids
A microgrid is a small grid (neighborhood, university campus) with distributed generation (solar, wind, batteries) that can operate independently or connected to the main grid.
Optimization problem:
- Many solar panels (variable output)
- Many batteries (can store or release)
- Many consumers (variable demand)
- Goal: minimize cost (buy from grid when cheap, sell back when expensive), or maximize renewables (use local solar first)
This is a unit commitment problem (which generators should be on?) and economic dispatch (how much power from each generator?).
Real-time approach:
- Every 5 minutes, forecast next 1 hour of solar and demand
- Optimize: battery charge/discharge schedule, load shedding if needed
- Execute: send setpoints to inverters and controllers
- Repeat: reoptimize as new forecasts arrive
With high renewable penetration, this optimization becomes critical. Too much solar at noon (demand is moderate) means batteries must absorb the surplus. Too little at 6pm (demand peaks, sun sets) means grid must supply power (expensive). Planning ahead avoids this mismatch.
Real-World Complexity
Grid modernization is gradual. Utilities have decades-old infrastructure:
- Measurement infrastructure is sparse (few sensors in the grid)
- Communication is slow (SCADA systems, not real-time APIs)
- Integration with legacy systems requires months of engineering
- Regulatory approval: grid reliability is critical, must pass extensive testing
A utility can’t deploy a new algorithm without months of validation. Failures have real consequences (blackouts).
Current state: some utilities deploy load forecasting (well-proven, low risk). Optimization and anomaly detection are pilot programs.
5. Healthcare Applications: When AI Decisions Impact Lives
AI in healthcare has regulatory, ethical, and practical complexities beyond most domains.
Diagnostics: Radiology
Chest X-ray analysis:
- Radiologist reviews image, looks for pneumonia, tuberculosis, pneumothorax, nodules
- Highly variable: image quality, patient body composition, prior images matter
- Inter-rater variability: two radiologists disagree 10-20% of the time
AI approach:
- Train CNN (ResNet, DenseNet) on 100k+ labeled chest X-rays
- Model learns: nodule appearance, Kerley B lines (sign of pneumonia), pleural effusion
- Deployment: radiologist uploads image, model outputs predictions + attention map (highlights abnormal regions)
- Workflow: model assists, doesn’t replace (radiologist makes final call)
FDA Approval (FDA clearance pathway):
- Phase 1: validate on internal dataset (does model work?)
- Phase 2: validate on external dataset from different hospitals (does it generalize?)
- Phase 3: prospective clinical trial (does it improve patient outcomes?)
- Submission to FDA: provide validation data, intended use, failure modes
- Review: takes 6-12 months for FDA clearance
Current state: Multiple radiology AI systems have FDA clearance (CheXpert by Stanford, others by Siemens, GE). Most are assistive (help radiologist decide), not autonomous.
Challenges:
- Labeling: need thousands of images labeled by expert radiologists (expensive, takes months)
- Shifts: model trained on modern equipment, deployed on older machines → performance drops
- Rare diseases: model trained on common cases (pneumonia, TB) fails on rare conditions (silicosis)
- Legal liability: if model-assisted diagnosis is wrong and patient harmed, who’s liable?
Wearables and Continuous Monitoring
Heart rate variability → Arrhythmia:
- Smartwatch measures heart rate continuously (100 Hz sampling)
- LSTM detects irregular patterns: atrial fibrillation (Afib) is an irregular heart rhythm
- User gets alert: “irregular rhythm detected, consult doctor”
Challenge: false positives. Watch detects motion artifact (user moving, not heart issue). Too many false alarms and user ignores alerts (alert fatigue).
Approach: increase model specificity, add confirmation rules (must see pattern for >60 seconds), use multimodal data (combine HR with motion data to rule out artifact).
Apple Watch received FDA clearance for atrial fibrillation detection in 2018.
Drug Discovery
Developing new drugs:
- Identify disease target (protein that causes disease)
- Compound screening: test millions of molecules, find ones that bind to target (inhibit the disease protein)
- Validation: test that the compound works in cells, then animals, then humans
- Trials: prove to FDA that drug works and is safe (costs $2B, takes 10+ years)
AI’s role: accelerate step 2 (compound screening).
- Traditionally: test 1000s of molecules in lab manually (months)
- With AI: train generative model on known active compounds, generate new candidates predicted to work, prioritize by ML ranking model, test top 10 candidates (weeks)
Models:
- Graph neural networks: molecules are graphs (atoms as nodes, bonds as edges)
- Variational autoencoders: learn latent space of molecules, generate new ones by sampling latent space
- Transformer models: treat molecule SMILES strings as sequences, use language models to generate candidate molecules
Success story: DeepMind’s AlphaFold (protein structure prediction). Knowing protein 3D structure is crucial for drug design. AlphaFold predicts structure from amino acid sequence. Previously took years of X-ray crystallography; AlphaFold does it in seconds. Deployed via public database, free for researchers. Impact: accelerated drug discovery, structural biology.
Challenges Specific to Healthcare
Privacy and Compliance (HIPAA):
- Patient data is sensitive
- Models trained on patient data must respect privacy
- Real approach: train locally on hospital servers, don’t send data to cloud
- Alternative: federated learning (train model across hospitals without sharing raw data)
Regulatory Approval:
- FDA requires extensive validation before use
- Every new version of the model must be revalidated
- Slows innovation (AI companies used to weekly model updates; healthcare requires monthly or quarterly)
Fairness:
- Training data is often biased (fewer examples of rare diseases, certain demographics underrepresented)
- Model learns: “this demographic has low risk of disease” (statistical artifact, not reality)
- Real harm: disease missed in underrepresented group due to model bias
- Solution: audit for fairness, balance training data, monitor performance per demographic in production
Explainability:
- Doctor needs to understand why model flagged a case
- “Neural network says abnormal” doesn’t help if radiologist can’t see what the network saw
- Attention mechanisms, saliency maps help: “network focused on upper left lobe” → doctor looks there
- Challenge: saliency maps can be misleading or gamed
6. Recommendation Systems: The Economics of Predicting What You’ll Like
Netflix, Amazon, and Spotify are fundamentally ML companies. Their business model is: predict what users will like, recommend it, users engage more, more ad revenue or subscriptions.
Collaborative Filtering
Insight: Users with similar tastes should like similar content.
Matrix factorization:
- Matrix: M[user, item] = rating (1-5 stars)
- Problem: matrix is sparse (each user rates <1% of items)
- Solution: factorize into two low-rank matrices: M ≈ U × V
- U: user latent factors (user vectors in 100D space)
- V: item latent factors (item vectors in 100D space)
- Prediction: M[u, i] ≈ U[u] · V[i] (dot product of user and item vectors)
Training: gradient descent to minimize ||M - U×V||² over observed entries.
Challenge: cold start. New users with no history. New items with no ratings.
- Solution: ask new user for ratings (bootstrap), or infer from demographics, or use content-based approach
Content-Based Filtering
Insight: Items similar in content should appeal to the same users.
- Netflix has metadata for each movie: genre, actors, director, year
- Compute similarity between movies (cosine similarity in feature space)
- If user liked “The Matrix,” recommend similar movies (sci-fi, special effects, Neo-like protagonist)
Advantage: no cold start for new items (you have metadata). Disadvantage: limited serendipity (always recommend similar movies, user never discovers new genres).
Serendipity vs. Accuracy
Pure accuracy: recommend movies user will definitely watch → engagement up, revenue up.
But: user gets bored (always same genre). Recommendation engines are too predictable.
Serendipity: recommend something unexpected but relevant → user discovers new genres, long-term engagement stays high.
Trade-off:
- Explore-exploit: 80% exploitation (recommend what user will like), 20% exploration (recommend novel items)
- Bandit algorithms: formalize the trade-off mathematically
A/B Testing at Scale
Change recommendation algorithm, measure impact:
- Control group: old algorithm
- Variant group: new algorithm
- Metric: did users engage more? (watch time for Netflix, click-through rate for Amazon)
With billions of users, even 0.1% improvement is significant.
But: 0.1% improvement is hard to detect statistically. Need large sample sizes.
Real process:
- Test new model offline (does it predict held-out ratings accurately?)
- Shadow mode: run new model but don’t use predictions (measure how it would perform if deployed)
- Canary: deploy to 1% of users, measure impact
- Ramp: gradually increase to 10%, 50%, 100% if impact is positive
Scale and Latency
Netflix has 250M users, 10k titles. Matrix factorization (250M × 10k) would be gigabytes of memory. Instead:
- Pre-compute recommendations offline: for each user, compute top 100 items (batch process, run nightly)
- Store in cache (Redis): lookup is instant
- Personalization: real-time factors (what user watched today) refine the cached recommendations
Latency budget: <100ms from user requesting page to recommendations appearing.
7. Natural Language Applications: Language Models in Production
From customer service chatbots to semantic search, NLP powers many applications.
Chatbots
Customer Service: user asks “how do I return this item?” → chatbot either answers directly or routes to human.
Approach:
- Intent classification: does message mean “return policy question”?
- Slot filling: extract parameters (what item? when purchased?)
- Response generation: template-based (“Here’s our return policy…”) or neural (generate response)
Real-world deployment:
- Template + rules: cheap, predictable, limited flexibility
- Fine-tuned LLM: GPT-3 with few-shot examples → flexible, handles novel questions
- Hybrid: use LLM for generation, but validate response before sending (check facts)
Challenge: out-of-scope questions. User asks “where does your CEO live?” (reasonable question, but not relevant to customer service). Model should say “I don’t know, please contact support.”
Summarization
Contract review: read 50-page legal document, extract key terms.
- Manual: lawyer spends 2 hours, costs $400
- AI: summarization model generates abstract in seconds
- Reality: abstract is missing nuances, lawyer still reviews, but faster
Approach:
- Extractive: copy sentences from document that are most important
- Abstractive: generate new sentences that capture the document
- Most successful: hybrid (extractive on top of abstractive: take key sentences, pass to summarizer)
Translation
“Break the language barrier.”
Modern translation (Google Translate, DeepL):
- Neural machine translation: sequence-to-sequence model
- Encoder: reads source language sentence, builds representation
- Decoder: generates target language sentence from representation
- Real-time constraint: <1 second from uploading document to translated output
Real-world challenges:
- Domain-specific terms: “malware” in security vs “malware” in medicine (different translations)
- Cultural nuance: idioms don’t translate literally
- Ambiguity: “I saw the man with the telescope” → who has the telescope?
Current: neural MT is very good for common languages (>90% BLEU score on WMT benchmarks). Rare language pairs still struggle.
Semantic Search
User searches “how to fix leaky faucet” → system must find relevant documents (YouTube videos, forum posts) even if documents don’t use exact keywords.
Traditional: keyword matching (fast, brittle to synonyms)
Modern: semantic search
- Index all documents: pass each through embedding model (BERT, text-embedding-3-large), get 1536D vector
- User searches: embed query in same space
- Find closest document vectors (cosine similarity or vector DB)
- Return top K documents
Advantage: “fix water leak” matches “repair broken tap” even with no shared keywords.
Real deployment: vector database (Weaviate, Pinecone) stores millions of vectors, retrieval is <100ms.
Sentiment Analysis
Review: “The product is amazing, highly recommend.” → Label: positive.
Review: “Broke after one week, terrible build quality.” → Label: negative.
Simple RNN or fine-tuned BERT does well. Real-world:
- Sarcasm: “Oh great, another bug” (negative despite positive words)
- Mixed: “Good product, terrible shipping” (mixed sentiment)
- Domain shift: model trained on product reviews fails on social media sentiment
Information Extraction
Document: “John Smith, age 30, employed by Acme Corp.” → Extract (person: “John Smith”, age: 30, company: “Acme Corp”)
Approaches:
- Rule-based: regex patterns (“age (\d+)”)
- Sequence tagging: BIO tagging (Begin, Inside, Outside tags for each entity type)
- Generative: prompt LLM (“extract person name, age, company from this text”)
Deployment: APIs vs. Local
Cloud API (OpenAI, Anthropic):
- Pros: latest models, no hardware cost, easy to scale
- Cons: privacy (data sent to cloud), latency (network round-trip), cost per request
Local (Ollama, Hugging Face Transformers):
- Pros: privacy, offline capability, no per-request cost
- Cons: need GPU hardware, older models, scaling is manual
Real-world: enterprises with sensitive data run local. Startups use APIs.
8. Computer Vision Applications: Seeing the World
Beyond autonomous vehicles, computer vision powers many systems.
Object Detection
Example: retail shelf monitoring. Camera observes shelves, detects if products are out of stock.
Model: YOLO (You Only Look Once), Faster R-CNN
- Input: image
- Output: bounding boxes + class labels + confidence scores
Deployment:
- Edge device (camera on shelf) runs inference locally
- Sends alerts to store manager: “milk is out of stock, aisle 3”
- Latency: <100ms per image
Challenge: 10,000 different products. Model can’t be trained on all. Solution:
- Train on broad categories (milk, bread, water bottles)
- Use instance segmentation to distinguish individual products via appearance
- Augment with barcode scanning (ground truth for out-of-stock)
Face Recognition
Use cases: security (access control), photo organization, law enforcement
Technical: face detection + feature extraction + comparison
- Face detection: MTCNN, RetinaFace (find face in image)
- Feature extraction: ResNet, VGGFace (convert face to 128D vector)
- Comparison: compute distance between vectors (same person = small distance)
Real-world:
- Accuracy: ~99% on benchmark datasets, lower in the wild (lighting, angle, expression)
- Privacy concerns: mass surveillance, bias against minorities
- Regulatory: EU restricts real-time facial recognition by law enforcement
Segmentation: Pixel-Level Understanding
Semantic segmentation: label each pixel (road, building, tree, person)
Use case: autonomous vehicles. Knowing “that’s a road” is more useful than “there’s a car-shaped object.”
Instance segmentation: separate individuals (three people → three masks)
Models: U-Net (dense prediction), Mask R-CNN (combines detection + segmentation)
Pose Estimation
Detect human keypoints: head, shoulders, elbows, wrists, hips, knees, ankles.
Applications:
- Sports analysis: swing biomechanics, running form
- Fitness: is user doing squat correctly?
- Healthcare: physical therapy (did patient do prescribed exercises?)
Model: OpenPose, MediaPipe, PoseNet
- Input: video
- Output: 17 keypoints per person, 25 frames/sec
Document Understanding
OCR (Optical Character Recognition): extract text from images of documents.
Modern approach: end-to-end model (detect text regions + recognize characters in one pass)
Layout understanding: document has structure (title, paragraphs, tables). Models must preserve structure.
Use cases: digitize paper documents, extract information from invoices (date, amount, vendor).
Real-world: combination of OCR + NLP. OCR extracts text, NLP extracts structured information (date field, amount field).
Video Analysis
Frame-level analysis is too slow (30 fps × N frames = huge computation).
Approach: 3D convolutions or recurrent models
- 3D CNN: convolve over time + space (learn temporal patterns)
- I3D (Inflated 3D): treat video as 3D data, scale up 2D models to 3D
Use cases:
- Activity recognition: is person playing soccer or swimming?
- Anomaly detection: unusual activity in surveillance footage
- Video understanding: summarize what happens in video
9. Financial Applications: AI Making High-Stakes Decisions
Finance involves regulatory oversight and adversarial actors. ML systems here are especially fraught.
Fraud Detection
Credit card transactions: millions per day, 0.1% are fraudulent (legitimate transactions mixed with fraud).
Approach:
- Baseline: user’s normal behavior (location, time of day, merchant type)
- Real-time model: is this transaction anomalous?
- If suspicious: extra verification (call user, require 2FA)
Models: random forest (interpretable, good for fraud), neural networks (higher accuracy, black box).
Real-world:
- False positive rate: flag too many legitimate transactions and users get frustrated
- False negative rate: miss fraud and customer is liable or bank loses money
- Trade-off: set threshold based on cost of error (cost of false positive vs. cost of false negative)
Challenges:
- Concept drift: fraud patterns change (criminals adapt to detection system)
- Recurring monitoring: retrain model weekly to catch new patterns
- Explainability: if transaction is blocked, user wants to know why
Credit Scoring
Loan application: predict if applicant will repay.
Data: income, employment history, credit history, loan amount, purpose
Model: logistic regression (simple, interpretable) or gradient boosting (higher accuracy)
Regulatory: Fair Lending Act (don’t discriminate based on protected characteristics: race, gender, religion, national origin).
Problem: protected characteristics are correlated with wealth/income (historical injustice). Model trained on data learns the correlation.
Solution:
- Remove protected features from input (but correlated features might still encode protected info)
- Audit model for disparate impact (does model treat minorities differently?)
- Adjust decision thresholds to equalize approval rates across demographics
- Explicit fairness constraints in optimization
Algorithmic Trading
High-frequency trading: make 1000s of trades per second, profit from small price differences.
Real-time constraint: decision in <1 microsecond (1000x faster than blink reflex).
Approach:
- Model observes market microstructure (order book, recent trades, news)
- Predicts next price movement
- Places trades if profit expected exceeds cost
Challenges:
- Latency: colocate servers at exchange to reduce network latency
- Model risk: if model’s prediction is wrong, losses are instant (markets are 24/5)
- Regulatory: SEC oversees automated trading, limits certain strategies
- Adversarial: other traders are also sophisticated, model must outcompete them
Risk Management
Portfolio optimization: given 1000 assets and return/risk, allocate capital to maximize return for acceptable risk.
Classical: Mean-variance optimization (Markowitz). Choose weights that minimize variance for target return.
Modern: use ML to predict asset correlations and returns, then optimize.
Challenge: predictions are uncertain. Model predicts Apple stock will return 10%, but uncertainty is ±20%. Optimization is sensitive to predictions; small change in prediction → large change in allocation.
Real approach: robust optimization (optimize for worst case within uncertainty bounds).
Compliance
AML (Anti-Money Laundering): detect suspicious patterns.
Models: graph neural networks to detect money laundering rings (suspicious transfers between accounts), or anomaly detection (unusual transfer amounts/frequencies).
Regulatory: banks must report suspicious activity. Model flags suspicious transactions, analysts review, escalate if confirmed.
10. Smart Home & IoT: Ambient Intelligence
Homes with 10-100 connected devices learning user preferences and automating comfort/security.
Occupancy Detection
Problem: is the house empty? If yes, turn off HVAC to save energy.
Sensors:
- Motion detectors (PIR sensors)
- Door/window sensors (entry points)
- Cameras (computer vision)
- WiFi network (devices connected?)
Model: ensemble combining multiple signals.
- If motion detected in last 10 min → occupied
- If door was unlocked and weather is cold → likely occupied
- If no devices on network → unoccupied
Real-world: false negatives (model thinks empty when occupied) cause comfort loss. False positives (thinks occupied when empty) waste energy. Tune threshold based on preference.
Security
Intrusion detection:
- Window broken? (acoustic signature)
- Door forced? (motion + door sensor + timeofday = 3am = suspicious)
- Unusual entry pattern? (timing, door order)
Models: anomaly detection (learn normal patterns, alert on deviations).
Real-world: false alarms desensitize user to alerts. A real intrusion must be caught.
Advanced: face recognition at doorbell (is this person authorized to enter?).
Energy Optimization
Prediction: forecast next 24 hours of electricity demand and solar generation.
Optimization: schedule large appliances (water heater, EV charging) when solar production is high (peak midday).
Real-world:
- Weather forecast error → solar prediction error
- User changes patterns (travel unexpectedly) → demand prediction fails
- Optimization must be conservative (avoid running out of battery power)
Voice Control
Smart speakers (Alexa, Google Home, Siri) understand spoken commands.
Pipeline:
- Audio → speech-to-text (transcribe)
- Text → intent classification (turn on lights, adjust temperature)
- Intent → action (send command to light controller)
Privacy concern: device records audio, sends to cloud for processing. Some users uncomfortable.
Solution: on-device processing. Recent models are small enough to run on speaker hardware. Google Recorder does on-device speech recognition, Alexa is moving this way.
Privacy: On-Device Preferred
IoT devices contain intimate home data: when you’re home, what rooms you’re in, temperature preferences, security.
Approach: process locally, don’t send to cloud.
Trade-off: cloud models are more powerful (larger, better trained). Local models are smaller, less accurate, but private.
Real deployment: hybrid.
- On-device: basic functionality (voice commands, simple automation)
- Cloud: when needed (complex queries, learning patterns over time)
11. Real-World Deployment Patterns
How do ML systems actually get deployed?
Batch Processing
Pattern: process data once per day/hour, store results, serve from cache.
Example: predictive maintenance.
- Every night: run LSTM on 100 machines’ sensor data from past 30 days
- Output: maintenance schedule for next week
- Store in database
- Maintenance team queries database in morning
Advantages:
- Can use expensive models (nightly batch has time budget)
- Can recompute if needed (no real-time constraint)
- Cost is predictable (fixed compute, fixed time)
Disadvantages:
- Latency: if event happens at 11pm, notification comes 12 hours later
- No adaptation: schedule doesn’t change until next batch
Real-Time Streaming
Pattern: process events as they arrive, make decisions instantly.
Example: fraud detection.
- User swipes card
- In <100ms: run fraud model on transaction
- If suspicious: block or challenge
- Send response to card terminal
Advantages:
- Instant feedback (user knows immediately)
- Adapts to latest data
Disadvantages:
- Latency budget is tight (<100ms)
- Can’t afford expensive models (must be fast)
- Models must be simple + optimized
Technology: Kafka (event streaming), real-time ML frameworks (Seldon, BentoML), feature stores (Tecton, Feast).
Hybrid: Batch Retraining, Streaming Inference
Most large-scale systems use this:
- Batch: nightly retraining on accumulated data, produce new model
- Streaming: serve latest model to inference engine
- Real-time: send predictions to user instantly
Example: Netflix recommendation.
- Batch: nightly, retrain collaborative filtering on 24 hours of new watch data (too much data to update in real-time)
- Streaming: user logs in, lookup precomputed recommendations (fast, <1ms)
- Real-time: if new user, fallback to content-based recommendations
Fallback: When ML Fails
Pattern: always have a backup plan.
If ML model crashes, is slow, or gives nonsensical output → fallback to simpler method.
Example: autonomous vehicle.
- Primary: neural network planner (sophisticated)
- Fallback 1: reactive planner (stop if obstacles detected)
- Fallback 2: manual control (human takes over)
Example: recommendation engine.
- Primary: collaborative filtering
- Fallback 1: content-based recommendations
- Fallback 2: trending items (what’s popular today)
- Fallback 3: editorial picks (human curated)
Real-world deployment requires fallback chains. Distributed systems are fragile; assume something will fail.
Gradual Rollout: De-Risking Deployment
New model: unknown behavior on real data. Rollout strategy:
Shadow Mode:
- Run new model but don’t use predictions
- Log predictions for analysis
- Compare to old model: does new model agree?
- If not → investigate before rollout
Canary:
- Deploy new model to 1% of users
- Monitor metrics (latency, accuracy, crashes)
- If metrics good → increase to 10%
- Gradually ramp to 100%
Metrics to watch:
- Latency: did inference slow down?
- Accuracy: are predictions still good? (A/B test against old model)
- Crashes: does new model have bugs?
- Business metrics: did engagement/revenue change?
Example: Facebook tests feed ranking algorithm on 1% of users, measures engagement. If engagement up, ramp to more users. If down, investigate or revert.
Real: rollouts take days to weeks. A bad model reaching all users could harm business significantly.
12. Common Challenges Across All Applications
Data Quality
Garbage in, garbage out. Most ML failures are data problems, not algorithm problems.
Issues:
- Labeling errors: human labels are wrong (inter-rater disagreement)
- Missing data: sensors fail, data gets lost
- Outliers: one extreme example skews training (one person spends $1M, model thinks everyone does)
- Data imbalance: 99.9% negative examples, 0.1% positive (rare events)
Detection:
- Visualize label distribution (is it realistic?)
- Check for duplicate examples (data leakage)
- Compare train/test label distribution (should be similar)
- Get multiple annotators (measure agreement, find inconsistencies)
Distribution Shift
Problem: model trained on data type X, deployed on data type Y. Performance drops.
Examples:
- Chest X-ray model trained on modern equipment, deployed on older machines → accuracy drops
- Fraud model trained on 2020 fraud patterns, deployed in 2024 when fraud techniques changed → model misses new patterns
- Autonomous vehicle model trained on California weather, deployed in Seattle → fails in rain
Detection:
- Monitor predictions over time: if distribution changes, flag alert
- Compare model accuracy on recent data vs. old data
- Use test-time adaptation: update model slightly on new data
Solution:
- Continuous retraining: retrain on new data monthly
- Domain adaptation: transfer learning from new domain
- Robustness: train on diverse data to generalize
Regulatory Compliance
Different domains have different rules:
- Healthcare: FDA approval required, extensive validation, documentation
- Finance: Fair lending laws, explainability required, risk management
- EU: GDPR (right to explanation), AI Act (classification by risk level)
- Autonomous vehicles: state-level approval, safety validation
Real impact: can’t deploy in some regions due to regulation. Can’t use certain models (black-box neural networks) where explainability is required.
Privacy
Regulations:
- GDPR: EU resident data is protected, users have right to access/delete
- CCPA: California similar to GDPR
- HIPAA: medical data
Practical:
- Don’t send sensitive data to cloud (keep medical data in-hospital)
- Use differential privacy: add noise to training data so individual contributions are hidden
- Federated learning: train model across hospitals without centralizing data
Real-world: data is often the moat (more data → better model). Privacy regulations limit data access, slowing model improvement.
Explainability
Why does it matter?
- User trust: “why was I denied a loan?” (lender must explain)
- Debugging: “why did model fail here?” (engineer must understand)
- Safety: “why did car brake?” (reassure passengers)
Techniques:
- Attention mechanisms: show which parts of input model focused on
- Feature importance: which features drove prediction?
- LIME: local interpretable model-agnostic explanations
- Saliency maps: visualize gradients (which pixels matter?)
Limitations: explanations can be misleading (models can have spurious correlations). A saliency map showing “blue color” matters doesn’t mean blue is causal; might be correlated with actual cause.
Fairness and Bias
Problem: model discriminates against group (race, gender, age).
Sources:
- Biased training data (historical injustice encoded in data)
- Proxy features (zip code correlates with race, not causal but predictive)
- Model amplifies: learns subtle correlations that humans wouldn’t consciously use
Detection:
- Disaggregate metrics by demographic: is model equally accurate for all groups?
- Compare decision rates: does model accept/reject equitable proportions across groups?
Solutions:
- Balance training data: oversample underrepresented groups
- Fairness constraints: explicitly minimize group disparities during training
- Threshold adjustment: use different decision thresholds for different groups (controversial)
Real-world: perfect fairness is impossible (trade-offs between accuracy and fairness). Find acceptable trade-off, document it, monitor for drift.
Cost of Inference
Cloud APIs: $0.001 per request (ChatGPT API), millions of requests → significant cost.
On-device: one-time cost of hardware, no per-request cost. But limited accuracy (small models).
Trade-off: accuracy vs. cost.
- ChatGPT (highest cost, highest accuracy)
- Open-source LLama (free, lower accuracy)
- Distilled model (cheap, decent accuracy, smaller)
Real deployment: use cheap model first. If accuracy insufficient, use expensive model.
Latency: Speed of Response
Real-time applications (fraud, autonomous vehicles): <100ms requirement.
- Models must be small and optimized
- Can’t afford big transformers, use distilled models or classical algorithms
Batch applications (predictive maintenance, recommendations): seconds to minutes.
- Can use larger models
- More processing time available
Trade-off: size vs. accuracy.
- Large transformer: highest accuracy, slow
- Distilled model: slightly lower accuracy, 10x faster
- Choose based on requirements
13. How These Connect to Harnesses
A harness is the decision-making layer orchestrating complex systems. Each real-world application follows a pattern:
Perception → Decision → Action
Harness is the Decision step.
Autonomous Vehicle Harness
Perception (camera/LIDAR → object detections)
↓
Harness: Planning & Control
- Prediction: what will others do?
- Planning: safe path given predictions
- Control: steering commands
↓
Action (actuators: steering wheel, brakes, accelerator)
Robotics Harness
Perception (camera → object detection, localization → SLAM)
↓
Harness: Manipulation & Navigation
- Object detection → where to grasp?
- Planning: how to reach object?
- Grasping: what grasp strategy?
↓
Action (robot arm: move to position, close gripper)
Smart Grid Harness
Perception (sensors → power flows, demand, generation)
↓
Harness: Optimization
- Load forecasting: predict demand
- Optimization: allocate generation
- Fault localization: where is problem?
↓
Action (controllers: reroute power, start generator, trip circuit)
In each case, the harness must:
- Integrate perception (uncertain sensor data, multiple modalities)
- Reason about the world (predictions, planning, optimization)
- Make decisions (choose action that achieves goals)
- Execute under constraints (real-time, reliability, safety)
The perception systems are deep learning (neural networks). The harness could be classical algorithms (optimization, planning), learned models (reinforcement learning), or hybrid.
Conclusion
AI in production is not the lab version. It’s:
- Integrated: perception, decision, action working together
- Constrained: latency, cost, power, reliability requirements
- Uncertain: sensors fail, data shifts, edge cases happen
- Real-time: decisions in milliseconds or predictions 24 hours ahead
- High-stakes: failures have real consequences (crashes, blackouts, financial loss)
- Regulated: privacy, fairness, explainability requirements vary by domain
The common pattern: perception systems extract meaning from raw data, harnesses orchestrate decisions, controllers execute actions. The harness is where real-world AI becomes practical.
Understanding these applications shows why harnesses matter: they translate ML predictions into reliable systems. Without a proper harness, AI is just math—good predictions don’t help if the system can’t integrate them into decisions that actually work.
Validation Checklist
How do you know you got this right?
Performance Checks
- Identified which application pattern fits your domain (autonomous vehicle, robotics, predictive maintenance, smart grid, healthcare, recommendations, NLP, computer vision, finance, smart home)
- Have a latency budget for your use case: real-time (<100ms), near-real-time (<1s), or batch (minutes to hours)
- Measured end-to-end system latency from sensor/input to action/output on representative data
Implementation Checks
- Perception-Decision-Action pipeline defined: each stage has clear inputs, outputs, and latency allocation
- Fallback chain implemented: primary ML model -> simpler model -> rule-based logic -> human escalation
- Gradual rollout strategy planned: shadow mode -> canary (1% of traffic) -> ramp to 100%
- Data quality checks in place: label distribution validated, duplicates removed, train/test split verified
- Distribution shift monitoring configured: track model accuracy on recent data, alert if performance degrades
- Domain-specific regulatory requirements identified (FDA for healthcare, Fair Lending for finance, GDPR for EU data)
- Cost of inference calculated: per-request cost for cloud API vs amortized hardware cost for your expected volume
Integration Checks
- Harness orchestration layer connects perception outputs to decision logic to action execution
- A/B testing infrastructure ready: can compare old model vs new model on live traffic with statistical significance
- Monitoring dashboards configured: latency, accuracy, crash rate, and business metrics tracked per deployment
Common Failure Modes
- Distribution shift in production: Model trained on historical data fails on current patterns (fraud evolves, weather changes, equipment ages). Fix: retrain monthly on new data, monitor accuracy per time window, set up automated drift detection.
- False positive fatigue: Too many alerts desensitize users (security alarms, predictive maintenance warnings, medical alerts). Fix: increase model specificity, add confirmation rules (pattern must persist >60 seconds), tune threshold based on cost of false positive vs false negative.
- Edge case domination: 99.9% of data is easy; the 0.1% of edge cases causes all real failures. Fix: active learning to prioritize hard examples for labeling, simulation for synthetic edge cases, gradual deployment with human oversight.
- Explainability gap: Stakeholders (doctors, regulators, customers) don’t trust black-box predictions. Fix: add attention maps, LIME explanations, or feature importance scores; document model limitations in user-facing documentation.
Sign-Off Criteria
- End-to-end system tested on real data (not just held-out test set) with realistic traffic patterns and failure scenarios
- Fallback behavior verified: system degrades gracefully when model fails, is slow, or gives low-confidence output
- Regulatory compliance confirmed for your domain (HIPAA audit for healthcare, fairness audit for lending, safety validation for autonomous systems)
- Business metrics tracked: does the ML system improve the metric you care about (revenue, safety, efficiency, user satisfaction)?
- Runbook documented: what to do when the model fails in production (who to alert, how to revert, when to retrain)
See Also
- Doc 06 (Harness Architecture) — Seven components apply to all real-world systems shown here; understand the abstract pattern
- Doc 05 (AI Agents) — Agentic reasoning (ReAct, planning) powers decision-making in these applications
- Doc 25 (Edge & Physical AI) — Edge deployment and physical systems are the targets for real-world applications
- Doc 09 (Operations & Observability) — Production systems require monitoring and observability; each application has unique metrics