Monitoring AI apps
Introduction
Unga AI model production la deploy aachchu. Users use panraanga. Everything looks fine... but is it? 🤔
Scary truth: AI models silently degrade. Traditional apps crash pannina error throw pannum. But AI models — wrong predictions kudukum, no error, no crash. Users bad experience get panraanga, nee theriyaadhey iruppa!
Real example: Zillow's AI home pricing model silently drifted — company $500 million loss panniduthu! 😱
Monitoring = Unga AI app oda eyes and ears. Indha article la AI-specific monitoring, Prometheus + Grafana setup, model drift detection — ellam hands-on ah paapom! 📊
Three Pillars of Observability
Observability = System internal state understand pannradhu external outputs la irundhu.
1. Metrics 📊 — Numbers over time
- CPU usage: 75%
- Request latency: 120ms
- Model accuracy: 94.2%
- Predictions per second: 500
2. Logs 📝 — Event records
3. Traces 🔍 — Request journey tracking
| Pillar | Question It Answers | Tool |
|---|---|---|
| Metrics | "How much? How fast?" | Prometheus |
| Logs | "What happened?" | ELK Stack, Loki |
| Traces | "Where did time go?" | Jaeger, Zipkin |
All three venum! Metrics alert kudukum, logs root cause kaatdum, traces bottleneck identify pannum. 🎯
AI-Specific Metrics — What to Track
Normal infra metrics PLUS these AI-specific ones:
🎯 Model Performance Metrics:
| Metric | What | Alert When |
|---|---|---|
| Accuracy | Prediction correctness | Drops >5% |
| Latency (p50/p95/p99) | Inference time | p99 > 2s |
| Throughput | Predictions/second | Drops >20% |
| Error rate | Failed predictions | > 1% |
| Confidence scores | Model certainty | Avg drops below 0.7 |
📊 Data Quality Metrics:
| Metric | What | Alert When |
|---|---|---|
| Feature drift | Input distribution change | Significant shift |
| Missing values | Null/NaN in inputs | > 5% |
| Data volume | Requests per hour | Unusual spike/drop |
| Schema violations | Unexpected input format | Any occurrence |
🔄 Model Drift Metrics:
| Metric | What | Alert When |
|---|---|---|
| PSI (Population Stability Index) | Distribution shift | PSI > 0.2 |
| KL Divergence | Statistical distance | Significant increase |
| Prediction distribution | Output pattern change | Unexpected shift |
Pro tip: Dashboard la real-time accuracy kaattunga — most important metric for AI apps! 📈
Prometheus + Grafana Setup
Step 1: Python app la metrics expose pannunga
Step 2: Prometheus scrape config add pannunga, Grafana dashboard create pannunga! 📊
Model Drift — The Silent Killer
Model drift = AI apps oda #1 enemy! Training time la 95% accuracy, 3 months la 70% ku drop aagum — no errors, no crashes, just bad predictions. 😰
Types of drift:
1. Data Drift (Input distribution changes)
- Training: Mostly English text
- Production: Suddenly Tamil + Hindi mix text varudhu
- Model confused aagum!
2. Concept Drift (Relationship changes)
- Training: "Work from home" = negative sentiment (pre-COVID)
- Production: "Work from home" = positive sentiment (post-COVID)
- Same input, different meaning!
3. Prediction Drift (Output distribution changes)
- Training: 50% positive, 50% negative predictions
- Production: 90% positive — something wrong!
Detection code:
Rule: Weekly drift check mandatory for production AI! 🔄
AI Monitoring Dashboard Design
Effective AI monitoring dashboard:
Row 1 — Health Overview 🟢
- Model version (current)
- Uptime percentage
- Total predictions today
- Current error rate
Row 2 — Performance Metrics 📈
- Accuracy trend (last 7 days)
- Latency p50/p95/p99 chart
- Throughput (requests/sec)
- Confidence score distribution
Row 3 — Data Quality 📊
- Feature drift indicators
- Missing value percentage
- Input volume trend
- Schema violation count
Row 4 — Infrastructure 🖥️
- CPU/Memory usage
- GPU utilization (for inference)
- Disk space
- Network I/O
Grafana dashboard JSON:
Pro tip: RED method follow pannunga — Rate, Errors, Duration. Every service ku ivanga track pannunga! 🎯
Smart Alerting Strategy
Alert fatigue = biggest monitoring mistake! 1000 alerts — no one cares. Smart alerting setup pannunga:
🔴 Critical (PagerDuty — Wake someone up):
- Model accuracy drops >10% in 1 hour
- Error rate >5%
- All instances down
- Inference latency >5s sustained
🟡 Warning (Slack notification):
- Accuracy drops >5% in 24 hours
- Drift detected (PSI > 0.2)
- Latency p99 >2s
- Disk usage >80%
- Confidence avg drops below 0.6
🔵 Info (Dashboard only):
- New model deployed
- Retrain job completed
- Traffic pattern change
Prometheus alerting rules:
Golden rule: Every alert ku runbook irukanum — alert vandha enna pannanum step-by-step! 📋
Structured Logging for AI
AI apps ku structured logging MUST:
What to log for AI:
✅ Request ID (trace across services)
✅ Model version (which model served)
✅ Input metadata (size, type — NOT actual data!)
✅ Latency breakdown (preprocess, inference, postprocess)
✅ Confidence score
✅ Prediction result
❌ NEVER log PII (names, emails)
❌ NEVER log full input data (privacy + storage cost)
Log aggregation: Loki + Grafana (free) or ELK Stack. Query example:
↑ Low confidence predictions filter pannunga — potential issues spot! 🔍
AI Monitoring Tools Comparison
ML-specific monitoring tools:
| Tool | Type | Cost | Best For |
|---|---|---|---|
| **Prometheus + Grafana** | General metrics | Free | Infrastructure + custom |
| **Evidently AI** | ML monitoring | Free/Open | Drift detection |
| **WhyLabs** | ML observability | Freemium | Full ML monitoring |
| **MLflow** | Experiment tracking | Free | Model versioning |
| **Arize AI** | ML observability | Paid | Enterprise |
| **Neptune.ai** | Experiment tracking | Freemium | Research teams |
| **Datadog** | Full stack | Paid | Enterprise |
| **New Relic** | APM | Freemium | Application perf |
Recommended stack (Free):
- 📊 Prometheus + Grafana — Metrics & dashboards
- 📝 Loki — Log aggregation
- 🔍 Jaeger — Distributed tracing
- 🤖 Evidently AI — ML drift detection
- 📦 MLflow — Model tracking
Total cost: $0! Full enterprise-grade monitoring for free! 🎉
AI Monitoring Architecture
┌───────────────────────────────────────────────────────┐ │ AI APP MONITORING ARCHITECTURE │ ├───────────────────────────────────────────────────────┤ │ │ │ 📱 Users │ │ │ requests │ │ ▼ │ │ ┌──────────────┐ │ │ │ AI API │──── Metrics ────▶ ┌────────────┐ │ │ │ (FastAPI) │ │ Prometheus │ │ │ │ │──── Logs ───────▶ │ │ │ │ │ /predict │ └─────┬──────┘ │ │ └──────┬───────┘ ┌─────▼──────┐ │ │ │ │ Grafana │ │ │ ┌────▼────┐ │ Dashboards │ │ │ │ Model │ └─────┬──────┘ │ │ │ Server │ │ │ │ └────┬────┘ ┌─────▼──────┐ │ │ │ │Alertmanager│ │ │ ┌────▼──────────┐ └──┬────┬────┘ │ │ │ Prediction │ │ │ │ │ │ Store (DB) │ Slack ◀┘ └▶PagerDuty │ │ └────┬──────────┘ │ │ │ │ │ ┌────▼──────────┐ ┌─────────────┐ │ │ │ Evidently AI │───▶│ Drift Alert │ │ │ │ (Drift Check) │ │ + Retrain │ │ │ │ Cron: Daily │ │ Trigger │ │ │ └───────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ ┌──────────┐ │ │ │ Loki │◀───│ Logs │ │ │ │(Log Aggreg.)│ │(Structlog)│ │ │ └──────┬──────┘ └──────────┘ │ │ └──────────▶ Grafana Explore │ │ │ └───────────────────────────────────────────────────────┘
Quick Setup — Docker Compose Stack
One command la full monitoring stack!
5 minutes la full monitoring stack ready! Import pre-built Grafana dashboards and done! 🚀
Prompt: Design Monitoring System
Summary
Key takeaways:
✅ Three pillars = Metrics + Logs + Traces — all three venum
✅ AI-specific metrics = Accuracy, drift, confidence — beyond CPU/memory
✅ Model drift = Silent killer — detect with Evidently AI
✅ Prometheus + Grafana = Free, powerful monitoring stack
✅ Smart alerting = Severity levels, runbooks, no alert fatigue
✅ Structured logging = JSON logs, request IDs, model versions
Action item: Unga AI project la Prometheus client add pannunga, 3 custom metrics expose pannunga (latency, predictions count, confidence). Grafana la dashboard create pannunga! 📊
Next article: Scalable AI Architecture — millions of users ku design! 🏗️
🏁 🎮 Mini Challenge
Challenge: Setup Monitoring Dashboard (Prometheus + Grafana)
Real monitoring setup — AI app performance track pannu! 📊
Step 1: Python Flask App with Metrics 🐍
Step 2: Docker Container 🐳
Step 3: Prometheus Configuration 🔍
Step 4: Start Prometheus 🚀
Step 5: Grafana Dashboard 📈
Step 6: Alerts Setup 🚨
Step 7: Load Test & Monitor 📊
Completion Time: 2-3 hours
Tools: Prometheus, Grafana, Flask
Production-ready monitoring ⭐
💼 Interview Questions
Q1: RED vs USE metrics — difference? AI apps la which important?
A: RED = Request rate, Error rate, Duration. USE = Utilization, Saturation, Errors. RED for API/service endpoints (perfect for inference). USE for infrastructure (CPU, memory, disk). AI apps: both important — RED track inference quality, USE track resource health.
Q2: Alerting fatigue — too many false alerts problem?
A: Set thresholds carefully (testing through). Composite alerts (multiple conditions). Time-based (peak hours different thresholds). Severity levels (critical vs warning). On-call rotation. Runbook attached (alert triggered, epdhi fix?). Best: business metrics alert (customer impact). Avoid low-level infrastructure alert spam.
Q3: Distributed tracing — why needed large systems?
A: Microservices: request crosses multiple services. Single log line illa — trace entire journey. Tools: Jaeger, Zipkin. Each span: service name, latency, status. Bottleneck identify (slow service). Errors debug (which service fail?). AI systems: model inference span, database span, cache span — identify slow component.
Q4: Model performance monitoring — model drift detect?
A: Production model: baseline accuracy establish. Monitor: prediction confidence, actual vs predicted (if label available after). Accuracy drop → model drift (data distribution change). Solution: retrain model, A/B test new version, implement canary. Critical for long-running AI systems.
Q5: Logging strategy — performance impact?
A: Synchronous logging = slow. Async logging better (background thread). Sampling: 100% request log illa, 10% random sample (balance). Structured logging: JSON (parsing easy). Log level: DEBUG (dev only), INFO (production), ERROR (critical). Too much logging = storage cost, noise. Right balance = visibility without overhead.
Frequently Asked Questions
Model drift na enna?