Monitoring AI apps
Introduction
Unga AI model production la deploy aachchu. Users use panraanga. Everything looks fine... but is it? ๐ค
Scary truth: AI models silently degrade. Traditional apps crash pannina error throw pannum. But AI models โ wrong predictions kudukum, no error, no crash. Users bad experience get panraanga, nee theriyaadhey iruppa!
Real example: Zillow's AI home pricing model silently drifted โ company $500 million loss panniduthu! ๐ฑ
Monitoring = Unga AI app oda eyes and ears. Indha article la AI-specific monitoring, Prometheus + Grafana setup, model drift detection โ ellam hands-on ah paapom! ๐
Three Pillars of Observability
Observability = System internal state understand pannradhu external outputs la irundhu.
1. Metrics ๐ โ Numbers over time
- CPU usage: 75%
- Request latency: 120ms
- Model accuracy: 94.2%
- Predictions per second: 500
2. Logs ๐ โ Event records
3. Traces ๐ โ Request journey tracking
| Pillar | Question It Answers | Tool |
|---|---|---|
| Metrics | "How much? How fast?" | Prometheus |
| Logs | "What happened?" | ELK Stack, Loki |
| Traces | "Where did time go?" | Jaeger, Zipkin |
All three venum! Metrics alert kudukum, logs root cause kaatdum, traces bottleneck identify pannum. ๐ฏ
AI-Specific Metrics โ What to Track
Normal infra metrics PLUS these AI-specific ones:
๐ฏ Model Performance Metrics:
| Metric | What | Alert When |
|---|---|---|
| Accuracy | Prediction correctness | Drops >5% |
| Latency (p50/p95/p99) | Inference time | p99 > 2s |
| Throughput | Predictions/second | Drops >20% |
| Error rate | Failed predictions | > 1% |
| Confidence scores | Model certainty | Avg drops below 0.7 |
๐ Data Quality Metrics:
| Metric | What | Alert When |
|---|---|---|
| Feature drift | Input distribution change | Significant shift |
| Missing values | Null/NaN in inputs | > 5% |
| Data volume | Requests per hour | Unusual spike/drop |
| Schema violations | Unexpected input format | Any occurrence |
๐ Model Drift Metrics:
| Metric | What | Alert When |
|---|---|---|
| PSI (Population Stability Index) | Distribution shift | PSI > 0.2 |
| KL Divergence | Statistical distance | Significant increase |
| Prediction distribution | Output pattern change | Unexpected shift |
Pro tip: Dashboard la real-time accuracy kaattunga โ most important metric for AI apps! ๐
Prometheus + Grafana Setup
Step 1: Python app la metrics expose pannunga
Step 2: Prometheus scrape config add pannunga, Grafana dashboard create pannunga! ๐
Model Drift โ The Silent Killer
Model drift = AI apps oda #1 enemy! Training time la 95% accuracy, 3 months la 70% ku drop aagum โ no errors, no crashes, just bad predictions. ๐ฐ
Types of drift:
1. Data Drift (Input distribution changes)
- Training: Mostly English text
- Production: Suddenly Tamil + Hindi mix text varudhu
- Model confused aagum!
2. Concept Drift (Relationship changes)
- Training: "Work from home" = negative sentiment (pre-COVID)
- Production: "Work from home" = positive sentiment (post-COVID)
- Same input, different meaning!
3. Prediction Drift (Output distribution changes)
- Training: 50% positive, 50% negative predictions
- Production: 90% positive โ something wrong!
Detection code:
Rule: Weekly drift check mandatory for production AI! ๐
AI Monitoring Dashboard Design
Effective AI monitoring dashboard:
Row 1 โ Health Overview ๐ข
- Model version (current)
- Uptime percentage
- Total predictions today
- Current error rate
Row 2 โ Performance Metrics ๐
- Accuracy trend (last 7 days)
- Latency p50/p95/p99 chart
- Throughput (requests/sec)
- Confidence score distribution
Row 3 โ Data Quality ๐
- Feature drift indicators
- Missing value percentage
- Input volume trend
- Schema violation count
Row 4 โ Infrastructure ๐ฅ๏ธ
- CPU/Memory usage
- GPU utilization (for inference)
- Disk space
- Network I/O
Grafana dashboard JSON:
Pro tip: RED method follow pannunga โ Rate, Errors, Duration. Every service ku ivanga track pannunga! ๐ฏ
Smart Alerting Strategy
Alert fatigue = biggest monitoring mistake! 1000 alerts โ no one cares. Smart alerting setup pannunga:
๐ด Critical (PagerDuty โ Wake someone up):
- Model accuracy drops >10% in 1 hour
- Error rate >5%
- All instances down
- Inference latency >5s sustained
๐ก Warning (Slack notification):
- Accuracy drops >5% in 24 hours
- Drift detected (PSI > 0.2)
- Latency p99 >2s
- Disk usage >80%
- Confidence avg drops below 0.6
๐ต Info (Dashboard only):
- New model deployed
- Retrain job completed
- Traffic pattern change
Prometheus alerting rules:
Golden rule: Every alert ku runbook irukanum โ alert vandha enna pannanum step-by-step! ๐
Structured Logging for AI
AI apps ku structured logging MUST:
What to log for AI:
โ Request ID (trace across services)
โ Model version (which model served)
โ Input metadata (size, type โ NOT actual data!)
โ Latency breakdown (preprocess, inference, postprocess)
โ Confidence score
โ Prediction result
โ NEVER log PII (names, emails)
โ NEVER log full input data (privacy + storage cost)
Log aggregation: Loki + Grafana (free) or ELK Stack. Query example:
โ Low confidence predictions filter pannunga โ potential issues spot! ๐
AI Monitoring Tools Comparison
ML-specific monitoring tools:
| Tool | Type | Cost | Best For |
|---|---|---|---|
| **Prometheus + Grafana** | General metrics | Free | Infrastructure + custom |
| **Evidently AI** | ML monitoring | Free/Open | Drift detection |
| **WhyLabs** | ML observability | Freemium | Full ML monitoring |
| **MLflow** | Experiment tracking | Free | Model versioning |
| **Arize AI** | ML observability | Paid | Enterprise |
| **Neptune.ai** | Experiment tracking | Freemium | Research teams |
| **Datadog** | Full stack | Paid | Enterprise |
| **New Relic** | APM | Freemium | Application perf |
Recommended stack (Free):
- ๐ Prometheus + Grafana โ Metrics & dashboards
- ๐ Loki โ Log aggregation
- ๐ Jaeger โ Distributed tracing
- ๐ค Evidently AI โ ML drift detection
- ๐ฆ MLflow โ Model tracking
Total cost: $0! Full enterprise-grade monitoring for free! ๐
AI Monitoring Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ AI APP MONITORING ARCHITECTURE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ ๐ฑ Users โ โ โ requests โ โ โผ โ โ โโโโโโโโโโโโโโโโ โ โ โ AI API โโโโโ Metrics โโโโโถ โโโโโโโโโโโโโโ โ โ โ (FastAPI) โ โ Prometheus โ โ โ โ โโโโโ Logs โโโโโโโโถ โ โ โ โ โ /predict โ โโโโโโโฌโโโโโโโ โ โ โโโโโโโโฌโโโโโโโโ โโโโโโโผโโโโโโโ โ โ โ โ Grafana โ โ โ โโโโโโผโโโโโ โ Dashboards โ โ โ โ Model โ โโโโโโโฌโโโโโโโ โ โ โ Server โ โ โ โ โโโโโโฌโโโโโ โโโโโโโผโโโโโโโ โ โ โ โAlertmanagerโ โ โ โโโโโโผโโโโโโโโโโโ โโโโฌโโโโโฌโโโโโ โ โ โ Prediction โ โ โ โ โ โ Store (DB) โ Slack โโ โโถPagerDuty โ โ โโโโโโฌโโโโโโโโโโโ โ โ โ โ โ โโโโโโผโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ โ Evidently AI โโโโโถโ Drift Alert โ โ โ โ (Drift Check) โ โ + Retrain โ โ โ โ Cron: Daily โ โ Trigger โ โ โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ โ โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ Loki โโโโโโ Logs โ โ โ โ(Log Aggreg.)โ โ(Structlog)โ โ โ โโโโโโโโฌโโโโโโโ โโโโโโโโโโโโ โ โ โโโโโโโโโโโโถ Grafana Explore โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Quick Setup โ Docker Compose Stack
One command la full monitoring stack!
5 minutes la full monitoring stack ready! Import pre-built Grafana dashboards and done! ๐
Prompt: Design Monitoring System
Summary
Key takeaways:
โ Three pillars = Metrics + Logs + Traces โ all three venum
โ AI-specific metrics = Accuracy, drift, confidence โ beyond CPU/memory
โ Model drift = Silent killer โ detect with Evidently AI
โ Prometheus + Grafana = Free, powerful monitoring stack
โ Smart alerting = Severity levels, runbooks, no alert fatigue
โ Structured logging = JSON logs, request IDs, model versions
Action item: Unga AI project la Prometheus client add pannunga, 3 custom metrics expose pannunga (latency, predictions count, confidence). Grafana la dashboard create pannunga! ๐
Next article: Scalable AI Architecture โ millions of users ku design! ๐๏ธ
๐ ๐ฎ Mini Challenge
Challenge: Setup Monitoring Dashboard (Prometheus + Grafana)
Real monitoring setup โ AI app performance track pannu! ๐
Step 1: Python Flask App with Metrics ๐
Step 2: Docker Container ๐ณ
Step 3: Prometheus Configuration ๐
Step 4: Start Prometheus ๐
Step 5: Grafana Dashboard ๐
Step 6: Alerts Setup ๐จ
Step 7: Load Test & Monitor ๐
Completion Time: 2-3 hours
Tools: Prometheus, Grafana, Flask
Production-ready monitoring โญ
๐ผ Interview Questions
Q1: RED vs USE metrics โ difference? AI apps la which important?
A: RED = Request rate, Error rate, Duration. USE = Utilization, Saturation, Errors. RED for API/service endpoints (perfect for inference). USE for infrastructure (CPU, memory, disk). AI apps: both important โ RED track inference quality, USE track resource health.
Q2: Alerting fatigue โ too many false alerts problem?
A: Set thresholds carefully (testing through). Composite alerts (multiple conditions). Time-based (peak hours different thresholds). Severity levels (critical vs warning). On-call rotation. Runbook attached (alert triggered, epdhi fix?). Best: business metrics alert (customer impact). Avoid low-level infrastructure alert spam.
Q3: Distributed tracing โ why needed large systems?
A: Microservices: request crosses multiple services. Single log line illa โ trace entire journey. Tools: Jaeger, Zipkin. Each span: service name, latency, status. Bottleneck identify (slow service). Errors debug (which service fail?). AI systems: model inference span, database span, cache span โ identify slow component.
Q4: Model performance monitoring โ model drift detect?
A: Production model: baseline accuracy establish. Monitor: prediction confidence, actual vs predicted (if label available after). Accuracy drop โ model drift (data distribution change). Solution: retrain model, A/B test new version, implement canary. Critical for long-running AI systems.
Q5: Logging strategy โ performance impact?
A: Synchronous logging = slow. Async logging better (background thread). Sampling: 100% request log illa, 10% random sample (balance). Structured logging: JSON (parsing easy). Log level: DEBUG (dev only), INFO (production), ERROR (critical). Too much logging = storage cost, noise. Right balance = visibility without overhead.
Frequently Asked Questions
Model drift na enna?