โ† Back|CLOUD-DEVOPSโ€บSection 1/16
0 of 16 completed

Monitoring AI apps

Intermediateโฑ 15 min read๐Ÿ“… Updated: 2026-02-17

Introduction

Unga AI model production la deploy aachchu. Users use panraanga. Everything looks fine... but is it? ๐Ÿค”


Scary truth: AI models silently degrade. Traditional apps crash pannina error throw pannum. But AI models โ€” wrong predictions kudukum, no error, no crash. Users bad experience get panraanga, nee theriyaadhey iruppa!


Real example: Zillow's AI home pricing model silently drifted โ€” company $500 million loss panniduthu! ๐Ÿ˜ฑ


Monitoring = Unga AI app oda eyes and ears. Indha article la AI-specific monitoring, Prometheus + Grafana setup, model drift detection โ€” ellam hands-on ah paapom! ๐Ÿ“Š

Three Pillars of Observability

Observability = System internal state understand pannradhu external outputs la irundhu.


1. Metrics ๐Ÿ“Š โ€” Numbers over time

  • CPU usage: 75%
  • Request latency: 120ms
  • Model accuracy: 94.2%
  • Predictions per second: 500

2. Logs ๐Ÿ“ โ€” Event records

code
[2026-02-17 10:30:15] INFO: Prediction request - input_size=512, model=v2.3
[2026-02-17 10:30:15] INFO: Inference completed - latency=45ms, confidence=0.92
[2026-02-17 10:30:16] WARN: Low confidence prediction - confidence=0.34

3. Traces ๐Ÿ” โ€” Request journey tracking

code
Request โ†’ API Gateway (5ms) โ†’ Pre-process (20ms) โ†’ 
Model Inference (45ms) โ†’ Post-process (10ms) โ†’ Response (80ms total)

PillarQuestion It AnswersTool
Metrics"How much? How fast?"Prometheus
Logs"What happened?"ELK Stack, Loki
Traces"Where did time go?"Jaeger, Zipkin

All three venum! Metrics alert kudukum, logs root cause kaatdum, traces bottleneck identify pannum. ๐ŸŽฏ

AI-Specific Metrics โ€” What to Track

Normal infra metrics PLUS these AI-specific ones:


๐ŸŽฏ Model Performance Metrics:

MetricWhatAlert When
AccuracyPrediction correctnessDrops >5%
Latency (p50/p95/p99)Inference timep99 > 2s
ThroughputPredictions/secondDrops >20%
Error rateFailed predictions> 1%
Confidence scoresModel certaintyAvg drops below 0.7

๐Ÿ“Š Data Quality Metrics:

MetricWhatAlert When
Feature driftInput distribution changeSignificant shift
Missing valuesNull/NaN in inputs> 5%
Data volumeRequests per hourUnusual spike/drop
Schema violationsUnexpected input formatAny occurrence

๐Ÿ”„ Model Drift Metrics:

MetricWhatAlert When
PSI (Population Stability Index)Distribution shiftPSI > 0.2
KL DivergenceStatistical distanceSignificant increase
Prediction distributionOutput pattern changeUnexpected shift

Pro tip: Dashboard la real-time accuracy kaattunga โ€” most important metric for AI apps! ๐Ÿ“ˆ

Prometheus + Grafana Setup

โœ… Example

Step 1: Python app la metrics expose pannunga

python
from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
PREDICTIONS = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model_version', 'status']
)

LATENCY = Histogram(
    'model_inference_seconds',
    'Inference latency',
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0]
)

ACCURACY = Gauge(
    'model_accuracy',
    'Current model accuracy'
)

CONFIDENCE = Histogram(
    'prediction_confidence',
    'Prediction confidence scores',
    buckets=[0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95]
)

# Use in your prediction endpoint
@app.post("/predict")
async def predict(request: PredictRequest):
    start = time.time()
    result = model.predict(request.data)
    duration = time.time() - start

    LATENCY.observe(duration)
    PREDICTIONS.labels(
        model_version="v2.3", status="success"
    ).inc()
    CONFIDENCE.observe(result.confidence)

    return result

Step 2: Prometheus scrape config add pannunga, Grafana dashboard create pannunga! ๐Ÿ“Š

Model Drift โ€” The Silent Killer

Model drift = AI apps oda #1 enemy! Training time la 95% accuracy, 3 months la 70% ku drop aagum โ€” no errors, no crashes, just bad predictions. ๐Ÿ˜ฐ


Types of drift:


1. Data Drift (Input distribution changes)

  • Training: Mostly English text
  • Production: Suddenly Tamil + Hindi mix text varudhu
  • Model confused aagum!

2. Concept Drift (Relationship changes)

  • Training: "Work from home" = negative sentiment (pre-COVID)
  • Production: "Work from home" = positive sentiment (post-COVID)
  • Same input, different meaning!

3. Prediction Drift (Output distribution changes)

  • Training: 50% positive, 50% negative predictions
  • Production: 90% positive โ€” something wrong!

Detection code:

python
from evidently.metrics import DataDriftPreset
from evidently.report import Report

# Compare training vs production data
report = Report(metrics=[DataDriftPreset()])
report.run(
    reference_data=training_df,
    current_data=production_df
)

# Get drift score
drift_results = report.as_dict()
if drift_results['drift_detected']:
    alert("โš ๏ธ Data drift detected! Retrain needed!")

Rule: Weekly drift check mandatory for production AI! ๐Ÿ”„

AI Monitoring Dashboard Design

Effective AI monitoring dashboard:


Row 1 โ€” Health Overview ๐ŸŸข

  • Model version (current)
  • Uptime percentage
  • Total predictions today
  • Current error rate

Row 2 โ€” Performance Metrics ๐Ÿ“ˆ

  • Accuracy trend (last 7 days)
  • Latency p50/p95/p99 chart
  • Throughput (requests/sec)
  • Confidence score distribution

Row 3 โ€” Data Quality ๐Ÿ“Š

  • Feature drift indicators
  • Missing value percentage
  • Input volume trend
  • Schema violation count

Row 4 โ€” Infrastructure ๐Ÿ–ฅ๏ธ

  • CPU/Memory usage
  • GPU utilization (for inference)
  • Disk space
  • Network I/O

Grafana dashboard JSON:

json
{
  "panels": [
    {
      "title": "Model Accuracy (7d)",
      "type": "timeseries",
      "targets": [{
        "expr": "model_accuracy"
      }]
    },
    {
      "title": "Inference Latency",
      "type": "histogram",
      "targets": [{
        "expr": "histogram_quantile(0.95, model_inference_seconds_bucket)"
      }]
    }
  ]
}

Pro tip: RED method follow pannunga โ€” Rate, Errors, Duration. Every service ku ivanga track pannunga! ๐ŸŽฏ

Smart Alerting Strategy

Alert fatigue = biggest monitoring mistake! 1000 alerts โ€” no one cares. Smart alerting setup pannunga:


๐Ÿ”ด Critical (PagerDuty โ€” Wake someone up):

  • Model accuracy drops >10% in 1 hour
  • Error rate >5%
  • All instances down
  • Inference latency >5s sustained

๐ŸŸก Warning (Slack notification):

  • Accuracy drops >5% in 24 hours
  • Drift detected (PSI > 0.2)
  • Latency p99 >2s
  • Disk usage >80%
  • Confidence avg drops below 0.6

๐Ÿ”ต Info (Dashboard only):

  • New model deployed
  • Retrain job completed
  • Traffic pattern change

Prometheus alerting rules:

yaml
groups:
  - name: ai_model_alerts
    rules:
      - alert: ModelAccuracyDrop
        expr: model_accuracy < 0.85
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Model accuracy dropped below 85%!"

      - alert: HighInferenceLatency
        expr: histogram_quantile(0.99, model_inference_seconds_bucket) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 2 seconds"

Golden rule: Every alert ku runbook irukanum โ€” alert vandha enna pannanum step-by-step! ๐Ÿ“‹

Structured Logging for AI

๐Ÿ’ก Tip

AI apps ku structured logging MUST:

python
import structlog

logger = structlog.get_logger()

# Every prediction log pannunga
logger.info("prediction_complete",
    request_id="req-abc123",
    model_version="v2.3",
    input_tokens=512,
    latency_ms=45,
    confidence=0.92,
    prediction="positive",
    feature_hash="sha256:abc..."  # Input fingerprint
)

What to log for AI:

โœ… Request ID (trace across services)

โœ… Model version (which model served)

โœ… Input metadata (size, type โ€” NOT actual data!)

โœ… Latency breakdown (preprocess, inference, postprocess)

โœ… Confidence score

โœ… Prediction result

โŒ NEVER log PII (names, emails)

โŒ NEVER log full input data (privacy + storage cost)

Log aggregation: Loki + Grafana (free) or ELK Stack. Query example:

code
{app="ai-api"} | json | confidence < 0.5

โ†‘ Low confidence predictions filter pannunga โ€” potential issues spot! ๐Ÿ”

AI Monitoring Tools Comparison

ML-specific monitoring tools:


ToolTypeCostBest For
**Prometheus + Grafana**General metricsFreeInfrastructure + custom
**Evidently AI**ML monitoringFree/OpenDrift detection
**WhyLabs**ML observabilityFreemiumFull ML monitoring
**MLflow**Experiment trackingFreeModel versioning
**Arize AI**ML observabilityPaidEnterprise
**Neptune.ai**Experiment trackingFreemiumResearch teams
**Datadog**Full stackPaidEnterprise
**New Relic**APMFreemiumApplication perf

Recommended stack (Free):

  • ๐Ÿ“Š Prometheus + Grafana โ€” Metrics & dashboards
  • ๐Ÿ“ Loki โ€” Log aggregation
  • ๐Ÿ” Jaeger โ€” Distributed tracing
  • ๐Ÿค– Evidently AI โ€” ML drift detection
  • ๐Ÿ“ฆ MLflow โ€” Model tracking

Total cost: $0! Full enterprise-grade monitoring for free! ๐ŸŽ‰

AI Monitoring Architecture

๐Ÿ—๏ธ Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           AI APP MONITORING ARCHITECTURE               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                         โ”‚
โ”‚  ๐Ÿ“ฑ Users                                               โ”‚
โ”‚    โ”‚ requests                                           โ”‚
โ”‚    โ–ผ                                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                      โ”‚
โ”‚  โ”‚   AI API     โ”‚โ”€โ”€โ”€โ”€ Metrics โ”€โ”€โ”€โ”€โ–ถ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚  (FastAPI)   โ”‚                    โ”‚ Prometheus โ”‚    โ”‚
โ”‚  โ”‚              โ”‚โ”€โ”€โ”€โ”€ Logs โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ  โ”‚            โ”‚    โ”‚
โ”‚  โ”‚  /predict    โ”‚                    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚         โ”‚                            โ”‚  Grafana   โ”‚    โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”                       โ”‚ Dashboards โ”‚    โ”‚
โ”‚    โ”‚  Model  โ”‚                       โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚    โ”‚ Server  โ”‚                             โ”‚           โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜                       โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚         โ”‚                            โ”‚Alertmanagerโ”‚    โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚    โ”‚ Prediction    โ”‚                    โ”‚    โ”‚         โ”‚
โ”‚    โ”‚ Store (DB)    โ”‚              Slack โ—€โ”˜    โ””โ–ถPagerDuty โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                   โ”‚
โ”‚         โ”‚                                              โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”‚
โ”‚    โ”‚ Evidently AI  โ”‚โ”€โ”€โ”€โ–ถโ”‚ Drift Alert โ”‚               โ”‚
โ”‚    โ”‚ (Drift Check) โ”‚    โ”‚ + Retrain   โ”‚               โ”‚
โ”‚    โ”‚  Cron: Daily  โ”‚    โ”‚   Trigger   โ”‚               โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚
โ”‚                                                         โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚    โ”‚    Loki     โ”‚โ—€โ”€โ”€โ”€โ”‚   Logs   โ”‚                     โ”‚
โ”‚    โ”‚(Log Aggreg.)โ”‚    โ”‚(Structlog)โ”‚                    โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ”‚           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ Grafana Explore                  โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Quick Setup โ€” Docker Compose Stack

โœ… Example

One command la full monitoring stack!

yaml
# docker-compose.monitoring.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]

  ai-api:
    build: ./ai-app
    ports: ["8000:8000"]
    environment:
      - PROMETHEUS_ENABLED=true

bash
docker-compose -f docker-compose.monitoring.yml up -d
# Grafana: http://localhost:3000
# Prometheus: http://localhost:9090

5 minutes la full monitoring stack ready! Import pre-built Grafana dashboards and done! ๐Ÿš€

Prompt: Design Monitoring System

๐Ÿ“‹ Copy-Paste Prompt
You are an MLOps Engineer specializing in AI observability.

Design a comprehensive monitoring system for:
- Sentiment analysis API (FastAPI + HuggingFace model)
- 10,000 predictions per hour
- Deployed on Kubernetes (3 replicas)
- Must detect model drift within 24 hours

Provide:
1. Complete Prometheus metrics to expose (Python code)
2. Grafana dashboard JSON with 8 panels
3. Alerting rules (critical + warning)
4. Drift detection pipeline (Evidently AI)
5. Runbook for "Model Accuracy Drop" alert
6. Cost estimation for monitoring infrastructure

Summary

Key takeaways:


โœ… Three pillars = Metrics + Logs + Traces โ€” all three venum

โœ… AI-specific metrics = Accuracy, drift, confidence โ€” beyond CPU/memory

โœ… Model drift = Silent killer โ€” detect with Evidently AI

โœ… Prometheus + Grafana = Free, powerful monitoring stack

โœ… Smart alerting = Severity levels, runbooks, no alert fatigue

โœ… Structured logging = JSON logs, request IDs, model versions


Action item: Unga AI project la Prometheus client add pannunga, 3 custom metrics expose pannunga (latency, predictions count, confidence). Grafana la dashboard create pannunga! ๐Ÿ“Š


Next article: Scalable AI Architecture โ€” millions of users ku design! ๐Ÿ—๏ธ

๐Ÿ ๐ŸŽฎ Mini Challenge

Challenge: Setup Monitoring Dashboard (Prometheus + Grafana)


Real monitoring setup โ€” AI app performance track pannu! ๐Ÿ“Š


Step 1: Python Flask App with Metrics ๐Ÿ

python
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest

app = Flask(__name__)

# Metrics define
request_count = Counter(
    'ai_predictions_total',
    'Total predictions',
    ['model_name']
)

latency = Histogram(
    'ai_prediction_latency_seconds',
    'Prediction latency',
    ['model_name']
)

@app.route("/predict", methods=["POST"])
def predict():
    with latency.labels(model_name="bert").time():
        # Model inference
        result = model.predict(data)
        request_count.labels(model_name="bert").inc()
        return {"result": result}

@app.route("/metrics")
def metrics():
    return generate_latest()

Step 2: Docker Container ๐Ÿณ

bash
docker build -t ai-monitoring:latest .
docker run -p 5000:5000 ai-monitoring:latest

Step 3: Prometheus Configuration ๐Ÿ”

yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ai-app'
    static_configs:
      - targets: ['localhost:5000']

Step 4: Start Prometheus ๐Ÿš€

bash
docker run -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
# Visit: localhost:9090

Step 5: Grafana Dashboard ๐Ÿ“ˆ

bash
docker run -p 3000:3000 grafana/grafana

# Browser: localhost:3000
# Login: admin/admin
# Add Prometheus as data source
# Dashboard create:
#   - Panel 1: Request rate
#   - Panel 2: Latency percentiles
#   - Panel 3: Error rate

Step 6: Alerts Setup ๐Ÿšจ

yaml
# alert.rules
- alert: HighLatency
  expr: ai_prediction_latency_seconds{quantile="0.99"} > 2
  for: 5m
  annotations:
    summary: "High prediction latency"

- alert: HighErrorRate
  expr: rate(ai_errors_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Error rate > 5%"

Step 7: Load Test & Monitor ๐Ÿ“Š

bash
# Generate traffic
ab -n 1000 -c 10 http://localhost:5000/predict

# Monitor in Grafana
# Metrics spike see, performance verify

Completion Time: 2-3 hours

Tools: Prometheus, Grafana, Flask

Production-ready monitoring โญ

๐Ÿ’ผ Interview Questions

Q1: RED vs USE metrics โ€” difference? AI apps la which important?

A: RED = Request rate, Error rate, Duration. USE = Utilization, Saturation, Errors. RED for API/service endpoints (perfect for inference). USE for infrastructure (CPU, memory, disk). AI apps: both important โ€” RED track inference quality, USE track resource health.


Q2: Alerting fatigue โ€” too many false alerts problem?

A: Set thresholds carefully (testing through). Composite alerts (multiple conditions). Time-based (peak hours different thresholds). Severity levels (critical vs warning). On-call rotation. Runbook attached (alert triggered, epdhi fix?). Best: business metrics alert (customer impact). Avoid low-level infrastructure alert spam.


Q3: Distributed tracing โ€” why needed large systems?

A: Microservices: request crosses multiple services. Single log line illa โ€” trace entire journey. Tools: Jaeger, Zipkin. Each span: service name, latency, status. Bottleneck identify (slow service). Errors debug (which service fail?). AI systems: model inference span, database span, cache span โ€” identify slow component.


Q4: Model performance monitoring โ€” model drift detect?

A: Production model: baseline accuracy establish. Monitor: prediction confidence, actual vs predicted (if label available after). Accuracy drop โ†’ model drift (data distribution change). Solution: retrain model, A/B test new version, implement canary. Critical for long-running AI systems.


Q5: Logging strategy โ€” performance impact?

A: Synchronous logging = slow. Async logging better (background thread). Sampling: 100% request log illa, 10% random sample (balance). Structured logging: JSON (parsing easy). Log level: DEBUG (dev only), INFO (production), ERROR (critical). Too much logging = storage cost, noise. Right balance = visibility without overhead.

Frequently Asked Questions

โ“ AI app monitoring normal app monitoring la irundhu eppadi different?
Normal apps โ€” CPU, memory, response time monitor pannunga. AI apps ku EXTRA ah model accuracy, prediction drift, data quality, feature distributions, inference latency โ€” ivangalayum track pannanum. Model silent ah degrade aagum โ€” errors throw pannaadhu but wrong predictions kudukum.
โ“ Model drift na enna?
Model drift = Production data training data la irundhu maradhu. Example: COVID time la shopping behavior complete ah change aachu โ€” old models fail aachu. Data drift (input changes) + Concept drift (relationship changes) โ€” rendu type irukku.
โ“ Best monitoring stack evadhu AI apps ku?
Prometheus + Grafana = metrics & dashboards. MLflow = model tracking. Evidently AI = data/model drift detection. WhyLabs = production ML monitoring. Start with Prometheus + Grafana โ€” free and powerful.
โ“ Monitoring setup panna yevlo time aagum?
Basic metrics (Prometheus + Grafana) โ€” 1-2 days. AI-specific metrics (drift, accuracy) โ€” 1 week. Full observability stack โ€” 2-3 weeks. Start basic, gradually add AI-specific monitoring.
โ“ Alerting eppadi setup pannanum?
Prometheus Alertmanager use pannunga. Critical alerts: model accuracy drop >5%, latency >2s, error rate >1%. Warning alerts: drift detected, disk >80%. Slack/PagerDuty ku route pannunga. Alert fatigue avoid pannunga โ€” only actionable alerts!
๐Ÿง Knowledge Check
Quiz 1 of 1

Model drift na enna?

0 of 1 answered