← Back|CLOUD-DEVOPSSection 1/17
0 of 17 completed

Multi-cloud AI systems

Advanced15 min read📅 Updated: 2026-02-22

Introduction

Unga AI app oru server la run aagudhu. Suddenly oru blog post viral aachchu — traffic 100x jump! 🚀 Server struggle pannum, requests timeout aagum, users leave panniduvanga.


Load Balancing = Traffic distribute pannradhu across multiple servers

Auto Scaling = Demand based ah servers add/remove pannradhu


Together — unga app any traffic handle pannum, cost optimize ah! Netflix, Uber, ChatGPT — ellam ivanga dhaan use panraanga.


Indha article la load balancing algorithms, auto scaling strategies, AI-specific configurations — ellam production-ready ah learn pannalam! ⚖️

Load Balancing — How It Works

Load balancer = Traffic distributor between servers.


Without load balancer 😰:

code
Users ──▶ Single Server ──▶ CRASH! (overloaded)

With load balancer 😊:

code
          ┌──▶ Server 1 (handling 33%)
Users ──▶ LB ──▶ Server 2 (handling 33%)
          └──▶ Server 3 (handling 33%)

Load balancer types:


TypeLayerRoutes ByExample
**L4 (Transport)**TCP/UDPIP + PortAWS NLB
**L7 (Application)**HTTPURL, headers, cookiesAWS ALB, Nginx
**DNS**DNSGeographic locationRoute 53, CloudFlare

AI apps ku: L7 (Application) load balancer best — URL path based routing, health checks, sticky sessions support! 🎯


Key features:

  • ❤️ Health checks — unhealthy server ku traffic send pannaadhu
  • 🔄 Session persistence — same user same server ku route
  • 🔒 SSL termination — HTTPS handle pannum
  • 📊 Monitoring — traffic metrics expose pannum

Load Balancing Algorithms

Different algorithms, different use cases:


1. Round Robin 🔄

code
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A  (back to start)
  • Simple, fair distribution
  • Problem: All servers equal assume pannum

2. Weighted Round Robin ⚖️

code
Server A (weight: 3) → Gets 3 requests
Server B (weight: 2) → Gets 2 requests
Server C (weight: 1) → Gets 1 request
  • Powerful servers ku more traffic
  • AI use: GPU servers ku higher weight

3. Least Connections 📉

code
Server A: 5 active connections → ❌ Skip
Server B: 2 active connections → ✅ Route here!
Server C: 8 active connections → ❌ Skip
  • Best for varying request durations
  • AI apps ku BEST — inference time varies!

4. IP Hash 🔗

code
User IP → Hash → Consistent server mapping
Same user always → Same server
  • Session persistence without cookies
  • Good for stateful AI conversations

5. Least Response Time

  • Route to fastest responding server
  • Considers both connections AND latency
  • Premium option — best performance

AI recommendation: Least Connections for inference APIs — oru request 50ms, oru request 2s — least connections handle pannum! 🎯

Nginx Load Balancer Setup

Example

Production Nginx config for AI API:

nginx
# /etc/nginx/conf.d/ai-api.conf

upstream ai_inference {
    least_conn;  # Best for AI — variable inference times

    server gpu-node-1:8080 weight=3;  # A100 GPU
    server gpu-node-2:8080 weight=3;  # A100 GPU
    server gpu-node-3:8080 weight=1;  # T4 GPU (less powerful)

    # Health check
    keepalive 32;
}

upstream ai_api {
    least_conn;
    server api-node-1:8000;
    server api-node-2:8000;
    server api-node-3:8000;
}

server {
    listen 443 ssl;
    server_name api.myaiapp.com;

    # SSL
    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    # API routes → CPU servers
    location /api/ {
        proxy_pass http://ai_api;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    # Inference routes → GPU servers
    location /predict {
        proxy_pass http://ai_inference;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;  # AI inference takes longer!
    }

    # Streaming (SSE for token generation)
    location /stream {
        proxy_pass http://ai_inference;
        proxy_buffering off;       # Disable buffering for SSE
        proxy_read_timeout 300s;   # Long timeout for streaming
    }
}

Key: API routes CPU servers ku, inference routes GPU servers ku — separate routing! ⚡

AWS ALB for AI Applications

AWS Application Load Balancer — managed, no maintenance:


Terraform config:

hcl
# Application Load Balancer
resource "aws_lb" "ai_alb" {
  name               = "ai-app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

# Target Group — API Servers
resource "aws_lb_target_group" "api" {
  name     = "ai-api-tg"
  port     = 8000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 3600  # 1 hour session
  }
}

# Target Group — GPU Inference
resource "aws_lb_target_group" "inference" {
  name     = "ai-inference-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path     = "/health"
    interval = 30
    timeout  = 10  # GPU warm-up takes time
  }

  # Slow start — new GPU instance warm-up period
  slow_start = 120  # 2 min warm-up
}

# Path-based routing
resource "aws_lb_listener_rule" "inference" {
  listener_arn = aws_lb_listener.https.arn
  condition {
    path_pattern { values = ["/predict*", "/stream*"] }
  }
  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.inference.arn
  }
}

Slow start = New GPU instance add aana, gradually traffic increase pannradhu. Model loading time kudukum! 🐢➡️🚀

Auto Scaling — Core Concepts

Auto scaling = Demand based ah capacity adjust pannradhu.


Types:


1. Reactive Scaling 📈

  • Metric threshold cross pannina scale
  • Example: CPU > 70% → add server
  • Lag: 2-5 minutes delay

2. Predictive Scaling 🔮

  • Historical patterns analyze panni advance la scale
  • Example: Every Monday 9 AM traffic spike — pre-scale
  • AWS supports this natively!

3. Scheduled Scaling

  • Fixed schedule based
  • Example: Business hours la 10 servers, nights 2 servers
  • Cheapest option for predictable traffic

Auto Scaling Group (ASG) config:

code
Minimum: 2 instances (always running)
Desired: 4 instances (normal load)
Maximum: 20 instances (peak limit)

Scaling policies comparison:

PolicyHow It WorksBest For
Target TrackingMaintain metric at targetSimple, effective
Step ScalingDifferent actions at thresholdsComplex rules
Simple ScalingOne action per alarmBasic
PredictiveML-based predictionPredictable patterns

AI apps ku: Target tracking with custom metric (inference queue length) = BEST! 🎯

AWS Auto Scaling for AI

Example

Complete auto scaling setup:

hcl
# Launch Template — GPU Instance
resource "aws_launch_template" "gpu" {
  name          = "ai-gpu-template"
  image_id      = "ami-deep-learning"
  instance_type = "g4dn.xlarge"

  user_data = base64encode(<<-EOF
    #!/bin/bash
    docker pull myapp/inference:latest
    docker run -d --gpus all -p 8080:8080 myapp/inference:latest
  EOF
  )
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
  name                = "ai-gpu-asg"
  min_size            = 2
  max_size            = 15
  desired_capacity    = 3
  vpc_zone_identifier = var.private_subnets

  launch_template {
    id      = aws_launch_template.gpu.id
    version = "$Latest"
  }

  # Warm pool — pre-initialized instances
  warm_pool {
    pool_state                  = "Stopped"
    min_size                    = 2
    max_group_prepared_capacity = 5
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 80 }
  }
}

# Target Tracking — GPU Utilization
resource "aws_autoscaling_policy" "gpu_target" {
  name                   = "gpu-utilization-target"
  autoscaling_group_name = aws_autoscaling_group.gpu_asg.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "GPUUtilization"
      namespace   = "Custom/AI"
      statistic   = "Average"
    }
    target_value = 65.0  # Scale when GPU > 65%
  }
}

Warm pool = Pre-warmed instances ready ah irukku — model loading 2 min wait avoid! GPU cold start problematic ah irukku, warm pool solves it! 🔥

Kubernetes HPA + KEDA

Kubernetes la auto scaling — HPA (Horizontal Pod Autoscaler) + KEDA:


Standard HPA:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60  # Max 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

KEDA — Event-Driven Scaling (better for AI):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-worker-scaler
spec:
  scaleTargetRef:
    name: ai-gpu-worker
  minReplicaCount: 0    # Scale to ZERO! 💰
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pending_inference_requests
      threshold: "10"
      query: sum(inference_queue_pending)

KEDA advantage: Scale to zero — requests illa na 0 pods, zero cost! Request vandha auto scale up. AI batch workers ku perfect! 🎯

Multi-Cloud AI Systems Architecture

🏗️ Architecture Diagram
┌──────────────────────────────────────────────────────────┐
│      LOAD BALANCING + AUTO SCALING ARCHITECTURE           │
├──────────────────────────────────────────────────────────┤
│                                                            │
│  📱 Users                                                  │
│    │                                                       │
│    ▼                                                       │
│  ┌─────────┐     ┌──────────────┐                         │
│  │  DNS    │────▶│  CloudFlare  │ (DDoS protection)       │
│  │(Route53)│     │     CDN      │                         │
│  └─────────┘     └──────┬───────┘                         │
│                          ▼                                 │
│               ┌──────────────────┐                        │
│               │   AWS ALB (L7)   │                        │
│               │  /api → API TG   │                        │
│               │  /predict → GPU  │                        │
│               └────┬─────────┬───┘                        │
│                    │         │                             │
│          ┌─────────▼──┐  ┌──▼──────────┐                  │
│          │  API ASG   │  │  GPU ASG    │                  │
│          │ (CPU, t3)  │  │ (g4dn)     │                  │
│          │            │  │            │                  │
│          │ min: 2     │  │ min: 2     │                  │
│          │ max: 20    │  │ max: 15    │                  │
│          │            │  │            │                  │
│          │ ┌──┐ ┌──┐ │  │ ┌──┐ ┌──┐ │                  │
│          │ │A1│ │A2│ │  │ │G1│ │G2│ │                  │
│          │ └──┘ └──┘ │  │ └──┘ └──┘ │                  │
│          │  ┌──┐     │  │  ┌──┐     │                  │
│          │  │A3│ ... │  │  │G3│ ... │                  │
│          │  └──┘     │  │  └──┘     │                  │
│          └───────────┘  └───────────┘                  │
│               │              │                            │
│          ┌────▼────┐   ┌─────▼─────┐                     │
│          │CloudWatch│   │CloudWatch │                     │
│          │CPU: 60% │   │GPU: 65%   │                     │
│          │target   │   │queue: <10 │                     │
│          └─────────┘   └───────────┘                     │
│                                                            │
│  🔥 Warm Pool: 3 pre-initialized GPU instances             │
│  📊 Predictive Scaling: ML-based traffic prediction        │
│  ⏰ Scheduled: Business hours boost                        │
│                                                            │
└──────────────────────────────────────────────────────────┘

Health Checks — Deep vs Shallow

Health checks = Load balancer eppadi healthy servers identify pannum.


Shallow Health Check (Liveness) 🟢:

python
@app.get("/health")
def health():
    return {"status": "ok"}
# Just checks: "Is the process running?"

Deep Health Check (Readiness) 🔍:

python
@app.get("/health/ready")
async def readiness():
    checks = {
        "model_loaded": model is not None,
        "gpu_available": torch.cuda.is_available(),
        "db_connected": await db.ping(),
        "memory_ok": get_memory_usage() < 90,
        "gpu_memory_ok": get_gpu_memory() < 85,
    }
    all_ok = all(checks.values())
    status = 200 if all_ok else 503
    return JSONResponse(checks, status_code=status)

AI-specific health checks:

CheckWhyFailure Action
Model loaded?GPU memory issuesRestart pod
GPU available?Driver crashReplace instance
Inference test?Model corruptReload model
Memory < 90%?OOM riskScale up
Queue < 100?OverloadedScale up

Configuration:

code
Health check interval: 15 seconds
Timeout: 10 seconds (GPU warm-up)
Healthy threshold: 2 consecutive passes
Unhealthy threshold: 3 consecutive failures

Rule: Always use deep health checks for AI servers — model loaded + GPU ok check mandatory! 🛡️

AI-Specific Scaling Patterns

💡 Tip

AI workloads ku special scaling patterns:

🤖 1. Inference vs Training Separation

- Inference: Auto-scale with request volume

- Training: Scheduled, fixed capacity (spot instances)

- NEVER mix on same servers!

🤖 2. Model Warm-up Strategy

code
New instance → Load model (60s) → Warm-up inference (30s) → Ready!

- Warm pool maintain pannunga

- Health check pass only after warm-up complete

🤖 3. Batch vs Real-time Split

- Real-time: Always-on, 2+ instances minimum

- Batch: KEDA scale-to-zero, event-triggered

- Different ASGs for different workload types!

🤖 4. GPU Memory-Based Scaling

- GPU memory > 80% → Scale up (before OOM!)

- Custom CloudWatch metric publish pannunga

🤖 5. Graceful Shutdown

python
# Don't kill mid-inference!
@app.on_event("shutdown")
async def shutdown():
    # Finish current requests (drain)
    await inference_queue.join()
    # Unload model cleanly
    del model
    torch.cuda.empty_cache()

Graceful shutdown illana — inference mid-way cut aagum, user bad response get pannum! ⚠️

Cost-Effective Scaling

Auto scaling cost optimize pannradhu:


Mixed Instance Strategy:

hcl
mixed_instances_policy {
  instances_distribution {
    on_demand_base_capacity = 2        # 2 on-demand (reliable)
    on_demand_percentage_above_base = 20  # 20% on-demand
    spot_allocation_strategy = "lowest-price"
  }
  launch_template {
    override {
      instance_type = "g4dn.xlarge"    # Primary
    }
    override {
      instance_type = "g4dn.2xlarge"   # Fallback
    }
    override {
      instance_type = "g5.xlarge"      # Alternative
    }
  }
}

Cost breakdown:

StrategyMonthly CostReliability
All on-demand$10,00099.99%
Mixed (80% spot)$4,00099.9%
All spot$3,00099%
Scheduled + spot$3,50099.9%

Best combo:

  • Base: 2 on-demand instances (always on)
  • Scale: Spot instances for traffic spikes
  • Schedule: Reduce min capacity off-hours
  • Savings: ~60% compared to all on-demand! 💰

Spot interruption handling: Instance terminate aaga pogudhu na, graceful drain — current requests complete pannitu shutdown! 🛡️

Prompt: Design Load Balancing System

📋 Copy-Paste Prompt
You are a Cloud Architect specializing in AI infrastructure.

Design a load balancing and auto-scaling system for:
- AI image generation API (Stable Diffusion model)
- 5,000 concurrent users at peak
- Each generation takes 10-30 seconds
- Must handle 100x traffic spikes (viral events)
- Budget: $8,000/month baseline

Provide:
1. Load balancer configuration (algorithm choice + reasoning)
2. Auto scaling policy (min/max/scaling triggers)
3. Warm pool / pre-warming strategy
4. Queue-based architecture for long-running generations
5. Cost optimization with spot instances
6. Graceful handling of spot interruptions
7. Monitoring metrics and alerts

Summary

Key takeaways:


Load balancing = Least connections best for AI (variable inference times)

L7 routing = API → CPU servers, Inference → GPU servers

Auto scaling = Target tracking with custom metrics (queue depth, GPU%)

Warm pool = Pre-initialized GPU instances for fast scaling

KEDA = Scale-to-zero for batch AI workers

Health checks = Deep checks — model loaded + GPU available

Cost = Mixed instances (on-demand base + spot for scaling) = 60% savings


Action item: AWS la oru ALB + ASG setup pannunga (free tier). 2 instances behind ALB, CPU target tracking policy add pannunga. Traffic simulate panni auto scaling watch pannunga! ⚖️


Next article: Multi-Cloud AI Systems — vendor lock-in avoid! ☁️☁️☁️

🏁 🎮 Mini Challenge

Challenge: Deploy Same App to Multiple Clouds


Vendor lock-in avoid — AWS, GCP, Azure la deploy pannu! ☁️☁️☁️


Step 1: Cloud-Agnostic App Create 🐍

python
# app.py — no cloud-specific APIs
from fastapi import FastAPI
import os

app = FastAPI()

# Environment variables (cloud-agnostic)
MODEL_PATH = os.getenv("MODEL_PATH", "./models/default")
STORAGE_BUCKET = os.getenv("STORAGE_BUCKET", "local")
DB_URL = os.getenv("DB_URL", "sqlite:///local.db")

@app.post("/predict")
def predict(text: str):
    # Model load from MODEL_PATH
    result = model.predict(text)
    return {"result": result}

Step 2: Docker Container (Cloud Agnostic) 🐳

dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["python", "app.py"]
# No cloud-specific commands!

Step 3: AWS Deploy ☁️

bash
# Push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
docker tag app:latest <account>.dkr.ecr.us-east-1.amazonaws.com/app:latest
docker push <account>.dkr.ecr.us-east-1.amazonaws.com/app:latest

# Deploy via Elastic Container Service
aws ecs create-service --cluster ai-cluster --service-name app --task-definition app-task --desired-count 2

Step 4: GCP Deploy 🟡

bash
# Push to Artifact Registry
gcloud builds submit --tag gcr.io/<project>/app:latest

# Deploy to Cloud Run
gcloud run deploy app --image gcr.io/<project>/app:latest --platform managed --region us-central1

Step 5: Azure Deploy 🔵

bash
# Push to Container Registry
az acr build --registry <registry> --image app:latest .

# Deploy to Container Instances
az container create --resource-group rg --name app --image <registry>.azurecr.io/app:latest

Step 6: Configuration Per Cloud ⚙️

bash
# .env.aws
MODEL_PATH=s3://bucket/models/default
STORAGE_BUCKET=s3://bucket
DB_URL=mysql://rds-endpoint/db

# .env.gcp
MODEL_PATH=gs://bucket/models/default
STORAGE_BUCKET=gs://bucket
DB_URL=postgres://cloudsql-endpoint/db

# .env.azure
MODEL_PATH=https://blob.core.windows.net/models
STORAGE_BUCKET=https://blob.core.windows.net
DB_URL=postgresql://azure-db/db

Step 7: Abstraction Layer (SDK) 🔧

python
# cloud_adapter.py
import os

class StorageAdapter:
    @staticmethod
    def upload(file, bucket):
        provider = os.getenv("CLOUD_PROVIDER")
        if provider == "aws":
            import boto3
            s3 = boto3.client('s3')
            s3.upload_file(file, bucket, file)
        elif provider == "gcp":
            from google.cloud import storage
            # GCP upload logic
        # abstraction provides cloud-agnostic interface

Step 8: Compare & Monitor 📊

bash
# Cost comparison:
# - AWS: $500/month
# - GCP: $480/month
# - Azure: $520/month
# Performance comparison: latency, throughput
# All 3 running simultaneously = multi-cloud safety!

Completion Time: 3-4 hours

Tools: AWS, GCP, Azure CLI, Docker

Multi-cloud expertise ⭐⭐⭐

💼 Interview Questions

Q1: Vendor lock-in — how prevent? Best practices?

A: Use open standards (Docker, Kubernetes). Avoid cloud-specific APIs (use SDKs with multi-cloud support). Separate config from code (environment variables). Multi-cloud from start (proof of portability). Data: standard formats (Parquet, JSON). Code review: cloud-specific APIs reject.


Q2: Cost optimization multi-cloud — best approach?

A: Benchmark each cloud (test same workload). Choose cheapest per service (AWS compute, GCP ML, Azure database). Hybrid: cost arbitrage (price varies). But monitoring overhead, complexity increase. Recommendation: single cloud primary, one secondary (disaster recovery).


Q3: Data transfer costs between clouds — expensive?

A: Yes! Cross-region, cross-cloud transfer = ₹5-10 per TB. In-region: free (AWS, GCP). Strategy: minimize transfer (process where data lives). Data gravity: data where → compute nearby. Caching, replication local. Transfer costs > compute costs possible.


Q4: Testing multi-cloud — CI/CD complexity?

A: Pipeline: code → build image → test (AWS) → test (GCP) → test (Azure) → deploy to chosen cloud. Time increase: parallel tests. Tools: Terraform (test different clouds). Ansible (agnostic configuration). GitHub Actions matrices (parallel multi-cloud).


Q5: Disaster recovery multi-cloud — active-active vs active-passive?

A: Active-passive: primary cloud, secondary ready (switch on failure). Active-active: both clouds serving (complex, higher cost). Multi-cloud: write one, read multiple (replicate). Replication lag: seconds possible. Recovery: automated failover (DNS switch, load balancer update).

Frequently Asked Questions

Load balancer na enna simple ah?
Traffic cop maari — incoming requests multiple servers ku distribute pannum. One server overload aagaadhu. Server down aana, traffic automatically other servers ku route aagum. Users ku always fast response kedaikum.
Auto scaling na enna?
Traffic based ah servers automatically increase/decrease aagum. Morning 10 users — 1 server. Afternoon 10,000 users — 10 servers auto add. Night 5 users — back to 1 server. Pay only for what you use!
AI apps ku special load balancing venum ah?
Yes! AI inference GPU-bound — normal round-robin work aagaadhu. Least-connections or GPU-utilization based routing venum. Long-running inference requests ku timeout adjust pannanum. Streaming responses ku WebSocket/SSE support venum.
Auto scaling aggressive ah set pannalaam ah?
Careful! Too aggressive = cost spike. Too conservative = latency spike. Start with target tracking (CPU 60-70%). AI apps ku custom metrics use pannunga — inference queue length or GPU utilization based scaling better than CPU-based.
🧠Knowledge Check
Quiz 1 of 1

AI inference API ku best load balancing algorithm evadhu?

0 of 1 answered