โ† Back|CLOUD-DEVOPSโ€บSection 1/17
0 of 17 completed

Multi-cloud AI systems

Advancedโฑ 15 min read๐Ÿ“… Updated: 2026-02-22

Introduction

Unga AI app oru server la run aagudhu. Suddenly oru blog post viral aachchu โ€” traffic 100x jump! ๐Ÿš€ Server struggle pannum, requests timeout aagum, users leave panniduvanga.


Load Balancing = Traffic distribute pannradhu across multiple servers

Auto Scaling = Demand based ah servers add/remove pannradhu


Together โ€” unga app any traffic handle pannum, cost optimize ah! Netflix, Uber, ChatGPT โ€” ellam ivanga dhaan use panraanga.


Indha article la load balancing algorithms, auto scaling strategies, AI-specific configurations โ€” ellam production-ready ah learn pannalam! โš–๏ธ

Load Balancing โ€” How It Works

Load balancer = Traffic distributor between servers.


Without load balancer ๐Ÿ˜ฐ:

code
Users โ”€โ”€โ–ถ Single Server โ”€โ”€โ–ถ CRASH! (overloaded)

With load balancer ๐Ÿ˜Š:

code
          โ”Œโ”€โ”€โ–ถ Server 1 (handling 33%)
Users โ”€โ”€โ–ถ LB โ”€โ”€โ–ถ Server 2 (handling 33%)
          โ””โ”€โ”€โ–ถ Server 3 (handling 33%)

Load balancer types:


TypeLayerRoutes ByExample
**L4 (Transport)**TCP/UDPIP + PortAWS NLB
**L7 (Application)**HTTPURL, headers, cookiesAWS ALB, Nginx
**DNS**DNSGeographic locationRoute 53, CloudFlare

AI apps ku: L7 (Application) load balancer best โ€” URL path based routing, health checks, sticky sessions support! ๐ŸŽฏ


Key features:

  • โค๏ธ Health checks โ€” unhealthy server ku traffic send pannaadhu
  • ๐Ÿ”„ Session persistence โ€” same user same server ku route
  • ๐Ÿ”’ SSL termination โ€” HTTPS handle pannum
  • ๐Ÿ“Š Monitoring โ€” traffic metrics expose pannum

Load Balancing Algorithms

Different algorithms, different use cases:


1. Round Robin ๐Ÿ”„

code
Request 1 โ†’ Server A
Request 2 โ†’ Server B
Request 3 โ†’ Server C
Request 4 โ†’ Server A  (back to start)
  • Simple, fair distribution
  • Problem: All servers equal assume pannum

2. Weighted Round Robin โš–๏ธ

code
Server A (weight: 3) โ†’ Gets 3 requests
Server B (weight: 2) โ†’ Gets 2 requests
Server C (weight: 1) โ†’ Gets 1 request
  • Powerful servers ku more traffic
  • AI use: GPU servers ku higher weight

3. Least Connections ๐Ÿ“‰

code
Server A: 5 active connections โ†’ โŒ Skip
Server B: 2 active connections โ†’ โœ… Route here!
Server C: 8 active connections โ†’ โŒ Skip
  • Best for varying request durations
  • AI apps ku BEST โ€” inference time varies!

4. IP Hash ๐Ÿ”—

code
User IP โ†’ Hash โ†’ Consistent server mapping
Same user always โ†’ Same server
  • Session persistence without cookies
  • Good for stateful AI conversations

5. Least Response Time โšก

  • Route to fastest responding server
  • Considers both connections AND latency
  • Premium option โ€” best performance

AI recommendation: Least Connections for inference APIs โ€” oru request 50ms, oru request 2s โ€” least connections handle pannum! ๐ŸŽฏ

Nginx Load Balancer Setup

โœ… Example

Production Nginx config for AI API:

nginx
# /etc/nginx/conf.d/ai-api.conf

upstream ai_inference {
    least_conn;  # Best for AI โ€” variable inference times

    server gpu-node-1:8080 weight=3;  # A100 GPU
    server gpu-node-2:8080 weight=3;  # A100 GPU
    server gpu-node-3:8080 weight=1;  # T4 GPU (less powerful)

    # Health check
    keepalive 32;
}

upstream ai_api {
    least_conn;
    server api-node-1:8000;
    server api-node-2:8000;
    server api-node-3:8000;
}

server {
    listen 443 ssl;
    server_name api.myaiapp.com;

    # SSL
    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    # API routes โ†’ CPU servers
    location /api/ {
        proxy_pass http://ai_api;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    # Inference routes โ†’ GPU servers
    location /predict {
        proxy_pass http://ai_inference;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;  # AI inference takes longer!
    }

    # Streaming (SSE for token generation)
    location /stream {
        proxy_pass http://ai_inference;
        proxy_buffering off;       # Disable buffering for SSE
        proxy_read_timeout 300s;   # Long timeout for streaming
    }
}

Key: API routes CPU servers ku, inference routes GPU servers ku โ€” separate routing! โšก

AWS ALB for AI Applications

AWS Application Load Balancer โ€” managed, no maintenance:


Terraform config:

hcl
# Application Load Balancer
resource "aws_lb" "ai_alb" {
  name               = "ai-app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

# Target Group โ€” API Servers
resource "aws_lb_target_group" "api" {
  name     = "ai-api-tg"
  port     = 8000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 3600  # 1 hour session
  }
}

# Target Group โ€” GPU Inference
resource "aws_lb_target_group" "inference" {
  name     = "ai-inference-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path     = "/health"
    interval = 30
    timeout  = 10  # GPU warm-up takes time
  }

  # Slow start โ€” new GPU instance warm-up period
  slow_start = 120  # 2 min warm-up
}

# Path-based routing
resource "aws_lb_listener_rule" "inference" {
  listener_arn = aws_lb_listener.https.arn
  condition {
    path_pattern { values = ["/predict*", "/stream*"] }
  }
  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.inference.arn
  }
}

Slow start = New GPU instance add aana, gradually traffic increase pannradhu. Model loading time kudukum! ๐Ÿขโžก๏ธ๐Ÿš€

Auto Scaling โ€” Core Concepts

Auto scaling = Demand based ah capacity adjust pannradhu.


Types:


1. Reactive Scaling ๐Ÿ“ˆ

  • Metric threshold cross pannina scale
  • Example: CPU > 70% โ†’ add server
  • Lag: 2-5 minutes delay

2. Predictive Scaling ๐Ÿ”ฎ

  • Historical patterns analyze panni advance la scale
  • Example: Every Monday 9 AM traffic spike โ€” pre-scale
  • AWS supports this natively!

3. Scheduled Scaling โฐ

  • Fixed schedule based
  • Example: Business hours la 10 servers, nights 2 servers
  • Cheapest option for predictable traffic

Auto Scaling Group (ASG) config:

code
Minimum: 2 instances (always running)
Desired: 4 instances (normal load)
Maximum: 20 instances (peak limit)

Scaling policies comparison:

PolicyHow It WorksBest For
Target TrackingMaintain metric at targetSimple, effective
Step ScalingDifferent actions at thresholdsComplex rules
Simple ScalingOne action per alarmBasic
PredictiveML-based predictionPredictable patterns

AI apps ku: Target tracking with custom metric (inference queue length) = BEST! ๐ŸŽฏ

AWS Auto Scaling for AI

โœ… Example

Complete auto scaling setup:

hcl
# Launch Template โ€” GPU Instance
resource "aws_launch_template" "gpu" {
  name          = "ai-gpu-template"
  image_id      = "ami-deep-learning"
  instance_type = "g4dn.xlarge"

  user_data = base64encode(<<-EOF
    #!/bin/bash
    docker pull myapp/inference:latest
    docker run -d --gpus all -p 8080:8080 myapp/inference:latest
  EOF
  )
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
  name                = "ai-gpu-asg"
  min_size            = 2
  max_size            = 15
  desired_capacity    = 3
  vpc_zone_identifier = var.private_subnets

  launch_template {
    id      = aws_launch_template.gpu.id
    version = "$Latest"
  }

  # Warm pool โ€” pre-initialized instances
  warm_pool {
    pool_state                  = "Stopped"
    min_size                    = 2
    max_group_prepared_capacity = 5
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 80 }
  }
}

# Target Tracking โ€” GPU Utilization
resource "aws_autoscaling_policy" "gpu_target" {
  name                   = "gpu-utilization-target"
  autoscaling_group_name = aws_autoscaling_group.gpu_asg.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "GPUUtilization"
      namespace   = "Custom/AI"
      statistic   = "Average"
    }
    target_value = 65.0  # Scale when GPU > 65%
  }
}

Warm pool = Pre-warmed instances ready ah irukku โ€” model loading 2 min wait avoid! GPU cold start problematic ah irukku, warm pool solves it! ๐Ÿ”ฅ

Kubernetes HPA + KEDA

Kubernetes la auto scaling โ€” HPA (Horizontal Pod Autoscaler) + KEDA:


Standard HPA:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60  # Max 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

KEDA โ€” Event-Driven Scaling (better for AI):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-worker-scaler
spec:
  scaleTargetRef:
    name: ai-gpu-worker
  minReplicaCount: 0    # Scale to ZERO! ๐Ÿ’ฐ
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pending_inference_requests
      threshold: "10"
      query: sum(inference_queue_pending)

KEDA advantage: Scale to zero โ€” requests illa na 0 pods, zero cost! Request vandha auto scale up. AI batch workers ku perfect! ๐ŸŽฏ

Multi-Cloud AI Systems Architecture

๐Ÿ—๏ธ Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      LOAD BALANCING + AUTO SCALING ARCHITECTURE           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                            โ”‚
โ”‚  ๐Ÿ“ฑ Users                                                  โ”‚
โ”‚    โ”‚                                                       โ”‚
โ”‚    โ–ผ                                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                         โ”‚
โ”‚  โ”‚  DNS    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  CloudFlare  โ”‚ (DDoS protection)       โ”‚
โ”‚  โ”‚(Route53)โ”‚     โ”‚     CDN      โ”‚                         โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                         โ”‚
โ”‚                          โ–ผ                                 โ”‚
โ”‚               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                        โ”‚
โ”‚               โ”‚   AWS ALB (L7)   โ”‚                        โ”‚
โ”‚               โ”‚  /api โ†’ API TG   โ”‚                        โ”‚
โ”‚               โ”‚  /predict โ†’ GPU  โ”‚                        โ”‚
โ”‚               โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜                        โ”‚
โ”‚                    โ”‚         โ”‚                             โ”‚
โ”‚          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”  โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”‚
โ”‚          โ”‚  API ASG   โ”‚  โ”‚  GPU ASG    โ”‚                  โ”‚
โ”‚          โ”‚ (CPU, t3)  โ”‚  โ”‚ (g4dn)     โ”‚                  โ”‚
โ”‚          โ”‚            โ”‚  โ”‚            โ”‚                  โ”‚
โ”‚          โ”‚ min: 2     โ”‚  โ”‚ min: 2     โ”‚                  โ”‚
โ”‚          โ”‚ max: 20    โ”‚  โ”‚ max: 15    โ”‚                  โ”‚
โ”‚          โ”‚            โ”‚  โ”‚            โ”‚                  โ”‚
โ”‚          โ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”‚  โ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”‚                  โ”‚
โ”‚          โ”‚ โ”‚A1โ”‚ โ”‚A2โ”‚ โ”‚  โ”‚ โ”‚G1โ”‚ โ”‚G2โ”‚ โ”‚                  โ”‚
โ”‚          โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ”‚  โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ”‚                  โ”‚
โ”‚          โ”‚  โ”Œโ”€โ”€โ”     โ”‚  โ”‚  โ”Œโ”€โ”€โ”     โ”‚                  โ”‚
โ”‚          โ”‚  โ”‚A3โ”‚ ... โ”‚  โ”‚  โ”‚G3โ”‚ ... โ”‚                  โ”‚
โ”‚          โ”‚  โ””โ”€โ”€โ”˜     โ”‚  โ”‚  โ””โ”€โ”€โ”˜     โ”‚                  โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ”‚
โ”‚               โ”‚              โ”‚                            โ”‚
โ”‚          โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚          โ”‚CloudWatchโ”‚   โ”‚CloudWatch โ”‚                     โ”‚
โ”‚          โ”‚CPU: 60% โ”‚   โ”‚GPU: 65%   โ”‚                     โ”‚
โ”‚          โ”‚target   โ”‚   โ”‚queue: <10 โ”‚                     โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ”‚                                                            โ”‚
โ”‚  ๐Ÿ”ฅ Warm Pool: 3 pre-initialized GPU instances             โ”‚
โ”‚  ๐Ÿ“Š Predictive Scaling: ML-based traffic prediction        โ”‚
โ”‚  โฐ Scheduled: Business hours boost                        โ”‚
โ”‚                                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Health Checks โ€” Deep vs Shallow

Health checks = Load balancer eppadi healthy servers identify pannum.


Shallow Health Check (Liveness) ๐ŸŸข:

python
@app.get("/health")
def health():
    return {"status": "ok"}
# Just checks: "Is the process running?"

Deep Health Check (Readiness) ๐Ÿ”:

python
@app.get("/health/ready")
async def readiness():
    checks = {
        "model_loaded": model is not None,
        "gpu_available": torch.cuda.is_available(),
        "db_connected": await db.ping(),
        "memory_ok": get_memory_usage() < 90,
        "gpu_memory_ok": get_gpu_memory() < 85,
    }
    all_ok = all(checks.values())
    status = 200 if all_ok else 503
    return JSONResponse(checks, status_code=status)

AI-specific health checks:

CheckWhyFailure Action
Model loaded?GPU memory issuesRestart pod
GPU available?Driver crashReplace instance
Inference test?Model corruptReload model
Memory < 90%?OOM riskScale up
Queue < 100?OverloadedScale up

Configuration:

code
Health check interval: 15 seconds
Timeout: 10 seconds (GPU warm-up)
Healthy threshold: 2 consecutive passes
Unhealthy threshold: 3 consecutive failures

Rule: Always use deep health checks for AI servers โ€” model loaded + GPU ok check mandatory! ๐Ÿ›ก๏ธ

AI-Specific Scaling Patterns

๐Ÿ’ก Tip

AI workloads ku special scaling patterns:

๐Ÿค– 1. Inference vs Training Separation

- Inference: Auto-scale with request volume

- Training: Scheduled, fixed capacity (spot instances)

- NEVER mix on same servers!

๐Ÿค– 2. Model Warm-up Strategy

code
New instance โ†’ Load model (60s) โ†’ Warm-up inference (30s) โ†’ Ready!

- Warm pool maintain pannunga

- Health check pass only after warm-up complete

๐Ÿค– 3. Batch vs Real-time Split

- Real-time: Always-on, 2+ instances minimum

- Batch: KEDA scale-to-zero, event-triggered

- Different ASGs for different workload types!

๐Ÿค– 4. GPU Memory-Based Scaling

- GPU memory > 80% โ†’ Scale up (before OOM!)

- Custom CloudWatch metric publish pannunga

๐Ÿค– 5. Graceful Shutdown

python
# Don't kill mid-inference!
@app.on_event("shutdown")
async def shutdown():
    # Finish current requests (drain)
    await inference_queue.join()
    # Unload model cleanly
    del model
    torch.cuda.empty_cache()

Graceful shutdown illana โ€” inference mid-way cut aagum, user bad response get pannum! โš ๏ธ

Cost-Effective Scaling

Auto scaling cost optimize pannradhu:


Mixed Instance Strategy:

hcl
mixed_instances_policy {
  instances_distribution {
    on_demand_base_capacity = 2        # 2 on-demand (reliable)
    on_demand_percentage_above_base = 20  # 20% on-demand
    spot_allocation_strategy = "lowest-price"
  }
  launch_template {
    override {
      instance_type = "g4dn.xlarge"    # Primary
    }
    override {
      instance_type = "g4dn.2xlarge"   # Fallback
    }
    override {
      instance_type = "g5.xlarge"      # Alternative
    }
  }
}

Cost breakdown:

StrategyMonthly CostReliability
All on-demand$10,00099.99%
Mixed (80% spot)$4,00099.9%
All spot$3,00099%
Scheduled + spot$3,50099.9%

Best combo:

  • Base: 2 on-demand instances (always on)
  • Scale: Spot instances for traffic spikes
  • Schedule: Reduce min capacity off-hours
  • Savings: ~60% compared to all on-demand! ๐Ÿ’ฐ

Spot interruption handling: Instance terminate aaga pogudhu na, graceful drain โ€” current requests complete pannitu shutdown! ๐Ÿ›ก๏ธ

Prompt: Design Load Balancing System

๐Ÿ“‹ Copy-Paste Prompt
You are a Cloud Architect specializing in AI infrastructure.

Design a load balancing and auto-scaling system for:
- AI image generation API (Stable Diffusion model)
- 5,000 concurrent users at peak
- Each generation takes 10-30 seconds
- Must handle 100x traffic spikes (viral events)
- Budget: $8,000/month baseline

Provide:
1. Load balancer configuration (algorithm choice + reasoning)
2. Auto scaling policy (min/max/scaling triggers)
3. Warm pool / pre-warming strategy
4. Queue-based architecture for long-running generations
5. Cost optimization with spot instances
6. Graceful handling of spot interruptions
7. Monitoring metrics and alerts

Summary

Key takeaways:


โœ… Load balancing = Least connections best for AI (variable inference times)

โœ… L7 routing = API โ†’ CPU servers, Inference โ†’ GPU servers

โœ… Auto scaling = Target tracking with custom metrics (queue depth, GPU%)

โœ… Warm pool = Pre-initialized GPU instances for fast scaling

โœ… KEDA = Scale-to-zero for batch AI workers

โœ… Health checks = Deep checks โ€” model loaded + GPU available

โœ… Cost = Mixed instances (on-demand base + spot for scaling) = 60% savings


Action item: AWS la oru ALB + ASG setup pannunga (free tier). 2 instances behind ALB, CPU target tracking policy add pannunga. Traffic simulate panni auto scaling watch pannunga! โš–๏ธ


Next article: Multi-Cloud AI Systems โ€” vendor lock-in avoid! โ˜๏ธโ˜๏ธโ˜๏ธ

๐Ÿ ๐ŸŽฎ Mini Challenge

Challenge: Deploy Same App to Multiple Clouds


Vendor lock-in avoid โ€” AWS, GCP, Azure la deploy pannu! โ˜๏ธโ˜๏ธโ˜๏ธ


Step 1: Cloud-Agnostic App Create ๐Ÿ

python
# app.py โ€” no cloud-specific APIs
from fastapi import FastAPI
import os

app = FastAPI()

# Environment variables (cloud-agnostic)
MODEL_PATH = os.getenv("MODEL_PATH", "./models/default")
STORAGE_BUCKET = os.getenv("STORAGE_BUCKET", "local")
DB_URL = os.getenv("DB_URL", "sqlite:///local.db")

@app.post("/predict")
def predict(text: str):
    # Model load from MODEL_PATH
    result = model.predict(text)
    return {"result": result}

Step 2: Docker Container (Cloud Agnostic) ๐Ÿณ

dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["python", "app.py"]
# No cloud-specific commands!

Step 3: AWS Deploy โ˜๏ธ

bash
# Push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
docker tag app:latest <account>.dkr.ecr.us-east-1.amazonaws.com/app:latest
docker push <account>.dkr.ecr.us-east-1.amazonaws.com/app:latest

# Deploy via Elastic Container Service
aws ecs create-service --cluster ai-cluster --service-name app --task-definition app-task --desired-count 2

Step 4: GCP Deploy ๐ŸŸก

bash
# Push to Artifact Registry
gcloud builds submit --tag gcr.io/<project>/app:latest

# Deploy to Cloud Run
gcloud run deploy app --image gcr.io/<project>/app:latest --platform managed --region us-central1

Step 5: Azure Deploy ๐Ÿ”ต

bash
# Push to Container Registry
az acr build --registry <registry> --image app:latest .

# Deploy to Container Instances
az container create --resource-group rg --name app --image <registry>.azurecr.io/app:latest

Step 6: Configuration Per Cloud โš™๏ธ

bash
# .env.aws
MODEL_PATH=s3://bucket/models/default
STORAGE_BUCKET=s3://bucket
DB_URL=mysql://rds-endpoint/db

# .env.gcp
MODEL_PATH=gs://bucket/models/default
STORAGE_BUCKET=gs://bucket
DB_URL=postgres://cloudsql-endpoint/db

# .env.azure
MODEL_PATH=https://blob.core.windows.net/models
STORAGE_BUCKET=https://blob.core.windows.net
DB_URL=postgresql://azure-db/db

Step 7: Abstraction Layer (SDK) ๐Ÿ”ง

python
# cloud_adapter.py
import os

class StorageAdapter:
    @staticmethod
    def upload(file, bucket):
        provider = os.getenv("CLOUD_PROVIDER")
        if provider == "aws":
            import boto3
            s3 = boto3.client('s3')
            s3.upload_file(file, bucket, file)
        elif provider == "gcp":
            from google.cloud import storage
            # GCP upload logic
        # abstraction provides cloud-agnostic interface

Step 8: Compare & Monitor ๐Ÿ“Š

bash
# Cost comparison:
# - AWS: $500/month
# - GCP: $480/month
# - Azure: $520/month
# Performance comparison: latency, throughput
# All 3 running simultaneously = multi-cloud safety!

Completion Time: 3-4 hours

Tools: AWS, GCP, Azure CLI, Docker

Multi-cloud expertise โญโญโญ

๐Ÿ’ผ Interview Questions

Q1: Vendor lock-in โ€” how prevent? Best practices?

A: Use open standards (Docker, Kubernetes). Avoid cloud-specific APIs (use SDKs with multi-cloud support). Separate config from code (environment variables). Multi-cloud from start (proof of portability). Data: standard formats (Parquet, JSON). Code review: cloud-specific APIs reject.


Q2: Cost optimization multi-cloud โ€” best approach?

A: Benchmark each cloud (test same workload). Choose cheapest per service (AWS compute, GCP ML, Azure database). Hybrid: cost arbitrage (price varies). But monitoring overhead, complexity increase. Recommendation: single cloud primary, one secondary (disaster recovery).


Q3: Data transfer costs between clouds โ€” expensive?

A: Yes! Cross-region, cross-cloud transfer = โ‚น5-10 per TB. In-region: free (AWS, GCP). Strategy: minimize transfer (process where data lives). Data gravity: data where โ†’ compute nearby. Caching, replication local. Transfer costs > compute costs possible.


Q4: Testing multi-cloud โ€” CI/CD complexity?

A: Pipeline: code โ†’ build image โ†’ test (AWS) โ†’ test (GCP) โ†’ test (Azure) โ†’ deploy to chosen cloud. Time increase: parallel tests. Tools: Terraform (test different clouds). Ansible (agnostic configuration). GitHub Actions matrices (parallel multi-cloud).


Q5: Disaster recovery multi-cloud โ€” active-active vs active-passive?

A: Active-passive: primary cloud, secondary ready (switch on failure). Active-active: both clouds serving (complex, higher cost). Multi-cloud: write one, read multiple (replicate). Replication lag: seconds possible. Recovery: automated failover (DNS switch, load balancer update).

Frequently Asked Questions

โ“ Load balancer na enna simple ah?
Traffic cop maari โ€” incoming requests multiple servers ku distribute pannum. One server overload aagaadhu. Server down aana, traffic automatically other servers ku route aagum. Users ku always fast response kedaikum.
โ“ Auto scaling na enna?
Traffic based ah servers automatically increase/decrease aagum. Morning 10 users โ€” 1 server. Afternoon 10,000 users โ€” 10 servers auto add. Night 5 users โ€” back to 1 server. Pay only for what you use!
โ“ AI apps ku special load balancing venum ah?
Yes! AI inference GPU-bound โ€” normal round-robin work aagaadhu. Least-connections or GPU-utilization based routing venum. Long-running inference requests ku timeout adjust pannanum. Streaming responses ku WebSocket/SSE support venum.
โ“ Auto scaling aggressive ah set pannalaam ah?
Careful! Too aggressive = cost spike. Too conservative = latency spike. Start with target tracking (CPU 60-70%). AI apps ku custom metrics use pannunga โ€” inference queue length or GPU utilization based scaling better than CPU-based.
๐Ÿง Knowledge Check
Quiz 1 of 1

AI inference API ku best load balancing algorithm evadhu?

0 of 1 answered