← Back|CLOUD-DEVOPSSection 1/17
0 of 17 completed

Load balancing + auto scaling

Advanced15 min read📅 Updated: 2026-02-17

Introduction

Unga AI app oru server la run aagudhu. Suddenly oru blog post viral aachchu — traffic 100x jump! 🚀 Server struggle pannum, requests timeout aagum, users leave panniduvanga.


Load Balancing = Traffic distribute pannradhu across multiple servers

Auto Scaling = Demand based ah servers add/remove pannradhu


Together — unga app any traffic handle pannum, cost optimize ah! Netflix, Uber, ChatGPT — ellam ivanga dhaan use panraanga.


Indha article la load balancing algorithms, auto scaling strategies, AI-specific configurations — ellam production-ready ah learn pannalam! ⚖️

Load Balancing — How It Works

Load balancer = Traffic distributor between servers.


Without load balancer 😰:

code
Users ──▶ Single Server ──▶ CRASH! (overloaded)

With load balancer 😊:

code
          ┌──▶ Server 1 (handling 33%)
Users ──▶ LB ──▶ Server 2 (handling 33%)
          └──▶ Server 3 (handling 33%)

Load balancer types:


TypeLayerRoutes ByExample
**L4 (Transport)**TCP/UDPIP + PortAWS NLB
**L7 (Application)**HTTPURL, headers, cookiesAWS ALB, Nginx
**DNS**DNSGeographic locationRoute 53, CloudFlare

AI apps ku: L7 (Application) load balancer best — URL path based routing, health checks, sticky sessions support! 🎯


Key features:

  • ❤️ Health checks — unhealthy server ku traffic send pannaadhu
  • 🔄 Session persistence — same user same server ku route
  • 🔒 SSL termination — HTTPS handle pannum
  • 📊 Monitoring — traffic metrics expose pannum

Load Balancing Algorithms

Different algorithms, different use cases:


1. Round Robin 🔄

code
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A  (back to start)
  • Simple, fair distribution
  • Problem: All servers equal assume pannum

2. Weighted Round Robin ⚖️

code
Server A (weight: 3) → Gets 3 requests
Server B (weight: 2) → Gets 2 requests
Server C (weight: 1) → Gets 1 request
  • Powerful servers ku more traffic
  • AI use: GPU servers ku higher weight

3. Least Connections 📉

code
Server A: 5 active connections → ❌ Skip
Server B: 2 active connections → ✅ Route here!
Server C: 8 active connections → ❌ Skip
  • Best for varying request durations
  • AI apps ku BEST — inference time varies!

4. IP Hash 🔗

code
User IP → Hash → Consistent server mapping
Same user always → Same server
  • Session persistence without cookies
  • Good for stateful AI conversations

5. Least Response Time

  • Route to fastest responding server
  • Considers both connections AND latency
  • Premium option — best performance

AI recommendation: Least Connections for inference APIs — oru request 50ms, oru request 2s — least connections handle pannum! 🎯

Nginx Load Balancer Setup

Example

Production Nginx config for AI API:

nginx
# /etc/nginx/conf.d/ai-api.conf

upstream ai_inference {
    least_conn;  # Best for AI — variable inference times

    server gpu-node-1:8080 weight=3;  # A100 GPU
    server gpu-node-2:8080 weight=3;  # A100 GPU
    server gpu-node-3:8080 weight=1;  # T4 GPU (less powerful)

    # Health check
    keepalive 32;
}

upstream ai_api {
    least_conn;
    server api-node-1:8000;
    server api-node-2:8000;
    server api-node-3:8000;
}

server {
    listen 443 ssl;
    server_name api.myaiapp.com;

    # SSL
    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    # API routes → CPU servers
    location /api/ {
        proxy_pass http://ai_api;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    # Inference routes → GPU servers
    location /predict {
        proxy_pass http://ai_inference;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;  # AI inference takes longer!
    }

    # Streaming (SSE for token generation)
    location /stream {
        proxy_pass http://ai_inference;
        proxy_buffering off;       # Disable buffering for SSE
        proxy_read_timeout 300s;   # Long timeout for streaming
    }
}

Key: API routes CPU servers ku, inference routes GPU servers ku — separate routing! ⚡

AWS ALB for AI Applications

AWS Application Load Balancer — managed, no maintenance:


Terraform config:

hcl
# Application Load Balancer
resource "aws_lb" "ai_alb" {
  name               = "ai-app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

# Target Group — API Servers
resource "aws_lb_target_group" "api" {
  name     = "ai-api-tg"
  port     = 8000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 3600  # 1 hour session
  }
}

# Target Group — GPU Inference
resource "aws_lb_target_group" "inference" {
  name     = "ai-inference-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path     = "/health"
    interval = 30
    timeout  = 10  # GPU warm-up takes time
  }

  # Slow start — new GPU instance warm-up period
  slow_start = 120  # 2 min warm-up
}

# Path-based routing
resource "aws_lb_listener_rule" "inference" {
  listener_arn = aws_lb_listener.https.arn
  condition {
    path_pattern { values = ["/predict*", "/stream*"] }
  }
  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.inference.arn
  }
}

Slow start = New GPU instance add aana, gradually traffic increase pannradhu. Model loading time kudukum! 🐢➡️🚀

Auto Scaling — Core Concepts

Auto scaling = Demand based ah capacity adjust pannradhu.


Types:


1. Reactive Scaling 📈

  • Metric threshold cross pannina scale
  • Example: CPU > 70% → add server
  • Lag: 2-5 minutes delay

2. Predictive Scaling 🔮

  • Historical patterns analyze panni advance la scale
  • Example: Every Monday 9 AM traffic spike — pre-scale
  • AWS supports this natively!

3. Scheduled Scaling

  • Fixed schedule based
  • Example: Business hours la 10 servers, nights 2 servers
  • Cheapest option for predictable traffic

Auto Scaling Group (ASG) config:

code
Minimum: 2 instances (always running)
Desired: 4 instances (normal load)
Maximum: 20 instances (peak limit)

Scaling policies comparison:

PolicyHow It WorksBest For
Target TrackingMaintain metric at targetSimple, effective
Step ScalingDifferent actions at thresholdsComplex rules
Simple ScalingOne action per alarmBasic
PredictiveML-based predictionPredictable patterns

AI apps ku: Target tracking with custom metric (inference queue length) = BEST! 🎯

AWS Auto Scaling for AI

Example

Complete auto scaling setup:

hcl
# Launch Template — GPU Instance
resource "aws_launch_template" "gpu" {
  name          = "ai-gpu-template"
  image_id      = "ami-deep-learning"
  instance_type = "g4dn.xlarge"

  user_data = base64encode(<<-EOF
    #!/bin/bash
    docker pull myapp/inference:latest
    docker run -d --gpus all -p 8080:8080 myapp/inference:latest
  EOF
  )
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
  name                = "ai-gpu-asg"
  min_size            = 2
  max_size            = 15
  desired_capacity    = 3
  vpc_zone_identifier = var.private_subnets

  launch_template {
    id      = aws_launch_template.gpu.id
    version = "$Latest"
  }

  # Warm pool — pre-initialized instances
  warm_pool {
    pool_state                  = "Stopped"
    min_size                    = 2
    max_group_prepared_capacity = 5
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 80 }
  }
}

# Target Tracking — GPU Utilization
resource "aws_autoscaling_policy" "gpu_target" {
  name                   = "gpu-utilization-target"
  autoscaling_group_name = aws_autoscaling_group.gpu_asg.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "GPUUtilization"
      namespace   = "Custom/AI"
      statistic   = "Average"
    }
    target_value = 65.0  # Scale when GPU > 65%
  }
}

Warm pool = Pre-warmed instances ready ah irukku — model loading 2 min wait avoid! GPU cold start problematic ah irukku, warm pool solves it! 🔥

Kubernetes HPA + KEDA

Kubernetes la auto scaling — HPA (Horizontal Pod Autoscaler) + KEDA:


Standard HPA:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60  # Max 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

KEDA — Event-Driven Scaling (better for AI):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-worker-scaler
spec:
  scaleTargetRef:
    name: ai-gpu-worker
  minReplicaCount: 0    # Scale to ZERO! 💰
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pending_inference_requests
      threshold: "10"
      query: sum(inference_queue_pending)

KEDA advantage: Scale to zero — requests illa na 0 pods, zero cost! Request vandha auto scale up. AI batch workers ku perfect! 🎯

Load Balancing + Auto Scaling Architecture

🏗️ Architecture Diagram
┌──────────────────────────────────────────────────────────┐
│      LOAD BALANCING + AUTO SCALING ARCHITECTURE           │
├──────────────────────────────────────────────────────────┤
│                                                            │
│  📱 Users                                                  │
│    │                                                       │
│    ▼                                                       │
│  ┌─────────┐     ┌──────────────┐                         │
│  │  DNS    │────▶│  CloudFlare  │ (DDoS protection)       │
│  │(Route53)│     │     CDN      │                         │
│  └─────────┘     └──────┬───────┘                         │
│                          ▼                                 │
│               ┌──────────────────┐                        │
│               │   AWS ALB (L7)   │                        │
│               │  /api → API TG   │                        │
│               │  /predict → GPU  │                        │
│               └────┬─────────┬───┘                        │
│                    │         │                             │
│          ┌─────────▼──┐  ┌──▼──────────┐                  │
│          │  API ASG   │  │  GPU ASG    │                  │
│          │ (CPU, t3)  │  │ (g4dn)     │                  │
│          │            │  │            │                  │
│          │ min: 2     │  │ min: 2     │                  │
│          │ max: 20    │  │ max: 15    │                  │
│          │            │  │            │                  │
│          │ ┌──┐ ┌──┐ │  │ ┌──┐ ┌──┐ │                  │
│          │ │A1│ │A2│ │  │ │G1│ │G2│ │                  │
│          │ └──┘ └──┘ │  │ └──┘ └──┘ │                  │
│          │  ┌──┐     │  │  ┌──┐     │                  │
│          │  │A3│ ... │  │  │G3│ ... │                  │
│          │  └──┘     │  │  └──┘     │                  │
│          └───────────┘  └───────────┘                  │
│               │              │                            │
│          ┌────▼────┐   ┌─────▼─────┐                     │
│          │CloudWatch│   │CloudWatch │                     │
│          │CPU: 60% │   │GPU: 65%   │                     │
│          │target   │   │queue: <10 │                     │
│          └─────────┘   └───────────┘                     │
│                                                            │
│  🔥 Warm Pool: 3 pre-initialized GPU instances             │
│  📊 Predictive Scaling: ML-based traffic prediction        │
│  ⏰ Scheduled: Business hours boost                        │
│                                                            │
└──────────────────────────────────────────────────────────┘

Health Checks — Deep vs Shallow

Health checks = Load balancer eppadi healthy servers identify pannum.


Shallow Health Check (Liveness) 🟢:

python
@app.get("/health")
def health():
    return {"status": "ok"}
# Just checks: "Is the process running?"

Deep Health Check (Readiness) 🔍:

python
@app.get("/health/ready")
async def readiness():
    checks = {
        "model_loaded": model is not None,
        "gpu_available": torch.cuda.is_available(),
        "db_connected": await db.ping(),
        "memory_ok": get_memory_usage() < 90,
        "gpu_memory_ok": get_gpu_memory() < 85,
    }
    all_ok = all(checks.values())
    status = 200 if all_ok else 503
    return JSONResponse(checks, status_code=status)

AI-specific health checks:

CheckWhyFailure Action
Model loaded?GPU memory issuesRestart pod
GPU available?Driver crashReplace instance
Inference test?Model corruptReload model
Memory < 90%?OOM riskScale up
Queue < 100?OverloadedScale up

Configuration:

code
Health check interval: 15 seconds
Timeout: 10 seconds (GPU warm-up)
Healthy threshold: 2 consecutive passes
Unhealthy threshold: 3 consecutive failures

Rule: Always use deep health checks for AI servers — model loaded + GPU ok check mandatory! 🛡️

AI-Specific Scaling Patterns

💡 Tip

AI workloads ku special scaling patterns:

🤖 1. Inference vs Training Separation

- Inference: Auto-scale with request volume

- Training: Scheduled, fixed capacity (spot instances)

- NEVER mix on same servers!

🤖 2. Model Warm-up Strategy

code
New instance → Load model (60s) → Warm-up inference (30s) → Ready!

- Warm pool maintain pannunga

- Health check pass only after warm-up complete

🤖 3. Batch vs Real-time Split

- Real-time: Always-on, 2+ instances minimum

- Batch: KEDA scale-to-zero, event-triggered

- Different ASGs for different workload types!

🤖 4. GPU Memory-Based Scaling

- GPU memory > 80% → Scale up (before OOM!)

- Custom CloudWatch metric publish pannunga

🤖 5. Graceful Shutdown

python
# Don't kill mid-inference!
@app.on_event("shutdown")
async def shutdown():
    # Finish current requests (drain)
    await inference_queue.join()
    # Unload model cleanly
    del model
    torch.cuda.empty_cache()

Graceful shutdown illana — inference mid-way cut aagum, user bad response get pannum! ⚠️

Cost-Effective Scaling

Auto scaling cost optimize pannradhu:


Mixed Instance Strategy:

hcl
mixed_instances_policy {
  instances_distribution {
    on_demand_base_capacity = 2        # 2 on-demand (reliable)
    on_demand_percentage_above_base = 20  # 20% on-demand
    spot_allocation_strategy = "lowest-price"
  }
  launch_template {
    override {
      instance_type = "g4dn.xlarge"    # Primary
    }
    override {
      instance_type = "g4dn.2xlarge"   # Fallback
    }
    override {
      instance_type = "g5.xlarge"      # Alternative
    }
  }
}

Cost breakdown:

StrategyMonthly CostReliability
All on-demand$10,00099.99%
Mixed (80% spot)$4,00099.9%
All spot$3,00099%
Scheduled + spot$3,50099.9%

Best combo:

  • Base: 2 on-demand instances (always on)
  • Scale: Spot instances for traffic spikes
  • Schedule: Reduce min capacity off-hours
  • Savings: ~60% compared to all on-demand! 💰

Spot interruption handling: Instance terminate aaga pogudhu na, graceful drain — current requests complete pannitu shutdown! 🛡️

Prompt: Design Load Balancing System

📋 Copy-Paste Prompt
You are a Cloud Architect specializing in AI infrastructure.

Design a load balancing and auto-scaling system for:
- AI image generation API (Stable Diffusion model)
- 5,000 concurrent users at peak
- Each generation takes 10-30 seconds
- Must handle 100x traffic spikes (viral events)
- Budget: $8,000/month baseline

Provide:
1. Load balancer configuration (algorithm choice + reasoning)
2. Auto scaling policy (min/max/scaling triggers)
3. Warm pool / pre-warming strategy
4. Queue-based architecture for long-running generations
5. Cost optimization with spot instances
6. Graceful handling of spot interruptions
7. Monitoring metrics and alerts

Summary

Key takeaways:


Load balancing = Least connections best for AI (variable inference times)

L7 routing = API → CPU servers, Inference → GPU servers

Auto scaling = Target tracking with custom metrics (queue depth, GPU%)

Warm pool = Pre-initialized GPU instances for fast scaling

KEDA = Scale-to-zero for batch AI workers

Health checks = Deep checks — model loaded + GPU available

Cost = Mixed instances (on-demand base + spot for scaling) = 60% savings


Action item: AWS la oru ALB + ASG setup pannunga (free tier). 2 instances behind ALB, CPU target tracking policy add pannunga. Traffic simulate panni auto scaling watch pannunga! ⚖️


Next article: Multi-Cloud AI Systems — vendor lock-in avoid! ☁️☁️☁️

🏁 🎮 Mini Challenge

Challenge: Setup Load Balancer + Auto Scaling (AWS)


High-traffic simulation → auto-scale watch pannu! 🚀⚖️


Step 1: Launch Template Create 📝

bash
# AWS Console → EC2 → Launch Templates
# AMI: Ubuntu 20.04
# Instance type: t3.micro (free tier)
# Security group: port 80, 443, 22
# User data:
#!/bin/bash
apt-get update
apt-get install -y python3-pip
pip install fastapi uvicorn
# app.py with simple prediction endpoint

# Save template version

Step 2: Auto Scaling Group Create 🔄

bash
# AWS Console → Auto Scaling → Auto Scaling Groups → Create
# Launch template: select template
# Min size: 2
# Max size: 8
# Desired capacity: 2
# VPC: select
# Subnets: multi-AZ

Step 3: Target Group Create 🎯

bash
# EC2 → Target Groups → Create
# Name: ai-targets
# Protocol: HTTP
# Port: 80
# Health check: /health (every 30 sec)
# Healthy threshold: 2
# Unhealthy threshold: 3

Step 4: Application Load Balancer ⚖️

bash
# EC2 → Load Balancers → Create ALB
# Name: ai-alb
# Availability zones: multi-AZ
# Listeners: HTTP:80 → target group: ai-targets
# Deploy
# Copy DNS name

Step 5: Scaling Policies 📊

bash
# ASG → Scaling Policies → Target Tracking
# Metric: CPU Utilization
# Target: 60%
# Scale up threshold: 60% breach
# Scale down threshold: 30% below
# Cooldown: 5 minutes

# Alternative: Step Scaling (more granular)

Step 6: Load Test 🔥

bash
# Install Apache Bench
# Generate traffic
ab -n 10000 -c 100 http://alb-dns/predict

# Monitor in AWS Console:
# - CloudWatch: CPU utilization
# - ASG: desired capacity increase (auto!)
# - Target Group: healthy target count

Step 7: Monitor & Observe 📈

bash
# CloudWatch Dashboards:
# - ALB request count
# - Target CPU
# - ASG desired capacity trend
# - See auto-scaling in action!

Step 8: Cost Check & Cleanup 💰

bash
# CloudWatch → Billing
# Check estimated hourly cost
# Then delete ALB, ASG, launch template
# Free tier la fit pannunga!

Completion Time: 2 hours

Tools: AWS EC2, ALB, ASG, CloudWatch

Hands-on scaling experience

💼 Interview Questions

Q1: Sticky sessions — need? Load balancer la how configure?

A: Sticky sessions = same client, same server. Need when: session data local server la irukku (no distributed cache). Configure: ALB → target group → stickiness. Duration: 1 day. Drawback: one server down → session loss. Better: Redis session store (all servers access, no stickiness needed).


Q2: Connection draining — graceful shutdown?

A: Old instances: new requests receive pannala, existing requests complete. Time: 300 seconds default. Critical: long-running requests (model training)? Timeout kashtam possible. Configure: Target group → Deregistration delay. Monitoring: slow draining → might indicate slow queries.


Q3: Geographic load balancing — multi-region?

A: Route53 (AWS): geolocation routing (user location → nearest region). Benefits: latency reduce, data residency compliance. Complexity: data sync across regions, cost increase. Best for: global users, low latency critical. Cost: data transfer between regions = expensive.


Q4: Session affinity vs stateless — architecture?

A: Stateless better (horizontally scale, resilient, simple). If session needed: Redis/Memcached (shared). Avoid local session state (hard to scale). AI apps: stateless inference perfect. Training jobs: stateful (data, checkpoints persistent needed).


Q5: Health checks failing — slow responses problem?

A: Increase timeout, unhealthy threshold. Real issue: server overloaded, bad code, deadlock? Investigate: logs, metrics (CPU, memory). Temporary: increase desired capacity ASG. Permanent: code fix, resource optimize. Monitoring: health check failures = alert.

Frequently Asked Questions

Load balancer na enna simple ah?
Traffic cop maari — incoming requests multiple servers ku distribute pannum. One server overload aagaadhu. Server down aana, traffic automatically other servers ku route aagum. Users ku always fast response kedaikum.
Auto scaling na enna?
Traffic based ah servers automatically increase/decrease aagum. Morning 10 users — 1 server. Afternoon 10,000 users — 10 servers auto add. Night 5 users — back to 1 server. Pay only for what you use!
AI apps ku special load balancing venum ah?
Yes! AI inference GPU-bound — normal round-robin work aagaadhu. Least-connections or GPU-utilization based routing venum. Long-running inference requests ku timeout adjust pannanum. Streaming responses ku WebSocket/SSE support venum.
Auto scaling aggressive ah set pannalaam ah?
Careful! Too aggressive = cost spike. Too conservative = latency spike. Start with target tracking (CPU 60-70%). AI apps ku custom metrics use pannunga — inference queue length or GPU utilization based scaling better than CPU-based.
🧠Knowledge Check
Quiz 1 of 1

AI inference API ku best load balancing algorithm evadhu?

0 of 1 answered