โ† Back|CLOUD-DEVOPSโ€บSection 1/17
0 of 17 completed

Load balancing + auto scaling

Advancedโฑ 15 min read๐Ÿ“… Updated: 2026-02-17

Introduction

Unga AI app oru server la run aagudhu. Suddenly oru blog post viral aachchu โ€” traffic 100x jump! ๐Ÿš€ Server struggle pannum, requests timeout aagum, users leave panniduvanga.


Load Balancing = Traffic distribute pannradhu across multiple servers

Auto Scaling = Demand based ah servers add/remove pannradhu


Together โ€” unga app any traffic handle pannum, cost optimize ah! Netflix, Uber, ChatGPT โ€” ellam ivanga dhaan use panraanga.


Indha article la load balancing algorithms, auto scaling strategies, AI-specific configurations โ€” ellam production-ready ah learn pannalam! โš–๏ธ

Load Balancing โ€” How It Works

Load balancer = Traffic distributor between servers.


Without load balancer ๐Ÿ˜ฐ:

code
Users โ”€โ”€โ–ถ Single Server โ”€โ”€โ–ถ CRASH! (overloaded)

With load balancer ๐Ÿ˜Š:

code
          โ”Œโ”€โ”€โ–ถ Server 1 (handling 33%)
Users โ”€โ”€โ–ถ LB โ”€โ”€โ–ถ Server 2 (handling 33%)
          โ””โ”€โ”€โ–ถ Server 3 (handling 33%)

Load balancer types:


TypeLayerRoutes ByExample
**L4 (Transport)**TCP/UDPIP + PortAWS NLB
**L7 (Application)**HTTPURL, headers, cookiesAWS ALB, Nginx
**DNS**DNSGeographic locationRoute 53, CloudFlare

AI apps ku: L7 (Application) load balancer best โ€” URL path based routing, health checks, sticky sessions support! ๐ŸŽฏ


Key features:

  • โค๏ธ Health checks โ€” unhealthy server ku traffic send pannaadhu
  • ๐Ÿ”„ Session persistence โ€” same user same server ku route
  • ๐Ÿ”’ SSL termination โ€” HTTPS handle pannum
  • ๐Ÿ“Š Monitoring โ€” traffic metrics expose pannum

Load Balancing Algorithms

Different algorithms, different use cases:


1. Round Robin ๐Ÿ”„

code
Request 1 โ†’ Server A
Request 2 โ†’ Server B
Request 3 โ†’ Server C
Request 4 โ†’ Server A  (back to start)
  • Simple, fair distribution
  • Problem: All servers equal assume pannum

2. Weighted Round Robin โš–๏ธ

code
Server A (weight: 3) โ†’ Gets 3 requests
Server B (weight: 2) โ†’ Gets 2 requests
Server C (weight: 1) โ†’ Gets 1 request
  • Powerful servers ku more traffic
  • AI use: GPU servers ku higher weight

3. Least Connections ๐Ÿ“‰

code
Server A: 5 active connections โ†’ โŒ Skip
Server B: 2 active connections โ†’ โœ… Route here!
Server C: 8 active connections โ†’ โŒ Skip
  • Best for varying request durations
  • AI apps ku BEST โ€” inference time varies!

4. IP Hash ๐Ÿ”—

code
User IP โ†’ Hash โ†’ Consistent server mapping
Same user always โ†’ Same server
  • Session persistence without cookies
  • Good for stateful AI conversations

5. Least Response Time โšก

  • Route to fastest responding server
  • Considers both connections AND latency
  • Premium option โ€” best performance

AI recommendation: Least Connections for inference APIs โ€” oru request 50ms, oru request 2s โ€” least connections handle pannum! ๐ŸŽฏ

Nginx Load Balancer Setup

โœ… Example

Production Nginx config for AI API:

nginx
# /etc/nginx/conf.d/ai-api.conf

upstream ai_inference {
    least_conn;  # Best for AI โ€” variable inference times

    server gpu-node-1:8080 weight=3;  # A100 GPU
    server gpu-node-2:8080 weight=3;  # A100 GPU
    server gpu-node-3:8080 weight=1;  # T4 GPU (less powerful)

    # Health check
    keepalive 32;
}

upstream ai_api {
    least_conn;
    server api-node-1:8000;
    server api-node-2:8000;
    server api-node-3:8000;
}

server {
    listen 443 ssl;
    server_name api.myaiapp.com;

    # SSL
    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    # API routes โ†’ CPU servers
    location /api/ {
        proxy_pass http://ai_api;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }

    # Inference routes โ†’ GPU servers
    location /predict {
        proxy_pass http://ai_inference;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;  # AI inference takes longer!
    }

    # Streaming (SSE for token generation)
    location /stream {
        proxy_pass http://ai_inference;
        proxy_buffering off;       # Disable buffering for SSE
        proxy_read_timeout 300s;   # Long timeout for streaming
    }
}

Key: API routes CPU servers ku, inference routes GPU servers ku โ€” separate routing! โšก

AWS ALB for AI Applications

AWS Application Load Balancer โ€” managed, no maintenance:


Terraform config:

hcl
# Application Load Balancer
resource "aws_lb" "ai_alb" {
  name               = "ai-app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

# Target Group โ€” API Servers
resource "aws_lb_target_group" "api" {
  name     = "ai-api-tg"
  port     = 8000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    interval            = 15
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 3600  # 1 hour session
  }
}

# Target Group โ€” GPU Inference
resource "aws_lb_target_group" "inference" {
  name     = "ai-inference-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path     = "/health"
    interval = 30
    timeout  = 10  # GPU warm-up takes time
  }

  # Slow start โ€” new GPU instance warm-up period
  slow_start = 120  # 2 min warm-up
}

# Path-based routing
resource "aws_lb_listener_rule" "inference" {
  listener_arn = aws_lb_listener.https.arn
  condition {
    path_pattern { values = ["/predict*", "/stream*"] }
  }
  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.inference.arn
  }
}

Slow start = New GPU instance add aana, gradually traffic increase pannradhu. Model loading time kudukum! ๐Ÿขโžก๏ธ๐Ÿš€

Auto Scaling โ€” Core Concepts

Auto scaling = Demand based ah capacity adjust pannradhu.


Types:


1. Reactive Scaling ๐Ÿ“ˆ

  • Metric threshold cross pannina scale
  • Example: CPU > 70% โ†’ add server
  • Lag: 2-5 minutes delay

2. Predictive Scaling ๐Ÿ”ฎ

  • Historical patterns analyze panni advance la scale
  • Example: Every Monday 9 AM traffic spike โ€” pre-scale
  • AWS supports this natively!

3. Scheduled Scaling โฐ

  • Fixed schedule based
  • Example: Business hours la 10 servers, nights 2 servers
  • Cheapest option for predictable traffic

Auto Scaling Group (ASG) config:

code
Minimum: 2 instances (always running)
Desired: 4 instances (normal load)
Maximum: 20 instances (peak limit)

Scaling policies comparison:

PolicyHow It WorksBest For
Target TrackingMaintain metric at targetSimple, effective
Step ScalingDifferent actions at thresholdsComplex rules
Simple ScalingOne action per alarmBasic
PredictiveML-based predictionPredictable patterns

AI apps ku: Target tracking with custom metric (inference queue length) = BEST! ๐ŸŽฏ

AWS Auto Scaling for AI

โœ… Example

Complete auto scaling setup:

hcl
# Launch Template โ€” GPU Instance
resource "aws_launch_template" "gpu" {
  name          = "ai-gpu-template"
  image_id      = "ami-deep-learning"
  instance_type = "g4dn.xlarge"

  user_data = base64encode(<<-EOF
    #!/bin/bash
    docker pull myapp/inference:latest
    docker run -d --gpus all -p 8080:8080 myapp/inference:latest
  EOF
  )
}

# Auto Scaling Group
resource "aws_autoscaling_group" "gpu_asg" {
  name                = "ai-gpu-asg"
  min_size            = 2
  max_size            = 15
  desired_capacity    = 3
  vpc_zone_identifier = var.private_subnets

  launch_template {
    id      = aws_launch_template.gpu.id
    version = "$Latest"
  }

  # Warm pool โ€” pre-initialized instances
  warm_pool {
    pool_state                  = "Stopped"
    min_size                    = 2
    max_group_prepared_capacity = 5
  }

  instance_refresh {
    strategy = "Rolling"
    preferences { min_healthy_percentage = 80 }
  }
}

# Target Tracking โ€” GPU Utilization
resource "aws_autoscaling_policy" "gpu_target" {
  name                   = "gpu-utilization-target"
  autoscaling_group_name = aws_autoscaling_group.gpu_asg.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "GPUUtilization"
      namespace   = "Custom/AI"
      statistic   = "Average"
    }
    target_value = 65.0  # Scale when GPU > 65%
  }
}

Warm pool = Pre-warmed instances ready ah irukku โ€” model loading 2 min wait avoid! GPU cold start problematic ah irukku, warm pool solves it! ๐Ÿ”ฅ

Kubernetes HPA + KEDA

Kubernetes la auto scaling โ€” HPA (Horizontal Pod Autoscaler) + KEDA:


Standard HPA:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference
  minReplicas: 2
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60  # Max 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scale down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

KEDA โ€” Event-Driven Scaling (better for AI):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-worker-scaler
spec:
  scaleTargetRef:
    name: ai-gpu-worker
  minReplicaCount: 0    # Scale to ZERO! ๐Ÿ’ฐ
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pending_inference_requests
      threshold: "10"
      query: sum(inference_queue_pending)

KEDA advantage: Scale to zero โ€” requests illa na 0 pods, zero cost! Request vandha auto scale up. AI batch workers ku perfect! ๐ŸŽฏ

Load Balancing + Auto Scaling Architecture

๐Ÿ—๏ธ Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      LOAD BALANCING + AUTO SCALING ARCHITECTURE           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                            โ”‚
โ”‚  ๐Ÿ“ฑ Users                                                  โ”‚
โ”‚    โ”‚                                                       โ”‚
โ”‚    โ–ผ                                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                         โ”‚
โ”‚  โ”‚  DNS    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  CloudFlare  โ”‚ (DDoS protection)       โ”‚
โ”‚  โ”‚(Route53)โ”‚     โ”‚     CDN      โ”‚                         โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                         โ”‚
โ”‚                          โ–ผ                                 โ”‚
โ”‚               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                        โ”‚
โ”‚               โ”‚   AWS ALB (L7)   โ”‚                        โ”‚
โ”‚               โ”‚  /api โ†’ API TG   โ”‚                        โ”‚
โ”‚               โ”‚  /predict โ†’ GPU  โ”‚                        โ”‚
โ”‚               โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜                        โ”‚
โ”‚                    โ”‚         โ”‚                             โ”‚
โ”‚          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”  โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”‚
โ”‚          โ”‚  API ASG   โ”‚  โ”‚  GPU ASG    โ”‚                  โ”‚
โ”‚          โ”‚ (CPU, t3)  โ”‚  โ”‚ (g4dn)     โ”‚                  โ”‚
โ”‚          โ”‚            โ”‚  โ”‚            โ”‚                  โ”‚
โ”‚          โ”‚ min: 2     โ”‚  โ”‚ min: 2     โ”‚                  โ”‚
โ”‚          โ”‚ max: 20    โ”‚  โ”‚ max: 15    โ”‚                  โ”‚
โ”‚          โ”‚            โ”‚  โ”‚            โ”‚                  โ”‚
โ”‚          โ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”‚  โ”‚ โ”Œโ”€โ”€โ” โ”Œโ”€โ”€โ” โ”‚                  โ”‚
โ”‚          โ”‚ โ”‚A1โ”‚ โ”‚A2โ”‚ โ”‚  โ”‚ โ”‚G1โ”‚ โ”‚G2โ”‚ โ”‚                  โ”‚
โ”‚          โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ”‚  โ”‚ โ””โ”€โ”€โ”˜ โ””โ”€โ”€โ”˜ โ”‚                  โ”‚
โ”‚          โ”‚  โ”Œโ”€โ”€โ”     โ”‚  โ”‚  โ”Œโ”€โ”€โ”     โ”‚                  โ”‚
โ”‚          โ”‚  โ”‚A3โ”‚ ... โ”‚  โ”‚  โ”‚G3โ”‚ ... โ”‚                  โ”‚
โ”‚          โ”‚  โ””โ”€โ”€โ”˜     โ”‚  โ”‚  โ””โ”€โ”€โ”˜     โ”‚                  โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ”‚
โ”‚               โ”‚              โ”‚                            โ”‚
โ”‚          โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚          โ”‚CloudWatchโ”‚   โ”‚CloudWatch โ”‚                     โ”‚
โ”‚          โ”‚CPU: 60% โ”‚   โ”‚GPU: 65%   โ”‚                     โ”‚
โ”‚          โ”‚target   โ”‚   โ”‚queue: <10 โ”‚                     โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ”‚                                                            โ”‚
โ”‚  ๐Ÿ”ฅ Warm Pool: 3 pre-initialized GPU instances             โ”‚
โ”‚  ๐Ÿ“Š Predictive Scaling: ML-based traffic prediction        โ”‚
โ”‚  โฐ Scheduled: Business hours boost                        โ”‚
โ”‚                                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Health Checks โ€” Deep vs Shallow

Health checks = Load balancer eppadi healthy servers identify pannum.


Shallow Health Check (Liveness) ๐ŸŸข:

python
@app.get("/health")
def health():
    return {"status": "ok"}
# Just checks: "Is the process running?"

Deep Health Check (Readiness) ๐Ÿ”:

python
@app.get("/health/ready")
async def readiness():
    checks = {
        "model_loaded": model is not None,
        "gpu_available": torch.cuda.is_available(),
        "db_connected": await db.ping(),
        "memory_ok": get_memory_usage() < 90,
        "gpu_memory_ok": get_gpu_memory() < 85,
    }
    all_ok = all(checks.values())
    status = 200 if all_ok else 503
    return JSONResponse(checks, status_code=status)

AI-specific health checks:

CheckWhyFailure Action
Model loaded?GPU memory issuesRestart pod
GPU available?Driver crashReplace instance
Inference test?Model corruptReload model
Memory < 90%?OOM riskScale up
Queue < 100?OverloadedScale up

Configuration:

code
Health check interval: 15 seconds
Timeout: 10 seconds (GPU warm-up)
Healthy threshold: 2 consecutive passes
Unhealthy threshold: 3 consecutive failures

Rule: Always use deep health checks for AI servers โ€” model loaded + GPU ok check mandatory! ๐Ÿ›ก๏ธ

AI-Specific Scaling Patterns

๐Ÿ’ก Tip

AI workloads ku special scaling patterns:

๐Ÿค– 1. Inference vs Training Separation

- Inference: Auto-scale with request volume

- Training: Scheduled, fixed capacity (spot instances)

- NEVER mix on same servers!

๐Ÿค– 2. Model Warm-up Strategy

code
New instance โ†’ Load model (60s) โ†’ Warm-up inference (30s) โ†’ Ready!

- Warm pool maintain pannunga

- Health check pass only after warm-up complete

๐Ÿค– 3. Batch vs Real-time Split

- Real-time: Always-on, 2+ instances minimum

- Batch: KEDA scale-to-zero, event-triggered

- Different ASGs for different workload types!

๐Ÿค– 4. GPU Memory-Based Scaling

- GPU memory > 80% โ†’ Scale up (before OOM!)

- Custom CloudWatch metric publish pannunga

๐Ÿค– 5. Graceful Shutdown

python
# Don't kill mid-inference!
@app.on_event("shutdown")
async def shutdown():
    # Finish current requests (drain)
    await inference_queue.join()
    # Unload model cleanly
    del model
    torch.cuda.empty_cache()

Graceful shutdown illana โ€” inference mid-way cut aagum, user bad response get pannum! โš ๏ธ

Cost-Effective Scaling

Auto scaling cost optimize pannradhu:


Mixed Instance Strategy:

hcl
mixed_instances_policy {
  instances_distribution {
    on_demand_base_capacity = 2        # 2 on-demand (reliable)
    on_demand_percentage_above_base = 20  # 20% on-demand
    spot_allocation_strategy = "lowest-price"
  }
  launch_template {
    override {
      instance_type = "g4dn.xlarge"    # Primary
    }
    override {
      instance_type = "g4dn.2xlarge"   # Fallback
    }
    override {
      instance_type = "g5.xlarge"      # Alternative
    }
  }
}

Cost breakdown:

StrategyMonthly CostReliability
All on-demand$10,00099.99%
Mixed (80% spot)$4,00099.9%
All spot$3,00099%
Scheduled + spot$3,50099.9%

Best combo:

  • Base: 2 on-demand instances (always on)
  • Scale: Spot instances for traffic spikes
  • Schedule: Reduce min capacity off-hours
  • Savings: ~60% compared to all on-demand! ๐Ÿ’ฐ

Spot interruption handling: Instance terminate aaga pogudhu na, graceful drain โ€” current requests complete pannitu shutdown! ๐Ÿ›ก๏ธ

Prompt: Design Load Balancing System

๐Ÿ“‹ Copy-Paste Prompt
You are a Cloud Architect specializing in AI infrastructure.

Design a load balancing and auto-scaling system for:
- AI image generation API (Stable Diffusion model)
- 5,000 concurrent users at peak
- Each generation takes 10-30 seconds
- Must handle 100x traffic spikes (viral events)
- Budget: $8,000/month baseline

Provide:
1. Load balancer configuration (algorithm choice + reasoning)
2. Auto scaling policy (min/max/scaling triggers)
3. Warm pool / pre-warming strategy
4. Queue-based architecture for long-running generations
5. Cost optimization with spot instances
6. Graceful handling of spot interruptions
7. Monitoring metrics and alerts

Summary

Key takeaways:


โœ… Load balancing = Least connections best for AI (variable inference times)

โœ… L7 routing = API โ†’ CPU servers, Inference โ†’ GPU servers

โœ… Auto scaling = Target tracking with custom metrics (queue depth, GPU%)

โœ… Warm pool = Pre-initialized GPU instances for fast scaling

โœ… KEDA = Scale-to-zero for batch AI workers

โœ… Health checks = Deep checks โ€” model loaded + GPU available

โœ… Cost = Mixed instances (on-demand base + spot for scaling) = 60% savings


Action item: AWS la oru ALB + ASG setup pannunga (free tier). 2 instances behind ALB, CPU target tracking policy add pannunga. Traffic simulate panni auto scaling watch pannunga! โš–๏ธ


Next article: Multi-Cloud AI Systems โ€” vendor lock-in avoid! โ˜๏ธโ˜๏ธโ˜๏ธ

๐Ÿ ๐ŸŽฎ Mini Challenge

Challenge: Setup Load Balancer + Auto Scaling (AWS)


High-traffic simulation โ†’ auto-scale watch pannu! ๐Ÿš€โš–๏ธ


Step 1: Launch Template Create ๐Ÿ“

bash
# AWS Console โ†’ EC2 โ†’ Launch Templates
# AMI: Ubuntu 20.04
# Instance type: t3.micro (free tier)
# Security group: port 80, 443, 22
# User data:
#!/bin/bash
apt-get update
apt-get install -y python3-pip
pip install fastapi uvicorn
# app.py with simple prediction endpoint

# Save template version

Step 2: Auto Scaling Group Create ๐Ÿ”„

bash
# AWS Console โ†’ Auto Scaling โ†’ Auto Scaling Groups โ†’ Create
# Launch template: select template
# Min size: 2
# Max size: 8
# Desired capacity: 2
# VPC: select
# Subnets: multi-AZ

Step 3: Target Group Create ๐ŸŽฏ

bash
# EC2 โ†’ Target Groups โ†’ Create
# Name: ai-targets
# Protocol: HTTP
# Port: 80
# Health check: /health (every 30 sec)
# Healthy threshold: 2
# Unhealthy threshold: 3

Step 4: Application Load Balancer โš–๏ธ

bash
# EC2 โ†’ Load Balancers โ†’ Create ALB
# Name: ai-alb
# Availability zones: multi-AZ
# Listeners: HTTP:80 โ†’ target group: ai-targets
# Deploy
# Copy DNS name

Step 5: Scaling Policies ๐Ÿ“Š

bash
# ASG โ†’ Scaling Policies โ†’ Target Tracking
# Metric: CPU Utilization
# Target: 60%
# Scale up threshold: 60% breach
# Scale down threshold: 30% below
# Cooldown: 5 minutes

# Alternative: Step Scaling (more granular)

Step 6: Load Test ๐Ÿ”ฅ

bash
# Install Apache Bench
# Generate traffic
ab -n 10000 -c 100 http://alb-dns/predict

# Monitor in AWS Console:
# - CloudWatch: CPU utilization
# - ASG: desired capacity increase (auto!)
# - Target Group: healthy target count

Step 7: Monitor & Observe ๐Ÿ“ˆ

bash
# CloudWatch Dashboards:
# - ALB request count
# - Target CPU
# - ASG desired capacity trend
# - See auto-scaling in action!

Step 8: Cost Check & Cleanup ๐Ÿ’ฐ

bash
# CloudWatch โ†’ Billing
# Check estimated hourly cost
# Then delete ALB, ASG, launch template
# Free tier la fit pannunga!

Completion Time: 2 hours

Tools: AWS EC2, ALB, ASG, CloudWatch

Hands-on scaling experience โญ

๐Ÿ’ผ Interview Questions

Q1: Sticky sessions โ€” need? Load balancer la how configure?

A: Sticky sessions = same client, same server. Need when: session data local server la irukku (no distributed cache). Configure: ALB โ†’ target group โ†’ stickiness. Duration: 1 day. Drawback: one server down โ†’ session loss. Better: Redis session store (all servers access, no stickiness needed).


Q2: Connection draining โ€” graceful shutdown?

A: Old instances: new requests receive pannala, existing requests complete. Time: 300 seconds default. Critical: long-running requests (model training)? Timeout kashtam possible. Configure: Target group โ†’ Deregistration delay. Monitoring: slow draining โ†’ might indicate slow queries.


Q3: Geographic load balancing โ€” multi-region?

A: Route53 (AWS): geolocation routing (user location โ†’ nearest region). Benefits: latency reduce, data residency compliance. Complexity: data sync across regions, cost increase. Best for: global users, low latency critical. Cost: data transfer between regions = expensive.


Q4: Session affinity vs stateless โ€” architecture?

A: Stateless better (horizontally scale, resilient, simple). If session needed: Redis/Memcached (shared). Avoid local session state (hard to scale). AI apps: stateless inference perfect. Training jobs: stateful (data, checkpoints persistent needed).


Q5: Health checks failing โ€” slow responses problem?

A: Increase timeout, unhealthy threshold. Real issue: server overloaded, bad code, deadlock? Investigate: logs, metrics (CPU, memory). Temporary: increase desired capacity ASG. Permanent: code fix, resource optimize. Monitoring: health check failures = alert.

Frequently Asked Questions

โ“ Load balancer na enna simple ah?
Traffic cop maari โ€” incoming requests multiple servers ku distribute pannum. One server overload aagaadhu. Server down aana, traffic automatically other servers ku route aagum. Users ku always fast response kedaikum.
โ“ Auto scaling na enna?
Traffic based ah servers automatically increase/decrease aagum. Morning 10 users โ€” 1 server. Afternoon 10,000 users โ€” 10 servers auto add. Night 5 users โ€” back to 1 server. Pay only for what you use!
โ“ AI apps ku special load balancing venum ah?
Yes! AI inference GPU-bound โ€” normal round-robin work aagaadhu. Least-connections or GPU-utilization based routing venum. Long-running inference requests ku timeout adjust pannanum. Streaming responses ku WebSocket/SSE support venum.
โ“ Auto scaling aggressive ah set pannalaam ah?
Careful! Too aggressive = cost spike. Too conservative = latency spike. Start with target tracking (CPU 60-70%). AI apps ku custom metrics use pannunga โ€” inference queue length or GPU utilization based scaling better than CPU-based.
๐Ÿง Knowledge Check
Quiz 1 of 1

AI inference API ku best load balancing algorithm evadhu?

0 of 1 answered