Cloud cost optimization
Introduction
Cloud migrate pannita — everything works great! But month end la bill paarkura: $3,000! Expected budget $800 dhan! 😱
30-35% of cloud spend is wasted according to Flexera 2025 report. Unused resources, oversized instances, wrong pricing model — panam drain aagudhu!
Cloud cost optimization = same performance maintain panni unnecessary spending eliminate pannradhu.
Indha article la:
- Cloud billing models understand
- Rightsizing — correct instance size select
- Reserved Instances & Savings Plans
- Spot Instances for 90% savings
- Storage & network cost optimization
- FinOps practices
- Real-world cost reduction strategies
Un cloud bill 40-60% reduce pannalam! 💪
Cloud Billing Models — How You Pay 💳
Pay-as-you-go = default pricing. Most expensive but most flexible.
AWS Pricing Models:
| Model | Discount | Commitment | Best For |
|---|---|---|---|
| **On-Demand** | 0% | None | Unpredictable workloads |
| **Reserved (1yr)** | 30-40% | 1 year | Steady-state apps |
| **Reserved (3yr)** | 50-60% | 3 years | Long-term production |
| **Savings Plans** | 30-50% | 1-3 years | Flexible commitment |
| **Spot Instances** | 60-90% | None (can be interrupted) | Batch, CI/CD, testing |
GCP Pricing Models:
| Model | Discount | How it works |
|---|---|---|
| **On-Demand** | 0% | Pay per second |
| **Committed Use (CUD)** | 28-55% | 1 or 3 year commitment |
| **Sustained Use** | Up to 30% | **Automatic!** Run >25% of month |
| **Preemptible/Spot** | 60-91% | Can be stopped anytime |
Azure Pricing Models:
| Model | Discount | Commitment |
|---|---|---|
| **Pay-as-you-go** | 0% | None |
| **Reserved (1yr)** | 30-40% | 1 year |
| **Reserved (3yr)** | 50-60% | 3 years |
| **Spot VMs** | Up to 90% | Can be evicted |
| **Azure Hybrid Benefit** | Up to 85% | Existing Windows/SQL licenses |
💡 GCP advantage: Sustained Use Discounts are automatic — no commitment needed! Run an instance for a full month and you get ~30% off automatically.
Rightsizing — Stop Overpaying for Resources 📏
#1 waste source: Oversized instances! Average CPU utilization in cloud = 15-20%. Nee 80% resource ku pay panra, use pannradhu 20% mattum! 😰
Rightsizing Process:
AWS Rightsizing Example:
Tools for Rightsizing:
| Tool | Provider | Free |
|---|---|---|
| **AWS Compute Optimizer** | AWS | ✅ |
| **GCP Recommender** | GCP | ✅ |
| **Azure Advisor** | Azure | ✅ |
| **Datadog** | Multi-cloud | Trial |
| **CloudHealth** | Multi-cloud | Paid |
| **Spot.io (NetApp)** | Multi-cloud | Paid |
Rightsizing Pro Tip
Start with non-production environments! 🧪
Dev/staging instances are typically oversized by 3-4x because devs copy production config.
Quick wins:
- Dev environments: Use t3.small instead of m5.large
- Staging: Use 50% of production sizing
- Schedule dev/staging to shut down nights & weekends (save 65%!)
Reserved Instances & Savings Plans 📊
Predictable workloads ku commitment kuduthaa big discounts kidaikkum!
When to use Reserved Instances:
- ✅ Production databases (always running)
- ✅ Core application servers
- ✅ Baseline capacity (minimum instances needed)
- ❌ Variable/seasonal workloads
- ❌ Short-term projects (<1 year)
Savings Plans vs Reserved Instances:
| Feature | Reserved Instances | Savings Plans |
|---|---|---|
| **Flexibility** | Locked to instance type | Any instance family |
| **Region** | Specific region | Any region (Compute SP) |
| **Discount** | Slightly higher | Slightly lower |
| **Ease** | Complex to manage | Simple commitment |
| **Recommendation** | Large, stable fleets | Most teams ✅ |
Optimal Strategy — Layered Approach:
AWS Savings Plan Example:
Spot Instances — 90% Savings! ⚡
Spot Instances = cloud provider oda unused capacity. 60-90% cheaper than On-Demand! But 2-minute warning la terminate aagalam.
Perfect for:
- ✅ CI/CD pipelines (build servers)
- ✅ Batch processing / data pipelines
- ✅ Machine learning training
- ✅ Testing environments
- ✅ Stateless web servers (behind load balancer)
NOT suitable for:
- ❌ Databases
- ❌ Single-instance production
- ❌ Stateful applications
- ❌ Anything that can't handle interruption
Spot Strategy — Diversify:
Spot Instance pricing (example):
| Instance | On-Demand | Spot Price | Savings |
|---|---|---|---|
| m5.large | $0.096/hr | $0.029/hr | **70%** |
| c5.xlarge | $0.170/hr | $0.034/hr | **80%** |
| r5.large | $0.126/hr | $0.025/hr | **80%** |
| g4dn.xlarge (GPU) | $0.526/hr | $0.158/hr | **70%** |
ML training la Spot use pannaa massive savings! 🚀
Storage Cost Optimization 💾
Storage costs slowly build up — often second largest cloud expense!
S3 Storage Classes (AWS):
| Class | Cost (GB/month) | Access | Best For |
|---|---|---|---|
| **S3 Standard** | $0.023 | Frequent | Active data |
| **S3 IA** | $0.0125 | Infrequent | Backups (30+ days) |
| **S3 Glacier Instant** | $0.004 | Rare | Archives (instant access) |
| **S3 Glacier Deep** | $0.00099 | Very rare | Compliance archives |
10 TB storage cost comparison:
| Class | Monthly Cost |
|---|---|
| S3 Standard | **$230** |
| S3 IA | **$125** |
| S3 Glacier Instant | **$40** |
| S3 Glacier Deep | **$10** |
S3 Lifecycle Policy — Automate tiering:
EBS Volume Optimization:
Quick wins:
- 🗑️ Delete unattached EBS volumes
- 📦 Enable S3 Intelligent-Tiering (auto-moves data)
- 🗜️ Compress data before storing
- 🔄 Set lifecycle policies on ALL buckets
- 📸 Delete old snapshots (>90 days)
Network & Data Transfer Costs 🌐
Hidden cost killer = data transfer! Ingress free, but egress is expensive.
AWS Data Transfer Pricing:
| Transfer Type | Cost |
|---|---|
| **Internet → AWS** | FREE |
| **AWS → Internet** | $0.09/GB |
| **Cross-region** | $0.02/GB |
| **Cross-AZ** | $0.01/GB |
| **Same AZ** | FREE |
| **NAT Gateway processing** | $0.045/GB |
NAT Gateway is expensive! 🚨
Optimization strategies:
| Strategy | Savings |
|---|---|
| **CloudFront CDN** | 40-60% on data transfer |
| **VPC Endpoints** | Eliminate NAT for AWS services |
| **Same-AZ placement** | Eliminate cross-AZ costs |
| **Compress responses** | Reduce transfer volume |
| **Cache at edge** | Fewer origin requests |
Pro tip: S3 traffic through NAT Gateway = double charge (NAT processing + data transfer). VPC Endpoint use pannaa both charges eliminated! 💡
FinOps — Cloud Financial Management 📈
FinOps = cloud cost la accountability and optimization culture build pannradhu.
FinOps Lifecycle:
Cost Tagging Strategy (Critical!):
Without tags = impossible to know which team/project is spending how much!
Enforce tagging with AWS SCP:
Budget Alerts:
Kubernetes Cost Optimization ⎈
K8s la cost waste panra top reasons:
1. Over-provisioned resource requests:
2. Cluster Autoscaler not configured:
K8s Cost Tools:
| Tool | What it does | Free |
|---|---|---|
| **Kubecost** | Cost allocation per namespace/pod | ✅ (basic) |
| **OpenCost** | CNCF cost monitoring | ✅ |
| **Goldilocks** | Resource request recommendations | ✅ |
| **Karpenter** | Smart node provisioning | ✅ |
Karpenter vs Cluster Autoscaler:
- Cluster Autoscaler: Node group based, slower scaling
- Karpenter: Pod-aware, picks optimal instance type, much faster ⚡
Cost Monitoring Architecture
**Cloud Cost Monitoring & Optimization Architecture:**
```
Cloud Providers (AWS / GCP / Azure)
│
▼
Cost Data Collection
├── AWS Cost Explorer API
├── GCP Billing Export (BigQuery)
├── Azure Cost Management API
│
▼
Central Cost Platform
├── Data aggregation & normalization
├── Tag-based allocation
├── Anomaly detection 🚨
├── Forecast & budgeting
│
▼
Dashboards & Alerts
├── Team-level cost dashboards
├── Budget threshold alerts (80%, 100%, 120%)
├── Daily cost anomaly notifications
├── Monthly cost review reports
│
▼
Optimization Engine
├── Rightsizing recommendations
├── Reserved/Savings Plan analysis
├── Unused resource detection
├── Spot Instance opportunities
│
▼
Governance & Automation
├── Auto-stop idle resources
├── Tag compliance enforcement
├── Auto-scaling policies
└── Cost approval workflows
```
**Tools for this architecture:**
| Component | Options |
|-----------|---------|
| **Collection** | AWS CUR, GCP Billing Export, Azure Exports |
| **Platform** | CloudHealth, Spot.io, Apptio, Kubecost |
| **Dashboard** | Grafana, custom (Metabase + BigQuery) |
| **Alerts** | PagerDuty, Slack, OpsGenie |
| **Automation** | Lambda functions, Cloud Functions |Real-World Cost Reduction Case Study
Scenario: AI SaaS Startup — $8,000/month → $3,200/month 🎯
Initial State ($8,000/month):
- 20 EC2 instances (all m5.xlarge On-Demand)
- 5 TB S3 Standard storage
- RDS db.r5.2xlarge (Multi-AZ)
- NAT Gateway processing 500 GB/month
- No tags, no monitoring, no optimization
Optimization Steps:
| Action | Monthly Savings |
|--------|----------------|
| Rightsizing 20 instances → 8 m5.large + 5 t3.medium | $1,800 |
| Savings Plan (1-year) for 8 baseline instances | $840 |
| Spot Instances for CI/CD and batch jobs | $600 |
| S3 Lifecycle — moved 4TB to IA/Glacier | $380 |
| RDS rightsizing — r5.2xlarge → r5.large | $520 |
| VPC Endpoints — eliminated NAT for S3/DynamoDB | $180 |
| Dev/staging shutdown nights & weekends | $480 |
| TOTAL SAVINGS | $4,800/month |
Result: 60% cost reduction! Same performance, happier CFO 😄💰
Common Cost Traps to Avoid
Watch out for these hidden costs! 🚨
1. NAT Gateway — Silently charges $0.045/GB. VPC Endpoints use pannu!
2. Elastic IPs — Attached = free. Unattached = $3.65/month each!
3. EBS Snapshots — Old snapshots accumulate. Set retention policies!
4. CloudWatch Logs — Ingestion $0.50/GB + storage $0.03/GB/month. Set log retention!
5. Idle Load Balancers — $16/month minimum even with zero traffic
6. Cross-region replication — Data transfer charges both ways
7. Lambda provisioned concurrency — Pay even when not invoked!
Monthly cleanup checklist:
- ☐ Delete unattached EBS volumes
- ☐ Release unused Elastic IPs
- ☐ Remove old EBS snapshots (>90 days)
- ☐ Check for idle RDS instances
- ☐ Review Lambda functions with provisioned concurrency
- ☐ Audit NAT Gateway data processing
Summary
Cloud Cost Optimization pathi namma learn pannadhu:
✅ Billing Models: On-Demand, Reserved, Savings Plans, Spot — right mix use pannu
✅ Rightsizing: Average 15-20% CPU utilization — downsize and save 50%+
✅ Reserved/Savings Plans: Predictable workloads ku 30-60% savings
✅ Spot Instances: Fault-tolerant workloads ku 60-90% savings
✅ Storage: Lifecycle policies, tiering — S3 Standard → IA → Glacier
✅ Network: VPC Endpoints, CDN, same-AZ placement
✅ Kubernetes: Resource requests optimize, Karpenter, Kubecost
✅ FinOps: Tags, budgets, accountability, continuous optimization
✅ Monitoring: Cost anomaly detection, budget alerts, dashboards
Key takeaway: Cloud cost optimization is not a one-time activity — it's a continuous practice. Monthly review pannu, automate what you can, and make every team accountable for their spend. 40-60% savings is realistic! 💰🚀
With this, nee Cloud & DevOps series complete pannitta! Infrastructure to optimization, everything covered! 🎓🎉
🏁 🎮 Mini Challenge
Challenge: Analyze & Optimize Cloud Billing
Real cost analysis — hidden charges find & eliminate pannu! 💰
Step 1: Billing Dashboard Access 📊
Step 2: Cost Attribution 🏷️
Step 3: Breakdown Analysis 📈
Step 4: Right-Sizing Analysis 🔍
Step 5: Reservation Purchase 📋
``calls>
# Current on-demand: ₹50000/month
# 1-year reservation: 30% discount = ₹35000/month
# Savings/year: ₹180000!
# AWS Reserved Instance calculator
# Check if production workloads stable (good for RI)
# Non-critical batch jobs: on-demand → spot
# Cost: ₹100/day → ₹30/day (70% save!)
# Risk: interruption (need checkpointing)
# Example: data processing batch
# Interruption: restart from checkpoint (ok)
# Save: ₹2100/month
# Find & remove:
# - Unattached EBS volumes
# - Unattached Elastic IPs
# - Old snapshots (beyond retention)
# - Non-running instances
# AWS CLI commands:
aws ec2 describe-volumes --filters "Name=status,Values=available" --query 'Volumes[*].VolumeId'
# Cost: ₹500/month wasted (easy win!)
# Create spreadsheet:
# Optimization | Current Cost | Optimized | Monthly Save
# Right-size EC2 | 5000 | 2000 | 3000
# Reserve instances | 50000 | 35000 | 15000
# Spot for batch | 3000 | 900 | 2100
# Remove unused | 500 | 0 | 500
# Total potential savings: ₹20,600/month (41%)
# Payoff: easy recommendations first
💼 Interview Questions
Q1: Cloud bill unexpectedly high — troubleshoot steps?
A: (1) Service breakdown check (top 5 services). (2) Timeline analysis (sudden spike? when?). (3) Recent changes review (new resources deployed?). (4) Unattached resources find (volumes, IPs). (5) Data transfer check (expensive!). (6) Check all regions (resources everywhere?). (7) Reserved instances expiry (went back to on-demand?).
Q2: Data transfer costs expensive — minimize?
A: Data gravity principle: process where data lives. S3 → EC2: same region (free). S3 → internet: expensive. Solutions: (1) CloudFront CDN (cache edge). (2) VPC endpoints (avoid internet gateway). (3) Same-region resources. (4) Batch downloads (consolidate calls). Monitoring: data transfer separate line-item track.
Q3: Commitment discount vs Savings Plans — difference?
A: Reserved Instances (RI): specific instance type, region, 1-3 years. Savings Plans: compute flexibility (EC2, Fargate, Lambda), region flexible. RI: deeper discount (up to 72%). Savings Plans: flexibility (easier for variable workloads). Choose: predictable workloads = RI, variable = Savings Plans.
Q4: Auto-scaling cost implication — unexpected bill?
A: ASG max capacity can cause runaway costs. Solution: (1) Set max limit realistic (not unlimited). (2) Scale-down cooldown appropriate (avoid flapping). (3) Billing alerts (daily budget threshold). (4) Scheduled scaling (reduce off-peak capacity). (5) Monitor scale events (debug unnecessary scaling).
Q5: Cost accountability team-wise — sharing model?
A: Tag all resources (team, project, cost-center). Reports generated per tag. Team budgets enforce (quota limits). Chargeback model: usage-based billing (team pays). Showback: visibility without charge (educate). Monthly reviews: trends, anomalies, optimization opportunities. Personal accountability = cost consciousness.
Frequently Asked Questions
Un company oda AWS bill la biggest waste source identify pannanum. Which metric FIRST paakkanum?