← Back|SOFTWARE-ENGINEERING›Section 1/17

0 of 17 completed

AI system design basics

Advanced⏱ 16 min read📅 Updated: 2026-02-17

Introduction

"Design a recommendation system for 10M users" — interview la idha ketta, freeze aaiduvaanga! 😰

AI system design is different from traditional system design. Normal system la data store panni retrieve pannuvenga. AI system la data process panni, models train panni, predictions serve pannanum — all at scale!

Indha article la AI system design fundamentals — architecture patterns, components, and real-world design decisions — ellam cover pannrom! 🏗️✨

AI vs Traditional System Design

Key differences:

Aspect	Traditional System	AI System
Data	Store & retrieve	Transform & learn
Logic	Code la define	Model la learn
Output	Deterministic	Probabilistic
Testing	Unit tests	Model metrics + tests
Updates	Code deploy	Model retrain + deploy
Scaling	More servers	More GPUs + servers
Monitoring	Errors & latency	Model drift + errors
Storage	Databases	Feature stores + model registry

Key Insight: AI system la data is the new code. Better data = better system. Code clean ah irundhalum, data messy na — system garbage! 🗑️

AI System High-Level Architecture

🏗️ Architecture Diagram

Complete AI system architecture:

```
┌─────────────────────────────────────────────────────┐
│                   CLIENT LAYER                       │
│  [Web App] [Mobile App] [API Consumers] [IoT]       │
└──────────────────────┬──────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────┐
│                  API GATEWAY                          │
│  [Load Balancer] [Auth] [Rate Limit] [Routing]       │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
┌──────────────┐ ┌──────────┐ ┌──────────────┐
│  App Server  │ │ ML Server│ │  Data Server  │
│  (CRUD ops)  │ │(Inference)│ │  (Pipeline)  │
└──────┬───────┘ └────┬─────┘ └──────┬───────┘
       │              │              │
       ▼              ▼              ▼
┌──────────┐  ┌──────────────┐ ┌──────────┐
│    DB    │  │ Model Store  │ │Feature   │
│(Postgres)│  │ (S3/Registry)│ │Store     │
└──────────┘  └──────────────┘ └──────────┘
                       │
              ┌────────▼────────┐
              │ Training Pipeline│
              │ [Data → Train →  │
              │  Evaluate → Deploy│
              └─────────────────┘
```

**3 Main Pillars:**
1. **Serving Layer** — Real-time predictions serve pannuradhu
2. **Training Pipeline** — Models train and update pannuradhu
3. **Data Pipeline** — Data collect, clean, and transform pannuradhu

Each pillar independently scale pannalam! 📊

Data Pipeline Design

AI system la data pipeline is the backbone:

code

Raw Data → Ingestion → Cleaning → Transform → Feature Store → Training
                                                    │
                                                    ▼
                                              Serving (Real-time)

Components:

python

# 1. Data Ingestion (Kafka/Kinesis)
from kafka import KafkaConsumer

consumer = KafkaConsumer('user-events',
    bootstrap_servers=['kafka:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

# 2. Data Cleaning
def clean_data(raw_event):
    # Remove nulls, fix types, validate schema
    if not raw_event.get('user_id'):
        return None
    return {
        'user_id': str(raw_event['user_id']),
        'action': raw_event['action'].lower(),
        'timestamp': parse_timestamp(raw_event['ts']),
    }

# 3. Feature Engineering
def compute_features(events):
    return {
        'click_rate_7d': calculate_ctr(events, days=7),
        'avg_session_time': mean([e['duration'] for e in events]),
        'purchase_count': len([e for e in events if e['action'] == 'purchase']),
    }

# 4. Feature Store (Redis/DynamoDB)
feature_store.set(user_id, features, ttl=3600)

Component	Tool Options	Purpose
Ingestion	Kafka, Kinesis, Pub/Sub	Real-time event streaming
Batch Processing	Spark, Airflow	Large dataset processing
Feature Store	Feast, Tecton, Redis	Feature serving
Storage	S3, GCS, Delta Lake	Raw data storage

Rule: Data pipeline fail aanaa — entire AI system fail! Monitoring and alerting must! 🚨

Model Serving Architecture

Trained model ah production la serve panradhu — most critical part:

Option 1: REST API Serving

python

# FastAPI + Model
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

@app.post('/predict')
async def predict(features: dict):
    prediction = model.predict([features['input']])
    return {'prediction': prediction[0], 'confidence': 0.95}

Option 2: gRPC Serving (High Performance)

protobuf

service PredictionService {
  rpc Predict(PredictRequest) returns (PredictResponse);
}

Option 3: Batch Prediction

python

# For non-real-time use cases
predictions = model.predict(batch_data)
save_to_database(predictions)

Serving Strategy Comparison:

Strategy	Latency	Throughput	Use Case
REST API	~100ms	Medium	Web apps, simple APIs
gRPC	~10ms	High	Microservices, mobile
Batch	Minutes	Very High	Reports, emails
Streaming	~50ms	High	Real-time feeds
Edge	~5ms	Low	IoT, mobile offline

Use case ku correct strategy choose pannunga! 🎯

Feature Store — AI ku Database

Feature Store = AI system ku special database:

code

┌────────────────────────────────┐
│         FEATURE STORE          │
│                                │
│  ┌──────────┐  ┌──────────┐   │
│  │ Offline   │  │ Online   │   │
│  │ Store     │  │ Store    │   │
│  │(Training) │  │(Serving) │   │
│  │ S3/HDFS   │  │ Redis    │   │
│  └──────────┘  └──────────┘   │
│         │              │       │
│    Batch Features  Real-time   │
│    (historical)   Features     │
└────────────────────────────────┘

Why Feature Store venum?

python

# Without Feature Store — MESSY! 😰
# Training la oru way features compute pannuvenga
train_features = compute_heavy_features(raw_data)

# Serving la vera way — INCONSISTENCY!
serve_features = quick_compute(request_data)
# Training-serving skew! Model performance drop!

# With Feature Store — CLEAN! ✅
from feast import FeatureStore

store = FeatureStore(repo_path='.')

# Training — same features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=['user:click_rate', 'user:purchase_count']
)

# Serving — SAME features!
online_features = store.get_online_features(
    entity_rows=[{'user_id': '123'}],
    features=['user:click_rate', 'user:purchase_count']
)

Training-Serving Skew = #1 AI system killer! Feature Store solves this! 💀→✅

Model Registry & Versioning

Models track panna Model Registry use pannunga:

python

import mlflow

# Model train and register
with mlflow.start_run():
    model = train_model(data)

    # Log metrics
    mlflow.log_metric('accuracy', 0.94)
    mlflow.log_metric('f1_score', 0.91)
    mlflow.log_metric('latency_p99', 45)  # ms

    # Register model
    mlflow.sklearn.log_model(model, 'recommendation-model',
        registered_model_name='prod-recommender')

Model Lifecycle:

Stage	Description	Action
Development	New model experiment	Train & evaluate
Staging	A/B testing	Shadow deployment
Production	Live traffic serve	Monitor closely
Archived	Old version	Keep for rollback

bash

# Model version manage
mlflow models serve -m "models:/prod-recommender/Production" -p 5001

# Rollback to previous version
mlflow models serve -m "models:/prod-recommender/3" -p 5001

Every model version track pannunga — rollback life save pannum! 🔄

Caching for AI Predictions

💡 Tip

AI predictions cache panna latency drastically reduce aagum:

python

import redis
import hashlib
import json

cache = redis.Redis(host='localhost', port=6379)

def cached_predict(features):
    # Create cache key from features
    key = hashlib.md5(json.dumps(features, sort_keys=True).encode()).hexdigest()

    # Check cache
    cached = cache.get(f"pred:{key}")
    if cached:
        return json.loads(cached)  # Cache hit! ⚡

    # Cache miss — run model
    prediction = model.predict(features)
    cache.setex(f"pred:{key}", 300, json.dumps(prediction))  # 5 min TTL
    return prediction

Cache Strategy by Use Case:

| Use Case | TTL | Cache Hit Rate |

|----------|-----|----------------|

| Product recommendations | 30 min | ~70% |

| Search ranking | 5 min | ~50% |

| Fraud detection | NO CACHE! | 0% |

| Content moderation | 1 hour | ~80% |

⚠️ Never cache real-time safety-critical predictions (fraud, security)!

AI System Monitoring

Traditional monitoring + AI-specific monitoring venum:

python

# AI System Metrics to Monitor
monitoring_metrics = {
    # Traditional Metrics
    'latency_p50': '< 100ms',
    'latency_p99': '< 500ms',
    'error_rate': '< 0.1%',
    'throughput': '> 1000 rps',

    # AI-Specific Metrics ← IMPORTANT!
    'model_accuracy': '> 0.90',
    'prediction_distribution': 'check for drift',
    'feature_missing_rate': '< 1%',
    'model_staleness': '< 7 days',
    'data_quality_score': '> 0.95',
}

Model Drift Detection:

python

from evidently import Report
from evidently.metrics import DataDriftPreset

# Weekly drift check
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_data, current_data=production_data)

if report.as_dict()['metrics'][0]['result']['dataset_drift']:
    alert("🚨 Model drift detected! Retrain needed!")

Alert Level	Condition	Action
🟢 Normal	All metrics in range	Continue
🟡 Warning	Accuracy drop 5%	Investigate
🔴 Critical	Accuracy drop 15%	Rollback model
🚨 Emergency	System down	Fallback to rules

Model drift catch pannala na — silently wrong predictions serve aagidum! 😱

A/B Testing for AI Models

New model deploy pannumpodhu A/B test pannunga:

python

# Simple A/B test implementation
import random

class ModelABTest:
    def __init__(self, model_a, model_b, traffic_split=0.1):
        self.model_a = model_a  # Current production
        self.model_b = model_b  # New challenger
        self.split = traffic_split

    def predict(self, features, user_id):
        # Consistent assignment per user
        bucket = hash(user_id) % 100

        if bucket < self.split * 100:
            result = self.model_b.predict(features)
            log_experiment('model_b', user_id, result)
        else:
            result = self.model_a.predict(features)
            log_experiment('model_a', user_id, result)

        return result

# Deploy with 10% traffic to new model
ab_test = ModelABTest(current_model, new_model, traffic_split=0.1)

Rollout Strategy:

Shadow mode (0% traffic) — predictions log pannunga, serve pannaadheenga
Canary (1-5%) — small traffic divert
A/B test (10-50%) — metrics compare
Full rollout (100%) — if metrics better

Never 0→100 deploy pannaadheenga! Gradual rollout save pannum! 🎯

Common AI Design Patterns

Interview la useful design patterns:

1. Ensemble Pattern 🎭

code

Request → [Model A] ──┐
          [Model B] ──┤→ Aggregator → Response
          [Model C] ──┘

Multiple models combine panni better predictions. Netflix recommendations idha use pannum.

2. Fallback Pattern 🔄

code

Request → [ML Model] → Success? → Response
                │
                ▼ (Fail)
          [Rules Engine] → Response
                │
                ▼ (Fail)
          [Default/Popular] → Response

Model fail aanaa — rules ku fallback. Rules fail aanaa — defaults.

3. Gateway Pattern 🚪

code

Request → [AI Gateway] → Route to best model
              │
    ┌─────────┼──────────┐
    ▼         ▼          ▼
[GPT-4]  [Claude]  [Local LLM]

Multiple LLMs behind single gateway — cost, latency, quality optimize pannalam.

4. Human-in-the-Loop Pattern 👤

code

Request → [AI Model] → Confidence > 95%? → Auto-respond
                              │
                              ▼ (Low confidence)
                        [Human Review Queue]

Low confidence predictions human ku route pannunga — quality maintain aagum! 📊

Cost Optimization Strategies

⚠️ Warning

⚠️ AI systems expensive aagum — optimize pannunga!

| Strategy | Savings | Implementation |

|----------|---------|----------------|

| Model distillation | 60-80% GPU cost | Large model → small model |

| Quantization | 50% memory | FP32 → INT8 |

| Caching predictions | 40-70% compute | Redis/Memcached |

| Batch inference | 30-50% cost | Off-peak processing |

| Spot instances | 60-90% training | AWS Spot/GCP Preemptible |

| Auto-scaling | Variable | Scale to zero when idle |

python

# Auto-scale to zero (Kubernetes)
# HPA config for ML serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 0  # Scale to zero! 💰
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70

Rule of thumb: Start with managed services (SageMaker, Vertex AI) — cheaper than building from scratch for small teams! 💰

Real World Example: Recommendation System

✅ Example

Design: E-commerce Product Recommendation System (10M users)

Requirements:

- Real-time recommendations (< 200ms)

- Personalized for each user

- Handle cold start (new users)

- 10M daily active users

Architecture:

code

User Request → API Gateway → Recommendation Service
                                    │
                    ┌───────────────┼────────────────┐
                    ▼               ▼                ▼
              [Candidate Gen]  [Feature Store]  [Cache Check]
              (ANN Search)     (User features)  (Redis)
                    │               │
                    ▼               ▼
              [Ranking Model] ← [Features]
                    │
                    ▼
              [Re-ranking] (diversity, freshness)
                    │
                    ▼
              [Response] → Cache → User

Tech Stack:

- Candidate Generation: FAISS/Milvus (vector search)

- Feature Store: Feast + Redis

- Model Serving: TensorFlow Serving

- Cache: Redis (30 min TTL)

- Pipeline: Airflow + Spark

- Monitoring: Prometheus + Grafana

Estimated Cost: ~$5K-15K/month for 10M users 💰

✅ Key Takeaways

✅ AI systems normal systems from different — data pipelines, feature stores, model serving, retraining cycles additional components venum

✅ Three pillars essential — serving layer (real-time predictions), training pipeline (model updates), data pipeline (data quality) — ellam important equally

✅ Real-time vs batch trade-off — real-time latency critical, batch cost-effective — use case based appropriate choice pannunga

✅ Feature store critical — features centralize, consistency training + serving, reuse enable, engineering complexity reduce pannunga

✅ Model versioning important — multiple versions production la coexist, A/B testing, gradual rollout, rollback capability venum

✅ Monitoring different ML systems — model accuracy, data quality, prediction distribution, retraining frequency — business metrics track pannu

✅ Caching + prediction cache smart — inference expensive, repeat queries cached results use panni latency + cost optimize pannunga

✅ Scalability planning essential — QPS, model size, feature computation scale with user growth consider pannunga upfront

🏁 Mini Challenge

Challenge: Design AI-Powered System from Scratch

Oru production-system design pannunga (50-60 mins):

Choose Domain: E-commerce, social media, content recommendation, or logistics
Gather Requirements: Users, scale, latency, accuracy targets define panni
Architecture: Components, data flow, technology stack design panni
Bottleneck Analysis: Where will system struggle? Identify panni mitigation plan
Scale Calculation: Estimate infrastructure cost, latency at scale
Monitoring Plan: Metrics, dashboards, alerts define panni
Trade-offs: Accuracy vs latency, cost vs performance document panni

Tools: Draw.io for diagrams, Google Docs for design doc

Deliverable: Complete design doc with architecture diagram, tech choices, cost estimates 📊

Interview Questions

Q1: AI system design interview la most important skill enna?

A: Requirement clarification, constraint understanding, scalability thinking, trade-off analysis. Design choices justify panna ability, not just suggesting random tech. Ask good questions, clarify ambiguity.

Q2: Real-time vs batch predictions – when choose panna?

A: Real-time: interactive use cases (recommendations, search), need < 1 second response. Batch: non-urgent (reports, offline analysis), large volume processing, cost optimization. Real-time more complex, more expensive.

Q3: Model serving infrastructure enna considerations important?

A: Model size, inference latency, QPS (queries per second), hardware requirements, update frequency, A/B testing capability. TensorFlow Serving, KServe, Seldon popular choices.

Q4: Recommendation system design la cold start problem enna?

A: New user/item ku recommendation give panna challenge – limited history data. Solutions: content-based, popularity-based fallback, onboarding survey, gradually transition to collaborative filtering.

Q5: Feature store enna, why important AI systems la?

A: Centralized feature management – consistent features training + serving, fast feature retrieval, feature reuse across models. Reduces engineering complexity, improves model quality, enables experimentation faster.

Frequently Asked Questions

❓ AI system design normal system design la irundhu eppdi different?

AI systems ku model training, data pipelines, feature stores, model versioning, and inference optimization additional ah venum. Traditional CRUD operations plus ML lifecycle manage pannanum.

❓ AI system design interview la enna expect panradhu?

ML pipeline design, model serving architecture, data flow, scaling strategies, monitoring, and trade-offs discuss pannanum. End-to-end thinking important.

❓ Small team la AI system build pannalama?

Yes! Managed services use pannunga — AWS SageMaker, GCP Vertex AI, etc. Infrastructure manage panna thevai illa. Start simple, scale gradually.

❓ AI system ku minimum infra enna venum?

Minimum: API server, model storage, data pipeline, monitoring. Cloud services use panna oru VM la kooda start pannalam. GPU optional — CPU inference possible for many models.

🧠Knowledge Check

Quiz 1 of 1

AI system la "Training-Serving Skew" enna?

0 of 1 answered

← Previous ByteGit + AI workflows Next Byte →Scalable AI apps

Courses

Learning Paths

Exam Prep

AI system design basics

Introduction

AI vs Traditional System Design

AI System High-Level Architecture

Data Pipeline Design

Model Serving Architecture

Feature Store — AI ku Database

Model Registry & Versioning

Caching for AI Predictions

AI System Monitoring

A/B Testing for AI Models

Common AI Design Patterns

Cost Optimization Strategies

Real World Example: Recommendation System

✅ Key Takeaways

🏁 Mini Challenge

Interview Questions

Frequently Asked Questions