← Back|SOFTWARE-ENGINEERINGβ€ΊSection 1/17
0 of 17 completed

AI system design basics

Advanced⏱ 16 min readπŸ“… Updated: 2026-02-17

Introduction

"Design a recommendation system for 10M users" β€” interview la idha ketta, freeze aaiduvaanga! 😰


AI system design is different from traditional system design. Normal system la data store panni retrieve pannuvenga. AI system la data process panni, models train panni, predictions serve pannanum β€” all at scale!


Indha article la AI system design fundamentals β€” architecture patterns, components, and real-world design decisions β€” ellam cover pannrom! πŸ—οΈβœ¨

AI vs Traditional System Design

Key differences:


AspectTraditional SystemAI System
**Data**Store & retrieveTransform & learn
**Logic**Code la defineModel la learn
**Output**DeterministicProbabilistic
**Testing**Unit testsModel metrics + tests
**Updates**Code deployModel retrain + deploy
**Scaling**More serversMore GPUs + servers
**Monitoring**Errors & latencyModel drift + errors
**Storage**DatabasesFeature stores + model registry

Key Insight: AI system la data is the new code. Better data = better system. Code clean ah irundhalum, data messy na β€” system garbage! πŸ—‘οΈ

AI System High-Level Architecture

πŸ—οΈ Architecture Diagram
Complete AI system architecture:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   CLIENT LAYER                       β”‚
β”‚  [Web App] [Mobile App] [API Consumers] [IoT]       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  API GATEWAY                          β”‚
β”‚  [Load Balancer] [Auth] [Rate Limit] [Routing]       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό              β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  App Server  β”‚ β”‚ ML Serverβ”‚ β”‚  Data Server  β”‚
β”‚  (CRUD ops)  β”‚ β”‚(Inference)β”‚ β”‚  (Pipeline)  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚              β”‚              β”‚
       β–Ό              β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    DB    β”‚  β”‚ Model Store  β”‚ β”‚Feature   β”‚
β”‚(Postgres)β”‚  β”‚ (S3/Registry)β”‚ β”‚Store     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Training Pipelineβ”‚
              β”‚ [Data β†’ Train β†’  β”‚
              β”‚  Evaluate β†’ Deployβ”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**3 Main Pillars:**
1. **Serving Layer** β€” Real-time predictions serve pannuradhu
2. **Training Pipeline** β€” Models train and update pannuradhu
3. **Data Pipeline** β€” Data collect, clean, and transform pannuradhu

Each pillar independently scale pannalam! πŸ“Š

Data Pipeline Design

AI system la data pipeline is the backbone:


code
Raw Data β†’ Ingestion β†’ Cleaning β†’ Transform β†’ Feature Store β†’ Training
                                                    β”‚
                                                    β–Ό
                                              Serving (Real-time)

Components:


python
# 1. Data Ingestion (Kafka/Kinesis)
from kafka import KafkaConsumer

consumer = KafkaConsumer('user-events',
    bootstrap_servers=['kafka:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

# 2. Data Cleaning
def clean_data(raw_event):
    # Remove nulls, fix types, validate schema
    if not raw_event.get('user_id'):
        return None
    return {
        'user_id': str(raw_event['user_id']),
        'action': raw_event['action'].lower(),
        'timestamp': parse_timestamp(raw_event['ts']),
    }

# 3. Feature Engineering
def compute_features(events):
    return {
        'click_rate_7d': calculate_ctr(events, days=7),
        'avg_session_time': mean([e['duration'] for e in events]),
        'purchase_count': len([e for e in events if e['action'] == 'purchase']),
    }

# 4. Feature Store (Redis/DynamoDB)
feature_store.set(user_id, features, ttl=3600)

ComponentTool OptionsPurpose
**Ingestion**Kafka, Kinesis, Pub/SubReal-time event streaming
**Batch Processing**Spark, AirflowLarge dataset processing
**Feature Store**Feast, Tecton, RedisFeature serving
**Storage**S3, GCS, Delta LakeRaw data storage

Rule: Data pipeline fail aanaa β€” entire AI system fail! Monitoring and alerting must! 🚨

Model Serving Architecture

Trained model ah production la serve panradhu β€” most critical part:


Option 1: REST API Serving

python
# FastAPI + Model
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

@app.post('/predict')
async def predict(features: dict):
    prediction = model.predict([features['input']])
    return {'prediction': prediction[0], 'confidence': 0.95}

Option 2: gRPC Serving (High Performance)

protobuf
service PredictionService {
  rpc Predict(PredictRequest) returns (PredictResponse);
}

Option 3: Batch Prediction

python
# For non-real-time use cases
predictions = model.predict(batch_data)
save_to_database(predictions)

Serving Strategy Comparison:


StrategyLatencyThroughputUse Case
**REST API**~100msMediumWeb apps, simple APIs
**gRPC**~10msHighMicroservices, mobile
**Batch**MinutesVery HighReports, emails
**Streaming**~50msHighReal-time feeds
**Edge**~5msLowIoT, mobile offline

Use case ku correct strategy choose pannunga! 🎯

Feature Store β€” AI ku Database

Feature Store = AI system ku special database:


code
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         FEATURE STORE          β”‚
β”‚                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Offline   β”‚  β”‚ Online   β”‚   β”‚
β”‚  β”‚ Store     β”‚  β”‚ Store    β”‚   β”‚
β”‚  β”‚(Training) β”‚  β”‚(Serving) β”‚   β”‚
β”‚  β”‚ S3/HDFS   β”‚  β”‚ Redis    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚              β”‚       β”‚
β”‚    Batch Features  Real-time   β”‚
β”‚    (historical)   Features     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why Feature Store venum?


python
# Without Feature Store β€” MESSY! 😰
# Training la oru way features compute pannuvenga
train_features = compute_heavy_features(raw_data)

# Serving la vera way β€” INCONSISTENCY!
serve_features = quick_compute(request_data)
# Training-serving skew! Model performance drop!

# With Feature Store β€” CLEAN! βœ…
from feast import FeatureStore

store = FeatureStore(repo_path='.')

# Training β€” same features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=['user:click_rate', 'user:purchase_count']
)

# Serving β€” SAME features!
online_features = store.get_online_features(
    entity_rows=[{'user_id': '123'}],
    features=['user:click_rate', 'user:purchase_count']
)

Training-Serving Skew = #1 AI system killer! Feature Store solves this! πŸ’€β†’βœ…

Model Registry & Versioning

Models track panna Model Registry use pannunga:


python
import mlflow

# Model train and register
with mlflow.start_run():
    model = train_model(data)

    # Log metrics
    mlflow.log_metric('accuracy', 0.94)
    mlflow.log_metric('f1_score', 0.91)
    mlflow.log_metric('latency_p99', 45)  # ms

    # Register model
    mlflow.sklearn.log_model(model, 'recommendation-model',
        registered_model_name='prod-recommender')

Model Lifecycle:


StageDescriptionAction
**Development**New model experimentTrain & evaluate
**Staging**A/B testingShadow deployment
**Production**Live traffic serveMonitor closely
**Archived**Old versionKeep for rollback

bash
# Model version manage
mlflow models serve -m "models:/prod-recommender/Production" -p 5001

# Rollback to previous version
mlflow models serve -m "models:/prod-recommender/3" -p 5001

Every model version track pannunga β€” rollback life save pannum! πŸ”„

Caching for AI Predictions

πŸ’‘ Tip

AI predictions cache panna latency drastically reduce aagum:

python
import redis
import hashlib
import json

cache = redis.Redis(host='localhost', port=6379)

def cached_predict(features):
    # Create cache key from features
    key = hashlib.md5(json.dumps(features, sort_keys=True).encode()).hexdigest()

    # Check cache
    cached = cache.get(f"pred:{key}")
    if cached:
        return json.loads(cached)  # Cache hit! ⚑

    # Cache miss β€” run model
    prediction = model.predict(features)
    cache.setex(f"pred:{key}", 300, json.dumps(prediction))  # 5 min TTL
    return prediction

Cache Strategy by Use Case:

| Use Case | TTL | Cache Hit Rate |

|----------|-----|----------------|

| Product recommendations | 30 min | ~70% |

| Search ranking | 5 min | ~50% |

| Fraud detection | NO CACHE! | 0% |

| Content moderation | 1 hour | ~80% |

⚠️ Never cache real-time safety-critical predictions (fraud, security)!

AI System Monitoring

Traditional monitoring + AI-specific monitoring venum:


python
# AI System Metrics to Monitor
monitoring_metrics = {
    # Traditional Metrics
    'latency_p50': '< 100ms',
    'latency_p99': '< 500ms',
    'error_rate': '< 0.1%',
    'throughput': '> 1000 rps',

    # AI-Specific Metrics ← IMPORTANT!
    'model_accuracy': '> 0.90',
    'prediction_distribution': 'check for drift',
    'feature_missing_rate': '< 1%',
    'model_staleness': '< 7 days',
    'data_quality_score': '> 0.95',
}

Model Drift Detection:

python
from evidently import Report
from evidently.metrics import DataDriftPreset

# Weekly drift check
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_data, current_data=production_data)

if report.as_dict()['metrics'][0]['result']['dataset_drift']:
    alert("🚨 Model drift detected! Retrain needed!")

Alert LevelConditionAction
🟒 NormalAll metrics in rangeContinue
🟑 WarningAccuracy drop 5%Investigate
πŸ”΄ CriticalAccuracy drop 15%Rollback model
🚨 EmergencySystem downFallback to rules

Model drift catch pannala na β€” silently wrong predictions serve aagidum! 😱

A/B Testing for AI Models

New model deploy pannumpodhu A/B test pannunga:


python
# Simple A/B test implementation
import random

class ModelABTest:
    def __init__(self, model_a, model_b, traffic_split=0.1):
        self.model_a = model_a  # Current production
        self.model_b = model_b  # New challenger
        self.split = traffic_split

    def predict(self, features, user_id):
        # Consistent assignment per user
        bucket = hash(user_id) % 100

        if bucket < self.split * 100:
            result = self.model_b.predict(features)
            log_experiment('model_b', user_id, result)
        else:
            result = self.model_a.predict(features)
            log_experiment('model_a', user_id, result)

        return result

# Deploy with 10% traffic to new model
ab_test = ModelABTest(current_model, new_model, traffic_split=0.1)

Rollout Strategy:

  1. Shadow mode (0% traffic) β€” predictions log pannunga, serve pannaadheenga
  2. Canary (1-5%) β€” small traffic divert
  3. A/B test (10-50%) β€” metrics compare
  4. Full rollout (100%) β€” if metrics better

Never 0β†’100 deploy pannaadheenga! Gradual rollout save pannum! 🎯

Common AI Design Patterns

Interview la useful design patterns:


1. Ensemble Pattern 🎭

code
Request β†’ [Model A] ──┐
          [Model B] ───→ Aggregator β†’ Response
          [Model C] β”€β”€β”˜

Multiple models combine panni better predictions. Netflix recommendations idha use pannum.


2. Fallback Pattern πŸ”„

code
Request β†’ [ML Model] β†’ Success? β†’ Response
                β”‚
                β–Ό (Fail)
          [Rules Engine] β†’ Response
                β”‚
                β–Ό (Fail)
          [Default/Popular] β†’ Response

Model fail aanaa β€” rules ku fallback. Rules fail aanaa β€” defaults.


3. Gateway Pattern πŸšͺ

code
Request β†’ [AI Gateway] β†’ Route to best model
              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό         β–Ό          β–Ό
[GPT-4]  [Claude]  [Local LLM]

Multiple LLMs behind single gateway β€” cost, latency, quality optimize pannalam.


4. Human-in-the-Loop Pattern πŸ‘€

code
Request β†’ [AI Model] β†’ Confidence > 95%? β†’ Auto-respond
                              β”‚
                              β–Ό (Low confidence)
                        [Human Review Queue]

Low confidence predictions human ku route pannunga β€” quality maintain aagum! πŸ“Š

Cost Optimization Strategies

⚠️ Warning

⚠️ AI systems expensive aagum β€” optimize pannunga!

| Strategy | Savings | Implementation |

|----------|---------|----------------|

| Model distillation | 60-80% GPU cost | Large model β†’ small model |

| Quantization | 50% memory | FP32 β†’ INT8 |

| Caching predictions | 40-70% compute | Redis/Memcached |

| Batch inference | 30-50% cost | Off-peak processing |

| Spot instances | 60-90% training | AWS Spot/GCP Preemptible |

| Auto-scaling | Variable | Scale to zero when idle |

python
# Auto-scale to zero (Kubernetes)
# HPA config for ML serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 0  # Scale to zero! πŸ’°
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 70

Rule of thumb: Start with managed services (SageMaker, Vertex AI) β€” cheaper than building from scratch for small teams! πŸ’°

Real World Example: Recommendation System

βœ… Example

Design: E-commerce Product Recommendation System (10M users)

Requirements:

- Real-time recommendations (< 200ms)

- Personalized for each user

- Handle cold start (new users)

- 10M daily active users

Architecture:

code
User Request β†’ API Gateway β†’ Recommendation Service
                                    β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό               β–Ό                β–Ό
              [Candidate Gen]  [Feature Store]  [Cache Check]
              (ANN Search)     (User features)  (Redis)
                    β”‚               β”‚
                    β–Ό               β–Ό
              [Ranking Model] ← [Features]
                    β”‚
                    β–Ό
              [Re-ranking] (diversity, freshness)
                    β”‚
                    β–Ό
              [Response] β†’ Cache β†’ User

Tech Stack:

- Candidate Generation: FAISS/Milvus (vector search)

- Feature Store: Feast + Redis

- Model Serving: TensorFlow Serving

- Cache: Redis (30 min TTL)

- Pipeline: Airflow + Spark

- Monitoring: Prometheus + Grafana

Estimated Cost: ~$5K-15K/month for 10M users πŸ’°

βœ… Key Takeaways

βœ… AI systems normal systems from different β€” data pipelines, feature stores, model serving, retraining cycles additional components venum


βœ… Three pillars essential β€” serving layer (real-time predictions), training pipeline (model updates), data pipeline (data quality) β€” ellam important equally


βœ… Real-time vs batch trade-off β€” real-time latency critical, batch cost-effective β€” use case based appropriate choice pannunga


βœ… Feature store critical β€” features centralize, consistency training + serving, reuse enable, engineering complexity reduce pannunga


βœ… Model versioning important β€” multiple versions production la coexist, A/B testing, gradual rollout, rollback capability venum


βœ… Monitoring different ML systems β€” model accuracy, data quality, prediction distribution, retraining frequency β€” business metrics track pannu


βœ… Caching + prediction cache smart β€” inference expensive, repeat queries cached results use panni latency + cost optimize pannunga


βœ… Scalability planning essential β€” QPS, model size, feature computation scale with user growth consider pannunga upfront

🏁 Mini Challenge

Challenge: Design AI-Powered System from Scratch


Oru production-system design pannunga (50-60 mins):


  1. Choose Domain: E-commerce, social media, content recommendation, or logistics
  2. Gather Requirements: Users, scale, latency, accuracy targets define panni
  3. Architecture: Components, data flow, technology stack design panni
  4. Bottleneck Analysis: Where will system struggle? Identify panni mitigation plan
  5. Scale Calculation: Estimate infrastructure cost, latency at scale
  6. Monitoring Plan: Metrics, dashboards, alerts define panni
  7. Trade-offs: Accuracy vs latency, cost vs performance document panni

Tools: Draw.io for diagrams, Google Docs for design doc


Deliverable: Complete design doc with architecture diagram, tech choices, cost estimates πŸ“Š

Interview Questions

Q1: AI system design interview la most important skill enna?

A: Requirement clarification, constraint understanding, scalability thinking, trade-off analysis. Design choices justify panna ability, not just suggesting random tech. Ask good questions, clarify ambiguity.


Q2: Real-time vs batch predictions – when choose panna?

A: Real-time: interactive use cases (recommendations, search), need < 1 second response. Batch: non-urgent (reports, offline analysis), large volume processing, cost optimization. Real-time more complex, more expensive.


Q3: Model serving infrastructure enna considerations important?

A: Model size, inference latency, QPS (queries per second), hardware requirements, update frequency, A/B testing capability. TensorFlow Serving, KServe, Seldon popular choices.


Q4: Recommendation system design la cold start problem enna?

A: New user/item ku recommendation give panna challenge – limited history data. Solutions: content-based, popularity-based fallback, onboarding survey, gradually transition to collaborative filtering.


Q5: Feature store enna, why important AI systems la?

A: Centralized feature management – consistent features training + serving, fast feature retrieval, feature reuse across models. Reduces engineering complexity, improves model quality, enables experimentation faster.

Frequently Asked Questions

❓ AI system design normal system design la irundhu eppdi different?
AI systems ku model training, data pipelines, feature stores, model versioning, and inference optimization additional ah venum. Traditional CRUD operations plus ML lifecycle manage pannanum.
❓ AI system design interview la enna expect panradhu?
ML pipeline design, model serving architecture, data flow, scaling strategies, monitoring, and trade-offs discuss pannanum. End-to-end thinking important.
❓ Small team la AI system build pannalama?
Yes! Managed services use pannunga β€” AWS SageMaker, GCP Vertex AI, etc. Infrastructure manage panna thevai illa. Start simple, scale gradually.
❓ AI system ku minimum infra enna venum?
Minimum: API server, model storage, data pipeline, monitoring. Cloud services use panna oru VM la kooda start pannalam. GPU optional β€” CPU inference possible for many models.
🧠Knowledge Check
Quiz 1 of 1

AI system la "Training-Serving Skew" enna?

0 of 1 answered