← Back|DATA-ENGINEERING›Section 1/18

0 of 18 completed

Real-time data systems

Advanced⏱ 16 min read📅 Updated: 2026-02-17

Introduction

Nee Swiggy la food order pannureenga 🍕. Order status real-time la update aagudhu:

"Order confirmed" → 2 seconds la
"Restaurant accepted" → instant
"Delivery partner assigned" → real-time
Live location tracking → every 3 seconds

Idhu ellam real-time data systems work aagura padi nadakkudhu! Batch processing la idhu possible illa — 1 hour wait pannaa food cold aaidum 😅

Real-time systems modern tech la everywhere — stock trading, fraud detection, gaming, social media feeds, IoT sensors. Let's deep dive pannalaam! 🚀

Batch vs Streaming — Key Differences

Batch Processing 📦

Data collect pannunga → wait → process all at once
Like washing clothes — collect dirty clothes, wash together
Latency: minutes to hours
Example: Daily sales report

Stream Processing 🌊

Data varum bodhe process pannunga — no waiting
Like assembly line — each item immediately handled
Latency: milliseconds to seconds
Example: Fraud detection

Feature	Batch	Streaming
Latency	Minutes-Hours	Milliseconds-Seconds
Data	Bounded (fixed set)	Unbounded (infinite)
Processing	MapReduce, Spark	Kafka, Flink, Spark Streaming
Complexity	🟢 Lower	🔴 Higher
Cost	💰 Lower	💰💰💰 Higher
Accuracy	✅ Complete data	⚠️ Approximate (sometimes)
State	Stateless (mostly)	Stateful (tricky)
Recovery	Re-run the batch	Checkpoint + replay

Key insight: "Do I really need real-time?" — Idhu first question. 90% of use cases ku near real-time (few minutes delay) podhum! 🤔

Stream Processing Architecture

🏗️ Architecture Diagram

┌─────────────────────────────────────────────────┐
│       REAL-TIME DATA SYSTEM ARCHITECTURE          │
├─────────────────────────────────────────────────┤
│                                                   │
│  PRODUCERS          MESSAGE BUS        CONSUMERS  │
│  ┌──────────┐   ┌───────────────┐   ┌──────────┐│
│  │ Web App  │──▶│               │──▶│ Analytics││
│  │ Mobile   │──▶│  Apache Kafka │──▶│ ML Model ││
│  │ IoT      │──▶│  (or Pulsar)  │──▶│ Alerts   ││
│  │ DB CDC   │──▶│               │──▶│ Search   ││
│  └──────────┘   └───────┬───────┘   └──────────┘│
│                         │                         │
│              ┌──────────▼──────────┐              │
│              │  STREAM PROCESSOR   │              │
│              │                     │              │
│              │  Apache Flink       │              │
│              │  Spark Streaming    │              │
│              │  Kafka Streams      │              │
│              │                     │              │
│              │  • Filter           │              │
│              │  • Transform        │              │
│              │  • Aggregate        │              │
│              │  • Window           │              │
│              │  • Join             │              │
│              └──────────┬──────────┘              │
│                         │                         │
│              ┌──────────▼──────────┐              │
│              │      SINKS          │              │
│              │  DB / Dashboard /   │              │
│              │  Alert / Data Lake  │              │
│              └─────────────────────┘              │
└─────────────────────────────────────────────────┘

Event-Driven Architecture

Real-time systems mostly event-driven:

Event = Something happened at a point in time

json

{
  "event_type": "order_placed",
  "timestamp": "2026-02-17T10:30:00Z",
  "user_id": "U123",
  "order_id": "ORD456",
  "amount": 599.00,
  "items": ["pizza", "coke"]
}

Event-Driven Patterns:

1. Event Notification 📢

"Something happened!" — lightweight, just notify
Consumer decide pannum what to do
Example: "New order placed" → email service, inventory service both react

2. Event-Carried State Transfer 📦

Event la full data irukku
Consumer ku additional API call vendaam
Example: Order event la items, prices, address — ellam irukku

3. Event Sourcing 📝

State ah store panna maateenga, events ah store pannuveeenga
Current state = replay all events
Example: Bank account — deposits and withdrawals list = current balance

4. CQRS (Command Query Responsibility Segregation) 🔀

Write (commands) and Read (queries) separate models
Write optimized for consistency, Read optimized for speed

Windowing — Time-Based Aggregation

Streaming data la "total orders" count pannanum — but data never stops! 🤔 Windowing solves this:

1. Tumbling Window ⬜⬜⬜

Fixed size, no overlap
Every 5 minutes, count orders

code

[0-5min] [5-10min] [10-15min]

2. Sliding Window 📏

Fixed size, overlapping
Every minute, count orders in last 5 minutes

code

[0----5]
  [1----6]
    [2----7]

3. Session Window 👤

Dynamic size, based on activity gaps
User inactive for 30 min = window close

code

[activity...gap...][activity...gap...]

4. Global Window 🌍

All data in one window
Useful with triggers (every 1000 events, fire)

Flink Example:

java

stream
  .keyBy(event -> event.getUserId())
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .aggregate(new OrderCounter());

Windowing concept real-time analytics ku fundamental! 📊

Processing Guarantees — Critical Concept

💡 Tip

Real-time systems la message processing guarantees super important:

💡 At-most-once — Message lose aagalaam, but duplicate varaadhu

- Fire and forget

- Use case: Metrics, logs (miss pannaalum okay)

💡 At-least-once — Message definitely process aagum, but duplicate varalaam

- Retry on failure

- Use case: Most applications (with idempotent consumers)

💡 Exactly-once — Message oru time dhaan process aagum

- Hardest to achieve! Distributed system la very tricky

- Kafka Transactions + idempotent producers use pannunga

Practical tip: At-least-once + idempotent consumers = effectively exactly-once. Idhu most production systems use panra approach!

Idempotent consumer example:

python

def process_order(event):
    if db.exists(event.order_id):  # Already processed?
        return  # Skip!
    db.insert(event)  # Process

Change Data Capture (CDC)

Database changes ah real-time la capture panradhu — CDC:

Problem: MySQL la order insert aana, immediately Elasticsearch la searchable aaganum. How?

CDC Solution:

code

MySQL ──(binlog)──▶ Debezium ──▶ Kafka ──▶ Elasticsearch

How CDC works:

Database internally transaction log maintain pannum (binlog, WAL)
CDC tool (Debezium) log ah read pannum
Changes ah events ah convert pannum
Kafka ku publish pannum
Consumers react pannuvaanga

Debezium Event:

json

{
  "op": "c",
  "before": null,
  "after": {
    "id": 1001,
    "name": "Laptop",
    "price": 75000,
    "stock": 50
  },
  "source": {
    "table": "products",
    "db": "ecommerce"
  }
}

op: c=create, u=update, d=delete, r=read(snapshot)

CDC use cases:

🔍 Real-time search index update
📊 Real-time analytics dashboard
🔄 Database replication
📦 Cache invalidation
🤖 ML feature store update

Backpressure — Handle or Die!

⚠️ Warning

⚠️ Backpressure = Producer is faster than consumer. Data pile up aagudhu!

Imagine: 10,000 events/second produce aagudhu, but consumer 5,000/second dhaan handle pannum. 5,000 events/second accumulate aagum! 💥

What happens without handling:

- Memory overflow → crash

- Increasing latency → minutes behind

- Data loss → worst case

Solutions:

🔵 Buffer — Queue la store pannunga (Kafka does this well — disk-based buffer)

🔵 Drop — Less important messages skip pannunga (metrics ok, transactions NOT ok)

🔵 Sample — Every nth message process pannunga

🔵 Scale — More consumers add pannunga (Kafka consumer groups)

🔵 Rate limit — Producer slow down pannunga

Kafka advantage: Disk-based storage — consumers slow aanalum data safe. Days worth of backlog handle pannum! Unlike in-memory queues that crash.

Monitor: Consumer lag metric always watch pannunga! 📊

Stateful Stream Processing

Most interesting stream processing stateful — previous events remember pannanum:

Stateless (simple):

code

Event → Transform → Output
"Convert USD to INR" — no memory needed

Stateful (complex):

code

Events → Accumulate state → Output
"Running average of last 100 orders" — need to remember!

State Examples:

📊 Running count/average
🔍 Deduplication (seen this event before?)
🔗 Stream-stream joins (match orders with payments)
🕐 Session tracking (user activity sessions)

State Storage Options:

Option	Speed	Durability	Use Case
In-memory	⚡ Fastest	❌ Lost on crash	Ephemeral counters
RocksDB (embedded)	🚀 Fast	✅ Persistent	Flink default
External DB (Redis)	🟡 Network latency	✅ Shared	Multi-app state

Challenge: Node crash aana state lose aagakoodadhu! Checkpointing — periodically state snapshot save pannunga. Crash aana last checkpoint la irundhu resume pannunga.

Flink automatically checkpoint pannum — oru reason it's the best stream processor! 🏆

Real-Time Processing Tools

Right tool choose panradhu important:

Tool	Type	Latency	Strengths	Weakness
Apache Kafka	Message Bus	ms	Durability, scale	Not a processor
Apache Flink	Stream Processor	ms	Stateful, exactly-once	Complex setup
Spark Streaming	Micro-batch	seconds	Easy if you know Spark	Not true streaming
Kafka Streams	Stream Processor	ms	Simple, no cluster	JVM only
Apache Pulsar	Message Bus	ms	Multi-tenancy	Newer, less ecosystem
AWS Kinesis	Managed Stream	ms	No ops, AWS native	Vendor lock-in
Google Dataflow	Managed Processor	ms	Auto-scale, GCP	Vendor lock-in

Recommendation by use case:

Simple event processing → Kafka + Kafka Streams
Complex stateful → Kafka + Flink
Already using Spark → Spark Structured Streaming
AWS shop → Kinesis + Lambda
GCP shop → Pub/Sub + Dataflow

Real-World: Uber Real-Time System

✅ Example

Uber oda real-time data system paapom 🚗:

Scale: 1 trillion+ messages/day through Kafka!

Use Cases:

1. 🗺️ Real-time ETA: Driver location + traffic + demand → ETA calculation (every 4 seconds update)

2. 💰 Dynamic Pricing: Supply/demand ratio real-time calculate → surge pricing

3. 🛡️ Fraud Detection: Trip patterns analyze → suspicious activity flag (< 100ms)

4. 📊 Driver Matching: Rider request → nearest available driver (< 1 second)

Tech Stack:

- Kafka — Central event bus

- Apache Flink — Stream processing (complex event processing)

- Apache Pinot — Real-time OLAP (dashboard queries)

- Custom ML — Demand forecasting, ETA prediction

Architecture Flow:

code

Driver App → Kafka → Flink (process) → Pinot (serve) → Dashboard
     └──────────────── Kafka → ML Model → Pricing Service

Key Learning: Uber batch + streaming both use pannum. Not everything needs real-time — they choose wisely! 🧠

Testing Real-Time Systems

Real-time systems test panradhu batch compare la much harder:

1. Unit Tests 🧪

Individual processing functions test pannunga
Window logic correct ah work aagudha?
State management bugs irukka?

2. Integration Tests 🔗

Embedded Kafka use panni end-to-end test
Testcontainers library — Docker la Kafka, Flink spin up pannunga

3. Load Tests 🏋️

Production traffic simulate pannunga
Backpressure handling test pannunga
Scale test — 10x normal load la enna nadakkum?

4. Chaos Tests 💥

Node kill pannunga — recovery work aagudha?
Network partition — split brain irukka?
Clock skew — out-of-order events handle aagudha?

5. Replay Tests ▶️

Production events capture panni replay pannunga
New logic old data la correct result kudukkudha?

Pro tip: Production la shadow mode run pannunga first — real data process pannunga but results discard pannunga. Issues catch aana apram dhaan live pannunga! 🎯

Prompt: Design Real-Time System

📋 Copy-Paste Prompt

You are a senior data architect. Design a real-time data system for this scenario:

A fintech company wants to:
1. Process 50,000 transactions/second
2. Detect fraudulent transactions in < 200ms
3. Update account balances in real-time
4. Generate real-time dashboards for ops team
5. Maintain audit trail for compliance

Requirements:
- 99.99% uptime
- Exactly-once processing for financial data
- Handle 10x traffic spikes during festivals

Provide: Architecture diagram, tool choices, processing guarantees strategy, monitoring plan, disaster recovery. Explain in Tanglish.

Real-Time Systems Best Practices

Production real-time systems ku follow pannunga:

✅ Question the need — Really real-time venum ah? Near real-time podhum ah?

✅ Design for failure — Everything will fail. Plan accordingly.

✅ Idempotent consumers — Duplicate processing safe ah irukanum

✅ Monitor consumer lag — Most important metric in streaming

✅ Schema evolution — Avro/Protobuf use pannunga, JSON avoid pannunga at scale

✅ Dead letter queue — Process panna mudiyaadha messages separate ah handle pannunga

✅ Graceful degradation — Overload la degrade pannunga, crash aagakoodadhu

✅ Capacity planning — 3x expected peak traffic ku design pannunga

✅ Observability — Metrics, logs, traces — three pillars irukanum

Real-time systems complex dhaan, but master pannaa you're in the top 10% of data engineers! 🏆

Next up: Kafka basics — real-time oda backbone! 🚀

✅ Key Takeaways

✅ Batch vs Streaming — Batch: fixed data sets, minutes-hours latency. Streaming: unbounded data, milliseconds latency. Real-time requirement carefully assess pannunga (90% case batch enough)

✅ Event-Driven Architecture — Events (something happened at a timestamp). Stream processing consume panni react pannunga. Event sourcing: state replay events irundhu reconstruct

✅ Processing Guarantees — At-most-once (may lose), At-least-once (may duplicate), Exactly-once (hard). Practical: at-least-once + idempotent consumers = effectively exactly-once

✅ Windowing Concepts — Tumbling (fixed, non-overlapping), Sliding (fixed, overlapping), Session (activity gaps). Real-time aggregations window-based

✅ Backpressure Handling — Producer faster than consumer → data accumulate. Buffer, drop, sample, scale, rate-limit options. Kafka disk-based buffer advantage

✅ State Management — Stateful operations (counting, joining, sessions) maintain state venum. Checkpointing periodic saving. Node crash → recover last checkpoint

✅ Change Data Capture (CDC) — Database changes capture → stream. Debezium popular tool. Real-time search index, cache invalidation, analytics

✅ Tools Ecosystem — Kafka (message bus), Flink (stream processor), Spark Streaming (micro-batch), Kafka Streams (simple). Use case match select pannunga

🏁 🎮 Mini Challenge

Challenge: Real-Time Stream Processing Hands-On

Kafka + Flink-like processing practice pannu:

Setup Kafka Locally - 5 min:

bash

docker run -d -p 9092:9092 confluentinc/cp-kafka:7.5.0
docker run -d -p 2181:2181 confluentinc/cp-zookeeper:7.5.0

Producer (Simulate Events) - 10 min:

python

from confluent_kafka import Producer
import json, time, random

producer = Producer({'bootstrap.servers': 'localhost:9092'})

for i in range(100):
    event = {
        'transaction_id': i,
        'amount': random.uniform(100, 10000),
        'timestamp': time.time(),
        'is_fraud': 1 if random.random() < 0.05 else 0
    }
    producer.produce('transactions', json.dumps(event))
    time.sleep(0.1)  # Every 100ms

Consumer (Process Stream) - 10 min:

python

from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'fraud-detector',
    'auto.offset.reset': 'earliest'
})

consumer.subscribe(['transactions'])
fraud_count = 0

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    event = json.loads(msg.value())

    if event['is_fraud']:
        print(f"FRAUD ALERT: Transaction {event['transaction_id']} - Amount: {event['amount']}")
        fraud_count += 1

Monitor - 5 min:

Consumer lag check
Throughput measure
Processing latency check

Learning: Real-time na simple (event process pannu), but reliable production na complex! 🚀

💼 Interview Questions

Q1: Real-time system design – first decision enna?

A: "Do I REALLY need real-time?" 90% cases near real-time (minutes) podhum. True real-time expensive, complex, hard. Batch vs real-time trade-off understand pannu. Cost implications, complexity, team skills – ellam consider pannu!

Q2: Stream processing – state management challenge?

A: Stateful operations hard! Example: "Count transactions last hour" → need to remember 1 hour events. Crash aana state lose aaidum! Solution: Checkpointing (save state regularly), RocksDB (embedded store), Redis (external store). Prevention: Replay mechanism (recover from last checkpoint).

Q3: Exactly-once processing – practical?

A: Theoretically possible but expensive! Idempotent consumer + deduplication = effectively exactly-once. Most systems "at-least-once" + idempotency accept pannum. Financial systems: Exactly-once critical (transactions duplicate aagakoodadhu). Metrics/logs: At-most-once ok.

Q4: Windowing – practical use case?

A: "Count orders every 5 minutes" → Tumbling. "Moving average last hour" → Sliding. "Group events per user session" → Session window. Real-time dashboards, anomaly detection, fraud detection – ellaam windowing use pannum. Window size choose pannu carefully – too big (slow), too small (overhead)!

Q5: Real-time system scale – 1M events/sec handle panna strategy?

A: Partitioning (distribute load), parallelism (multiple consumers), batching (small micro-batches), compression (reduce network), monitoring (consumer lag track). Infrastructure: Kubernetes auto-scaling. Design: Stateless preferred (scale easy), stateful hard (state sync complex). Uber scale ha achieved, but 10+ year experience! 💪

Frequently Asked Questions

❓ Real-time vs batch — edhuvum choose pannanum?

Both venum! Lambda architecture la batch (accuracy) + real-time (speed) combine pannuvaanga. Use case decide pannum — fraud detection ku real-time must, monthly reports ku batch enough.

❓ Real-time system build panna costly ah?

Yes, batch compare la 3-5x more expensive. Infrastructure, monitoring, expertise — ellam more venum. Really real-time venum ah nu first question pannunga.

❓ Exactly-once processing possible ah?

Technically very hard. Most systems "effectively once" achieve pannuvaanga — idempotent consumers + deduplication logic use panni. Kafka supports exactly-once semantics with transactions.

❓ Real-time ku enna latency expect pannalaam?

Depends on system: True real-time < 100ms, near real-time < 1 second, micro-batch < 5 minutes. Your use case ku enna acceptable nu define pannunga first.

🧠Knowledge Check

Quiz 1 of 1

Backpressure na enna?

0 of 1 answered

← Previous ByteData cleaning basics Next Byte →Kafka basics

Courses

Learning Paths

Exam Prep

Real-time data systems

Introduction

Batch vs Streaming — Key Differences

Stream Processing Architecture

Event-Driven Architecture

Windowing — Time-Based Aggregation

Processing Guarantees — Critical Concept

Change Data Capture (CDC)

Backpressure — Handle or Die!

Stateful Stream Processing

Real-Time Processing Tools

Real-World: Uber Real-Time System

Testing Real-Time Systems

Prompt: Design Real-Time System

Real-Time Systems Best Practices

✅ Key Takeaways

🏁 🎮 Mini Challenge

💼 Interview Questions

Frequently Asked Questions