Real-time data systems
Introduction
Nee Swiggy la food order pannureenga 🍕. Order status real-time la update aagudhu:
- "Order confirmed" → 2 seconds la
- "Restaurant accepted" → instant
- "Delivery partner assigned" → real-time
- Live location tracking → every 3 seconds
Idhu ellam real-time data systems work aagura padi nadakkudhu! Batch processing la idhu possible illa — 1 hour wait pannaa food cold aaidum 😅
Real-time systems modern tech la everywhere — stock trading, fraud detection, gaming, social media feeds, IoT sensors. Let's deep dive pannalaam! 🚀
Batch vs Streaming — Key Differences
Batch Processing 📦
- Data collect pannunga → wait → process all at once
- Like washing clothes — collect dirty clothes, wash together
- Latency: minutes to hours
- Example: Daily sales report
Stream Processing 🌊
- Data varum bodhe process pannunga — no waiting
- Like assembly line — each item immediately handled
- Latency: milliseconds to seconds
- Example: Fraud detection
| Feature | Batch | Streaming |
|---|---|---|
| Latency | Minutes-Hours | Milliseconds-Seconds |
| Data | Bounded (fixed set) | Unbounded (infinite) |
| Processing | MapReduce, Spark | Kafka, Flink, Spark Streaming |
| Complexity | 🟢 Lower | 🔴 Higher |
| Cost | 💰 Lower | 💰💰💰 Higher |
| Accuracy | ✅ Complete data | ⚠️ Approximate (sometimes) |
| State | Stateless (mostly) | Stateful (tricky) |
| Recovery | Re-run the batch | Checkpoint + replay |
Key insight: "Do I really need real-time?" — Idhu first question. 90% of use cases ku near real-time (few minutes delay) podhum! 🤔
Stream Processing Architecture
┌─────────────────────────────────────────────────┐ │ REAL-TIME DATA SYSTEM ARCHITECTURE │ ├─────────────────────────────────────────────────┤ │ │ │ PRODUCERS MESSAGE BUS CONSUMERS │ │ ┌──────────┐ ┌───────────────┐ ┌──────────┐│ │ │ Web App │──▶│ │──▶│ Analytics││ │ │ Mobile │──▶│ Apache Kafka │──▶│ ML Model ││ │ │ IoT │──▶│ (or Pulsar) │──▶│ Alerts ││ │ │ DB CDC │──▶│ │──▶│ Search ││ │ └──────────┘ └───────┬───────┘ └──────────┘│ │ │ │ │ ┌──────────▼──────────┐ │ │ │ STREAM PROCESSOR │ │ │ │ │ │ │ │ Apache Flink │ │ │ │ Spark Streaming │ │ │ │ Kafka Streams │ │ │ │ │ │ │ │ • Filter │ │ │ │ • Transform │ │ │ │ • Aggregate │ │ │ │ • Window │ │ │ │ • Join │ │ │ └──────────┬──────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ SINKS │ │ │ │ DB / Dashboard / │ │ │ │ Alert / Data Lake │ │ │ └─────────────────────┘ │ └─────────────────────────────────────────────────┘
Event-Driven Architecture
Real-time systems mostly event-driven:
Event = Something happened at a point in time
Event-Driven Patterns:
1. Event Notification 📢
- "Something happened!" — lightweight, just notify
- Consumer decide pannum what to do
- Example: "New order placed" → email service, inventory service both react
2. Event-Carried State Transfer 📦
- Event la full data irukku
- Consumer ku additional API call vendaam
- Example: Order event la items, prices, address — ellam irukku
3. Event Sourcing 📝
- State ah store panna maateenga, events ah store pannuveeenga
- Current state = replay all events
- Example: Bank account — deposits and withdrawals list = current balance
4. CQRS (Command Query Responsibility Segregation) 🔀
- Write (commands) and Read (queries) separate models
- Write optimized for consistency, Read optimized for speed
Windowing — Time-Based Aggregation
Streaming data la "total orders" count pannanum — but data never stops! 🤔 Windowing solves this:
1. Tumbling Window ⬜⬜⬜
- Fixed size, no overlap
- Every 5 minutes, count orders
2. Sliding Window 📏
- Fixed size, overlapping
- Every minute, count orders in last 5 minutes
3. Session Window 👤
- Dynamic size, based on activity gaps
- User inactive for 30 min = window close
4. Global Window 🌍
- All data in one window
- Useful with triggers (every 1000 events, fire)
Flink Example:
Windowing concept real-time analytics ku fundamental! 📊
Processing Guarantees — Critical Concept
Real-time systems la message processing guarantees super important:
💡 At-most-once — Message lose aagalaam, but duplicate varaadhu
- Fire and forget
- Use case: Metrics, logs (miss pannaalum okay)
💡 At-least-once — Message definitely process aagum, but duplicate varalaam
- Retry on failure
- Use case: Most applications (with idempotent consumers)
💡 Exactly-once — Message oru time dhaan process aagum
- Hardest to achieve! Distributed system la very tricky
- Kafka Transactions + idempotent producers use pannunga
Practical tip: At-least-once + idempotent consumers = effectively exactly-once. Idhu most production systems use panra approach!
Idempotent consumer example:
Change Data Capture (CDC)
Database changes ah real-time la capture panradhu — CDC:
Problem: MySQL la order insert aana, immediately Elasticsearch la searchable aaganum. How?
CDC Solution:
How CDC works:
- Database internally transaction log maintain pannum (binlog, WAL)
- CDC tool (Debezium) log ah read pannum
- Changes ah events ah convert pannum
- Kafka ku publish pannum
- Consumers react pannuvaanga
Debezium Event:
op: c=create, u=update, d=delete, r=read(snapshot)
CDC use cases:
- 🔍 Real-time search index update
- 📊 Real-time analytics dashboard
- 🔄 Database replication
- 📦 Cache invalidation
- 🤖 ML feature store update
Backpressure — Handle or Die!
⚠️ Backpressure = Producer is faster than consumer. Data pile up aagudhu!
Imagine: 10,000 events/second produce aagudhu, but consumer 5,000/second dhaan handle pannum. 5,000 events/second accumulate aagum! 💥
What happens without handling:
- Memory overflow → crash
- Increasing latency → minutes behind
- Data loss → worst case
Solutions:
🔵 Buffer — Queue la store pannunga (Kafka does this well — disk-based buffer)
🔵 Drop — Less important messages skip pannunga (metrics ok, transactions NOT ok)
🔵 Sample — Every nth message process pannunga
🔵 Scale — More consumers add pannunga (Kafka consumer groups)
🔵 Rate limit — Producer slow down pannunga
Kafka advantage: Disk-based storage — consumers slow aanalum data safe. Days worth of backlog handle pannum! Unlike in-memory queues that crash.
Monitor: Consumer lag metric always watch pannunga! 📊
Stateful Stream Processing
Most interesting stream processing stateful — previous events remember pannanum:
Stateless (simple):
Stateful (complex):
State Examples:
- 📊 Running count/average
- 🔍 Deduplication (seen this event before?)
- 🔗 Stream-stream joins (match orders with payments)
- 🕐 Session tracking (user activity sessions)
State Storage Options:
| Option | Speed | Durability | Use Case |
|---|---|---|---|
| In-memory | ⚡ Fastest | ❌ Lost on crash | Ephemeral counters |
| RocksDB (embedded) | 🚀 Fast | ✅ Persistent | Flink default |
| External DB (Redis) | 🟡 Network latency | ✅ Shared | Multi-app state |
Challenge: Node crash aana state lose aagakoodadhu! Checkpointing — periodically state snapshot save pannunga. Crash aana last checkpoint la irundhu resume pannunga.
Flink automatically checkpoint pannum — oru reason it's the best stream processor! 🏆
Real-Time Processing Tools
Right tool choose panradhu important:
| Tool | Type | Latency | Strengths | Weakness |
|---|---|---|---|---|
| **Apache Kafka** | Message Bus | ms | Durability, scale | Not a processor |
| **Apache Flink** | Stream Processor | ms | Stateful, exactly-once | Complex setup |
| **Spark Streaming** | Micro-batch | seconds | Easy if you know Spark | Not true streaming |
| **Kafka Streams** | Stream Processor | ms | Simple, no cluster | JVM only |
| **Apache Pulsar** | Message Bus | ms | Multi-tenancy | Newer, less ecosystem |
| **AWS Kinesis** | Managed Stream | ms | No ops, AWS native | Vendor lock-in |
| **Google Dataflow** | Managed Processor | ms | Auto-scale, GCP | Vendor lock-in |
Recommendation by use case:
- Simple event processing → Kafka + Kafka Streams
- Complex stateful → Kafka + Flink
- Already using Spark → Spark Structured Streaming
- AWS shop → Kinesis + Lambda
- GCP shop → Pub/Sub + Dataflow
Real-World: Uber Real-Time System
Uber oda real-time data system paapom 🚗:
Scale: 1 trillion+ messages/day through Kafka!
Use Cases:
1. 🗺️ Real-time ETA: Driver location + traffic + demand → ETA calculation (every 4 seconds update)
2. 💰 Dynamic Pricing: Supply/demand ratio real-time calculate → surge pricing
3. 🛡️ Fraud Detection: Trip patterns analyze → suspicious activity flag (< 100ms)
4. 📊 Driver Matching: Rider request → nearest available driver (< 1 second)
Tech Stack:
- Kafka — Central event bus
- Apache Flink — Stream processing (complex event processing)
- Apache Pinot — Real-time OLAP (dashboard queries)
- Custom ML — Demand forecasting, ETA prediction
Architecture Flow:
Key Learning: Uber batch + streaming both use pannum. Not everything needs real-time — they choose wisely! 🧠
Testing Real-Time Systems
Real-time systems test panradhu batch compare la much harder:
1. Unit Tests 🧪
- Individual processing functions test pannunga
- Window logic correct ah work aagudha?
- State management bugs irukka?
2. Integration Tests 🔗
- Embedded Kafka use panni end-to-end test
- Testcontainers library — Docker la Kafka, Flink spin up pannunga
3. Load Tests 🏋️
- Production traffic simulate pannunga
- Backpressure handling test pannunga
- Scale test — 10x normal load la enna nadakkum?
4. Chaos Tests 💥
- Node kill pannunga — recovery work aagudha?
- Network partition — split brain irukka?
- Clock skew — out-of-order events handle aagudha?
5. Replay Tests ▶️
- Production events capture panni replay pannunga
- New logic old data la correct result kudukkudha?
Pro tip: Production la shadow mode run pannunga first — real data process pannunga but results discard pannunga. Issues catch aana apram dhaan live pannunga! 🎯
Prompt: Design Real-Time System
Real-Time Systems Best Practices
Production real-time systems ku follow pannunga:
✅ Question the need — Really real-time venum ah? Near real-time podhum ah?
✅ Design for failure — Everything will fail. Plan accordingly.
✅ Idempotent consumers — Duplicate processing safe ah irukanum
✅ Monitor consumer lag — Most important metric in streaming
✅ Schema evolution — Avro/Protobuf use pannunga, JSON avoid pannunga at scale
✅ Dead letter queue — Process panna mudiyaadha messages separate ah handle pannunga
✅ Graceful degradation — Overload la degrade pannunga, crash aagakoodadhu
✅ Capacity planning — 3x expected peak traffic ku design pannunga
✅ Observability — Metrics, logs, traces — three pillars irukanum
Real-time systems complex dhaan, but master pannaa you're in the top 10% of data engineers! 🏆
Next up: Kafka basics — real-time oda backbone! 🚀
✅ Key Takeaways
✅ Batch vs Streaming — Batch: fixed data sets, minutes-hours latency. Streaming: unbounded data, milliseconds latency. Real-time requirement carefully assess pannunga (90% case batch enough)
✅ Event-Driven Architecture — Events (something happened at a timestamp). Stream processing consume panni react pannunga. Event sourcing: state replay events irundhu reconstruct
✅ Processing Guarantees — At-most-once (may lose), At-least-once (may duplicate), Exactly-once (hard). Practical: at-least-once + idempotent consumers = effectively exactly-once
✅ Windowing Concepts — Tumbling (fixed, non-overlapping), Sliding (fixed, overlapping), Session (activity gaps). Real-time aggregations window-based
✅ Backpressure Handling — Producer faster than consumer → data accumulate. Buffer, drop, sample, scale, rate-limit options. Kafka disk-based buffer advantage
✅ State Management — Stateful operations (counting, joining, sessions) maintain state venum. Checkpointing periodic saving. Node crash → recover last checkpoint
✅ Change Data Capture (CDC) — Database changes capture → stream. Debezium popular tool. Real-time search index, cache invalidation, analytics
✅ Tools Ecosystem — Kafka (message bus), Flink (stream processor), Spark Streaming (micro-batch), Kafka Streams (simple). Use case match select pannunga
🏁 🎮 Mini Challenge
Challenge: Real-Time Stream Processing Hands-On
Kafka + Flink-like processing practice pannu:
Setup Kafka Locally - 5 min:
Producer (Simulate Events) - 10 min:
Consumer (Process Stream) - 10 min:
Monitor - 5 min:
- Consumer lag check
- Throughput measure
- Processing latency check
Learning: Real-time na simple (event process pannu), but reliable production na complex! 🚀
💼 Interview Questions
Q1: Real-time system design – first decision enna?
A: "Do I REALLY need real-time?" 90% cases near real-time (minutes) podhum. True real-time expensive, complex, hard. Batch vs real-time trade-off understand pannu. Cost implications, complexity, team skills – ellam consider pannu!
Q2: Stream processing – state management challenge?
A: Stateful operations hard! Example: "Count transactions last hour" → need to remember 1 hour events. Crash aana state lose aaidum! Solution: Checkpointing (save state regularly), RocksDB (embedded store), Redis (external store). Prevention: Replay mechanism (recover from last checkpoint).
Q3: Exactly-once processing – practical?
A: Theoretically possible but expensive! Idempotent consumer + deduplication = effectively exactly-once. Most systems "at-least-once" + idempotency accept pannum. Financial systems: Exactly-once critical (transactions duplicate aagakoodadhu). Metrics/logs: At-most-once ok.
Q4: Windowing – practical use case?
A: "Count orders every 5 minutes" → Tumbling. "Moving average last hour" → Sliding. "Group events per user session" → Session window. Real-time dashboards, anomaly detection, fraud detection – ellaam windowing use pannum. Window size choose pannu carefully – too big (slow), too small (overhead)!
Q5: Real-time system scale – 1M events/sec handle panna strategy?
A: Partitioning (distribute load), parallelism (multiple consumers), batching (small micro-batches), compression (reduce network), monitoring (consumer lag track). Infrastructure: Kubernetes auto-scaling. Design: Stateless preferred (scale easy), stateful hard (state sync complex). Uber scale ha achieved, but 10+ year experience! 💪
Frequently Asked Questions
Backpressure na enna?