Real-time data systems
Introduction
Nee Swiggy la food order pannureenga ๐. Order status real-time la update aagudhu:
- "Order confirmed" โ 2 seconds la
- "Restaurant accepted" โ instant
- "Delivery partner assigned" โ real-time
- Live location tracking โ every 3 seconds
Idhu ellam real-time data systems work aagura padi nadakkudhu! Batch processing la idhu possible illa โ 1 hour wait pannaa food cold aaidum ๐
Real-time systems modern tech la everywhere โ stock trading, fraud detection, gaming, social media feeds, IoT sensors. Let's deep dive pannalaam! ๐
Batch vs Streaming โ Key Differences
Batch Processing ๐ฆ
- Data collect pannunga โ wait โ process all at once
- Like washing clothes โ collect dirty clothes, wash together
- Latency: minutes to hours
- Example: Daily sales report
Stream Processing ๐
- Data varum bodhe process pannunga โ no waiting
- Like assembly line โ each item immediately handled
- Latency: milliseconds to seconds
- Example: Fraud detection
| Feature | Batch | Streaming |
|---|---|---|
| Latency | Minutes-Hours | Milliseconds-Seconds |
| Data | Bounded (fixed set) | Unbounded (infinite) |
| Processing | MapReduce, Spark | Kafka, Flink, Spark Streaming |
| Complexity | ๐ข Lower | ๐ด Higher |
| Cost | ๐ฐ Lower | ๐ฐ๐ฐ๐ฐ Higher |
| Accuracy | โ Complete data | โ ๏ธ Approximate (sometimes) |
| State | Stateless (mostly) | Stateful (tricky) |
| Recovery | Re-run the batch | Checkpoint + replay |
Key insight: "Do I really need real-time?" โ Idhu first question. 90% of use cases ku near real-time (few minutes delay) podhum! ๐ค
Stream Processing Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ REAL-TIME DATA SYSTEM ARCHITECTURE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ PRODUCERS MESSAGE BUS CONSUMERS โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โ โ Web App โโโโถโ โโโโถโ Analyticsโโ โ โ Mobile โโโโถโ Apache Kafka โโโโถโ ML Model โโ โ โ IoT โโโโถโ (or Pulsar) โโโโถโ Alerts โโ โ โ DB CDC โโโโถโ โโโโถโ Search โโ โ โโโโโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโโ โ โ โ โ โโโโโโโโโโโโผโโโโโโโโโโโ โ โ โ STREAM PROCESSOR โ โ โ โ โ โ โ โ Apache Flink โ โ โ โ Spark Streaming โ โ โ โ Kafka Streams โ โ โ โ โ โ โ โ โข Filter โ โ โ โ โข Transform โ โ โ โ โข Aggregate โ โ โ โ โข Window โ โ โ โ โข Join โ โ โ โโโโโโโโโโโโฌโโโโโโโโโโโ โ โ โ โ โ โโโโโโโโโโโโผโโโโโโโโโโโ โ โ โ SINKS โ โ โ โ DB / Dashboard / โ โ โ โ Alert / Data Lake โ โ โ โโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Event-Driven Architecture
Real-time systems mostly event-driven:
Event = Something happened at a point in time
Event-Driven Patterns:
1. Event Notification ๐ข
- "Something happened!" โ lightweight, just notify
- Consumer decide pannum what to do
- Example: "New order placed" โ email service, inventory service both react
2. Event-Carried State Transfer ๐ฆ
- Event la full data irukku
- Consumer ku additional API call vendaam
- Example: Order event la items, prices, address โ ellam irukku
3. Event Sourcing ๐
- State ah store panna maateenga, events ah store pannuveeenga
- Current state = replay all events
- Example: Bank account โ deposits and withdrawals list = current balance
4. CQRS (Command Query Responsibility Segregation) ๐
- Write (commands) and Read (queries) separate models
- Write optimized for consistency, Read optimized for speed
Windowing โ Time-Based Aggregation
Streaming data la "total orders" count pannanum โ but data never stops! ๐ค Windowing solves this:
1. Tumbling Window โฌโฌโฌ
- Fixed size, no overlap
- Every 5 minutes, count orders
2. Sliding Window ๐
- Fixed size, overlapping
- Every minute, count orders in last 5 minutes
3. Session Window ๐ค
- Dynamic size, based on activity gaps
- User inactive for 30 min = window close
4. Global Window ๐
- All data in one window
- Useful with triggers (every 1000 events, fire)
Flink Example:
Windowing concept real-time analytics ku fundamental! ๐
Processing Guarantees โ Critical Concept
Real-time systems la message processing guarantees super important:
๐ก At-most-once โ Message lose aagalaam, but duplicate varaadhu
- Fire and forget
- Use case: Metrics, logs (miss pannaalum okay)
๐ก At-least-once โ Message definitely process aagum, but duplicate varalaam
- Retry on failure
- Use case: Most applications (with idempotent consumers)
๐ก Exactly-once โ Message oru time dhaan process aagum
- Hardest to achieve! Distributed system la very tricky
- Kafka Transactions + idempotent producers use pannunga
Practical tip: At-least-once + idempotent consumers = effectively exactly-once. Idhu most production systems use panra approach!
Idempotent consumer example:
Change Data Capture (CDC)
Database changes ah real-time la capture panradhu โ CDC:
Problem: MySQL la order insert aana, immediately Elasticsearch la searchable aaganum. How?
CDC Solution:
How CDC works:
- Database internally transaction log maintain pannum (binlog, WAL)
- CDC tool (Debezium) log ah read pannum
- Changes ah events ah convert pannum
- Kafka ku publish pannum
- Consumers react pannuvaanga
Debezium Event:
op: c=create, u=update, d=delete, r=read(snapshot)
CDC use cases:
- ๐ Real-time search index update
- ๐ Real-time analytics dashboard
- ๐ Database replication
- ๐ฆ Cache invalidation
- ๐ค ML feature store update
Backpressure โ Handle or Die!
โ ๏ธ Backpressure = Producer is faster than consumer. Data pile up aagudhu!
Imagine: 10,000 events/second produce aagudhu, but consumer 5,000/second dhaan handle pannum. 5,000 events/second accumulate aagum! ๐ฅ
What happens without handling:
- Memory overflow โ crash
- Increasing latency โ minutes behind
- Data loss โ worst case
Solutions:
๐ต Buffer โ Queue la store pannunga (Kafka does this well โ disk-based buffer)
๐ต Drop โ Less important messages skip pannunga (metrics ok, transactions NOT ok)
๐ต Sample โ Every nth message process pannunga
๐ต Scale โ More consumers add pannunga (Kafka consumer groups)
๐ต Rate limit โ Producer slow down pannunga
Kafka advantage: Disk-based storage โ consumers slow aanalum data safe. Days worth of backlog handle pannum! Unlike in-memory queues that crash.
Monitor: Consumer lag metric always watch pannunga! ๐
Stateful Stream Processing
Most interesting stream processing stateful โ previous events remember pannanum:
Stateless (simple):
Stateful (complex):
State Examples:
- ๐ Running count/average
- ๐ Deduplication (seen this event before?)
- ๐ Stream-stream joins (match orders with payments)
- ๐ Session tracking (user activity sessions)
State Storage Options:
| Option | Speed | Durability | Use Case |
|---|---|---|---|
| In-memory | โก Fastest | โ Lost on crash | Ephemeral counters |
| RocksDB (embedded) | ๐ Fast | โ Persistent | Flink default |
| External DB (Redis) | ๐ก Network latency | โ Shared | Multi-app state |
Challenge: Node crash aana state lose aagakoodadhu! Checkpointing โ periodically state snapshot save pannunga. Crash aana last checkpoint la irundhu resume pannunga.
Flink automatically checkpoint pannum โ oru reason it's the best stream processor! ๐
Real-Time Processing Tools
Right tool choose panradhu important:
| Tool | Type | Latency | Strengths | Weakness |
|---|---|---|---|---|
| **Apache Kafka** | Message Bus | ms | Durability, scale | Not a processor |
| **Apache Flink** | Stream Processor | ms | Stateful, exactly-once | Complex setup |
| **Spark Streaming** | Micro-batch | seconds | Easy if you know Spark | Not true streaming |
| **Kafka Streams** | Stream Processor | ms | Simple, no cluster | JVM only |
| **Apache Pulsar** | Message Bus | ms | Multi-tenancy | Newer, less ecosystem |
| **AWS Kinesis** | Managed Stream | ms | No ops, AWS native | Vendor lock-in |
| **Google Dataflow** | Managed Processor | ms | Auto-scale, GCP | Vendor lock-in |
Recommendation by use case:
- Simple event processing โ Kafka + Kafka Streams
- Complex stateful โ Kafka + Flink
- Already using Spark โ Spark Structured Streaming
- AWS shop โ Kinesis + Lambda
- GCP shop โ Pub/Sub + Dataflow
Real-World: Uber Real-Time System
Uber oda real-time data system paapom ๐:
Scale: 1 trillion+ messages/day through Kafka!
Use Cases:
1. ๐บ๏ธ Real-time ETA: Driver location + traffic + demand โ ETA calculation (every 4 seconds update)
2. ๐ฐ Dynamic Pricing: Supply/demand ratio real-time calculate โ surge pricing
3. ๐ก๏ธ Fraud Detection: Trip patterns analyze โ suspicious activity flag (< 100ms)
4. ๐ Driver Matching: Rider request โ nearest available driver (< 1 second)
Tech Stack:
- Kafka โ Central event bus
- Apache Flink โ Stream processing (complex event processing)
- Apache Pinot โ Real-time OLAP (dashboard queries)
- Custom ML โ Demand forecasting, ETA prediction
Architecture Flow:
Key Learning: Uber batch + streaming both use pannum. Not everything needs real-time โ they choose wisely! ๐ง
Testing Real-Time Systems
Real-time systems test panradhu batch compare la much harder:
1. Unit Tests ๐งช
- Individual processing functions test pannunga
- Window logic correct ah work aagudha?
- State management bugs irukka?
2. Integration Tests ๐
- Embedded Kafka use panni end-to-end test
- Testcontainers library โ Docker la Kafka, Flink spin up pannunga
3. Load Tests ๐๏ธ
- Production traffic simulate pannunga
- Backpressure handling test pannunga
- Scale test โ 10x normal load la enna nadakkum?
4. Chaos Tests ๐ฅ
- Node kill pannunga โ recovery work aagudha?
- Network partition โ split brain irukka?
- Clock skew โ out-of-order events handle aagudha?
5. Replay Tests โถ๏ธ
- Production events capture panni replay pannunga
- New logic old data la correct result kudukkudha?
Pro tip: Production la shadow mode run pannunga first โ real data process pannunga but results discard pannunga. Issues catch aana apram dhaan live pannunga! ๐ฏ
Prompt: Design Real-Time System
Real-Time Systems Best Practices
Production real-time systems ku follow pannunga:
โ Question the need โ Really real-time venum ah? Near real-time podhum ah?
โ Design for failure โ Everything will fail. Plan accordingly.
โ Idempotent consumers โ Duplicate processing safe ah irukanum
โ Monitor consumer lag โ Most important metric in streaming
โ Schema evolution โ Avro/Protobuf use pannunga, JSON avoid pannunga at scale
โ Dead letter queue โ Process panna mudiyaadha messages separate ah handle pannunga
โ Graceful degradation โ Overload la degrade pannunga, crash aagakoodadhu
โ Capacity planning โ 3x expected peak traffic ku design pannunga
โ Observability โ Metrics, logs, traces โ three pillars irukanum
Real-time systems complex dhaan, but master pannaa you're in the top 10% of data engineers! ๐
Next up: Kafka basics โ real-time oda backbone! ๐
โ Key Takeaways
โ Batch vs Streaming โ Batch: fixed data sets, minutes-hours latency. Streaming: unbounded data, milliseconds latency. Real-time requirement carefully assess pannunga (90% case batch enough)
โ Event-Driven Architecture โ Events (something happened at a timestamp). Stream processing consume panni react pannunga. Event sourcing: state replay events irundhu reconstruct
โ Processing Guarantees โ At-most-once (may lose), At-least-once (may duplicate), Exactly-once (hard). Practical: at-least-once + idempotent consumers = effectively exactly-once
โ Windowing Concepts โ Tumbling (fixed, non-overlapping), Sliding (fixed, overlapping), Session (activity gaps). Real-time aggregations window-based
โ Backpressure Handling โ Producer faster than consumer โ data accumulate. Buffer, drop, sample, scale, rate-limit options. Kafka disk-based buffer advantage
โ State Management โ Stateful operations (counting, joining, sessions) maintain state venum. Checkpointing periodic saving. Node crash โ recover last checkpoint
โ Change Data Capture (CDC) โ Database changes capture โ stream. Debezium popular tool. Real-time search index, cache invalidation, analytics
โ Tools Ecosystem โ Kafka (message bus), Flink (stream processor), Spark Streaming (micro-batch), Kafka Streams (simple). Use case match select pannunga
๐ ๐ฎ Mini Challenge
Challenge: Real-Time Stream Processing Hands-On
Kafka + Flink-like processing practice pannu:
Setup Kafka Locally - 5 min:
Producer (Simulate Events) - 10 min:
Consumer (Process Stream) - 10 min:
Monitor - 5 min:
- Consumer lag check
- Throughput measure
- Processing latency check
Learning: Real-time na simple (event process pannu), but reliable production na complex! ๐
๐ผ Interview Questions
Q1: Real-time system design โ first decision enna?
A: "Do I REALLY need real-time?" 90% cases near real-time (minutes) podhum. True real-time expensive, complex, hard. Batch vs real-time trade-off understand pannu. Cost implications, complexity, team skills โ ellam consider pannu!
Q2: Stream processing โ state management challenge?
A: Stateful operations hard! Example: "Count transactions last hour" โ need to remember 1 hour events. Crash aana state lose aaidum! Solution: Checkpointing (save state regularly), RocksDB (embedded store), Redis (external store). Prevention: Replay mechanism (recover from last checkpoint).
Q3: Exactly-once processing โ practical?
A: Theoretically possible but expensive! Idempotent consumer + deduplication = effectively exactly-once. Most systems "at-least-once" + idempotency accept pannum. Financial systems: Exactly-once critical (transactions duplicate aagakoodadhu). Metrics/logs: At-most-once ok.
Q4: Windowing โ practical use case?
A: "Count orders every 5 minutes" โ Tumbling. "Moving average last hour" โ Sliding. "Group events per user session" โ Session window. Real-time dashboards, anomaly detection, fraud detection โ ellaam windowing use pannum. Window size choose pannu carefully โ too big (slow), too small (overhead)!
Q5: Real-time system scale โ 1M events/sec handle panna strategy?
A: Partitioning (distribute load), parallelism (multiple consumers), batching (small micro-batches), compression (reduce network), monitoring (consumer lag track). Infrastructure: Kubernetes auto-scaling. Design: Stateless preferred (scale easy), stateful hard (state sync complex). Uber scale ha achieved, but 10+ year experience! ๐ช
Frequently Asked Questions
Backpressure na enna?