← Back|DATA-ENGINEERING›Section 1/18

0 of 18 completed

Kafka basics

Advanced⏱ 16 min read📅 Updated: 2026-02-17

Introduction

Post office yosichchu paaru 📮 — oru area la letters collect panni, sort panni, different areas ku deliver pannuvaanga. Apache Kafka same concept — but for data! Digital post office 🏤

LinkedIn 2011 la create pannanga — ivanga per day 7 trillion messages process pannuvaanga! Now Uber, Netflix, Airbnb, 80%+ Fortune 500 companies Kafka use pannuvaanga.

Why Kafka is everywhere:

⚡ Handle millions of messages/second
💾 Messages disk la persist (days/weeks)
📈 Horizontally scalable — more brokers add pannunga
🔄 Multiple consumers same data read pannalaam
🛡️ Fault tolerant — broker crash aanalum data safe

Kafka is not just a tool — it's the central nervous system of modern data platforms! Let's understand how 🧠

Core Concepts

Kafka oda key building blocks:

1. Topic 📋

Category or feed name — like a folder for messages
Example: "orders", "payments", "user-events"
Producers write to topics, consumers read from topics

2. Partition 🗂️

Topic subdivide aagum into partitions
Parallelism enable pannum — more partitions = more throughput
Messages within partition ordered (guaranteed!)
Across partitions? No ordering guarantee

3. Broker 🖥️

Kafka server instance
Cluster = multiple brokers (typically 3+)
Each broker stores some partitions

4. Producer ✍️

Application that writes messages to topics
Decides which partition ku send pannanum

5. Consumer 👀

Application that reads messages from topics
Tracks its position (offset) in each partition

6. Consumer Group 👥

Group of consumers that work together
Each partition assigned to ONE consumer in the group
Enables parallel processing!

Concept	Analogy	Purpose
Topic	TV Channel	Categorize messages
Partition	Lanes in highway	Parallelism
Broker	Post office branch	Store & serve
Producer	News reporter	Send messages
Consumer	TV viewer	Read messages
Consumer Group	Family watching together	Load balance

Kafka Architecture

🏗️ Architecture Diagram

┌─────────────────────────────────────────────────┐
│              KAFKA CLUSTER                        │
├─────────────────────────────────────────────────┤
│                                                   │
│  PRODUCERS        BROKERS          CONSUMERS      │
│  ┌────────┐   ┌────────────────┐  ┌────────────┐│
│  │ App 1  │──▶│  Broker 1      │  │ Consumer   ││
│  │        │   │  ┌──────────┐  │──▶│ Group A    ││
│  │ App 2  │──▶│  │Topic:orders│ │  │ ┌──┐ ┌──┐ ││
│  │        │   │  │ P0 │ P1  │  │  │ │C1│ │C2│ ││
│  │ App 3  │──▶│  └──────────┘  │  │ └──┘ └──┘ ││
│  └────────┘   ├────────────────┤  ├────────────┤│
│               │  Broker 2      │  │ Consumer   ││
│               │  ┌──────────┐  │──▶│ Group B    ││
│               │  │Topic:orders│ │  │ ┌──┐      ││
│               │  │ P2 │ P3  │  │  │ │C3│      ││
│               │  └──────────┘  │  │ └──┘      ││
│               ├────────────────┤  └────────────┘│
│               │  Broker 3      │                 │
│               │  (replicas)    │                 │
│               └────────────────┘                 │
│                                                   │
│  ┌──────────────────────────────────┐            │
│  │  ZooKeeper / KRaft               │            │
│  │  (Cluster metadata management)   │            │
│  └──────────────────────────────────┘            │
└─────────────────────────────────────────────────┘

Partitions — The Secret to Scale

Partitions Kafka oda superpower! 💪

How partitioning works:

code

Topic: "orders" (4 partitions)

Partition 0: [Order1] [Order5] [Order9]  → Consumer 1
Partition 1: [Order2] [Order6] [Order10] → Consumer 2
Partition 2: [Order3] [Order7] [Order11] → Consumer 3
Partition 3: [Order4] [Order8] [Order12] → Consumer 4

Partition Key decides which partition:

python

# Same user_id always same partition ku pogum
producer.send('orders', key=user_id, value=order_data)

Why partition key matters:

Same key = same partition = ordering guaranteed for that key
User U123 oda orders always order la varum ✅
Different users parallel ah process aagum ⚡

How many partitions?

Rule of thumb: target throughput / single consumer throughput
100 MB/s venum, consumer 10 MB/s handle pannum → 10 partitions minimum
More partitions = more parallelism, but more overhead too
Start with 6-12, increase as needed

Warning: Partition count increase pannalaam, but decrease panna mudiyaadhu! Plan carefully 🎯

Producer Configuration — Get It Right!

💡 Tip

Producer configuration production la critical:

💡 acks setting — Most important config!

- acks=0 — Don't wait for broker acknowledgment (fastest, risky)

- acks=1 — Leader acknowledges (good balance)

- acks=all — All replicas acknowledge (safest, slower)

💡 Batching — Don't send one message at a time!

code

batch.size=16384        # Batch size in bytes
linger.ms=5             # Wait 5ms to batch more messages

💡 Compression — Reduce network and disk usage:

code

compression.type=lz4    # Best performance
# Options: none, gzip, snappy, lz4, zstd

💡 Idempotent Producer — Prevent duplicates:

code

enable.idempotence=true

💡 Retries — Network blips ku:

code

retries=3
retry.backoff.ms=100

Golden Rule: Financial data? acks=all + enable.idempotence=true. Metrics/logs? acks=1 + compression podhum! 🏦

Consumer Groups — Parallel Processing

Consumer groups Kafka la load balancing enable pannudhu:

Scenario: "orders" topic with 4 partitions

1 Consumer in Group:

code

Consumer 1 reads: P0, P1, P2, P3 (all partitions — overloaded!)

2 Consumers in Group:

code

Consumer 1 reads: P0, P1
Consumer 2 reads: P2, P3    (balanced! ✅)

4 Consumers in Group:

code

Consumer 1: P0
Consumer 2: P1
Consumer 3: P2
Consumer 4: P3    (perfect parallelism! 🎯)

5 Consumers in Group:

code

Consumer 1: P0
Consumer 2: P1
Consumer 3: P2
Consumer 4: P3
Consumer 5: IDLE ❌ (no partition to read!)

Key Rules:

Max useful consumers = partition count
Multiple consumer groups = multiple independent readers
Group A reads for analytics, Group B reads for ML — both get ALL messages
Consumer crash aana? Rebalance — remaining consumers take over its partitions

Offset Management:

Each consumer tracks offset (position in partition)
Committed offset = "I've processed up to here"
Crash aana, restart from last committed offset — no data loss! 🛡️

Kafka with Python — Hands On

Python la Kafka use pannalaam — confluent-kafka library:

Producer:

python

from confluent_kafka import Producer
import json

producer = Producer({'bootstrap.servers': 'localhost:9092'})

def delivery_report(err, msg):
    if err:
        print(f'❌ Delivery failed: {err}')
    else:
        print(f'✅ Delivered to {msg.topic()} [{msg.partition()}]')

# Send messages
for i in range(100):
    event = {'order_id': i, 'amount': 100 + i, 'user': f'user_{i % 10}'}
    producer.produce(
        topic='orders',
        key=str(event['user']),
        value=json.dumps(event),
        callback=delivery_report
    )
producer.flush()  # Wait for all deliveries

Consumer:

python

from confluent_kafka import Consumer
import json

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'order-processor',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['orders'])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        print(f'Error: {msg.error()}')
        continue
    
    event = json.loads(msg.value())
    print(f"Order {event['order_id']}: ₹{event['amount']}")
    # Process order...

consumer.close()

Simple ah start pannalaam! Docker la Kafka run panni practice pannunga 🐳

Replication — Data Safety

Broker crash aana data lose aagakoodadhu — replication saves us:

Replication Factor = 3 (standard production config)

code

Partition 0:
  Broker 1: [Leader]   ← Producers/Consumers talk to this
  Broker 2: [Follower] ← Keeps copy
  Broker 3: [Follower] ← Keeps copy

How it works:

Producer → Leader ku write
Leader → Followers ku replicate
Followers "in-sync" aana → ISR (In-Sync Replica) set la irukku
Leader crash aana → ISR la irundhu new leader elect aagum ⚡

ISR (In-Sync Replicas):

Follower leader la irundhu data catch up panniduchaa = "in sync"
min.insync.replicas=2 — at least 2 replicas in sync irukanum
acks=all + min.insync.replicas=2 = data loss almost impossible!

Scenario: Broker 1 crashes

code

Before: Leader=Broker1, ISR=[Broker1, Broker2, Broker3]
After:  Leader=Broker2, ISR=[Broker2, Broker3]
        (automatic failover! zero data loss ✅)

Production config recommendation:

Setting	Value	Why
replication.factor	3	Standard safety
min.insync.replicas	2	Tolerate 1 broker down
acks	all	No data loss

ZooKeeper → KRaft Migration

⚠️ Warning

⚠️ ZooKeeper is being deprecated! Kafka 3.5+ la KRaft mode use pannunga.

ZooKeeper problems:

- Separate system maintain pannanum — operational overhead

- Scaling limitations with large clusters

- Metadata operations slow

KRaft (Kafka Raft) benefits:

- ✅ No separate ZooKeeper cluster needed

- ✅ Faster controller failover (seconds vs minutes)

- ✅ Simplified operations

- ✅ Better scalability (millions of partitions)

Migration path:

1. Kafka 3.5+ install pannunga

2. KRaft mode la new clusters start pannunga

3. Existing clusters gradually migrate pannunga

New projects ku: Always KRaft mode use pannunga! ZooKeeper dependency remove pannunga 🎯

bash

# KRaft mode la Kafka start
kafka-storage.sh format -t $(kafka-storage.sh random-uuid) \
  -c config/kraft/server.properties
kafka-server-start.sh config/kraft/server.properties

Real-World Kafka Use Cases

✅ Example

Companies Kafka eppadi use pannuvaanga 🌍:

LinkedIn (Kafka creators):

- 7 trillion messages/day

- Activity tracking, metrics, logs

- 100+ Kafka clusters

Netflix:

- 1 trillion+ events/day

- Viewing activity → recommendation engine

- Real-time A/B testing results

Uber:

- Trip events, driver location, pricing

- Kafka → Flink → real-time features

- 1000+ microservices Kafka through communicate

Spotify:

- User listening events → discover weekly

- 500 billion events/day

- Real-time music recommendations

Common patterns across companies:

1. 📊 Event sourcing — All state changes as events

2. 🔗 Microservice communication — Async message passing

3. 📈 Real-time analytics — Live dashboards

4. 🤖 ML feature pipelines — Fresh features for models

5. 📝 Audit logs — Compliance and debugging

Kafka Connect — Easy Integrations

Custom code ezhudhaama data move panna — Kafka Connect!

What is it?

Pre-built connectors for databases, file systems, cloud services
No coding needed — just configuration!

Source Connectors (Data → Kafka):

json

{
  "name": "mysql-source",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql-server",
    "database.port": "3306",
    "database.user": "kafka-user",
    "database.password": "secret",
    "database.server.name": "myserver",
    "table.include.list": "ecommerce.orders"
  }
}

Sink Connectors (Kafka → Data):

json

{
  "name": "elasticsearch-sink",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "topics": "orders",
    "connection.url": "http://elasticsearch:9200",
    "type.name": "_doc"
  }
}

Popular Connectors:

Connector	Direction	Use Case
Debezium MySQL	Source	CDC from MySQL
JDBC	Source/Sink	Any SQL database
S3	Sink	Archive to S3
Elasticsearch	Sink	Search index
BigQuery	Sink	Analytics warehouse

Kafka Connect = No-code data integration! 🔌

Kafka Monitoring Essentials

Production Kafka cluster monitor pannanum — key metrics:

Broker Metrics:

Under-replicated partitions — 0 irukanum, > 0 = problem! 🔴
Active controller count — Exactly 1 irukanum
Request latency — Produce/fetch request time
Disk usage — retention policy based

Producer Metrics:

Record send rate — Messages per second
Record error rate — Failed sends
Batch size avg — Batching efficiency

Consumer Metrics:

Consumer lag — 🔴 MOST IMPORTANT! Messages behind = lag
Fetch rate — Consumption speed
Commit rate — Offset commit frequency

Consumer Lag Formula:

code

Lag = Latest Offset - Consumer Committed Offset

Lag increasing = consumer slow or stuck! Immediate attention venum ⚠️

Tools:

Kafka UI (open source) — cluster visualization
Burrow — consumer lag monitoring
Grafana + Prometheus — dashboards
Confluent Control Center — commercial

Prompt: Local Kafka Setup

📋 Copy-Paste Prompt

You are a Kafka instructor. Help me set up Apache Kafka locally using Docker Compose for learning purposes.

Requirements:
1. Single broker Kafka cluster (KRaft mode, no ZooKeeper)
2. Kafka UI for web-based management
3. Create sample topics: orders, payments, user-events
4. Python producer and consumer scripts
5. Show how to monitor consumer lag

Provide:
- docker-compose.yml
- Topic creation commands
- Python scripts with confluent-kafka library
- Common troubleshooting tips

Explain everything in Tanglish. Assume beginner-level Docker knowledge.

Kafka Best Practices

Production Kafka ku essential practices:

✅ Replication factor 3 — Minimum for production

✅ Use Avro/Protobuf — JSON is slow at scale, schema registry use pannunga

✅ Partition key wisely — Hot partitions avoid pannunga (don't use country as key!)

✅ Monitor consumer lag — Alerting set pannunga, lag > threshold = page

✅ Retention policy — Business need ku match pannunga (7 days common)

✅ Compaction — State topics ku log compaction enable pannunga

✅ Security — SASL/SSL enable pannunga, ACLs configure pannunga

✅ Capacity plan — Disk space = throughput × retention × replication factor

✅ Test failover — Monthly broker restart panni recovery test pannunga

Kafka master pannaa real-time data engineering la unstoppable aaiduveeenga! 🚀

Next: Airflow orchestration — pipelines schedule and manage pannalaam! 📅

✅ Key Takeaways

✅ Kafka Fundamentals — Distributed event streaming platform. Topics (categories), Partitions (parallelism), Brokers (servers), Producers (write), Consumers (read)

✅ Topics & Partitions — Topic messages categorize. Partitions parallel ah process. Same partition messages ordered; across partitions ordering guarantee illa. Partition count throughput affect

✅ Consumer Groups — Multiple consumers work together. Each partition max one consumer per group. Multiple groups independent read. Rebalancing automatic aagum

✅ Durability & Replication — Replication factor 3 standard production. Leader + followers in-sync. Leader crash → follower elected. Data loss protection

✅ Producer Configuration — acks=all (safest, slow), acks=1 (good balance), acks=0 (fastest, risky). Batch, compression, idempotence setup. Financial data acks=all

✅ Offset Management — Consumer position track. Commit offset "processed up to here". Crash → resume from last offset. No manual implementation, Kafka handles

✅ Kafka Streams — Lightweight processing library. Topology building, stateless + stateful operations. Small-scale real-time processing perfect

✅ Monitoring Essential — Consumer lag (critical metric), under-replicated partitions, controller status. Burrow tool consumer lag dedicated monitoring. Lag increasing → escalation

🏁 🎮 Mini Challenge

Challenge: Kafka Topic + Producer + Consumer

Complete Kafka workflow hands-on:

Step 1: Create Topic - 5 min:

bash

kafka-topics.sh --create --topic orders --partitions 3 --replication-factor 1 --bootstrap-server localhost:9092

Step 2: Producer - 10 min:

python

from confluent_kafka import Producer
import json

producer = Producer({'bootstrap.servers': 'localhost:9092'})

orders = [
    {'order_id': 1, 'user_id': 100, 'amount': 500},
    {'order_id': 2, 'user_id': 101, 'amount': 750},
    {'order_id': 3, 'user_id': 102, 'amount': 1000},
]

for order in orders:
    producer.produce(
        'orders',
        key=str(order['user_id']),  # Partition key
        value=json.dumps(order)
    )
    print(f"Produced: {order}")

producer.flush()

Step 3: Consumer Group 1 - 5 min:

python

consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'order-processor-1',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['orders'])

print("Consumer Group 1 (Analytics):")
while True:
    msg = consumer.poll(1.0)
    if msg:
        print(f"Processed: {msg.value().decode()}")

Step 4: Consumer Group 2 - 5 min:

Same code, group.id='order-processor-2' (different group)
Both groups receive SAME messages independently!

Observe:

Consumer lag (Kafka lag monitor)
Same messages different consumers
Partition assignment (3 partitions, load balanced)

Learning: Kafka flexibility – multiple consumers, persistence, replay capability! 📨

💼 Interview Questions

Q1: Kafka oda magic – traditional queue la irundhu difference?

A: Traditional queue: Message consume aana delete. Kafka: Message disk la persist (days/weeks). Multiple consumers same message read. Offset track pannu. Replay possible! Scalability, durability, flexibility – Kafka advantages! But complexity increase aagum.

Q2: Partitions – eppadi choose pannanum?

A: Throughput calculation first: Need 1M messages/sec, single partition 100K/sec – minimum 10 partitions. Then: Add 2-3x buffer for growth. Partition distribution balanced? Check leadership. Too many (overhead), too few (throughput bottleneck). Monitor and adjust!

Q3: Consumer lag – production alert setup?

A: CRITICAL metric! Lag increasing → consumers slow. Alert: lag > threshold (e.g., 1 hour). Burrow (monitoring tool) use pannu. Investigate: consumer slow, network issue, partition imbalanced? Fix → restart consumer, scale up, optimize code.

Q4: Producer reliability – acks configuration?

A: acks=0 (fastest, risky), acks=1 (leader confirms), acks=all (all replicas confirm = safest, slow). Financial data? acks=all mandatory! Metrics/logs? acks=1 ok. Trade-off: Speed vs safety. Batch size, compression, retries – all producer config tune pannu based on use case!

Q5: ZooKeeper → KRaft migration – business impact?

A: KRaft better (simpler operations). But migration complex – downtime risk. Strategy: New clusters KRaft mode, existing gradual migrate. Kafka 3.5+ already KRaft ready. Performance improvement, operational simplicity gain. Timeline: 2026-27 lah industry mostly KRaft adopt pannuvaanga! 📊

Frequently Asked Questions

❓ Kafka na enna simple ah?

Kafka is a distributed event streaming platform. Simple ah sonna — high-speed message delivery system. Producers messages publish pannuvaanga, consumers read pannuvaanga. Messages disk la store aagum.

❓ Kafka vs RabbitMQ — enna difference?

Kafka is a log-based system (messages persist, multiple consumers read same message). RabbitMQ is a traditional queue (message consumed = deleted). Kafka is better for high-throughput streaming, RabbitMQ for task queues.

❓ Kafka free ah?

Apache Kafka is open-source and free! But managed services like Confluent Cloud, AWS MSK cost money. Self-host pannaalum infra cost irukku.

❓ Kafka learn panna enna prerequisites venum?

Linux basics, networking fundamentals, any programming language (Java/Python preferred). Distributed systems concepts therinjaa bonus. Docker therinjaa local la practice panna easy.

🧠Knowledge Check

Quiz 1 of 1

Consumer group la 3 consumers irukku, topic la 2 partitions irukku. Enna nadakkum?

0 of 1 answered

← Previous ByteReal-time data systems Next Byte →Airflow orchestration

Courses

Learning Paths

Exam Prep

Kafka basics

Introduction

Core Concepts

Kafka Architecture

Partitions — The Secret to Scale

Producer Configuration — Get It Right!

Consumer Groups — Parallel Processing

Kafka with Python — Hands On

Replication — Data Safety

ZooKeeper → KRaft Migration

Real-World Kafka Use Cases

Kafka Connect — Easy Integrations

Kafka Monitoring Essentials

Prompt: Local Kafka Setup

Kafka Best Practices

✅ Key Takeaways

🏁 🎮 Mini Challenge

💼 Interview Questions

Frequently Asked Questions