← Back|DATA-ENGINEERING›Section 1/17

0 of 17 completed

Data cleaning basics

Intermediate⏱ 14 min read📅 Updated: 2026-02-17

Introduction

"Garbage in, garbage out" — data world oda golden rule! 🗑️

Nee world's best ML model build pannalaam. But dirty data feed pannaa? Output um garbage dhaan! Data cleaning is the most important and most underrated skill in data engineering.

Real-world data eppadi irukkum:

Names: "john", "JOHN", "John ", " john " — same person, 4 different formats 😅
Dates: "01/02/2025" — January 2nd or February 1st? 🤷
Phone: "+91-9876543210", "09876543210", "9876543210" — same number!
Age: 25, -5, 999, NULL — enna nadakku? 🤔

Idha ellam fix panradhu dhaan data cleaning! Let's learn step by step 🚀

Types of Dirty Data

Dirty data different forms la varum:

Problem	Example	Impact
Missing Values	NULL, empty, "N/A"	Model bias, errors
Duplicates	Same record 2 times	Inflated metrics
Inconsistent Format	"Chennai" vs "CHENNAI"	Wrong grouping
Invalid Values	Age = -5, Email = "abc"	Wrong predictions
Outliers	Salary = 99999999	Skewed statistics
Wrong Data Types	"123" as string, not number	Calculation errors
Encoding Issues	"cafÃ©" instead of "café"	Display problems
Stale Data	Address from 10 years ago	Wrong decisions

80% of dirty data falls into 3 categories: missing values, duplicates, inconsistent formats. Fix these three, and your data quality jumps dramatically! 📈

Handling Missing Values

Missing values — most common data problem! Different strategies irukku:

1. Detection 🔍

python

import pandas as pd

# Check missing values
df.isnull().sum()
df.isnull().mean() * 100  # Percentage

# Sneaky missing values
df.replace(['N/A', 'NA', '', 'null', 'None', '-'], pd.NA, inplace=True)

2. Deletion ❌

python

# Row delete — missing < 5%
df.dropna(subset=['critical_column'])

# Column delete — missing > 50%
threshold = len(df) * 0.5
df.dropna(axis=1, thresh=threshold)

3. Imputation 🔧

python

# Numeric — mean/median
df['age'].fillna(df['age'].median(), inplace=True)

# Categorical — mode
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Forward fill — time series
df['price'].fillna(method='ffill', inplace=True)

4. Smart Imputation 🧠

python

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

Strategy	When to Use	Pros	Cons
Delete rows	< 5% missing	Simple	Data loss
Mean/Median	Numeric, random missing	Fast	Reduces variance
Mode	Categorical	Easy	May add bias
KNN Imputer	Complex patterns	Accurate	Slow on big data

Removing Duplicates

Duplicate records metrics ah inflate pannidhu — dangerous!

Exact Duplicates:

python

# Find duplicates
df.duplicated().sum()

# Remove duplicates
df.drop_duplicates(inplace=True)

# Keep first/last occurrence
df.drop_duplicates(subset=['email'], keep='last')

Fuzzy Duplicates — tricky ones! 🕵️

python

# "John Smith" vs "john smith" vs "JOHN SMITH"
from fuzzywuzzy import fuzz

score = fuzz.ratio("John Smith", "Jon Smith")  # 91% similar!

SQL la duplicate handling:

sql

-- Find duplicates
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Keep only latest record
DELETE FROM users a
USING users b
WHERE a.id < b.id
AND a.email = b.email;

Best Practice: Duplicate removal BEFORE any analysis pannunga. Otherwise oru customer 3 times count aagum — revenue reports wrong aagum! 💰

Format Standardization

Same data, different formats — clean pannanum:

Text Cleaning:

python

# Case normalization
df['name'] = df['name'].str.strip().str.title()
# "  john DOE  " → "John Doe"

# Phone number standardization
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
# "+91-9876-543210" → "919876543210"

# Email cleanup
df['email'] = df['email'].str.strip().str.lower()

Date Standardization:

python

# Multiple formats → single format
df['date'] = pd.to_datetime(df['date'], format='mixed')
# "01/02/2025", "2025-01-02", "Jan 2, 2025" → uniform datetime

Address Standardization:

python

# Common abbreviations
replacements = {
    'St.': 'Street', 'St ': 'Street ',
    'Ave.': 'Avenue', 'Rd.': 'Road',
    'Blvd.': 'Boulevard'
}
for old, new in replacements.items():
    df['address'] = df['address'].str.replace(old, new)

Standardization pannaalaa "Chennai", "chennai", "CHENNAI" ellam oru group la varum — analytics correct aagum! ✅

Outlier Detection — Smart Approach

💡 Tip

Outliers handle pannum bodhu think before you delete! 🤔

💡 IQR Method — Most reliable:

python

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]

💡 Z-Score — Normal distribution data ku:

python

from scipy import stats
z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]

💡 Domain Knowledge — Most important! Age = 150 is error. But salary = ₹50 crore might be real (CEO salary).

💡 Don't auto-delete — Flag as outlier, investigate, then decide. Sometimes outliers ARE the interesting data!

💡 Cap instead of remove — Winsorization: extreme values ah threshold ku set pannunga.

Data Validation Rules

Clean data ku validation rules define pannanum:

Common Validations:

python

def validate_data(df):
    rules = {
        'age': lambda x: (x >= 0) & (x <= 120),
        'email': lambda x: x.str.contains('@'),
        'phone': lambda x: x.str.len() == 10,
        'amount': lambda x: x > 0,
        'date': lambda x: x <= pd.Timestamp.now(),
    }
    
    for col, rule in rules.items():
        invalid = ~rule(df[col])
        print(f"{col}: {invalid.sum()} invalid records")
        df.loc[invalid, col] = pd.NA  # Mark as missing
    
    return df

Great Expectations Library:

python

import great_expectations as gx

# Define expectations
validator.expect_column_values_to_not_be_null('user_id')
validator.expect_column_values_to_be_between('age', 0, 120)
validator.expect_column_values_to_be_unique('email')
validator.expect_column_values_to_match_regex('phone', r'^\d{10}$')

Validation rules automated ah run pannunga — every pipeline run la! Manual checking scalable illa 🚫

Data Cleaning Pipeline

🏗️ Architecture Diagram

┌─────────────────────────────────────────────────┐
│          DATA CLEANING PIPELINE                   │
├─────────────────────────────────────────────────┤
│                                                   │
│  ┌──────────┐   ┌───────────┐   ┌────────────┐  │
│  │ RAW DATA │──▶│  PROFILE  │──▶│  CLEAN     │  │
│  │          │   │           │   │            │  │
│  │ CSV/DB   │   │ Stats     │   │ Missing    │  │
│  │ API      │   │ Types     │   │ Duplicates │  │
│  │ Files    │   │ Nulls %   │   │ Formats    │  │
│  └──────────┘   └───────────┘   └─────┬──────┘  │
│                                       │          │
│  ┌──────────┐   ┌───────────┐   ┌─────▼──────┐  │
│  │  OUTPUT  │◀──│ VALIDATE  │◀──│ TRANSFORM  │  │
│  │          │   │           │   │            │  │
│  │ Clean DB │   │ Rules     │   │ Outliers   │  │
│  │ Parquet  │   │ Tests     │   │ Types      │  │
│  │ Report   │   │ Quality % │   │ Encode     │  │
│  └──────────┘   └───────────┘   └────────────┘  │
└─────────────────────────────────────────────────┘

Data Profiling — First Step Always!

Cleaning start pannum munnadi profile pannunga — data ah understand pannunga:

python

# Quick profiling
print(df.shape)           # Rows, columns
print(df.dtypes)          # Data types
print(df.describe())      # Stats
print(df.isnull().sum())  # Missing counts
print(df.nunique())       # Unique values per column

# Detailed profiling
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Data Quality Report")
profile.to_file("report.html")

Profile Report shows:

Distribution of each column 📊
Correlation between columns 🔗
Missing value patterns 🕳️
Duplicate rows count 🔄
Data type warnings ⚠️

5 minutes profiling = 5 hours debugging save pannalaam! Always profile first, then clean 🎯

Encoding & Type Issues

⚠️ Warning

⚠️ Silent killers — encoding and type issues!

🔴 Character Encoding:

- "cafÃ©" instead of "café" — UTF-8 vs Latin-1 mismatch

- Fix: df = pd.read_csv('file.csv', encoding='utf-8')

- Still broken? Try: encoding='latin-1' or encoding='cp1252'

🔴 Numeric as String:

- "1,234.56" — comma irukku, string ah treat aagum

- Fix: df['amount'] = df['amount'].str.replace(',', '').astype(float)

🔴 Boolean Confusion:

- "Yes", "Y", "1", "TRUE", "true" — all mean True

- Fix: Map to standard boolean

🔴 Date as String:

- "2025-01-15" string ah irundhaa date operations work aagaadhu

- Fix: pd.to_datetime() use pannunga

🔴 Mixed Types in Column:

- Some rows integer, some string — pandas silently converts to object

- Fix: pd.to_numeric(df['col'], errors='coerce') — invalid values NaN aagum

Real-World: Hospital Data Cleaning

✅ Example

Hospital patient data clean panra example 🏥:

Raw Data Problems:

- Patient name: "KUMAR, R" vs "R. Kumar" vs "Kumar R"

- Blood group: "O+", "O positive", "O Positive", "O+"

- Date of birth: "15/03/1990", "1990-03-15", "March 15, 1990"

- Phone: Some with +91, some without, some with spaces

- Diagnosis: Free text with typos — "Diabeties", "Diabtes", "Diabetes"

Cleaning Steps:

1. 🔤 Names: Parse into first_name, last_name, standardize case

2. 🩸 Blood group: Map all variations to standard codes (O+, A-, etc.)

3. 📅 DOB: pd.to_datetime() with format='mixed'

4. 📞 Phone: Regex — keep only digits, ensure 10-digit format

5. 🏥 Diagnosis: Fuzzy matching to standard ICD codes

Result: 50,000 records cleaned, 3,000 duplicates removed, 2% invalid records flagged for manual review.

Impact: Correct patient matching improved from 78% to 97%! 🎯

Prompt: Automate Data Cleaning

📋 Copy-Paste Prompt

You are a data engineer. Write a Python data cleaning pipeline for this CSV:

Columns: name, email, phone, age, salary, city, join_date

Known issues:
1. Names have inconsistent capitalization and extra spaces
2. Some emails are invalid (no @ symbol)
3. Phone numbers in different formats (+91, with dashes, with spaces)
4. Age has negative values and values > 150
5. Salary has string values like "50K" and "1,20,000"
6. City names have typos ("Chenai" for "Chennai")
7. Dates in multiple formats

Write a complete cleaning function with logging. Show before/after statistics. Use pandas. Explain each step in Tanglish.

Data Cleaning Best Practices

Data cleaning pro maari pannanum na follow pannunga:

✅ Always profile first — Data ah understand pannaamal clean panna start pannaadheenga

✅ Document everything — What you cleaned, why, how — future reference ku

✅ Keep original data — Raw data ah modify pannaadheenga, copy la work pannunga

✅ Automate repeatable tasks — Same cleaning weekly pannanum na script ezhudhunga

✅ Validate after cleaning — Cleaning correct ah nadandhicha nu verify pannunga

✅ Handle edge cases — Empty strings, whitespace-only, special characters

✅ Test with samples — Full dataset la run pannum munnadi sample la test pannunga

✅ Version your cleaning code — Git la maintain pannunga

Remember: Clean data = trustworthy insights = better decisions = happy stakeholders! 🏆

Next up: Real-time data systems — live data eppadi handle pannanum! ⚡

✅ Key Takeaways

✅ Garbage In, Garbage Out — Data quality model quality determine pannum. Perfect model kuda dirty data garbage output produce pannudhu

✅ Dirty Data Types — Missing values, duplicates, format inconsistencies, invalid values, outliers, encoding issues, stale data. 80% issues irundhu vara three: missing, duplicates, format

✅ Missing Values — < 5% delete pannalam. 5-30% imputation (mean, median, mode, KNN). > 50% column drop panna or investigate. Domain expertise matters

✅ Duplicates Dangerous — Exact duplicates easy remove. Fuzzy duplicates tricky (fuzzywuzzy library). Duplicate row metrics inflate pannum; analysis wrong aagum

✅ Standardization Critical — Names, dates, phone numbers, addresses format consistent pannunga. "Chennai", "CHENNAI", "Chenai" grouping correct aaganum

✅ Outliers Understand — Always remove panna vendaam; understand first. Real outlier (billionaire income) keep pannunga. Error (age 999) fix pannunga

✅ Data Profiling First — Cleaning start pannum munnaadi profile. Describe, dtypes, nulls, nunique, distributions understand pannunga

✅ Automation Recommended — Manual cleaning tedious + error-prone. Great Expectations library tests automate pannunga. Production pipelines validation rules built-in necessary

🏁 🎮 Mini Challenge

Challenge: Real-world Messy Dataset Clean Pannu

Realistic dirty data cleanup hands-on:

Create Messy Dataset - 5 min:

python

import pandas as pd
import numpy as np

# Intentionally create dirty data
df = pd.DataFrame({
    'name': ['john', '  JOHN  ', 'John Doe', 'john', None],
    'email': ['john@gmail.com', 'john@', 'john', 'JOHN@GMAIL.COM', 'invalid'],
    'phone': ['9876543210', '98-7654-3210', '+91 9876543210', '987654321', ''],
    'age': [25, -5, 150, 25, 30],
    'salary': ['50000', '5,00,000', '₹5 lakhs', '500K', None]
})

Profile Data - 5 min:

python

print(df.info())
print(df.describe())
print(df.isnull().sum())
# Visual: Any red flags?

Clean Data - 20 min:

python

# Names: standardize
df['name'] = df['name'].str.strip().str.title()
df['name'].fillna('Unknown', inplace=True)

# Email: validate
df['email'] = df['email'].str.lower()
df.loc[~df['email'].str.contains('@'), 'email'] = None

# Phone: extract digits only
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df = df[df['phone'].str.len() == 10]

# Age: fix invalid
df = df[(df['age'] > 0) & (df['age'] < 120)]

# Salary: parse and standardize
df['salary'] = df['salary'].str.replace(r'[^0-9]', '', regex=True).astype(float)

Validate - 5 min:

python

print(f"Records before: {initial_count}, after: {len(df)}")
print("Data quality improved from 40% to 85%")

Learning: Realistic data romba messy! Patience and systematic approach needed! 🧹

💼 Interview Questions

Q1: Data cleaning – manual vs automated?

A: Manual initial (understand data, identify patterns). Automated (pipelines, validation rules, Great Expectations). Production: Always automated – manual scale aagaadhu. But rules define panna manual work venum!

Q2: Cleaning strategy – delete rows vs imputation?

A: Context dependent! < 5% missing, delete ok. > 30% missing → imputation risk. Medical data → domain expert consult. Financial → strictly validate. General rule: Understand WHY missing – random vs systematic. Random → imputation ok, Systematic → investigate!

Q3: Outliers – always remove?

A: NO! First understand – real outlier (billionaire income) or error (age 999). Domain knowledge crucial. ML models: Remove or cap. Statistical analysis: May include (represent real variation). Business: Keep (valuable insight). Case-by-case decision!

Q4: Data cleaning reproducible ah?

A: MUST! Document every step:

Which rows deleted, why
Which values imputed, method
Validation rules applied
Before-after metrics

Version control (Git): code + cleaned data version together. Production reproducibility essential! Same input → same output always.

Q5: Dirty data ngayon production model – impact?

A: Model accuracy drop, bias introduce, wrong decisions. Example: Age negative → model fails. Email invalid → message undeliverable. Financial data wrong → reports inaccurate. Regulatory fines (GDPR). Reputation damage. Data cleaning INVESTMENT, not cost! 💰

Frequently Asked Questions

❓ Data cleaning ku evlo time aagum?

Data science projects la 60-80% time data cleaning ku dhaan pogum. Real data romba messy ah irukkum — patience venum!

❓ Missing values ah delete pannalaamaa?

Depends! Missing values 5% ku kammi na delete pannalaam. More than that na imputation (mean, median, mode fill) better. Context paathu decide pannunga.

❓ Outliers ah remove pannanum ah?

Always remove panna vendaam! First understand — real outlier ah (billionaire in income data) or error ah (age = 999). Real outliers ah keep pannunga, errors ah fix pannunga.

❓ Best data cleaning tool enna?

Python pandas most popular. SQL also powerful for large datasets. OpenRefine for non-coders. Excel for small datasets. Tool illa, concept important!

❓ Clean data eppadi verify pannanum?

Data profiling pannunga — column-wise stats check pannunga (null %, unique counts, min/max). Great Expectations library use panni automated tests ezhudhunga.

🧠Knowledge Check

Quiz 1 of 1

Column la 70% values missing irundhaa enna pannanum?

0 of 1 answered

← Previous ByteSQL for AI apps Next Byte →Real-time data systems

Courses

Learning Paths

Exam Prep

Data cleaning basics

Introduction

Types of Dirty Data

Handling Missing Values

Removing Duplicates

Format Standardization

Outlier Detection — Smart Approach

Data Validation Rules

Data Cleaning Pipeline

Data Profiling — First Step Always!

Encoding & Type Issues

Real-World: Hospital Data Cleaning

Prompt: Automate Data Cleaning

Data Cleaning Best Practices

✅ Key Takeaways

🏁 🎮 Mini Challenge

💼 Interview Questions

Frequently Asked Questions