โ† Back|DATA-ENGINEERINGโ€บSection 1/17
0 of 17 completed

Data cleaning basics

Intermediateโฑ 14 min read๐Ÿ“… Updated: 2026-02-17

Introduction

"Garbage in, garbage out" โ€” data world oda golden rule! ๐Ÿ—‘๏ธ


Nee world's best ML model build pannalaam. But dirty data feed pannaa? Output um garbage dhaan! Data cleaning is the most important and most underrated skill in data engineering.


Real-world data eppadi irukkum:

  • Names: "john", "JOHN", "John ", " john " โ€” same person, 4 different formats ๐Ÿ˜…
  • Dates: "01/02/2025" โ€” January 2nd or February 1st? ๐Ÿคท
  • Phone: "+91-9876543210", "09876543210", "9876543210" โ€” same number!
  • Age: 25, -5, 999, NULL โ€” enna nadakku? ๐Ÿค”

Idha ellam fix panradhu dhaan data cleaning! Let's learn step by step ๐Ÿš€

Types of Dirty Data

Dirty data different forms la varum:


ProblemExampleImpact
**Missing Values**NULL, empty, "N/A"Model bias, errors
**Duplicates**Same record 2 timesInflated metrics
**Inconsistent Format**"Chennai" vs "CHENNAI"Wrong grouping
**Invalid Values**Age = -5, Email = "abc"Wrong predictions
**Outliers**Salary = 99999999Skewed statistics
**Wrong Data Types**"123" as string, not numberCalculation errors
**Encoding Issues**"cafรƒยฉ" instead of "cafรฉ"Display problems
**Stale Data**Address from 10 years agoWrong decisions

80% of dirty data falls into 3 categories: missing values, duplicates, inconsistent formats. Fix these three, and your data quality jumps dramatically! ๐Ÿ“ˆ

Handling Missing Values

Missing values โ€” most common data problem! Different strategies irukku:


1. Detection ๐Ÿ”

python
import pandas as pd

# Check missing values
df.isnull().sum()
df.isnull().mean() * 100  # Percentage

# Sneaky missing values
df.replace(['N/A', 'NA', '', 'null', 'None', '-'], pd.NA, inplace=True)

2. Deletion โŒ

python
# Row delete โ€” missing < 5%
df.dropna(subset=['critical_column'])

# Column delete โ€” missing > 50%
threshold = len(df) * 0.5
df.dropna(axis=1, thresh=threshold)

3. Imputation ๐Ÿ”ง

python
# Numeric โ€” mean/median
df['age'].fillna(df['age'].median(), inplace=True)

# Categorical โ€” mode
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Forward fill โ€” time series
df['price'].fillna(method='ffill', inplace=True)

4. Smart Imputation ๐Ÿง 

python
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

StrategyWhen to UseProsCons
Delete rows< 5% missingSimpleData loss
Mean/MedianNumeric, random missingFastReduces variance
ModeCategoricalEasyMay add bias
KNN ImputerComplex patternsAccurateSlow on big data

Removing Duplicates

Duplicate records metrics ah inflate pannidhu โ€” dangerous!


Exact Duplicates:

python
# Find duplicates
df.duplicated().sum()

# Remove duplicates
df.drop_duplicates(inplace=True)

# Keep first/last occurrence
df.drop_duplicates(subset=['email'], keep='last')

Fuzzy Duplicates โ€” tricky ones! ๐Ÿ•ต๏ธ

python
# "John Smith" vs "john smith" vs "JOHN SMITH"
from fuzzywuzzy import fuzz

score = fuzz.ratio("John Smith", "Jon Smith")  # 91% similar!

SQL la duplicate handling:

sql
-- Find duplicates
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Keep only latest record
DELETE FROM users a
USING users b
WHERE a.id < b.id
AND a.email = b.email;

Best Practice: Duplicate removal BEFORE any analysis pannunga. Otherwise oru customer 3 times count aagum โ€” revenue reports wrong aagum! ๐Ÿ’ฐ

Format Standardization

Same data, different formats โ€” clean pannanum:


Text Cleaning:

python
# Case normalization
df['name'] = df['name'].str.strip().str.title()
# "  john DOE  " โ†’ "John Doe"

# Phone number standardization
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
# "+91-9876-543210" โ†’ "919876543210"

# Email cleanup
df['email'] = df['email'].str.strip().str.lower()

Date Standardization:

python
# Multiple formats โ†’ single format
df['date'] = pd.to_datetime(df['date'], format='mixed')
# "01/02/2025", "2025-01-02", "Jan 2, 2025" โ†’ uniform datetime

Address Standardization:

python
# Common abbreviations
replacements = {
    'St.': 'Street', 'St ': 'Street ',
    'Ave.': 'Avenue', 'Rd.': 'Road',
    'Blvd.': 'Boulevard'
}
for old, new in replacements.items():
    df['address'] = df['address'].str.replace(old, new)

Standardization pannaalaa "Chennai", "chennai", "CHENNAI" ellam oru group la varum โ€” analytics correct aagum! โœ…

Outlier Detection โ€” Smart Approach

๐Ÿ’ก Tip

Outliers handle pannum bodhu think before you delete! ๐Ÿค”

๐Ÿ’ก IQR Method โ€” Most reliable:

python
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]

๐Ÿ’ก Z-Score โ€” Normal distribution data ku:

python
from scipy import stats
z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]

๐Ÿ’ก Domain Knowledge โ€” Most important! Age = 150 is error. But salary = โ‚น50 crore might be real (CEO salary).

๐Ÿ’ก Don't auto-delete โ€” Flag as outlier, investigate, then decide. Sometimes outliers ARE the interesting data!

๐Ÿ’ก Cap instead of remove โ€” Winsorization: extreme values ah threshold ku set pannunga.

Data Validation Rules

Clean data ku validation rules define pannanum:


Common Validations:

python
def validate_data(df):
    rules = {
        'age': lambda x: (x >= 0) & (x <= 120),
        'email': lambda x: x.str.contains('@'),
        'phone': lambda x: x.str.len() == 10,
        'amount': lambda x: x > 0,
        'date': lambda x: x <= pd.Timestamp.now(),
    }
    
    for col, rule in rules.items():
        invalid = ~rule(df[col])
        print(f"{col}: {invalid.sum()} invalid records")
        df.loc[invalid, col] = pd.NA  # Mark as missing
    
    return df

Great Expectations Library:

python
import great_expectations as gx

# Define expectations
validator.expect_column_values_to_not_be_null('user_id')
validator.expect_column_values_to_be_between('age', 0, 120)
validator.expect_column_values_to_be_unique('email')
validator.expect_column_values_to_match_regex('phone', r'^\d{10}$')

Validation rules automated ah run pannunga โ€” every pipeline run la! Manual checking scalable illa ๐Ÿšซ

Data Cleaning Pipeline

๐Ÿ—๏ธ Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          DATA CLEANING PIPELINE                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ RAW DATA โ”‚โ”€โ”€โ–ถโ”‚  PROFILE  โ”‚โ”€โ”€โ–ถโ”‚  CLEAN     โ”‚  โ”‚
โ”‚  โ”‚          โ”‚   โ”‚           โ”‚   โ”‚            โ”‚  โ”‚
โ”‚  โ”‚ CSV/DB   โ”‚   โ”‚ Stats     โ”‚   โ”‚ Missing    โ”‚  โ”‚
โ”‚  โ”‚ API      โ”‚   โ”‚ Types     โ”‚   โ”‚ Duplicates โ”‚  โ”‚
โ”‚  โ”‚ Files    โ”‚   โ”‚ Nulls %   โ”‚   โ”‚ Formats    โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                       โ”‚          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  OUTPUT  โ”‚โ—€โ”€โ”€โ”‚ VALIDATE  โ”‚โ—€โ”€โ”€โ”‚ TRANSFORM  โ”‚  โ”‚
โ”‚  โ”‚          โ”‚   โ”‚           โ”‚   โ”‚            โ”‚  โ”‚
โ”‚  โ”‚ Clean DB โ”‚   โ”‚ Rules     โ”‚   โ”‚ Outliers   โ”‚  โ”‚
โ”‚  โ”‚ Parquet  โ”‚   โ”‚ Tests     โ”‚   โ”‚ Types      โ”‚  โ”‚
โ”‚  โ”‚ Report   โ”‚   โ”‚ Quality % โ”‚   โ”‚ Encode     โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Profiling โ€” First Step Always!

Cleaning start pannum munnadi profile pannunga โ€” data ah understand pannunga:


python
# Quick profiling
print(df.shape)           # Rows, columns
print(df.dtypes)          # Data types
print(df.describe())      # Stats
print(df.isnull().sum())  # Missing counts
print(df.nunique())       # Unique values per column

# Detailed profiling
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Data Quality Report")
profile.to_file("report.html")

Profile Report shows:

  • Distribution of each column ๐Ÿ“Š
  • Correlation between columns ๐Ÿ”—
  • Missing value patterns ๐Ÿ•ณ๏ธ
  • Duplicate rows count ๐Ÿ”„
  • Data type warnings โš ๏ธ

5 minutes profiling = 5 hours debugging save pannalaam! Always profile first, then clean ๐ŸŽฏ

Encoding & Type Issues

โš ๏ธ Warning

โš ๏ธ Silent killers โ€” encoding and type issues!

๐Ÿ”ด Character Encoding:

- "cafรƒยฉ" instead of "cafรฉ" โ€” UTF-8 vs Latin-1 mismatch

- Fix: df = pd.read_csv('file.csv', encoding='utf-8')

- Still broken? Try: encoding='latin-1' or encoding='cp1252'

๐Ÿ”ด Numeric as String:

- "1,234.56" โ€” comma irukku, string ah treat aagum

- Fix: df['amount'] = df['amount'].str.replace(',', '').astype(float)

๐Ÿ”ด Boolean Confusion:

- "Yes", "Y", "1", "TRUE", "true" โ€” all mean True

- Fix: Map to standard boolean

๐Ÿ”ด Date as String:

- "2025-01-15" string ah irundhaa date operations work aagaadhu

- Fix: pd.to_datetime() use pannunga

๐Ÿ”ด Mixed Types in Column:

- Some rows integer, some string โ€” pandas silently converts to object

- Fix: pd.to_numeric(df['col'], errors='coerce') โ€” invalid values NaN aagum

Real-World: Hospital Data Cleaning

โœ… Example

Hospital patient data clean panra example ๐Ÿฅ:

Raw Data Problems:

- Patient name: "KUMAR, R" vs "R. Kumar" vs "Kumar R"

- Blood group: "O+", "O positive", "O Positive", "O+"

- Date of birth: "15/03/1990", "1990-03-15", "March 15, 1990"

- Phone: Some with +91, some without, some with spaces

- Diagnosis: Free text with typos โ€” "Diabeties", "Diabtes", "Diabetes"

Cleaning Steps:

1. ๐Ÿ”ค Names: Parse into first_name, last_name, standardize case

2. ๐Ÿฉธ Blood group: Map all variations to standard codes (O+, A-, etc.)

3. ๐Ÿ“… DOB: pd.to_datetime() with format='mixed'

4. ๐Ÿ“ž Phone: Regex โ€” keep only digits, ensure 10-digit format

5. ๐Ÿฅ Diagnosis: Fuzzy matching to standard ICD codes

Result: 50,000 records cleaned, 3,000 duplicates removed, 2% invalid records flagged for manual review.

Impact: Correct patient matching improved from 78% to 97%! ๐ŸŽฏ

Prompt: Automate Data Cleaning

๐Ÿ“‹ Copy-Paste Prompt
You are a data engineer. Write a Python data cleaning pipeline for this CSV:

Columns: name, email, phone, age, salary, city, join_date

Known issues:
1. Names have inconsistent capitalization and extra spaces
2. Some emails are invalid (no @ symbol)
3. Phone numbers in different formats (+91, with dashes, with spaces)
4. Age has negative values and values > 150
5. Salary has string values like "50K" and "1,20,000"
6. City names have typos ("Chenai" for "Chennai")
7. Dates in multiple formats

Write a complete cleaning function with logging. Show before/after statistics. Use pandas. Explain each step in Tanglish.

Data Cleaning Best Practices

Data cleaning pro maari pannanum na follow pannunga:


โœ… Always profile first โ€” Data ah understand pannaamal clean panna start pannaadheenga

โœ… Document everything โ€” What you cleaned, why, how โ€” future reference ku

โœ… Keep original data โ€” Raw data ah modify pannaadheenga, copy la work pannunga

โœ… Automate repeatable tasks โ€” Same cleaning weekly pannanum na script ezhudhunga

โœ… Validate after cleaning โ€” Cleaning correct ah nadandhicha nu verify pannunga

โœ… Handle edge cases โ€” Empty strings, whitespace-only, special characters

โœ… Test with samples โ€” Full dataset la run pannum munnadi sample la test pannunga

โœ… Version your cleaning code โ€” Git la maintain pannunga


Remember: Clean data = trustworthy insights = better decisions = happy stakeholders! ๐Ÿ†


Next up: Real-time data systems โ€” live data eppadi handle pannanum! โšก

โœ… Key Takeaways

โœ… Garbage In, Garbage Out โ€” Data quality model quality determine pannum. Perfect model kuda dirty data garbage output produce pannudhu


โœ… Dirty Data Types โ€” Missing values, duplicates, format inconsistencies, invalid values, outliers, encoding issues, stale data. 80% issues irundhu vara three: missing, duplicates, format


โœ… Missing Values โ€” < 5% delete pannalam. 5-30% imputation (mean, median, mode, KNN). > 50% column drop panna or investigate. Domain expertise matters


โœ… Duplicates Dangerous โ€” Exact duplicates easy remove. Fuzzy duplicates tricky (fuzzywuzzy library). Duplicate row metrics inflate pannum; analysis wrong aagum


โœ… Standardization Critical โ€” Names, dates, phone numbers, addresses format consistent pannunga. "Chennai", "CHENNAI", "Chenai" grouping correct aaganum


โœ… Outliers Understand โ€” Always remove panna vendaam; understand first. Real outlier (billionaire income) keep pannunga. Error (age 999) fix pannunga


โœ… Data Profiling First โ€” Cleaning start pannum munnaadi profile. Describe, dtypes, nulls, nunique, distributions understand pannunga


โœ… Automation Recommended โ€” Manual cleaning tedious + error-prone. Great Expectations library tests automate pannunga. Production pipelines validation rules built-in necessary

๐Ÿ ๐ŸŽฎ Mini Challenge

Challenge: Real-world Messy Dataset Clean Pannu


Realistic dirty data cleanup hands-on:


Create Messy Dataset - 5 min:

python
import pandas as pd
import numpy as np

# Intentionally create dirty data
df = pd.DataFrame({
    'name': ['john', '  JOHN  ', 'John Doe', 'john', None],
    'email': ['john@gmail.com', 'john@', 'john', 'JOHN@GMAIL.COM', 'invalid'],
    'phone': ['9876543210', '98-7654-3210', '+91 9876543210', '987654321', ''],
    'age': [25, -5, 150, 25, 30],
    'salary': ['50000', '5,00,000', 'โ‚น5 lakhs', '500K', None]
})

Profile Data - 5 min:

python
print(df.info())
print(df.describe())
print(df.isnull().sum())
# Visual: Any red flags?

Clean Data - 20 min:

python
# Names: standardize
df['name'] = df['name'].str.strip().str.title()
df['name'].fillna('Unknown', inplace=True)

# Email: validate
df['email'] = df['email'].str.lower()
df.loc[~df['email'].str.contains('@'), 'email'] = None

# Phone: extract digits only
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df = df[df['phone'].str.len() == 10]

# Age: fix invalid
df = df[(df['age'] > 0) & (df['age'] < 120)]

# Salary: parse and standardize
df['salary'] = df['salary'].str.replace(r'[^0-9]', '', regex=True).astype(float)

Validate - 5 min:

python
print(f"Records before: {initial_count}, after: {len(df)}")
print("Data quality improved from 40% to 85%")

Learning: Realistic data romba messy! Patience and systematic approach needed! ๐Ÿงน

๐Ÿ’ผ Interview Questions

Q1: Data cleaning โ€“ manual vs automated?

A: Manual initial (understand data, identify patterns). Automated (pipelines, validation rules, Great Expectations). Production: Always automated โ€“ manual scale aagaadhu. But rules define panna manual work venum!


Q2: Cleaning strategy โ€“ delete rows vs imputation?

A: Context dependent! < 5% missing, delete ok. > 30% missing โ†’ imputation risk. Medical data โ†’ domain expert consult. Financial โ†’ strictly validate. General rule: Understand WHY missing โ€“ random vs systematic. Random โ†’ imputation ok, Systematic โ†’ investigate!


Q3: Outliers โ€“ always remove?

A: NO! First understand โ€“ real outlier (billionaire income) or error (age 999). Domain knowledge crucial. ML models: Remove or cap. Statistical analysis: May include (represent real variation). Business: Keep (valuable insight). Case-by-case decision!


Q4: Data cleaning reproducible ah?

A: MUST! Document every step:

  • Which rows deleted, why
  • Which values imputed, method
  • Validation rules applied
  • Before-after metrics

Version control (Git): code + cleaned data version together. Production reproducibility essential! Same input โ†’ same output always.


Q5: Dirty data ngayon production model โ€“ impact?

A: Model accuracy drop, bias introduce, wrong decisions. Example: Age negative โ†’ model fails. Email invalid โ†’ message undeliverable. Financial data wrong โ†’ reports inaccurate. Regulatory fines (GDPR). Reputation damage. Data cleaning INVESTMENT, not cost! ๐Ÿ’ฐ

Frequently Asked Questions

โ“ Data cleaning ku evlo time aagum?
Data science projects la 60-80% time data cleaning ku dhaan pogum. Real data romba messy ah irukkum โ€” patience venum!
โ“ Missing values ah delete pannalaamaa?
Depends! Missing values 5% ku kammi na delete pannalaam. More than that na imputation (mean, median, mode fill) better. Context paathu decide pannunga.
โ“ Outliers ah remove pannanum ah?
Always remove panna vendaam! First understand โ€” real outlier ah (billionaire in income data) or error ah (age = 999). Real outliers ah keep pannunga, errors ah fix pannunga.
โ“ Best data cleaning tool enna?
Python pandas most popular. SQL also powerful for large datasets. OpenRefine for non-coders. Excel for small datasets. Tool illa, concept important!
โ“ Clean data eppadi verify pannanum?
Data profiling pannunga โ€” column-wise stats check pannunga (null %, unique counts, min/max). Great Expectations library use panni automated tests ezhudhunga.
๐Ÿง Knowledge Check
Quiz 1 of 1

Column la 70% values missing irundhaa enna pannanum?

0 of 1 answered