Data cleaning basics
Introduction
"Garbage in, garbage out" — data world oda golden rule! 🗑️
Nee world's best ML model build pannalaam. But dirty data feed pannaa? Output um garbage dhaan! Data cleaning is the most important and most underrated skill in data engineering.
Real-world data eppadi irukkum:
- Names: "john", "JOHN", "John ", " john " — same person, 4 different formats 😅
- Dates: "01/02/2025" — January 2nd or February 1st? 🤷
- Phone: "+91-9876543210", "09876543210", "9876543210" — same number!
- Age: 25, -5, 999, NULL — enna nadakku? 🤔
Idha ellam fix panradhu dhaan data cleaning! Let's learn step by step 🚀
Types of Dirty Data
Dirty data different forms la varum:
| Problem | Example | Impact |
|---|---|---|
| **Missing Values** | NULL, empty, "N/A" | Model bias, errors |
| **Duplicates** | Same record 2 times | Inflated metrics |
| **Inconsistent Format** | "Chennai" vs "CHENNAI" | Wrong grouping |
| **Invalid Values** | Age = -5, Email = "abc" | Wrong predictions |
| **Outliers** | Salary = 99999999 | Skewed statistics |
| **Wrong Data Types** | "123" as string, not number | Calculation errors |
| **Encoding Issues** | "café" instead of "café" | Display problems |
| **Stale Data** | Address from 10 years ago | Wrong decisions |
80% of dirty data falls into 3 categories: missing values, duplicates, inconsistent formats. Fix these three, and your data quality jumps dramatically! 📈
Handling Missing Values
Missing values — most common data problem! Different strategies irukku:
1. Detection 🔍
2. Deletion ❌
3. Imputation 🔧
4. Smart Imputation 🧠
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Delete rows | < 5% missing | Simple | Data loss |
| Mean/Median | Numeric, random missing | Fast | Reduces variance |
| Mode | Categorical | Easy | May add bias |
| KNN Imputer | Complex patterns | Accurate | Slow on big data |
Removing Duplicates
Duplicate records metrics ah inflate pannidhu — dangerous!
Exact Duplicates:
Fuzzy Duplicates — tricky ones! 🕵️
SQL la duplicate handling:
Best Practice: Duplicate removal BEFORE any analysis pannunga. Otherwise oru customer 3 times count aagum — revenue reports wrong aagum! 💰
Format Standardization
Same data, different formats — clean pannanum:
Text Cleaning:
Date Standardization:
Address Standardization:
Standardization pannaalaa "Chennai", "chennai", "CHENNAI" ellam oru group la varum — analytics correct aagum! ✅
Outlier Detection — Smart Approach
Outliers handle pannum bodhu think before you delete! 🤔
💡 IQR Method — Most reliable:
💡 Z-Score — Normal distribution data ku:
💡 Domain Knowledge — Most important! Age = 150 is error. But salary = ₹50 crore might be real (CEO salary).
💡 Don't auto-delete — Flag as outlier, investigate, then decide. Sometimes outliers ARE the interesting data!
💡 Cap instead of remove — Winsorization: extreme values ah threshold ku set pannunga.
Data Validation Rules
Clean data ku validation rules define pannanum:
Common Validations:
Great Expectations Library:
Validation rules automated ah run pannunga — every pipeline run la! Manual checking scalable illa 🚫
Data Cleaning Pipeline
┌─────────────────────────────────────────────────┐ │ DATA CLEANING PIPELINE │ ├─────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌───────────┐ ┌────────────┐ │ │ │ RAW DATA │──▶│ PROFILE │──▶│ CLEAN │ │ │ │ │ │ │ │ │ │ │ │ CSV/DB │ │ Stats │ │ Missing │ │ │ │ API │ │ Types │ │ Duplicates │ │ │ │ Files │ │ Nulls % │ │ Formats │ │ │ └──────────┘ └───────────┘ └─────┬──────┘ │ │ │ │ │ ┌──────────┐ ┌───────────┐ ┌─────▼──────┐ │ │ │ OUTPUT │◀──│ VALIDATE │◀──│ TRANSFORM │ │ │ │ │ │ │ │ │ │ │ │ Clean DB │ │ Rules │ │ Outliers │ │ │ │ Parquet │ │ Tests │ │ Types │ │ │ │ Report │ │ Quality % │ │ Encode │ │ │ └──────────┘ └───────────┘ └────────────┘ │ └─────────────────────────────────────────────────┘
Data Profiling — First Step Always!
Cleaning start pannum munnadi profile pannunga — data ah understand pannunga:
Profile Report shows:
- Distribution of each column 📊
- Correlation between columns 🔗
- Missing value patterns 🕳️
- Duplicate rows count 🔄
- Data type warnings ⚠️
5 minutes profiling = 5 hours debugging save pannalaam! Always profile first, then clean 🎯
Encoding & Type Issues
⚠️ Silent killers — encoding and type issues!
🔴 Character Encoding:
- "café" instead of "café" — UTF-8 vs Latin-1 mismatch
- Fix: df = pd.read_csv('file.csv', encoding='utf-8')
- Still broken? Try: encoding='latin-1' or encoding='cp1252'
🔴 Numeric as String:
- "1,234.56" — comma irukku, string ah treat aagum
- Fix: df['amount'] = df['amount'].str.replace(',', '').astype(float)
🔴 Boolean Confusion:
- "Yes", "Y", "1", "TRUE", "true" — all mean True
- Fix: Map to standard boolean
🔴 Date as String:
- "2025-01-15" string ah irundhaa date operations work aagaadhu
- Fix: pd.to_datetime() use pannunga
🔴 Mixed Types in Column:
- Some rows integer, some string — pandas silently converts to object
- Fix: pd.to_numeric(df['col'], errors='coerce') — invalid values NaN aagum
Real-World: Hospital Data Cleaning
Hospital patient data clean panra example 🏥:
Raw Data Problems:
- Patient name: "KUMAR, R" vs "R. Kumar" vs "Kumar R"
- Blood group: "O+", "O positive", "O Positive", "O+"
- Date of birth: "15/03/1990", "1990-03-15", "March 15, 1990"
- Phone: Some with +91, some without, some with spaces
- Diagnosis: Free text with typos — "Diabeties", "Diabtes", "Diabetes"
Cleaning Steps:
1. 🔤 Names: Parse into first_name, last_name, standardize case
2. 🩸 Blood group: Map all variations to standard codes (O+, A-, etc.)
3. 📅 DOB: pd.to_datetime() with format='mixed'
4. 📞 Phone: Regex — keep only digits, ensure 10-digit format
5. 🏥 Diagnosis: Fuzzy matching to standard ICD codes
Result: 50,000 records cleaned, 3,000 duplicates removed, 2% invalid records flagged for manual review.
Impact: Correct patient matching improved from 78% to 97%! 🎯
Prompt: Automate Data Cleaning
Data Cleaning Best Practices
Data cleaning pro maari pannanum na follow pannunga:
✅ Always profile first — Data ah understand pannaamal clean panna start pannaadheenga
✅ Document everything — What you cleaned, why, how — future reference ku
✅ Keep original data — Raw data ah modify pannaadheenga, copy la work pannunga
✅ Automate repeatable tasks — Same cleaning weekly pannanum na script ezhudhunga
✅ Validate after cleaning — Cleaning correct ah nadandhicha nu verify pannunga
✅ Handle edge cases — Empty strings, whitespace-only, special characters
✅ Test with samples — Full dataset la run pannum munnadi sample la test pannunga
✅ Version your cleaning code — Git la maintain pannunga
Remember: Clean data = trustworthy insights = better decisions = happy stakeholders! 🏆
Next up: Real-time data systems — live data eppadi handle pannanum! ⚡
✅ Key Takeaways
✅ Garbage In, Garbage Out — Data quality model quality determine pannum. Perfect model kuda dirty data garbage output produce pannudhu
✅ Dirty Data Types — Missing values, duplicates, format inconsistencies, invalid values, outliers, encoding issues, stale data. 80% issues irundhu vara three: missing, duplicates, format
✅ Missing Values — < 5% delete pannalam. 5-30% imputation (mean, median, mode, KNN). > 50% column drop panna or investigate. Domain expertise matters
✅ Duplicates Dangerous — Exact duplicates easy remove. Fuzzy duplicates tricky (fuzzywuzzy library). Duplicate row metrics inflate pannum; analysis wrong aagum
✅ Standardization Critical — Names, dates, phone numbers, addresses format consistent pannunga. "Chennai", "CHENNAI", "Chenai" grouping correct aaganum
✅ Outliers Understand — Always remove panna vendaam; understand first. Real outlier (billionaire income) keep pannunga. Error (age 999) fix pannunga
✅ Data Profiling First — Cleaning start pannum munnaadi profile. Describe, dtypes, nulls, nunique, distributions understand pannunga
✅ Automation Recommended — Manual cleaning tedious + error-prone. Great Expectations library tests automate pannunga. Production pipelines validation rules built-in necessary
🏁 🎮 Mini Challenge
Challenge: Real-world Messy Dataset Clean Pannu
Realistic dirty data cleanup hands-on:
Create Messy Dataset - 5 min:
Profile Data - 5 min:
Clean Data - 20 min:
Validate - 5 min:
Learning: Realistic data romba messy! Patience and systematic approach needed! 🧹
💼 Interview Questions
Q1: Data cleaning – manual vs automated?
A: Manual initial (understand data, identify patterns). Automated (pipelines, validation rules, Great Expectations). Production: Always automated – manual scale aagaadhu. But rules define panna manual work venum!
Q2: Cleaning strategy – delete rows vs imputation?
A: Context dependent! < 5% missing, delete ok. > 30% missing → imputation risk. Medical data → domain expert consult. Financial → strictly validate. General rule: Understand WHY missing – random vs systematic. Random → imputation ok, Systematic → investigate!
Q3: Outliers – always remove?
A: NO! First understand – real outlier (billionaire income) or error (age 999). Domain knowledge crucial. ML models: Remove or cap. Statistical analysis: May include (represent real variation). Business: Keep (valuable insight). Case-by-case decision!
Q4: Data cleaning reproducible ah?
A: MUST! Document every step:
- Which rows deleted, why
- Which values imputed, method
- Validation rules applied
- Before-after metrics
Version control (Git): code + cleaned data version together. Production reproducibility essential! Same input → same output always.
Q5: Dirty data ngayon production model – impact?
A: Model accuracy drop, bias introduce, wrong decisions. Example: Age negative → model fails. Email invalid → message undeliverable. Financial data wrong → reports inaccurate. Regulatory fines (GDPR). Reputation damage. Data cleaning INVESTMENT, not cost! 💰
Frequently Asked Questions
Column la 70% values missing irundhaa enna pannanum?