Data cleaning basics
Introduction
"Garbage in, garbage out" โ data world oda golden rule! ๐๏ธ
Nee world's best ML model build pannalaam. But dirty data feed pannaa? Output um garbage dhaan! Data cleaning is the most important and most underrated skill in data engineering.
Real-world data eppadi irukkum:
- Names: "john", "JOHN", "John ", " john " โ same person, 4 different formats ๐
- Dates: "01/02/2025" โ January 2nd or February 1st? ๐คท
- Phone: "+91-9876543210", "09876543210", "9876543210" โ same number!
- Age: 25, -5, 999, NULL โ enna nadakku? ๐ค
Idha ellam fix panradhu dhaan data cleaning! Let's learn step by step ๐
Types of Dirty Data
Dirty data different forms la varum:
| Problem | Example | Impact |
|---|---|---|
| **Missing Values** | NULL, empty, "N/A" | Model bias, errors |
| **Duplicates** | Same record 2 times | Inflated metrics |
| **Inconsistent Format** | "Chennai" vs "CHENNAI" | Wrong grouping |
| **Invalid Values** | Age = -5, Email = "abc" | Wrong predictions |
| **Outliers** | Salary = 99999999 | Skewed statistics |
| **Wrong Data Types** | "123" as string, not number | Calculation errors |
| **Encoding Issues** | "cafรยฉ" instead of "cafรฉ" | Display problems |
| **Stale Data** | Address from 10 years ago | Wrong decisions |
80% of dirty data falls into 3 categories: missing values, duplicates, inconsistent formats. Fix these three, and your data quality jumps dramatically! ๐
Handling Missing Values
Missing values โ most common data problem! Different strategies irukku:
1. Detection ๐
2. Deletion โ
3. Imputation ๐ง
4. Smart Imputation ๐ง
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Delete rows | < 5% missing | Simple | Data loss |
| Mean/Median | Numeric, random missing | Fast | Reduces variance |
| Mode | Categorical | Easy | May add bias |
| KNN Imputer | Complex patterns | Accurate | Slow on big data |
Removing Duplicates
Duplicate records metrics ah inflate pannidhu โ dangerous!
Exact Duplicates:
Fuzzy Duplicates โ tricky ones! ๐ต๏ธ
SQL la duplicate handling:
Best Practice: Duplicate removal BEFORE any analysis pannunga. Otherwise oru customer 3 times count aagum โ revenue reports wrong aagum! ๐ฐ
Format Standardization
Same data, different formats โ clean pannanum:
Text Cleaning:
Date Standardization:
Address Standardization:
Standardization pannaalaa "Chennai", "chennai", "CHENNAI" ellam oru group la varum โ analytics correct aagum! โ
Outlier Detection โ Smart Approach
Outliers handle pannum bodhu think before you delete! ๐ค
๐ก IQR Method โ Most reliable:
๐ก Z-Score โ Normal distribution data ku:
๐ก Domain Knowledge โ Most important! Age = 150 is error. But salary = โน50 crore might be real (CEO salary).
๐ก Don't auto-delete โ Flag as outlier, investigate, then decide. Sometimes outliers ARE the interesting data!
๐ก Cap instead of remove โ Winsorization: extreme values ah threshold ku set pannunga.
Data Validation Rules
Clean data ku validation rules define pannanum:
Common Validations:
Great Expectations Library:
Validation rules automated ah run pannunga โ every pipeline run la! Manual checking scalable illa ๐ซ
Data Cleaning Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ DATA CLEANING PIPELINE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โ โ RAW DATA โโโโถโ PROFILE โโโโถโ CLEAN โ โ โ โ โ โ โ โ โ โ โ โ CSV/DB โ โ Stats โ โ Missing โ โ โ โ API โ โ Types โ โ Duplicates โ โ โ โ Files โ โ Nulls % โ โ Formats โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโฌโโโโโโโ โ โ โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโผโโโโโโโ โ โ โ OUTPUT โโโโโ VALIDATE โโโโโ TRANSFORM โ โ โ โ โ โ โ โ โ โ โ โ Clean DB โ โ Rules โ โ Outliers โ โ โ โ Parquet โ โ Tests โ โ Types โ โ โ โ Report โ โ Quality % โ โ Encode โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Profiling โ First Step Always!
Cleaning start pannum munnadi profile pannunga โ data ah understand pannunga:
Profile Report shows:
- Distribution of each column ๐
- Correlation between columns ๐
- Missing value patterns ๐ณ๏ธ
- Duplicate rows count ๐
- Data type warnings โ ๏ธ
5 minutes profiling = 5 hours debugging save pannalaam! Always profile first, then clean ๐ฏ
Encoding & Type Issues
โ ๏ธ Silent killers โ encoding and type issues!
๐ด Character Encoding:
- "cafรยฉ" instead of "cafรฉ" โ UTF-8 vs Latin-1 mismatch
- Fix: df = pd.read_csv('file.csv', encoding='utf-8')
- Still broken? Try: encoding='latin-1' or encoding='cp1252'
๐ด Numeric as String:
- "1,234.56" โ comma irukku, string ah treat aagum
- Fix: df['amount'] = df['amount'].str.replace(',', '').astype(float)
๐ด Boolean Confusion:
- "Yes", "Y", "1", "TRUE", "true" โ all mean True
- Fix: Map to standard boolean
๐ด Date as String:
- "2025-01-15" string ah irundhaa date operations work aagaadhu
- Fix: pd.to_datetime() use pannunga
๐ด Mixed Types in Column:
- Some rows integer, some string โ pandas silently converts to object
- Fix: pd.to_numeric(df['col'], errors='coerce') โ invalid values NaN aagum
Real-World: Hospital Data Cleaning
Hospital patient data clean panra example ๐ฅ:
Raw Data Problems:
- Patient name: "KUMAR, R" vs "R. Kumar" vs "Kumar R"
- Blood group: "O+", "O positive", "O Positive", "O+"
- Date of birth: "15/03/1990", "1990-03-15", "March 15, 1990"
- Phone: Some with +91, some without, some with spaces
- Diagnosis: Free text with typos โ "Diabeties", "Diabtes", "Diabetes"
Cleaning Steps:
1. ๐ค Names: Parse into first_name, last_name, standardize case
2. ๐ฉธ Blood group: Map all variations to standard codes (O+, A-, etc.)
3. ๐
DOB: pd.to_datetime() with format='mixed'
4. ๐ Phone: Regex โ keep only digits, ensure 10-digit format
5. ๐ฅ Diagnosis: Fuzzy matching to standard ICD codes
Result: 50,000 records cleaned, 3,000 duplicates removed, 2% invalid records flagged for manual review.
Impact: Correct patient matching improved from 78% to 97%! ๐ฏ
Prompt: Automate Data Cleaning
Data Cleaning Best Practices
Data cleaning pro maari pannanum na follow pannunga:
โ Always profile first โ Data ah understand pannaamal clean panna start pannaadheenga
โ Document everything โ What you cleaned, why, how โ future reference ku
โ Keep original data โ Raw data ah modify pannaadheenga, copy la work pannunga
โ Automate repeatable tasks โ Same cleaning weekly pannanum na script ezhudhunga
โ Validate after cleaning โ Cleaning correct ah nadandhicha nu verify pannunga
โ Handle edge cases โ Empty strings, whitespace-only, special characters
โ Test with samples โ Full dataset la run pannum munnadi sample la test pannunga
โ Version your cleaning code โ Git la maintain pannunga
Remember: Clean data = trustworthy insights = better decisions = happy stakeholders! ๐
Next up: Real-time data systems โ live data eppadi handle pannanum! โก
โ Key Takeaways
โ Garbage In, Garbage Out โ Data quality model quality determine pannum. Perfect model kuda dirty data garbage output produce pannudhu
โ Dirty Data Types โ Missing values, duplicates, format inconsistencies, invalid values, outliers, encoding issues, stale data. 80% issues irundhu vara three: missing, duplicates, format
โ Missing Values โ < 5% delete pannalam. 5-30% imputation (mean, median, mode, KNN). > 50% column drop panna or investigate. Domain expertise matters
โ Duplicates Dangerous โ Exact duplicates easy remove. Fuzzy duplicates tricky (fuzzywuzzy library). Duplicate row metrics inflate pannum; analysis wrong aagum
โ Standardization Critical โ Names, dates, phone numbers, addresses format consistent pannunga. "Chennai", "CHENNAI", "Chenai" grouping correct aaganum
โ Outliers Understand โ Always remove panna vendaam; understand first. Real outlier (billionaire income) keep pannunga. Error (age 999) fix pannunga
โ Data Profiling First โ Cleaning start pannum munnaadi profile. Describe, dtypes, nulls, nunique, distributions understand pannunga
โ Automation Recommended โ Manual cleaning tedious + error-prone. Great Expectations library tests automate pannunga. Production pipelines validation rules built-in necessary
๐ ๐ฎ Mini Challenge
Challenge: Real-world Messy Dataset Clean Pannu
Realistic dirty data cleanup hands-on:
Create Messy Dataset - 5 min:
Profile Data - 5 min:
Clean Data - 20 min:
Validate - 5 min:
Learning: Realistic data romba messy! Patience and systematic approach needed! ๐งน
๐ผ Interview Questions
Q1: Data cleaning โ manual vs automated?
A: Manual initial (understand data, identify patterns). Automated (pipelines, validation rules, Great Expectations). Production: Always automated โ manual scale aagaadhu. But rules define panna manual work venum!
Q2: Cleaning strategy โ delete rows vs imputation?
A: Context dependent! < 5% missing, delete ok. > 30% missing โ imputation risk. Medical data โ domain expert consult. Financial โ strictly validate. General rule: Understand WHY missing โ random vs systematic. Random โ imputation ok, Systematic โ investigate!
Q3: Outliers โ always remove?
A: NO! First understand โ real outlier (billionaire income) or error (age 999). Domain knowledge crucial. ML models: Remove or cap. Statistical analysis: May include (represent real variation). Business: Keep (valuable insight). Case-by-case decision!
Q4: Data cleaning reproducible ah?
A: MUST! Document every step:
- Which rows deleted, why
- Which values imputed, method
- Validation rules applied
- Before-after metrics
Version control (Git): code + cleaned data version together. Production reproducibility essential! Same input โ same output always.
Q5: Dirty data ngayon production model โ impact?
A: Model accuracy drop, bias introduce, wrong decisions. Example: Age negative โ model fails. Email invalid โ message undeliverable. Financial data wrong โ reports inaccurate. Regulatory fines (GDPR). Reputation damage. Data cleaning INVESTMENT, not cost! ๐ฐ
Frequently Asked Questions
Column la 70% values missing irundhaa enna pannanum?