Preparing data for AI
Introduction
"Give me six hours to chop down a tree, and I will spend the first four sharpening the axe." โ Abraham Lincoln ๐ช
AI la idhu 100% true. Best model + bad data = bad results. But average model + great data = amazing results! ๐ฏ
Data Scientists oda 80% time data preparation la dhaan pogudhu. Boring ah theriyum, but idhu dhaan difference between oru working AI model and oru failed project.
Indha article la data preparation complete process paapom โ collection to model-ready data varai! ๐
Data Preparation Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ AI DATA PREPARATION PIPELINE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ โ COLLECT โโโถ โก EXPLORE โโโถ โข CLEAN โ โ โ โ โ โ โ Sources EDA/Stats Handle missing โ โ APIs Visualize Remove duplicates โ โ Scrape Understand Fix errors โ โ โ โ โฃ TRANSFORM โโโถ โค FEATURE ENG โโโถ โฅ SPLIT โ โ โ โ โ โ โ Normalize Create new Train/Val/Testโ โ Encode Select best Stratify โ โ Scale Reduce dims Time-based โ โ โ โ โฆ VALIDATE โโโถ โง VERSION โโโถ ๐ค MODEL READY! โ โ โ โ โ โ Quality DVC/Git โ โ Bias check Reproducible โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 1: Data Collection
First step โ data enga irundhu varum?
Internal Sources:
- ๐ Company databases (MySQL, PostgreSQL)
- ๐ Log files (server logs, app events)
- ๐ง CRM data (customer interactions)
External Sources:
- ๐ Public datasets (Kaggle, UCI, government data)
- ๐ APIs (Twitter, weather, stock market)
- ๐ท๏ธ Web scraping (ethically!)
- ๐ Third-party data vendors
Key considerations:
| Factor | Question |
|---|---|
| Volume | Evlo data venum? |
| Velocity | How fresh should data be? |
| Variety | What formats? |
| Veracity | Data trustworthy ah? |
| Legal | Permission irukka? GDPR/privacy? |
Rule of thumb: Start with whatever data you have, validate its usefulness, then collect more if needed. Don't wait for "perfect" data โ it doesn't exist! ๐
Step 2: Exploratory Data Analysis (EDA)
Data collect pannita โ next step: understand your data! Blindly model build pannaadheenga.
EDA Checklist:
๐ Basic Stats
๐ Distributions
- Numerical columns: histogram, box plot
- Categorical columns: value counts, bar chart
- Target variable: balanced ah illa imbalanced ah?
๐ Relationships
- Correlation matrix โ features eppadi related?
- Scatter plots โ patterns theriyudha?
- Cross-tabulation โ categories compare
โ ๏ธ Red Flags to Watch:
- ๐ด 90%+ missing values in a column
- ๐ด Single value dominates (99% same value)
- ๐ด Impossible values (age = -5, price = 0)
- ๐ด Data leakage โ future data in training set!
EDA tools: pandas, matplotlib, seaborn, ydata-profiling (auto EDA report generate pannum!) ๐
Step 3: Data Cleaning
Dirty data clean pannunga โ most time-consuming step!
Missing Values:
| Strategy | When to Use | Code |
|---|---|---|
| Drop rows | <5% missing, random | `df.dropna()` |
| Mean/Median fill | Numerical, normal dist | `df.fillna(df.mean())` |
| Mode fill | Categorical | `df.fillna(df.mode())` |
| Forward fill | Time series | `df.ffill()` |
| ML imputation | Complex patterns | KNNImputer |
Duplicates:
Outliers:
- IQR method: Q1 - 1.5*IQR to Q3 + 1.5*IQR range outside = outlier
- Z-score: |z| > 3 = outlier
- Domain knowledge: age > 150 is obviously wrong!
Data Type Fixes:
- String "123" โ Integer 123
- "2026-02-17" โ datetime object
- "Male"/"M"/"male" โ standardize to "M"
Remember: Document every cleaning step! Reproducibility important. ๐
Step 4: Feature Engineering โ The Art! ๐จ
Feature engineering โ idhu dhaan good ML engineer ah great ML engineer ah separate pannudhu!
What is Feature Engineering?
Raw data la irundhu model ku meaningful signals create panradhu.
Common Techniques:
1. Date Features ๐
2. Text Features ๐
- Word count, character count
- Sentiment score
- TF-IDF vectors
- Embeddings (BERT, Word2Vec)
3. Aggregation Features ๐
4. Interaction Features โ๏ธ
5. Binning ๐ฆ
Age โ Age groups (0-18, 19-35, 36-60, 60+)
Income โ Low/Medium/High
Encoding Categorical Data
ML models ku numbers mattum dhaan puriyum! Categorical data convert pannanum:
๐ก Label Encoding โ Categories ku numbers assign
โ ๏ธ Problem: Model thinks Green(2) > Red(0) โ but colors ku order illaye!
๐ก One-Hot Encoding โ Each category ku separate column
โ No ordering problem. But too many categories = too many columns!
๐ก Target Encoding โ Category replace with target mean
โ ๏ธ Data leakage risk โ careful!
๐ก Frequency Encoding โ Category replace with frequency
Rule: Few categories (<10) โ One-Hot. Many categories โ Target/Frequency encoding.
Feature Scaling
Features different scales la irukum โ normalize pannanum!
Why? Age (0-100) vs Salary (10000-1000000) โ salary dominate pannidumnu model think pannum.
| Method | Formula | Range | When to Use |
|---|---|---|---|
| **Min-Max** | (x-min)/(max-min) | 0 to 1 | Neural networks |
| **Standard** | (x-mean)/std | ~-3 to +3 | Linear models, SVM |
| **Robust** | (x-median)/IQR | Varies | Outliers irukum bodhu |
| **Log Transform** | log(x) | Varies | Skewed data |
Important: fit_transform training data la mattum pannunga. Test data ku transform mattum โ fit pannaadheenga! Data leakage avoid pannanum. โ ๏ธ
Train/Validation/Test Split
Model ku data 3 parts ah split pannanum:
Standard Split:
- ๐ข Training (70-80%) โ Model learn pannum
- ๐ก Validation (10-15%) โ Hyperparameter tuning
- ๐ด Test (10-15%) โ Final evaluation (touch pannaadheenga until end!)
Stratified Split โ Imbalanced data ku important!
Time-based Split โ Time series data ku:
- Train: Jan-Oct data
- Validate: Nov data
- Test: Dec data
- NEVER shuffle time series! Future data leak aaidum! โฐ
Cross-validation โ Small dataset ku best:
- K-Fold (k=5 or 10)
- Each fold gets to be test set once
- More reliable evaluation
Data Labeling for Supervised Learning
Supervised ML ku labeled data venum โ idhu expensive and time-consuming!
Labeling Methods:
1. Manual Labeling ๐ท
- Humans tag each data point
- Most accurate but slowest
- Cost: $0.01-$1 per label
- Tools: Label Studio, Labelbox, Amazon MTurk
2. Semi-Automated ๐ค+๐ท
- Model predicts, human verifies
- Active Learning: model uncertain samples human ku send pannum
- 5x faster than fully manual
3. Weak Supervision ๐
- Rules and heuristics use pannudhu
- Example: Email with "lottery" = spam (rule-based)
- Tool: Snorkel
- Noisy but fast
4. Self-Supervised ๐
- Data itself la irundhu labels create
- GPT training: next word prediction โ no manual labels!
- Most modern LLMs use this
Labeling Quality Tips:
- Multiple annotators per sample (majority vote)
- Clear labeling guidelines document pannunga
- Inter-annotator agreement measure pannunga
- Regular quality audits pannunga
Data Augmentation: More Data, Less Collection! ๐ฏ
Data augmentation โ existing data la irundhu MORE training data create pannunga!
Image Augmentation ๐ผ๏ธ
- Rotate: 90ยฐ, 180ยฐ, 270ยฐ
- Flip: horizontal, vertical
- Crop: random sections
- Color: brightness, contrast change
- Noise: slight blur or noise add
1 image โ 10-50 variations!
Text Augmentation ๐
- Synonym replacement: "happy" โ "joyful"
- Back translation: English โ Tamil โ English
- Random insertion/deletion
- Paraphrasing with LLMs
Tabular Augmentation ๐
- SMOTE: synthetic minority samples create (imbalanced data ku)
- Noise injection: slight random noise add
- Mixup: two samples interpolate
Audio Augmentation ๐ต
- Speed change: faster/slower
- Pitch shift: higher/lower
- Background noise add
- Time stretch
Data augmentation use pannina model accuracy 5-20% improve aagum! Especially small datasets ku game-changer. ๐
Data Leakage โ The Silent Killer!
โ ๏ธ Data Leakage = Training la test/future information mix aairadhu
Model training la 99% accuracy, production la 50% โ idhu leakage sign!
Common Leakage Sources:
๐ด Target Leakage โ Target directly related feature use pannradhu
- Predicting "will patient recover?" โ using "discharge_date" as feature!
๐ด Temporal Leakage โ Future data training la use pannradhu
- Predicting stock price โ using next day's data for today's prediction
๐ด Train-Test Contamination โ Test data info training la leak aagradhu
- Scaling: full dataset la fit panni apparam split pannradhu (WRONG!)
- Correct: split first, then fit on train only
๐ด Duplicate Leakage โ Same record train AND test la irukkradhu
Prevention:
1. Split FIRST, process AFTER
2. Time-based splits for temporal data
3. Check feature importance โ suspicious ah high aana investigate
4. Cross-validate properly
Data Versioning โ Track Your Data!
Code ku Git use pannuvome โ data ku?
Why version data?
- 3 months munna trained model better ah irunduchu โ but that data enga?
- New data add pannita model worse aaiduchu โ rollback pannanum
- Regulatory compliance โ which data used for which model?
Tools:
| Tool | Type | Best For |
|---|---|---|
| **DVC** | Git-like for data | ML projects |
| **LakeFS** | Git for data lakes | Data lake versioning |
| **Delta Lake** | Time travel | Lakehouse |
| **MLflow** | Experiment tracking | Full ML lifecycle |
DVC Example:
Now you can always go back to any data version! Reproducible ML = professional ML. ๐
Prompt: Data Preparation Plan
โ Key Takeaways
Summary:
โ 80% of AI project time = data preparation
โ EDA first โ understand before transforming
โ Cleaning โ missing values, duplicates, outliers handle
โ Feature Engineering โ raw data la useful signals extract
โ Encoding โ categorical data numerical ah convert
โ Scaling โ features same scale ku bring
โ Split โ train/val/test properly, stratify if needed
โ Data Leakage โ the silent killer, always watch out!
โ Version data โ DVC or similar tool use pannunga
"In God we trust; all others must bring data." โ And that data better be well-prepared! ๐
Next article: SQL for AI Apps โ AI data work ku SQL eppadi use pannanum! ๐ฏ
๐ ๐ฎ Mini Challenge
Challenge: Dataset prepare panni ML model train pannu
Complete data preparation cycle practice:
Step 1 (Load & Explore - 10 min):
Step 2 (Clean - 10 min):
Step 3 (Features - 10 min):
Step 4 (Split & Scale - 5 min):
Step 5 (Train - 5 min):
Learning: 80% time data, 20% time model โ it's ALWAYS about data quality! ๐
๐ผ Interview Questions
Q1: Data preparation time 60-80% project time โ enna reason?
A: Real-world data messy! Missing values, duplicates, inconsistencies, outliers, multiple formats. Clean panna, validate pannu, understand pannu โ time consuming. AI model quality = data quality. Garbage data โ garbage model. Oru months project data maathram 4 months poagum!
Q2: Feature engineering โ art vs science?
A: Bit of both! Science: statistical methods (correlation, dimensionality reduction). Art: domain knowledge (what features matter). "Age" raw feature, but "age_group" better ML feature. Experienced engineers domain expertise use pannum โ their features always better!
Q3: Training-serving skew โ practical example?
A: Training: avg_salary last 30 days compute pannu. Serving: avg_salary last 1 hour? Different features โ different predictions! Solution: Feature Store โ both places same features. Or standardized pipeline โ deterministic, reproducible feature computation!
Q4: Imbalanced data (95% negative, 5% positive) โ handling?
A: Techniques: undersampling (majority downsample), oversampling (SMOTE โ synthetic samples), class weights, stratified splits. Business context important โ fraud 0.1% data, but costly โ handle specially! Class balance != realistic โ real world often imbalanced.
Q5: Data leakage โ subtle ways?
A: Subtle cases tricky! Training time future info include pannu (target directly related feature). Scaling full dataset first (information leak). Test data info training data explore panna. Time series shuffle pannu (past-future mix). Prevention: Understand your data, think critically, split FIRST, THEN process!
Frequently Asked Questions
Feature scaling (StandardScaler) la fit_transform enga pannanum?