← Back|DATA-ENGINEERING›Section 1/18

0 of 18 completed

Preparing data for AI

Intermediate⏱ 15 min read📅 Updated: 2026-02-17

Introduction

"Give me six hours to chop down a tree, and I will spend the first four sharpening the axe." — Abraham Lincoln 🪓

AI la idhu 100% true. Best model + bad data = bad results. But average model + great data = amazing results! 🎯

Data Scientists oda 80% time data preparation la dhaan pogudhu. Boring ah theriyum, but idhu dhaan difference between oru working AI model and oru failed project.

Indha article la data preparation complete process paapom — collection to model-ready data varai! 🚀

Data Preparation Pipeline

🏗️ Architecture Diagram

┌─────────────────────────────────────────────────┐
│          AI DATA PREPARATION PIPELINE            │
├─────────────────────────────────────────────────┤
│                                                   │
│  ① COLLECT ──▶ ② EXPLORE ──▶ ③ CLEAN            │
│      │              │              │              │
│   Sources        EDA/Stats     Handle missing    │
│   APIs           Visualize     Remove duplicates │
│   Scrape         Understand    Fix errors        │
│                                                   │
│  ④ TRANSFORM ──▶ ⑤ FEATURE ENG ──▶ ⑥ SPLIT      │
│      │                │                │          │
│   Normalize       Create new       Train/Val/Test│
│   Encode          Select best      Stratify      │
│   Scale           Reduce dims      Time-based    │
│                                                   │
│  ⑦ VALIDATE ──▶ ⑧ VERSION ──▶ 🤖 MODEL READY!   │
│      │              │                             │
│   Quality         DVC/Git                         │
│   Bias check      Reproducible                    │
└─────────────────────────────────────────────────┘

Step 1: Data Collection

First step — data enga irundhu varum?

Internal Sources:

📊 Company databases (MySQL, PostgreSQL)
📁 Log files (server logs, app events)
📧 CRM data (customer interactions)

External Sources:

🌐 Public datasets (Kaggle, UCI, government data)
🔌 APIs (Twitter, weather, stock market)
🕷️ Web scraping (ethically!)
🛒 Third-party data vendors

Key considerations:

Factor	Question
Volume	Evlo data venum?
Velocity	How fresh should data be?
Variety	What formats?
Veracity	Data trustworthy ah?
Legal	Permission irukka? GDPR/privacy?

Rule of thumb: Start with whatever data you have, validate its usefulness, then collect more if needed. Don't wait for "perfect" data — it doesn't exist! 😅

Step 2: Exploratory Data Analysis (EDA)

Data collect pannita — next step: understand your data! Blindly model build pannaadheenga.

EDA Checklist:

📊 Basic Stats

python

df.describe()  # mean, std, min, max
df.info()      # data types, null counts
df.shape       # rows, columns

📈 Distributions

Numerical columns: histogram, box plot
Categorical columns: value counts, bar chart
Target variable: balanced ah illa imbalanced ah?

🔗 Relationships

Correlation matrix — features eppadi related?
Scatter plots — patterns theriyudha?
Cross-tabulation — categories compare

⚠️ Red Flags to Watch:

🔴 90%+ missing values in a column
🔴 Single value dominates (99% same value)
🔴 Impossible values (age = -5, price = 0)
🔴 Data leakage — future data in training set!

EDA tools: pandas, matplotlib, seaborn, ydata-profiling (auto EDA report generate pannum!) 📊

Step 3: Data Cleaning

Dirty data clean pannunga — most time-consuming step!

Missing Values:

Strategy	When to Use	Code
Drop rows	<5% missing, random	`df.dropna()`
Mean/Median fill	Numerical, normal dist	`df.fillna(df.mean())`
Mode fill	Categorical	`df.fillna(df.mode())`
Forward fill	Time series	`df.ffill()`
ML imputation	Complex patterns	KNNImputer

Duplicates:

python

df.duplicated().sum()  # count
df.drop_duplicates()   # remove

Outliers:

IQR method: Q1 - 1.5*IQR to Q3 + 1.5*IQR range outside = outlier
Z-score: |z| > 3 = outlier
Domain knowledge: age > 150 is obviously wrong!

Data Type Fixes:

String "123" → Integer 123
"2026-02-17" → datetime object
"Male"/"M"/"male" → standardize to "M"

Remember: Document every cleaning step! Reproducibility important. 📝

Step 4: Feature Engineering — The Art! 🎨

Feature engineering — idhu dhaan good ML engineer ah great ML engineer ah separate pannudhu!

What is Feature Engineering?

Raw data la irundhu model ku meaningful signals create panradhu.

Common Techniques:

1. Date Features 📅

python

df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5,6])
df['month'] = df['date'].dt.month
df['hour'] = df['timestamp'].dt.hour

2. Text Features 📝

Word count, character count
Sentiment score
TF-IDF vectors
Embeddings (BERT, Word2Vec)

3. Aggregation Features 📊

python

df['avg_purchase_last_30d'] = ...  # Customer average
df['total_orders'] = ...            # Lifetime count
df['days_since_last_login'] = ...   # Recency

4. Interaction Features ✖️

python

df['price_per_sqft'] = df['price'] / df['area']
df['bmi'] = df['weight'] / (df['height']**2)

5. Binning 📦

Age → Age groups (0-18, 19-35, 36-60, 60+)

Income → Low/Medium/High

Encoding Categorical Data

💡 Tip

ML models ku numbers mattum dhaan puriyum! Categorical data convert pannanum:

💡 Label Encoding — Categories ku numbers assign

code

Red=0, Blue=1, Green=2

⚠️ Problem: Model thinks Green(2) > Red(0) — but colors ku order illaye!

💡 One-Hot Encoding — Each category ku separate column

code

Red:   [1, 0, 0]
Blue:  [0, 1, 0]
Green: [0, 0, 1]

✅ No ordering problem. But too many categories = too many columns!

💡 Target Encoding — Category replace with target mean

code

City → Average house price of that city

⚠️ Data leakage risk — careful!

💡 Frequency Encoding — Category replace with frequency

code

Mumbai → 0.35 (35% of data)
Delhi → 0.25 (25% of data)

Rule: Few categories (<10) → One-Hot. Many categories → Target/Frequency encoding.

Feature Scaling

Features different scales la irukum — normalize pannanum!

Why? Age (0-100) vs Salary (10000-1000000) — salary dominate pannidumnu model think pannum.

Method	Formula	Range	When to Use
Min-Max	(x-min)/(max-min)	0 to 1	Neural networks
Standard	(x-mean)/std	~-3 to +3	Linear models, SVM
Robust	(x-median)/IQR	Varies	Outliers irukum bodhu
Log Transform	log(x)	Varies	Skewed data

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

Important: fit_transform training data la mattum pannunga. Test data ku transform mattum — fit pannaadheenga! Data leakage avoid pannanum. ⚠️

Train/Validation/Test Split

Model ku data 3 parts ah split pannanum:

Standard Split:

🟢 Training (70-80%) — Model learn pannum
🟡 Validation (10-15%) — Hyperparameter tuning
🔴 Test (10-15%) — Final evaluation (touch pannaadheenga until end!)

python

from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Stratified Split — Imbalanced data ku important!

python

# 95% non-fraud, 5% fraud — ratio maintain aaganum
train_test_split(X, y, stratify=y)

Time-based Split — Time series data ku:

Train: Jan-Oct data
Validate: Nov data
Test: Dec data
NEVER shuffle time series! Future data leak aaidum! ⏰

Cross-validation — Small dataset ku best:

K-Fold (k=5 or 10)
Each fold gets to be test set once
More reliable evaluation

Data Labeling for Supervised Learning

Supervised ML ku labeled data venum — idhu expensive and time-consuming!

Labeling Methods:

1. Manual Labeling 👷

Humans tag each data point
Most accurate but slowest
Cost: $0.01-$1 per label
Tools: Label Studio, Labelbox, Amazon MTurk

2. Semi-Automated 🤖+👷

Model predicts, human verifies
Active Learning: model uncertain samples human ku send pannum
5x faster than fully manual

3. Weak Supervision 📏

Rules and heuristics use pannudhu
Example: Email with "lottery" = spam (rule-based)
Tool: Snorkel
Noisy but fast

4. Self-Supervised 🔄

Data itself la irundhu labels create
GPT training: next word prediction — no manual labels!
Most modern LLMs use this

Labeling Quality Tips:

Multiple annotators per sample (majority vote)
Clear labeling guidelines document pannunga
Inter-annotator agreement measure pannunga
Regular quality audits pannunga

Data Augmentation: More Data, Less Collection! 🎯

✅ Example

Data augmentation — existing data la irundhu MORE training data create pannunga!

Image Augmentation 🖼️

- Rotate: 90°, 180°, 270°

- Flip: horizontal, vertical

- Crop: random sections

- Color: brightness, contrast change

- Noise: slight blur or noise add

1 image → 10-50 variations!

Text Augmentation 📝

- Synonym replacement: "happy" → "joyful"

- Back translation: English → Tamil → English

- Random insertion/deletion

- Paraphrasing with LLMs

Tabular Augmentation 📊

- SMOTE: synthetic minority samples create (imbalanced data ku)

- Noise injection: slight random noise add

- Mixup: two samples interpolate

Audio Augmentation 🎵

- Speed change: faster/slower

- Pitch shift: higher/lower

- Background noise add

- Time stretch

Data augmentation use pannina model accuracy 5-20% improve aagum! Especially small datasets ku game-changer. 🚀

Data Leakage — The Silent Killer!

⚠️ Warning

⚠️ Data Leakage = Training la test/future information mix aairadhu

Model training la 99% accuracy, production la 50% — idhu leakage sign!

Common Leakage Sources:

🔴 Target Leakage — Target directly related feature use pannradhu

- Predicting "will patient recover?" — using "discharge_date" as feature!

🔴 Temporal Leakage — Future data training la use pannradhu

- Predicting stock price — using next day's data for today's prediction

🔴 Train-Test Contamination — Test data info training la leak aagradhu

- Scaling: full dataset la fit panni apparam split pannradhu (WRONG!)

- Correct: split first, then fit on train only

🔴 Duplicate Leakage — Same record train AND test la irukkradhu

Prevention:

1. Split FIRST, process AFTER

2. Time-based splits for temporal data

3. Check feature importance — suspicious ah high aana investigate

4. Cross-validate properly

Data Versioning — Track Your Data!

Code ku Git use pannuvome — data ku?

Why version data?

3 months munna trained model better ah irunduchu — but that data enga?
New data add pannita model worse aaiduchu — rollback pannanum
Regulatory compliance — which data used for which model?

Tools:

Tool	Type	Best For
DVC	Git-like for data	ML projects
LakeFS	Git for data lakes	Data lake versioning
Delta Lake	Time travel	Lakehouse
MLflow	Experiment tracking	Full ML lifecycle

DVC Example:

bash

dvc init
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "v1 training data"
dvc push  # upload to remote storage

Now you can always go back to any data version! Reproducible ML = professional ML. 🏆

Prompt: Data Preparation Plan

📋 Copy-Paste Prompt

You are a senior ML engineer. A startup has raw customer data and wants to build a churn prediction model.

Data available:
- Customer demographics (name, age, city, plan_type)
- Usage logs (login_count, session_duration, features_used)
- Support tickets (date, category, resolution_time)
- Payment history (amount, date, failed_payments)

Create a complete data preparation plan:
1. EDA steps and expected findings
2. Cleaning strategy per data source
3. Feature engineering ideas (at least 10 features)
4. Encoding and scaling strategy
5. Train/test split approach
6. Potential data leakage risks

Explain in Tanglish.

✅ Key Takeaways

Summary:

✅ 80% of AI project time = data preparation

✅ EDA first — understand before transforming

✅ Cleaning — missing values, duplicates, outliers handle

✅ Feature Engineering — raw data la useful signals extract

✅ Encoding — categorical data numerical ah convert

✅ Scaling — features same scale ku bring

✅ Split — train/val/test properly, stratify if needed

✅ Data Leakage — the silent killer, always watch out!

✅ Version data — DVC or similar tool use pannunga

"In God we trust; all others must bring data." — And that data better be well-prepared! 😄

Next article: SQL for AI Apps — AI data work ku SQL eppadi use pannanum! 🎯

🏁 🎮 Mini Challenge

Challenge: Dataset prepare panni ML model train pannu

Complete data preparation cycle practice:

Step 1 (Load & Explore - 10 min):

python

import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv('dataset.csv')
report = ProfileReport(df)
report.to_file('report.html')  # Open in browser – issues visible!

Step 2 (Clean - 10 min):

python

# Missing values
df.fillna(df.median())
# Duplicates
df.drop_duplicates()
# Outliers (IQR method)
Q1, Q3 = df['salary'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[(df['salary'] > Q1 - 1.5*IQR) & (df['salary'] < Q3 + 1.5*IQR)]

Step 3 (Features - 10 min):

python

# Create meaningful features
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior'])
df['years_employed'] = 2026 - df['join_year']

Step 4 (Split & Scale - 5 min):

python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test = train_test_split(df, test_size=0.2)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

Step 5 (Train - 5 min):

python

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")

Learning: 80% time data, 20% time model – it's ALWAYS about data quality! 📊

💼 Interview Questions

Q1: Data preparation time 60-80% project time – enna reason?

A: Real-world data messy! Missing values, duplicates, inconsistencies, outliers, multiple formats. Clean panna, validate pannu, understand pannu – time consuming. AI model quality = data quality. Garbage data → garbage model. Oru months project data maathram 4 months poagum!

Q2: Feature engineering – art vs science?

A: Bit of both! Science: statistical methods (correlation, dimensionality reduction). Art: domain knowledge (what features matter). "Age" raw feature, but "age_group" better ML feature. Experienced engineers domain expertise use pannum – their features always better!

Q3: Training-serving skew – practical example?

A: Training: avg_salary last 30 days compute pannu. Serving: avg_salary last 1 hour? Different features → different predictions! Solution: Feature Store – both places same features. Or standardized pipeline – deterministic, reproducible feature computation!

Q4: Imbalanced data (95% negative, 5% positive) – handling?

A: Techniques: undersampling (majority downsample), oversampling (SMOTE – synthetic samples), class weights, stratified splits. Business context important – fraud 0.1% data, but costly – handle specially! Class balance != realistic – real world often imbalanced.

Q5: Data leakage – subtle ways?

A: Subtle cases tricky! Training time future info include pannu (target directly related feature). Scaling full dataset first (information leak). Test data info training data explore panna. Time series shuffle pannu (past-future mix). Prevention: Understand your data, think critically, split FIRST, THEN process!

Frequently Asked Questions

❓ Data preparation ku evlo time aagum?

Typically 60-80% of total AI project time data preparation la dhaan pogum. Oru 6-month project la 4 months data work dhaan!

❓ Feature engineering na enna?

Raw data la irundhu ML model ku useful ah pudhu columns (features) create panradhu. Example: date of birth la irundhu age calculate pannradhu.

❓ Training data evlo venum?

Problem complexity depend pannum. Simple classification ku 1000+ samples podhum. Complex deep learning ku millions venum. Quality > Quantity always.

❓ Data labeling eppadi pannanum?

Manual labeling (humans tag data), semi-automated (model suggests, human verifies), or automated (rule-based). Tools: Label Studio, Amazon SageMaker Ground Truth.

❓ Data augmentation na enna?

Existing data la irundhu pudhu training samples create panradhu. Images rotate/flip panni, text paraphrase panni — more training data generate pannalam without collecting new data.

🧠Knowledge Check

Quiz 1 of 1

Feature scaling (StandardScaler) la fit_transform enga pannanum?

0 of 1 answered

← Previous ByteData lakes vs warehouses Next Byte →SQL for AI apps

Courses

Learning Paths

Exam Prep

Preparing data for AI

Introduction

Data Preparation Pipeline

Step 1: Data Collection

Step 2: Exploratory Data Analysis (EDA)

Step 3: Data Cleaning

Step 4: Feature Engineering — The Art! 🎨

Encoding Categorical Data

Feature Scaling

Train/Validation/Test Split

Data Labeling for Supervised Learning

Data Augmentation: More Data, Less Collection! 🎯

Data Leakage — The Silent Killer!

Data Versioning — Track Your Data!

Prompt: Data Preparation Plan

✅ Key Takeaways

🏁 🎮 Mini Challenge

💼 Interview Questions

Frequently Asked Questions