← Back|DATA-ENGINEERINGSection 1/18
0 of 18 completed

Preparing data for AI

Intermediate15 min read📅 Updated: 2026-02-17

Introduction

"Give me six hours to chop down a tree, and I will spend the first four sharpening the axe." — Abraham Lincoln 🪓


AI la idhu 100% true. Best model + bad data = bad results. But average model + great data = amazing results! 🎯


Data Scientists oda 80% time data preparation la dhaan pogudhu. Boring ah theriyum, but idhu dhaan difference between oru working AI model and oru failed project.


Indha article la data preparation complete process paapom — collection to model-ready data varai! 🚀

Data Preparation Pipeline

🏗️ Architecture Diagram
┌─────────────────────────────────────────────────┐
│          AI DATA PREPARATION PIPELINE            │
├─────────────────────────────────────────────────┤
│                                                   │
│  ① COLLECT ──▶ ② EXPLORE ──▶ ③ CLEAN            │
│      │              │              │              │
│   Sources        EDA/Stats     Handle missing    │
│   APIs           Visualize     Remove duplicates │
│   Scrape         Understand    Fix errors        │
│                                                   │
│  ④ TRANSFORM ──▶ ⑤ FEATURE ENG ──▶ ⑥ SPLIT      │
│      │                │                │          │
│   Normalize       Create new       Train/Val/Test│
│   Encode          Select best      Stratify      │
│   Scale           Reduce dims      Time-based    │
│                                                   │
│  ⑦ VALIDATE ──▶ ⑧ VERSION ──▶ 🤖 MODEL READY!   │
│      │              │                             │
│   Quality         DVC/Git                         │
│   Bias check      Reproducible                    │
└─────────────────────────────────────────────────┘

Step 1: Data Collection

First step — data enga irundhu varum?


Internal Sources:

  • 📊 Company databases (MySQL, PostgreSQL)
  • 📁 Log files (server logs, app events)
  • 📧 CRM data (customer interactions)

External Sources:

  • 🌐 Public datasets (Kaggle, UCI, government data)
  • 🔌 APIs (Twitter, weather, stock market)
  • 🕷️ Web scraping (ethically!)
  • 🛒 Third-party data vendors

Key considerations:

FactorQuestion
VolumeEvlo data venum?
VelocityHow fresh should data be?
VarietyWhat formats?
VeracityData trustworthy ah?
LegalPermission irukka? GDPR/privacy?

Rule of thumb: Start with whatever data you have, validate its usefulness, then collect more if needed. Don't wait for "perfect" data — it doesn't exist! 😅

Step 2: Exploratory Data Analysis (EDA)

Data collect pannita — next step: understand your data! Blindly model build pannaadheenga.


EDA Checklist:


📊 Basic Stats

python
df.describe()  # mean, std, min, max
df.info()      # data types, null counts
df.shape       # rows, columns

📈 Distributions

  • Numerical columns: histogram, box plot
  • Categorical columns: value counts, bar chart
  • Target variable: balanced ah illa imbalanced ah?

🔗 Relationships

  • Correlation matrix — features eppadi related?
  • Scatter plots — patterns theriyudha?
  • Cross-tabulation — categories compare

⚠️ Red Flags to Watch:

  • 🔴 90%+ missing values in a column
  • 🔴 Single value dominates (99% same value)
  • 🔴 Impossible values (age = -5, price = 0)
  • 🔴 Data leakage — future data in training set!

EDA tools: pandas, matplotlib, seaborn, ydata-profiling (auto EDA report generate pannum!) 📊

Step 3: Data Cleaning

Dirty data clean pannunga — most time-consuming step!


Missing Values:

StrategyWhen to UseCode
Drop rows<5% missing, random`df.dropna()`
Mean/Median fillNumerical, normal dist`df.fillna(df.mean())`
Mode fillCategorical`df.fillna(df.mode())`
Forward fillTime series`df.ffill()`
ML imputationComplex patternsKNNImputer

Duplicates:

python
df.duplicated().sum()  # count
df.drop_duplicates()   # remove

Outliers:

  • IQR method: Q1 - 1.5*IQR to Q3 + 1.5*IQR range outside = outlier
  • Z-score: |z| > 3 = outlier
  • Domain knowledge: age > 150 is obviously wrong!

Data Type Fixes:

  • String "123" → Integer 123
  • "2026-02-17" → datetime object
  • "Male"/"M"/"male" → standardize to "M"

Remember: Document every cleaning step! Reproducibility important. 📝

Step 4: Feature Engineering — The Art! 🎨

Feature engineering — idhu dhaan good ML engineer ah great ML engineer ah separate pannudhu!


What is Feature Engineering?

Raw data la irundhu model ku meaningful signals create panradhu.


Common Techniques:


1. Date Features 📅

python
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5,6])
df['month'] = df['date'].dt.month
df['hour'] = df['timestamp'].dt.hour

2. Text Features 📝

  • Word count, character count
  • Sentiment score
  • TF-IDF vectors
  • Embeddings (BERT, Word2Vec)

3. Aggregation Features 📊

python
df['avg_purchase_last_30d'] = ...  # Customer average
df['total_orders'] = ...            # Lifetime count
df['days_since_last_login'] = ...   # Recency

4. Interaction Features ✖️

python
df['price_per_sqft'] = df['price'] / df['area']
df['bmi'] = df['weight'] / (df['height']**2)

5. Binning 📦

Age → Age groups (0-18, 19-35, 36-60, 60+)

Income → Low/Medium/High

Encoding Categorical Data

💡 Tip

ML models ku numbers mattum dhaan puriyum! Categorical data convert pannanum:

💡 Label Encoding — Categories ku numbers assign

code
Red=0, Blue=1, Green=2

⚠️ Problem: Model thinks Green(2) > Red(0) — but colors ku order illaye!

💡 One-Hot Encoding — Each category ku separate column

code
Red:   [1, 0, 0]
Blue:  [0, 1, 0]
Green: [0, 0, 1]

✅ No ordering problem. But too many categories = too many columns!

💡 Target Encoding — Category replace with target mean

code
City → Average house price of that city

⚠️ Data leakage risk — careful!

💡 Frequency Encoding — Category replace with frequency

code
Mumbai → 0.35 (35% of data)
Delhi → 0.25 (25% of data)

Rule: Few categories (<10) → One-Hot. Many categories → Target/Frequency encoding.

Feature Scaling

Features different scales la irukum — normalize pannanum!


Why? Age (0-100) vs Salary (10000-1000000) — salary dominate pannidumnu model think pannum.


MethodFormulaRangeWhen to Use
**Min-Max**(x-min)/(max-min)0 to 1Neural networks
**Standard**(x-mean)/std~-3 to +3Linear models, SVM
**Robust**(x-median)/IQRVariesOutliers irukum bodhu
**Log Transform**log(x)VariesSkewed data

python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

Important: fit_transform training data la mattum pannunga. Test data ku transform mattum — fit pannaadheenga! Data leakage avoid pannanum. ⚠️

Train/Validation/Test Split

Model ku data 3 parts ah split pannanum:


Standard Split:

  • 🟢 Training (70-80%) — Model learn pannum
  • 🟡 Validation (10-15%) — Hyperparameter tuning
  • 🔴 Test (10-15%) — Final evaluation (touch pannaadheenga until end!)

python
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Stratified Split — Imbalanced data ku important!

python
# 95% non-fraud, 5% fraud — ratio maintain aaganum
train_test_split(X, y, stratify=y)

Time-based Split — Time series data ku:

  • Train: Jan-Oct data
  • Validate: Nov data
  • Test: Dec data
  • NEVER shuffle time series! Future data leak aaidum! ⏰

Cross-validation — Small dataset ku best:

  • K-Fold (k=5 or 10)
  • Each fold gets to be test set once
  • More reliable evaluation

Data Labeling for Supervised Learning

Supervised ML ku labeled data venum — idhu expensive and time-consuming!


Labeling Methods:


1. Manual Labeling 👷

  • Humans tag each data point
  • Most accurate but slowest
  • Cost: $0.01-$1 per label
  • Tools: Label Studio, Labelbox, Amazon MTurk

2. Semi-Automated 🤖+👷

  • Model predicts, human verifies
  • Active Learning: model uncertain samples human ku send pannum
  • 5x faster than fully manual

3. Weak Supervision 📏

  • Rules and heuristics use pannudhu
  • Example: Email with "lottery" = spam (rule-based)
  • Tool: Snorkel
  • Noisy but fast

4. Self-Supervised 🔄

  • Data itself la irundhu labels create
  • GPT training: next word prediction — no manual labels!
  • Most modern LLMs use this

Labeling Quality Tips:

  • Multiple annotators per sample (majority vote)
  • Clear labeling guidelines document pannunga
  • Inter-annotator agreement measure pannunga
  • Regular quality audits pannunga

Data Augmentation: More Data, Less Collection! 🎯

Example

Data augmentation — existing data la irundhu MORE training data create pannunga!

Image Augmentation 🖼️

- Rotate: 90°, 180°, 270°

- Flip: horizontal, vertical

- Crop: random sections

- Color: brightness, contrast change

- Noise: slight blur or noise add

1 image → 10-50 variations!

Text Augmentation 📝

- Synonym replacement: "happy" → "joyful"

- Back translation: English → Tamil → English

- Random insertion/deletion

- Paraphrasing with LLMs

Tabular Augmentation 📊

- SMOTE: synthetic minority samples create (imbalanced data ku)

- Noise injection: slight random noise add

- Mixup: two samples interpolate

Audio Augmentation 🎵

- Speed change: faster/slower

- Pitch shift: higher/lower

- Background noise add

- Time stretch

Data augmentation use pannina model accuracy 5-20% improve aagum! Especially small datasets ku game-changer. 🚀

Data Leakage — The Silent Killer!

⚠️ Warning

⚠️ Data Leakage = Training la test/future information mix aairadhu

Model training la 99% accuracy, production la 50% — idhu leakage sign!

Common Leakage Sources:

🔴 Target Leakage — Target directly related feature use pannradhu

- Predicting "will patient recover?" — using "discharge_date" as feature!

🔴 Temporal Leakage — Future data training la use pannradhu

- Predicting stock price — using next day's data for today's prediction

🔴 Train-Test Contamination — Test data info training la leak aagradhu

- Scaling: full dataset la fit panni apparam split pannradhu (WRONG!)

- Correct: split first, then fit on train only

🔴 Duplicate Leakage — Same record train AND test la irukkradhu

Prevention:

1. Split FIRST, process AFTER

2. Time-based splits for temporal data

3. Check feature importance — suspicious ah high aana investigate

4. Cross-validate properly

Data Versioning — Track Your Data!

Code ku Git use pannuvome — data ku?


Why version data?

  • 3 months munna trained model better ah irunduchu — but that data enga?
  • New data add pannita model worse aaiduchu — rollback pannanum
  • Regulatory compliance — which data used for which model?

Tools:

ToolTypeBest For
**DVC**Git-like for dataML projects
**LakeFS**Git for data lakesData lake versioning
**Delta Lake**Time travelLakehouse
**MLflow**Experiment trackingFull ML lifecycle

DVC Example:

bash
dvc init
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "v1 training data"
dvc push  # upload to remote storage

Now you can always go back to any data version! Reproducible ML = professional ML. 🏆

Prompt: Data Preparation Plan

📋 Copy-Paste Prompt
You are a senior ML engineer. A startup has raw customer data and wants to build a churn prediction model.

Data available:
- Customer demographics (name, age, city, plan_type)
- Usage logs (login_count, session_duration, features_used)
- Support tickets (date, category, resolution_time)
- Payment history (amount, date, failed_payments)

Create a complete data preparation plan:
1. EDA steps and expected findings
2. Cleaning strategy per data source
3. Feature engineering ideas (at least 10 features)
4. Encoding and scaling strategy
5. Train/test split approach
6. Potential data leakage risks

Explain in Tanglish.

Key Takeaways

Summary:


80% of AI project time = data preparation

EDA first — understand before transforming

Cleaning — missing values, duplicates, outliers handle

Feature Engineering — raw data la useful signals extract

Encoding — categorical data numerical ah convert

Scaling — features same scale ku bring

Split — train/val/test properly, stratify if needed

Data Leakage — the silent killer, always watch out!

Version data — DVC or similar tool use pannunga


"In God we trust; all others must bring data." — And that data better be well-prepared! 😄


Next article: SQL for AI Apps — AI data work ku SQL eppadi use pannanum! 🎯

🏁 🎮 Mini Challenge

Challenge: Dataset prepare panni ML model train pannu


Complete data preparation cycle practice:


Step 1 (Load & Explore - 10 min):

python
import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv('dataset.csv')
report = ProfileReport(df)
report.to_file('report.html')  # Open in browser – issues visible!

Step 2 (Clean - 10 min):

python
# Missing values
df.fillna(df.median())
# Duplicates
df.drop_duplicates()
# Outliers (IQR method)
Q1, Q3 = df['salary'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[(df['salary'] > Q1 - 1.5*IQR) & (df['salary'] < Q3 + 1.5*IQR)]

Step 3 (Features - 10 min):

python
# Create meaningful features
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior'])
df['years_employed'] = 2026 - df['join_year']

Step 4 (Split & Scale - 5 min):

python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test = train_test_split(df, test_size=0.2)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

Step 5 (Train - 5 min):

python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")

Learning: 80% time data, 20% time model – it's ALWAYS about data quality! 📊

💼 Interview Questions

Q1: Data preparation time 60-80% project time – enna reason?

A: Real-world data messy! Missing values, duplicates, inconsistencies, outliers, multiple formats. Clean panna, validate pannu, understand pannu – time consuming. AI model quality = data quality. Garbage data → garbage model. Oru months project data maathram 4 months poagum!


Q2: Feature engineering – art vs science?

A: Bit of both! Science: statistical methods (correlation, dimensionality reduction). Art: domain knowledge (what features matter). "Age" raw feature, but "age_group" better ML feature. Experienced engineers domain expertise use pannum – their features always better!


Q3: Training-serving skew – practical example?

A: Training: avg_salary last 30 days compute pannu. Serving: avg_salary last 1 hour? Different features → different predictions! Solution: Feature Store – both places same features. Or standardized pipeline – deterministic, reproducible feature computation!


Q4: Imbalanced data (95% negative, 5% positive) – handling?

A: Techniques: undersampling (majority downsample), oversampling (SMOTE – synthetic samples), class weights, stratified splits. Business context important – fraud 0.1% data, but costly – handle specially! Class balance != realistic – real world often imbalanced.


Q5: Data leakage – subtle ways?

A: Subtle cases tricky! Training time future info include pannu (target directly related feature). Scaling full dataset first (information leak). Test data info training data explore panna. Time series shuffle pannu (past-future mix). Prevention: Understand your data, think critically, split FIRST, THEN process!

Frequently Asked Questions

Data preparation ku evlo time aagum?
Typically 60-80% of total AI project time data preparation la dhaan pogum. Oru 6-month project la 4 months data work dhaan!
Feature engineering na enna?
Raw data la irundhu ML model ku useful ah pudhu columns (features) create panradhu. Example: date of birth la irundhu age calculate pannradhu.
Training data evlo venum?
Problem complexity depend pannum. Simple classification ku 1000+ samples podhum. Complex deep learning ku millions venum. Quality > Quantity always.
Data labeling eppadi pannanum?
Manual labeling (humans tag data), semi-automated (model suggests, human verifies), or automated (rule-based). Tools: Label Studio, Amazon SageMaker Ground Truth.
Data augmentation na enna?
Existing data la irundhu pudhu training samples create panradhu. Images rotate/flip panni, text paraphrase panni — more training data generate pannalam without collecting new data.
🧠Knowledge Check
Quiz 1 of 1

Feature scaling (StandardScaler) la fit_transform enga pannanum?

0 of 1 answered