โ† Back|DATA-ENGINEERINGโ€บSection 1/18
0 of 18 completed

Preparing data for AI

Intermediateโฑ 15 min read๐Ÿ“… Updated: 2026-02-17

Introduction

"Give me six hours to chop down a tree, and I will spend the first four sharpening the axe." โ€” Abraham Lincoln ๐Ÿช“


AI la idhu 100% true. Best model + bad data = bad results. But average model + great data = amazing results! ๐ŸŽฏ


Data Scientists oda 80% time data preparation la dhaan pogudhu. Boring ah theriyum, but idhu dhaan difference between oru working AI model and oru failed project.


Indha article la data preparation complete process paapom โ€” collection to model-ready data varai! ๐Ÿš€

Data Preparation Pipeline

๐Ÿ—๏ธ Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          AI DATA PREPARATION PIPELINE            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                   โ”‚
โ”‚  โ‘  COLLECT โ”€โ”€โ–ถ โ‘ก EXPLORE โ”€โ”€โ–ถ โ‘ข CLEAN            โ”‚
โ”‚      โ”‚              โ”‚              โ”‚              โ”‚
โ”‚   Sources        EDA/Stats     Handle missing    โ”‚
โ”‚   APIs           Visualize     Remove duplicates โ”‚
โ”‚   Scrape         Understand    Fix errors        โ”‚
โ”‚                                                   โ”‚
โ”‚  โ‘ฃ TRANSFORM โ”€โ”€โ–ถ โ‘ค FEATURE ENG โ”€โ”€โ–ถ โ‘ฅ SPLIT      โ”‚
โ”‚      โ”‚                โ”‚                โ”‚          โ”‚
โ”‚   Normalize       Create new       Train/Val/Testโ”‚
โ”‚   Encode          Select best      Stratify      โ”‚
โ”‚   Scale           Reduce dims      Time-based    โ”‚
โ”‚                                                   โ”‚
โ”‚  โ‘ฆ VALIDATE โ”€โ”€โ–ถ โ‘ง VERSION โ”€โ”€โ–ถ ๐Ÿค– MODEL READY!   โ”‚
โ”‚      โ”‚              โ”‚                             โ”‚
โ”‚   Quality         DVC/Git                         โ”‚
โ”‚   Bias check      Reproducible                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 1: Data Collection

First step โ€” data enga irundhu varum?


Internal Sources:

  • ๐Ÿ“Š Company databases (MySQL, PostgreSQL)
  • ๐Ÿ“ Log files (server logs, app events)
  • ๐Ÿ“ง CRM data (customer interactions)

External Sources:

  • ๐ŸŒ Public datasets (Kaggle, UCI, government data)
  • ๐Ÿ”Œ APIs (Twitter, weather, stock market)
  • ๐Ÿ•ท๏ธ Web scraping (ethically!)
  • ๐Ÿ›’ Third-party data vendors

Key considerations:

FactorQuestion
VolumeEvlo data venum?
VelocityHow fresh should data be?
VarietyWhat formats?
VeracityData trustworthy ah?
LegalPermission irukka? GDPR/privacy?

Rule of thumb: Start with whatever data you have, validate its usefulness, then collect more if needed. Don't wait for "perfect" data โ€” it doesn't exist! ๐Ÿ˜…

Step 2: Exploratory Data Analysis (EDA)

Data collect pannita โ€” next step: understand your data! Blindly model build pannaadheenga.


EDA Checklist:


๐Ÿ“Š Basic Stats

python
df.describe()  # mean, std, min, max
df.info()      # data types, null counts
df.shape       # rows, columns

๐Ÿ“ˆ Distributions

  • Numerical columns: histogram, box plot
  • Categorical columns: value counts, bar chart
  • Target variable: balanced ah illa imbalanced ah?

๐Ÿ”— Relationships

  • Correlation matrix โ€” features eppadi related?
  • Scatter plots โ€” patterns theriyudha?
  • Cross-tabulation โ€” categories compare

โš ๏ธ Red Flags to Watch:

  • ๐Ÿ”ด 90%+ missing values in a column
  • ๐Ÿ”ด Single value dominates (99% same value)
  • ๐Ÿ”ด Impossible values (age = -5, price = 0)
  • ๐Ÿ”ด Data leakage โ€” future data in training set!

EDA tools: pandas, matplotlib, seaborn, ydata-profiling (auto EDA report generate pannum!) ๐Ÿ“Š

Step 3: Data Cleaning

Dirty data clean pannunga โ€” most time-consuming step!


Missing Values:

StrategyWhen to UseCode
Drop rows<5% missing, random`df.dropna()`
Mean/Median fillNumerical, normal dist`df.fillna(df.mean())`
Mode fillCategorical`df.fillna(df.mode())`
Forward fillTime series`df.ffill()`
ML imputationComplex patternsKNNImputer

Duplicates:

python
df.duplicated().sum()  # count
df.drop_duplicates()   # remove

Outliers:

  • IQR method: Q1 - 1.5*IQR to Q3 + 1.5*IQR range outside = outlier
  • Z-score: |z| > 3 = outlier
  • Domain knowledge: age > 150 is obviously wrong!

Data Type Fixes:

  • String "123" โ†’ Integer 123
  • "2026-02-17" โ†’ datetime object
  • "Male"/"M"/"male" โ†’ standardize to "M"

Remember: Document every cleaning step! Reproducibility important. ๐Ÿ“

Step 4: Feature Engineering โ€” The Art! ๐ŸŽจ

Feature engineering โ€” idhu dhaan good ML engineer ah great ML engineer ah separate pannudhu!


What is Feature Engineering?

Raw data la irundhu model ku meaningful signals create panradhu.


Common Techniques:


1. Date Features ๐Ÿ“…

python
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5,6])
df['month'] = df['date'].dt.month
df['hour'] = df['timestamp'].dt.hour

2. Text Features ๐Ÿ“

  • Word count, character count
  • Sentiment score
  • TF-IDF vectors
  • Embeddings (BERT, Word2Vec)

3. Aggregation Features ๐Ÿ“Š

python
df['avg_purchase_last_30d'] = ...  # Customer average
df['total_orders'] = ...            # Lifetime count
df['days_since_last_login'] = ...   # Recency

4. Interaction Features โœ–๏ธ

python
df['price_per_sqft'] = df['price'] / df['area']
df['bmi'] = df['weight'] / (df['height']**2)

5. Binning ๐Ÿ“ฆ

Age โ†’ Age groups (0-18, 19-35, 36-60, 60+)

Income โ†’ Low/Medium/High

Encoding Categorical Data

๐Ÿ’ก Tip

ML models ku numbers mattum dhaan puriyum! Categorical data convert pannanum:

๐Ÿ’ก Label Encoding โ€” Categories ku numbers assign

code
Red=0, Blue=1, Green=2

โš ๏ธ Problem: Model thinks Green(2) > Red(0) โ€” but colors ku order illaye!

๐Ÿ’ก One-Hot Encoding โ€” Each category ku separate column

code
Red:   [1, 0, 0]
Blue:  [0, 1, 0]
Green: [0, 0, 1]

โœ… No ordering problem. But too many categories = too many columns!

๐Ÿ’ก Target Encoding โ€” Category replace with target mean

code
City โ†’ Average house price of that city

โš ๏ธ Data leakage risk โ€” careful!

๐Ÿ’ก Frequency Encoding โ€” Category replace with frequency

code
Mumbai โ†’ 0.35 (35% of data)
Delhi โ†’ 0.25 (25% of data)

Rule: Few categories (<10) โ†’ One-Hot. Many categories โ†’ Target/Frequency encoding.

Feature Scaling

Features different scales la irukum โ€” normalize pannanum!


Why? Age (0-100) vs Salary (10000-1000000) โ€” salary dominate pannidumnu model think pannum.


MethodFormulaRangeWhen to Use
**Min-Max**(x-min)/(max-min)0 to 1Neural networks
**Standard**(x-mean)/std~-3 to +3Linear models, SVM
**Robust**(x-median)/IQRVariesOutliers irukum bodhu
**Log Transform**log(x)VariesSkewed data

python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

Important: fit_transform training data la mattum pannunga. Test data ku transform mattum โ€” fit pannaadheenga! Data leakage avoid pannanum. โš ๏ธ

Train/Validation/Test Split

Model ku data 3 parts ah split pannanum:


Standard Split:

  • ๐ŸŸข Training (70-80%) โ€” Model learn pannum
  • ๐ŸŸก Validation (10-15%) โ€” Hyperparameter tuning
  • ๐Ÿ”ด Test (10-15%) โ€” Final evaluation (touch pannaadheenga until end!)

python
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Stratified Split โ€” Imbalanced data ku important!

python
# 95% non-fraud, 5% fraud โ€” ratio maintain aaganum
train_test_split(X, y, stratify=y)

Time-based Split โ€” Time series data ku:

  • Train: Jan-Oct data
  • Validate: Nov data
  • Test: Dec data
  • NEVER shuffle time series! Future data leak aaidum! โฐ

Cross-validation โ€” Small dataset ku best:

  • K-Fold (k=5 or 10)
  • Each fold gets to be test set once
  • More reliable evaluation

Data Labeling for Supervised Learning

Supervised ML ku labeled data venum โ€” idhu expensive and time-consuming!


Labeling Methods:


1. Manual Labeling ๐Ÿ‘ท

  • Humans tag each data point
  • Most accurate but slowest
  • Cost: $0.01-$1 per label
  • Tools: Label Studio, Labelbox, Amazon MTurk

2. Semi-Automated ๐Ÿค–+๐Ÿ‘ท

  • Model predicts, human verifies
  • Active Learning: model uncertain samples human ku send pannum
  • 5x faster than fully manual

3. Weak Supervision ๐Ÿ“

  • Rules and heuristics use pannudhu
  • Example: Email with "lottery" = spam (rule-based)
  • Tool: Snorkel
  • Noisy but fast

4. Self-Supervised ๐Ÿ”„

  • Data itself la irundhu labels create
  • GPT training: next word prediction โ€” no manual labels!
  • Most modern LLMs use this

Labeling Quality Tips:

  • Multiple annotators per sample (majority vote)
  • Clear labeling guidelines document pannunga
  • Inter-annotator agreement measure pannunga
  • Regular quality audits pannunga

Data Augmentation: More Data, Less Collection! ๐ŸŽฏ

โœ… Example

Data augmentation โ€” existing data la irundhu MORE training data create pannunga!

Image Augmentation ๐Ÿ–ผ๏ธ

- Rotate: 90ยฐ, 180ยฐ, 270ยฐ

- Flip: horizontal, vertical

- Crop: random sections

- Color: brightness, contrast change

- Noise: slight blur or noise add

1 image โ†’ 10-50 variations!

Text Augmentation ๐Ÿ“

- Synonym replacement: "happy" โ†’ "joyful"

- Back translation: English โ†’ Tamil โ†’ English

- Random insertion/deletion

- Paraphrasing with LLMs

Tabular Augmentation ๐Ÿ“Š

- SMOTE: synthetic minority samples create (imbalanced data ku)

- Noise injection: slight random noise add

- Mixup: two samples interpolate

Audio Augmentation ๐ŸŽต

- Speed change: faster/slower

- Pitch shift: higher/lower

- Background noise add

- Time stretch

Data augmentation use pannina model accuracy 5-20% improve aagum! Especially small datasets ku game-changer. ๐Ÿš€

Data Leakage โ€” The Silent Killer!

โš ๏ธ Warning

โš ๏ธ Data Leakage = Training la test/future information mix aairadhu

Model training la 99% accuracy, production la 50% โ€” idhu leakage sign!

Common Leakage Sources:

๐Ÿ”ด Target Leakage โ€” Target directly related feature use pannradhu

- Predicting "will patient recover?" โ€” using "discharge_date" as feature!

๐Ÿ”ด Temporal Leakage โ€” Future data training la use pannradhu

- Predicting stock price โ€” using next day's data for today's prediction

๐Ÿ”ด Train-Test Contamination โ€” Test data info training la leak aagradhu

- Scaling: full dataset la fit panni apparam split pannradhu (WRONG!)

- Correct: split first, then fit on train only

๐Ÿ”ด Duplicate Leakage โ€” Same record train AND test la irukkradhu

Prevention:

1. Split FIRST, process AFTER

2. Time-based splits for temporal data

3. Check feature importance โ€” suspicious ah high aana investigate

4. Cross-validate properly

Data Versioning โ€” Track Your Data!

Code ku Git use pannuvome โ€” data ku?


Why version data?

  • 3 months munna trained model better ah irunduchu โ€” but that data enga?
  • New data add pannita model worse aaiduchu โ€” rollback pannanum
  • Regulatory compliance โ€” which data used for which model?

Tools:

ToolTypeBest For
**DVC**Git-like for dataML projects
**LakeFS**Git for data lakesData lake versioning
**Delta Lake**Time travelLakehouse
**MLflow**Experiment trackingFull ML lifecycle

DVC Example:

bash
dvc init
dvc add data/training.csv
git add data/training.csv.dvc
git commit -m "v1 training data"
dvc push  # upload to remote storage

Now you can always go back to any data version! Reproducible ML = professional ML. ๐Ÿ†

Prompt: Data Preparation Plan

๐Ÿ“‹ Copy-Paste Prompt
You are a senior ML engineer. A startup has raw customer data and wants to build a churn prediction model.

Data available:
- Customer demographics (name, age, city, plan_type)
- Usage logs (login_count, session_duration, features_used)
- Support tickets (date, category, resolution_time)
- Payment history (amount, date, failed_payments)

Create a complete data preparation plan:
1. EDA steps and expected findings
2. Cleaning strategy per data source
3. Feature engineering ideas (at least 10 features)
4. Encoding and scaling strategy
5. Train/test split approach
6. Potential data leakage risks

Explain in Tanglish.

โœ… Key Takeaways

Summary:


โœ… 80% of AI project time = data preparation

โœ… EDA first โ€” understand before transforming

โœ… Cleaning โ€” missing values, duplicates, outliers handle

โœ… Feature Engineering โ€” raw data la useful signals extract

โœ… Encoding โ€” categorical data numerical ah convert

โœ… Scaling โ€” features same scale ku bring

โœ… Split โ€” train/val/test properly, stratify if needed

โœ… Data Leakage โ€” the silent killer, always watch out!

โœ… Version data โ€” DVC or similar tool use pannunga


"In God we trust; all others must bring data." โ€” And that data better be well-prepared! ๐Ÿ˜„


Next article: SQL for AI Apps โ€” AI data work ku SQL eppadi use pannanum! ๐ŸŽฏ

๐Ÿ ๐ŸŽฎ Mini Challenge

Challenge: Dataset prepare panni ML model train pannu


Complete data preparation cycle practice:


Step 1 (Load & Explore - 10 min):

python
import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv('dataset.csv')
report = ProfileReport(df)
report.to_file('report.html')  # Open in browser โ€“ issues visible!

Step 2 (Clean - 10 min):

python
# Missing values
df.fillna(df.median())
# Duplicates
df.drop_duplicates()
# Outliers (IQR method)
Q1, Q3 = df['salary'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[(df['salary'] > Q1 - 1.5*IQR) & (df['salary'] < Q3 + 1.5*IQR)]

Step 3 (Features - 10 min):

python
# Create meaningful features
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior'])
df['years_employed'] = 2026 - df['join_year']

Step 4 (Split & Scale - 5 min):

python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test = train_test_split(df, test_size=0.2)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

Step 5 (Train - 5 min):

python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X_train_scaled, y_train)
print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")

Learning: 80% time data, 20% time model โ€“ it's ALWAYS about data quality! ๐Ÿ“Š

๐Ÿ’ผ Interview Questions

Q1: Data preparation time 60-80% project time โ€“ enna reason?

A: Real-world data messy! Missing values, duplicates, inconsistencies, outliers, multiple formats. Clean panna, validate pannu, understand pannu โ€“ time consuming. AI model quality = data quality. Garbage data โ†’ garbage model. Oru months project data maathram 4 months poagum!


Q2: Feature engineering โ€“ art vs science?

A: Bit of both! Science: statistical methods (correlation, dimensionality reduction). Art: domain knowledge (what features matter). "Age" raw feature, but "age_group" better ML feature. Experienced engineers domain expertise use pannum โ€“ their features always better!


Q3: Training-serving skew โ€“ practical example?

A: Training: avg_salary last 30 days compute pannu. Serving: avg_salary last 1 hour? Different features โ†’ different predictions! Solution: Feature Store โ€“ both places same features. Or standardized pipeline โ€“ deterministic, reproducible feature computation!


Q4: Imbalanced data (95% negative, 5% positive) โ€“ handling?

A: Techniques: undersampling (majority downsample), oversampling (SMOTE โ€“ synthetic samples), class weights, stratified splits. Business context important โ€“ fraud 0.1% data, but costly โ€“ handle specially! Class balance != realistic โ€“ real world often imbalanced.


Q5: Data leakage โ€“ subtle ways?

A: Subtle cases tricky! Training time future info include pannu (target directly related feature). Scaling full dataset first (information leak). Test data info training data explore panna. Time series shuffle pannu (past-future mix). Prevention: Understand your data, think critically, split FIRST, THEN process!

Frequently Asked Questions

โ“ Data preparation ku evlo time aagum?
Typically 60-80% of total AI project time data preparation la dhaan pogum. Oru 6-month project la 4 months data work dhaan!
โ“ Feature engineering na enna?
Raw data la irundhu ML model ku useful ah pudhu columns (features) create panradhu. Example: date of birth la irundhu age calculate pannradhu.
โ“ Training data evlo venum?
Problem complexity depend pannum. Simple classification ku 1000+ samples podhum. Complex deep learning ku millions venum. Quality > Quantity always.
โ“ Data labeling eppadi pannanum?
Manual labeling (humans tag data), semi-automated (model suggests, human verifies), or automated (rule-based). Tools: Label Studio, Amazon SageMaker Ground Truth.
โ“ Data augmentation na enna?
Existing data la irundhu pudhu training samples create panradhu. Images rotate/flip panni, text paraphrase panni โ€” more training data generate pannalam without collecting new data.
๐Ÿง Knowledge Check
Quiz 1 of 1

Feature scaling (StandardScaler) la fit_transform enga pannanum?

0 of 1 answered