← Back|DATA-ENGINEERING›Section 1/16
0 of 16 completed

Data types (structured/unstructured)

Beginnerā± 10 min readšŸ“… Updated: 2026-02-17

Introduction

Nee daily phone la photos edukka, WhatsApp la messages anuppa, Excel la data enter panna — ivanga ellam different types of data! šŸ“±


AI world la data types purinjikunadhu romba important. Yen na — different data types ku different processing methods venum. Excel table process panradhu um, Instagram photo process panradhu um — same method la mudiyaadhu!


Indha article la Structured, Unstructured, Semi-structured data — moonu um paapom, AI la ivanga role enna nu clear ah purinjidam! šŸŽÆ

Three Types of Data

Data mainly 3 types ah divide pannalam:


1. Structured Data šŸ“Š

  • Rows and columns la organized (like Excel)
  • Fixed schema — every record same format
  • Examples: Database tables, CSV files, spreadsheets
  • Easy to search, filter, analyze

2. Unstructured Data šŸ“·

  • No predefined format
  • Free-form, messy, varied
  • Examples: Images, videos, emails, PDFs, audio
  • Hard to search, needs AI to process

3. Semi-structured Data šŸ“‹

  • Some structure irukku, but rigid schema illa
  • Tags or markers use pannudhu
  • Examples: JSON, XML, HTML, email headers
  • Flexible but still parseable

Fun fact: World la generate aagura data la 80-90% unstructured! Only 10-20% is structured. AI revolution unstructured data process panna ability la dhaan irukku! 🤯

Analogy: Library vs Attic vs Filing Cabinet

āœ… Example

Data types ah real-life la compare pannrom:

šŸ“š Structured Data = Library

- Every book has a catalog number

- Organized by genre, author, year

- Easy ah find pannalam — just search the catalog!

šŸ“¦ Unstructured Data = Attic (Maadivarai)

- Old photos, letters, random items dumped

- No organization, no labels

- Find panna time aagum, manually sort pannanum

šŸ—‚ļø Semi-structured Data = Filing Cabinet with Sticky Notes

- Files irukku, some labels irukku

- But inside format varies — some typed, some handwritten

- Better than attic, but not as neat as library

AI oda power — attic la irundhu valuable items automatically find pannum! That's what deep learning does with unstructured data. 🧠

Data Types Processing Architecture

šŸ—ļø Architecture Diagram
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│            DATA TYPES IN AI PIPELINE              │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                   │
│  STRUCTURED        SEMI-STRUCTURED  UNSTRUCTURED  │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  │
│  │ Database │     │   JSON   │    │  Images  │  │
│  │  Tables  │     │   XML    │    │  Videos  │  │
│  │   CSV    │     │   Logs   │    │  Audio   │  │
│  ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜  │
│       │                │               │         │
│       ā–¼                ā–¼               ā–¼         │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”   │
│  │   SQL    │    │  Parser  │    │ AI/ML    │   │
│  │  Engine  │    │  Engine  │    │ Models   │   │
│  ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”˜   │
│       │                │               │         │
│       ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜         │
│                        ā–¼                         │
│               ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”                  │
│               │  UNIFIED DATA  │                  │
│               │    STORAGE     │                  │
│               │  (Data Lake)   │                  │
│               ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜                  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Detailed Examples with AI Context

Each data type ku real AI use cases paapom:


Structured Data + AI šŸ“Š

  • Customer purchase history → Recommendation engine
  • Stock prices table → Price prediction model
  • Sensor readings → Anomaly detection
  • Student marks → Performance prediction

Unstructured Data + AI šŸ“·

  • X-ray images → Disease detection (Computer Vision)
  • Customer emails → Sentiment analysis (NLP)
  • Call recordings → Speech-to-text, keyword extraction
  • Social media posts → Trend analysis, brand monitoring

Semi-structured Data + AI šŸ“‹

  • JSON API responses → Data extraction pipelines
  • Server logs → Error pattern detection
  • HTML web pages → Web scraping for training data
  • IoT sensor JSON → Real-time monitoring dashboards

Key insight: Modern AI oda superpower — unstructured data la irundhu structured insights extract panradhu! šŸ”„

Side-by-Side Comparison

Moonu types um oru table la compare pannrom:


FeatureStructuredSemi-structuredUnstructured
FormatRows & ColumnsTags/MarkersNo format
SchemaFixedFlexibleNone
StorageRDBMSNoSQL/Document DBData Lake/Blob
SearchEasy (SQL)Medium (queries)Hard (needs AI)
ExampleExcel, MySQLJSON, XMLPhotos, Videos
% of data~10-20%~5-10%~80%
AI processingTraditional MLParsers + MLDeep Learning
Query languageSQLJSONPath, XPathVector search

Takeaway: Structured data easy to handle but rare. Unstructured data abundant but hard to process. AI bridges this gap! šŸŒ‰

Try It: Classify Data Types

šŸ“‹ Copy-Paste Prompt
You are a data engineering tutor teaching in Tanglish.

Given these data items, classify each as Structured, Semi-structured, or Unstructured:
1. A MySQL database of employee records
2. A folder of customer complaint emails
3. A JSON file from a weather API
4. YouTube video comments
5. An Excel spreadsheet of sales data
6. Server log files
7. MRI brain scan images
8. An XML configuration file

For each item, explain WHY it belongs to that category and what AI use case it could serve.

Industry Use Cases

Different industries la data types eppadi use aagudhu:


šŸ„ Healthcare

  • Structured: Patient records, lab results, billing
  • Unstructured: X-rays, MRI scans, doctor notes
  • AI combines both → accurate diagnosis!

šŸ›’ Retail

  • Structured: Sales data, inventory counts, pricing
  • Unstructured: Product reviews, customer photos, chat logs
  • AI analyzes reviews to improve products

šŸ¦ Finance

  • Structured: Transaction records, account balances
  • Semi-structured: API feeds, regulatory filings (XML)
  • Unstructured: News articles, earnings call audio
  • AI detects fraud by combining all three! šŸ”

šŸ“± Social Media

  • Mostly unstructured: Posts, images, videos, stories
  • Semi-structured: User profiles (JSON), hashtags
  • AI powers entire feed — recommendations, ads, moderation

Challenges with Each Type

āš ļø Warning

Each data type ku unique challenges irukku:

āš ļø Structured Data Challenges

- Schema changes break pipelines

- Scaling relational DBs is expensive

- Rigid — real world messy, tables are neat

āš ļø Unstructured Data Challenges

- Storage costs — videos/images use massive space

- Processing requires expensive GPUs

- Quality varies wildly — blurry photos, noisy audio

- Labeling for AI training is time-consuming & costly

āš ļø Semi-structured Data Challenges

- Nested data (JSON inside JSON) hard to flatten

- No standard schema — every source different

- Parsing errors common when format changes

šŸ’” Pro tip: Real-world AI projects always deal with ALL THREE types. Multi-modal data handling is a critical DE skill!

Tools for Each Data Type

Data type ku suitable tools:


Data TypeStorageProcessingAI Tool
StructuredPostgreSQL, MySQLSQL, pandasscikit-learn
StructuredBigQuery, RedshiftSpark SQLAutoML
Semi-structuredMongoDB, DynamoDBjq, PythonCustom parsers
Semi-structuredElasticsearchLogstashAnomaly detection
UnstructuredS3, GCS, HDFSSpark, RayPyTorch, TensorFlow
UnstructuredVector DB (Pinecone)EmbeddingsLLMs, CNNs

Beginner tip: Start with structured (SQL + pandas), then semi-structured (JSON + Python), finally unstructured (images + PyTorch). Step by step! 🪜

Hands-On: Work with All Three Types

Practice panra steps:


Exercise 1: Structured Data šŸ“Š

  1. Download any CSV dataset from Kaggle
  2. Load into pandas: pd.read_csv('data.csv')
  3. Run basic analysis — mean, count, groupby
  4. Load into SQLite and query with SQL

Exercise 2: Semi-structured Data šŸ“‹

  1. Fetch JSON from a free API (e.g., weather API)
  2. Parse with Python: json.loads(response)
  3. Flatten nested JSON to tabular format
  4. Store in MongoDB (free Atlas tier)

Exercise 3: Unstructured Data šŸ“·

  1. Collect 100 images (cats vs dogs from Kaggle)
  2. Use Python PIL to resize and normalize
  3. Convert to numerical arrays (numpy)
  4. Try a simple classification with pre-trained model

Bonus Project: Build a pipeline that combines all three types! Customer data (structured) + reviews (unstructured) + API data (semi-structured) → unified analysis. šŸ†

āœ… Key Takeaways

Recap time! šŸ“


āœ… Structured = Rows & columns, fixed schema (Excel, SQL tables)

āœ… Unstructured = No format (images, videos, audio, text)

āœ… Semi-structured = Some structure, flexible (JSON, XML, logs)

āœ… 80-90% of world's data is unstructured

āœ… AI's superpower = processing unstructured data at scale

āœ… Modern data engineering handles ALL three types

āœ… Different storage & tools for each type


Next article: "Data Flow in AI Apps" — data eppadi oru AI application la flow aagudhu nu paapom! šŸŽÆ

Prompt: Data Conversion Challenge

šŸ“‹ Copy-Paste Prompt
You are a Python data engineering tutor.

Show me how to:
1. Read a JSON file (semi-structured) with nested objects
2. Flatten it into a structured pandas DataFrame
3. Save as both CSV (structured) and back to JSON (semi-structured)
4. Extract text fields for NLP processing (moving toward unstructured)

Use a realistic example — like an e-commerce product catalog with nested categories, reviews, and images.

Include complete Python code with comments.

šŸ šŸŽ® Mini Challenge

Challenge: Three Data Types Handle Pannu


Real-world la irukka different data types practice pannu:


Step 1 (Structured Data - 10 min):

  1. Oru CSV download pannu Kaggle la (student grades example)
  2. Pandas la load panni basic analysis: mean, count, distribution
  3. SQLite table ah save pannu

Step 2 (Semi-structured Data - 10 min):

  1. Weather API call pannu (free API: openweathermap.org)
  2. JSON response parse pannu
  3. Nested data flatten panni pandas DataFrame ah convert
  4. CSV ah save pannu

Step 3 (Unstructured Data - 10 min):

  1. 10 images download pannu (Google Images from any category)
  2. PIL/Pillow use panni image size check pannu
  3. Numpy array ah convert pannu (numerical representation)
  4. Array dimensions print pannu (idhu unstructured → numerical transformation)

Bonus: Moonu outputs compare pannu – structured (table rows), semi-structured (hierarchy), unstructured (pixel arrays)!


Learning: Same data, different formats, different processing tools – idhu real-world! šŸŽÆ

šŸ’¼ Interview Questions

Q1: Unstructured data la AI power irukku nu sonna, explain pannu?

A: Unstructured data (images, text, audio) la hidden patterns irukku. Deep learning models idha extract pannum. Example: Patient X-ray irundhu disease detect pannum, email spam vs not spam classify pannum. Human eye ku pattern visible anna machine eye ku (neural networks) extreme accuracy varum!


Q2: Data type choose pannumbodhu considerations?

A: Project requirements analysis pannunga. Only structured data irukka? SQL database podhum. Mixed types? Data lake better. Cost vs performance – structured queries fast ana expensive, unstructured cheap ana processing slow. Start with what you have, optimize if needed.


Q3: Structured → Unstructured la convert pannum possible ah?

A: Possible! Example: customer CSV data embeddings ah convert panni vector store la search pannum (RAG applications). But vice versa hard – unstructured → structured always labeling venum. Humans manually annotate pannanum.


Q4: Storage le ka cost – structured vs unstructured?

A: Unstructured much cheaper! S3/GCS ~$0.02/GB/month. But processing expensive – GPU/TPU servers venum ML ku. Structured data databases expensive ($5-50/GB/month) but querying fast. Trade-off irukku always.


Q5: Real-world company la data types distribution enna?

A: 80-90% unstructured! Images, videos, social posts, emails. But business analytics ku 10-20% structured data critical. Modern companies both handle pannanum – data lake (unstructured) + warehouse (structured). Lakehouse solution best!

Frequently Asked Questions

ā“ What is structured data?
Structured data is organized in rows and columns like Excel sheets or database tables. Examples: customer names, phone numbers, transaction amounts.
ā“ What is unstructured data?
Unstructured data has no predefined format. Examples: images, videos, emails, social media posts, audio files.
ā“ Which data type is more common?
Unstructured data makes up about 80-90% of all data generated worldwide. Images, videos, and text documents dominate.
ā“ Can AI process unstructured data?
Yes! Modern AI (especially deep learning) excels at processing unstructured data — image recognition, NLP for text, speech recognition for audio.
🧠Knowledge Check
Quiz 1 of 1

Which of the following is an example of UNSTRUCTURED data?

0 of 1 answered