Data types (structured/unstructured)
Introduction
Nee daily phone la photos edukka, WhatsApp la messages anuppa, Excel la data enter panna โ ivanga ellam different types of data! ๐ฑ
AI world la data types purinjikunadhu romba important. Yen na โ different data types ku different processing methods venum. Excel table process panradhu um, Instagram photo process panradhu um โ same method la mudiyaadhu!
Indha article la Structured, Unstructured, Semi-structured data โ moonu um paapom, AI la ivanga role enna nu clear ah purinjidam! ๐ฏ
Three Types of Data
Data mainly 3 types ah divide pannalam:
1. Structured Data ๐
- Rows and columns la organized (like Excel)
- Fixed schema โ every record same format
- Examples: Database tables, CSV files, spreadsheets
- Easy to search, filter, analyze
2. Unstructured Data ๐ท
- No predefined format
- Free-form, messy, varied
- Examples: Images, videos, emails, PDFs, audio
- Hard to search, needs AI to process
3. Semi-structured Data ๐
- Some structure irukku, but rigid schema illa
- Tags or markers use pannudhu
- Examples: JSON, XML, HTML, email headers
- Flexible but still parseable
Fun fact: World la generate aagura data la 80-90% unstructured! Only 10-20% is structured. AI revolution unstructured data process panna ability la dhaan irukku! ๐คฏ
Analogy: Library vs Attic vs Filing Cabinet
Data types ah real-life la compare pannrom:
๐ Structured Data = Library
- Every book has a catalog number
- Organized by genre, author, year
- Easy ah find pannalam โ just search the catalog!
๐ฆ Unstructured Data = Attic (Maadivarai)
- Old photos, letters, random items dumped
- No organization, no labels
- Find panna time aagum, manually sort pannanum
๐๏ธ Semi-structured Data = Filing Cabinet with Sticky Notes
- Files irukku, some labels irukku
- But inside format varies โ some typed, some handwritten
- Better than attic, but not as neat as library
AI oda power โ attic la irundhu valuable items automatically find pannum! That's what deep learning does with unstructured data. ๐ง
Data Types Processing Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ DATA TYPES IN AI PIPELINE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ STRUCTURED SEMI-STRUCTURED UNSTRUCTURED โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ Database โ โ JSON โ โ Images โ โ โ โ Tables โ โ XML โ โ Videos โ โ โ โ CSV โ โ Logs โ โ Audio โ โ โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ โ โ โ โ โ โ โผ โผ โผ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ SQL โ โ Parser โ โ AI/ML โ โ โ โ Engine โ โ Engine โ โ Models โ โ โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ โ โ โ โ โ โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ โ โผ โ โ โโโโโโโโโโโโโโโโโโ โ โ โ UNIFIED DATA โ โ โ โ STORAGE โ โ โ โ (Data Lake) โ โ โ โโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Detailed Examples with AI Context
Each data type ku real AI use cases paapom:
Structured Data + AI ๐
- Customer purchase history โ Recommendation engine
- Stock prices table โ Price prediction model
- Sensor readings โ Anomaly detection
- Student marks โ Performance prediction
Unstructured Data + AI ๐ท
- X-ray images โ Disease detection (Computer Vision)
- Customer emails โ Sentiment analysis (NLP)
- Call recordings โ Speech-to-text, keyword extraction
- Social media posts โ Trend analysis, brand monitoring
Semi-structured Data + AI ๐
- JSON API responses โ Data extraction pipelines
- Server logs โ Error pattern detection
- HTML web pages โ Web scraping for training data
- IoT sensor JSON โ Real-time monitoring dashboards
Key insight: Modern AI oda superpower โ unstructured data la irundhu structured insights extract panradhu! ๐ฅ
Side-by-Side Comparison
Moonu types um oru table la compare pannrom:
| Feature | Structured | Semi-structured | Unstructured |
|---|---|---|---|
| Format | Rows & Columns | Tags/Markers | No format |
| Schema | Fixed | Flexible | None |
| Storage | RDBMS | NoSQL/Document DB | Data Lake/Blob |
| Search | Easy (SQL) | Medium (queries) | Hard (needs AI) |
| Example | Excel, MySQL | JSON, XML | Photos, Videos |
| % of data | ~10-20% | ~5-10% | ~80% |
| AI processing | Traditional ML | Parsers + ML | Deep Learning |
| Query language | SQL | JSONPath, XPath | Vector search |
Takeaway: Structured data easy to handle but rare. Unstructured data abundant but hard to process. AI bridges this gap! ๐
Try It: Classify Data Types
Industry Use Cases
Different industries la data types eppadi use aagudhu:
๐ฅ Healthcare
- Structured: Patient records, lab results, billing
- Unstructured: X-rays, MRI scans, doctor notes
- AI combines both โ accurate diagnosis!
๐ Retail
- Structured: Sales data, inventory counts, pricing
- Unstructured: Product reviews, customer photos, chat logs
- AI analyzes reviews to improve products
๐ฆ Finance
- Structured: Transaction records, account balances
- Semi-structured: API feeds, regulatory filings (XML)
- Unstructured: News articles, earnings call audio
- AI detects fraud by combining all three! ๐
๐ฑ Social Media
- Mostly unstructured: Posts, images, videos, stories
- Semi-structured: User profiles (JSON), hashtags
- AI powers entire feed โ recommendations, ads, moderation
Challenges with Each Type
Each data type ku unique challenges irukku:
โ ๏ธ Structured Data Challenges
- Schema changes break pipelines
- Scaling relational DBs is expensive
- Rigid โ real world messy, tables are neat
โ ๏ธ Unstructured Data Challenges
- Storage costs โ videos/images use massive space
- Processing requires expensive GPUs
- Quality varies wildly โ blurry photos, noisy audio
- Labeling for AI training is time-consuming & costly
โ ๏ธ Semi-structured Data Challenges
- Nested data (JSON inside JSON) hard to flatten
- No standard schema โ every source different
- Parsing errors common when format changes
๐ก Pro tip: Real-world AI projects always deal with ALL THREE types. Multi-modal data handling is a critical DE skill!
Tools for Each Data Type
Data type ku suitable tools:
| Data Type | Storage | Processing | AI Tool |
|---|---|---|---|
| Structured | PostgreSQL, MySQL | SQL, pandas | scikit-learn |
| Structured | BigQuery, Redshift | Spark SQL | AutoML |
| Semi-structured | MongoDB, DynamoDB | jq, Python | Custom parsers |
| Semi-structured | Elasticsearch | Logstash | Anomaly detection |
| Unstructured | S3, GCS, HDFS | Spark, Ray | PyTorch, TensorFlow |
| Unstructured | Vector DB (Pinecone) | Embeddings | LLMs, CNNs |
Beginner tip: Start with structured (SQL + pandas), then semi-structured (JSON + Python), finally unstructured (images + PyTorch). Step by step! ๐ช
Hands-On: Work with All Three Types
Practice panra steps:
Exercise 1: Structured Data ๐
- Download any CSV dataset from Kaggle
- Load into pandas:
pd.read_csv('data.csv') - Run basic analysis โ mean, count, groupby
- Load into SQLite and query with SQL
Exercise 2: Semi-structured Data ๐
- Fetch JSON from a free API (e.g., weather API)
- Parse with Python:
json.loads(response) - Flatten nested JSON to tabular format
- Store in MongoDB (free Atlas tier)
Exercise 3: Unstructured Data ๐ท
- Collect 100 images (cats vs dogs from Kaggle)
- Use Python PIL to resize and normalize
- Convert to numerical arrays (numpy)
- Try a simple classification with pre-trained model
Bonus Project: Build a pipeline that combines all three types! Customer data (structured) + reviews (unstructured) + API data (semi-structured) โ unified analysis. ๐
โ Key Takeaways
Recap time! ๐
โ Structured = Rows & columns, fixed schema (Excel, SQL tables)
โ Unstructured = No format (images, videos, audio, text)
โ Semi-structured = Some structure, flexible (JSON, XML, logs)
โ 80-90% of world's data is unstructured
โ AI's superpower = processing unstructured data at scale
โ Modern data engineering handles ALL three types
โ Different storage & tools for each type
Next article: "Data Flow in AI Apps" โ data eppadi oru AI application la flow aagudhu nu paapom! ๐ฏ
Prompt: Data Conversion Challenge
๐ ๐ฎ Mini Challenge
Challenge: Three Data Types Handle Pannu
Real-world la irukka different data types practice pannu:
Step 1 (Structured Data - 10 min):
- Oru CSV download pannu Kaggle la (student grades example)
- Pandas la load panni basic analysis: mean, count, distribution
- SQLite table ah save pannu
Step 2 (Semi-structured Data - 10 min):
- Weather API call pannu (free API: openweathermap.org)
- JSON response parse pannu
- Nested data flatten panni pandas DataFrame ah convert
- CSV ah save pannu
Step 3 (Unstructured Data - 10 min):
- 10 images download pannu (Google Images from any category)
- PIL/Pillow use panni image size check pannu
- Numpy array ah convert pannu (numerical representation)
- Array dimensions print pannu (idhu unstructured โ numerical transformation)
Bonus: Moonu outputs compare pannu โ structured (table rows), semi-structured (hierarchy), unstructured (pixel arrays)!
Learning: Same data, different formats, different processing tools โ idhu real-world! ๐ฏ
๐ผ Interview Questions
Q1: Unstructured data la AI power irukku nu sonna, explain pannu?
A: Unstructured data (images, text, audio) la hidden patterns irukku. Deep learning models idha extract pannum. Example: Patient X-ray irundhu disease detect pannum, email spam vs not spam classify pannum. Human eye ku pattern visible anna machine eye ku (neural networks) extreme accuracy varum!
Q2: Data type choose pannumbodhu considerations?
A: Project requirements analysis pannunga. Only structured data irukka? SQL database podhum. Mixed types? Data lake better. Cost vs performance โ structured queries fast ana expensive, unstructured cheap ana processing slow. Start with what you have, optimize if needed.
Q3: Structured โ Unstructured la convert pannum possible ah?
A: Possible! Example: customer CSV data embeddings ah convert panni vector store la search pannum (RAG applications). But vice versa hard โ unstructured โ structured always labeling venum. Humans manually annotate pannanum.
Q4: Storage le ka cost โ structured vs unstructured?
A: Unstructured much cheaper! S3/GCS ~$0.02/GB/month. But processing expensive โ GPU/TPU servers venum ML ku. Structured data databases expensive ($5-50/GB/month) but querying fast. Trade-off irukku always.
Q5: Real-world company la data types distribution enna?
A: 80-90% unstructured! Images, videos, social posts, emails. But business analytics ku 10-20% structured data critical. Modern companies both handle pannanum โ data lake (unstructured) + warehouse (structured). Lakehouse solution best!
Frequently Asked Questions
Which of the following is an example of UNSTRUCTURED data?