Data types (structured/unstructured)
Introduction
Nee daily phone la photos edukka, WhatsApp la messages anuppa, Excel la data enter panna ā ivanga ellam different types of data! š±
AI world la data types purinjikunadhu romba important. Yen na ā different data types ku different processing methods venum. Excel table process panradhu um, Instagram photo process panradhu um ā same method la mudiyaadhu!
Indha article la Structured, Unstructured, Semi-structured data ā moonu um paapom, AI la ivanga role enna nu clear ah purinjidam! šÆ
Three Types of Data
Data mainly 3 types ah divide pannalam:
1. Structured Data š
- Rows and columns la organized (like Excel)
- Fixed schema ā every record same format
- Examples: Database tables, CSV files, spreadsheets
- Easy to search, filter, analyze
2. Unstructured Data š·
- No predefined format
- Free-form, messy, varied
- Examples: Images, videos, emails, PDFs, audio
- Hard to search, needs AI to process
3. Semi-structured Data š
- Some structure irukku, but rigid schema illa
- Tags or markers use pannudhu
- Examples: JSON, XML, HTML, email headers
- Flexible but still parseable
Fun fact: World la generate aagura data la 80-90% unstructured! Only 10-20% is structured. AI revolution unstructured data process panna ability la dhaan irukku! š¤Æ
Analogy: Library vs Attic vs Filing Cabinet
Data types ah real-life la compare pannrom:
š Structured Data = Library
- Every book has a catalog number
- Organized by genre, author, year
- Easy ah find pannalam ā just search the catalog!
š¦ Unstructured Data = Attic (Maadivarai)
- Old photos, letters, random items dumped
- No organization, no labels
- Find panna time aagum, manually sort pannanum
šļø Semi-structured Data = Filing Cabinet with Sticky Notes
- Files irukku, some labels irukku
- But inside format varies ā some typed, some handwritten
- Better than attic, but not as neat as library
AI oda power ā attic la irundhu valuable items automatically find pannum! That's what deep learning does with unstructured data. š§
Data Types Processing Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā DATA TYPES IN AI PIPELINE ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā⤠ā ā ā STRUCTURED SEMI-STRUCTURED UNSTRUCTURED ā ā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā ā ā ā Database ā ā JSON ā ā Images ā ā ā ā Tables ā ā XML ā ā Videos ā ā ā ā CSV ā ā Logs ā ā Audio ā ā ā āāāāāā¬āāāāāā āāāāāā¬āāāāāā āāāāāā¬āāāāāā ā ā ā ā ā ā ā ā¼ ā¼ ā¼ ā ā āāāāāāāāāāāā āāāāāāāāāāāā āāāāāāāāāāāā ā ā ā SQL ā ā Parser ā ā AI/ML ā ā ā ā Engine ā ā Engine ā ā Models ā ā ā āāāāāā¬āāāāāā āāāāāā¬āāāāāā āāāāāā¬āāāāāā ā ā ā ā ā ā ā āāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā ā ā ā¼ ā ā āāāāāāāāāāāāāāāāāā ā ā ā UNIFIED DATA ā ā ā ā STORAGE ā ā ā ā (Data Lake) ā ā ā āāāāāāāāāāāāāāāāāā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Detailed Examples with AI Context
Each data type ku real AI use cases paapom:
Structured Data + AI š
- Customer purchase history ā Recommendation engine
- Stock prices table ā Price prediction model
- Sensor readings ā Anomaly detection
- Student marks ā Performance prediction
Unstructured Data + AI š·
- X-ray images ā Disease detection (Computer Vision)
- Customer emails ā Sentiment analysis (NLP)
- Call recordings ā Speech-to-text, keyword extraction
- Social media posts ā Trend analysis, brand monitoring
Semi-structured Data + AI š
- JSON API responses ā Data extraction pipelines
- Server logs ā Error pattern detection
- HTML web pages ā Web scraping for training data
- IoT sensor JSON ā Real-time monitoring dashboards
Key insight: Modern AI oda superpower ā unstructured data la irundhu structured insights extract panradhu! š„
Side-by-Side Comparison
Moonu types um oru table la compare pannrom:
| Feature | Structured | Semi-structured | Unstructured |
|---|---|---|---|
| Format | Rows & Columns | Tags/Markers | No format |
| Schema | Fixed | Flexible | None |
| Storage | RDBMS | NoSQL/Document DB | Data Lake/Blob |
| Search | Easy (SQL) | Medium (queries) | Hard (needs AI) |
| Example | Excel, MySQL | JSON, XML | Photos, Videos |
| % of data | ~10-20% | ~5-10% | ~80% |
| AI processing | Traditional ML | Parsers + ML | Deep Learning |
| Query language | SQL | JSONPath, XPath | Vector search |
Takeaway: Structured data easy to handle but rare. Unstructured data abundant but hard to process. AI bridges this gap! š
Try It: Classify Data Types
Industry Use Cases
Different industries la data types eppadi use aagudhu:
š„ Healthcare
- Structured: Patient records, lab results, billing
- Unstructured: X-rays, MRI scans, doctor notes
- AI combines both ā accurate diagnosis!
š Retail
- Structured: Sales data, inventory counts, pricing
- Unstructured: Product reviews, customer photos, chat logs
- AI analyzes reviews to improve products
š¦ Finance
- Structured: Transaction records, account balances
- Semi-structured: API feeds, regulatory filings (XML)
- Unstructured: News articles, earnings call audio
- AI detects fraud by combining all three! š
š± Social Media
- Mostly unstructured: Posts, images, videos, stories
- Semi-structured: User profiles (JSON), hashtags
- AI powers entire feed ā recommendations, ads, moderation
Challenges with Each Type
Each data type ku unique challenges irukku:
ā ļø Structured Data Challenges
- Schema changes break pipelines
- Scaling relational DBs is expensive
- Rigid ā real world messy, tables are neat
ā ļø Unstructured Data Challenges
- Storage costs ā videos/images use massive space
- Processing requires expensive GPUs
- Quality varies wildly ā blurry photos, noisy audio
- Labeling for AI training is time-consuming & costly
ā ļø Semi-structured Data Challenges
- Nested data (JSON inside JSON) hard to flatten
- No standard schema ā every source different
- Parsing errors common when format changes
š” Pro tip: Real-world AI projects always deal with ALL THREE types. Multi-modal data handling is a critical DE skill!
Tools for Each Data Type
Data type ku suitable tools:
| Data Type | Storage | Processing | AI Tool |
|---|---|---|---|
| Structured | PostgreSQL, MySQL | SQL, pandas | scikit-learn |
| Structured | BigQuery, Redshift | Spark SQL | AutoML |
| Semi-structured | MongoDB, DynamoDB | jq, Python | Custom parsers |
| Semi-structured | Elasticsearch | Logstash | Anomaly detection |
| Unstructured | S3, GCS, HDFS | Spark, Ray | PyTorch, TensorFlow |
| Unstructured | Vector DB (Pinecone) | Embeddings | LLMs, CNNs |
Beginner tip: Start with structured (SQL + pandas), then semi-structured (JSON + Python), finally unstructured (images + PyTorch). Step by step! šŖ
Hands-On: Work with All Three Types
Practice panra steps:
Exercise 1: Structured Data š
- Download any CSV dataset from Kaggle
- Load into pandas:
pd.read_csv('data.csv') - Run basic analysis ā mean, count, groupby
- Load into SQLite and query with SQL
Exercise 2: Semi-structured Data š
- Fetch JSON from a free API (e.g., weather API)
- Parse with Python:
json.loads(response) - Flatten nested JSON to tabular format
- Store in MongoDB (free Atlas tier)
Exercise 3: Unstructured Data š·
- Collect 100 images (cats vs dogs from Kaggle)
- Use Python PIL to resize and normalize
- Convert to numerical arrays (numpy)
- Try a simple classification with pre-trained model
Bonus Project: Build a pipeline that combines all three types! Customer data (structured) + reviews (unstructured) + API data (semi-structured) ā unified analysis. š
ā Key Takeaways
Recap time! š
ā Structured = Rows & columns, fixed schema (Excel, SQL tables)
ā Unstructured = No format (images, videos, audio, text)
ā Semi-structured = Some structure, flexible (JSON, XML, logs)
ā 80-90% of world's data is unstructured
ā AI's superpower = processing unstructured data at scale
ā Modern data engineering handles ALL three types
ā Different storage & tools for each type
Next article: "Data Flow in AI Apps" ā data eppadi oru AI application la flow aagudhu nu paapom! šÆ
Prompt: Data Conversion Challenge
š š® Mini Challenge
Challenge: Three Data Types Handle Pannu
Real-world la irukka different data types practice pannu:
Step 1 (Structured Data - 10 min):
- Oru CSV download pannu Kaggle la (student grades example)
- Pandas la load panni basic analysis: mean, count, distribution
- SQLite table ah save pannu
Step 2 (Semi-structured Data - 10 min):
- Weather API call pannu (free API: openweathermap.org)
- JSON response parse pannu
- Nested data flatten panni pandas DataFrame ah convert
- CSV ah save pannu
Step 3 (Unstructured Data - 10 min):
- 10 images download pannu (Google Images from any category)
- PIL/Pillow use panni image size check pannu
- Numpy array ah convert pannu (numerical representation)
- Array dimensions print pannu (idhu unstructured ā numerical transformation)
Bonus: Moonu outputs compare pannu ā structured (table rows), semi-structured (hierarchy), unstructured (pixel arrays)!
Learning: Same data, different formats, different processing tools ā idhu real-world! šÆ
š¼ Interview Questions
Q1: Unstructured data la AI power irukku nu sonna, explain pannu?
A: Unstructured data (images, text, audio) la hidden patterns irukku. Deep learning models idha extract pannum. Example: Patient X-ray irundhu disease detect pannum, email spam vs not spam classify pannum. Human eye ku pattern visible anna machine eye ku (neural networks) extreme accuracy varum!
Q2: Data type choose pannumbodhu considerations?
A: Project requirements analysis pannunga. Only structured data irukka? SQL database podhum. Mixed types? Data lake better. Cost vs performance ā structured queries fast ana expensive, unstructured cheap ana processing slow. Start with what you have, optimize if needed.
Q3: Structured ā Unstructured la convert pannum possible ah?
A: Possible! Example: customer CSV data embeddings ah convert panni vector store la search pannum (RAG applications). But vice versa hard ā unstructured ā structured always labeling venum. Humans manually annotate pannanum.
Q4: Storage le ka cost ā structured vs unstructured?
A: Unstructured much cheaper! S3/GCS ~$0.02/GB/month. But processing expensive ā GPU/TPU servers venum ML ku. Structured data databases expensive ($5-50/GB/month) but querying fast. Trade-off irukku always.
Q5: Real-world company la data types distribution enna?
A: 80-90% unstructured! Images, videos, social posts, emails. But business analytics ku 10-20% structured data critical. Modern companies both handle pannanum ā data lake (unstructured) + warehouse (structured). Lakehouse solution best!
Frequently Asked Questions
Which of the following is an example of UNSTRUCTURED data?