Data Engineering Roadmap
Your guide from foundations to advanced data systems
🏗️ Foundations
Start here if you're new to data engineering
SQL
The language of data. Master SELECT, JOIN, GROUP BY, and window functions.
Python
Essential for data manipulation, automation, and modern data tools.
Apache Spark (PySpark)
Big data processing framework for distributed computing and large-scale data analysis.
Linux
Command-line skills and shell scripting for data pipeline operations.
⚙️ Core Data Engineering
Essential concepts and tools for data professionals
Data Modeling
Designing schemas, relationships, and data architectures.
ETL / ELT
Extract, Transform, Load patterns and modern data pipelines.
🤖 AI/ML Data Engineering
Optional: For engineers interested in ML infrastructure and pipelines
Vector Databases
Pinecone, Weaviate, Chroma - Storage for AI embeddings and semantic search.
Feature Stores
Feast, Tecton - Centralized feature management for ML models.
ML Pipelines
MLflow, Kubeflow, SageMaker - Orchestrate ML training and deployment.
Model Serving
BentoML, TorchServe, KServe - Deploy and serve ML models at scale.
⚡ Real-time Streaming
Modern stream processing and real-time data systems
Stream Processing
Kafka, Flink, Pulsar - Real-time data processing and event streaming.
Change Data Capture
Debezium, Fivetran CDC - Capture database changes in real-time.
Real-time Analytics
Druid, Pinot, ClickHouse - Low-latency analytics on streaming data.
🔒 Data Quality & Governance
Essential practices for reliable data systems
Data Observability
Monte Carlo, Great Expectations - Monitor and ensure data quality.
Data Contracts
Define and enforce data schemas and quality standards.
Privacy Engineering
PII handling, GDPR compliance, and data privacy practices.
💰 Cloud Economics
Cost optimization and financial operations
Cost Optimization
Spot instances, autoscaling, resource efficiency strategies.
Serverless Trade-offs
When to use serverless vs managed services.
FinOps Basics
Cost monitoring, budgeting, and financial accountability.
🏢 Data Engineering Layers
Understanding the complete data lifecycle
Data Sources
Where data originates and how to access it
Data Ingestion
Moving data from sources to storage systems
Data Processing
Transforming raw data into valuable insights
Data Storage
Storing processed data for analysis and consumption
Data Consumption
Making data accessible to users and applications
🛠️ Explore Tools
Find the right tools for each layer
🎯 Learning Paths
Common journeys and recommendations
Beginner Path
SQL → Python → Batch Processing (PySpark) → Basic ETL → Simple BI → Data Quality Basics → Streaming Fundamentals
Analytics Engineer
Advanced SQL → Data Modeling → Advanced Batch Processing (dbt) → Real-time Analytics → BI Tools → ML Data Prep
Data Platform Engineer
Python → Batch Architecture (Spark) → Streaming Architecture → Orchestration → Cloud Architecture → DevOps → ML Ops → Cost Optimization
ML Data Engineer
SQL → Python → Batch Processing → Vector Databases → Feature Stores → ML Pipelines → Model Serving