Data Engineering Roadmap

🏗️ Foundations

Start here if you're new to data engineering

SQL

The language of data. Master SELECT, JOIN, GROUP BY, and window functions.

💡 Click to see core concepts

Next step → Core Engineering

Python

Essential for data manipulation, automation, and modern data tools.

💡 Click to see core concepts

Next step → Core Engineering

Apache Spark (PySpark)

Big data processing framework for distributed computing and large-scale data analysis.

💡 Click to see core concepts

Next step → Core Engineering

Linux

Command-line skills and shell scripting for data pipeline operations.

💡 Click to see core concepts

Next step → Core Engineering

⚙️ Core Data Engineering

Essential concepts and tools for data professionals

Data Modeling

Designing schemas, relationships, and data architectures.

💡 Click to see core concepts

Next step → Data Layers

ETL / ELT

Extract, Transform, Load patterns and modern data pipelines.

💡 Click to see core concepts

Next step → Data Layers

🤖 AI/ML Data Engineering

Optional: For engineers interested in ML infrastructure and pipelines

Vector Databases

Pinecone, Weaviate, Chroma - Storage for AI embeddings and semantic search.

Next step → Real-time Streaming

Feature Stores

Feast, Tecton - Centralized feature management for ML models.

Next step → Real-time Streaming

ML Pipelines

MLflow, Kubeflow, SageMaker - Orchestrate ML training and deployment.

Next step → Real-time Streaming

Model Serving

BentoML, TorchServe, KServe - Deploy and serve ML models at scale.

Next step → Real-time Streaming

⚡ Real-time Streaming

Modern stream processing and real-time data systems

Stream Processing

Kafka, Flink, Pulsar - Real-time data processing and event streaming.

Next step → Data Quality

Change Data Capture

Debezium, Fivetran CDC - Capture database changes in real-time.

Next step → Data Quality

Real-time Analytics

Druid, Pinot, ClickHouse - Low-latency analytics on streaming data.

Next step → Data Quality

🔒 Data Quality & Governance

Essential practices for reliable data systems

Data Observability

Monte Carlo, Great Expectations - Monitor and ensure data quality.

Next step → Cloud Economics

Data Contracts

Define and enforce data schemas and quality standards.

Next step → Cloud Economics

Privacy Engineering

PII handling, GDPR compliance, and data privacy practices.

Next step → Cloud Economics

💰 Cloud Economics

Cost optimization and financial operations

Cost Optimization

Spot instances, autoscaling, resource efficiency strategies.

Next step → Learning Paths

Serverless Trade-offs

When to use serverless vs managed services.

Next step → Learning Paths

FinOps Basics

Cost monitoring, budgeting, and financial accountability.

Next step → Learning Paths

🏢 Data Engineering Layers

Understanding the complete data lifecycle

Data Sources

Where data originates and how to access it

Next step → Data Ingestion

Data Ingestion

Moving data from sources to storage systems

Next step → Data Processing

Data Processing

Transforming raw data into valuable insights

Next step → Data Storage

Data Storage

Storing processed data for analysis and consumption

Next step → Data Consumption

Data Consumption

Making data accessible to users and applications

Final step → Explore Tools

🛠️ Explore Tools

Find the right tools for each layer

Browse Complete Tools Catalog

🎯 Learning Paths

Common journeys and recommendations

Beginner Path

SQL → Python → Batch Processing (PySpark) → Basic ETL → Simple BI → Data Quality Basics → Streaming Fundamentals

6-9 months

Analytics Engineer

Advanced SQL → Data Modeling → Advanced Batch Processing (dbt) → Real-time Analytics → BI Tools → ML Data Prep

9-15 months

Data Platform Engineer

Python → Batch Architecture (Spark) → Streaming Architecture → Orchestration → Cloud Architecture → DevOps → ML Ops → Cost Optimization

15-24 months

ML Data Engineer

SQL → Python → Batch Processing → Vector Databases → Feature Stores → ML Pipelines → Model Serving

12-18 months

Data Engineering Roadmap

🏗️ Foundations

SQL

Python

Apache Spark (PySpark)

Linux

Core Concepts

⚙️ Core Data Engineering

Data Modeling

ETL / ELT

🤖 AI/ML Data Engineering

Vector Databases

Feature Stores

ML Pipelines

Model Serving

⚡ Real-time Streaming

Stream Processing

Change Data Capture

Real-time Analytics

🔒 Data Quality & Governance

Data Observability

Data Contracts

Privacy Engineering

💰 Cloud Economics

Cost Optimization

Serverless Trade-offs

FinOps Basics

🏢 Data Engineering Layers

Data Sources

Data Ingestion

Data Processing

Data Storage

Data Consumption

🛠️ Explore Tools

🎯 Learning Paths

Beginner Path

Analytics Engineer

Data Platform Engineer

ML Data Engineer