Filter Tools

Tools by Cloud Provider

โ˜๏ธ Amazon Web Services (AWS)

5 tools โ–ผ

Amazon S3

Data Lake Data Storage

Use case: Object storage, data lakes

Amazon Redshift

Petabyte-scale Data Storage

Use case: Complex analytical queries

Amazon Kinesis

Streaming Data Ingestion

Use case: AWS real-time data

AWS Glue

Serverless Data Processing

Use case: Serverless ETL, data catalog

Amazon QuickSight

BI Tool Data Consumption

Use case: AWS-native BI, dashboards

๐Ÿ”ต Google Cloud Platform (GCP)

5 tools โ–ผ

Google Cloud Storage

Data Lake Data Storage

Use case: Object storage, analytics

Google BigQuery

Serverless Data Storage

Use case: Large-scale analytics

Google Pub/Sub

Streaming Data Ingestion

Use case: Event streaming, messaging

Google Dataflow

Serverless Data Processing

Use case: Stream and batch processing

Google Looker

BI Tool Data Consumption

Use case: Embedded analytics, modeling

๐Ÿ”ท Microsoft Azure

5 tools โ–ผ

Azure Data Lake Storage

Data Lake Data Storage

Use case: Enterprise data lakes

Azure Synapse

Analytics Data Storage

Use case: Integrated analytics, data warehousing

Azure Event Hubs

Streaming Data Ingestion

Use case: Real-time data ingestion

Azure Databricks

Analytics Data Processing

Use case: Big data analytics, ML

Power BI

BI Tool Data Consumption

Use case: Enterprise reporting, Office integration

๐ŸŒ Multi-Cloud & Cloud Agnostic

6 tools โ–ผ

Pinecone

Cloud Service Data Processing

Use case: AI embeddings, semantic search

Monte Carlo

Cloud Service Data Processing

Use case: Data quality monitoring

Snowflake

Cloud Warehouse Data Storage

Use case: Enterprise analytics

Databricks

Analytics Data Processing

Use case: Big data, ML workflows

Fivetran

Managed Data Ingestion

Use case: SaaS data integration

Tableau

BI Tool Data Consumption

Use case: Interactive dashboards

๐Ÿ”“ Open Source & Self-Hosted

22 tools โ–ผ

Apache Kafka

Distributed Data Ingestion

Use case: Event streaming, messaging

Apache Flink

Distributed Data Processing

Use case: Real-time analytics, low-latency

Apache Druid

Distributed Data Storage

Use case: Fast analytics on streaming data

Apache Airflow

Batch Data Ingestion

Use case: ETL pipeline orchestration

Prefect

Modern Data Ingestion

Use case: Python-native workflows

Dagster

Data-aware Data Ingestion

Use case: Asset governance, lineage

Apache Spark

Batch + Streaming Data Processing

Use case: Big data transformations

Apache Flink

Streaming Data Processing

Use case: Real-time analytics, low-latency

dbt

SQL-first Data Processing

Use case: Data warehouse transforms

Pandas

In-memory Data Processing

Use case: Data analysis, cleaning

Great Expectations

Validation Data Processing

Use case: Data quality testing

Delta Lake

Lakehouse Data Storage

Use case: ACID transactions, reliability

Apache Iceberg

Lakehouse Data Storage

Use case: Schema evolution, analytics

Metabase

Open Source Data Consumption

Use case: Self-service analytics

Superset

Open Source Data Consumption

Use case: Interactive dashboards

Jupyter

Analysis Data Consumption

Use case: Data exploration

PostgreSQL

Relational Data Sources

Use case: Application databases, ACID compliance

MySQL

Relational Data Sources

Use case: Web applications, high performance

MongoDB

NoSQL Data Sources

Use case: Flexible schemas, unstructured data

Docker

Containerization Data Engineering Ecosystem

Use case: Application deployment

Kubernetes

Orchestration Data Engineering Ecosystem

Use case: Container management

Terraform

Infrastructure as Code Data Engineering Ecosystem

Use case: Cloud resource management

๐Ÿ”„ Tool Comparisons

Workflow Orchestration

Tool
Best For
Key Feature
Learning Curve
Airflow Mature
Batch ETL, established teams
Large community, extensive integrations
Medium
Prefect Modern
Python-native workflows
Dynamic scaling, modern UI
Low-Medium
Dagster Data-aware
Asset governance, lineage
Data assets, software-defined assets
Medium-High

Quick Recommendation:

  • Choose Airflow for established ETL workflows and large teams
  • Choose Prefect for Python-first development and modern workflows
  • Choose Dagster for complex data pipelines needing governance

๐Ÿ“Š Data Warehouses

Tool
Best For
Key Feature
Cost Model
Snowflake Multi-cloud
Enterprise analytics, multi-cloud
Automatic scaling, data sharing
Pay-per-use (credits)
BigQuery Serverless
Large-scale analytics, ML
Serverless, ML integration
Pay-per-query + storage
Redshift AWS Native
AWS ecosystem, petabyte-scale
AWS integration, concurrency scaling
Node-based + on-demand

Quick Recommendation:

  • Choose Snowflake for multi-cloud strategy and enterprise features
  • Choose BigQuery for serverless operations and Google ecosystem
  • Choose Redshift for AWS-native workloads and cost control

โšก Streaming Platforms

Tool
Best For
Key Feature
Complexity
Kafka Industry Standard
High-throughput event streaming
Distributed log, durability
High
Kinesis AWS Managed
AWS ecosystem, managed service
Fully managed, AWS integration
Medium
Pulsar Flexible
Multi-tenancy, geo-replication
Tiered storage, unified messaging
High

Quick Recommendation:

  • Choose Kafka for maximum control and ecosystem compatibility
  • Choose Kinesis for AWS-native managed streaming
  • Choose Pulsar for advanced features and multi-tenancy

๐Ÿ“ˆ Business Intelligence Tools

Tool
Best For
Key Feature
Cost
Tableau Enterprise
Visual analytics, enterprise
Advanced visualizations
High
Power BI Microsoft
Microsoft ecosystem, enterprise
Office 365 integration
Medium-High
Looker Embedded
Embedded analytics, modeling
LookML modeling layer
High
Metabase Open Source
Self-service, simplicity
Easy setup, SQL interface
Low-Medium

Quick Recommendation:

  • Choose Tableau for advanced visual analytics and enterprise needs
  • Choose Power BI for Microsoft ecosystem integration
  • Choose Looker for embedded analytics and data modeling
  • Choose Metabase for cost-effective self-service analytics

๐Ÿž๏ธ Lakehouse Formats

Format
Best For
Key Feature
Ecosystem
Delta Lake Databricks
ACID transactions, reliability
Time travel, optimized writes
Databricks, Spark
Iceberg Open
Schema evolution, multi-engine
Engine-agnostic, partitioning
Spark, Flink, Trino

Quick Recommendation:

  • Choose Delta Lake for Databricks ecosystem and ACID guarantees
  • Choose Iceberg for multi-engine support and schema evolution

โ˜๏ธ Cloud-to-Cloud Data Engineering Cheat Sheet

Quick reference for equivalent services across AWS, GCP, and Azure. Perfect for multi-cloud migrations and architecture planning.

๐Ÿ”ฅ 1. Distributed Compute (Spark / Hadoop)

Use Case
GCP
AWS
Azure
Managed Spark/Hadoop clusters
Dataproc
EMR
Azure Databricks
Serverless Spark
Dataproc Serverless
EMR Serverless
Synapse Serverless Spark
Notebook development
Vertex AI Workbench / Dataproc Hub
EMR Notebooks / SageMaker
Databricks Notebooks
When to use: scalable ETL, ML feature engineering, batch pipelines, heavy Spark jobs.

โš™๏ธ 2. Batch ETL / ELT

Use Case
GCP
AWS
Azure
Serverless ETL
Dataflow (Batch)
AWS Glue ETL
ADF Mapping Data Flows
Pipeline orchestration
Cloud Composer (Airflow)
Step Functions / MWAA
Azure Data Factory Pipelines
SQL ELT
BigQuery SQL
Redshift SQL
Synapse SQL
When to use: scheduled transformations, ELT workflows, Airflow-based orchestration.

โšก 3. Real-Time Streaming

Use Case
GCP
AWS
Azure
Stream ingestion
Pub/Sub
Kinesis Data Streams
Event Hubs
Stream processing
Dataflow (Streaming)
Kinesis Data Analytics (Flink)
Azure Stream Analytics
CDC ingestion
Datastream
DMS + MSK Connect
ADF CDC / Event Grid
When to use: event-driven pipelines, real-time analytics, CDC replication.

๐Ÿ—„๏ธ 4. Storage (Lake, Warehouse, Lakehouse)

Use Case
GCP
AWS
Azure
Object storage (data lake)
Cloud Storage
S3
ADLS Gen2
Data warehouse
BigQuery
Redshift
Synapse Dedicated SQL Pool
Lakehouse
BigQuery + Dataplex
S3 + Athena + Glue Catalog
Synapse + OneLake + Fabric
When to use: central data lake, analytics warehouse, unified lakehouse architecture.

๐Ÿงญ 5. Metadata, Governance, Catalog

Use Case
GCP
AWS
Azure
Data catalog
Data Catalog
Glue Data Catalog
Purview Data Catalog
Governance & lineage
Dataplex
Lake Formation
Microsoft Purview
When to use: schema management, lineage, access control, governance.

๐Ÿงช 6. Data Quality & Observability

Use Case
GCP
AWS
Azure
Data quality rules
Dataplex DQ
Glue Data Quality
Purview Data Quality
Pipeline monitoring
Cloud Monitoring
CloudWatch
Azure Monitor
When to use: validating datasets, monitoring pipelines, enforcing SLAs.

๐Ÿ“Š 7. BI & Analytics

Use Case
GCP
AWS
Azure
BI dashboards
Looker / Looker Studio
QuickSight
Power BI
SQL on data lake
BigQuery Omni / BigLake
Athena
Synapse Serverless SQL
When to use: dashboards, ad-hoc analytics, federated SQL.

๐Ÿค– 8. ML / Feature Engineering (Adjacent to DE)

Use Case
GCP
AWS
Azure
ML platform
Vertex AI
SageMaker
Azure ML
Feature store
Vertex AI Feature Store
SageMaker Feature Store
Azure Feature Store (Fabric)

๐Ÿงฉ One-Page Summary

If you want Spark โ†’

GCP: Dataproc
AWS: EMR
Azure: Databricks

If you want Serverless ETL โ†’

GCP: Dataflow
AWS: Glue
Azure: ADF Data Flows

If you want Streaming โ†’

GCP: Pub/Sub + Dataflow
AWS: Kinesis + KDA
Azure: Event Hubs + Stream Analytics

If you want a Lakehouse โ†’

GCP: BigQuery + Dataplex
AWS: S3 + Athena + Glue Catalog
Azure: Synapse + OneLake + Fabric

๐ŸŽฏ Quick Decision Guide

Startups/Small Teams

  • Orchestration: Prefect or Airflow
  • Storage: BigQuery or Redshift
  • BI: Metabase or Power BI
  • Streaming: Kinesis (if AWS)

Enterprise Teams

  • Orchestration: Airflow or Dagster
  • Storage: Snowflake
  • BI: Tableau or Looker
  • Streaming: Kafka or Pulsar

Cost-Conscious

  • Orchestration: Open-source Airflow
  • Storage: BigQuery (pay-per-query)
  • BI: Metabase (open-source)
  • Streaming: Managed Kinesis

Tool Comparisons

๐Ÿ”„ ETL Tools

Airflow: Complex workflows, battle-tested
Prefect: Python-native, modern
Dagster: Asset governance, lineage

โšก Streaming

Kafka: High throughput, distributed
Pub/Sub: Google Cloud native
Kinesis: AWS integrated, scalable

๐Ÿ“ˆ BI Tools

Tableau: Enterprise, visual analytics
Power BI: Microsoft ecosystem
Looker: Embedded analytics, modeling
Metabase: Open-source, simple

๐Ÿž๏ธ Storage Formats

S3: Object storage standard
Delta Lake: ACID transactions, reliability
Iceberg: Schema evolution, multi-engine

Tools by Layer

Data Ingestion

Moving data from sources to systems

tools
Explore Layer

Data Processing

Transforming data into insights

tools
Explore Layer

Data Storage

Storing and organizing data

tools
Explore Layer

Data Consumption

Making data accessible and useful

tools
Explore Layer