Comprehensive Data Engineering Tools
Find right tools for your data engineering needs - cloud providers, comparisons, and decision guides
Filter Tools
Tools by Cloud Provider
โ๏ธ Amazon Web Services (AWS)
5 toolsAmazon S3
Use case: Object storage, data lakes
Amazon Redshift
Use case: Complex analytical queries
Amazon Kinesis
Use case: AWS real-time data
AWS Glue
Use case: Serverless ETL, data catalog
Amazon QuickSight
Use case: AWS-native BI, dashboards
๐ต Google Cloud Platform (GCP)
5 toolsGoogle Cloud Storage
Use case: Object storage, analytics
Google BigQuery
Use case: Large-scale analytics
Google Pub/Sub
Use case: Event streaming, messaging
Google Dataflow
Use case: Stream and batch processing
Google Looker
Use case: Embedded analytics, modeling
๐ท Microsoft Azure
5 toolsAzure Data Lake Storage
Use case: Enterprise data lakes
Azure Synapse
Use case: Integrated analytics, data warehousing
Azure Event Hubs
Use case: Real-time data ingestion
Azure Databricks
Use case: Big data analytics, ML
Power BI
Use case: Enterprise reporting, Office integration
๐ Multi-Cloud & Cloud Agnostic
6 toolsPinecone
Use case: AI embeddings, semantic search
Monte Carlo
Use case: Data quality monitoring
Snowflake
Use case: Enterprise analytics
Databricks
Use case: Big data, ML workflows
Fivetran
Use case: SaaS data integration
Tableau
Use case: Interactive dashboards
๐ Open Source & Self-Hosted
22 toolsApache Kafka
Use case: Event streaming, messaging
Apache Flink
Use case: Real-time analytics, low-latency
Apache Druid
Use case: Fast analytics on streaming data
Apache Airflow
Use case: ETL pipeline orchestration
Prefect
Use case: Python-native workflows
Dagster
Use case: Asset governance, lineage
Apache Spark
Use case: Big data transformations
Apache Flink
Use case: Real-time analytics, low-latency
dbt
Use case: Data warehouse transforms
Pandas
Use case: Data analysis, cleaning
Great Expectations
Use case: Data quality testing
Delta Lake
Use case: ACID transactions, reliability
Apache Iceberg
Use case: Schema evolution, analytics
Metabase
Use case: Self-service analytics
Superset
Use case: Interactive dashboards
Jupyter
Use case: Data exploration
PostgreSQL
Use case: Application databases, ACID compliance
MySQL
Use case: Web applications, high performance
MongoDB
Use case: Flexible schemas, unstructured data
Docker
Use case: Application deployment
Kubernetes
Use case: Container management
Terraform
Use case: Cloud resource management
๐ Tool Comparisons
Workflow Orchestration
Quick Recommendation:
- Choose Airflow for established ETL workflows and large teams
- Choose Prefect for Python-first development and modern workflows
- Choose Dagster for complex data pipelines needing governance
๐ Data Warehouses
Quick Recommendation:
- Choose Snowflake for multi-cloud strategy and enterprise features
- Choose BigQuery for serverless operations and Google ecosystem
- Choose Redshift for AWS-native workloads and cost control
โก Streaming Platforms
Quick Recommendation:
- Choose Kafka for maximum control and ecosystem compatibility
- Choose Kinesis for AWS-native managed streaming
- Choose Pulsar for advanced features and multi-tenancy
๐ Business Intelligence Tools
Quick Recommendation:
- Choose Tableau for advanced visual analytics and enterprise needs
- Choose Power BI for Microsoft ecosystem integration
- Choose Looker for embedded analytics and data modeling
- Choose Metabase for cost-effective self-service analytics
๐๏ธ Lakehouse Formats
Quick Recommendation:
- Choose Delta Lake for Databricks ecosystem and ACID guarantees
- Choose Iceberg for multi-engine support and schema evolution
โ๏ธ Cloud-to-Cloud Data Engineering Cheat Sheet
Quick reference for equivalent services across AWS, GCP, and Azure. Perfect for multi-cloud migrations and architecture planning.
๐ฅ 1. Distributed Compute (Spark / Hadoop)
โ๏ธ 2. Batch ETL / ELT
โก 3. Real-Time Streaming
๐๏ธ 4. Storage (Lake, Warehouse, Lakehouse)
๐งญ 5. Metadata, Governance, Catalog
๐งช 6. Data Quality & Observability
๐ 7. BI & Analytics
๐ค 8. ML / Feature Engineering (Adjacent to DE)
๐งฉ One-Page Summary
If you want Spark โ
If you want Serverless ETL โ
If you want Streaming โ
If you want a Lakehouse โ
๐ฏ Quick Decision Guide
Startups/Small Teams
- Orchestration: Prefect or Airflow
- Storage: BigQuery or Redshift
- BI: Metabase or Power BI
- Streaming: Kinesis (if AWS)
Enterprise Teams
- Orchestration: Airflow or Dagster
- Storage: Snowflake
- BI: Tableau or Looker
- Streaming: Kafka or Pulsar
Cost-Conscious
- Orchestration: Open-source Airflow
- Storage: BigQuery (pay-per-query)
- BI: Metabase (open-source)
- Streaming: Managed Kinesis