Airflow Works Best When It Does Less - Guru, Eerla | Data Engineering

The symptoms are consistent:

Workers pinned at high CPU
Retry storms under load
DAGs that pass locally but fail in production
Business logic buried inside orchestration

This isn’t a scaling issue. It’s a boundary violation.

Airflow Is a Control Plane

Airflow exists to:

schedule work
enforce dependencies
manage retries

It does not exist to:

process data
hold state
execute transformations

When orchestration and compute share the same layer, they compete for resources.

That competition is where systems degrade.

DAGs Should Describe Flow — Nothing Else

A DAG answers:

What runs, and in what order?

Not:

How does the data get processed?

Once you embed logic inside DAGs:

orchestration becomes coupled to implementation
pipelines become untestable
changes become risky

Clean systems separate:

DAG → control flow
Compute → execution layer

The Patterns That Cause Most Failures

In-process compute: Large joins, pandas jobs, heavy transforms inside tasks
XCom as a data layer: Passing payloads instead of metadata
Business logic in DAGs: No versioning, no reuse, no testability
Shared resources: Orchestration and compute competing for CPU/memory

These are not edge cases. This is how most Airflow systems fail.

Failure Modes (They Compound Fast)

Scheduler starvation: Workers doing compute can’t schedule new tasks
Retry amplification: Failures increase load → more failures
State inconsistencies: No clear ownership of data or transformations
Debugging collapse: Logs tied to orchestration, not execution

Failures don’t originate in SQL. They emerge at system boundaries.

The Correct Model

Airflow should coordinate work, not perform it.

Trigger Spark / dbt / containerized jobs
Wait for completion
Pass references (IDs, URIs), not data

Airflow becomes thin, predictable, and stable.

What Improves in Production

Scheduler remains responsive under load
Failures are isolated to compute systems
Pipelines become testable outside Airflow
Recovery becomes deterministic

Not because of better tooling. Because responsibilities are separated correctly.

The System Model

Think in layers:

Control → Airflow
Compute → Spark / dbt / containers
Storage → warehouse / lake

If these blur, the system becomes fragile.

Final Take

Most Airflow issues are self-inflicted.

Not because Airflow is limited, but because it’s forced to do work it was never designed for.

If your DAGs are executing real computation, you don’t have a pipeline problem. You have a system design problem.

One Rule

If a task:

runs long CPU workloads
or processes large in-memory data

It does not belong in Airflow.