Guru, Eerla | Data Engineering

I used to think the job was “make nightly ETLs run.”

Now it’s: ship APIs, run containers, own latency, and get alerted when a feature lookup crosses 200ms. That’s not scope creep. That’s job finally matching the system.

The Shift Isn’t Semantic — It’s Architectural

What we used to call “data engineering” was batch orchestration:

Airflow DAG → SQL → table
Consumers read tables directly

No contracts, no ownership boundaries, no SLAs.

That model breaks the moment data becomes part of a product.

Today’s requirement surface looks like backend systems:

Low-latency access (not “tomorrow morning”)
Multi-tenant isolation
Explicit contracts (schemas, APIs)
Versioning and backward compatibility
Observability + on-call ownership

Bluntly: if your data is consumed in real time, you are running a service.
If you’re running a service, you’re doing backend engineering.

Core Primitive #1: Pipelines → Services

The mental model changed.

Old:

Airflow runs a job → Writes to a table → Consumers figure it out

New:

Continuous processing (streaming or micro-batch)
Expose via API and/or event stream
Explicit ownership + SLA

What actually matters:

Data is no longer “stored and discovered” — it’s served
Consumers shouldn’t reverse-engineer tables
Contracts replace tribal knowledge

Trade-off: you gain discoverability and low-latency consumption but now own API lifecycle and backward compatibility.

Core Primitive #2: Event-Driven State, Not Static Tables

Batch assumes world is static. It isn’t. Kubernetes, serverless functions, managed streaming (Kafka, Pub/Sub), and object stores are standard. That means data engineers must understand containers, infra-as-code, and CI/CD in the same way backend teams do.

Real systems operate on event streams with evolving state: Late data arrives, Events reorder, State must be recomputed or corrected.

That introduces backend problems:

Partitioning
Backpressure
Idempotency
State consistency

Example pattern that actually survives production:

Idempotent upserts (not “exactly once” illusions)
Versioned writes
Externalized state (DB or state store)

“Exactly once” is marketing. Idempotency + replay ability is what works.

Where things break:

Duplicate events → corrupt aggregates
Late arrivals → wrong features
Hot partitions → latency spikes

If you haven’t debugged one of these at 2 AM, you’re still in the old model.

Core Primitive #3: Infrastructure Is No Longer Optional

Once you deploy on Kubernetes, Managed streaming (Kafka, Pub/Sub), Object stores, you inherit backend responsibilities whether you like it or not.

You now deal with:

Resource limits (CPU/memory pressure)
Autoscaling behavior
Deployment rollouts
Failure domains

What actually matters:

Your system fails at infra boundary, not in SQL
Capacity planning is part of your job
“It runs locally” is meaningless

Where things break:

Memory pressure → consumer restarts → reprocessing storms
Bad rollout → partial schema mismatch → cascading failures
Under-provisioned consumers → lag → SLA violations

Core Primitive #4: Data Contracts Are APIs

Reading raw tables is not a contract. It’s a liability.

Modern systems expose:

/features/v1/{entity_id}
Event streams with schema guarantees
Versioned payloads

That forces backend discipline:

Schema evolution strategy
Version negotiation
Deprecation policies

What actually matters:

A schema change = breaking API change
Backward compatibility is not optional
Consumers should not need coordination for every change

“We use dbt, so we’re structured.” is a common misconception. dbt structures transformations. It does not solve consumer contracts at runtime.

Production Reality: You Own Reliability

Once data feeds product features, it inherits product expectations.

That means:

SLOs (not “best effort”), Ex: 99.9% of events processed < 500ms
Alerting
Runbooks
Postmortems

What actually matters:

Latency is a user-facing metric now
Freshness is correctness
Silent failures are worse than crashes

Where things break:

Lag accumulates silently → stale features → bad decisions
Partial pipeline failures → inconsistent state
No observability → debugging becomes guesswork

Cost and Security: The Hidden Backend Layer

At scale, data systems behave like distributed backends with worse costs.

You deal with:

Storage explosion
Cross-region egress
Sensitive data access

So, you end up implementing:

RBAC
Quotas
Network isolation
Encryption policies

What actually matters:

Data systems leak money faster than backend systems
Security failures here are more damaging
Multi-tenant isolation is non-trivial

Tooling Lie: Abstractions Remove Easy Problems

Modern tools (dbt, managed pipelines, feature stores) are useful.

They remove:

Boilerplate
Simple transformations

They do not remove:

Stateful processing
Event-time correctness
System reliability

Reality: When things get hard, you drop down to:

Custom services
Streaming processors
Backend patterns

That’s the convergence point.

What Good Teams Do Differently

They stop pretending pipelines are scripts. They treat them like services.

Non-negotiables:

CI/CD for data + APIs
Contract testing (schema compatibility)
Observability (metrics, traces, logs)
SLO-driven prioritization
Versioned interfaces

Mental model shift:

Tables are storage
APIs are products

Before vs After (Operationally)

Before:

Nightly batch refresh
Consumers query raw tables
Logic duplicated everywhere
No ownership, no SLA

After:

Streaming or near-real-time pipeline
Exposed via API or event stream
Centralized logic
Owned, monitored, versioned

Outcome:

Less duplication
Faster iteration
More upfront cost
Far less long-term chaos

Engineering Checklist (If You Care About Scale)

Treat every pipeline as a service
Version everything (schemas, APIs, outputs)
Design for replay and idempotency
Instrument before optimizing
Define SLOs early (or you’ll invent them under pressure)
Prefer boring, reliable systems over clever ones

Hiring Reality

The bar shifted.

What matters now:

System design
Production ownership
Debugging distributed systems
Strong programming fundamentals

What matters less:

Isolated Spark/SQL expertise without system context

Titles are catching up:

Data Platform Engineer
Data Infra Engineer
Backend Engineer (Data)

The Takeaway

Data engineering didn’t expand — it matured.

The industry stopped tolerating:

brittle pipelines
undefined ownership
silent failures

and replaced them with:

services
contracts
reliability expectations

If your system delivers data to something that makes decisions in real time, you are not “moving data.” You are operating a backend system. Start treating it that way — or you’ll keep debugging it like it’s 2015.