Guru, Eerla | Data Engineering

2026-04-22T10:38:42+00:00

I used to think the job was “make nightly ETLs run.”

Now it’s: ship APIs, run containers, own latency, and get alerted when a feature lookup crosses 200ms. That’s not scope creep. That’s job finally matching the system.

The Shift Isn’t Semantic — It’s Architectural

What we used to call “data engineering” was batch orchestration:

Airflow DAG → SQL → table
Consumers read tables directly

No contracts, no ownership boundaries, no SLAs.

That model breaks the moment data becomes part of a product.

Today’s requirement surface looks like backend systems:

Low-latency access (not “tomorrow morning”)
Multi-tenant isolation
Explicit contracts (schemas, APIs)
Versioning and backward compatibility
Observability + on-call ownership

Bluntly: if your data is consumed in real time, you are running a service.
If you’re running a service, you’re doing backend engineering.

Core Primitive #1: Pipelines → Services

The mental model changed.

Old:

Airflow runs a job → Writes to a table → Consumers figure it out

New:

Continuous processing (streaming or micro-batch)
Expose via API and/or event stream
Explicit ownership + SLA

What actually matters:

Data is no longer “stored and discovered” — it’s served
Consumers shouldn’t reverse-engineer tables
Contracts replace tribal knowledge

Trade-off: you gain discoverability and low-latency consumption but now own API lifecycle and backward compatibility.

Core Primitive #2: Event-Driven State, Not Static Tables

Batch assumes world is static. It isn’t. Kubernetes, serverless functions, managed streaming (Kafka, Pub/Sub), and object stores are standard. That means data engineers must understand containers, infra-as-code, and CI/CD in the same way backend teams do.

Real systems operate on event streams with evolving state: Late data arrives, Events reorder, State must be recomputed or corrected.

That introduces backend problems:

Partitioning
Backpressure
Idempotency
State consistency

Example pattern that actually survives production:

Idempotent upserts (not “exactly once” illusions)
Versioned writes
Externalized state (DB or state store)

“Exactly once” is marketing. Idempotency + replay ability is what works.

Where things break:

Duplicate events → corrupt aggregates
Late arrivals → wrong features
Hot partitions → latency spikes

If you haven’t debugged one of these at 2 AM, you’re still in the old model.

Core Primitive #3: Infrastructure Is No Longer Optional

Once you deploy on Kubernetes, Managed streaming (Kafka, Pub/Sub), Object stores, you inherit backend responsibilities whether you like it or not.

You now deal with:

Resource limits (CPU/memory pressure)
Autoscaling behavior
Deployment rollouts
Failure domains

What actually matters:

Your system fails at infra boundary, not in SQL
Capacity planning is part of your job
“It runs locally” is meaningless

Where things break:

Memory pressure → consumer restarts → reprocessing storms
Bad rollout → partial schema mismatch → cascading failures
Under-provisioned consumers → lag → SLA violations

Core Primitive #4: Data Contracts Are APIs

Reading raw tables is not a contract. It’s a liability.

Modern systems expose:

/features/v1/{entity_id}
Event streams with schema guarantees
Versioned payloads

That forces backend discipline:

Schema evolution strategy
Version negotiation
Deprecation policies

What actually matters:

A schema change = breaking API change
Backward compatibility is not optional
Consumers should not need coordination for every change

“We use dbt, so we’re structured.” is a common misconception. dbt structures transformations. It does not solve consumer contracts at runtime.

Production Reality: You Own Reliability

Once data feeds product features, it inherits product expectations.

That means:

SLOs (not “best effort”), Ex: 99.9% of events processed < 500ms
Alerting
Runbooks
Postmortems

What actually matters:

Latency is a user-facing metric now
Freshness is correctness
Silent failures are worse than crashes

Where things break:

Lag accumulates silently → stale features → bad decisions
Partial pipeline failures → inconsistent state
No observability → debugging becomes guesswork

Cost and Security: The Hidden Backend Layer

At scale, data systems behave like distributed backends with worse costs.

You deal with:

Storage explosion
Cross-region egress
Sensitive data access

So, you end up implementing:

RBAC
Quotas
Network isolation
Encryption policies

What actually matters:

Data systems leak money faster than backend systems
Security failures here are more damaging
Multi-tenant isolation is non-trivial

Tooling Lie: Abstractions Remove Easy Problems

Modern tools (dbt, managed pipelines, feature stores) are useful.

They remove:

Boilerplate
Simple transformations

They do not remove:

Stateful processing
Event-time correctness
System reliability

Reality: When things get hard, you drop down to:

Custom services
Streaming processors
Backend patterns

That’s the convergence point.

What Good Teams Do Differently

They stop pretending pipelines are scripts. They treat them like services.

Non-negotiables:

CI/CD for data + APIs
Contract testing (schema compatibility)
Observability (metrics, traces, logs)
SLO-driven prioritization
Versioned interfaces

Mental model shift:

Tables are storage
APIs are products

Before vs After (Operationally)

Before:

Nightly batch refresh
Consumers query raw tables
Logic duplicated everywhere
No ownership, no SLA

After:

Streaming or near-real-time pipeline
Exposed via API or event stream
Centralized logic
Owned, monitored, versioned

Outcome:

Less duplication
Faster iteration
More upfront cost
Far less long-term chaos

Engineering Checklist (If You Care About Scale)

Treat every pipeline as a service
Version everything (schemas, APIs, outputs)
Design for replay and idempotency
Instrument before optimizing
Define SLOs early (or you’ll invent them under pressure)
Prefer boring, reliable systems over clever ones

Hiring Reality

The bar shifted.

What matters now:

System design
Production ownership
Debugging distributed systems
Strong programming fundamentals

What matters less:

Isolated Spark/SQL expertise without system context

Titles are catching up:

Data Platform Engineer
Data Infra Engineer
Backend Engineer (Data)

The Takeaway

Data engineering didn’t expand — it matured.

The industry stopped tolerating:

brittle pipelines
undefined ownership
silent failures

and replaced them with:

services
contracts
reliability expectations

If your system delivers data to something that makes decisions in real time, you are not “moving data.” You are operating a backend system. Start treating it that way — or you’ll keep debugging it like it’s 2015.

If You’re Not Letting AI Write Code, You’re Already Behind - But Don’t Hand It the Keys

2026-04-22T10:31:00+00:00

Recently, I used an AI assistant to bootstrap a local environment - resolving dependencies, fixing configuration issues, and getting everything running in minutes.

Later, when a teammate asked for help reproducing the setup, I realized something uncomfortable: I didn’t have a clear, deterministic set of steps to give them. Only a prompt.

That moment highlights a deeper shift:

AI is accelerating development - but it’s also changing how knowledge is created, shared, and reproduced.

The Quiet Shift Happening in Engineering

Over the last year, something changed.

AI didn’t just become “useful.” It became embedded.

Inside IDEs
Inside pull requests
Inside CI pipelines

That’s not hype. That’s where the baseline is moving.

What Happens If You Ignore It

If your team treats AI as optional:

You spend hours writing boilerplate
You manually generate tests
You refactor code that a model could do in seconds

Meanwhile, other teams are shipping faster - not because they’re smarter, but because they’ve automated the boring parts.

What Happens If You Use It Blindly

This is where most teams fail. They adopt AI… without discipline.

And then they hit:

hallucinated APIs
insecure code patterns
hidden dependencies
unpredictable costs

AI doesn’t fail loudly. It fails convincingly.

What Actually Changed

AI didn’t replace engineers. It shifted the role.

From:

writing every line manually

To:

defining intent
reviewing outputs
enforcing correctness

Think of it like this: AI writes the first draft. Engineers decide what survives.

Where AI Actually Delivers Value

Not everywhere - but in very specific places.

High Leverage Use Cases

Scaffolding endpoints and services
Generating unit tests
Writing documentation
Small refactors and migrations
Drafting pull requests

These are:

repetitive
time-consuming
low cognitive value

Perfect for automation.

A Realistic Before vs After

Before:

Endpoint from spec: 3-5 hours
Tests, boilerplate, docs: manual

After (AI-assisted):

Draft in ~30-60 minutes
review and validation
~2x-3x speedup for routine work

Across a sprint, that’s not small. That’s weeks of engineering time reclaimed per quarter.

Why Most Teams Still Don’t Trust It

Because they shouldn’t - yet. Common failure modes:

1. Hallucinations

Code references:

non-existent APIs
wrong schemas
imaginary helpers

2. Insecure Patterns

You’ll see:

hardcoded secrets
outdated libraries
unsafe defaults

3. Hidden Dependencies

Generated code quietly pulls in things your system doesn’t track. Now your SBOM is wrong.

4. Cost Surprises

Everyone assumes: “AI = GPU cost”

Reality:

network egress
NAT gateways
load balancers
storage

Often cost more than inference itself.

The Only Way This Works: Treat AI Like a Junior Engineer

Not a tool. Not an oracle. A junior teammate.

It can:

draft quickly
make mistakes
require supervision

So, your system needs to enforce that.

The Production Pattern That Works

Here’s the model that actually scales:

AI generates code
Attach provenance metadata (model, prompt, timestamp)
Run:
- linting
- security scans
- dependency checks
Generate + run tests
Run integration/contract checks
Block merge if anything fails
Require human review

Why Provenance Matters More Than People Think

If you don’t track:

which model generated the code
what prompt was used
when it was created

Then when something breaks… You have no idea why.

Provenance turns AI from: “Black box output”

Into: auditable engineering artifact

What You Must Have (Non-Negotiable)

If you’re using AI in production:

Unit + integration tests
Contract tests
SAST + dependency scanning
CI gating (no green -> no merge)
Versioned artifacts
Audit trail for generated code

Without these: You’re not accelerating. You’re accumulating risk faster.

The Cost Reality Most Teams Miss

AI isn’t just a “model cost.”

Track:

tokens per request
request frequency
storage (artifacts, embeddings)
network (egress, gateways)
logging + retention

Measure it early. Because costs don’t grow linearly - they compound with usage.

The Real Shift: What Engineers Do Now

The role is moving up the stack.

From:

writing code

To:

designing systems
writing better specifications/prompts
verifying outputs
governing models

The best engineers won’t write more code. They’ll decide better code faster.

How to Adopt This Without Breaking Things

Don’t go all-in. Start small.

30 Days

Use AI for scaffolding in one repo
Store generated code with provenance

60 Days

Add CI checks (tests + security)
Track usage and cost

90 Days

Add gating policies
Consider fine-tuning on your codebase
Expand to more workflows

What’s Coming Next

Domain-specific copilots (finance, healthcare, etc.)
Deeper IDE + CI integration
Policy-as-code for AI-generated changes
Auditors asking for: provenance, SBOMs, model governance

This isn’t optional infrastructure anymore. It’s becoming standard engineering practice.

If you’re not using AI to handle repetitive engineering work, you’re falling behind.

But if you use it without discipline, you’ll move faster - in the wrong direction.

AI can 2-3x your velocity - but only if you verify everything it writes.

Not all Data Pipelines Fail — They Succeed with Wrong Data

2026-04-11T20:55:00+00:00

Most data pipelines don’t fail loudly. They fail quietly — and keep running. That’s the real problem.

The Week That Changed My Mind

I used to think CI/CD for data was “nice to have.”

Then in one week:

An upstream schema drifted
An ETL job added duplicate records
Production jobs are successful

Nothing crashed. Pipelines stayed “green.” But data was wrong.

That’s when it clicked:

The line between a calm data team and a chaotic one
isn’t tooling — it’s discipline.

And today, that discipline looks like CI/CD for data.

Why This Matters Now

Data systems have changed.

We’re no longer dealing with:

small batch jobs
stable schemas
occasional updates

We’re dealing with:

event-driven pipelines
constantly evolving data
real-time or near-real-time expectations

At the same time, tools have matured:

dbt and modern orchestration
table formats like Delta / Iceberg
integrated DataOps platforms

The direction is clear:

Data is no longer “just pipelines.” It’s software that needs to be tested, versioned, and deployed.

How Data Systems Actually Fail

Without CI/CD, failures don’t look like exceptions.

They look like this:

Silent Data Corruption

A bad join or schema change doesn’t crash anything.
It just poisons downstream dashboards.

Non-Reproducible Backfills

You rerun a pipeline and get a different answer.
Now “what changed?” has no clear answer.

Partial Writes & Broken State

Long-running jobs fail halfway. Some data is updated. Some isn’t. Now you have multiple versions of truth.

Slow, Painful Incident Response

No tests. No rollback. No clear lineage. Fixing one issue turns into days of investigation.

These aren’t edge cases. They’re everyday problems in systems that lack guardrails.

Why App-Style CI Isn’t Enough

It’s tempting to apply traditional CI/CD patterns directly.

But data systems behave differently:

Stateful pipelines → you deal with checkpoints, offsets, time
Schema evolution → producers change constantly
Non-determinism → randomness, APIs, sampling
Heavy backfills → reprocessing large volumes

This means you need more than just “run tests on PR.” You need data-aware patterns.

What Works in Practice

You don’t need a perfect system. You need a few high-leverage patterns.

1. Treat Data Like Code

Store everything in version control:

SQL models
pipeline definitions
schemas and contracts

Every change goes through a PR.

Why it matters: Small, reviewable changes are easier to trust — and easier to roll back.

2. Enforce Data Contracts

Don’t let schemas drift silently.

Validate changes before they hit production:

column types
nullability
required fields

If contract breaks, deploy should fail.

3. Make Pipelines Idempotent

If rerunning a job changes results, you don’t have a pipeline —
you have a risk.

Use patterns like:

upserts (merge)
deterministic transformations

Same input → same output. Every time.

4. Shift Testing Left

Don’t wait for production to validate data.

Add layers of testing:

unit tests for transformations
integration tests on small datasets
statistical checks (row counts, null rates, distributions)

Bad data should fail fast — before it ships.

5. Use Versioned, Time-Travel Tables

Table formats like Delta or Iceberg make a huge difference.

They give you:

atomic writes
rollback capability
reproducible snapshots

If you can’t rewind data, you can’t debug it.

6. Canary Before Full Deployment

Don’t deploy changes everywhere at once.

Run on a subset of data
Compare key metrics
Promote only if it passes

Small blast radius → safer systems.

7. Build Observability into Pipeline

You shouldn’t rely on someone noticing a broken dashboard.

Track:

freshness
completeness
anomalies in key metrics

Good systems detect issues before users do.

What Changes After You Do This

The shift is subtle — but powerful.

Before

Pipelines “usually work”
Fixes are manual and reactive
Data issues take days to debug

After

Changes are tested before deployment
Failures are isolated and reversible
Data is reproducible and auditable

The Trade-Offs (Be Honest)

CI/CD for data isn’t free.

It costs:

engineering time
compute for testing
discipline to maintain

But the alternative is worse:

unreliable dashboards
broken trust
expensive incidents

Most teams don’t pay upfront. They pay later — with interest.

Where This Is Going

We’re already seeing the next layer:

automated anomaly detection
smarter validation using ML
systems that suggest root causes

But all of that depends on one thing:

You can’t build intelligent systems on top of unreliable pipelines. CI/CD is the foundation.

A Practical Starting Point

You don’t need to do everything at once.

Start small:

Add schema checks to your PRs
Run data tests (dbt or similar) on every change
Version one critical dataset with time travel

That alone will eliminate a surprising amount of chaos.

Final Take

Data pipelines don’t just move data. They produce decisions.

If those pipelines aren’t:

tested
versioned
reproducible

Then decisions built on top of them aren’t reliable either.

CI/CD for data turns pipelines from “best effort” into systems you can trust.

If You Think You Know Python, These Will Prove You Wrong

2026-04-11T20:52:00+00:00

Most of us get comfortable because our code works, not because we fully understand why. And that illusion breaks the moment you hit edge cases that don’t behave the way you expect.

1. Default Mutable Arguments (but the real gotcha)

You already know this is bad:

def add_item(x, lst=[]):
    lst.append(x)
    return lst

But here’s what people miss: It’s not just a bug — it’s intentional state retention:

def counter(x, cache={}):
    cache[x] = cache.get(x, 0) + 1
    return cache

This acts like a hidden static variable.

💡 Used carefully → performance trick
💀 Used accidentally → nightmare debugging

2. `is` vs `==` (Worse Than You Think)

Everyone says: use ==, not is

But here’s the twist:

a = 256
b = 256
print(a is b)  # True

a = 257
b = 257
print(a is b)  # False

Python interns small integers (-5 to 256).

Even worse:

a = "hello"
b = "hello"
print(a is b)  # True (sometimes)

💡 String interning is inconsistent across contexts.

3. Late Binding in Closures (Classic Trap)

funcs = []
for i in range(3):
    funcs.append(lambda: i)

print([f() for f in funcs])  # [2, 2, 2]

👉 All lambdas capture same variable, not value.

Fix:

funcs.append(lambda i=i: i)

4. Dict Order Is Guaranteed (But That Changes Design)

Since Python 3.7:

d = {"a": 1, "b": 2}

👉 Order is preserved.

Hidden impact:

People now rely on dict order → implicit logic coupling
Old code assumptions break when ported.

💡 Dicts are now often used like lightweight ordered structures.

5. `set` Removes Duplicates… But Also Reorders

list(set([3, 1, 2, 1]))
# [1, 2, 3]  (but not guaranteed order)

👉 Many devs accidentally introduce non-determinism.

6. Everything Is a Reference (But Not Always Obvious)

a = [1, 2]
b = a
b.append(3)

print(a)  # [1, 2, 3]

But:

a = [1, 2]
b = a[:]
b.append(3)

print(a)  # [1, 2]

👉 Copy vs reference bugs show up in:

caching
multiprocessing
data pipelines

7. Tuple Isn’t Always Immutable

t = ([1, 2], 3)
t[0].append(99)

print(t)  # ([1, 2, 99], 3)

👉 Tuple is immutable, but its contents might not be.

8. `+=` Can Mutate… or Not

a = [1, 2]
b = a
a += [3]

print(b)  # [1, 2, 3]

But:

a = (1, 2)
b = a
a += (3,)

print(b)  # (1, 2)

👉 List mutates in-place
👉 Tuple creates new object.

Same operator. Different behavior.

9. Exception Handling Can Hide Bugs

try:
    return something()
finally:
    return "oops"

👉 finally overrides return.

10. `for-else` Exists (and Almost Nobody Uses It Right)

for x in data:
    if x == target:
        break
else:
    print("Not found")

👉 else runs only if loop did NOT break.

11. Floating Point Lies

0.1 + 0.2 == 0.3  # False

👉 You know this… but it still bites in:

finance
aggregations
data pipelines

12. List Multiplication Shares References

grid = [[0]*3]*3
grid[0][0] = 1

print(grid)
# [[1,0,0],[1,0,0],[1,0,0]]

👉 All rows point to same list.

13. `bool` Is a Subclass of `int`

True + True == 2  # True
isinstance(True, int)  # True

👉 This leaks into:

pandas
aggregations
weird bugs

14. `del` Is Not Reliable

class A:
    def __del__(self):
        print("deleted")

👉 Garbage collection timing is unpredictable.

💀 Don’t rely on it for cleanup.

15. Iterators Get Exhausted Silently

it = iter([1,2,3])
list(it)  # [1,2,3]
list(it)  # []

👉 This causes subtle bugs in:

streaming pipelines
generators
testing

16. Pattern Matching (3.10+) Has Sharp Edges

match x:
    case 1:
        ...
    case y:
        ...

But:

👉 This captures variable, not compares.

💀 Many devs think it’s equality.

17. Shadowing Built-ins Breaks Everything

list = [1,2,3]
list("abc")  # 💀

👉 Happens more in notebooks than you think.

18. `globals()` and `locals()` Are Writable (Sometimes)

You can do wild stuff like:

globals()['x'] = 10

👉 Useful for metaprogramming
💀 Dangerous in large systems.

Final Take

Most Python bugs aren’t syntax issues.

They’re mental model mismatches.

Python looks simple — but it’s full of “gotchas by design.”

You’re Not Competing with AI - You’re Competing with Engineers Who Use It

2026-03-25T00:00:00+00:00

I’m not saying this after a weekend of trying AI tools. I’m saying this after 2 years of using Cursor consistently - while working a demanding full-time job. And I’ll be direct: The way most engineers are still writing code today is already outdated.

Let’s Say the Quiet Part Out Loud

If you’re still:

Manually writing boilerplate
Googling patterns you’ve implemented 100 times
Stitching together repetitive logic

You’re not demonstrating skill. You’re demonstrating resistance to leverage.

My Turning Point

When I first started using Cursor, I used it like autocomplete. That was a mistake. The real shift happened when I treated it like a collaborator.

I was building a data pipeline:

Ingestion
Schema validation
Transformations
Feature logic

Normally: a couple of days.

This time, I described the system in plain English:

Inputs
Outputs
Constraints
Edge cases

Cursor generated a working structure in minutes. Not perfect. But good enough to skip hours of setup. What used to take days took a few hours.

After repeating this over months: I realized this isn’t a trick. This is the new baseline.

What 2 Years of This Looks Like (With a Full-Time Job)

Here’s the part that really changed my perspective: All of this was built outside my day job. Not by grinding nights endlessly. But by reducing the cost of building.

Over the past couple of years, I’ve built:

An AI blog writing agent (research → structure → draft) - Check it out
An event management app: https://tribe-connect-two.vercel.app/
Pybenders - LLM-powered reels generator, multiple visual formats, 12+ content contexts, multi-platform output - pybenders/README.md at main · eerla/pybenders
A full data engineering guide: https://eerla.github.io/data-engineering-blog/
An Interview Assist tool: resume scanning, auto-generated interview questions, structured evaluation - https://intervue-assist.streamlit.app/
Thrive: mobile app where users receive daily customized motivational quotes powered by LLM - eerla/Thrive: initial commit

And several smaller tools and browser extensions that I use locally.

The Part Most Engineers Won’t Like

None of this required:

Months of effort per project
Perfect architecture upfront
Doing everything manually

Because I wasn’t. AI handled:

Boilerplate
Scaffolding
Repetitive logic
First drafts

I focused on:

What to build
How it should work
What actually matters

The Lie Engineers Tell Themselves

“I want to understand everything deeply.”

After 2 years of working like this: Depth doesn’t come from writing everything yourself. It comes from:

Reviewing
Questioning
Refining
Iterating faster

AI doesn’t remove depth. It removes wasted effort disguised as depth.

The Real Threat (Be Honest)

If AI can generate most of your code… Then most of your code was never your advantage.

Your advantage is:

Judgment
System design
Problem framing
Speed of iteration

If your identity is tied to typing code manually… This shift will feel uncomfortable.

A Simple Example

Messy module:

Duplicated logic
Unclear structure

Before: Hours of refactoring

Now: “Clean this up. Improve readability. Don’t change behavior.” Done in seconds.

My job?

Validate
Refine
Move forward

This Is Not a Productivity Hack

This is where people underestimate it.

It’s not: “I save some time”

It’s: “I build at a completely different scale”

You:

Try more ideas
Ship more projects
Abandon bad paths faster
Take bigger risks

That’s not speed. That’s leverage.

The Gap Is Already Forming

After 2 years, I can say this confidently: There are now two types of engineers:

Writes code
Builds with AI

Same intelligence. Completely different output.

“I Don’t Want to Be Dependent”

You already are. On:

Frameworks
Libraries
Open-source
Google

AI is just the next layer. Refusing it isn’t discipline. It’s denial.

The Uncomfortable Ending

In a year, saying: “I don’t use AI to code” will sound like: “I don’t use the internet when I code.”

Final Line

You’re not competing with AI. You’re competing with engineers who have been using it for 2 years - while working full-time - and shipping consistently. And they’re not slowing down.

This isn’t about Cursor. You can replace it with any AI tool. The real point is, engineers who learn to leverage AI will outpace those who don’t - regardless of which tool they use.

If you’re building data platforms, exploring lakehouse architectures, or just curious about how modern data systems achieve reliability, connect with me on LinkedIn.

Airflow Works Best When It Does Less

2026-03-23T04:00:00+00:00

The symptoms are consistent:

Workers pinned at high CPU
Retry storms under load
DAGs that pass locally but fail in production
Business logic buried inside orchestration

This isn’t a scaling issue. It’s a boundary violation.

Airflow Is a Control Plane

Airflow exists to:

schedule work
enforce dependencies
manage retries

It does not exist to:

process data
hold state
execute transformations

When orchestration and compute share the same layer, they compete for resources.

That competition is where systems degrade.

DAGs Should Describe Flow — Nothing Else

A DAG answers:

What runs, and in what order?

Not:

How does the data get processed?

Once you embed logic inside DAGs:

orchestration becomes coupled to implementation
pipelines become untestable
changes become risky

Clean systems separate:

DAG → control flow
Compute → execution layer

The Patterns That Cause Most Failures

In-process compute: Large joins, pandas jobs, heavy transforms inside tasks
XCom as a data layer: Passing payloads instead of metadata
Business logic in DAGs: No versioning, no reuse, no testability
Shared resources: Orchestration and compute competing for CPU/memory

These are not edge cases. This is how most Airflow systems fail.

Failure Modes (They Compound Fast)

Scheduler starvation: Workers doing compute can’t schedule new tasks
Retry amplification: Failures increase load → more failures
State inconsistencies: No clear ownership of data or transformations
Debugging collapse: Logs tied to orchestration, not execution

Failures don’t originate in SQL. They emerge at system boundaries.

The Correct Model

Airflow should coordinate work, not perform it.

Trigger Spark / dbt / containerized jobs
Wait for completion
Pass references (IDs, URIs), not data

Airflow becomes thin, predictable, and stable.

What Improves in Production

Scheduler remains responsive under load
Failures are isolated to compute systems
Pipelines become testable outside Airflow
Recovery becomes deterministic

Not because of better tooling. Because responsibilities are separated correctly.

The System Model

Think in layers:

Control → Airflow
Compute → Spark / dbt / containers
Storage → warehouse / lake

If these blur, the system becomes fragile.

Final Take

Most Airflow issues are self-inflicted.

Not because Airflow is limited, but because it’s forced to do work it was never designed for.

If your DAGs are executing real computation, you don’t have a pipeline problem. You have a system design problem.

One Rule

If a task:

runs long CPU workloads
or processes large in-memory data

It does not belong in Airflow.

I Dug Into Delta Lake’s Transaction Log - This Is How ACID Actually Works on S3

2026-03-22T00:00:00+00:00

I used to treat object stores like what they are: cheap, durable, and completely unreliable for transactional work. Great for dumping data. Terrible for updates, deletes, or anything resembling correctness.

A few years ago, if someone told me they were doing MERGE, UPDATE, DELETE on S3, I’d assume one of two things:

They built a fragile abstraction
Or they didn’t understand failure modes yet

Then I started digging into Delta Lake. What I found wasn’t magic. It was a very deliberate systems design trade-off.

Delta Lake adds a transaction log layer on top of immutable data files

Why object stores break transactional systems

Object stores like S3, ADLS, and GCS were never designed for databases.

They give you:

Immutable blobs
High throughput reads/writes
Cheap storage at scale

But they lack:

Atomic updates
Strong consistency on listing
Transactions
Native metadata layer

Which means: You can store data reliably - but you can’t change it reliably.

The core idea: don’t fix storage - add a layer

Delta Lake doesn’t try to make S3 transactional. Instead, it builds a thin transaction layer on top of it:

Data files → immutable (Parquet)
Changes → tracked separately
Truth → defined by a log

This is the key shift: State is not in the files. It’s in the log.

Think of it like this:

Files = raw facts (never edited)
Log = source of truth
Snapshot = interpretation of log + files

How Delta Lake actually gives you ACID

Three core building blocks:

1) Immutable data files

Data is written as Parquet Never updated in-place Updates = new files + mark old ones as removed

This avoids:

Partial writes
Corruption
Complex locking

2) The transaction log (_delta_log)

Every change creates a new file:

_delta_log/
  000000.json
  000001.json
  000002.json

Each commit contains:

Files added
Files removed
Metadata changes

Periodically: Delta writes checkpoints (Parquet) So, you don’t replay everything from day one

3) Optimistic concurrency control

Instead of locks:

Read latest snapshot
Prepare changes
Validate nothing changed
Commit

If conflict: retry with new snapshot

The commit protocol (this is the real trick)

Object stores are unreliable for coordination. Delta works around it using:

Atomic file creation → commit = new JSON file
Validation before commit → detect conflicts
Retries instead of locks → scale horizontally

No central coordinator. No database. Just files + discipline.

What ACID features you actually get

Atomic commits → commit file exists or not
Snapshot isolation → consistent reads
Time travel → query past versions
MERGE / UPDATE / DELETE → simulated via file rewrites
CDC (Change Data Feed) → incremental pipelines

Minimal examples

Write

df.write.format("delta").mode("append").save("/mnt/lake/table")

Time travel

spark.read.format("delta").option("versionAsOf", 42).load("/mnt/lake/table")

MERGE (UPSERT)

from delta.tables import DeltaTable

tgt = DeltaTable.forPath(spark, "/mnt/lake/table")

(tgt.alias("t")
 .merge(source.alias("s"), "t.id = s.id")
 .whenMatchedUpdateAll()
 .whenNotMatchedInsertAll()
 .execute())

Where things start breaking at scale

This is where most teams struggle:

1) Small file problem

Too many small files → slow queries

Fix:

Compaction (OPTIMIZE)
Target 100MB–1GB file sizes

2) _delta_log growth

Heavy writes → massive log

Fix:

Frequent checkpoints
Monitor log size

3) High write concurrency

Too many writers → retries explode

Fix:

Partition-aware writes
Queue + controlled writers
Append + compact later

4) VACUUM risks

Deletes old files permanently

If misused:

→ breaks time travel
→ breaks downstream pipelines

Trade-offs

Pros

Works on any object store
No central DB required
Enables Lakehouse architecture
Scales extremely well

Cons

Metadata overhead
Operational complexity
Retry-heavy under contention
Requires discipline (not plug-and-play)

What I actually do in production

Treat Delta as transaction layer, not storage
Enforce file sizing + compaction
Monitor _delta_log like a system metric
Avoid high-concurrency small writes
Be strict with schema evolution

The bigger picture

Databricks didn’t make S3 transactional. They accepted its limitations and built:

a log-based abstraction
with immutable data
and optimistic commits

That’s it.

TL; DR

ACID isn’t coming from S3. It’s coming from _delta_log.

Files don’t define truth - the log does.

And once you understand that: You stop treating Delta like magic and start treating it like a system.

If you’re building data platforms, exploring lakehouse architectures, or just curious about how modern data systems achieve reliability, connect with me on LinkedIn.

The Real Cost of Data Observability

2024-02-15T00:00:00+00:00

This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data

The article should cover:

The data observability gold rush
Hidden costs of observability tools
What you actually need vs what vendors sell
DIY approaches to data quality
Real-world examples of cost-effective observability

Copy your complete Medium article here, preserving all formatting, code examples, and insights.

dbt Changed Data Engineering Forever

2024-02-10T00:00:00+00:00

This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data

The article should cover:

The data transformation landscape before dbt
How dbt revolutionized SQL-based transformations
Key features that make dbt powerful
Real-world examples of dbt implementations
The future of data transformation with dbt

Copy your complete Medium article here, preserving all formatting, code examples, and insights.

You Don’t Need Kafka for Everything

2024-02-05T00:00:00+00:00

This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data

The article should cover:

Why Kafka became the default choice
The complexity and overhead of Kafka
Simpler alternatives for common use cases
When Kafka actually makes sense
Real-world examples of over-engineered messaging systems

Copy your complete Medium article here, preserving all formatting, code examples, and insights.

Guru, Eerla | Data Engineering

The Shift Isn’t Semantic — It’s Architectural

Core Primitive #1: Pipelines → Services

Core Primitive #2: Event-Driven State, Not Static Tables

Core Primitive #3: Infrastructure Is No Longer Optional

Core Primitive #4: Data Contracts Are APIs

Production Reality: You Own Reliability

Cost and Security: The Hidden Backend Layer

Tooling Lie: Abstractions Remove Easy Problems

What Good Teams Do Differently

Before vs After (Operationally)

Engineering Checklist (If You Care About Scale)

Hiring Reality

The Takeaway

If You’re Not Letting AI Write Code, You’re Already Behind - But Don’t Hand It the Keys

The Quiet Shift Happening in Engineering

What Happens If You Ignore It

What Happens If You Use It Blindly

What Actually Changed

Where AI Actually Delivers Value

High Leverage Use Cases

A Realistic Before vs After

Why Most Teams Still Don’t Trust It

1. Hallucinations

2. Insecure Patterns

3. Hidden Dependencies

4. Cost Surprises

The Only Way This Works: Treat AI Like a Junior Engineer

The Production Pattern That Works

Why Provenance Matters More Than People Think

What You Must Have (Non-Negotiable)

The Cost Reality Most Teams Miss

The Real Shift: What Engineers Do Now

How to Adopt This Without Breaking Things

30 Days

60 Days

90 Days

What’s Coming Next

Not all Data Pipelines Fail — They Succeed with Wrong Data

The Week That Changed My Mind

Why This Matters Now

How Data Systems Actually Fail

Silent Data Corruption

Non-Reproducible Backfills

Partial Writes & Broken State

Slow, Painful Incident Response

Why App-Style CI Isn’t Enough

What Works in Practice

1. Treat Data Like Code

2. Enforce Data Contracts

3. Make Pipelines Idempotent

4. Shift Testing Left

5. Use Versioned, Time-Travel Tables

6. Canary Before Full Deployment

7. Build Observability into Pipeline

What Changes After You Do This

Before

After

The Trade-Offs (Be Honest)

Where This Is Going

A Practical Starting Point

Final Take

If You Think You Know Python, These Will Prove You Wrong

1. Default Mutable Arguments (but the real gotcha)

2. is vs == (Worse Than You Think)

3. Late Binding in Closures (Classic Trap)

4. Dict Order Is Guaranteed (But That Changes Design)

5. set Removes Duplicates… But Also Reorders

6. Everything Is a Reference (But Not Always Obvious)

7. Tuple Isn’t Always Immutable

8. += Can Mutate… or Not

9. Exception Handling Can Hide Bugs

10. for-else Exists (and Almost Nobody Uses It Right)

11. Floating Point Lies

12. List Multiplication Shares References

13. bool Is a Subclass of int

14. __del__ Is Not Reliable

15. Iterators Get Exhausted Silently

16. Pattern Matching (3.10+) Has Sharp Edges

17. Shadowing Built-ins Breaks Everything

2. `is` vs `==` (Worse Than You Think)

5. `set` Removes Duplicates… But Also Reorders

8. `+=` Can Mutate… or Not

10. `for-else` Exists (and Almost Nobody Uses It Right)

13. `bool` Is a Subclass of `int`

14. `del` Is Not Reliable

18. `globals()` and `locals()` Are Writable (Sometimes)