I used to treat object stores like what they are: cheap, durable, and completely unreliable for transactional work. Great for dumping data. Terrible for updates, deletes, or anything resembling correctness.

A few years ago, if someone told me they were doing MERGE, UPDATE, DELETE on S3, I’d assume one of two things:

  • They built a fragile abstraction
  • Or they didn’t understand failure modes yet

Then I started digging into Delta Lake. What I found wasn’t magic. It was a very deliberate systems design trade-off.

Delta Lake Architecture Delta Lake adds a transaction log layer on top of immutable data files

Why object stores break transactional systems

Object stores like S3, ADLS, and GCS were never designed for databases.

They give you:

  • Immutable blobs
  • High throughput reads/writes
  • Cheap storage at scale

But they lack:

  • Atomic updates
  • Strong consistency on listing
  • Transactions
  • Native metadata layer

Which means: You can store data reliably - but you can’t change it reliably.


The core idea: don’t fix storage - add a layer

Delta Lake doesn’t try to make S3 transactional. Instead, it builds a thin transaction layer on top of it:

  • Data files → immutable (Parquet)
  • Changes → tracked separately
  • Truth → defined by a log

This is the key shift: State is not in the files. It’s in the log.

Think of it like this:

  • Files = raw facts (never edited)
  • Log = source of truth
  • Snapshot = interpretation of log + files

How Delta Lake actually gives you ACID

Three core building blocks:

1) Immutable data files

Data is written as Parquet Never updated in-place Updates = new files + mark old ones as removed

This avoids:

  • Partial writes
  • Corruption
  • Complex locking

2) The transaction log (_delta_log)

Every change creates a new file:

_delta_log/
  000000.json
  000001.json
  000002.json

Each commit contains:

  • Files added
  • Files removed
  • Metadata changes

Periodically: Delta writes checkpoints (Parquet) So, you don’t replay everything from day one

3) Optimistic concurrency control

Instead of locks:

  • Read latest snapshot
  • Prepare changes
  • Validate nothing changed
  • Commit

If conflict: retry with new snapshot


The commit protocol (this is the real trick)

Object stores are unreliable for coordination. Delta works around it using:

  • Atomic file creation → commit = new JSON file
  • Validation before commit → detect conflicts
  • Retries instead of locks → scale horizontally

No central coordinator. No database. Just files + discipline.


What ACID features you actually get

  • Atomic commits → commit file exists or not
  • Snapshot isolation → consistent reads
  • Time travel → query past versions
  • MERGE / UPDATE / DELETE → simulated via file rewrites
  • CDC (Change Data Feed) → incremental pipelines

Minimal examples

Write

df.write.format("delta").mode("append").save("/mnt/lake/table")

Time travel

spark.read.format("delta").option("versionAsOf", 42).load("/mnt/lake/table")

MERGE (UPSERT)

from delta.tables import DeltaTable

tgt = DeltaTable.forPath(spark, "/mnt/lake/table")

(tgt.alias("t")
 .merge(source.alias("s"), "t.id = s.id")
 .whenMatchedUpdateAll()
 .whenNotMatchedInsertAll()
 .execute())

Where things start breaking at scale

This is where most teams struggle:

1) Small file problem

Too many small files → slow queries

Fix:

  • Compaction (OPTIMIZE)
  • Target 100MB–1GB file sizes

2) _delta_log growth

Heavy writes → massive log

Fix:

  • Frequent checkpoints
  • Monitor log size

3) High write concurrency

Too many writers → retries explode

Fix:

  • Partition-aware writes
  • Queue + controlled writers
  • Append + compact later

4) VACUUM risks

Deletes old files permanently

If misused:

  • → breaks time travel
  • → breaks downstream pipelines

Trade-offs

Pros

  • Works on any object store
  • No central DB required
  • Enables Lakehouse architecture
  • Scales extremely well

Cons

  • Metadata overhead
  • Operational complexity
  • Retry-heavy under contention
  • Requires discipline (not plug-and-play)

What I actually do in production

  • Treat Delta as transaction layer, not storage
  • Enforce file sizing + compaction
  • Monitor _delta_log like a system metric
  • Avoid high-concurrency small writes
  • Be strict with schema evolution

The bigger picture

Databricks didn’t make S3 transactional. They accepted its limitations and built:

  • a log-based abstraction
  • with immutable data
  • and optimistic commits

That’s it.


TL; DR

ACID isn’t coming from S3. It’s coming from _delta_log.

Files don’t define truth - the log does.

And once you understand that: You stop treating Delta like magic and start treating it like a system.


If you’re building data platforms, exploring lakehouse architectures, or just curious about how modern data systems achieve reliability, connect with me on LinkedIn.