<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://eerla.github.io/data-engineering-blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://eerla.github.io/data-engineering-blog/" rel="alternate" type="text/html" /><updated>2026-04-22T10:38:42+00:00</updated><id>https://eerla.github.io/data-engineering-blog/feed.xml</id><title type="html">Guru, Eerla | Data Engineering</title><subtitle>Lead Engineer building real-world data systems. Strong opinions on data engineering, system design, and practical solutions.</subtitle><author><name>Guru, Eerla</name></author><entry><title type="html"></title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/04/22/2026-03-22-why-data-engineer-is-quietly-becoming-backend-engineer/" rel="alternate" type="text/html" title="" /><published>2026-04-22T10:38:42+00:00</published><updated>2026-04-22T10:38:42+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/04/22/2026-03-22-why-data-engineer-is-quietly-becoming-backend-engineer</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/04/22/2026-03-22-why-data-engineer-is-quietly-becoming-backend-engineer/"><![CDATA[<p>I used to think the job was “make nightly ETLs run.”</p>

<p>Now it’s: ship APIs, run containers, own latency, and get alerted when a feature lookup crosses 200ms. That’s not scope creep. That’s job finally matching the system.</p>

<h2 id="the-shift-isnt-semantic--its-architectural">The Shift Isn’t Semantic — It’s Architectural</h2>

<p>What we used to call “data engineering” was batch orchestration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Airflow DAG → SQL → table
Consumers read tables directly
</code></pre></div></div>

<p>No contracts, no ownership boundaries, no SLAs.</p>

<p>That model breaks the moment data becomes part of a product.</p>

<p>Today’s requirement surface looks like backend systems:</p>

<ul>
  <li>Low-latency access (not “tomorrow morning”)</li>
  <li>Multi-tenant isolation</li>
  <li>Explicit contracts (schemas, APIs)</li>
  <li>Versioning and backward compatibility</li>
  <li>Observability + on-call ownership</li>
</ul>

<p>Bluntly: if your data is consumed in real time, you are running a service.<br />
If you’re running a service, you’re doing backend engineering.</p>

<h2 id="core-primitive-1-pipelines--services">Core Primitive #1: Pipelines → Services</h2>

<p>The mental model changed.</p>

<p><strong>Old:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Airflow runs a job → Writes to a table → Consumers figure it out
</code></pre></div></div>

<p><strong>New:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Continuous processing (streaming or micro-batch)
Expose via API and/or event stream
Explicit ownership + SLA
</code></pre></div></div>

<p>What actually matters:</p>

<ul>
  <li>Data is no longer “stored and discovered” — it’s served</li>
  <li>Consumers shouldn’t reverse-engineer tables</li>
  <li>Contracts replace tribal knowledge</li>
</ul>

<p>Trade-off: you gain discoverability and low-latency consumption but now own API lifecycle and backward compatibility.</p>

<h2 id="core-primitive-2-event-driven-state-not-static-tables">Core Primitive #2: Event-Driven State, Not Static Tables</h2>

<p>Batch assumes world is static. It isn’t. Kubernetes, serverless functions, managed streaming (Kafka, Pub/Sub), and object stores are standard. That means data engineers must understand containers, infra-as-code, and CI/CD in the same way backend teams do.</p>

<p>Real systems operate on event streams with evolving state: Late data arrives, Events reorder, State must be recomputed or corrected.</p>

<p>That introduces backend problems:</p>

<ul>
  <li>Partitioning</li>
  <li>Backpressure</li>
  <li>Idempotency</li>
  <li>State consistency</li>
</ul>

<p><strong>Example pattern that actually survives production:</strong></p>

<ul>
  <li>Idempotent upserts (not “exactly once” illusions)</li>
  <li>Versioned writes</li>
  <li>Externalized state (DB or state store)</li>
</ul>

<p>“Exactly once” is marketing. Idempotency + replay ability is what works.</p>

<p><strong>Where things break:</strong></p>

<ul>
  <li>Duplicate events → corrupt aggregates</li>
  <li>Late arrivals → wrong features</li>
  <li>Hot partitions → latency spikes</li>
</ul>

<p>If you haven’t debugged one of these at 2 AM, you’re still in the old model.</p>

<h2 id="core-primitive-3-infrastructure-is-no-longer-optional">Core Primitive #3: Infrastructure Is No Longer Optional</h2>

<p>Once you deploy on Kubernetes, Managed streaming (Kafka, Pub/Sub), Object stores, you inherit backend responsibilities whether you like it or not.</p>

<p>You now deal with:</p>

<ul>
  <li>Resource limits (CPU/memory pressure)</li>
  <li>Autoscaling behavior</li>
  <li>Deployment rollouts</li>
  <li>Failure domains</li>
</ul>

<p><strong>What actually matters:</strong></p>

<ul>
  <li>Your system fails at infra boundary, not in SQL</li>
  <li>Capacity planning is part of your job</li>
  <li>“It runs locally” is meaningless</li>
</ul>

<p><strong>Where things break:</strong></p>

<ul>
  <li>Memory pressure → consumer restarts → reprocessing storms</li>
  <li>Bad rollout → partial schema mismatch → cascading failures</li>
  <li>Under-provisioned consumers → lag → SLA violations</li>
</ul>

<h2 id="core-primitive-4-data-contracts-are-apis">Core Primitive #4: Data Contracts Are APIs</h2>

<p>Reading raw tables is not a contract. It’s a liability.</p>

<p>Modern systems expose:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">/features/v1/{entity_id}</code></li>
  <li>Event streams with schema guarantees</li>
  <li>Versioned payloads</li>
</ul>

<p>That forces backend discipline:</p>

<ul>
  <li>Schema evolution strategy</li>
  <li>Version negotiation</li>
  <li>Deprecation policies</li>
</ul>

<p><strong>What actually matters:</strong></p>

<ul>
  <li>A schema change = breaking API change</li>
  <li>Backward compatibility is not optional</li>
  <li>Consumers should not need coordination for every change</li>
</ul>

<p>“We use dbt, so we’re structured.” is a common misconception. dbt structures transformations. It does not solve consumer contracts at runtime.</p>

<h2 id="production-reality-you-own-reliability">Production Reality: You Own Reliability</h2>

<p>Once data feeds product features, it inherits product expectations.</p>

<p>That means:</p>

<ul>
  <li>SLOs (not “best effort”), Ex: 99.9% of events processed &lt; 500ms</li>
  <li>Alerting</li>
  <li>Runbooks</li>
  <li>Postmortems</li>
</ul>

<p><strong>What actually matters:</strong></p>

<ul>
  <li>Latency is a user-facing metric now</li>
  <li>Freshness is correctness</li>
  <li>Silent failures are worse than crashes</li>
</ul>

<p><strong>Where things break:</strong></p>

<ul>
  <li>Lag accumulates silently → stale features → bad decisions</li>
  <li>Partial pipeline failures → inconsistent state</li>
  <li>No observability → debugging becomes guesswork</li>
</ul>

<h2 id="cost-and-security-the-hidden-backend-layer">Cost and Security: The Hidden Backend Layer</h2>

<p>At scale, data systems behave like distributed backends with worse costs.</p>

<p>You deal with:</p>

<ul>
  <li>Storage explosion</li>
  <li>Cross-region egress</li>
  <li>Sensitive data access</li>
</ul>

<p>So, you end up implementing:</p>

<ul>
  <li>RBAC</li>
  <li>Quotas</li>
  <li>Network isolation</li>
  <li>Encryption policies</li>
</ul>

<p><strong>What actually matters:</strong></p>

<ul>
  <li>Data systems leak money faster than backend systems</li>
  <li>Security failures here are more damaging</li>
  <li>Multi-tenant isolation is non-trivial</li>
</ul>

<h2 id="tooling-lie-abstractions-remove-easy-problems">Tooling Lie: Abstractions Remove Easy Problems</h2>

<p>Modern tools (dbt, managed pipelines, feature stores) are useful.</p>

<p>They remove:</p>
<ul>
  <li>Boilerplate</li>
  <li>Simple transformations</li>
</ul>

<p>They do not remove:</p>
<ul>
  <li>Stateful processing</li>
  <li>Event-time correctness</li>
  <li>System reliability</li>
</ul>

<p>Reality: When things get hard, you drop down to:</p>
<ul>
  <li>Custom services</li>
  <li>Streaming processors</li>
  <li>Backend patterns</li>
</ul>

<p>That’s the convergence point.</p>

<h2 id="what-good-teams-do-differently">What Good Teams Do Differently</h2>

<p>They stop pretending pipelines are scripts. They treat them like services.</p>

<p><strong>Non-negotiables:</strong></p>

<ul>
  <li>CI/CD for data + APIs</li>
  <li>Contract testing (schema compatibility)</li>
  <li>Observability (metrics, traces, logs)</li>
  <li>SLO-driven prioritization</li>
  <li>Versioned interfaces</li>
</ul>

<p><strong>Mental model shift:</strong></p>

<ul>
  <li>Tables are storage</li>
  <li>APIs are products</li>
</ul>

<h3 id="before-vs-after-operationally">Before vs After (Operationally)</h3>

<p><strong>Before:</strong></p>
<ul>
  <li>Nightly batch refresh</li>
  <li>Consumers query raw tables</li>
  <li>Logic duplicated everywhere</li>
  <li>No ownership, no SLA</li>
</ul>

<p><strong>After:</strong></p>
<ul>
  <li>Streaming or near-real-time pipeline</li>
  <li>Exposed via API or event stream</li>
  <li>Centralized logic</li>
  <li>Owned, monitored, versioned</li>
</ul>

<p><strong>Outcome:</strong></p>
<ul>
  <li>Less duplication</li>
  <li>Faster iteration</li>
  <li>More upfront cost</li>
  <li>Far less long-term chaos</li>
</ul>

<h2 id="engineering-checklist-if-you-care-about-scale">Engineering Checklist (If You Care About Scale)</h2>

<ul>
  <li>Treat every pipeline as a service</li>
  <li>Version everything (schemas, APIs, outputs)</li>
  <li>Design for replay and idempotency</li>
  <li>Instrument before optimizing</li>
  <li>Define SLOs early (or you’ll invent them under pressure)</li>
  <li>Prefer boring, reliable systems over clever ones</li>
</ul>

<h2 id="hiring-reality">Hiring Reality</h2>

<p>The bar shifted.</p>

<p><strong>What matters now:</strong></p>
<ul>
  <li>System design</li>
  <li>Production ownership</li>
  <li>Debugging distributed systems</li>
  <li>Strong programming fundamentals</li>
</ul>

<p><strong>What matters less:</strong></p>
<ul>
  <li>Isolated Spark/SQL expertise without system context</li>
</ul>

<p>Titles are catching up:</p>
<ul>
  <li>Data Platform Engineer</li>
  <li>Data Infra Engineer</li>
  <li>Backend Engineer (Data)</li>
</ul>

<h2 id="the-takeaway">The Takeaway</h2>

<p>Data engineering didn’t expand — it matured.</p>

<p>The industry stopped tolerating:</p>
<ul>
  <li>brittle pipelines</li>
  <li>undefined ownership</li>
  <li>silent failures</li>
</ul>

<p>and replaced them with:</p>
<ul>
  <li>services</li>
  <li>contracts</li>
  <li>reliability expectations</li>
</ul>

<p>If your system delivers data to something that makes decisions in real time, you are not “moving data.” You are operating a backend system. Start treating it that way — or you’ll keep debugging it like it’s 2015.</p>]]></content><author><name>Guru, Eerla</name></author></entry><entry><title type="html">If You’re Not Letting AI Write Code, You’re Already Behind - But Don’t Hand It the Keys</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/04/22/if-youre-not-letting-ai-write-code-youre-already-behind/" rel="alternate" type="text/html" title="If You’re Not Letting AI Write Code, You’re Already Behind - But Don’t Hand It the Keys" /><published>2026-04-22T10:31:00+00:00</published><updated>2026-04-22T10:31:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/04/22/if-youre-not-letting-ai-write-code-youre-already-behind</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/04/22/if-youre-not-letting-ai-write-code-youre-already-behind/"><![CDATA[<p>Recently, I used an AI assistant to bootstrap a local environment - resolving dependencies, fixing configuration issues, and getting everything running in minutes.</p>

<p>Later, when a teammate asked for help reproducing the setup, I realized something uncomfortable: I didn’t have a clear, deterministic set of steps to give them. Only a prompt.</p>

<p>That moment highlights a deeper shift:</p>

<p>AI is accelerating development - but it’s also changing how knowledge is created, shared, and reproduced.</p>

<h2 id="the-quiet-shift-happening-in-engineering">The Quiet Shift Happening in Engineering</h2>

<p>Over the last year, something changed.</p>

<p>AI didn’t just become “useful.” It became embedded.</p>

<ul>
  <li>Inside IDEs</li>
  <li>Inside pull requests</li>
  <li>Inside CI pipelines</li>
</ul>

<p>That’s not hype. That’s where the baseline is moving.</p>

<h2 id="what-happens-if-you-ignore-it">What Happens If You Ignore It</h2>

<p>If your team treats AI as optional:</p>

<ul>
  <li>You spend hours writing boilerplate</li>
  <li>You manually generate tests</li>
  <li>You refactor code that a model could do in seconds</li>
</ul>

<p>Meanwhile, other teams are shipping faster - not because they’re smarter, but because they’ve automated the boring parts.</p>

<h2 id="what-happens-if-you-use-it-blindly">What Happens If You Use It Blindly</h2>

<p>This is where most teams fail. They adopt AI… without discipline.</p>

<p>And then they hit:</p>

<ul>
  <li>hallucinated APIs</li>
  <li>insecure code patterns</li>
  <li>hidden dependencies</li>
  <li>unpredictable costs</li>
</ul>

<p>AI doesn’t fail loudly. It fails convincingly.</p>

<h2 id="what-actually-changed">What Actually Changed</h2>

<p>AI didn’t replace engineers. It shifted the role.</p>

<p><strong>From:</strong></p>
<ul>
  <li>writing every line manually</li>
</ul>

<p><strong>To:</strong></p>
<ul>
  <li>defining intent</li>
  <li>reviewing outputs</li>
  <li>enforcing correctness</li>
</ul>

<p>Think of it like this: AI writes the first draft. Engineers decide what survives.</p>

<h2 id="where-ai-actually-delivers-value">Where AI Actually Delivers Value</h2>

<p>Not everywhere - but in very specific places.</p>

<h3 id="high-leverage-use-cases">High Leverage Use Cases</h3>

<ul>
  <li>Scaffolding endpoints and services</li>
  <li>Generating unit tests</li>
  <li>Writing documentation</li>
  <li>Small refactors and migrations</li>
  <li>Drafting pull requests</li>
</ul>

<p>These are:</p>
<ul>
  <li>repetitive</li>
  <li>time-consuming</li>
  <li>low cognitive value</li>
</ul>

<p>Perfect for automation.</p>

<h2 id="a-realistic-before-vs-after">A Realistic Before vs After</h2>

<p><strong>Before:</strong></p>
<ul>
  <li>Endpoint from spec: 3-5 hours</li>
  <li>Tests, boilerplate, docs: manual</li>
</ul>

<p><strong>After (AI-assisted):</strong></p>
<ul>
  <li>Draft in ~30-60 minutes</li>
  <li>review and validation</li>
  <li>~2x-3x speedup for routine work</li>
</ul>

<p>Across a sprint, that’s not small. That’s weeks of engineering time reclaimed per quarter.</p>

<h2 id="why-most-teams-still-dont-trust-it">Why Most Teams Still Don’t Trust It</h2>

<p>Because they shouldn’t - yet. Common failure modes:</p>

<h3 id="1-hallucinations">1. Hallucinations</h3>
<p>Code references:</p>
<ul>
  <li>non-existent APIs</li>
  <li>wrong schemas</li>
  <li>imaginary helpers</li>
</ul>

<h3 id="2-insecure-patterns">2. Insecure Patterns</h3>
<p>You’ll see:</p>
<ul>
  <li>hardcoded secrets</li>
  <li>outdated libraries</li>
  <li>unsafe defaults</li>
</ul>

<h3 id="3-hidden-dependencies">3. Hidden Dependencies</h3>
<p>Generated code quietly pulls in things your system doesn’t track. Now your SBOM is wrong.</p>

<h3 id="4-cost-surprises">4. Cost Surprises</h3>
<p>Everyone assumes: “AI = GPU cost”</p>

<p>Reality:</p>
<ul>
  <li>network egress</li>
  <li>NAT gateways</li>
  <li>load balancers</li>
  <li>storage</li>
</ul>

<p>Often cost more than inference itself.</p>

<h2 id="the-only-way-this-works-treat-ai-like-a-junior-engineer">The Only Way This Works: Treat AI Like a Junior Engineer</h2>

<p>Not a tool. Not an oracle. A junior teammate.</p>

<p>It can:</p>
<ul>
  <li>draft quickly</li>
  <li>make mistakes</li>
  <li>require supervision</li>
</ul>

<p>So, your system needs to enforce that.</p>

<h2 id="the-production-pattern-that-works">The Production Pattern That Works</h2>

<p>Here’s the model that actually scales:</p>

<ol>
  <li><strong>AI generates code</strong></li>
  <li><strong>Attach provenance metadata</strong> (model, prompt, timestamp)</li>
  <li><strong>Run:</strong>
    <ul>
      <li>linting</li>
      <li>security scans</li>
      <li>dependency checks</li>
    </ul>
  </li>
  <li><strong>Generate + run tests</strong></li>
  <li><strong>Run integration/contract checks</strong></li>
  <li><strong>Block merge if anything fails</strong></li>
  <li><strong>Require human review</strong></li>
</ol>

<h2 id="why-provenance-matters-more-than-people-think">Why Provenance Matters More Than People Think</h2>

<p>If you don’t track:</p>
<ul>
  <li>which model generated the code</li>
  <li>what prompt was used</li>
  <li>when it was created</li>
</ul>

<p>Then when something breaks… You have no idea why.</p>

<p>Provenance turns AI from: “Black box output”</p>

<p>Into: auditable engineering artifact</p>

<h2 id="what-you-must-have-non-negotiable">What You Must Have (Non-Negotiable)</h2>

<p>If you’re using AI in production:</p>

<ul>
  <li>Unit + integration tests</li>
  <li>Contract tests</li>
  <li>SAST + dependency scanning</li>
  <li>CI gating (no green -&gt; no merge)</li>
  <li>Versioned artifacts</li>
  <li>Audit trail for generated code</li>
</ul>

<p>Without these: You’re not accelerating. You’re accumulating risk faster.</p>

<h2 id="the-cost-reality-most-teams-miss">The Cost Reality Most Teams Miss</h2>

<p>AI isn’t just a “model cost.”</p>

<p>Track:</p>
<ul>
  <li>tokens per request</li>
  <li>request frequency</li>
  <li>storage (artifacts, embeddings)</li>
  <li>network (egress, gateways)</li>
  <li>logging + retention</li>
</ul>

<p>Measure it early. Because costs don’t grow linearly - they compound with usage.</p>

<h2 id="the-real-shift-what-engineers-do-now">The Real Shift: What Engineers Do Now</h2>

<p>The role is moving up the stack.</p>

<p><strong>From:</strong></p>
<ul>
  <li>writing code</li>
</ul>

<p><strong>To:</strong></p>
<ul>
  <li>designing systems</li>
  <li>writing better specifications/prompts</li>
  <li>verifying outputs</li>
  <li>governing models</li>
</ul>

<p>The best engineers won’t write more code. They’ll decide better code faster.</p>

<h2 id="how-to-adopt-this-without-breaking-things">How to Adopt This Without Breaking Things</h2>

<p>Don’t go all-in. Start small.</p>

<h3 id="30-days">30 Days</h3>
<ul>
  <li>Use AI for scaffolding in one repo</li>
  <li>Store generated code with provenance</li>
</ul>

<h3 id="60-days">60 Days</h3>
<ul>
  <li>Add CI checks (tests + security)</li>
  <li>Track usage and cost</li>
</ul>

<h3 id="90-days">90 Days</h3>
<ul>
  <li>Add gating policies</li>
  <li>Consider fine-tuning on your codebase</li>
  <li>Expand to more workflows</li>
</ul>

<h2 id="whats-coming-next">What’s Coming Next</h2>

<ul>
  <li>Domain-specific copilots (finance, healthcare, etc.)</li>
  <li>Deeper IDE + CI integration</li>
  <li>Policy-as-code for AI-generated changes</li>
  <li>Auditors asking for: provenance, SBOMs, model governance</li>
</ul>

<p>This isn’t optional infrastructure anymore. It’s becoming standard engineering practice.</p>

<p>If you’re not using AI to handle repetitive engineering work, you’re falling behind.</p>

<p>But if you use it without discipline, you’ll move faster - in the wrong direction.</p>

<p>AI can 2-3x your velocity - but only if you verify everything it writes.</p>]]></content><author><name>Think Data</name></author><category term="ai" /><category term="software-engineering" /><category term="development" /><category term="ai" /><category term="llm" /><category term="software-development" /><category term="software-engineering" /><summary type="html"><![CDATA[AI is accelerating development but it's also changing how knowledge is created, shared, and reproduced.]]></summary></entry><entry><title type="html">Not all Data Pipelines Fail — They Succeed with Wrong Data</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/04/11/not-all-data-pipelines-fail-they-succeed-with-wrong-data/" rel="alternate" type="text/html" title="Not all Data Pipelines Fail — They Succeed with Wrong Data" /><published>2026-04-11T20:55:00+00:00</published><updated>2026-04-11T20:55:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/04/11/not-all-data-pipelines-fail-they-succeed-with-wrong-data</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/04/11/not-all-data-pipelines-fail-they-succeed-with-wrong-data/"><![CDATA[<p>Most data pipelines don’t fail loudly. They fail quietly — and keep running. That’s the real problem.</p>

<h2 id="the-week-that-changed-my-mind">The Week That Changed My Mind</h2>

<p>I used to think CI/CD for data was “nice to have.”</p>

<p>Then in one week:</p>

<ul>
  <li>An upstream schema drifted</li>
  <li>An ETL job added duplicate records</li>
  <li>Production jobs are successful</li>
</ul>

<p>Nothing crashed. Pipelines stayed “green.” But data was wrong.</p>

<p>That’s when it clicked:</p>

<p>The line between a calm data team and a chaotic one<br />
isn’t tooling — it’s discipline.</p>

<p>And today, that discipline looks like CI/CD for data.</p>

<h2 id="why-this-matters-now">Why This Matters Now</h2>

<p>Data systems have changed.</p>

<p>We’re no longer dealing with:</p>
<ul>
  <li>small batch jobs</li>
  <li>stable schemas</li>
  <li>occasional updates</li>
</ul>

<p>We’re dealing with:</p>
<ul>
  <li>event-driven pipelines</li>
  <li>constantly evolving data</li>
  <li>real-time or near-real-time expectations</li>
</ul>

<p>At the same time, tools have matured:</p>
<ul>
  <li>dbt and modern orchestration</li>
  <li>table formats like Delta / Iceberg</li>
  <li>integrated DataOps platforms</li>
</ul>

<p>The direction is clear:</p>

<p>Data is no longer “just pipelines.” It’s software that needs to be tested, versioned, and deployed.</p>

<h2 id="how-data-systems-actually-fail">How Data Systems Actually Fail</h2>

<p>Without CI/CD, failures don’t look like exceptions.</p>

<p>They look like this:</p>

<h3 id="silent-data-corruption">Silent Data Corruption</h3>
<p>A bad join or schema change doesn’t crash anything.<br />
It just poisons downstream dashboards.</p>

<h3 id="non-reproducible-backfills">Non-Reproducible Backfills</h3>
<p>You rerun a pipeline and get a different answer.<br />
Now “what changed?” has no clear answer.</p>

<h3 id="partial-writes--broken-state">Partial Writes &amp; Broken State</h3>
<p>Long-running jobs fail halfway. Some data is updated. Some isn’t. Now you have multiple versions of truth.</p>

<h3 id="slow-painful-incident-response">Slow, Painful Incident Response</h3>
<p>No tests. No rollback. No clear lineage. Fixing one issue turns into days of investigation.</p>

<p>These aren’t edge cases. They’re everyday problems in systems that lack guardrails.</p>

<h2 id="why-app-style-ci-isnt-enough">Why App-Style CI Isn’t Enough</h2>

<p>It’s tempting to apply traditional CI/CD patterns directly.</p>

<p>But data systems behave differently:</p>

<ul>
  <li><strong>Stateful pipelines</strong> → you deal with checkpoints, offsets, time</li>
  <li><strong>Schema evolution</strong> → producers change constantly</li>
  <li><strong>Non-determinism</strong> → randomness, APIs, sampling</li>
  <li><strong>Heavy backfills</strong> → reprocessing large volumes</li>
</ul>

<p>This means you need more than just “run tests on PR.” You need data-aware patterns.</p>

<h2 id="what-works-in-practice">What Works in Practice</h2>

<p>You don’t need a perfect system. You need a few high-leverage patterns.</p>

<h3 id="1-treat-data-like-code">1. Treat Data Like Code</h3>
<p>Store everything in version control:</p>
<ul>
  <li>SQL models</li>
  <li>pipeline definitions</li>
  <li>schemas and contracts</li>
</ul>

<p>Every change goes through a PR.</p>

<p><strong>Why it matters:</strong> Small, reviewable changes are easier to trust — and easier to roll back.</p>

<h3 id="2-enforce-data-contracts">2. Enforce Data Contracts</h3>
<p>Don’t let schemas drift silently.</p>

<p>Validate changes before they hit production:</p>
<ul>
  <li>column types</li>
  <li>nullability</li>
  <li>required fields</li>
</ul>

<p>If contract breaks, deploy should fail.</p>

<h3 id="3-make-pipelines-idempotent">3. Make Pipelines Idempotent</h3>
<p>If rerunning a job changes results, you don’t have a pipeline —<br />
you have a risk.</p>

<p>Use patterns like:</p>
<ul>
  <li>upserts (merge)</li>
  <li>deterministic transformations</li>
</ul>

<p>Same input → same output. Every time.</p>

<h3 id="4-shift-testing-left">4. Shift Testing Left</h3>
<p>Don’t wait for production to validate data.</p>

<p>Add layers of testing:</p>
<ul>
  <li>unit tests for transformations</li>
  <li>integration tests on small datasets</li>
  <li>statistical checks (row counts, null rates, distributions)</li>
</ul>

<p>Bad data should fail fast — before it ships.</p>

<h3 id="5-use-versioned-time-travel-tables">5. Use Versioned, Time-Travel Tables</h3>
<p>Table formats like Delta or Iceberg make a huge difference.</p>

<p>They give you:</p>
<ul>
  <li>atomic writes</li>
  <li>rollback capability</li>
  <li>reproducible snapshots</li>
</ul>

<p>If you can’t rewind data, you can’t debug it.</p>

<h3 id="6-canary-before-full-deployment">6. Canary Before Full Deployment</h3>
<p>Don’t deploy changes everywhere at once.</p>

<ul>
  <li>Run on a subset of data</li>
  <li>Compare key metrics</li>
  <li>Promote only if it passes</li>
</ul>

<p>Small blast radius → safer systems.</p>

<h3 id="7-build-observability-into-pipeline">7. Build Observability into Pipeline</h3>
<p>You shouldn’t rely on someone noticing a broken dashboard.</p>

<p>Track:</p>
<ul>
  <li>freshness</li>
  <li>completeness</li>
  <li>anomalies in key metrics</li>
</ul>

<p>Good systems detect issues before users do.</p>

<h2 id="what-changes-after-you-do-this">What Changes After You Do This</h2>

<p>The shift is subtle — but powerful.</p>

<h3 id="before">Before</h3>
<ul>
  <li>Pipelines “usually work”</li>
  <li>Fixes are manual and reactive</li>
  <li>Data issues take days to debug</li>
</ul>

<h3 id="after">After</h3>
<ul>
  <li>Changes are tested before deployment</li>
  <li>Failures are isolated and reversible</li>
  <li>Data is reproducible and auditable</li>
</ul>

<h2 id="the-trade-offs-be-honest">The Trade-Offs (Be Honest)</h2>

<p>CI/CD for data isn’t free.</p>

<p>It costs:</p>
<ul>
  <li>engineering time</li>
  <li>compute for testing</li>
  <li>discipline to maintain</li>
</ul>

<p>But the alternative is worse:</p>
<ul>
  <li>unreliable dashboards</li>
  <li>broken trust</li>
  <li>expensive incidents</li>
</ul>

<p>Most teams don’t pay upfront. They pay later — with interest.</p>

<h2 id="where-this-is-going">Where This Is Going</h2>

<p>We’re already seeing the next layer:</p>
<ul>
  <li>automated anomaly detection</li>
  <li>smarter validation using ML</li>
  <li>systems that suggest root causes</li>
</ul>

<p>But all of that depends on one thing:</p>

<p>You can’t build intelligent systems on top of unreliable pipelines. CI/CD is the foundation.</p>

<h2 id="a-practical-starting-point">A Practical Starting Point</h2>

<p>You don’t need to do everything at once.</p>

<p>Start small:</p>
<ul>
  <li>Add schema checks to your PRs</li>
  <li>Run data tests (dbt or similar) on every change</li>
  <li>Version one critical dataset with time travel</li>
</ul>

<p>That alone will eliminate a surprising amount of chaos.</p>

<h2 id="final-take">Final Take</h2>

<p>Data pipelines don’t just move data. They produce decisions.</p>

<p>If those pipelines aren’t:</p>
<ul>
  <li>tested</li>
  <li>versioned</li>
  <li>reproducible</li>
</ul>

<p>Then decisions built on top of them aren’t reliable either.</p>

<p>CI/CD for data turns pipelines from “best effort” into systems you can trust.</p>]]></content><author><name>Think Data</name></author><category term="data-engineering" /><category term="data-pipelines" /><category term="cicd" /><category term="data-engineering" /><category term="data-pipelines" /><category term="data-quality" /><category term="cicd" /><summary type="html"><![CDATA[Most data pipelines don't fail loudly. They fail quietly — and keep running. That's the real problem.]]></summary></entry><entry><title type="html">If You Think You Know Python, These Will Prove You Wrong</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/04/11/if-you-think-you-know-python-these-will-prove-you-wrong/" rel="alternate" type="text/html" title="If You Think You Know Python, These Will Prove You Wrong" /><published>2026-04-11T20:52:00+00:00</published><updated>2026-04-11T20:52:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/04/11/if-you-think-you-know-python-these-will-prove-you-wrong</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/04/11/if-you-think-you-know-python-these-will-prove-you-wrong/"><![CDATA[<p><img src="https://images.unsplash.com/photo-1555066931-4365d14bab8c?ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&amp;auto=format&amp;fit=crop&amp;w=1000&amp;q=80" alt="Photo by Hitesh Choudhary on Unsplash" /></p>

<p>Most of us get comfortable because our code works, not because we fully understand why. And that illusion breaks the moment you hit edge cases that don’t behave the way you expect.</p>

<h2 id="1-default-mutable-arguments-but-the-real-gotcha">1. Default Mutable Arguments (but the real gotcha)</h2>

<p>You already know this is bad:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">add_item</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">lst</span><span class="o">=</span><span class="p">[]):</span>
    <span class="n">lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">lst</span>
</code></pre></div></div>

<p>But here’s what people miss: It’s not just a bug — it’s intentional state retention:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">counter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">cache</span><span class="o">=</span><span class="p">{}):</span>
    <span class="n">cache</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">cache</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">cache</span>
</code></pre></div></div>

<p>This acts like a hidden static variable.</p>

<p>💡 Used carefully → performance trick<br />
💀 Used accidentally → nightmare debugging</p>

<h2 id="2-is-vs--worse-than-you-think">2. <code class="language-plaintext highlighter-rouge">is</code> vs <code class="language-plaintext highlighter-rouge">==</code> (Worse Than You Think)</h2>

<p>Everyone says: use <code class="language-plaintext highlighter-rouge">==</code>, not <code class="language-plaintext highlighter-rouge">is</code></p>

<p>But here’s the twist:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">256</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span> <span class="ow">is</span> <span class="n">b</span><span class="p">)</span>  <span class="c1"># True
</span>
<span class="n">a</span> <span class="o">=</span> <span class="mi">257</span>
<span class="n">b</span> <span class="o">=</span> <span class="mi">257</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span> <span class="ow">is</span> <span class="n">b</span><span class="p">)</span>  <span class="c1"># False
</span></code></pre></div></div>

<p>Python interns small integers (-5 to 256).</p>

<p>Even worse:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="s">"hello"</span>
<span class="n">b</span> <span class="o">=</span> <span class="s">"hello"</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span> <span class="ow">is</span> <span class="n">b</span><span class="p">)</span>  <span class="c1"># True (sometimes)
</span></code></pre></div></div>

<p>💡 String interning is inconsistent across contexts.</p>

<h2 id="3-late-binding-in-closures-classic-trap">3. Late Binding in Closures (Classic Trap)</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">funcs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
    <span class="n">funcs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="n">i</span><span class="p">)</span>

<span class="k">print</span><span class="p">([</span><span class="n">f</span><span class="p">()</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">funcs</span><span class="p">])</span>  <span class="c1"># [2, 2, 2]
</span></code></pre></div></div>

<p>👉 All lambdas capture same variable, not value.</p>

<p><strong>Fix:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">funcs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="k">lambda</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="p">:</span> <span class="n">i</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="4-dict-order-is-guaranteed-but-that-changes-design">4. Dict Order Is Guaranteed (But That Changes Design)</h2>

<p>Since Python 3.7:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span> <span class="o">=</span> <span class="p">{</span><span class="s">"a"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"b"</span><span class="p">:</span> <span class="mi">2</span><span class="p">}</span>
</code></pre></div></div>

<p>👉 Order is preserved.</p>

<p><strong>Hidden impact:</strong></p>

<p>People now rely on dict order → implicit logic coupling<br />
Old code assumptions break when ported.</p>

<p>💡 Dicts are now often used like lightweight ordered structures.</p>

<h2 id="5-set-removes-duplicates-but-also-reorders">5. <code class="language-plaintext highlighter-rouge">set</code> Removes Duplicates… But Also Reorders</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]))</span>
<span class="c1"># [1, 2, 3]  (but not guaranteed order)
</span></code></pre></div></div>

<p>👉 Many devs accidentally introduce non-determinism.</p>

<h2 id="6-everything-is-a-reference-but-not-always-obvious">6. Everything Is a Reference (But Not Always Obvious)</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span>
<span class="n">b</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>  <span class="c1"># [1, 2, 3]
</span></code></pre></div></div>

<p>But:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span><span class="p">[:]</span>
<span class="n">b</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>  <span class="c1"># [1, 2]
</span></code></pre></div></div>

<p>👉 Copy vs reference bugs show up in:</p>
<ul>
  <li>caching</li>
  <li>multiprocessing</li>
  <li>data pipelines</li>
</ul>

<h2 id="7-tuple-isnt-always-immutable">7. Tuple Isn’t Always Immutable</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span> <span class="o">=</span> <span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="mi">99</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>  <span class="c1"># ([1, 2, 99], 3)
</span></code></pre></div></div>

<p>👉 Tuple is immutable, but its contents might not be.</p>

<h2 id="8--can-mutate-or-not">8. <code class="language-plaintext highlighter-rouge">+=</code> Can Mutate… or Not</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span>
<span class="n">a</span> <span class="o">+=</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>

<span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>  <span class="c1"># [1, 2, 3]
</span></code></pre></div></div>

<p>But:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span>
<span class="n">a</span> <span class="o">+=</span> <span class="p">(</span><span class="mi">3</span><span class="p">,)</span>

<span class="k">print</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>  <span class="c1"># (1, 2)
</span></code></pre></div></div>

<p>👉 List mutates in-place<br />
👉 Tuple creates new object.</p>

<p>Same operator. Different behavior.</p>

<h2 id="9-exception-handling-can-hide-bugs">9. Exception Handling Can Hide Bugs</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">try</span><span class="p">:</span>
    <span class="k">return</span> <span class="n">something</span><span class="p">()</span>
<span class="k">finally</span><span class="p">:</span>
    <span class="k">return</span> <span class="s">"oops"</span>
</code></pre></div></div>

<p>👉 <code class="language-plaintext highlighter-rouge">finally</code> overrides return.</p>

<h2 id="10-for-else-exists-and-almost-nobody-uses-it-right">10. <code class="language-plaintext highlighter-rouge">for-else</code> Exists (and Almost Nobody Uses It Right)</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="n">target</span><span class="p">:</span>
        <span class="k">break</span>
<span class="k">else</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Not found"</span><span class="p">)</span>
</code></pre></div></div>

<p>👉 <code class="language-plaintext highlighter-rouge">else</code> runs only if loop did NOT break.</p>

<h2 id="11-floating-point-lies">11. Floating Point Lies</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mf">0.1</span> <span class="o">+</span> <span class="mf">0.2</span> <span class="o">==</span> <span class="mf">0.3</span>  <span class="c1"># False
</span></code></pre></div></div>

<p>👉 You know this… but it still bites in:</p>
<ul>
  <li>finance</li>
  <li>aggregations</li>
  <li>data pipelines</li>
</ul>

<h2 id="12-list-multiplication-shares-references">12. List Multiplication Shares References</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grid</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span>
<span class="n">grid</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="k">print</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span>
<span class="c1"># [[1,0,0],[1,0,0],[1,0,0]]
</span></code></pre></div></div>

<p>👉 All rows point to same list.</p>

<h2 id="13-bool-is-a-subclass-of-int">13. <code class="language-plaintext highlighter-rouge">bool</code> Is a Subclass of <code class="language-plaintext highlighter-rouge">int</code></h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="bp">True</span> <span class="o">+</span> <span class="bp">True</span> <span class="o">==</span> <span class="mi">2</span>  <span class="c1"># True
</span><span class="nb">isinstance</span><span class="p">(</span><span class="bp">True</span><span class="p">,</span> <span class="nb">int</span><span class="p">)</span>  <span class="c1"># True
</span></code></pre></div></div>

<p>👉 This leaks into:</p>
<ul>
  <li>pandas</li>
  <li>aggregations</li>
  <li>weird bugs</li>
</ul>

<h2 id="14-__del__-is-not-reliable">14. <code class="language-plaintext highlighter-rouge">__del__</code> Is Not Reliable</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">A</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__del__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"deleted"</span><span class="p">)</span>
</code></pre></div></div>

<p>👉 Garbage collection timing is unpredictable.</p>

<p>💀 Don’t rely on it for cleanup.</p>

<h2 id="15-iterators-get-exhausted-silently">15. Iterators Get Exhausted Silently</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">it</span> <span class="o">=</span> <span class="nb">iter</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">])</span>
<span class="nb">list</span><span class="p">(</span><span class="n">it</span><span class="p">)</span>  <span class="c1"># [1,2,3]
</span><span class="nb">list</span><span class="p">(</span><span class="n">it</span><span class="p">)</span>  <span class="c1"># []
</span></code></pre></div></div>

<p>👉 This causes subtle bugs in:</p>
<ul>
  <li>streaming pipelines</li>
  <li>generators</li>
  <li>testing</li>
</ul>

<h2 id="16-pattern-matching-310-has-sharp-edges">16. Pattern Matching (3.10+) Has Sharp Edges</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">match</span> <span class="n">x</span><span class="p">:</span>
    <span class="n">case</span> <span class="mi">1</span><span class="p">:</span>
        <span class="p">...</span>
    <span class="n">case</span> <span class="n">y</span><span class="p">:</span>
        <span class="p">...</span>
</code></pre></div></div>

<p>But:</p>

<p>👉 This captures variable, not compares.</p>

<p>💀 Many devs think it’s equality.</p>

<h2 id="17-shadowing-built-ins-breaks-everything">17. Shadowing Built-ins Breaks Everything</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">list</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">]</span>
<span class="nb">list</span><span class="p">(</span><span class="s">"abc"</span><span class="p">)</span>  <span class="c1"># 💀
</span></code></pre></div></div>

<p>👉 Happens more in notebooks than you think.</p>

<h2 id="18-globals-and-locals-are-writable-sometimes">18. <code class="language-plaintext highlighter-rouge">globals()</code> and <code class="language-plaintext highlighter-rouge">locals()</code> Are Writable (Sometimes)</h2>

<p>You can do wild stuff like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">globals</span><span class="p">()[</span><span class="s">'x'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">10</span>
</code></pre></div></div>

<p>👉 Useful for metaprogramming<br />
💀 Dangerous in large systems.</p>

<h2 id="final-take">Final Take</h2>

<p>Most Python bugs aren’t syntax issues.</p>

<p>They’re mental model mismatches.</p>

<p>Python looks simple — but it’s full of “gotchas by design.”</p>]]></content><author><name>Think Data</name></author><category term="python" /><category term="programming" /><category term="gotchas" /><category term="python" /><category term="advanced-programming" /><category term="common-mistakes" /><summary type="html"><![CDATA[Most of us get comfortable because our code works, not because we fully understand why. And that illusion breaks the moment you hit edge cases that don't behave the way you expect.]]></summary></entry><entry><title type="html">You’re Not Competing with AI - You’re Competing with Engineers Who Use It</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/03/25/compete-with-engineers-who-use-ai/" rel="alternate" type="text/html" title="You’re Not Competing with AI - You’re Competing with Engineers Who Use It" /><published>2026-03-25T00:00:00+00:00</published><updated>2026-03-25T00:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/03/25/compete-with-engineers-who-use-ai</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/03/25/compete-with-engineers-who-use-ai/"><![CDATA[<p>I’m not saying this after a weekend of trying AI tools. I’m saying this after 2 years of using Cursor consistently - while working a demanding full-time job. And I’ll be direct: The way most engineers are still writing code today is already outdated.</p>

<p><img src="/data-engineering-blog/assets/images/useai.png" alt="AI Engineering Workflow" class="center-image" /></p>

<hr />

<h2 id="lets-say-the-quiet-part-out-loud">Let’s Say the Quiet Part Out Loud</h2>

<p>If you’re still:</p>
<ul>
  <li>Manually writing boilerplate</li>
  <li>Googling patterns you’ve implemented 100 times</li>
  <li>Stitching together repetitive logic</li>
</ul>

<p>You’re not demonstrating skill. You’re demonstrating resistance to leverage.</p>

<hr />

<h2 id="my-turning-point">My Turning Point</h2>

<p>When I first started using Cursor, I used it like autocomplete. That was a mistake. The real shift happened when I treated it like a collaborator.</p>

<p>I was building a data pipeline:</p>
<ul>
  <li>Ingestion</li>
  <li>Schema validation</li>
  <li>Transformations</li>
  <li>Feature logic</li>
</ul>

<p>Normally: a couple of days.</p>

<p>This time, I described the system in plain English:</p>
<ul>
  <li>Inputs</li>
  <li>Outputs</li>
  <li>Constraints</li>
  <li>Edge cases</li>
</ul>

<p>Cursor generated a working structure in minutes. Not perfect. But good enough to skip hours of setup. What used to take days took a few hours.</p>

<p>After repeating this over months: I realized this isn’t a trick. This is the new baseline.</p>

<hr />

<h2 id="what-2-years-of-this-looks-like-with-a-full-time-job">What 2 Years of This Looks Like (With a Full-Time Job)</h2>

<p>Here’s the part that really changed my perspective: All of this was built outside my day job. Not by grinding nights endlessly. But by reducing the cost of building.</p>

<p>Over the past couple of years, I’ve built:</p>

<ul>
  <li><strong>An AI blog writing agent</strong> (research → structure → draft) - <a href="https://github.com/eerla/ai_blog_writing_agent">Check it out</a></li>
  <li><strong>An event management app</strong>: <a href="https://tribe-connect-two.vercel.app/">https://tribe-connect-two.vercel.app/</a></li>
  <li><strong>Pybenders</strong> - LLM-powered reels generator, multiple visual formats, 12+ content contexts, multi-platform output - <a href="https://github.com/eerla/pybenders">pybenders/README.md at main · eerla/pybenders</a></li>
  <li><strong>A full data engineering guide</strong>: <a href="https://eerla.github.io/data-engineering-blog/">https://eerla.github.io/data-engineering-blog/</a></li>
  <li><strong>An Interview Assist tool</strong>: resume scanning, auto-generated interview questions, structured evaluation - <a href="https://intervue-assist.streamlit.app/">https://intervue-assist.streamlit.app/</a></li>
  <li><strong>Thrive</strong>: mobile app where users receive daily customized motivational quotes powered by LLM - <a href="https://github.com/eerla/Thrive">eerla/Thrive: initial commit</a></li>
</ul>

<p>And several smaller tools and browser extensions that I use locally.</p>

<hr />

<h2 id="the-part-most-engineers-wont-like">The Part Most Engineers Won’t Like</h2>

<p>None of this required:</p>
<ul>
  <li>Months of effort per project</li>
  <li>Perfect architecture upfront</li>
  <li>Doing everything manually</li>
</ul>

<p>Because I wasn’t. AI handled:</p>
<ul>
  <li>Boilerplate</li>
  <li>Scaffolding</li>
  <li>Repetitive logic</li>
  <li>First drafts</li>
</ul>

<p>I focused on:</p>
<ul>
  <li>What to build</li>
  <li>How it should work</li>
  <li>What actually matters</li>
</ul>

<hr />

<h2 id="the-lie-engineers-tell-themselves">The Lie Engineers Tell Themselves</h2>

<p>“I want to understand everything deeply.”</p>

<p>After 2 years of working like this: Depth doesn’t come from writing everything yourself. It comes from:</p>
<ul>
  <li>Reviewing</li>
  <li>Questioning</li>
  <li>Refining</li>
  <li>Iterating faster</li>
</ul>

<p>AI doesn’t remove depth. It removes wasted effort disguised as depth.</p>

<hr />

<h2 id="the-real-threat-be-honest">The Real Threat (Be Honest)</h2>

<p>If AI can generate most of your code… Then most of your code was never your advantage.</p>

<p>Your advantage is:</p>
<ul>
  <li>Judgment</li>
  <li>System design</li>
  <li>Problem framing</li>
  <li>Speed of iteration</li>
</ul>

<p>If your identity is tied to typing code manually… This shift will feel uncomfortable.</p>

<hr />

<h2 id="a-simple-example">A Simple Example</h2>

<p>Messy module:</p>
<ul>
  <li>Duplicated logic</li>
  <li>Unclear structure</li>
</ul>

<p><strong>Before</strong>: Hours of refactoring</p>

<p><strong>Now</strong>: “Clean this up. Improve readability. Don’t change behavior.” Done in seconds.</p>

<p>My job?</p>
<ul>
  <li>Validate</li>
  <li>Refine</li>
  <li>Move forward</li>
</ul>

<hr />

<h2 id="this-is-not-a-productivity-hack">This Is Not a Productivity Hack</h2>

<p>This is where people underestimate it.</p>

<p>It’s not: “I save some time”</p>

<p>It’s: “I build at a completely different scale”</p>

<p>You:</p>
<ul>
  <li>Try more ideas</li>
  <li>Ship more projects</li>
  <li>Abandon bad paths faster</li>
  <li>Take bigger risks</li>
</ul>

<p>That’s not speed. That’s leverage.</p>

<hr />

<h2 id="the-gap-is-already-forming">The Gap Is Already Forming</h2>

<p>After 2 years, I can say this confidently: There are now two types of engineers:</p>

<ol>
  <li>Writes code</li>
  <li>Builds with AI</li>
</ol>

<p>Same intelligence. Completely different output.</p>

<hr />

<h2 id="i-dont-want-to-be-dependent">“I Don’t Want to Be Dependent”</h2>

<p>You already are. On:</p>
<ul>
  <li>Frameworks</li>
  <li>Libraries</li>
  <li>Open-source</li>
  <li>Google</li>
</ul>

<p>AI is just the next layer. Refusing it isn’t discipline. It’s denial.</p>

<hr />

<h2 id="the-uncomfortable-ending">The Uncomfortable Ending</h2>

<p>In a year, saying: “I don’t use AI to code” will sound like: “I don’t use the internet when I code.”</p>

<hr />

<h2 id="final-line">Final Line</h2>

<p>You’re not competing with AI. You’re competing with engineers who have been using it for 2 years - while working full-time - and shipping consistently. And they’re not slowing down.</p>

<p>This isn’t about Cursor. You can replace it with any AI tool. The real point is, engineers who learn to leverage AI will outpace those who don’t - regardless of which tool they use.</p>

<hr />

<p>If you’re building data platforms, exploring lakehouse architectures, or just curious about how modern data systems achieve reliability, connect with me on <a href="https://www.linkedin.com/in/guru-e/">LinkedIn</a>.</p>]]></content><author><name>Guru, Eerla</name></author><category term="ai-engineering" /><category term="ai" /><category term="cursor" /><category term="engineering" /><category term="productivity" /><category term="tools" /><summary type="html"><![CDATA[I’m not saying this after a weekend of trying AI tools. I’m saying this after 2 years of using Cursor consistently - while working a demanding full-time job. And I’ll be direct: The way most engineers are still writing code today is already outdated.]]></summary></entry><entry><title type="html">Airflow Works Best When It Does Less</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/03/23/airflow-works-best-when-it-does-less/" rel="alternate" type="text/html" title="Airflow Works Best When It Does Less" /><published>2026-03-23T04:00:00+00:00</published><updated>2026-03-23T04:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/03/23/airflow-works-best-when-it-does-less</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/03/23/airflow-works-best-when-it-does-less/"><![CDATA[<p>The symptoms are consistent:</p>

<ul>
  <li>Workers pinned at high CPU</li>
  <li>Retry storms under load</li>
  <li>DAGs that pass locally but fail in production</li>
  <li>Business logic buried inside orchestration</li>
</ul>

<p>This isn’t a scaling issue. It’s a boundary violation.</p>

<h2 id="airflow-is-a-control-plane">Airflow Is a Control Plane</h2>

<p>Airflow exists to:</p>
<ul>
  <li>schedule work</li>
  <li>enforce dependencies</li>
  <li>manage retries</li>
</ul>

<p>It does not exist to:</p>
<ul>
  <li>process data</li>
  <li>hold state</li>
  <li>execute transformations</li>
</ul>

<p>When orchestration and compute share the same layer, they compete for resources.</p>

<p>That competition is where systems degrade.</p>

<h2 id="dags-should-describe-flow--nothing-else">DAGs Should Describe Flow — Nothing Else</h2>

<p>A DAG answers:</p>

<p><strong>What runs, and in what order?</strong></p>

<p>Not:</p>

<p><strong>How does the data get processed?</strong></p>

<p>Once you embed logic inside DAGs:</p>
<ul>
  <li>orchestration becomes coupled to implementation</li>
  <li>pipelines become untestable</li>
  <li>changes become risky</li>
</ul>

<p>Clean systems separate:</p>
<ul>
  <li><strong>DAG → control flow</strong></li>
  <li><strong>Compute → execution layer</strong></li>
</ul>

<h2 id="the-patterns-that-cause-most-failures">The Patterns That Cause Most Failures</h2>

<ul>
  <li><strong>In-process compute:</strong> Large joins, pandas jobs, heavy transforms inside tasks</li>
  <li><strong>XCom as a data layer:</strong> Passing payloads instead of metadata</li>
  <li><strong>Business logic in DAGs:</strong> No versioning, no reuse, no testability</li>
  <li><strong>Shared resources:</strong> Orchestration and compute competing for CPU/memory</li>
</ul>

<p>These are not edge cases. This is how most Airflow systems fail.</p>

<h2 id="failure-modes-they-compound-fast">Failure Modes (They Compound Fast)</h2>

<ul>
  <li><strong>Scheduler starvation:</strong> Workers doing compute can’t schedule new tasks</li>
  <li><strong>Retry amplification:</strong> Failures increase load → more failures</li>
  <li><strong>State inconsistencies:</strong> No clear ownership of data or transformations</li>
  <li><strong>Debugging collapse:</strong> Logs tied to orchestration, not execution</li>
</ul>

<p>Failures don’t originate in SQL. They emerge at system boundaries.</p>

<h2 id="the-correct-model">The Correct Model</h2>

<p>Airflow should coordinate work, not perform it.</p>

<ol>
  <li><strong>Trigger</strong> Spark / dbt / containerized jobs</li>
  <li><strong>Wait</strong> for completion</li>
  <li><strong>Pass references</strong> (IDs, URIs), not data</li>
</ol>

<p>Airflow becomes thin, predictable, and stable.</p>

<h2 id="what-improves-in-production">What Improves in Production</h2>

<ul>
  <li><strong>Scheduler remains responsive</strong> under load</li>
  <li><strong>Failures are isolated</strong> to compute systems</li>
  <li><strong>Pipelines become testable</strong> outside Airflow</li>
  <li><strong>Recovery becomes deterministic</strong></li>
</ul>

<p>Not because of better tooling. Because responsibilities are separated correctly.</p>

<h2 id="the-system-model">The System Model</h2>

<p>Think in layers:</p>

<ul>
  <li><strong>Control → Airflow</strong></li>
  <li><strong>Compute → Spark / dbt / containers</strong></li>
  <li><strong>Storage → warehouse / lake</strong></li>
</ul>

<p>If these blur, the system becomes fragile.</p>

<h2 id="final-take">Final Take</h2>

<p>Most Airflow issues are self-inflicted.</p>

<p>Not because Airflow is limited, but because it’s forced to do work it was never designed for.</p>

<p>If your DAGs are executing real computation, you don’t have a pipeline problem. You have a system design problem.</p>

<h2 id="one-rule">One Rule</h2>

<p>If a task:</p>
<ul>
  <li>runs long CPU workloads</li>
  <li>or processes large in-memory data</li>
</ul>

<p>It does not belong in Airflow.</p>]]></content><author><name>Think Data</name></author><category term="airflow" /><category term="orchestration" /><category term="data-engineering" /><category term="airflow" /><category term="orchestration" /><category term="data-engineering" /><category term="best-practices" /><summary type="html"><![CDATA[If your Airflow tasks are doing real computation, your system is already mis designed.]]></summary></entry><entry><title type="html">I Dug Into Delta Lake’s Transaction Log - This Is How ACID Actually Works on S3</title><link href="https://eerla.github.io/data-engineering-blog/blog/2026/03/22/delta-lake-transaction-log-acid-on-s3/" rel="alternate" type="text/html" title="I Dug Into Delta Lake’s Transaction Log - This Is How ACID Actually Works on S3" /><published>2026-03-22T00:00:00+00:00</published><updated>2026-03-22T00:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2026/03/22/delta-lake-transaction-log-acid-on-s3</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2026/03/22/delta-lake-transaction-log-acid-on-s3/"><![CDATA[<p>I used to treat object stores like what they are: cheap, durable, and completely unreliable for transactional work. Great for dumping data. Terrible for updates, deletes, or anything resembling correctness.</p>

<p>A few years ago, if someone told me they were doing MERGE, UPDATE, DELETE on S3, I’d assume one of two things:</p>
<ul>
  <li>They built a fragile abstraction</li>
  <li>Or they didn’t understand failure modes yet</li>
</ul>

<p>Then I started digging into Delta Lake.
What I found wasn’t magic. It was a very deliberate systems design trade-off.</p>

<p><img src="/data-engineering-blog/assets/images/blog/delta-lake-architecture.png" alt="Delta Lake Architecture" class="center-image" />
<em>Delta Lake adds a transaction log layer on top of immutable data files</em></p>

<h2 id="why-object-stores-break-transactional-systems">Why object stores break transactional systems</h2>
<p>Object stores like S3, ADLS, and GCS were never designed for databases.</p>

<p>They give you:</p>
<ul>
  <li>Immutable blobs</li>
  <li>High throughput reads/writes</li>
  <li>Cheap storage at scale</li>
</ul>

<p>But they lack:</p>
<ul>
  <li>Atomic updates</li>
  <li>Strong consistency on listing</li>
  <li>Transactions</li>
  <li>Native metadata layer</li>
</ul>

<p>Which means: You can store data reliably - but you can’t change it reliably.</p>

<hr />

<p>The core idea: don’t fix storage - add a layer</p>

<p>Delta Lake doesn’t try to make S3 transactional.
Instead, it builds a thin transaction layer on top of it:</p>
<ul>
  <li>Data files → immutable (Parquet)</li>
  <li>Changes → tracked separately</li>
  <li>Truth → defined by a log</li>
</ul>

<p>This is the key shift: State is not in the files. It’s in the log.</p>

<p>Think of it like this:</p>
<ul>
  <li>Files = raw facts (never edited)</li>
  <li>Log = source of truth</li>
  <li>Snapshot = interpretation of log + files</li>
</ul>

<hr />

<h2 id="how-delta-lake-actually-gives-you-acid">How Delta Lake actually gives you ACID</h2>

<p>Three core building blocks:</p>

<h3 id="1-immutable-data-files">1) Immutable data files</h3>

<p>Data is written as Parquet
Never updated in-place
Updates = new files + mark old ones as removed</p>

<p>This avoids:</p>
<ul>
  <li>Partial writes</li>
  <li>Corruption</li>
  <li>Complex locking</li>
</ul>

<h3 id="2-the-transaction-log-_delta_log">2) The transaction log (_delta_log)</h3>

<p>Every change creates a new file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_delta_log/
  000000.json
  000001.json
  000002.json
</code></pre></div></div>

<p>Each commit contains:</p>
<ul>
  <li>Files added</li>
  <li>Files removed</li>
  <li>Metadata changes</li>
</ul>

<p>Periodically:
Delta writes checkpoints (Parquet)
So, you don’t replay everything from day one</p>

<h3 id="3-optimistic-concurrency-control">3) Optimistic concurrency control</h3>

<p>Instead of locks:</p>
<ul>
  <li>Read latest snapshot</li>
  <li>Prepare changes</li>
  <li>Validate nothing changed</li>
  <li>Commit</li>
</ul>

<p>If conflict: retry with new snapshot</p>

<hr />

<h2 id="the-commit-protocol-this-is-the-real-trick">The commit protocol (this is the real trick)</h2>

<p>Object stores are unreliable for coordination.
Delta works around it using:</p>
<ul>
  <li>Atomic file creation → commit = new JSON file</li>
  <li>Validation before commit → detect conflicts</li>
  <li>Retries instead of locks → scale horizontally</li>
</ul>

<p>No central coordinator. No database. Just files + discipline.</p>

<hr />

<h2 id="what-acid-features-you-actually-get">What ACID features you actually get</h2>

<ul>
  <li><strong>Atomic commits</strong> → commit file exists or not</li>
  <li><strong>Snapshot isolation</strong> → consistent reads</li>
  <li><strong>Time travel</strong> → query past versions</li>
  <li><strong>MERGE / UPDATE / DELETE</strong> → simulated via file rewrites</li>
  <li><strong>CDC (Change Data Feed)</strong> → incremental pipelines</li>
</ul>

<hr />

<h2 id="minimal-examples">Minimal examples</h2>

<h3 id="write">Write</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="s">"delta"</span><span class="p">).</span><span class="n">mode</span><span class="p">(</span><span class="s">"append"</span><span class="p">).</span><span class="n">save</span><span class="p">(</span><span class="s">"/mnt/lake/table"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="time-travel">Time travel</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="s">"delta"</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">"versionAsOf"</span><span class="p">,</span> <span class="mi">42</span><span class="p">).</span><span class="n">load</span><span class="p">(</span><span class="s">"/mnt/lake/table"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="merge-upsert">MERGE (UPSERT)</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">delta.tables</span> <span class="kn">import</span> <span class="n">DeltaTable</span>

<span class="n">tgt</span> <span class="o">=</span> <span class="n">DeltaTable</span><span class="p">.</span><span class="n">forPath</span><span class="p">(</span><span class="n">spark</span><span class="p">,</span> <span class="s">"/mnt/lake/table"</span><span class="p">)</span>

<span class="p">(</span><span class="n">tgt</span><span class="p">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"t"</span><span class="p">)</span>
 <span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">source</span><span class="p">.</span><span class="n">alias</span><span class="p">(</span><span class="s">"s"</span><span class="p">),</span> <span class="s">"t.id = s.id"</span><span class="p">)</span>
 <span class="p">.</span><span class="n">whenMatchedUpdateAll</span><span class="p">()</span>
 <span class="p">.</span><span class="n">whenNotMatchedInsertAll</span><span class="p">()</span>
 <span class="p">.</span><span class="n">execute</span><span class="p">())</span>
</code></pre></div></div>
<h2 id="where-things-start-breaking-at-scale">Where things start breaking at scale</h2>

<p>This is where most teams struggle:</p>

<h3 id="1-small-file-problem">1) Small file problem</h3>

<p>Too many small files → slow queries</p>

<p><strong>Fix:</strong></p>
<ul>
  <li>Compaction (OPTIMIZE)</li>
  <li>Target 100MB–1GB file sizes</li>
</ul>

<h3 id="2-_delta_log-growth">2) _delta_log growth</h3>

<p>Heavy writes → massive log</p>

<p><strong>Fix:</strong></p>
<ul>
  <li>Frequent checkpoints</li>
  <li>Monitor log size</li>
</ul>

<h3 id="3-high-write-concurrency">3) High write concurrency</h3>

<p>Too many writers → retries explode</p>

<p><strong>Fix:</strong></p>
<ul>
  <li>Partition-aware writes</li>
  <li>Queue + controlled writers</li>
  <li>Append + compact later</li>
</ul>

<h3 id="4-vacuum-risks">4) VACUUM risks</h3>

<p>Deletes old files permanently</p>

<p>If misused:</p>
<ul>
  <li>→ breaks time travel</li>
  <li>→ breaks downstream pipelines</li>
</ul>

<hr />

<h2 id="trade-offs">Trade-offs</h2>

<h3 id="pros">Pros</h3>
<ul>
  <li>Works on any object store</li>
  <li>No central DB required</li>
  <li>Enables Lakehouse architecture</li>
  <li>Scales extremely well</li>
</ul>

<h3 id="cons">Cons</h3>
<ul>
  <li>Metadata overhead</li>
  <li>Operational complexity</li>
  <li>Retry-heavy under contention</li>
  <li>Requires discipline (not plug-and-play)</li>
</ul>

<hr />

<h2 id="what-i-actually-do-in-production">What I actually do in production</h2>

<ul>
  <li>Treat Delta as transaction layer, not storage</li>
  <li>Enforce file sizing + compaction</li>
  <li>Monitor _delta_log like a system metric</li>
  <li>Avoid high-concurrency small writes</li>
  <li>Be strict with schema evolution</li>
</ul>

<hr />

<h2 id="the-bigger-picture">The bigger picture</h2>

<p>Databricks didn’t make S3 transactional.
They accepted its limitations and built:</p>
<ul>
  <li>a log-based abstraction</li>
  <li>with immutable data</li>
  <li>and optimistic commits</li>
</ul>

<p>That’s it.</p>

<hr />

<h2 id="tl-dr">TL; DR</h2>

<p>ACID isn’t coming from S3.
It’s coming from _delta_log.</p>

<p>Files don’t define truth - the log does.</p>

<p>And once you understand that: You stop treating Delta like magic and start treating it like a system.</p>

<hr />

<p>If you’re building data platforms, exploring lakehouse architectures, or just curious about how modern data systems achieve reliability, connect with me on <a href="https://www.linkedin.com/in/guru-e/">LinkedIn</a>.</p>]]></content><author><name>Guru, Eerla</name></author><category term="delta-lake" /><category term="acid" /><category term="s3" /><category term="data-lake" /><category term="transaction-log" /><summary type="html"><![CDATA[I used to treat object stores like what they are: cheap, durable, and completely unreliable for transactional work. Great for dumping data. Terrible for updates, deletes, or anything resembling correctness.]]></summary></entry><entry><title type="html">The Real Cost of Data Observability</title><link href="https://eerla.github.io/data-engineering-blog/blog/2024/02/15/the-real-cost-of-data-observability/" rel="alternate" type="text/html" title="The Real Cost of Data Observability" /><published>2024-02-15T00:00:00+00:00</published><updated>2024-02-15T00:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2024/02/15/the-real-cost-of-data-observability</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2024/02/15/the-real-cost-of-data-observability/"><![CDATA[<!-- PASTE YOUR MEDIUM CONTENT HERE -->
<p>This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data</p>

<p>The article should cover:</p>
<ul>
  <li>The data observability gold rush</li>
  <li>Hidden costs of observability tools</li>
  <li>What you actually need vs what vendors sell</li>
  <li>DIY approaches to data quality</li>
  <li>Real-world examples of cost-effective observability</li>
</ul>

<p>Copy your complete Medium article here, preserving all formatting, code examples, and insights.</p>]]></content><author><name>Guru, Eerla</name></author><category term="data-engineering" /><category term="observability" /><category term="data-quality" /><category term="monitoring" /><category term="cost" /><summary type="html"><![CDATA[This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data]]></summary></entry><entry><title type="html">dbt Changed Data Engineering Forever</title><link href="https://eerla.github.io/data-engineering-blog/blog/2024/02/10/dbt-changed-data-engineering-forever/" rel="alternate" type="text/html" title="dbt Changed Data Engineering Forever" /><published>2024-02-10T00:00:00+00:00</published><updated>2024-02-10T00:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2024/02/10/dbt-changed-data-engineering-forever</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2024/02/10/dbt-changed-data-engineering-forever/"><![CDATA[<!-- PASTE YOUR MEDIUM CONTENT HERE -->
<p>This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data</p>

<p>The article should cover:</p>
<ul>
  <li>The data transformation landscape before dbt</li>
  <li>How dbt revolutionized SQL-based transformations</li>
  <li>Key features that make dbt powerful</li>
  <li>Real-world examples of dbt implementations</li>
  <li>The future of data transformation with dbt</li>
</ul>

<p>Copy your complete Medium article here, preserving all formatting, code examples, and insights.</p>]]></content><author><name>Guru, Eerla</name></author><category term="data-engineering" /><category term="dbt" /><category term="transformation" /><category term="sql" /><category term="data-warehouse" /><summary type="html"><![CDATA[This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data]]></summary></entry><entry><title type="html">You Don’t Need Kafka for Everything</title><link href="https://eerla.github.io/data-engineering-blog/blog/2024/02/05/you-dont-need-kafka-for-everything/" rel="alternate" type="text/html" title="You Don’t Need Kafka for Everything" /><published>2024-02-05T00:00:00+00:00</published><updated>2024-02-05T00:00:00+00:00</updated><id>https://eerla.github.io/data-engineering-blog/blog/2024/02/05/you-dont-need-kafka-for-everything</id><content type="html" xml:base="https://eerla.github.io/data-engineering-blog/blog/2024/02/05/you-dont-need-kafka-for-everything/"><![CDATA[<!-- PASTE YOUR MEDIUM CONTENT HERE -->
<p>This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data</p>

<p>The article should cover:</p>
<ul>
  <li>Why Kafka became the default choice</li>
  <li>The complexity and overhead of Kafka</li>
  <li>Simpler alternatives for common use cases</li>
  <li>When Kafka actually makes sense</li>
  <li>Real-world examples of over-engineered messaging systems</li>
</ul>

<p>Copy your complete Medium article here, preserving all formatting, code examples, and insights.</p>]]></content><author><name>Guru, Eerla</name></author><category term="data-engineering" /><category term="kafka" /><category term="messaging" /><category term="architecture" /><category term="system-design" /><summary type="html"><![CDATA[This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data]]></summary></entry></feed>