Blog
Real-world data engineering insights, no fluff.
If You're Not Letting AI Write Code, You're Already Behind - But Don't Hand It the Keys
AI is accelerating development but it's also changing how knowledge is created, shared, and reproduced.
Not all Data Pipelines Fail — They Succeed with Wrong Data
Most data pipelines don't fail loudly. They fail quietly — and keep running. That's the real problem.
If You Think You Know Python, These Will Prove You Wrong
Most of us get comfortable because our code works, not because we fully understand why. And that illusion breaks the moment you hit edge cases that don't behave the way you expect.
You're Not Competing with AI - You're Competing with Engineers Who Use It
I’m not saying this after a weekend of trying AI tools. I’m saying this after 2 years of using Cursor consistently - while working a demanding full-time job. And I’ll be direct: The way most engineers are still writing code today is already outdated.
Airflow Works Best When It Does Less
If your Airflow tasks are doing real computation, your system is already mis designed.
I Dug Into Delta Lake's Transaction Log - This Is How ACID Actually Works on S3
I used to treat object stores like what they are: cheap, durable, and completely unreliable for transactional work. Great for dumping data. Terrible for updates, deletes, or anything resembling correctness.
The Real Cost of Data Observability
This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data
dbt Changed Data Engineering Forever
This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data
You Don't Need Kafka for Everything
This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data
Airflow is Not a Data Pipeline Tool
If your Airflow tasks are doing real computation, your system is already mis-designed.
Batch > Real-Time (Most of the Time)
This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data
Your Data Lake is Probably a Swamp
This is where your article from Medium will go. Just copy and paste the full content from https://medium.com/@think-data
Spark on GCP is Overkill - Use BigQuery Instead
If you’re running a persistent Spark footprint on GCP for analytics and ELT, you should at least ask whether you still need it. In my experience as a lead data/platform engineer, for the majority of analytics-heavy workloads BigQuery is faster to operate, cheaper to run (once tuned), and dramatically simpler to own than self-managed Spark clusters. Treat Spark as an occasional specialist tool - not the default.