In this episode of Tech on the Rocks, Nitay and Kostas sit down with Shubham Baldava, co-founder of DataZip and creator of OLake, to trace the evolution of the modern open lakehouse — from the early days of Apache Hudi to today's Iceberg-centric world.
Shubham shares stories from a decade of data engineering at scale, including building near real-time pipelines at Japanese fintech giant PayPay, scaling a TikTok-style social platform at ShareChat from 10M to 160M monthly active users, and the cost and complexity pressures that pushed teams to adopt lakehouse architectures in the first place.
From there, the conversation digs into the table format wars: why Hudi was the early pick for truly open, vendor-neutral lakehouses, how Iceberg has caught up and pulled ahead on integrations, where Delta fits in, and what the Tabular acquisition means for the community. Shubham explains why he believes all the major formats are converging — single-file commits, deletion vectors, variant and geospatial types, Z-indexes — and why integration breadth, not features alone, is now the deciding factor.
The discussion then turns practical: what the four real pillars of a lakehouse are (ingestion, optimization, query, governance), why Debezium is so hard to replace, what it takes to hit 10-minute CDC latency for fintech reconciliation, and how OLake is rethinking ingestion with Arrow-based writes, exactly-once semantics built on Iceberg metadata, multi-phase compaction, and watermark-based parallel backfills.
Finally, Shubham looks ahead to a future where Iceberg becomes the single substrate for structured, semi-structured, and unstructured data — powering multi-engine analytics and AI workloads on top of formats like Lance and Vortex, now that Iceberg has decoupled from Parquet.
Topics covered:
• Lessons from PayPay, ShareChat, and indie app entrepreneurship
• Hudi vs Iceberg vs Delta — history, trade-offs, and convergence
• Why fintech reconciliation needs sub-10-minute CDC
• The real cost of running BigQuery, Trino, and Spark side by side
• Debezium's staying power and why Go (not Rust) for next-gen CDC
• How OLake uses Arrow, equality and positional deletes, and multi-step compaction
• The decoupling of Iceberg from Parquet and what Lance/Vortex unlock for AI
• Where to build in-house vs adopt managed lakehouse tooling