← Previous · All Episodes
Building the Open Lakehouse for the AI Era with Shubham Baldava  from DataZip / OLake Episode 28

Building the Open Lakehouse for the AI Era with Shubham Baldava from DataZip / OLake

· 58:14

|

Nitay (00:02.158)
Shubham, thanks for joining us. It's great to have you with us on the pod here with Costas. So I'd love to start with, us some background about your experience and kind of how that leads you to what you're doing today.

Shubham Baldava (00:14.157)
Hey, Nite, thanks for having me. Hello, Kostas, thanks again. So a quick background about me. I graduated like 2015. I have been a data engineer and a backend engineer for almost last 10 years. I'm always interested in creating products, creating really good services which can help people out for their day-to-day tasks or maybe something like data products.

So when I graduated, I launched a small train app in my college, which got downloads about around like 5 million downloads and about half a million active users. And I maintained it for almost five years. I got a decent ad revenue of 300, 400K from that. Just an indie developer, I did not have a team or something. And that was my first introduction to entrepreneurship. was always surprised by how much you can learn from.

creating something for the people. And after that, after my college, I joined a company called Works Applications. So it was a Japanese company, primarily creating the ERP solutions for Japanese giants like Panasonic, Mitsubishi. They had like many Japanese customers. So I worked in Singapore and Tokyo for almost three and a half years. I worked primarily as a data engineer slash back an engineer for a while. That was my first introduction to the big data technologies like Spark and all of the other.

you know, Hive and other things. after that, I decided to move back to India because startup space in India was booming. I joined an early state social media company out of India, kind of a competitor to TikTok in 2018, 2019. And then I worked there for almost two years. I worked primarily as backend engineer. So one of the main products that they had was a short video app, was like a clone or TikTok competitive in India.

So I saw the scale when I joined, we had just 30 people. When I left, it was like 180 people. The monthly active users grew from like 10 million to almost 160 million. So that's where I saw like how important data can play as a role in growing or scaling these massive applications. After that, I joined a company back in Tokyo called Pepe. So Pepe is like a fintech company in Japan, caters to like almost half of the Japanese population. I got introduced to...

Shubham Baldava (02:36.717)
first Lake house technology I have ever used called Apache Hoody. Hoody is still really popular. It's one of the three main Lake houses out there. That was back in like 2020. I spent like almost one year working on this. I knew at that point it's definitely the technology which is going to change the landscape of how data is stored and how data is queried by different query engines. After that, because of the COVID, I couldn't make a physical move to Japan. So I decided to stay back in India.

And I joined a company, gaming company in India, fortunately I met like my current co-founders and started DataZip. So DataZip is a company, we creating a product called Oleg. So yeah, that's a quick introduction about me. I'll go deep into what DataZip does.

Nitay (03:23.22)
Awesome. Before we go deep into data zipping, just taking us back to some of the experiences you mentioned with PayPay, SharedChat, et cetera. I'd to hear a bit more about some of the data challenges you saw there and the limitations, but also the pros and cons of Hoody and some of the other solutions that you saw along the way, because it sounds like you've seen this wave multiple times.

Shubham Baldava (03:43.935)
Yeah, definitely. So I'll start with Shareshad. So Shareshad had their data platform of their own. Like they were streaming like billions of events every couple of hours directly into like first the raw layer of the data, which was like a basic parquet at that point. And then they were like doing the medallion, a bit of medallion, like I will not say full fledged medallion, but some level of like one or two layers of medallion and

dumping the data into BigQuery. One of the main challenges that we saw as the scale grew was the cost. BigQuery as a tool is one of the most expensive ones out there, even though we paid for the fixed compute in BigQuery. But it was still really, really expensive. And we were just trying a lot of things to reduce the cost because of that. And I think if one of the things that I miss, like,

today versus at that point is if there were like query engines like Doug DB or something easier, like we could have just put it like Doug DB on top of the parquet or any other layers. And then we could have reduced the cost of that. So that was my first introduction. was, although like I did not work on the data platform much, but we were the ones who were generating a lot of data and put dumping it to the platform and then using it for different purpose, like creator monetizations or notifications and all of that.

We used to send like all I remember almost close to 2 billion notifications a day. So all of that was generated with data that we saw people creating the posts and different content. So that was my first like a really high scale introduction to the data because the previous company was a B2B. So there was not really, really high scale data like that. Then when I joined PayPay,

I think the reason why I joined PayPal was I've always wanted to work on data. Like I'm not a backend guy. realized that at that point, I don't want to create any more APIs. I want to work on creating data pipelines. So, and at that point I got introduced to Hoody. So one of the most amazing things I saw was Hoody was being used there and it was being queried by two main engines. One was Trino and one was Spark.

Shubham Baldava (05:58.573)
And then they were dumping the golden layer of data into the BigQuery. So that was a high-level architecture in Hoody, in Pepe. So I think one of the main things that the team that I've worked with along the people was to get Hoody, to ingest the data into Hoody in near real time, like almost about 10 minute latency at that point. Back in 2020, it was still pretty challenging to get that down below like an hour.

And we were able to achieve 10 minutes. we kind of like we were so happy and we presented it to the whole company because there were a lot of issues at that point. It was not the most mature of Lake houses out there. And I personally work on the like a reconciliation platform on top of Hoody. So reconciliation like it's a fintech company. It basically it's just like a we chat like you know the payments app in China is similar in Japan.

So they had a massive reconciliation jobs, were, we were facing a lot of difficult time because people from different teams, know just basic SQL and we wanted to give them like a uniform interface so that they can like reconcile between multiple payment gateways and their own data. So I worked on a YAML based simple framework so that other people can come and they can write the YAML, the config when they want to run the reconciliation jobs and it would just start automatically.

So that was my primary project. that, almost, when I left it was the people were running almost 70 big reconciliation job. And after that, I think right now it's running around like 400 to 500 jobs a day, which is like a good scale in terms of paper because they are like scanning per job, a couple of terabytes easily because they have a massive scale right now. So yeah, that's like the basics of what I did there before I kind of jumped onto doing something of my own.

Nitay (07:56.67)
And why if I make it because I think this potentially ties to even some of the stuff you will get into with data zip. Why is it so hard to like get data into Hootie into the right form and do it fast and do it like why why is ingestion in and of itself such a difficult problem?

Shubham Baldava (08:12.127)
I think it was primarily because the scale and the middlewares we were using, like, I think at that point we were using Debezium Kafka and Spark Streaming. Or at some point, I think we also tried like glue jobs. So one of the main reasons was that, you know, like we were trying to manage everything on our own, like Kafka clusters, the Spark Streaming jobs.

And the interface of Spark streaming with Hoody was still evolving at that point. So many times it would just like out of nowhere, it would just break if we kind of reduce the frequency. So we figured out a couple of issues with Hoody. We raised the issue to Hoody. And then eventually that got a little stable over the time. I think it was just the reason was like it was primarily maturing the Hoody and the Spark streaming interface was just maturing at that point. It was not the best.

at that point. but eventually it got more stable. And I think we got also better at managing the Kafka with that scale because we were actually growing like almost 10, 20 % month on month at that point. like the scale was increasing a lot, the reconciliation jobs were increasing a lot. I think eventually we, in six to eight months, we made it more stable. I don't think it was like more to do with like any missing part of the technology, but things just getting, things just get time to get more mature.

We were, you'll be surprised, we were a team of almost eight to 10 data engineers and three of them were just working on this ingestion piece to get this down to like almost 10 minutes.

Kostas (09:51.374)
Why is important to get the intention like this? Latency. Why you need less than 10 minutes?

Shubham Baldava (10:00.013)
So it's a fintech platform, They process like massive amounts of payments. I think one of the main reason to get it below like one hour was primarily to couple of things like to identify like if there is any problem with any payments going wrong as fast as we can. There are also a couple of other things like if there is like slightly increase in the some fraud kind of activities.

we have to detect it even more faster. And also like company, many, it's just plain basic data engineering and analytics like dashboards. People wanted to see like what's the status even more faster. Initially we started with like a couple of hours, then we went down to like an hour and then we eventually to 10 minutes. So I think it's just the nature of the business and in finance you need to get a couple of edge cases like reconciliation and all you can still run it like T minus one, but like frauds.

Kostas (10:54.072)
Yeah.

Shubham Baldava (10:58.349)
Errors or something like payment payment gateway is going down because we are not the ones like we also have to monitor banks so if some some payment gateway of a bank is going down we need to detect and Divert transaction to some other things or maybe just cut off stuff from there

Kostas (11:14.328)
Yeah, makes sense. And can you tell us a little bit more about like the use case of reconciliation? Like what does that mean like in the FinTech context, right? And like why is, so again, like complex and requires like so much infrastructure like to actually do it.

Shubham Baldava (11:36.653)
So as I mentioned, like they were using three query engines or query layers. One was the Spark, one was like a Trino and one was BigQuery. So reconciliation, as I mentioned, it's okay to run it like 12 hours, 24 hours, or sometimes even like six hours, it's fine. The data, the freshness is not that important. What do I mean by reconciliation? Imagine you have banks reporting to you like,

Okay, these many transactions has happened from, know, hundreds of thousands of users who are using that bank. Right. And then there are like, imagine if I'm paying like my gas bill with pay pay, or imagine I'm paying my restaurant bill with pay pay. So that money just basically gets from my account, my bank account into pay pay wallet, and then pay pay wallet to the next person's account. Right. So

I think the reconciliations are really important because that money has to match like if my source bank account to my paypay wallet and my paypay wallet to the next person's account. So all of this has to reconcile and get to zero. That's the main goal of reconciliation. And I think they had like multiple services, multiple teams working on multiple products. So with paypay, I can simply just pay to other people. I can simply pay to business.

I can even pay to like bills and all of that. So there were multiple products being shaping up and it was, it was just growing. They were adding a lot of features. So all of these products data has to match. Sometimes even they now I think they also do lending. So that is also one of the major things for reconciliation. So, and the main problem was not that we can, we cannot reconcile. The main problem was giving all of these smaller teams access to data platform.

for them to reconcile their own data. Like if I'm managing like response team, I need to reconcile all the response builds, right? So now the main problem is these backend engineers or like these developers, whoever they are, they only know like at max basic plain SQL. They don't know like writing Python scripts to do all of that. It's just difficult for them to understand what a PySpark script would look like.

Shubham Baldava (13:48.269)
So what we decided was, why don't we give like a simple YAML interface because everybody knows YAML and everybody, most of the people know SQL. And they would like, it would be like a Airflow configuration that is mentioned. imagine if banks tells you that, okay, now the bill is available, all of the transaction PDF is available on some FFT servers at a certain time in a day, right? They will...

they will put that time in the YAML and they will also, that is detected by Airflow, the reconciliation job will trigger and they would match the balances with a simple basic SQL query. And we did all of that in YAML and then the whole YAML thing is kind of a reconciliation platform that people are using. We kept on adding more and more features on that.

Nitay (14:38.762)
And so maybe related to the reconciliation topic, you mentioned you guys were using Spark and Trino and BigQuery. Why all three? And how did you decide what used what engine and how was data shared or collaborated between them? Like I imagine Spark and Trino could use the same underlying data set, but then BigQuery, like what was the sync process between them? Like I'm a new engineer. Why is there all three? What do I do?

Shubham Baldava (15:01.439)
Yeah, I also thought the same when I joined, like why too many stuff? I think Spark was primarily because of the longer running jobs. Just I mentioned, right, Reconciliation was used to run on Spark primarily. After that, Trino was like a cheap version of BigQuery because it was maintained internally. And BigQuery was like for the really, really important dashboards, really highly interactive queries. Like for the end analysts who are like,

writing queries and waiting for that. So usually we used to divide between things like, okay, this is really important. This needs to, the dashboard needs to load. So we used to use BigQuery for that. And for the normal Trino, even if it takes some time, used to use Trino and for longer running jobs for Spark. Now the Trino and Spark used to query Hoody, but at that point we had to use BigQuery. have to re-sync this data into BigQuery, like the gold, the final layer of medallion into BigQuery because

know, BigQuery was not supporting querying Houdi at that point.

Nitay (16:07.948)
Got it. that's so that kind of two way sync and all that you had to build on by yourself and manage that yourself. What was like, so as you were thinking about these problems, which aspects of this, you said something interesting before in terms of like, well, it wasn't Hootie's fault. They just weren't necessarily prepared for this yet. But as you guys kind of file tickets and so on, then the system overall got better. How did you think of, I'm curious from your perspective.

which things you should build internally and own and where you're going to differentiate and where you should invest versus which things given enough time the community will do for you and which like, how did you prioritize and which things did you ended up tackling?

Shubham Baldava (16:51.405)
To be fairly honest, I was not involved in that side of the process much, but I think it was primarily driven by the motivation of reducing the costs and that was the sole driver for that. Because I think the PayPal was operating at a really high scale transactions and amount of data was pretty huge. And we did some...

like calculations, I think before I joined, they had some calculations in place so that if they use fully like BigQuery, it would cost them like a ton of money. So they decided like, why don't we build like at least the raw and bronze and some part of the silver in like a hoodie and then eventually like get to the golden layer in BigQuery or Trino, right? So I think it was the primary driver was cost.

And I think that's the reason why we decided to adopt Hoody because it was just out of Uber at that point. Uber donated it to the Apache Foundation. I think the lead or the manager of the team, he was in contact with somebody from Hoody's team. They had worked previously together. And that's one of the reasons why we decided to give it a try, give it a shot. It worked for us right away, like a couple of hours latency. It worked amazing. Then...

we felt a little bit of problems while going down a bit by bit to like a couple of minutes, like 10 minutes. And I think that's where we decided like, why don't we just contribute back to Hoody? So I think one or two people in the team where they were also contributing back into Hoody to solve our own specific issues.

Kostas (18:30.654)
A question that will help our audience also understand a little bit better how the landscape was back then with these technologies. You mentioned like Hoody and the reasons for going with Hoody, but there were other technologies out there. There was Delta, there was Iceberg.

Why?

Why Hoody compared to the rest? Outside of, let's say, as you say, having the connection with the team and knowing the people and having access to the team that's building this thing. What was the need at the end of the day for something like Hoody when two other table formats are already out there? Why we needed the third one?

Shubham Baldava (19:35.469)
I think if you follow the formats, like the history of the formats, Hoody was the first one to go full open source. Because as far as I remember, the initial versions of Delta, so Delta was first to publish, I think, in terms of the timeline, but Hoody gave people the access to a lot of good features. One of the good features was Delta Streamer.

So Delta Streamer, what it used to do is it used to get the CDC streams from like the Bayesian and it used to dump it directly into a hoodie table. I think Delta was primarily because it was really tightly coupled with Databricks. So they just did it for the sake of like making it open source. I would say that at least at that point, it seemed like that. to be fairly honest, I was not the decisive person, but I'm working in Iceberg right now. So I know the history a bit.

And I remember like we, I think we had some internal discussions at that point. was not in the picture at that point. I was like the first stable version of Iceberg came out in like 2021. The V1 of Iceberg. think, no, sorry, 2019 or 2020, somewhere around that. But Hoody was the, Hoody was, they launched it before Iceberg, by the way. And I think like one of the main reasons why we chose Hoody, I feel that it has to be that is because that was the only

only truly open source, no vendor lock-in kind of stuff back then. was not in the full maturity or fully into the picture. Delta was really tightly coupled with Databricks. So I think that's the main reason.

Kostas (21:15.448)
Makes sense. So the other thing about these systems is that, and I hear you talking about it, and it's kind of interesting because we are talking primarily for a table format. We can talk about what the table format is and why we need it. But when you talk about Hoody, for example, and your use case,

You don't talk that much about like the tables as much as you talk about some services around that, right? Like you mentioned, like for example, Kafka, you mentioned the Bayesium, you mentioned Spark Streaming, right? Now, like, if we like to be like precise, like table format has nothing to do with all these things, right? Like the table format is how you organize. It's primarily around metadata, like how you find.

and you organize the data on your storage. So what's...

these services deliver why you need these and why they need to be together with the table format. Why we can't say we have any table format we want. Let's call it like Mitai or Kosta or whatever. It doesn't really matter. And we can reason about these. We can decide why to use one or the other in terms of how we store and access our data on our storage.

whatever is around it, it's fine, right? Like we can pick whatever we want. Like maybe I don't want like Spark streaming. I prefer to have Flink or I don't know, I want to have just, let's say like my own thing that I built like in-house. But to me at least like it doesn't feel like that. Like there's always like the conversation. It's we start like we talk like about like the table formats, but we...

Kostas (23:22.2)
quite fast end up talking about other stuff around it, right? That is related with the use cases that we have. Does it even make sense? Is that how you also experience working with these technologies? Or there's something else out there?

Shubham Baldava (23:47.485)
So let me try to answer this. if I mean, like if you can pause me and tell me that I'm going in the wrong direction. So one of the main reasons and one of the really, really big reasons why we talk about all of this like Spark, Flink, Kafka, Debezium and all of that, because table formats like Iceberg or Hoody, they are just a spec like.

Kostas (23:53.934)
Yeah.

Shubham Baldava (24:11.817)
It is just a standard defined by a certain community like Apache or some organization. It's basically like a type C port. You can define type C charger specification and it can be used to charge iPhone, it can be used to charge Macbook. And I think, you still need to have the resources or servers around this to make sure that data goes into the spec.

properly and data is queried by the spec properly. Now there are three parts to this, actually four parts, but I talk about like first part is to get the data in and to get that data in, you need to follow the spec. there are multiple, because it's just the spec, there are multiple ways you can get the data in, right? So one of the standard open source, the OG way of doing things is Debezium Kafka or Flink, Debezium Kafka or Spark Streaming or Spark.

whatever suits your use case. If you're primarily talking about the CDC or Delta kind of data. Now a couple of other way people usually do this is they have like a lot of events data like the backend events or front end events data coming into Kafka. Then they use like Flink or do like a in-flight analysis or streaming analytics and then dump the data into Iceberg. Iceberg is like, is the data at rest.

I would call it like that. You use it to dump the data and then eventually do like a near real time analytics and machine learning and all of that stuff. So that is the first part. Second part is this needs like a constant optimizations. The data, over the time you would end up creating a lot of smaller, smaller files if you're dumping the data in a much lower frequency. I mean, one of the reasons why at that point,

Hudi was really, really good in terms of, Hudi is good in terms of like really high frequency data today, but at that point it was not that good. And they have optimized it a lot, like after the one house company behind Hudi got created, like they have optimized this a lot. So you need to constantly do the optimization, like which means compaction, you need to combine smaller, smaller files, you need to remove the duplicates from the like upserts, you need to duplicate those.

Shubham Baldava (26:37.355)
And you need to basically create those snapshots because most of these formats, they support time travel. So that's the second part of it. The third part of it is query. So now you can query it using any other tool in the ecosystem because the whole purpose of a lake house is to give you like a one platform to query using any query engine. And today, every lake house, almost every lake house out there is supported by most of the data warehouses or query engines. So these are the three main things.

The fourth thing that I don't want to go deep into is governance, which where you need to make sure that the query engines and the people who are using them are not accessing the wrong tables, which they are not supposed to. So these are the four things. And that's the reason why we keep on hearing all sorts of many tools around this, because they don't have a tools themselves.

Kostas (27:22.851)
Mm-hmm.

Yeah, makes sense. And so I'd like to start getting, you like what you are doing today. Today you are working with Iceberg, right? So how did you decide to go from hoodie to Iceberg?

Shubham Baldava (27:46.477)
To be fairly honest, did like when, so let me give you a little bit of context and why we arrived on Icebook. So first product that we created in the company was called data as a one stack. It's a one stack. So the word one stack means that it's a one product which has data ingestion, data transformations, data warehouse and data governance, right?

It's like one, we created this because I was, when in my previous job, when I was trying to hire good data engineers, there were not many available in the market. And trust me, when I say this, I'm in India, we have tons of engineers here, but it was difficult to find people who knew like core data engineering. And when I was working with a really big company, like almost 1000 plus, 5,000 employees like that.

we were facing this problem. we thought like, about SMBs? Like they must be dying to get their data sorted, right? So we created a product, a simple product where people like analysts, data analysts, can start with the data pipelines themselves. And we were using Clickhouse as a data warehouse with that, right? Because Clickhouse was a simple, single server warehouse. It's easy to set up, easy to use. And then like we realized after two and a half years of creating that product,

we had a graduation problem. So when the data used to reach like on a more than a couple of terabytes, we used to run like Clickhouse machine of a terabyte RAM size, because the Clickhouse consumes a lot of RAM and with that couple of terabytes of data in the format, Clickhouse format, it used to, we were basically graduating out of the machine sizes because you practically can't create more than a terabyte RAM machine. And at that point, Clickhouse did not have the storage compute separation thing, right?

One of the main reasons why we thought like, you know, we need to do something better at this. And I think after that, we started talking to a lot of people on, on, know, like this works for SMBs, but now we have to grow beyond that revenue and we need to earn more. Like, so we decided to like talk to a lot of people. We talked to almost 150 people, including my ex colleagues at Pepe. Like this happened after almost three years, three to four years of me exiting Pepe, right?

Shubham Baldava (30:04.173)
And at that point, one thing we realized was that, know, who had a lot of buttons to push when it comes to optimize for some use cases. Like who is really good for like really, really high frequency data ingestion, like really high frequency streaming rights. But who do you have a lot of optimization stuff to be done if you just want to do a basic data engineering or data analytics on top of foodie.

That was one of the main feedbacks on even the different people using Hoody. So we had to replace Clickhouse and the only logical way at that point seemed to be like a iceberg because Delta is again, like it's strongly tied to the Databricks ecosystem. And even at that point, because I remember specifically like Databricks did not open source some of the proprietary functions or features like I think deletion vectors or a couple of other stuff.

So we, after this, talking to lot of people, we decided like, Iceberg seems to be that neutral ground where it gives us the best of both worlds. You can try to optimize for ingestion. You can also try to optimize for read and writes. And that's easy. It's not the most, too many buttons to push kind of stuff. So that's one of the main reasons why we chose Iceberg.

Kostas (31:23.374)
So today, like in 2026, if I have a use case, which is, let's say, heavy event type data, like CDC or like events, what do I do? Do I pick hoodie? Do I pick?

iceberg, like how things have matured. Because I remember also back in the days, this streaming integration kind of like use case was always like kind of like the weak spot for stuff like iceberg. And it doesn't have like to do with the spec. Again, it has to do with the tooling around the spec, right? Like the implementation of the spec. And probably I would say like,

maybe Iceberg was in a little bit of like a disadvantage there, compared like to Delta, for example, because Delta had Spark and Spark had like the streaming capabilities. Although someone can argue that it's not the best experience like for a developer, like to work with this thing, blah, blah, like all these things. But I'm sure like things have changed. Like, I mean, they have like, first of all, Tabular got acquired by

data breaks. So what do do today when we won't like to work with this type of data and we won't like to deliver them like on like a data lake architecture?

Shubham Baldava (32:59.565)
You're definitely right. So I think in my opinion, the all the formats are converging because they're all all of them are adding like the same features like, you know, like previously Delta had the Z indexes now iceberg have them now booty has already added them. You know, like a couple of other features like Delta had deletion vectors now iceberg has already introduced them in the V3.

Variant is one of the highlights of this because it got introduced in V3 and geospatial types, geometry types are also there in V3. I think all of the features that you usually need for primarily solving all of the major data engineering use case, all of them are there in all the formats or eventually they're going to merge basically with having most of the features just like a little bit.

nitty gritties difference or on the some spec side. I think one of the main reasons for if I'm data engineer, I'm sitting on the other side of the table, if I'm going to choose a technology today, I would choose like a technology which has most amount of integrations in place. And that technology today is Iceberg. And I'm not saying this because I'm developing for Iceberg. I'm saying because one of the reasons why we choose to develop for Iceberg is because every warehouse, every

query engine out there, they're supporting Iceberg today. Hoody does not have as many integrations as Iceberg in the place. Like for example, one of the main examples is if you have data in Iceberg, you can actually query that using Databricks or you can also query that using Snowflake. And Snowflake also added support for Delta, like just a quarterback, if I remember correctly. But I think still the major amount of query engines they have even higher than Delta and

hoodie. And one of the other reasons is like if it's a war between formats, I would like to be Switzerland. If it's a world war, I would like to be Switzerland. This is like a neutral ground for almost every country out there, right? And I think that's iceberg today. Yeah.

Nitay (35:12.608)
And how should people be thinking just tackling on top of that, like tying it to some of the conversations we had before, given the world where it's at today and given like what you're saying. So optimizing for integrations, optimizing for kind of flexibility. What aspects should they be doing themselves internally? What aspects should they be utilizing some vendor like a data zip, right? Like now, now there's a whole world of vendors on top of open source formats that there wasn't before with Houdini. Certainly not when you, when you were talking kind of the early days of what you were saying.

So where is the lines today and how do you see them kind of going forward?

Shubham Baldava (35:47.081)
I would say like a lot of these vendors, they're trying to converge in like multiple features into one offering. So one of the examples and one of the big examples of that is if you have heard about recent acquisition of Fypheron, not acquisition, but a merger of Fypheron and dbt. And if you read the post by dbd, I think CEO or CTO, I don't remember the name, but I think that one of the main reasons why they're doing this is because they want to create like a

kind of a Databricks kind of a platform on top of Iceberg. Like Fyfrann ingesting into Iceberg and DBT is kind of a transformation layer on top of this. So what I would suggest is that, especially if you would like to go for Iceberg, maybe first you have to figure out what's the end use case for this. I would strongly suggest that you can use some level of open source tools like

DataZip, we are also going to add extra features. Like we are adding features like table optimizations in next one month. We have been working on that feature for almost like last two months. So DataZip can be a really good layer for the CDC and compaction streaming compaction side of the things. One of the highlights that we are introducing in that compaction is, if you ingest the data and you compact the data parallely, like the plain vanilla compaction offered by Iceberg.

you know, one of the things might fail because of the conflict, but we are developing it in a way that the conflict would not be there. And you can do both of these activities parallel. So, and we are highly optimizing just for ingesting the data from multiple databases, Kafka or S3, right? So I think this OLEC as a stack you can use for definitely ingesting and compacting the data. Then you might need to figure out like what would be my catalog.

because catalog is something that either has to do with the cloud provider you're using. If you're using AWS, Glue is definitely one of the good ones. If you're using like Snowflake, maybe you can go for the Polaris because Polaris, Vimeo and Snowflake are developing like Polaris. Or you can use the Snowflake managed version of Polaris. I think they call it Horizon or something like that. So now you have to figure out what could be your query engine, what could be your

Shubham Baldava (38:13.549)
like data. So there are multiple factors to this. What could be an end use case? How real time you want that data to be? How much data engineering bandwidth you have? Would you be able to maintain your Trino clusters on your own? Or would you go for a fully managed stuff like maybe Athena or Snowflake, right? Or if you're already using data breaks, so why don't you ingest the data directly into Unity, right? So all of this matters a lot.

So I would strongly suggest the first side of the equation where you don't want to go for the Debezium, Kafka, Spark streaming. You can go for OLEC and compaction, OLEC compaction, because we make it really easy. And then the second part of the equation, you can actually go for anything and everything that has to do with the engineering bandwidth you have and the use case you have. So if you're already using Snowflake, you have paid for it, you can use that or data. So I think that would be my answer.

Nitay (39:07.79)
And you mentioned a few more choices there, right? Like we don't have to go into deep into every single one, but like there's Athena, there's a bunch of serverless stuff, et cetera, like you said. It's funny because the question we said earlier of like you guys had Spark and BigQuery and Trino and you were like, yeah, I came in, I didn't understand why we have all three. Now an engineer comes in and there's like 2000 options. It's not even three anymore, right? Like each part and aspect of the data, there's many, vendors. So help the audience maybe understand like

What is the right times to use iceberg and bring data dip, data set, sorry, and more specifically, what are the times that where it's not the fit, where you should be looking at other options.

Shubham Baldava (39:48.813)
Yeah, so I first start from the ingestion side. So I think one of the reasons why like why you want to use Olick would be that, you Olick or data zips platform would be that, you know, like we support all of these databases. We are one of the fastest in the industry. We are almost 20 times faster than the vanilla debasium thing. And like.

You want to go for us if you want high throughput. If you want like easy UI, you can you want like a faster and correct data. So what do I mean by correct data is like we support exactly once in adjusting into iceberg and also we support the scheme evolution. We support aero rights. So we are fully optimized on iceberg. So you would want to go for Oleg if you don't want to do because if you if you use the Bayesian the other open source options, you would want to do that safe part yourself.

you would either want to use the Kafka Connect or you would want to use Spark Streaming or Flink, right? Where you will have to maintain all of this yourself, like the scheme evolution, some level of backfill. Like if you add a new table to the sync, you need to do the historical load by yourself. And that is difficult in terms of Tibizium because out of 100 and 150 calls we did, almost we...

had like 40, 50 people using Debezium and they were complaining, most of them were complaining that it's pretty difficult to maintain as a tool. So that's there. And even Flink, Flink has some level of CDC inside called Flink CDC that also uses Debezium. So it's again, like it's a little bit difficult to maintain and manage. But if you want to pay for it, I would say like either you can go for paid options like Fypran or there are a couple of other tools like Isturi.

But I think one of the problems is if you're operating on a really high scale volume ingestion, that gets really expensive really, really quickly. and I think one of the main differentiators between us and them is we are going to also launch a managed offering soon. And you'd be able to like, at least the cost would be like three to four or five times cheaper than all the other people, players in the market or the tools in the market. Because we are going to charge people a number of GB sync, not like number of rows sync, which is the standard.

Shubham Baldava (42:10.637)
costing model for the other like tools. That's the first part where the data I'm talking about data ingestion to answer your second question, was like, you know, why, what, like, I think like too many tools to be chosen from the like 2000 tools to be chosen from the, you know, query engine side of the things. To be honest, like there are a of good options as well. It just depends on how much engineering effort you can put to manage that and what, would be the use case.

So I'll tell you one of the really interesting use cases we are solving for one of the companies based in, I think Seattle called Cordial. So Cordial is like they're a notification company. They send out billions of notifications every week. They have a massive scale. So we as OLEC, we held them in the state from their MongoDB into Iceberg. And now they run DougDB on top of this and they send like personalized notifications to you.

Like because that DB is pretty cheap to spin up. can spin it up on a, on a, uh, like a simple, um, uh, you know, like a very small tiny machines, like 16 GB machine, 32 GB machines. And, and like, imagine if you have purchased like last 10 products, these 10 products, we would query that quickly using that DB. And then you would send you like using AI, some personalized notification. Uh, so I think.

depends a lot on what would be your use case and how much money you can spend. You can also go for a snowflake if you have tons of money. If you have an engineering bandwidth, you can go for Trino or something like that.

Nitay (43:45.312)
And you said a couple of interesting things there. You called out exactly one semantics. called out this theme of saving money, obviously, is kind of a recurring theme, and doing things in a particular manner. I'm curious, as you guys roll out those sorts of features, how much does it require the application level to be built in particular ways? For example, many other systems, famously, in order to get exactly one semantics.

The application has to be, you know, idempotent. It has to have these kinds of semantics, et cetera. So either with that or with the cost savings and stuff that you're doing, like what is imposed on the user? How much or has there have to be kind of that marriage between your infrastructure and their application versus how generic is it? Walk us through little of those details.

Shubham Baldava (44:31.405)
So I think, so when we started OLA, the whole vision was just to develop this for Iceberg. And the reason is because we strongly believe Iceberg is going to be like that de facto storage in like the next three years or two years. It's already like kind of defeated or other formats. If you know, you see that option of Iceberg, right? So every feature that we create, for example, 2PC we are creating, right?

We have really, really optimized it for Iceberg. So the way 2PC works in O-link is that we would write the second phase of the commit, like the metadata of the commit, into the Iceberg metadata itself. So when we start the sync again, we would check from, we would cross-verify from the Iceberg metadata that, this ingestion, this part of the ingestion has already happened or not, right? So something like that.

And for example, if I have to talk about a couple of other things, imagine the next feature on the roadmap is like a basic division kind of watermarking strategy. imagine like you would like you have 10 tables being sent in a CDC way and you add one more table, which is like a really huge table. Now you don't want to pause the CDC because of this historical snapshot that's been going on. So what we would do is we would

we would run this historical snapshot and the CDC would happen in parallel in like a for this table specifically in like a temporary table in iceberg. And then we would load the main data from this table into like the source table into the iceberg table. And then once the historical snapshot is done, would use this temporary iceberg table and then dump the CDC into the iceberg table as an upsurge.

And what this prevents is that your CDC bloat doesn't happen. Your database disk doesn't get filled up because you pause the ingestion for historical load of a big table. So the reason I'm mentioning all of this is because we are developing a spoolie native to Iceberg. We are using Iceberg features to offset for all of these things where Debezium used to use our alternate Kafka topic for watermarking or.

Shubham Baldava (46:42.541)
You know, in Devezium you need to actually fight for like getting the exactly once right. You need to use checkpoints in Flink if you're using Devezium Kafka Flink. You need to use checkpointing to get that exactly once thing. And in Spark Streaming, there are some difficult ways to achieve that. With us, you don't have to worry all of that. Like Oleg takes care of it because we are highly optimized for just Iceberg. We are using Iceberg to figure out all the problems that I just mentioned.

Nitay (47:14.446)
What

actually maybe you have some examples of like what are the iceberg specific features and things that enable you to do this that some of the other formats don't have or what are the things you have to build in that enable those things.

Shubham Baldava (47:29.293)
I think like specifically if I have to talk about a of other features that we have developed Iceberg specific is that we are using Arrow. So what we're doing is we are using Arrow to write into Iceberg tables like fully that. And I would not call it like we can't do that for other table formats. We can still do that for the other table formats, but we have developed using Arrow for Iceberg.

So what we are doing right now is we take all of these data coming in from CDC or Kafka or whatever the source that you want to sync. We dump that into arrow like a record batches and then we sort of export that parquet file and we registered that parquet file into iceberg, right? So all these kinds of things. And apart from that, like we have developed like a custom iceberg compaction so that

You know, like the, the native iceberg compaction that is there, which is offered by the iceberg spec is kind of, it's, it's there, but it's like really heavy. It converts all of the files into like, it's really, it's all of the files into the destination files. What we are doing right now is we are breaking the compaction process into two or three steps. First step is basically getting the, upsurge convert, like getting the equality deletes into positional deletes. So now what do I mean by quality and positional deletes?

We are ingesting data into equality deletes using OLA, which is the fastest way to get your data into I-Square. And then we are converting them to positional. So that's the first step of the compaction process. And when we convert that to positional, query, like positional gives you best of the both worlds. Like it gives you a decent enough read performance. And also it's not that difficult to convert from equality to positional. So you also get a decent freshness of the data. So that's the first step of the compaction. This feature is also fully I-Square specific.

because Iceberg has this equality positional and in the v3 they have deletion vectors. Now the second step of the compaction is like converting really small files into like a medium level files and the third step of the process is medium level files into like a target file size which is like 500 MB or something like that. So we are also developing this along with ingestion so that your ingestion don't have to conflict with all of these three steps even if they are happening in parallel.

Shubham Baldava (49:48.247)
So this is also happening like like doing something different just for Iceberg for like within only.

Nitay (49:58.19)
That's cool, the multi-phase aspect of the algorithm. That's very neat. How do you think of tying this back to something you said before about, you know, at the time, Hoody was kind of the winning format, and then over time, Iceberg seems to be taking over, and you guys are fully betting on Iceberg will win. But there's interesting things happening related to what you said before about kind of the open source community and tabular, which is now Databricks and all these kinds of different things. And so, A, how do you think about the future of where this community will shake out?

B, some of the things, for example, this great compaction algorithm you just mentioned, how do you think for your own company, which aspects of that are you guys keeping to you? And that's going to stay a data zip thing. Which aspects do you open source and leverage the community and get them to be part of the kind of format and so on? How do you think about these things?

Shubham Baldava (50:48.173)
Yeah, just to be really blunt and honest answer would be that, you know, this is like a, like a little bit of politics happening on a lot of these table formats, especially like Iceberg is like, you know, people are, people are trying to push some level of their own agendas when it comes to launching some features because, you know, like the tabular acquisition and all of that.

That's my personal feeling. I mean, like my, my, I might be wrong, but I, the way I see it is this happening right now. And, you know, there are two or three big players in the market, like, uh, the data breaks snowflake, uh, dreamy, and a couple of other people. They have really, uh, you know, they have a really strong influence on what, what features comes out in every new version that iceberg launches. And I think one of the, one of the main, main, think database has said it publicly that they are going to make Delta like almost.

Equivalent to iceberg in terms of the features that they offer right? so the single file commit that you've seen before the the deletion vectors that has happened in in in v3 the variant types the geographical types all of them are making them a little bit step-by-step closer to what Delta is and also the the Z index and all of that, right? so and this is this is like I might be hundred percent wrong, but this is what I this is what I feel this is happening because eventually like

Maybe Databricks wants to fully have Delta and Iceberg and they want to merge this into one format. And they want to be the best tool out there for that, right? For optimizing and querying and governing about that. So yeah, that's what my feeling is. Sorry, what was the second question?

Nitay (52:37.614)
So one is kind of where is it going in the future, which you kind of touched on a bit. And two is kind of how you as data set think about that in terms of where you guys are going to be part of the community, which things are more data set private. Like some of this compaction stuff you talked about could be a great example. How do you think about these things with the community going forward?

Shubham Baldava (52:57.549)
I think our vision is to basically with Oleg, vision is to help people adopt Iceberg. When we started talking about like to all the people who either wanted to get into Iceberg because they're tired of the vendor lock-in and paying like really high amounts to these vendors. And one of the main reasons was that people are tired of maintaining multiple copies of the data, like the way we were doing at PayPay, right? We had one in Hoody, one in BigQuery. And a lot of people are doing that because

Databricks is really good in machine learning stuff. So they want to dump that one copy of data into Databricks. Snowflake is good in like really good faster queries, single second like server spin times and all of that. They also dumped like one copy of data into that. And then there is a raw copy which is sitting in a C which is a Sparkage, right? I think our vision is to help people adopt Iceberg so they can avoid all of these problems. Our vision is to basically get data from multiple sources.

multiple even legacy systems like Hive into Iceberg and get that data into best way. Like people should be able to get that data in the most performant way so that they don't have to worry about, my data is now in Iceberg. still not like my queries are taking a lot of time or it's a lot of money or the queries are timing out. So that's first vision of Oleg is to help people adopt Iceberg.

And after that, can actually, what we see is we can partner up with a of good companies like Lakekeeper is there. Like there are a lot of like a Bob plan we have recently done like you know, some conversations with Bob plan in, order to, for us to get compatible with them. think the way we see it is people would want to like have that flexibility with them so that we don't dictate the, the, the catalog or we don't dictate the

the governance, like people would want to choose their own tools or set of tools. We would be the one to get that data into iceberg. And one of the, one of the really interesting stuff that, that had just happened like a week back is iceberg has a decoupled the, the, the five formats with the spec. So now you can have like a Lance or vortex in iceberg also coming up in like next quarter or so. So I think, and this is where it gets really interesting for us because Lance and vortex is like.

Shubham Baldava (55:23.037)
you know, you have structured, had semi-structured with variant, now you can have unstructured as well, right? All of these like text, logs, know, audio transcriptions. So all of that gets sitting in Iceberg. So Iceberg becomes like one single layer for AI LLMs to query all sorts of data. And even basically like every other AI tool out there would sit on top of Iceberg.

to query like maybe you want to do image search, maybe you want to do vector search, or maybe you want to do like basic queries. And we want to also have a bit of piece of that slice of that thing into this. We want to also ingest data into those formats as well. So yeah, I think that is the future that we see where that's going. And we want our vision is basically get data into Iceberg and most optimizes.

Kostas (56:20.132)
You mentioned something very interesting, like the evolution from going from these only typed columnar data into the variant types, where you have the semi-structured stuff there, and the schema on read and all that stuff. And now we are going into the unstructured. And whenever you have this transition, there's also like...

impact on ingestion, right? Like it's very different to ingest, let's say like the data from just like tuples going like to like the, the variant types. And I guess even more like aggressively when you have to ingest like videos, right? Or like just texts or images or who knows what else. So

From that perspective, what that means for you, right? It's a very different type of data that you have to work. How do you even move that stuff around? Sure, I can't get a JSON payload and put it into Kafka. Am I going now to take my, I don't know, 4K video and chunk it into a...

I don't know what and put it into Kafka and then like, what? So tell us a little bit about that. Like how does this going to work?

Shubham Baldava (57:56.493)
To be fairly honest, this has just happened last week. I mean, there were, we were following some tickets and some issues in the Iceberg community and we have been following this for a while. And this just happened last week. So to be fairly honest, we, now we are also doing our own research. I don't have all the answers today. Like we, right now we were doing for structured and semi-structured. Like we had all the features in place to get the data into like.

flatten the data or normalize the JSON data into maybe level zero, level one, level two. All of that is either there or we are developing that somehow. But I think this changes the game. I think so one of the early research or one of the early kind of like things that I was able to like over the weekends, I study a lot of these. So there's one company which is doing a really amazing job at this is called Daft. If you've heard about Daft, they have recently

raised a really big round of funding. And I think they are at a really sweet spot to do this. So they have multiple modules built around multiple types of data. And all of this is written in Rust. So you can actually take out some level of, let's say you have audio in the video, you can take out all the transcripts, can feed it into LLM, then take it back and then get that text into LANS or some level of, you know.

I think, and that is doing really good job because they run on gray. So we also have some ideas and some innovative thoughts about why don't we do something similar to that? Because we have already been doing it for structured and semi-structured. Why don't we also do that? It can open a gateway to endless possibilities because if this happens,

LLM is going to change everything. It's already changing a lot of things. And without data, LLM is just like hallucinating stuff, like if you don't feed it the right data. So I think we would want to have a slice of this cake as well. We would definitely venture into this with our next. After this fundraiser is done, we are definitely going to go into this. Maybe something similar to what Doct is doing. Yeah.

Kostas (01:00:13.261)
Yeah.

That makes sense. And my last question, like for me, because we're also like close here to the end. We've talked about Debezium, like we mentioned it like quite a few times. Debezium is one of these pieces of technology that like, I don't know, like people love to hate in a way. I mean, it's obviously something that...

ended up being kind of like a necessity, but also having like a lot of problems. But at the same time, nothing came out to replace it, like, at least like in the open source world, right? And I was always kind of trying to understand why, like why no one hasn't tried like, you know, they do like the typical, let's like build

the rust version of the vision, like why not? Or zig or whatever, mean, whatever we like to use there, like to make it like better. But one of the things that I kind of learned and I'd like to hear your take on that is that one of the reasons that like the vision was like so hard like to place was because of the connectors that it had with the upstream database systems.

And doing that and actually rebuilding that is like a pretty hard problem because you really need to go into very obscure kind of like not very visible parts of like the database systems. And then you have to account for all the edge cases around that because many of these databases systems also like they didn't build these replication logs specifically for this kind of like use cases, right?

Kostas (01:02:12.868)
I don't know if this is the case anymore, but like how important at the end this is in your opinion and like how do you deal with that yourself?

Shubham Baldava (01:02:22.353)
so, like I think Debezium as a piece of technology, Debezium has been like almost a decade and a half. I'm assuming like developed at Red Hat and then it's like, they've, open-sourced it. I think one of the main reasons why Debezium is still alive and there is no Rust alternative for it is because, I think one of the main reasons is Java was like Java is and was, the dominant driver.

language and all of these enterprise applications were developed in Java, Spring Boot and all of that, right? So one of the main, like all the drivers with every bugs and every feature request, everything, it first used to come in Java. And I think that's one of the main reasons why, for example, I'll give you an example, like Oracle driver or Oracle driver of the BASIUM.

it has many things really, really good. Like, I mean, in terms of even if you want to do like CDC, there are two ways to do CDC with Oracle. One is a Golden Gate, which is like a managed CDC for Oracle. And the second one is the reading it from the binary files directly, right? So there is an open source library which does that. And that is most like matured in Java. So one of the reasons why we did not choose Rust and we went for a Go is because Go is still like a little bit more mature.

compared to Rust when it comes to these drivers and nitty gritties like so Oracle has a go driver with 19 and 21 supporting these two versions but I'm assuming the Vizium has even older like 17 or even backdated version such but and many of the banks around the world they're using like I don't know maybe you know my father's time Oracle version so

I think one of the main reasons that Debezium is still there because it supports all of these bugs and all of these feature requests from really, really old versions of databases to even the new versions today. And that's one of the main reasons and it's written in Java. So the most mature drivers are there in Java. I think that that's two reasons which I feel it's really difficult to replace even today. And even if somebody new comes up, they will still have to write in Java.

Shubham Baldava (01:04:38.093)
So because these drivers are not being updated. But I think one thing we are changing with OLEC is that we have taken an initiative to do that. we have been adding a lot of these improvements, let's say something like Oracle, we are working with one of the banks in India and we have optimized a lot for 19 and 21 Oracle. now we are also adding, have recently added MS SQL, we have recently added DB2. So these are also like legacy databases.

We are adding the support for them like one by one. And we are also changing the drivers to be a little faster. Like for example, we have roadmap of using ADBC. if you have known like there is a company called Columnar, have recently launched ADBC. So we are also planning to use ADBC wherever we can. For Postgres MySQL, that seems like a easy thing to achieve. For MSSQL is also out there. So I think in in short or in summary of what I'm trying to say is.

Debezium is there, it's really generic for Kafka kind of a thing. Like if you dump the messages into Kafka, you want to do it like for downstream use cases, you want to dump it into Iceberg, you want to dump it into warehouse. Debezium is your go-to tool. And one of the main reasons why it's not replaceable, as I've already explained, I think we are changing that a bit by bit, but right now we are primarily for Iceberg. There are a couple of people and good companies have asked us to develop a Kafka sync for this.

because they want to replace the Vizium internally. They're tired of maintaining this, at least for the databases where it works. right now we are thinking with the product, like, should we add a Kafka as a sync or not? And how would that work with dumping into Iceberg as well as Kafka? So that's something we are figuring out right now.

Kostas (01:06:25.612)
Awesome. All right. We are at the end here. And before we close our episode, share with us something that really excites you about the future.

Shubham Baldava (01:06:43.381)
I think the only thing that really, really excites me is that future is having like a multi-engine data platform running on iceberg for AI. Like AI is going to play a significant role. And one of the main reason why it excites me is because of AI, the queries, the amount of queries that humans we are making are going to be exponentially increasing because AI doesn't have to rest. It can still go on for like days and nights. So.

With that increase in the amount of queries, you can't pay like millions of dollars to the vendors that are already there. You need to find a more efficient, more cheaper data formats and one place to sort everything, structure, semi-structured, unstructured. So that's what excites me, that's keeps me up in the night. We are running towards the future with Iceberg in center, so yeah.

Kostas (01:07:36.046)
Awesome. Thank you so much, Subham, and we are looking forward to have you again in the future.

Shubham Baldava (01:07:41.459)
Awesome. Thank you, Kostas. Thank you, Nite. Thanks for having me. was really nice having a conversation with you.

View episode details


Subscribe

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts Amazon Music
← Previous · All Episodes