Episode 16

From Data Mesh to Lake House: Revolutionizing Metadata with Lakekeeper

March 21, 2025 · 57:25

Kostas (00:04.657)
Victor, welcome. So nice to have you here on the podcast together with Nitay. How about starting with a quick introduction about yourself? Tell us a few things about you, who you are, what you've done and what you're up to today.

Viktor Kessler (00:20.812)
Yeah. Hi, hi, Kostas. Hi, Nite. Thank you for having me here in that podcast. And yeah, let's just start with introduction. So my name is Victor and I'm actually based in Germany and we're building some startup. But maybe just to give you a little background on my career and how I developed myself. So I have a classical computer science degree and started to build

In the early days, some data warehousing system, risk management system, maybe some of you know, Solvency 2 in insurance context or Basel 3, that's all that European stuff. Then one day I made a decision to switch from a customer to a vendor. went to MongoDB, was as a solutions architect at MongoDB for a couple of years, and then started my career on the next US startup, Dremio.

was Solutions Architect at Dremio, all the time based in Germany. And in 2023, my co-founder and I, Christian, started a new startup. First, we thought about to build something around the data mesh and pretty soon we understood that we need a lake house. And that's what our startup is developing right now. And if you look at Vakamo and especially our project Lakekeeper.

is an open source REST catalog for Apache Iceberg. We will talk about that more in detail in a second, but that's my introduction,

Kostas (01:55.591)
Yeah, that's great. So you mentioned that when you decided to start a company, you thought...

and then you realize that you need data lake or a lake house. Tell us a little bit more about that. Why this happened? Why you decided like to move away from the concept of the data mesh into the lake house.

Viktor Kessler (02:22.838)
Yeah, absolutely. So maybe just in a background what I collected on my experience when you as a solutions architect and in a startup as I was, you can imagine that you meet like four or five customers every week. And then if you accumulate that over four or five years, you will meet like hundreds of different customers. And you have all the time the conversation.

very precise and bespoke about like what's the problem on a customer side, what challenges they have to solve their business needs and what are the required capabilities actually to do so and how they achieved some specific goals. And you know, like from experience what I saw as a technical solutions architect and having a very interesting

kind of topic as a data lake or data warehouse, and you have a conversation with companies, you understand that a lot of people inside the organization don't really understand what's the problem we're trying to solve. So if you have a conversation with data engineers or data practitioners, you see that they totally understand, okay, we need to move maybe from a data warehouse to the data lake or for specific use cases, we need maybe AI and machine learning. But if you go to a business,

and then you have a conversation with business, you understand that they have a totally different perspective on that issue and they're not able to follow you. And that is actually a conflict because if you're trying to position a solution from a technical perspective and someone is not understanding how that is going to help them, that's a challenge.

And we came to realization, I myself and Christian, that we need to get that a little bit different. And at that time, for two years, data mesh was very popular and data mesh was not a technical implementation, but data mesh was a concept for organization or how to build the processes in a distributed manner. So it's not like a central place where you can do some stuff, but every

Viktor Kessler (04:37.03)
data domain is independent and they can do their own analytics and so on. And the good thing is if you go with a data mesh approach to business, then it can really explain them, okay, you have that challenge and to solve that challenge, you create a data product and then maybe you can just even give that data product from, let's say marketing to sales, that your colleagues in sales department can use that data product and so on.

Using a data mesh principles, which are data domain, data product, self-service and computational governance helps to position every technology around that, such like Lakehouse. So we started to develop around the data mesh at the beginning. But when we started in 2023 and especially the beginning of 2024, we understood for the self-service and collaboration

The classical approach with, let's say, the data warehouse based on Postgres is kind of hard. And not really impossible because the technology itself is more centralized. But we have a data mesh concept which is decentralized. And we looked around what is actually the decentralized approach on the technological part. And it's a lake house because lake house is a, first of all, decentralized platform. And it allows everyone to be on a self-service manner to develop something.

either loading the data into a lake house or analyzing the data inside the lake house. And we said, okay, we need to provide a lake house for every company so that every company is able actually to move from a data lake or data warehouse to a lake house. And if you look at, let's say April, 2024, you can actually choose either Hive Metastore,

or creating a lake house, which is based on Apache Iceberg in our situation. And it's maybe a dominant format, open table format nowadays. Or you could use Tabula, is a creator of Apache Iceberg with Ryan Blu and Dan Wicks. The problem was on our side that we couldn't use software as a service for our platform. And we made a decision, let's build a catalog.

Viktor Kessler (06:56.75)
April, May, we started to build our catalog. And June, July, you know what happened with acquisition of Tabula and partnerships with all the different companies. And that's how we end up at the moment. So we have a catalog, Lakekeeper for the Lake House. But still, the data mesh is not gone. It's, let's say, in the next steps how we would like to approach the market.

Nitay (07:22.479)
That's really interesting. And I love, by the way, the solutions architect background, because folks like yourself really like live and breathe the problem every single day and the pain point and then having to exactly translate between the business user and kind of the data engineers and so on. So perhaps before we dive in deeper into the kind of the lake house and the catalog stuff that you were talking about, take us back to the core of the data mesh problems that you were talking about. What were the specific pain points and what was the business user looking to solve specifically?

Viktor Kessler (07:50.894)
Thank

Nitay (07:52.079)
And where does kind of the state of the art data mesh like do a great job versus where is it like not not actually meeting the demands and the different ways that people are implementing it.

Viktor Kessler (08:03.438)
It's quite interesting because if you look at data mesh for two years, the idea is why we actually got the data mesh. There is something changed in the landscape of companies. And what the change is, is a decentralization. So companies started to understand that it's not possible to keep everything in a single place. It's not possible to keep the data engineering.

in an ivory tower and they can go and solve all problems inside of organization. What you saw that data engineering and that set up as a centralized approach was a bottleneck. businesses were struggling to get data to understand the situation, to make any decision. Because if you're business and you need to make a decision and like everyone was about to become a data driven company.

You kind of waited maybe a week, month, a year, or if you look at the data engineer backlog, it was so huge, so no one was able to get it completely, kind of provide a solution for that. And the second problem on that centralized approach compared to the data mesh was, if you look what is actually happening in data engineering.

So the data engineers will get to, let's put it to marketing department and they will copy everything out of CRM to a centralized data warehouse. Then they gonna run from on that medallion architecture. So you have like your raw data, prepared data, golden data, whatever. And then they will go back to marketing department and tell them, okay, here's your data. Would you like to become a data owner? And you will never find any.

person in the marketing department who is going to say, yes, let's be a data owner for that data what you created. The problem is that we disconnected marketing department from a data for that centralized approach. And in the meantime, what has actually happened in a department like marketing department, they already have their data in Excel or whatever type of technology they use.

Viktor Kessler (10:24.558)
So we usually built what we call a shadow type of analytics somewhere and they have the whole solution there. And that is like the state of centralized versus decentralized approach. So to solve the problem, I think ThoughtWorks, Ramak was the first who actually run the data mesh as a concept that...

we need to rethink the whole situation. So if you have a decentralized world and they have all that entities around the same organization, so why not to give them a way to be in a self-service way so they can just actually extract data, prepare data, and maybe just produce some kind of results and make a decision. And the first principle of data mesh was a data domain. So I can define a data domain. It's not that easy to define a data domain. So you have like a bounded context.

marketing or campaign management and then you can just define that domain and inside of that domain you have usually a classical OLTP team microservices and they can actually provide all service for business and you should just extend that team with data engineer and maybe some analytical type of persona and what they can then create in that data domain a data product.

And then we can actually dive in and what is a data product. But in a nutshell, data product is some type of thing which helps you to make an action inside of your organization. So if you're like, okay, I know exactly amount of customers, and then I can just provide some specific service about customers so you can take the action. And that's what data product will help you. And so that's like the second pillar of data mesh after data domain.

The third one is self-service. So now we have a domain with a team and they have the expertise. They understand their data and you're not losing that expertise on the way from let's say CRM to data warehouse. But you can just keep it inside of a team and they can use all the tools you can use with Lakehouse. For instance, you can just take your, let's say,

Viktor Kessler (12:45.44)
a Python and then they can extract from CRM. You can load the data inside of a lake house and then you can use your Trino, your DuckDB and analyze the data and then on top on that maybe a Tableau, Power BI and then they could create your analytics and data products. The last pillar of data mesh is a computational governance because what you had before you had a simple structure, org structure where you have a team which

controls everything, who has access to the data, what's the purpose of accessing the data, but that's not in place anymore. So data mesh kind of decompose that structure. right now you have a lot of data domains. And in order to keep control from a governance perspective, you now need a computational way to achieve that. And it's not simple because you have all the different tooling in place and then

Everyone tries to make some data preparation and then they would like to share the data. And to achieve that, you need a technical foundation for authentication, authorization, which is managed centrally, but the data domains are independent and self-service manner. Maybe just to finalize the idea of data mesh,

In a way, how we treated the data silos, it was like, let's try to get rid of data silos. But if you look from a data mesh perspective, you can say, maybe is a data silo equals to data domain. And if we have that mechanism of computational governance, so we can actually control the way who is accessing.

what type of data for what purposes. So maybe the data silo concept is not that bad. projecting that to a world where you have all that small companies who produces some data products, which might be as a silo as well, but then you have your supply chains where you have all the data products and you can actually mirror that to a data mesh. And then, then, then you will see that having data silos is not a bad thing.

Viktor Kessler (15:12.302)
but it actually produces a lot of value to the company and then you can become a really data-driven company with using a data mesh.

Nitay (15:22.484)
I would go even a bit further. This is maybe a bit of a personal hot take, but I think data silos is one of these, like to use the nature analogy, it's an evergreen problem, not a deciduous problem. And it's at best maybe a deciduous problem, meaning like trees, right? Like at best, maybe it's a problem that comes and goes. In reality, in most cases, in my experience, it's a problem that's always going to be there, always was there and will continue to be there.

And every time that I see organizations saying, okay, well, we have this beautiful lake house now. It's all going to be in S3 with iceberg, park, et cetera. Everything's going to be there and everything's going to centralize there. It's a nice vision. It's a nice view. And I'm not saying you shouldn't work towards that because I think most many organizations would get a lot of value in working towards that. Absolutely. But it's inevitable that you're going to have, like you said, some squad or pod of folks.

Viktor Kessler (15:49.347)
Yeah.

Nitay (16:14.517)
that are going to want to build their own application or have their own kind of vendor that they use or whatever it is that have some tiny sliver of data about your company, whether first party, third party, whatever. And so I think you said some interesting things there in terms of like, sounds like the coming at it from the get-go saying,

our data is going to be decentralized, our data is going to be kind of like, do I provide the infrastructure that enables you to manage data and governance and rules and ACLs and all this kind of stuff, even in that world. And so going from there, so it makes a lot of sense. Where did it kind of not meet the mark? What's kind of lacking in the concept and in kind of the execution, I guess, if you will, and where's the gap today that you see?

Viktor Kessler (17:06.762)
Yeah, well, the interesting part is, as I mentioned at the beginning, that we need a data product and data product is something what provides an action. And if you look at the history, you will see that for, I don't know, last 20, 30 years, data

itself was in a spotlight on the main stage. Businesses acted on data. And just to give an example, customer was on a website, let's send him an email. Or customer opened the email, let's send him a discount or like try to interact based on data. Unfortunately, at the same time, if you had a...

conversation with businesses or with data practitioners, everyone complains about data quality, data profiling, that I don't know exactly who is using what type of data, data ownership. Data is probably not in a right shape or it's outdated results and so on. And to solve this is not something what you can do with data, but it's doable with metadata.

Unfortunately, metadata is stuck in a test. So how we use and treat the metadata? Metadata is a necessity for some type of documentation. Let's go and maybe write a wiki about this specific data table that owner is someone from marketing, but we don't know exactly who is that marketing and the expectation that data is available like all the time or it's going to be updated sometime.

and what's the profiling of data. So we don't know all of that. So because it's usually a manual process and there are startups out there who trying to solve this from a metadata perspective. But the main thing is what is not solved today, that metadata exactly as data itself should be actionable. So metadata should provide a way how you can take an action in automatic way. And it's going to be even interesting if

Viktor Kessler (19:16.312)
be going towards AI agents and operators, which will require actionable metadata because let's just face it today, I can call Kostas and I will ask Kostas, can you provide me data tomorrow for specific use case? And Kostas can tell me, tomorrow I don't have time because there's a party, so let's just postpone it for a day, a week or whatever. So we can have a conversation as human beings.

And that's type of relationship what data consumer and data producer can build. And usually there is no system which tracks the relationship. So it's all mouth to mouth and like on a human relationships. And exactly that is a problem as well. But if you look at the AI agents and operators and they cannot talk and call each other, they need to automated way as a computational unit where

AI can actually align on itself on the metadata. So we know exactly who is using what, what's the purpose, how the data is going to change, if the structure is changing, how that will affect the whole consumption pipeline, and so on and so on. So just to get it back to the point, what is missing is the actionability on the metadata or what I call actionable metadata as a category itself.

And that's something that we need to establish because having that in place, we can go and create a computational governance. We can go and create a lot of statistics to provide AI agents in place. Because if you listen to Satya, Mark, Zuckerberg, everyone who forecasts, we're going to have a lot of AI agents. So we need that way of metadata to allow them.

to be in place.

Kostas (21:08.103)
Okay. I have a couple of questions, Victor, but I'll start with something related to the metadata that you've been talking about, like in the need of a new category there. So I think one of the things that is difficult for people to understand when it comes to catalogs, especially people who are, say not technical themselves, is how catalogs are being used.

right? And what flavors of catalogs are there? Because my understanding at least, we have at least like two types of catalogs. When we're talking about something like Hive Metastore, yeah, sure. It is a system that at the end of the day, the user also interacts with to get some metadata, which is, for example, what tables can I access and what types they have and all these things. But most importantly, the catalog was initially created

for the query engine itself to track metadata that is relevant for the query engine, because the query engine needs to know, hey, like, okay, someone is asking me to access this table. This table is broken down into X different files that they are on these locations with these statistics, blah, blah, blah, blah, blah, blah, blah, blah, blah, very boring stuff, very technical stuff, no humans should ever touch, right? That's what, let's say the...

The stereotypical catalog out there for data and CNER is like the first thing that they have to interact with. Like a system that is actually going to power computation and users interacting with that as a by-product of this. But then there's also this whole category of enterprise catalogs in the past, stuff like Colibra, where

Things start from, let's say, the basic concept of dictionaries. Let's say we need to agree on some terminology here in the company. We are having so many different teams. Each team has a different concept of what revenue is. We can't do that anymore. We need to have one concept of revenue. And based on that, also we derive how we are going to compute revenue, right? And then you start getting into the catalogs that are, let's say, more, I would say,

Kostas (23:33.607)
more metadata oriented and more to be consumed by humans actually and not machines in the past at least. And then of course you have governance. You have access controls, who has access to what, why, and all these things where you probably have a separated enterprise category on that too. You have Apache Ranger, for example. It's it's in shown thing up there and companies built on top of that and all these things. Now,

These, as long as they were, you know, problems only for the larger enterprises, that was fine because larger enterprises love complexity, right? They are complex organizations on themselves. If you are, I don't know, General Electric or Siemens and you have hundreds of thousands of people in teams that probably don't even know that it's one, you know, like exists, you are already experienced in complexity, right? Because you need to be, you are a complex organization yourself.

But when we reach the point where we start having like startups out there talking about, you, about the importance of metadata and cataloging and fusing in a way, like these different concepts of cataloging into one, I think things are starting like changing, right? But I think they're like increasingly confusing because at the end of the day, we need all of that stuff, right? It's not like we don't need any more of the technical cataloging that the query engine needs. We still need that, right? But we also need the metadata that you are talking about.

So.

My question to you about that is you are talking about like a new category, but are we, what we are experiencing here is actually a new category or we see the need for things that let's say for the past two, three decades, they were primarily the problems of very large enterprises actually becoming the problem of everyone. And that's why we have to rethink how to deliver software, build the systems again.

Kostas (25:32.155)
Are we have to solve the same problems but in a different way or we are talking about completely new problems today when it comes like to catalog.

Viktor Kessler (25:40.906)
Yeah, that's a great question. Well, maybe let me start with

catalog, which is a very misused word, right? So you will find so many different things of catalog. It might be metadata catalog, business catalog, technical catalog, and everyone understands that for itself, what is a catalog, which makes that very, very hard to distinguish and to explain actually what type of catalog and metadata we talk about.

And the challenge of, let's say, a classical approach of cataloging from business or metadata catalogs, they tried to create a cataloging for all that business terms. And the quite interesting experience from myself is if you go and ask

three departments, what is customer, you will get like five different opinions on that one, right? So it's very,

Well, the difference on the way how companies and departments define all of this. And that is actually, again, a confirmation that we live in decentralized world. There is no essential catalog which will tell you a customer is the person who does this. So it depends who you talk to. They all understand different what is a customer and even in an inside of organization, IT.

Viktor Kessler (27:16.682)
department will tell you that the customer is actually someone from business, not even the customer who pays money to buy some services. And that is like the one reality. And the reality is that this process was more manual. And usually if you use that catalog and you went to the system and you type this type of information, who is a customer, on pressing on save button,

that information was obsolete because someone in an organization, you said, that large organization with hundreds, thousands of like units and employees, they had all the different, yeah, understanding of that specific information. So what we now trying to solve is a little bit different. And maybe just let me take you on a journey of metadata. So

That is more a technical journey of metadata than the business journey of metadata. And as you said at the beginning, so maybe even starting with data warehouse, we had a single monolithic system like database. And in that single monolithic system, we had a technical catalog, which we know as information schema. And that catalog, that information schema was in charge of table lifecycle.

of database objects like tables, views, stored procedures. That catalog used to allow someone to access data, to access table, to write to data and so on. And that is part of data what I'm referring to on an actionable metadata site at the moment. Then if you go on that journey towards data lake, we had Hive Metastore. And Hive Metastore was born actually to

create or to mimic a facade of table that I don't need to take care about a folder and all those files to access to. So I actually had a spot where I can ask how table is actually looks like. Is it partitioned by something? And my engine went and could actually query with all stuff. as you described, I had a ranger plugins which allowed me to solve some governance.

Viktor Kessler (29:39.072)
issues as well. But somehow we never got to a point where Hive Metastore provided the same capabilities what we know from a database. Like on the database, we have that from one side strict and rigid bureaucratic way, but it was structured, was transactional, it was secure. On a Hive Metastore, we had all that small problems in that part, but it was flexible.

And therefore, we now have a lake house with a promise that we can get both structured from what we know from a database, with transactional guarantees, with schema evolution, with time travel, and we can stay flexible. And that's the promise of OpenTable format as Apache Iceberg.

And the question is now, do we need to solve the business metadata in the first place or can we actually concentrate on the technical stuff because all the problems of breaking a consumption pipeline? let me give you a simple example. So I am a data producer. So I have my table.

For instance, Kostas, you are a data consumer. You use that table to create your report. And with that report, you can make a decision if you're going to invest somewhere or you're going to actually hire some more people inside of your stock. And in a small organization, it's all manageable. But if you will go to that large organization and I will actually go and change inside of my table something, that will break your report. Because usually what we don't have, we don't have that relationship.

And that, again, a different type of metadata, a relationship between data practitioners. But what we need to ensure, we need to ensure that the consumption pipeline is stable, or like Christian, my co-founder says, unbreakable. So we need to have that in place. And you can imagine that within the organization, that type of consumption pipelines looks like a value chain. So we have like all the different departments who uses all the data, trying to add more value to the data.

Viktor Kessler (31:54.966)
and provide that to the next department maybe. So at the top, some C level management can make a proper decision on something. But from a design partner, what we have right now, we learned if the consumption pipeline is complex and it breaks all the time, it has a huge impact on the business side. And which means business loses money because business is not able to make a decision. And to summarize the whole thing is the first

idea on our side is that the technical metadata as a schema management, as table statistics can provide a great value to us what we had actually not really concentrated. And the problem was maybe not that huge because having a centralized way inside of a data analytics didn't make that huge problem.

And exactly right now, because you're moving to a decentralized vault, that problem grows exponentially because everyone is trying to get the data or to offer the data, but not having that in place on actionable metadata, we will end up in some kind of a chaotic environment. And that's maybe just a very long explanation, but that's the way what I see on actionable metadata.

Kostas (33:20.093)
So if I understand correctly, in order to enable the decentralized data organization, we need to centralize the metadata, right? Because the metadata needs to live all in one catalog. So do I get that correct or I'm missing something?

Viktor Kessler (33:46.602)
Well, maybe centralized catalog is not the correct explanation, but the common catalog, which will have that data in one place. And again, it might be, if you look at the metadata inside of Iceberg catalog, it's somehow decentralized because you have all that as three buckets where metadata of Iceberg table lives.

But at the same time, Catalog itself has a prosperous database, for instance, where you have the concentration of metadata. And just to give an example, our Catalog Lakekeeper has a specific interfaces to push all the changes to Nuts or Kafka messaging system and to make that all reactive. So therefore there is a mix between a centralized place and same not centralized place. And then you can build your

type of computational governance for parties or data contract engines with some objectives that you expect to have your data products with stable schemas or like data updates and so on. And therefore you have that mix between decentralized and centralized. But yeah.

Kostas (35:05.836)
Okay, I have more questions for you, Victor, but I want to ask a question to Nitay because you mentioned something about customers. think I have an expert here when it comes to the definition of a customer, which is Nitay, someone who built for more than 10 years a customer data platform. So tell us, Nitay, do companies agree inside?

the organization, what the customer is and how big of a problem it is.

Nitay (35:36.555)
So it's interesting. One of the things we found oftentimes, and this kind of ties to some of the stuff you were saying early on, Victor, is the customer, as the old saying goes of like, what's the anecdote? Nobody asks you for car. They ask you for a faster horse, but really what you will build them is a car, and then they fall in love with you. I found that there's a lot of that, especially in these kinds of cases where you're

Viktor Kessler (35:43.202)
back.

Nitay (36:05.14)
where you're dealing with highly technical details and capabilities and so on. So let me give a specific example. It was incredibly rare that we would talk to a customer and they would tell us, man, I really just don't have my data catalog story correct. And to be clear also, they would never use those terms. They had different terms and they were aware of the concept. But even when they're aware of the concept and they're very cognizant of it, they didn't seem to them that that's the problem. So the thing that they would call it, for example, many of our customers would call it,

the business data dictionary. And so they are aware that like, you know, there's this notion that they have a business data dictionary and that's what defines what my, you know, campaign spend thing is or what my ROAS is or ROI or how I look at a sale or transaction, et cetera. Like it has all these different definitions, but it's kind of sitting there somewhere. It may be in a system. It may be anecdotal word of mouth from one to another. Like who knows, right?

but most of the time they come to you and what they're trying to solve is some higher level problem. And then to Victor, to your point kind of earlier on, I found that so many times it was our job to really come in and say, okay, we have this customer data platform. It's going to enable all these use cases you're talking about, but as we're deploying or as we're going through, by the way, one of the things that you guys are gonna need is this business data dictionary.

And half the times we wouldn't even necessarily say it, it would just kind of organically happen that then they would realize, oh, you guys helped us solve that problem too. Like things just work easier now. Everything just coincides. Like the workflows, people interacting, like it just works. Yes, because we solved the underlying problem. And so I think that one of the things that makes this world tricky or interesting rather is I find that it's very rare.

that people point to. And this is why, again, going back to the beginning of the conversation with solution architects, I really believe that that role is such a key role. Because being able to stitch together between the business needs and these detailed technical capabilities, that honestly, most people just want to not think about it and say, OK, it's somebody else's problem, and then they've covered it for me.

Nitay (38:16.958)
The other aspect of what's being talked about here, and this was kind of going back to the point of data silos and so forth is I find that, and you asked a very interesting question, Costas, around consolidation and even a common nomenclature. So I somewhat challenge that there's even going to be a single common data catalog or a single common data notion, because I find that especially as you go to these like,

large, multi-brand enterprises, global companies, et cetera. You have so many different definitions of things and the definition of itself may change according to the use case. So are you looking at things at a basic simple level? Are you looking at things at a calendar level or fiscal level? Fiscal year, I mean, right?

Many businesses, the fiscal year that they work in is not the same as their calendar year for various reasons. The simplest one being you want your Q4 to be a big quarter and November, December, it tends to be very quiet months because of holidays and so forth. And so many, many enterprise businesses today, people don't realize this actually run on a, for example, January end fiscal year calendar. And so when you're calculating things, are you calculating them on calendar year or fiscal year, for example?

Just that, right, it's right there, it makes there be a binary split from almost any financial related metric you might come up with, right? Give me, you know, X and Y span over the last quarter, last month, last this. And so many folks kind of simplify these things by saying, okay, we're always gonna look at the fiscal calendar and that's it, right? And that's oftentimes the right answer, for example. But again, there's particular use cases where certain things may change.

one organization's way of looking at a metric like revenue is very different than another organization. Certainly the way the finance department looks at revenue, they have 12 different terms for what revenue means, what income means, right? But for marketing, it's just like, the person, know, did they click on my thing and that they didn't make a sale? Like that's it, I won, right? Like, I don't know whether they paid, whether they deployed, whether they this, like, I don't know all these things. I know they clicked the thing and they want to buy. And so,

Nitay (40:33.496)
What I often would find is that, you know, you have kind of these cross-cutting CIO led sort of initiatives that are trying to create kind of the one golden set of whether it be data or even metadata across the company. But I do wonder, or I do think rather that more and more, you're going to see kind of needing to enable decentralized metadata as well.

And perhaps there's a level of like metadata lineage. don't want to, like turtles all the way down as they say, right? Like I don't want to kind of over-index on this thing, but I do think there's some aspect of that where everything stems from some other thing and there are some relations between them, but they're not all the same definition necessarily. Does that make sense?

Kostas (41:21.799)
Yeah, it does. It absolutely does. Although I have to say that I'm, I kind of like still struggle with the same questions that I have like from the beginning. And that's not because of Unita or Victor. think it's probably me. I'm having like the hard time here. Well, I'll try like with an example and ask actually both of you like to tell me how you think that this can work. So.

And Victor, I'll start with you. So when we are talking about, let's say cataloging and like this is the real life systems and like try to make things like a little bit more concrete, like how they manifest like in real life out there, right? So we have, say we are a SaaS business. We have our application, our application uses some kind of transactional database there and the domain.

right, of our application lives there. There is a schema there that defines our product and how we interact with our customers, defines the entities, but it does it in a very specific way. It does it in a way that is optimized for delivering the software to our customer, right? Let's say it's a MongoDB or it's a Postgres, doesn't really matter. From that world, in order to make this data available also internally for other reasons,

We need to somehow make it available to other teams that they are going to use this data in at least slightly different way. And I'm not talking about any processes. I'm not talking about ETL now. I'm not talking about any of that stuff. And I do that like on purpose. But I might have an analyst at some point who needs some of this data.

or marketeer who needs some of these interactions so they can reconstruct what in their mind it's called a user journey. So they can go and build, let's say a campaign that's going to be more successful, right? But obviously these initial like schema is not designed around like this concept. There's no concept anywhere there about like user journey. Somehow we have to infer that from the data, right? And that's where like the transformation part comes in. And when I'm talking about transformation, I'm talking about pretty much like

Kostas (43:45.925)
aligning different data models with data models modeling different ways of perceiving the world out there, right?

Now, the way that traditionally this happens is a pretty complicated process. Like the data needs to be extracted. It needs to be extracted in a way that it's not going to break the transactional database because we still need to keep serving the people out there. So we might have, let's say a CDC approach. And now the data gets transformed for technical reasons into a different serialization with different data types, with different schemas. And they get on something that's called the Kafka, for example.

right on a topic. Now we don't have a table anymore. We have a topic, but we still have a schema registry there, right? That defines semantics about this, how this data should look like. Someone is a consumer now takes this data, maybe writes it on, I don't know, like on data, if they're like old school or on snowflake, they're like hipsters or I don't know if they're like hardcore engineers, like on data bricks, right? And

From there, we have the whatever medallion thing that needs to happen, blah, blah. What I'm trying to say is that every little part that we are touching here, right, it's perceiving the world in a very specific way, represents the world in a very specific way.

But at the end of the day, we are still talking about the same data. Nothing has changed. Like from what has been initially captured on the old LTP database. It's like the same thing, right? But your marketeer at the end of this journey definitely cannot do anything with the data. That's like the raw data that Postgres has, right? And what I see here, and that's where I struggle, right? The technical side to me is like a little bit easier. Like I can see, let's say something like Lakekeeper.

Kostas (45:46.483)
that magically collects metadata from all these registries from all like the stop point, like let's say like the boundaries between the systems there and make sure that keeps track of like the lineage, like how the metadata is like changing and making this like metadata like available in a very consistent matter, right? But we still have somehow from that to use the metadata to...

figure out semantics about the data that we have, that it's in most cases, if not all of the cases, very context sensitive, right? Like the same data, if I going to talk in the marketing team, I need to use a different language, different terminology, understand the word in a different way. My semantics at the end are like slightly different. If I go to the finance folks, different. If I go to the product folks, different. If I go to the engineers,

and the product engineers different. And this is the gap that I see there, which I don't have an answer yet of how it can be breached. And this is, like let's say the enterprise with the colibras of the world and like the dictionaries and like the business dictionaries and the sematic layers as we call them, in a way they try like to solve these problems with bureaucracy.

someone somewhere is responsible to keep track and maintain, let's say the semantics and make sure that these semantics are like correct, right? But I don't know how scalable is this at the end of the day, right? Or how we can make this available like to the broader market out there. And still like, I don't have a clear understanding of how this gap is bridged. And that's what I'd love like to hear from both of you because Victor, think

You've seen that like from both sides, both like the business, but also like the technical side with like Lakekeeper of how like you try to solve the problem technically. And Nidai, you obviously like from like the business side there trying like to create the customer data platform. So again, feel free to tell me that Kostas, you're just silly. You don't understand very simple things here. Like these things are like solved. So this is the solution, but...

Viktor Kessler (48:03.334)
.

Kostas (48:05.841)
I'd love to hear from you, like, how at the end, like, this can be done and how big of an opportunity it is at the end of the day, like, for the market out there to build solutions on that.

Viktor Kessler (48:16.204)
Yeah, Kostas, where to start and how much time do we have to talk about that? Because it's a very, very large topic. And I'll try to push it in a nutshell to give you like my understanding of the situation. Well, first of all, you mentioned at the beginning, have our OLTP systems, Postgres, MongoDB. But...

I hope that the end customer is not going to that database, but they rely on some consumption-ready product via API or app or whatever. So they don't go actually to a normalized model of third form of normalization on the Postgres because there is no business there, right? So it's kind of very hard to understand actually the data model.

if you have like entity relationship model. So therefore actually the customer itself goes to a ready product and consumes that product. You open your app, you can order your coffee, you can order your pizza, whatever you need. So that's the way how it's solved today in a business way without like analytics, it's on a transactional way. But now somehow we came to idea, we don't go that way. We will actually...

take that data out of Postgres and make that the third normal form in some star schema or data vault and complete transfer that state of the data that no one understands that anymore like from a persona who actually providing the service itself. And that what I would like to do here to become maybe it's a radical solution, but just to rethink the whole way how we solve the problem nowadays.

where we're just trying to copy ETL, CDC, whatever type of mechanism we use here. So why not to say that a team which provides a service to a customer as a transactional product? Why not that same team will provide an analytical product? So I'm a marketing, I can create a product and that can be consumed by sales team.

Viktor Kessler (50:34.062)
and then it may be an API, REST API, but maybe it's an SQL API that everyone can connect and write SQL script and then understand the thing. But they don't need to take care about how to copy out of MongoDB in that case, which is like a nested documented model, and then go through topics of Kafka and then Flink and Spark and the whole stuff. So they will...

have a responsibility to create a consumption analytical product. And that's a totally different way of what we have today. And this team is able to explain to everyone who would like to use that product, why it's constructed in that way and what value it can provide to anyone. So you can then go directly to marketing, and then you can make a consultancy with them and they will explain to you how to use it.

Just to give you a real world example, let's say we have automotive, and I'm here from Germany, right? So still strong automotive in Germany. And if you look at every car manufacturer, like Mercedes, Volkswagen, whatever, like all of them, you will find that you have a supply chain out of, I don't know, a thousand of suppliers until you will get a final car. And what we're trying to solve

from a classical perspective of analytical way that Mercedes will be able to build that car from a raw material, finding the aluminum somewhere, whatever, to a final car and deliver it to the customer. But that's not the case. Mercedes said, OK, supplier A, Bosch, give me that sensor. Supplier B, Continental, give me your tires, and so on and so forth.

So they just said, OK, everyone as a supplier within my huge supply chain has a competency to provide me a final product which I can use to build the next product. So I can build a value chain edit process here. And that same mechanism we just now trying to deploy on analytical world. We need to just to stop to create a fabric which can provide like a

Viktor Kessler (52:55.886)
the whole complex product, we need just a small data silos, data domain, data fabric, whatever you call it, which will create one thing as a startup, but it's good to consume directly. And then you can use that inside of a data supply chain and provide that value to the next company, which will enrich that value and provide to the next company and to the next company, to the end customer who will use it.

And maybe we will even reverse the supply chain and data which can be collected by the end customer can flow back that supply chain and help everyone inside of a supply chain to understand what's the data product good or bad. And maybe just to get to that one as well, let's imagine we have a marketing department and they have five data products. No one in our organization using that data products.

Me as a CFO, can then go to that marketing department and ask them, okay, you spend now like a million on all that tooling, but no one using that data product. Maybe you need to rethink that and create a different product. like a market driven approach. So if I produce a product, I produce a coffee, no one want to buy it because it doesn't taste. So maybe I need just to change the formula. And same to that marketing department. So they can just actually change the formula of a data product, of analytical data product.

offer that inside of organization marketplace, and then you can create value. And that is the way how first, well, in a situation that we can actually create from a cost center, like analytical department, from a cost center to a profit center, because marketing can produce a product, which can be sold to the organization, or even outside the organization between partners or whoever.

who can use that data. And from a CE4 perspective, the discussion is now different. It's not like, okay, I need a million budget for a snowflake, but I don't know exactly what the value is created here. Now everyone will create a product. We can go and sell that product, but not in a way that a central team will copy all the data out of Postgres.

Viktor Kessler (55:17.678)
create a star schema and now you can use that product and no one really understands how that is actually created. So that's a paradigm shift. And again, we can dive in every single aspect of that paradigm shift, but I would say you need to just go to what you have already in place. You have your microservice oriented organization who provides services and you need just to learn from them.

maybe adjust your existing team who needs to produce analytical products and not just transactional products. So I will pause here and Nite probably has her opinion as well.

Nitay (55:57.19)
I think at a business level, there's a few things to realize here. And I love this topic because I've thought about this a lot and seen various variations of it. The thing to realize, a couple key dimensions here. First is that part of the reason this topic is so complicated is because you're talking about something that inherently is a human process and will always have a human process part of it. And in particular, you're talking about something that is

across multiple different types of personas and users, often cases. And so for example, in my own experience, right, building Action IQ, most of my users are within the marketing org. So even for us, we solve the problem for one particular org. And the problem that you're really talking about is one of control and flexibility. And how do you bridge that gap, that spectrum of having enough level of control where you can, know,

do your consolidation, your common data model, whatever, all the different things you want, but also give people a level of flexibility. And, you know, to give a few analogies, so one product, I'm sure all of our audience knows well and loves, if you think of like a GitHub with GitHub repos, right? When GitHub started, it was basically just, here's a place you can put your GitHub repos and share them. Very little bells and whistles on top of that. Incredibly valuable, successful. People kind of went nuts and all, and the rest is history.

Today, if you go and you go into like the settings tab of a GitHub repo, it has dozens of settings of things you can do to it, right? And you can still go in and say, hey, this particular branch, I want you to go and be able to do whatever you want on, this other branch, no, no, I'm going to protect it with a bunch of different things and so on and so forth. Now, why does this all work, in my opinion? I'm talking like high level 10,000 foot view. It all works because of a couple of reasons. One, it's because the entire user set that you're working with is developers.

So you can constrain all of those two developers. And indeed, we've seen that as soon as you go way beyond developers, the GitHub model starts to break down, right? Like you start to bring data science and others and they want notebooks and they want sharing with dbt and they want like other tools and things. And GitHub doesn't really work well when you have a giant blobs and unstructured data and files and so on, right? Like it really is meant for code and developers. Two, the other reason I think it works very well.

Nitay (58:16.53)
is the analogy that I give one of my favorite experiences was back when I was working at Google. Google famously did this, I think, quite well, which is, as people know, Google has lots of great infrastructure and foundations and platform, but they also enabled this 20 % time, right? Historically, they had this thing. Why did that work well? That worked well because what they did was they gave people flexibility to do whatever they want, but after you went and you did your exploration, you did whatever.

If you wanted to use all the greatness of all the Google technologies, then you had to follow the happy path. And the happy path was build your app like this, do it this way, et cetera, et et cetera. If you don't follow the happy path, go ahead, you can do it. Nobody's telling you not to, but you're on your own. And so to go back to the GitHub analogy, I could take a GitHub repo today and download the whole repo and copy it and create a new one, not fork, not anything, and go share it with you and say, hey, let's collaborate on this repo.

but it doesn't make sense, right? Like it just makes it so much more difficult than, I just might as well use GitHub and make a branch here and fork it there and so on. Like there's so much tooling built in that I ostracize myself from the happy path of here's how we work and here's how the value that I get from doing it in that way. And so I think when you're going about these kinds of things at a business level, the best way in my opinion, when you have this complex organization and so forth, we're coming at it from a pure like,

I'm going to be a dictator and I'm going to control everything and I'm going to do that. Like that's not going to work, right? You're going to hamper all flexibility. going to screw, destroy all exploratory experimentation aspect, then your company is going to lose over time. At the same time, the other extreme also doesn't work well. You can't just have everybody running around like, know, wild West and having complete chaos. And the problem is I think people try to bounce back and forth between these two.

When really I think the way to do it right is you set up this happy path and you motivate people to want to be on that happy path while also enabling them to do other things that they want, but then they're on their own. Right? And more and more you create this happy path such that it adds so much value that people want to keep latching onto it. And I would venture that a lot of data platforms

Nitay (01:00:41.032)
Broadcutting infrastructure and so on initiatives that are done wrong is because they don't have that kind of incentive or kind of structure in place, in my opinion.

Kostas (01:00:53.299)
All right. That's, that was some awesome feedback from both of you guys. I think we need much more time to talk about that stuff and hopefully Victor, we can have you again, like in a future episode. So we are at the end here and before we close, Victor, I'd like to ask you to share with our audience, like a few things about how they can learn more about Lakekeeper, the company.

yourself, any resources that you like to share so people can go and learn more.

Viktor Kessler (01:01:28.346)
Absolutely. Well, first of all, Lakekeeper is open source and the best and easiest way to learn about Lakekeeper or about Lakehouse and Iceberg is to contribute to that community, either to Iceberg directly or if you can just run a test with Lakekeeper and you will find a bug, so issue that or if you would like to have an extension. That's the number one. Go to the GitHub repo.

create a branch or a fork and just try to contribute to that one. If you would like just to learn more about Lake House, I invite everyone on April 2nd to a first Iceberg meetup in Europe. It's in Amsterdam. So nice city. Just, I don't know, like worth to visit.

So that's the first thing. And the same on Iceberg Summit, which we'll find on April 8th 9th in San Francisco. That's the best way just to learn about the Iceberg, about Lake Keeper. That's something that I would love to give to everyone. on Lake Keeper itself, we have a Discord channel, so join the channel and just have a conversation with us, team. We'll be happy to chat with you guys.

Kostas (01:02:51.987)
Amazing. Thank you so much, Viktor, and we are looking forward to have you again in the future.

Nitay (01:02:56.531)
Thank you, Victor. It's great having you on.

Viktor Kessler (01:02:57.198)
Thank you guys. Bye bye.

View episode details

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

From Data Mesh to Lake House: Revolutionizing Metadata with Lakekeeper

Subscribe