Episode 14

Semantic Layers: The Missing Link Between AI and Data with David Jayatillake from Cube

February 20, 2025 · 59:03

Kostas (00:02.284)
David, welcome. It's so nice to have you here today together with Nitay. Please start with a quick introduction of yourself. Tell us a few things about your past and what we are doing today.

David Jayatillake (00:15.39)
Thanks for having me, Costas. So yeah, I'm David. I'm VP of AI at Cube. Before I was at Cube, I was running a startup called Delphi Labs. And before that, I was working at a few different data startups. And prior to that, I've been working in data and analytics engineering for the first 12 years of my career before entering data startups.

what I've been doing in the last few years. So I think the Delphi Labs part is particularly interesting. So what we built at Delphi Labs was an AI interface to semantic layers. We believe that, and at the time there are a bunch of companies doing what we'd call text to SQL. And so they were using AI along with the database schema to generate a query that would then be run. And then that would answer the question a user had.

And we believe that that wouldn't ever really work and it wouldn't be safe enough or accurate enough or consistent enough to be used in business. And at the time, like that seemed like a contrarian view, but now it's been proved out. I think people are moving away from building those kinds of tools and if they're staying in the space to answer, to answer the same problem, they're coming closer to the method that we proposed and

If not, they're just pivoting to do something completely different. And so the reason for that is that AI is great at writing SQL because it's read a bunch of SQL from Stack Overflow and whatever else it's been pre-trained on. It doesn't really know about your business or your business's data model. So therefore, when you ask it to write some SQL for you,

It's kind of guessing what your data means by column names and other metadata that it might have, or having seen prior patterns of how data is created from databases. And because of that, it gets things right some of the time, but often it gets things wrong. And as soon as you have a complex data model, and I don't mean particularly complex, we

David Jayatillake (02:40.168)
We've seen this on a small data model of 13 entities tables. even getting the joins right on a data model of that scale is very, very rare for text to SQL. And so we could see from early on that even with some all-powerful LLM, that it wouldn't really ever be good enough, just because it's not magic. It doesn't know what to do. And that's where the semantic layer comes in.

So with the semantic layer, and let me explain what a semantic layer is up front. So a semantic layer is a kind of knowledge graph. So every semantic layer is a knowledge graph. Not every knowledge graph is a semantic layer. So the semantic layer is a knowledge graph that expresses and codifies how your data fits together, so how you would join tables together. And also, what does the data mean? So is this table?

And, know, you know, how can I abstract this table to mean this is the list of, this is the customers that we have, or this is the orders that we have. And so it provides that level of abstraction to move from data structures to entities and also how those entities fit together, how they relate to each other. And then what attributes they have in terms of both like qualitative measures, like dimension. you know, for a person, maybe it's their height or.

Maybe it's their nationality or whatever it might be. But then also metrics. So this could be average height of a population or something like that. And so you have both of these attributes also codified in the semantic layer. how that they're calculated. And so this could be regulated metrics like revenue or something like that are defined once codified and then are accessible to be used over and over again from the semantic layer. So that's the core.

of the semantic layer. Now, a universal semantic layer like cube has a bunch of other features as well in terms of how it integrates to many, many databases, how it can handle connecting to many databases at once, how it connects to up downstream BI tools and consumption points, the APIs it has, security models, caching and pre aggregation. So there's a whole bunch of other features that a semantic layer can have, but

David Jayatillake (05:06.164)
The core of what a semantic layer does is, as I described, around entities and abstraction. Yeah.

Kostas (05:13.294)
That's great. So David, you mentioned that the semantic layer is working well right now with AI. It actually helps AI to generate queries that work at the end of the day. And I'd love to get deeper into that, but before we do this, I'd love to hear from you. How was...

life with a semantic layer before AI. And then tell us a little bit how it is after AI, right? Like what has changed with introduction of AI because semantic layer is not something, I mean, it is relatively new, but not as new as AI. So people have the need to use it before. So how people were using it before and how AI changed the way that like people work and use semantic layers.

David Jayatillake (05:49.161)
Yeah.

David Jayatillake (06:06.74)
Yeah. And I think like, first of all, it's probably worth mentioning that the idea of a semantic layer is probably not that new. If you think about it's probably as old as analytical querying. And so you see tools like SAP business objects, which had like the rudiments of a semantic layer, you know, as early as the nineties. And then throughout the end of the nineties and into the new millennium, you've got Microsoft's SQL Server Analysis Services.

which has an OLAP cube and other OLAP cube type technologies. And these are kind of like the forerunners for semantic layers. They were more limited, but the principle of abstracting what something means was there. So what it was like using them before AI, most of the time they would be behind a BI tool in all likelihood. So you have things like Looker, which was like a

a more modern equivalent, but MicroStrategy as well had something similar. And Power BI also has its own kind of semantic layer with MDX and DAX. But essentially, usually you define these metrics on this data model, how it all fits together in the semantic layer, and therefore avoid building pre-aggregated data marts and one big table type solutions, and therefore allow

a lot of flexibility in how someone would use your BI tool to query the data. And that's usually how they were consumed. Cube also is the number one semantic layer also used for embedded analytics. So that's actually Cube's heritage, is in companies building usually dashboard looking sorts of things into custom web front ends. And so

Actually cube is used by tens of thousands of companies around the world to do this today We don't actually know how many use them because cube is open source So if they're using the open source version and on our cloud version, we just don't know we just know we have you know a bunch of github stars and so that's that's those like the two ways people have seen or consumed from semantic like this before the other way is that

David Jayatillake (08:31.312)
One of the things that cube does is it can pretend to be Postgres effectively. So what, what it allows something to do is just consume data from it as if it was a database. But because of the definitions of the semantic layer, making things a lot easier saying, this is a customer. This is an order. This, this is, you don't need to, you know, calc, you don't need to make some formula to calculate revenue. just say some revenue or ag revenue, you know,

you can write a very simple, like three line SQL query to pull something actually that's very, complicated that cubes compiler will then compile into real executable SQL on your, on your data warehouse. And like that, that compiler is actually like a core of why a semantic layer is different to a knowledge graph, because it's like a, it's a knowledge graph that's in production, that's being used for that compilation is, is kind of the way I describe it. So.

Nitay Joffe (09:28.645)
And can you give us maybe a bit more for the audience, a bit more context. One of the things from what I've seen, not to simplify it, but semantic layer and knowledge graphs tend to be kind of, if done well, really good structured documentation, essentially. And the if done well part is like a huge asterisk.

David Jayatillake (09:31.635)
Now go for it.

David Jayatillake (09:49.043)
Yeah.

David Jayatillake (09:52.798)
Yes.

Nitay Joffe (09:52.803)
because it's a massive curation project, essentially. And it's not just one time, it's ongoing. if there's one thing harder than writing plain text, it's writing structured documentation. And so tell us a bit about why is it so hard and why do so many people not do it right? What's the right way to do it? Give us some context on kind of what it takes to do semantic layer correctly.

David Jayatillake (10:03.902)
Yes.

David Jayatillake (10:14.356)
And this is why I am probably not very confident in the success of Knowledge Graphs in the future, because yes, you're right. They are documentation of a kind. Whereas with semantic layers, because they are in the production workflow of how you query data, what you've written in YAML or JavaScript or Python or whatever it might be to define your semantic layer is then compiled by the semantic layer.

in order to fulfill analytical requests. So it's not just documentation. It's like living, breathing documentation that defines how your data is accessed. And that's why I'm more confident in semantic layers being used than knowledge graphs or other documentation, because they are in the production workflow. they're, know, if, they don't work, you know about it because people's dashboards or people's like product data products aren't working and you you know about it.

So they're more similar in some ways to like an ORM I think and I could like a that's where they're like somewhere between a knowledge graph and an ORM

Nitay Joffe (11:25.861)
And so how do you then set it up correctly and how do you go about to your point about kind of, is it just sticking in production and then, you know, like you said, people will notice it. And so over time it will become better. Or do you have to like do this upfront thing of like, okay, the entire data model has to be mapped out. have to understand what the entire graph structure will look like. Like how do you do it? Well,

David Jayatillake (11:31.731)
Yeah.

David Jayatillake (11:46.868)
So I definitely wouldn't say you should do your whole data model. And I think that's where some of these projects fail is because people get too ambitious. And you see it because I think they sometimes think about it as if it was a data catalog or a knowledge graph where they want it to span across everything that exists. But because semantic layers are used to provide access to data, you don't need to make them cover anything that you don't want to provide access to. Your API doesn't need to.

provide access to data that you don't want it to. So with Qube, what we usually say is choose like your most important, most highly accessed data. And then, you know, that's the place to start. Make a small part of that or a microcosm of your whole data model available via Qube and then expand it as and when you see that, you know, you've got other things that are heavily accessed as well that you want to codify their meaning. And you're seeing in

consistency in the way you were accessing it before. So what we think is, I don't think people would cover much more than 20 % of their whole data estate with a semantic layer because, quite frankly, the other 80 % isn't very well used. So you don't need to take the time and efforts to build and maintain semantic layer modeling code for it.

Kostas (13:11.214)
David, I have a question. So you mentioned that the semantic layer is used to access the data, right? So if I understand correctly, now instead of going and creating my database directly, I'm actually creating the semantic layer and the semantic layer will take my queries and translate them, compile them into like the SQL of the underlying system and like return some results back. But...

I'm trying to understand, for a user that doesn't know what the semantic layer is, but it's an analyst, right? Like, so usually let's say they see schema, an information schema in a database, so they can see their tables, they can describe these tables. They see some statistics there around that. And that's like what's like, it's familiar. Like they see the relations there and from that they go and like build queries, right? How is...

semantic layer different? Like what is the API at the end that like the semantic layer exposes, right? And how it's richer than the relations of the database does.

David Jayatillake (14:10.867)
Yeah.

David Jayatillake (14:19.24)
Yeah. And I think that the way to think about it is the, you make a request of a semantic layer, like, so we have a rest API interface, right? You, the, if you look at the structure of that request, it's a really simple bit of Jason, which is more or less a shopping list of the things that you want, right? I want these measures from this cube view from, with these dimensions filtered by this. That's it. It's, it's not doing the joins are not there.

The joins are handled by the compiler. Actually, how to aggregate something into a metric is handled by the compiler. So you're just asking for things that you want. And this feeds nicely into AI as well, but I'll touch on that momentarily. The way that then works for a user is that when they're using their BI tool, they're not writing a SQL query to then access the data.

They're just seeing, well, here are the measures and dimensions available for this entity or this little data model. And so I'll just pick those and click Run, and then I get the data I want. So that's already a lot better than having to write SQL or rely on someone else to build your data to feed into your BI tool. This is where the whole more modern concept of self-serve BI

comes from is rooted in the semantic layer and that's what Looker put forward.

Kostas (15:51.086)
But this semantic model has to be built by someone, right? Like we start with raw data and we need to add more more more structure until we get to the point where we can expose this type of model to the user. Who is responsible for that in the organization?

David Jayatillake (15:58.29)
Yes.

David Jayatillake (16:15.294)
So very typically, it has been, and I say this not as in a person, but a role, right? The role has typically been a data engineer or an analytics engineer. Now, so that role may have been served by an analyst who is just capable of doing that in the past. I've been that analyst before. And the other thing I'd say is there is always a semantic layer, right? Whether you bought one or

use an open source one or not, the semantic layer exists. It just might be in the head of that analyst who knows what your data means, knows how to join it together, knows how to aggregate it to answer questions. That semantic layer always is there because otherwise you can't use your data. And so the only thing that is different when you're using a semantic layer like cube is that you're codifying that knowledge into whatever this modeling code might be. It might be YAML, it might be JavaScript, it might be Python.

And so someone is writing that to codify that knowledge so that everyone can use it and so that people kind of can agree and have consensus on what data means. so yeah, typically that's been a data engineer or analytics engineer role being performed by someone who may be or may not be a data or analytics engineer. that's probably been like one of the barriers to entry is that this has to happen in order

to get access to use a semantic load to get the benefits of it. And in some ways, it's a shame that when GCP bought Looker, because Looker was gaining huge momentum and people were moving towards this idea of using the semantic. But think that's Looker's momentum has kind of waned since they joined GCP. But there are still companies like Cube who are bearing that torch.

One thing that's really interesting is I wrote a blog series recently where I was learning how to use a tool called SQL Mesh, which is a data transformation tool. And towards the end of the series, I was feeling ambitious because I was using this new AI ID called Winsurf. And so I thought, well, can I automatically build a semantic layer from the metadata available inside

David Jayatillake (18:42.652)
a transformation tool like SQL mesh. what that tool has is effectively knowledge about and metadata about how tables join together, how data is aggregated. So especially if you don't have a semantic layer today, most likely you're building what you call gold layer tables. from this is from medallion architecture that companies like Databricks are pushing quite heavily at the moment. And so these gold tables are, and I called them Mark tables before.

But they effectively are aggregated tables. And when you make an aggregated table like that, it's effectively a SQL query that joins your relational model together appropriately and aggregates columns from those tables to make metrics and groups by other columns, which are then the dimensions for that gold layer table. And so that query is actually incredibly powerful and actually is

is sufficient in terms of metadata to generate a semantic layer. And that's what I found was when I built this, I built a new CLI command for SQL mesh called kube generate that I found that using AI and that metadata, I was able to automatically generate a kube semantic layer based on particularly using those gold, gold layer table creation SQL.

And so everyone who doesn't have a semantic layer today has that kind of gold layer SQL because that's what they use to either feed into a BI tool or to answer any data questions that know, analysts have had tricks of how for how to deal with this since the dawn of analytics, which is, know, typically they'll save like base queries that are almost like templates and they replace bits of them. But the joins are the same and the aggregations are the same. They just remove and

swap and change as they need to. But that's the metadata that you need for a semantic layer. And what I found was using a powerful LLM like GPT-40 or O1, I was able to automatically generate a semantic layer.

Nitay Joffe (20:56.389)
That's a really interesting approach. And that maybe shines a light on like what the future holds, which is will semantic layers need to be human curated or have any human in the loop or will they or should they be just, I understand you correctly, fully generated from your query log, your SQL transformation, your DBT, or you know, you have some framework or something that your analysts and data engineers are already using to make queries. Just shove that in and we will.

create and dynamically tune your semantic layer for you.

David Jayatillake (21:30.356)
So we are certainly working in that direction. That's something I'm going to, as you know, that's one of the products that I'm looking to release early next year. Well, sorry, this year. But I still think that we will need a human in the loop because the idea of AI, even AI not doing it on the fly, but doing it once, saving it, and then reusing it without supervision is not acceptable to human users.

Firstly, because, yes, I was able to build a semantic layer very, very well. And what that means is someone can take that and use it in Q. But what they might do is they might want to extend it. They may want to edit it. It may have got some slight errors in that. But because the bulk of the heavy lifting has been done in building the semantic layer, humans can easily then, in a short amount of time, do little edits to get it ready for production. That's the joy of it.

But there still needs to be, I think, a human who is responsible for what the data means to their organization. Because I think

It's just not going to be acceptable for the meaning of data to just be black box. I just don't think companies will accept that. And I actually even believe that even if it was perfect. So I wrote a blog post called a darker truth. And what I posited was imagine if you did have an LLM that could do perfect text to SQL every time, you know, just Uber LLM. And what I

Realized was actually that would never even that wouldn't be good enough Even if you knew it would be perfect every time it wouldn't be good enough because it's a black box. It's not transparent it's not able to explain how it's Derived meaning from your data and actually Stakeholders won't accept that they want to know how the sausage is made with data and so that's why you will always need a human in the loop

David Jayatillake (23:36.264)
So yes, the human may no longer be writing a load of YAML to define their semantic layer, but they are supervising the PR that AI has made to create the semantic layer or to maintain the semantic layer. So it's just going to massively speed things up and make things more accessible and easy to use.

Maybe you won't need such large teams to maintain them, that's for sure, but you still need humans there.

Nitay Joffe (24:09.687)
How does this tie to, you mentioned before, I think it goes back when you were saying you were at the, Delphi, that everybody was trying to do text to SQL, but kind of to put it simply running around like Hettas chickens, not really knowing what they're doing and so forth. And that you guys had kind of an interesting approach tying to the semantic layer about how to do it right. And that people are now kind of coming around to that approach. How does all that fit in and what, is that current correct approach today to do that?

David Jayatillake (24:34.622)
So I think like it's, so if you think about what Text2SQL is doing, Text2SQL was trying to do that piece that I built, that I just built, which was generating the semantic layer from the data and then also querying the data in one step. So, but the meaning is kind of inside that LLM network somewhere, completely unfathomable to anyone in using that method.

So if you think about what I've proposed is actually that semantic layer needs to be built first. Whether it's built by AI or with AI and with human supervision, that needs to happen. And those definitions need to be agreed and almost not set in stone. they shouldn't be moving all the time. They shouldn't be moving from query to query. They should be moving purposefully when

you know, when you need them to, because there's been a product change or even incorporated new data, or you've got new information and you want to change it, it shouldn't just move because of a new query. And so it's like two legs, you've got the leg where you've defined the semantic layer, and then you've got the leg where you're querying the semantic layer. And so what we proposed that Delphi and we found this worked really, really well was actually that final leg of querying

Is someone asked a question and then we translate that to a semantic layer? API request and because we're making that rest API request I described earlier It's much closer to natural language than sequel because you're literally just asking for a bunch of things you want and just separating them into different arrays according to the Allowed structure of that json request and that allowed structure doesn't really have any room to

Go, you know into strange directions like sequel can you know? There's really one way to write that request correctly if it doesn't get written correctly. It doesn't run and the other great thing about it is because you're requesting specific things known things from the semantic layer The user that you're you're helping You can say I've chosen these things from the semantic and these things in the semantic layer should be business terms that they know about like

David Jayatillake (26:58.504)
you know, net revenue or whatever that might be. They should, if they're asking for it, for that sort of thing, they should know about these terms and they often they do. So if you say to them, we're giving you net revenue, less depreciation. They might say, that's not what I wanted. I wanted just net revenue. And they can have that conversation with, with the AI interface. And yes, it might have got something wrong, but it's transparent. So it's okay. And that's, and whereas when it's, text to SQL.

that level of feedback and transparency never happens. And that's why I think it works. And one of the great things that we proved was there was a benchmark that we replicated. And the benchmark was made by a team from Data.World led by Juan Cicada. You might know or some of your listeners might know. And the benchmark was very specifically designed to test an LLM system's ability

to answer data questions with natural language. It used an insurance data model of about 13 entities or tables. And so what we did was we replicated the benchmark. And so with Text2SQL, the accuracy was something like 16 or 17%. And with Text2SQL, enhanced with a knowledge graph alone, so essentially,

using retrieval augmented generation and the question, feeding parts of the knowledge graph that are relevant to the query into the prompt in order to write better SQL did give a lift. It made the query about 50 % accurate. But actually, when we use the semantic layer, which not only provides that context and knowledge graph, but constrains the query to be a request,

from those elements in the semantic layers knowledge graph, and then you relies on the semantic layers compiler to generate the SQL, executable SQL to pull those things, we achieved 100 % on that benchmark. And so it just shows like how much of a difference you get from using a semantic layer. So at that point, we knew the method was the right one for sure.

Kostas (29:20.536)
So David, that's a huge difference in performance, right? What contributes to that? Is it at the end that we put, let's say, enough and the right type of guardrails there? So we give less freedom at the end to the model so it cannot hallucinate in a way.

David Jayatillake (29:46.685)
Exactly.

Kostas (29:48.408)
Tell us a little bit more about that and also what are the limitations, right? If there are limitations in that approach.

David Jayatillake (29:53.086)
Yeah.

So the constraint is exactly, and it's a multiple levels of constraint. Firstly, there's constraint in that it's not allowed to make things up. It must choose things from the semantic layer. Those are the things it knows about. Those are the things it's allowed to request. Secondly, it's constrained because it's not writing code. It's not doing some code to calculate and query some database. It's saying, well, you've chosen these things. Now make a JSON REST API request. So it's literally just.

taking the things it's chosen and piping them into the right structure to make a request. And so that request format is another constraint. It's another guardrail in stopping it from hallucinating. combined, so the context constraint and the request constraint, that's really powerful. It kind of makes it somewhat boring because it just does the right thing.

or if you ask it for something that it doesn't know about, just says, sorry, I don't know. And for me, I love that because as an analyst, when I was asked these questions, if someone asked for data that I didn't have or didn't know about it, I said, I said, I don't know. And maybe I'd have to go and ask someone else or, or figure out if we needed to ingest new data or whatever, but I didn't make up. I'm just going to, didn't, you know, I didn't make up a query on some random data that I thought might be the right thing. Right.

And that's good. So for me, that's a good thing. But that's also a limitation. It's not just going to be able to answer any question on your data. You have to have the semantic layer codifying the meaning of that data and how to use it. That has to be done up front before AI can query it. That's one limitation. The second limitation, which we're working on, is that semantic layer queries

David Jayatillake (31:52.735)
usually Not they're usually somewhat simple and that's in the sense of you know You want some metrics by some dimensions filtered by X and sorted by Y that that's kind of like the the structure of a query they don't typically do very advanced things like Regressions or forecasting or root cause and that you know, all of these advanced things that you could do with analytics now, that's something we Will work on in the future for sure

But those are some of the limitations with the method is that you need to build more than just what I've described in terms of pulling the query to do everything that you might want to do in analytics.

Kostas (32:34.126)
One question that I'd like to hear your experience on. What I hear when you say about the mistakes that the LLMs can do when you ask them to formulate a query to answer a question. I see two levels there of, let's say, hallucination. One is like hallucinating on the syntax. Does this thing write syntax that it's correct?

It's probably like a little bit like easier to deal with because at the end of the day, most of these should fail by the query engine and return some error. So you can create some kind of loop there and like try like to fix it. And the second level is semantic errors, right? Let's say for example, and I'm saying that from like personal experience, for example, you...

you feed, let's say, a schema of a table to an LLM, you have multiple different IDs there. And although it be obvious to a human what the user ID is, that's not necessarily what the LLM will pick. And that's a very simple case. It's just like the semantics, very straightforward semantics of a column. This can get like,

David Jayatillake (33:49.256)
Yes.

Kostas (33:58.734)
complicated to the point where even a human trying to debug that will be hard, right? But from your experience, building these things, what have you seen in terms of like where the LLMs fail mostly and where's the most value from the guardrails of the semantic layer brings? Because from what I understand, it brings value to both, right? It removes like the issues from both layers.

David Jayatillake (34:21.481)
Yeah.

David Jayatillake (34:25.042)
Yeah, so the the values I guess to start with semantically because the semantically exists, you know what exists in the data, right? Whereas with text to SQL it's guessing what it thinks is in the data kind of based on the question a bit based on the schema. Maybe some additional prompt context that may have.

But with the semantic layer knows these are the things that exist. I don't know about anything else I don't need to know anything else then there's the additional kind of semantic understanding of well, how do Of how do I use this right? So with the text the sequel it knows it needs to know how to use those things together How do I join this? What's the primary key for the foreign key relationship? How do I write a window function or something to get the right number and That is all codified again in the semantic layer

Because someone has said, you join using these keys. You sum this column to get revenue. And so therefore, again, it's not having to guess that. It's just pulling it and enjoying the benefits of the semantic layer and the compiler and everything that has gone before. So I think those are the two main ways. And then also, of course, with the query syntax, cube has, you know,

spent many years building query rewriting capability for the compiler to work in tens of different database dialects. So, you know, all the big ones like snowflake, BigQuery, Redshift, whatever it might be, Databricks. And therefore it writes correct SQL deterministically. You know, it's not there's not there's no some there's no probability of getting it wrong. It's always going to write the correct syntax. And if it's not doing that, that's just a bug with the semantic layer compiler.

which does happen, of course, but then they get fixed by an engineer. And yeah, so those are the three ways I'd say it helps, as you asked. The other thing I'd say is when using the semantic layer with AI, it's not magic. You do need that semantic layer to be nicely described and structured.

David Jayatillake (36:44.914)
Like the way I usually describe it is imagine if someone from your industry, but from a competitor company or some another company came to your company tomorrow. Would they understand that semantic layer or is there so much jargon and acronyms that they wouldn't know what things were? You know things like if it's a customer call it a customer don't have some strange abbreviation or if you do have a strange abbreviation for the metric name and your dashboards depend upon that.

put something in the description that says, is a customer. And so then the LLM can pick up on that from the metadata piped into it in retrieval or augmented generation. And so the way I describe how to prepare a semantic layer for AIs in three different ways, firstly, clarity of names, like I just described. Secondly, it's dissimilarity of names. If you think about how LLMs work in a vector context, you want things to be as

different as possible in naming. And then thirdly, unlike humans, LLMs don't do TLDR, right? They are happy to read a lot of information and metadata. So put plenty of information in the descriptions and other meta tags that you might be able to have, and the LLM can pick it up and perform better with it.

And so those are like the three ways, which I don't think are actually any different to actually making a semantic layer better for humans, to be honest. But yeah.

Nitay Joffe (38:15.237)
So building off that, you made some really interesting points there where naming matters deeply. And in my experience, to your point, even simple concepts can often have a huge complexity to the amount of different ways to view that same concept. So like a notion like revenue, right, or income. It could be income to the business, it could be income post discounts, it could be income that's...

David Jayatillake (38:33.128)
Yes.

Nitay Joffe (38:41.581)
taxable, could be income that's, know, recognized revenue versus booked versus, collected. It could be income that's, you know, like all these different things, maybe something that's valid for an affiliate or not. Right. And so one of the, the, the, the question that leads me to is I imagine in your world, the part of kind of the vision, part of the, the ideal here is that the people who are curating the,

the semantic layer and the people who are kind of setting this all up and being the humans in the loop, that they can enable the army of analysts and so forth that don't have that expertise to do what they want and get it right ideally every time. And my question for you is that user hat, that user workflow, if you look on the view of somebody who's coming in and to your point, just trying to, you know, put some dimensions and measures and make some chart and so on, they hop into their product, their VI tool, whatever it is, they put a bunch of things together.

The result is not what they expected. And it's not what they expected. It's not an interesting insight. It's so far off that like, okay, clearly it's wrong. What do they do? Because at some layer, they've got to be guessing or talking to somebody to figure out, well, where did it go wrong? Meaning building off the previous question, it's a semantic issue, but where is the semantic issue? Did they ask the wrong question? Did the LLM?

You know go off the guardrails or so on in some weird way or even if the LLM did behave nicely Maybe the LLM just chose the wrong revenue thing to use Even though they're all allowed within the data catalog or the DLLM just generate the wrong query on top of Metrics and dimensions. So how do you like debug this whole system? How do they know that it was the LLM or it was the semantic layer or it was them or it was like What do they do?

David Jayatillake (40:29.298)
Yeah, so I think this is where it's the joy of the semantic layers transparency is good because the LLM can say to you, I chose, you know, net revenue, less depreciation or I chose EBITDA and the person can see, okay, what does EBITDA mean? And it can, it can either take the description or show it, or maybe that's already available to them as to what that thing is. And then

They can possibly figure out that actually this is not the right revenue of the 17 flavors of revenue that I wanted. I want another one. Give me that one. And then because they've clarified the LLM can hopefully do a better job of choosing the correct one. Or it can just say here are the 17 flavors. Which of these did you specifically want? And that's difficult before the LLM, right? Because they're having to dig around in some tool.

And you know these interfaces can be quite complicated so they may not find the one they want And whereas for an LLM, that's not a problem

Nitay Joffe (41:37.909)
How does this, looking forward a little bit, how does this whole world change where these LLMs become now agents?

David Jayatillake (41:46.75)
So this is really interesting thing that I've been thinking about for Cube recently. And the reason why we've been thinking about it is probably the reason that you're asking the question is that we have these new kind of like universal agent interfaces is what I'm describing them. when I say universal, I mean within a company. So let's say it's Microsoft Copilot or AWS Q or whatever it might be. The providers of cloud providers

They're all wanting their customers to start using these single point agents to answer pretty much any of their business questions. Now, these agents obviously are not capable of doing that out of the box. And so you can imagine with something like Microsoft Copilot, for example, that we could actually integrate with it. A lot of the time, these agents have

extension or application frameworks. And what we could do is use something like our AI API, which we have today as a product, which uses the method that we've been talking about throughout this conversation. And what could happen is if that agent is asked a question, the agent would have to do something in terms of routing, in terms of

Well, I know that this is now a quantitative question. And therefore, I'm going to route this not to my document search option, which is reading through your notion, or not through looking at some OLTP ERP system for a specific record. No, this is actually an analytical question. So let's route it to the cube AI API extension or application.

And so then that question gets routed to the cube AI API. It can make, try and answer it using the semantic layer. As we've just been talking about respond back to the agent and the agent can pass that through to the user. And the user doesn't even have to know anything about the semantic layer or how to use a BI tool or, or even that, what data they have in the company, you know, they're there. What, what data they can access is defined.

David Jayatillake (44:07.102)
by the RBAC that governs their access to the agent overall.

David Jayatillake (44:15.764)
Does that make sense?

Kostas (44:15.854)
I have a question about the agents case. So the way I, what I hear from you David, when you describe like this scenario is you have this piece of software that's backed by the LLM obviously, is like kind of free to roam around, find services, ask questions, use the services and then come back to the user with some results, right?

How does the agent discover all these different services that can be used, right? And do we need some kind of, I don't know, build a protocol or an API for agents at the end of the day to communicate and consume this information? It kind of, you know, to me, and again, it might be because, you know, I'm coming...

Maybe I'm too old in a way. And I've seen the things of, now we have services and we need to make the services discovered to other services. And we need to have a layer that allows the discovery of these services. And of course, back then the service was pretty dumb. So we had like to be very explicit with our protocols and all these things. But I don't know, like my experience so far, least like with agents, it's just that

doesn't mean that you just put something out there and this will figure it out on its own. So they still have the feeling that's infrastructure missing there. Can you tell us a little bit about that, what you see and what you've seen working and what's missing? And that's a question for you too, Nitai, by the way.

David Jayatillake (46:07.284)
So maybe I should go first. But the way I think about it is, so I'm not a software engineer by background. know architecture relatively well. But I've heard of things like Swagger before, which is self-describing APIs. I've also seen just today two new pieces of software which automatically probe an API and generate metadata about them. And I'm thinking.

The reason why these have come out are probably because people are trying to get LLMs to automatically use the APIs without having to be told how to use them. these are the ways I see them happening is maybe there's standardization. And so when you make an application for Microsoft Copilot, for example, it expects these possible parameters to the API that the

that the copilot is going to use. that might be very, very basic. It might just be like role is user or role is agent or role is like application. And then the text that comes back and that's all that's allowed. Like it could be as simple as that. And then therefore Microsoft copilot knows, okay, well, I'm going to use this thing and I'm going to say, I'm the agent and this is the user's question and this is the text and I'm going to receive text from the application and I can send it back.

And it could be as simple as that. I imagine there'll be more complexity over time, especially when you think about security. So that's one way, standardization. The other way that maybe scales up a bit faster is, yes, maybe there's the open API standard, if I'm correct. Maybe any API that conforms to that standard can become accessible to a

agent like Microsoft Copilot, Amazon Q, Because they build some standard way to use those APIs. Now, from my understanding of that OpenA API standard is that it's still quite deep. It's not necessarily easy to know how to use an API just because it meets that standard. But at the same time, it might make it easy enough for an LLM to automatically use it. That's the other thing. Maybe that is an application in itself of

David Jayatillake (48:29.992)
How does an LLM use any open API for standard API? Maybe that's possible, but haven't seen it yet.

Nitay Joffe (48:43.737)
Yeah, on my side, you know, think one of interesting things, and I forget the name, those one or two tools that had this capability where essentially I think there's multiple different levels you can understand an API. And so the level that David, that you were alluding to, the kind of swagger type of thing is essentially documentation, either structured or unstructured. And so the traditional way is structured documentation, similar akin to what you're saying, where you have some swagger step.

spec, sorry, that says here's the API, here's the inputs it can take, here's the outputs it expects, here's the side effects, et cetera. Then there is the, what I think I was like the more unstructured, where it's kind of open AI spec like, I forget what their exact thing is, but basically it's just a bunch of English that says like, hey, this is a function, it should do this, feed it things like that, and you'll get X, like feed it English words and you'll get capitalized sentences. Like that's what it does, right? And the third level is,

the one where you kind of ignore that altogether and just shove data at the function, see what comes back, and infer what it does. And I was trying to think of the name of those, a tool I ran across the other day that kind of blended between them. And it was used in particular, it came from the world of testing, because you can imagine automated test generators where you give it some function that says, hey, this should capitalize words. And it goes and says, OK, I give it, know,

the dog was in the park, and yep, it capitalized the. But what if I give it UTF-8 stuff? What if I throw XML in it? What if I do like it just starts like barraging it with all these corny cases and things in order to make sure that your function really is robust and safe and kind of defensive coding, if you will, right? And as part of doing that, you can use that to actually then do a next level inference that says, what this function does is capitalize letters or sentences, et cetera.

And I think the interesting thing is where you can stitch between these different things. There was a guy at Google long ago that said that the real spec of a function is all the possible side effects it can have, that the real public API is actually, that in any API, if it's in the world in public for long enough.

Nitay Joffe (51:01.209)
will have its full extent, its full superset of all capabilities exposed because some user somewhere will try to reverse engineer it or will try to bombard it and will discover, hey, if I put in this UTF-A thing, it doesn't just capitalize, it makes the letters pink, like it could return something else. That's kind of cool. The function does that now. And eventually, so many people start calling it with that thing that that becomes a thing that you now have to support because if you change that behavior, it becomes breaking somebody else's app, right?

And so there's this interesting thing to cause us to your question of how do you discover the full thing that an API does. I think the initial take is, yes, you just use the Swagger spec or some English spec. You use that for, okay, here's what it should do. And then you assume it behaves well. think where things get to kind of the next level, similar to what the discussion we had around the semantic layer here is, when you then start building an agent in production around that system, it will eventually bombard it with everything. And so you will start discovering.

perhaps unintended behaviors, and it will keep building off that because it's just going by what the spec is. And so then to me, the interesting question becomes, is the agent kind of have enough self-aware, enough validation and checking on top of it to keep checking that the function does what it said it was supposed to do? Because I think that's where the system gets interesting over time.

David Jayatillake (52:20.828)
Or if it can heal, right? So let's say it depends on something it shouldn't. And that breaks over time. Can it just heal and use the API differently or do some post-processing to change? And yeah, that's something I've kind of seen as possible. Like when I've been using windsurfers, I've built something that uses a function and its use of the function didn't work. And I fed the error back into Cascade, which is the

Kostas (52:20.833)
Yeah.

David Jayatillake (52:50.478)
of conversation part and so okay I see this error I need to change this and then it fixes it and a lot of the time it then starts working again I wonder if they can heal in that way

Nitay Joffe (53:01.668)
Exactly.

Kostas (53:02.86)
Yeah, that's very interesting. think the connection you make, Nitai, software testing and the use of fuzzers, for example, because fuzzing is the technique you're talking about, which has been used very successfully. I think it started with security, but it's used also in software a lot lately too. I think it be interesting to see how...

some new developments there, like the super deterministic replaying of execution that systems like Adidas do, for example, maybe can be used. But the question is, and I'm saying that by being, for example, at conference a couple of months ago about databases and pretty much the people there from companies like Meta and Google.

when they were talking about the progress that they made to migrate from one technology to the other, a lot of the work was around, like we had to build this fuzzer to do this and run these tests and all that stuff. So still someone needs to write that stuff, right? And it's far from trivial how to do it. So it goes back to, I think,

similar to the question of like who builds the semantic layer, for the LLM to go and consume. There's feels like a chicken egg problem, out there, which I don't know, maybe whoever figures this out, make a lot of money out of that, but, it kind of feels that we are at this stage of like, kind of know what is needed, but,

who builds and how we can build it efficiently and like how we can curate all these and create all this tooling. It's still something that the industry has like to figure out. And of course, and that's, I don't know, I kept coming in my mind when both of you were talking about the agents. I feel like you can build all these sophisticated things and run the agent and then just get a message after a few seconds that you run out of tokens. So you're rate limited by the API.

David Jayatillake (55:21.502)
Yeah.

Kostas (55:24.337)
of the LLM service provider. So anyway, very exciting things, but I know we're getting like a little bit like closer to the end here, David. And I'd like to ask you a little bit about what's next for Kube, the semantic layers and AI, right? You are working like spearheading the function around like how AI is used.

David Jayatillake (55:50.185)
Yeah.

Kostas (55:50.286)
So what excites you and what 2025 will bring to the market?

David Jayatillake (55:56.594)
Yeah. I think like, that's because I've been writing this white paper, it's kind of like codified my thoughts on this topic is the way I think about it is like split into development and production. so if you think about development for a semantic layer, we already have a co-pilot in our ID. So cube cloud actually cube in general has an ID where you can make cube, YAML, JavaScript or Python changes. And then you have a playground to immediately

Experience what those changes have done and that's why people actually like to use our ID as opposed to their own one when they're developing cube And so we have a copilot there but we're trying to think bigger in terms of yes the copilot helps someone who already somewhat knows what they're doing when they're Writing cube yaml, but maybe they don't come to it that often and that's why it just will have speed them up actually for someone to start from

from nothing to build a semantic layers. I think a bigger, more of a mental block than anything else. So if we could have something which could maybe just connect to their data warehouse or take a text file, which is like their important queries, pause it using something like SQL Glot or something else to find the joins and to find the aggregations and then automatically generate a semantic layer even with AI. it's not deterministically

Perfect, but good enough to start them off with a semantic layer that works and that they can then see okay I see how you build metrics because you've already built things that I know about I'm gonna just add things because that's the thing fundamentally It's a lot easier to add to a semantic layer than to just build one from the ground up And I think that will really help people so we're internally we're calling this like as AI data engineer I'm not sure what we'll actually call it as a product. But yeah, that's one of the things that

I'm working on from a product point of view. The other thing that I'm also thinking about is how do we go? And so this is the production side, which is actually consumption is, yes, I'm thinking about how do we engage with these agent frameworks, whether that's the Microsoft Copilot, whether it's Amazon AWS queue, whether it's GCP Vertex, or anything else. I'm sure Databricks and Snowflake will have their own agents at some point. How do we integrate with those?

David Jayatillake (58:22.612)
And so that's where we've been thinking, you know, and I don't think we've seen Exactly what they would want yet anyhow like in terms of integrations they have some of them have made Methods of making applications or extensions for those Agents and so that is a form of standardization in itself It's a little bit heavy in the term in the way that you actually have to build like a piece of software to

integrate with it's not just an API but you know we need to investigate those more and learn about how they work and then also I am excited about actually going a bit further than just pulling like data like even though that is complicated you know just asking you know what's my revenue by month for this marketing channel that sounds like a simple query but

When you translate that from a simple cube request into executable SQL, that could be like a hundred line SQL query, right? That's pretty complicated. But people actually want to do more. They want to go beyond like descriptive analytics and they want to go into, you know, they, they want to ask questions like, what should I do? How should I change my marketing budget allocation? You know, they want to actually have assistance for decisions and recommendations.

And they want to forecast they want to know well what would happen if I change this? And those are the things that I'm kind of excited about. OK, well, so let's so let's say we've got to a point where people can get a safe data set from AI. Can they then ask AI to extrapolate? Can they get ask AI to dive deep into it and explain things from it you know that those are the next things that I'd like to do?

Nitay Joffe (01:00:10.017)
That's some really cool stuff. Last question I have from you related to some of this is, you know, lot of the AI community I hear talking about how end to end things will become or not. So for example, a lot of the stuff, even today, but like certainly in the early days as the whole LLM wave was starting up, people think it's you throw a bunch of data in, you train, you get a model out.

In reality, there's a bunch of processes along the way that are actually distinct steps, right? So there's like tokenization, there's like we talked about here a little bit, there's embeddings and RAG, there's the training where you just, you know, the supervised training, there's, you know, RLHF kind of stuff, where you're learning from feedback. And so it's actually more and more discrete steps. And one of the things that I've heard more and more folks doing is having the thesis that actually what...

can or should be done is you just make the model pipeline, the model step itself, bigger and bigger and bigger so that, for example, the embeddings become just more parameters in the model itself. And so if, during the training process, you actually get embeddings as part of the output and you get the end result tunes itself. And my question to you here is, could that or should that even happen, for example, with the semantic layer? Would the semantic layer ever be?

just part of the model in and of itself. And at the end of the day, you have this like end to end flow where you're trying to, you know, create some BI report based on revenue and then maybe time, whatever. And as part of that, the AI will actually go and realize, I need to actually change the parameters and those parameters that I'm changing actually are the semantic layer itself versus I need to change the SQL generation functions versus I need to change the business understanding functions.

David Jayatillake (01:01:47.102)
Yeah.

Nitay Joffe (01:01:53.609)
It's all just different parameters, or will it stay distinct objects, distinct systems, distinct steps kind of forever per se? Does that make sense, the question?

David Jayatillake (01:02:03.378)
Yeah, I wasn't familiar with this idea of like passing in embeddings as almost like configuration into the LLM. That's really interesting. I wonder, especially given that we're hitting some diminishing returns with LLM performance already, like whether

whether this will provide like consistent enough performance. And, you know, if you think about how complex a semantic layer, for example, could be, you know, possibly thousands or tens of thousands of lines of today, and even compiled how complex it could be, because that's what you'd put in like some kind of embedding equivalent of the compiled semantic layer into the LM. I'm just not I have no idea.

like whether it was, whether it's workable. But if it is, that could be very, very powerful. But I do know that if we, and this feeds back into that agent framework or like the graph type frameworks that we're seeing in that if you can break those steps down and do each step very, well separately.

We know that that's working today because those tools are doing better than the ones that try and do everything in one shot and get it very, wrong. but then there's a case of latency and costs and, things going on, which we need to solve as well. But I think that there's already been really big strides there and the agent framework that we just discussed is just a, like some kind of, almost formalization of that.

Kostas (01:03:52.472)
That was great, David. It's time for us here to wrap. Thank you so much. We definitely need to have you back. So I think like in a couple of months, especially after you have, you know, like at Cube released some of the very exciting features that you talked about, we should have you back and see how the landscape will look. Things are happening like so fast and things are changing pretty much like...

month, every month. So we'd love to have you back. Let's do that. And again, thank you. It was great to have you here and it was a very exciting and interesting conversation.

Nitay Joffe (01:04:33.241)
Thank you, David. It's pleasure.

David Jayatillake (01:04:33.576)
Yes, I'm good. Great to be here. Thanks for having me.

View episode details

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

Semantic Layers: The Missing Link Between AI and Data with David Jayatillake from Cube

Subscribe