Episode 24

From Notebooks to Production: Xorq’s lockfile Approach for Reproducible, Portable ML Pipelines

January 29, 2026 · 57:26

Nitay (00:02.454)
Hussain, it's great to have you with us today. Thank you for joining us. Why don't we start by giving us a brief background of your work experience, career, and everything. Start from the beginning.

Hussain (00:12.974)
Thank you for having me both, Genete and Jaco says. So I've been a data practitioner ever since I left college, building the data systems, like at first as an analyst, and I was giving like all these shitty tools. So I had like Excel and the VBA was kind of my like best thing that I could use in order to like a program or script something. And then I had things like SAS, the SAS,

a statistics software and I spent a bunch of time just like glue coating between these two things and then you add your SQL systems and then that becomes another challenge. So my career has been pretty much about, okay, how do you just smooth these rough edges between these systems and essentially make them more composable than they were supposed to be.

That's where I started. There's a ton of pain in there. One of the pain points were like nothing talked to each other. So you had to basically serialize, deserialize the data. You gotta like learn the language or the scripting framework that the tool provides. And about like 10 years ago, if you were a tool builder, you kind of like naturally gravitated towards Python. And that's kind of where my tool building

started, was following people like Wes who was, I noticed he was on the podcast before, and Wes was building pandas out and I was essentially implementing a ton of pandas in my own work. At some point, we all learned that pandas are not scaled too well, so we started building out these systems. One of systems is called Dask, which is a distributed pandas library.

and we did a lot of work in just scaling Panda's workloads out. That was basically until I was in my corporate or the enterprise career. From there, came into or landed into consulting. In 2018, 2017 timeframe, Hadoop was the thing. There was a ton of money to be made on if you did some consulting on Hadoop. I was doing a bunch of Python apps on Hadoop and

Hussain (02:36.174)
fun things like YARN and building out these Python and Java interrupt. At some point, I moved into the federal consulting world and in the federal side, I saw large scale. So these like asbestos filled rooms with terabytes of data. And that was something that was really awesome to see. I was basically surprised that we had all this data in this government agencies.

that was not being touched at all, and just building out the tooling and open source being the primary way of doing that, that ended up into the systems that we basically paved the way for the work that I'm doing right now with xorq So with xorq, the idea is that we want to remove the glue code between all of these...

systems and the way that we do that is by building out a manifest or a lock file that kind of resembles your query plans from your database world and make them multi-engine, make it at like a higher level of abstraction where you can express these workloads that transcend multiple systems. Once you're able to express that in a declarative way, that's where the fun starts because now you can do things like

hey, I have a machine learning to the pipeline that I have created a lock file for. Now I can take a node from this machine learning to the pipeline or an operation from this pipeline and pass in new data or compose a new thing on top of it. And that unbinding is kind of the key concept that we were able to basically get to that because...

we spend a lot of time in building this manifest or a lock file for your machine learning pipelines. So the lock files are interesting because the idea comes from the dependency management world. your package managers would have a cargo lock or the UV lock. A PyProject, a Toml would have a specification and you might lock it. A Conda has the same thing. But what a lock file.

Hussain (04:55.727)
implies is that you are able to take the... It implies that you have a specification, which is declarative, and you can then lock that specification with the exact operations, with the schemas that you might be referencing, including all of your memory tables and so on. So the goal is that we are able to build out

a system that can express these multi-engine pipelines, but then we can lock it and then that log file becomes like a manifest where we can basically look at lineages and like debug it, debugging things. from there, we can do things that we are very familiar in the Unix land where you might want to pipe a small program to another program. We can do that because the...

these like manifests can be like composed around. And that a composition is one part of it is this like, this like log file manifests, but then the other side of it is coming from Arrow. So like the Arrow provides us like standard format. So as long as your log files always resolve into Arrow data, you are able to kind of compose on top and like build new systems without really

they have this pre-determined thought that this is exactly what I was going to build. So that's like the composability unlock. And the third thing that we were able to essentially get to is we can make our pipelines portable. So a lot of times people would write in my previous world, we would write a pipeline for the system. as soon as a system changes, you go from a production, from a development world to...

production world, those are the types of things that you can now do it easily because you can kind of compile or like translate this log file into the specific system or the SQL dialect that that system might understand. So that's kind of my story of like how I arrived at xorq. The goal here is to make machine learning reproducible, make it composable and also portable as well.

Nitay (07:20.382)
And I'd love to spend, before we go super deep into xorq, sounds like yours are doing some really interesting stuff there. Maybe take us back a little bit. I'd love to hear more about kind of why we are in this place, kind of from a data ecosystem perspective. Because it sounds to me like you have some really great experience all the way back from like the VBA and SAS and so on. And then your example with Dask, where you have a few examples, it feels to me like where you have some really powerful, cool tool that can achieve some cool purpose.

but then all the things around it become a pain, right? So scaling becomes a pain, reproducibility becomes a pain, deployment becomes a pain. And it seems like you've kind of seen that in a few different ways or places. So talk us through kind of some of those past experiences and what you then saw coming with like why the need for xorq even.

Hussain (08:08.278)
Absolutely. So one common scenario that I saw over and over again is that the machine learning and the world in data science has this divide that a lot of the stuff happens in the research arena and it never really makes it to a production. So I spent a lot of time on like just like thinking through it. Why do people do a lot of this research and they even find insights but they never end up deploying it. And the answer was like

the path to deployment, at least in the past, and even today, is very thorny. And most likely, it will essentially require that you will get a notebook from a data scientist, and you are a DevOps engineer or a data engineer who is going to take that notebook, do some archaeology to understand, what did the scientists write? did this pipeline run the same way that I'm going to run it today?

And then they cannot you how was it actually created? What is the? What's inside this black box and that is a tremendously manual process and and and it's like something where people would take things from a Python land where you would have a a XG boost model trained and then converted into C++ or like Java because your production system only just support Java so you were like a lot of this like research

is happening in the Python side, which makes a lot of sense, but Python is not very well adopted or translates well into the production system. And that's the gap that I saw where people were spending up to six to nine, eight months just taking these notebook pipelines and making them useful for the business. And that's kind of the thing that even today, I was actually surprised by a time I was founding xorq about two years ago.

was that, are we still not able to do this? And once I found out that, we are still not able to go from research to production smoothly with some confidence and knowing that, this is the exact same thing that got deployed. And doing that is incredibly manual and challenging and has all these translation steps. And that's kind of where we are now. And Xorque's idea is that we don't want to do this translation.

Hussain (10:28.47)
when we describe this pipeline and if you do it well enough, if you do it declaratively, you have all the information that you need encoded in that declaration, then now you could just lock that pipeline for a particular system. like I have, like our design partners are on a snowflake a lot, but then they would inevitably have data bricks and they wanna be able to go back and forth between those two systems. And that's something.

that you cannot really do without having that forethought, hey, I have this need. And that's the need or the pain that we are trying to solve for for xorq. Another concrete example that I have is we spent about, I want to say about a year building out a machine learning platform. And the platform did really well for the research.

community, were able to build out new models very, very fast. And then when we were about to deploy the models, something just changes. Your schema changes, like you basically somebody might have changed the regression, the way that you want to do a particular machine learning task, or a pre-processing step. And that is not very easy to decipher. Even when the...

a system is into production. And that's kind of where a lot of the work is that, hey, once I have deployed, I still wanna know the lineages. I still wanna know how a thing was calculated and be able to debug that well enough. So like there's not just a research to a production problem. There's also like once you're in production, machine learning systems are hard to upkeep. that...

basically comes in because you don't have a connection. You lost the connection to research and you don't have all the things declared in that pipeline for you to automatically monitor it. A lot of the monitoring of these pipelines is manual and happens after the fact. And that's something that, what we're saying is that, you don't need to write a monitoring pipeline after you have deployed. That monitoring pipeline should just fall out of the declaration that you provided. And I think that's like a foundational shift.

Hussain (12:48.622)
and thinking about machine learning systems.

Kostas (12:53.679)
Shane, I have a question. So you're talking about like log files here. These are like things that, okay, they've been around for a while, at least like in package management, like in operating systems, and we see them like adopted with like programming languages. When you develop, let's say, like a pipeline using like Python, again, you might use some package management system, right, for that.

Let's say you have like UV, for example, right? Like for Python. Why is that enough? Why do we need something more? And why just having what like the programming language and the framework offers is not enough and we need to do more for the pipelines.

Hussain (13:43.426)
Yeah, so let's think about what the lock files enable in the package management world. So you have a specification that could be like a project, a toggle where you are just saying, hey, here is the dependencies that I want. The lock files actually pins all of your dependencies and says, here is what this recipe or your declaration actually resolves to.

And that resolution is important because you have this problem that, this only runs on my machine. And that problem you're able to solve by providing this log file. But there is another problem, which is that, hey, I want to upgrade my dependencies. To upgrade, you need to have this specification that is at a higher level than this log file. And you can essentially change that specification and go to a log file quickly.

A log file gave you both reproducibility and upgradability. So, hey, I can actually upgrade my pipelines. There's a similar thing that is missing machine learning side. So in the machine learning side, the world is actually more complicated because you have data and you have schemas and those schemas might change. You might have a data set that you might include that is ad hoc. And all of those things need to be pinned and you need to basically

Make a hash or like name it in a way that is a unique and you can identify it and that is where By like doing that you can actually generate this like reproducibility So like what happens is machine learning side is like hey, I like you know around this training run Like runtime last week and I have this like result, but I'm not able to reproduce this result The reason might be that you have changed something in the in the pipeline and that is not apparent in just code

or in the spec because you have to kind of pin that code and that needs to resolve to a particular schema, a particular UDF that you might be using for like for the machine learning user defined functions. And so like that's kind of where there's a gap where machine learning is very much like in this like imperative world. We don't really have a spec or like a way to generate the spec because it's all, it's mostly imperative. So like for us,

Hussain (16:03.438)
first challenges to like, hey, how do we generate or like, how do we declare these pipelines in a way where we can like hermetically seal, right? Like, essentially have all the information that we need. And that's where we are now. So like we, like in xorq, we adopted IBIS. IBIS is a expression system. It's like a relational sort of like a DSL.

that you can extend. And we extended that in order to kind of like basically say, hey, you can actually represent or declare a machine learning pipeline as a relational algebra thing that is, you know, like a, that is something that you can basically extend with the UDFs and aggregate variations of it. So now you can express this machine learning pipeline and like take it like a scikit-learn pipeline and,

Kostas (16:52.527)
Mm-hmm.

Hussain (16:58.558)
resolve it into a bunch of SQLs. And those SQLs are your declarative way of running that. And that's what the lock file enables. And more than that, it also does is, hey, because you took all of your basically graph of dependencies of the machine learning side, you have a graph with all these nodes and these entry points.

So now once the pipeline is locked, you can actually say, what happens if I pass in a new data in my machine learning pipeline at this exact point? And that could be a test data. That could be just model scoring, just doing large scale predictions. And that is where the log files are essentially, basically lets us basically change the graph in a way that

that might suit a new use case that you may not have thought about. instead of saying, hey, I have a training pipeline, I have a model train, and from this train model, I'm going to go to a prediction pipeline, which is like a live inference, and then I have a monitoring pipeline. All of these things become just a single log file that you can change around declaratively and be able to pipe through your new...

the data or new use cases. The last thing that it enables, I talked about earlier, is the portability side. So one thing that we did with xorq is that not all engines, so Snowflake does it well, but not all engines are going be able to run your UDFs. So you need to be able to port these UDFs, and porting these UDFs means that you've got to embed an engine. And we adopted

Data Fusion Apache project, which is like an unbundled database, is really good at processing record batch streams. And you can add in more UDFs, and you can customize that. Because it's an embedded engine, you can port it and take it to wherever you need to run it. And that's the portability aspect of it that you can also basically drive from a log file.

Hussain (19:19.214)
So I'd say like, but in order to, the prerequisite of this log file is that you gotta be able to declaratively express the whole pipeline. And a lot of the things in machine learning world today are not declarative, are imperative. So like that's something that we are trying to make those imperative bits fit into this declaration. So the one example is that you may have a pre-processing step where you are encoding

a particular column in a particular way, that encoding, you don't really know what the output will be before you run the pipeline. So you still need to be able to declare this in a way where it is flexible and you don't have to provide, or you can have opaque parts of this pipeline. But once it's trained, now you can make it more transparent.

And that's one of the main challenges that we're able to essentially solve with xorq is that, you can actually provide a pipeline that could have these opaque pieces. But then these opaque pieces can become transparent once you have the information and make that declarative.

Kostas (20:35.914)
Yeah, so what I'm thinking while you talk about the challenges here is, and what I find like a pretty big challenge with trying to, let's say, bring the

functionality of a log file into pipelines, right? Is that, okay, when you have a programming language, right? You have a very well-defined, almost like sandbox environment where things happen, right? When you have crates in Rust, for example, everything is very well-defined, right? It's Rust code, you know where it comes from, you know where it's going to be compiled.

but the compilers are a lot like everything is. There's nothing like some, there's no like side effect outside of like, let's say the programming language itself, like in the general case, right? Now getting into like the declarative world and getting into running pipelines in data systems, you have platforms now, right? And you probably also have systems that they are different.

And not only that, but let's say you run your pipeline or you define your pipeline, your pipeline might run, let's say like on Snowflake, like a platform view. You mentioned the data is there. The data is part, it's an artifact of the pipeline, right? Like, for example, if you have like an incremental pipeline, the output of one run depends on like the previous run that you have, right? So the data itself is part of

Hussain (22:19.01)
you

Kostas (22:21.935)
like the log file in a way, it should be part of the log file, right? But at the same time, no one is stopped in theory and in practice actually, like, unfortunately, from connecting on Snowflake and just like changing their role somewhere, right? Or doing something on the data that is not like expected, right? Or something upstream somehow messes up with the data.

Hussain (22:26.518)
It is.

Kostas (22:51.777)
So now your contract that you have there, it's not like, let's say like satisfied anymore. you defining like a hermetically closed system for the log file when you have all these. you have code, you have systems that they have configuration, right? And behavior that you don't necessarily like control. And you have data and you need to.

lock all of them in a way, I guess, right? I might be wrong, please correct me. How do you deal with that? Because it sounds like a pretty tough problem. I'm sure like every data engineer out there, like if this could be solved, they would like extremely happy, right? A lot of the on calls would be not waking them up, like if these things were like, solved. So tell us a little bit about that. Yeah.

Nitay (23:45.451)
If I may add to that because I was gonna add something along that line. It seems to me like you need some level of support from the underlying system. Like what you made me think of is like the difference between a lock file and like Docker, where Docker also creates a whole environment for you that is essentially safe and reproducible, but it does it because the Linux kernel supports change root and supports these like, know, copy on write data file system and a bunch of like magic with networks and virtual devices and so on.

Hussain (23:57.422)
Hmm.

Nitay (24:14.345)
So it seems like you would need some of those kinds of capabilities from the underlying data system to coastless's point, or do you have to build that yourself? Like how does that end up working?

Hussain (24:25.612)
Yeah. So I think the contract, the fundamental contract that we have with the data system is a schema. So like a schema in and like schema out, that's the contract. And everything is an arrow record batch stream. So if you go back to like a Unix analogy where everything is a file system, for us, everything is like a file, file descriptor. For us, everything is a expression.

And that expression will almost definitely, or not almost definitely, will resolve to a Aero record batch stream. And that is sort of the contract that we can now say, okay, how does the Unix pipe work? So like you have like a Unix program, you can pipe something from like a standard in that is a file descriptor and you're reading the bytes out of it. We have a similar thing that we can enable just like a Unix pipe.

where the contract is that we have a schema in, which is some kind of a error to data, and then we have a schema out. like everything or this like program becomes a like transform function. So you are always transforming of error record batches one way or the other. And that is a foundational kind of like a foundational thing for us where like as long as we can...

architect our data systems around Arrow and these schema contracts, we can actually get that hermetically sealed solution because that is what the schemas would be. Now, what do we need from the underlying systems? We need basically Arrow support. So you should be able to read Arrow and get Arrow back, which is something that most

Most systems do it because Parquet and Arrow go hand in hand. So it's kind of easy for you to do. And as long as you can basically give me record batches in, I can build this declarative specification based on these schema in and schema out for concepts. And that is also composable because everything composes around just like file descriptors in the Unix. You are composing around expressions.

Nitay (26:48.586)
And so it sounds like, correct me if I'm wrong, but if I'm sort of correct, it sounds like what you're doing is providing stable computations, but potentially mutable data, is that right? Meaning like if I go and run the same computation, like I lock, I have the xorq lock on my computation today. I go and I run the same computation tomorrow. The schemas are the same, the underlying logic and computation guaranteed will be the same, but to...

to CoSYS points somebody may have gone and done a point update on one particular row. And so the input and the output may be different. And that's OK, because that's kind of what I expect is to run it on different data. Is that right? Or are you trying to also solve the data consistency problem?

Hussain (27:27.862)
So we're not solving the data consistency problem. It's basically saying, hey, we have this data up in his cache somewhere and here's the cache. And we are going to basically say, this is what you asked for. So the locking is not locking the actual data. That becomes, so like we have a way of saying that, I have some ad hoc parquet files that I want to lock. So you can do that with xorq. And in that case, you will get the exact same data back.

Nitay (27:31.219)
Okay.

Hussain (27:57.582)
So that's a choice that you can make, but the cost there is that you have to hash this whole data frame. And that might not be tenable for you. the types of guarantees that we provide are around schema. The data can change, and that might mean that you have a different result, but you will know exactly where that difference is coming from. That is going to be coming from, hey, my data changed, but none of your schemas or your computations

have changed from the last lock.

Kostas (28:31.094)
Ksenia, can you tell us a little bit more about the schema? When you say guarantees about the schema, first of all, what's a schema in the XORG world? What do you define? What are the elements that the user needs to declare there in order to define a schema?

Hussain (28:50.158)
Yeah, so schema is just like a database table schema that you are familiar with. We are using arrow data types and nested data types. And those are the types that we use in order to make the opaque operations happen. So you can say, hey, I don't know the number of columns that I'm going to get back, but I know here's my nested struct that I will get back. So you have a struct type or like a

a type that basically enables that. So the schema is just like column names with types and like that is like basically the guarantee and that's something that we lock. So like when you lock a pipeline, you will get the exact schema that was at that time. If the upstream schema changes, your pipeline will no longer run and it will tell you, hey, your like upstream schema changed.

Kostas (29:45.742)
So part of the challenge that I see with declarative systems is that there are a lot of like implicit things happening, right? So if we think about like a database, right? Like when we write like SQL, you pretty much have like something almost like a compiler that takes, let's say the declaration of what data you want to get at the end.

And the database engine decides, writes code in a way, right? Creates the logical plan from the logical plan, then like the physical plan. All the passes to optimize actually rewrite these things. So we might even like, let's say, write, not write a join explicitly on our SQL, but see the plan and

she joins in there, for example, right? Part of that is also that typecasting many times is implicit, right? So in many cases, in order to decide, let's say the type, the output table will have, you actually need to go and materialize this thing. Even if you materialize with a limit zero, where no data is getting materialized, but you...

you see like the output schema. And one of the most interesting things that I've wasted time in my life doing was going through the DuckDB implicit casting engine, which is almost a cost-based optimizer within the optimizer that tries to figure out the implicit casts.

Hussain (31:14.295)
see you.

Kostas (31:42.7)
that need to happen based on rules that are also related to performance. obviously these things are more, how to say that, there aren't surprises with, let's say, very common operations that you do. When you cast to a date, you expect to see a date. But always, with declarative systems in SQL,

the pain lies in like all the edge cases, right? That they can happen there. so what happens, let's say if I declare like a schema and for whatever reason, the target system that I'm running decides through like the implicit casting that happens there, like to change like a type.

How do I figure out that this is not a problem with my declaration or definition? And how I can understand what the engine did there? How does XOR deal with this pretty complicated type inference? I don't know how to name the problem exactly, dealing with

all the type magic that's database systems are doing.

Hussain (33:15.918)
Right. So we demand from the user to know your output type. And we make it pretty easy for you to build this. In the expression system, these things are being calculated with your real engine. So when you are locking, it's going to go and do this limit one or limit zero query and make sure that you get the exact same resolve schema. So we have that schema. Now, if that changes, because

our physical plan decided to do something interesting with it, it's going to basically give you an error, say, hey, this is not the same schema that I was expecting. You can change it. You can basically go back to the declaration, rerun it, and now you will get the right answer again. And that is something that you can lock. But in general, a schema inference and those things, we're operating at the logical plan level.

we're not operating at the physical plan level. So we can kind of ignore it as long as the user, while they're declaring it, have access to the same system that they want to run it on. Things like DuckDB, that's super portable. We can make it pretty easy for the dev side. And then let's say you wrote something on DuckDB, locked it, and now you want to run it on Snowflake. But Snowflake has this one type that is off. That will fail.

fail at compile time. It will not fail at runtime. So I think that's something that in the data space, what we'll see is that you will have all these stages in this like Airflow job or whatever that pass and at the end is going to fail. And that's what we're trying to avoid. We're gonna make sure that you can fail fast and like fail early because your schemas do not match up. And that's something that you can do at the logical plan layer.

and we like basically depend on the logical plan to like give us the right schemas back. If they do not, if there's some physical plan optimization, that would be something we'll have to like kind of like deal with on like a case by case basis.

Nitay (35:24.038)
There was something you saying you mentioned a little bit earlier that would affect the plans that you're getting fed into your system, which is the whole kind of imperative to declarative transformations. Tell us a bit more about that, because I imagine you run into a lot of cases with stuff that is by its nature not declarative, some random custom function script I wrote, whatever, all this kind of stuff. So how are you dealing with all those cases?

Hussain (35:41.314)
Absolutely.

Hussain (35:46.84)
So one way is this opaque type, so this nested type, so you don't need to know the schema. You just know it's a struct of something. So that's a way to hide a lot of this imperative code between this declarative interface. The other thing is that we spend a lot of time on basically providing people escape hatches. And one of the escape hatches that we call is called a flight exchanger.

And it's based on ArrowFlight where we just have this schema in, schema out exchange, exchange a contract. And that contract can hide anything that you might want to do imperative in there. And you don't have to be even part of the SQL system. This can run outside of SQL. But the contract is just this like schema in which can have like opaque types.

And you will always be able to kind of hide your imperative things, which in the machine learning you are always going to have because a pre-processing step might generate a type of encoding that you do not know or you cannot know upfront. And being able to do that with these higher level types is kind of the answer for us. Now, once you have done that and you have a trained pipeline, at that point, everything should be declared.

and everything should have exact preprocessing, exact imputation that you might want to do, or encodings. And then you can basically take... Your train pipeline becomes a declarative spec, and you can take that declaration and lock that one, and then you can essentially drive anything else that you might want. So...

Having an escape hatch is actually the most important thing in machine learning because, like your point, it's not something that you can declare upfront. And that is by design, and that's OK. So the way that we're doing it is by these opaque types and providing escape hatches around a schema and schema of out contract. You could argue that it is less transparent.

Hussain (38:02.254)
because you did not un-nest your struct and give the exact column. But that's kind of the best you can do in the machine learning side. But once it is trained, then we can actually create representations that are fully declarative. Imperative code, especially in the pandas land, And a lot of our UDFs are pandas UDFs. People are...

going to write imperative stuff in it and that's that's just like a reality of it that we fully accept.

Nitay (38:38.894)
It seems like over time you could build, or the community could build, and deeper integrations with xorq, such that those essentially opaque, four-pointer type of stuff basically becomes more specific, a kind of schema of more specific types and more specific computations once you deeply are into, like, don't know, PyTorch and TensorFlow and whatever else, right? All these kinds of things, it seems like. And so this seems like kind of like a catch-all for all those cases.

Hussain (38:59.598)
Absolutely. Yep.

Nitay (39:06.404)
until you have kind of which makes a lot of sense. And I say a lot of kind of projects starting that way. Shifting gears slightly, give us a sense for the kind of the developers and users out there. So what does this look like from like a soup to nuts, right? Like I know lock file is I have this .lock file file. I have a few commands, you know, like cargo or whatever, and UV and so on that generate that file and keep it in sync and updated. What does the kind of the flow look like for XOR?

block file. And how does it work?

Hussain (39:37.304)
Yeah, so the way that they work is like you're like a data scientist or a data engineer. You like start with like a notebook. You like write these these like IBIS expressions that like xorq basically supports. You like take these IBIS expressions which are declarative and you have all these escape hatches in there. From there, there's a step which is a new step in machine learning which is like xorq build. So you can build a pipeline and this like

building will actually create a folder with like your like YAML specs in there and that's the lock file that I'm talking about. This folder will also have like source information, your UV lock in there as well so you know exactly the of dependencies that your UDFs may need. And once you have this lock file, now you need like a registry of source and like that's what we are calling a catalog. So we have a expression catalog or a compute catalog.

And this catalog has all of your pipelines or parts of those pipelines available for anyone else to reproduce. this catalog can be local. So we have this open source GitHub where you can actually go down and try this out yourself. And it will give you a local catalog. And now you can then be able

take this catalog, it like a server and like provide like you know people a connection where they can just say okay what was the pipeline that I had last month, I want to get the exact revision of that pipeline and you can like basically start from there instead of starting from some notebook and like opaque code that you got to like basically understand what exactly ran in what order.

Kostas (41:25.398)
saying I have like a question that is, a little bit more like probably connect like to the product side of things, but, we're talking here about reproducibility in like ML pipelines. There was like a whole category of products, that like called like feature stores, right? but at the end of the day, like feature stores was pretty much about.

how we can define features, sure, but like features at the end of the day are, let's say the outputs of a pipeline that then is like used in the online world, like as part of, let's say like a model or something. And in a way, like they tried like to solve similar problems, right? With what you're talking.

Most of them, if not all of them, they would introduce some DSL, right? That is declarative to define the features, which of course like to define a feature, then you have to define the computation that create features. So like practically we're talking about the pipeline there. And of course, like making all these things like we've seen at least the boundaries of their products.

and system reproducible. A big difference that I see here with xorq is that you don't, let's say, attach xorq to a specific compute engine. It's something that can work with whatever engine the users will like to use. Now, a lot of things happened. Feature stores were, for a while, like...

the new hot thing out there, not anymore, for many different reasons that we can talk about, but at the end of the day, what are, let's say, the overlaps between the problems like xorq is trying to solve and feature stores and what the xorq community can learn from that, right? To make xorq successful from like the mistakes or let's say the...

Kostas (43:51.086)
the wrong decisions that the feature stores made.

Hussain (43:56.782)
Yeah, I spent quite a bit of time just thinking about feature stores and why they did not quite take off the way we thought they might. And I would love to dive into it a little bit more. in general, think what is a feature store? So I think if you boil down to what it is, I think it's mostly metadata about your pipelines. And the metadata in this case is like, OK, here's my feature.

and that feature might have an entity. So that entity is very similar to a semantic layer where you might have dimensions and measures. So you have these entities, and they have a timestamp. So the exact metadata about this feature is an entity and a timestamp. And then once you have these two pieces of metadata, now you can do things like, hey, I want to get historical, or I want to get online. And because you know the entity and the timestamp,

you can do an as of join for historical and get the as of data for that particular event. And if it's online, you can do the same. But that is the key thing that a feature store enables. Like, hey, I can get my historical data without any feature leakage, and I would have guarantees that it got the right timestamps. And the temporal aspects of it are satisfied well.

And then once it's time for me to actually take it to production, I don't need to do more work. It is already like all the information is encapsulated and we just need to rerun or like materialize these like features for that particular time. And then you have like this problem of like cache management. So you're going to have to like basically do this like a TTL on like how long a feature should be like fresh.

And that's its only complexity. So what we are kind of like saying is that, this is just metadata on top of engines, and you're supporting multiple engines, and the feature stores inherently are not about engines or a cache. It's not about a Redis cache that's materialized. It's about the specification and the metadata about it. And the way that we are able to... So we have a feature store that you can build on top of xorq.

Hussain (46:22.978)
with your own engine, it could be Snowflake, could be Flink, whatever else that you might have. And the way that we were able to essentially enable it is by fixing this metadata. And we're actually building it on top of a semantic layer. So we built a boring semantic layer on IBIS expressions. And that semantic layer basically provides the building blocks for me to say, here's my entity, which is a dimension, and here's my targets.

And now I can do my joints for historical and online well. And we are also agnostic to any engine, so you can kind of say, I want to run it in production on this engine versus another engine for dev or however that might work. So the feature store, in my mind, is just metadata operation, or metadata management.

And that is actually, it fits in really well with like a log file approach, because all we have is like a metadata about things. And we can essentially, you know, basically build out these like classes or these like higher level abstractions that people can say, okay, here's my feature store. Here's all the features that I have. I want these like 10 features for these entities. And I want them to be like properly, a proper temporal,

Hussain (47:51.584)
aspects of it are later retained. So to answer your question, I think why feature stores did not take off in general is like they kind of came up in this world, right, where a lot of machine learning was only in research. think a of the, so like machine learning just like never went to production to the extent that we might think. And when it did go to production, I think it did like,

People had the feature stores in there. And then the other part of this is a lot of feature stores became about real time and the computation power. So they started inventing their own engines, which I think was the wrong lean, where, I need to keep my own reddish cache that is purpose-built for this use case. That, I believe, is the wrong abstraction. You don't really need to

build infra for managing metadata, you should be executing this on the engines that are already there, which basically do all of these things like an as of join or like the entity joins really well. So why do you need a new infra for it? And I think that's where the reasons are interesting and I would love to get your thoughts on it. But my understanding is that like,

A lot of this happened because a lot of machine learning was coming from this shadow IT. And there was a parallel stack that got built up for the purpose of machine learning. And it turns out that we did not need this parallel stack because we already have these things well-defined and well-done in the database side. So I think that's where the feature store story in my head is, OK, why did it not take off?

People still use it. It's just that in my mind, you just did not need all this infra that really got created by this shadow IT because they may not have access to the same tools of the production systems. And it kind of got sold to this shadow IT in this way. you kind of became a stack that was about to basically rolled into something else because you have better caching in the enterprise or you have better

Hussain (50:19.412)
SQL systems that might do that. What are your thoughts on it? A co-star is like, what do you think from the product standpoint, features will not take off?

Kostas (50:28.43)
I mean from what I've seen and okay, I'm sure like there are people that they will take leave the whole life cycle of features those like from within but and they have like better understanding of what happened but I think what happened with feature stores is that they

From a product perspective, the use cases were, let's say like feature stores were becoming like really something you could sell, like very, very, very specific. So you had fraud detection. Then obviously like the market segment like for that is very, very specific, right? It's primarily like big banks. Then you have, and that has

Potentially money right it is like a space where? like people are willing like to spend money because if they don't spend money like Potentially the the problems they will get will be much more expensive, right? Then you have another big category of problems, which is recommender systems now recommender systems are Interesting in this market like

that it's very, very conscious of costs and margins. So that makes it something that if you want to sell, let's say to a big e-commerce shop, like you really have to be very, very optimal, like to make it work for them instead of them like building it on their own. And I think like the third one is

kind of like similar with fraud detection, it's more of like, let's say, like an insurance and like things like, well, someone with like a loan, let's say like, can we verify quickly that like for this loan, are like applicable or not. Now all these cases, they have two important, like let's say like there's another, actually another important dimension to it, which is,

Hussain (52:30.442)
X-rays.

Kostas (52:49.922)
there is an offline and then online side of the problem with extremely different latency requirements, right? Like when you are doing fraud detection, like you have to be really, really fast. it's not something, okay, like someone can wait for a second, but it's not like you can wait like minutes, right? So you are also like from a systems perspective and like,

love to hear also like Nick Tye's opinion on that. Now you have like to build two systems in a way, right? Like you have to build the offline system that does the training, creates like materialized like the features and all that stuff. And then you have the online one, which has to be super low latency. You have the caching issues there. Are these like, let's say the latest features that I have, let's say for my user, user out there like to do like for detection.

In some other cases, like for example, when you're doing like the other writing and like all the insurance and like loans and all that stuff, you might need to go and query external APIs, right? Because you need to go get like the credit scores and like all the information from there. And these things that you cannot really cast, like you have like to go and make requests. So now you have an external dependency where you don't really control what kind of

Let's say latencies you are going to like to have there, right? And at the same time, like you need all the heavyweight spark like jobs out there to go and do all the, the training. So I think it's like a combination of like a very narrow defined opportunity at the end of the day from like, like a product perspective and the super complicated and expensive.

Hussain (54:15.394)
Mm-hmm.

Kostas (54:40.096)
system like to build, which might make sense for like Uber to build, right? But okay, like how big is the opportunity outside there? don't know. What's what do you think, Mitai?

Nitay (54:53.275)
Yeah, I was thinking about as you were talking. Um, so first of all, I agree with a lot of what you're saying. I You know, feature stores to me never took off because it always felt like kind of a, uh, feature, a product kind of thing. Uh, and it was one of these things that everybody had some sort of implementation of, but there wasn't necessarily one generic pattern. And that one generic pattern that was coming about was not.

this kind of perfect intersection of big enough value, big enough pain point, difficult for me to do myself, all this kind of stuff. And part of it, I think is exactly what you said, Costas, about like the different requirements on the runtime versus the training and the kind of the different, and what that ends up with is very bespoke kind of ad hoc implementations, which makes it tough to have kind of a general solution.

And so, yeah, so I agree with all that. And I think kind of, you know, taking that to something like xorq, I think we are very much in a obviously different world than we were a few years ago, even in the years of Feature Stores, which on the one hand is not that long ago, on the other hand feels like it's a, you know, eternity ago, relative to everything that's happened. And I think we are very much in this world of...

Hussain (56:06.478)
30.

Nitay (56:14.989)
a lot of disparate systems and kind of a multi-platform queries are the norm, right? Where you're getting things from multiple different data sets, bringing them together in different ways, pulling in different libraries, all these different kinds of things. And so the need to have kind of one way to get a sense of all that and have kind of a lingua franca, if you will, of like being able to share and collaborate and work with that.

does feel to me like something interesting, I think it's kind of a, it in a way brings back like the same questions or things that that lessons from the feature store of how do you make sure that this is kind of a killer thing that becomes a standard product, not just a nice to have feature.

Hussain (57:01.294)
That makes a lot of sense. I 100 % agree that I think the scope was too narrow because features do not really live in isolation. They live in pipelines. They live with models. And if you're not taking care of the whole pipeline that includes sources, not just pre-processing, but then also splitting and how you want to train the model, I think the feature stores need to be inherently part of your training world.

And that was still the other step for the people to do or learn a new DSL in order to take it from their training into a feature store. So I think there's a bunch of integration problems there as well. And then I agree that the use cases are pretty narrow, but then also I think to my earlier point, I think machine learning just never went to production.

I think a lot of people now with like agentic AI are kind of like seeing that, we need to take the machine learning stuff to production. And I think that's like one good thing that is coming out of it is like, okay, how do we do AI? Like, let's at least do machine learning. like, I'm seeing a lot more people taking models and like, know, a classic models that they were not really doing before, taking them to the production and using them. And that's like a new thing that I think.

the feature stores were like part of this world where a lot of things were research.

Kostas (58:31.278)
One more question, Cristina, about the approach and different approaches like trying to solve this reproducibility problem. So you take the approach of the log file. We've seen also attempts that are more related on reproducibility from the data side of things.

the concept of like Git-like operations, like on your pipelines, but not only on the code, also like on the data itself, right? You know how like, okay, with stuff like Iceberg and having like, let's say the ability there to branch in a way, like your data, in general, like considering data more as like an immutable thing instead of like something that it's like mutable, right?

I mean, Snowflake has like the time travel capabilities. I think like Bauplan is like probably like one of the startups that they are really like, let's say, trying to like define these new category of like way of like doing things. And based also on like technology like Nessie, which was like a catalog that would allow to have like these, let's say, Git operations on top of

your pipelines. How is this different from what xorq is trying to do? Or it's not, maybe they are complementary, maybe they can work together, I don't know. it feels to an outsider that, okay, if we, let's say, relax a little bit the definitions about if it is ML or analytical, but at the end of the day, everything is a pipeline, everything is about data, the same underlying systems are used.

in many cases. So how do they differ if they do, right? And yeah, tell us your thoughts on that.

Hussain (01:00:33.454)
Cool.

Hussain (01:00:38.682)
So I think they are analogous. They're definitely different. for example, in the iceberg case or Git for data for like Boblon, are basically iceberg built out of stigmata layer and on top of the stigmata layer, you can do new things. So you can easily do schema evolution or easily create Git style branches. And that is happening on the data side. So we are similar, but we are

like just for compute side. the data side, you still need Git for data, you still need Iceberg, or you need a way that everything is immutable and you're like the asset transactions and those types of guarantees are at like a metadata level. For the compute side, so we have similar things that you can do. So instead of branching, what you could do is that you can unbind a node. So if you have a machine learning graph that has that

a test split and like a train split and you were using the train split to train the model and like test split to evaluate a model, you can actually say that, I have this like test split node that I want to unbind and pass new data in. So now you can keep the trained pipeline the same, but you have essentially converted your training pipeline and the evaluation pipeline to a inference pipeline.

And that is able to do because you have this iceberg-style metadata and this iceberg-style of things where you can, instead of branching, we are unbinding a node. And that would be one way of thinking about it. Another one is sort of the transaction side or the schema evolution side. A lot of those stuff becomes trivial because we just need to swap out a connection.

and say, hey, now you have a new schema in there, and now you relock it with this new schema. So I would say that it is analogous. It is complementary. You still want to have your data layer, kind of like this immutable thing. And the more immutable it is, the better off we are going to be, because we can trust that these schemas are going to survive. now you can start combining these two things together. Now you can start answering the question of,

Hussain (01:02:58.754)
hey, did my data change or did my schema change? And I think those two questions need to go hand in hand. And I would say that we are kind of like new stack that is emerging around icebergs or metadata decomposition. And we'll be akin to the compute side, while iceberg is on the storage side.

Nitay (01:03:24.663)
That makes sense, very cool. I think it'll be cool to see how the future kind of unfolds with these two coming together and working together, as you said. So we're coming up on time here. Perhaps one last question to close us out. So tell us a bit about kind of the future of xorq and if you had one wish for the data ecosystem, what would it be?

Hussain (01:03:44.558)
I think the future, we just closed our pre-seed round. We are signing up our design partners right now. We have some really good people signed up. And the goal is to get our cloud product in front of them, have the open source offering work seamlessly with the cloud offering, and have this full system that can sustain these machine learning use cases from fraud to

recommendation models and so on. If I had one wish, what would be the thing that I want? I think I would lean towards basically asking for things like the DuckDB to have more of them and more specialized engines that can do specialized things for things that we are writing a bespoke code for.

These should be just done in any engine, and more engines are a way to compose things together, and you can still have the exact, precise capability that you want. We are in this multi-engine world already, but I think with these in-process engines like DataFusion, DuckDB, et cetera, the Polars, you do have this capability that might not be in these main systems. So now you can do things like offload a...

of a operation to a particular engine. And being able to do that seamlessly as part of a larger compute graph, I think that's the challenge that is unmet. the way to, my wish would be that we have better engines and more of them. And I think that's what we need in order to solve this problem well.

Nitay (01:05:37.378)
makes all the sounds moving towards more even more of a best of breed kind of architecture where you have the absolute best solution for each particular problem and then you have a system like Zorak that is bringing together this kind of heterogeneous multi-engine world into one kind of thing that can make sense to a single person. Very cool. I look forward to that future. This has been absolutely great. Thank you for joining us, Hussain

Hussain (01:06:01.998)
Thank you for having me, really appreciate it.

View episode details

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

← Previous · All Episodes

From Notebooks to Production: Xorq’s lockfile Approach for Reproducible, Portable ML Pipelines

Subscribe