Episode 4

MLOps Evolution: Data, Experiments, and AI with Dean Pleban from DagsHub

September 27, 2024 · 53:55

Nitay (00:04.128)
All right, Dean, it's great to have you here with us today. It's a pleasure having you with Kostas and I. We'd love to start this off with kind of just tell us a bit about yourself, your background.

Dean (00:14.936)
Thank you for having me. So I'm Dean. am

Co -founder and the CEO of a company called DagsHub. I'm located in Tel Aviv in Israel. My background is a combination of physics and computer science. On the physics side, I did a bunch of research around quantum optics and not really related to the work I'm doing today. And on the computer science side, I worked a bit on image processing and computer vision, and that's still my comfort zone in the world of machine learning. But at this point, I've also had the pleasure of working

on some LLM projects and reinforcement learning and a bunch of other interesting things. And today I'm the founder and the CEO of a company called DagsHub, which I founded with my co -founder and my friends in Skindergarten guy. And what we're doing is we're basically like DagsHub is a platform for machine learning, now AI, if you're sticking to the current buzzwords teams to help them manage their machine learning projects.

of like GitHub plus hugging face for enterprise teams so you can manage data sets, experiments, and models in one place. And I'm also the host of the MLOps podcast where I talk to industry experts about how they get their models to production, what challenges they face, and how they overcome them. So that's me. Thank you for having

Nitay (01:41.629)
Great, great, Dinshul, thank you. And so help us maybe understand a bit of the kind of beyond the buzzwords, as you said, there was between ML and AI and like, is ML ops really? Like, does this stuff actually take to do in real life?

Dean (01:53.75)
Yeah, so if I go back to the early days of DagsHub, I think that the insight that we had that led us to build this company is that

When you're just working with code, you have a pretty good life and there's a lot of tools that help you get to production, work as a team, and sort of build a structured flow from an empty text file to a production application. But when you add data to that mix, things don't work as neatly and there's a lot of friction and things that are counterintuitive. There's not a lot of tools. A lot of the tools are very early stage, so it's not really clear how

to flow with them. And that means that getting to production is difficult. on top of that, working as a team is difficult. So a lot of organizations, especially larger ones, there is this thing where you might have five teams working on the same problem and not knowing that each of them is working on it. And so they're not sharing intermediate results, whether it's models or data sets and things like that. And so the way we got to where we're at

we asked ourselves like, okay, what are the main differences between like DevOps and what is now called MLOps at the time MLOps was sort of a nascent term. And the two main insights that we had about that was one is that there's data, right? Like if you think about software, then code is the source code. And if the three of us have the same code snippet, then we can theoretically at least run it and get the same result. But with machine learning, if you

the same code but different data, you're going to get very different results. One thing that you need to address is how you manage data alongside your code and maybe treat it in the same importance or maybe even more important than the code that you're using to build that model.

Dean (03:49.772)
And the second thing is the experimental nature of machine learning. So in software, the way I like to explain this when I'm talking to software developers is there's no real situation in which you get to build a feature and it's like literally impossible. You might give up on that feature because there is not enough time or it's not high enough priority or something like that. But in the end it like can get.

built, right? But with machine learning, you might literally be working on something that is impossible. And so there's an experiment that you run and you try to get a certain result, but sometimes you don't get that result or you get results that are not good enough to actually put this in production. And so you need to take into consideration that experimental nature where if you're thinking about this from like a get perspective, right? So you have like a master branch in machine learning, you might have branches that have dead ends, which is not very common in software development. And so you need to be able to

of manage those dead ends because they're still important. If you're working on something for long -term projects, you don't want to repeat the mistakes of the past, and so you need to know what you already tried and failed. And so those sort of insights are what led us to build the platform the way we do. And so I would say that the idea is you need to manage models, experiments, and data sets alongside your code.

They can't be in a separate place because otherwise you lose your hands and your feet the moment the project grows complex. And you need to treat the experimental nature of like being able to run things that fail, but then sort of investigate why they failed so that you know what to focus on next and how to actually succeed, if that makes

Nitay (05:30.843)
Yeah, does. you're so you're highlighting here is kind of the there's a it's like a three headed beast between code data and experiments that behave very differently and need different kind of processes and products and tools around it. What have you seen today or maybe historically in terms of like how people tried to kind of shoehorn the problem if you will probably I imagine people like you said came from the code world said okay well data I'll just throw my code tools at it. experiments I can just do more of what I've done historically. Why wouldn't it

Dean (05:58.818)
Yeah. So I think that the, are a few sort of a StoneAge tools that we've seen applied to machine learning in the hopes that they'll just work. So that's like, we, we coined this term, one of our developer relations people coined this term like StoneAge versioning. near shout out to you. But, but basically like a lot of, a lot of times what teams would do is they would either,

I mean, there are a bunch of failure modes, but let's say the most common ones are, it's, well, it's data, but we can shove it into Git, it'll be okay. So that would be one failure mode. And that obviously works if your data is super small and doesn't change often. But the moment that is no longer true, then you kind of need dedicated tools to handle that. The other approach would just be to throw everything into an S3 bucket and hope for the best. And then you sort of manually manage the version

in different folders and things like that, that can also lead to a lot of pain. And the other thing is for experiments, right? The most common experiment tracking tool, I think still to this day is an Excel sheet or a Google sheet. And that's obviously a bad idea for serious projects, but it does get you somewhere when you're starting out. And in the spirit of like MVP mentality, I don't think it's necessarily bad for like day zero.

But given the tools that do exist today, you can basically get a much better result with very little effort that will get you further along and not needing to like rebuild your entire stack the moment you grow above like two people in a team. So I think those are the two main failure modes that we've seen. And the way, guess, just to briefly explain the way we sort

think about the solution is that for data management, we actually think about like three tiers that you might need for data versioning. So the first is your data never changes, in which case putting it in an S3 bucket and just calling it data, that's fine. I think that case is more relevant for like academic use cases where you're doing research on a benchmark data set and it never changes.

Dean (08:23.552)
But then the second tier is the most common one, which is my data is ad only. So the simplest example is maybe I have whatever surveillance footage or something like that, or I'm collecting data from the wild in some way. Then I have my data set and it might be

images in this example, again, I tend to go to computer vision because that's my comfort zone. So I might have a bucket of images, but more images or more videos are added every day. And so that requires some form of versioning, but you can sort of reduce that versioning to daytime versioning. So if you say, show me my data set as it was at date X, then you get the canonical version because you're not overriding your data. And then the third level, which is the most complicated one, is if your actual underlying files are changing,

And in that case, you actually need to version the file very similar to what you would do with Git, but you need to treat it slightly differently because datasets tend to be much larger than your Git repository and things like that. With experiments, it's a bit

simple, right? Because you just, you can think about experiment tracking tools today as like much more sophisticated Excel sheets. So it's a place where you can log all the information, the parameters, the metrics and everything. It automatically creates charts and visualizations that help you understand how your training run is going. And also in hindsight, like compare between different experiments. And it lets you log artifacts, which ties into the last thing.

that is technically an artifact, So the model that you're building is an artifact for your machine learning building process, but you obviously need to log that in order to deploy to production and use it in the end. So most experiment tracking tools also let you, to a certain extent, manage your models. And that's how we're thinking about the ideal solution here.

Nitay (10:19.441)
Now, and I'm interested in, in particular, wait, Go ahead, cross.

Kostas (10:19.66)
Dean, question. Just one question about what you've been talking so far.

I think what is really interesting here is how you have all these different artifacts and how they relate to each other. But there is one artifact there which I think deserves a little bit more time, primarily because if someone does not come from the ML space, might think of this as something different. And I'm talking specifically about the experiments, right? Because there's like

the concept of like even like in product management, of like running like experimentation, which is like more of like A -B testing, which is like a very, very like specific thing, right? Of what it means. But my feeling is that when we are talking like in the ML context about experimentation is like something different. And I'd love to learn more about that.

because I don't know actually and I think that it will be interesting also like for the people that they like to understand for whoever like doesn't come like from an ML background. What is this experimentation thing we're talking about, right? And it's like, what is this Excel that you mentioned and how like why it is important,

Dean (11:47.119)
Yeah, sure. So that's actually a great point. Whenever I have like a user interview with someone who's coming from, like who's doing machine learning at a company that's like more B2C oriented. So they have like a ton of users. When they say experiment, every time I need to make sure whether or not they're talking about the machine learning experiment or the product experiment. So that's like a, that's a great, a great question. And the point that's worth sort of

diving into. yeah, experimentation, like as a concept, of course, there's no real difference between an experiment for machine learning and an experiment for a product. You're just experimenting with different things. So in the case of a product, you have something, maybe it's a feature or an entire product or something even more specific, a color of a button, right? And you want to see whether or not that impacts some bottom line metric that you care about. And so you would go through a process where

You put out two versions of your product with a different variation or more with the variations and you get

to go to each one of these variations and then you collect that metric and decide which one is better, right? So in the button example, just to give something concrete, you wanna see whether or not a red versus blue button gets more users to click on checkout. So you'd create two versions, half of the people that go to your website see the red button, half see the blue button, and then you measure how much, like what's the conversion rate to checkout and whether or not there's statistical significance. So that's the

version of experimentation. With machine learning, there are similarities, but it's not exactly the same. So the idea is that when

Dean (13:31.213)
when you're building a machine learning model, there's the deterministic part, which is your writing code. And that code is supposed to define a model like the structure and the loss function that you want and a bunch of other parameters that you care about. And then you have your data set, which again is in most cases, there's exceptions to everything I'm saying right now, but in most cases, it's like a curated set of examples that has both the inputs that you're going to put into your model and the expected outputs. We

those labels or annotations. So when the model runs on your inputs, what do you expect it to output? And then there's a training process in which the data is fed into the model. The outputs are calculated. They're compared to the labels, so what the expected output was. And then you calculate the difference, and that's called the loss. And that loss is used to calibrate the model.

so that it's a better representative of the dataset that you have. So to give a concrete example, if you want to build a model that has, that detects whether or not an image is a dog or a cat, then you will have a dataset of images. Each one is a dog or a cat, and each one has a label, which is dog or cat. And then you define your model. And what you basically do in the training process is you run each sample.

of dog or cat through the model. The model outputs dog or cat. And then there are two options. Either the model is correct, in which case, again, we're simplifying here to a very simple case of like yes or no. But basically, if the model is correct, then it has no error. So there's no loss being propagated. But if the model is mistaken, then basically that loss is back propagated so that the model weights or the sort

parameters of that model change automatically to reduce the chance that it will make a mistake again. And so you run through this process iteratively. Now, why is this related to experiments? Basically, if you remember in the beginning, I was saying that when you define the model, there's a lot of parameters that you can play with. And some of those are like built into the model. For example, which loss function do we use? So most models support multiple different loss functions and each loss function would be better for different tasks.

Dean (15:56.031)
There are also other parameters like learning rates and weight decay. I'm throwing out lot of buzzwords. They don't really matter to understand what it means to do an experiment, but you can select values for these parameters. And then maybe in my mind, the most important parameter is which dataset are you using. So as we said earlier, if I have the same code but different data, I'm going to get very different results. So you can also play with the dataset that you're using to train your model.

And all of these are sort of parameters, so inputs to the experiment. And in the end, you're basically calculating certain metrics. So you want to see, save a bit of the data set on the side, you call that a test set. And then after you're done training, you take your model and you run it on that test set. So you're basically quizzing your model on something that you held out from training. And then you you give it a grade, right? So that's usually like metrics such as accuracy or mean average

or recall and stuff like that. And then you give it a score.

And the experiment tracking is basically tracking those parameters, including the code version that you ran, including the data set, including the parameters and the metrics that came out as a result so that you can later compare between different experiments and say like, this experiment was better than this one. What did we change there? we have like more data. So maybe if we increase our data set, we're going to get better results, or maybe we chose this loss function and this one is better than the other one. So I might have an experiment where I'm trying different loss functions and seeing which one is more suitable.

for my problem. So those are the different types of experiments and of course they can live together. So if you're a data scientist and Facebook, you're doing machine learning experiments and then you're building a model that goes into the product and then doing product experiments to see whether or not that model is affecting the bottom line for the company.

Kostas (17:49.691)
Yeah, all right. That was an amazing reply. Nitya, please go ahead because I derided you before. So I'll continue after your question.

Nitay (17:50.344)
How do you find people are...

Nitay (18:02.236)
Yeah, sorry, think we have a delay here. We'll cut out. Just going off of, so it sounds like a big part of it, because you mentioned a few times, is kind of the quality of the data. And as the saying goes, if you have bad quality data, it's garbage in, garbage out, as they say. And so you have the situation that you described. You have your training data, you're training the model, then you

test set and so on. That whole thing may, for lack of better terms, be somewhat lying to you, meaning like you may be training a model that's overfitting and won't actually generalize because maybe your test set and your training set might actually be not be too specific and not generalizable. How do you find, I'm curious, in the things that you've seen, people actually give themselves confidence that the data is high quality that they're using, not just the model result is high quality, but how are they monitoring the data? What are they doing or what should they be doing?

Dean (18:52.527)
Yeah, that's a good question. When we started out working on DAG sub, I think there were lot of threads and forums where people were asking like, I'm working on this task, whatever, cat, dog classification, how much data points do I need to train a good model? And people would always answer or more or less always answer like, it depends on what you're trying to do.

Like how complicated that task is, how specific you need the model to be, what's your, your accuracy requirements and things like that. But I think today, especially with, with sort of fine tuning, becoming the standard process for a lot of the real world ML applications. And just to explain what that means, it means that instead of training a model from scratch, I'm actually taking a model that someone else, usually statistically speaking, usually it's like one of the large companies trained for a long time on a lot of high quality generic data.

And then instead of training it from scratch, I take that model and through this special process called fine tuning, I train it just for a bit on my specific data. And that way I have

generalization gains from the big model being trained by someone with more money and more data, and the specificity gains from training it on my data with my specific company's jargon or specific documents or images or whatnot. So I think that that's the world that we're in. And in that world, there's still a lot of work that you need to do around data.

So first of all, there's the groundwork, right? You need to collect data and make sure that it's annotated.

Dean (20:33.069)
And once you have that early version, you're going to probably want to do a couple of things. One is you're going to want to validate that data. So you're going to have to, you're going to want someone who understands the data and understands the business problem, look at the data and the annotations and make sure that they make sense. that theoretically, if you have a model training on them and the model is perfect, right. Then it would get us.

to where we want to be from the application perspective. The other side of that coin is you also want to make sure that there's data samples in your dataset or multiple datasets. We'll maybe talk about that a bit later.

helps you check for edge cases. So the typical example that people give is if you're going to the world of autonomous driving, most of the data that you have is just regular driving in different places. Everything is pretty normal. The driver is like in the middle of the lane. Nothing special is happening. But if you only train your model on that, what happens when it sort of finds itself in a situation where it's raining or there is construction on the road?

or someone is just like running into the road, not on a crosswalk or things like that. That's obviously very dangerous. And so even in like non life threatening situations, you wanna make sure that you're creating datasets that will help you cover those edge cases and actually get the results that you need in order to make sure your model doesn't fail.

when it gets out of the standard distribution. So I think those are the main things that you want to do with respect to data curation. And the ways you would usually do that is you would probably.

Dean (22:23.245)
aggregate your data somewhere. So if it's a tabular data, you would put it in some database. If it's unstructured data, you'd probably put it in some storage bucket. You would want to set up an annotation tool that will let your annotators or domain experts, depending on the task that you have.

add those annotations. And then you would want some tool to visualize the data set, understand how your labels look like, understand which data sets you have, and hopefully sort of get to a place where you feel comfortable taking that into your training loop and actually building a model.

Kostas (23:04.529)
All right, many questions here, but there is like a question that I have like almost like from the beginning. So I will like to make sure that I ask you that thing. So what I hear from you is that you're dealing with a problem that like by definition, let's say it's like what we would call like multimodal, right? Like there's no like one type of artifact that has to be versioned, for example, right? Like you have data, you

code, you have experiments. And I'm sure if we have like a big fan of Lisp here would say that like everything is data anyway, so like what's your problem? But it's not exactly like that, right? So while we were like, okay, just writing software, we would have like Git, like an amazing tool,

It's built with some very strong assumptions about what we are dealing with here, what we are versioning. It's primarily text. So how do you... And the first question is, each one of them being different, how do you deal with versioning, all these different things? And by the way, even if we stay just in data, forget about the rest of the artifacts.

Just the fact that we might have tabular data, might have text data, we might have images, right? What does it mean for an image to change? I don't know. Many times, if you take a video and you take two subsequent frames, they look the same to me. But if you see the files, they're different, right? So how do you deal with that? And how much of a problem it is at the end, right? I don't know. Maybe we

throw everything like on GitHub and we call it the day. I don't know, but I'd love to hear like from you on

Dean (25:05.381)
Yeah. So the separation that we make today is between what we call structured and unstructured data. I say we, because sometimes we talk to people and they would consider like a Jason to be unstructured and an image like has no classification in either option. So when I say unstructured, mean anything that is saved as a file and structured is, is something that's like a table.

so that's the, the two categories of data that we think about, and DagsHub is primarily focused on the unstructured data, side of things. So everything that is stored as a file, I'll talk about how we think about working with structured data in the context of machine learning, because I think it does tie into what we're doing. But for the unstructured side, basically what you would do is, is what I mentioned earlier.

Let's assume that your data is changing. So the number one, like throw everything in an S3 bucket in a folder called data and be done with it doesn't apply. Then it depends on whether you're in like level two or level three versioning. So add only versioning or like files change under my feet versioning for the add only versioning. Our approach is that if you try to do get like hashing for files, that's going to be painful.

from an infrastructure perspective in some cases, right? So you can build a system that's optimized for like a specific action, like adding a new version or pulling data or pushing data or something like that. But usually you won't be optimized for all cases and it's going to be a painful process. So what you can do instead is not manage the files directly, but manage pointers to those files. And those pointers can be a bunch of metadata.

It doesn't have to be at that point. doesn't have to be just what's contained in the file. You can attach metadata like the annotations so that you can have multiple annotation versions without needing to change the file and things like that. And then you can version that similar to how you would treat like tabular data versioning. So

Dean (27:18.425)
So that's for unstructured data and the advantage of this approach for, I guess, unstructured data in level two versioning. The advantage of this approach is that we don't care whether it's a text file, an image file, a video file, it's fine, right? And this assumes add only, but you might ask, okay, what happens if my actual images change? So I think one answer is that that's very uncommon.

Right? Like most industry use cases that we see, if you're working with truly unstructured data, so non -tabular and text is the same, you, but a lot of times text is saved in tabular formats. assuming it's not in a table, then the chances of your video data actually changing are very low with, maybe one main exception, which is like up and coming right now, which is generative AI. if

generating images, then you might be generating different data sets every time you change your model version. But if those are for the same tasks, you might be overriding your data. So let's assume that you found a use case in which your data files are actually changing, whether it's GenAI or not. In that case, you would need something that is kind of like Git, but for data.

Now the tool that DAGZub uses is an open source tool called DVC. It's probably the most widely adopted one for data versioning. There are other options like LakeFS and Zet, for example. But the idea is pretty similar, right? So they would like, there's different performance and maybe slightly different use cases, but the idea is similar. Instead of versioning the files directly like Git does, you would version, you would,

create a pointer to that file. So this would be like a very simple text file that has the hash of the large data file that you're using. You would put that text file inside your Git repository. So you're connecting between the Git version and the specific versions of your data files. And then you would put the data file somewhere where it's supposed to be. So some S3 bucket in like a hash table that sort of stores it in a versioned way.

Dean (29:35.727)
So that has the advantage of you can never ruin your dataset unless you're doing it maliciously. And you get to tie the versioning into your Git version. So that when you're, if I spoke earlier about the source code for machine learning being data plus code, that means that when you're looking at a version, it's canonical. It's for your entire project, the dataset version that you use plus the code version and it's deduplicated. So if you're using the same remote,

you can basically use the same file over and over and over and it's not increasing the size of your remote. Git, by the way, and we didn't know this when we started the company, Git does duplicate. So a lot of people think that Git de -duplicates at the file level, but that is not correct, apparently. if you're creating, it doesn't matter because most Git repos are pretty small, but when you're creating a new version of a file, it duplicates everything basically.

as far as I know, you know what, I'll be modest here. I might be mistaken, but this is as far as I understand how Git works. So yeah, that's the short explanation on unstructured data.

Kostas (30:45.863)
Okay, and how do you put everything together, right? Because, okay, we can say like we are versioning, let's say binary data, we are versioning also our code, we are versioning also like our experiment using like Google Sheets, I don't know, keeps versions there, like all that stuff. But at the end, all these are like things of parts of like the same thing, right? How...

connection happens? Like how do you do that? How do you connect the data with the code and the experiments? And I don't know if there's any other artifact there, but how do you do

Dean (31:31.353)
Yeah. So this is sort of getting into a product question and the way we, thought about this. first, we did not prepare these questions in advance if anyone is listening. but this is a great question because this is exactly what we're doing at DAGZEP. So this is the way we started the, company was sort of to be the connective tissue between the different components. and so we asked ourselves, what are the jobs to be done that our users, the data scientists or ML engineers, have in their day -to -day work?

and how would it look to put them in one place? And as I sort of mentioned before, each one of these components needs different treatments. You're not looking at experiments as if they were Git tracked files. They need a different treatment. You're not looking at the data sets as if they were just a part of your Git repository. So you couldn't really just, you know, copy the Git UI and add a bunch of other lines for stuff and for it to just work.

But what we did is we basically like, if you go to a DagsHub project today, this is sort of our flavor. I'm not saying it's the only way or the right way, but this is the way we think about this. So if you go to DagsHub project right now, you would have like a tab for code, a tab for data sets, a tab for experiments, a tab for models, and a tab for annotations. And the thing that we do is we put all of those in one place and we build links.

for each one of these tabs into the other relevant tab. So when you're looking at an experiment, you have a button that takes you to the specific version of code or the specific data set that you use for that experiment or the model that resulted from that experiment and then later on, maybe like deploy it to production and things like that. So you're basically able to go through that process.

back from in theory production to the model that was deployed to the experiment that created it to the data set and the code that were the source for that experiment. And each one of them requires different treatment. that's, that was a very hard part that we implemented, which is like, where do you expect to see each one of these components when you're looking at something else? So when I'm looking at my code, when and where and how do I expect to see the data?

Dean (33:41.221)
that was used in a certain experiment with that code, for example. So I think that that's a hard problem. If you go to a project on DagsHub, like you can do this for free, then you can see how we approach it. But the idea is to really provide a connective tissue and put the things in the right place with respect to the other components of your project.

Kostas (34:04.086)
Okay, one last question before I give the microphone back to Nitei for his questions. And I think it's again like a product question and it's more of like me trying to understand the user. So we have all these different components, right? Like we have the data, we have the codes, we have the experiments, we have the models, we have the annotations.

There is always a starting point, right? Or let's say like a common starting point because okay, like obviously like every practitioner has their own way, but there is like some pattern there, right? So like, for example, we say like about data engineers, like, okay, at the end they are building pipelines, right? Like, so the concept of a pipeline is like kind of like the starting point of like the...

the experience that you're trying to work and build products upon. In this case, what's the starting point? From all these different artifacts, how, let's say, the new project starts from and what the core first thing that a data scientist or a male engineer thinks of when they start a new

Dean (35:02.545)
Sure.

Dean (35:26.169)
Yeah, that's a good question. guess, right, like as you said, for data engineers, it's always the pipeline, right? Like the ETL, the processing steps. So for data scientists, I think that there's maybe a bit more chaos, possibly because they want to work on the modeling part. I guess...

I won't say they, like we want to work on the modeling part. Even when I do a project like for a customer or something like that, that's where you're always drawn to. Like you want to actually train the model, see the results and then play around with it. But we acknowledge that like data is the most important part, as Nitai said earlier, like garbage in garbage out. If you're not creating good data sets, you're not going to get good results. And so I think if you think about

Kostas (36:10.844)
Mm -hmm.

Dean (36:21.105)
chronologically, then building your data set is the first thing. So collecting data plus annotating it. But maybe the counterintuitive part here is that when you, if I rephrase this, right? So imagine you're a data scientist number one in a startup and the whatever CEO or CTO comes with you and tells you like, listen, we need an ML feature in our product as soon as possible, right?

So what you're going to do is you're going to probably open a notebook and run through the entire pipeline end to end in a very, very hacky way. And then you're later going to worry about like fixing stuff. And so the question is like, what are the main things that you need on day one, right? Because that's where you're going to meet the data scientists in the end. And I think that this is like our understanding of this. I'm now talking as if we always knew this, but our understanding of this has evolved over the years of working on DAX Hub. I think that the three things that you need at day one are

You need a way to build your data set, again, collect data and annotate it. Otherwise you don't have anything. You need a way to train your model and you need a way to deploy it. And I think if you go to data scientists and startup companies that are starting today, those are the three things that they're going to worry about. And afterwards they're going to worry about the other stuff. So even experiment tracking, they will tell you, yeah, whatever, I'm just writing it down in Excel sheet. The sophisticated data scientists would like,

as part of their Python code, make sure that they write it to some file like text file or CSV file locally so they have it documented. But they would just race to the end with the hopes that the POC is good enough and they have more time to organize everything. So the day one challenges are those three, I would

Nitay (38:07.894)
And one of the things I've found, I love that point because one of the things I've found also to your point Dean is that like they build, write the Jupyter notebook, they get something working, the hack shows something interesting. The very next logical step in their mind is, okay, here's the notebook, like put it in production. Let's go. Like, why can't you just do that? Right? The reality is to your point, there's actually a lot of productionizing effort that needs to be done in order to take that, you know, great idea that's in that notebook and turn it into a full model serving application and so forth. And so what one of the

Kostas (38:08.247)
Okay, that's awesome.

Nitay (38:37.721)
questions that kind of makes me think of this. How do you think about from like a UI UX and product perspective within your product? Because clearly you've got multiple personas as you said, using your product potentially in different tabs, expecting kind of a different product look and feel I imagine, right? Because one's thinking about data and data pipelines and others thinking about experiment tracking and others thinking about my code and others thinking about, and so how do you think about like providing that all within one look and feel or within one product?

Dean (39:06.555)
So I think that like you're not going to provide everything for everyone in run product, but you need to, like we needed to choose a specific persona and then make sure that the product can interact or interface with the other personas in the place that makes sense for them. if we think about, I think nowadays, this wasn't true when we started, but nowadays there's three main personas that you see in ML context.

One is like a data engineer. So if you ask the typical data scientists, they would say that those are the people that do everything that happens before they get a data set, like a collection of data points, right? That they can use. then you have the data set scientists, who might or might not be in charge of like the annotation process, but like they're basically taking this data set, curating it, making sure it's good for training, training the model, and then they get the model artifact. And then there's this third persona.

sometimes called ML engineer, ML ops engineer, titles are a mess in the world of tech in general, but I think data science specifically is worse. but this person's job is basically to take this model and put it in production. And in larger organizations, their job is to basically build, if data engineers are building the data pipelines, then the ML engineers are building the pipelines that would let the data scientists automate that process from a model result to production.

Now, obviously, when you go to smaller companies, the data scientists might do everything or the ML engineer might do everything. So build the pipeline that goes from the data lake to ready data set for training, train the model, and then do the deployment themselves. stackers like that are very, very rare and usually get paid a lot of money. But then,

The platform, like the way we're thinking about this is we start with the data collection process. So if there, if like you have

Dean (41:07.441)
cameras in the field, right? Like we, I'm trying to think of a concrete, so a company that we published a case study with, so I can share like, we're working with a company that has like microphones and farms and they collect like sounds from livestock to detect diseases and like pigs and sheep and things like that. So they might have like data pipelines that are run not by the data scientists to like process the data, get the data from the field to some

or something like that, that might not be the responsibility of the data scientists. But then afterwards, that's where sort of we fit in. So we would connect to that S3 bucket and like literally connect to that S3 bucket. You don't need to do anything else. So that's how we're thinking about interface number one in our unstructured data context. We would sort of help the data scientists go through the entire process of like data annotation. They can do that like on the platform.

experimentation, they can do that on the platform and then model management. And today, the interface that we have is there's like a very elaborate API for model management that will let the MLOps engineers basically get the latest model once it's done training and feed that into a deployment pipeline. But we're working on some stuff to take over more of that process to make that even easier. So I think that the end goal would be that the MLOps engineer

sets up the infrastructure for deployment, they connect it to DagsHub, and then from the data scientist perspective, it's just like click and it's deployed, right? Like that would be, I think, the ideal interface on that side. So yeah, that's how I think about the interplay of the different roles.

Kostas (42:49.599)
Yeah, and I want to add something here because I think didn't actually said something super, super important from like a product perspective. And I think like anyone who listens, like keep a note on that. What he said, and I think like, this is like a problem with like many, especially like founders of like, companies like in the data infrastructure, like space, like

Or the infrastructure in general, because infrastructure by definition is something that's touched by many different people. It's very horizontal. But when you're building products at the end, and that's think super, super important, you always build a product for one persona. You can't build a product for multiple personas. What you can do is build for your persona, accept the fact that there are other personas that they are going to be somehow interacting with that, and

I think he put it in the perfect way. You build interfaces. And I think that's super, super important. And I've been there, been trying to build something and seeing all these different people out there and being like, okay, let's try and add features there for data scientists and data engineers and ML engineers and don't know, SREs and the volts and all that

But if you do that, that's going to be disaster. Like at the end, like it's not going to be a product for anyone. And like the right way I like to do it is like to focus in one persona and the rest should be interfaces to other systems that the other personas like can manage. Anyway, I really had like to point this out because I think it's like, I know it's not necessarily about like ML.

and data and all that stuff, but I think it's like a super, super important point for anyone who's like building something for someone out there.

Dean (44:51.631)
Yeah, I think that it's interesting. First, I agree with you. think that, by the way, just to not make it sound overly smart, I think this is a hard earned learning that we went through the process of experimenting with different takes and then realized that this is the way to go. It's easier said than done. think especially when you're starting a company and you... A lot of times users not

not because of bad intentions, but they sort of are excited and the request that they have when they're excited is like, I want to bring in more people from my organization. So build features for them as well. And that's a hard thing to say no to. it's an ongoing challenge. I think it never stops. But I guess I'm actually curious, Costas, from your experience, because you're working on the data engineer part. So you could argue that you're in the beginning of the process.

But I'm curious if from your perspective, you see something that happens before people get into your product, right? Like, there something that you, from my perspective, I need to interface in the beginning and in the end? I'm curious from your perspective, is it just interfacing in the end or do you also have something like that in the beginning?

Kostas (46:05.837)
I mean there is in the beginning and I think like the things that you see there is all the upstream dependencies where like the data is coming from right and then usually what you have there is like application or like product engineers that they care for completely like different things I think like especially there like I think like the difference between like the personas involved is like

very, very obvious, right? And that's why we, I think we still have like problems that they are not like fixed, not because they are like hard to fix technically necessarily, but primarily because of like having a hard time like figuring out how to create like this kind of interfaces or like even like deciding who should have like ownership over something or not, right? And we've talked about like on this show too, we've been talking with other people about

streaming systems in Kafka, for example, like who owns this thing, right? That like transports like the data. Is it like the data team? Is it the production engineering team? Who is it? And I don't think that's like, I obviously I have my opinions and I'm sure you have yours and it I his, but if we think like in terms of like the industry out there, it's not clear who does that. And like a lot of the problems like actually arise from that, right?

But what I would say is that there are also loops in a way, right? And I think a very good example is taking the data engineer and the data scientist, right? Because they have something in common, that's data, but the rest is completely different. They're like completely different people. They think in completely different ways. And I think that's why...

People don't pay attention to words as much as they should, in my opinion, but there is a lot of wisdom of choosing the word science versus engineering in these two roles. A data engineer cannot experiment. There's no experimentation in engineering. You can experiment when you are building a bridge. There's no such thing. But when you're a scientist, you experiment. That's all you do. Now the question is how you take these complete opposites?

Kostas (48:25.677)
ways of thinking and operating and put these people like to work together. And I think that's where, you start like having almost like memes out there of like, Hey, can you put my notebook into production? like the data scientist goes and hands this like the data engineer, the data engineer is like, okay, like, go like commentary side right now, because like what I'm supposed to do with is like mess of code annotations, diagrams, like whatever. Right. so

I think there's like from a product perspective, like a lot of value to be delivered by creating like the right interfaces, both in a forward way, but also like in the backward, right? Like what happens after the data scientist has figured out the experiment and now we have to take these things and put them back in production. And part of that is also like the data engineers like consistently like creating these like data sets, right? And I don't think it's a solved problem.

And I think there's a lot of opportunity there. And that's good news for both me and you as builders of companies like in this space.

Nitay (49:33.177)
I think the other interesting point here is, yeah, I agree with this, what you're saying, Kostas and Dean, it's a great question. One of the things I've seen is that like, I think if you look at each persona, whether it be within the data or engineering and ML or any field really, I almost find that like every persona has some curve of, and along that curve is some percentage of things that they're able to do by themselves. And then beyond some line, there's some percentage of things that they need some other persona, team, et cetera, to collaborate with.

And it's almost like every tool, as all of us have vendors, what every tool is doing is trying to kind of move that line in some way. And as you move that line, the question is, are you enabling more of that person but screwing others, meaning like are you enabling to go from 50 to 70 %? But that other 30 % actually just got a lot harder because you made it so much worse because they're going to use your tool. Or are you enabling more of like self -starting, but then it actually launches you into a better collaborative version as well. So I think that ties into what you were saying about like hyper

on a single persona but having really deep rich integrations especially in the enterprise space and so kind of nailing both sides of those curves but at the same time knowing where your line is and not thinking that your line is at zero at 100 because it never will be right I think is kind of the key that I've seen how I think about it.

Dean (50:51.119)
Yeah. When I talk to users, I try to say that, like I don't believe in an all in one tool.

You're going to have to work with other tools. it really depends on how to, two points I wanted to make that I think are practical. Maybe if someone is listening and building a tool for, for technical people, cause I really think it's bigger than, than specifically ML or data. One is that if you're in a nascent markets and things have not, like there's no 30 year old company in your business that, that you're competing with, then

it's very likely that the interfaces have not yet been decided. And so there's a market education piece. So if you are in a competition with a 30 year old company, there's probably interfaces that are well decided and you should probably comply with them as a starter and then maybe innovate on the interfaces. But if not, then one of the challenges that I think we had is

different companies had different ideas for where those interfaces need to be. And so one option is you cut that in your ICP. So you say, I only work with companies that think about the interfaces the way I do. just gave this example, like this is the role of the data engineer. This is the data scientist, this is the ML engineer. Or you sort of make an effort and spend, I don't know, marketing dollars on educating the market that this is the interfaces, this is how they should be. And this is the reason why.

And the second point I wanted to make is, course, as you were talking about defining flows and things like that, I think one of the challenges when you have...

Dean (52:23.173)
product that covers interfaces between different people is that you don't go far enough when you're defining the user flows that you're expecting to solve. And so you sort of hand wave over what happens after you sort of cross that wall. And I think people do that in their own companies. Like we're talking about like deploying a melt to production there, there were, still are a lot of people where the data scientist is building something in a notebook and giving that notebook to someone. And I've spoken to companies where they're like, that's fine because we

is that the alternative of giving more autonomy to the data scientists is not like a good fit for us right now. And that could be good. I'm not saying that it's always bad, but from a tool builder perspective, you have to ask yourself like.

So from my perspective as DagsHub, I'm stopping right now at the model management part, but what is my user going to do right after they get there? And how do I make sure that they feel comfortable doing that? And I think that also relates to Nithai, what you were saying about like, I don't want to make the part after DagsHub much harder because then no one will use it. So you have to pay attention to that. And that's super important, I think, when building features and also when like positioning yourself in the market. But yeah.

Nitay (53:35.232)
That's very, very, very sage advice. shifting gears slightly and tying to kind of what a little bit what you said here, I'd love to hear how does ML actually, or AI factor into DAX of itself internally? Meaning like in what ways are you either dogfooding your own products, champagne drinking, whatever the metaphor is. and what way, like one, one thing that struck me is what you said, early on the podcast about Stone Age versioning. one of the things that made me think of is a lot of folks, I know with like ML pipelines, for example, in the vision world.

If I upload an image, I want to not only search by the date that the image was, but I want to actually search as I'm doing my data curation. I need my dog images, I need my cat images, right? That requires some level of auto -tagging, some level of smart search, et cetera, et So in what ways does ARML make it into your own company?

Dean (54:24.025)
Yeah, that's a good question. I'll separate, like, obviously we're dog -fruiting our product in the sense of like building projects both to show customers and make sure everything is working as planned, but that's not the interesting part. So let's talk about how we actually use ML. I think there are three examples that I...

like categories that I can think of. One is we actually do try to incorporate machine learning into the product in different areas. And the best, the coolest example in my mind of this is that

because we have an exploration component of the platform, so being able to discover what other people are working on and maybe use those data sets or models in your own project, we wanted to incentivize our users to document their projects. And as anyone who's worked with engineers knows, the average engineer does not like documentation. That's not the fun task. And that usually causes problems that grow as the organization becomes bigger.

In the data world, there are data catalogs, which are entire products whose only goal is to help organizations document and make their data.

accessible to other teams within the organization. But when we were talking to a lot of companies, we noticed that they had a problem where they would sometimes pay a lot of money for those data catalogs. They would deploy them and then no one would use them properly. Like there weren't a lot of documented data sets and it's similar to what we spoke earlier about like garbage in garbage out. If no one is documenting your data, then you just have a ton of

Dean (56:02.315)
unnamed data sets and no one will actually use them. So an idea that we had, one of our engineers had this idea. So it was, what if we made it fun? So we asked the user, like when we see a user does something on DagsHub, like push code or run an experiment, and this is the first experiment for a specific project, we would prompt them and say, like, listen, if you give us...

a short description of your project and choose three tags to make it easier to discover later, we will generate like an AI avatar for your project so that it looks nicer. And that surprisingly works really well.

the amount of projects that have descriptions and tags on them, which makes them easier to discover, has grown significantly since we released this feature. And it just makes the platform also nicer because you see projects with avatars as opposed to the gravatar, the generic ones, right? So that's one example that I like. The other thing that we did with LLMs the moment they came out, which was both an experiment for us to see what the process would look like and what we could learn while doing this.

is we build a chat bot for our docs. We deployed that to our Discord server so that when people ask a question, bot answers immediately, hoping that it would alleviate the support burden from our engineering team. And that worked decently well. Like you get answers, it sometimes hallucinates. So you still have to have someone monitoring that and correcting the model when it's mistaken. But that sort

works nicely. From the user perspective though, the main thing that we support with machine learning is you can actually do something called like auto labeling or active learning on the platform. So the idea is if you have your data set and Nita you mentioned this earlier, like let's say I have a data set of images and I wanna know which objects exist in those images so can do something like say, give me all the images that have a dog in them or a car in them. So.

Dean (58:08.227)
There are multiple ways you can do this, but the more robust ways that you would want to run models that extract that information from your data and then make that data that extracted information available for querying. And so today with DAGZIP, you can actually connect models that you built or that are off the shelf to your projects and then have your data.

be auto labeled by them. So you can extract all that information, save it back to the data set, and then actually query by that information. So you can do a lot of really nice things like create different subsets by the categories of objects in a certain image or the types of, I don't know, the semantics of the text and if it's a text data set, for example.

So those are, guess, the three main ways that we incorporate ML into the product. Hopefully we'll find more ways to do that moving forward.

Nitay (59:07.746)
I love that avatar example. That's amazing. I mean, it gives a kind of a little human touch to the product. And I've seen exactly what you mean with many, many projects with those little kind of grayscale boxes and images. yeah, that's a great example. love that. OK, shifting gears to, because we're probably wrapping up here shortly, moving towards kind of the future. I'd love to hear about where you and DagsHub are going and kind of bigger picture what you'd like to see from the ML community.

Dean (59:37.69)
yeah. So I guess like from Dax hub.

What we'd like to do, and I touched upon this a bit with the interfaces part, like moving forward, we'd want to complete the second interface, like do an even better job of connecting this to what happens after you have your model ready and until you get it to production. And we'd like to deepen on the first interface, like deepen the capabilities that the platform provides for data management. You know, we're working with our users and our customers to make sure that they can actually use this in real world use

There are a lot of requests, so there's a lot of things to build. But I think the sort of main veins that we see is one, being able to support additional data sets and more real world things around like detecting why a model fails when it does. So for example, you trained your model, you want to be able to say something like, show me a data set of all the images where the model was very

and try to sort of gain insights into why that happened so that you can build a test set to make sure that it doesn't happen again in the future and things like that. And on the deployment part, it's maybe a bit more straightforward, but equally important, how do I connect this to my infrastructure, but then get a model that's actually like an endpoint that I can actually use, whether it's to show internal stakeholders that I'm doing a good job.

or to actually incorporate into the product. And of course, like we already support a bunch of like LLM use cases like prompt engineering and human in the loop evaluation and things like that. But also I think as the industry sort of the specific subset of the industry that is working on like ML, sorry, LLM applications is going to mature. I think they're going to need more specific capabilities. That being said, when I speak to

Dean (01:01:38.607)
experts that I mentioned earlier, it seems to me that a lot of the tooling around specifically LLMs looks very similar to classical ML with minor changes that you sort of need to pay attention to. So that's sort of on the tooling side. If we want to take maybe a step back or up and talk about sort of the world and the market, I think that, you

We started DagsHub partially because we believed that ML is going to be really big. I think that in the last few years, like since at GPT, everyone agrees that ML is going to be big. Maybe now there's more skeptics. And there's a bigger question of like what ML enables moving forward, which part of the economy that takes and things like that. I'm optimistic, not everyone is, right? Both optimistic about the functionality that's going to let us.

do a better job, but not kill us all, if you take that to the extreme. So I'm sort of bullish on where this market is going to go, obviously I'm biased, but I think that we're going to see ML like shift or change a lot of jobs that are currently done by humans in a very...

suboptimal way, make some of those those jobs more optimal and maybe replace humans and others, right? I think that the the reference I have in mind is like, you know, the the car was maybe bad if you were a horse carriage driver, but for most of humanity, it like shifted a lot of the workforce to different jobs that were that sort of moved the economy forward.

So I'm sort of hopeful that AI will cause that and not in the too far distant future. Like I'm optimistic that it will be within our lifetimes. So that's sort of my take on that. don't know, like I would be curious to hear your thoughts as well.

Nitay (01:03:38.591)
Well, yeah, mean, there's lots to say. guess going off that then, so my final question on my side, what would be the one habit or the one shift that you would want to happen?

Dean (01:03:50.737)
You mean specifically caused by machine learning? So.

Like I would want it to. So I think coding is already changed very significantly. I also use it for a lot of the other tasks that I do, like as a CEO, like marketing and sales and a bunch of those things. But one of the things that I'm waiting for is for agents to actually work. I this might be a controversial take. A lot of people would be like, they already work. What are you talking about? But I think that letting AI

perform actions on our behalf is something that is, it works in very, very limited settings. And I kind of can't wait until it works in more broader settings. I'm kind of optimistic that like with like the current generation of LLMs or the next generation of LLMs being more integrated into the tools and things that we use, whether it's our smartphones and laptops and whatever fridge.

that would give us the ability to actually let them execute more and more tasks on our behalf. And I think that's going to free up a lot of time for people to work on like much more interesting things, right? Like be creative. So that's the main shift I'm waiting for. I think we're not yet there, but it's not too far in the future.

Nitay (01:05:13.941)
Yeah, that's super fascinating. think you've set us up for the next podcast episode when you come back for what we're going to talk about. Costas, any last question from your side?

Dean (01:05:20.484)
Sure.

Kostas (01:05:24.275)
I have plenty, but I think it's material for another episode. think there's a lot to talk about, especially in the evolution from ML to AI. I think Din has a very good visibility into what is happening out there. So we'd love to have this conversation, but I'm a little bit reluctant to start the conversation now because then I'll be

disappointed as we won't be able to continue it right now. let's find some more time and let's have a topic around that stuff because I think all three of us have opinions of how the future might look like with AI and I think it would be a very fun conversation to have. So let's go and do that.

Nitay (01:06:20.735)
Let's do it. Dean, thank you very much for coming. I think there was a few great nuggets here within the show for folks, both advice and great explanations of the ML world and the future. And we very much look forward to having you on again.

Dean (01:06:23.089)
Let's do it.

Dean (01:06:35.973)
Thank you for having me, looking forward to the next time.

View episode details

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

MLOps Evolution: Data, Experiments, and AI with Dean Pleban from DagsHub

Subscribe