← Previous · All Episodes
How Cloudflare Reinvents Serverless at Global Scale with Josh Howard Episode 19

How Cloudflare Reinvents Serverless at Global Scale with Josh Howard

· 52:19

|

Kostas (00:03.042)
Hello everyone. I'm very happy today. I have an old friend here, Josh. Had like the luck to be working with him when I was at Starburst. He was managing one of the things that built some amazing stuff back then. Unfortunately, some of the stuff are like was a little bit too deep into the systems and they tend like to be a little bit like harder for people like to recognize.

like the amazing work that some people have to do, especially in database systems, when it comes to performance and other things that he can talk more about that. He's the expert here. But today he's on some new things, super excited also. And I can't wait to hear from him. Josh, welcome and please tell us a few things about you.

Josh (00:56.217)
Yeah, so as you said, we know each other from my time at Starburst. So was the engineering manager of the query engine performance team. And then I spent a lot of time actually getting the SAS platform off the ground, doing a lot of things like container management and then query routing, being able to park and resume queries, fun stuff like that. But

More recently, I've switched to Cloudflare. I've been at Cloudflare for about a year and a half, and I'm a senior engineering manager there. I'm responsible for the durable objects in D1 products. And those are essentially the storage, storage primitives that are on the Cloudflare developer platform. So I've been focused there, and I probably spend about 80 % of my time on durable objects and then the rest on D1. But yeah, happy to be here.

Kostas (01:46.135)
Amazing. Okay, quick question. So back at Starburst, you were working, let's say more on the query layer, right? I think like especially with systems like Trino and anything that is trying to separate, let's say the storage from the execution part. Storage is like kind of like taking care of other systems like S3, for example, right? Like with the data lake. You started there.

You decide to go even deeper. went to storage from what I hear. Can you tell us a little bit about like this journey and what made you do that? Like what's excited you like in, okay, go even like deeper into like the, the guts of likely, like the storage.

Josh (02:34.559)
Yeah, so I would say like the interesting thing is like so I mentioned before like I was doing like AI ML stuff whenever I was out of school and then at some point I started getting curious about systems and then switched over to building systems at Starburst and You know, it's definitely like what I plan on doing for you know, the foreseeable future So it was really just like, you know kind of it's been a progression towards lower levels of the stack, but I think

The thing that's most interesting is that like after spending a lot of time building databases, like they're pretty formulaic. so, you know, I mean, they have your, your standard pieces, you know, you've got your, your, you know, parser, you build an abstract syntax tree. You're going to run an optimizer on that to get like a, you know, execution plan. And you're going to farm that out to some execution engine and then get your, you know, that's what gives you your results. you know, and.

It's a very mature industry where you can go read papers on like, are the latest optimization rules or the latest way to build a columnar query execution engine for an OLAP database. And all of those things are available to you. Right. And I think the, the, the reason that I found like what Cloudflare is building is like super compelling is because it's just a very unique product. There are people that are building open source, durable objects now. And imitation is the best form of flattery.

But, you know, it's like, Clafler was building this first and it's a very unique take on a low level system primitive. And so that's the thing that I thought was super compelling for, for Clafler. And so it's not just storage, right? It's a, it's a execution environment that's tied to storage, right? So you can route requests to a single place. It has access to local storage. It's really fast. And we have some tricks to make sure that that's durable. it's, it's, you know, in some ways you could probably, some people would call it a database.

Kostas (04:30.998)
Okay, tell us more. How, first of all, new system designs differ from something like S3, which is probably what most people have in their mind about storage, right?

Josh (04:44.365)
Yeah, what we provide is strong consistency guarantees for your application. And so you can imagine storage systems, those are optimized for large blobs. And so for that, Cloudflare has Cloudflare's R2. And it's S3 wire compatible. It's a great product. But actually, if you go a level deeper than that,

R2 is built using durable objects. And so the way that they actually use durable objects today is to say that if you upload some data, you want to have a consistent view of that metadata, right? Regardless of like the actual blobs and vice that are stored. And so for that, they actually use durable objects to track metadata operations. In the same way, if you look at the way that Google Cloud Storage is architected, they have Colossus, which is like the underlying file system that backs Google Cloud Storage.

And then they have Spanner, which is what they use for metadata management for Google Cloud Storage. And so in that picture, we're actually the Spanner type component that's providing, you know, serializable, like strong consistency guarantees for that data. And so, yeah, that's kind of the way to contextualize. So it's, not an object store, but confusingly, durable objects is used in our object store and is also just like a

know, consistency primitive on the developer platform broadly.

Kostas (06:16.042)
All right. Okay. There are many, many questions here, but I would like to make a quick pause before we get deeper into that. Just to some context. Can you tell us a little bit about like Cloudflare, right? Like how Cloudflare ended up like building what you're describing right now. Cloudflare has been around for a while. My feeling is that most people when they hear the word Cloudflare, usually the first image in their minds is like this.

emails you see like when you're going to a website and you're getting some kind of like, they're like warning or something like that from Cloudflare because of CDN. Tell us a little bit about like their story and like how from whatever like they started to be doing today, they are like building this whole developer platform from what I understand.

Josh (07:04.035)
Yeah, yeah, absolutely. It's funny. My wife tells me that the only way that she knows what CloudFlare is, is like if she can't access a website, that it's usually CloudFlare blocking it. Anyway, so the backstory on CloudFlare. you know, I guess it long before my time, you know, it was originally a CDN, right? And it was announced at like TechCrunch Disrupt, where, you know, you could basically just point your DNS record at, you know, CloudFlare. And that would basically

you know, act as a content distribution network for your application, your existing application. So, Clefler did that for several years. Kind of the interesting thing is that it was built using Anycast networking, which totally outside of my technology skillset, but I fear, you know, apparently there were some big open questions about making that work in like the 2013 timeframe. So, you know.

Fast forward a few years, Cloudflare is probably the most widely known CDN. And it's kind of expected nowadays that you want website response times to be around 50 milliseconds. Cloudflare has been engineering the Cloudflare infrastructure in our network to be 50 milliseconds from 95 % of the world's population since the very beginning. And then the question becomes like, what does that mean? We've actually scaled up to around 335 data centers.

Um, you know, at that scale, some of them go away. So the actual number fluctuates, but it's, it's around 335 right now. Um, and those are all over the planet, right? So, um, if you're in, you know, uh, harder to reach parts of the world, like, you know, Laos or, know, uh, like Cambodia, like that's, that's, I guess the, the, the latest place I learned that we have a data center at, um, you know, we're actually still fairly local for you, right.

And so you can be anywhere in the world and have pretty low latencies for accessing your content. So the question becomes like that content is like static content for a CDN, right? We've got these like several hundred data centers around the planet. It'd be really cool if you could program those, right? I mean, so, and that's kind of what Cloudflare built in around like the 2017 timeframe. They acquired a company called Sandstorm IO and then an engineer called his name is Kenton Varda.

Josh (09:25.807)
And he came to Cloudflare with the idea to build workers. At least this is the story I'm told. And again, before my time, but workers operated on the idea of like, we need to be able to execute customer program code anywhere in the world and do that efficiently. And so the reason why that's problematic is, you might imagine is like, if you're having to deal with containers, right?

You've got to go clone that container, run it in several hundred places. The amount of like resource overhead for running those containers is quite substantial. And so, Cloudflare opted for a different model, which I think is like probably closer to, you know, what, what people think serverless is. It's not just managing containers for people. It's actually, you know, dealing with a server, which can run large, a large multi-tenant application, which we call our runtime. And that actually hosts, customer.

customer code and what we call isolates. Um, so I can get a little bit more into isolates and like kind of what they are. They're actually not unique to Cloudflare, but, um, you know, the way that durable objects comes into play is, um, you know, once you have all of this, this customer code running everywhere in the world, it becomes very hard to like coordinate between these individual, you know, services that are running everywhere. Um, if you imagine you only want to do something in one part of the world or only do something once.

then you can't do that with like just workers. You need some other primitive. And so you can go to a database, right? And that's been kind of the traditional solution, but like, you know, you're still having, you know, workers can actually overwhelm your database in that situation, right? Like your database isn't going to be able to handle, you know, hundreds or thousands of TCP connections to it concurrently. You know, so we, that, that's like an inherent limitation.

We can actually do a little bit better though. And that's kind of what durable objects is. It's like, it's, it's a first-class way to coordinate between workers. So to use a durable object, you just write some code and it's a JavaScript class. And then you can refer to that class and methods on that class and invoke them remotely. And we guarantee that those methods are only ran one place in the world at any given point in time. So if you need to coordinate between workers, you can use durable objects and then durable objects also provide a bunch of cool APIs.

Josh (11:47.578)
for interacting with other services that only make sense if you're only doing something one place in the world, like interacting with the file system, interacting with local storage, interacting with a container, GPU, whatever. So I'll pause there, sorry.

Nitay (12:02.676)
that was super helpful. Help us understand maybe for our listeners, like how you see this fitting into the stack of like the modern app development, right? So people have probably heard of like, know, Vercell and you say the word durable, most people think of like more durable execution and like temporal and things like that.

And I imagine you guys see some other kind of serverless frameworks and obviously a lot of AWS and other things. So help us understand kind of how you, where this fits in that, it replaces, how it differentiates.

Josh (12:33.881)
Yeah, yeah, exactly. So you mentioned Temporal. I'll kind of focus there for a second. durable execution is kind of a new paradigm. Temporal is probably the market leader in that area. We actually have another product, Workflows, which is essentially a rebranded durable object. They do some other stuff, and APIs are really important, and they make it easier to use durable objects for this type of kind stateful execution model.

but that's just like kind of an example application. So what we are trying to provide is a coordination primitive and then lower level system, system API. So, you can think of workers and durable objects collectively replacing node running on like an EC2 instance. Right. So there are going to be things that you are going to scale out across your instances for.

reliable or for scalability reasons, that's going to be like what you would use workers for. There are things where you need to have like ensure they only happen once. And that's what you would use durable objects for. So you're essentially kind of writing your application. It's all, you know, kind of JavaScript based, but you're writing your application and kind of two parts. And that replaces the singular application server. And what does that buy you? Right? Well, you don't have like a secondary database. You don't have another server to manage.

You don't have any servers to manage. And then you also have these like small instances of things that are located at CloudFlare's Edge network. So, you if everything works right, then you're, you know, 50 to 100 milliseconds to have any type of storage operation anywhere in the world. So that's kind of the reason why you'd want to architect using this.

Nitay (14:19.1)
And there was an interesting stat you mentioned there by the way, because I believe like you said, CloudFer has 300 plus data centers. believe last I checked AWS itself only has, I think, like a hundred maybe.

Josh (14:29.749)
Maybe yeah, FlyIO has, you know, tens as well for like their smaller provider. But yeah, I mean, it's like, you know, it's definitely the largest network of, you know, data centers in the world, right? I don't think that there's really a close competitor in that. that, you know, it's not without caveats, right? I mean, if you look at any, like, you know, the US East one is going to be...

much larger than anything that Cloudflare operates, right? The hyperscalers have larger data centers, but what we don't have in sheer size, we're trying to make up for in distribution.

Nitay (15:06.822)
So does that, and that brings up interesting points, which is that, does that lend itself to going after particular kinds of use cases? So is it applications that are more sensitive to lower latency or the global replication or so on, then like AWS or others couldn't handle? like, how do you find yourself kind of splitting up the market? it where, where like, you know, if I gave you like three or four different types of apps, you'd be like, this is the ones that are perfect for Cloud Fair.

durable objects, whereas these ones like, you know what, stick a Postgres, go EC2, whatever, like do that kind of stuff.

Josh (15:39.472)
Yeah, I mean, so like if you're just, if a three tier web application would just work for you, like you can build using CloudFlare's developer platform. It's really nice because it's all serverless, right? It's a very, like we, it's a very great developer experience to build on CloudFlare. However, like, you know, the things that you couldn't build before that you now can are way more interesting, right? So I think the things that, you know, where CloudFlare's developer app,

developer platform really shines. It's like applications like multiplayer games, real time streaming, know, different like collaborative editing, like for if you're like doing like document editing or something like that, that's a kind of classic use case for us. Chat is another one. Really any type of like kind of collaborative application where you know, you're needing to interact with somebody in real time. Those are excellent applications for cloudflare's network where, you know, the the

engineering effort to go build something like that on an AWS would just be probably too high for most small companies. we actually have like Durable Objects has an example chat application that you can go like see the source code for it's on GitHub, but you can have like, you know, it's an out of the box chat application, like 500 lines of code. So it's a great way to get started.

Kostas (16:59.598)
Yeah, that's amazing. Talking about interactive and real time and all these things, have you noticed people using it also to build the cool thing of these days, which is agents and all these systems with agents interacting with each other and all that stuff?

Josh (17:19.497)
yes. I mean, that, that's the big marketing push, right. and I mean, it's not just a marketing push. It actually works, right. But, that, that's the thing that you'll see most people from Claw Flare talking about right now. the, the cool thing is that, you know, durable object can do basically anything that a server can do. Right. And so it's, you can think of it. Another way to think of it is just like a very lightweight server. And, you know, that can interact with LLMs, you know, go solve, you know, real tasks.

Um, like, you know, like, you know, people can say an agent will do, know, where you have like, you know, it can send emails, um, you know, uh, interact with like, you know, a chat backend and stuff like that. Um, we have, you know, a lot of people building agents, um, you know, on durable objects, we have a separate agent SDK, which makes it a little bit easier to build. Like it's just, you know, like kind of shows you where to put your open AI keys and stuff like that.

It's really easy to get started with the agents SDK. So I'd say if that's the thing that you're interested in, you should absolutely use the agents SDK. But, you know, we joke internally that the agents SDK is just like import agents as, or import durable objects as agents from Cloudflare. You know, it's just, it's a very lightweight wrapper that's more purpose built. And I think those types of things are super useful just because like,

durable objects are kind of abstract. It's really easy to like know exactly what vertical you're going after, right? So it'd be really cool, I think, to have another thing that was like, you know, this is a multiplayer game for, you know, that special class of durable objects just for that.

Kostas (18:56.782)
Okay, that's great. let's go back to when we talking a little bit earlier, you mentioned about this developer platform, right? And I guess the durable objects, it's just one part of it. Can you give us a little bit more color on that? So what is this platform? And when you're talking about developers, what developers are we talking about? Because there are many different flavors out there, right? And I'm sure there is, obviously it can be used for...

any use case and any stack out there if you want to, but there's definitely some focus there. So tell us a little bit more about that. Like who should be excited about it and what's like the suite of tooling that you have available there for the developers.

Josh (19:43.684)
Yeah. Yeah. So maybe like the most important limitation here is the stack, right? So, so I'll probably start there and then we can get into like, are the other tools that you can, can use. So as I mentioned, right, like the serverless approach is different. We're not just managing containers on behalf of people. We're actually using, well, we call them isolates, but they're actually from V8. So, you know, for those that don't know V8 is the JavaScript engine, which, you know, is inside of Chrome.

Right. So every single Chrome tab is going to be like another V8 isolate and isolates are V8 isolates are the technology that, you know, Google is using to make sure that like, you malicious code and one browser can or one tab cannot interact with another. Right. There's obviously a lot of other layers of sandboxing there. But at the JavaScript runtime layer, that's V8. And so what Klauffler did is actually we're the first people to use V8 server side that I'm aware of.

And so, you know, we've got a lot of other, you know, mitigations on our side for, for isolating customer workloads, that are kind of technically interesting, but, know, the gist is, is that everything that runs on Cloudflare's developer platform has to run in a V8 isolate, which effectively limits the usage to JavaScript or TypeScript, right? Cause you can, you can transpile TypeScript into JavaScript or Wasm. So, you know, if you want to write some C++ and like run that on the developer platform.

and compile it to Wasm, I'm sure you can. I don't know anybody that's doing that because it'd be a pain, like just cause it's C++ and you know, but, I'm sure you could do that. We have a Rust SDK, which you know, you can compile down to Wasm and run on the, the developer platform. There's a couple other languages that we support, but you know, would say just JavaScript is the obvious choice for using workers. So if you're, if you're a JavaScript developer, like I would say, like you should definitely check out the developer platform first.

because it is the singular easiest way to deploy a JavaScript application. So that's kind of like, you know, the limitation there. But with that, right, like you get a lot of stuff, right? Like, so it's completely serverless. know, everything is like running in these isolates. You don't have to manage any infrastructure. Everything is like pay per request. And we have a very generous free tier. So like, you we have a lot of customers that just never leave free, which is great. And

Josh (22:09.569)
So, you know, and we have a lot of other higher level products for doing useful things, right? So kind of to go through a couple of things, like, you know, there's workers for just running your application. There's workers KV, which is for, you know, kind of eventually consistent storage that's highly scalable. You've got D1, which is their relational SQLite database, which is accessible to your worker workflows, which I mentioned before.

which is just like that durable execution model, which you can use. So there's a lot of different products here. Some of them are more AI focused. Some of them are more database and storage focused, but it's a full product suite, similar to what you would have for AWS between, what is it, like EC2, S3, all of those. So I don't know how many products we have, easily 10 plus.

Kostas (23:04.609)
Makes sense. And at what point, like a new developer who starts directing with the developer platform will have to start reaching out to the durable objects.

Josh (23:20.263)
It depends on what you're wanting to build, right? So you can build applications on the developer platform without ever using durable objects, right? And like we have particularly for larger customers, we have some customers that will only put some portion of their app on the developer platform just because, you know, maybe they've got this thing that's been running fine for years and they don't need to touch, right? But so like, you you can effectively from your application call out to the developer platform and run a workflow for durable execution.

rather than going and getting a temporal license or using their separate SAS product. So you've got a lot of options in terms of how you can integrate. But the thing, when you would realize you need to use durable objects is, do you want to store state or have a stateful application that runs at the edge? And then that state can be actual durable state, like stored on a disk somewhere.

Or it could be something like you need to establish a persistent web socket connection to a singular place in the world so that you can have that as a coordination point for your game or your document that you're going to have multiple people editing or a chat room or whatever. So like, you know, there's a lot of different ways where you would need some form of state, but that's what durable objects is fundamentally providing.

Nitay (24:44.588)
I'm curious, one of the things touching on kind of related to earlier part of the conversation, we were talking about AI agents and so on. As the application itself gets more complex, I imagine you start to see a need for various kinds of data access patterns, right? So for example, in the obvious AI agent world, the vector stores and graph DBs and all these kinds of different things, how does that interplay in the cloud player world? Or you need kind of more sophisticated mechanisms.

Josh (25:14.041)
Yeah. So it's interesting, like from like a database perspective, like, you know, we have a bunch of different offerings, right. And that's the way to like branch first in that decision tree is like what your consistency model should be. Right. So, you know, as I mentioned, KV is like, you know, it's mostly infinitely scalable. There's probably some limit there, but I don't know what it is, right. It's high enough that like,

I've never had to concern myself with that. And we actually do build off of the developer platform for all things internal to Cloudflare, or at least like most things internal to Cloudflare. So, you know, like my teams are using KV and various points and I've never thought to bother to look up the limits. You know, maybe that's a bad thing, but you know, whatever. So, you know, so that's, that's one.

one kind of extreme. think durable objects is on the other extreme to where like it is a single instance and it's strongly consistent, right? So, you know, it kind of just depends on like what you need, but you you can have different APIs exposed to either of those. And so for durable objects, we have itself like a KV API where you can just do gets and puts on individual keys from within that durable object. And all of those KV operations are going to be, you consistent.

we're strongly consistent. and then we also expose the SQL API. So you can access, you know, data through a SQLite interface, right. And, that corresponds to a SQLite database, which is on the local disk. And we be replicating to stuff with, so to make sure it's durable. But, so, you know, those are kind of the two models that we support out of the box. you know, you could, you know, there, there are libraries where you can kind of interact with either the KV product or the.

KV storage within a durable object through like a Mongo like API, right? So as far as like other types of like, you know, like data access patterns, like I kind of have historically viewed those as like APIs on top of some type of storage format, you know, different people would probably argue that that's inefficient or could be further optimized. But like I would say that's a pretty strong starting point to say you've got one place to store, one way to store this.

Josh (27:29.047)
And then you're just layering on different types of, you know, access patterns. think Cosmos DB was probably like the first like large scale system that I've heard of that kind of has that model where they're like, it's one format, one set of data, just different APIs layered on top.

Nitay (27:47.932)
That's very interesting. what have you guys done? I'm always curious on the developer experience side. So if I'm a new developer, I have an app I want to build, on the one hand, there's an incredible level of power here, right? Being able to instantly basically make it global and consistent and all that. And on the other hand, part of the exact beauty of it is I don't want to have to think about any of that, right? I'm just writing a new app. Like, leave me alone. Let me just put in the logic, make it work. And when I get to, know...

Chachi BT scale then probably the stuff will really really matter. So how do you like kind of bridge that spectrum of like of the you know single person starting a new app all the way to the like, I truly have a globally distributed app and so forth.

Josh (28:28.835)
Yeah. mean, so the, the best part about that is that the complexity of like making it globally distributed and stuff like that, that's all handled by the runtime. So from like a user's perspective, it's the same thing as like, you know, deploying like a node app in a container somewhere, like, know, like a digital ocean droplet or whatever. Like it's just that easy for our customers to like go deploy an app. Right.

so there's not like a huge amount of cognitive overhead for building, like, you know, for, getting a lot of the value that is provided. but with that said, right, like there are certain things that you have to think about, to where, like the fact that, you know, your code is running in multiple places from day one, if you're just fresh out of like a developer bootcamp, that's not something that you're going to be thinking about, right. or at least, you know, as the developer bootcamps I've seen, maybe there's some good ones like that, that get into like, you know, fun distributed systems problems, but.

That's not really the problem they're trying to solve. yeah, mean, so that's one piece. I think the other piece is it's JavaScript. We build on top of web standards. if you want to use popular libraries like Prisma is a super popular ORM for interacting with databases of all types, you can run Prisma against durable objects. You can run it against your D1 database. You can, you

it use whatever JavaScript library you want. Most of them just work out of the box. Some of them have like a couple of like interesting dependencies on like, you know, like your local file system or stuff like that, that you need to like, you know, requires extra consideration, but it's certainly not like a blocker in most cases to just bring the libraries that people know and love.

Nitay (30:14.63)
Got it, very cool. What are the, do find people typically shifting over workloads and applications and migrating them to it? Or is it new use case or is it like where's the like sweet spot of entry point?

Josh (30:28.495)
It's a good question. I think that it's mostly new use cases. So, you know, I'm trying to think about like ones that we can talk about. you know, there's, there are a lot of kind of vibe coding frameworks where, you know, like providers will want to, you know, create like a place to deploy the application to that's, you know, fairly like low cost.

has pretty good controls around it. You want to be able to have like a standard set of like services which are attached and available to build. like, you know, we've talked to certain customers about that type of thing. And, so, you know, I think like that, that's a really popular new use case in the past few months that has emerged where like, you know, if you're trying to open up a framework for, you know,

non-developers to go build an application and you need some place to deploy it on their behalf, then Cloudflare is a... I think we've done a good job of capturing that specific use case, which has been pretty massive. again, the past few months. There's also things like, as I mentioned, like chat is a huge one. That interface has kind of exploded with chat GPT.

It's kind of weird because like I've always thought about that as like, you know, it's just the interface is that it's a chat bot and a chat interface for a standard app. And the AI thing kind of followed after where like, you know, or like true like agentic stuff, right? We're like, it feels like you're talking to potentially another person there. But yeah, I mean, so that use case is, you know, very easily deployed on Cloudflare's developer platform as well. But yeah, like I think it's mostly.

new use cases. There's, you as I said, like if you want to build a three tier web app and don't want to manage infrastructure, like then it's a great solution as well. But you know, the things that we spend most of our time on are solving things that you couldn't do and other, you know, other with other infrastructure.

Kostas (32:41.486)
So I want to ask you something about the workflows that you mentioned. So there is the GW workflows, like Temporal. And I want to make a connection there with your past life. Back in the more like OLAP world, the workflow is kind of like in the core of

what is happening there, right? Like there is airflow, there are like many systems that their whole reason of existence is actually to coordinate and make sure that tasks are full-tolerant, let's say, like at least assuming that the data engineer is doing their work right. But that's one world. There's also the world of applications and we see this concept of a workflow there too, right?

Josh (33:14.489)
Yeah.

Kostas (33:38.99)
From someone who has seen both, there are similarities obviously, but there are also like differences. Biggest difference obviously the audience, right? Like the people are completely different, they have different goals, they use different stacks. But fundamentally, like the concepts are not that different, right? As a person who has built in both worlds, why do you see there as

let's say the things that are common and things that are different and why there is this, let's say, split world in, let's say the concept of coordinating complex tasks that need to be executed.

Josh (34:29.155)
Yeah, I mean, it's a good question. You know, I think to give you a bit of a philosophical answer, like I think that, you know, they should be very similar, right? I think, you know, it's, you know, there isn't a whole lot of like fundamental, fundamental technical reasons why they can't be the same product, right? But I think like, you know, one of the things that

you know, as you mentioned, has to emerge is that it's a different, different tool chains, different, you know, kind of, you know, people approach those two problems with radically different backgrounds. think honestly, you know, tech has a problem of kind of reinventing the wheel in a lot of situations. And like, know, you could probably use airflow and temporal interchangeably for a lot of workloads if you're willing to put in the effort, right? Now, are you, what would the efforts go to? It's like, it was probably like a different set of like,

connectors, what systems are you integrating with? There's a whole lot of work that would have to be put in there. So it's like, the reason why I think Airflow is really useful is like, what are you using Airflow with? Well, it's probably Snowflake, Spark, Trino, and there's like pre-built connectors and patterns that you can just use, right? So you don't really have to put a whole lot of like, first principles thought into it. So.

You know, mean, that's it. like, like for workflows on like the, the, you know, application side, right. It's like, you know, you're probably integrating with like Stripe and like, you know, your like actual database state and things like that to try and like, you know, or like some type of external like email notification system. Right. So they're just like different, you know, like, I think it's more like the ecosystem that's built up around these kinds of two very similar engines that differentiates these products.

So, you know, with unlimited time and resources, they could converge, but I don't know, they'll probably keep being separate.

Kostas (36:29.038)
Yeah. So going back to Cloudflare, because the approach that you are taking is very opinionated in terms of the stack and the user, right? Is there a plan or do you see a world where more offerings will be out there for other people? Let's say I'm the person that is a data scientist or a data engineer or whatever we want to call them today. don't know, AI engineer, ML engineer, anything that's more like on the...

data world, let's say, different stacks, And toolchain. So do you see like a world where people from that part of the industry will also be able to work on Cloudflare development platform?

Josh (37:19.609)
I mean, do I expect that to eventually happen? Like, yeah. You know, I mean, it's like, if you go back to like 2006 AWS, like did they, you know, were they thinking about SageMaker then? no, but like, I think, you know, it's like, if you look at the pattern of history, eventually like, you know, these platforms just kind of keep adding, new products. And yeah. So I think like at some point we will definitely get there. the thing that.

I've, we've been talking about internally, right? Is like, you know, what are the things that we can do that are actually differentiated products, right? You know, so one of the things is like, DuckDB is really cool. We've got a lot of people that like, you know, think it's a very interesting product idea. It's nice because the storage layer is separable. So you can imagine a very simple product architecture of having DuckDB sitting in front of R2.

and have everything kind of hosted at Koffler's Edge. And we can manage security and stuff like that for you and do that without having to deal with allowing your local laptop to connect and stuff like that, or consuming your laptop battery to run expensive queries. Is that a differentiated enough product for us to go build? I don't know. That's kind of something that we haven't pulled the trigger on it.

There are a lot of things that we could go build that, you know, I think we're kind of trying to maybe see is, that a direction we want to go in? But ultimately, will there probably be something for data scientists? Yeah, I'd imagine so. It's just really like, you know, I think we're still trying to figure out like, what's the thing that we can uniquely do better than other people. So for the same reason that we haven't built like something like a CockroachDB type of like vertically infinitely scalable, you know, system.

Kostas (39:04.29)
Hmm.

Josh (39:14.12)
you know, it's like, that's a very hard problem to solve and like, we're not convinced we could do a better job than them or like the Yuga bite folks. So.

Kostas (39:22.53)
Yeah. So how's your experience like going from one world to the other, right? Like building systems that are more like in the data processing side of the industry to actually going like and serving as a user, like application developer, right? Like how, how you have experienced that and the pros and the cons.

Josh (39:48.858)
Yeah. Well, it's very funny, like focusing on like query engine performance for several years and digging into that. And it's just funny that like everything is CPU bound. know, like fundamentally like you're crunching as many numbers as fast as possible. And so you're always going to be CPU bound. And so you're always trying to like generate better optimizer plans, which reduce the number of

number of operations you have to do and then just do those operations faster on the execution side. And so, you know, I was operating in a world to where the only thing you really had to worry about was, you know, CPU efficiency. And, you know, if you did that, then you were good, right? It's on the application side, it's a network latency that's the real killer, right? And so it's kind of like, you know, there's a very few times where I've even had to think about taking a flame graph, right?

But, we do have to trace all network calls and like worry about like what happens if like, you know, something like we have a cache miss, you know, and have to incur other hops, right? Our protocol for routing to durable objects is like gossipy to essentially avoid having to make multiple round trips twice. And so there's a lot of like different ways that you kind of engineer things to avoid extraneous network calls where

That was just, you know, that was not the important thing when building a query engine. mean, you know, Trino is a distributed query engine. So whenever you have a pipeline blocking operation, like an aggregation or a join, you do have to have this all to all communication between all of the workers in your cluster. But, you know, that's not the, like, there's no way to avoid that. And so, you know, you basically just want to make sure that all workers can progress as quickly as possible where like, you know, again, like we've done some CPU optimizations.

you know, within durable objects. But it's really just kind of making things more multi-threaded that were originally not intended to be multi-threaded. It's not like, you know, we're not like, you know, trying to actually like, you know, like we're just never blocked on CPU. So yeah, I mean, like that's like kind of the fundamental like systems level change, you know, and then the surface area is actually another, like the API surface area is actually interesting as well because,

Josh (42:14.989)
Trino is based on SQL, right? It's ANSI SQL. There's a standards body around it. Web APIs have standards bodies as well, but it turns out that, you know, JavaScript libraries have a whole lot of surface area and you can, you know, make arbitrarily good or bad decisions there. So it's been fun. You know, like, you know, being able to, solve different problems, but like it's, I think the application space seems a little bit more unconstrained.

I would say. And that's not necessarily a better thing. Constraint makes for great innovation. it's just kind of wide open terms of what you can spend your time doing.

Kostas (42:57.39)
Yeah, it makes a lot of sense. So it's interesting what you say about like the trade dose there and the constraints and where you're focusing like the CPU bound workloads in the OLAP world and the latency bound ones and the network based one or IO whatever you want to call it for the application layer. You as a system designer and builder, do you see the introduction of another variety?

which is the GPU, but there is many layers of APIs on top of that with the LLMs that they are the completely different, I would say, like type of workload there in terms of what we are, even what we are measuring, right? Like we start seeing like things in the online inference, like what, for example, talking about time to first token.

and like how, like what's to be last token and like throughput, but like throughput might not be that important. And usually when we start introducing new metrics there, it means that like something changes, right? And do you see any impact in terms of like how you think about building systems at the level that you do? Because now you have systems that they have to interact also with these.

LLMs that they have a completely different profile in terms of like how they interact and the latencies and all that stuff.

Josh (44:30.955)
Yeah, that's a great question. I wish I had a better answer to this, but we found that like, there's a lot of like lower level things that are fun to talk about, but fundamentally the issue that we get the most complaints about are the fact that you can have requests which die while streaming back a multi-minute response from an LLM. And it's like,

It's really interesting that like the thing that seems to break is the fact that like, there's no mechanism for like, you know, pulling or retrieving chunks of data from these are these, these LLMs. and it's just like pretty basic stuff that like, just seems like has not been solved yet in the AI world. mean, cause like they're going really fast and solving a bunch of other interesting things, but like, you know, it's like, it was kind of funny, like, like for Trino, right? have like an HTTP API to where you can pull back.

large results, sets of data from the coordinator node on a cluster. And it's interesting that like, you you have like basically a pagination API of like, you know, pull the next result set using this, like, you know, pre-signed token or something like that. And, you know, it's like, you can't do that with an LLM. You just have to basically like hold open a connection and hope it doesn't fail. And like, there's just like a bunch of like reliability problems that we've seen to where like, you know, those are the things that we.

you know, get complaints for not necessarily like that, you know, that the way that we're tracking latency is insufficient. you know, so, so there's some basic, like, think like where we're at today, it seems like there's just some basic API design work that needs to be done. you know, and then hopefully the more interesting system stuff will follow, but like, just, you know, it feels like, you know, there's, there's a lot of basic blocking and tackling that's yet to be done.

Nitay (46:20.956)
Do you find there are any limitations on the computer and the network side? You mentioned some interesting things there around avoiding the n squared algorithm, no exchange or shuffle. As soon as you do that in a global network, it all falls apart. So you do this interesting gossip stuff. I know a lot of folks who have tried serverless, typically they try something like an AWS Lambda, let's say, and Lambdas, they're easy to kind of get going and they're good for most kind of basic stuff, but at the same time.

once you really try to push the limits of them, you start to see these like CPU limits, memory limits, like all these different kinds of issues, right? Do you guys find that there's like a certain sweet spot and of the types of computer, the types of workloads that people should be running on it or like anything about how to think about it?

Josh (47:10.261)
yeah, I mean, like, that's a great question. So there are, there are obviously limits with a serverless platform, right? I mean, they're just, it's inherent, right? you know, and like, you can always go buy a bigger box and, with a serverless platform, it's like, have to effectively decide that we need to scale your workload across more, more servers for you. and so, you know, there are certain.

you know, certain situations where you can kind of run into like memory limits or, or, know, throttle your CPU. I think observability is really the key piece there. And the same thing is like, you know, how do you know if you, if you're in charge of the server, how do you know that you need to go buy a larger server? Right. And so, or rent one, I guess, but so.

like, you you can connect like developer tools to do like heat profiles and see memory usage over time to your worker, right? And like, there's a lot of like, you know, ways that, you know, are already built for you to like, kind of go inspect isolates themselves. And so there's a, there's a, a pretty nice tool chain for actually doing some deeper, like kind of web performance work for server side JavaScript. we're compatible with, you know, the, parts of that, that I'm aware of.

so that's like kind of one piece, but yeah, I mean, fundamentally, like there's always going to be some limit, in terms of like scale. Workers is processing around 50 million requests per second, which is pretty high. mean, like, I think like the, interesting thing is like, know, Cofler also has like DDoS protection, right? And so it's like, you're really not going to be able to knock over, you know, Cofler's infrastructure, from that side. So like workers is, you know, should be viewed as effectively infinitely scalable.

durable objects, an individual durable object is not infinitely scalable. And so this is an interesting limit that a lot of people run into. it's a single thread, right? And so, you know, you can push it, you know, you can have something that blocks that thread, you know, like maybe you're mining some Bitcoin. it won't receive more requests until it's done with the form one. And so you have to deal with the same, like don't block the event loop problems that you do in the browser. Right. The good thing is, that.

Josh (49:27.289)
you know, there's some symmetry in that problem of like people that are used to solving those problems, client side can then go solve them server side. So it's not like, you know, it's not like new problems. You still have to think about the same things as like a JavaScript programmer.

Nitay (49:41.54)
I'm curious, you've mentioned this notion of server-side JavaScript a few times, like you guys taking the PA platform and making it work server-side. What was the key innovation or insights or challenges there to make that actually work?

Josh (49:57.296)
Honestly, it's before my time. So I don't know what the insight was necessarily that made people want to do that. mean, it works in hindsight, but why then? And what initial work had to go in? I'm not sure. One of the things that we do have to continually do is because we're using isolates, that's thread level isolation.

And so by default, these workloads are susceptible to Spectre attacks. And so we have to do a lot of work to mitigate Spectre attacks. have the ability to dynamically sandbox individual threads. So if we suspect that your code is actually doing something nefarious, we can dynamically say, actually, no, you get your own process while we figure this out, because we don't want you potentially inspecting other threads within the same process. So.

There's some like security stuff that has to be done. We employ some security researchers that are world class inspector mitigations. And so, you know, we take that very seriously, but like that's one thing that, you know, like I know that we had to do. The other thing is really just scheduling, right? Like if you're talking about running multiple, you know, Chrome tabs, scheduling problem is like a little bit simpler. Like you can only make progress on the active one potentially, right? Like that's not.

not the best scheduling algorithm, but it's a scheduling algorithm. For us, we have to be able to make concurrent progress across a lot of different isolates and have reliable performance. And so we have a multi-tenant scheduler that had to be implemented. So there's some stuff like that that I know of, but those are the parts of the system that I've encountered from the DO vanish point.

Nitay (51:45.498)
That's fascinating. And the point you made on the security side is very interesting. So there's like a performance to security trade off, right? Because on one extreme, you could run everything a separate process, but then you lose, I imagine, all the performance and like savings and memory and so on.

Josh (51:57.38)
Well, the performance is really just overhead of operating these things. the impact is really like cost, right? And then like, you know, that is cost to Cloudflare. And then it also, which, you know, we're pretty generous about giving that back to customers in terms of a free tier and then also running their code everywhere. you know, that's the cost versus security. And so being able to dynamically do process isolation whenever it's warranted is kind of a good trade-off for that.

So, most people are good citizens and they can just run in their own thread and they're fine.

Nitay (52:34.8)
I'm curious randomly since you mentioned it, there was the famous story of who was at Heroku, I think, who shut down their free tier, right, because people were kind of abusing it and using it for various things that were unwanted. Have you guys run into any things like that? Is there anything in down the line design that like...

Josh (52:43.812)
Yeah.

Josh (52:51.499)
Yeah. I mean, we have a ton of abuse mitigations, right? You know, they happen very frequently. We had a public incident, you know, a few weeks ago now to where we had a customer or a person who was thinking that they were mitigating abuse and what they did to mitigate that incident, they shut down the worker, which powers Cloudflare R2. So

Like it was like a global R2 outage because of a well-intended attempt to mitigate the, the, you know, uh, I think it was like R2. There's an R2 user was hosting some malicious content on R2. And so in the step to mitigate that they shut down the worker, which was serving the content, which was the same worker that's serving all R2 requests. I have to go back and actually read the it's a public incident report, but I have to go, uh, read it again for details, but you know, it's like.

There are bad offenders. have mechanisms of tracking that. We have a pretty solid team for mitigating those issues, but in a hurry to mitigate some of them, you can cause more damage. So it is an ongoing concern, but we have no plans of shutting down the free tier. And honestly, it's just a great business for us. Free tier is not that expensive.

for us. I mentioned 50 million requests per second for workers generally. Durable object serves about 500,000. Our free tier usage is probably less than a percentage point of that. It's tiny by comparison.

Nitay (54:38.62)
Yeah, it's a must be a way for people to get in and get started too and kind Makes sense. So shifting gears, maybe one more question for me to coast us take over, is, I'm curious to hear kind of what the current challenges you're going through and what's kind of what's in the future for you guys.

Josh (54:41.463)
Yeah, like the marketing benefit is just like far outstrips the cost. It's yeah. So.

Josh (54:59.801)
Yeah. So the key thing, right. And I mentioned kind of that we have a storage engine, right. And so what we have for durable objects is what we call the storage relay service. And so it's a complete rewrite of the storage backend for durable objects. So originally we actually use CockroachDB. We've not really been super public about that, but we've ran into issues with it.

It's not a secret by any means, it's outlived its purpose for us. There's some technical challenges with running CockroachDB on Cloudflare's infrastructure. Then there's some licensing changes which are unfriendly, so we've had to move. But we've rewritten our storage back into this thing called the Storage Relay Service. It's really cool because that's the thing that gives you access to SQLite from within durable objects. We announced that in October.

The new storage backend went GA this last developer week, which is in early April. And so that's kind of the big thing. But I think the coolest part about that is still yet to come because there are a lot of places in the world where durable objects does not run. Right. And so the, the low, the latency benefit, the coordination, you know, benefit that you get from low latency, those don't exist in some parts of the world, particularly the global South. Right. And the reason is because we can't run durable objects there because the infrastructure is too heavy. And so.

storage relay service, in addition to providing a lot of new storage APIs is actually much lighter weight. can deal with network disconnects, flaking networking infrastructure. you know, we can move much more quickly to new parts of the world. like new parts of the world, like within the same, like kind of continent or region. And so, we're very confident that this is like an architecture that we can deploy in those places. The next piece is like actually getting enough infrastructure there. We've been seriously talking about.

India, we've spec'd out what an India region would look like for us. So we're planning on deploying there relatively soon. And then, you know, South America and Sub-Saharan Africa will follow. But like the plan is, is that we want to be running durable objects everywhere, like in the 335 data centers that Cloudflare currently runs, not just like the, I don't know, maybe 150 to, you know, I think we probably run in half of them now. I'm not exactly sure. But so that's the big piece. And then beyond that, I think the placement story

Josh (57:27.075)
for durable objects is not where we want it to be. So right now we have a fairly simple heuristic of placing a durable object where it was first requested. That works surprisingly well, but they're not particularly dynamic. If they move, you can write code to move them, but it's not handled for you in a seamless way. And so that's something that I really think we wanna go solve. And the first place we're solving that is actually through

So, durable objects do have the capability that's currently private, but they do have the capability to have read replicas. We use that to power read replication for D1. We're holding that functionality back behind some flags because we want to make sure we get the API right for durable objects so that it's not that complicated to use. But we will be releasing the functionalities out

through D1 and is stable, we're really just trying to finalize the API. And then once we get that, then we're planning on doing smart placement for replicas first. And so that's something that we're actively working on right now. So there's still like a ton of stuff that's like following this massive storage rewrite that we've unlocked. So there's a lot more to come with that new system. So those are some of the things that I'm most excited about.

Kostas (58:51.767)
All right, that's some amazing stuff. One last question, because we're close to the end here, Josh. And I'll go back to the beginning of our conversation when we talking about doing the transition from the OLAP world to the object storage systems. You mentioned at that point that in the OLAP world and in optimization, there is

a lot of information out there. You go read the papers, sit down, try to implement these things, measure your CPU and if things look better, you move to the next. But in what you're working right now, things are not like that, right? Like maybe it's like a little bit harder, let's say, like to get into the details of how you design and the systems. Tell us a little bit about that and how someone can...

who is interested, right? Like can learn more about these things.

Josh (59:54.224)
Um, well, we're, we're in, in all of our blog announcements, like Klaffler has a blogging culture. And so in all of our blog announcements, we actually have, um, at the bottom, some, some more details on how we added the feature we're blogging about. Right. So we're pretty consistent about doing that. Um, so you can actually see a good amount of what we've architected. you, you you, you do have to scroll past the headlines. The headlines are cool. You should read them too, but like, you know, if you want just like the implementation details there at the bottom.

and then in terms of, you know, other things, you know, I would say, you know, trying to do more, you know, public appearances on what we're building. I would love to get more people on that, on the engineering team to contribute to that as well. but, you can always also just reach out. you know, that, that's something I'd be very glad to talk to people about. but yeah, I, I, we can go more into that if, if you want more specific details, you know, but.

Kostas (01:00:52.238)
I think we'll need another episode for that, which happy to have by the way and having that as the topic, like to go into like details into that. So thank you so much for spending the time with us today. I think that was an amazing introductory episode for maybe at least one more to get into like deeper into like the systems engineering and design.

how people can reach out and what other information you would like to share with them and as a follow up.

Josh (01:01:32.495)
Yeah, yeah, absolutely. So I would say you can feel free to reach out by email or on Twitter. So my email is jhoward at cloth layer.com. On Twitter, you can find me at a Josh Howard. There are several of us. So, you know, I'm the one that starts with an a. But yeah, and other information, I mean, I would say like we're hiring, right? So if you're interested in what building, you know, durable objects or do one, please reach out. I'd be happy to have a conversation with people that

you know, are listening to this podcast and find this stuff interesting. but, yeah. And otherwise, you know, I'd say, you know, like I obviously represent a large team of people that are contributing to the cloud flow developer platform and also, you know, durable objects and D1. you know, definitely, you know, shout out to the team for that, but, yeah, definitely. And, you know, try it. we recently launched a free tier, so it's free. really great. you've got.

a durable object, you can just leave running for like an entire day and that resets at midnight every night. So you can build a lot with that surprisingly, because it's just like pay per request. So you can build real applications that are free using durable objects. So right out.

Kostas (01:02:47.928)
Amazing. Thank you so much, Josh. And we're looking forward to have you again on the show.

Nitay (01:02:51.77)
Thank you. Thank you, Josh. It was fantastic.

View episode details


Subscribe

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts Amazon Music
← Previous · All Episodes