Episode 12

Building a Native Search Engine in PostgreSQL: ParadeDB's Journey to Replace Elasticsearch with Philippe Noël

January 16, 2025 · 01:00:21

Kostas (00:02.326)
Hello and welcome. It's so nice to have you here at the show together with Nitai. Let's start by listening to you about your background, quick introduction and what you're up to, what are you working with today.

Philippe Noël (00:08.128)
Bye.

Philippe Noël (00:11.847)
So, thanks for

Philippe Noël (00:17.93)
Yeah, thanks for having me. I am the founder, one of the founders and CEO of a company called ParadeDB We are building an elastic search alternative on Postgres. What that means concretely is we're building fast full-text search and fast on-disk analytics on Postgres. The goal is to offer companies who run Postgres as their relational database, a zero ETL solution for doing search and analytics. So we work with a lot of companies in the

Bintech, e-commerce, sales automation, legal tech, industries, when they want to do user-facing dashboard, user-facing search experiences, user-facing filtering or table type workloads, things like that.

Kostas (01:00.268)
That's awesome. How did you get into that? Elasticsearch has been around for a while. I'm old enough to remember Lucene first of all. And then Elasticsearch coming in and like changing the way that like people worked with it. I remember also the first era of Postgres where search was not a thing.

It did add capabilities for that at some point. So it is like a technology that has been really important for a very long time. I think also it has changed focus in a way. Like if you see the companies around them, like Elasticsearch itself started more with, as you said, let's say like do search on a blog, right? Then search with free text to

Philippe Noël (01:34.913)
Thank

Kostas (01:59.328)
accommodate more use cases around, a log analytics and like that kind of stuff, which is kind of different, right. And a lot of like tooling built around that. So, tell us a little bit from your perspective, or someone who is a builder in this, like the story of, inverted indexes, search, text, elastic search until today, working in like building Parabee.

Philippe Noël (02:29.121)
Yeah, I mean, your first question, how we got into it, a bit by accident, to be honest. I previously was building a browser and then we, yeah, we moved on from that project. And as we were looking for the next thing to do, we, me and my co-founder, did a lot of consulting and we were working with both Postgres and vector databases or Postgres and Elastic and had a lot of pain points around it.

And eventually we talked to more people and realized that was a pretty generalized pain point. So we kind of just started going into it, but I was, I was rather new to the data systems world. I would say, you know, browsers are systems, but they're a different kind of system for sure. As far as, as far as the second part of your question, what we're doing. Yeah. Elastic is great. Elastic is great. They also started with user facing search. They moved into observability. I think.

You need to build a really strong foundational data store before you can start to do observability type workloads on it. So I think this company trajectory path makes a lot of sense. We're still at the stage of ours where we're focusing more on the foundational data store. And so we're doing user facing workloads, but I think one day, you know, if all goes well, we might be, we might move into observability too. Who knows.

Kostas (03:40.042)
Yeah, that makes sense. okay, observability is a very interesting topic for me primarily because so far at least, right? There's no one data store that can serve observability like completely. Like we think about, let's say there are like, people say these three pillars in observability, have traces, we have logs and we have metrics, right? And elastic search, would say,

focuses more on the logs side. Of course, as a company, they probably added more stuff in there. But naturally, because of the technology itself, logs are the first thing there. But as someone who is starting this journey to build the technology, I'd love to hear your thoughts both as a founder, so from a business perspective, but also as an engineer, from the engineering perspective.

Philippe Noël (04:13.141)
Yeah.

Kostas (04:40.79)
How do you plan ahead in a journey that will take you in such a different place at the end, right? Like starting from logs and then if you go into observability, starting from the text, and if you go into observability, you need to start introducing all these different things that are kind of different. So how do you think about that? And how do you think if it is even possible, at the end of the day, to do it?

Philippe Noël (05:06.305)
I think it is possible. I it's definitely possible. I like to think we could be that data store that is able to do logs and metrics and traces and on, but we'll see. I don't think about it that much, though. I'm a pretty big believer that you just listen to customers and you keep critical thinking on top of what customers say. So we have a rough idea of where we want to go. And I listen to customers and...

I'd take what they say from it. And as long as it's reasonably in the same direction as what I'm thinking, I feel like everything is good. And the details along the way are, you know, they got small problems that get resolved as you go. And if it turns out that what customers are asking are very different than what we're thinking of going, then probably where I'm thinking of going is wrong, right? And I should understand better what the customers are saying. So, so far, for example, in our case, we mostly work with customers that have user facing workloads.

But every now and then, we have some customers that are starting to ask about observability. And every now and then, we're not an observability platform, and we're very far from being one. But it gives me some confidence that there might be something there down the line. And if we continue to chug along on the more actionable feedback and actionable requests, those requests for observability-related features might increase as we go. And I think that's enough for now.

Kostas (06:30.912)
Mm-hmm.

Yeah, yeah, for sure. Okay, let's focus on today though. Observability is the future and hopefully we'll have another episode in the future to talk about that because Power ADB will solve the problem there too. But let's talk about the customer facing search and the problem there. Tell us a little bit more about that. How do you define it, first of all? What it means? What's the use case there and why people need something?

Philippe Noël (06:44.897)
Yeah.

Kostas (07:02.924)
like Power8DB or Elastic Surge.

Philippe Noël (07:06.953)
Sure, sure, So in more technical terms, what we do is we do fast searching analytics over Postgres data. The reason I say it's user-facing is because typically Postgres is used as a popular relational database to build applications on top of, right? So usually the data that's stored in your Postgres database is going to be exposed in some way in the actual application, unless it's just user logins and things like that, but for the most part. So we...

The canonical architecture will be a customer that has, let's say Postgres is their relational database. They really like it. Eventually they need to have user facing search analytics over that data, whether it's an e-commerce catalog or a list of data within a CRM or whether it's like a list of leads that could be message via sales automation platform or something like that. Transactions that happen on a payment platform like PayPal or what's up.

they want the users to be able to visualize and search through that information for whatever reason. Postgres is not really good at doing this, so they need to bring an external data store, let's say Elasticsearch or there's many others, right? And then to do ETL over to that data. At that point, the company needs a new data store, they need to know how to operate, they need to know how to build with it, they need an ETL pipeline to work with it, that ETL pipeline can break, they also need to do data conversion, data transformation along the way.

which leads to incompatibilities. Most of these tools, specifically Elasticsearch, they're known for not being very reliable. Elasticsearch is not even a SQL based tool. So you need to denormalize your data. It doesn't support joins, so on and so forth. So all these problems sort of creep in from people that were at companies that were previously pretty happy with Elastic, excuse me, with Postgres. And so what we do is we make all of this go away. So ParadeDB is fully based on Postgres, it's not a fork. So we support joins.

We support connecting as a replica to your existing Postgres so you don't need ETL. We support full Postgres transactions, Asset properties. So the reliability and data integrity is much higher, which is very important for our FinTech customers, for example. You don't need to denormalize your data. That means in full-text search world, the re-indexing is much lighter because you don't need to re-index the full index. You just re-index the specific rows that get modified, so on and so forth. That's kind of the reasons people adopt us.

Kostas (09:33.196)
All right. I have a few different things here and I would like to clarify. So we have search as it is, let's say, defined by elastic search. I put like a keyword there and I expect something similar to it to come back. And there's a lot of

Let's say specific things that need to happen there in order to do that, that's traditional data basis. Who wouldn't do? That's why Lucene was built and all that stuff. Which by the way, think now that I think about it, kind of makes sense that you come from the browser world because Lucene initially was created to index the web, right? So, and search for that. So there is a connection there at the end of the day, but that's one thing, right?

That's more of the functionality that you get when you go to a blog and you want to search for a post that talks about NITI, for example, and it has to return all the documents back that NITI is one way or another like mentioned in there. But then you mentioned also analytics and that's a little bit different. In analytics, you have aggregations, you have, let's say more, let's, I would say all up kind of

operations there, which again, traditionally people would take the data from the Postgres, put it on a different system, do their analytics there. And then at some point, the use case of doing customer facing analytics on top of this data emerges. And that's why you have at scale, at least like systems like Pino or Druid that they do that kind of stuff. But these are like different, right?

Philippe Noël (10:58.771)
Mm-hmm.

Kostas (11:28.096)
Elasticsearch, I mean, maybe you can do, let's say, both with Elasticsearch, but obviously you can't replace what Pino can do with Elasticsearch and vice versa. So with ParadeDB, is the user able to do both or the focus is more in one of the two?

Philippe Noël (11:48.001)
We're very focused on being elastic search replacements. So when I say analytics, I really mean faceted search. So analytics over full-text search results, very common might be like bucketing your search results in different categories, right? You're an e-commerce website. You want to bucket them based on three star, four star, five star ratings of the product, let's say, right? So that's one common one. Another common one might be counting the number of results. That's also very common, right?

you want to tell the user they have been 200 million results that are reported, even though you're showing them 100, right, or 50 or something like that. So those are very simple cases of analytics and columnar aggregations that we support. As we go, we keep getting more and more requests for sort of general OLAP, like online queries that we may end up supporting. We've found that most of the big spenders on Elastic

really needed the interaction of both faceted search and traditional full-text search. So we were kind of forced to build those features to sell to those customers. But we're not trying to be an offline analytics and we're not really trying to be this sort of distributed druid analytics for like really large scale data stored on S3. Like we're very much looking to do what Elastic does today.

Nitay (13:05.55)
That's what I think that.

Kostas (13:05.821)
All right, one last question from me and then I'll give it to Nitai. going back and I touched on this topic a little bit and introduction. So Postgres at the beginning didn't have support for this type of search, right? At some point they introduced that. So they have, you can create like an index right now where you can do, let's say, this type of search.

Why this is not enough? people still, before we get into ParADB, right? Why people still have to go into Elasticsearch and then we can talk about like ParADB, but Elasticsearch is still there, even if Postgres has this type of index. So why this is happening?

Philippe Noël (13:52.181)
Yeah. Yeah, good question. I spoke at CMU about our work about like a month or two months ago, and I talked about that in detail if you're ever curious. there's kind of four, there's four things. First of all, the relevancy ranking algorithm that Postgres implements is not the state of the art. They implement something called TF-IDF, and the state of the art is called BM25, which is the generalization of TF-IDF. Long story short,

the relevancy results in Postgres are not as optimal as something like Elasticsearch. This is due to how they've built Postgres. They don't keep document statistics across the entire database, and that is required to have the highest level of relevancy ranking. So that's one. The second one is the Postgres full-text search is, you know, it's good, but it's not that optimized. So at a certain level of performance, a certain level of scale, excuse me, you're gonna start to have really poor performance. We typically see around

400 gigabytes of data within a Postgres database, the performance starts to balloon a lot and customers are no longer happy with the basic full-text search. The third reason is Postgres is not a search engine, right? It's a relational database. And so the search capabilities are kind of there to supplement it, but it's far from having the level of expressivity that a real search engine like Elastic has. So if you want a bit of like fuzzy searching within your app, you're fine with Postgres full-text search. It's quite good.

and you can get far with it. But if you think like you're building Spotify, right, and you want to be able to boost certain search results because you want more weight to be given to the number of listens of a song rather than to how closely it matches, right? So if someone mistypes a song, but that song has 2 billion streams, it's probably more relevant than like the properly typed song that has less, right? And things like that. People can get really deep with the logic for it. And Postgres doesn't have any of that.

So as soon as you start to have really, really high quality searches, it's a big deal. And Postgres doesn't do this faceted search I mentioned, these kind of analytics with it, which also come hand in hand.

Nitay (16:01.592)
Do you see yourself sort of pushing out the boundary of if and when, and perhaps it is more of an if?

people need a separate search engine at all? Meaning like you said, you were replacing Elasticsearch. And so does this mean that with ParadeDB, the way you see it is you should never need another search engine product period because we will just make Postgres so good at it that you will never need to make a special as opposed to, like we said in the analytics case where you might reach a point where, okay, at this point, a Druid or a Pivot Pinot or et cetera, makes sense. Is that how you see it or?

Philippe Noël (16:37.652)
I wouldn't think of it that way because we do deploy separate from your Postgres.

So ParadeDB, the way our architecture deploys is we get deployed as a replica to your Postgres cluster. So most customers will have a hosted Postgres, let's say on Amazon Web Services, right? And what ParadeDB will do is it will be deployed as an extra instance in that Postgres cluster and it will be connected to the other instances via Postgres replication. So there are still some, you can think of it as a separate search engine.

But because it's built on Postgres, it can speak to Postgres protocol and the Postgres data model perfectly. And you don't need any ETL or anything like that. But it's still a separate search engine, I would say. So most of our customers, they still follow the same trajectory where they start with the native Postgres full-text search. Once they reach a certain scale, they realize that it's no longer working well. And they want the easiest possible solution to go and provide that high quality search. And for ADB, usually, is that answer.

We still are a separate search engine, I would say, separate analytics engine.

Nitay (17:44.92)
So it's more than just an existing plugin on top of your existing Postgres. It's more a separate instance with ParadeDB. Do you find to that point that there is, have you found any inherent?

architectural limitations or ramifications of building on top of Postgres? there a limit in the future where you're like actually we're going to be held back because it's Postgres or is it enough of a general system? people have famously, I think there's a saying somewhere that like, know, something is a thing in the data world if there's a Postgres extension for it, right? Because these days there's a Postgres extension for literally everything.

Philippe Noël (18:17.933)
I

Nitay (18:19.468)
And so do you find that it's flexible enough that you don't see any point at which you'll have to re-architect or build more systems on top of?

Philippe Noël (18:30.835)
Yeah, a lot of people asked me that. There was this study that was made regarding vector databases. I forget by which university, but they were looking as to whether it's possible for Postgres as a vector database to be equally as performant to a dedicated one. And they found that there was no inherent limitation. Eventually it would be possible for Postgres to be equally capable as a dedicated system. I think the same is true for us, to be honest, with a caveat. Postgres has really strong consistency with

MVCC, which allows you to have really strong, like multiple concurrent access to the database. That adds some overhead, right? Those ACID guarantees for checking visibility of transactions. And that's something that Coastgrass customers are really, really happy with and one of our advantages, but it is a bit of a limitation. It's a very, very small amount of overhead. So you have like really strong performance as well. But I do think like we will never become a snowflake basically, right? If you live in a world where transactional

concurrent transactional access doesn't matter, it wouldn't make sense. But for us, performance, ProDB is five times faster than Elastic on single node. So performance has definitely not been limited by Progres.

Nitay (19:44.058)
And one more question on that thread, is like, so in the limit, you know, if we zoom forward 10, 20 years, assume Postgres gets, you know, built in natively, BM 25, like you said, and it gets kind of a better inverted index and so forth. Do you see Postgres just more and more taking over and becoming essentially the multimodal DB? Because I asked because historically, although Postgres extensions get built left and right,

as from a big picture like who wins and if we measure by essentially like companies building off of it and so on, hasn't been that successful necessarily. In fact, most people have tried to do multimodal DBs and different factors or different forms like a Rango and others, right? There's been very few that are like, we're gonna make Postgres a multimodal DB. We're gonna put in this GIS thing and this vector thing and this search thing and all these other plugins.

Philippe Noël (20:26.093)
Yeah.

Nitay (20:36.778)
Why has that not taken off? it a technical limitation? Do you think in the future that changes or like how do you see kind of the data ecosystem playing out on top of Postgres or not?

Philippe Noël (20:45.522)
Yeah. Yeah, there's a few, there's a few aspects to this. One is I do think there has been some technical limitations that have been lifted over the last few years. And I think they continue to be lifted. Like Postgres is increasingly more extensible.

And I think that shows with how many people are building on it now compared to in the past. But I'm also not a big believer in a tightly coupled multimodal database. And maybe this is a hot take, but what I'm a big believer in is something similar to what we're doing, where you have multiple Postgres instances that are dedicated for specific workloads and communicate together with Postgres logical replication. The benefit you get is because the data model is Postgres underneath the hood.

You don't need any of those transformation. You don't need that complex serialization that you need to send data between different systems. But each database is dedicated to its own service. So you have independence of scaling. You have different query patterns. You might pick hardware differently. You might tune it differently, things like that. So that's more what I see the multimodal world becoming, where you would have, let's say, your RDS for your transaction. You might have a ProADB instance attached to it for.

search and analytics, you may have a different one for like materialized use, right? Or whatever it may be, right? You might have a different one for like ML workloads or I don't know. Something like that.

Kostas (22:05.996)
Philippe, tell us a little bit more about what it means to build on top of Postgres. There is obviously a very rich ecosystem of plugins out there. think most people probably have heard about this whole thing like embeddings inside Postgres.

What are the different ways that someone can build and use Postgres as, let's say, the foundation for doing that? Because I have a feeling there might be more than one way to do that. So we'd love to hear from an expert on how you do it.

Philippe Noël (22:50.079)
Yeah, there's multiple things you can do. You can define custom indexes in Postgres, which is what we do. You can define a custom storage layout for Postgres tables as well, which is less common, but is also something that we've done and something a few other companies do. You can define custom operators. You can even define custom scanning functions in Postgres. So that's how we're built, for example. You get to define how Postgres scans tables.

And in our case, we basically feed the data of those cans into our own inverted index that we bundle in. So you can really extend it in crazy ways. Those, would say, are the four main ways to do really hardcore plumbing into Postgres and really augment it for specific workloads. But people do even lighter modifications. Amazon, as a plugin,

framework where you can create, if you create a purely in memory plugin, it can be hosted on any Amazon instance without any approval from them, for example, but it has to be purely in memory. So there's a company called Baffle that you may or not have heard of. They do encryption at rest within databases and they do it purely in memory via that plugin. And they have a whole company built on top of this. That's like, I think they're the only

like successful business that I know that's working on purely within the confines of the in-memory plugin world, but that's even something that you can do.

Kostas (24:17.9)
All right. So going back to choosing Postgres to do that, right? Why would you do it like this instead of following the Elasticsearch approach? And the reason I'm asking is because these kind of decisions, have pros and cons, right? Obviously, there are the pros that you're talking about. If you successfully do it,

the user is going to have like a better user experience. No ETL, you don't have to worry about like state management between different systems, which is pretty hard and all these things. But I'm sure there are also like constraints, right? You have to operate in whatever limitations Postgres itself enforces to you, right? So tell us a little bit about the pros and the cons and let's focus

actually a little bit more on the cons because there are definitely those there too. And we'd to hear also about that. What makes your life harder because at the end of the day you chose to go and build within like Postgres.

Philippe Noël (25:32.314)
Yeah, so there's only like really one main cons I would say like one really serious cons which is It's only really good if you your postgres company, right? So

The whole no ETL that I described, that assumes you ingest data from Postgres. If you're interested in data from MySQL or MongoDB, of course there is ETL. And the data model is different. For example, if you're a NoSQL, then there's a lot of arguments why you might want a NoSQL search engine as well. You're used to dealing with JSON, everything is built around that. Postgres has JSON support.

Pretty much every single of our large customers also stores JSONs in ParadeDB and also does search over JSONs, but usually it's not like a primary workload for them. So I would say we're kind of boxing ourselves in to the Postgres market and we're doing this pretty intentionally. The reason Elastic has the limitations I describe is because while it works well for everything, right, but it also doesn't work phenomenally for anything because it's designed to be very agnostic. And the way we are trying to...

to improve or to innovate in the space is to say, Hey, you know, we're not necessarily smarter or harder working or whatever. We're going to be more focused, right? And we're going to say, okay, Postgres is the database that we really love. We think it's, you know, it's the second most popular in the world today. It's the fastest growing and probably going to pass my SQL soon. And so we feel confident restricting ourselves to that market. So if you're a Postgres company, then we are hoping ParadeDB is going to be the single best product for you in that case.

If you're not a Postgres company, we're probably not the best product, right? Or at least we'll be on a level playing field with some other products.

Kostas (27:12.921)
Yeah, that makes sense. And so in the world of streaming data, like Kafka, for example, we see a lot of conversation about being, support the Kafka API. So the way that people approach that, because you have an existing, let's say market, actually, I wouldn't even call it like a market. I would say it's more about people

invested a lot both in their own growth, but also inside the systems building infrastructure around Kafka. So at the end of the day, if you want to improve, you can do it by just going out there and saying, just rip this thing apart and use something completely different. Actually, it's better to have some kind of compatibility. In their case, it's more of the API itself. The rest is probably completely different. There's nothing common with Kafka there. I can see like...

some kind of similarity with Postgres 2, but I have a feeling that things are a little bit more complicated there because the database system is much more complex system in a way. when we say we are, let's say, built on top of Postgres or we are Postgres compatible, where are the, let's say the boundaries of the compatibility there? Like how much...

Philippe Noël (28:31.244)
So.

Kostas (28:39.148)
You can push, let's say, these boundaries and become more of like a different system, but still being something that looks like Postgres to the outside, right?

Philippe Noël (28:55.423)
Yeah, yeah. So that's a very important question. Most people that build on or like they say they build on Postgres, they do a Postgres SQL wire compatibility. So people will build entire different data models, but they'll make sure that the network protocol speaks Postgres so that they can ingest data from Postgres. My take on this actually is this is not sufficient. Like you have a lot of these tools like HugoBytes and Cockroach and so on that are databases that are Postgres compatible, but they're not Postgres.

the types are not implemented the same, and then there's issues there, and then blah, blah, blah, blah, and the extensions are not supported. And before you know it, it doesn't work. I think the thing with Postgres is it's a really, really, really good database. And people that are happy with it, they're happy with it because of everything that it has. Postgres is really minimal in a lot of sense. But the things that it does, it does them really well. And so for us before, for example, we tried to cut corners. We tried to put

analytics libraries like DuckDB and Data Fusion inside Postgres. And we quickly found out that we lost some of that core Postgres that customers love and had to backtrack and just go and buy the bullet and sort of build the same optimizations, but natively within the Postgres query engines. So that's the way we're thinking of it here. But from that perspective, with the APIs that gets exposed for extending it, you can join the two. You can build something that is 100 % true Postgres.

but that has those optimization built in. You just can't cut corners when you're doing it. And I think that's quite important. As far as API compatibility with Elastic and tools like this, that's something that would be great to have. We don't have it. We also support JSONs. We've added that recently so that customers could more easily migrate, but the APIs are still different. Today, we work with companies that have really deep pain points, so they're actually willing to rewrite their queries entirely. And usually, the canonical architecture is

moving data from Postgres to Elastic. So that whole part, they don't need to redo, right? Because we're already in Postgres. The only part of the migration they need to redo is the query writing. But I do think in the future it will be possible to have sort of a query conversion tool, at least to get you close to having a full set of queries working. Maybe not everything, but most of them. But for now, it's too early for us to invest time in this.

Nitay (31:15.022)
Can you an example maybe of like where, cause you mentioned before that you had done some explorations previously that might have drifted you from Postgres and you decided, okay, that's the wrong path. Where and why specifically does that matter given what you said here of like, if you're wire compatible and if the company, the customer is already doing some query translations anyways, then at what layer do they care that you're a hundred percent Postgres given that you're deployed on the side?

Where is the interface that matters to the user that they're going to come and be like, wait a minute, no, this is not what I thought I was getting.

Philippe Noël (31:49.739)
Yeah, so there's a few things. One is all of the tools that they use that speak Postgres, they expect them to work. And as soon as you have different data types, right, you need to do all the conversion internally, there's edge cases that show up, right? Like, you got to map the data. So for example,

When we use Data Fusion, which is a RAS analytical library, Data Fusion implements the arrow types, and Postgres implements the Postgres types. Multiple Postgres types can map to the same arrow type. So Postgres has text and Varchar, and both of them would be UTF-8s in arrow. So you lose precision, and that starts to cause problems in people's applications, for example, which is a really big problem for them. Another one is in what makes Postgres so good, which is data integrity.

So a lot of our initial customers are fintechs. Fintechs get really attracted to the idea of a MVCC compliant search engine, which has never existed before. Ready B is building something that has really strong data integrity. If you bundle in another query engine inside of it, now the whole way the Postgres does the asset guarantees and the transaction visibility check-in is not communicated to that query engine. And if you want to pass that in, like it's a whole can of worms.

of edge cases and things like that. Like Postgres is so good because it's been built for 30 years, right? They've spent a lot of time making it right. And then you start to have issues around this. Like we didn't support crash recovery, which was a really big problem for people. And then by the way, they want to integrate it within their high availability system. So you need to write data to the right headlocks properly. All of this comes in if you just build with Postgres. Otherwise you just end up re-implementing all of this.

And that's just really, really difficult and it's sort of like a never ending chase.

Nitay (33:38.798)
And tell us a bit about the deployment model. how much, it sounds like a lot of what you're seeing, if I'm if I'm wrong, is the kind of BYOC model of like, I have my cluster, you come and deploy Parabd in my cluster, but you have the control plane. Like what are you seeing out there and what's the, what's the paradigm?

Philippe Noël (33:54.616)
Yeah, exactly. So we deploy as a replica to your existing Postgres clusters, wherever it may be. We're supported on Amazon and GCP now. So Google Cloud SQL and RDS and Aurora on the AWS side. For ADB, we basically deploy it with a new VPC. Customers connect it to their RDS via replication, and then it stays in sync in real time. But for some of our customers, have...

metrics reported to us so that we can manage the cluster remotely. Or for the ones that want to do it themselves, we have run books that basically come in with the cluster that will tell them what to do, on what alerts they have. And yeah, that's been a very successful model, to be honest. I'm a really big believer in that model. We're still figuring things out, and it's a lot of work to build this. We first took inspiration from Alex and Red Panda.

They have built a really popular Brignion Cloud offering. It's their biggest revenue driver. And yeah, we're trying to follow a similar model.

Nitay (34:53.39)
Yeah, it does seem like kind of a modern trend a lot of folks are doing and I'm seeing kind of across the board is this.

kind of next wave between on-prem to private cloud and so on is I own the data, I own all the underlying networking and so on so that I know it doesn't leave any of my space and I have all the full governance and control, but don't give me the headache of managing the actual system. And so it's this kind of like win-win.

Philippe Noël (35:21.829)
Yeah.

Yeah, mean, ironically, I think it's easier to build in some ways. It's still very hard to build, but when you're a multi-tenant application, there's so many concerns that you have, right, isolating data and things like that versus when we work with customers in a bring-on-cloud model. Things are very well scoped. Obviously, the management of it is a bit harder. It's a bit tricky. So that part is harder to build for sure. But yeah, think, at least for us, customers that want bring-on-cloud, they pay bigger amounts of money and their requirements are more scoped.

So it's been pretty obvious to focus on them. And also from the financial standpoint, it's also easier for customers, right? All these big companies, they have negotiated spend with Amazon, with discounts. Like I would need, you I don't have that yet. Maybe one day I would, but I also need to charge a markup on the hardware. They don't need to do that because they pay Amazon directly and they pay me directly. So the financials also make a lot more sense. And I'm a big believer in this becoming a very popular model over the next 10 years.

Nitay (36:22.914)
Yeah, there's clever stuff you can do. I've seen a lot of data companies do where once you distribute your product via AWS marketplace and stuff, you can actually get the customers to use their credits even when using your product and like pass that off.

Philippe Noël (36:33.215)
Yeah.

Nitay (36:34.198)
I'm interested, since you mentioned financials, do you see it hitting your bottom line as you think about it? Specifically, what I'm talking about is right. Like the pro of doing multi-tenancy is oftentimes you can get a lot of savings, right? You get a lot of like, you reduce a lot of wastage or utilization and you can get better margins. All these kinds of things are why historically people do multi-tenancy, right? And so given that you're doing this kind of, bring my cloud, essentially it's single tenant, every customer. Do you see that affecting the bottom line, like pricing and everything that you have to do?

to make up for that.

Philippe Noël (37:05.937)
Well, for us, we just charge a license fee basically for the management and the software. The customer pays the bill for the hardware to Amazon directly. So it's up to the customer to size the instance properly for their own workload. That does mean we don't support small customers. They might use the open source version or something like that, but we don't do any of this. think multi-tenant really makes sense when you have slightly smaller customers that

may not be justifying full instances or full whatever your unit of compute or unit of storage is. So for now, I don't think so, to be honest. I really don't think so. And even quite the contrary, I think we get to be able to charge more in some ways because we have all these guarantees that we provide customers. We don't need to double dip into the Amazon margins. So you're just trying to build a better product. And I think the best...

the better product should be able to charge more in the future,

Nitay (38:04.334)
and you're essentially just charging for the control plant. So for you, it's pretty high margin, I imagine, because you don't have to manage any resources on your end.

Philippe Noël (38:10.677)
Yeah, it's like 100 % margin. mean, it's the time of the engineers, which is expensive. But yeah, my control plane is like, it's the jokes. It's a small amount of hardware compared to the database. Yeah.

Nitay (38:26.382)
Right. How much, shifting gears slightly, how much customization do you find the customers end up requesting? Like in particular, I'm thinking, I've done some work in the past with search engines. One of the things with search engines is famously there's so many different places along the pipeline where some garbage can come in and affect the result, right? Some bad documents, some bad processing, et cetera.

But the good news, good and bad, is that a good ranking function can fix everything. And so so much of a search engine ends up coming down to the search ranking function that's used. I imagine at some level, customers are wanting to do a lot of optimization or customization to bring their own ranking function, not just use DM25 as good as it is or whatever. How much, where is that? What are you seeing there?

Philippe Noël (38:50.265)
Yeah.

Yeah, this is a very topical question because we have three customers right now that are all asking to bring their own scoring functions. One call I was on like two hours before this asking the same. Today we do not make that possible. We have our own query builder where you can essentially create your own ranking function by boosting certain fields or like choosing how tie breaks are being handled and things like that. So we do have that available at the API level.

but we don't give customers the ability to go into the internals of the search engine and provide a specific ranking function. The reason we haven't done that yet is because we're worried users will just shoot themselves in the foot basically by messing with the internals too much. And we haven't yet gotten a request where we thought it wasn't possible to be handled by our existing APIs that we expose. But I don't know. I'm starting to doubt whether that's true as we get more people asking about this. So.

We have customers make a lot of customization to answer your first question. They do it all at the API level. Maybe we'll expose it to them on the internal side as well.

Kostas (40:22.22)
I'd like to ask a little bit of a different question and go back to the Postgres ecosystem. We see lately a lot happening in there with many different plugins appearing. You mentioned DAGDB, for example. We heard a lot lately about creating kind of, let's say, an OLAP system.

behind Postgres using DuckDB for that. But you said you tried it and you don't think that is the right way to do that. Tell us a little bit more about that. mean, certainly the plugin system of Postgres is very powerful. think what the community is building out there is a testament of that.

But usually with a lot of power comes a lot of responsibility too. It's very easy to just create garbage out there. By the way, I'm not trying to say that that stuff is garbage, but you've tried it and you have opinions and we'd love to hear these opinions.

Philippe Noël (41:33.633)
mean, I will say, a large number of Postgres extensions are very limited and are done a bit hastily and are not products, to be honest. So there's hundreds of Postgres extensions, but actually really good ones that people use in production at mature companies, there's maybe like 10, if not less. So I do think you're right that...

With great power, you also have the responsibility of it. And a lot of people kind of cut corners. They put some proof of concepts together. They think it's fun, but they don't ever make it into an actual product. Now, because you can do so much in the plugins in the extensions world, you get to do crazy things, right? Like we put together DougDB inside Postgres in like three weeks, right? And obviously then we ran some query, then it's a hundred times faster than Postgres.

So suddenly we're like, whoa, you know, great. So useful to people like who wouldn't want this. But once you start talking to really big companies, they tell you, like does it support, you know, point in time recovery? Well, no, it doesn't. Right. And by the way, does this and this and this, and suddenly you realize there's a lot missing. Right. So that's why I stand by what I said before. I think you can build some exciting stuff with DougDB and Postgres. In fact, we still have a product that has DougDB and Postgres and we use it to read

from object storage into per ADB. So customers might have data in S3 that they want to bundle into their search, or DougDB and Postgres is sort of a no ETL way to read data from object storage. But for actual database operations, it's just a different paradigm. It's a different data model. It's a different set of trade-offs. DougDB is not meant to have the data integrity level of Postgres. And so we've kind of learned you just have to sit down and do the work the hard way. But few people do it.

because it's a lot of work and also there's very few people that know Postgres well enough to do this. Even us, to be honest, we didn't until someone with a lot of experience joined our team. Before that, we tried our best, but Postgres, you need to really know how it works to do it the right way.

Kostas (43:45.322)
Yeah, 100%. You mentioned there are a bunch of really good extensions out there and outside of ParadeDB, which obviously is like one of them. What other extensions you would recommend to someone to take a look if they were interested to see good quality work around building an extension?

Philippe Noël (44:06.441)
I think the best Postgres extension ever implemented is called Citus. If you've heard of Citus, it's distributed Postgres. The team behind it is some of the most competent people in Postgres. They've built it really, really, really well. It ended up being bought by Microsoft for like 200 or $200 million, something like that. So that's one. think it's the best one. Another very popular one is called Timescale.

Kostas (44:12.652)
Yeah.

Philippe Noël (44:33.377)
And then there's us, obviously we like to think we're doing good work about it. You can take a look. And there's a few others, those are probably like those two, would say are the two biggest ones that exist today.

Kostas (44:44.268)
What's your take on this new breed of, I would say, cloud infrastructure companies that they offer Postgres as a serverless service? I think the latest one is probably Prisma, which by the way, I found it very interesting. I always was thinking like, how do you monetize an ORM? Now I know.

sell database as a service at the end of the day and use the ORM for distribution and awareness. But what's your take on that? We have seen like a lot of them. There's Neon, obviously, there is SuperBase. So, and my take, and again, I'd love to hear also from Mitai on that, is that...

All these people at the end of the day, what they're trying to do is like go and take a piece of RTS because that's where, let's say people spend a lot of money on like Postgres databases out there. And they invest a lot in their relationship with these extensions because obviously these extensions help boost awareness even more and create more hobby projects, more people like using them and all that stuff and creates like this more of like

the go-to-market side of things. what's your take on that? How do you see these companies and also how do you see them relate to the success of ParadeDB?

Philippe Noël (46:23.957)
Yeah, so I know the folks from Prisma very well. They're adding support for ParadeDB, which is exciting. I mean, the more people run Postgres, the better it is for us, right? So I'm very supportive of all of these folks, the work that they're doing. You know, at the end of the day, it increases the market size for Parade. You know, we do integrate with a few of them so that ParadeDB is natively available within their database.

I was mentioning in our deployment model before typically we deploy as a replica. This is what we recommend for any sizeable deployments, but for sort of smaller obvious projects. It's obviously much easier if it's differently bundled in. So for those folks, we're working to integrate directly within them. That's another benefit of building as a plugin, like that is possible. of both deployment models are possible. I do think it's hard to take a piece of RDS.

It's very, very hard to get people to move their OLTP workloads and RDS is very simple. It's vanilla, the most vanilla Postgres you'll find, but it's also incredibly reliable, right? So people, you know, it gets the job done. So I think they tackle more newer workloads like Superbase, Neon, and so on. They all try to tackle newer workloads from people that will just establish with them because they have benefits over RDS. And, you know, if you're starting somewhere, I might as well start there. I know.

SuperBase and Neon, especially SuperBase is doing really, really well at this and I expect them to continue to do well. I don't know how much more space there is left on the market for more people to come and do that though.

Kostas (48:00.798)
Yeah, so again to both of you guys the question why would someone today go and use something like super base right instead of using RDS to run their project there

Nitay (48:17.962)
I can take a first stab at it, which is that I think, Philippe, you pointed out a few key points here. So one is, think historically you can almost bucket companies into, there's ones that manage something for you and there's ones that don't. And even the ones that don't, they still add value, right? Think like tools companies, right? Like processes, all these different things. It's not like there's not value there. Of course there is, but I would venture a guess that...

You pick anything from bucket B of managing versus not in those companies are orders of magnitude more successful or more valuable or have tried more revenue or so on. Because ultimately, managing something for a customer, especially if you're managing their data, is just a higher value pitch, higher value thing that you can be doing.

And so I'm not surprised to see a company like a Prisma go from an ORM, which is a great tool and it's a great thing that developers love to use, but good luck charging big dollars for it to, hey, we'll manage your database for you. Not surprised at all. To your latter question, Kostas, I mean, I think, think Philippe makes a great point there, which is that it is very, very hard in my opinion and an experience to make a new data company, database company specifically.

that just goes and says, I'm going to take down X and I'm just going to go head to head. And there's this incumbent in the space. They're good, but we're going to do even better. And we're just going to win on that. It's possible, but very hard. I think most database companies that are breakaway successes are such because either A, they latch onto some new use case or some evolving thing, or B, there's some like big infrastructure change, or C, some sort of like even secular change, if you will.

And in each of those is some sort of title shift in the market. So like we mentioned a little bit, tiny bit here, vector databases. Whether you like it or don't like it, what happens in the future we can discuss regarding vector databases. But undoubtedly, vector databases was not a thing 20 years ago. It's a thing now because everybody has these rag and et cetera use cases. And so the other thing that's a big shift here in things like the serverless databases is that people just

Nitay (50:29.378)
don't want to manage anything, period. Serverless is kind of a lie. There's nothing that's truly serverless. It's just not your server. It's somebody else's server is all it really means is they're managing the server and you're not having to think about compute and CPU hours and memory and all this stuff. You're taking away all those knobs from the user. And so that is a...

Infrastructure change, just like going from managing data centers to the cloud has been an infrastructure shift. Serverless is kind of the latest in that shift of just having the customer be further and further away from having to think about any of this stuff at all. And I do think that the general use case does point to those things because it's just easier to get up and running. Don't even make me think about these things. Right? Like Snowflake famously from a data warehouse perspective.

was kind of the first ones to do this, where they hid away all of the underlying knobs from you. And even Databricks is moving in this direction now. I'm sorry, actually, I said Snowflake. I'm at BigQuery. BigQuery even more so was the first one that just said, give us your query. We'll make it run. Doesn't matter if it's underneath the hood. If it takes one minute, if it takes 10 hours, if it takes 10,000 machines, we'll charge you for whatever was processed. Don't worry about it. And then you saw Snowflake kind of move in this direction. Yes, you still choose the warehouse size, but that's it.

And Databricks is more moving in this direction. So I think for companies that are coming up now, like new startups, and even new products or explorations within existing companies, I think a lot of them will trend towards, just give me the serverless thing. Don't make me worry about it and so on. And then the question to me becomes, how far did those go? Are those really able to grow and scale with me, or is there a point which I have to switch that serverless thing for something else?

The argument that they would make is that, well, we can grow with you. We can scale indefinitely. And it just becomes a cost thing. And at that point, you're so sunk in that it's not worth the migration, right? Similar to how there's been many papers and research that's shown that if you get off the cloud and manage your own resources, there's huge savings to be had. But good luck convincing people, right? Like, how many companies can you name that actually did that? Yes, there's a couple. But it's a very small percentage. What do you think for the?

Kostas (52:36.021)
you

Philippe Noël (52:39.969)
No, I mean, I agree. I agree with everything you said. I agree with everything you said.

Kostas (52:47.968)
Yeah, makes sense. think there's another dimension in that, especially for data systems. And I think that relates a little bit also with the bring your own cloud that Philippe was talking about. And Prisma is a very good example of that. So for me, serverless is not, okay, it is the user experience, right? As you said, Nitai, like people...

Don't want like to care about anything outside of the API that you go and interact with the service, right? Anything that has to do with the infrastructure, please take care of it. I don't want to reason in terms of CPUs, clusters, memory, whatever, Which is great. It's the SaaS experience at the end of the day, right? Like you don't go and use Salesforce and you have to think about how to size this thing for your team, right? Like no sales team would ever

be able to do that, right? But I think that when it comes to this type of infrastructure, traditionally, the problem was isolation. You do have, in order to deliver this, you need to build infrastructure that it has to be multi-tenant at the end of the day. It's really hard to make the economics of this work without the multi-tenancy there.

And the reason I'm mentioning Prisma is because Prisma has built their infrastructure there based on Unicare Nodes. And Unicare Nodes, which is a new technology, again, tied to what you were saying about seeing some seats and new things coming in, they do provide this kind of very strong isolation of your processes that pushes as possible down to the hardware itself while enabling also multi-tenancy at the end of the day, right?

And that changes the game completely because the financial parameters change a lot. And like in my opinion, you can have serverless without also a serverless pricing model. Like that's not serverless, it's something else. So that's what kind of like excites me. I think what I always try to understand when someone says, hey, we have like a serverless offering here. What I'm trying to understand is this true serverless at the end of the day?

Kostas (55:14.858)
If we think about Lambda functions on AWS, wouldn't have Lambda if we didn't have Firecracker first. And Firecracker was what brings this isolation in place that people now can go and trust this thing and run. And I think Unicamnels is the next step to go and do this. So it's very exciting technology, I know how much of this we are going to see in the database and how fast, but...

If you think about traditionally databases, it's all about how to bypass the operating system. Right. So it kind of feels like Unikernels are like the right fit there. We'll see. But I think a lot has also like to do with that. And that's what's kind of like, I'm, I'm, I'm very excited and trying to see how it's going to evolve in the future. And I love to hear like from you, Philip, like based on

this and your experience with the bring your own cloud and like the customers that you see over there, right? How do you see these two paradigms either compete or converge at the end, right? And I'm asking you as a founder, someone who's building a business at the end of the day, right? Because it is important for you to, you need to build your margins while having a vector there to keep growing inside the account, right?

Philippe Noël (56:40.693)
Yeah, yeah. mean, right now, so far, bring your own cloud. We charge based on data volume, like on the size of the cluster. So I do think we still have ways to grow within the account, even if we don't manage the underlying hardware directly. It's unclear, to be honest. Like, we're still figuring things out. I'm a big believer that you should just give the customers exactly what they want. And if they're very happy, they'll come back to you to purchase more things as you go.

And that's why I think all the really large companies, become these kind of conglomerates selling a bit of everything related together because people are just very happy buying X from them that they also want to buy Y. So we have a lot of ideas for things we want to do in Postgres. was hinting at observability in the beginning. There's a lot of other things we could move on into as we build a product. I see us growing into the bring your own cloud type accounts more in that way rather than in just increasing.

And inherently increasing the compute behind the scenes by just servicing more workloads. That being said, we do have other partnerships in place with managed service providers like Prisma and so on to host and share revenue on PrairieDB with us. In that way, it's to be sort of like differentiated offering. So I do think we'll have some sort of managed service, but I think in our case, it makes more sense for that to be through existing providers rather than something entirely from scratch and just divide the market even further.

So that's kind of what we're thinking of doing that way. I'm very excited for Prismas to come out. I think theirs is quite unique and I think forking Postgres is a big endeavor and it's better if she can avoid forking it. So I'm excited to see what they do, excuse me, without forking it.

Kostas (58:22.068)
Amazing. All right, we are getting closer to the end here. So one last question from me before I hand the microphone to Nitai for his last question. What do you see as the future of search? A lot of the stuff we are talking about today, we talked about today, they were there also 15 years ago when Elasticsearch started, right? Sure, the algorithms changed.

We have better precision there and all these things. But where do you see search itself as a function going to? And what excites you? You can't wait to see it part of PowerADB.

Philippe Noël (59:04.501)
Yeah, that's a good question. mean, I think it's just becoming a lot more prevalent, right? More and more applications are much more data heavy than they used to be before. Search is the main way people interact with data heavy applications usually. It's such a big interface to every software, honestly. Like people search for apps on their Mac OS to find which app to launch or like they search on e-commerce website and Google and whatever. So I do think it's going to continue to increase.

And I think it's exciting to see the whole wave happening with AI now that's bringing more people into search. Ironically, I think this is bringing more people into traditional keyword search than they would realize. Like a lot of customers come to us and they say, hey, I'm doing vector search. Like, can I buy your product to make my search better? Well, like traditionally people would think like you would be doing keyword search and then you would layer vector search on top of it. But it's sort of opening up a new, you know, new entry point for people doing search. So I'm very excited for that to continue to grow.

As far as what I'm looking forward to bringing to Parade TV, that's a good question. think ResearchNet is turning to be pretty decent. I'm very excited for the world where we go distributed. I think that'd be very exciting. Fortunately, because of the way we're designed, we don't need to do it for a long time. We can serve over 100 terabyte clusters without needing a distributed infrastructure. So we have no plans to do it anytime soon. But when we do, it's going to be a cool piece of tech.

Nitay (01:00:31.308)
I like the point you made there because there's so many people that have this hot take of like, opening eye and throw up it because it's to kill Google search. Like we're not going to need it anymore. And the reality is the exact opposite. think people don't realize that Google searches ever since chat GPT and so forth have come out, have actually gone through the roof. I was talking with a friend of mine who was at search at Google.

Philippe Noël (01:00:42.858)
Yeah.

Nitay (01:00:52.59)
And he's like, we've never seen so much load for two reasons. One, people are searching more. But two, in particular, people are searching more and writing more words in the search, right? Because Google search essentially trained people to think keyword ease, right? And think, just speak in little tidbits of a sentence.

Philippe Noël (01:01:04.385)
Okay.

Nitay (01:01:09.568)
Now with chat GPT, chat GPT is training people to actually talk like humans and just write out an English sentence as a prompt. And people are typing that prompt into Google. And guess what? It does a better job. Regardless, not even talking about Gemini. I'm talking about just like the core search does a better job when you give it more information. And so there's like this magical thing where like, no, actually people are searching a lot more. So I couldn't agree more with your point there. Last question from me.

Philippe Noël (01:01:21.824)
Yeah.

Nitay (01:01:34.242)
What about on the community side? If you could have anyone wish or anyone ask or desire from the larger Postgres community and so forth, what would you want the community to be doing more of? Where should that community be investing?

Philippe Noël (01:01:50.987)
That's a good question. That's a good question. I mean, I think there's still so many things that need to be better in Postgres. I wish Postgres had better support around like incremental view materialization, for example. But I don't know if that's a community project. Increment of view maintenance, excuse me.

I think as a community, the main thing the Postgres community needs to do is squash the bias that people have that Postgres isn't as scalable as other databases. This is a very common question that I get, like, but it's on Postgres. Is it going to scale to my workload? And then their workload is not even that big. People are like, I have two terabytes. Is it going to work? Yes, definitely. You don't even need to worry about it. And even people have 10 terabytes, and it works very well. People have hundreds of terabytes, and it works very well.

Right. And so I think Postgres like 10 years ago, when the Citus founders were starting it, they said their biggest problem was people thought of Postgres as a toy database. I don't think that's the case anymore, but I still don't think people think of Postgres the way they think of like Oracle or MongoDB. Right. Perhaps rightfully so, perhaps not, but I think establishing Postgres is sort of like the best database there is also from a scalability perspective is one of the better things the community can do.

Nitay (01:03:07.896)
Do think people are, this is a bit, perhaps a biased question, but do think people are just scared of sharding? Like as soon as they have to shard, they're like, no, it's not gonna scale, it's not gonna work.

Philippe Noël (01:03:18.303)
Yeah, I think people are,

Nitay (01:03:20.974)
That's kind of what my guess has been what I've found as well is that most people just like think the database falls apart as soon as you shard. Whereas it turns out actually databases like Postgres are very good at it. As long as you know the keys and you know what you're doing, it actually works very well.

Philippe Noël (01:03:35.135)
Yeah, it's very, very good. It's very, very good. And it's very well documented and very, like, you know, lot of people have done it. Yeah.

Nitay (01:03:43.214)
Right. Cool. Well, I think that that wraps us up for now. Thank you very much, Philippe, for joining us. Great conversation about PrairieDB, about Postgres, Search, a little bit of analytics, and everything in between. And we look forward to having you on again and continuing the conversation.

Philippe Noël (01:04:00.725)
Yeah, thanks for having me guys.

View episode details

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

Building a Native Search Engine in PostgreSQL: ParadeDB's Journey to Replace Elasticsearch with Philippe Noël

Subscribe