← Previous · All Episodes
From pandas to Arrow: Wes McKinney on the Future of Data Infrastructure Episode 23

From pandas to Arrow: Wes McKinney on the Future of Data Infrastructure

· 01:22:05

|

Kostas

Hello everyone, here we are at another episode of Take on the Rocks and we have a special guest today, Wes McKinney. Wes, welcome. How are you today? And Please give us a quick intro of yourself. Sure.

Wes

Thanks. Thanks for having me. Well, most people know me as the creator of the Python Pandas project, but uh I've also worked on, been involved with uh a number of other open source projects like the uh Apache Arrow project where I'm a co-creator, the iBis project for For Python, I've also done a lot of work on the Apache Parquet file format and uh generally have been pretty involved in helping build and and support the growth of the Python data science ecosystem. I have a book called Python for Data Analysis that's in its third edition. We're starting to talk about a fourth edition. that treats pandas and other other uh other python python projects and I'm also an entrepreneur and and investor so I uh I've started a couple companies datapad way back when Voltron Data more recently doing Aero and GPU accelerated data processing. And I am a software architect currently at at uh Posit, so data science platform company where I've been involved with the Positron data science IDE and I've done work mostly on the data explorer within that because basically building the data viewer data explorer that I always wanted within a Visual Studio VS Code uh Fork, Positron. And I run a small venture fund called Compose Ventures. And so I've been pretty active in the last five or six years as as an angel investor and uh and investing in seed and pre-seed pre-seed rounds. Essentially I recognize that I have limited uh impact if I just do just build myself and so I started uh you know as as the community grew and especially the the ecosystem around around the Aero project. investing has given me a way to like advise and and be useful to other entrepreneurs and open source developers building building companies. So it's given me another way to like expand my influence, I guess, in the open source ecosystem and and beyond in the in the world of enterprise data infrastructure. So I'm involved in a lot of stuff. I Somehow I managed to keep it all together. My inbox is often a horror show, but I do the best I can and uh yeah, it's been pretty pretty interesting ride so far, and I'm excited for what the future holds.

Kostas

Wow, it's I I have to say it's like impressive because you know like we I don't know if the term makes sense, but what came to my mind while you're describing like all the things that you've been working on is kind of like the equivalent of what we call like a full stack. engineer but not in the application but on the system systems level which is kind of very very rare thing like usually people like in systems like they tend like to be very focused in one thing and for a good reason. These things are like very complicated and require like a lot of attention. But it seems like you go from file formats, like stuff like Arrow, up to the IDE that we can use to leverage all these things and You know, like the John Ali says.

Wes

Yeah. I mean I started out when I when I started building building pandas, which was like 17 years ago. 2008 was the year, so it's been it's been a long time. But uh I started out being really focused on like, yes, the implementation and making pandas work, but mostly the the user API and the user experience was what was really motivating to me, like wanting to really refine and and create an intuitive and um yeah, essentially a programming workflow that was really accessible and and powerful for for people to be able to think think about their data more easily to write the code to manipulate it. Because I was building tools for myself because I was frustrated at doing stuff in Excel or doing stuff in the programming languages like Java or C or or even R at the time. And R R has come a long way in the last in the last seventeen years. But Uh you know, really just thinking about that human productivity problem. And then as time went on, I naturally had to learn more about the system side. like how to make things fast, how to think make things scalable, memory efficient. And that got into like the file formats because eventually your data tools become bottlenecked on their ability to to access data. So not only file formats but also database database connectivity and like connecting to foreign data sources. And that's led more recently to to some of the the work that's happening in you know ADBC to accelerate database connectivity with with arrow and uh but you know it's been all incremental over a long period of time. So like we we work on solving one problem, get to a place where we're satisfied with that, make make sure that I'm not the only person in the loop who knows how the software works and can maintain it and continue to develop and improve it and then move on and identify like okay what's the next problem that can make this whole stack of technology work better together and be easier to use and faster and and all all those things. So along that is a lot of like pain and suffering as an open source developer and maintainer. It's still it's still hard. I mean, certainly now with uh our our new uh friends of Claude Code and Codecs and coding agents and whatnot, it's definitely made my life a lot better as as an open source developer and maintainer because I can offload a lot of the annoying stuff that I used to struggle with and like ask other people for help. Like CICD was always a struggle for me. Anything DevOps related, sysadmin related. And so now that that that I can like ask a coding agent to help me with all that stuff that I don't enjoy doing is has been uh has been really nice, but still like open source is difficult and there's like this sustainability, you know, maintainership. problem of like, yes, we have AI and AI is wonderful, but yet open source software still has to be developed and maintained. And there's still a problem of like where does the money come from? Like how do people get paid to paid to do open source. And I've, you know, some of my entrepreneurship and and work on the business side has been motivated by trying to create like the right kind of synergistic relationship between the open source software and the and the business side. And part of the reason why I'm at I met Posit and uh you know and I'm such a big fan of of Pauseit, formerly known as R Studio as a company, is because they're one of a one of the rare companies that's been able to to build a sustainable and successful business while also, you know, putting half of their R and D into building open source software for for open science, which is uh which is really cool. There's not a lot of companies that are able to do that. sustainably over a long period of time.

Kostas

Hundred percent. And I think that's that's probably like a a topic on its own that we can spend multiple episodes to talk about and I'd love to hear that because I think uh And I I felt it also like I've been in two companies so far before like the one I'm now that they had some good to market motion that involved open source too like Starburst data with Trino, right? Like very well-known project there. At Rutherstack also we have like an open source and I've seen like many different ways that like companies struggle with keeping things uh balanced between what we do with the open source and what we do with the business side and how we can jointly create uh value. I don't know. I I'd love to hear more about that. But before we get to that, I want to ask you something and go back to your past. So you said like you started with Pantas. And then trying to like with then getting inspired by trying like to solve the productivity problem, you started going deeper and deeper into like the systems until you read like the file. formats. So from pandas, what came next? What was in this journey the next thing that you decided that hey like to keep improving things I need to move into that and look how like things work there.

Wes

Yeah. Well what came What came directly after after pandas, so Python for data analysis, the first edition came out in 2012, and then and then Chang Shu and I started a company called called Datapad and it was a visual analytics company, something at like the intersection of can think of Datapad as a little bit at the intersection of Jupyter notebooks and and business intelligence tools, maybe a little bit like what what Hex is today. But like a much more primitive earlier version of that when the web web stack and technology was a lot a lot less mature. And one of the things that that we needed to do was provide fast interactive analytics, exploratory analytics running in a software as a service environment. And all the back end of datapad was built in Python. Initially it used it used pandas, but we quickly ran into the problem that pandas could not deliver the kind of speed and interactivity that we wanted to to create for um for for our users. And of course this was 2013, but DuckDB didn't exist at the time. So today if you were building this product you would just use DuckDB problem solved. But this led me to like recontemplate the whole design and internals of of pandas because it wasn't designed like like a database essentially. And so I gave a I gave a talk actually at the end of I think it was November 2013 at Pi Data New York called Ten things I hate about pandas. And essentially the summary was that pandas isn't designed like a database engine and wasn't designed for working with large data sets, parallel processing, had a bunch of rough edges that were the result of its relationship with numpy, which wasn't really designed for analytic it was designed for numerical processing, scientific computing, but not analytics. So it's so numpy didn't have at the time didn't have very good support for string data. non-numeric data, structured data like lists, structs, unions, and like non-scalar types, basically. And so so while we were working on while we were working on data pad, I we built like I built a a C library that that implemented a miniature query engine. You could think of it as like being a very primitive version of of of DuckDB that was built to run in a cloud environment with the data storage in in S3. There was a file format which was maybe similar to very similar to Arrow called like Proto Arrow stored in stored in S3. And so like we would pull pull those uh that that those serialized uh data files into the in-memory query engine and that was what was powering the um powering the data pet application. I think if you go on YouTube and look at talks from that era, like there's still a couple of like live demos where you can see the the the product in product in action. But this led me to start uh start thinking on the problem of like Okay, if we wanted to build really fast data frame libraries, like why isn't there a you know out-of-the-box, standardized, efficient in-memory format for representing tabular data that's efficient to process that could be used portably across programming languages. And so this was the seed of what got me thinking about about Arrow back in back in 2013. And um so datapad, you know If you were doing anything related to business intelligence in 2013, 2014 and you weren't Looker, basically it was really hard. Like Looker was the winner that took all in that And so we found it just really difficult to continue as a as a as a startup and and raise money and continue onward in 2014. And we found an opportune exit to go to Cloudera at the end of at the end of 2014. And so it was while I was at Cloudera that I started interacting with the Impala team, which build a uh you know distributed distributed data warehouse uh at Cloudera, the Kudu team, which was building a distributed columnar storage engine supporting fast analytics and transactions. Still a really cool project that I think is underappreciated, you know, created by Todd Lickon, who's now at Google working on working on Spanner and doing uh I'm sure doing great things there as well. But uh so being in that environment and coming out of the data pad experience of like wanting to really wanting to redesign the internals of pandas to be a lot more efficient, faster, lower latency, and be able to work with much larger data sets. It was almost like the perfect environment to to kick off a new project to to build what ultimately turned into uh turned into Arrow. So at the end of 2014, like I was already starting to like jot down like a design document for like what would be basically proto-ero, like what would be the thing that would work well for me, but also something that potentially Impala could use or Kudu could use for data transfer, data interoperability. Uh and at the same time, I also wanted to I had this desire to decouple the API of the data frame library from the the compute engine. And so right at the same time, um, I started building what turned into the IBIS project, um, which has been a little bit under the radar um in the Python ecosystem, but now is a 10-year-old project and has become fairly mature. And has a surprising amount of use and is is uh you know starting to have more much more of a a a broader impact in the in the python ecosystem than it than it did in its in its earlier days. But So I had these like two ideas in my mind, the the decoupling of the decoupling of the API from the compute compute engine and storage. and and the creation of a like interoperable memory format for for columnar for columnar data. And so ultimately the columnar data project ended up being the thing that was like you know, resonated more with the the broader open source ecosystem and the thing that is, you know, as today has become super successful and and ubiquitous. So So I've spent a lot of the last year, 10 years like really focused on making that happen and and uh not only building the technology but but finding people who want To work on that and and uh creating an environment where they can be productive and build the build the open source project and the ecosystem around it.

Kostas

Regarding Arrow, can you help us a little bit like understand what Arrow is? Um and most importantly, like What is how it's different compared to let's say parquet or ORC, whatever. I mean there are these are like file formats, RO is not like strictly a file format but you can still go and store like you mentioned also like you can store this is like on S3 and all that stuff so What's this what's the difference? And two more questions actually on that. That's like on the same topic. Why is needed and why this decoupling that it enables is actually important.

Wes

Yeah, yeah. So Arrow is a bunch of things. It started out being really focused on being a a specification for representing tabular data in column-oriented format in memory. So w without concern for storage or for for interoperability. So purely like looking at Data in data and memory, so in RAM, if you allocate a chunk of memory, like how do you physically arrange the bytes to represent uh represent a table. Even even less so a table, more like at the individual column level. Like what does it mean to have a column of float 64 values or a column of strings or a column of lists of int 32s. And so there's a specification which describes the exact like way that memory is allocated and the way that the data is arranged in that in that memory to to create the logical concept of like a column of doubles or a column of strings or a column of Boolean values. It handles things like missing missing data. So missing data is handled with a with a bit mask that's a separate chunk of memory that's overlaid on a column. And then you can have a group of columns that with that have names and that so a group of column types and names, they they create a schema. And so that that creates a defines essentially a chunk of a table. And so a table is constituted from collections of what are called batches, record batches, in Arrow speak. And so a record batch is a collection of columns, each of which contain contiguous data representing a physical column within that batch. So the batch might be really small. It could be you know, effectively a hundred rows of data in the table or a thousand or thirty-two thousand. You could have really big batches like a million or more. But the the idea is that it's supposed to facilitate arranging data in chunked formats so you can move it around and stream it really efficiently between from process to process or from programming language to programming language or between different systems. And based on the the application you can choose to size the batches, like how it makes sense for like the particular like their particular application. So after the like the memory, the specification for how to arrange the data in memory, we defined protocols for relocating that data between processes and between different systems. So there's two modalities. So one where you're moving data from one process to the other. So that's called the IPC or interprocess communication format. And so that basically arranges the columns end-to-end like dominoes with a small metadata prefix that describes the structure of the chunk of data. And so when you receive that on the receiver side in a different process, you look at the little metadata prefix, which tells you essentially the um the byte offsets of each of the constituent memory buffers that you need to reconstruct the the record batch, like the chunk of a table without doing any further copying or conversion. And so this enables, you know, what we describe as zero copy deserialization in the sense that like you receive, let's say you pop You're receiving data over a socket or like a Unix socket or over HTTP or something. So you pop a chunk of arrow data off of your interface and then you look at the metadata prefix. And then you create an object in your desired programming language that has the all of the pointer offsets to the to the locations within that chunk of memory. Which enables you to create this tabular view of this block of data that you received over the over the wire. So compared with like traditional database drivers, This is much more efficient because you aren't receiving the data and then immediately converting and moving every single value into a different data structure. And so typically what you see like if you're reading data, let's say out of the over the Postgres. wire protocol, you would receive a chunk of data from running a SQL query, and you would immediately relocate all of the bytes into a new data structure that is particular to your application. And so this is expensive. It's pro the amount of work you have to do is proportional to the size of the data set that you transferred. But in ArrowLand, the deserialization cost is not proportional to the size of the data set. So you receive the data in memory. That has a certain cost of moving data over the wire or whatever your protocol is. But when you actually have once you actually have the data in memory, in process memory, you can construct a data structure, an arrow data structure that references it without any without what with a cost that is fixed. And so you could have a million rows, a billion rows, or a hundred rows, and it would be the same cost is proportional to the number of columns. essentially. But that's that cost is very that cost is very small.

Nitay

There's also another if you can't actually talk just diving a little bit deeper in that because I think there's really some like amazing magical stuff you guys did there. I like I I get super excited about that kind of performance work because especially being able to do that in a cross-language manner, the kinds of like native buffer swizzling stuff that you guys had to do. Uh I think it's really, really cool.

Wes

Yeah. No, it's it's it it it is cool. And it's been it's really nitty-gritty. And in each each programming language implementation of Arrow, the the mechanics of how we make that work is is is is a little bit different. I know in in in Rust, for example, because Rust does Rust abores unsafe unsafe pointers and and things like that. So So there's been some some workarounds and I think there's in part one of the reason there's two Rust implementations of arrow is because one uses unsafe and doesn't forbid unsafe and there's a a there's an implementation of arrow called arrow too which isn't as actively maintained and developed. It's not the official implementation of Arrow, but it's It's an unsafe free uh version, uh version of Arrow. But basically the the metadata prefix is serialized using flat buffers. And so partially we we chose to use flat buffers because We wanted to minimize the cost even of looking at that metadata packet that describes the structure of the payload. So if for some reason you send across a payload that has 10,000 columns or a million columns. that you aren't we aren't adding unnecessary overhead purely just to look at like the structural, like the the metadata that provides the structural information of the of the uh IPC, like the interprocess communication payload of um of arrow. So it's it's pretty cool. And and uh I don't know all the exact details of how all the how all the implementations work, but Even in Java, like it's a little bit tricky because Java's whole relationship with unsafe memory is a little bit it's complicated. But another thing we did later on to mit to facilitate interoperability in process, especially like thinking about languages loading DuckDB as a as an embedded process. is we developed something called the C data interface. So basically it's a it's a C struct that you can construct because almost all programming languages have the ability to construct to create C obstructs in memory. through some version of like called CFFI or C function calling. So it enables you to call C functions from that language. And so using the same capabilities that allow people to call C functions from their programming language and build interfaces to C libraries because everyone's got C libraries they need to call. You can create create structs that represent an arrow. array object, column object, and that allows us to relocate arrow data structures across C function call sites, like for example, from uh I don't know, Rust or uh Go or uh I don't know Swift or uh C sharp or whatever programming language, we can relocate in memory data into a f into a foreign library like DuckDB, for example, where there's no code shared at all. The only thing that's known is like the C header that allows that library to be called from a foreign context. And we can re- we can pass arrow data over the C function call site without any copying in memory. And so that's enabled really interesting things like being able to use embedded DuckDB to execute queries against in-memory data sets with completely zero copy. And DuckDB has an internal kind of pointer swizzling way of scanning like interacting with arrow data directly with minimal overhead to run run queries. Because DuckDB is not exactly has an arrow like memory representation, but isn't exactly arrow because they've made some optimizations to, you know, to to to go faster for like DuckDB's specific internal design. But was uh you know very very much like built to to work well with arrow and and uh have minimal overhead in that like embedded query context, if that makes sense.

Kostas

We asked one question. So okay, we are talking about columnar data here. Like with Arrow, we said like, as you said, how we can efficiently represent this like in memory and then how we can transfer them around with the minimum possible. like cost. So there is also file formats like parquet, which is again a format around how to efficiently store on your disk. in this case a columnar data. What's the difference between the two? Like why there is a need to handle data, columnar data again, right? in a potentially different way when you store that to your hard drive compared to having it like in memory.

Wes

Yeah. Well So Arrow Arrow wasn't designed for storage. So to give you to give you an idea, probably a lot of people listening have used Parquet or or heard about or interacted with Parquet in some way, but but aren't familiar with with how it's implemented. And so when you store store data in parquet, the data goes through a series of encoding passes. So firstly there's there's dictionary encoding and so data that For so for example, if you have data that has a small number of unique values, dictionary encoding is a type of compression that works really well. So the idea is that you determine what are the unique set of values that occur in the data, and you replace the actual occurrences of data with integers that refer to the dictionary. where you store those unique values once and and that that goes in the file. But so you create those those what are called dictionary indices. And then those dictionary indices are further further encoded in a parquet file with with run length encoding. Also the null values are collapsed in and in in the data set. So if you have a column in a parquet file where Maybe there's a million values, but only one of them is is not null. So in the in that case, the actual way that the data is stored would be really, really tiny. And so it might be, you know. tens of like tens of bytes or like, you know, uh like definitely very, very small. But when you decode like when you go to decode the the parquet file, you have to expand and rehydrate the data to fit into the data structures that are being used in in in your application. So there's these multiple encoding passes that create the encoded columns that are stored in the parquet file. There's usually like an extra layer of general purpose compression, which is added on top, which has started to create problems, which we can discuss in a little bit. So there's these multiple encoding passes, and then there's a general purpose compression. Usually it's Snappy or a Z standard or LZ4 are like the three most popular compressions, like general purpose overlay compressions that are put on top. um of all that. But when you go to read a comp read a parquet file, you actually have to do a pretty non-trivial amount of decoding and decompression to rehydrate the data so that you can process it in your query engine. And when you think about when Parquet was created back in like 2011, 2012 was the era when when Parquet was designed. It was a collaboration between Twitter and Cloudera. And so in that era, the the whole architecture of data centers and storage was totally different. So data was stored on Spinning disk hard drives, which were able to read two or three hundred megabytes a second, like networking and data centers, like Maybe you had 10 gigabit Ethernet, but that was like the bleeding edge. Like we don't have the affiniband, you know, terabit internet that we have today in modern data centers. And so there was really a focus on making the file on disk like in the data center. as compact as possible. So with these multiple encoding passes. And then the benefit that you would get by making the file really, really small. So transferring it out of NFS or out of cloud storage and into onto the host where it's to be decompressed, like the the benefit that you re you reap by reducing the transfer cost of the file or like the columns of interest in the file. that would greatly dwarf like the uh decoding cost of decoding the uh decoding the column. And so arrow basically doesn't have any um doesn't have any uh encoding. I think we added we added on like a run length encoded type so to be able to like have general purpose run length encoded data, but it's not an encoded like compressed file format if that if that makes sense. Like it's completely rehydrated, a null value. If there's a null value though if let's say you have Int 32s and you have a bunch of nulls, there will be four bytes that are all probably like zero where there should be a null in the data set. So that allows you to do random access predictably into the data set. And so Arrow is not designed to be really small like to store on store on we did add like general purpose compression so you can take like interprocess communication data. Let's say that you're sending over HTTP over over a Unix socket, and if it makes sense to compress that, like you can apply apply general purpose compression. But we added that as like a nice to have, like not something that's trying to replace parquet or trying to be trying to be a file format. But also like Arrow was designed for a different era of of data systems where disks have gotten 10 to 100 times faster, networking has gotten 10 to 100 or 1,000 times faster. And so like the design constraints that that Parquet was created for In in you know, 15, almost 15 years ago, uh like the world is totally different now. So like I. Essentially it's tipped the scales where now I. O. bandwidth is not as big of a deal, but actually decoding performance is what has become the bottleneck in these in these modern applications. And so Arrow was designed to like you know, in a forward of course it was created in the mid-2010s, but it was designed for this like forward-looking future of like everything is super parallel, like You know, we've got GPUs, we got pro we got server processors with like hundreds of cores. Networking is super fast, disks are super fast, and so like we need a a a way to represent data. that can be moved around, you know, in super efficiently, hyper-efficiently, and then processed in these like hyper-parallel, you know, server contexts. with minimal deserialization. And so if we had extra serialization, extra decoding on the receiver side, that would basically kill the performance of the application. So it's yeah, it's definitely a different a different design approach. But we still need file formats and there's still like, you know, uh like to if we replaced Parquet with arrow in the data center, it would be it would be problematic because arrow is too big on disk and uh and there are still applications where you know they're they where IO bandwidth still applications where you know IO bandwidth is still meaningful. Like that's happening a little bit more in like GPU, like GPU clusters. Like it turns out in In in GPU clusters, you can decode files, you can decode parquet files on a on a GPU. And that's uh also created new complications that that the n new next generation of file formats is trying to work around and you know not carry forward the same some of the same design problems that that parquet parquet has into the next generation of file formats.

Nitay

That's that's a perfect uh segue maybe into like what what what is the the future of file formats, right? Like it it seems like this in an exciting way, like never ending problem. On the one hand you think like, well I've solved this, like I have a great common format, like it should work for all time. But in reality, as you just said, the underlying resource constraints shift and change. What becomes the bottleneck shifts over time. Suddenly you have GPUs, you have I. O. that where random seeks is not a big problem. So what what do you see as the future of some of these kind of up and coming new file formats and where do you see kind of the industry going?

Wes

Yeah. Well I guess to to to focus on the motivation, and we've been talking a little bit about like arrow versus parquet. And so I think the distinction between arrow and parquet is is You know, easier to easier to articulate in the sense that Arrow doesn't have these encoding passes and isn't trying to shrink it, shrink down and be this like really small file that when it's rehydrated could expand to be a really big, really big data set. But you know, a lot of people are have just discovered parquet and started using parquet in the last in the last few years. For example, you know, people writing, you know, machine learning AI papers and finding like, oh, like Actually using a binary columnar file format makes a lot of sense for for a lot of these lot of these applications. And so for people that have just discovered and learned about parquet in the last five years to suddenly hear that like, oh, actually we need to design new file formats. that might be like, what the heck? Like, just learned about Parquet, just got comfortable using it, and now you're telling me we need a new file format. But Parquet has some some challenges. um which I'll try to work through incrementally. So one is that it it's it struggles with um very large wide schemas. Like You know, the working rule of thumb is, you know, parquet is up to maybe a thousand columns or ten thousand columns is is okay, but even when you start to get up into that level, the overhead of just of interacting with the parquet metadata. So there's a file footer in parquet that has a lot of structural information. uh about the file and this is encoded with uh with Apache Thrift which is similar to protocol buffers and was kind of like Facebook's you know copy of protocol buffers if that makes sense. Protobuf's a lot more popular now. But just interacting, like deserializing the metadata so that you can know what what uh data in the file needs to be read out of storage in order to decompress it. can be pretty pretty expensive, especially when you have when you have wide schemas. So that's one definitely one problem. Another problem is that Parquet doesn't have very advanced encodings. And so especially when you consider advances in like SIMD, like single instruction, multiple data processing. in modern CPUs as well as like modern multi-core systems and also GPUs, Parquet's encoding, built-in encodings, the encoding passes that it does are not very not very advanced. It also relies on general purpose compression, which is not not very efficient to process to decompress either on modern CPUs or on GPUs. And there's some other kind of design flaws. Like one is that they're Isn't enough information in in a parquet file to have predictability about memory requirements, like memory allocation requirements? So in CPUs, this is like a little bit less um a little bit less important, but now that like a lot so much processing is moving on to GPUs, so the lack of pre-app memory pre-allocation hints in a parquet file means that decompressing Is parquet data directly on an NVIDIA GPU, for example, is a lot less efficient than it could be if you if you knew ahead of time, like, oh, this string data is going to expand to exactly this many bytes. And so you would know Okay, I have to reserve this much memory on the GPU for my decoding pass. So like essentially it has to engage in multiple passes to determine like how much memory to allocate and then finally to rehydrate the the data into that into that memory. And so what's what's happened in the last 15 years since Parquet was created is there's there's been a lot of um certainly advances in hardware, but also research. into lightweight encodings for for data that are efficient to decode both on modern CPUs and GPUs. And so a lot of this research has come out of groups like TU Munich, CWI. CWI is where DuckDB was created. So that's kind of the intellectual lineage. Like Colin Ray databases essentially were born in born in the Netherlands. And um so What was so what the research has shown is that you can achieve similar compression levels as parquet, but a greatly better random access, like retrieval, like looking up values in a file. Parquet also is not very good at like random access. You have to decode a whole big chunk of data in order to f pick one value out of the file, and that's expensive. And so what the r research has shown is that you can achieve similar similar compression levels to to Parquet. but vastly greater uh decoding performance decoding performance using lightweight encodings only and not using general purpose compression. So there's a whole batch of like next generation, like like parallel friendly, GPU friendly encodings that can be used. to to to store the data. And then there's also like the metadata problem that I was describing earlier where you want the file to be able to have like why not have a million columns in the file and be able to efficiently pull out ten of those million columns? That's a big problem in a parquet, but ideally a file format should should accommodate that that type of type of need. When you consider machine learning data sets, you've got you see feature engineering generate data sets with like millions or tens of millions of columns. And that sometimes that data needs to be generated and stored. And so if you have a file format that just isn't good at handling really wide data sets, that that is problematic. So Anyway, so now there's a whole crop of um you know new like production and and research file formats to Um, you know, I guess we have I didn't really talk about multimodal data, uh like images and vectors and those things. That's a whole nother dimension that parquet is not. Is not good at. Oh, let's see. So okay, we got metadata like wide schemas, metadata, deserialization, deserialization overhead. Encoding performance, both encoding and decoding performance, the use of general purpose compression, multimodal data storage, random access performance, Is performance. So if you're building, you know, if you're building a large data platform and like you you can start to measure like how much does it cost to do data retrieval and data deserialization at large scale. Like imagine your imagine your Meta or your Google or Microsoft. these costs really, really add up and the you know to be, you know, continue to use parquet as like this whole file format of file format of record. does have a material cost at large scale in the data center. And so that's that's what's been motivating creating new file formats that don't have these limitations and are a lot more efficient to to d decode.

Kostas

So I think it's clear why there is a need to rethink like how we store the data. And it's interesting what you said about like the how the The hardware changes and the like the workloads shift like to be bound on different things, right? Like from being I. O. bound with a network, like now it's like like CPU bound. But these things tend to be like really sticky, right? Like parquet, it's not easy to get rid of it. Like it's it's very fundamental in like what people have built out there. So how do we also make these file formats outside of being up to date today also let's say future proof, right? Like how we can make sure that I don't know, like something crazy that happens in five years from now in terms of like the hardware and how things change there again. We won't have like to wait another fifteen years, like to be able to design and migrate like to something new.

Wes

Yeah, that is that is a good question. I think one of the you know one of the biggest things that that you would like to like to leave things more open-ended is not getting into the same if you build a new file format that you don't get into the situation that that that that Parquet is in where and there's you know Parquet has had some other social challenges and par w one one one challenge that it's had is what I would describe what I've described in and many times as being implementation fragmentation. So there are many implementations of Parquet across many different systems and At the same time, there was a handful of different new features. There were new encodings, data encodings added to Parquet and new features that were that were added and bolted on. But they didn't get implemented consistently everywhere. And so so that led to this this fragmentation where Spark supported one set of parquet features and and not and not others. And then Spark started getting more popular. And so that created a lot of gravity around use of parquet because parquet was like the preferred file format, input output file format for Spark. And so if a third party wanted to create and use parquet files, and make them interoperable with Spark, they would only choose to implement the features that they were confident that Spark supported well and out of the box and that would create like good interoperability across across the whole stack. And so that that cr had the effect of essentially d disincentivizing people from implementing or getting on board with new bleeding edge features in Parquet if they couldn't be confident that the files that they were writing could be read uh consistently in Spark and and other and other places. And so, you know, one idea that that so for example, like adding a new encoding. And so if you implement the new encoding, you start generating a bunch of files that use that and then you go read the the data somewhere else and it can't be read because there's an unrecognized encoding. We don't want that. And so one one idea that that that we had that that that came out of like the the Vortex and and F3 projects, which are new file format projects, was to Be able to ship new encodings or new features implemented in WebAssembly, like actually attached in in the file So an implementation, if it encountered, let's say, a new encoding but doesn't yet doesn't yet know how to decode that, that it would find basically a bundled implementation and WebAssembly. maybe not the most efficient implementation that could exist because WebAssembly is not not as efficient as like the native code that you could write necessarily in your language, but at least like For compatibility, you would have an implementation that comes bundled that that you can you can use to be able to read the files. So at least you wouldn't be in this like nether this like no man's land of receiving data files that you're unable to read, if that makes sense. So that that that's one idea. I think there's other like affordances that you can provide in the in the metadata to be able to leave things open for change or addition, but also to enable implementations to It's what's called forward compatibility. So the idea of forward compatibility is that if you see if you're an older implementation of a standard and you receive a file that is says attests that it is newer, that it comes from a newer version than than what you have, that you can recognize when it is using features that that you don't recognize. and either choose to skip those safely. And so there might be like a column that you can't read because it uses new features that you don't know how to implement. So you can skip those f those those features safely And say, I'm just not gonna try to read that because you know I might seg fault or have like a memory leak or something. Um or or you could just error and say, like, I'm not, you know, I'm not gonna do that. We don't want to get in a situation where you receive a file, especially untrusted input. you know, some user file uploads a, you know, like a a data file and you feed it into your your database or your your you know your duckdb process or whatever and it it blows up like we don't want you know crashes on account of like you know, people try just doing their best to to take advantage of like, you know, new features that are added in the future. So it's it's tricky, but but yeah, I think We, you know, we've learned a lot from I think we've learned a lot from doing parquet and like parquet is still being made better. I think the existence of new file format projects has also rejuvenated parquet development. in in a lot of ways. And so so like there have been great leaps in like, let's say parquet performance in data fusion and Rust, for example. And that was partly motivated by like Like, hey, parquet can be made better. The implementations can be made it made a lot better. And I I I completely agree with that. There's some like discussion of maybe there could be a parquet three. There's already like parquet one and two or parquet one point oh, parquet two point oh and and there was some awkwardness around like people not understanding like what is meant by parquet two point oh and like what features. are 2. 0 features versus 1. 0 features and that that's created some fragmentation amongst the implementations. You know, but there's a lot of there's a lot of healthy community dialogue and and um you know especially now with like the the big push around open data lakes, you know, iceberg and Delta Lake and whatnot, there's there's work being done to provide a path to integrate um bleeding edge file formats into iceberg so that if you're using iceberg, you aren't forced to only be using Parquet. Because that would also have the effect over time of of like holding back the ecosystem, especially if you're in a controlled environment where you can control all the files that are being generated and know that you're not using like, you know, unsafe or forbidden features. And so if you're building a a large data center and or a large cloud platform with you know with like petabytes of data. And you know, why not choose have the liberty to choose the file format that makes that makes the most sense for your for your for your for your your application and your computing engines. There's the whole like, I assume you guys have had some exposure to like multimodal data sets. especially now with in in in the AI world. And so that's a whole nother dimension. We're storing images and videos and vectors, not especially ideal in parquet. And so formats like file formats like Lance and and Vortex are being developed specifically to provide you know, better support for those those types of data sets. And so, you know, but also some other things like being able to modify, you know, modify data sets like add columns and, you know, all those, all those things.

Kostas

Yeah, a hundred percent. One last question from me and then I want to give the mic to Nital because I've been monopolizing the conversation. It's nice that you mentioned the table formats there because you mentioned metadata and metadata and like the amount of metadata and the complexity of metadata that you have to handle is also like a factor when you're talking about figuring out the the data there that you store. These table formats also add a bunch of metadata there, right? And my question is at what point Managing the metadata across the whole stack that you have there starts becoming a problem or a bottleneck. If it does, maybe it doesn't, right? But I don't know. I when something keeps coming up more and more often, it's instead of talking, you know, more about the data, we talk more about the metadata. It sounds like Okay, this is something that uh becomes interesting. So I'd like to hear from you and also like how they interact with each other, right? You know, we used to have like databases, like let's say I I get like a binary from Postgres. I run it over there. Like I don't really care as a developer how the table is like encoded or how I page it, like all these things, right? Like at the end of the day I run my SQL, whatever is like going to happen there it happens. Like with data lakes and all this like disaggregation between like the the systems. Like we we ask people to get really into the details of each part of like the database. uh words there. So yeah, t tell me a little bit about that. I'd love like to understand where we are getting with this metadata and the metadata management.

Wes

Yeah, I'm just uh I'm just looking up a uh a tweet of mine from uh July twenty sixteen. So I I I had a tweet. Uh we're still allowed to call them tweets. A post, an ex post. Um, so my tweet from July 20th, 2016 was forget big data. Let's start talking about big metadata. Um, and so maybe I was a little bit I I think I was just making a joke, but but actually this is a real this is a real thing. And so I guess to go back in time, like the Hive Metastore was like the first one of the first like open, you know, open metastores that the whose goal was to facilitate multi-engine access to to data sets stored in in distributed storage, whether cloud storage or HDFS and whatnot. But then, you know, people found that that uh especially really big companies with with massive data sets like Netflix, that um just interacting with the metadata associated with a really large, low-moving table would uh just interacting with that metadata would result in like could could add up to like minutes of planning overhead in in running in in running a query. Another that's one one problem that occurred. Another problem would be if you were running a spark job on a massive parquet dataset. that Spark would launch a Spark job just to look at the metadata of all the Parquet files referenced in in in in the operation. And so That would that alone would be like significant computation just to read all of the file for footers of all the parquet files and then reckon about what is in what is in all those files. Like, is there any schema evolution? Like Maybe you're only referencing a handful of columns in the files, and so you would need to plan out like, okay, we need to fetch these byte ranges of all these files and in attached storage or in or in cloud storage. And so just just getting just reckoning about like what what data needs to be read out of storage, like what data needs to be deserialized. And then how to reconstruct and scan those those data sets could be, you know, have have massive have massive overhead. So that this led to the creation of projects like like uh like iceberg, Delta Lake and Hootie basically provide like more scalable, more scalable metadata storage than it would it be designed for Essentially to allow you to do queries on your metadata more efficiently to be able to plan and execute large scale large scale queries. Now, of course, like One of the funny things is that the you know the the proliferation of you know iceberg and iceberg and friends, I think iceberg is probably the most successful or or on on track to be the most successful open data like format. is that, you know, iceberg may actually be overkill for a lot of uh people's data lakes. And so the DuckDB team recently, you know, they they built a uh kind of a neo hive metastore, you know, hopefully a a way to store metadata in a real database, not in not in like flat files, parquet files. And cloud storage that it's designed for people's more modest data lakes lakes that is like simpler to uh simpler to manage and easier to work with than um th than iceberg. It's called Duck Lake. So a little bit of pushback against like the uh you know, kind of the iceberg efication of of everything. I think it was maybe partly driven by like the Duck D B T Duck D B team's frustration with like uh implementing all of the all of the features that have that have been developed in in iceberg. But it's yeah, this this big bent big metadata problem, it's it's It's it's a big problem. But uh yeah, I I'm not uh you know, I haven't done too much I haven't done any development on iceberg, but I'm I'm familiar with it. Um I know Ryan Blue and the team and are now at Databricks. But uh yeah, it's it's the amount of investment that's coming from the tech ecosystem into this is pretty pretty substantial. Like all the major tech companies, like cloud. like hyperscalers have, especially Amazon and Microsoft, like AWS and Microsoft, like they're they're making just massive investments in in table formats like iceberg and uh like the arms race around like who will you know, have the best tools and the best integration to support like these growing data lakes because essentially if you're you if you're managing people's data you're collecting uh collecting attacks on every petabyte that's stored and that's you know very profitable for the the cloud hyperscalers.

Nitay

The other since you have called out a few interesting things there on the metadata side, the other thing that I find interesting is um kind of tying back to your point about multimodal, how metadata catalogs or or um uh schema views, whatever you want to call them. Uh started out, like you said, as just kind of a hide meta store and that was it. It was kind of this basic thing pointing you to some files and then over time kind of evolved. And these days It's trying to do a lot more than even that, right? Like many of them are trying to take over and be like your model catalog and kind of all host the point to all the different embeddings and point to all the different kind of multimodal data. What what do you see as the like the future of what what happens there and and kind of tying to your point also um I'm always reminded when I think of stuff like DuckDB and Duck Lake and so on of of one of those, I think it was a research paper a while back called the uh the cost paper. Right, that showed how how how most workloads actually are faster with a single machine, single single CPU even, something something like that.

Wes

Yeah, the uh scalability but at what cost. So the famous Exactly, exactly. It was written it was written in the same yeah, it was written in the same year that we we started doing Arrows Arrow, so there was like this was definitely in the water at that time. Exactly. Frank McSherry and uh yeah, and uh you know, there were a group of folks from uh Microsoft Research. Uh but yeah, the cost of outperforming a single thread. And it turns out that a lot of these big data systems, they they achieve scalability, but they introduce, you know, massive overheads. So Yeah, that's it.

Nitay

And the thing that I found particularly interesting at that point, which I think is mostly kind of still true, is that that line actually changes and it changes over time to a single CPU, a single thread, even even regardless of the um clock speeds and so on, actually being able to handle more. Just no just because of all the overhead of all the distributed system stuff and that that line is not fixed and actually tends to move towards the other direction that people think. Essentially was like one one of the takeaways.

Wes

Yeah. Well it's interesting now that like we kinda with the like you know, it feels like the whole ecosystem is on this this crusade to like tear out the JVM and like the no like the no JVM movement, like re rebuild everything in Rust or at least in native code and like everything like things should be easy to deploy. It should just be a static binary that you can just copy onto the server and run. to have all this like blowed and and complexity. I I think that a lot of that what's happening now is a is a is a reaction to like, you know people being frustrated with like the oper not only the operational complexity of these systems, but also like their their inefficiency. And so I like in recent talks, like I've talked about like the, you know, computing hierarchy of needs. Like people you know, just just to be able to scale was was was what really mattered and to make things efficient and operationally easy to manage and whatnot was they were all secondary concerns, like just like to achieve feasibility to to execute these these workloads. Um and even it's what and it even what's crazy, uh another thing I pointed out is like the original MapReduce paper in 2004, servers in that era only had one processor core. And so like now, like what was big data for Google, you know, it's just one, just one machine in in the cloud. And so, you know, a lot of these like You know, a lot of these distributed systems like simply aren't needed. Is what what is big data now is compared with 20 years ago. It's probably changed by a thousand X. So like a terabyte is not big data anymore. Exactly.

Nitay

Uh yeah, it's fascinating. Um, I want to slightly shift gears, but but go to some of the stuff that you you pointed out along the way. You mentioned a little bit about kind of data fusion um and some other kind of related projects that that arrow is often kind of a part of the broader ecosystem uh with data fusion and flight and ADBC and so forth. So give us kind of a picture of like the broader ecosystem of what's happening there uh and and where that's moving towards.

Wes

Yeah, so so data fusion is a uh a Rust-based um configurable, customizable query engine um that started out as a as Andy Groves um Andy Groves uh personal project and got folded into got folded into the Apache Arrow project, eventually the data fusion community. Um you can kind of think of data fusion as being a little bit like DuckDB, but more designed to be customized and adapted to suit like specialized query processing systems. Whereas like DuckDB is a little more of a batteries included full stack system that shipped that, you know, you can you can think of it as a little bit more like SQLite. You can just drop DuckDB as one gigantic C file into your project and have an embedded query engine that's ready to go. Whereas like data fusion is something that it it is it's designed to be customized and adapted to like the the query engine that you're building. So For example, like it's being used to build influx data's next generation query engine. There's like a whole array of startups that are that are built on top of DataFusion like both LanceDB and creator of the Lance file format, SpiralDB, who creates Vortex. Like they're all using data fusion. So there's now I would say like dozens of companies that are building, you know, specialized query processing systems or data processing systems that are using using data fusion, which is really which is really cool. But eventually the data fusion community got big enough where they they uh split off from Aero to create their own top level project in the in the ASF. So now they're Apache Data Fusion and they have a growing community. Um there's like a whole like data fusion ecosystem of subprojects and and add-on con contrib projects to uh you know to to build other essentially add-ons and plugins for people that are that are doing stuff with data fusion. But what's cool about it is like if you're building a startup now and you want to program in Rust, like there's no almost no motivation to create your own query engine. You should use like data fusion is there. It's meant to be picked up off the shelf and customized for free application. So like as an example, there's a company called Arroyo that built a um like streaming SQL engine they got acquired by Cloudflare and so now they're powering uh Cloudflare's like SQL like data lake offering. Uh and that's all that's all using Data Fusion to my knowledge. So that's that's exactly the intended use case for the project. So it's super cool and like thinking about not only just that that error that data fusion is aero native, it's written in Rust, but also it's this accelerant in the sense that like it's given it's raised the like the base platform for people to build new data processing solutions on. So it used to be fifteen years ago you would need to build a whole query engine from scratch if you were doing a startup, whereas now like you know, using data fusion as the base is is uh especially if you want to work in Rust is a no-brainer. And that's that's that's exactly what we wanted.

Kostas

Yeah, and I want to add something here about uh data fusion because people that I don't know, people who haven't like worked with it, I think they should give it a try, even if they are not in the database systems, because the beauty of data fusion in my opinion is like how modular it is. Uh like usually with data bases was always kind of like a monolith what you kind of think like if you take Trino for example or even like Spark right going through end-to-end the system is like a huge monograph that everything is like super connected to each other. Like making a change somewhere is like really, really brittle. But the real value of like having this architecture that Data Fusion has is that you can pick the part that you want to innovate in, right? Like in Arroyo case, that like Wes was talking about, right? They were like focusing a lot like in the like the streaming world, like how we can efficiently like go and do like streaming process it there. But the rest you can take it out of the box and just use them and that like accelerates like development like a lot. right in a similar but different approach like cube the semantic layer they use data fusion too to build their materialization layer and casting layer that they have So again, they took the things like from the query engine that are not part of their IP, let's say, like where they want to innovate and take it out of the box and like use it. And then go and build their own stuff, like in what matters for them. And Whoever like designed it, they did like I think like a really really good job there and it is adding like a lot of value in the ecosystem. It's that's why you see also Delta fusion being used for streaming down to like semantic layers and building cubes, right? You couldn't do like any like that up basically.

Wes

I mean DBT I mean the next generation version of DBT, like DBT like DBT core, is called DBT Fusion. It uses data fusion. And so it's like now uh yeah, that's that's You know, now data fusion with the like the agglomeration of, you know, DBT, FiveTrans, Census, uh, SQL Mesh are like all one company now. And like there's data fusion and an arrow there. And so the fact that that's like, you know, at the center of like, you know, and George Fraser from FiveTran like is a professed like fan of Aero. So the fact that that that's like at the center of like, you know, all of this kind of enterprise data transformation and processing is really uh uh super cool. Uh you know, where it goes from here, I mean I, you know, the way Andrew Lamb, like the de facto lead developer of Data Fusion, described it is that he he wants to create a world where there can be a thousand data fusion companies. And I think we're, you know, probably on track to have that.

Nitay

What's um I'm curious and one thing that it's an interesting point that that you guys are making, which is that I've I've always been fascinated by and and personally I think one that's one of the reasons I love this field is because I feel like in the infrastructure space Especially data infrastructure, given a a long enough timeline, open source always wins.

Wes

But yeah, not without not not without a lot of uh not not without a lot of uh, you know N a blood, sweat and tears.

Nitay

Huge, huge bumps along the way. Absolutely yeah yeah. Huge bumps along the way. Huge roller coaster. There are huge like unfortunate projects and unfortunate politics and companies trying to take over and usurp things and like all the drama that you might imagine, like Hollywood level stuff. But eventually.

Wes

Well one of the things that yeah well one of the things that that especially in like these modern times where you know like uh you know AI is kind of sucking all the the oxygen out of the room, I feel like these types of data infrastructure projects and are like Arrow and Data Fusion and and the like composable data stack are in many ways some of the most like AI resistant technologies or projects out there. Like, yes, you could certainly use use AI and agenda coding to work on these projects more more productively. But like we are, you know, they're like people are talking about like AGI being five years off, but like You know, i if I I d I don't see AI coming out of the woodwork and building being able to like on its own build projects like data fusion, you know, any anytime soon. And uh it's It's uh, you know, so it's uh yeah, I think I think the the development and like the progress on these projects will certainly like be more become more efficient and and features will get developed more quickly with like, you know, AI assisted refactoring and AI assisted like You know, I think about all the like hand handwritten kernels for like, you know, mundane functions that I've written over the years and how much of that could have been done with you know, how much of that could have been done with Claud Code and like all the thousands of unit tests that I've written over the years. But but like this is still a domain that like really needs like smart people, like focused on it and and like also to collaborate and work together to build these solutions that can, you know, power us for the next, you know, essentially be the foundation for the next 20 or 30 years of data systems.

Nitay

Which since since you went there actually, I'm curious since you like um mentioned kind of the this vibe coding and AI and cloud code and so on. I'd love to hear your take as as a kind of what what I think of as like an an infra renaissance man, if you will, given the variety of stuff that you've worked on. What is the like all this vibe going stuff? What is it actually good for and not in the kinds of projects that you work on? Where where does it need to improve? Where do you see it going? Like I'd love to hear kind of some of your personal Experiences there.

Wes

Yeah. Um I haven't yet used um so I'm I'm a claude I'm a cloud code user. I haven't I haven't used codecs yet. Um I'm told I'm hearing good things, so maybe I should I should try it out or I should hit them against each other. Um I I I've occasionally used this front end for a claud code called conductor, uh conductor. build, which some people really like conductor. Apparently it supports both now. So you could have like one agent writing code, the other agent reviewing the code that the other agent is writing and pitting them against each other. Maybe that's a good way to work. I'm not sure. But I found that I've used Cloud Code quite a bit to do work on Positron, which is a vo a fork of VS Code. And VS Code is like a absolutely gigantic code base. And uh Positron is like a very thick fork, like has, you know, like many person decades of work put into like customizations and like custom extensions and and and UI layers like uh customizing like the base code OSF. And I I've actually found that that even like Like the agents are not great at front end stuff, but but like the middle layer of like business logic and plumbing and like sorting through you know, the layers of things and like debugging and fixing fixing things that are not working. Sometimes they struggle to like get get the feedback. Like I find myself copy and pasting error messages from like the Chrome Chrome console. to get it the information it needs to fix the bug. But but I I've done quite a lot of work with with with coding agents on specifically Claude Code on Positron. I haven't used it to work on the Aero codebase. Um Arrow has a lot of x86 SIMD in it. And so I was one thing I was interested in looking at is seeing if I could get any of the coding agents to port some of the Like SIMD intrinsic stuff to use like Apple Silicon intrinsics. It seems like the kind of problem that the coding agent would be good good for. But I find that they perform best like in narrow narrowly scoped problems where you give it very clear and detailed instructions and also you are in a position to be able to evaluate and review the the work. that they're that they're doing. Like I recently built like a totally different project from any of this. Like I built a terminal UI for my personal finance data. And I'm like maybe three maybe 350 to 400 turns deep with using Claud Code to build this project. And I think without like my software engineering experience and knowledge of like design patterns and software architecture and reading the code and giving it detailed feedback on like fix this, like reor reorganize things this way, write you know, make this refactoring so that you can write unit tests for this thing. I think if I were fully vibe coding and like not reading the code and giving feedback, like it would just make a gigantic mess uh that would eventually become unmaintainable. And probably yield dim diminishing returns at at a certain point. And so um I think for like narrowly scoped targeted work, especially like writing test suites or like refining and refactoring test suites and like Cleaning stuff, like normalizing things, also doing like exploration and code bases that you don't know well. They're really good for like some of that exploratory planning work, like just helping you understand what you're dealing with and give you like help create a plan that maybe you go and actually do the work by hand later on. So that that's worked pretty well for me. But uh yeah, it's been a little bit of an adventure so far with the agenda coding. I admit that I was pretty skeptical about AI assisted coding up until earlier this year. Like I didn't use cursor at all until uh I think I us started using Windsurf a little bit, you know, part like to be um yeah, in part it was like, well, it seemed like everyone was using cursor, so I'll use like the cursor competitor and see how it is. And I thought it was useful, like the autocomplete was good, but nothing nothing really quite clicked for me until I started using Claud Code like early version back in in March. Where I was like, oh, okay, this actually is the modality that that works for me. Like this is working at the level where where I feel like I can get a lot of the stuff done with this. But You know, there's definitely been lots of like pitfalls and like, you know, toes shot off and like, you know, branches that have been discarded, scrapped because like You know, it's just clear that like the agent is not able to like the problem is too nuanced or too complex for the agent to reason about, but you know For Python small like Python code bases where that have a lot of unit tests, it's actually really effective. And uh, you know, I feel like the more test coverage you have and like the more like feedback, like automated feedback you can feed into the into the agent. Like for example, I was working on a project that uh I upgraded DuckDB and then suddenly test like like some test cases started failing and I didn't know what the error was. I looked at the errors and it wasn't they weren't errors that I recognized and I'm like, hey I just asked Claude Code and like, hey, this this is erroring, what's going on? And it took it maybe fifteen minutes to fix the problem, but it figured out what was wrong and fixed it without any intervention from me, which was which was pretty impressive. And probably But I just, you know, went and, you know, got some coffee or, you know, I went to work out or something. I came back and it was and it was done. And so that being able to delegate like investig inve investigations and like diagnosing bugs and things like that, especially in a code base that has a lot of tests and like can be worked on in a fairly autonomous way. Like It's definitely a game changer, uh, you know, just uh to get a lot of the mundane stuff out of the ways that you can as a, you know, spend your human cognition time on stuff that is the highest value.

Kostas

One one quick question about what you mentioned uh about like cloud code. So what's clicked for you when you use cloud code compared like to the IDs out there? Why is so different as like an experience that's now it's something that like you feel like, okay, this is like what I can use.

Wes

I think because it it especially using it in like what Simon Willison calls like the YOLO mode where you dangerously skip permissions. I think because it it has fairly and obviously you should you should, you know, practice you know, practice good security and like run run cloud code in a low permission environment, the sandbox or something like that. But but like being able to have unbridled access to everything in everything in in a terminal environment like, you know, like Git, like the G H CLI tool, like You know, I I know that like you can hook these things up to cursor and like you can essentially achieve some of the same things that you have with Cloud Code by like through MCP servers and and like exposing a lot of tools like to to things like Things like cursor, but you know, it it I think also working in the terminal just feels very it feels very streamlined. Like, you know, I c I I like to watch it work, like see w see what it's doing, what it's investigating, like sometimes You know, I'll see it going down a dark path and I'll hit escape and I'll say, no, no, no, that's not right. Like I can see what you're doing. That's not the right, that's not not the right approach. Like here's Here's here's the right approach. But I find that it it's able to, you know, get into it's most effective when it's getting into a loop of investigating a bug or a unit test failure or something like that. And so being able to like, you know, investigate search, like add debugging statements, recompile, run the test suite, oh, it still fails. analyze and so it might go through like the same loop twenty or thirty times before it actually is able to successfully diagnose and fix the problem. But it it's, you know, I find it's a very effective agentic loop. An And I just I I used Windsurf prior to that. And I'm sure Windsurf has gotten a lot better in the meantime, but maybe a month into using Claude Code, I canceled my Windsurf subscription because I just wasn't using it. I'm like I'm like this is this is you know meeting all of my needs and so I don't I don't need anything else from an agentic coding tool.

Nitay

Yeah, I didn't I did very similarly my current iteration also is cloud code, I'm heavily using it. I'm curious one thing since you mentioned the YOLO mode, one thing I've seen more and more and more people start to talk about is full YOLO, meaning like here you go, sudo. Do whatever you want, but then because of that, they then run it within like a code space or Docker or something where like you can blow away the entire machine. I don't care. Right. But then I go, like you said, I'm gonna go for an hour, like implement this whole thing, get it done, let me know when it's done kind of thing.

Wes

Yeah, yeah, yeah. Yeah, I know I think uh yeah, I recently read like uh yeah, Sy Simon Willison did gave a talk about like um like sandboxing and uh basically you know running clawed code inside inside a Docker container and there's like like an is sandbox flag where yeah basically You know, it's it's it really is, you know, yeah, it's like pseudo and all that. I haven't um I haven't done as much of that. I I should spend a little time and like develop a more sophisticated setup Like I probably like half the time when I'm using Claude Code, I'm sure that I'm exposing myself to like prompt injection risk and things like that. So I should probably practice a little bit better. better security practices, but also like I haven't seen I haven't yet seen Cloud Code like do anything. You know, on on my like my little s money flow side project, I Uh one time it deleted my config directory like for the for the tool like without asking me and I was a little bit annoyed. Um but uh you know aside from that, like I've you know very rarely uh I haven't seen it do anything that makes me too uncomfortable or makes me feel like there's their safety issues. But prompt injection is the biggest is the biggest risk. Um, because it's like that's a a thing that's always there. And I know that like I'm sure that the Claude Co team is doing their best with like safeguards and like trying to prevent a clawed code from being hijacked by prompt injection, but I'm sure like a a determined attacker could still like get through and, you know, cause it to like you know, send the contents of your home directory or like your, I don't know, all your dot files or something and like get your API keys and your SSH keys and stuff, the keys to the kingdom. And that would be That would be very bad.

Kostas

Wesh I want to ask you a question about open source and AI. It's been uh I've seen it like people like talking uh recently about being a maintainer in open source world where people, no, now with AI, they can go and make like PRs out there. So the cognitive loads like for the poor maintainers who already had a lot on their plate, right? It becomes like kind of like harder. Do you see like okay there is like the value from one side of like AI making things let's say like accelerate and like building and hopefully that helps like open source too. But there are like probably some issues also there. Like what have you seen so far and like what worries you, let's say, if anything.

Wes

Yeah. Well, for better or for worse, I guess fortunately for me, I'm not actively actively maintaining any open source projects. I uh so they're but they're my understanding is like I talk to people who are maintaining Arrow, for example, and there is an increased increase in submissions that that are you know, we ask people to disclose whether s a submi a PR a PR pull request is is AI assisted or fully AI generated. And so That definitely is creating additional additional maintainer burden. I I think maybe a bigger issue that I see with AI and open source is the fact that The AI labs themselves are beneficiaries of the, you know, 15 or 20 years worth of open source code that's been created on GitHub and on Stack Overflow and all over the place. So for example, like one of the reasons that Chat GPT and Claude and all of the coding coding agents are so good at using pandas is because there's an incredible amount of training data for for using pandas. And so you would think that that there would be some incentive for you know OpenAI to contribute some nominal, you know, sponsorship or like sponsor maintainers to support and maintain pandas, but like They're not gonna I mean, I do you see them doing that like I don't? And so essentially it's it's uh kind of a little bit of Like the AI companies are the new oil drillers, right? Like I just rewatched the movie, uh, the movie There Will Be Blood, and there there's like eerie parallels. Right, with uh with with what's going on now with the AI industry in the sense that like the AI labs are strip mining all of the intellectual property that's been created by the open source. you know, the open source community and humanity in general as far as like code is concerned. And, you know, maybe AI will become sophisticated enough to generate like the next generation of open source libraries and training data to be able to continue to make themselves better. You know. It I I I don't know, right? It's like I think eventually the well is gonna maybe not totally run dry, but like, you know, run more dry than it than it is now or or hit some some limit in terms of like, you know, almost all the code that's being generated on the internet is is AI generated. And so essentially it's like the becomes the snake eating its tail at a certain point and like what's that going to do to the overall quality of the work that's that's that's that's being produced. I think another challenge I see is that maybe there's like a certain AI induced laziness and that people will stop trying to solve problems that their AI agents can't help them with or like stop asking questions that like they can't be solved with, you know, the current generation of AI tools. And so that maybe has a limit of l has the effect of slowing or stymieing progress on new ideas and things like that. I think also Um that's already a lot of stuff. Yeah, let's see let's see what else. Yeah, it's it's it's it's an interesting problem for sure. I I'm happy that I'm happy that Cloud Code and and coding agents exist. Like they're definitely making me more productive, but they also do make me fearful for the future, especially like junior developers, junior engineers, like anybody who's I I under I have a under I have I've been told that that computer science students uh today in universities like have a lot of existential dread about whether or not there's going to be a job waiting for them when they graduate. And there's also the problem of like if people don't learn how to do software development and build things the old-fashioned way, like, you know, will they ever acquire the experience and know-how to become like what we currently consider to be like a senior principal level engineer. So like when I'm interacting with Cloud Code, I'm relying on you know, my, you know, 17, 20 years of software engineering experience to give it feedback and to to be able to judge whether or not it's doing doing good work, to tell it what needs to be fixed. But if you come out of school and like 95% of the work that you're doing is being done by a coding agent, will you still acquire the same level of experience and judgment that you would in the past? Like Probably not. I mean, but that also means that there will need to be a new emphasis on code literacy and spending a lot more time reading code and analyzing code. to be able to develop that that judgment. Cause it used to be that the judgment would come the hard way, you know, by building and seeing like by senior developers showing you like, oh no, you need to apply this design pattern or like You know, this needs, you know, this this application is too tightly coupled. You need to think about like model view and controller. But, you know, you often learn these lessons the hard way by like building software that's tangled and too complicated and seeing like, okay, I need to do refactoring and uh, you know, the and the mess that you made is is is kind of your punishment for like not doing things the right way to begin with. But From my from what I've seen so far, like completely unsupervised and without much feedback, like these coding agents will make a gigantic mess and uh will we'll will claim that they've solved problems when they haven't, or have like will have generated code that's like totally incorrect, or like unit tests that are like, you know, working around problems that they couldn't solve by like introducing mocks that like pretend to to pretend to do the right thing when actually the the logic is incorrect. So it's a it's it's a total minefield out there, you know, combined with the fact that LLMs by their nature hallucinate and are, you know, some prob some fraction of the time will generate, you know, incorrect or like you know, mistaken responses and yet will, you know, with full confidence claim to you that the software is, you know, bug free and production ready and ready to ship. I feel like Claude Code is telling me that every day, like, all right. It's ready to ship. Production grade. I'm like, I don't know about that. I open it up and I'm like, not quite. Let's let's fix some things. I see a few problems here.

Nitay

thing that I find interesting is I made an analogy the other day to somebody that it's kind of like in the in the early days of to the point of the conversation we were having of big data. Where in order to make things scale, you suddenly had to forgo a bunch of things that people took as like as a given standard. So meaning like transactions and, you know, so you had all this like know, SQL and eventual consistency and like no more at no more asset, no more all this kind of stuff. I feel like that's exactly what's happening in the engineering world now, where we are forgoing the idea that software must be consistent. And a lot of people that are coming into like of age now and engineering and coding and building software, I think the whole notion of like SLAs and high reliability and so on. I feel like is eroding because they're used to like, well, I mean the thing hallucinates, the systems gets down once in a while, like, okay, that's just how things are. Where where where us us OGs if you will are like that's not how you build systems. Like that that's not actually how it should work. Um and and I and I and I feel like it's interesting because I think every kind of at least some of the more modern kind of waves of of technical breakthrough and innovation have this like super disruptive effect with one particular capability that's very, very promising and interesting, but but they get there by foregoing a bunch of other stuff. And then over time, the people find ways to actually fix the things that they've for that they've forgot while keeping that capability. Right. So over time we had new sequel or whatever you want to call it where they're like actually you can have a distributed database and still have transactions and actually you can actually get consistency. And and I wanna I feel like the same thing is going to happen s over time with with AI, or at least that's that's my optimistic take uh is that we will retain a lot of the productivity and the magic of it if you will, but actually get to some level of quality of

Wes

time, maybe not million, but like there's some bringing everyone's expectations a little more back to back to planet Earth, you know, feet on the ground like Just viewing LLMs and AI as as another tool to enhance human productivity as as opposed to like, you know, the uh you know, one great leap for mankind, you know, kind of kind of thing, you know, that that uh Yeah, I'm I'm a little bit, yeah, I'm definitely optimistic on on its impact on on on individuals' productivity and and and being able to like provide like leverage for for for the work that that you're doing and and de being able to delegate like and farm out like the the mundane work that used to bog you down in the past. So I'm definitely, you know, like a lot of chore work, like anything that's a chore in GitHub, like that's prime work. If you're you know, if you're writing commit messages and it's like chore. Like that's probably gonna be AI work uh in in the in the future, but you know, yeah, there's still like there's still a need for for humans in the loop, probably fewer humans, but still like you know, the uh I'm a little bit less like confident about like the the narratives around like mass unemployment on account of like, you know, coding agents.

Nitay

This has been a super, super uh fantastic conversation. I'm curious to hear from you kind of summarizing a lot of what we talked about about kind of the data infrastructure space and the changes of resources and new kind of needs for new file formats and so on. What what's one thing that if you had your top feature request getting worked on today, what's one thing you would want that kind of Ecosystem to Putubbi working on or to build.

Wes

I think there's there's definitely an underinv under investment or like people aren't thinking so much about ec like the bridge between like AI and like the current generation of LLMs and and data, data itself. Like I think there's a general recognition that LLMs are they're language models. They're not they're not models for looking at looking at data. And so even like to provide like provide a data set, it's, you know, to take an arrow like arrow data and stuff it into a context window, like you're not gonna get good results out if you're asking like retrieval questions or data analysis questions on that data. And so uh so LLMs are having to rely on, you know, having to rely on tool calling to be able to do anything, you know, anything meaningful, anything meaningful with data. But that you know, that does seem like it's it's an area that is going to need, you know, need some work. And so like having more of like a collective reckoning about like How do we, you know, how do we show how do we show LLM's data, especially tabular data, and get, you know, get meaningful, meaningful results out? Like I know there's some startups that are working on that, working on that problem. But uh, you know, I think that's one one thing. Like, you know, I'm involved with this as an advisor and investor in this company Columner, which is doing aero native, it's really accelerating the ADBC Aero database connectivity ecosystem. And um You know, it's one of these like AI resistant technologies of like everybody needs to be able to connect more efficiently to databases and simplify, you know, go faster, make things simpler. So ADB ADBC makes a ton of sense as like the successor to J D B C and ODBC, but then there's all the question of like MCP and like A You know, uh it's like JSON and like it isn't exactly the best interface to to uh hook up to your hook up to your data. So So maybe we need like a data context protocol or like some other, you know, agent data context protocol that's more efficient and also helps resolve some of the like Yeah, just just the general problems that are associated with having LLMs look at data and currently like there's a bunch of research that's been done that's shown that they just aren't very good at it. And even like how to format the data, even the the best

Nitay

Yeah, the relationships and the type of Things got happening between the structured data and the underlying like formulas and things that might be happening between the data sets, uh something but uh clearly needs kind of a different way of thinking about it than just a whole bunch of like a sequence of tokens kind of language perspective. So I think your your uh idea of like a data-oriented MCP or some sort of standard. It's a great one. I would love to see something like that. Cool. So with that, uh I think we'll we'll wrap it here. But but certainly not the end, because we'd love to have you on and then continue the conversations. Thank you for this this great episode. I really enjoyed the conversation. Thank you for all your contributions. I personally use many of the projects you've worked on and I love them. So yeah, look look forward to chat keeping the conversation going. Great. Thanks again for having me on.

View episode details


Subscribe

Listen to Tech on the Rocks using one of many popular podcasting apps or directories.

Apple Podcasts Spotify Overcast Pocket Casts Amazon Music
← Previous · All Episodes