Iceberg: The Open Table Format for Petabyte Scale Analytics

Engineers of Scale

Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue

0:00

-57:12

Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue

Why Iceberg was created, the technical breakthroughs that made it possible, and how it is driving the explosive adoption of the modern data stack - from Ryan Blue, creator of Iceberg

Sudip Chakrabarti

Aug 01, 2024

Transcript

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.

We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating “look back” on Hadoop, and Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark. In this episode, we host Ryan Blue, creator of Apache Iceberg, the most popular open table format that is driving much of the adoption of Data Lake Houses today. Ryan shares with us what led him to create Iceberg, the technical breakthroughs that have made it possible to handle petabytes of data safely and securely, and the critical role he sees Iceberg playing as more and more enterprises adopt the modern data stack.

Show Notes

Timestamps

[00:00:01] Introduction and background on Apache Iceberg
[00:02:50] The origin story of Apache Iceberg
[00:12:00] Where Iceberg sits in the modern data stack
[00:10:38] Transactional consistency in Iceberg
[00:14:38] Top features that drive Iceberg’s adoption
[00:20:00] The technical underpinnings of Iceberg
[00:21:33] How Iceberg makes "time travel" for data possible
[00:24:08] Storage system independence in Iceberg
[00:30:13] Query performance improvements with Iceberg
[00:35:08] Alternatives to Iceberg and pros/cons
[00:40:45] Future roadmap and planned features for Apache Iceberg

Transcript

Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products.

Today, we have another episode in our data engineering series. We are going to dive deep into Apache Iceberg, an amazing project that is redefining the modern data stack. For those of you who already don't know Iceberg, it is an open table format designed for huge petabyte-scale tables. It provides a database-type functionality on top of object stores such as Amazon S3. And the function of a table format is to determine how you manage, organize, and track all of the files that make up a table. You can think of it as an abstraction layer between your physical data files written in Parquet or other formats and how they are structured to form a table. The goal of the Iceberg project is really to allow organizations to finally build true data lake houses in an open architecture, avoiding any vendor and technology lock-in that we all trade off. And just to give you a little bit of history, the Iceberg development started in 2017. In November 2018, the project was open-sourced and donated to the Apache Foundation. And in May 2020, the Iceberg project graduated to become a top-level Apache project. Today, I have the pleasure of welcoming Ryan Blue, the creator of Iceberg project, to our podcast. Welcome, Ryan. It is so great to have you on and thanks for making the time.

Ryan: Thanks for having me on. I always enjoy doing these. It's pretty fun to talk about this stuff.

Sudip: Awesome. Maybe I'll start with having you tell us the origin story of Iceberg. Why did you create it in the first place? I imagine it probably goes back all the way to your days at Cloudera. Anything you can help us with connecting the dots between where you are at Cloudera, then Netflix, and then now building Iceberg?

Ryan: Yeah, you are absolutely correct. It stemmed from my Cloudera days, as well as without Netflix, it wouldn't have happened. So basically what was going on at Cloudera was you had the two main products. You had Hive and Impala in the query space. And they did a just terrible job interacting with one another. I mean, even three years into the life of Iceberg as a project, you were having to have this command to invalidate Hive state in Impala, to pull it over from the meta store. And even then, you just had really tough situations where you want to go update a certain part of a table, and you have to basically work around the limitations of the Hive table format. So if you're overwriting something or modifying data in place, which is a fairly common operation for data engineers, you would have to read the data, union it, and sort of deduplicate or do your merging logic with the new data, and then write it back out as an overwrite. And that's super dangerous, because you're reading all the data essentially into your Hadoop cluster at the time, and then you're deleting all that data from the source of truth. You're saying, okay, this partition is now empty. It doesn't exist anymore. And then you're adding back the data in its place. And if something goes wrong, can you recover? Who knows? And so it was just a mess, and people knew it was a mess all over the place. I talked to engineers at Databricks at the time and was like, hey, we should really collaborate on this and fix the Hive table format and come out with something better. But it was just not to be at Cloudera, because they were distributors. They had so many different products, and it was very hard to get the will behind something like this, which is where Netflix certainly helped.

Sudip: What was the scale of data at Netflix when you got there? Just rough estimates.

Ryan: I think from around that time, we were saying we had about 50 to 100 petabytes in our data warehouse. It's hard to tell how accurate that is at any given time because of data duplication. Exactly what I was talking about a second ago means you're rewriting a lot. And that write amplification, because your write amplification is at the partition level in Hive tables, you would just do so much extra writing and keep that data around for a few days, and it's very hard to actually know what was live or active.

Sudip: So you landed at Netflix. They have this amazing scale of data. You came from Cloudera with some very specific frustrations, and I would imagine some very specific thoughts on how you're going to solve it at Netflix scale. Maybe walk us a little bit through how those first year or two was at Netflix. How did you go about creating what is now Iceberg?

Ryan: So we didn't actually work on it for a year or two. First, we moved over to Spark, and that was fun. So Netflix had a pre-existing solution to sort of make tables a little bit better. What we had was called the batch pattern. And this was basically, we hacked our Hive meta store so that you could swap partitions. So you could say, instead of deleting all these data files and then overwrite new data files in the same place, we just used different directories with what we called a batch ID. And then we would swap from the old partition to the new partition. And that allowed us to make atomic changes to our tables at a partition level. And that was really, really useful and got us a long way. In fact, we still see this in practice in other organizations where Netflix data engineers have gone off and gotten involved in infrastructure at other companies. And it worked well, as long as the only changes you needed to make swapped partitions. So again, the granularity was just off. But this affected Iceberg primarily because we really felt the usability challenges. We had to retrain basically all of our data engineers to think in terms of the underlying structure of partitions in a table and what they could do by swapping. So rather than having them think in a more modern row-oriented fashion, saying, okay, take all my new rows, remove anything where the row already exists, and then write it into the table, they had to do all of that manually and then think about swapping because you can't just insert or append. And so the two things were the usability was terrible, and it really got us thinking about how do we not only solve our challenge here, make the infrastructure better, but how do we also make a huge difference in the lives of the customers and everyone on our platform to make it so that they don't have to think about these lower levels. Usability has always been important to me in my career, so I had already said we're just going to fix schema evolution, but really thinking about the operations that people needed to be able to run on these tables, I think, influenced the design. And it also showed me because I'd ported our internal table format to Spark twice by the time we started the Iceberg project, I knew I needed to change in Spark.

Sudip: I can imagine. How would you describe the way you guys were storing data on Netflix? Is that closest to something what we all now call Data Lake? Is that roughly what it was?

Ryan: The terms are very nebulous here, but I think of Data Lake as basically the post-Hadoop world where everyone basically moved to object stores. And I know that Hortonworks was calling their system a Data Lake at the time, but I think we've really coalesced around this picture of a Data Lake as a really Hadoop-style infrastructure with an object store as your data store. And then, of course, now we have Lake Houses, right, which again plays a pretty meaningful role in Iceberg's adoption too.

Sudip: Any particular thought on what's driving this movement towards Lake Houses?

Ryan: Well, I'm going to have to answer a question with a question, which is, what do you mean by Lake House? Because there's a broad spectrum here. I kind of see it as where you can actually do warehouse style analytics on Data Lakes. Basically, Data Lake is the way I think about it. You dump whatever data you want in whatever format you want, and then you need some kind of map, obviously, to navigate that. And that's when I think the Data Lake graduates into becoming a Lake House, in my mind.

Sudip: Okay. I think that that corresponds with what I hear from most practitioners.

Ryan: And this is absolutely what we were targeting with Iceberg. We were saying, let's take the guarantees and semantics of a traditional SQL data warehouse, and let's bring those into the Hadoop world. Let's give full transaction asset semantics to the tables. Let's have schema evolution that just always works, no matter what the underlying data format is. And let's also restore the table abstraction so that people using a table don't need to know what's underneath it. They don't need to be experts in building that structure. They don't need to be experts in how this is laid out so they can query that structure and so on. You should just be able to use the logical table space. And I say this a lot. We're going back to 1992. What I think is difficult about the term Lake House in general is you've got all these other definitions. Starburst thinks about Lake House as an engine that can pull together the highly structured Iceberg tables, semi-structured JSON, completely unstructured storage, and documents and be able to make sense of those, as well as to talk to transactional databases or even Kafka topics and things like that. So they see it from this vision looking downward at a disparate collection of data sources. On the other hand, we, as a single source, we think of it at Tabular, my company, as what all can I connect to Iceberg tables? We've got this rich layer of storage, and so the closest definition for me is Lake House is this architecture where I can connect Flink and Snowflake and BigQuery and use them all. And use them all equally. And then you still have other vendors who actually claim to be Lake House vendors and use it in their marketing, but don't support schema evolution. Don't actually support transactions and ACID semantics. And so I think that there's just a really big space out there. So for our part, we think of independent or separation of compute and storage as sort of defining what we do. Because you get that flexibility to use any engine with the same single source of truth. We layer in security and make sure that no matter if you're coming in through a Python process on someone's laptop, or a big warehouse component like Redshift, that you get that excellent SQL behavior and security.

Sudip: That makes sense. I want to kind of maybe double click on something you were just about started talking about, which is transactional consistency. And it is, I would say, one of the most important things that probably makes a Data Lake House in my mind. And you guys have a really interesting way of how you went about solving that. Would you mind expanding on that a little bit? What does it look like from a technical breakthrough standpoint, the way you'd be able to do it?

Ryan: I'm actually pretty happy with that being one of the defining features. I would throw in the SQL behavior stuff as well, because I think that really is a critical part. But yeah, I think that the ability to use multiple engines safely is probably the defining characteristic of basically the shift in data architecture in the next 10 years. I didn't even realize that until now. It was a bit of a happy accident because we were looking to make sure that our users at Netflix could use Spark and Pig at the same time. And they could also have Flink and these background jobs that pick data up in one region and move it to our processing region and all of those things need to commit to the same table. And what we didn't realize was that that's not something that you can do in the data warehouse world. In the data warehouse world, everything comes through your query layer and you're sort of bottlenecked on the query layer. Whereas we knew that we wanted to be bottlenecked by the scalability of S3 itself. And we needed to come up with this strategy for coordinating across completely different frameworks and engines that knew nothing about one another. And that was the happy accident. We were doing it so that we had transactions. I was literally thinking at the time, I want someone to be able to run a merge command rather than doing this little dance of how do I get something rather than doing this little dance of how do I structure my job so that I'm overwriting whole partitions. That was our problem was that we wanted to make fine-grained changes to these tables. We wanted to put more of that logic into the engine with the merge and update commands and things like that. But the background was that we had very different writers in three, four different frameworks. And we needed to support all of them. And in designing a system that could support all of them concurrently, we actually unlocked that larger problem which was how do I build data architecture from all these different engines? And that's what we're seeing today as this explosion. We're seeing everyone has data architecture with four, five different data products and engines. What they want is to share storage underneath all of those so that you don't have data silos. So that it's not like, oh, I put all this data in I put all this other data in Snowflake and in order to actually work across those boundaries I have to copy and I have to maintain sync jobs and I have to worry is it up to date? Did today's sync job run? Am I securing that data in both places according to my governance policies? It's a nightmare. And so we're actually seeing this happy accident of, hey, now we have formats that can support both Redshift and Snowflake at the same time. And that is like I said, I think that this is going to drive the next 10 years of change in the analytic data space.

Sudip: That's awesome! Can you tell us a little more about how you guys went about really making ACID transactions possible in Iceberg? And I'm probably alluding to other things, real cool things you guys do with snapshots and so on. Can you talk a little more about what is the technical underpinning of that?

Ryan: So the idea is to take an already atomic operation and have some strategy to scale up that atomic operation to potentially petabyte scale or even larger. I don't think that there is a limit to the volume of data you can commit in a single transaction. The trade-off is how often you can do transactions rather than the actual size of the transaction itself. So what we do is we start with a simple tree structure. The Hive format that we were using before had a tree structure as well. You had data files as the leaves, directories as nodes, and then a database of directories essentially. So you had this multi-level structure. And what you would do is you'd select the directories you need through additional filters and then go list those directories to get the data files and then you'd scan all of those data files. We wanted something where you didn't have to list directories because in an object store that is really, really bad. In fact, at the beginning it was eventually consistent and so we just had massive problems with correctness that we had to solve. In fact, S3 Guard actually came out of some of the earlier work on our team before I was there. That's a fun little tidbit. What we wanted to do with Iceberg was mimic the same structure, but we wanted to have essentially files that track all of the leaves in this tree structure. We started with files at the bottom of the tree and the leaves and then you have nodes called manifests that list those data files and then you have a manifest list that corresponds to a different version of that and that just says, hey, these four manifests, those make up my table and then you can make changes to that tree structure over time efficiently. You're only rewriting a very small portion of that tree structure and it all sort of rolls up into this one central table level metadata file that stores essentially 100% of the source of truth metadata about that table. You've got metadata file, manifest list for every known version, manifests that are shared across all of those versions, and then finally data files that are also shared across versions. The very basic commit mechanism here is just pointing from one tree root to another. You can make any change you want in this giant tree, write it all out, and then ask the catalog to swap. I'm going from V1 to V2 or V2 to V3 and if two people ask to swap, replace V2 at the same time, one of them succeeds and one of them fails. And that gives us this linear history and is essentially the basis for both isolation, keeping readers and writers from seeing intermediate states as well as the atomicity.

Sudip: Got it. And I imagine this architecture is also what makes time travel possible for Iceberg and just for listeners, the way I think about time travel is it allows you to roll back to prior versions if you want to correct issues in case of errors with data processing or even help data scientists to recreate historical versions of their analysis. Can you help us understand a bit more how time travel has been made possible for Iceberg?

Ryan: Yeah, absolutely. So any given Iceberg table tracks multiple versions of the table, which we call snapshots. And because you have multiple versions of the table at the same time, that's one of the ways we isolate things. So if I commit a new version, say, snapshot 4 in the table, and you're reading snapshot 3, I can't just go clean up snapshot 3 and all the files that are no longer referenced. I wouldn't want to, but it might also mess up someone who's reading that table. So we keep them around for some period of time. By default, it's five days. And our thinking there is just enough time for you to have a four-day weekend and a problem on Friday, and you can still fix it on Tuesday and know what happened. So we keep them around for five days, and that actually means that by having that isolation, we have the ability to go back in time and say, okay, well, what was the state of the table five days ago? And we've iterated on that quite a bit to bring into use cases that we're talking about. So now you can tag, which is essentially to name one of those snapshots and say, hey, I'm going to keep this around as Q3 2023. And this is the version of the table I use for our, say, audited financials or to train a model. And you want to keep that for a lot longer. So you can set retention policies on tags, and it just keeps that version of the table around. And as long as the bulk of the table is the same between that version and the current version, it costs you very little to keep all that data around.

Sudip I know there's no typical scenario, but in general, if you compare the metadata versus the actual data, what is the overhead typically in a snapshot of the metadata you store? What is the volume of data to metadata?

Ryan: That sort of ratio? It depends. So wider tables, more columns are going to stack up more metadata. I would say that there's generally a 1000x or three or four orders of magnitude difference between the amount of metadata and the amount of data in a table. Now that differs based on how large your files are, because you can really push that. When we collect file range metadata for each column so that we can at a very fine granularity, this is another way that we improve on the Hive table standard, where you had to read everything in a directory. We can actually go and filter down based on column ranges. You can really push that. So you can have column ranges that are effective, but then within those column or data files, you can have further structures that allow you to skip a lot of data. And so where your trade-off is there is a little fuzzy. You can have lots of data files in a table, or you can compact them and have just a few data files in a table. It kind of depends on the use case which way you would go.

Sudip: That makes sense. You have that knob to turn depending on how granular you want to go. I want to talk a little bit about one really striking feature of Iceberg, which is you don't really have any storage system dependency. You could go from S3 to Azure to something else. I'm curious a little bit how you guys have achieved that. It's like, you know, on the Snowflake side, you have separation of compute and storage. In your case, again, you kind of broke away from any dependency on the storage system. Can you tell us a little bit more about how you're able to achieve that?

Ryan: I think when we started this project, people had been trying to remove Hadoop from various libraries in the Hadoop ecosystem for a very, very long time. And we actually still have a dependency on Hadoop. You can use libraries, but you don't have to. What we wanted to do was lean into the fact that everyone was moving towards object stores instead of file systems. Hadoop is a file system, and it has things like directory structures and all of that, which led us to this listing directories for our table representation and some of these anti-patterns that we're trying to get rid of with the modern table formats. What we wanted to do was have a simpler abstraction than a file system and just go with what the object store itself could do. No renames, no listing, no directory operations. We just wanted simple put and get operations. We designed Iceberg itself to have no dependencies on these expensive, silly operations. We did that based on our experience actually working with S3 as a Hadoop file system. We had our own implementation of the S3 file system, and we kept turning off guarantees to make it faster. For example, if you go to read a file in S3, like S3A, the Hadoop implementation that talks to S3, you'll actually go and list it as though it were a prefix. The reason why is you want to make sure that it's not a directory. It could be a directory if there are files that have that file name slash and then some other name. We said, well, that's silly. We always want this to go and read it as a file. We don't want that extra round trip going to make sure, hey, this isn't a directory, is it? We were able to really squeeze some extra performance out of our I.O. abstraction by letting go of this idea that it should function and behave like a file system. Let's embrace S3 for what it is. It's an object store, and let's use it properly.

Sudip: That's super interesting. Thank you for sharing that. And if I may step back a little bit, when you think about your Apache Iceberg users who are really getting value out of it, what do you think are the top, maybe three features they are particularly excited about? What really makes them move from whatever they are using today to Iceberg?

Ryan: Time travel is one of the big ones. And the others, I don't know that I would really consider them features. Granted, from a project perspective, they're features. But I've always talked to people about Iceberg by opening the conversation with, I don't want you to care about Iceberg. Iceberg is a way of making files in an object store appear like a table. And you should care about tables. You should be able to add, drop, rename, reorder columns. And you shouldn't care about things beyond that. So Iceberg, the purpose of it is to restore this table abstraction that we had hopelessly broken in the Hadoop days. If you talk about core features that are super important for Iceberg, it's stuff like being able to make reliable modifications to your tables. Having that isolation between reads and writes, and knowing that as I'm changing a table, you're not going to get bogus results. It is being able to forget about what's underneath the table. In the Hadoop days, you had to know, okay, am I using CSV? Or am I using JSON? Or am I using Parquet? Because all three have different capabilities when it comes to changing their schema. It's all ridiculous, right? CSV, no deleting columns. But you can rename them. Parquet, you can't rename, but you can drop columns. Those sorts of things, they're just paper cuts. One of the things I'm most proud of with Iceberg is that we got rid of all the paper cuts. It just works. It doesn't always give you exactly what you would want because it's a never-ending process to make it better, but it just works all the time. That's what I hear from people who are moving over to it, that it's just like, hey, I didn't have to think about this. The defaults are, for the most part, really good. It's dialed in. It just works, and I don't have to think about these challenges anymore. Those challenges are actually very significant. Things like hidden partitioning. I talked about restoring the table abstraction to the SQL standard in 1992. Partitioning in the Hive-like table formats, they all require additional filters because you're taking some data property that you like, like timestamp. When did this event happen? You're chunking the data up into hours or days along that dimension and storing the data so that you can find it more easily. But in Hive-like formats, that's a manual process, which is ridiculous. Especially if you switch the timestamp mechanism, too. If you want to go from hours to days, that is, again, ridiculous. You can't do that at all. In order to switch the partitioning of a table in Hive or Hive-like tables, you have to completely rewrite the data and all the queries that touched that table. Because it's all manual, the queries are tied to the table structure itself. You have a column that corresponds to a directory and you have to filter by that column or else you scan everything. It's incredibly error-prone. On the right side, you have to remember, what time zone am I using to derive the date? Right? Because that's different. Hey, which date column did I use for this? What date format, potentially, if you're going to a string, did I use? And if you get any of those things wrong, and you just mis-categorize data, you might not see that. The users might not see that. It's just wrong. We've got an example that we did with Starburst. It's a data engineering example where you go and you connect Tabular and Starburst together and you run through this New York City Taxi dataset example. I love this example because you grab a month's worth of New York City Taxi data and you put it into a directory and you read it and it's like stuff from 2008 even though it's 2003 in the dataset. You're like, where did all this extra data come from? And that's exactly what you're doing with manual partitioning all the time. Someone said, hey, this is the data for this month, and you just trusted it and you put it there, and now when you actually go look at it, you find stuff from like 10 years in the future. It's a massive problem. And on the read side, you don't know if they put the data in correctly. You don't know if you're querying it correctly. You might not even query it correctly. You can get different results if you just say, hey, I want timestamps between A and B, or if you say, I want timestamps between A and B but make sure I'm only reading the correct days. You can actually get different results because of where data was misplaced. It's a huge problem. Again, in Iceberg, we just said, hey, let's just handle that. Let's keep track of the relationship between timestamps and how you want the data laid out. Let's have a very, very clear definition of how to go between the two and then be able to take filters and say, oh, you want this timestamp range? Well, I know the data files you're going to need for that. So we do a much better job of just keeping track of data and eliminating those data engineering errors. So you're kind of completely upstarting away everything that data engineers had to do to just keep the data consistent, transactionally valid, all of that. You're kind of taking all of that away and just exposing the table formats to the different engines. We take inspiration from SQL. SQL is declarative. Tell me what you want, not how to get there. And we've really compromised that abstraction in the Hadoop landscape. We've said, oh, in order to cluster the data effectively for reading, you're going to want to sort your data before you write it into the table. Which is kind of crazy, because we don't actually sort it because we want it sorted. We sort it because we want it clustered and written a certain way, and the engine happens to do that if we write it that way. It's a really backwards way of thinking. And in Iceberg, we've actually gone and said, okay, this is the sort order I want on a table. This is the structure I want on a table. We've made all of these things declarative. And then, we've gone back through, and this has been a process of years, we've gone to make the engines respect those things. So, Spark and Trino will take a look at the table configuration and say, oh, okay, I know how to produce data into this table for the downstream consumption after the fact. And you shouldn't have to figure out how to fiddle with your SQL query in order to get Trino to do the right thing. And the same thing in Spark. Everything should just respect those settings. And we're just taking it back to a declarative approach, which is the basis of databases in the first place. I want to touch on one other thing, which is what does all of this mean for query performances? I constantly hear that people who are using Iceberg, they're really happy about the performance they get.

Sudip: Can you talk a little bit about how that is possible, given of course you are doing a lot of the work under the hood, which is very complex.

Ryan: Yeah. There are a number of dimensions to this. The first thing that we did was we just made job planning faster. Our early, early presentations on Iceberg were like, hey, we had a job that took us four hours to plan. And that was just listing directories. And sure, we took that down to one hour because we were able to list directories in S3 in parallel. And then we said, okay, well, what if we actually made this tree structure and we used metadata files to track the data files for that query? And we get it down to something ridiculous. And then we said, okay, well, what if we then add more metadata? Because we're not relying on directory listing that only gives us this file exists. What if we kept more metadata about that file? What if we knew when it was added to the table and what column ranges were in that file? And things like that. Well, we could then use those statistics to further prune. And so we got this thing that used to not even complete and took four hours just to plan. And we got that down to 45 seconds to plan and run the entire thing because we were able to basically narrow down to just the data that we needed. So, you know, there are success stories like that. We also have a success story where we replaced an Elasticsearch cluster that was multiple millions of dollars per year with an Iceberg table because we were just looking up based on, you know, the one key. Now, ElasticSearch does a lot more than Iceberg, and I don't want to misrepresent there, but having an online system that is constantly indexing by just one thing, you can just use an Iceberg table and its primary index because Iceberg has a multi-level index of data and metadata. It's another reason for that tree that we were talking about earlier. So there's that aspect. Can we find the data faster and things like that? Iceberg also has rich metadata provided by this tree that is actually accurate because you know exactly the files you're going to read, which is a far better input to cost based optimization and that sort of thing. So we just keep accumulating these dimensions. The thing that we've been exploring the most lately is actually unrelated to really better or more metadata about the data. Granted, we are going into, can we keep NDV statistics and things that are better for cost based optimizers and those sorts of things. But what we've been finding out at my company, Tabular, is that automated optimization, which is like background rewrites of data, is amazingly impactful. So let me go into a little of why that's unlocked by Iceberg. In the before time in Hive table formats, you couldn't maintain data. You had to write it the correct way the first time because you couldn't make atomic changes to that data. Whenever I touch data, your query might give you the wrong answer. The solution across your organization is never touch data. Write it once and forget about it. That is, unfortunately, the status quo in most organizations. Write it once, try and do a good job, but don't worry about it after that. And that try and do a good job means only the most important tables actually get someone's attention. If you've got a table where there are a thousand tiny files, well, a thousand tiny files isn't really enough to worry about. It's when we get to hundreds of thousands of tiny files that we start caring. There's this huge, long tale of just performance problems. It's not awful, but it's not great. Automated optimization is allowing us to basically fix things after the fact. The fact that we fixed atomicity, the fact that we can make changes to our table, enable us to combine with the declarative approach I was talking about earlier and actually have processes that go through and look for anti-patterns and problems and fix them. Automated compaction is one where you might have a ton of overhead just because you have tiny files and that means you have more metadata in the table and everything is generally slower and it's not great. If you just have an automated service going and looking for all those tables with one partition with a thousand files in it that are all 5k, turns out you can really save a lot on your storage costs and your compute costs and all of those things and it's cheap. It's automated so it's not like you're spending an entire person doing that. This is not a problem that data engineers have time to care about, but it has a tremendous value to the organization in terms of just everything functions more smoothly.

Sudip: So you can basically go back and clean up all those data stores without putting any engineers behind it.

Ryan: Yeah, exactly. I mentioned combining this with the declarative approach. What Iceberg has done is it's moved a lot of tuning settings to the table. So we can tune things like compression codec, compression codec level, row group size, page size, different parquet data properties. We can go and figure out the ideal settings for just that data set and then you can have some automated process go and use those settings. So there's a ton of data out there sitting in snappy compression that's probably four or five years old, something like that. What if you had a process that could go turn that into the LZ4 or something safely and cut down your AWS S3 bill by half? That's the kind of impact that we're talking about and while you're cutting down the size of your data by 50%, that also generally makes your queries 50% faster. That is pretty amazing. Anyone with historical data would sign up for that in a heartbeat, right? Because we all are storing data that we never go back and compress and yet pay AWS for. Let's hope. That's kind of where Tabular is building a business. So yeah, I hope so.

Sudip: That's fantastic! I want to shift gears a little bit and talk a bit about what are the other alternatives to Iceberg? There is, of course, Delta Lake, Apache Hudi that come up. How do you think about those solutions versus Iceberg? What are the pros and cons versus Iceberg that you can think about?

Ryan: I think most people are standardizing on Iceberg these days. I have theories for that, but one of the biggest impacts is the reach of Iceberg into proprietary engines. That's largely been the past year, year and a half, that you've seen real big investments from companies like Snowflake and Google. Redshift announced support. Cloudera is moving basically all of their customers onto Iceberg. The broad commercial adoption for Iceberg in particular I think is one of the biggest selling points. If you're thinking about this space, either as a customer or as a vendor, customers have this extra concern of what vendors support the thing. But vendors and customers have two things in common. The first is really is there a technical fit? Is the technical foundation there? For formats like Hudi and Hive that use Hive partitioning, they don't support schema evolution, they basically don't support the ACID transactions either. Hudi's gone a long way towards getting close to ACID transactions, but it's still not quite there. I fired it up on a laptop last year and was able to drop data in a streaming job, duplicate rows and all sorts of things. So it's just not quite there. So I think Iceberg has that technical foundation, that commit protocol that we were talking about in the beginning. It's the same approach that Delta uses, by the way. That is really solid and reliable. So I think that that's one thing that both vendors and early adopters are looking to. Another technical foundation one is just portability. Can you actually build in another language and support all of the table formats properties? So this is the classic example here is Bucketing and Hive used Java native hashcodes and reimplementing Java native hashing on some other system is not going to happen. So you basically had this whole set of things that was not portable in Java. Hudi has many similar issues there and I don't think that it's going to be something that can be ported to C++ or Python or some of these other languages that you really want to fold into the group. So I think the distinguishing factor between Delta and Iceberg and Hudi is that that technical foundation is just not there. On the other side, I think you have the openness. So assuming that that technical foundation is there, is this a project that people can invest in? Can you become a committer? Can someone from Snowflake become a committer? And that really influences vendors' decisions because it's a very tough spot to support a data format that is wholly controlled by your competitor. And I think that that's why most vendors have chosen to recommend Iceberg because Iceberg has that technical foundation. It's also a very neutral product. It was designed to be that neutral from the start. Even before we realized that independence and neutrality in this space was going to be absolutely critical, again, we didn't think of being able to share data between Databricks and Snowflake. We thought, we want Snowflake to be able to query our data at Netflix. We didn't think about the dynamics of these two giants needing to share a dataset. And do you trust Databricks to expose data to Snowflake or Snowflake to expose data to Databricks? Do you think that that is performant? Do you trust it? There are a lot of issues there. I'm not making any accusations with those two in particular. All I'm saying is you want to know that your format is neutral and puts everyone on a level playing field. And especially if you're a vendor, you want to know that you're on a level playing field with other vendors.

Sudip: Absolutely, and I think that comment of openness applies in particular to Delta. That's where I think both vendors and customers have their concerns around. Let me ask you this question. Do you imagine you guys would ever build an engine yourselves since you own the table format, you probably could build a pretty optimized engine. Is that ever a plan?

Ryan: Let me correct you there because we don't own the table format. We contribute to the community, but we participate on the same terms as everyone else. And that's actually a really, really important distinction for us because we don't want to turn into a closed community. The adoption of Iceberg has been fueled by the neutrality of the community itself and the ability for all of these different vendors and large tech companies to invest and know that they have a say in what's going on. So I'm always very careful not to say Tabular is not the company behind Iceberg. We're contributors to Iceberg and we know quite a bit about it. I'm a PMC member but I never want to do anything to compromise that neutrality itself. But getting back to your question, are we going to build a compute engine? I don't think so because I think that there is a real opportunity here in independent storage. The vision that I have for our platform as Tabular is to be that neutral vendor. To be the person looking at your query logs coming in from some engine and saying, hey, if we clustered your data this way, it would save you a whole bunch of money. I don't think that we've ever had the incentive structure to do that because we've never had shared data. So the database industry for the last 30 years has lived under the Oracle model. And Oracle I think is just the most prominent person here. But we sell you compute, we sell you storage, and those two things are inseparable. The fact that all of your data is in this silo sort of keeps you coming back for more compute, for upgrades, for the contracts, and it sort of locks you in. And I don't think that that is unnatural. It's just been the reality for databases for 30 years. Now, the shared storage model is overturning that. And so I think there is a really huge opportunity here for us to rebuild the database industry where you don't automatically get those compute dollars simply because someone put data into your database, or wrote with your database. And so I think that it's actually structurally important for the market to try for this separation, for independent storage that tries to make your data work as well as possible with all of the different uses out there. We don't want anyone tipping the scales and saying, you know, this table is going to perform better if you use our streaming solution or our SQL solution. I think you want an independent vendor sitting there that is looking out for your interests as a customer and really helping you find those inefficiencies. Now, I know that some vendors do a great job of making things faster all the time and things like that. I just don't think that the incentives are quite aligned there. And in fact, throughout the majority of the analytic database history, you have largely been responsible for finding your own efficiencies. Oh, you need to put an index on this column. You need to think about clustering your data this way. And you hear horror stories where, oh, we didn't have the data clustered right, and we were spending $5 million too much a year on this use case. I think it is the responsibility of independent storage vendors to find those things and to fix them for you. And so I really think that that could be a valuable piece of the database industry in the future, and that's where we're headed as a company. Not towards compute, but actually being a pure storage vendor.

Sudip: And I'd say this probably has not been tried before at the layer that you are operating at.

Ryan: Yes, at the lowest storage layer maybe, but not at the layer that really sits in between the engines and the storage layer. Definitely have not seen that. It hasn't been possible to share storage. And it's going to, by the way, have a huge impact on other areas. The example I always give is governance and access controls. Where if you can now share storage between two vastly different databases, it no longer makes any sense to secure the query layer instead of the data layer. And so we have all of these query engine vendors that have security in their query layer. Well, that does you no good if you have three other query layers, or if you have a Python process that's going to come use DuckDB and read that data. I mean, it's a very messy world, and I think we're just starting down this path of what does the analytic database space look like if you separate compute and storage?

Sudip: Last question before I ask you a couple of lightning round questions. What is the future roadmap for Iceberg? And when I say future, I mean, let's say in the next 18 to 24 months. What are some of the big things on your mind that you want to tackle?

Ryan: There are a few things. As always, getting better, improving performance, and things like that. A couple of the larger ones are multi-table transactions. So Iceberg is actually uniquely suited to be able to have multi-table transactions just like data warehouses. So we're going further and further along that path of data warehouse capabilities. We're also coming out with cross-engine views. So you'll at least be able to say, hey, here's a view for Trino, and here's the same view for Spark, and be able to have a single object in your catalog that functions as a view in both places. Now that you're sharing a table space across vastly different engines, we also need to have the rest of the database world. So views are another area where once you're thinking about having a common data layer, you need to expand that a little bit. You need to expand that to access controls across databases. You also need to expand that to views. Views are a really critical piece of how we model data transformation and different processes without necessarily materializing everything immediately. Views are another area. We're also working on the next revision of the format itself, which is going to include data and metadata encryption, which is a big one, as well as just getting closer and closer to the SQL standard in terms of schema evolution. Schema evolution is actually really critical, and this has been coming up lately in CDC discussions. CDC is a process where you're capturing changes, hence change data capture, from a transactional database, and then using that change log, you're keeping a mirror in your analytic space up to date. And if you can't keep up with the same changes from the transactional system, you can't really mirror that data. So schema evolution is extremely important here because if someone renames a column, and I already talked about if you're just using Parquet or something, you just can't rename columns. If someone renames a column in that upstream database, you need to be able to do the same thing. Iceberg is, I think, ahead of the game in that we handle most schema transitions, but we actually need more. We need more type promotions, we need the default value handling, things like that to really get to feature parity with the upstream systems. Now these are mostly edge cases, but it makes a really big difference when you're running a system at scale, and someone in a different org made a change to their production transactional database, and now you've got a week's worth of time to reload all that data. So we're working on filling out the spec there and adding some new types, stuff like timestamp with nanosecond precision, and then blob stores, I think, as well.

Sudip: Wonderful. A lot to look forward to then in the next 12 to 18 months. This has been phenomenal, Ryan, so thank you so much for sharing the whole Iceberg story with our listeners. I want to end with one quick lightning round. We do it at the end of all of our podcasts. Just, you know, real quick questions. So the first one is what has already happened in your space that you thought would take much longer?

Ryan: I have been fairly surprised at how rapidly large companies have not only adopted Iceberg, but really put money and effort behind it. AWS is building extensions into Lake Formation and Glue, which is a very significant investment. The work that Snowflake is doing is incredible. We're very happy to have seen that Databricks, even though they back Delta, has also added Iceberg support to their platform, basically acknowledging the fact that Iceberg is the open standard for data interchange. I think I've been pretty shocked there. I would definitely have thought it would take longer for companies to ramp up, but they seem to have really hit the ground running.

Sudip: Fantastic. Second question. What do you think is the most interesting unsolved question in your space, again?

Ryan: Independent versus tied storage.

Sudip: Tell me more. What do you mean?

Ryan: I have a blog post, The Case for Independent Storage, that summarizes my thoughts here that we've talked about. This is basically the rise of open table formats has created a world in which you can share data underneath Redshift and Snowflake and Databricks and all these commercial database engines. Everyone is rushing, like the answer to my last question, to support and own data in open storage formats. I think that a big question is going to be whether people trust a single vendor that they also use for compute to manage the storage that is used by other vendors. I think that that is probably the biggest question. I've gone all in. I've placed all my chips on the table. We think that it's going to be independent. That's a central strategic bet of Tabular as a company, is that people are going to want an independent storage layer that connects and treats all of these others the same and tries to represent your interests as a customer and get you the best performance in whatever engine is right for the task. Is that going to matter in the marketplace? Or is it going to be that people are perfectly happy having an integrated solution where I buy storage and compute from one vendor and I sort of add on another, say, streaming vendor that uses that same storage. Is that the model or is it going to be independent? I think that the transition is to independence, and that's because it destroys that vendor lock-in. Even if you think today that this vendor for compute is head and shoulders beyond the rest and is going to be amazing. What about in five years or in ten years? Are they going to get disrupted? Are you going to have to move that data? I'm very bullish on the case for independent storage, but whether it will matter to customers is not something I get to choose. Spark versus Hadoop is a classic analogy there. The compute engine is getting disrupted. Last question. What's one message you'd want everyone to remember today? I would just say go check out Iceberg. It can make your life a whole lot easier. That's what we wanted to do.

Sudip: Absolutely. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti, and I look forward to our next conversation.

Engineers of Scale

Iceberg: The Open Table Format for Petabyte Scale Analytics - with Ryan Blue

Show Notes

Timestamps

Transcript

Discussion about this episode