From Spark to Databricks: Spark's Origins, Innovations, and What's Next

Engineers of Scale

From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

0:00

-51:50

From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

The insider's story of how Spark got created, what made Spark and Databricks successful, and the future of Spark - from Reynold Xin, co-creator of Spark and co-founder of Databricks.

Sudip Chakrabarti

Dec 13, 2023

Transcript

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Infrastructure Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do an in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.

We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episode, we hosted Doug Cutting and Mike Cafarella for a fascinating discussion on Hadoop. In this episode, we are incredibly fortunate to have Reynold Xin, co-creator of Apache Spark and co-founder of Databricks, share with us the fascinating origin story of Spark, why Spark gained unprecedented adoption in a very short time, the technical innovations that made Spark truly special, and the core conviction that has made Databricks the most successful Data+AI company.

Timestamps

Introduction [00:00:00]
Origin story of Spark [00:03:12]
How Spark benefited from Hadoop [00:07:09]
How Spark leveraged RAM to monopolize large-scale data processing [00:09:27]
RDDs demystified [00:11:43]
Three reasons behind Spark’s amazing adoption [[00:21:47]
Technical breakthroughs that speeded up Spark 100x [00:27:05]
Streaming in Spark [00:31:13]
Balancing open source ethos with commercialization plans [00:37:45]
The core conviction behind Databricks [00:40:40]
Future of Spark in the Generative AI era [00:44:39]
Lightning round [00:49:39]

Transcript

Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host and partner at Decibel.VC, where we back technical founders building technical products. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today, I have the great pleasure of welcoming Reynold Xin, co-founder of Databricks and co-creator of Spark to our podcast. Hey, Reynold, welcome. [00:00:37]

Reynold: Hey, Sudip. [00:00:38]

Sudip: Thank you so much for being on our podcast. Really appreciate it! [00:00:41]

Reynold: Pleasure to be here. [00:00:42]

Sudip: All right, we are going to talk a lot about Spark, the project that you created and is behind the company that is now Databricks. You went to University of Toronto, is that right? [00:00:54]

Reynold: I did go to the University of Toronto, spent about five years in Canada, and then came to UC Berkeley for my PhD. So, been in the Bay Area for over, I think almost 15 years by now. [00:01:05]

Sudip: What brought you to Berkeley specifically? [00:01:07]

Reynold: It's an interesting point. So when I was considering where to pursue my PhD studies, I looked at all the usual suspects, the top schools, and one of the things that really attracted me to Berkeley - actually, it was two things. One is there's a very strong collaborative culture, in particular across disciplines, because in many PhD programs, the way it works in academic research is you have a PI, a principal investigator, a professor who leads a bunch of students, and they collaborate within that group. But one thing that's really unique about Berkeley is that they brought together all these different people from very different disciplines - machine learning, computer systems, databases - and have them all sit in one big open space and collaborate. So it led to a lot of research that was previously much more difficult to do because you really needed that cross discipline. The second part was Berkeley always had this DNA of building real-world systems. A lot of academic research kind of stop at publishing, but Berkeley sort of has had this tradition of going back to BSD, UNIX, Postgres, RAID, RISC, and all of that, actual systems that have a real-world industry impact. And that's really what attracted me. [00:02:20]

Sudip: And obviously, that's what you guys did with Spark, too. [00:02:23]

Reynold: Yeah, we tried to continue that tradition. So we didn't stop at just the papers. [00:02:27]

Sudip: Yes, absolutely. And I think that's a very common criticism of a lot of academic work, right? Great in quality, but doesn't go that last mile to get to production or get to actual users. [00:02:40]

Reynold: It's not necessarily a wrong thing either, just different approaches. You could argue, hey, let academia figure out the innovative ideas and validate them and have industry productize them. It's not necessarily the strength of academia to productize systems. To some extent, it's just different ways of doing things. [00:02:57]

Sudip: Let me go back to when you guys started Spark, and this is circa 2009. I'm guessing you had just joined the PhD program, and this was still the AMP Lab - Algorithms, Machine, People Lab - right? [00:03:11]

Reynold: Yeah. [00:03:12]

Sudip: Which, of course, is now Sky Lab and was RISE Lab in between. Can you give us a little bit of an idea of what the motivations were to start the research behind Spark? I mean, this was the time when Hadoop was still king, right? Like, why do Spark? [00:03:27]

Reynold: It was an interesting story. So Spark actually started technically the year before I showed up. By the time I showed up, there was this very early, early thing already. So Netflix back then in the 2000s, I think even a little bit before 2009, had this Netflix Prize, which is the competition they created in which they anonymized their movie rating datasets so anybody can participate in the competition to come up with better recommendation models for movies. And whoever can improve the baseline the most would get a million dollars. [00:03:58]

Sudip: Yeah, I remember that. [00:04:01]

Reynold: That was a big deal. Eventually, it was shut down for privacy reasons. I think maybe there were lawsuits that happened, but it was a big deal in computer science: and in the history of machine learning. And this particular PhD student, Lester, was really into this kind of competitions: and also a million dollars was a lot of money. [00:04:17]

Sudip: Sure. For a grad student in particular, right? [00:04:20]

Reynold: A grad student makes about $2,000 a month. So he tried to compete, and one thing that he noticed was that this dataset was much larger than the toy dataset he used to work with for academic research, and it wouldn't fit on his laptop anymore. So he needed something to scale out to be able to process all this data and implement machine learning algorithms. And one of the keys with machine learning is that you are not done when you come up with the first model. It is a continuously iterative process to improve it over time. The velocity of iteration is very important. And he tried Hadoop first because that was the unique thing - if you wanted to do distributed data processing, you used Hadoop back in 2009. And he realized it was horribly inefficient to run. Every single run takes minutes, and it was also horribly inefficient to write. So the productivity for iterating on the program itself was very difficult because the API was very complicated, it's very clunky. So he kind of walked down the aisle - the nice thing about having a giant open space with people from very different disciplines - and talked to Matei, who was also a PhD student back then, one of my fellow co-founders at Databricks. He said, hey, I have this challenge, and I think if you have those kind of primitives, I could really do my competition much faster. So Matei and him basically worked together over the weekend and came up with the very first version of Spark, which was only 600 lines of code. It was an extremely simple system that aimed to do two very simple things. One was a very, very simple, elegant API that exposed distributed data sets as if those were a single local data collection. And second, it could put or cache that data set in memory. So now you can repeatedly run computation on it, which is very important for machine learning because a lot of machine learning algorithms are iterative. So with those two primitives, they were able to make progress much faster for the Netflix Prize. And I think Lester's team even tied for the first place in terms of accuracy. [00:06:24]

Sudip: Did he get the money? [00:06:24]

Reynold: He did not get the money because their team were 20 minutes late in the submission. So they lost a million dollars for a 20-minute difference. So if Matei had worked a little bit harder and had Spark maybe 20 minutes earlier, Lester might have been a million dollar richer. [00:06:44]

Sudip: That's such an amazing story. Wow, I actually did not know that! [00:06:48]

Reynold: So when Spark started from research, it kind of started for a competition and really just the collaborative open space and the opportunity that all of those people just happened to be there at the right time led its very, very first version. Now, obviously Spark today looks very, very different from what the original 600 lines of code was. But that's how it got started. [00:07:09]

Sudip: One question I have since you touched on Hadoop, do you feel that Spark benefited from Hadoop being already there? Did you guys use some components of the Hadoop ecosystem? Like for example, if Hadoop hadn't existed, do you think Spark could have still been created? [00:07:23]

Reynold: Spark definitely benefited enormously from Hadoop early on. There are also baggages that we carry from Hadoop that up until today are still there. It's definitely benefited massively. The first example was that Hadoop solved the storage problem for Spark. And as a result, Spark never had to deal with storage. Spark more or less considered storage as a commodity. To a large extent, organization were able to store a large amount of data reliably and cheaply, which was key. [00:07:56]

Sudip: And this is HDFS in particular? [00:07:59]

Reynold: HDFS, yeah. And later on HDFS largely faded and got replaced by object stores. But Spark never had to worry about, hey, how do you store a large amount of data? And that was very, very important. And Spark piggybacked onto the Hadoop deployments. All the initial deployment of Spark were sort of onto Hadoop clusters themselves. So the existence of those clusters made it easier because if the hardware resources were not there, that would also have been very problematic for any user of large-scale data processing systems. And Spark leveraged a lot of the Hadoop code itself, especially the storage layer, retrieval as well as the data formats. So Spark definitely benefited enormously from it. At the same time, it's a lot of baggage, unfortunately. There's no free lunch. [00:08:45]

Sudip: Right, right. And we actually had Doug Cutting and Mike Cafarella, on the previous episode here, and Doug was talking about how he fully anticipated Hadoop eventually evolving and being replaced by a more advanced system. So it's sort of generations of advancement that happened. One of the key game-changers behind Spark at a very high level was Spark's use of RAM, the memory, keeping data in memory. Can you talk a little bit about how Spark uses memory and what are the benefits? Obviously, there are benefits in speed, iterative computation, but can you talk a little bit about that? And then one other question I'll ask at the same time is why was it so difficult for Hadoop to maybe adopt memory? [00:09:27]

Reynold: This is actually a very complicated topic, but let's try to maybe explain it in just a few minutes. There are two places where Spark uses memory in a fairly clever way that Hadoop didn't do and those really led to the dramatic improvement. One is the ability to simply keep the data in memory. And as we said earlier, that's very important for any kind of iterative computation in which you want to repeatedly scan the same data. So it was critical for machine learning workloads. But at the same time, it's also the same primitive that's very useful for any interactive data science because you're often looking at the same data sets over and over and over again when you're doing interactive data science. Those actually happened to be the first maybe two killer use cases of Spark. The second place, it's a little bit less about the direct use of memory, but because Spark exposes fundamentally much higher-level abstraction compared with Hadoop. Hadoop is very simple - it's Map, Shuffle, Reduce - MapReduce. It's a very simple paradigm. There's no concept of joins and no concept of filters. You basically create filters yourself and map it back to Map and Reduce. Spark exposes a higher-level abstraction that has the concept of filters, joins, code group, and all of this. And as a result, it can train a more complex computation by DAGs, direct cyclic graphs, of tasks. And as part of it, it knows, for example, hey, if you're running a Map right after a filter, you don't have to persist the output of filter onto disk or onto HDFS and then read it back in. As a matter of fact, it can just stream through them. So this particular optimization now removes the need for data to go to disk repeatedly in a larger computation. But it's a little bit less about just memory. It has to do with a combination of both, hey, let's just pass through data, stream through data and memory, as well as having the ability to express that more complex computation diagram. [00:11:23]

Sudip: So if you had like three stages in that DAG, Hadoop would’ve returned the results after every stage to the disk and read it back, whereas Spark can keep it in memory because it knows that DAG? [00:11:36]

Reynold: It knows it's the same thing. I mean, it has to do with having the completeness of the computation rather than only Map and Reduce. [00:11:43]

Sudip: Right. And then one of the primary concepts behind Spark is what is called RDD, Resilient Distributed Datasets. Can you talk a little bit about that for someone who might not be familiar? [00:11:54]

Reynold: I think maybe from two perspectives, one's from a system perspective, the other one's from the user's perspective. And I think the user's perspective probably should go first. The brilliant thing about RDD is that distributed computation used to be super difficult. And if you think about message passing, Hadoop made it slightly simpler to have the MapReduce concept. RDD took it much further and basically said, hey, if you have a large dataset, the way you program large datasets should just be like how you program a collection of data in memory on a single node. If you were to write a Java program, a Scala program, a Python program, everybody knows what a list is, an array is. They're all collections and there are ways to transform collections. Maybe, the way we should be programming large datasets should be identical to that. And that really means the API now for programming a distributed program against a large amount of data is as if you're just manipulating some data that's on a single node. So that dramatically decreased the complexity in the API surface. Now, the second big innovation in RDD is this. One of the big things with distributed computation is, hey, you have all those machines, you might have thousands of machines that could fail. How do you deal with failures? So RDD, while the user-facing API exposes just a bunch of collections, internally, it creates a lineage for every collection. So it doesn't literally materialize when you say, hey, this collection is just a filter on the previous one. Instead of materializing the collection, it is actually lazy. It just tracks, hey, this collection is simply formed through doing a filter on the previous collection. So it gives you that lineage of how you create the datasets. And again, when you really need the result, for example, you want to output the data, you want to get back how many rows there are, it would trigger this computation graph. And if there's a failure on any of the machines, it runs something very simple - it analyzes the computation graph and checks, hey, so what are the downstream, upstream dependencies? If that node fails, I just need to reproduce the data on that node. So it creates a minimum plan to reproduce the partial dataset on that node to handle failures. And it does that all gracefully, without the user having to know anything about it. So that's really the two brilliant parts of RDD. The first is, it creates a new programming paradigm for distributed data processing, which makes you program basically single-node collections. And the second is all the underlying system techniques to make fault recovery work really well. [00:14:29]

Sudip: Got it. And then in 2013, you guys introduced DataFrame, and then in 2015, you introduced Datasets. So what are those APIs and how do they connect or relate to RDDs? [00:14:43]

Reynold: So in 2014, we introduced DataFrames. I remember because in 2015, at Strata conference, which is a big conference, I gave a talk about DataFrame GA. After we started Databricks and we started working a lot more closely with the Spark users. I mean, we always work very closely with Spark users but after the company started, now we no longer had the academic research to worry about. [00:15:08]

Sudip: No paper to write. [00:15:10]

Reynold: No paper to write. No exams to take. No courses to take. And then we realized at some point that, even though there's a lot of unstructured data and semi-structured data out there, at some point people introduced structure onto their data. Structure could be, hey, here's an array of floating point numbers for my machine learning vectors. Structure could be, here's a column called email description, which is a pile of text, right? Those are all structures. Probably like 95% of the programs become some sort of loosely defined structured programs. And the collection of data, which while it's a very powerful abstraction, is still not high enough for structured programming. And that involves also a lot of user-defined functions. Like imagine if you're programming in Python, you want to traverse a list, you want to do something to it. You write a lot of code to say, for example, let's do a for loop across the list. And then for each of the element, you try to compare it to say the number one. If it's number one or it's greater than one, you keep it. If it's less than one, you ignore it. There's a lot of code you're writing, expressing in Python. The problem with that code is that the system cannot optimize it because it is Python code. It is Python code with very strong Python-specific semantics that we can't do anything about. But we do know that often people are just doing very basic comparisons. They're doing very basic expressions that exist in a more structured context. So the reason we created DataFrame was twofold. One is we want to raise the level of structure in the API even higher that makes structured programs easier to write. So users have to write less code. The second is we want to be able to capture more and more of the semantics of the computation. So instead of the user writing a lot of user code in Python, they will just express what they want to do in the DSL in Python still. But this time it's the DSL that tells us the semantics of all those computations. And then we can optimize that under the hood. As a matter of fact, we did. Before Spark 2.0, Python was probably 5 to 10 times slower than any JVM language on Scala and Java for Spark users. And even today, if you Google or ask ChatGPT, it's very likely ChatGPT will tell you that if you use Scala, you get better performance with Spark. And the reason for that is not because Spark itself behaves very differently. Simply because if you had used Python before, we had to run a lot of your code in Python, and the Python code would inherently be slower. But with DataFrame, we're able to capture the semantic information and actually generate an execution plan that, regardless of what language you use, will be the same execution plan and we'll optimize it and we'll make it run faster and faster over time. And that was incredibly powerful. And it basically got Python to have exactly the same performance as the JVM languages. So it's a big deal because these days probably 80% of the Spark users use Python. [00:18:27]

Sudip: Right. I mean, it's a language for data scientists. [00:18:30]

Reynold: Yeah, exactly. And then the Dataset API came, although I actually would not recommend people to use the Dataset API. In hindsight, I would not have created the Dataset API. [00:18:40]

Sudip: I see. [00:18:41]

Reynold: So Python is weakly typed, dynamically typed. Scala is statically typed. And a lot of Scala programmers really love typed information because that’s powerful. One of the things with DataFrame is that it becomes dynamically typed. So even when we declare DataFrame in Scala, it doesn't actually know what exactly is the type of that DataFrame, what are the columns in it. All of those are dynamically generated. So Dataset was our attempt and I think we did our best job out there to create a statically typed program that allows you to bind in runtime but give you that compile-time safety throughout the program so you would know, hey, this is actually a Dataset of, for example, a student. And student is a class with all these fields. And we give you the compile-time safety, but with the final validation that when you read this data in, the first time you create a Dataset of student, we actually run the validation to see all the data actually have all these fields you need and how does it map to the student class. The reason I said maybe in hindsight, I would not have created it is, as it turned out, it only served a very small percentage of users. And second, it's extremely complicated. The most complicated code in Spark are one, the scheduler - schedulers are always complicated. Second is all this type mapping and static sort of type, dynamic type binding stuff. Those codes are super hairy and very error-prone and very few people know how to change it. So a lot of investment for high technical complexity for very small percentage of users. It is a very cool idea though in abstract. [00:20:21]

Sudip: That's a fascinating story. Talking about programming language, one quick question I had was, the choice of programming language for Spark was Scala. Why was that? [00:20:31]

Reynold: Every month some new engineer joins Databricks and asks, why do we use Scala? It was actually an easy choice back then. In 2010-2009, because the Hadoop ecosystem was in Java, in order to be able to leverage most of the Hadoop ecosystem, Spark needed to be in the JVM. At the same time, we really wanted something that could be interactive. We felt that the interactive experience was extremely important. Java was not interactive. Even today, Java is not interactive. There's no way you just hand-type Java without an IDE. It's not concise. It is super verbose. So Scala was a language that was identified as basically much more concise than Java with a repo that we could actually hack. So now somebody could just interactively in command line start issuing a one-line Spark code and run across like a thousand machines in the cloud. And there was no other language that would fit the requirements of being in JVM and being interactive [00:21:33]

Sudip: That actually clears up a big question for me I've always had, why Scala? [00:21:39]

Reynold: I think Spark and Scala had a coexistent relationship and really a lot of Spark users came to Scala because of Spark. [00:21:47]

Sudip: Yeah, exactly. A lot of people learned Scala, including myself, just to use Spark. Now, shifting gears a little bit, Reynold, one of the things that obviously we all have witnessed over the last 10 years is the amazing adoption of Spark. Now that you are looking back and it's 2023 now, looking back at the last 10-12 years of Spark history, if you were to pick, let's say, three factors that you think were the most important in driving this adoption of Spark, what do you think those would be? [00:22:13]

Reynold: The first and foremost would be, I think, the focus on the end user. Everything I talked about so far I always started with a level of abstraction to make it much simpler to program. These days I don't need to evangelize Spark anymore because it's everywhere. But one thing I often say is that, many of you will come because of the performance improvement. Because you've heard that you can make your program 10 or 100 times faster than Hadoop, but you’ll really stay for the programmability. You'll never go back to a MapReduce program once you program Spark. We didn't end with just the RDD API. We have continued pushing the boundary, introduce streaming APIs, introduce DataFrames and all this. And they're all about how do we help the end users to be a lot more productive, make the APIs more expressive for the tasks they are supposed to do. And that focus on the end user programmability is key.

The second one would be a culture of innovation which is not surprising because a bunch of us came from academia and really wanted to apply bleeding-edge ideas to a real-world system and see how we could improve it. So Spark brought a lot of innovative ideas that were never ever done in other systems or only done in very niche systems. But, Spark brought those to the masses.

The third one I would pick is unlike Hadoop and many other systems in the past, Spark had a batteries-included approach, which means, hey, what are the most common things people want to do? Let's make it doable out of the box, instead of having another extension framework or project you have to go download, install and configure in the system. We have had, for the longest time, all the popular data types. You could just use the built-in APIs to read them, like popular data types for distributed computation, not necessarily for local stuff because the focus was on large-scale data sets. And we added the whole machine learning library to Spark if you want to run logistic regressions - just run it out of the box, you have it. [00:24:17]

Sudip: And you are talking about MLLib here. [00:24:19]

Reynold: Exactly. All of this made Spark much more powerful because one of the big pain points for a lot of users is having to configure a system with dependencies and app frameworks and extensions. Whereas in Spark, you install it, and now you have all of this power. [00:24:37]

Sudip: And going back to the usability piece, I mean, I like how you put it, like, come for performance and stay for usability, right? And going back to the usability piece, I believe one of those driving factors was when you guys added SQL support. You, in particular, I believe, were behind the original Shark project and then the Spark SQL project. I'm just curious, at a high level, was there any particular technical challenge that you had to resolve to bring SQL in a distributed data processing system? [00:25:09]

Reynold: Over the years we have done it, we've been redoing it. Actually, what Shark referred to is actually what I did during my PhD before Databricks even started. The way Shark was architected was that it basically took Hive, which is the SQL Hadoop system. We took the physical client generated by Hive and converted it into a Spark program. And then we were able to run SQL somewhere between 10 to 100 times faster than what Hive would be able to do back then. One of the main challenges there was really the Hive codebase, which was potentially the most spaghetti codebase I've ever seen. I'm hoping I'm not offending anybody here and it probably had nothing to do with the creator. I think it's just because it was created at Facebook for a specific set of use cases and then it suddenly got insanely popular. A lot of use cases and different requirements got piled onto it and people added a lot of code very quickly. It was so bad that when Databricks first started, in the first year one of our founding engineers actually quit after working on this codebase for a few months and that was the sole reason given. And up until this day when I talk to him he's like, yeah, honestly, that was the real reason. That codebase was so difficult to work with. So when Michael Armbrust came to Databricks in early 2014, initially he was actually hired to build a query optimizer for large-scale data. At some point I walked up to him and said, hey, you should kill this thing I created. This nine-headed monster is impossible to deal with and the code is too difficult to maintain. I think if we start from scratch and build it from scratch and have nothing to do with Hive, it would be substantially simpler. It would be much easier to actually evolve. And he did. And then that became Spark SQL which is honestly much easier to set up, much easier to maintain, much easier to iterate on. [00:26:55]

Sudip: So you basically ended up killing your own creation. [00:26:57]

Reynold: Yeah, a lot of people didn't expect that but I was jumping over it. Some people were like, are you happy that now it happened? I'm like, yeah, that's incredible that it happened. [00:27:05]

Sudip: That's so funny. And then going back a few years, talking about the other thing that kind of led to adoption of Spark, I remember back in, I think 2015 or somewhere around that, you guys made a number of key performance improvements to Spark Core. Can you talk a little bit about what those improvements were and what were some of the major architectural changes you guys had done? [00:27:31]

Reynold: So 2015-2016 was a big year in terms of improvement and that's when we released Spark 2.0. And the claim at the time was Spark 2.0 was an order of magnitude, 10 times faster than Spark 1. It might not be exactly 10 times for every workload, but it was dramatically faster. And a lot of it had to do, first of all, with the DataFrame API. We talked about DataFrame API that gives an even higher level abstraction. Now we convey semantic information so we can optimize under the hood. And we fully leveraged that in Spark 2.0. So Spark 1 released the DataFrame API, raised the level of abstraction, and enabled us to do the optimization in Spark 2.0. And so those optimizations basically boiled down to two big things. One is we hyper-optimized for Parquet and actually created the first vectorized Parquet reader. And that led to the use of Parquet itself as the columnar format, and Parquet MR at the time was basically implemented row-wise. It was not fully extracting the performance out of a columnar format. So we implemented vectorized Parquet and that got massive speed up just in terms of read performance. And then for the entire query execution engine, we put this idea that exists only in academic papers and academic systems called Hosted Code Generation and put that into Spark itself. The idea is we would take a query plan and generate the actual code that's needed to execute that query. As if you're building a query engine purpose-built just for executing that one query. And the reason that would make it much more efficient is because a generic system has a lot of overhead because it has to be generic. For example, it has a lot of function calls because your query engines in particular have the concept of an iterator model in which you chain a bunch of iterators and each sub-operation is an iterator and every iterator call, when you say next, it generates a virtual function call which is very slow. Then the compiler is not great at optimizing it because those are complex programs. But if you, for example, have a simple program that does the aggregation. If you were to purpose-build a program to do that, you would just write a for loop from the beginning of the data to the end of the data and then sum something up into a local variable and then return the local variable. It would be a three-line program. Compilers would be amazing at optimizing that. So we basically treated Spark itself as a compiler that compiles SQL queries into actual purposely written code for executing those SQL queries. And SQL here doesn't just apply to SQL. A big, cute idea in DataFrame is that DataFrame is not very different from SQL and SQL is not very different from DataFrame. They all generate similar query plans under the hood and once you generate those purpose-built code for that query, you compile it with the JVM. The JVM now is much better at optimizing that. So basically vectorization and CodeGen combined, we actually got to a dramatic speed-up. By the way, we didn't invent either of this. In academia, vectorization always existed but it was never built for the open-source SQL system. This kind of CodeGen was pioneered by Thomas Neumann at TU Munich and they had an academic system called HyPer which was eventually commercialized and got bought by Tableau. So we took that idea and really made it into adoption. Probably more people benefited from that because of that work and benefited from Thomas Neumann's idea than the HyPer system itself. [00:31:13]

Sudip: I want to talk a little bit about streaming. There was Spark Streaming and I believe now you guys have Structured Streaming, right? I'd love to understand a little bit what is the difference between those two and then one other related question is, originally the way I know streaming was implemented in Spark was through micro-batches, right? Is it still the case and how is Structured Streaming different from the original idea? [00:31:37]

Reynold: Incidentally, the relationship between Structured Streaming and Spark streaming is very similar to DataFrame and RDDs. It's all about raising the level of abstraction. Spark Streaming was a fairly innovative thing because it basically came with this insight which is that, if you just run a batch that is small enough and fast enough, at some point you have the approximate streaming. To the extreme, you run a batch for every row which is basically one row at-a-time processing. And that was pretty popular because it introduced a whole new class and workloads that previously people thought would be very, very difficult and require super specialized systems for. Just with a lot of IoT devices, sensor data, message buses becoming more popular as Spark was growing, streaming workloads started growing too. There was a very important problem with Spark Streaming which was actually not about micro-batches. I personally think the big micro-batch versus true streaming debate is overblown too much. The biggest problem with Spark Streaming was that the window of streaming is completely tied to the physical batch size. So the physical property of its execution is leaked into the programming abstraction. The most common operation in streaming is windowing. If you want to window by, for example, a bunch of records, in Spark Streaming, the way of design is that you would actually run that batch as a window. And this really limits a lot of optimization activities and also limits the programmability. It also limits another very important thing, which is, if you have late-arriving data, now you can't deal with it because you're already done with that batch. Your data shows up later and cannot be considered a part of that window anymore. So the time and everything has to be physically tied to the way it was executed. So after we did DataDrames, we started thinking about how to improve all those issues of streaming. One very obvious one was, many people want the concept of time that's logical instead of physical. Physical meaning whenever the event showed up to the system, that's the time it showed up. Logical means the event has some property called time, maybe a column or a field, and that is the actual time. So we thought that for streaming, let's completely decouple the execution from the API and just think about how you can program a streaming job. We looked at a lot of streaming jobs and realized the intent people express and the transformation people express, were not very different from a regular DataFrame. It's a program, except they want to run that in a continuous mode. And for any data that comes in, they want to be able to run that instead of just running once. So we came up, I think it was 2016 or 2017, I remember I was giving the announcement at Spark Summit and I said that, the easiest way to do streaming is that you don't have to think about streaming at all. So Structured Streaming basically introduced the concept of streaming DataFrames. There's no separate API for streaming. It is just a DataFrame. And the only difference is how you create a DataFrame. If you created a DataFrame using a streaming reader because it's coming from some message bus or even just a pile of files that might continuously arrive on object stores, you have a streaming DataFrame. And you can run the same operations just like your normal batch DataFrames. And everything else is the same. And that really made it much simpler to program streaming because now people don't have to learn a new paradigm. They don't have to think about, hey, what is windowing? Well, it turned out windowing is a group by some time. So that was very, very powerful. And that also led us to realize that a lot of people, when they do streaming, they're not even actually trying to do things in super real time. What they really, really want is they want the incremental-ness that happens in streaming as data flows in. Because virtually every data pipeline is continuous, they always have data coming in. They might come in once a day or once a week, but it's always new data coming in. It's very rare you have a data pipeline you run once and never worry about. Now, as data continues coming in, now you have to worry about, okay, so you don't want to reprocess all of your data all the time because that's highly inefficient. So people invented their own way of doing incremental processing. And it turned out the number one use case for Structured Streaming was people just using it for incremental processing because now they no longer have to worry about the state. The funniest thing is that they would build a streaming pipeline, they would run it, and it processed the data. The data's actually coming in, for example, once a week. So if right after half an hour, if they finish processing their data, they’d actually shut it down manually. And then a week later, when there's new data coming in they rerun their streaming pipeline. And because it's fault-tolerant and because it tracks all the incremental states, it just does this whole end-to-end incremental processing. Because of that, we introduced a concept called streaming once, which is literally that when you run the streaming pipeline it finishes processing all the data it currently sees, and then it shuts it down. And the next time you want to do it, you relaunch it again. And that itself, actually, it probably generates hundreds of millions of workloads just on Databricks today. People are running streaming in a batch mode. [00:37:07]

Sudip: Exactly. Yeah, that's what I was going to say. It makes it so much easier. But it's a unified programming interface. It's a unified understanding. And I'm sure it also made it easier on the engine side, right?

Reynold: Exactly, the unification. Yeah, because we don't have to build so many different engines. Another question is, so does it still run in micro-batch? There're actually different modes today. There are certain things that are running in micro-batch because for example, obviously, the streaming once mode, it is just a giant batch job, except it does all the incrementalism. There are also continuous modes in which it actually processes data all at a time, in which you can get from records coming in to records going out in some milliseconds. [00:37:45]

Sudip: I want to shift gears just a little bit and go to like, you know, circa 2013 when you guys started Databricks, right? And then you had this really fast growing open source community, which was Spark, and then you had this very early company, Databricks, right? And one of the challenges a lot of open source creators have, particularly when they start a company, is how do you balance between the open source community with the commercialization effort? How did you guys manage to do it? Were there certain guiding principles that you guys had that really helped you? [00:38:21]

Reynold: It's difficult. I think Ali, the CEO of Databricks often joked about that we should never start another company based on open source projects. The reason for that is you need two strikes, right? Normally to start a company and if you want the company to do well, you focus on the business problem of the company and with all the stars aligned and you're lucky and you work super hard and you have an amazing strategy, it works out. And that is a very difficult problem. If you add open source to it, now you have a two-step process. You have to have the open source project taking off and doing well, and that itself is not a trivial thing. It's probably a little bit easier than building proprietary software, but it's not a guarantee. Most open source projects don't work out. And then after that, you need to have an amazing business model and work towards a great strategy. And now the actual end-to-end success requires you have to multiply the two probabilities, which makes it very low. And probably the reason why there are very few companies that have heavy open source roots, and been successful. One thing we focus on a lot is we have different teams doing open source and they have different mandates. The open source teams are tracking so their KPIs are about adoption metrics of the open source projects. We always open source everything in API. We don't want to lock customers in because we have another kill API that makes it super difficult to migrate out if they ever need to. And we focus a lot on evangelism of the open source project. We try to walk a very careful line. For the longest time, the Spark Summit was called the Spark Summit and there was very little Databricks content in the first day keynote. It was all about open source. We saw this change as time went on and then as we also renamed the project. There was a lot of AI content so a lot of things have changed, but those are a lot of the things we did early on. There's creating a more cleaner delineation, both in terms of organizational structure, KPI tracking, events, and all that. But it's not an easy task. Actually, if we were to redo it again, I'm not sure if we could do it. [00:40:21]

Sudip: I'm laughing because that's coming from the founder of the most successful company in open source ecosystem, right? [00:40:29]

Reynold: I know a lot of people think of Databricks and say, hey, there's a great business to be made about open source. I'm not sure. I mean, it's not been doable, but it's not an easy job. [00:40:40]

Sudip: When you guys started Databricks, I remember 2014-15 time frame, I used to go to some of the board meetings. I remember there was a pretty heavy debate about should Databricks stay all cloud, or should Databricks go and support an on-prem version of Spark? And there was a lot of customer pressure because Spark adoption was increasing. [00:41:03]

Reynold: Everyone wanted to pay us - like ten million dollars for Spark. [00:41:05]

Sudip: And there was a lot of competitive pressure. Cloudera was really going at the time, Hortonworks was around. But you guys had a very deep conviction that it is either Databricks cloud or nothing. Can you talk a little bit about that, like where that conviction came from? [00:41:23]

Reynold: Yeah. I mean, in a kind of sense, we're really glad that we didn't go on-prem and become a support company. It ultimately comes down to the longer-term vision of where we see the puck is going and we try to go towards that rather than capturing what's right in front of us. And that's something easy to say because there are also scenarios in which people think too long term and they're dead before the long-term future and vision could even manifest. But I think one of the reasons that really got us going there was it actually had to do with Berkeley also. So at Berkeley in 2009-2010, there was this very famous paper called the Berkeley view of the cloud. [00:42:05]

Sudip: Yeah, above the cloud, right? [00:42:07]

Reynold: The view of the cloud. There are variants of the paper. The initial title and then later it's just a view of the cloud. But that paper is probably the most cited technical paper on cloud computing. It was so popular that many business schools incorporated that in various of their classes. Like my wife, for example, went to business school and read that paper. Not because I showed her the paper, but it was actually part of the class. And that paper predicted that the vast majority of the compute and computing infrastructure will move to the cloud and it will be finally true, this computing as a utility. Very few companies and houses have their own generators. They are just using electricity from the grid, right? It's something that's reliable enough, something that's viewed as virtually infinite and the economics simply makes sense. There are niche use cases, those will never go away, but the vast majority don't think about it as much. And that paper predicted that future by analyzing not just the technical foundations, but also what that type of allocations would be possible and the economics and the accounting and all that that have pretty profound influence on, I would say, the way we think about all this at Databricks. And I think some, maybe not everybody, but a couple of Databricks founders are also involved in that paper as well. So we always thought, hey, we really wanted the ability to be able to release software super quickly. We always wanted the ability to be able to provision and get a POC going in a matter of days instead of a matter of months, because now you have to go procure the hardware. We always thought a lot of the complexity of software, especially infrastructure software, is in the operations of it, not just in the building of it. And all of those are enormous values we can create, but we can only do that if it's in the cloud. And we're so early on at the time that we just view, and even today, we kind of view most software in our stack are pretty broken. We need to continuously improve them. Velocity is key. And having simplified environments, not having to worry about the 20 different variants of Linux, 50 variants of IP wireless thing and firewalls and all that would be enormously beneficial. So that's kind of what got us started. [00:44:27]

Sudip: And clearly, it paid out so well for you guys. [00:44:31]

Reynold: But, it is difficult Because every time we hire an exec, it's like, hey, I have a great idea to increase our revenue by 10x. [00:44:39]

Sudip: Yeah, go on-prem. Before I switch to lightning round, one final question about Spark. What's next for Spark? What's in the future? [00:44:49]

Reynold: I think the API is actually pretty good. We've been doing a lot of incremental refinement. For example, one of the biggest complaints of Spark is when you use Python, the error messages are simply nonsense because those include JVM stack traces and all that. And we actually spent a lot of time improving those to the point that you could probably still see it, but most of the time, you don't even feel there's a JVM that's running the Python program. So there's all this sharp edges we're trying to remove, which ultimately, they're not big fundamental ideas, but they're really the ones that create friction that gets in the way of everyday users. So a lot of work goes into that. With a lot of the GenAI use cases, it's 2023, everybody has to talk about GenAI, we have noticed a lot of the Spark programs that were generated by ChatGPT and all the other things. I bet there will be more Spark programs generated by machines than by humans in the next year. And most of them don't have good practices because many of them were generated and learned on a giant corpus on Stack Overflow and whatever is on Reddit, mostly based on malpractices from the past. Some were written before certain new things were introduced and were never updated. So we're working on this thing called the English SDK, which basically, if you think of it, it's really just how do we teach ChatGPT with the right prompting so it generates best practice Spark as opposed to generating malpractice Spark that now some other human experts come in and try to fix it. Things like this would make Spark's adoption go far wider and really benefit a lot of users that previously just felt Spark was too daunting. It can be somebody who is reasonably technical but they use Spark and they generate a Spark program in ChatGPT, they realize: oh it crashes for this data. But it turned out with a better chatbot, it will actually generate things that don't crash. We want to get to that point. I think another very big opportunity for Spark is that I fundamentally believe the biggest innovation of Spark is not performance, it's not just the use cases, but rather for something that's very old in data, which is data engineering. In the old school days, they used ETL tools like Informatica and all of that. And later, there's a little bit of Modern Datastack that got super popular the last couple of years and people write SQL queries again. There's something fundamentally wrong with SQL for data engineering. It's a very hyperbole statement, but what's fundamentally wrong with SQL? SQL was not designed with engineering practices in mind. You cannot easily test a SQL query. There's no abstractions in SQL. You could use recursive common subtable expressions, but again, it's just a pile of text. There's no variables, there's no for loops, there's no functions, there's no classes, there's no CI/CD framework, which means the key word is engineering. Engineering requires a sense of rigor, and rigor is backed by fundamental engineering principles. And what are the fundamental engineering principles? They are abstractions. I'm talking about software engineering here, right? They're abstractions. They are testing. They are CI/CD. They are how you roll out. And SQL is just terrible for all of those. I love SQL. I did my PhD in databases. I spent a decade optimizing, figuring out how to build systems to run SQL better. But we actually have a solution in front of us. It's real programming languages that can do everything SQL does, which is actually the Python Dataframe API. If you compose exactly the same program in SQL in a Python Dataframe API, it looks almost the same by readability. You can now actually test it using just vanilla Python code. You could decompose your program because it doesn't have to be a pile of text. You can have multiple files backing them. Each file is a Python file. You can have classes. You can have functions. All of this are great tools. By the way, it's just off the shelf and available because it's Python. People build far more complicated Python programs than the most complicated data engineering programs. So certainly you could use that. We haven't done a good enough job to explain to the world what's the value of Python. Python is now very popular in data science and machine learning. But to the SQL folks, many of them don’t even know Python. We haven't really done a good enough job in educating them. I think that would be one of the most important things that Spark can get right is to tell the world, hey, here's how you can do data engineering. And by the way, it's not that hard. It's just vanilla Python. And the Dataframe API is just the equivalent way of writing SQL. As a matter of fact, everything you're familiar with in SQL translates here, except now you have all the toolkits to do serious engineering. [00:49:39]

Sudip: That's fascinating. So, Reynold, we end every one of our episodes with three quick questions. We call it the lightning round, starting with the first one being acceleration. So in your view, what has already happened in Big Data that you thought would take much, much longer? [00:49:57]

Reynold: Yeah, the death of Hadoop. I would think it would take a lot longer for Hadoop to die. Enterprise software never really goes away, but Hadoop today is largely irrelevant. I want to see it happen faster, I thought. [00:50:09]

Sudip: You definitely had something to do with it, didn't you? [00:50:11]

Reynold: Yes, we had a part in that, but if you asked me 10 years ago, I would tell you by 2030, maybe you would see a rapid decline. [00:50:21]

Sudip: Then the second question around exploration, what do you think is the most interesting unsolved question in your space, you know, largely Big Data processing? [00:50:31]

Reynold: There's many. I'll just pick one. I think how do you combine unstructured data and structured data. Especially with GenAI now, there's great ways of processing unstructured data and analyzing unstructured data, but then how do you combine them is unclear. I think there's a lot of value that can be generated currently. [00:50:48]

Sudip: Final question, what's one message you want everyone listening today to remember? [00:50:55]

Reynold: I mean, to maybe the builders - the open source framework builders would be - put the user first, think from their shoes. Think about simplicity to the users, I would say. The last thing I said, Python, data engineering, you want to use a real programming language for data engineering to bring the engineering rigor into data. [00:51:15]

Sudip: Reynold, thank you so much for sharing your insights. This was a real pleasure and frankly a privilege to have you on. Thank you. [00:51:23]

Reynold: Thanks a lot for the invitation, Sudip. [00:51:25]

Sudip: This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. [00:51:41]

Engineers of Scale

From Spark to Databricks: Spark's Origins, Innovations, and What's Next - with Reynold Xin

Timestamps

Transcript

Discussion about this episode