Flink: The Unified Stream and Batch Processing Engine

Engineers of Scale

Flink: The Unified Stream and Batch Processing Engine - with Stephan Ewen

1×

0:00

-49:50

Flink: The Unified Stream and Batch Processing Engine - with Stephan Ewen

How Flink was built to unify batch and streaming data, and what's next for streaming—from its creator, Stephan Ewen

Sudip Chakrabarti

Oct 25, 2024

Transcript

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.

We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating look back on Hadoop, Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark, and Ryan Blue, creator of Apache Iceberg on the technical breakthroughs that made Iceberg possible. In this episode, we host Stephan Ewen, creator of Apache Flink, the most popular open-source framework for unified batch and streaming data processing. Stephan shares with us the history of Flink, the technical breakthroughs that made it possible to build a unified engine to process both streaming and batch data, and his take on the emerging streaming market and use cases.

Show Notes

Timestamps

[00:00] Introduction and background on Flink
[01:44] Relationship between the Stratosphere research project and Flink
[10:28] How Flink evolved to focus on stream processing over time
[13:08] Technical innovations in Flink that allowed it to support both high throughput and low latency
[18:56] How Flink handles missing data and mission failures
[21:47] Interesting use cases of Flink in production
[26:02] Factors that led to widespread adoption of Flink in the community
[29:18] Comparison of Flink with other stream processing engines like Storm and Samza
[37:07] Current state of stream processing and where the technology is headed
[39:47] Lightning round questions

Transcript

Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products.

Sudip: Today, I have the great pleasure of having Stephan Ewen, the creator of Flink on our podcast. Hey Stephan, welcome and thanks for your time.

Stephan: Hello. Hey, thanks for having me. Nice to be here.

Sudip: All right! So you created Flink, which is one of the defining stream processing engine in Big Data. As I understand, if I go back a few years, it all actually had started when you were still doing a PhD in computer science from the famous Technical University in Berlin and working on a project called Stratosphere with Professor Markle. Can you talk a little bit about what was the relationship between Stratosphere and Flink, if any?

Stephan: When we started in 2009 with the work on Stratosphere, Hadoop was all the rage and the industry was starting to pick it up. But also in academia, everybody was looking at it. Hadoop had some cool properties, but it threw away a lot of wisdom from databases. And there were a bunch of projects that were trying to reconcile both worlds - how can we get some of the Hadoop goodness and database goodness in the same project. And Stratosphere was one of those. It was doing basically general-purpose batch Big Data on a mixed Hadoop database engine. When we were done with that project, we kind of liked the result too much to say, “okay, that's it, it goes into an academic drawer.” We actually made it open source and started working on it. So the open source version of Stratosphere became Apache Flink. We had to pick a new name when we made it open source and donated it to Apache.

Sudip: What was the story behind the name?

Stephan: I kind of picked up on that reading some of the old interviews.

Sudip: Like, was there a conflict with another trademark?

Stephan: Yeah. I think in the end, all naming decisions are actually driven either by trademark conflicts or search engine optimization. It was actually no different for Stratosphere. I think it was a horrible name to get good search ranking for. And also, it was a registered trademark in so many industries already that we felt, okay, there's no chance this will work for us. And I think it was fashionable at that point in time to pick project names based on words in other languages, because they were usually not trademarked in the US. And we did the same thing. Flink just means quick and nimble in German. And because, you know, it's kind of where it started from, we picked that as a name. It was a good choice in hindsight, because it did lend itself to a cute squirrel as the mascot - and we really liked that. If you want to go back to the question, like Stratosphere to Flink, Stratosphere was the starting point, it was an experiment with mixed database engines. The first year of Apache Flink was also pretty much that. We're doing batch processing, graph processing, ML algorithms on the same engine. And then we explored stream processing as a possible use case, because we knew there were some interesting properties in the engine. But stream processing wasn't a very big thing back then. When we started exploring that and published the first APIs, we saw that folks really, really loved it. And that's when we shifted the priority of the project to stream processing. So 2015 is when Flink went all in on stream processing.

Sudip: So what Flink is today actually has very little to do with what Stratosphere was, as it happens with so many projects. Stratosphere was like the research phase, the first prototype, and then you found your niche with Flink. And over time, kind of almost rewrote the project. I want to go back a little to the motivations that started the research project. Like you said, Hadoop threw away some of the wisdom from the database world. Were there particular problems that you guys had in mind that you wanted to solve with this new system that Hadoop could not?

Stephan: Databases have these two components - they have a declarative language and a query optimizer component around that. And they kind of decouple how things get executed from how you specify the actual analytics programs. That is for a good reason, because the number of programmers that can write very good low level code is actually very, very small. And the data analysts that can do that is even smaller. So this decoupling is there for a reason. Hadoop just threw that away and made everybody write MapReduce jobs. And of course, that's why one of the first things that came was building SQL engines on top of Hadoop again. But then those threw away some of the generality of the MapReduce model. The nice thing about MapReduce is that you could implement a lot of things that you couldn't as easily express in SQL. So the question was: was there a nice place in between? Can we use DSLs or embeddings or compiler extensions in programming languages to capture some of the program semantics and optimize them in a database query optimization style and then map them to a suitable execution engine and get some of the database query optimization code generation magic - not for just SQL, but for a more general programming paradigm. That was one of the starting points. And then, of course, the Hadoop execution engine is sort of simplistic, right? Like just map and reduce steps. Sure, it's a powerful primitive, but also it's not for everything the most efficient primitive. If you look at the systems that dominate the space today, they all have way, way more general and powerful engines that support a lot of the operations which came from distributed databases. So we worked on what distributed execution primitives you need, and what's the minimum orthogonal set to build a very flexible engine that can then run these programs that are going through a translation and query optimization layer.

Sudip: If you were to describe Flink to an engineer who doesn't really know what it is, what would be a description that you would use?

Stephan: Yeah, so we called it on the website, and I think the website still does that today, stateful computation over data streams. The idea of Flink is that data streams are very, very fundamental primitives. It's not just real time streams but it's actually the shape in which most data gets produced. Unless you're running a particle accelerator experiment that literally dumps a petabyte on you in a few milliseconds, most of the data in real life get created as a sequence of interactions from users with services and sensors. So almost all data actually gets created as a stream. And it's a very natural way to process it as streams. You do preserve sort of causal relationships. You take into account that there's a time dimension through which data moves and so on. So Flink is really a system that's built to work with this idea of a data stream. And it can be a real-time data stream. That's where it is probably most famous for being useful and where its strongest capabilities lie - to actually be very strong in analyzing real time data streams. But it doesn't stop there. It can also process archived data streams. If you think of a file as an archived data stream, like a snapshot of a database table as a stream of records, Flink works equally well with those. It even works when you mix those, like you have a stream that keeps continuously moving versus a stream that is more of a slow moving, or even a static snapshot of data. And then how do you combine those? So, Flink handles all sorts of stream processing: real-time stream processing, and if you wish, offline time-lagged stream processing, under a unified programming model and an execution engine. Although the strongest differentiating parts, I would say, are still on the real-time side.

Sudip: If I go back and watch some of the presentations you guys did when you started Flink, I believe one underlying motivation was to unify batch processing with stream processing. It's almost like you guys built a batch processing system on top of stream processing by making batch processing a special case of streams. I'm curious what were some of the architectural innovations that you guys had to do? And what were some of the challenges you faced in doing that?

Stephan: That's actually one of my favorite topics. I said it before that conceptually, the stream is a very, very fundamental primitive. And it's not just at a conceptual level - it also works like that in the runtime layer. Conceptually, if you're very good in processing real-time streams, you can start thinking that a batch is really just a stream that starts and then stops. Maybe I'm not getting the data bit by bit, like you know records that I draw from Kafka, but I'm getting it as a firehose over a read from a file in an S3 bucket. So naively thinking of a batch as a special case of a bounded stream also works well on the runtime. But then there are a ton of things you have to do differently to make it work well from a resource efficiency and performance perspective. There are a few things with the real-time streams and the batch streams that are actually different. And it's important to understand how to take care of those different characteristics. On the real-time side, a stream always has a trade off between latency and completeness. You have to understand how long you want to wait before you emit a result in order to reflect all relevant data versus what's the latency you introduce by holding the result back. It's kind of a very specific thing. You have to make the engine very flexible in configuring that trade off, because every use case has a slightly different requirement. It's where the watermarking mechanism in Flink ultimately boils down to, in configuring that trade off. The next thing is that in the streaming case, you're obviously interested in what's that time to first result - you want the result as soon as possible. On the batch side, because the data is little older anyway, time to first result doesn't matter. You always assume that batch is as good as it gets - like, I don't get more data because I wait longer. And those two things let you use completely different scheduling algorithms, data structures, and so on on the batch side, which ultimately is what you need to do to make it really competitive with batch systems. So what Flink ultimately did over the years was that it started as almost like two engines in one system, like a completely different batch engine that only exploited bounded streams, and a real-time streaming engine. Later, we started to unify the engines. And that had implications on every level. You need to build a network stack that's extremely good in decoupling execution stages for flexible scheduling. That affects the way you organize memory management in your network stack. It also means that the way the operators are implemented is very different. Like a batch join works fundamentally different from a streaming join. And whenever an update comes from one side, you probe against what you remember from the opposite side. And then there's all sorts of optimizations you can do. But if you want to get all that in the same engine, the same engine has to support both of those patterns very well. So one way to think about this is that what Flink evolved to over time was almost a very sophisticated, distributed flow control network where different operators can activate, deactivate, and prioritize different inputs over others to exploit all these characteristics. It implements a sophisticated flow control model over the network stack that extends to the entire streaming graph and couples the way all operators together decide on how data flows. It's fascinating and also took a while to get there. And it's still one of my favorite parts in the architectural stack. It was a fun learning and discovery and engineering experience over time.

Sudip: How long did it take you guys to build this unified engine?

Stephan: I don't think it's actually fully done, to be honest. When did this unification effort start? 2016 or 2017, maybe. I would say good parts are done, but in certain areas, it's still ongoing. Let me give you one example. One of the really interesting case where you start to combine batch and streaming execution is when you actually have a stream that is kind of a real-time stream, but it already has a lot of data. Let's just assume you're connecting to a Kafka topic that has half a year of data and retention, but it's also has new data coming in. And this may be models, user interactions from your service, and you want to bootstrap a new recommender with that data. You'd really like to fast process through the previous half year of buffer data, and then you want to keep working in real time on the new data as it comes in. It's kind of like bootstrapping a new service based on a stream of data in the past that keeps updating with real-time data, which is actually a really nice use case of this unified batch streaming model. It is a really nice way to show how a data stream is a unifying fundamental piece, because it represents at the same time the old data and the new data. And you don't even have to think about, okay, I have to first build a separate job that reads the old data and kind of does something, then I have to have a different system that updates it with the real time. It's actually just like one streaming computation and the stream processor just crunches through the past and then updates the present. Conceptually, very, very easy, right? But practically not that easy to do because you can process the entire stream in a streaming fashion, but it's going to be very inefficient. The batch data structures and algorithms tend to be much more efficient than the streaming ones. And if you only use the streaming ones, you also use the streaming way of executing in the past just because you want to continue with the streaming way of executing in the present. You end up wasting a lot of efficiency. It's a real thing and people do make it happen, but they throw way more resources at it than they really want to. What you really want is a system that executes up to the point today in a batch fashion and then switches over to executing in a streaming fashion. That work is still ongoing in Flink to make that type of behavior really nice and smooth. The whole unified batch streaming is a gigantic topic that I'm not sure it will ever be fully done, to be honest. Almost like holy grail for data processing. If you want to build a unified engine, it's also really hard to make it competitive with the specialized engines on both sides. That's kind of the added challenge. There's a value in having the unification, but ultimately you don't want to be much worse than any specialized system because then you're also not giving folks a good reason to use it. So that's always the challenge.

Sudip: And since we're on this topic, I wanted to ask you about what do you think about micro-batch processing that something like Spark Streaming does? When you compare Flink's stream processing with what Spark has historically supported through micro-batches, what are the pluses and minuses of that?

Stephan: If you have a batch engine, then micro-batching is interesting, I don't know if I should say that it is a hack to do something that comes somewhat close to stream processing for certain use cases. The issue with micro-batching is not necessarily just the latency because of the batching, but as pipelines get more and more complex if you try to execute those pipelines in a micro-batch fashion, you'll be going through so many stages of scheduling that it will be very hard to make that really well behaved. It just flows much more naturally in a real streaming engine that has the right operators, just keeps the data moving, and then implements the fault tolerance as an asynchronous background operation, which is the architecture of Flink. That being said, if you're looking at the batch processing side, you’d start with micro-batching first because that’s a very good match. On the streaming side though, I always found that it hits a limit if you try to add all the things you need, and you get very close to implementing a streaming engine again.

Sudip: Right. You are basically using the batch processing system to behave like a stream and then introducing a lot of inefficiencies; so, might as well build a stream processing engine - got it. When you guys started in Hadoop, like you said earlier, Hadoop was really prevalent and widely used. Did you guys benefit from any of the Hadoop ecosystem?

Stephan: I would say so, both in a good way and a bad way. One of the things we benefited from was that, with the introduction of Hadoop, the idea of building data lakes liberated data out of central databases. That was a necessary precursor for everything else to happen. So that goes for Flink as well, even though I would say that Kafka was probably even more important for Flink. What Hadoop HDFS did for data at rest, Kafka did for data in motion. But still, in a way Hadoop kicked off the whole thing. Plus one thing that Flink relies on for its fault tolerance model is a place to put asynchronous snapshots in. And the first sort of mass storage that most companies tended to have was HDFS. Today, most folks use S3, but HDFS was the first checkpoint target for Flink. So we usually were working together with the Hadoop vendors. Plus the big data community was almost synonymous with the Hadoop community at that point in time. So when we started, all the first Flink presentations were at the Hadoop summits. In spite of the things that Hadoop did not do right, it was a super important step in the evolution. And they actually did a good job in building a community. And that was really important to unlock projects like Flink and Spark and so on.

Sudip: Can I switch to some of the interesting use cases that you guys were seeing with stream processing in particular? You had a company, Data Artisans, at that time, which was building on top of Flink. So I imagine a lot of your users and customers had stream processing use cases. What were some of the most interesting use cases that you saw where people were using Flink?

Stephan: It's very hard to pick because there were so many interesting use cases. The first one that blew my mind and one I was super proud of was a telco company that used Flink to understand the health of their cell tower network and the utilization of all the cells to do load predictions. We once visited them and saw their central control room where they were trying to operate the entire network. You could actually see those dashboards and prediction graphs that were all powered by our technology. That was a really fun moment. For apps that I use day to day, whenever I know that Flink is running behind and I understand what is happening as I'm scrolling through Netflix, or I'm ordering an Uber, etc., those things still fascinate me. In hindsight, the most mind blowing use case was a smart city project in China and Southeast Asia where they had an insane amount of sensors to optimize traffic flow in real time to reduce travel time for drivers and carbon emissions from exhaust. Also in case there was an accident, they could evacuate certain roads for ambulances. How they changed the traffic lights, how they aggregated and predicted the traffic, running all of that through the streaming engine was pretty amazing. That is, and maybe that's the most impressive one over all time with real world implications too. Wow. It was not something that just made people click ads. Those folks could actually see huge reduction in exhaust fume emissions and massive reduction in travel time for ambulances and so on. That's nice - literally saving lives.

Sudip: That's fascinating! Thank you for sharing that. Now looking back, over the last eight to nine years since Flink was created, I would say it's probably the most popular stream processing engine that not only has been really popular within the open source community, but actually has gotten into production in multiple, real use cases. Now, if you were to look back, what do you think were maybe the most important two, three things that really led to that community adoption which in turn led to Flink being deployed in production?

Stephan: I think there were a few. Some technical, some more community. On the technical side, it was the fact that Flink actually managed to do this combination of streaming with low latency, good throughput, stateful exactly-one semantics, while supporting large states, I mean, really large states in terabytes. Flink was able to handle all of that because it had a way of making snapshots incremental, and using those for backup, rollback, etc. It was a stream processor that worked and scaled and one you didn’t have to worry about. The only dependency that it really had was S3, which probably is the cheapest dependency you can probably have in data processing. Other than that, you didn't have to worry about adding more load to any other system. So it was fairly easy to just add new jobs and the needed S3 and compute. So from a technical side, I think that was really good.

On the community side, the biggest thing that I can actually pinpoint is this. When certain companies started adopting it, I think when Netflix and Uber and Alibaba started showing off how they're using Flink for very, very serious stuff, that pulled many other companies in their tailwinds. I think those were the years in hindsight that made Flink. It still took a long time for Flink to hit mainstream adoption, but that was mostly because lots of companies just took a while to adopt stream processing because, organizationally or from an infrastructure perspective, it's much harder to adopt stream processing than batch processing. Mainstream takeoff actually did still take a while, but I would say that adoption within early, name-brand adopters helped other companies pick up stream processing.

Sudip: So those were like your Lighthouse users, Lighthouse customers, who had serious production workloads and that drew the community and the rest of the market to use Flink.

Stephan: Yeah, I would say in hindsight, that would be my perception. It's always hard with these open source projects to understand what exactly drives the user to pick it up. You can sometimes do surveys, but still for 95% of the downloads, you never know where they actually came from. But my perception is those Lighthouse users drew a lot of people, yeah.

Sudip: Talking about what was really interesting about Flink, you guys actually supported very high throughput and low event latency at the same time. Was there a particular architectural innovation that allowed you to do both together?

Stephan: There's one kind of design principle in there that is exactly the opposite of batching. It's this idea that data has to keep moving through the system. Of course, there's a network boundary. There's a certain amount of buffering, just because you want to use the network in an efficient way. But it's very short and it's very much a locally isolated decision. There's no coordination with any other processes. It's like data keeps flowing and all the fault tolerance work is an asynchronous background process. The way to make this work was based on the distributed snapshot algorithm. That sounds easy, but it's actually not that easy to make it work in practice. Another thing that was a fortunate synergy was the relationship between Flink and RocksDB. The architecture that we came up with in Flink with aligned very well with RocksDB. It's not that we had designed Flink for RocksDB, but when we found RocksDB we felt that it actually matched the Flink design principles pretty well - keep data flowing, make fault tolerance an asynchronous background process; if you do buffering, make it a local decision, and don't have synchronization barriers across parallel processes.

Sudip: That kind of reminds me of the paper that Mike Stonebraker wrote on the eight rules of a stream processing system. I think the first rule was exactly that - keep data moving, and don't add costly storage as a step to process your messages. It sounds like you guys did it really, really well and hit it out of the park.

Stephan: Yeah, that work would definitely stand the test of time.

Sudip: Yeah, absolutely. How did Flink handle missing data or mission failures, given that you were not really persisting the data as much as possible to avoid storage latency?

Stephan: The core idea of the asynchronous snapshot method is that you rely on the fact that you can replay a certain amount of the data. But you don't want to replay too much because if you replay too much, then that's going to cost time. So that's actually also where Flink worked really well with Kafka. The way it would work is, on a failure you would go back a couple of records in your Kafka topic, restore the snapshot state to where that specific point was when you were at a certain offset in Kafka and then replay and rebuild from there. That means you don't have to actually track every tuple or every event. What you really need to do is to create a new snapshot, a new fallback point, and tag that position and your progress with logical markers. And then you kick off an asynchronous background process that says, okay, let me persist everything I need now, but in the incremental variant, like building on previous snapshots, to build a new fallback point. And then on failure, it would restore that fallback point, understand to what offset in the input stream that corresponded to that fallback point and then start replaying from there. This is actually not unlike how many databases work. You have your transaction log where you persist stuff once in a while, you checkpoint, you flush all the materialized tables so you can truncate your transaction log and you don't need to replay too much from your transaction log. It's a similar philosophy, but generalized through a distributed data flow graph.

Sudip: Which goes back to the original points you were making about bringing some of the database techniques into distributed data processing as the motivation behind your Stratosphere research project in the first place, right?

Stephan: Yeah, yeah. It's very interesting. I think the premise of that project was so true, there's a lot to be learned also from old research in that area. So maybe here is an interesting anecdote. How did we actually come up with this whole algorithm in Flink? Yes, we were seeing string processing as an interesting use case and that we can build interesting APIs. But how do you actually make this fault tolerant and efficient? And there were lots of ideas floating around. There was micro-batching that Spark had just written about and there was Storm that tracked things per tuple, but none of them really clicked with me. And the one thing that did click with me was that I still had a paper reading list from my academic times. I had a paper on a system called Distributed Graph Lab that showed an evaluation against a fault tolerance model based on some distributed snapshot method called Cheney-Lamport snapshots, from the early 1980s or so. And for some reason, I thought, okay, that sounds interesting. I want to read more about that snapshot method from the early 80s. And that turned out to be the inspiration in the end. The algorithm that Flink uses is a modified version of that algorithm from the early 80s. The core idea of the Cheney-Lamport algorithm is that, let's assume you have lots of independent asynchronous processes running around and at any point in time you want to know the state of the system as if a photographer had taken a picture. How do you even do that? And that's what that algorithm was about and that's exactly what we needed. We could actually simplify it a bit because we had a very specific setup - a directed acyclic data flow graph. But the core principle is the same, use logical markers to track and see where processes stand in relationship to each other. That's actually one of the discoveries made by computer scientists in the 70s and 80s. And all we do is continuously rediscover the same things. I wouldn't go that far, but there's a lot of the fundamental stuff that are still very, very relevant. Sometimes the biggest thing is that it's very hard to read those papers from back then, but absolutely worth not discarding that research.

Sudip: That's a fascinating point you're making, and particularly for up-and-coming engineers who are into the space. It's like going back to some of this old literature in academia is really worth a shot.

Stephan: Yeah. Although it's hard to find the relevant one. So what has worked really well for me is to keep looking at other systems that do interesting stuff. For us back then, it was Distributed Graph Lab, which was somewhat of a totally unrelated system in many ways, but still interesting in its implementation. So it just fascinated me. I wanted to learn more about that. And then their related work list also. I really love related work lists and citation lists and so on. And, dig through the ones that actually sound interesting, like use it as a bit of a guided search algorithm.

Sudip: Now, when you guys were building Flink, there were two others that were getting built around the same time, maybe even were already in the Apache Foundation at that time. One was Storm. The other one was Samza. I'm curious how you'd compare Flink to those two. Where did you guys really shine? What were some of the things that those engines did better?

Stephan: I would say Storm was almost the incumbent in a way back then. It's just the first system that started stream processing, even though I don't think it even called itself a stream processor. It was probably more like event processing. And Storm had this tricky notion of trying to track individual events and the flow of events in order to understand event failure and replay those, which for certain things was helpful. However, it was ultimately impossible to do that at scale - you cannot track fault tolerance related information on each individual event. It doesn't work, doesn't scale, not even for simple event pipelines, let alone for more complex ones. Even if you go to complex streaming SQL queries, this wouldn't work. Flink did not do that. The Snapshot algorithm has the characteristic of saying we're tracking progress in a much more coarse grained manner, in a much cheaper manner. Plus we can actually do it exactly once, not at least once, right? That was something I think that made a big difference. Plus, Flink actually was a stateful system. And I think Storm later added some extensions to sort of manage state, but I think the integration never went as deep. So stateful exactly-once cheap fault tolerance was something that Flink had. In a way, even Spark streaming had sort of a stateful nature. I think that's why folks were calling Flink and Spark back then the generation two streaming engines where Storm was generation one. And you can actually see that all newer engines are not built after the Storm pattern. If anything, they're closer to the Flink and Spark model or the Kafka Streams model.

I think Streams was very, very interesting. It was in some way, I think the most interesting one at the time and one that from a technical perspective, I was initially most worried about. It had this very interesting notion of how it used Kafka. But just by the limitations of Kafka back then, Streams could not do exactly-once characteristics on state, which Flink could do, which was a big deal to users. I think the second part is there's something good and bad about coupling stream processing to Kafka. And the same still holds true for Kafka Streams. I think it's good in a way that if Kafka is your only dependency and you already have Kafka, it's a very easy way to get started. But at the same time, Kafka is a lot more expensive to operate than S3. Like Flink's minimal dependency is S3, Streams and Kafka's minimal dependency is Kafka. If you actually deploy a streaming pipeline that triples the load on your Kafka cluster, it's not the easiest thing for your infra team. Putting more data in the file system or the object store was always the easier thing, in my opinion.

I'm forgetting the fact that in hindsight, Flink has had a pretty good case in its APIs as well. Like the whole event time model, the combination of processing time, event time and so on was a big deal when it came out. It's something that Flink didn't fully come up with by itself. A lot of it came out of the Google Dataflow project, which later open sourced into the Beam project. So I think they were the pioneers of that. But even before Beam got created, Flink picked up those semantics from some research papers on Google Dataflow, and was the first system that implemented those in open source. And those semantics were actually very, very powerful. I have an absolutely love-hate relationship with that thing up to today, because it is crazy powerful, but it's also so hard to get it right that I think it's one of those things that causes people most operational friction. So if I ever rewrite Flink, I'm wondering if there would be a way to just get rid of that.

Sudip: And that is a great segue into the next thing I wanted to ask you. If you were to build Flink again today, fundamentally, what would you do differently, if anything at all?

Stephan: I would do 100 things differently. Of course, hindsight is 20-20, right? To maybe stick to the previous point, the whole event time model. I don't have a perfect answer on this. It's both an incredible piece and an annoying piece because it allows you to do very, very powerful things, but also how do you actually tune your watermark? How do you understand? How do you build your notion of completeness? Is that something you should even try and do? Or is that something that you can just never get right and we should actually just build, try to build APIs, not assuming we can actually do that. It's one of these questions I haven't answered for myself, but it's definitely something I would revisit. Things that are much more clear are on the runtime architecture side or on the whole deployment side. So when Flink started, it was pre-Kubernetes, almost like cloud was there, but not everything was there. A lot was still on-prem. I think Yarn from the Hadoop project was the most popular scheduler. So Flink and the philosophy there was still like projects have to worry about their scheduling themselves. That's how Yarn and Mesos integrations worked. And so Flink actually built a lot of stuff that you just would not build in a system today, and would just try to be a nice player in Kubernetes and that's it. Over time, we evolved Flink through that, but if you look in certain areas, you can still see the heritage. It was built as a system to run on Yarn initially and not on Kubernetes in the cloud. So definitely, that would be completely different.

On the runtime side, cloud native architecture will be the first big thing. That includes more than just like the way you schedule and build your whole deployment. It also means that the whole internal stateful engine, I would build that as a disaggregated engine. I would try to build something closer to a file system, distributed file cache or file system caching layer on top of S3 and build the whole snapshotting more into that layer than into the RocksDB layer.

I think there's some fascinating stuff to do. It would be possible today because, first of all, systems like S3 or generally cloud storage and object stores have gotten so much better. The network bandwidth is just phenomenally better than it was when Flink started out. The networks in the cloud actually give you more stable net latencies. That part is so much more mature that one should absolutely build on that today. But it wasn't the case when Flink started. So you can see quite a lot of those changes or like bit by bit coming into the system. But it's an evolution of an architecture that was initially built in a different era and then brought into this new era. You can probably take some shortcuts if you build it directly for this era. So on that side, definitely more cloud native, more taking into account cloud storage and cloud network architectures in a much better way.

Sudip: Cool! So my final question to you, Stephan, is as someone who has built a system that is synonymous with stream processing today and have even done built startup in that space. Looking back over the last few years and where we are in the stream processing market, do you feel that stream processing as a space has realized the potential and promise that it had? Where do you think we are on that journey today?

Stephan: That's a very good question. I'm not sure if I have a great answer for that. I can just give you some thoughts. I think in many ways, stream processing is still at the point where it's breaking through into the mainstream. I remember that when I left my active work in Flink about one and a half year ago, we started noticing a bit of a shift of the type of companies that started using it. Like it was not just tech companies anymore. It was also non-tech companies that started using, that started asking for support, that wanted to become customers. That's a fairly recent development, I would say. And it goes back to something I said earlier. I feel stream processing, compared to some other technologies, sits so much more centrally in your application architecture that it's not the easiest thing to adopt. You need so many other things in place for it to really deliver value. And I think that's just why it takes time for companies. It took longer for the market to mature than I had expected. But I think it is actually here now. I think it is getting into the mainstream. So I would say it hasn't yet reached its full potential yet. At the same time, you know, things like streaming SQL, which are making stream processing more accessible, also hasn't been around for that long, right? Yet, they're really making a difference, I think, in the breadth of the audience that can use streaming. The streaming potential is higher as it integrates more with other systems, databases that hold other data, etc. There's a surge in popularity of CDC style patterns where you take CDC data out of another database. That, I think, is one of the things that is also really driving a new class of use cases in stream processing. That's also something that hasn’t been happening for too long. Looking beyond Flink, the integration of stream processing into other major databases by players like Snowflake, working on incrementally materialized views and so on, is actually good.

The thing that I'm not 100% sure of is what actually will be the shape of stream processing 10 years from now. I think Flink will play a role. It might also look very different from anything we have today. So, I think it's still a fascinating space to watch. There's still a lot more to come and happen.

Sudip: Fantastic! That brings me to the final thing I wanted to ask you, which is our lightning round. So we end every episode with three very short questions. The first one is around acceleration. In your view, what has already happened in big data processing that you thought would take much, much longer?

Stephan: The fact that so many, rather virtually all, data workloads have moved to the cloud. When we started working on Flink, the assumption working with a bank was that data workloads will never run in the cloud, not in 20 years. And now like five years later, they're all in the cloud. That was much faster than I had thought.

Sudip: The second question is, what do you think is the most interesting unsolved question in large scale data processing still?

Stephan: Yeah, that's an interesting one. I'm not sure if it is most interesting, but the most challenging one isn't actually very much a technical question. It's more of an organizational question or a question of integrating data. I feel that the data processing technology is evolving, but it's actually well understood what needs to happen there. What's much harder is to actually make all those different systems, teams, sources, formats, semantics, everything come together in a meaningful way. I feel that folks struggle a lot to make that happen. I don't think there's a silver bullet for that.

Sudip: And my final question to you is, what is one message you would want everyone that is listening today to remember?

Stephan: It's a tough one. If I have to pick one, I would say the journey of Flink was a relatively long one with many steps. If you actually take into account the research project that started it all, we started working on this in 2009. Maybe one and a half years ago, it really started to get into the mainstream. That means 12 years after inception, right? That is a long time. So sometimes things take a while to mature, but it is worth sticking with them. If you actually believe in the idea, if you actually see that there's potential. And so I'm like, it's absolutely worth sticking it out, which maybe is a little ironic to say because we actually sold our company relatively early. But still, I think at least we stuck with the project, we continued building and driving the vision. But, that seems like something that is not very popular today. It feels like folks are hopping between opportunities relatively fast - okay, I do this now for two years and then I go do the next thing. And then in six months, let me try another project. Sometimes, it's worth staying on certain things. And specifically for a technology like Flink, I said this a few times, on the simple idea of asynchronous snapshots, work is still ongoing till today to make it more reliable, more stable, more scalable, more predictable and handle ever larger scenarios. These things sometimes take years to mature, but when they mature, they become really powerful. So sometimes it's actually worth sticking it out.

Sudip: That's a really fascinating and insightful point! Thank you so much, Stephan. Stephan, it was really a pleasure and frankly a privilege hosting you today. I thank you so much for your time.

Stephan: Yeah, thank you so much. It was, I had a lot of fun. Thank you for hosting me.

Discussion about this podcast

Engineers of Scale

Hello everyone, welcome to the Engineers of Scale podcast. In this podcast, we go back in time and give you an insider’s view on the projects that have completely transformed the infrastructure industry. Most importantly, we celebrate the heroes who created and led those projects.

Listen on