Data Engineering: The Past, Present and Future with Joseph Hellerstein

Engineers of Scale

0:00

-48:57

Data Engineering: The Past, Present and Future with Joseph Hellerstein

A journey through the evolution of Data Engineering, from its roots to future innovations with Joseph Hellerstein

Sudip Chakrabarti

Dec 12, 2024

Transcript

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.

We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating look back on Hadoop, Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark, Ryan Blue, creator of Apache Iceberg on the technical breakthroughs that made Iceberg possible, and Stephan Ewen, creator of Apache Flink. In this episode, we host Joseph Hellerstein, Professor at UC Berkeley and founder of Trifacta and RunLLM. Joe helps us step back and explore the evolution of Data Engineering over the past several decades while also discussing the future innovations on the horizon.

Show Notes

Timestamps

[00:00:01] Introduction and Joe’s background
[00:01:38] What got Joe interested in data engineering
[00:03:59] Defining data engineering and its key components
[00:05:16] Significant trends and changes fueling data engineering over the last 20 years
[00:06:30] Key components of data engineering and the role of each
[00:08:07] Contrasting modern data stack with traditional data stack
[00:12:10] Developers vs. data engineers/analysts in building data pipelines
[00:14:12] Role of AI and LLMs in data preparation and cleaning
[00:16:51] Journey from data warehouses to data lakes to data lakehouses
[00:21:14] Role of data catalogs in data engineering going forward
[00:32:57] Unified data platforms vs. best-of-breed tools for data engineering
[00:37:03] Possibility of one system serving both OLTP and OLAP use cases
[00:40:46] Impact of AI on the data stack and data engineering
[00:43:23] Interesting future research directions in data engineering
[00:46:02] Lightning round: Acceleration, unsolved questions, key message

Transcript

Sudip [00:00:01]: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today I have the great pleasure of hosting Joe Hellerstein, professor at UC Berkeley and founder of Trifecta and Aqueduct. Joe, welcome to the Engineers of Scale podcast.

Joe [00:00:36]: Thanks, it's fun to be here.

Sudip [00:00:37]: Thank you so much. So I'm not going to walk our listeners through an overview of your background because you truly need no introduction. When it comes to innovation in the field of data engineering, there are really very few people who even come close to what you have done. I will, however, mention a fun fact that I learned recently, even though I think I've known you for many years, and that is your interest in music. Not only are you a musician on the side, you actually had even minored in music during your PhD at Wisconsin. So how do you balance all of your research and startup work with your interest in music?

Joe [00:01:12]: Well, I'm a believer that you should have a rich life and that people who spend 24-7 on computing are spending a little too much time maybe. So I enjoy computing and data engineering and all that good stuff, and I love to geek out about those things, but it's one of a bunch of things I value in life, family, hobbies, and so on. And I'm sure a lot of your listeners are the same. And anyone who tells you that you have to do something 24-7 to Excel, I think is telling you a lie.

Sudip [00:01:38]: So then what got you interested in data engineering as a quality of research in the first place?

Joe [00:01:44]: Yeah, well, my background, going back to my training after college, was in database systems. And my first job right out of college was at IBM Research, which was the founding lab out in San Jose that built the first relational databases. And a bunch of those people were still there. So I was really brought up by some of the founders of the field of database systems. After that, I went to Berkeley and then Wisconsin for my schooling, which was more of the community of the folks who really pioneered the database system space. So I'm an old hand, even though I'm not as old as most of those people by about a generation, I still feel like I'm an old database hand. I come from that lineage. And what's happened over my career since the, you know, I got my PhD in the mid-90s, is that the process of managing data and the computation that goes around it has become more and more central to all of computing and the way it projects on the real world. So your listeners know better than any, probably, that really, we shouldn't talk about computer science. We should talk about data science, data engineering, because without data, computing is kind of meaningless. And this is a truth that emerged, you know, in the last 20 years, really. But it was one that the database people were working on well before that. And I feel kind of blessed to have been born into that community because the relevance of data engineering to all things computation and therefore much of society is so apparent today.

Sudip [00:03:01]: I would say that both the schools that you have involved with, Wisconsin, and of course, UC Berkeley now, I think, have had tremendous quality of work coming out in database and data systems and data engineering together. So it just, you know, has had such a big history. It's an awesome tradition.

Joe [00:03:19]: I mean, when I was coming up, there were three places to do real work in data systems. It was IBM, Berkeley, and Wisconsin. And I had the fortune, and to some degree, I took measures to interact with all those people when I was very young, straight out of college. And they were really the center of all activity because a lot of academic computer science at that time didn't get it. MIT, when I interviewed in the mid-90s, hadn't had a person on the faculty doing databases for over a decade. And it was very clear when I got there that they did not think it was an intellectual activity. They thought it was something that businesses do. And that's all changed radically in the course of my academic career. We're now data-driven computing is all of computing, really.

Sudip [00:03:59]: Taking a step back, if you were to describe to someone what data engineering is, and I know you teach a very popular course on that at Berkeley, too, how do you describe it?

Joe [00:04:12]: Yeah, it's been tricky, actually, because I think we're in a time of transition. And so you have to talk about things relative to where things are right now. So the way I talk about it with people is they understood that there was a shift from traditional computer science to what was being called data science over the last, say, decade, where clearly data had to be at the center of things, or at least some things. But what happened in the data science programs that evolved is they were largely developed as sort of statistics meets algorithms. And that left out all the engineering concerns of how do you build systems around these foundations, particularly systems that drive large amounts of data? Because the statisticians traditionally didn't use large amounts of data. Incredible what's achievable with statistics with very small amounts of data, of course. And so that's what I talk about. It's like, well, how do we systematize and build engineered artifacts that reflect the importance of data and computation together?

Sudip [00:05:02]: And looking back in the last 20 years or so, since you started working at Berkeley and obviously started your two companies, are there certain significant trends or changes that have really fueled data engineering?

Joe [00:05:16]: Yeah, I mean, you know, there's a long enough scope that you have to include the existence of the World Wide Web as one of them. So, you know, you go back to the 90s and data was all about relational databases because that was the data that was entered into computers. And the web changed all that. Now there's all sorts of nonsense that you could harvest. And I remember joking in the early aughts, maybe late 1990s, my buddy was saying, my gas station just got on the internet. Goodness knows why I would ever want to run a query against my gas station, right? But nowadays we realize that all that sort of light recording of ambient stuff and people's thoughts and ideas and conversations is highly valuable. That was not at all clear when the web started out. You know, web search is like, well, I might want to find some stuff. Most of it's irrelevant to me, but I want to find a few things, right? That was web search. But what you see, you know, if LLMs are a compression of the web, what we're seeing today is having a compressed version of everything anybody's ever said is outrageously powerful, even if the technology is pretty simple. So the rise of kind of ambient human information, something I did not anticipate whatsoever.

Sudip [00:06:21]: Got it. Today, as we know data engineering, what would you say are the key components of data engineering and kind of what's the role of each?

Joe [00:06:30]: You know, we often talk about pipelines, right? And I think it's not a bad way to think is to kind of start down the pipeline, look at what feeds what. Where does the data get generated? How does it get ingested? What data didn't I measure at all? Actually, we start there. The statisticians always start there. There's a universe. Things are happening. Some of it gets measured. That's called sampling. That's the start of any pipeline is there's phenomena out there that we could record. We choose to record some of them. And then, of course, there's the pipelines we think of from ingest to processing to feedback loops that happen, right? When you're learning from outputs and how people react to them. So thinking about the long-term pipelines all the way from what did we measure to how did people react to it in apps? And then we measured the apps and we closed the loop. That's modern data engineering. Unfortunately, it's too big for any one organization to own, right? You go into any company and there isn't one org inside the company that owns that whole thing. And it's definitely too big for any one person's head. And so the other reality of data engineering, like a lot of real-world engineering, is teamwork. And it's cross-disciplinary. And it's a lot about people.

Sudip [00:07:36]: Absolutely. Yeah, people, culture, team, all of that plays into the efficiency of a data engineering team, 100%. I think over the last few years, particularly with the last decade or so, we kind of transitioned from what used to be called traditional data stack, like much more on-prem, much more built around old generation technologies. And now we keep hearing about modern data stack, much more cloud-native and so on. In your view, what is modern data stack? What are the components? How do you contrast that with traditional data stack?

Joe [00:08:07]: I have some opinions about the branding of all this. So modern data stack is a brand that was basically promoted by a couple of startups that were venture-backed. They tried to ascribe a particular meaning to that. And of course, the word modern is kind of hilarious because to me, it's kind of Mad Men 1950s modern furniture, right? But it is true that we do things differently today than we did in the 80s and 90s. So let me talk about it in those terms rather than trying to say it's two particular companies that tried to brand the modern data stack, okay? Because there's also by now enough blowback with that terminology that I just don't even want to walk down that path. Sure. Having said all that, when I was coming up, it was the beginning of the data warehouse movement. And that itself was a reaction to just having a database. So once upon a time, you had data, you put it in a database and that was it. And IBM would sell you a mainframe, right? You'd run the database. And then what happened was there were lots of databases and they were all over the place. And so there were many of these operational databases. The data warehousing movement came along and said, people want to see the big picture across all these databases. So we're going to have this ETL process, right? We're going to load, extract from all the operational stores, transform the data into a common format, load the warehouse. And that was a story that made many consultants very rich. And it also opened opportunities for some software vendors, Informatica, Teradata, right, that were tuned for that workload. And that was the status quo for about a decade. And then what happened was the world became too complicated for that as well. So the fiction that there'd be a single data warehouse that would really cover the business was a fiction all along, but it became a painfully untrue fiction sort of post the web, really. We had lots of data that really didn't want to go in the relational database at all. It wanted to be text search or it wanted to be image files in a file system, right? And then we saw the cracks emerge there in the data warehousing relational database movement. We heard NoSQL and there was a bunch of things like that. Where we are today is, I think, is kind of the end of that road. And in the end of that road, we have a majority of data that is not rows and columns when it's born, that is highly valuable, that needs to be managed. So you can't just throw it in the file system because it's too important and people need to be able to version it, know where it came from, know its provenance, know how it's getting used and do governance on it, all the things that are managing the data. So the things that used to be easy in the database are really hard on this messy data, all this stuff about governance and organization. It's very disparate and it's all over. So what's inevitable in this setting is that we're going to have to kind of stitch together more stuff even than before. And that's where we get into kind of the state of things today. And there's names people like to use for it, like Lakehouse and so on. But that's a fine name actually, doesn't mean a whole lot, but neither did Warehouse so that's okay. But the bottom line is there's going to be rich data of many facets and there's going to be many uses for that data. So big fan in of lots of data sources from lots of places, big fan out of lots of use cases. We'd like management in the middle of that hourglass, but we're not going to be able to assume that the data is structured for that management. The data is going to retain its probably original format and then extracts of various kinds kind of go out, right? So that's kind of where we are. And I think, you know, if it was a data filtration system or a data hourglass or something that might be more helpful than a lakehouse, which to me is a cabin on the side of a lake. I don't know if that's a useful analogy. It's cute neologism. It is this kind of problem that we're trying to manage.

Sudip [00:11:24]: So I think, you know, it's like from data warehouses, which are primarily ETL. And when it came to data ingestion, we are now more in the ELT world, right? Which is data lakes. And then also, like there is a small movement around capital E, small T, and then LT, which is like you extract, do some transform, load, and then transform again. So I just wanted to talk a little bit about the users and the builders of this data pipelines. And there are, you know, two primary kinds of personals one are developers and the other ones are data engineers. In your view, do you see like developers, you know, kind of taking on more and more of this complex world of building data pipelines using it? Or do you see more like analysts and domain experts, you know, do things on their own using local tools, LLMs and so on?

Joe [00:12:10]: I'm going to take a slightly different slice on it than you started with. Not because I quibble with it, but I want to make sure I'm kind of on the grounds that I want to talk about. So I think we can kind of split the world into people whose primary focus is computational. So think of data engineers, data scientists, IT professionals. Okay? That's not a cool name. So we try not to use that anymore. And then people whose job is fundamentally not about data or computing. Their job is something else. Like people in the line of business, people in science, people in what have you, journalists who use data. And you can think of them as consumers, if you will, whereas you think of the first body of people as maybe maintainers or something like that. They're the people who plumb the data. I think both constituencies are really important. I think they're both big markets for venture and for startups and for people who have technical skills to work at. But I do think Silicon Valley and academia both because academia and computing is computer scientists like me. We are very much more comfortable with the technical folks, with the developers. That's home base for us. And it's really fun to write software that we can use ourselves to make ourselves more productive. And so we do a lot of this dogfooding. Open source is another big kind of cultural contribution. You want to build a project to get your friends to use it. They like it. But it's, you know, we're kind of taking care of ourselves. There's this huge constituency of people who are going to get huge value out of data and they're the ones who understand how it's getting used. So they're actually people where like the cha-ching happens. These are the people who are going to monetize the data. In the sciences, it wouldn't be money. It would be scientific innovation. In journalism, it would be the big story.

Sudip [00:13:41]: They extract value from the data.

Joe [00:13:44]: Exactly. Yeah. And those people arguably are more important on my humble base, I try to admit that. So I think it is kind of two things. And I think we can ask a useful question, which is, will one set of technologies suit both of those constituencies? Maybe. I think that's a good conversation to do next. But I do feel like setting up that there's kind of broadly these two camps because two is easier to think about than 20. So let's not get too fine grained. How should we think about building and architecting data systems? And when I say systems, I do mean in the aggregate, like organizational systems to serve both those constituencies. And what kind of software do they need? I think it's a good, healthy frame for kind of the bigger picture.

Sudip [00:14:26]: And do you see any convergence in terms of the kind of systems both constituencies might use? And does everyone become a Python programmer in some ways?

Joe [00:14:36]: I'm so glad you brought that up. There's been a strong movement and I see this in the data science program at Berkeley as well. And we were a pioneering program in having a data science major. If you would just learn some Python, then you could do data science. And it just feels very backward. And I don't think that we should be expecting that all of the people who extract value from data will be programmers. Thinking programmatically, so having exposure to the notion of step-by-step problem breakdowns, instructing a computer to do things, all important, right? Do you have to learn a traditional programming language like Python to do that? I think not. I've felt this way for a long time, but I feel like LLMs are a really wonderful way to surface this to the general populace. If you are sufficiently disciplined step-by-step in telling the LLM what you want, it will probably understand you and probably help you get your job done, even if you're not really a programmer. And there's going to be other technologies, obviously, over time that'll be better than just a chat box on the internet for doing this, both user interfaces and models and programs. So yeah, I do feel that this idea that, oh, everybody will do everything in Python is super dev-centric. And that people like me and probably you, I don't want to put too much on you, and your listeners, that's what we're good at. But we have to have some empathy for the people who are the value extractors. And when I say empathy, I don't mean treating them like children. I mean giving them tools that make them super powerful. Right?

Sudip [00:16:01]: Yeah, I think in particular over the last five to 10 years in general, the venture ecosystem which I can speak for has definitely probably over-rotated a bit more on the developer-centric experience of data engineering even at the cost of ignoring the actual constituents of users down the line like you were saying. I would kind of shift gears a little bit more on a core component of data engineering that you have done some incredible work on which is data preparation and cleaning. To this day, it's still like it continues to consume major time and resources. And you not only did a lot of work including your data wrangler paper and so on, but also went on to found one of the pioneering companies in this area, Trifacta. Looking back where we are now, do you feel we have managed to solve that problem of data preparation and cleaning yet?

Joe [00:16:51]: Yeah, somebody pointed out that data transformation is kind of like cancer research. It's like a lifetime employment guarantee because you're going to help, right? And you may do brilliant, amazing things that help people's lives, but you're probably not going to solve it if you look forward with any sense of arrogance. There's lots to be done still, but there have been some good things that have been done that we can build on.

Sudip [00:17:18]: And do you see like AI and LLMs kind of changing the game a fair amount going forward? Like any fundamental shift you see there coming?

Joe [00:17:27]: This is great. So yes and no. And let me see if I can frame this up. When we started Trifacta, it was 2012. And the hypothesis in the research was that you want to build a feedback loop between the human and the computer. And the way it would work is this. The human would somehow guide the computer to, I want my data to look like this. And then the computer would say, well, here's a few things I could do to make your data more like this. What do you think about each of them? Pick the one you like best. I call this the guide-decide loop. So there's a human in the middle that's guiding the computer, and then the computer is making suggestions and the human's deciding which ones to use. And this was with a user interface that was visual. So I worked very closely with data visualization leaders like Jeff Hare, who's one of the fathers of D3.js, was a co-founder at Trifacta and a joint student to make sure that that visualization loop and the interaction part, the human part of that, were really powerful. So you could really see anomalies in the data, you could see examples of the data, and then you could interact by doing things like pointing at a bar chart and saying, this bar looks funny, or pointing at a cell in a spreadsheet and saying, what are the features that you give to the inference engine, to give to the algorithm to come up with suggestions on how to address those features? So a lot of our energy went in there. The AI that we advertised behind it was dead simple. It got a little better over the course of the company, over 10 years, our models got more sophisticated, but the user experience only got marginally better because the key issue was that interaction model. We built an interaction model around cleaning data, wrangling data. Now, we sold Trifacta some months before the launch of ChatGPT. And part of the deal was that I didn't go with the software to the acquirers.

Sudip [00:19:13]: And this was Alteryx, the public company.

Joe [00:19:15]: Alteryx, yes. So Alteryx acquired Trifacta. I will say that Google Cloud Dataprep is still the Trifacta product. It even says Trifacta on it. They haven't changed the branding. So both Google and Alteryx are using the tech directly. Better inference will make that product better, but it will not fundamentally change the hypothesis that started around this guide-to-side loop and this idea that you have to give people the opportunity to decide if the outputs of their prompts are right.

You know, that's the whole thing with ChatGPT. You ask it a question, it gives you an answer. It's the same story. So if you're asking it for code, there's something I'll speak to the developers in the audience. You know, please write me code that will pivot this table and remove all the blanks. Okay, it'll spit out some Python code. How do you know it's the right Python code? Well, maybe you should run it on some sample data and see how it looks. Could you build a tool that would allow that iteration to go faster? You know, don't fill the blanks with zero. Fill the blanks by doing linear interpolation, right? Something like that. So you need this feedback loop and users need to be able to see or evaluate whether the suggestions that they're getting from the AI are right. That piece of the puzzle, the turn of the crank through the user is a big piece of it. So I guess in sum, user experience and a deep understanding of what you do when you're wrangling data so that you surface it to people and so that they can say to the computer what they mean, those are independent of how good the AI is. Similarly, the AI being really good doesn't remove the need for those experiences. So I would love to be working on this problem right now because you plug in GPT-4 into this. This is going to be way better than the inference we got. I think the qualitative user experience will be better in some cases by a lot and better in other cases only by a little. I mean, that depends on the task. But it is fun times, no doubt, for this technology.

Sudip [00:20:58]: So some of the technical stack probably you could use some of the off-the-shelf models and so on. But what you're saying is the secret sauce around user experience, the real naughty problem is there. And that hasn't gone away or hasn't gotten any simpler.

Joe [00:21:14]: Yeah, and I would say it's the same and your audience probably has hands-on experience by now with things like Copilot. If you embed Copilot in the IDE in a nice way as they've done with Visual Studio, it really helps the programmer quite a lot. But if you don't embed it well, for instance, if you ask it to write you an entire program instead of the next line it's going to do a rotten job because now you have to read that entire program, figure it out, etc. So this thoughtful combination of understand your domain, which in my case is data, understand what the technology can do pretty well, and then build the right feedback loop around that, that's going to be the game for a lot of products over the next few years around LLMs. And that is so true because we talk a lot about what is the mode for some of the companies that are using, obviously, LLMs and AIs and user experience comes up so frequently. If you get it right, it's, after all, a probabilistic way of thinking of the world, right? So if you do not get it right, it doesn't work.

Sudip [00:21:53]: Couldn't agree more. I want to touch on a different topic, which is, you know, as I was doing my research on some of your work, I found this paper way back from 2005 that you wrote with Michael Stonebraker. And it was titled, What Goes Around Comes Around. I think, you know, the short summary was you guys kind of discussed this fascinating cycle of data models over the previous four decades and how data models have gone from complex to simple and then back again to complex. I'm curious a little bit, like, you know, now that, you know, it has been several years since your publication, where do you think we are in that data model cycle today?

Joe [00:22:46]: Yeah, that's a fun topic for your listeners. If you don't know Mike Stonebraker, he was my master's advisor. He's one of the founders of relational databases. He won the Turing Award for that. He started his career in like 1970. He's 80 years old. He's still going. He's still like at every meeting running things. The guy's, he's amazing. And he's founded a number of influential companies. And of course, the Postgres project that many of us use that I was a grad student on. So Stonebraker's a legend. He's a super opinionated guy and he likes to kind of have his say. So that paper is written entirely by him. I don't disagree with what it says and we were putting a book together, but that was his chapter. And boy, is it him. So it's very black and white. And I see the world in shades of gray a little more than Mike does. But what I would say is the high level message of pretty much tabular simple data with a well understood query language that's pretty clean is going to always win over time over any custom complicated thing. I agree with that. Data is too important an asset to have behind fancy stuff. You want to have it behind relatively clean stuff. Having said that, as Stonebraker actually did in Postgres, you can put a lot of interesting data into Postgres and query it with pretty simple SQL. And it's not really flat tables anymore. It's something more than that. And, you know, if you go off and read database theory papers from the last five, ten years and you hang out with the right folks, you'll realize that generalizing the relational algebra to richer mathematical structures can give you actually more flexibility in this space than I think that paper gives account for. So I actually think over the next ten years we will see another generation of extended extensions to the relational model that will make it amenable to new data types in ways that we haven't seen before. So I can give you some concrete examples. Traditionally, like if you had something like an image in your database, it was just a blob, right? It's a binary large object. It's just a route of bytes, right? Unix-style stream of bytes. I think increasingly what we're going to see is, and Postgres actually has the infrastructure for this, it has since like 1989, but we're going to see this in the field. Point at any blob, you have a featurization function that pulls out a bunch of attributes for that blob. Those attributes, they're columns. So, you know, think about the image that a self-driving car takes at any given time. Bounding boxes around a bunch of regions in XY space, each of which may be tagged with a class or a list of possible classes. I think this is a pedestrian. I think this is a car. I think this is a stop sign. And then it's got a time ticker, and that's a row in a time series database. You know, if you want to start building queries over what happened at this intersection, at a busy intersection at a particular time, it's going to be a time series query over something that eventually looks like rows and columns, but it's actually video. And so we want to extend our tools and unify our tools, right? So the technologies like LLMs and image processing and so on are generating features that we can easily query. And we're going to need to build systems that are a little smarter for that than what we've got in Postgres and the like. But I'm optimistic that the road between where we are with stuff like Postgres and its children and where we need to get to, that's a bridgeable gap. And so I think there's a nice opportunity here for a next generation of powerful analytic databases to be built that will be extensions of what we have today. And it's not really a circle, what goes around and comes around. It's a spiral, right? You're going forward as you're moving, right? And I think there is progress that's required and that's going to happen.

Sudip [00:26:12]: That's a fascinating example because today if you have to kind of extract the feature from a blob of image data, you have to build all these fancy pipelines and stitch together a different number of different tools. If you could expose all of that to a simple SQL interface to someone who only knows SQL, now you're just empowering that person so much more.

Joe [00:26:33]: And now, if you may, think about data governance in that world. That function you wrote that extracts all those bounding boxes and labels, that's a model. That model had training data. And there was this ML team that owns the training pipeline for that model and maintains it. Now we have governance questions that are like not just what data did you look at, but in your query, which functions did you use? Were any of those functions model-based? What was the data that trained those models? And so if you're doing something like the right to be forgotten in GDPR, I want anything you're carded on the street to be deleted from the database because you're no longer using that insurance company, let's say. I can't just look at which queries touched it through SQL. I need to also look at which functions in your query were trained on a model based video that you were in. This is now across teams. It's across what we currently think of as totally different data pipelines. But this is the future of data governance and metadata management. So it's got pretty big implications for how organizations are run.

Sudip [00:27:32]: That is actually a really good segue into one of the things I wanted to ask you, which was the role of data catalogs. I know you're not a fan of the term, but in that kind of modern data stack, what do you think we are with data catalog? I mean, some time ago, there were probably a lot of users that came out of the web-scale companies. There are, of course, companies like Alation and Colibra, who are more ahead in terms of commercialization. Do you see a data catalog as a role in the data engineering going forward? What does that look like?

Joe [00:28:02]: I think, inevitably, they have to exist in some form. And I saw this when we were selling Trifacta, and I saw this in the research we did in this space. And I did some of that research in collaboration with LinkedIn back at the time, and they were one of the first data hub vendors we had. And I continue, actually, to advise Acro. So I should just make a public statement about that. So they're one of the data hub vendors. But, you know, if you have many data sources under different systems, some of which are proprietary, some of which are open source or different flavors of proprietary managed open source, they're not going to come with a common catalog. It's in no one's incentive, as a software vendor in the space, to build the catalog and if you're wrangling your data, then you'll catalog it with Trifacta. We'll own everyone's metadata. We'll be very powerful. Customers did not buy that. They knew that was a scam. So it's a lock-in scam. So a neutral data catalog is a reality, I think, for any large organization, even today, honestly, going forward all the more so as data gets richer and systems proliferate. And it's a hard problem that merits full-time tech focus. So again, at Trifacta, we wanted to build a data catalog, but by golly, that's going to be a big lift. We were plenty busy building data wrangling tools. We were happy to partner with the likes of Colibra Innovation and so on because they were doing a good job. And it takes a whole team just to do that stuff.

Sudip [00:29:26]: 100%. On a different note, you actually have had a ringside seat, and not only that, actually worked on several technologies in this whole movement that you were talking about a little bit earlier, which is we went from data warehouse to we are kind of in the middle of getting to data lake and then even early days of data lake houses. Any thoughts on what is fueling this? Like what are the trends behind this journey from warehouse to lakes to lake houses?

Joe [00:29:57]: Yeah. I mean, I think the easy answer is software logs were kind of the first high volume source of data that folks like us and the people on your podcast had to manage that just didn't really make sense to put in a relational database. It was too expensive. You know, they sort of had rows and columns, but they kind of didn't too because there's lots of text in those logs that you want to look at. I mean, they're not always structured the same way and so on. And so you saw the likes of Splunk and their following competitors carve out a very large niche on log processing that as an academic you're sort of like, yeah, it's kind of information retrieval is kind of databases. I don't see anything new here. La la la. Academic ivory tower stuff. But, you know, really important business problem with really good tech out in the field. That's the tip of the spear. New data type is just a little bit different. It's got different cost structure and value structure and different queries. You don't want to put it in Teradata, right? And you don't want to put it in Amazon Aurora either, right? You kind of want to just leave it lie. Now, Splunk didn't work that way because it was early. So Splunk, the thing that customers hated was it was so expensive to put your logs in Splunk. They were charging by the byte. Essentially, I heard complaints about that all day long, every day.

Sudip [00:31:03]: And still do. I still do.

Joe [00:31:05]: Yeah. Yeah, which is why it's kind of great they got bought by Cisco. I feel like it's old school pricing that'll last for a while. But realistically, that was the first use case. And what we're going to see now, because featurization with LLMs is so practical, you can really get structured data out of anything now. You can get structured data out of your web chats. You can get structured data out of your images and your security cameras. Very low budget sources of data are going to turn into columns. Is the customer happy or sad? Is this a complaint or praise? Which product are they complaining about? These are all things that you get out of a customer chat, right? Those all get to be columns now. And you're going to want to load them into something, some customer relationship management system, right? So there's going to be lots of modalities of data that we're going to extract structure out of, not just log files and traditional relational data. But I feel like log files were the first big use case and we're just going to see lots more. So an open question is, does a vendor like Databricks that wants to give you soup to nuts data lake house manage to give you the full spectrum of that stuff in a nice package? Or is it more like Splunk? Is it more like Splunk? You know, where you get kind of someone who's tuned up to be really good at a particular kind of pipeline, a kind of data and a kind of query. And then they can monetize that in a vertical application. I think those are really interesting questions for the space going forward.

Sudip [00:32:23]: And that is actually a really interesting thread I just want to kind of pull on a little bit, which is, I think historically data engineering has been mostly about using best of breed tools and then stitching them together to implement your data pipelines, right? Now, of course, we have companies like Databricks, which you are intimately familiar with. And then to a certain extent, even Snowflake, they talk about their unified data platform. Do you feel like we are heading to a world where enterprises will standard on a unified data platform? Or do we still have this, you know, kind of duct taping of best of breed tools?

Joe [00:32:57]: You know, it's funny. We have this conversation often. And when I say we, I mean, those of us in the community, not just you and me. You say Databricks and then you say Snowflake and you think about them. And then you remind yourself that AWS and Google and Microsoft have those things and 17 more, right? That are data solutions. So if we broaden the scope a little to be like, what tools in the AWS toolbox should I be using? And will I stay only at AWS? You know, we know the answer to that. The answer is no, right? For lots of reasons. Now, as to who's going to be good at what, as opposed to maybe I want to split my bets and I don't want to have one vendor relationship. It's going to be possible to be an 80% solution on a bigger, 80% of more stuff, right? The relational database was always winning because it was good for 80% of your data problems. Now, you know, what is 80% of your data problems is a much, your problems are much broader. I do think there are sort of 60 to 80% solutions there that you could get from a single vendor. I think there will be solutions that under the hood will have a lot of pieces that today we think of as different pieces of software. You know, one of the things that happened with the big data era and open source and I got a little salty about this with some vendors at a conference recently, I should say, is, you know, we went through 10 years of Hadoop, right? And it was awful. And the reason it was awful was partly because it wasn't the greatest software in the world. It was kind of open source and a lot of it was immature and never really matured. But a lot of it was that it was 14 or 16 pieces of software that weren't super mature. Each of which had a logo and a fan that like, you know, a community around it. So there were five or six super fans followed by 25 fans. It was like identity politics. It's like, no, no, no. You have to integrate the queue into the database system because the queue is cool. It's got a name and a logo and a bunch of fans, you know, whereas if it had been run by a business, they might have consolidated business units over time. And so what I think we're dealing with right now is going to be consolidation of what we currently think of as pieces of the pipeline just because it's going to make technical sense to consolidate them. They're close enough to each other. They should just not have two teams and two products and AWS is rife with this, which is the most confusing. I do think we're going to enter into an era of consolidation around that. Open source will be the last to do it though because of all the cultural issues I just mentioned. The way you get open source to move forward is you build an inner team that really is super fans. And so that causes fragmentation of product. It's hard to build a big enough team of super fans to build a big enough product and to merge teams over time. So I got into a fight at this conference because somebody said that the future of data systems is the stuff emerging from Facebook and Voltron and other places right now. And I was like, I'll believe you when you show it to me, but the last 10 years suggests otherwise. What I see from Amazon, Google, Oracle, Databricks, Snowflake, all that stuff is way better than what's coming out of open source technically. And it's not like they're hiding the technology. Actually, they're publishing about it, especially like Amazon and Microsoft. They write lots of papers about what they're building. Those papers are more sophisticated on average than what we're seeing in open source. So much as I'm a huge advocate of open source and a postgres guy from way, way, way back, what I'd say is that in terms of these big stack problems, we're going to see consolidation. It's going to happen first at the bigger companies or at startups that are willing to take on a big risk and that some of this like piecewise stuff is going to fade away.

Sudip [00:36:17]: It's a fascinating view. Yeah. Thank you. I have a similar question, but more around use cases. So which is traditionally we always have had systems that were transactional, so doing OLTP. And then we had systems built for analytics, OLAP. And I think, you know, I actually myself got into a debate five, six years ago about whether it is possible for one system to cater to both. And I think there was a terminology around that time that came up, which was HTAP, Hybrid Transaction Analytics Processing or something like that. Do you see like a world where it's one system to serve both use cases or do you think those use cases will forever be served by separate systems?

Joe [00:37:03]: To me, this is totally a cost-benefit analysis conversation. So is it possible to build a system that does both? Well, I believe it's possible. In fact, one way to build it is to just sell both systems and put a little glue underneath it and kind of try to hide it from the customer. But that's not a very good instance. But you can build these things. And in fact, former PhD students of mine have done great work on this at IBM and other places. The possibility to do it well is there. It's hard to do because you're basically meeting multiple SLOs with a single piece of software. Some people want very low latency, high transaction rate and then transactional semantics. And other people's SLOs, I want very large data volume. High latency is fine, but huge data volume, I want lots of throughput of bytes. And to meet both those SLOs in a single system without introducing 700 tuning knobs, one of which might be run in OLTP mode versus run in OLAP mode. But to really get down to what are all the knob settings that make it work well in one or the other, it's really hard to build that well and make it usable. And then you ask the question, well, if I built it and I invested in doing that, and let's say I have infinite budget, let's say I'm AWS, is there enough customer demand to fund that? And to my guess, the answer is no. Most people probably can live off some kind of data loading in the background that's not too hard to manage, not too expensive in human time to do the ETL, ELT, call it what you will. And have two systems and two different organizations that manage them. And then there's the governance issues. So the governance issues for your operational databases are typically very few people get access to them, right? And it's real easy to lock down who gets to see what because very few people get to see any of it. But once you get it into the warehouse and you've torn apart all the little pieces that are accessible to different people from each other in some sense, now you have a management problem only on the OLAP side, but it's hard enough over there. So getting governance working in HTAP is also a big challenge outside of the other technical challenges. The organizational governance challenges are hard. Again, I don't think most orgs really want all that noise in their production operational databases. My advice to the entrepreneur doing HTAP is that's brave and risky. But if you did it and you won, man, that'd be awesome. You'd get two markets instead of just one. But I think it's high risk.

Sudip [00:39:18]: I will take that. By the way, one comment I just wanted to make was a conversation we had over the last half an hour, 40 minutes. I've heard you so many times bring up governance, which was fascinating, you know, given you're an academic first, but I imagine a lot of that comes from your entrepreneurial experience.

Joe [00:39:33]: Yeah, absolutely. Nobody, none of my colleagues know much about this except for the ones who are actually beginning to get interested in fairness. So some of my colleagues who are working on the boundary like data fairness, AI fairness, they get it actually very deeply in ways that I often don't think about because in industry we don't think about that so much. But most core computer scientists think about how fast things go and how useful the outputs are. And they tend not to think about governance. It's absolutely right. And it's hard to teach, honestly, like giving a lecture on that stuff's pretty dry. I don't know if you've ever gone through a class on like access control, but it's not real inspiring. So I think it's one of those things you've got to get out there, get your hands dirty. And they realize it's like the biggest problem sometimes. It is not fun because you are mostly talking about locking down as opposed to, you know, obviously empowering people. Yeah, it's a little bit like security in that sense where, you know, you have to scare people enough and put a lot of drama around doing it wrong. To get people fired up to want to do it right.

Sudip [00:40:25]: 100%. I want to ask you a little bit about how you see the whole data stack evolving with AI. I mean, everybody now wants to be an AI company. Do you see some fundamental changes in the data stack and how we have done data engineering over the last, you know, two, three decades now that we are going to add up all of this in a fascinating LLM technology?

Joe [00:40:46]: I guess what I'd say is a couple of things. I'm beginning to get the feeling after beavering away at this at Aqueduct for a couple of years with my co-founders who are brilliant Berkeley PhDs and professors that, you know, picks and shovels broad purpose tooling for LLMs for the enterprise is not ready yet. It's too early for that. Most enterprises that are going to go in this direction probably will either use what they can get from Microsoft and the like or they'll do some stuff in-house until they figure out what they really need. And this just hasn't settled down enough to do a broad-based solution. I think that decentralized solutions are going to be earlier to adoption and earlier to generating real value. You know, the best examples that we see in the wild are the difference between ChatGPT and Copilot. I know exactly why I would pay for Copilot. I have less reason to pay for ChatGPT as an individual. I mean, it's kind of cool as long as it's cheap. But, you know, broad-based answer any question I can think of pretty well, I don't know what that's for, really. But, you know, if developers are going to be doing the tooling because developers are going to give very crisp feedback as to what they want in this space, that is a good place to focus. I think Microsoft's been very smart focusing on Copilot because the value of it to their constituency is very clear and they can dogfood it in-house for a long time and figure out how to make it better at the tasks they already do anyway. Medical, by contrast, like on the one hand, sounds great, right? And on the other hand, yikes! How do we make it work well for this kind of thing when it hasn't been trained on it? I had a conversation with a major medical provider recently and they said they have like a stack list of stuff they want to try LLMs for and it's got like 250 use cases. And I said, wow, I'd love to see that. That sounds fascinating. But, you know, the ones that we settled on talking to them about that seemed like they were actionable were ones inside of IT. Because inside of IT, we could figure out, first of all, we wouldn't kill anybody. Second of all, and I didn't really figure out like if it's working or not pretty well. And so doing pilots inside of sort of technology settings, I think there's going to be a lot of easier pickings there for a while. And we'll get better at it by using it ourselves as the developer community more quickly. Of course, people will succeed applying it to specific verticals too. And that's also good. But doing broad-based right now, I think we're too early.

Sudip [00:43:06]: Too early to settle. I want to ask you one last question before we move on to the lightning round. And you probably have one of the best vantage points to answer this. What are some of the most interesting future directions, research projects, and data engineering that you are excited about?

Joe [00:43:23]: So I always like to answer these questions in terms of that I'm not working on because I have my biases, right? But actually on this one, since I'm not doing Trifacta anymore, I would say that actually data transformation, the T in your E-T-L-T-L-T-L-T-L, whatever star, I think it's an awesome petri dish for LLM technologies because of a bunch of things. First of all, algorithmically trivial. It doesn't have to invent new algorithms to do E-T-L. We're not doing things to the data that involve computing Fibonacci numbers or even sorting often. I mean, like, you don't have to implement quicksort. You don't have to invent any new algorithms. You just need to apply building blocks in the right way to the problem. So I think it should be good at it to a first approximation. Secondly, very hard user interface problem. So it's one thing for me, and I don't mean to minimize things like mid-journey because actually I think it's totally fascinating and awesome. But if I say, give me a picture of two people talking on a podcast and it's got this crazy awesome picture, it's really gratifying. But it doesn't matter if it's right or wrong. And I didn't define what's right and wrong.

Sudip [00:44:26]: How delighted were you?

Joe [00:44:27]: I'm like, I'm delighted. Try again. That's not engineering. That's authoring. And authoring has got different constraints than engineering. So I would say in the engineering world, the great thing about data transformation is it's hard to know if you got it right. It's a huge piece of data. So whether you got it right is highly contextual. Example, you got log files like Splunk. The marketing department is going to put it in the CRM to try to figure out targeted ads. They want to do one thing with that data. The IT department wants to look at downtimes of servers. You're going to do a very different thing with that data. And whether the data is cleaned adequately for one or the other is a completely different objective function for the optimization. So saying there's an LLM that does this, well, maybe with the right prompting. Yeah, maybe. But the prompting is going to have to be very interesting. And my point there is because there's many different correct answers to clean up this data, the evaluation of the output becomes the hard part. And this gets all the way back to the beginning of the podcast. Did you build the right user experience so the user can guide the system and decide if the outputs are right? That guide decide loop comes back. So I think it's a wonderful petri dish and LLMs are great at some things right now in data cleaning and terrible at others. We can talk about that and it's one I know well. So that's also easy for me to suggest.

Sudip [00:45:42]: So we end each of our episodes with three quick questions. We call it the lightning round. So the first one is around acceleration. So I'm going to ask you a little bit around your space, which is data engineering. What do you think has already happened in data engineering that you probably had thought would take much longer?

Joe [00:46:02]: The one that surprised me at the speed it went is the disaggregation of the stack. I would have expected that kind of the tightly coupled what they call shared nothing architecture where you have memory disk all in one and then you that's your building block and you knit those together and you parallelize across these full machines. That transitioned in the cloud very quickly to what amounts to shared disk. We have a disk tier, a storage tier, think of S3 and we've query processing tier and a log generation tier maybe and the log generation tier hydrates the storage tier. I did not expect that to happen or happen so quickly and it makes sense and it forced a lot of design changes. I'm very impressed actually with the teams at Microsoft and Amazon and Google who've led on this but that shift to disaggregated stack went faster than I would have guessed.

Sudip [00:46:47]: In large scale data processing generally speaking what do you think is still the most interesting unsolved question? I mean data cleaning is definitely one I can think of from your background.

Joe [00:46:58]: I think the most exciting thing I'm not working on right now is querying unstructured data. I think we're going to see tons of progress on that in the next five years. I don't even think you need 10 to be five years you have SQL interface to everything and you'll be able to ask questions of any kind of object and get some kind of answers but you may not have time to train GPT-4 on all that data every time so how are you going to plumb together all the pieces so I can ask SQL on all my data that's in the lake? I think that's happening it's happening in research and I foresee it happening in industry over the next short window.

Sudip [00:47:28]: Fantastic. Last question. What's one message you would have for everyone listening?

Joe [00:47:35]: One message that I mean this audience probably already knows but it's always about the data. There's going to be lots of innovations on computation there's going to be lots of cool algorithms there's going to be new kinds of models it's always all about the data and that means where did it come from what data did you choose to acquire and then of course how you bake the cake with it right you know whether it's training a model or building a warehouse or whatever that's important but it's always about all the data and where it came from it's always about that but so are the systems you roll out and the algorithms you run they're always in service of the data and the traditional field of computer science really is always all about the data.

Sudip [00:48:14]: That is actually a really fascinating answer you know given we are talking about data entering and given you're back could not agree more. So on that note Joe it was a real pleasure and privilege to host you today. Thank you so much for your time!

Joe [00:48:29]: My pleasure Sudip. Thanks for having me.

Sudip [00:48:31]: All right. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation.

Engineers of Scale

Data Engineering: The Past, Present and Future with Joseph Hellerstein

Show Notes

Timestamps

Transcript

Discussion about this episode