When Hadoop was King and Yahoo was Cool - with Doug Cutting and Mike Cafarella

Engineers of Scale

0:00

-45:24

When Hadoop was King and Yahoo was Cool - with Doug Cutting and Mike Cafarella

The untold story of how Hadoop got created, how Google influenced the project and the role Yahoo had in making it successful - from the creators of Hadoop, Doug Cutting and Mike Cafarella.

Sudip Chakrabarti

Nov 29, 2023

Transcript

In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in infrastructure software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider story. For each such project, we go back in time and do in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to educate the next generation of engineers who were not there when those transformational projects were created.

In our first “season,” we start with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. And what better than kicking off the Data Engineering season with an episode on Hadoop, a project that is synonymous with Big Data. We were incredibly fortunate to host the creators of Hadoop, Doug Cutting and Mike Cafarella, to share with us the untold history of Hadoop, how multiple technical breakthroughs and a little luck came together for them to create the project, and how Hadoop created a vibrant open source ecosystem that led to the next generation of technologies such as Spark.

Timestamps

Introduction [00:00:00]
Origin story of Hadoop [00:03:26]
How Google’s work influenced Hadoop [00:05:47]
Yahoo’s contribution to Hadoop [00:13:51]
Major milestones for Hadoop [00:20:06]
Core components of Hadoop - the why’s and how’s [00:22:44]
Rise of Spark and how the Hadoop ecosystem reacted to it [00:27:19]
Hadoop vendors and the tension between Cloudera and Hortonworks [00:31:51]
Proudest moments for the Hadoop creators [00:33:56]
Lightning round [00:36:04]

Transcript

Sudip: Welcome to the inaugural episode of the Engineers of Scale podcast. In our first season, we'll cover the projects that have transformed and shaped the data engineering industry. And what's better than starting with Hadoop, the project that is synonymous with Big Data. Today, I have the great pleasure of hosting Doug Cutting and Mike Cafarella, the creators of Hadoop. And just for the record, Hadoop is an open source software framework for storing enormous data and distributed processing of very large data. Think hundreds and thousands of petabytes of data on again, hundreds and thousands of commodity hardware nodes. If you have anything to do with data ever, you certainly know of Hadoop and have either used it or definitely have benefited from it one way or another. In fact, I remember back in 2008, I was working on my second startup, and we were actually processing massive amounts of data from retailers coming from their point of sale systems and inventory. And as we looked around, Hadoop was the only choice we really had. So today I'm incredibly excited to have the two creators of Hadoop, Mike Caffarella and Doug Cutting with us today. Mike and Doug, welcome to the podcast. It is great having you both. [00:01:02]

Doug: It's great to see you. Thank you. Thanks for having us. [00:01:10]

Sudip: If you guys don't mind, I think for our listeners, it'll be great to know what you guys are up to these days. Mike, maybe I'll start with you and then Doug. [00:01:19]

Mike: Sure. I'm a research scientist in the Data Systems group at MIT. [00:01:27]

Doug: I'm a retired guy. I stopped working 18 months ago. My wife ran for public office and it was a good time for me to transition into being a home keeper, do shopping and cooking. But I also have a healthy hobby of mountain biking and doing trail advocacy and development, trying to build more trail systems around the area that I live in. [00:01:44]

Sudip: Sounds like you're having real fun, Doug. One day we all aspire to get there, for sure. I'm really curious to know how you guys had met. I've seen some interviews of you guys. You kind of talked about how, I think, Doug, you were working on Lucene at that time and then connected with Mike somehow through a common friend. I'd love to know a little more detail on how you guys met and how you guys started working together. [00:02:06]

Doug: It kind of goes back to Hadoop really. Hadoop was preceded by this project, Nutch. Nutch was initiated when a company called Overture, which we'll probably hear more about, called me up out of the blue as a guy who had experience in both web search engines and open source software and said, hey, how would you like to write an open source web search engine? And I said, that'd be cool. And they say they had money to pay me at least part time and maybe a couple other people. And did I know anyone? I didn't know anybody offhand, but I had friends. I called up my freshman roommate, a guy named Sammy Shio, who is a founder of Marimba. And I said, Sammy, do you know anybody? And he said, you should talk to Mike Cafarella. I think it was the only name that I got. And I called Mike and he said, yeah, sure, let's do this. [00:02:49]

Mike: So at the time, this would be in like late summer, early fall of 02. I had worked in startups and in industry for a few years, but I was looking to go back to school. So I was putting together applications for grad school. And I was working with an old professor of mine to kind of scoop up my application a little bit because I had been out of research and so on for a while. And that was a fun project, but it wasn't consuming all my time. And so Sammy, who was one of the founders of Marimba, which was my first job out of college, he got in touch and said that his buddy, Doug, had an interesting project and I should make sure I go talk to him, which was great. I was looking for something to do and it came at just the right moment. [00:03:26]

Sudip: That was quite a connection, Mike. And then going back to that timeframe, 2002-2003, I think, Doug, you started touching on how you started working on Nutch and eventually became Hadoop. Would you mind just maybe walking us through a little bit like the origin story of Hadoop? I mean, I know Overture funded you for writing the web crawler, but what was their interest in an open source web crawler in the first place? [00:03:49]

Doug: I think that's a good question to get back to some of the business context. We want to mostly focus on tech here, but the business context matters, as is often the case. So I had worked on web search from 96 to 98 at a company called Excite. I'd been pretty much the sole engineer doing the backend search and indexing system. And then I transitioned away from that, written this search engine on the side called Lucene, which I ended up open sourcing in 2000. Also in 98, Google launched, and initially they were ad-free. All the other search engines, there were a handful of them, were totally encrusted and covered with display ads. So just think like magazine ads, just random ads that they managed to sell the space to advertisers. Google started with no ads, and they also really focused and spent a lot of effort trying to work on search quality. All they were doing was search. Everybody else was trying all kinds of things to get more ads in front of people, and Google just focused on making search better. And by 2000, they'd succeeded, and the combination of this really clean, simple interface and better quality search results, they had taken most of the search market share already. But they needed a revenue plan. This company called Overture had, in the meantime, invented a way to make a lot of money from web search by auctioning off keywords to advertisers and matching them to the query. Google copied that and started themselves minting money. Overture was nervous because they had this market, and they were licensing it to Yahoo and Microsoft and others, but they were worried that all of their customers were going to get beaten by Google and go out of business. So on one hand, they sued Google. That's an interesting side story. But on the other hand, they decided, we should build our own search engine to compete with Google. We somehow need to do this. They bought AltaVista. They tried to build something internally, and they also thought, you know, open source is this big trend. Let's do an open source one to have something to compete. So they called me, and I called Mike, and we worked with a small team of guys there at Overture, led by a guy named Dan Fain, and we started working on trying to build web search as open source. [00:05:47]

Sudip: That is such a phenomenal historical context. Including myself. I don't think many, very many people had that. And then interestingly, Google also came out with their GFS paper in 2003, their MapReduce paper in 2004, which obviously influenced a lot of the work that I think you guys did down the line. I'm curious, what do you think might have caused Google to publish those papers in the first place? Any hypothesis on that? [00:06:14]

Mike: I think you're putting your finger on something interesting and important, which was, at the time, that wasn't common practice to have a research paper that told you a lot of technical details about an important piece of infrastructure. I don't think it was part of some genius, long-term plan to profit down the road. It was part of a general culture at the place to emphasize the virtues of publishing and openness and science. Maybe it helped them with hiring or something like that, but if so, that was kind of an indirect benefit. And it was really trend-setting. I mean, they ended up publishing a ton of papers. I think Microsoft and Yahoo and other companies followed suit. There's a whole string of really interesting papers throughout the 2000s and early 2010s, systems that we might never have learned about had they remained totally closed. But it's interesting to think about the impact of the GFS paper, I think, on our experience, Doug, which was we had worked on Nutch for, I guess, about a year. And after about a year's time, I recall that it was indexing on the order of tens of millions of pages, but you couldn't get more than a month's worth of freshness because the disk head just couldn't move that fast in a month. So it was a single machine, but we were limited by storage capacity or by disk throughput on the seek side. If we wanted the index size to grow substantially larger, then we had to have some kind of distributed solution. I remember we spent something like six to nine months, Doug, working on a dedicated distributed indexer for Nutch. I remember the technical details. Maybe you can pitch in a little bit there. But I do remember finishing it, or at least thinking we were finished, and then about five minutes later reading the GFS paper and realizing that we should have done it that way. [00:07:51]

Doug: I remember running it. I remember operating that thing. And I think we actually got up to 200 million web pages. We were basically doing, this was still well before the MapReduce paper, doing MapReduce by hand. We'd quickly learned that we couldn't do all of this on a single processor. I more or less knew that from my days at Excite. We sort of hit the limit of what you could do on single processing, and even then we were already doing some things distributed. So we needed to have a way to do it distributed, which was to chop up the problem into pieces and run them in parallel. Overture had bought some hardware for us. A friend of ours named Ben Lutch, another guy we brought on, was running that hardware in a data center somewhere, and we could farm off and run processes on these. But it was a lot of work. We'd run four things doing crawling, get that data down on the disks of those machines. Then we'd parse out all the links, and then we had to sort of combine, do a shuffle effectively in MapReduce terms, and do a merge of all those data on the different machines, and decide which pages to grab next, and then to do indexing. We had all that plumbing working, probably a 10-step process, each step of which was distributed across five machines. But it took you running and monitoring all of these processes for 10 steps, and shuffling files around by hand. We were just using SCP to move things between these nodes. It was laborious, and I don't think practical for more than five machines. We would have needed to start thinking about automating all that. Somewhere in there is when the MapReduce paper came out and automated all that, and added in a lot of other reliability considerations, as did the GFS thing. We didn't have to worry about drive failures and machines crashing. With five machines, that didn't happen, practically. But if we wanted to move up to 100 or 1,000 machines, then we knew it would, and that we'd need all that. So it was a pretty nice gift. I mean, back to motivations for Google, I think part of it was, I think as Mike sort of indicated, they came out of academia, they had this don't be evil motto, and felt like it was sharing this. I think there was also a little bit of an agenda, in that at some level didn't believe that technological edge was sustainable in any case, that what you really needed to do was build company culture and operations. I actually talked to Larry Page about that once, and he claimed that their only sustainable edge was operations, which I thought was an interesting claim to make. But also, they believed having an open source implementation would help them in recruiting, that people would already be familiar with these concepts, with this model, and when they came into Google, they could adapt more readily and come up to speed. Which again, says they weren't worried about competition at a technical level, which is interesting. I'm very grateful they had that sort of high-minded attitude, because we were able to benefit tremendously. That was a big project. It took them, you know, five years with a huge team to work through a lot of different alternatives, and come up with a solution that they published. And Mike and I could just go, hey, let's go implement that. We've got a blueprint here. We were very happy to take all their hard work off the shelf. [00:10:42]

Sudip: How long did it take you to kind of incorporate the ideas from those two papers, the GFS and MapReduce, into what you guys were building? [00:10:53]

Mike: I remember running about a 40-node version of HDFS roughly six months after the paper came out. So I think by the summer of 2004, we had something limping along. I do remember that that version, though, if one node had the misfortune of going down, a simple thing that the paper doesn't dwell a lot on, but one implements, is a machine goes down, a portion of the file system has now become a little bit more in danger because you have fewer copies of those bytes than you would like to have. So it's time to copy those to duplicate them more. And if a machine went down, the other machines were scrupulous to a fault, would absolutely blast as many bytes as possible to escape this dangerous situation right away. It would paralyze the entire cluster until it had done so. So you had to limit that a little bit, but it was roughly in that time that the basic system was limping along. [00:11:42]

Doug: Mike tended to focus on the core GFS algorithms and the core MapReduce algorithms, and I tended to focus more on hooking that all into the rest of Nutch and then running it. Mike also did some crawls up at UW as well. [00:11:53]

Mike: I should say, you know, one thing that I've always thought of as an underappreciated element of the project's success, and that Doug really took ownership of, was the readme and the out-of-the-box experience. You could download the thing, and an hour later, it could be working on 100 nodes. And at the time, that was just unfathomable. If you want to get, say, an eight-node distributed database system working at the time, well, you better go have a budget to go hire some consultants from IBM to help you set that up, right? But Nutch, almost Hadoop experience at this point, in a distributed setting was really smooth, and Doug focused on all that stuff. It was really, I think, a big ingredient in its success. [00:12:20]

Sudip: I think, speaking from my own experience, it really shortened our time to value, and we didn't have to raise a whole lot of money because cloud was coming up. This was circa 2009. AWS was just about coming up, so we could get up and running so quickly because we had this nice project you guys had created. So, very belated thank you for that. [00:12:32]

Sudip: It sounds like you guys had most of the plumbing before the two Google papers came out. Did you ever think of if those hadn't come out, how Hadoop might have looked now? [00:13:00]

Doug: I think we would have struggled. I think we would have come up with some scripts to try to automate some of this stuff, and long-term, we would have struggled with reliability issues and scaling issues. [00:13:08]

Mike: We were really interested in Nutch, the search engine at the time, right? The goal of all this work was to improve the search engine quality, to improve its coverage, ranking quality, speed to indexing, and so on. And so, in that alternate reality, if things had been a little bit different, I like to think that we would have encountered the same technical issues that the Google ads did, and we would have gone through a similar kind of technical discovery process just a little bit later than they did. Of course, they're very sharp. Would we have actually had as good an outcome? I don't know, but one thing that was nice is that focus on the search engine led us to see some of these problems later than the Google guys, but earlier than most other people. We were a pretty small team next to, I think, the resources that Google had. [00:13:51]

Sudip: Yeah, resources and how important search was. Obviously, that was the entire company, entire business model, right? Speaking of which, another search giant at that time, Yahoo, had an amazing role, a disproportionate role in making Hadoop successful. And Doug, you, of course, went to work there, I believe, in 2006, if I'm not mistaken. Maybe if you could walk us through the role that you guys saw Yahoo playing in Hadoop in the early days, how they made it such a successful project. [00:14:17]

Doug: Yahoo bought Overture, which was interesting, and they continued to support the work on Nutch. I think that deal closed in probably 2004 or something like that, 2005. And Yahoo was trying to build its own search engine to go against Google, and they bought a company called Inktomi, and a big team of engineers from that who had been running a web search engine. And we're trying to figure out the next generation of that. They recognized the need for a new data backend to hold all the crawl data to do the processing of it. They looked around, they saw the MapReduce paper and the GFS paper, and they said, we need that. And they saw this open source implementation, they thought that'd be a great starting point. And to boot, they had already funded it. I think that was a coincidence as much as anything, but it meant we already knew people there. I went and gave a tech talk about it in 2005. And in 06, they said, let's start, let's adopt this platform that you guys have in Nutch as our backend. We really want to invest in it. I said, great, I'll come work for you. That's what I would love to see is some serious investment. It's just Mike and I working on it. It's going to be a long tail of debugging to get to anywhere near the solidity that it needs. So I joined. They were like, we don't care about all the web search specific stuff, because we already have that. What we need is this backend stuff, the equivalent of GFS and MapReduce. We need the HDFS. And so we split it in two. They were also concerned about intellectual property stuff working in the search space. They wanted to work in as narrow a space as they could for exposure to patent issues and so on. So we split Hadoop out. I had that name waiting on a back burner. My son had this stuffed yellow elephant that he had named Hadoop. He just created this name, and I thought that would be good. It comes with a mascot. [00:15:55]

Sudip: And I purposefully didn't ask you that question, because that's one answer everyone knows, for sure. [00:16:01]

Doug: And so that was, I think, February, maybe, of 06. We split them out, refactored. It was a pretty small job. I think Mike and I had already factored the code reasonably well, so it was mostly just renaming a lot of things and putting them in different namespaces. And we were kind of off to the races. Yahoo immediately, the day I started, I think they had a team of 10 all of a sudden working on it. And we had a hundred machine cluster of really nice hardware compared to anything that Mike and I had ever had before. [00:16:27]

Mike: Yeah, it was incredible. I mean, up until that point, Doug and I were working on it. I think Stefan, Stefan Grosjef, had been contributing on the open source side for Nutch. He was kind of a notable contributor. There were a few other people, but when Yahoo invested in it, it really was an epic change in the number of people paying attention. It was great. We really expanded the set of people who were participating. [00:16:47]

Doug: By summer, they had probably a hundred engineers and a thousand node clusters on this. And I don't know if it was the end of 06 or was it in 07 when they actually transitioned their production search engine to running on top of this. It was a big process. Owen O'Malley started really making a lot of improvements there. Who were some of the other guys there, Mike? [00:17:06]

Mike: Arun Murthy and Eric14 was the lead engineer, or the engineering manager of that at that time. Rami Stata, who I think had come to Yahoo as part of an acquisition, he was the manager and kind of our champion for the project inside Yahoo for a period of time. He was really instrumental. So one thread of the story is the contingent nature of the project. There were lots of things that had to go right for this to be a success. And at many points, there were individuals or companies who decided to listen to the better angels of their nature. And whether it was Overture funding the project initially, Yahoo deciding to fund it, or Google deciding to write those papers, a lot of things came up right and eventually yielded a good outcome. But it took lots of people working independently to kind of happen into it. [00:17:48]

Doug: As an open source maintainer, my goal has always been to get a project to the point where I'm not needed, where it has a life of its own. It's built up enough of a user base, about enough of a developer base. And where Mike and I were in 05 with Nutch wasn't there. It was too raw. You could use it in the out-of-the-box experience was as good as it could be, but it was unproven. I was working as a freelance consultant. I was getting tired. I was looking for a full-time gig. And IBM talked to me and they said, we really want to invest in Lucene. And it would have been a nice job, but Lucene was fine. Lucene was off and running on its own. I didn't need to be there day to day to use Lucene because IBM was already using it. Whereas Nutch really needed a sponsor. And that's when Yahoo came along. And so it was really a great thing that they did and really, really took it and made it real, got it to that point where it really proved itself. And then Facebook and Twitter and all the rest could start using it. [00:18:39]

Sudip: In hindsight, what was your take on Yahoo's relationship with an open source project? Was that smooth sailing internally where people really believed in it? Or was there some push to put it within the walls? [00:18:53]

Doug: It wasn't something that they had done a lot of before. So it was new ground for them at a corporate level. I had learned because I'd been doing consulting in open source for a while that I needed a clause in my employment agreement that said that I could contribute to open source. None of the Yahoo employees had that. And so although the engineering management was using this and investing in this and committed to the vision of open source, the lawyers said, no, Yahoo employees, except for Doug, can't actually contribute to open source. So they had to submit all their additions and then I could commit them. It was this one little step that I had to do for well over a year. It took a change in Yahoo CEO before we could get somebody to override the legal department and say, this is actually okay for Yahoo employees to directly apply changes to Apache. So there were a lot of things like that. The other thing that was an issue is Yahoo, as I said, had 100 people working on this. And there were a handful of people in other companies that were starting to use Hadoop, but nobody had anything like that. And in open source, it's hard to not dominate when you've got that big of an imbalance. And we really wanted to build an egalitarian community where everybody weighs in. And it was hard for Yahoo to not be the 200 pound gorilla. And there were some growing pains around that over the years. [00:20:06]

Sudip: Before I shift to talking about the main components of the Hadoop ecosystem, I want to kind of just read through a couple of milestones that I found. And I'd love to know if any of those sound completely off. So I found that in 2007, within less than a year after you joined Yahoo or Doug, they actually were using Hadoop on a thousand node cluster. And then in April 2008, apparently Hadoop defeated supercomputers and became the fastest processing system to sort through an entire terabyte of data. And then in April 2009, I think Hadoop was used to sort a terabyte of data in 62 seconds, beating Google's MapReduce implementation. And then finally, I think in 2009 also, it was used to sort through a petabyte of data and indexing billions of searches and indexing billions of web pages. These are like heady, heady milestones. I'm curious a little bit, what was the feeling inside Yahoo at that time as you guys were hitting those milestones, which actually went in some cases way beyond what I think you guys had set out to do? [00:21:08]

Doug: That was mostly Arun and Owen doing those benchmarks and driving that forward. I think [00:21:13]

Mike: We were pretty stoked. It was awesome. Doug's right. The point is to get it to the point where it won't die. And people that we knew, but we were not working with every day, they were taking it and doing amazing things with it. That's what you want to have happen. It was thrilling to see. It was really fun. Yeah, it also gets to motives back to why Google published those papers. [00:21:34]

Doug: It gave their employees public visibility and employees like that. You want employees to be happy. I think that was part of the motive for Yahoo adopting an open source solution is people like to be visible in the outside world, more peer recognition, and being involved in open source gives you that. So it made Yahoo a more fun place to work and more rewarding. And also being able to try to beat these kind of records. Again, it's great for recruiting and retention, employee morale, so long as you believe that you're not giving away the bank. When you do that, I think you build a much stronger company. Yahoo's profile isn't what it used to be in the consumer space nowadays. But in the 2000s, they were not as maybe financially successful as Google, but the technical depth of the company was really good. They had a ton of people working on Hadoop. They had a lot of people in different parts of the company, like Yahoo Labs was a research lab that was fantastic at that time. The technical skills inside the company were great. Google was getting a lot of press. So I always felt like some of these guys, it felt great to go make a splash and get some numbers in the way that people in Mountain View were. And they certainly had the brains to do it. So I thought it was great. That's sick. [00:22:44]

Sudip: I want to shift to discuss a little more technical stuff. So as I understand, Hadoop has had four main components, HDFS, Yarn, Hive, and Hadoop Common. I'm curious a little bit if you wouldn't mind spending a little bit of time on what was the motivation to create each of those components and what do you think was kind of the guiding principle to build that ecosystem of four main components? [00:23:09]

Mike: I'll try to address some of this, Doug. I mean, you were more closely involved with some of these components than I was. I should say the Yahoo engagement in the 05 to 06 and so on, it was thrilling in a lot of ways. It was also kind of the beginning of the end for me because I was in grad school at the time. And when you have 100 people working on the project, I mean, people would file bugs and fix them before I could come in to put in my 10 hours a week. So at some point, I had to decide whether I was going to actually get my PhD or keep trying ineffectually to contribute next to everyone else. So I had to kind of pull out by 07 or so. So some of these later components, I don't have a ton of insight into, but I can comment on some of them. HDFS was the original Google file system element. The original version, I think, reflected a design set of choices that informed a lot of Nutch and to do, which was a focus on correctness rather than performance or other things. That's why Java was the programming language we chose for all this, not known for being super high performance at the time or arguably now. I think it's not as bad as people think, but I think it's okay. But the thing that was good about it was we were really worried like if the system is wrong or if no one can contribute, it'll die. If it's slow, people can probably live with it a little bit. And again, I think the performance penalty of that's a little bit overstated. So HDFS, the initial version, was really designed to emulate GFS in many ways. In some ways, we made decisions that reflected our particular use case, like there were certain kind of mutation operators or append operators in GFS that were not supported in the original version of HDFS. There was no traditional failover in the original version. That would have to come later with Zookeeper and a few other services that offered us high quality distributed system safety. So the original version of HDFS was an emphasis on correctness and just bulk storage. And that's turned out to be an enduring advantage of the whole project, right? That's still something that a long time later people really need. It's become technically much more sophisticated than it was originally. But that emphasis on like reliable storage that is as cheap per byte as possible has still proven to be a good idea. Yeah, I mean, I'll just add, you know, HDFS and MapReduce were kind of the original two functional components, each modeled after a paper from Google. The common portion is just the utilities that we needed to build this kind of system, you know, RPC, storage formats, just that kind of library. Yarn came along a little later. That was a project led by Arun to abstract the scheduling out of MapReduce and try to come up with a scheduler that was general purpose that other systems could use to schedule tasks across a large cluster once you had one built up for more than just use it for more than just MapReduce. [00:25:53]

Mike: One of the cool things at that time, you know, this is like the late 2000s, were that people were learning from the MapReduce example, both at a systems perspective and kind of a science perspective. And they were building much more ambitious systems in the original MapReduce pattern. So there was Pig, which was like a SQL processor that was built on top of it. Hive would come out pretty soon. From Microsoft, maybe a year or two later, there's a system called Nyad. And all these were kind of arbitrary computation graph systems. The original infrastructure of MapReduce inside Hadoop couldn't support that stuff, but Yarn could, like they could be much more ambitious about the computation graph. [00:26:27]

Sudip: And as I understand, Hive was the one that made someone use SQL to write MapReduce jobs, right? That kind of opened up the user base quite a bit. [00:26:38]

Mike: Yeah, that's right. Pig was very similar, but it used a different syntax. So it didn't use SQL syntax, but it tried to obtain something similar. Mike and I weren't doing science. We were doing engineering on this project, and maybe some social work trying to build communities and so on. To really evolve, you need to have people experiment and try to build new kinds of systems. And I think Hadoop really inspired that in the open source space. I think we saw a huge number of things follow on, many of which succeeded, many of which didn't. So we gave people this example, and then they could do some science based on our engineering example. But in the case of Hadoop, MapReduce and GFS were really the science part. [00:27:19]

Sudip: Hadoop really succeeded and was probably designed as a batch analytics system in the first place, right? And then as the new use cases were coming up, did you guys ever consider adding more like a streaming use case, more of an interactive analytics use case? Was that ever a consideration for you? [00:27:34]

Doug: Not really. I mean, I think we saw other people coming along and addressing those kinds of systems with Spark and other systems that really, really addressed it properly. And it was layered on top. And these systems were designed from the outset to be more batch oriented. A lot of these things were incremental, right? Like the HDFS and MapReduce process were very batch oriented. If you wanted something that was more interactive time or immediate, then maybe you'd use a traditional relational database. And I think what's become apparent since then is that there are a lot of different points in between, right? At the time, batch oriented execution was what I thought made it distinctive. Having a MapReduce experience that was a little more interactive, although not what you would call like real time necessarily, that turned out to be a pretty compelling point in the design space. And I don't think that was obvious to me like in 2011 or 2012. The big advance was before that, you couldn't process that kind of data. Even if you had the hardware, there wasn't really commercial software. That was, I think, what really excited me when I read those papers was it opened up this whole realm of managing terabytes and being able to do computations over. My background originally was in computational linguistics and doing processing over large corpuses of text is critical to that, but you couldn't do it until we had this. And open source really enables that in a way that proprietary solutions aren't as accessible to everyone. That was a thing that really excited me about it was that the possibilities that it opened up for folks doing creative tasks with large amounts of data. I mean, you could do that stuff back then. People had computers that you could attach to a network. Distributed programming was possible, but the distributed programming libraries were for researchers, not everyday programmers. And the storage costs, if you wanted a lot more storage than what a small number of disks hanging off your PC could supply, you could go buy a RAID device or an EMC-style device that was incredibly expensive and didn't give you the scalability. So for most people, it really opened up a lot of stuff that they didn't have before. And the MapReduce API looked really clean and elegant and looked like it made it really simple to process huge amounts of things. Nowadays, people look at it and they say, ooh, how clumsy. You really have to do all that work. But at the time, it was really groundbreaking in the simplicity. [00:29:54]

Sudip: I remember watching an interview of you, Doug, back in 2016 or something. Somebody asked you, what would you see as the success of Hadoop ecosystem? I think one comment you made was, I kind of expect some of the Hadoop things to shrink and be replaced by new tools, which turned out to be obviously a completely accurate prediction. So I'm curious a little bit, how did you guys think about when you saw something like Spark come up, which essentially was the newer version of MapReduce, using obviously in-memory. What are your first thoughts? Did you consider adding maybe an in-memory extension in Hadoop, particularly the MapReduce portion, to do something similar? How do you guys think about it? [00:30:32]

Doug: I don't think of it as a competition where Hadoop had to keep up with these other projects. Rather, for me, the bigger mission is trying to get more technology in open source. And so that's a success. When Spark comes along and makes some fundamental improvements and provides something that can replace, in many cases, maybe not in every case, Hadoop, that's a good thing. More power to it. The process is working. We're making progress because we're not selling anything. It's very different than a commercial marketplace, where if a competitor comes up with something, you need to try to match it. In open source, we're much happier to have different projects complement one another. [00:31:09]

Mike: I fully agree with everything Doug just said. The thing that made those sort benchmarks and the engagement with people inside Yahoo and elsewhere so thrilling was people cared enough to make it better. People who work on Spark cared enough to make it obsolete in some ways. It's not that different. They actually thought it was worthy of paying attention to, and they did a better job. [00:31:30]

Sudip: Great, awesome! Let me last in a couple of things. One is the three vendors that came out in Hadoop. There was obviously Cloudera, which started in 2008, as I understand. MapR came out in 2009, and finally Hortonworks in 2011. Doug, of course, you were involved in Cloudera from pretty much day one. I'm guessing, Mike, you also probably advised some of those. [00:31:51]

Mike: I did a little bit of advice, but I had gone on my professor career. I was out of the business mostly, but I did a little bit of consulting. [00:31:57]

Sudip: I'm curious, looking back, how do you see the three vendors in the Hadoop space? What did they do right? What did they get wrong? And if you were to do any company based on Hadoop today, knowing what you know now, anything different you'd do? [00:32:11]

Doug: There was definitely some unfortunate things that happened in there in those years. When VCs approached me in probably 2007, a group of folks from Accel and from a couple other firms and said, we want to start companies in this space. And I was like, fine. I'm not interested in that right now. I'm happy at Yahoo. But if you do, please get together and start one. It's going to be complicated enough for the open source project to deal with a startup, but to deal with multiple ones and them trying to stab each other, it's just going to poison the open source community. We really don't need that. And these folks I talked to started Cloudera and Cloudera got off to a start and I joined Cloudera. But unfortunately, we still had bad blood because Yahoo perceived it as a threat. Yahoo had gotten all this acclaim for investing in open source and building this amazing system. And now some of that acclaim was being taken by someone else, by a startup, and moreover, money was being taken, profit was being taken. And Cloudera had stock options for something which might become big and people at Yahoo were working at a big, already public company. And so there was resentment, which led to conflict in the open source community, which stymied the project there for a while and eventually led to the team from Yahoo creating Hortonworks as a competitor for Cloudera. And the two of us went at it pretty non-productively. It wasn't, I don't think, a really healthy competition where we egged each other on. Rather, we were selling very similar products in the same market and undercutting each other. Maybe it was good for consumers of the software. The customers, yeah. It wasn't great for either company and it wasn't great for the open source projects where the bitterness led over into. So when Hortonworks and Cloudera finally merged, it sort of put that to peace finally. It wasn't ideal, but it was what it was. [00:33:56]

Sudip: Absolutely. Coming back to Hadoop, as you kind of look back, you know, since you guys started, which is like, I'm thinking of almost 20 years now, what would be the most proudest moments or moments in your view? Maybe Mike, I'll start with you and then Doug. [00:34:13]

Mike: I'll mention one funny one, which is actually from the Nutch era rather than Hadoop. Okay. Nutch was successful in that a lot of its code lived on in Hadoop and so on. But as an actual system, were there that many Nutch users in the world? Not that many. However, there was one that was very notable to me personally, which was the Mitchell South Dakota Chamber of Commerce. I remember this very clearly. They ran like a small intranet search site on it. Maybe that's not that notable, but if you've ever driven cross country past the Corn Palace, I believe the Chamber of Commerce actually owns and operates, or at least did at one time, the Corn Palace. There's a piece of Americana that was tied to Nutch. Whenever some people asked me about Nutch, I would brag about the Corn Palace, how they were secretly running my code. That was pretty great. On Hadoop itself, you know, I think just the breadth of it, those sorting benchmarks were very exciting. The fact that for, you know, many universities based like undergraduate classes on it, that felt great. Those were all really notable moments for me. [00:35:09]

Doug: For me, Hadoop exceeded my wildest expectations. When we started sort of trying to re-implement GFS and MapReduce, my goal was to provide open source implementation of things that researchers could use to manage large data collections. To some degrees, I think it didn't hit me the degree to which corporations had massive amounts of data that they weren't harnessing. And that was what occurred to these VCs. They knew that, they were talking to, but I was not involved in enterprise software at all prior to joining Cloudera. I didn't know anything about that whole market that was out there. And to see that explode, to see banks, insurance companies, governments, these kinds of institutions really take off using this stuff was pretty amazing. I didn't see this becoming a staple of enterprises by any means. It was far beyond my imagination for where we could go with this. [00:36:04]

Sudip: That is a very nice segue into the last thing we like to do on every podcast, which is a lightning round. So maybe I'll start with you, Doug, since you kind of brought us here. Three quick questions. One is around acceleration. So what has happened in big data that you thought would take much longer? What has already happened? [00:36:23]

Doug: I guess, I mean, I'm an optimist. I always think things are going to go quickly and go well. But that said, I'm not sure I expected to move to open source and to the cloud as rapidly as they have. I think we've really seen open source take root in enterprise data technologies and become an accepted way of doing things, which 20 years ago, it was not at all. I don't think it was everything. Everything in enterprise was proprietary, pretty much. And also running things in the cloud. People really wanted to keep their data on their servers and cloud was not widely trusted. And that's, we've seen a real 180 there. To me, it always was appealing. I don't want to run servers. I like being able to rent a server much better and treat it as a service. So that's been a great thing to see. Mike? [00:37:10]

Mike: You know, I think if you're going to say what was surprising or that I thought would take longer, I mean, AI is a very true, but in some ways boring way to answer that question. Two things that are interesting about the kind of revolution in AI that we've seen, I would say going back to 2012, when some of the first vision models became really good, and it hasn't been stopping, is first to think about the extent to which big data has been enabling technology for the modern AI stack that we see now. Even if you had had the idea that neural approaches should be turbocharged, you probably couldn't have done anything with that observation in 2002. And the other thing, which is especially interesting to me, is the way that neural models, like these really large scale models that are produced at incredible expense, have migrated so quickly to open source. The open source AI stack is really good. And when you consider they face a lot of the challenges that we did around Nutch, which is like, you've got no hardware and you've got no data, but make it work anyway. It's really impressive to me what the open source AI community has been able to do. [00:38:10]

Sudip: I think what is interesting is, you know, Google at that time, when I think you guys were building Hadoop, was kind of the protagonist right at the time. And now they are kind of playing defense in some ways, right? Because of AI and what is happening in open source, thanks to Facebook, Meta, and so on. So it's interesting to see the kind of tables turned a little bit in that way. Second question is around exploration. What do you think is the most interesting unsolved question in your space? Maybe, Mike, I'll start with you. Like, what do you think is still not solved that you'd love to see happen? [00:38:43]

Mike: I've got a ton of answers to this question, which I hope doesn't make me seem ungrateful for everything good that's been happening. One thing I would say is that a lot of what people store in these enormous HDFS clusters is documents, right? Like we've got a huge store of company documents. But understanding of a document beyond just the text is pretty poor. So understanding images or plots or the kind of multimodal form of a document is generally not that great. I'm hopeful that some of these AI approaches would make that better. Another thing, which is maybe at the very top of the big data stack, is I'm getting kind of sick of dashboards. I've been seeing like the same, you know, really complicated dashboards for 15 plus years. Like, here's my data center or my complicated system. I've got a big data system underneath, maybe Hadoop, maybe MapReduce, maybe Spark that is collating a ton of data. And I boil it down into some neon acid green set of dashboards that is, you know, honestly, pretty unpleasant to manage. So I'd like to see us do something with all these, like, we have this extremely large and high dimensional data set. I'd like to move beyond just the big pile of dashboards that most people use to investigate what's going on. I don't even know exactly what the answer is, but we've been dealing with the dashboard metaphor for a long time. And I think we need some innovation there. [00:40:02]

Sudip: I think on that point, Peter Baylis at Sisu Data is trying to do, you know, something like that, where he's trying to move you away from dashboards and really focus on what is going wrong in your business. [00:40:13]

Mike: Yeah, I think that's one possibly really interesting direction. Yeah. Doug, for you? [00:40:17]

Doug: One of the big challenges I think we haven't yet met, and I'm hoping we will, is really dealing with issues around privacy and consent and transparency. It's not strictly technical, but there's technical aspects. We want to get value from data. Much of the data which is most valuable is about people, but respecting those people's rights and getting value out of the data at the same time can be in conflict and coming up with mechanisms to really handle that conflict and deal with it in a reasonable way. I think it's hard. I think we're only in the early days of that. [00:40:54]

Mike: I think we will see progress. [00:40:55]

Doug: I think as a society, we've seen that as we adopt other technologies. It takes decades. If you look at, you know, food safety and automobile safety, that took a long time to evolve, and it's continuing to evolve and develop how that's managed and regulated. Healthcare safety, and I think data safety, we've got some very crude things we're doing so far, and there's a lot of room to advance that. I wish the industry took it more seriously and led rather than follows and has to deal with laws that are crafted by non-technical folks rather than trying to come up with strong technical solutions that really respect people. Anyway, that's one that I'm concerned about. [00:41:34]

Sudip: Beautiful. Last question for you guys. What's one message you would like everyone to remember today? Maybe, Mike, if I can start with you. [00:41:42]

Mike: I think I'll mention that the success of the Hadoop project over the last, I guess, 20 years, it certainly took a lot of hard work by a lot of people. It also required a huge amount of luck. If I look at the amount of work that I put into this or something else, and maybe I think Doug would say the same thing, Hadoop's been dramatically more successful than a lot of projects. I don't feel like the work on it is any better or worse than some others. There's a heavy amount of luck in it. As I mentioned, there's also a heavy amount of contingency. Like at a few crucial points, some people decided to do something pretty good for the universe. Google didn't have to publish those papers. Yahoo didn't have to keep funding an open-source project. But they did something a little bit better than they had to, and it turned out to be great. So if you're listening to this, and you're in a position to do something a little bit better for the tech universe than you otherwise might have to, maybe you have the seeds of Hadoop on your hands. You should go for it. [00:42:30]

Sudip: That's a fantastic point. [00:42:31]

Doug: Yeah, no, I definitely want to echo Mike in that we were incredibly lucky and at the right place at the right time with the right skills to move this along to the next step. That said, I think a strategy that I try to employ, and I assume Mike does as well, is you do want to aim big. You do want to aim high. And at the same time, watch your feet so you don't trip. And it's this constant challenge of how to maximize the outcome without compromising your goals. And I think that's the art of doing this, is try to find something which satisfies all these competing concerns. Obviously, we wanted this project to be successful, and we looked around and found the gifts that people had laid out there for us, and opened them and ran with it. It was luck guided by, I think, some successful ability to compromise and find the right path for it. [00:43:19]

Mike: Doug, there's one question I was hoping you would answer during this conversation. Maybe I'll just ask it because I don't want it to end before knowing it. It's occasionally people ask me, if we were to do it all over again, would you still choose Java for it? And I give my answer, but I want to know yours. Would you still use Java to do all this stuff? [00:43:34]

Doug: Yeah, no question. I did my share of C and C++ programming, and it's painful for this kind of thing in particular. We wanted to focus on algorithms, on the overall architecture, and keep things as simple as possible. And to have done it in another language, I think would have been premature optimization. I think we were able to get solid performance by optimizing where we needed for the most part. So yeah, I don't have a regret there. [00:44:00]

Sudip: How about you, Mike? [00:44:01]

Mike: I agree with all of Doug's points that I would not have done it in a lower level language. I think the only question I have sometimes is whether it should have been even higher level. Like, I never regretted the claimed performance problems with Java. I never really observed them, or maybe just at that point in the project, they weren't important. I wonder if we should have done it in Python or Perl. Like, how high up the stack could we have gone and still had a successful project? [00:44:21]

Doug: Yeah, I think I'm a little bit more of a language snob. [00:44:23]

Mike: Fair enough. [00:44:27]

Sudip: Fantastic. Thank you so much, guys. [00:44:29]