Hadooponomics: Building Big Data, Better: Why Integration, Not Infrastructure, Is Key (Podcast Transcript)

HadooponomicsEp14Listen to the original podcast.

James Haight: All right, welcome back, everyone. This is James Haight, your host of the Hadooponomics podcast. Great that you could join us again today, and we’re really re-energized about this push for Season 3. A lot of great feedback, we’ve got a lot of momentum, couldn’t be happier about it. So please keep the comments coming, reach out to us, we’d love to hear your feedback and build off of that.

And so in order to keep on delivering this, we have today another fantastic guest. We have Yaron Haviv. He’s a CTO and co-founder of iguazio. Really exciting Big Data startup in the field right now. And the conversation with him lets us look at things in a little bit more of a technical angle than we usually can talk about here on the show. Really, we take advantage of the fact that he’s a technical co-founder, and we look into this conversation around open source versus open standards in the Big Data ecosystem in a way that we really haven’t before. For anyone who’s actually interested in this conversation, I would absolutely recommend going back to the archives to Episode 5 where we talked with New Stack, would be a great pairing to go with this episode, to give you another view of a different side of the same coin regarding this. And Yaron, he really gives a cool perspective on the friction in the market that this has created, as well as the opportunities that it’s created, and sort of how we should be viewing it and how we should prepare ourselves to invest in a Big Data future where so many things are changing all the time. How do we make an investment that is going to be worthwhile in two or three years? So a really interesting perspective there. He is a cool guy making a lot of moves in this space, and if you guys wanna follow up with him, bluehillresearch.com/hadooponomics. You’ll find the show notes there as well as the contact information with how to get in touch with him. I know he’d welcome that. And, of course, we’ll have things like the transcript and how to get in touch with us on the show. Find me at @james_haight on Twitter, got a lot of good stuff up there for you.

So this episode, like I said, lets us dive into things on a more technical aspect. So if that’s your cup of tea, sit back, relax, and enjoy. And if it’s not, then I suggest coming at this episode with some open ears and an open mind, and perhaps learning about things from a little bit of a more deep perspective than what we usually touch on. So in either case, I’m gonna step back, and let’s go straight into the interview.

All right everyone, I am here with Yaron Haviv. He is the CTO and founder of iguazio. He’s an interesting guy, he maintains a blog as a noted industry expert and thought leader in the Big Data space. Yaron, welcome to the show.

YARON HAVIV: Thanks, and nice being here.

James: So, Yaron, one of the things that we’re excited to have you on the show about is there’s a whole lot of things going on in the Big Data world and you’re plugged in in this really interesting position to sort of uniquely observe what’s actually happening. But what I wanna do is first have you just tell us a little bit about yourself. Who are you, what do you do, and we can sort of move into that angle from there.

Yaron: Great, so I love technologies, I think started at the age of 12 by hacking computers. I’ve been CTO and VP in several companies, coming mainly from the infrastructure space but also dealing with a lot of the higher levels of the stack. I was the CTO in a company called Voltaire, which dealt with high performance applications in various fields. And then VP for Data Center Activities in a company called Mellanox, which is sort of one of the leading providers in the networking space, especially around cloud infrastructure, used by sort of most of the cloud providers, and database vendors, and storage vendors, etc. And my role at Mellanox was owning all the relations and technical aspects of open source communities like Hadoop, OpenStack, etc. Working with tier one cloud providers on creating better solutions together for them. And it gave me sort of an interesting perspective on one, and sort of how various open source projects operate, like Linux and OpenStack, and the Apache ecosystem, as well as sort of insight into how cloud providers operate. What their architecture, internally, looks like, not what people say they think it looks like. So that’s an interesting perspective.

James: Absolutely, and I think we’re gonna dive a lot into that sort of open source conversation, I think you have a pretty unique vantage point into that. But the place I wanna start off in is really just taking a look at the Big Data market as a whole. Certainly we get a lot of functional experts on this show, we got a lot of people who are very good at what they do specifically, or from sort of the data visualization interpretation standpoint. But I think, since we have you on the show, I think there’s a really unique opportunity to dive into what’s actually happening in the industry right now. And I know you’re currently at one of the largest Big Data conferences out there, you’re in New York at the Strata-Hadoop event. So why don’t you just give us a quick overview of sort of what you’re seeing at the event, how’s it going, then we can sort of paint some broader strokes of how the industry’s actually moving.

Yaron: Yeah, so it’s actually only started, really, today, because I think there were tutorials the first couple of days. But it did start being crowded at the show we were at yesterday, all the swag events at the evening, as well. I think one of the things that sort of I’ve been talking to some of the analysts who were crawling in the booths in the evening events, and I think people are sort of confused over so many solutions, and how do you really differentiate between one another. And it’s interesting, we were joking with some of the guys I talked to from other companies and the analysts, it’s like, you go, you look at the slogans. Everyone is sort of a data lake, everyone is sort of a faster, bigger, better, everyone is streaming. And then when you dive in underneath sort of the marketing fluff, it’s sort of a different thing. And people are sort of trying to adopt their marketing messages to what they customers need, not necessarily to convey what their products do.

James: Yeah, and one of the things, I can relate to this first hand, my job is to cover [laugh] the Big Data market, right? My job is to stay as informed as possible and talk to all these companies, and try and understand what’s the difference between what people actually do and what they say they do, and it’s hard, right? And I spend my entire time and attention on trying to do that. So for people in our audience, we have a lot of Big Data enthusiasts, we have a lot of practitioners, I have to imagine they show up to an event like this and that is a constant struggle. So what are we looking for in terms of actual, meaningful indicators of how to differentiate solutions and how to match it up to what we’re trying to bring back to our company?

Yaron: Yes, I do think that a lot of the solutions, what they do, there are a bunch of different components and they’re all sort of trying to create better simplification by combining all those components, building some glue code around some user interface, some automation of various procedures. I think that’s sort of a lot of what you see in the show. I haven’t seen sort of real, fundamental, groundbreaking technologies announced yet. Maybe I’ll be surprised. It’s more around sort of the glue logic and building sort of more usable solutions.
I do think that next year, or in a couple of years, we’re not gonna see many of the players here, because sort of the barrier of entry of building such integration solutions is not that high. A lot of companies can basically take some AngularUIs and take sort of a Python glue logic and Java stuff and basically form those integrated platforms. Some will have better technology, some will have sort of better marketing and will gain mind share, but because there’s sort of so many of those it doesn’t make sense to keep all of them.

James: Sure, and so going off of that, what are the things that you think are actually gonna stick around? Certainly there are fundamental movements in technology every year, and certainly, at the event last year, everything was all about Spark. And you mention now everyone’s got a data lake. What are the important things that are actually sticking around? What are the big trends that you see that are more sweeping that are actually impacting change in the future of the Big Data industry?

Yaron: Yes, I think, first, let’s think of sort of the megatrends in the industry as a whole, not specifically around Big Data. So the way I look at it, there are three megatrends, okay? One is IoT, basically tons of sensor data. And I can tell you later what we did for an engine demo on the IoT. Second one is sort of the move towards Software as a Service, so basically companies that instead of buying software for on-prem, they’ll keep on using sort of the Salesforce model, and all the ERP, CRM, all those boring things will turn into be sort of mobile apps. That’s, I think, sort of the second megatrend. And the third megatrend is [laughs] going to be around organizations that build those sort of, let’s call them data lake, or next generation data warehousing. So instead of centralizing the organization around computation, building lots of servers, server farms, all that stuff, with silos of data scattered around the Oracles, and SAPs, and mainframes, I think there will be sort of a shift towards centralizing the data, creating data centers for data. That’s why they’re called data centers. And then the computation will be sort of the thing around it. And computation could be mobile workers with their laptops, could be mobile devices, could be Raspberry Pis. And I think that’s sort of where those three megatrends that I envision will happen.

James: Mm-hm, and so when you take these megatrends and then you sort of boil them down or distill them to specifically the Big Data landscape, and I’m just thinking of folks in our audience, of what they should be on the lookout for. A lot of people in our audience, they work for companies and their job is to keep their eye out for the latest and greatest Big Data technologies, or just analytics technologies, to bring it back. What happens when we [laughs] start living in this connected, mobile world with a whole lot more computation? What’s the end result?

Yaron: Yeah, so I think if I sort of try to draw the conclusions from those megatrends, what can we see sort of in common? I think, first, it’s a lot of data, it’s a hybrid, because some data will be on-prem, some data will be in the cloud, some data will be sort of hosted somewhere, or on your mobile devices. So you have to have sort of a hybrid approach that scales. You need to move to a paradigm which is more like Software as a Service or Platform as a Service versus maintaining infrastructure, if you want to be able to build your applications in the speed that organizations require. And this is, I think, where sort of the cloud providers have a better chance than many of the technologies we see today. If you look at the simplicity that Azure brings, or AWS brings to Big Data, it’s totally different than rolling out a Hadoop cluster.

James: Yeah, certainly, and one of the things I notice with a lot of the folks that I work with and talk to is a lot of people made big bets early on, and they’ve run into issues of they don’t have the skills on staff to actually figure out how to spin up these Hadoop clusters, or have the people to make it happen. Or the technology was just overly complex and it was just too technical, had it been abstracted or simplified away. Curious on two points. One, do you see that, and then two, what’s not working for people? What are the things that we should be avoiding, because now better technology exists where we can actually make our lives easier if we’re just a little bit smarter about what we get into?

Yaron: Yeah, so first thing, the advantage of some of the cloud providers is because they have such a mighty force and all the integration capabilities, is that they sort of integrate everything. So if you think about Hadoop, how they do security, that really depends if you’re gonna ask a Cloudera or Hortonworks, or someone else, and you will get the conclusion that there’s no security. And if you are going to work with, let’s say, Amazon, and then basically you have access control, and IAM, if you know the model across everything, doesn’t matter if it’s an S3 or a Redshift. Or same for server elasticity, you don’t need to deal with the tuning aspects. So I feel the cloud providers have an advantage of sort of vertically integrating the solution to address all the aspects of user interface, security, high availability in a cohesive manner. And because of this huge fragmentation in the Hadoop camp, and hundreds of Apache projects, it’s not really possible to do this level of integration. We’ll probably get later to talk about sort of the open standards versus open source. But I think that’s the big challenge. It’s pretty impossible to build a cohesive solution today. We have so many moving parts that sort of step on each other. They’re not integrated, because if I’m new and start building a streaming solution, I also don’t have the energy to go and integrate with a couple of hundred projects that do the other things in the ecosystem. So I think that’s sort of the advantage that cloud providers and sort of vertical integration have.

James: Sure, and you hint at where I wanna take the conversation next, and sort of, we’ve had a few discussions on this show, and I think our audience will remember back, about open source and the relative merits and the things to watch out for, and how we should think about it perhaps a little bit differently than it’s often portrayed. So I just wanna start with, first, asking you to just expand on this open source versus open standard argument, and we’ll sort of poke into it a little bit more and tease out a few details that I think our audience is gonna like.

Yaron: Good, so, I have many years of experience in open source. I was in different projects, some of them were actually funded by the government, some of them were sort of initiated by industry organizations. And I could sort of try and put the line of where things were more successful and where things sort of failed. And I think where things were more successful is that there was more work within the group to create sort of standardization of layers, of goals, and those kinds of things. So if you think about Linux, I think, is the most successful open source project. If you think about it, there is Linus, who is very strict, I don’t know if you’ve happened to send a patch to Linux and experience some of Linus’s comments. Very strict on how things need to be done, very strict about the layering model of Linux. If you happen to build a USB device, you fit into the USB device model, and you can try and change it. And there’s a process with a different area, sort of managers that will accept those things. In many cases, in order to push new technologies to Linux, they have to have sort of the standard. And this is an area where I worked in companies, we wanted to put a feature which was pretty unique, in the market, and we sort of got back pressure, and you can’t push this feature because it doesn’t have any sort of backing from a standard or something like that. So what it helps you is that you have very standard layers. If I’m writing software, and I’ll use POSIX semantics, or I’ll use BSD sockets, or if I’m building a network device then I need to fit into sort of the network device model. And I think that’s sort of what we see good in projects like Linux, or OpenStack is another interesting example. I don’t know if it’s very successful, but it is successful in sort of the, I think, integration of all those different vendors into one platform. And they also create layers. You have, for example, if you want to provide storage to OpenStack, then you have Cinder. It’s an integration between OpenStack and storage. And if you wanna provide networking solutions, whether it will be a load balancer or a software stack that does firewalling, etc, you have a layer called Neutron, and also very well defined interfaces. And what you will see is that there’s always a reference open implementation for some of those that will be good enough for many people. But in parallel you have vendors that plug into the same API and bring something which may be faster or more secure, or better enterprise functionality or quality. And that’ll foster the ecosystem because more vendors have incentive into plugging into this platform. And also, it allows more interoperability, because there is the strict layering, this definition of what is management layers, what kind of APIs you need to conform to. And I think the reason OpenStack is still not too successful is because it calls on itself too much. And if they sort of confine themselves into fixing the infrastructure as a service problem first, and then going after the higher levels of the stack, I think they would’ve been more successful.

James: Mm-hm, so there’s a lot of interesting things in that, right? And certainly there’s pros and cons of being, more or less, locked down and having these standards as opposed to some of the other ways to do it. Can you contrast sort of what you described to the open source Big Data ecosystem as it exists today?

Yaron: Oh, yeah, that’s sort of a different story. Every four guys that know how to code go to Apache. So we have a new project, and sort of they build this project. Those projects tend to overlap. I don’t know how many streaming solutions we have, or how many sort of data storage solutions we have, or machine learning projects. And there’s no real definition how a machine learning project integrates with the rest of the ecosystem, or how a streaming solution integrates with the rest of the solution. And I think the only common thing, today, is maybe YARN and HDFS in this environment. No one uses MapReduce. And HDFS sucks, it’s not the right abstraction for data, because if I wanna consume data for analytics, files is pretty unstructured, it doesn’t have any notion of metadata of what this file contains. Doesn’t allow me to search efficiently, I have to do full scan. And even scheduling, why do I need to stick to scheduling a platform that only deals with Hadoop? I’d rather use something like a Mesos or Kubernetes that are much more open and allow me to integrate my web server and FTP server, not just my Hadoop cluster, into the same resource pool. So I think that’s sort of the challenge is that the only two fundamental standard solutions which, are sort of YARN and HDFS, don’t fit the bill. They do not fit the actual solutions that people need to build.

James: Mm-hm. There’s a two pronged sort of questioning that I wanna go from it. The first is, so what direction are we going, right? Obviously we could continue doing this forever, and just having the amount of solutions that don’t necessarily talk to each other, don’t necessarily build on each other and overlap. That can just keep on going and growing. I suspect it’s not gonna go in that pattern. So I’m just curious, where do you see sort of this evolution and innovation going? What direction will it ultimately lead us?

Yaron: Yeah, so, again, I think that what eventually is gonna happen, that sort of cloud-like technologies are gonna win. And a lot of the intermediate layers sort of that try to provide yet another small time advantage are going to fade away. So Amazon Stack, Azure Stack, and also what sort of iguazio does. I am a bit biased, but see, what we try to do is build sort of a complete platform for managing Big Data. We’re not dealing with the machine learning or computation aspects, we’re dealing with all forms of data, whether it’s streaming, or NoSQL, or file or object access, and all the aspects of lifecycle of the data, which includes security, and backups, and those annoying things. And then you’ll have the computation part on top of it. The infrastructure part of data is the most complicated one. That’s the one that you have sort of consensus, and consistency, and fault tolerancy, and all those complicated problems. If you do that right then the upper layer of machine learning algorithms becomes simpler, because it’s sort of stateless. It can fade away, it can come back, and nothing happens. If your data crashes or is not secure, that’s sort of a bigger problem. So I think what we’ll see is people building those sort of, I won’t call it data lake, and I think data lake is sort of an abstract term, pool of data, a lot of Hadoop and HDFS, I think. If you’re looking an Azure data lake that’s sort of a better model of sort of a data lake. Or the different services in Amazon, whether it’s Redshift, or S3, or Kinesis, those are sort of yet another example of a data lake. And on top of them you’ll have computation platforms. I think Spark is going to stick, and not necessarily because it’s better than Flink or the other solutions. I think just because you have a sort of center of gravity. People see Spark as something that integrates well with the ecosystem, that sort of has a consensus around it, whether it’s IBM investing billions on it or others, and they’ll start sticking their solutions around it. So, for example, Spark has a very good data abstraction layer, unlike Hadoop. It’s called the Data Source API, and Spark can run analytics on anything from MySQL, Cassandra, Redis, services in Amazon, Azure, and iguazio, basically natively. So I think it will sort of become a de facto platform. And on top of it, I think what we need is also abstractions for machine learning libraries. We need to model some of those things and allow different vendors to bring different machine learning algorithms. Maybe doing better distributed algorithms that are sort of missing today, and other innovations.

James: Mm-hm, so people out there who are, maybe, the practitioners, they’re running the data ecosystem for their organization. You’re talking about what the future might become and where we need to go, and what eventually will happen to make sense of a lot of things out there. What’s your advice to those people? Is it, hey, sit back, relax, and in a couple of years we’re gonna have it all figured out? Or what questions do we ask, what path do we go down, assuming that [laughs] our bosses want us to take action sooner than later?

Yaron: Yeah, I think the biggest mistakes an organization makes is they focus on the technology building blocks versus on their application. And that’s why you end up with a lot of IT shops building Hadoop clusters that don’t do anything. I think what you need to start is sort of defining what your application looks like, what’s your business goals as an organization? Try and deal with sort of small steps, one step at a time. For example, maybe your organization, I know you’re an insurance company, you wanna run analytics on sort of different customers, or sort of the ROI models, or whatever you need to know what kind of data you’re going to consume. What are the computation models you need to deal with? And then based on that, sort of try and go and create a solution, not sort of the other way around: let’s build an Hadoop cluster, let’s figure out all the components for it, and then we’ll sort of scratch our head and think about the applications. Because if you understand your business case, then some technologies may be a better fit than others.

James: Mm-hm, and a while back, we had a sort of a big discussion around this idea that Big Data’s fundamentally an application problem more so than maybe a technical problem. Which sounds counterintuitive, but we had a pretty good discussion around it. One of the things that we were taking away is really, at the end of the day, you’re fundamentally solving, in some ways, a people problem, right, in some ways an organization problem of how to build your organization in a way to answer the questions you want to answer. And when you peel back the layers, the things that make it successful tend to be the approach you take and the people you have working on it. The right people, the right approach is gonna get the right answer, even with the wrong technology. It might take a little longer. [laughs] And I think there’s a lot to be said to not forget that human element to it. I wonder if you also see that happening with you guys, and how you deploy things, and who you work with.

Yaron: Yeah, so first, I think the most important thing is to understand sort of your analytic pipeline and algorithms, and all that stuff, plus the infrastructure. You should assume the infrastructure is going to change. It may be Spark today, and in two years maybe a different project. Maybe you’re gonna use HDFS today and tomorrow a more enterprise platform like what we do. So what you really need to focus is how to build your application and sort of a bit decoupled from all the noise below. And then by the time you finish your application, you may have sort of established solutions in the market versus investing in sort of the infrastructure. And then, by the time you finish, maybe that infrastructure is sort of going to change.

What I spend most of my time with is not necessarily building our platform, we have very good engineers and architects for doing that, is basically spending a lot of time with data analysts and researchers, and understanding their application patterns. And then sort of bringing it back to engineering and say, look, if we go and do, for example, one of the things that we have is, we can do NoSQL endtime series sort of in the same data model. And because sort of a lot of that endtime series is sort of handled as streaming data, and then what it means is that you need to run software that essentially goes and aggregate all that time series in summation into some more meaningful data. And that’s very time consuming. It means that sort of your insights are not going to be immediate. And rather than doing that, the right approach is to sort of combine streaming and tables, the sort of thing like building the aggregations immediately into the table. So when you run Spark queries you can run a query saying, give me all the devices that have temperature more than 30 degrees in the last one hour. So that’s basically combined streaming information and traditional SQL information into something very fast. Now, the reason we got into coming with such a solution was because we spoke to customers and that was very common case. And currently, the solutions they build for that is a very complicated one, and we could simplify it. So the calls I love are not necessarily the ones with sort of the infrastructure people, they are the ones which are people that have challenge with how to sort of analyze data and then sort of trying to change their paradigm in terms of let’s do it differently, and that have a magnitude path for performance. Or two magnitude orders of performance.

James: So I wanna transition a little bit. Amongst the other things that you do, and I’m sure our audience is getting a sense of it as well, you’re a pretty noted thought leader and industry expert in this world. I’m curious if you wanna just give us some parting shots as to what you think is coming up next. What you think the big trends are, what is it that our audience should be on the lookout for.

Yaron: If you think of many of the databases that exist out there today are sort of designed around hard drive mechanical limitations. Even things like HBase, Cassandra, DynamoDB, Google BigQuery, basically they all went and copied the same research paper talking about log-structured merge-trees and things like that, which were designed ten years ago and were all revolving around sort of the use of hard drives and the mechanical limitations of hard drives. And if you’re going to sort of start and look at the emerging technologies, like flash. I don’t know if you noticed that Intel announced new types of memory. They’re going to be available next year. Suddenly, you can create a paradigm shift. And most people are still sort of stuck with yesterday’s technologies. We see sort of this decoupling with application people, sort of, they master the application space, infrastructure people sort of master the lower level technologies, and they sort of don’t connect. That’s why you won’t see high performance databases so quickly, because the guys building databases and application stack don’t know how to utilize those sort of new infrastructure elements below. But I do think if you’re looking at a technology trend, this is something that’s going to come faster in the coming year. Especially when we’ve announced our products, other people looking at the same type of technologies, it’s going to give you new capabilities that you didn’t think could exist before.

James: Mm-hm, and if you’re gonna summarize that for an overarching sense, it sounds like on the cusp of what’s coming next, emerging technologies and some things that are out there already are allowing us to transcend what are now the traditional limitations of storage and memory, and that sort of thing. So it’s a pretty exciting time ahead.

So Yaron, there’s a lot of interesting stuff here. I suspect that our audience would probably wanna learn a little bit more about some of the trends that you’re talking about, or just keep an eye on you in general. In terms of where to go to find out about you, you mentioned you keep up a blog and that sort of thing. Where should our audience go if they wanna get a peek into what you’re up to?

Yaron: Yeah, so first you can look into my Twitter handle, it’s my full name, @yaronhaviv. Also, in the iguazio website, we also maintain our blog. I sort of merged my blog into the company blog, but it’s a technical blog, no marketing. It’s only sort of new technology and what we observe from the industry. And anyone can try and connect with me in LinkedIn and I’m pretty responsive on those things.

James: Yaron, loved having you on the show today, really appreciate you coming on.

Yaron: Great, thanks for having me.

Posted on by admin