Hadooponomics: Hacking the Stack: Rethinking Coding, the Scientific Method, and Big Data (Podcast Transcript)

Hadooponomics16Listen to the original podcast.

James Haight: You’re listening to the Hadooponomics podcast, and, as always, this is your host, James Haight. Pleasure to have you back with us this week. We have a very fun and extremely actionable episode for you today. My guest is Jerry Overton, and our topic is high impact data science.

You’ll notice, of course, on this show, that we love to talk about the big picture, we love to talk about industry trends. We’re often very pie in the sky. But with today’s episode, we’re bridging this gap and bringing it down a level to give you actual, technical advice, actionable steps that you can use to take back with you to be better at your job tomorrow.

So really exciting for us, and any data practitioners out there, you’re gonna be sure you’re gonna want to listen to this episode. But for those of you who aren’t, Jerry’s a very well spoken individual, and I think his topics and his arguments have a lot of bearing in any discipline, as well. So I would say it’s highly interesting to whatever side of the spectrum you fall on, but especially if you’re a practitioner. I think there’s a lot of gems in here that you’re gonna want to take back with you.

So Jerry teaches people how to be effective at using data to influence change in an organization, so we’ve leveraged that. He’s a well known thought leader, and teacher, and author in this space. So we take advantage of him while we have him here, and we have a chance to talk about a lot of things. And we start on talking about, first, the difference between data science in theory versus data science in practice. And then we bridge that into the actual things that you can do right now to be better at your job. So a really exciting episode here. I’m thrilled to have him on the show. And we actually got a chance to introduce sort of a new segment onto the show, sort of a rapid fire round where we go through myth busting common theories that sound great but actually don’t work out that well in practice. I suspect a lot of you guys will know these firsthand, once you hear it as well. There’s a lot of fun tips in here to take back with you.

And you know the drill. We talk about a lot of things in this, we reference a lot of material. So if you get lost in that, you wanna find out more, you wanna connect with Jerry, bluehillresearch.com/hadooponomics. We’ll have all that stuff for you, as always. We’ll have the transcript for you as well, and, of course, you can use that to get in touch with myself and our team. Twitter, LinkedIn, whatever medium you prefer, we’re there. I love hearing your comments. It helps us make the show better. It shapes what we do going forward in the future. So please keep them coming, really enjoy hearing them.

Otherwise, that’s all for me, so sit back, relax, and enjoy the show.

Hey everyone, I’m here with Jerry Overton. He is the Data Scientist and Distinguished Engineer at CSC, Computer Sciences Corporation. Jerry, welcome to the show.

Jerry Overton: Awesome, hey, thanks for having me.

James: So, Jerry, we’re excited to have you on. There’s a lot of interesting things that you do. You have a big presence in a whole lot of areas. But rather than me trying to explain everything, I’m gonna turn it over to you. Who are you, what do you do, and let’s take it from there.

Jerry: Okay, well, let’s see. I have a number of roles. So, inside CSC, I spend a lot of developing our client facing data science offerings. So making sure that what we’re putting out in the marketplace is the latest stuff, that it has the latest technology, and that it serves our clients’ needs. I’m also in charge of internal projects. So taking the things that we offer to our clients and using them internally to make sure that we have the best practices, and that we’re doing the kinds of things we ask our clients to do. And then last is I get to spend a fair amount of time doing thought leadership activities. So stuff like this, podcasts, and publishing, and teaching, just getting out there into the marketplace and having a voice, and trying to establish the kind of trust that we need in order to do the business that we want with our clients.

I spend a lot of my time focusing on just essential things. There’s a million different things that I do, but they all have that same theme. I see a lot of leadership in emerging technologies and data science, so a lot of new algorithms, cognitive, and deep learning. I also see a lot of leadership with the tools. So there’s a lot of really good platforms out there, IDEs, things like that. But where we’re missing the leadership is in the area of application. How do you take all of this new technology, these new techniques, and do something with it, something that actually matters? That matters to the stakeholders, and to the people who pay your salary? So the thing that I focus on, no matter what aspect of a job I’m doing, I’m always focused on applied data science. That’s really where I live.

James: Yeah, and one of the things that drew us to having you on the show is this whole idea, your whole theme, and everything you do seems to revolve around this idea of high impact data science. You wrote a book, Going Pro In Data Science, which has made the rounds and been pretty popular. Can you just kind of explain what you actually mean by that?

Jerry: Yeah, so, okay, Going Pro In Data Science. There’s a difference between the data science that’s written and the data science that’s practiced. And like you mentioned, I appreciate the plug, the Going Pro In Data Science. And it’s interesting because that started out as a series of letters to myself. So every time I came to a problem where the conventional wisdom just broke down, and I had to figure out a way of actually making this stuff work, actually making an impact, making inroads with our stakeholders, every time I came up with something like that, I wrote a blog post. I ended up collecting those blog posts into a collection of works and then publishing an e-book. And it’s really all about professional effectiveness. So when I talk about going pro, I’m talking about the practice of making an actual impact and making a difference. And now, just recently, I’ve been able to extend that. So Going Pro In Data Science is really about what you need to do to be personally effective. But then the next step, and the thing that I’ve been working on lately is what do you need to make an impact in the entire enterprise? So data science at enterprise scale, that kind of thing. It’s been very exciting for me.

James: Absolutely, and our listeners can, I’m sure, anticipate where we’re going with this. But the whole theme of this podcast, of the Hadooponomics podcast, is talking about getting you to the Big Data payoff. And we spend a lot of time talking about thought leadership and what’s coming up next. What does, what does not work. And the bridge to get into that payoff is, of course, applying it in your words, going pro, or perhaps doing data science with high impact. Not just doing it to do it, but doing it because it matters. And that’s what I wanna revolve this whole show around, which is why we’re so excited to have you on as a guest, because this is what you do. You help people every day actually make this happen. And so going off of that, I think you had a really interesting discussion around this idea of stack thinking. And I wanna poke into that a little bit, go a couple layers deep. But before we do that, can you just sort of give us an overview of what you mean by that, and why our audience should care?

Jerry: Yeah, yeah, I’m glad you asked about that. Okay, stack thinking. So let’s start with the technology stack. So picture a stack of boxes, right? Boxes stacked on top of each other from the floor to the ceiling. That’s your stack. And then each of the boxes, starting from the bottom, working your way up, is labeled with things like, I don’t know, the infrastructure layer, or the storage layer. All the way up to the applications in the application layer. And then inside each of these boxes is a bunch of toys that fits whatever layer they happen to be in. Its tools and technologies. The technology stack is a diagram that shows all of that. And we, as data folk, man, we love the technology stack. It’s usually the first thing that we ask about when we get on a project. We use it to organize our thinking. Some of us even use it to organize our careers, right? We see ourselves as, hey, I’m the guy who works at the visualization layer of the stack, and that’s really what I focus on. And the stack has a number of virtues. First of all, if you’re going to build something, you’re gonna need to understand the stack. And it is also a way to organize what is a very confusing and really fast changing mass of technologies.

But there’s drawbacks to the stack. The problem with the stack is that it’s really focused on the technology itself. So you can pull these toys out of these boxes and link them together, and create something that works, but that’s not the same thing as creating something that someone wants to use. You see, when you focus on stack, you get into what I call stack thinking. So stack thinking is all about looking at an individual component and just worrying about how it fits with other layers of a stack. So you start to wonder things like, I don’t know, should I build a D3 visualization? What kind of visualization library should I use? Or what kind of database should I go with? Should I go with NoSQL or SQL, or a graph database? You get stuck down in the weeds at those layers, and you end up building something, like I said, that works, but doesn’t really have an impact, not something that your stakeholders are going to find value in. And that’s really the problem with stack thinking.

James: Mm-hm, and I can’t help but bring up, we’ve had a number of episodes that have either danced around this issue or directly talked about it. But we had a few gentlemen on from this blog and thought leadership website, as well as they provide some other services, too, but it’s called The New Stack, right? And they just talk about how the technology stack is changing and we can’t think about it in the same way that we used to. And then the second piece, and I’d like to get your reaction to it, is we’ve had, on this show, a conversation that Big Data is fundamentally an application problem. We should think about it in terms of what we’re trying to solve and the impacts we can have, rather than, can we do it, or what are we going to build. Start with that sort of right to left thinking, rather than the traditional left to right thinking. Start with your solution first and your endpoint first. Curious if that jibes with your thoughts on this, and then dive into it from there.

Jerry: It does, it does jibe. So this idea that this is an application problem, and that we should be thinking about the stakeholders, is key. And the way I tend to think about it is I think of it, instead of in terms of a stack, I think of it in terms of a utility. So think of an electrical utility, like the one that’s providing power to the rooms that we’re in right now. It’s really useful to think of applied data science in terms of a machine learning utility. So instead of generating and distributing electrical power, you can think of it as generating and distributing meaningful, data driven insights. So there are a number of advantages to thinking about what you’re building as an insight utility. The first is the kind of common sense that you inherit by doing that. So if you look at electrical engineers, or mechanical engineers, and the ones that are working on real, electrical utilities, you rarely, in fact, I’ve never heard of an engineer say things like, I wonder if I can just, at random, take a generator and connect it to a transformer. Takings things out of the stack and connecting them together. What you really focus on is the consumer. What kind of power is being consumed, and what are the specifications for it? Let’s build something according to those specs, something that delivers according to what our stakeholders need.

When you think about applied data science in terms of the machine learning utility, and at CSC we call it the industrial machine learning utility. When you think about it in terms of that, what you really get is a focus on the stakeholders. So instead of thinking about the various technical components, and how they weave together, you get things questioned, and concerns like, well, what data will I need to power an experiment? Or what are the most important business questions that need to be answered by those experiments? Or when the experiments generate results, how often will the stakeholders need these insights? Or what areas of the business can we transform with these insights? Which is a very different way of thinking than what I call stack thinking.

James: We’ve been pretty pie in the sky here, and we’ve talked about this episode was gonna be our bridge [laugh], gonna be a nice bridge between the theoretical to the practical to becoming a pro. So let’s take it down a level.

So we agree, an insight generation utility, we agree we gotta reshape our stack thinking to understand new technologies, but what does that mean? Take it down a level for us.

Jerry: Yeah, and that’s a really good transition, because what you’re implicitly talking about is, we have a problem now. Because one of the reasons why you adopt the stack is because it makes it clear as to what you should do next. Even though what you’re doing may not get to stakeholder value, you at least have a hit list of tactical things that you need to do next. But when you’re thinking about it terms of a utility, what’s your next step? What is the thing that guides you? So there’s a hole there when it comes to high impact data science.

Now a lot of us tend to fill that hole with what I call the tyranny of action. It’s basically looking around and doing what you see other people doing. So there are different reports that almost every Fortune 500 company has embarked on a Hadoop project, so let’s do that as well. Or companies who integrate Big Data, cloud, and mobility do 53% more business than their peers, so hey, let’s do that. And on, and on. My favorite one is Company X, they’re running Spark on a 400 node cluster, Company X is awesome, therefore we should be doing that as well. I hear that one a lot, that one’s particularly infuriating me.

But that’s what I mean by the tyranny of action. And the thing that you need, in order to save yourself from that, is a strategy. So strategic thinking gets you out of that following the crowd just for the sake of doing so. It gets you to specific actions. But it gets you to those that are specific to a context. So it gets you to doing something better than architecting, developing and deploying Hadoop-based data lakes just because that’s what someone else did.

James: I love that, and to break down the strategy into sort of its core components, and what’s actually necessary, it seems rather than taking a tool based approach to outcomes, it’s saying, let’s focus on the outcomes and then choose the tools to get us there. And part of what I wanted to bring this into is you have a really interesting thought on what you call the art of the hack. And I know it sort of pops up in a lot of what you do. So I’d like to explore that a little bit, because it strikes me is sort of the next step after what happens when you have your data strategy defined.

Jerry: Yeah, yeah, so we’re going along the progression, right? So we started with a different way of thinking about data science, right? The whole utility way of thinking. And then we’ve kind of gotten in, further down into strategy, which gets us into tactics. But then after you have your tactics you need to look at ways of actually implementing. And so you’re gonna need to go out and full staff it with talent and skills that you need in order to implement. And one of the things that you inevitably come up against is writing and deploying algorithms.

So when you look at what is commonly put out there as a requisite for a good data science, you get the big three, which is computer science skills, domain expertise, mathematics and statistics. Now, in practice, though, what I’ve found is what is more useful for what I call a professional data scientist, which is somebody who is focused on delivering results, high impact data science, what is actually more useful is, instead of the domain expertise, it’s agile experimentation. So being able to interact in an agile way with the domain experts. Instead of mathematics and statistics, it’s all about being able to evaluate hypotheses. And I don’t mean that in the really formal, statistical sense, like-

James: [laughs] Find the p-value

Jerry: Right, right, right, right, right, right. I mean something like,hey, I think I’m gonna lose ten pounds if I go on an all-Dorito diet, right? That’s a hypothesis. It’s what kind of data are you going to need to refute that? Hey, maybe I’m a genius, or maybe I’m crazy. I need to know which world I live in. Let’s start to generate some data and evaluate that hypothesis. That’s really what I mean.

But the thing I wanna focus on is the whole computer science aspect. Now, although I think it’s good to know as much as possible, the more knowledge you have the better, but what I’ve found is that’s actually practical for professional data science, high impact data science, isn’t all the computer science skill. It’s really professional data science programs. Now, I haven’t seen any name for it. I haven’t seen those practices all lumped together under some new moniker. So I just gave it a name, I call it the art of the hack, which is really about the professional development skills that you need as a data scientist.

And here’s what I mean by that. So here’s the big picture for the art of the hack. When you are going to build something, the first thing you do is you decide on what algorithm that you’re going to go with. Your next step is not like what you see on TV, where you get the superprogrammer sitting down at the blank screen, typing away furiously, and then, boom, some magic happens. As a professional data scientist, what really happens is you do a search. You do a Google search for somebody who’s already done this. Because whatever algorithm you’re going to create, whatever real project you’re working on, there’s someone out there who’s probably done at least 85% of what you’re looking for. Go out and find it. Do a web search, search your own archives for code you may have written before. Go and see your colleagues, find code from them. But don’t start with a blank sheet. Copy and paste, and then run the code. Your zero draft is a working version of someone else’s code. Then you take that, you make a small change, and you run it. And then you make another small change, and you run it. And you’re evolving the algorithm towards something that does what you want it to do. Something that’s useful for you. But you have something that works at every stage of the way. Now that’s the art of the hack.

Now I teach this in a class, and I ran this class just a little while ago internally to CSC, just so I could practice and get it refined. And one of my colleagues, Logan Wilt, she went in and did this exercise. So I have students go in and do an exercise in the art of the hack. She did such a fantastic job with it that I actually asked her to write up what she did, and to create it as a blog post. I use it as an example in the class, and I encourage you to go out and take a look at, Google Logan Wilt and the art of the hack. It’ll bring up a CSC blog article. And I think that she actually does an even better job than I could ever do trying to explain what that’s like as a professional data science programmer. What the thinking is like, how you progress, what it’s actually like to do that kind of programming.

James: Mm-hm, and we’ll make sure to put that up in the show notes so any listeners who didn’t quite catch that, just go to the bluehillresearch.com website and we’ll have a link back to that. So that’ll be really interesting.

So I love what you’re saying with the art of the hack. Don’t reinvent the wheel, stand on the shoulders of giants. Whatever term you want to use. It seems like an incredibly, much more efficient way to actually get to where you want to go. Curious, do you get any pushback, though? Someone says, well, hey, look, if you take something that’s not designed specifically for your solution, and you’re just sort of adding a whole bunch of hacks on top of it, and just modifying, do you end up with this Frankenstein’s monster that, yeah, it does what you want it to do, but it’s so complicated that it’s just terribly inefficient? So I imagine you get pushback to that effect, and I’m sure you’ve thought through the answer.

Jerry: Yeah, and so you’re not lopping on top. You’re not just taking something and constantly piling on. What you’re doing is you’re molding it. So I think it was a quote attributed to Michelangelo, who, when asked about how’d you create the David, I think he said that, well, you just start with block of marble, and you remove everything that’s not the David.

James: [laughs]

Jerry: And that’s kind of what you’re doing here with the art of the hack, right? You start with someone else’s code. You start taking away the pieces that aren’t doing what it is that you need. When you add on, you’re adding on functionality that’s only absolutely essential. And this is the important part, because you mentioned understanding. You start with understanding, and you have to maintain that throughout. That’s very important when you’re doing the art of the hack. So when you first run someone else’s code, it’s important to understand what it’s doing. And then, from there, you make changes that you also understand. And so every step of the way, as you’re evolving, you’re getting something that’s closer and closer to your masterpiece, but also something that runs all the time, and something that you understand the entire time. Does that make sense?

James: It does, and I think that’s an important [laughs] piece to get, even if you don’t understand how Michelangelo created the David. I think understanding that process is probably necessary.

So with that, let’s transition to something that I’m pretty excited about, and we don’t get a chance to do on this show that often, although I wish we did. But something that we were sort of going back and forth talking about, calling it the myth busting fire round, right? You spend a lot of time looking at what works, looking at what doesn’t work. And we touched on it earlier, there’s a lot of things that people do because other people do it, and it doesn’t necessarily work, right? Just because everyone’s doing it doesn’t mean you should do. So I wanna just get your take on a few things in sort of a rapid fire round. What do people do all the time that actually doesn’t work, that they’d be better off doing something else?

Jerry: Okay, well let me start by prefacing, in that people that I meet are very intelligent. People are smart, but everybody gets tricked up with these things, and it’s because they sound right. These are things that are floating out there, they sound right, but for some odd, strange reason, when you start to put it into practice, it just breaks down. And that’s really what you need in order to go from the theory to the professional data scientist, is just knowing which of those ideas that yeah, it sounds right, but it doesn’t really, actually work.

So let me go through a few of these. First of all, the scientific method, as it’s presented. So I think the essence of a data scientist is following the scientific method. I think that data science is an actual science. But when you look at the way the scientific method is documented, it goes from hypothesis, to experiment, to conclusion, and then maybe you get that little nice arrow going back to your hypothesis, where you refine it all nice and neat. And that looks good, it sounds right, but the scientific method, the real scientific method, works nothing like that. It is a spaghetti mess where you start with a basic challenge, and maybe you accept that and you read something, and then you try something, then you go back. It’s a jumbled mess. And so one of the ideas out there is, what do you do with your discretionary time, as a data scientist? Well, let me ask you, if you were a data scientist, and you had to spend an hour or two a week to do whatever it is you want. What would you go out, what would you spend time doing?

James: Yeah, well, one of the things I like to do at the end of every week, I always like to think of what went well, and what went very poorly, and then identify that as something to improve upon. So I suppose I would figure out what my biggest challenges were from last week and then try and maybe spend an hour or two on solving those.

Jerry: Yeah, that’s a really good thing to do. For me, my thing would have been, and if this was a year ago, my thing would’ve been to go out and learn more about new algorithms. Because that’s the sexy thing to do, and that’s the thing we like doing the most. But the thing that actually works the best, the thing that, if you were going to use your discretionary time to get you the biggest bang for your buck, the thing that you should do is build pipelines. Build data pipelines. And this is the data that, maybe you’re not even sure if you have a specific use for it yet. Because the scientific method is so twisting and turning, the questions that you end up answering aren’t really the ones that are the most important to you. They really aren’t. They’re the ones that you have the resources in order to be able to answer. And in order to be able to have a big impact, you’re going to have to have those resources there before you need them. So one of the most important things that you can do with your discretionary time, as a data scientist, is just go out, grab data. Build professional grade data pipelines. Build access to this data, and it won’t be long before you find a very good use for it.

James: Wow, yeah, I mean, now that you say that, that seems sort of an eye opening thought, right? I’m just going back through my mind, and all of the conversations we’ve had on this podcast, all the folks that I’ve worked with, and that we again and again, always hear the theme. There’s a lot of these questions you want to answer, and so few of them you actually can because you have the data to do it. And that’s really where the challenge is. So to alleviate that from the outset, it opens up the door to a whole new world of questions that you can actually ask. So I’ve never really heard that formalized before, but now that you say that, it seems intuitively obvious.

Jerry: Yeah, yeah, and along those lines, here’s another big one. So an idea that sounds right is to write these algorithms, find patterns, and then tell a really good story, and use that story to impact your stakeholders. Now that sounds like, what could be wrong with that? Well here’s how that usually ends up playing out. You start with the data. You learn about a cool algorithm. Hey, I learned this deep learning algorithm, I wanna be able to use it on this data. So you do. You run your algorithm, you find some cool patterns. Hey, these are correlations or patterns, or whatever, that no one’s ever seen before. Or these are great insights, I think this is awesome. So you get together with your user experience team, the business team, and you create these data stories based on the patterns that you found, and it’s all about telling the stakeholder this wonderful insight that you found. And then you take that, and you put it in front of the stakeholders, ready for your ticker tape parade, and the response is, eh. Or maybe even pushback [laughs]. But that happens all the time, right? Because there’s a flaw in this whole idea of, well, we’re going to fix this with a story. If you look at the distance between when you first started writing your algorithm, when you first started engaging on the project, to when you first engaged your stakeholder, the gap is so large. And you’re trying to fill that gap by inflating the story. Maybe if I used better visualizations, or maybe if I used language closer to the business, or whatever. You’re trying to inflate this story enough to bridge the gap, and you’re just not going to get there. A better way of doing that, a better idea is, essentially, to start with a hypothesis. A hypothesis is so important to developing algorithms, or just doing data science in general. Not just because it should be based on the scientific method, but the hypothesis is your contract. You start with engaging the stakeholder, and you develop a hypothesis. Hey, here’s what I think the world should be like, what do you think is going on? You come to something that’s testable, something that, hey, if I got you information, and I either confirmed or refuted this idea, would that be of value to you? That is your contract that you use to show value. Then you get the data, then you write your algorithms to generate evidence, and you decide on the credibility, and win or lose, you’re still providing value. Because if your experiment confirms that hypothesis, then you get value there. If it refutes it, then you’ve learned something, and you get value there. And now it’s all just about keeping the stakeholders in that loop, rather than trying to inflate this story enough to bridge that gap.

James: Mm-hm, and to me, we talk about data storytelling, and influencing stakeholders though data almost on a daily basis, and especially throughout this show, and through a lot of the research that we produce, and I think that’s an important that you bring up. In my mind, the way we talk about it is it’s sort of where you have this feedback loop. You need to get the buy-in first to understand that what you’re doing is an important issue. If you build a house in the middle of nowhere, it might be a great house, and no one wants to show up, right?

Jerry: That’s right.

James: It’s sort of, hey, do you care about this? If I disprove this, would it matter to you? If I show that this is correct, could you turn that into more revenue? Those are questions you ask up front and then you go on your sort of journey to embark on your data storytelling, right? I think that’s a really important feedback loop to have at the beginning of your process.

Jerry: Exactly, exactly, and if you don’t have that, there’s no amount of great storytelling that’s going to close that loop. So it’s a critical start.

James: So, Jerry, there’s a lot of awesome stuff in here in terms of actionable takeaways, some myths that have been busted, and a couple references to your colleague’s work, to your own work. If our audience wants to follow up or learn more, where’s a good place for them to go to stay in tune to what you’re up to?

Jerry: Oh, okay, yeah, three things. So first of all, I’m hosting online training in early January, it’s through O’Reilly Media. The class is called Mastering Data Science at Enterprise Scale. And it covers everything that we’ve talked about here plus a whole lot more. We take our students on an epic journey to high impact data science.

Second mention I’d like to make is the e-book that we talked about, Going Pro in Data Science. That’s also from O’Reilly Media. It’s completely free, so when you get it, you can download it in a PDF format. It has a lot of the things that I’ve learned along the way in what it takes to be personally effective as a data scientist.

And then the last thing I’d like to mention is, of course, our CSC blogs. So at CSC I keep a blog called Doing Data Science. So if you Google CSC and Doing Data Science, you should see that. That’s where I post the latest happenings in my world.

James: Fantastic. So as I mentioned, we’ll have links back up to that stuff on the Blue Hill website as well. So if you get lost, if you’re trying to follow this, we’ll have the links up there, so no problem.

Jerry, it has been an absolute pleasure to have you on this show. Thanks so much for coming on.

Jerry: Hey, no problem, this has been fantastic, thanks for having me.

Posted on by admin