This is the fourth in Blue Hill Research’s blog series “Questioning Authority with Toph Whitmore.”
Ashish Thusoo is the co-founder and CEO of Big-Data-as-a-Service provider, Qubole. He and I recently talked DataOps, data disintermediation at Facebook, elastic-data pricing models, abstraction layers, and the future of Big Data infrastructure. (Hint: It’s in the cloud.)
TOPH WHITMORE: You were an engineering manager at Facebook, where you implemented a DataOps approach to infrastructure management. You left Facebook in 2011 to start up Qubole. What motivated you to move on?
ASHISH THUSOO: That topic [DataOps] is very pertinent, and is something that a lot of companies struggle with. A lot of the genesis around Qubole was based on that.
Creating these data lakes, operating these Big Data platforms and making them available, making them self-service—those are extremely difficult tasks for most companies. At Qubole, we said, you know what, the best way to do this is to use the cloud! Use the cloud to create Big Data infrastructure that is self-service and automated. Automation takes care of the operational needs around the self-service infrastructure, and the interfaces are self-service enough that a marketing analyst or business analyst or data analyst can go into that infrastructure and do some queries and such.
Qubole is heavily influenced by the experience that we had at Facebook. My cofounder and I joined Facebook in 2007. We had a data-warehousing system, and we had the data team. Analysts would talk to the data team, and the team would then go off to get vanilla data that was stored in silos, and create some summary datasets, then put those into a data warehouse, and then analysts would come in and query that data. The process was very, very slow. Essentially, the direct result of that slow process was that we pulled data, but we didn’t actually use it that much. And the analysts would just go forward with their intuition. Data delayed is basically data denied.
When we went in there, we said this is a broken model, especially for a company that is growing so quickly. We need to rethink this model, and essentially create a self-service platform, which everybody in the company can use, and make the data team support that platform instead of being between the users and the platform.
That model is essentially what we built inside Facebook. The hope and thesis was that if you’re a data analyst, data scientist, developer, or end user, you should be able to get to the data without having to call anyone for help. The infrastructure should make that access easy, and also support that access. So, if you’re writing a query, the operational model should scale enough that it will be able to give you the results in time.
TW: You built this for Facebook. How did you recreate the technology at Qubole?
AT: We were using open-source tools at Facebook. At Qubole, the vision is similar to what we achieved inside Facebook, but with Qubole, we want to achieve it for everyone out there, for every other company. Our mantra here is that if you aspire to be a data-driven company, you should use Qubole. It will help you do that. Much in the same way that that internal platform helped Facebook.
The technology stack is completely different. Facebook was all on-prem. The enabler for Qubole was the cloud. We saw people trying to create data lakes on-prem. With Hadoop, the cost of storage had gone down dramatically. But infrastructure on-prem is still very, very limited. It’s static. You put up your clusters, you put up your systems, and then—even if you put it up so that other people can use the infrastructure—there’s always this risk for the administrator that “I can’t really open this up to everyone because it’s going to be a big problem.”
With the cloud, we turned that on its head. With the cloud, you can create a new system on the fly. It’s completely elastic. With the Qubole platform, you could create these self-service interfaces for data engineers, data scientists, and data analysts.
Our mantra is that—on the cloud platform—for any of the transformations that are coming in to that interface, we create the infrastructure on the fly. We orchestrate the computing infrastructure, and for storage, the data lakes actually that are being created on the cloud are being created on the object stores, not in HDFS. They are being created in object stores in Amazon, Oracle, Azure(Microsoft), or Google clouds.
In the cloud, the object store actually decouples compute and storage. You can keep creating the data lake, you can keep putting the data in the object store, and then with a platform like Qubole, you can have an infrastructure that adapts to your compute needs.
TW: How do you differentiate Qubole from the big Hadoop players?
AT: First, we position ourselves as the cloud platform for Big Data. The big difference for the other vendors: The distro distribution mechanism works well if you are doing on-prem. But when you go to the cloud, you can actually see all of this as a Big Data service. You can do a SaaS platform, which will remove all the complexity of having to stand up infrastructure.
Qubole users come in, create a login, and they’re ready! The same infrastructure is ready. And through that SaaS service, we are processing some 575 petabytes of data every month.
Second, the open-source software distros were built in the era of datacenters, and go in the direction of a converged architecture: “Store data in HDFS, and the same machine will be used for computation.”
Cloud architecture has changed that. The diverged architecture, where the storage is in an object store and compute, is ephemeral, gives customers a flexible pricing model and an elastic data model. And we position Qubole as a cloud-agnostic, cloud data platform that offers Big Data as a service to our clients.
TW: You mentioned the pricing model. I hear concerns from enterprise data leaders about pricing penalties for data growth. Qubole’s message of agile scaling sounds great, but what do I do if I’m about to turn on a new IoT data-delivery system? Will my expenses go up as my data volumes explode?
AT: That is a common issue. And not just for IoT projects. It’s not just the data pricing—the compute can go completely haywire too. You can get a thousand machines in the cloud in a jiffy.
There are two answers. The first is auditability—the ability to give complete visibility to the administrator as to where the costs are going: Which teams are using it more? Are they using it for the right reasons? Are certain data sets being used? Are certain datasets not being used? Can those then be moved to a different archival store? Or maybe they should be not stored?
Second is the cloud pricing model we follow. Cloud adoption initially started off in mid-market, maybe typically with the startup, millennial company.
TW: And software devs.
AT: Right. The pricing model was essentially compute hours. For entry, that model is great. And Qubole offers that. But as your computation scales, as your data scales, you get a discounted price for that.
Often, people use our elastic model in the POC stage, or in the early adoption cycle. When it’s clear the extent [to which] they need infrastructure, then they go into subscription pricing, where they buy a certain amount of compute for a certain price. And that scales. Their pricing is not going to go haywire.
TW: You talked about data scientists, data engineers, data analysts using Qubole. What’s their “pain” right now? And how does Qubole alleviate that?
AT: For these personas, self-service is the big thing: “I don’t want to wait for my data, I want it now.”
In most enterprises, a data team empowers all three roles. This is the team that is the internal sponsor for the infrastructure and systems needed to power analysis. Qubole targets the data teams: Instead of being on the receiving end of the ire of the folks saying “Hey, where’s my data?” the data teams can actually say, “You know what, with the help of Qubole, we’ve created this service, this infrastructure, this Big Data platform for you.” It becomes a mechanism for driving a full-blown data transformation, much in the same way that we drove it at Facebook from 2007-2011.
All of the learnings are there, and—as a self-service option—Qubole provides these users with the right tools for what they want to do. For example, Apache Spark is very popular with data scientists, so we have a Spark offering. We support Presto, which is more in tune for a data analyst. The same data platform can also be used by developers, who might be using Hadoop or Spark for writing applications. Or an engineer who might be using Hive for data-cleansing.
For the data team, Qubole becomes very powerful as a single platform. The data team can serve each of these different personas, and the data team is able to have complete control, and full visibility into what is happening. And can drive that infrastructure on any cloud that they want.
TW: The enterprise market question: Have you seen accelerated adoption in particular verticals?
AT: Our strategy has been “follow the cloud.” Some industries, like media, retail, ecommerce, or even enterprise marketing departments, are adopting the cloud before others. But we also see growing interest from healthcare, even financial services.
From the industry perspective, we feel that industry should know how to drive this transformation. What do you need from the perspective of people, processes, and technology to achieve that?
There is a growing realization across different verticals that they have to adopt a culture of DataOps, where data is widely available. Qubole is a catalyst in driving that adoption of data across the enterprise.
TW: We share an interest in that topic! Where do you see cloud-based data services evolving in the next few years, and where does Qubole go from here?
AT: The future is bright! When we started the company in 2011, there was a question mark on whether cloud would actually be the disruptive technology it had the potential to be. That question is answered now.
Companies are moving to the cloud, partly because applications are being built there, and new data is being produced there. But also, those businesses are realizing that they need to become much more agile with respect to their IT service.
In the cloud market, AWS is by far the leader, but we are also seeing the emergence of Azure, the emergence of Oracle, of Google, and more. As that happens, it creates a great dynamic for the market, because it gives companies options. Once you start treating clouds as base-level compute and services, you need services which can be agnostic. Qubole has a very strong role to play in that.