As CEO of Trifacta, Adam Wilson is committed to developing the best in data-wrangling technology, and then of course, preaching its gospel. He and I spoke recently about Trifacta’s past, present, and future (“groups and loops”), partnerships with companies you might have heard of, and how the enterprise data landscape is evolving (for the better).
TOPH WHITMORE: Tell me about Trifacta’s backstory. Where did it all begin?
ADAM WILSON: Trifacta was born of a joint research project between the University of California, Berkeley and Stanford. There was a distributed-computing professor at Cal that had been doing work in this area [data wrangling] for almost a decade, looking at the intersection of people, data, and computation. He got together with a human-computer interaction professor from Stanford who was trying to solve the complex problem of transformation and preparing data for analysis.
And they were joined by a Stanford PhD. student who had worked as a data scientist at Citadel on trading platform algorithms. He found he spent the majority of his time pushing data together, cleansing it, and refining it, as opposed to actually working on algorithms. He returned to Stanford to work with these professors to figure out how to eliminate the 80% of the pain that exists in these analytics problems by automating the coding or tooling, and making it more self service. The three of them worked together, and created a prototype called the Stanford Data Wrangler. Within six months, 30,000 people were using it, and they realized they had more than an academic research project. So they created a commercial entity and started delivery to customers like Pepsi, Pfizer, GoPro, RBS.
I joined two-and-a-half years ago to help with go-to-market. At the time, the question was how do we help people take data from raw to refined, get productive with that information quickly, and do so in a self-service manner? We focused on customer acquisition, and I’m pleased to say we now have more than 7000 companies using Trifacta technology. And customer use of Trifacta data-wrangling technology creates training data that improves our machine learning.
TW: How does machine learning show up in Trifacta? And what drove your investment in it?
AW: Historically, machine learning has been the exclusive purview of only the highly technical. But machine learning and artificial intelligence have been part of Trifacta since the beginning. There are two fundamental observations. First, every data set is not a new data set. There are things we can infer from the data itself. Whether it’s inferring data types or inferring joins, we can provide automated structuring in a straightforward manner.
Second, we learn from user behavior. As users interact with data, we can make recommendations based on that behavior. Based on our own analysis, we can recognize they are dealing with a specific kind of data and interacting with it in a particular way, and we can make a suggestion. They can choose that suggestion and get immediate feedback as to what the data would look like if they apply those suggested rules. That cuts down on iteration. The end users can make a quick decision, see what it looks like, and if they don’t like it, make a different decision. Over time, they build up intelligence that encapsulates all the rules they are applying to the data. And that becomes something they can share, reuse, and recycle.
It’s not just about individual productivity in getting to refined data. It’s about how end users can collectively leverage that across teams or an enterprise to help curate data at scale.
TW: The business value of the machine learning you’re describing…does that take Trifacta into sales conversations with business stakeholders? Or do you evangelize primarily to an IT operations audience?
AW: The winners in this market are going to be those who recognize that collaboration between those two enterprise roles is absolutely essential. In the past, you’ve seen people building technical tools for IT organizations, and who have lost track of who the end consumer is, and have not provided self service. Or, on the flip side, you’ve seen BI technologies that embed lightweight data tools, but in the end lose track of the fact that IT needs to be able to govern that information, curate that information, secure it, and ensure it’s leveraged across the organization.
From the beginning, Trifacta has been a strong advocate of a vendor-neutral data-wrangling layer that allows you to wrangle data from everything, and in many regards, allows people to change their minds. You may be using any storage or data-visualization technology, but you don’t’ want to feel locked into any one decision that you’re making. You always want to be able to transform your data so that it’s useful, regardless of where you might be storing or processing it, or how you might be visualizing it now. Wrangle once, use everywhere.
We have a large financial services customer that uses 136 different BI-reporting solutions. The idea that they can wrangle that data in 136 different ways with 136 different tools was surprising for them. We provide a linear way to wrangle that information, refine it, then publish it out through a number of different channels, all with a high degree of confidence that it’s correct, and with appropriate lineage and metadata tracking how the source data has changed.
TW: Trifacta has pursued a proactive alliance strategy. Tell me about the partnership with Alation. How do the two technologies complement each other?
AW: I’m excited about the partnership with Alation! We have joint customers together with Munich RE, Marketshare, BNSF, and a number of companies looking to combine cataloging with wrangling. The idea is, when the data gets integrated into the large-scale data lakes, the first step is let me inventory it, then let me create an enterprise data dictionary that makes discovery and finding assets easier. Then, let me refine that data, enrich it, and transform it into something that will drive my downstream analysis. It starts with getting that data-lake infrastructure in place, then bringing in the tooling to allow end users to make productive use of the data that’s in the data lake.
Our customers use many different BI and visualization tools like Qlik, MicroStrategy, or Tableau, and sometimes modeling or predictive analytics environments like DataRobot. The front-end technologies serve different types of data consumption, but the cataloging combined with the wrangling is complementary, and ensures you can operationalize your data lake and expose it to a broad set of users.
TW: You’ve also recently partnered with a little startup called Google. Tell me about that partnership, what it means to Trifacta, what it means to your customers?
AW: Our vision for the space has always been self service. That approach helps alleviate infrastructure friction. Any time we can help people get wrangling faster and spending more time with the data as opposed to configuring infrastructure, that’s a win. About a year ago, Google took a look at this market and recognized that—as more data lands on the Google Cloud Platform, and in particular, cloud storage—Google needed a way to help those customers get that data into BigQuery, and to leverage it with technology like TensorFlow that would help those customers accelerate the process of seeing value from the data in those environments.
Google did an exhaustive search, and they selected Trifacta as the Google data-preparation solution. We worked with Google to ensure scalability, and that included integrating with Google Dataflow, and authentication, and security infrastructure. Google will take us to market as “Google Cloud Dataprep,” under the Google brand, and sell it alongside and in combination with new Google cloud services. To my knowledge, it’s the first time that Google has OEM’ed a third-party technology as part of the Google Cloud Platform.
TW: I have to ask—since I’m speaking with the CEO—will Google buy Trifacta?
AW: A lot of the value in a solution like Trifacta is being the decoder ring for data. Our independence is an important part of where the value is in the company. The fact that Trifacta can gracefully interoperate with on-prem systems and cloud environments was important to Google in making the decision to standardize on Trifacta. There’s value in our independence, so for us, the exciting thing is not only having the Google seal of approval, but delivering a multitude of hybrid use cases. HSBC is a joint customer, and uses Google for risk and compliance management and financial reporting. Trifacta data-wrangling has become a critical capability for HSBC to leverage, particularly with regard to data governance. Regulations change, keeping up with them is a huge burden, but Trifacta gives HSBC the flexibility to wrangle its data—on-prem or in the cloud—and create value in that evolving regulatory environment.
I sometimes get asked about what the Google partnership means for exclusivity—Will Trifacta still work with AWS, and Microsoft Azure, and others? The answer is absolutely yes. We’ve had a leading cloud vendor really shape our cloud capabilities, and accelerate our cloud roadmap. But we’ve made sure that everything we’ve done can be leveraged elsewhere, in other cloud environments. It’s not just a hybrid world between cloud and on-prem, it’s a multi-cloud world. That was important to Google. Google has multi-cloud customers, and they need to be able to wrangle data in those environments as well.
TW: Very diplomatic answer! Where to next for Trifacta?
AW: Three things. The first two are “groups and loops.” We put effort into self service, governance, machine learning. Now we want to apply this to provide fundamentally better solutions for teams to work together, to collaborate more efficiently. We’ve only just scratched the surface, and in the next twelve months you’ll see innovation from Trifacta in what it means to collaboratively curate information, and then learn from collective intelligence. How do we crowd-source that curation? How do we share collective intelligence most efficiently? And how do you get organizational leverage across it?
As for “loops,” we’re looking at how we ensure that this collective intelligence can be reused, and operationalized to scale with ever-increasing efficiency. We see a tool-chain of data tools to be crafted that will essentially become the work bench for how modern knowledge-workers get productive and collaborate.
Third, Trifacta is looking at how we can embrace real-time data streaming, as more and more of the data is streamed into these environments.