Topics of Interest Archives: Trifacta

Data Wrangling, “Groups & Loops,” and Some Company Called Google: Questioning Authority with Trifacta CEO Adam Wilson

Adam Wilson, Trifacta CEOAs CEO of Trifacta, Adam Wilson is committed to developing the best in data-wrangling technology, and then of course, preaching its gospel. He and I spoke recently about Trifacta’s past, present, and future  (“groups and loops”), partnerships with companies you might have heard of, and how the enterprise data landscape is evolving (for the better).

TOPH WHITMORE: Tell me about Trifacta’s backstory. Where did it all begin?

ADAM WILSON: Trifacta was born of a joint research project between the University of California, Berkeley and Stanford. There was a distributed-computing professor at Cal that had been doing work in this area [data wrangling] for almost a decade, looking at the intersection of people, data, and computation. He got together with a human-computer interaction professor from Stanford who was trying to solve the complex problem of transformation and preparing data for analysis.

And they were joined by a Stanford PhD. student who had worked as a data scientist at Citadel on trading platform algorithms. He found he spent the majority of his time pushing data together, cleansing it, and refining it, as opposed to actually working on algorithms. He returned to Stanford to work with these professors to figure out how to eliminate the 80% of the pain that exists in these analytics problems by automating the coding or tooling, and making it more self service. The three of them worked together, and created a prototype called the Stanford Data Wrangler. Within six months, 30,000 people were using it, and they realized they had more than an academic research project. So they created a commercial entity and started delivery to customers like Pepsi, Pfizer, GoPro, RBS.

I joined two-and-a-half years ago to help with go-to-market. At the time, the question was how do we help people take data from raw to refined, get productive with that information quickly, and do so in a self-service manner? We focused on customer acquisition, and I’m pleased to say we now have more than 7000 companies using Trifacta technology. And customer use of Trifacta data-wrangling technology creates training data that improves our machine learning.

TW: How does machine learning show up in Trifacta? And what drove your investment in it?

AW: Historically, machine learning has been the exclusive purview of only the highly technical. But machine learning and artificial intelligence have been part of Trifacta since the beginning. There are two fundamental observations. First, every data set is not a new data set. There are things we can infer from the data itself. Whether it’s inferring data types or inferring joins, we can provide automated structuring in a straightforward manner.

Second, we learn from user behavior. As users interact with data, we can make recommendations based on that behavior. Based on our own analysis, we can recognize they are dealing with a specific kind of data and interacting with it in a particular way, and we can make a suggestion. They can choose that suggestion and get immediate feedback as to what the data would look like if they apply those suggested rules. That cuts down on iteration. The end users can make a quick decision, see what it looks like, and if they don’t like it, make a different decision. Over time, they build up intelligence that encapsulates all the rules they are applying to the data. And that becomes something they can share, reuse, and recycle.

It’s not just about individual productivity in getting to refined data. It’s about how end users can collectively leverage that across teams or an enterprise to help curate data at scale.

TW: The business value of the machine learning you’re describing…does that take Trifacta into sales conversations with business stakeholders? Or do you evangelize primarily to an IT operations audience?

AW: The winners in this market are going to be those who recognize that collaboration between those two enterprise roles is absolutely essential. In the past, you’ve seen people building technical tools for IT organizations, and who have lost track of who the end consumer is, and have not provided self service. Or, on the flip side, you’ve seen BI technologies that embed lightweight data tools, but in the end lose track of the fact that IT needs to be able to govern that information, curate that information, secure it, and ensure it’s leveraged across the organization.

From the beginning, Trifacta has been a strong advocate of a vendor-neutral data-wrangling layer that allows you to wrangle data from everything, and in many regards, allows people to change their minds. You may be using any storage or data-visualization technology, but you don’t’ want to feel locked into any one decision that you’re making. You always want to be able to transform your data so that it’s useful, regardless of where you might be storing or processing it, or how you might be visualizing it now. Wrangle once, use everywhere.

We have a large financial services customer that uses 136 different BI-reporting solutions. The idea that they can wrangle that data in 136 different ways with 136 different tools was surprising for them. We provide a linear way to wrangle that information, refine it, then publish it out through a number of different channels, all with a high degree of confidence that it’s correct, and with appropriate lineage and metadata tracking how the source data has changed.

TW: Trifacta has pursued a proactive alliance strategy. Tell me about the partnership with Alation. How do the two technologies complement each other?

AW: I’m excited about the partnership with Alation! We have joint customers together with Munich RE, Marketshare, BNSF, and a number of companies looking to combine cataloging with wrangling. The idea is, when the data gets integrated into the large-scale data lakes, the first step is let me inventory it, then let me create an enterprise data dictionary that makes discovery and finding assets easier. Then, let me refine that data, enrich it, and transform it into something that will drive my downstream analysis. It starts with getting that data-lake infrastructure in place, then bringing in the tooling to allow end users to make productive use of the data that’s in the data lake.

Our customers use many different BI and visualization tools like Qlik, MicroStrategy, or Tableau, and sometimes modeling or predictive analytics environments like DataRobot. The front-end technologies serve different types of data consumption, but the cataloging combined with the wrangling is complementary, and ensures you can operationalize your data lake and expose it to a broad set of users.

TW: You’ve also recently partnered with a little startup called Google. Tell me about that partnership, what it means to Trifacta, what it means to your customers?

AW: Our vision for the space has always been self service. That approach helps alleviate infrastructure friction. Any time we can help people get wrangling faster and spending more time with the data as opposed to configuring infrastructure, that’s a win. About a year ago, Google took a look at this market and recognized that—as more data lands on the Google Cloud Platform, and in particular, cloud storage—Google needed a way to help those customers get that data into BigQuery, and to leverage it with technology like TensorFlow that would help those customers accelerate the process of seeing value from the data in those environments.

Google did an exhaustive search, and they selected Trifacta as the Google data-preparation solution. We worked with Google to ensure scalability, and that included integrating with Google Dataflow, and authentication, and security infrastructure. Google will take us to market as “Google Cloud Dataprep,” under the Google brand, and sell it alongside and in combination with new Google cloud services. To my knowledge, it’s the first time that Google has OEM’ed a third-party technology as part of the Google Cloud Platform.

TW: I have to ask—since I’m speaking with the CEO—will Google buy Trifacta?

AW: A lot of the value in a solution like Trifacta is being the decoder ring for data. Our independence is an important part of where the value is in the company. The fact that Trifacta can gracefully interoperate with on-prem systems and cloud environments was important to Google in making the decision to standardize on Trifacta. There’s value in our independence, so for us, the exciting thing is not only having the Google seal of approval, but delivering a multitude of hybrid use cases. HSBC is a joint customer, and uses Google for risk and compliance management and financial reporting. Trifacta data-wrangling has become a critical capability for HSBC to leverage, particularly with regard to data governance. Regulations change, keeping up with them is a huge burden, but Trifacta gives HSBC the flexibility to wrangle its data—on-prem or in the cloud—and create value in that evolving regulatory environment.

I sometimes get asked about what the Google partnership means for exclusivity—Will Trifacta still work with AWS, and Microsoft Azure, and others? The answer is absolutely yes. We’ve had a leading cloud vendor really shape our cloud capabilities, and accelerate our cloud roadmap. But we’ve made sure that everything we’ve done can be leveraged elsewhere, in other cloud environments. It’s not just a hybrid world between cloud and on-prem, it’s a multi-cloud world. That was important to Google. Google has multi-cloud customers, and they need to be able to wrangle data in those environments as well.

TW: Very diplomatic answer! Where to next for Trifacta?

AW: Three things. The first two are “groups and loops.” We put effort into self service, governance, machine learning. Now we want to apply this to provide fundamentally better solutions for teams to work together, to collaborate more efficiently. We’ve only just scratched the surface, and in the next twelve months you’ll see innovation from Trifacta in what it means to collaboratively curate information, and then learn from collective intelligence. How do we crowd-source that curation? How do we share collective intelligence most efficiently? And how do you get organizational leverage across it?

As for “loops,” we’re looking at how we ensure that this collective intelligence can be reused, and operationalized to scale with ever-increasing efficiency. We see a tool-chain of data tools to be crafted that will essentially become the work bench for how modern knowledge-workers get productive and collaborate.

Third, Trifacta is looking at how we can embrace real-time data streaming, as more and more of the data is streamed into these environments.

Posted in Blog | Tagged , , | Leave a comment

This Week in DataOps: The Tradeshow Edition

TWIDO logoDataOps wasn’t the most deafening sound at Strata + Hadoop World San Jose this year, but as data-workflow orchestration models go, the DataOps music gets louder with each event. I’ve written before about Boston-based DataOps startup Composable Analytics. But several Strata startups are starting to get attention too.

Still-in-stealth-mode-but-let’s-get-a-Strata-booth-anyway San Francisco-based startup Nexla is pitching a combined DataOps + machine-learning message. The Nexla platform enables customers to connect, move, transform, secure, and (most significantly) monitor their data streams. Nexla’s mission is to get end users deriving value from data rather than spending time working to access it. (Check out Nexla’s new DataOps industry survey.)

DataKitchen is another DataOps four-year-overnight success. The startup out of Cambridge, Massachusetts also exhibited at Strata. DataKitchen users can create, manage, replicate, and share defined data workflows under the guise of “self-service data orchestration.” The DataKitchen guys—“Head Chef” Christopher Bergh and co-founder Gil Benghiat—wore chef’s outfits and handed out logo’ed wooden mixing spoons. (Because your data workflow is a “recipe.” Get it?)

DataOps at Strata - Nexla and DataKitchen booths

DataOps in the wild — The Nexla and DataKitchen exhibition booths at Strata + Hadoop World San Jose.

Another DataOps-y theme at Strata: “Continuous Analytics.” In most common parlance, the buzzphrase suggests “BI on BI,” enabling data-workflow monitoring/management to tweak and improve, with the implied notion of consumable, always-on, probably-streaming, real-time BI. Israeli startup Iguazio preaches the continuous analytics message (as well as plenty of performance benchmarking) as part of its “Unified Data Platform” offering.

I got the chance to talk DataOps with IBM honchos Madhu Kochar and Pandit Prasad of the IBM Almaden Research Center. Kochar and Prasad are tasked with the small challenge of reinventing how enterprises derive value from their data with analytics. IBM’s recently announced Watson AI partnership with Salesforce Einstein is only the latest salvo in IBM’s efforts to deliver, manage, and shape AI in the enterprise.

Meanwhile, over in the data-prep world, the data wranglers over at Trifacta are working to “fix the data supply chain” with self-service, democratized data access. CEO Adam Wilson preached a message of business value—Trifacta’s platform shift aims to resonate with line-of-business stakeholders, and is music to the ears of a DataOps wonk like me. (And it echoes CTO Joe Hellerstein’s LOB-focused technical story from last fall.)

Many vendors are supplementing evangelism efforts with training outreach programs. DataRobot, for example, has introduced its own DataRobot University. The education initiative is intended both for enterprise training, but also for grassroots marketing, with pilot academic programs already in place at a major American university you’ve heard of but shall remain nameless, as well as the National University of Singapore and several others.

Another common theme: The curse of well-intentioned technology. Informatica’s Murthy Mathiprakasam identifies two potential (and related) data transformation pitfalls: cheap solutions for data lakes that can turn them into high-maintenance, inaccessible data swamps, and self-service solutions that can reinforce data-access bad habits, foster data silos, and limit process repeatability. (In his words, “The fragmented approach is literally creating the data swamp problem.”) Informatica’s approach: unified metadata management and machine-learning capabilities powering an integrated data lake solution. (As with so many fundamentals of data governance, the first challenge is doing the metadata-unifying. The second will be evangelizing it.)

I got the opportunity to meet with Talend customer Beachbody. Beachbody may be best known for producing the “P90” and “Insanity” exercise programs, and continues to certify its broad network of exercise professionals. What’s cool from a DataOps perspective: Beachbody uses Talend to provide transparency, auditability, and control via a visible data workflow from partner to CEO. More importantly, data delivery—at every stage of the data supply chain—is now real time. To get to that, Beachbody moved its information stores to AWS and—working with Talend—built a data lake in the cloud offering self-service capabilities. After a speedy deployment, Beachbody now enjoys faster processing and better job execution using fewer resources.

More Strata quick hits:

  • Qubole is publishing a DataOps e-book with O’Reilly. The case-study focused piece includes use-case examples from the likes of Walmart.
  • Pentaho is committed to getting its machine-learning technology into common use in the data-driven enterprise. What’s cool (to me): the ML orchestration capabilities, Pentaho’s emphasis on a “test-and-tune” deployment model.
  • Attunity offers three products using two verbs and a noun. Its Replicate solution enables real-time data integration/migration, Compose delivers a data-warehouse automation layer, but it is Attunity’s Visibility product that tells the most interesting DataOps story: It provides “BI-on-BI” operations monitoring (focused on data lakes).
  • Check out Striim’s BI-on-BI approach to streaming analytics. It couples data integration with a DataOps-ish operations-monitoring perspective on data consumption. It’s a great way to scale consumption with data volume growth. (The two i’s stand for “Integration” and “Intelligence.” Ah.)
  • Along those same lines, anomaly-detection technology innovator Anodot has grown substantially in the last six months, and promises a new way to monitor line-of-business data. Look for new product, package, and service announcements from Anodot in the next few months.

Last week I attended Domo’s annual customer funfest Domopalooza in Salt Lake City. More on Domo’s announcements coming soon, but a quick summary:

  • Focus was noticeably humble (core product has improved dramatically from four years ago, when it wasn’t so great, admitted CEO Josh James in his first keynote) and business-value-focused. (James: “We don’t talk about optimizing queries. (Puke!) We talk about optimizing your business.”)
  • There was a definite scent of DataOps in the air. CSO Niall Browne presented on Domo data governance. The Domo data governance story emphasizes transparency with control, a message that will be welcomed in IT leadership circles.
  • Domo introduced a new OEMish model called “Domo Everywhere.” It allows partners to develop custom Domo solutions, with three tiers of licensing: white label, embed, and publish.
  • Some cool core enhancements include new alert capabilities, DataOps-oriented data-lineage tracking in Domo Analyzer, and Domo “Mr. Roboto” (yes, that’s what they’re calling it) AI functionality.
  • Domo also introduced its “Business-in-a-Box” package of pre-produced dashboard elements to accelerate enterprise deployment. (One cool dataviz UI element demoed at the show: Sample charts are pre-populated with applicable data, allowing end users to view data in the context of different chart designs.)

Finally, and not at all tradeshow-related, Australian BI leader Yellowfin has just announced its semi-annual upgrade to its namesake BI solution. Yellowfin version “7.3+” comes out in May. (The “+” might be Australian for “.1”.) The news is all about extensibility, with many, many new web connectors. But most interesting (to me at least) is its JSON connector capability that enables users to establish their own data workflows. (Next step, I hope: visual-mapping of that connectivity for top-down workflow orchestration.)

Posted in Blog | Tagged , , , , , , , , , , , , , , , | Leave a comment