In past blog posts, we talked about how data management is fundamentally changing. It’s no secret that a convergence of factors – from an explosion in data sources, innovation in analytics techniques, and a shifting decentralization of analytics away from IT – all create obstacles as businesses try to invest in the best way to get value from their data.
Individual business analysts are encountering a growing challenge as the difficulty of preparing data for analysis is expanding almost as exponentially as the data itself. Data exchange formats such as JSON and XML are becoming more popular, and present a difficult task to parse and make useful. Combined with the vast amounts of unstructured data held in Big Data environments such as Hadoop and the growing number of ‘non-traditional’ data sources like social streams or machine sensors, getting data sources into a clean format can be a momentous task.
Analyzing social media data and its impact on sales sounds great in theory, but logistically, it’s complicated. Combining data feeds from disparate sources is easier now than ever, but it doesn’t ensure that the data is ready for analysis. For instance, if time periods are measured differently in the two data sources, one set of data must be transformed so that an apples-to-apples comparison can be made. Other predicaments arise if the data set is incomplete. For example, sales data might be missing the zip code associated with a sale in 20% of the data set. This, too, takes time to clean and prepare.
This is a constant challenge, and one that is exacerbated at scale. Cleaning inconsistencies in a 500-row spreadsheet is one thing, but doing so across millions of rows of transaction logs is quite another.
A certain level of automation is required to augment the capabilities of the analyst when we are dealing with data at this scale. There is a need for software that can identify the breakpoints, easily parse complex inputs, and pick out missing or partial data (such as zip codes) and automatically fill it in with the right information. Ultimately, the market is screaming for solutions that let analysts spend less time preparing data and more time actually analyzing it.
For all of these reasons, it is no surprise that a number of vendors have come to market offering a better way to prepare data for analysis. Established players like MicroStrategy and Qlik are introducing data preparation capabilities into their products to ease the pain and allow users to stay in one interface rather than toggle between tools. Others, like IBM Watson Analytics and Microsoft Power BI, are following a similar path.
In addition, a number of standalone products are ramping up their market presence. Each offers deeply specialized solutions, and should provide a much-needed helping hand to augment data analysts’ effort. At Blue Hill, we have identified Alteryx, Informatica Rev, Paxata, Tamr, and Trifacta as our five key standalone solutions to evaluate. (For a deeper analysis of each solution and a further look at market forces in general, be on the lookout for our upcoming research report on the subject.) These products represent a new breed of solutions that emphasize code-free environments for visually building data blending workflows. Further, the majority of these solutions leverage machine learning, textual analysis, and pattern recognition to automatically do the brunt of the dirty work.
As a forward-looking indicator to the promise of the space, venture capital firms have notably placed their bets. Most recently, Tamr announced $25.2 million in funding this week, and Alteryx landed $60 million in funding late last year. This is a validation of what data analysts already know: the need for scalable and automated data blending and preparation capabilities is gigantic.