Topics of Interest Archives: Data Management

Why Your Data Preparation and Blending Efforts Need a Helping Hand

Data Blender Blog PictureIn past blog posts, we talked about how data management is fundamentally changing. It’s no secret that a convergence of factors – from an explosion in data sources, innovation in analytics techniques, and a shifting decentralization of analytics away from IT – all create obstacles as businesses try to invest in the best way to get value from their data.

Individual business analysts are encountering a growing challenge as the difficulty of preparing data for analysis is expanding almost as exponentially as the data itself. Data exchange formats such as JSON and XML are becoming more popular, and present a difficult task to parse and make useful. Combined with the vast amounts of unstructured data held in Big Data environments such as Hadoop and the growing number of ‘non-traditional’ data sources like social streams or machine sensors, getting data sources into a clean format can be a momentous task.

Analyzing social media data and its impact on sales sounds great in theory, but logistically, it’s complicated. Combining data feeds from disparate sources is easier now than ever, but it doesn’t ensure that the data is ready for analysis. For instance, if time periods are measured differently in the two data sources, one set of data must be transformed so that an apples-to-apples comparison can be made. Other predicaments arise if the data set is incomplete. For example, sales data might be missing the zip code associated with a sale in 20% of the data set. This, too, takes time to clean and prepare.

This is a constant challenge, and one that is exacerbated at scale. Cleaning inconsistencies in a 500-row spreadsheet is one thing, but doing so across millions of rows of transaction logs is quite another.

A certain level of automation is required to augment the capabilities of the analyst when we are dealing with data at this scale. There is a need for software that can identify the breakpoints, easily parse complex inputs, and pick out missing or partial data (such as zip codes) and automatically fill it in with the right information. Ultimately, the market is screaming for solutions that let analysts spend less time preparing data and more time actually analyzing it.

For all of these reasons, it is no surprise that a number of vendors have come to market offering a better way to prepare data for analysis. Established players like MicroStrategy and Qlik are introducing data preparation capabilities into their products to ease the pain and allow users to stay in one interface rather than toggle between tools. Others, like IBM Watson Analytics and Microsoft Power BI, are following a similar path.

In addition, a number of standalone products are ramping up their market presence. Each offers deeply specialized solutions, and should provide a much-needed helping hand to augment data analysts’ effort.  At Blue Hill, we have identified Alteryx, Informatica Rev, Paxata, Tamr, and Trifacta as our five key standalone solutions to evaluate. (For a deeper analysis of each solution and a further look at market forces in general, be on the lookout for our upcoming research report on the subject.) These products represent a new breed of solutions that emphasize code-free environments for visually building data blending workflows. Further, the majority of these solutions leverage machine learning, textual analysis, and pattern recognition to automatically do the brunt of the dirty work.

As a forward-looking indicator to the promise of the space, venture capital firms have notably placed their bets. Most recently, Tamr announced $25.2 million in funding this week, and Alteryx landed $60 million in funding late last year. This is a validation of what data analysts already know: the need for scalable and automated data blending and preparation capabilities is gigantic.

Posted in Analytics, Blog, General, General Function, General Management | Tagged , , , | Leave a comment

Fundamental Shifts in Information Management

Hadoop Elephant Latte (Source: Yuko Honda, Flickr)As market observers, we at Blue Hill have seen some big fundamental changes in the use of technology, such as the emergence of Bring Your Own Device, the progression of cloud from suspect technology to enterprise standard, and the assumption of ubiquitous and non-stop social networking and interaction. All of these trends have led to fundamental changes in our assumptions of technology usage, and brought market shifts where traditional players ceded ground to upstarts or new market entrants.

Based on key market trends that are occurring simultaneously, Blue Hill believes that the tasks of data preparation, cleansing, augmentation, and governance are facing a similar shakeup where the choices that enterprises make will fundamentally change. This key market shift is due to five key trends:

- Formalization of Hadoop as an enterprise technology
- Proliferation of data exchange formats such as JSON and XML
- New users of data management and analytics technology
- Increased need for data quality
- Demand for best-in-breed technologies

First, Hadoop has started to make its way into enterprise data warehouses and production environments in meaningful ways. Although the hype of Big Data has existed for several years, the truth is that Hadoop was mainly limited to the largest of data stores back in 2012, and enterprise environments were spinning up Hadoop instances as proofs of concept. However, as organizations have seen the sheer volume of relevant data requested for business usage increase by an order of magnitude, between customer data, partner data, and third-party sources, Hadoop has emerged as a key technology to simply keep pace with the intense demands of the “data-driven enterprise.” This need for volume means that enterprise data strategies must include both the maintenance of existing relational databases and the growth of semi-structured and unstructured data that must be ingested, processed, and made relevant for the business user.

Second, with the rise of  APIs, data formats such as JSON and XML have become key enterprise data structures to exchange data of all shapes and sizes. As a result, Blue Hill has seen a noted increase in enterprise requests to cleanse and support JSON and other semi-structured data strings within analytic environments. Otherwise, this data remains simply as descriptive information rather than analytic data that can provide holistic and enterprise-wise insights and guidance. To support the likes of JSON and XML without simply taking a manual development approach, enterprise data management requires investment in tools that can quickly contextualize, parse, and summarize these data strings into useful data.

Third, it’s hard to ignore the dramatic success of self-service analysis products such as Tableau and the accompanying shift in user’s relationship with data.  Also, it’s important to consider the nearly $10 billion dollars recently spent to take two traditional market leaders, TIBCO and Informatica, private. Users of data management and analytics technology have spread beyond the realm of IT, and are now embedded into the core function of roles within various business groups. Traditional technology vendors must adapt to these shifts in the market by focusing on ease of use, compelling future-facing roadmaps, and customer service. With the two largest independent data management players going private, the world of information management will most likely be opened up in unpredictable ways for upstarts that are natively built to support the next generation of data needs.

Fourth, companies are finally realizing that Big Data does not obviate the need for data quality. Several years ago, there was an odd idea that Big Data could stay dirty because the volume was so large that only the “directional” guidance of the data mattered and that the statistical magic of data scientists would fix everything. As Big Data has increasingly become Enterprise Data (or just plain old data), companies now find that this is not true, and, just as with every other computing asset, garbage in is garbage out. With this realization, companies now have to figure out how to refine Big Data of the past five years from coal to diamonds by providing enterprise-grade accuracy and cleanliness. This requires the ability to cleanse data at scale, and to use data cleansing tools that can be used not just by expert developers and data scientists, but by data analysts and standard developers as well.

Finally, the demand for best-in-breed technologies is only increasing with time. One of the most important results from the “there’s an app for that” approach to enterprise mobility is the end user’s increasing demand to instantly access the right tool at the right moment. For a majority of employees, it is not satisfactory to simply provide an enterprise suite or to take a Swiss Army knife approach to a group of technologies. Instead, employees expect to switch back and forth between technologies seamlessly, and they don’t care whether their favorite technologies are provided by a single vendor or by a dozen vendors. This expectation for seamless partnership either forces legacy vendors to have a roadmap to make all of their capabilities best-of-breed, or to lose market share as companies shift to vendors that provide specific best-in-class capabilities and integrate with other top providers. This expectation is simply a straightforward result of the increasingly competitive and results-driven world that we all live in, where employees desire to be more efficient and to have a better user experience.

As these market pressures and business expectations all create greater demand for better data, Blue Hill expects that the data management industry will undergo massive disruption.  In particular, data preparation represents an inordinate percentage of the time spent on overall analysis. Analysts spend the bulk of their workload cleaning, combining, parsing, and otherwise transforming data sets into a digestible input for downstream analysis and insights. As organizations deal with an increasingly-complex data environment, the time spent on getting data ready for analysis is expanding and threatening to overwhelm existing resources. In response to these market forces, a new class of solutions have emerged that are focused on the data “wrangling” or “transformation” process. These solutions leverage machine-learning, self-service access, and visual interfaces to simplify and expedite analysts’ ability to work with data even at the largest of scales. Overall, there is an opportunity for IT orchestrators to bring this new category of tools into their arsenal of best-in-breed solutions.

See Related Research

Companies that are ready to support this oncoming tidal wave of data change will be positioned to support the future of data-driven analysis and change. Those that ignore this inexorable set of trends will end up drowning in their data, or losing control of their data as growth outstrips governance.

Posted in Blog, General, General Function, General Industry, General Management, IT & Infrastructure, Research | Tagged , , , | Leave a comment

The Pumpkin Spice School of Big Data

Source: Pumpkin Spice Trident Layers Gum by Mike MozartIn our particular pocket of New England, the leaves are turning golden, and football is replacing baseball on the TVs. This means one thing to coffee drinkers: the re-emergence of the Pumpkin Spice Latte at Starbucks. Over the past ten years, this drink has gone from an odd cult drink to a phenomenon so large that it has earned its own hashtag on Twitter: #PSL.

At the same time, one has to wonder, “What is Pumpkin Spice?” (Other than possibly the long-lost American cousin of the Spice Girls?) Pumpkin spice doesn’t actually have pumpkin in it. And it’s far from the spiciest flavor out there. However, the concept of “pumpkin spice” insinuates the idea of something that’s handmade, traditional, and uniquely American in a way that draws people into the concept of wanting to consume it. Despite its complete lack of pumpkin and relative lack of spice, the flavor created is almost secondary to the cultish conceit that has been constructed around “Pumpkin Spice.”

Unfortunately, the hype, conceptualization, and ubiquitous phenomenon of Pumpkin Spice is matched in the enterprise world through the most overhyped phrase in tech: Big Data.  Like Pumpkin Spice, everybody wants Big Data, everybody wants to invest in Big Data tools, and everybody thinks that we are currently in a season or era of Big Data. And in the past, we’ve explained why we reluctantly think the term “Big Data” is still necessary. But when you go behind the curtain and try to figure out what Big Data is, what do you actually find?

For one thing, “Big Data” often isn’t that big. Although we talk about petabytes of data, there are practitioners that talk about “Big Data” problems that are only hundreds of megabytes. These are still very big portions of data, but these problems are manageable through traditional analytics tools.

And even when Big Data is “big,” this is still a very relative term. For instance, even when Big Data collects terabytes of data, text, and binaries, the data collected is rarely analyzed on a daily basis. In fact, we still lack the sentiment analysis, video analysis, and audio analysis needed to quickly analyze large amounts of data. And we know that data is about to grow by at least one order of magnitude, if not two, as the Internet of Things and the accompanying billions of sensors start to embed themselves into our planet.

Even outside of the Internet of Things, the entirety of the biological ecosystem represents yet another large source of data that we are just starting to tap. We are nowhere close to understanding what happens in each of our organs, much less in each cell of our bodies. To get to this level of detail for any lifeform represents additional orders of magnitude for data.

And then there’s even a higher level of truly Big Data when we track matter, molecules, and atomic behavior on a broad-based level to truly understand the nature of chemical reactions and mechanical physics. Compared to all of this, we are just starting to collect data on Planet Earth. And yet we call it Big Data.

See Related Research

So, our “Big Data” isn’t big in comparison to the amount of data that actually exists on Earth. And the types of data that we collect are still very limited in nature, since they almost always come from electronic sources, and often lack the level of detail that could legitimately recreate the environment and context of the transaction in question. And yet we are already calling it Big Data and setting ourselves up to start talking about “Bigger Data,” “Enormous Data,” and “Insanely Large Data.”

To get past the hype, we should start thinking about Big Data in terms of the scope that is actually being collected and supported. There is nothing wrong with talking about the scale of “log management data” or “sensor data” or “video data” or “DNA genome data.” For those of us who live in each of these worlds and know that log management gets measured in terabytes per day or that the human genome has 3 billion base pairs and approximately 3 million SNP (single-nucleotide polymorphism) replacements, we start talking about meaningful measurements of data again, rather than simply defaulting to the overused Big Data term.

I will say that there is one big difference between Pumpkin Spice season and Big Data Season. Around the end of the year, I can count on the end of Pumpkin Spice season. However, the imprecise cult of Big Data seems far from over; the community of tech thought leaders continues to push more and more use cases into Big Data, rather than provide clarity on what actually is “Big,” what actually constitutes “Data,” and how to actually use these tools correctly in the Era of Big Data.

In this light, Blue Hill Research promises to keep the usage of the phrase “Big Data” to a minimum. We believe there are more valuable ways to talk about data, such as:

- Our primary research in log and machine data management
- Our scheduled research in self-service topics including data quality, business intelligence, predictive analytics, and enterprise performance management
- Tracking the $3 billion spent in analytics over the past five years.
- Cognitive and neuroinspired computing

By focusing on the actual data topics that provide financial, operational, and line-of-business value, Blue Hill will do its best to minimize the extension of Big Data season.

Posted in Analytics, Blog, General Function, General Industry, Research | Tagged , , , , , , | Leave a comment

Blue Hill's Q4 Self-Service Analytics Research

Data Science Venn DiagramThere is a fundamental issue in the world of enterprise analytics and data management that is vital to the future of business intelligence and analytics: are employees free and able to pursue the deep analytical insights needed to further advance their business goals? The concept of the analytical business has become more popular in recent years as statistics and algorithms have become sexy concepts. One need only look at the concept of “Moneyball” to see that statistics are no longer relegated to the nerd squad. When Brad Pitt becomes the face of gaining analytic advantages in the workplace, analytics has arrived as a mainstream business topic.

But the popularity of analytics does not mean that it has been fully realized into a set of tools that are ubiquitous and easy to use. Although we have seen phenomenal strides in the tools made available to support business intelligence over the past five years, we are still largely in a world of haves and have-nots when it comes to analytic access.

Why is this? Part of the problem is that we as an industry are defining analytic freedom in different ways. A simple way to think of this is to consider the enterprise-wide view, the department-wide view, and the individual view.

Some of us look at this from a company-wide view, where analytic freedom means having agile data warehousing, robust ETL, a portfolio of analytic applications custom-made for each department, an army of number crunchers to handle each predictive request, and a fully-realized BI Center of Excellence.

Yet others look at the department-wide view, where the key is to provide each employee within a department with relevant data. For a marketing department, this might mean a 360-degree view of all campaigns, products, and customers. For a manufacturing department, this might mean full access to operational efficiencies, production, and Six Sigma efforts. These needs are often met in department-specific applications such as CRM and marketing automation management. But outside of the department’s purview, everybody else’s data problems are irrelevant. As a result, these department-specific solutions merely create a silo, where data-driven enlightenment is limited only to a specific few individuals and solely for certain tasks within a single department.

And finally, there is the individual’s need for data. There is the 1% of data analysts who are able to independently work with the vast majority of data sources, statistically analyze them, and find key connections that have previously escaped detection. We call them data scientists, and the only thing we truly know about this rare and prized species is that there is an enormous shortage of these individuals. But for the rest of us, vendors still need to catch up and provide a variety of tools that will give the typical knowledge worker the same access to data and analytics that the data analysts and data scientists have. This is no small task, as it requires transformative products to be developed in multiple areas: data cleansing, data management, business intelligence, predictive analytics, and performance management.

To make good decisions, individuals first have to find the correct data sources, and then make sure that the data is clean and reliable. This means going through everything: formal business data repositories, third-party data, collected survey and sensor data, informal spreadsheets and tallies, and more. In doing so, employees are often tasked with cleaning up the manual mistakes associated with data collection and collation. The subsequent task of data cleansing is estimated to take up three-quarters of a data analyst’s time. To reallocate this time to more valuable tasks, such as direct data analysis or business alignment of results with specific initiatives, companies need to take advantage of self-service and automated data management tools that solve basic problems in data management. This may include issues as mundane as changing “Y” to “Yes” or providing a default value for any null values in a column. Or this may include the automatic joining of fields in unrelated data sources that have never been linked before. As Blue Hill looks at data management, we plan to look at vendors ranging from market leaders such as IBM and Informatica to emerging startups such as Tamr, Trifacta, and Paxata to determine how each solution supports Blue Hill’s key stakeholders in technology, finance, and the line of business.

There has been a recent evolution in focused on self-service management of business intelligence. The vendors that have caught Blue Hill’s attention to the greatest extent in this regard include Adaptive Insights, Birst, GoodData, IBM Watson Analytics, Microsoft PowerPivot, Qlik Sense, SAP Lumira, Tableau, and Yellowfin. One of the most interesting aspects about this evolution is that end users may initially assume that the startup vendors mentioned would be less scalable, whereas the established enterprise vendors would be more difficult to use. However, this assumption is a false dichotomy; all of the leading vendors in this space, regardless of size, must scale and be easy to use. The key differentiations between these vendors tend to be more associated with the roles that they play within the enterprise and the extent to which they play into the Blue Hill Corporate Hierarchy of Needs.

See Related Research

Predictive analytics has been a more difficult area to innovate from a usability perspective. The biggest challenge has traditionally been the basic hurdle of statistical knowledge. For instance, Microsoft Excel has long had a statistical package that was sufficient to handle basic requests, but the vast majority of Excel users don’t know how to access or use it. Likewise, the statistical software giants, IBM SPSS and SAS, are easy enough to find in the academic world where students cut their teeth on statistical analysis. But for knowledge workers who were not number crunchers in their college days, this availability is (appropriately enough) academic compared to their day-to-day requests for sales projections, production forecasts, and budget estimates. Because of this, the drag-and-drop workflows of Alteryx, the natural language inputs of Watson Analytics, and the modelling ease of Rapidminer and SAP InfiniteInsight are going to become increasingly important as companies seek to change their companies from reactive monitors of data to predictive and cognitive analyzers of data-driven patterns.

Finally, Enterprise Performance Management represents an important subset of business intelligence focused on financial and operational planning. This is a core capability for any business planning. Small companies typically use spreadsheets to handle this analysis. However, as companies start including multi-currency, multi-country, complex supply chains, diverse tax structures, and even treasury activities, companies increasingly need a dedicated EPM solution that can be shared amongst multiple finance officers. At the same time, EPM needs to remain easy to use, or companies risk trading off the assurance of compliance with a delay of days or even weeks in supporting financial closes and budgeting activities. In light of this core challenge, Blue Hill is looking both at the offerings of large software vendors (such as Oracle, IBM, SAP, and Infor) as well as newer upstarts (such as Adaptive Insights, Host Analytics, Tidemark, and Tagetik) to see how they have worked to simplify the Enterprise Performance Management space.

These are the key research efforts that Blue Hill is going to pursue this quarter as we seek to understand the advancement of self-service in analytics, business intelligence, and data management. We are seeking the true differentiators that buyers can hang their hat on in 2014 and going into 2015 as they affect the financial, technological, and line-of-business managers: the three key stakeholders.

Posted in Analytics, Blog, General Function, General Industry, Research | Tagged , , , | Leave a comment

Latest Blog

GRC Implementation Success, Part 3: Business Requirement Definition GRC Implementation Success, Part 2: GRC’s Place in the Business GRC Implementation Success, Part 1: Implementation Success is GRC Success

Topics of Interest

Blog

News

BI

Big Data

Cloud

Virtualization

Emerging Tech

Social Media

Microsoft

Unified Communications

GRC

Security

Supply Chain Finance

Procure-to-Pay

Order-to-Cash

Corporate Payments

Podcast

Risk Management

Legal Tech

Data Management

Visualization

Log Data

Business Intelligence

Predictive Analytics

Cognitive Computing

Wearable Tech

Salesforce

Sales Enablement

User Experience

User Interface

Private Equity

Recurring Revenue

ILTACON

Advanced Analytics

Machine Learning

IBM

IBM Interconnect

video platform

enterprise video

design thinking

enterprise applications

Tangoe

Managed Mobility Services

Strata

Hadoop World

DataOps

service desk

innovation

knowledge

design

usability

USER Applications

ROI

Time-to-Value

AI

Questioning Authority

Domo

Yellowfin

Nexla

DataKitchen

Iguazio

Trifacta

DataRobot

Informatica

Talend

Qubole

Pentaho

Attunity

Striim

Anodot

Tableau

IoT

fog computing

legacy IT

passwords

authentication

Switchboard Software

GoodData

Data Wrangling

Data Preparation

TWIDO

Information Builders

Analytics

Enterprise Performance Management

General Industry

Human Resources

Internet of Things

Legal

Mobility

Telecom Expense Management