The Challenges of Big Data & Approaches to Data Quality

(1)

The Challenges of Big Data &

Approaches to Data Quality

Using big data to examine and discover the value in data for

accurate analytics

(2)

Introduction

Big Data Challenges

Big data has the potential to do many things, leading some to claim that it’s a replacement for today’s BI and data management environments. Replacement is unlikely — the big data stack is a multi-purpose processing platform, not a database. The technologies don’t support complex interactive queries well, if at all. While it enables activities like ad-hoc analysis and real-time processing, it is not designed to support BI workloads.

A data warehouse may have high data latency due the way it is loaded, but it provides interactive response time to queries. Big data systems often do the opposite: store and maintain up-to-the-minute data, but deliver slow, batch-level response time to a query. Big data technologies are designed for scalable processing or simple storage and retrieval with very high throughput. To accomplish this they omit features that are a given in the BI

environment; features like schemas so you can know how data is stored, or metadata so you know what type of data to expect in a given field. The metadata that does exist, including the schema, is optional, leaving enforcement of standards to the whims of developers. It can be easy to lose track of data, or for others to use it inappropriately.

The processing layer in a big data environment is primarily code-driven. The situation is much like the early days of data warehousing when all data integration was hand-coded. Combine hand-coding, limited schema support and lack of metadata and the big data platform can become little more than a data dumping ground.

Big data can also complicate data architecture. Today, most data warehouse methodologies assume a single place for all data, and that the data is either source data, meaning it is of questionable quality and usability, or it is target data, meaning it’s been standardized, cleansed and is available for use in the warehouse. The only data available to users or analysts is the standardized common data in the warehouse layer.

A big data platform makes source data available, as well processed data, and even data pulled from data warehouses and MDM repositories. It creates a new layer in the data architecture between data sources and the data warehouse. It’s possible for someone to mix the raw data with warehouse data, or bypass the warehouse and work directly on the raw data. Knowing the lineage of data and what state it’s in are important when that data can be stored in multiple locations at different levels of standardization, designed for different uses.

Many companies start by building a proof of concept. The early projects work well enough because they are constrained to one or a few large data sources and are not integrated with other information delivery systems. Analysts and developers have little trouble navigating and tracking data during these early stages.

The initial success seems to imply that it is safe to ignore data management principles like master data management and data governance. The opposite is true, unless the goal is to deliver a standalone system that doesn’t play a role in the organization's data infrastructure.

(4)

The problem created by ignoring some up-front planning is that data management was not designed into the big data architecture. As new data is added and existing data in the system is used for a range of purposes beyond the scope of first project, the same difficulties faced in a BI environment appear: poor data quality, inconsistent attributes and definitions, incorrect results from analytic models, and difficulty integrating data.

The biggest long term challenge to implementing and integrating big data in enterprise IT isn’t technology, it’s data governance and management.

In big data, as in BI, the data is the important element, not the technology.

A Look Inside the Evolving Information Needs of the Retail Sector

Effective consumer product development and retailing is about differentiation. The primary areas of differentiation today being customer experience, branding and service. Investment priorities should be in customer facing areas since these drive retention and growth. This means building new analytic capabilities using a repository of business information beyond core sales transactions that are the usual focus of analysis.

Most measurements in retail today are designed around the “average customer” and individual channels. The BI metrics in use are simple ones like average basket size, average items per order or profit per customer. Measurements like these ignore the individual behaviors that shape customer response to products, promotions and services.

Planning around a non-existent “average customer” ignores what we know about how customers drive profitability. Effective retail requires more sophisticated customer analysis strategies to understand behavior across a growing set of channels.

The new analytic requirements are turning top-down analysis on product sales into a bottom-up exercise of understanding individual behaviors. Planning product features and services should be driven by these details, requiring changes to planning and analysis. Both

manufacturer and retailer marketing and merchandising optimization need to incorporate the behavioral differences of customers across channels.

To enable detailed analysis requires the integration of data from each point of customer interaction, whether it is marketing, a transaction or a service. This interaction data must be married with all of the existing transaction data as well as data from loyalty programs, social media and other external sources.

New analysis capabilities require changes to the existing BI infrastructure. More data, increasingly complex analysis and the need for up-to-date information stress the current systems. Big data platforms offer opportunities to store and process this information, expanding visibility across channels and into customer behaviors.

(5)

Big Data Use In Retail and CPG

Multichannel Data and Analysis

The multi-channel customer experience has historically been characterized by a kind of dys-integration – a lack of consistency and interoperability across channels. The problem for most companies is that providing a seamless multi-channel interaction isn't just a desirable feature, it's an integral component of the customer experience.

“Customer experience” is a function of the totality of a customer's interactions with a

company's sales, marketing, and service efforts across multiple touchpoints. These include visits to stores, phone calls, product catalogs and other direct mail, the web, and location-aware mobile software.

Take the example of a personalized promotion sent via email for an expensive new product. It lures a customer to visit the company web site. Registering for the promotion, she has difficulty finding a retail location that carries the product so she can see it before buying. Her visit to the web site eventually surfaces a click-to-call or click-to-chat prompt where she communicates with a customer service representative and is directed to an appropriate location. At the store she has difficulty finding the product and asks the floor staff, who can find it for her.

In this interaction with the company, the customer moved from email to the web site to a live online interaction with the service center to a retail store. From her perspective, she dealt with a single entity – the company. To the company this was a set of independent activities managed by channel, and likely not visible as a single set of interactions by anyone.

This is the multi-channel reality; it's true of retailers, of consumer packaged goods (CPG) manufacturers, and of companies in virtually every other industry. It’s a rare organization that doesn’t interact with customers using more than one channel.

Mobile and web strategies must work in conjunction with conventional marketing and store strategy, yet many retailers still organize the business so that each channel is managed separately. CRM systems have not helped as much as expected, in part because they are designed for a single channel or they only capture information that is common across all

channels, losing the specifics of each. This limits CRM to a narrow view of customer interactions and makes it hard to understand them across multiple channels.

Operating each channel with separate BI and analysis systems was reasonable when customers could only visit a store or order from a catalog. Using online and mobile channels, a customer can read reviews and place an order with a competitor using their phone while standing in your store. To properly serve customers today requires treating channels as complementary

components of a multi-channel system.

The state of channel data

Most companies today have visibility into the processes, activities and costs within each channel. The information is usually in analytic silos organized by function like marketing, call center or customer support.

(6)

Looking across channels is harder, but is necessary because they conflict or reinforce each other. An online offer may cannibalize store sales, or vice versa. A catalog in the mail coupled with online ads can increase both online and store sales. Some channels may be better for customer acquisition than for retention. Use the wrong channel for your purpose and you can improve it at the expense of another.

The first step to managing a multi-channel environment is simply gaining visibility into what's happening within each channel. Business intelligence tools provide basic monitoring, but most IT shops are unprepared for the vast increase in data that results from getting the interaction data from each channel and storing it in one place, leading to performance and integration problems.

Transactions to interactions

Tracking all of the interactions with customers – as opposed to simply storing financial sales and service transactions – creates a new window on customer behavior. Pre-purchase shopping activities and post-purchase actions are visible. Connecting this customer behavior to the transaction data already available can show where and why you are winning sales or where you may be losing customers.

The use of much of this new data isn’t clearly repeatable, unlike most data in BI. The interaction data is more often explored and analyzed to identify patterns that can be acted on. Once a pattern is identified, the data may no longer be needed, so there is no need to build a complicated process to load it into a data warehouse.

Customer activity is one area where big data platforms are being used: to store all of the history for all interactions at each point of customer contact. From here they can be accessed, explored and analyzed in an ad-hoc fashion.

Other times the pattern detection can be operationalized — the analytic model is programmed into a process that sends results to other applications or to people who can act on the

information. The big data platform becomes a component of the data flow into the warehouse or other systems. The BI environment can then be used to monitor the resulting business processes and results.

Customers, the source of interactions

To manage customers’ multi-channel experience properly means building a picture using all the details from each individual. For example, some people will open emails to read offers but never click online ads, while others will click ads but rarely open emails, and others visit the store when they receive a catalog.

People have different affinities for channels, so analytic models are used to determine which channel is preferred by each person for marketing, for selling and for service contacts. These affinities can't be determined at the aggregate segment level. They must be calculated on an individual basis, which means a lot of data processing.

Customer segments can then be derived from the collection of individual affinities and behaviors. These segments are different from the static demographic and lifestyle segments used in the past. They group people by actual behavior rather than characteristics that imply similar behavior.

(7)

Behavior dictates the segments, so people may migrate between segments as their behaviors change, or as the company changes processes in different channels with the goal of changing customer behavior. The behavioral models are therefore run more frequently, or even updated continuously in the case of personalized systems.

Executing these types of analytic models daily (or more often) on large volumes of data is not the workload of a typical data warehouse platform. The increased workload is another reason for the use of big data platforms.

Integration and Data Management Challenges

The details for each channel make up an enormous and complex set of data. Most companies solved the problem of capturing transaction data and making it available via BI, yet few retain basket-level details for each customer for more than a year. This is mostly due to IT constraints of database cost and performance when storing large volumes of data, combined with the inability to make use of it.

The data itself is difficult to process and store in a standard BI environment. It’s composed of transactions, events and text from sources like the stream of clicks from a web site, location data from a mobile app, email marketing content, complaints emailed from customers, and the comments entered by staff in support of warranty applications.

The integration challenge is familiar to data warehouse professionals, but the systems and data are unfamiliar. Getting all the data requires collecting and managing massive amounts: any and every interaction that a customer has across a company's customer touchpoints. It requires meaningfully linking terms and concepts from emails and text with the attributes and identifiers in operational systems and the data warehouse. It requires uniquely identifying customers, products and locations across every system. It's at once a big data problem and a big data quality and master data management (MDM) problem.

Managing multi-channel data is multi-disciplinary. If your merchandising group and your web analytics team aren't using the same standardized product data, how usable is the intelligence you're producing? It's a question of standardizing in ways that are cross-referenceable and linkable between the channels and processes.

This involves data quality, in the sense of applying standardization at the point of data creation, along with notoriously thorny issues such as data stewardship across lines of business.

Workgroup and line of business managers must agree on key information, like how to maintain unique product and customer identifiers. If they don't, linking the data from systems in

separate channels may be impossible.

Data should be governed at the point of creation because fixing it after the fact is difficult, time-consuming, and sometimes requires manual intervention. This is particularly relevant to

system-generated event streams like web analytics and mobile application monitoring because a programmer creates the application that generates the data. The programming happens far in advance of deployment. By the time a problem is caught, there may be terabytes of data with incorrect attributes and weeks of delay before the application creating the data can be fixed. Big data is not exempt from the same data governance disciplines that affect BI today. The

(8)

and how the tools integrate with big data platforms. The data governance applications to help manage many of these problems already exist in the BI market.

Multi-channel retail requires organizational and process changes to succeed. These operational changes drive new data collection and analysis requirements. The challenge companies face is creation of a platform to pull all the required information together, process it, and deliver it when and where it is needed.

Understanding customers: the shift from product to customer focus

The retail and CPG industry has been centered on products and categories for too long. The financial metrics are derivative of customer behavior. Shifting the focus of analysis from products to customers shifts the focus from the symptoms of performance to the causes. The shift is one of perspective, from a focus on "what's selling?" to a focus on "who's buying?" The implications are broad because customers, segments and behavior become attributes of traditional product-based analysis as well as core analytical areas that impose new data storage and processing requirements.

Most retail analysis has been focused on the consumer in the aggregate. The industry is moving toward finer-grained customer analysis to drive all areas of the business. It's a subtle shift because customer data has always been a key element, but not at the individual level.

Companies are applying customer insights about who is buying and why at finer levels of detail, and they are using it for a broader set of activities than direct marketing. For example, analysis of buying behavior is used to merchandise products and adjust assortments in stores according to the local customer population. Some industry segments get more value from assortment and category planning using customer data than others, for example in apparel where the product stock constantly rotates and local taste is important.

The industry is moving to a more detailed demand-driven supply chain. The shift from supply push to demand pull models is mainstream already. What is changing today is collaborative efforts between manufacturers and retailers to understand demand by customers and behavioral segments. The old demand and supply chain models involve item movement with little insight into who's buying, making it difficult to optimize supply at retail outlets.

Most uses of customer data today are still at a segment level, not at the individual customer level that enables practices like "next best offer" and situational marketing. These practices have been done in consumer banking for years. The retail industry is moving in this direction but it's difficult to achieve because of the amount of computing required, the need for deeper analytic capabilities, and the most difficult – changes to internal business practices.

Measuring the effectiveness of promotions provides a good example of why customer behavior should be tracked at an individual level. Imagine that your company sells a premium brand and discounts one of its primary products. Sales go up as expected, but explaining why is much more complex, linked to individual behaviors:

 People switch from their regular brand to yours because of the discount. They may switch back to their regular product when the price returns to normal, or they may like the product enough to become loyal customers.

(9)

 Others, who usually buy a discount product now buy yours at the sale price, essentially moving up in quality due to the lower price. Most of them will switch back when the promotion is over.

 People who have never tried the product may be incented to try it at the discounted price, converting some of them into regular customers.

 Brand-loyal customers buy more, which will translate into a drop in sales from this group after the promotion is over.

 Brand-loyal customers who see price as an indication of quality believe the product’s quality has deteriorated, which may lead to lower loyalty or defections to other brands, with both reputation and financial costs.

The outcomes vary based on individual behaviors. If you know these outcomes for each individual, your promotions adjusted to address expected behavior. In essence, they can be tuned for customer acquisition or retention. Marketing can be more effectively targeted at brand switchers, or treated as a loyalty incentive for one group of customers.

The lifestyle or demographic segments that are usually used by marketing can’t help here. People may fit into a lifestyle segment based on a demographic profile, but their behaviors can place them into completely different segments. It’s usually more effective to use bottom-up segments defined by behaviors than it is to use top-down demographics segments.

Understanding behavior at this detailed level requires analysis of the customers to see which group they fit into. The analysis requires a history of marketing interactions and purchase patterns at the individual customer and product level. Given the amount of direct or individually trackable marketing today — emails, online ads, mailers with unique QR codes, mobile applications — there is a wealth of customer data available that can be used to assess pre-purchase triggers and behaviors.

Customer analysis extends well beyond marketing. Satisfaction, loyalty, and willingness to recommend your product or service are all important. These metrics are strongly influenced by product quality and features, by the buying experience, and by post-sale service and support. In order to see this larger picture of the customer, data from each customer’s interactions in each marketing, transaction and service channel is required; a big data problem in size, complexity and analytic processing. For example, product quality and satisfaction can be

understood in a broad sense through online reviews, ratings and discussions on social media, as well as customer surveys. To use this data requires a combination of text analytics and

integration technologies to map the information from the raw text to products, product features, and other aspects of the business.

One thing that should not be ignored is the impact of data quality and governance on this data. Unless care is taken to use shared, unique customer identifiers in customer-facing or tracking systems, it may not be possible to identify individual customers or link all interactions to them when the data arrives in the big data platform.

(10)

identifiers for the important things the business analyzes: products, locations, channels, and customers.

The retail and CPG industry has a long history of customer data management, particularly in businesses that sold via a direct channel. Big data provides missing elements: interaction data that wasn’t available in the past, indicators of behavior and responses, and the ability to perform sophisticated analysis on the entirety of the data.

The big data platform doesn’t replace the existing customer master. It must be linkable to the customer master. It stores the massive volumes of interaction data and results of analytic processing. The platform can be used for ad-hoc discovery, exploration and running analytics.

When to Apply Data Management Principles to Big Data?

Quality is contextual

The over-arching principle of data quality in a data warehouse is a flat earth model. There is one data model and one set of integrated data. Therefore there is one set of rules for the cleaning and standardizing of data before it is loaded.

These rules ensure that the data is clean enough for the most stringent use case. This is good for the transactions that data warehouses were designed to manage. You don't want financial data with material impacts to be incorrect because it's expensive or slow to clean and load. The problem with this approach is that it doesn't factor in other uses. Quality is a measure of the fitness for a particular use. The data may not be good enough to provide public financial forecasts, but it is good enough to estimate how much inventory to have on hand to avoid stores running out of stock.

Transaction data, being core to the business, has the strictest requirements. The data

warehouse has its origins managing this type of data, and the associated data requirements. Big data applications, in contrast, use interaction and transaction data, and use that data for

different purposes.

These purposes are most often directional in nature, rather than providing an absolute answer to a question like "what was the profit on this product line last quarter?" At other times, they may be very specific to the action to be taken for a specific customer. The varied uses should dictate the required level of standardization and quality required.

Quality is not free

Interaction data itself is different. It is largely derived from system actions and event logs rather than entered by people – not that valuable for reporting use cases. The interaction data that is created by people is usually text rather than discrete values in a form.

Much of this data is noisy and contains gaps, however analyzing the patterns over periods of time or in large collections of events provides useful information. This is a form of data

extraction, involving analytic techniques rather than queries or the data processing techniques used for warehouse extraction and loading.

(11)

Analytics require different cleaning, standardizing and formatting than the data in data warehouses; you are preparing data for machine consumption rather than human-readable reports. The type of data cleansing required is heavily dependent on the analytic techniques to be applied. This makes it hard to anticipate in advance exactly how to clean, standardize and store the data.

Cleansing the data ahead of time means it will be optimal for some analytics, but not others, and it still will not be in a human-usable format for querying. Because of these restrictions, the event streams and interaction data are often better left in a raw or lightly processed form. To do otherwise would incur significant costs given the complexity and volume of data.

The output of analytic processing is delivered to people or into operational systems. In many cases it's more important to manage this information than the large collection of source data fed into the processes that created it. There is no quality concern in the usual sense since this data is the output of an analytic process, but the lineage and other aspects are still important. The data management nuances and tradeoffs are important to consider when working with big data. By storing the source data you are moving the time and effort it will take to process the data into a usable form for consumption. In a data warehouse you expend this effort up front, the benefit being quick response times to user queries.

In a big data platform you often wait to do this until you're sure it's an ongoing requirement. This approach trades the time to access and use data for the speed one can make data available for use. The net effect is more rapid projects and lower development effort, and therefore cost. When processes become repeatable and reusable, the data can then be cleaned and

transformed on load rather than on use.

A data quality plan is no good without process

The retail and consumer manufacturing sector is re-orienting processes around customers and multi-channel presence. It’s challenging to implement because of the amount of data

processing required, the finer levels of detail and the need for new analytic capabilities. Big data can fill needs unmet by the BI environment. Big data platforms offer new capabilities to collect, store and analyze large volumes of data, and use data that was previously unusable due to processing limitations.

The use isn’t without boundaries. A big data platform isn’t a database and doesn’t perform the same set of tasks, so it’s not a replacement for a data warehouse. It’s better viewed as a data processing complement to a data warehouse, with some potential for use as an alternative for large volumes of complex data.

The primary use cases for big data in the retail sector are storage for large volumes of detailed data that would otherwise be archived from a conventional data warehouse, large scale processing of event streams, processing of textual or non-tabular data, and both batch and ad-hoc analytic processing.

This difference in use doesn’t exempt a big data platform from the data governance that has been implemented in the BI world, although how and where it is implemented may change. In particular, links between master data and the developer-driven processes that create event

(12)

logs must be put in place. Without these, it can be impossible to tie interaction data back to the reference data that gives it meaning.

Data quality is vital when it comes to the management of unique identifiers for the core

elements of the business: products, customers and channels. Lack of quality in this area means that data from different sources can’t be linked, whether it’s interactions in a big data platform or transactions in a data warehouse.

Data standardization has less emphasis because a big data platform isn’t used for daily operational reporting. When quality is important, the scalable processing capabilities can be used to clean the data at the time of use rather than cleaning it all before storage. In real-time cases, it may not be possible to clean the data as it’s being written due to volume and latency, so after-the-fact is the only possible way.

Big data offers many new capabilities that extend what can be done with information beyond BI and data warehousing, but only with proper attention to the data being produced and used.

(13)

About the Author

Mark Madsen is a research analyst focused on analytics, big data and information management. Mark is an award-winning architect and former CTO whose work has been featured in

numerous industry publications. He is an international speaker and author. For more information, or to contact Mark, visit http://ThirdNature.net.

Third Nature is a research and consulting firm focused on new practices and emerging technology for business intelligence, analytics and

information management. The goal of the company is to help

organizations learn how to take advantage of new information-driven management practices and applications. We offer consulting, education and research services to support business and IT organizations and technology vendors.

About SAP

Master Big Data with SAP Information Management Solutions Information management (IM) is the practice of managing data used by operational applications and analytic solutions to support day-to-day operation and decision-making processes. It elevates raw data into valuable information that can help drive operational excellence and competitive advantage. Your organization can maximize its return on big data analytics by using Information Management solutions from SAP to:

Gain a complete view of information by accessing and integrating data from any scale from any data source with high velocity

Get unprecedented insight from Big Data by extracting useful intelligence from unstructured data and combine it with structured data for new contextual insight

Ensure trust in information by governing data quality, correcting issues during data movement, and defining policies to know when data is fit for use. For more information, visit

www.sap.com/eim.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. These materials are provided for information only and are subject to change without notice. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. Please see http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for additional trademark information and notices. Inquiries regarding permission or use of material contained in this document should be addressed to:

Third Nature, Inc. PO Box 1166

The Challenges of Big Data & Approaches to Data Quality