Big Data and the Case Study Motivation

(1)

BIG DATA AT SYSTEMS IN MOTION:

EMPHASIZING THE USE CASE-BASED APPROACH

This white paper will discuss how “Big Data” – data that is too large to be managed by traditional techniques – is changing the practice of Business Intelligence and

Analytics (BI). It will look at what organizations need to ask when adapting their BI strategy to accommodate

exponentially growing data volumes. It will touch on current techniques of storing, managing, and analyzing data, and the limitations of those techniques.

The paper will conclude that a use case-based approach is the most meaningful way ahead in a Big Data environment.

(2)

Page No

-1. The Challenge and Promise of Too Much Data

1.1 Analyzing large datasets 1.2 Volume, Velocity, Variety 1.3 Challenges and Transformation

2. The Business Case for Big Data

2.1 Use Cases and Benefits 2.2 Why Big Data?

3. Current Techniques: Advantages and Disadvantages

3.1 Hadoop

3.2 Vertica

3.3 Amazon Redshift

4. The Use Case Driven Approach to Big Data

4.1 Descriptive, Predictive, and Prescriptive Analytics 4.2 The Engagement Process at Systems in Motion

5. Solutions and Packages

5.1 Case Studies

5.2 SIM’s Information Management Solution Architecture 5.3 Packages and services

References

3

3 3 4

4

5 5

6

6 7 7

8

8 9

9

9 10 10

11

(3)

1.2 Volume, Velocity, Variety

The term “Big Data” itself needs to be clarified. Often quoted in this context is Matt Aslett of 451 Research, who wrote in October 2012[2]_{: “While the term Big Data might be almost universally unloved, it is also now}

almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management

technologies. Those limitations have come to be defined by a combination of words… volume, variety, and velocity…”

Patterns in astronomical data can lead to discoveries

Patterns in buyer behavior can lead to more targeted product recommendations and higher ad click-throughs Correlations between patient health and their ethnicity may lead to a new direction for research

Social media usage patterns could lead to the development of new and more engaging applications

Volume refers to the fact that there is too much data, driven primarily by digital channels, to manage with traditional tools

Velocity refers to the rate at which data is being produced, which is too high for the capability of traditional tools – whether it is in batches, near-real-time, or real-time. Machine logs on active networks and real-time user events from popular mobile apps are high-velocity data generators.

Variety refers to the fact that data may be structured, semi-structured or unstructured, or of a format/structure that does not lend itself to traditional modeling and analysis.

The Challenge and Promise of Too Much Data

For organizations that use Business Intelligence and Analytics (BI) to derive actionable insights from the data available to them, traditional information-handling mechanisms fall short when it comes to massive,

exponentially-growing volumes of heterogenous data. A typical large enterprise accesses public and

private data from a variety of sources. The data can be structured, unstructured, or semi-structured; it typically includes information gathered from customers, information about customer behavior, and data from social media, mobile applications, and community support forums. The format and structure of the data varies widely – multi-language text, voice, images, video, semi-structured logs, and contextual information about the data.

1.1 Analyzing large datasets

From an IDC study, less than 1% of the data in the world is analyzed; the digital universe will amount to 40 zettabytes (40 billion terabytes) in 2020, a 50-fold increase from 2010[1]_.

But the volume and heterogeneity of the available data are not just challenges; they are opportunities. Real-world correlations exist between seemingly unrelated information, so the analysis of large, combined datasets can provide far deeper and more holistic insights compared to the analysis of individual, smaller datasets. Enterprises now have the ability to gather data from outside their firewalls (social, mobile, web), so they have access to rich data sets that can be mined for correlations to data generated within their enterprise systems (ERP, CRM, SCM, etc.).

(4)

2. The Business Case for

Big Data

Organizations, therefore, are largely out of sync with a networked, digital world. They can align with it by changing how they handle data, what data they filter out, how they analyze it, and what they apply it to.

Big data impacts information management strategies in different ways for different information-driven systems. In marketing, untargeted outreach evolves into targeted, personalized message delivery. In campaign systems, standalone CRM systems with slow market responsiveness give way to real-time information management, integrated across channels. Even pricing, which has been driven by usage data across the general population, can be personalized according to shopping and browsing patterns.

BI, in fact, is moving from static reporting to real-time predictions (and hence, recommendations) from smart systems.

Customers are more sophisticated and brand-aware. With their increasing use of digital media to rate and purchase products, there is more useful data that can be captured – and also more demand on organizations to do so. Businesses are increasingly making data-driven decisions, which means they need more sophisticated analytics. Given the decreasing shelf-life of data, they prefer self-serve methods to get at the results of the analysis very quickly. IT departments have stronger cost-cutting mandates, and must show faster and faster value to business – while analyzing customer data that has (as above) a shorter and shorter shelf-life.

1.3 Challenges and Transformation

Technical challenges associated with large datasets include capture, filtering, storage, retrieval, transfer, analysis, and visualization. Traditional tools and platforms are lacking in terms of speed and storage. Existing methodologies and systems struggle to keep pace with growing volumes, and also with changing nature of value that organizations want to derive from analysis.

The Big Data “problem” is compounded by the different vectors along which customers, the business, and IT departments are moving:

As early as May 2011, a McKinsey Global Institute report said Big Data would be the “next frontier for innovation, competition, and productivity.”[3]_{In 2013, Big Data will drive}

US$34 billion in IT spending.[4]

The business case for Big Data is, now as in 2011, indicated in the McKinsey Global report. It listed five broad ways in which Big Data analyses can create value[3]_:

Unlocking of value by making information more transparent and usable at a much higher frequency

Organizational performance boost via collection of information on a variety of customer interactions all across the product lifecycle

Development of precisely tailored products and services Improved decision-making via sophisticated analytics

(5)

Domain

Use Cases

Business Benefits

2.2 Why Big Data?

1. Do we need Big Data? Many organizations buy

into the idea that they can benefit from Big Data analysis, and then prepare to throw resources at the problem.

The prevalent idea is that if there is a large volume of potentially useful information, and if the tools for analyzing it exist, then something of value will emerge when the two are put together.

Some datasets can be analyzed with traditional tools. Depending on the diversity of the data (Variety) and how fast it arrives and/or changes (Velocity), an organization might or might not need to restructure the way it manages its data.

In some cases, before looking at channeling new data streams, it might be more productive to analyze whether existing data can provide the desired business insights. On the other hand, one round of analysis might uncover the need for new data channels. Different tools and systems serve the purpose depending on the rate of inflow of information, as discussed in §1.

2. What is the anticipated RoI? Big data projects

are not cheap. An organization cannot assume an RoI based on the outcomes of similar projects. An RoI assessment requires competent data scientists, who can build a comprehensive understanding of the business environment, existing and potential data sources, and possible use cases.

3. What is the best use case? How best would your

organization leverage Big Data, and subsequently, what data might you try to get access to? As an example, one organization in the retail industry might benefit most from targeted point-of-sale advertising; another might benefit from dynamic pricing.

What is needed is an evaluation of available tools, and an assessment of current and potential data sources – along with business trends within the industry in question.

What all of the above point to is the idea that data and suitable tools are not by themselves sufficient to make a Big Data business case. The project should ideally begin with an analysis of unmet strategic or tactical business needs that are not served well OR at all using the current Information Management architecture and proceed along a use case based approach. Such an approach will ensure the necessary business buy-in towards a highly desired business oriented outcome vs. “incremental” bits & bytes oriented gains. While it is true that every organization has – or can

access – data that it is not tapped into, not every organization needs Big Data, and no two companies will benefit from Big Data insights to the same degree. Here are some questions every business needs to ask:

2.1 Use Cases and Benefits

Some business use cases and business benefits are summarized below.

Retail (web, mobile) Retail

Digital media

Networks

Telecom Healthcare

Targeted real time offers Dynamic pricing Retail Customer lifetime value prediction

Behavioral analytics

Data and user security

CDR analysis

Healthcare Behavioral analytics for the pharmaceutical and insurance industries

More click-throughs and conversions Creation of targeted channel campaigns and improved campaign RoI

Enhanced customer engagement leading to higher monetization

Proactively identify and block disruptions and threats

Reduce churn and optimize pricing

Contextual pricing and proactive management of patient health

(6)

Hue

Impala

Cost-effectiveness, because it can work on inexpensive servers that store as well as process data Almost infinite scalability through its massively parallel processing capabilities

Capability to store highly heterogenous data from very disparate systems, and use schema-on-read methods to process data on demand

In the BI and Big Data contexts, there are downsides of using Hadoop alone:

Hadoop is an open source framework that traces its origins to the Google File System. MapReduce is a framework for processing very large datasets on a distributed system, for certain kinds of analysis. The many advantages of Hadoop include:

In sum, Hadoop is a powerful data analysis framework, but it is not the tool of choice for all use cases.

Building real-time applications and generating real-time responses to queries are difficult.

There is a focus on staging and storing data before other operations. Datasets must always be processed using MapReduce to get insights and further actions.

Hadoop works in batch mode, so processing jobs need to be run over the entire dataset when new data is added. This means time-to-analyze keeps increasing. That, in turn, makes Hadoop by itself unsuitable for use cases where new data comes in at regular intervals – and where business will benefit from real-time analysis of such data.

3. Current Techniques:

Advantages and

Disadvantages

Different platforms have different capabilities and limitations in terms of their ability to handle massive amounts of structured, semi-structured, and unstructured data.

3.1 Hadoop

Hadoop has become the de facto standard for storing, processing and analyzing large volumes of data up to the petabyte scale. The Hadoop Distributed File System (HDFS) and the set of tools and technologies used to process data from it are depicted below:

Top Level Interfaces Top Level Abstractions Distributed Data Processing Self-healing clustered storage system Dashboarding and Reporting Workflow Akaban Oozie Analysis

Pig Scalding Mahout Hive

MapReduce Hbase HDFS ZooKeeper Data Pipeline DistCp Sqoop Flume Scribe

(7)

JDBC ODBC

Leader Node

Compute Node 1 Compute Node n

Node Slices Node Slices

Vertica is not suitable in some cases, for the following reasons:

Cited advantages of Vertica, which are also the general advantages of columnar databases, include:

Fast real-time queries. Access to immediate answers allows ad hoc analysis of, and insights from, time-sensitive data. The data compression system used in Vertica means lower cost of storage. An analytics library is built into the database, which makes it possible to perform a variety of operations on data without the intermediate step of extraction.

Data updates are not supported. If previously loaded data needs to be updated, the entire dataset has to be loaded again – which has a huge operations impact on real time analytics systems.

The system is optimized for read speed. When reads and writes happen in parallel, there is a performance slowdown. As datasets become larger and more complex, and as queries become more diverse, the data compression benefit diminishes.

3.2 Vertica

A typical columnar database architecture is depicted below. The Vertica Analytics Platform, as one example, uses a columnar database design.

Like Vertica, Redshift uses a columnar database, compresses data and is optimized for read speeds. Apart from dataset size, advantages of Amazon Redshift for Big Data analysis include:

Fast real-time queries are supported, like Vertica. Immediate answers to queries allow ad hoc analysis of time-sensitive data. Redshift, unlike Hadoop, supports SQL functionality.

Data and querying can be managed over the cloud. Enterprises do not need a new infrastructure for analytics.

3.3 Amazon Redshift

A cloud-based data warehouse service, Amazon Redshift can scale at the range of petabytes. The system architecture is similar to that of Google BigQuery, another web service that allows analysis of very large datasets:

Drawbacks to Redshift as an analytics platform include the fact that it does not provide interoperability between SQL and other languages. Also, it has the same inherent drawbacks of a cloud service – network latency and security – to overcome which the BI applications would have to be on the same cloud set-up where the data resides.

Client Applications

Data Warehouse Cluster

SQL SQL-MapReduce Unified

Interface

High Volume, Fast Querying WLM (Dynamic Workload Manager)

Massively-Parallel Data Stores

Row

Store ColumnStore

Ap Dat Ap Dat Ap Dat Ap Dat

(8)

4. The Use Case Driven Approach to Big Data

As mentioned in §2, some companies do not know precisely what they want from

Big Data analytics. On the other hand, companies that have a business goal want options. They may want real-time answers from a stream of information, or they might want one batch of data processed every few days. Similarly, they might or might not know the best use case, knowing only that there is valuable data to be tapped into.

This points to the idea of the use case-based approach; the use case for Big Data analytics determines the choice of the Information Management architecture to support these Big Data initiatives. Systems in Motion works with customers at various stages of maturity of their BI programs.

These stages span the spectrum from descriptive to predictive to prescriptive.

4.1 Descriptive, Predictive, and Prescriptive Analytics

In descriptive analytics, we perform real-time processing of “what happened”; this manifests as business reporting and data warehousing. In the predictive phase, we extrapolate and forecast, to answer the question of “what will happen” (data and text mining). For fully data-driven business decisions, prescriptive analytics combines insights from the descriptive and predictive phases to answer the questions of

“what to do” and “why to do it.” With descriptive analytics, we design and develop the data warehouse / operational data source, develop reports, and conduct data migration and integration.

With prescriptive analytics, we conduct a role and outcome-specific analysis, design a predictive analytics framework, leverage external and unstructured data, and deliver Big Data implementations in the cloud. This analytics spectrum – and the spectrum of SIM’s offerings – is outlined in the table below:

Descriptive

Questions Enablers Outcomes

Predictive

Prescriptive

What happened? What is happening? Business reporting Dashboards Scorecards Data warehousing Well-designed business problems and opportunities

What will happen? Why will it happen?

Data mining Text mining Web mining Media mining Forecasting Accurate projections of future states and conditions

What should I do? Why should I do it?

Optimization Simulation Decision modeling Expert systems

Best possible business decisions and transactions

In all cases, SIM helps customers capture, collect, store, and process all data that relates to the enterprise – whether outside-in or inside-out.

(9)

1

Analyze _Plan2 _Develop3 _Manage4

Effective Big Data Analytics environment Uses and Sources of Data Information value mapping

Use Case definition Analysis of current environment Data feeds analysis (variety, volume, velocity)

Architecture - Data architecture - Analytics

Data modeling (schema-on-read, events etc.) Application design - Analytics - Closed loop APIs Iteration planning Storyboarding Iterative QA strategy

Agile Scrum development Prototyping

Cloud deployments (private, public cloud) Multiple Scrum teams

Ongoing support Additional functionality New development Application management

Planning

Big Data Workshop Iterative Development Ongoing Management

5. Solutions and Packages

We discussed, in §2, some Big Data use cases that SIM has worked on. Here are a few solutions we have delivered.

5.1 Case Studies

Analytics apps for retail sales

A big box retailer wanted a connected business strategy (online, mobile and social). They needed to analyze cross-channel sales data; their high latency in identifying sales trends and patterns was a challenge. SIM helped them track real-time sales trends on mobile devices; our Big Data analytics engine helped perform drill-down trends analysis and incentive planning.

Audience analytics

A leader in enterprise gamification – the application of game design to non-game environments – realized that they needed data-driven insights to optimize audience engagement. They lacked visibility into their RoI metrics, and they needed to optimize campaign spend; which campaigns drove the most engagement was not clear. SIM’s analytics engine uncovered patterns that helped increase user adoption of the brands they worked for, along with brand loyalty. The customer was able to create promotions and targeted campaigns to incentivize an engaged audience.

1

Analyze Plan2 Develop3 Manage4

4.2 The Engagement Process at Systems in Motion

SIM’s iterative engagement process for a use case-driven Big Data environment consists of four phases, as depicted below:

(10)

5.2 SIM’s Information Management Solution Architecture

A sample solution architecture is depicted below:

5.3 Packages and services

Systems in Motion offers different information management packages for organizations working with Big Data. The packages span identification of use cases, a showcasing of benefits of those use cases, platform modernization to address the Big Data challenge, Big Data analytics, and prediction derived from Big Data analytics. These are summarized in the table below:

High-velocity, diverse data can be collected from a variety of sources The Big Data storage architecture is determined by the use case

The analytics engine performs high-volume, fast queries to uncover patterns Custom analytics applications deliver visualizations based on the use case

Data collection fr

om variety of sour

ces

Use case driven big data envir

onment

Purpose-built analytical applications

Big Data Cloud

Hadoop DFS JDBC ODBC SQL Map Reduce Products Marketing Sales R&D APIs Mobile Devices Field Sensors Satellite Data External Databases

High volume, velocity, variety data traffic

Use case driven Big Data storage architecture

Analytics engine for high volume, fast querying

Use case specific custom applications

Realtime Campaign Analytics

A mobile entertainment application provider wanted real time analysis on digital campaigns being run across multiple referrers -- to measure the ROI on user acquisition spend. SIM’s real time Big Data cloud platform was used to instrument the app using client and server side SDKs and provide an interactive tool to marketing users to analyze campaign effectiveness. The customer doubled down on enhancing spend on their most effective channels in subsequent campaigns and also used the information to identify most revenue bearing user segments and cohorts for targeted follow on campaigns.

(11)

1. EMC2_{press release, December 2012.}_{New Digital Universe Study Reveals Big Data Gap: Less Than 1% of}

World’s Data is Analyzed; Less Than 20% is Protected.

http://www.emc.com/about/news/press/2012/20121211-01.htm

2. Matt Aslett, research director at 451 Research, October 2012. Research Director Reflects on New Big Data Book.

http://www.ibmbigdatahub.com/blog/research-director-reflects-new-big-data-book#sthash.pz0ngU2i.dpuf

3. McKinsey Global Institute Report, May 2011. Big data: The next frontier for innovation, competition, and productivity.

http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

4. Gartner press release, October 2012. Gartner Says Big Data Will Drive $28 Billion of IT Spending in 2012.

http://www.gartner.com/newsroom/id/2200815

Telephone: (415) 992-7277 Email: [email protected]

GLOBALINNOVATIONHUB

Systems In California, 7707 Gateway Plaza, Suite 100, Newark, CA 94560

LEANSERVICE DELIVERYCENTER

Systems In Michigan, 1136 Oak Valley Drive Ann Arbor, MI 48108

About Systems In Motion

Systems in Motion was founded with a vision of challenging the existing notions and practices of IT consulting and outsourcing. Our agile, integrated and business focused approach allows us to deliver game changing ROI with deployment of cutting edge technology solutions using onshore delivery centers and global innovation hubs.

References

For more information on our Big Data packages and services, please visit

www.systemsinmotion.com.

Package Name

Descriptive

Big Data Discovery Big Data Pilots EDW Modernization

Big Data Analytics

Big Data Prediction

1-3 day workshop to inform, educate, and identify early business use cases for Big Data 60-90 day pilots to showcase Big Data-driven benefits for identified use cases

Modernization of existing EDW, IM infrastructures to address real time analytics need Leveraging modern MPP platforms to reduce storage/infrastructure spend

Architect and deploy cloud based Big Data Analytics platform, use case specific solution From data feeds to end visualization layer

Data Mining and prediction using Big Data platform Batch data processing-focused

Big Data and the Case Study Motivation

BIG DATA AT SYSTEMS IN MOTION:

EMPHASIZING THE USE CASE-BASED APPROACH

Table of Contents

Page No

-1. The Challenge and Promise of Too Much Data

2. The Business Case for Big Data

3. Current Techniques: Advantages and Disadvantages

4. The Use Case Driven Approach to Big Data

5. Solutions and Packages

References

3

4

6

8

9

11

1.2 Volume, Velocity, Variety

The Challenge and Promise of Too Much Data

1.1 Analyzing large datasets

2. The Business Case for

Big Data

1.3 Challenges and Transformation

Domain

Use Cases

Business Benefits

2.2 Why Big Data?

2.1 Use Cases and Benefits

3. Current Techniques:

Advantages and

Disadvantages

3.1 Hadoop

3.2 Vertica

3.3 Amazon Redshift

4. The Use Case Driven Approach to Big Data

4.1 Descriptive, Predictive, and Prescriptive Analytics

Descriptive

Predictive

Prescriptive

5. Solutions and Packages

5.1 Case Studies

4.2 The Engagement Process at Systems in Motion

5.2 SIM’s Information Management Solution Architecture

5.3 Packages and services

About Systems In Motion

References

Package Name

Descriptive