BIG DATA AT SYSTEMS IN MOTION:
EMPHASIZING THE USE CASE-BASED APPROACH
This white paper will discuss how “Big Data” – data that is too large to be managed by traditional techniques – is changing the practice of Business Intelligence and
Analytics (BI). It will look at what organizations need to ask when adapting their BI strategy to accommodate
exponentially growing data volumes. It will touch on current techniques of storing, managing, and analyzing data, and the limitations of those techniques.
The paper will conclude that a use case-based approach is the most meaningful way ahead in a Big Data environment.
Table of Contents
Page No
-1. The Challenge and Promise of Too Much Data
1.1 Analyzing large datasets 1.2 Volume, Velocity, Variety 1.3 Challenges and Transformation
2. The Business Case for Big Data
2.1 Use Cases and Benefits 2.2 Why Big Data?3. Current Techniques: Advantages and Disadvantages
3.1 Hadoop3.2 Vertica
3.3 Amazon Redshift
4. The Use Case Driven Approach to Big Data
4.1 Descriptive, Predictive, and Prescriptive Analytics 4.2 The Engagement Process at Systems in Motion5. Solutions and Packages
5.1 Case Studies5.2 SIM’s Information Management Solution Architecture 5.3 Packages and services
References
3
3 3 44
5 56
6 7 78
8 99
9 10 1011
1.2 Volume, Velocity, Variety
The term “Big Data” itself needs to be clarified. Often quoted in this context is Matt Aslett of 451 Research, who wrote in October 2012[2] : “While the term Big Data might be almost universally unloved, it is also now
almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management
technologies. Those limitations have come to be defined by a combination of words… volume, variety, and velocity…”
Patterns in astronomical data can lead to discoveries
Patterns in buyer behavior can lead to more targeted product recommendations and higher ad click-throughs Correlations between patient health and their ethnicity may lead to a new direction for research
Social media usage patterns could lead to the development of new and more engaging applications
Volume refers to the fact that there is too much data, driven primarily by digital channels, to manage with traditional tools
Velocity refers to the rate at which data is being produced, which is too high for the capability of traditional tools – whether it is in batches, near-real-time, or real-time. Machine logs on active networks and real-time user events from popular mobile apps are high-velocity data generators.
Variety refers to the fact that data may be structured, semi-structured or unstructured, or of a format/structure that does not lend itself to traditional modeling and analysis.
The Challenge and Promise of Too Much Data
For organizations that use Business Intelligence and Analytics (BI) to derive actionable insights from the data available to them, traditional information-handling mechanisms fall short when it comes to massive,
exponentially-growing volumes of heterogenous data. A typical large enterprise accesses public and
private data from a variety of sources. The data can be structured, unstructured, or semi-structured; it typically includes information gathered from customers, information about customer behavior, and data from social media, mobile applications, and community support forums. The format and structure of the data varies widely – multi-language text, voice, images, video, semi-structured logs, and contextual information about the data.
1.1 Analyzing large datasets
From an IDC study, less than 1% of the data in the world is analyzed; the digital universe will amount to 40 zettabytes (40 billion terabytes) in 2020, a 50-fold increase from 2010[1].
But the volume and heterogeneity of the available data are not just challenges; they are opportunities. Real-world correlations exist between seemingly unrelated information, so the analysis of large, combined datasets can provide far deeper and more holistic insights compared to the analysis of individual, smaller datasets. Enterprises now have the ability to gather data from outside their firewalls (social, mobile, web), so they have access to rich data sets that can be mined for correlations to data generated within their enterprise systems (ERP, CRM, SCM, etc.).
2. The Business Case for
Big Data
Organizations, therefore, are largely out of sync with a networked, digital world. They can align with it by changing how they handle data, what data they filter out, how they analyze it, and what they apply it to.
Big data impacts information management strategies in different ways for different information-driven systems. In marketing, untargeted outreach evolves into targeted, personalized message delivery. In campaign systems, standalone CRM systems with slow market responsiveness give way to real-time information management, integrated across channels. Even pricing, which has been driven by usage data across the general population, can be personalized according to shopping and browsing patterns.
BI, in fact, is moving from static reporting to real-time predictions (and hence, recommendations) from smart systems.
Customers are more sophisticated and brand-aware. With their increasing use of digital media to rate and purchase products, there is more useful data that can be captured – and also more demand on organizations to do so. Businesses are increasingly making data-driven decisions, which means they need more sophisticated analytics. Given the decreasing shelf-life of data, they prefer self-serve methods to get at the results of the analysis very quickly. IT departments have stronger cost-cutting mandates, and must show faster and faster value to business – while analyzing customer data that has (as above) a shorter and shorter shelf-life.
1.3 Challenges and Transformation
Technical challenges associated with large datasets include capture, filtering, storage, retrieval, transfer, analysis, and visualization. Traditional tools and platforms are lacking in terms of speed and storage. Existing methodologies and systems struggle to keep pace with growing volumes, and also with changing nature of value that organizations want to derive from analysis.
The Big Data “problem” is compounded by the different vectors along which customers, the business, and IT departments are moving:
As early as May 2011, a McKinsey Global Institute report said Big Data would be the “next frontier for innovation, competition, and productivity.”[3] In 2013, Big Data will drive
US$34 billion in IT spending.[4]
The business case for Big Data is, now as in 2011, indicated in the McKinsey Global report. It listed five broad ways in which Big Data analyses can create value[3]:
Unlocking of value by making information more transparent and usable at a much higher frequency
Organizational performance boost via collection of information on a variety of customer interactions all across the product lifecycle
Development of precisely tailored products and services Improved decision-making via sophisticated analytics
Domain
Use Cases
Business Benefits
2.2 Why Big Data?
1. Do we need Big Data? Many organizations buy
into the idea that they can benefit from Big Data analysis, and then prepare to throw resources at the problem.
The prevalent idea is that if there is a large volume of potentially useful information, and if the tools for analyzing it exist, then something of value will emerge when the two are put together.
Some datasets can be analyzed with traditional tools. Depending on the diversity of the data (Variety) and how fast it arrives and/or changes (Velocity), an organization might or might not need to restructure the way it manages its data.
In some cases, before looking at channeling new data streams, it might be more productive to analyze whether existing data can provide the desired business insights. On the other hand, one round of analysis might uncover the need for new data channels. Different tools and systems serve the purpose depending on the rate of inflow of information, as discussed in §1.
2. What is the anticipated RoI? Big data projects
are not cheap. An organization cannot assume an RoI based on the outcomes of similar projects. An RoI assessment requires competent data scientists, who can build a comprehensive understanding of the business environment, existing and potential data sources, and possible use cases.
3. What is the best use case? How best would your
organization leverage Big Data, and subsequently, what data might you try to get access to? As an example, one organization in the retail industry might benefit most from targeted point-of-sale advertising; another might benefit from dynamic pricing.
What is needed is an evaluation of available tools, and an assessment of current and potential data sources – along with business trends within the industry in question.
What all of the above point to is the idea that data and suitable tools are not by themselves sufficient to make a Big Data business case. The project should ideally begin with an analysis of unmet strategic or tactical business needs that are not served well OR at all using the current Information Management architecture and proceed along a use case based approach. Such an approach will ensure the necessary business buy-in towards a highly desired business oriented outcome vs. “incremental” bits & bytes oriented gains. While it is true that every organization has – or can
access – data that it is not tapped into, not every organization needs Big Data, and no two companies will benefit from Big Data insights to the same degree. Here are some questions every business needs to ask:
2.1 Use Cases and Benefits
Some business use cases and business benefits are summarized below.
Retail (web, mobile) Retail
Digital media
Networks
Telecom Healthcare
Targeted real time offers Dynamic pricing Retail Customer lifetime value prediction
Behavioral analytics
Data and user security
CDR analysis
Healthcare Behavioral analytics for the pharmaceutical and insurance industries
More click-throughs and conversions Creation of targeted channel campaigns and improved campaign RoI
Enhanced customer engagement leading to higher monetization
Proactively identify and block disruptions and threats
Reduce churn and optimize pricing
Contextual pricing and proactive management of patient health
Hue
Impala
Cost-effectiveness, because it can work on inexpensive servers that store as well as process data Almost infinite scalability through its massively parallel processing capabilities
Capability to store highly heterogenous data from very disparate systems, and use schema-on-read methods to process data on demand
In the BI and Big Data contexts, there are downsides of using Hadoop alone:
Hadoop is an open source framework that traces its origins to the Google File System. MapReduce is a framework for processing very large datasets on a distributed system, for certain kinds of analysis. The many advantages of Hadoop include:
In sum, Hadoop is a powerful data analysis framework, but it is not the tool of choice for all use cases.
Building real-time applications and generating real-time responses to queries are difficult.
There is a focus on staging and storing data before other operations. Datasets must always be processed using MapReduce to get insights and further actions.
Hadoop works in batch mode, so processing jobs need to be run over the entire dataset when new data is added. This means time-to-analyze keeps increasing. That, in turn, makes Hadoop by itself unsuitable for use cases where new data comes in at regular intervals – and where business will benefit from real-time analysis of such data.
3. Current Techniques:
Advantages and
Disadvantages
Different platforms have different capabilities and limitations in terms of their ability to handle massive amounts of structured, semi-structured, and unstructured data.
3.1 Hadoop
Hadoop has become the de facto standard for storing, processing and analyzing large volumes of data up to the petabyte scale. The Hadoop Distributed File System (HDFS) and the set of tools and technologies used to process data from it are depicted below:
Top Level Interfaces Top Level Abstractions Distributed Data Processing Self-healing clustered storage system Dashboarding and Reporting Workflow Akaban Oozie Analysis
Pig Scalding Mahout Hive
MapReduce Hbase HDFS ZooKeeper Data Pipeline DistCp Sqoop Flume Scribe
JDBC ODBC
Leader Node
Compute Node 1 Compute Node n
Node Slices Node Slices
Vertica is not suitable in some cases, for the following reasons:
Cited advantages of Vertica, which are also the general advantages of columnar databases, include:
Fast real-time queries. Access to immediate answers allows ad hoc analysis of, and insights from, time-sensitive data. The data compression system used in Vertica means lower cost of storage. An analytics library is built into the database, which makes it possible to perform a variety of operations on data without the intermediate step of extraction.
Data updates are not supported. If previously loaded data needs to be updated, the entire dataset has to be loaded again – which has a huge operations impact on real time analytics systems.
The system is optimized for read speed. When reads and writes happen in parallel, there is a performance slowdown. As datasets become larger and more complex, and as queries become more diverse, the data compression benefit diminishes.
3.2 Vertica
A typical columnar database architecture is depicted below. The Vertica Analytics Platform, as one example, uses a columnar database design.
Like Vertica, Redshift uses a columnar database, compresses data and is optimized for read speeds. Apart from dataset size, advantages of Amazon Redshift for Big Data analysis include:
Fast real-time queries are supported, like Vertica. Immediate answers to queries allow ad hoc analysis of time-sensitive data. Redshift, unlike Hadoop, supports SQL functionality.
Data and querying can be managed over the cloud. Enterprises do not need a new infrastructure for analytics.
3.3 Amazon Redshift
A cloud-based data warehouse service, Amazon Redshift can scale at the range of petabytes. The system architecture is similar to that of Google BigQuery, another web service that allows analysis of very large datasets:
Drawbacks to Redshift as an analytics platform include the fact that it does not provide interoperability between SQL and other languages. Also, it has the same inherent drawbacks of a cloud service – network latency and security – to overcome which the BI applications would have to be on the same cloud set-up where the data resides.
Client Applications
Data Warehouse Cluster
SQL SQL-MapReduce Unified
Interface
High Volume, Fast Querying WLM (Dynamic Workload Manager)
Massively-Parallel Data Stores
Row
Store ColumnStore
Ap Dat Ap Dat Ap Dat Ap Dat
4. The Use Case Driven Approach to Big Data
As mentioned in §2, some companies do not know precisely what they want from
Big Data analytics. On the other hand, companies that have a business goal want options. They may want real-time answers from a stream of information, or they might want one batch of data processed every few days. Similarly, they might or might not know the best use case, knowing only that there is valuable data to be tapped into.
This points to the idea of the use case-based approach; the use case for Big Data analytics determines the choice of the Information Management architecture to support these Big Data initiatives. Systems in Motion works with customers at various stages of maturity of their BI programs.
These stages span the spectrum from descriptive to predictive to prescriptive.
4.1 Descriptive, Predictive, and Prescriptive Analytics
In descriptive analytics, we perform real-time processing of “what happened”; this manifests as business reporting and data warehousing. In the predictive phase, we extrapolate and forecast, to answer the question of “what will happen” (data and text mining). For fully data-driven business decisions, prescriptive analytics combines insights from the descriptive and predictive phases to answer the questions of
“what to do” and “why to do it.” With descriptive analytics, we design and develop the data warehouse / operational data source, develop reports, and conduct data migration and integration.
With prescriptive analytics, we conduct a role and outcome-specific analysis, design a predictive analytics framework, leverage external and unstructured data, and deliver Big Data implementations in the cloud. This analytics spectrum – and the spectrum of SIM’s offerings – is outlined in the table below:
Descriptive
Questions Enablers OutcomesPredictive
Prescriptive
What happened? What is happening? Business reporting Dashboards Scorecards Data warehousing Well-designed business problems and opportunitiesWhat will happen? Why will it happen?
Data mining Text mining Web mining Media mining Forecasting Accurate projections of future states and conditions
What should I do? Why should I do it?
Optimization Simulation Decision modeling Expert systems
Best possible business decisions and transactions
In all cases, SIM helps customers capture, collect, store, and process all data that relates to the enterprise – whether outside-in or inside-out.
1
Analyze Plan2 Develop3 Manage4
Effective Big Data Analytics environment Uses and Sources of Data Information value mapping
Use Case definition Analysis of current environment Data feeds analysis (variety, volume, velocity)
Architecture - Data architecture - Analytics
Data modeling (schema-on-read, events etc.) Application design - Analytics - Closed loop APIs Iteration planning Storyboarding Iterative QA strategy
Agile Scrum development Prototyping
Cloud deployments (private, public cloud) Multiple Scrum teams
Ongoing support Additional functionality New development Application management
Planning
Big Data Workshop Iterative Development Ongoing Management
5. Solutions and Packages
We discussed, in §2, some Big Data use cases that SIM has worked on. Here are a few solutions we have delivered.
5.1 Case Studies
Analytics apps for retail sales
A big box retailer wanted a connected business strategy (online, mobile and social). They needed to analyze cross-channel sales data; their high latency in identifying sales trends and patterns was a challenge. SIM helped them track real-time sales trends on mobile devices; our Big Data analytics engine helped perform drill-down trends analysis and incentive planning.
Audience analytics
A leader in enterprise gamification – the application of game design to non-game environments – realized that they needed data-driven insights to optimize audience engagement. They lacked visibility into their RoI metrics, and they needed to optimize campaign spend; which campaigns drove the most engagement was not clear. SIM’s analytics engine uncovered patterns that helped increase user adoption of the brands they worked for, along with brand loyalty. The customer was able to create promotions and targeted campaigns to incentivize an engaged audience.
1
Analyze Plan2 Develop3 Manage4
4.2 The Engagement Process at Systems in Motion
SIM’s iterative engagement process for a use case-driven Big Data environment consists of four phases, as depicted below:
5.2 SIM’s Information Management Solution Architecture
A sample solution architecture is depicted below:
5.3 Packages and services
Systems in Motion offers different information management packages for organizations working with Big Data. The packages span identification of use cases, a showcasing of benefits of those use cases, platform modernization to address the Big Data challenge, Big Data analytics, and prediction derived from Big Data analytics. These are summarized in the table below:
High-velocity, diverse data can be collected from a variety of sources The Big Data storage architecture is determined by the use case
The analytics engine performs high-volume, fast queries to uncover patterns Custom analytics applications deliver visualizations based on the use case
Data collection fr
om variety of sour
ces
Use case driven big data envir
onment
Purpose-built analytical applications
Big Data Cloud
Hadoop DFS JDBC ODBC SQL Map Reduce Products Marketing Sales R&D APIs Mobile Devices Field Sensors Satellite Data External Databases
High volume, velocity, variety data traffic
Use case driven Big Data storage architecture
Analytics engine for high volume, fast querying
Use case specific custom applications
Realtime Campaign Analytics
A mobile entertainment application provider wanted real time analysis on digital campaigns being run across multiple referrers -- to measure the ROI on user acquisition spend. SIM’s real time Big Data cloud platform was used to instrument the app using client and server side SDKs and provide an interactive tool to marketing users to analyze campaign effectiveness. The customer doubled down on enhancing spend on their most effective channels in subsequent campaigns and also used the information to identify most revenue bearing user segments and cohorts for targeted follow on campaigns.
1. EMC2 press release, December 2012. New Digital Universe Study Reveals Big Data Gap: Less Than 1% of
World’s Data is Analyzed; Less Than 20% is Protected.
http://www.emc.com/about/news/press/2012/20121211-01.htm
2. Matt Aslett, research director at 451 Research, October 2012. Research Director Reflects on New Big Data Book.
http://www.ibmbigdatahub.com/blog/research-director-reflects-new-big-data-book#sthash.pz0ngU2i.dpuf
3. McKinsey Global Institute Report, May 2011. Big data: The next frontier for innovation, competition, and productivity.
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
4. Gartner press release, October 2012. Gartner Says Big Data Will Drive $28 Billion of IT Spending in 2012.
http://www.gartner.com/newsroom/id/2200815
Telephone: (415) 992-7277 Email: [email protected]
GLOBALINNOVATIONHUB
Systems In California, 7707 Gateway Plaza, Suite 100, Newark, CA 94560
LEANSERVICE DELIVERYCENTER
Systems In Michigan, 1136 Oak Valley Drive Ann Arbor, MI 48108
About Systems In Motion
Systems in Motion was founded with a vision of challenging the existing notions and practices of IT consulting and outsourcing. Our agile, integrated and business focused approach allows us to deliver game changing ROI with deployment of cutting edge technology solutions using onshore delivery centers and global innovation hubs.
References
For more information on our Big Data packages and services, please visit
www.systemsinmotion.com.
Package Name
Descriptive
Big Data Discovery Big Data Pilots EDW Modernization
Big Data Analytics
Big Data Prediction
1-3 day workshop to inform, educate, and identify early business use cases for Big Data 60-90 day pilots to showcase Big Data-driven benefits for identified use cases
Modernization of existing EDW, IM infrastructures to address real time analytics need Leveraging modern MPP platforms to reduce storage/infrastructure spend
Architect and deploy cloud based Big Data Analytics platform, use case specific solution From data feeds to end visualization layer
Data Mining and prediction using Big Data platform Batch data processing-focused