BIG DATA AND
INVESTIGATIVE ANALYTICS
The New Fron+er
Table of Contents
Introduc+on ……… 3
Chapter 1: What Is Inves+ga+ve Analy+cs? ………. 4
Chapter 2: Top Five Requirements for
Inves+ga+ve Analy+cs ……….………. 10
Chapter 3: Case Studies – Inves+ga+ve Analy+cs for
Big Data ……….. 16
Summary ……… 23
Introduction
Big Data and
Investigative Analytics
There’s no ques+on that big data represents both a challenge and an opportunity. As big data volumes con+nue to explode, businesses will face challenges in quickly extrac+ng rich insight from the mountain of machine-‐ generated data streaming in from devices, sensors, smart meters, opera+onal equipment and other sources.
Tradi+onal analy+c tools are oTen not up to the job of allowing users to interrogate highly diverse types of big data. As data connec+ons and
dependencies grow exponen+ally, it’s no longer possible to capture ac+onable informa+on in a rigid set of KPIs and canned reports. To effec+vely manage big data, companies need to explore op+ons for performing richer, real-‐+me data analysis with far fewer resources.
One approach for doing that is Inves+ga+ve Analy+cs, where users ask a series of quickly changing, itera+ve ques+ons to figure out why something did or did not happen and how to op+mize a par+cular outcome in the future. Compared to tradi+onal analy+cs, which lack flexibility, inves+ga+ve analy+cs yields
insight into ques+ons that haven’t even been dreamed up yet.
In this ebook, we will delve into the role of inves+ga+ve analysis as it relates to big data, technology requirements for pu]ng inves+ga+ve analy+cs into
CHAPTER ONE
WHAT IS INVESTIGATIVE
ANALYTICS?
Emerging Data
Analytics Stack
Days of One-Size-Fits-All Are Gone
“Yesterday’s BI-‐ETL-‐EDW stack is wrong-‐sided for tomorrow’s
needs, and quickly becoming irrelevant.” -‐ Gigamon
In today’s big data world, the one-‐size-‐fits-‐all approach no longer works. The data management stack has transformed into mul+ples, while the analy+c stack has had to respond with individualized tools to get at the appropriate data and func+on, be it opera+onal analy+cs, inves+ga+ve analy+cs or predic+ve
analy+cs. Big data has created pockets of specializa+on, where some databases are great for warehousing (e.g. Hadoop), while others excel at analy+cs.
Companies are also challenged by an evolving infrastructure and the
prolifera+on of data centers, data warehouses and data marts. Not only is the infrastructure used to deliver informa+on changing, the data coming in from a myriad of new devices is also changing drama+cally – in terms of speed, type and volume of data.
With the overwhelming influx of machine-‐generated data begging to be analyzed, business users such as data scien+sts need real-‐+me, interac+ve visualiza+on of their data and flexible query crea+on. Today, with the right mix of solu+ons, businesses are able to analyze months worth of data with sub-‐ second response +me and realize extraordinary business value from performing deep analysis with queries created on the fly.
Big Data &
The Internet of Things
A jet airliner generates 20TB of diagnos+c data per hour of flight. The average oil plaborm has 40,000 sensors, genera+ng data 24/7. 80% of all households in Germany (32 million) will need to be equipped with smart meters by 2020, in accordance with the European Union market guidelines. These examples alone represent a staggering amount of data that must be captured, analyzed and acted upon.
Today’s AnalyGc Environment:
More “things” are now connected to the Internet than people, a phenomenon dubbed The Internet of Things. Fueled by machine-‐to-‐ machine (M2M) data, the Internet of Things promises to make our lives easier and bemer, from more efficient energy delivery and consump+on to mobile health innova+ons where doctors can monitor pa+ents from afar. However, the resul+ng +dal wave of data streaming in from smart devices, sensors, monitors, meters, etc., is tes+ng the capabili+es of tradi+onal database technologies. They simply can’t keep up; or when they’re challenged to scale, are cost prohibi+ve.
Just ten years ago, the largest data warehouse in the world was 30TB; today, petabyte-‐sized data warehouses are common, and the volumes con+nue to grow. According to a 2012 Informa+on Difference survey, most of the 209 customers surveyed said they were experiencing data growth of 20-‐50% annually.
Investigative
Analytics
Move from “What Happened?”...to “Why?”
Tradi+onal analy+c tools are oTen not up to the job of allowing users to interrogate the fast moving, highly diverse types of high-‐volume big data. As data connec+ons and dependencies grow exponen+ally, it’s no longer possible to capture ac+onable informa+on in a rigid set of KPIs and canned reports. To effec+vely manage big data, companies should explore op+ons for performing richer, real-‐+me data analysis. One effec+ve approach is inves+ga+ve analy+cs.
In the recent TDWI ebook, Inves&ga&ve Analy&cs: The New BI Fron&er (June 2013), analyst Stephen Swoyer describes the bookends of the analy+c con+nuum as tradi+onal analy+cs and predic+ve analy+cs:
§ Tradi+onal analy+cs puts ques+ons into historical context, includes
common BI ac+vi+es (e.g. reports, dashboards, scorecards), and is mostly SQL-‐driven.
§ Predic+ve analy+cs on the other hand uses uses data mining or
sta+s+cal algorithms to score data with models and forecasts. Both of these approaches answer the ques+on of “what” – What happened? What will happen?
With a more open-‐ended process, inves+ga+ve analy+cs, in comparison, answers the “why:” Why did it happen?
Swoyer describes inves+ga+ve analy+cs as “an open-‐ended ac+vity that looks for pamerns, anomalies, and clusters (i.e., for clues) that can be used to formulate ques+ons or which can be correlated with events, condi+ons, or phenomena.” With inves+ga+ve analy+cs, users can ask a series of quickly changing, itera+ve ques+ons to figure out why something did or did not happen and how to op+mize a par+cular outcome in the future,
resul+ng in deeper and richer insight.
OperaGonal AnalyGcs
IteraGve, quickly changing queries (usually ad hoc)
AutomaGc calculaGons during live
transacGons Alerts, KPIs, standard reports PredicGve AnalyGcs InvesGgaGve AnalyGcs
What is going to happen? What
happened?
What has happened and why?
CHAPTER TWO
TOP FIVE REQUIREMENTS FOR
INVESTIGATIVE ANALYTICS
Number 1: Low Touch
The extensive effort needed to fine tune with indexing, par++oning and sharding can all get in the way of effec+ve, efficient analy+cs. In a +me of s+ll-‐ constrained budgets, data analysis needs to be affordable, as well as easy-‐to-‐ use and implement, in order to jus+fy the investment. This demands low-‐touch solu+ons that are op+mized to deliver fast analysis of large volumes of data, with minimal hardware, administra+ve effort or customiza+on needed to set-‐ up or change query and repor+ng parameters.
X
“The cool thing is that it can produce a new report –
which produces a new ad-hoc query – and I don’t
have to worry about performance because Infobright
takes care of all that for me.”
-‐ Bob Hammond, CTO, Jumptap Low-‐touch – minimal DBA requirements with a self-‐tuning system
Number 2: Ad-‐Hoc Performance
FricGonless Inquiry: Move from quesGon to answer, quickly.
In fast-‐paced business and opera+onal environments (smart grids are a great example), intelligence needs change quickly, so analy+c tools can’t be
constrained by data schemas that limit the number and type of queries that can be performed. Tradi+onal data solu+ons like standard, row-‐based rela+onal databases fall short here, as they were designed to handle single-‐record, structured data. Big data analysis requires a flexible solu+on that allows for unplanned, ad-‐hoc querying, and that doesn’t require a lot of +nkering or +me-‐consuming manual configura+on – such as indexing and managing data par++ons – to create and change analy+c queries.
Enter fric+onless inquiry, where the path between ques+on and answer is void of rigid structure: when users reach the “aha!” moment, they’ll have all the
informa+on needed to ask the next ques+on or dig deeper into data, without
Number 3: Dynamic Scalability
Scalability: Inherently respond to increased load along any of these axes – query performance, number of users, number of records/size of data.
As demand for inves+ga+ve analysis of big data increases, businesses need highly scalable solu+ons that can handle current and future data growth. At some point, tradi+onal, hardware-‐based infrastructure will run out of
headroom in terms of storage and processing capabili+es. However, adding more data centers, servers and disk storage subsystems is expensive to buy and maintain, crea+ng a situa+on where costs begin to outweigh the
Number 4: Load Speeds
Machine-‐generated data is loaded very, very quickly and oTen needs to be inves+gated within a short period of +me – for example, a mobile carrier who wants to automate loca+on-‐based smart phone offers based on incoming GPS data. If it takes too long to process and analyze this kind of data, the resul+ng intelligence will fail to be useful.
Businesses can’t afford for data to get stale. Solu+ons must be able to quickly and easily load, dynamically query, analyze and communicate informa+on quickly enough to provide for whatever real-‐+me query processing or aler+ng is required.
Within 60 seconds of data hitting Infobright customer
HasOffers’ tracking platform, customers are able to run
ad-hoc queries and get results that they can use to
make better business decisions in real-time.
Number 5: Compression
Economical storage of big data requires very efficient data compression within a network node, smart device or even a massive data center cluster.
Efficient compression lowers TCO, allowing for less storage capacity and minimized networking and hardware investments. In addi+on, efficient data compression increases the accuracy of query results by enabling +ghter data sampling increments and longer historical data sets (e.g. accommoda+ng for situa+ons like seasonality in retail.) By capturing more data at lower
granularity levels – e.g. one second vs. one hour – businesses will be able to iden+fy pamerns that exist at lower levels (which may have previously been missed due to storage constraints.)
CHAPTER THREE
BIG DATA, INVESTIGATIVE
ANALYTICS CASE STUDIES
Overview
Mavenir’s Converged Messaging SoluGon
Mavenir Systems provides innovative mobile
convergence solutions that enable mobile operators
to offer subscribers new and enhanced services and
applications.
Challenges
Mavenir
Mavenir’s goal was to drive more revenue by offering a solu+on to mobile operators that allows them to retrieve detailed SMS records for customer service and regulatory compliance. They needed an analy+cs solu+on to:
§ Quickly load and store large volumes of detailed data
§ Capacity in excess of 3 billion messages per day
§ Peak periods like Chinese New Year can generate over 70
million messages in an hour
§ Make that data available for analysis within minutes
§ Store 90 days worth of data with a small hardware footprint
§ Handle projected 70% growth rate in mobile messaging
§ Have low TCO including low storage and license costs
“Data storage is a big issue for mobile operators,
and it’s only going to get more challenging as the
use of messaging continues to explode.”
SoluGon: Infobright Enterprise EdiGon (IEE)
Mavenir
Data Compression
& History
• Keep 90 days of data stored in less hardware footprint due to dras+c compression
Ge]ng Data in
and Out Quickly
• 20k records per second at peak capacity in ini+al release • Current itera+on is
100k records per peak • Projected 70% growth
plan
• Load from event/log files every 5 minutes, making available in near-‐real +me
Reducing Capex &
Opex
• No indexes, data par++oning or manual tuning
• No need for DBA resources to manage the database on an ongoing basis • Low licensing costs • TCO only 20% of the
cost of compe++ve solu+ons
Mavenir has won major wireless carriers such as
MetroPCS, Telstra and Viettel based on this solution.
Overview
LiveRail is the leading publisher monetization platform
for video delivering over three billion impressions –
25% of all online video ads – each month.
LiveRail
LiveRail is a mul+-‐plaborm, real-‐+me video adver+sing ecosystem providing:
§ Real-‐+me bidding
§ Yield op+miza+on
§ Ad serving analy+cs
Challenges
LiveRail
With a growing roster of customers – including PBS, MLB.com and CBS Interac+ve – LiveRail was faced with managing increasingly large data volumes and a need to provide clients with near real-‐+me access to this informa+on for repor+ng and ad-‐hoc analysis.
§ 10 billion monthly video ad opportuni+es
§ 2 billion data points each day
§ Dozens of engagement metrics including percentages
§ Viewed/completed
§ Pause/resume
§ Mu+ng
Publishers needed the ability to drill down with near real-‐+me access to determine op+mal video length, as well as determine whether there is a correla+on between comple+on rates and ad frequency.
“Infobright gives our customers the ability to do
fast, ad-hoc analysis against the extensive video
advertising data.”
SoluGon: Infobright IEE + Hadoop
LiveRail recognized with
Computerworld Data+ Award
LiveRail
Data Compression & History • 25X space reduc+on Or • 25X more history online Analyzing Data Quickly • 20,000 ad-‐hoc/real-‐ +me reports per day run by customers • Reports that used to
take two to three minutes now take seconds
Reducing Capex & Opex
• No indexing or tuning required
• Fewer servers or storage disk required • Lower licensing costs
than alterna+ves • Low-‐touch, simple
In Summary
Big Data and
Investigative Analytics
Big data demands a big change in thinking. Companies that maintain their status quo of analy+cs technologies and processes will find themselves spending
progressively more money on servers, storage and DBAs – an approach that’s difficult to sustain and s+ll presents the risk of not ge]ng the needed answers.
Gone are the days of simply seeking the “what” from an analy+cs solu+ons. Today, companies can – and need – to know why. Inves+ga+ve analy+cs are the key to revealing pamerns of behavior or insights to immediately take ac+on on, and either capitalize on or prevent in the future.
To extract rich, real-‐+me insight from the onslaught of machine-‐generated data, companies require a technology founda+on characterized by five requirements:
§ Low-‐touch administra+on
§ Flexible, ad-‐hoc querying
§ Dynamic scalability
§ Fast, reliable performance
§ Efficient compression
When there’s more and more data to mine, inves+ga+ve analy+cs cut through the clumer with precision, ensuring accurate, immediate results, even as
machine-‐generated data grows to the petabyte scale… and beyond. By
maximizing insight into data, companies can make bemer decisions at the speed of business, thereby reducing costs, iden+fying new revenue streams, and
HAVE QUESTIONS?
24
See how
JDSU
and others
are using Infobright to meet their
investigative analytics needs and
drive business value.
Find us on the web: www.infobright.com Contact us: 877-‐596-‐2483 / [email protected]