Analytics and the Context Multiplier

(1)

The

Deal

(2)

Analytics and the Context Multiplier

Raw Data Feature extraction metadata Domain linkages Full contextual analytics Location risk Occupational risk Dietary risk Family history Actuarial data Government statistics Epidemic data Chemical exposure

Personal financial situation

(3)

(4)

IBM Watson

(5)

• Automates customer interaction to increase customer engagement in sales and service

• Transforms customer engagement by knowing, engaging and empowering clients

• Developscustomer relationships through a transformative user experience

What it does

• Provides answers not links and webpages

• Answers with evidence not guesses

• Not restricted to a predefined question-answer set

• Learns from every interaction

How it does it

Watson Engagement Advisor

(6)

Watson Discovery Advisor

Answer previously unanswerable research problems

6

Watson can read these medical records in six seconds!

Gain Awareness

Harness all available scientific knowledge in the hunt for a breakthrough and identifies better leads for any researcher to pursue

Understand Relationships

Enable every scientist to identify new relationships and explore never before considered options that lead to real differentiating scientific innovations.

Clarify Ideas

(7)

Data at Scale

Volume

Data in Many Forms

Variety

Data in Motion

Velocity

Data Uncertainty

Veracity

Big Data Definition

(8)

BigData

MYTH:

Big Data is only about large datasets; we should just say larger than what you have

MYTH:

Big Data means Hadoop..that’s it

MYTH:

Big Data means ‘rip-and-replace’, death to the RDBMS and no governance

MYTH:

NoSQL means no SQL, never, utter hatred for SQL

MYTH:

Big Data means unstructured data and only for sentiment

without analytics

(9)

(10)

An increasingly sensor-enabled and instrumented

business environment generates

HUGE

volumes of

data with

MACHINE SPEED

characteristics…

1 BILLION

lines of code

(11)

Applications for Big Data Analytics

Homeland Security

Finance

Smarter Healthcare

Multi-channel

sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

Fraud and Risk

Log Analysis

Search Quality

(12)

Automatic Temporal and Spatially Enriched Data

(13)

Use Cases: Law Enforcement and Security

 Video surveillance, wire taps,

communications, call records, etc.

 Millions of messages per second

with low density of critical data

 Identify patterns and relationships

among vast information sources



“The US Government has been working with IBM Research since 2003 on a

radical new approach to data analysis that enables high speed, scalable and

complex analytics of heterogeneous data streams in motion. The project has

been so successful that US Government will deploy additional installations

to enable other agencies to achieve greater success in various future

(14)

Velocity – Creating Actionable Intelligence in Real Time

(15)

Volume - The Government Industry Data Challenge

IBM Multimedia Analysis & Retrieval

Automatic Semantic Classification

of Images and Video

Content based feature extraction &

Search

Gigapixel Panorama Photography

http://www.gigapixel.com/image/gigapan-canucks-g7.html

(16)

Predictive Analytics in a Neonatal ICU

 Real-time analytics and correlations

on physiological data streams

– Blood pressure, Temperature, EKG,

Blood oxygen saturation etc.,

 Early detection of the onset of

potentially life-threatening

conditions

– Up to 24 hours earlier than current

medical practices

– Early intervention leads to lower patient

morbidity and better long term

outcomes

 Technology also enables

(17)

(18)

Big Data Analytics

Iterative & Exploratory Data is the structure

Traditional Analytics

Structured & Repeatable Structure built to store data

18

Warehouse Modernization Has to Themes

?

Analyzed Information Question Data Answer Hypothesis

Start with hypothesis Test against selected data

Data leads the way

Explore all data, identify correlations

Data

Correlation

All Information

Exploration

Actionable Insight

(19)

Analyze all

TRADITIONAL APPROACH BIG DATA APPROACH

(20)

Analyze as is

Carefully cleanse information

before any analysis

(21)

Find corellation

Start with hypothesis and

test against selected data

(22)

Analyze in motion

Analyze data after

it’s been processed

and landed in a warehouse or mart

Analyze data in motion

as it’s

generated, in real-time

Repository Analysis Insight

Data

(23)

Complementary Analytics

23

Traditional Approach

Structured, analytical, logical

New Approach

Creative, holistic thought, intuition

(24)

 Different requirements require different tools

– Document stores

– Key/value stores

– BigTable implementations (columnar)

– Graph databases

 Values (there are exceptions)

– Huge data volumes – easy scale-out

– Developers code integrity if it’s needed

– Relaxed (eventual) consistency

– Semi-structured data

– Schema on read

(25)

Why NoSQL?

Pressures on Traditional Relational Stores

Technical change/

Different forms of data

_{(SLAs, Archive, Governance)}

Regulatory pressures

(26)

Database Landscape Overview

SQL noSQL database Hadoop

Description • Relational SQL (RDBMS) • Operational and Analytic • E.g. DB2, Oracle,

Microsoft, Teradata, etc.

• noSQL database • Mainly operational • E.g. Cloudant,

MongoDB, Redis, Riak, Aerospike, Amazon Dynamo DB, etc.

• SQL on Hadoop (mainly analytic)

• HBase (evolving OLTP, ACID) • E.g. BigInsights, Cloudera,

Hortonworks, MapR, Pivotal • HP Labs Trafodion

Typical Infrastructure

• Proprietary database storage

• Unix, Linux, Windows • SMP, MPP, SAN, Integrated Systems, Appliances • Proprietary database storage • Linux • Commodity clusters • Local attach disks,

NAS • Cloud • Mobile • HDFS files • Linux • Commodity clusters • Local attach disks

(27)

Different Categories of noSQL Databases

NoSQL

Category Use this when….

Application Examples Vendors Document 63% revenue share*

• Schema is not well defined

• Schema is very likely to change, need to maintain flexibility

• Commonly described with JSON

• Frequently changing product catalogs • Cloudant** • MongoDB • Couchbase • MarkLogic Key-Value 24% revenue share*

• Your data is not highly related

• All you need is basic Create, Read, Update, Delete (CRUD)

• Rapid Scaling for simple data collections

• User Sessions • Shopping Cart • Redis • Aerospike • AWS (DynamoDB) • Basho Technologies (Riak) BigTable/ Columnar 9% revenue share*

• High volume, low latency write • Big Data, sparse data

• Need compression or versioning

• Telco, heavy ingest, petabyte scale

• User Activity logs • Sensor data • HBase (Hadoop)** • BigTable • Cassandra Graph DB 4% revenue Share*

• Your data looks like a graph

• Have highly interconnected data, need to trace relationships • Website Purchase Recommendations • Social Network Processing • Titan** • Neo Technology (Neo4J)

* Source: IBM study 2013 estimated by splitting total noSQL revenue ($288m) by ratio of top 10 vendors reported 2013 revenue. Total 2013 noSQL database revenue estimated $343m

(28)

Hadoop

 Open-source software framework from Apache

 Inspired by

– Google MapReduce

– GFS (Google File System)



HDFS

(29)

Hadoop Explained

 Hadoop computation model

– Data stored in a distributed file system spanning many inexpensive computers

– Bring function to the data

– Distribute application to the compute resources where the data is stored

 Scalable to thousands of nodes and petabytes of data

MapReduce Application

1. Map Phase

(break job into small parts)

2. Shuffle

(transfer interim output for final processing)

3. Reduce Phase

(boil all output down to a single result set)

Return a single result

set

Result Set

Shuffle

public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable

one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context

StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(Text key,

Iterable<IntWritable> val, Context context){ int sum = 0;

for (IntWritable v : val) { sum += v.get(); . . .

Distribute map

tasks to cluster

(30)

Visualization & Discovery _Integration Workload Optimization Streams Netezza Flume DB2 DataStage

Big Data Enterprise platform

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data Store

HBase

Text Processing Engine & Extractor Library) BigSheets

JDBC

Applications & Development

Text Analytics MapReduce

Pig & Jaql Hive

Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management Security

Audit & History

(31)

Application

SQL interface Engine

InfoSphere BigInsights

HiveTables _{HBase tables} _{CSV Files}

Data Sources

SQL Language JDBC / ODBC Driver

JDBC / ODBC Server

Future: The SQL interface . . . .

 Rich SQL query capabilities

– SQL '92 and 2011 features

– Correlated subqueries

– Windowed aggregates

 SQL access to all data stored in

InfoSphere BigInsights

 Robust JDBC/ODBC support

 Take advantage of key features

of each data source

 Leverage MapReduce

parallelism

OR

(32)

Spreadsheet-style Analysis

 Web-based analysis and

visualization

 Spreadsheet-like

interface

– Define and manage long

running data collection

jobs

(33)

JAQL –

IBM’s programming language in hadoop world

 Jaql is a complete solutions environment supporting all other

BigInsights components



Integration point for

various analytics

–

Text analytics

–

Statistical analysis

–

Machine learning

–

Ad-hoc analysis



Integration point for

various data sources

–

Local and distributed

file systems

–

NoSQL data bases

–

Content repositories

–

Relational sources

(Warehouses,

operational data bases)

B ig In sig h ts T e x t A n a ly tics S ta tistica l A n a ly sis (R m o d u le ) M a ch ine lea rnin g (Sy ste m M L ) Ad -Hoc a n a ly sis (Big S h e e ts) (In te g rat ion ) DB2 , Net e z z a , S tre a m s, …

Jaql

Jaql I/O Jaql Core

Operators

Jaql Modules

(34)

Data In Motion and At Rest: Complementary

High

Med

Low

_Med

_High

Latency

yr

ms ms … sec min hr day wk mo

1PB B KB GB 10GB 100GB 1TB 10TB 100TB MB

At Rest:

Warehouse/Hadoop

In Motion:

Streams

-Scalable processing of huge data stores

(35)

Streams Analyzes All Kinds of Data

Mining in Microseconds

(included with Streams)

Image & Video

(Open Source)

Simple & Advanced Text

(included with Streams)

(36)

 continuous ingestion

 Continuous ingestion

 Continuous analysis

How Streams Works

(37)

Achieve scale:

By partitioning applications into software components By distributing across stream-connected hardware hosts

Infrastructure provides services for

Scheduling analytics across hardware hosts, Establishing streaming connectivity

Transform

Filter / Sample

Classify

Correlate

Annotate

Where appropriate:

Elements can be fused together for lower communication latency

 Continuous ingestion

 Continuous analysis

How Streams Works

(38)

Streams Runtime Supports Placement Criteria

x86 host x86 host Meters Company Filter Usage Model Meters x86 host

Host pools can force

operators to be on hosts

with solidDB installed

Usage Contract x86 host x86 host Text Extract Degree History Compare History Store History Text Extract Temp Action Season

Adjust Daily Adjust

Operator placement constraints

allow for co-location, ex-location,

and isolation of operators

(39)

Data Warehouse Augmentation: Value & Diagram

Pre-Processing Hub

Query-able Archive

Exploratory Analysis

Information Integration Data Warehouse Streams Real-time processing BigInsights Landing zone for all data

Data Warehouse BigInsights Can combine with unstructured information Data Warehouse

1

2

3

39

(40)

Individual Silos can Answer Typical Questions, One-by-One

40

Wiki

“Who is best able to help

this customer?”

Experts

“What is her view of our

company?”

Social

Media

Fulfillment

“What issues has this

customer had in the past?”

Support Ticketing

“Where else has she

worked?”

External

Sources

“Who is this customer?”

CRM

“What is available

inventory?”

Supply

Chain

Email

“How is her company

using our products?”

Content Mgt.

“What products has she

purchased?”

_DBMS

…BUT! An enhanced 360º

view provides answers in

one application

Fusion of data from

multiple systems enables

deeper insights—not just

facts

“What should I know

before calling her for

renewal?”

“What marketing

materials should I send?”

“What’s going on with

this customer

TODAY?”

“What products can I

upsell this customer?”

“How can we increase

engagement with her?”

How can we get more

customers like her?”

“What impact will

(41)

Janet Robertson Customer search Transaction history Customer’s Products Customer info Indexed 3rd _party information related to customer Unstructured internal information related to customer

SAP Systems DynamicsMicrosoft SharePoint

(42)

IBM Cloud Offering for Analysts: Watson Analytics

Unified analytics experience 100% cloud based

Mobile ready

Visual storytelling Intelligent

automation

(43)

The IBM Big Data Platform

Hadoop-based low latency analytics for variety and

volume

Queryable Archive Structured Data

BI+Ad Hoc

Analytics on Structured Data

Operational Analytics on Structured Data

Time-structured analytics Large volume structured

data analytics

Low Latency Analytics for streaming data

MPP Data Warehouse

Stream Computing Information Integration

Hadoop

(44)

(45)

Data Reservoir Repositories (Zones)

Landing, Exploration, Archive Reporting, Interactive Analysis Deep Analytics, Modeling

Data Reservoir: Refinery Services

Trusted Data, Warehousing Operational Systems Document Storage Transactional DB

NoSQL Doc Store Hadoop Mixed Workload RDBMS

Analytic Appliance Data Mart Landed Raw Data Discovery Sandbox Staging Transformation

Information Governance Catalog

Metadata for Data Sets Stored in Reservoir Repositories

IBM DataWorks

Integration • Load • Trickle feed Security • Masking • Test data generation Data Quality • Cleansing • Standardization • Matching

• Reference data generation

Data Lifecycle

(46)

Actionable Insight Reporting, Analysis Data Types Landing, Exploration, Archive Reporting, Interactive Analysis Deep Analytics, Modeling Transaction and Application Data Machine and Sensor Data Enterprise Content Social Data Image and Video

Third-Party Data

Information Management Zones

Trusted Data, Warehousing Discovery, Exploration Decision Management Predictive Analytics, Modeling Operational Systems Document Storage

Real-Time Analytical Processing

Governance and Lifecycle Management Fabric

Mainframe, Power8, Intel, Cloud (Managed/Hosted), Bluemix Services

Transactional DB

NoSQL Doc Store Hadoop Mixed Workload RDBMS

(47)

Emerging Big Data Implementation Pattern

Ingest

Landing and Analytics Sandbox Zone

Indexes, facets Hive/HBase Col Stores Documents In Variety of Formats Analytics MapReduce Repository, Workbench

Ingestion and Real-time Analytic Zone

Data Sinks Filter, Transform Ingest Correlate, Classify Extract, Annotate Warehousing Zone Enterprise Warehouse Data Marts Query Engines Cubes Descriptive, Predictive Models Models Widgets Discovery, Visualizer Search Analytics and Reporting Zone

Metadata and Governance Zone

Co

nnec

to

(48)

Visualization & Discovery _Integration Workload Optimization Streams Netezza Flume DB2 DataStage

IBM InfoSphere BigInsights Enterprise Edition

Runtime

Advanced Analytic Engines

File System

MapReduce

HDFS

Data Store

HBase

Text Processing Engine & Extractor Library) BigSheets

JDBC

Applications & Development

Text Analytics MapReduce

Pig & Jaql Hive

Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management Security

Audit & History

(49)

50

Integration

Streams (Data in Motion)

Big Data (Data At Rest)

Real Time Event Detection Pattern Detection Deep Analytics Integration

Datawarehouse Customers Profiles

In te g ratio n In te g ratio n Marketing Offers Creation and Management System Matching System Multichannel Notification System Predictive Model unstructured data structured data In te g ratio n CaixaBank operational system (structured) CaixaBank ‘at rest’ / ‘in motion’ (unstructured) CaixaBank Electronic Journal (structured) External Social Media (unstructured) Text Analytics Text Analytics unstructured data

(50)

51

Integration

Streams (Data in Motion)

Big Data (Data At Rest)

Real Time Event Detection Pattern Detection Deep Analytics Integration

Datawarehouse Customers Profiles

In te g ratio n In te g ratio n Marketing Offers Creation and Management System Matching System Multichannel Notification System Predictive Model unstructured data structured data In te g ratio n CaixaBank operational system (structured) CaixaBank ‘at rest’ / ‘in motion’ (unstructured) CaixaBank Electronic Journal (structured) External Social Media (unstructured) Text Analytics Text Analytics unstructured data Deep Analytics

Deep Analytics (Research, Existing, Third-party)

Sentiment Analysis

Behavior Analysis

Intent Analysis

Influence Analysis Concept Labeling &

Classification

Topic Detection Location Based

Analysis

Data linkage

(51)

(52)

Why are Developers Using Bluemix?

Go from zero to running code in a matter of

minutes.

Automate the development and delivery of many

applications.

To rapidly bring

products and services to

market at lower cost

To continuously deliver

new functionality to their

applications

To extend existing

investments in IT

infrastructure

(53)

Infrastructure Services

Database

as-a-Service

Systems of Record

Cloudant: Database as a Service (Documents)

(54)

(55)

dashDB: Data Warehouse as a Service

Netezza

Analytics

BLU

Acceleration

dashDB

Cloud

3rd_{Party DW}

Build More

Grow More

Know More

 Deploy in hours with rapid cloud provisioning

 No infrastructure investment for cloud agility

 In-Database analytics built in

 R Integration for predictive modeling  Partner Ecosystem for analytics  IBM Watson Analytics ready

 Load and Go with no tuning required  Columnar optimized for analytic

workloads

(56)

(57)

58