The
Deal
Analytics and the Context Multiplier
Raw Data Feature extraction metadata Domain linkages Full contextual analytics Location risk Occupational risk Dietary risk Family history Actuarial data Government statistics Epidemic data Chemical exposurePersonal financial situation
IBM Watson
• Automates customer interaction to increase customer engagement in sales and service
• Transforms customer engagement by knowing, engaging and empowering clients
• Developscustomer relationships through a transformative user experience
What it does
• Provides answers not links and webpages
• Answers with evidence not guesses
• Not restricted to a predefined question-answer set
• Learns from every interaction
How it does it
Watson Engagement Advisor
Watson Discovery Advisor
Answer previously unanswerable research problems
6
Watson can read these medical records in six seconds!
Gain Awareness
Harness all available scientific knowledge in the hunt for a breakthrough and identifies better leads for any researcher to pursue
Understand Relationships
Enable every scientist to identify new relationships and explore never before considered options that lead to real differentiating scientific innovations.
Clarify Ideas
Data at Scale
Volume
Data in Many Forms
Variety
Data in Motion
Velocity
Data Uncertainty
Veracity
Big Data Definition
BigData
MYTH:
Big Data is only about large datasets; we should just say larger than what you have
MYTH:
Big Data means Hadoop..that’s it
MYTH:
Big Data means ‘rip-and-replace’, death to the RDBMS and no governance
MYTH:
NoSQL means no SQL, never, utter hatred for SQL
MYTH:
Big Data means unstructured data and only for sentiment
without analytics
An increasingly sensor-enabled and instrumented
business environment generates
HUGE
volumes of
data with
MACHINE SPEED
characteristics…
1
BILLION
lines of code
Applications for Big Data Analytics
Homeland Security
Finance
Smarter Healthcare
Multi-channel
sales
Telecom
Manufacturing
Traffic Control
Trading Analytics
Fraud and Risk
Log Analysis
Search Quality
Automatic Temporal and Spatially Enriched Data
Use Cases: Law Enforcement and Security
Video surveillance, wire taps,
communications, call records, etc.
Millions of messages per second
with low density of critical data
Identify patterns and relationships
among vast information sources
“The US Government has been working with IBM Research since 2003 on a
radical new approach to data analysis that enables high speed, scalable and
complex analytics of heterogeneous data streams in motion. The project has
been so successful that US Government will deploy additional installations
to enable other agencies to achieve greater success in various future
Velocity – Creating Actionable Intelligence in Real Time
Volume - The Government Industry Data Challenge
IBM Multimedia Analysis & Retrieval
Automatic Semantic Classification
of Images and Video
Content based feature extraction &
Search
Gigapixel Panorama Photography
http://www.gigapixel.com/image/gigapan-canucks-g7.html
Predictive Analytics in a Neonatal ICU
Real-time analytics and correlations
on physiological data streams
– Blood pressure, Temperature, EKG,
Blood oxygen saturation etc.,
Early detection of the onset of
potentially life-threatening
conditions
– Up to 24 hours earlier than current
medical practices
– Early intervention leads to lower patient
morbidity and better long term
outcomes
Technology also enables
Big Data Analytics
Iterative & Exploratory Data is the structure
Traditional Analytics
Structured & Repeatable Structure built to store data
18
Warehouse Modernization Has to Themes
?
Analyzed Information Question Data Answer HypothesisStart with hypothesis Test against selected data
Data leads the way
Explore all data, identify correlations
Data
Correlation
All Information
Exploration
Actionable Insight
Analyze all
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze as is
TRADITIONAL APPROACH BIG DATA APPROACH
Carefully cleanse information
before any analysis
Find corellation
TRADITIONAL APPROACH BIG DATA APPROACH
Start with hypothesis and
test against selected data
Analyze in motion
TRADITIONAL APPROACH BIG DATA APPROACH
Analyze data after
it’s been processed
and landed in a warehouse or mart
Analyze data in motion
as it’s
generated, in real-time
Repository Analysis Insight
Data
Data
Complementary Analytics
23
Traditional Approach
Structured, analytical, logical
New Approach
Creative, holistic thought, intuition
Different requirements require different tools
– Document stores
– Key/value stores
– BigTable implementations (columnar)
– Graph databases
Values (there are exceptions)
– Huge data volumes – easy scale-out
– Developers code integrity if it’s needed
– Relaxed (eventual) consistency
– Semi-structured data
– Schema on read
Why NoSQL?
Pressures on Traditional Relational Stores
Technical change/
Different forms of data
(SLAs, Archive, Governance)
Regulatory pressures
Database Landscape Overview
SQL noSQL database Hadoop
Description • Relational SQL (RDBMS) • Operational and Analytic • E.g. DB2, Oracle,
Microsoft, Teradata, etc.
• noSQL database • Mainly operational • E.g. Cloudant,
MongoDB, Redis, Riak, Aerospike, Amazon Dynamo DB, etc.
• SQL on Hadoop (mainly analytic)
• HBase (evolving OLTP, ACID) • E.g. BigInsights, Cloudera,
Hortonworks, MapR, Pivotal • HP Labs Trafodion
Typical Infrastructure
• Proprietary database storage
• Unix, Linux, Windows • SMP, MPP, SAN, Integrated Systems, Appliances • Proprietary database storage • Linux • Commodity clusters • Local attach disks,
NAS • Cloud • Mobile • HDFS files • Linux • Commodity clusters • Local attach disks
Different Categories of noSQL Databases
NoSQL
Category Use this when….
Application Examples Vendors Document 63% revenue share*
• Schema is not well defined
• Schema is very likely to change, need to maintain flexibility
• Commonly described with JSON
• Frequently changing product catalogs • Cloudant** • MongoDB • Couchbase • MarkLogic Key-Value 24% revenue share*
• Your data is not highly related
• All you need is basic Create, Read, Update, Delete (CRUD)
• Rapid Scaling for simple data collections
• User Sessions • Shopping Cart • Redis • Aerospike • AWS (DynamoDB) • Basho Technologies (Riak) BigTable/ Columnar 9% revenue share*
• High volume, low latency write • Big Data, sparse data
• Need compression or versioning
• Telco, heavy ingest, petabyte scale
• User Activity logs • Sensor data • HBase (Hadoop)** • BigTable • Cassandra Graph DB 4% revenue Share*
• Your data looks like a graph
• Have highly interconnected data, need to trace relationships • Website Purchase Recommendations • Social Network Processing • Titan** • Neo Technology (Neo4J)
* Source: IBM study 2013 estimated by splitting total noSQL revenue ($288m) by ratio of top 10 vendors reported 2013 revenue. Total 2013 noSQL database revenue estimated $343m
Hadoop
Open-source software framework from Apache
Inspired by
– Google MapReduce
– GFS (Google File System)
HDFS
Hadoop Explained
Hadoop computation model
– Data stored in a distributed file system spanning many inexpensive computers
– Bring function to the data
– Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output for final processing)
3. Reduce Phase
(boil all output down to a single result set)
Return a single result
set
Result Set
Shuffle
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable
one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context
StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(Text key,
Iterable<IntWritable> val, Context context){ int sum = 0;
for (IntWritable v : val) { sum += v.get(); . . .
Distribute map
tasks to cluster
Visualization & Discovery Integration Workload Optimization Streams Netezza Flume DB2 DataStage
Big Data Enterprise platform
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine & Extractor Library) BigSheets
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management Security
Audit & History
Application
SQL interface Engine
InfoSphere BigInsights
HiveTables HBase tables CSV Files
Data Sources
SQL Language JDBC / ODBC Driver
JDBC / ODBC Server
Future: The SQL interface . . . .
Rich SQL query capabilities
– SQL '92 and 2011 features
– Correlated subqueries
– Windowed aggregates
SQL access to all data stored in
InfoSphere BigInsights
Robust JDBC/ODBC support
Take advantage of key features
of each data source
Leverage MapReduce
parallelism
OR
Spreadsheet-style Analysis
Web-based analysis and
visualization
Spreadsheet-like
interface
– Define and manage long
running data collection
jobs
JAQL –
IBM’s programming language in hadoop world
Jaql is a complete solutions environment supporting all other
BigInsights components
Integration point for
various analytics
–
Text analytics
–
Statistical analysis
–
Machine learning
–
Ad-hoc analysis
Integration point for
various data sources
–
Local and distributed
file systems
–
NoSQL data bases
–
Content repositories
–
Relational sources
(Warehouses,
operational data bases)
B ig In sig h ts T e x t A n a ly tics S ta tistica l A n a ly sis (R m o d u le ) M a ch ine lea rnin g (Sy ste m M L ) Ad -Hoc a n a ly sis (Big S h e e ts) (In te g rat ion ) DB2 , Net e z z a , S tre a m s, …
Jaql
Jaql I/O Jaql Core
Operators
Jaql Modules
Data In Motion and At Rest: Complementary
High
Med
Low
Low
Med
High
Latency
yr
ms ms … sec min hr day wk mo
1PB B KB GB 10GB 100GB 1TB 10TB 100TB MB
At Rest:
Warehouse/Hadoop
In Motion:
Streams
-Scalable processing of huge data storesStreams Analyzes All Kinds of Data
Mining in Microseconds
(included with Streams)
Image & Video
(Open Source)
Simple & Advanced Text
(included with Streams)
continuous ingestion
Continuous ingestion
Continuous analysis
How Streams Works
Achieve scale:
By partitioning applications into software components By distributing across stream-connected hardware hosts
Infrastructure provides services for
Scheduling analytics across hardware hosts, Establishing streaming connectivity
Transform
Filter / Sample
Classify
Correlate
Annotate
Where appropriate:Elements can be fused together for lower communication latency
Continuous ingestion
Continuous analysis
How Streams Works
Streams Runtime Supports Placement Criteria
x86 host x86 host Meters Company Filter Usage Model Meters x86 hostHost pools can force
operators to be on hosts
with solidDB installed
Usage Contract x86 host x86 host Text Extract Degree History Compare History Store History Text Extract Temp Action Season
Adjust Daily Adjust
Operator placement constraints
allow for co-location, ex-location,
and isolation of operators
Data Warehouse Augmentation: Value & Diagram
Pre-Processing Hub
Query-able Archive
Exploratory Analysis
Information Integration Data Warehouse Streams Real-time processing BigInsights Landing zone for all data
Data Warehouse BigInsights Can combine with unstructured information Data Warehouse
1
2
3
39Individual Silos can Answer Typical Questions, One-by-One
40
Wiki
“Who is best able to help
this customer?”
Experts“What is her view of our
company?”
SocialMedia
Fulfillment
“What issues has this
customer had in the past?”
Support Ticketing“Where else has she
worked?”
ExternalSources
“Who is this customer?”
CRM
“What is available
inventory?”
SupplyChain
“How is her company
using our products?”
Content Mgt.“What products has she
purchased?”
DBMS…BUT! An enhanced 360º
view provides answers in
one application
Fusion of data from
multiple systems enables
deeper insights—not just
facts
“What should I know
before calling her for
renewal?”
“What marketing
materials should I send?”
“What’s going on with
this customer
TODAY?”
“What products can I
upsell this customer?”
“How can we increase
engagement with her?”
How can we get more
customers like her?”
“What impact will
Janet Robertson Customer search Transaction history Customer’s Products Customer info Indexed 3rd party information related to customer Unstructured internal information related to customer
SAP Systems DynamicsMicrosoft SharePoint
IBM Cloud Offering for Analysts: Watson Analytics
Unified analytics experience 100% cloud based
Mobile ready
Visual storytelling Intelligent
automation
The IBM Big Data Platform
Hadoop-based low latency analytics for variety and
volume
Queryable Archive Structured Data
BI+Ad Hoc
Analytics on Structured Data
Operational Analytics on Structured Data
Time-structured analytics Large volume structured
data analytics
Low Latency Analytics for streaming data
MPP Data Warehouse
Stream Computing Information Integration
Hadoop
Data Reservoir Repositories (Zones)
Landing, Exploration, Archive Reporting, Interactive Analysis Deep Analytics, ModelingData Reservoir: Refinery Services
Trusted Data, Warehousing Operational Systems Document Storage Transactional DB
NoSQL Doc Store Hadoop Mixed Workload RDBMS
Analytic Appliance Data Mart Landed Raw Data Discovery Sandbox Staging Transformation
Information Governance Catalog
Metadata for Data Sets Stored in Reservoir Repositories
IBM DataWorks
Integration • Load • Trickle feed Security • Masking • Test data generation Data Quality • Cleansing • Standardization • Matching• Reference data generation
Data Lifecycle
Actionable Insight Reporting, Analysis Data Types Landing, Exploration, Archive Reporting, Interactive Analysis Deep Analytics, Modeling Transaction and Application Data Machine and Sensor Data Enterprise Content Social Data Image and Video
Third-Party Data
Information Management Zones
Trusted Data, Warehousing Discovery, Exploration Decision Management Predictive Analytics, Modeling Operational Systems Document Storage
Real-Time Analytical Processing
Governance and Lifecycle Management Fabric
Integration | Matching | Masking | Lineage | Security | Privacy | Glossary
Mainframe, Power8, Intel, Cloud (Managed/Hosted), Bluemix Services
Transactional DB
NoSQL Doc Store Hadoop Mixed Workload RDBMS
Emerging Big Data Implementation Pattern
Ingest
Landing and Analytics Sandbox Zone
Indexes, facets Hive/HBase Col Stores Documents In Variety of Formats Analytics MapReduce Repository, Workbench
Ingestion and Real-time Analytic Zone
Data Sinks Filter, Transform Ingest Correlate, Classify Extract, Annotate Warehousing Zone Enterprise Warehouse Data Marts Query Engines Cubes Descriptive, Predictive Models Models Widgets Discovery, Visualizer Search Analytics and Reporting Zone
Metadata and Governance Zone
Co
nnec
to
Visualization & Discovery Integration Workload Optimization Streams Netezza Flume DB2 DataStage
IBM InfoSphere BigInsights Enterprise Edition
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store
HBase
Text Processing Engine & Extractor Library) BigSheets
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration Index Splittable Text Compression Enhanced Security Flexible Scheduler Jaql Pig ZooKeeper Lucene Oozie Adaptive MapReduce Hive Integrated Installer Admin Console Sqoop Adaptive Algorithms Dashboard & Visualization Apps Workflow Monitoring Management Security
Audit & History
50
Integration
Streams (Data in Motion)
Big Data (Data At Rest)
Real Time Event Detection Pattern Detection Deep Analytics Integration
Datawarehouse Customers Profiles
In te g ratio n In te g ratio n Marketing Offers Creation and Management System Matching System Multichannel Notification System Predictive Model unstructured data structured data In te g ratio n CaixaBank operational system (structured) CaixaBank ‘at rest’ / ‘in motion’ (unstructured) CaixaBank Electronic Journal (structured) External Social Media (unstructured) Text Analytics Text Analytics unstructured data
51
Integration
Streams (Data in Motion)
Big Data (Data At Rest)
Real Time Event Detection Pattern Detection Deep Analytics Integration
Datawarehouse Customers Profiles
In te g ratio n In te g ratio n Marketing Offers Creation and Management System Matching System Multichannel Notification System Predictive Model unstructured data structured data In te g ratio n CaixaBank operational system (structured) CaixaBank ‘at rest’ / ‘in motion’ (unstructured) CaixaBank Electronic Journal (structured) External Social Media (unstructured) Text Analytics Text Analytics unstructured data Deep Analytics
Deep Analytics (Research, Existing, Third-party)
Sentiment Analysis
Behavior Analysis
Intent Analysis
Influence Analysis Concept Labeling &
Classification
Topic Detection Location Based
Analysis
Data linkage
Why are Developers Using Bluemix?
Go from zero to running code in a matter of
minutes.
Automate the development and delivery of many
applications.
To rapidly bring
products and services to
market at lower cost
To continuously deliver
new functionality to their
applications
To extend existing
investments in IT
infrastructure
Infrastructure Services
Database
as-a-Service
Systems of RecordCloudant: Database as a Service (Documents)
dashDB: Data Warehouse as a Service
Netezza
Analytics
BLU
Acceleration
dashDB
Cloud
3rdParty DWBuild More
Grow More
Know More
Deploy in hours with rapid cloud provisioning
No infrastructure investment for cloud agility
In-Database analytics built in
R Integration for predictive modeling Partner Ecosystem for analytics IBM Watson Analytics ready
Load and Go with no tuning required Columnar optimized for analytic
workloads
58