How To Get More Data From Your Computer

(1)

Industry Perspective:

Big Data and Big Data Analytics

David Barnes Program Director

Emerging Internet Technologies IBM Software Group

(2)

What is Big Data?

(3)

The Adjacent

Possible

(4)

Inexpensive disk

+ Increased processing power

+ Data Warehouse

+The Web

+ X

= Big Data

X=Sensors used to gather climate information, posts to social

media sites, digital pictures and videos, transaction records, cell

phone GPS signals, and more.

(5)

161 exabytes of data were created in 2006 –

3 million times the amount of information contained

in all the books ever written.

In 2010 the number reached hit 988 exabytes.

IDC estimates that 1.8 zettabytes were created and

replicated in 2011.

(6)

Every day, people create the equivalent of 2.5

quintillion bytes of data from sensors, mobile devices,

online transactions, and social networks.

Every month people send one billion Tweets and post

30 billion messages on Facebook.

90% (or more) of the world’s data is unstructured.

(7)

The true nature of information

(8)

Is noisy

Is often times dirty

Is often full of valuable information

Unstructured Data

(9)

Big Data has swept into every industry

and business function.

Businesses need to put the power of Big

Data analytics in the hands of their

business employees – Data Scientist is

somewhat misleading.

“ Leaders in every sector will have to

grapple with the implications of big

data, not just a few data-oriented

managers.”

– McKinsey Global Institute

The Big Data Imperative

9

Big Data Business

Patterns

Computational Journalism

Chief Legal Officer

Retail Business Planner

IT Systems Management

Pharma - Clinical Trials

Business Fraud Detection

Evidence Based Medicine

Web Archiving

. . .

(10)

Today’s Problem

Data growing at compound annual growth of 60%/year

Storage capacity continue to increase dramatically

Storage access speeds have not kept up

At transfer speed of 500 MB/sec - 1 terabyte of data

will require ~30 mins to read from single drive

Enter Map/Reduce

• Automates the mechanisms of large-scale distributed computation ( i.e. work

distribution, load balancing, replication, failure/recovery)

• Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds

to read

• M/R parallel processing model provides cost effective framework for new generation

of analytic applications on unstructured or semi-structured data

(11)

Requirement: A New Class of Big Data Applications

Big Data analytics must be

brought to the line-of-business

user.

• Leverage easy-to-use

manipulation metaphors

• Use natural language

technologies for analytics

• Provide rich visualizations to

quickly identify insights

(12)

Demo

Buyer Sentiment Analysis

(13)

Social Media: Chiliean Earthquake 2010

2010 Chilean earthquake fifth largest

earthquake in recorded history

The affected areas suffered major

devastation - buildings, airports,

hospitals, prisons, bridges, and roads

were severely damaged

Land-based communications systems

suffered major outages

The wireless 3G infrastructure remained

intact and operational

13

(14)

Social Media: Chiliean Earthquake 2010

14

Social networking on wireless

networks major form of

communications

Extreme Blue students collected 226

million Tweets, analyzed,categorized

by incidence type and location

Tweets included - Can I get food? Can

I get gas? Are the bridges down -

images

The results were visualized

Completed in ~12 weeks

(15)

Big Data = Volume, Variety and Velocity

15

• Volume - Scale from terabytes to zettabytes

• Variety - Relational and non-relational data types from an ever-

expanding variety of sources

• Velocity - Streaming data and large volume data movement

(16)

Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes

• Variety - Relational and non-relational data types from an ever-

expanding variety of sources

• Velocity - Streaming data and large volume data movement

(17)

(18)

(19)

The Supercomputer is based on over 1,200 high

powered IBM System X servers and can perform

150 trillion calculations per second -- equivalent

to 30 million calculations per Danish citizen per

second.

Vestas expects its data sets will grow to 20-plus

petabytes over the next four years.

(20)

Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes

• Variety - Relational and non-relational data types from an ever-

expanding variety of sources

• Velocity - Streaming data and large volume data movement

(21)

Seton Healthcare Family

Reducing CHF readmission to improve care

Business Challenge

Seton Healthcare strives to reduce the occurrence of high cost Congestive Heart Failure (CHF) readmissions by

proactively identifying patients likely to be readmitted on an emergent basis.

What’s Smart?

IBM Content and Predictive Analytics for Healthcare

solution will help to better target and understand high-‐risk CHF patients for care management programs by:

Smarter Business Outcomes

• Seton will be able to proactively target care management and reduce re-‐admission of CHF patients.

• Teaming unstructured content with predictive analytics, Seton will be able to identify patients likely for re-‐admission and introduce early interventions to reduce cost, mortality

IBM solution

• IBM Content and Predictive Analytics for Healthcare

• IBM Cognos Business Intelligence

• IBM BAO solution services

• Utilizing natural language processing to extract key elements from unstructured History and Physical, Discharge Summaries, Echocardiogram Reports, and Consult Notes

• Leveraging predictive models that have demonstrated high positive predictive value against extracted elements of structured and unstructured data

• Providing an interface through which providers can intuitively navigate, interpret and take action

“IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM Watson, enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant

clinical and operational information to drive more informed decision making and optimize patient and operational outcomes.”

(22)

IBM Content and PredicUve AnalyUcs for Healthcare

The Seton CHF Readmission SoluUon

Unstructured Data

(Cerner Clinical Documenta0on:

History and Physical, Discharge Summary, Echocardiogram.)

Structured Data

(Avega Cost Data, DSS Admission History, DSS Procedure History, Cerner Clinical Events)

Raw

Informa=on

Search and Visually Explore (Mine)

Monitor, Dashboard and Report (Cognos BI)

Ques%on and Answer*

Custom SoluBons

Dynamic Mul=mode Interac=on IBM Content and

Predic=ve Analy=cs

Content AnalyBcs

• Natural Language Processing

• Medical Fact and Rela0onship Extrac0on (Annota0on)

• Trend, PaIern, Anomaly, Devia0on Analysis

PredicBve AnalyBcs

• Predic0ve Scoring and Probability Analysis

Analyzed and Visualized Informa=on

Health Integra=on Framework

Data Warehouse and Model Master Data Management Advanced Case Management Business AnalyBcs

Partners (HLI) Specialized Research

IBM Watson for Healthcare

Conﬁrm hypotheses or seek alternaFve ideas with conﬁdence based responses from learned knowledge*

UUlizing natural language processing to extract key elements from unstructured History and Physical and Discharge Summary

Leveraging predicUve models that have demonstrated high posiUve predicUve value against extracted elements of structured and

unstructured data

Providing an interface through which providers can intuiUvely navigate, interpret and take acUon

(23)

The Data We Thought Would Be Useful … Wasn’t

• 113 candidate predictors from structured and unstructured data sources

• Structured data was less reliable then unstructured data – increased the reliance on unstructured data New Unexpected Indicators Emerged … Highly Predic=ve Model

• 18 accurate indicators or predictors (see next slide)

Predictor Analysis % Encounters

Structured Data % Encounters Unstructured Data

Ejec0on Frac0on (LVEF) 2% 74%

Smoking Indicator 35%

(65% Accurate) 81%

(95% Accurate)

Living Arrangements <1% 73%

(100% Accurate)

Drug and Alcohol Abuse 16% 81%

Assisted Living 0% 13%

What Really Causes Readmissions at Seton

Key Findings

3

97% at 80^th percen0le

49% at 20^th percen0le

(24)

Cognos dashboard reporUng system can help in monitoring the key clinical,

operaUonal and ﬁnancial metrics. More importantly, being able to track down

the top priority cases for case management.

5

Visualizing the Results: Readmissions Dashboard

1.Clinical Sta=s=cs:

admission count,

readmission count and readmission rate

2.Opera=onal Sta=s=c:

Counts of diﬀerent length of stay periods

3.Financial Sta=s=c: Total direct cost by total

admission and by readmission

4.Mortality: mortality rate 5.Average length of stay 6.Average direct cost by total admission and by readmission only

7.PA Model Score:

Distribu0on of propensity of readmission

1 2 3

4 5 6

7

(25)

Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes

• Variety - Relational and non-relational data types from an ever-

expanding variety of sources

• Velocity - Streaming data and large volume data movement

(26)

USC Annenberg School of Communications

(27)

InfoSphere Streams

27

(28)

Big Data Platform Vision

28

Big Data Enterprise Engines

Big Data Solutions

Internet Scale Analytics

Streaming Analytics

Developers End Users Administrators

Big Data User Environments

Bringing Big Data to the Enterprise

Client and Partner Solutions

Open Source Foundational Components

Hadoop MapReduce HDFS Hbase Pig Lucene Jaql

AGENTS INTEGRATION

Marketing Warehouse Appliances

Data Warehouse

Database

Analytics

Business Intelligence Master Data

Mgmt

InfoSphere Warehouse

Netezza

InfoSphere MDM

DB2

SPSS

Cognos

Unica