Industry Perspective:
Big Data and Big Data Analytics
David Barnes Program Director
Emerging Internet Technologies IBM Software Group
What is Big Data?
The Adjacent
Possible
Inexpensive disk
+ Increased processing power
+ Data Warehouse
+The Web
+ X
= Big Data
X=Sensors used to gather climate information, posts to social
media sites, digital pictures and videos, transaction records, cell
phone GPS signals, and more.
© 2010 IBM Corporation
161 exabytes of data were created in 2006 –
3 million times the amount of information contained
in all the books ever written.
In 2010 the number reached hit 988 exabytes.
IDC estimates that 1.8 zettabytes were created and
replicated in 2011.
© 2010 IBM Corporation
Every day, people create the equivalent of 2.5
quintillion bytes of data from sensors, mobile devices,
online transactions, and social networks.
Every month people send one billion Tweets and post
30 billion messages on Facebook.
90% (or more) of the world’s data is unstructured.
The true nature of information
Is noisy
Is often times dirty
Is often full of valuable information
Unstructured Data
© 2010 IBM Corporation
Big Data has swept into every industry
and business function.
Businesses need to put the power of Big
Data analytics in the hands of their
business employees – Data Scientist is
somewhat misleading.
“ Leaders in every sector will have to
grapple with the implications of big
data, not just a few data-oriented
managers.”
– McKinsey Global Institute
The Big Data Imperative
9
Big Data Business
Patterns
Computational Journalism
Chief Legal Officer
Retail Business Planner
IT Systems Management
Pharma - Clinical Trials
Business Fraud Detection
Evidence Based Medicine
Web Archiving
. . .
© 2010 IBM Corporation
Today’s Problem
Data growing at compound annual growth of 60%/year
Storage capacity continue to increase dramatically
Storage access speeds have not kept up
At transfer speed of 500 MB/sec - 1 terabyte of data
will require ~30 mins to read from single drive
Enter Map/Reduce
• Automates the mechanisms of large-scale distributed computation ( i.e. work
distribution, load balancing, replication, failure/recovery)
• Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds
to read
• M/R parallel processing model provides cost effective framework for new generation
of analytic applications on unstructured or semi-structured data
© 2010 IBM Corporation
Requirement: A New Class of Big Data Applications
Big Data analytics must be
brought to the line-of-business
user.
• Leverage easy-to-use
manipulation metaphors
• Use natural language
technologies for analytics
• Provide rich visualizations to
quickly identify insights
Demo
Buyer Sentiment Analysis
© 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide
Social Media: Chiliean Earthquake 2010
2010 Chilean earthquake fifth largest
earthquake in recorded history
The affected areas suffered major
devastation - buildings, airports,
hospitals, prisons, bridges, and roads
were severely damaged
Land-based communications systems
suffered major outages
The wireless 3G infrastructure remained
intact and operational
13
© 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide
Social Media: Chiliean Earthquake 2010
14
Social networking on wireless
networks major form of
communications
Extreme Blue students collected 226
million Tweets, analyzed,categorized
by incidence type and location
Tweets included - Can I get food? Can
I get gas? Are the bridges down -
images
The results were visualized
Completed in ~12 weeks
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
15
• Volume - Scale from terabytes to zettabytes
• Variety - Relational and non-relational data types from an ever-
expanding variety of sources
• Velocity - Streaming data and large volume data movement
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
• Volume - Scale from terabytes to zettabytes
• Variety - Relational and non-relational data types from an ever-
expanding variety of sources
• Velocity - Streaming data and large volume data movement
The Supercomputer is based on over 1,200 high
powered IBM System X servers and can perform
150 trillion calculations per second -- equivalent
to 30 million calculations per Danish citizen per
second.
Vestas expects its data sets will grow to 20-plus
petabytes over the next four years.
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
• Volume - Scale from terabytes to zettabytes
• Variety - Relational and non-relational data types from an ever-
expanding variety of sources
• Velocity - Streaming data and large volume data movement
© 2011 IBM Corporation
Seton Healthcare Family
Reducing CHF readmission to improve care
Business Challenge
Seton Healthcare strives to reduce the occurrence of high cost Congestive Heart Failure (CHF) readmissions by
proactively identifying patients likely to be readmitted on an emergent basis.
What’s Smart?
IBM Content and Predictive Analytics for Healthcare
solution will help to better target and understand high-‐risk CHF patients for care management programs by:
Smarter Business Outcomes
• Seton will be able to proactively target care management and reduce re-‐admission of CHF patients.
• Teaming unstructured content with predictive analytics, Seton will be able to identify patients likely for re-‐admission and introduce early interventions to reduce cost, mortality
IBM solution
• IBM Content and Predictive Analytics for Healthcare
• IBM Cognos Business Intelligence
• IBM BAO solution services
• Utilizing natural language processing to extract key elements from unstructured History and Physical, Discharge Summaries, Echocardiogram Reports, and Consult Notes
• Leveraging predictive models that have demonstrated high positive predictive value against extracted elements of structured and unstructured data
• Providing an interface through which providers can intuitively navigate, interpret and take action
“IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM Watson, enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant
clinical and operational information to drive more informed decision making and optimize patient and operational outcomes.”
© 2011 IBM CorporaUon
2 © 2011 IBM CorporaUon
IBM Content and PredicUve AnalyUcs for Healthcare
The Seton CHF Readmission SoluUon
Unstructured Data
(Cerner Clinical Documenta0on:
History and Physical, Discharge Summary, Echocardiogram.)
Structured Data
(Avega Cost Data, DSS Admission History, DSS Procedure History, Cerner Clinical Events)
Raw
Informa=on
Search and Visually Explore (Mine)
Monitor, Dashboard and Report (Cognos BI)
Ques%on and Answer*
Custom SoluBons
Dynamic Mul=mode Interac=on IBM Content and
Predic=ve Analy=cs
Content AnalyBcs
• Natural Language Processing
• Medical Fact and Rela0onship Extrac0on (Annota0on)
• Trend, PaIern, Anomaly, Devia0on Analysis
PredicBve AnalyBcs
• Predic0ve Scoring and Probability Analysis
Analyzed and Visualized Informa=on
Health Integra=on Framework
Data Warehouse and Model Master Data Management Advanced Case Management Business AnalyBcs
Partners (HLI) Specialized Research
IBM Watson for Healthcare
Confirm hypotheses or seek alternaFve ideas with confidence based responses from learned knowledge*
UUlizing natural language processing to extract key elements from unstructured History and Physical and Discharge Summary
Leveraging predicUve models that have demonstrated high posiUve predicUve value against extracted elements of structured and
unstructured data
Providing an interface through which providers can intuiUvely navigate, interpret and take acUon
© 2011 IBM CorporaUon
The Data We Thought Would Be Useful … Wasn’t
• 113 candidate predictors from structured and unstructured data sources
• Structured data was less reliable then unstructured data – increased the reliance on unstructured data New Unexpected Indicators Emerged … Highly Predic=ve Model
• 18 accurate indicators or predictors (see next slide)
Predictor Analysis % Encounters
Structured Data % Encounters Unstructured Data
Ejec0on Frac0on (LVEF) 2% 74%
Smoking Indicator 35%
(65% Accurate) 81%
(95% Accurate)
Living Arrangements <1% 73%
(100% Accurate)
Drug and Alcohol Abuse 16% 81%
Assisted Living 0% 13%
What Really Causes Readmissions at Seton
Key Findings
3
97% at 80th percen0le
49% at 20th percen0le
© 2011 IBM CorporaUon
Cognos dashboard reporUng system can help in monitoring the key clinical,
operaUonal and financial metrics. More importantly, being able to track down
the top priority cases for case management.
5
Visualizing the Results: Readmissions Dashboard
1.Clinical Sta=s=cs:
admission count,
readmission count and readmission rate
2.Opera=onal Sta=s=c:
Counts of different length of stay periods
3.Financial Sta=s=c: Total direct cost by total
admission and by readmission
4.Mortality: mortality rate 5.Average length of stay 6.Average direct cost by total admission and by readmission only
7.PA Model Score:
Distribu0on of propensity of readmission
1 2 3
4 5 6
7
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
• Volume - Scale from terabytes to zettabytes
• Variety - Relational and non-relational data types from an ever-
expanding variety of sources
• Velocity - Streaming data and large volume data movement
© 2010 IBM Corporation
USC Annenberg School of Communications
© 2010 IBM Corporation
InfoSphere Streams
27
© 2010 IBM Corporation
Big Data Platform Vision
28
Big Data Enterprise Engines
Big Data Solutions
Internet Scale Analytics
Streaming Analytics
Developers End Users Administrators
Big Data User Environments
Bringing Big Data to the Enterprise
Client and Partner Solutions
Open Source Foundational Components
Hadoop MapReduce HDFS Hbase Pig Lucene Jaql
AGENTS INTEGRATION
Marketing Warehouse Appliances
Data Warehouse
Database
Analytics
Business Intelligence Master Data
Mgmt
InfoSphere Warehouse
Netezza
InfoSphere MDM
DB2
SPSS
Cognos
Unica