Big
Data
Analytics
Opportunities
and
Challenges
Anup Kumar,
Professor
and
Director
of
MINDS
Lab
Computer
Engineering
and
Computer
Science
Department
University
of
Louisville
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
•
Concluding
Observations
3
Big
Data
Applications
• Advertising and marketing
– Customer shopping patterns
– Response to promotional campaign • Manufacturing
– Maintenance of machine health • Social Media
– Browsing and sentiment analysis – Impact on buying patterns
• Government data
Big
Data
Applications
(cont’d)
•
Stock
Market
–
Stock
performance
prediction
•
Healthcare Management
–
Patient health monitoring
–
Impact of preventive care
•
Financial
Institutions
–
Fraud
detection
and
mitigation
•
Weather
5
Big
Data
Analytics:
Benefits
•
Saving
money
–
Fraud
detection
in
medical
industry
–
Risk
Management
–
Lower
cost
and
better
outcomes
•
Real
‐
time
decision
making
–
Sensor
data
analysis
–
Influence
of
social
sentiment
on
patient
health
•
Getting
to
know
your
customer
better
–
Targeted
advertisement
Components
of
Big
Data
• Volume
– Data generated by 2020 will be in Zettabytes (1021)
– Soon after that the measure will be Yottabyte (1024) and Brontabyte (1027)
• Variety
– Structured (transactional data)
– Unstructured (image, video and text data) • Velocity
– Rate of data generation
– Increase in the ability to process data • Variability/Veracity
7
• Understanding Advanced Analytics – Association analysis
– Clustering, Classification, Regression – Recommendation framework
– Time series analysis
• Handling volume and velocity of data – Map/Reduce framework
– Cloud computing framework – Custom frameworks
• Dealing with Variety of Data
– Structured Data Analytics (Advanced Analytics) – Unstructured Data Analytics (Text Processing)
Key
Components
of
Big
Data
• Descriptive analytics :
– Specifies the data characteristics • How to describe the system?
• What happened in the system and when? • What are the parameters in the system?
• What is the impact of a parameter on the system? • Is there any correlation between the parameters?
• Predictive analytics: uses data mining and predictive modeling – It can answer the following questions
• What are the future trends?
• What is the decision based on past history?
9
• Hadoop Based Implementation of Analytics • In‐Database Analytics
• Allows analytics to be carried out on any type of data store • Provides a standard framework for computation
– On cloud environment – On in‐premises network • Advantages
– Vendor‐independent
– Allows any type of analytics to be carried out • Disadvantages
– Complex implementation
11
• Allows analytic computation to be carried out in the database – Uses:
• SQL
• SQL extensions • Advantages
– Higher analytics efficiency, easy usability, better database manageability – Analytic centralization may allow easy security, data, and version
management • Disadvantages
– Vendor dependent – Cost
• Structured Data Analytics – Association analysis – Clustering – Classification – Regression – Recommendation framework – Time series analysis
• Unstructured Data Analytics – Association analysis
– Clustering – Classification
13
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
•
Concluding
Observations
• Open source from Apache
– Free download from http://hadoop.apache.org
• Runs on commodity hardware
– Lowers procurement, licensing, and operational costs • Scalable
– Incrementally add hardware nodes as needed
• Can accommodate growth in data and processing requirements • Fault tolerant
– Automatic data replication and task reallocation
15
Core
Components
of
Hadoop
Big
Data
Architecture
Data Storage (HDFS) Data Processing (MapReduce) Security Operations Integration
•
Designed
for
large
‐
scale
data
processing
–
Allows
storage
of
large
data
files
•
Runs
on
top
of
native
file
system
on
each
node
•
Data
is
stored
in
units
known
as
blocks
–
64MB
(default)
although
normally
larger
•
Large
files
are
stored
in
many
blocks
over
many
nodes
•
Blocks
are
replicated
to
multiple
nodes
Hadoop
Distributed
File
System
17
HDFS
Blocks
and
Replication
File Block 3 Block 2 Block 1 Node 1 Block 1 Block 2 Node 2 Block 1 Node 3 Block 2 Block 3 Block 3
•
Nodes
in
Hadoop
have
different
roles
to
play
•
HDFS
nodes
–
NameNodes
–
DataNodes
•
MapReduce
nodes
–
JobTracker
–
TaskTracker
19
Hadoop
Node
Interaction
DataNode DataNode DataNode DataNode NameNode TaskTracker TaskTracker JobTracker Hadoop Cluster MapReduce HDFS Start
•
Receives
jobs
from
clients
and
manages
jobs
–
Master
in
master
‐
slave
architecture
with
TaskTrackers
•
Determines
job
execution
plan
–
Assigns
nodes
to
different
processing
tasks
–
Monitors
tasks
as
they
execute
•
Detects
failures
and
restarts/re
‐
runs
tasks
21
•
Manages
individual
tasks
that
the
JobTracker
issues
–
Can
be
map
or
reduce
jobs
•
Single
TaskTracker
can
run
many
map
or
reduce
tasks
in
parallel
•
Keeps
JobTracker
updated
with
task
progress
using
heartbeat
signal
–
Failure
to
send
heartbeat
results
in
JobTracker
resubmitting
task
•
To
another
TaskTracker
node
•
HDFS
is
also
a
master
‐
slave
architecture
–
NameNode
is
the
HDFS
master
directing
DataNode
slave
nodes
•
DataNodes
perform
low
‐
level
input/output
–
This
is
where
data
is
stored
•
Keeps
track
of
files
–
How
they
are
broken
into
blocks
–
Which
nodes
blocks
are
stored
on
•
Only
one
per
Hadoop
cluster!
–
Single
point
of
failure,
so
should
not
use
commodity
hardware
23
•
Performs
low
‐
level
work
reading
and
writing
HDFS
blocks
to
local
file
system
•
Communicates
with:
–
NameNode
•
Provides
information
about
which
blocks
they
are
storing
–
DataNodes
to
replicate
data
blocks
•
Many
per
Hadoop
cluster
–
Normally
run
on
inexpensive
commodity
hardware
• Hadoop can be configured to run in one of three modes – Local (standalone) mode
– Pseudo‐distributed mode – Fully distributed mode
• Four key configuration files define which mode to run in
– core-site.xml
– hdfs-site.xml
– mapred-site.xml
25
•
This
is
the
default
mode
for
Hadoop
•
No
assumptions
are
made
about
hardware
–
Does
not
use
HDFS
•
Used
for
developing
and
debugging
application
logic
of
Hadoop
programs
•
Hadoop
runs
in
clustered
mode
–
Cluster
size
is
one
•
Useful
for
developing
and
debugging
Hadoop
applications
–
Enables
examination
of:
•
Memory
and
HDFS
usage
•
Configuration
files
used
to
set
port
numbers
used
for
NameNode
and
JobTracker
27
•
This
is
the
mode
for
running
Hadoop
applications
in
production
•
Configuration
requires
setup
of
–
Master
nodes
•
Host
of
NameNode
and
JobTracker
components
–
Slave
nodes
•
Machines
running
DataNode
and
TaskTracker
components
•
NameNode
hosts
a
report
on
HDFS
status
on
port
number
50070
–
Enables
checking
of
each
DataNode
in
cluster
•
JobTracker
provides
status
report
on
MapReduce
jobs
on
port
50030
–
Reports
about
ongoing
jobs
•
Hadoop is
powerful
for
processing
large
datasets
–
Requires
significant
programming
skills
to
develop
applications
•
Pig
is
a
Hadoop extension
that
simplifies
Hadoop programming
•
It
is
Easy
to
use
•
It
is
Higly scalable
•
There
are
two
major
components
of
Pig
–
A
data
processing
language
called
Pig
Latin
–
A
compiler
to
generate
Pig
Latin
to
Hadoop programs
29
•
A
data
warehousing
package
built
on
top
of
Hadoop
–
Originally
developed
at
•
Target
audience
for
Hive
is
data
analysts
comfortable
with
SQL
–
Provides
access
to
ad
hoc
queries
and
data
analysis
•
On
large
‐
scale
datasets
•
Focuses
on
structured
data
–
Enables
optimizations
to
be
performed
•
Provides
SQL
‐
like
language
HiveQL
–
Based
on
tables,
rows,
columns
–
No
need
to
know
about
Hadoop
programming
•
Hive
requires
a
metastore
service
–
Provides
schemas
to
structure
the
data
•
Helps
with
efficient
querying
and
processing
–
Implemented
using
tables
in
a
relational
database
•
By
default,
Hive
uses
Java’s
built
‐
in
Derby
database
•
Hive
can
be
accessed
from
–
Command
‐
line
interface
–
Web
interface
known
as
Hive
web
Interface
31
• Secure authorization
– Ability to control and enforce access to data
– Grant privileges on data for authenticated users. • Fine‐grained access control
– Control at the database, table, and view level – As an example of fine grained control
• One group can see all data
• Another group can see only non‐sensitive data filtered by a view • Sentry provides unified security platform:
– Existing Hadoop Kerberos security for authentication
– Sentry access policy applied to all applications e.g. Pig, Hive etc
Introduction
to
Apache
Sentry
Introduction
to
Oozie
•
Oozie is
a
Hadoop workflow
engine
–
Manages
data
processing
activities
–
Available
from
http://oozie.apache.org
•
Simplifies
the
management
of
–
Recurring
and
data
driven
jobs
–
Re
‐
execution
of
failed
jobs
•
Comprised
of
–
A
workflow
engine
•
Can
execute
different
types
of
jobs
33
Oozie Workflows
• Oozie workflows are directed acyclic graphs (DAGs) • Graph nodes are either
– Action nodes
• Perform a workflow task – Control flow nodes
• Determine the logic between tasks
• There is always exactly one start node and at least one terminal node
start action
action
terminal
Pig,
Hive,
and
Impala
in
Big
Data
Architecture
HDFS Map Reduce
Security Operations Integration Pig Hive Mahout Impala
Flume
35
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data Analytics Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
•
Concluding
Observations
Hadoop Based
Analytics
•
Allows
analytics
to
be
carried
out
on
any
type
of
data
store
•
Provides
standard
framework
for
computation
–
On
cloud
environment
–
On
in
‐
premises
network
•
Advantages
–
Vendor
independent
–
Allows
any
type
of
analytics
to
be
carried
out
–
Provides
flexible
and
adaptable
architecture
for
implementation
37
•
Allows
use
of
large
computational
resources
–
Inter
cluster
communication
is
managed
by
MapReduce
•
MapReduce architecture
supports
–
Data
and
task
distribution
–
Fault
monitoring
–
Task
and
data
replication
–
Simple
programming
model
•
Limitations
of
MapReduce
–
Cannot
solve
all
the
problems
•
It
can
solve
many
Big
Data
problems
–
Data
filtering
–
Statistics
and
aggregation
–
Graph
analytics
–
Decision
Tree
and
classification
–
Clustering
and
recommendations
•
Practical
Distributed
API
–
Easier
to
understand
and
use
•
Higher
level
APIs
exist
–
To
reduce
the
complexity
of
programming
39
Phases
in
MapReduce Processing
•
Data
Processing
with
MapReduce goes
through
three
phases
–
Map
Phase
•
Processes
the
data
and
generate
<key,
Value>
pairs
–
Shuffle
Phase
•
Moves
the
data
<key,
value>
pairs
to
appropriate
processing
node
for
reduction
–
Reduce
Phase
•
Processes
the
data
<key,
value>
pairs
to
generate
final
output
• In order to compute support
– The number of times each product and its combinations occur in the data
has to be calculated
• The original transaction file format – 1, milk (M), bread (B)
– 2, bread (B), butter (T)
– 3, milk (M), bread (B), butter(T) – 4, milk(M)
• The input data file would be: – M B MB
– B T BT
– M B T MB MT BT MBT
41
Association
Rule
MapReduce
M B MB B T BT M B T MB MT BT MBT M M B T BT M B MB M,1 B ,1 T, 1 BT,1 M,1 B,1 MB, 1 M,1 BMT,1 B,3 BT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 Input Splitting Mapping Shuffling Reducing
M,1 M,1 MB,1 M B T MB MT BT MBT M, 1 B ,1 T, 1 MB,1 BT,1 MT,1 MBT,1 B,1 B,1 B,1 T,1 T,1 MB,1 BT,1 BT,1 MT,1 MT,1 M,3 T,2 MB,2 MT,2
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
•
Concluding
Observations
43
Steps
in
Structured
Data
Analytics
•
Step
1:
Business
domain
analysis
•
Step
2:
Data
exploration
and
investigation
•
Step
3:
Data
preparation
and
cleaning
•
Step
4:
Model
design
and
development
•
Step
5:
Model
verification
and
testing
•
Descriptive
analytics
provides
knowledge
discovered
from
the
data
set
–
Examples
include
•
Clustering
•
Correlation
analysis
•
Pattern
discovery
•
Association
rules
•
Predictive
analytics
builds
a
model
to
predict
future
behavior
–
Examples
include
•
Regression
models
45
• Open source machine learning library
– Leveraging the power of
• MapReduce
• Hadoop
• Currently available algorithms are
– Various clustering algorithms
– Singular value decomposition
– Parallel Frequent Pattern mining
– Naive Bayes classifier
– Random forest decision tree‐based classifier
– Collaborative filtering
– User‐ and item‐based recommenders – Others
Mahout
Architecture
Storage Storage Storage Compute Compute Storage Storage Storage Compute Compute Hadoop/MapReduceMahout: Machine Learning Library
Clustering, Classification, Recommendation, Filtering, etc.
47
•
Command
line
interaction
–
Allows
execution
of
machine
learning
algorithms
using
Mahout
commands
–
Pros:
Easy
use
of
algorithms
–
Cons:
Have
to
specify
large
number
of
options;
fixed
number
of
options
are
available
•
Application
Program
Interface–based
interaction
–
Allows
execution
of
machine
learning
algorithms
in
the
source
code
–
Pros:
More
flexibility
is
available
in
using
the
algorithms
–
Cons:
Have
to
know
programming
Clustering:
Introduction
•
Clustering
is
the
grouping
of
available
data
based
on
similarities
on
some
pre
‐
specified
criteria
–
Uses
unsupervised
learning
techniques
–
Uses
statistical
methods
•
A
way
of
looking
for
patterns
or
structure
in
the
data
that
are
of
interest
•
Allows
partitioning
of
data
in
different
groups
–
Identifies
a
set
of
patterns
and
structures
•
Can
be
used
as
a
standalone
technique
to
gain
insight
into
49
•
Two
core
components
of
clustering
are
–
An
algorithm
to
group
items
in
clusters
•
Kmeans
clustering
•
Hierarchical
clustering
•
Many
others
–
Concept
of
similarity
and
dissimilarity
•
This
specifies
how
close
various
data
locations
are
Business
Applications
for
Clustering
• Market research – Segmentation – Targeted advertisements – Customer categorization • Government management
– Crime hot spot analysis
– Census data for economic and social analysis • Inventory management
51
•
Select
features
from
the
data
sets
•
Choose
clustering
approach
•
Select
parameters
for
clustering
/bin/mahout kmeans
-i input-file
-o output
-c clusters
-dm
org.apache.mahout.common.distance.CosineDistanceMeasure-x 5
-ow
-cd 1
-k 25
53
• Association rule (AR) mining involves finding one of the following from among
a set of data
– Frequent patterns
– Associations
– Correlation
– Causal structures
• AR is used to find interesting relationships between variables in transaction‐
oriented data sets
• Association rules are of the form
– {X} ‐> {Y } with [support, confidence, lift]
• Cross‐selling and up‐selling
– Market basket analysis • Catalog design
– Keeping the related items together • Planning and monitoring
– Network design based on web logs
– Planning web design using web usage log mining
• In general, any decision based on knowledge of previous transactions by sets
of users
Business
Applications
of
55
• Classification separates data into various categories
– Comprises assigning a class label to a set of unclassified data sets
• The classification of an unknown observation to a class is based on the model
– Thus, classification is categorized as supervised learning
• Classification is a two‐step process – Model training and creation
• Based on training data set where the observations have already been
classified – Model testing
• Apply the model to test data to classify them in appropriate categories
• Training examples
– Complete set of inputs for model development and testing • Training data
– Subset of training examples used for building the model • Model
– Set of rules generated after training the classifier • Test data
– Remaining examples from training examples not used in training data to
study how effective the model was • Predictor variables
– Set of variables that are used to decide the output for target variable • Target variable
57
•
The
general
structure
for
input
for
training
is
–
TV,
PV
1,
PV
2,
PV
3,
………..,PV
n•
The
general
structure
for
test
input
is
–
??,
PV
1,
PV
2,
PV
3,
………..,PV
n–
??
refers
to
the
target
variable
that
needs
to
be
evaluated
based
on
predictor
variable
The
Structure
of
Input
for
Mahout
Classifier
PV = predictor variable TV = target variable TV, PV1, PV2, PV3, ………..,PVn TV, PV1, PV2, PV3, ………..,PVn TV, PV1, PV2, PV3, ………..,PVn Classifier model generation Classifier generated model ??, PV1, PV2, PV3, ………..,PVn ??, PV1, PV2, PV3, ………..,PVn ??, PV1, PV2, PV3, ………..,PVn TV TV TV
•
Loan
application
processing
•
Categorization
of
customer
types
•
Identification
of
product
classes
•
Classification
of
a
privacy
policy
as
safe
or
unsafe
•
Medical
diagnosis
•
Targeted
marketing
Business
Applications
of
59
•
Decision
tree
•
Random
forest
•
Many
others
• This provides tree structure with
– Internal nodes representing a test on a certain attribute
• The result of the test is represented by the branch at internal node – A class is represented by a leaf node
• Starts with the root node and splits into multiple branches
– May further split into new nodes
– Ends in leaf nodes that contain the “decisions”
• Model can be in the form of a set of rules or a decision tree – Rules specify the decision‐making process of the model • Uses recursive partitioning approach (divide and conquer)
– The process stops when partitioning does not improve the outcome
61
• Stochastic Gradient Descent (SGD)
– Can run in sequential, online, and incremental mode
– Data set of less than tens of millions is suitable for processing • Naive Bayes
– Can run in parallel mode
– Data set of millions to hundreds of millions is suitable for processing • Random forest
– Can run in parallel mode
– Data set of less than tens of millions is suitable for processing
Algorithms
Supported
by
Mahout
for
•
Training:
bin/mahout trainlogistic [options]
Command
for
Training
the
SGD
Classifier
Major Options Explanation
--input <file> Input file --output <file> Output file --target <variable> Target variable --categories <n> Possibilities in TV --predictors <pv1>….<pvn> Predictor variable --types <tpv1> …,<tpvn> Types of PVs
--passes Number of times input should be
63
•
Evaluating:
bin/mahout runlogistic
[options]
Command
for
Evaluating
the
SGD
Classifier
Options Explanation
--auc Show AUC score
--scores Show TV and scores for each input --confusion Display confusion matrix
--input <file> Input file
--model <model> Read model from specified file --quiet Generate less details of execution
•
Medicine
– Disease recommendation – Drug recommendation – Case based search
•
Marketing
– Cell phone companies for identifying users that may switch – Recommending books at Amazon
– Recommending products on the web sites
•
Education
– Universities guiding students what courses to take – Conference organizers assigning papers to reviewers
Application
for
Recommendation
65
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
•
Concluding
Observations
Scope
of
Text
Mining
Text
Mining
Data Mining Information Retrieval Computational Linguistics Web Mining Natural Language Processing Statistics67
•
Each
document
text
may
contain
large
amounts
of
text
–
High
dimensionality
•
Ambiguity
of
content
due
to
language
features
•
Sematic
issues
–
Words
and
phrases
may
not
be
semantically
independent
•
Complexity
of
natural
language
processing
•
Feature
selection
–
Determine
nGrams necessary
•
Text
Preprocessing
–
Removal
of
numbers
–
Removal
of
punctuation
marks
–
Text
case
conversion
as
needed
–
Stop
word
removal
(can
use
pre
‐
specified
list
or
generic
list)
–
Stemming
(identify
word
by
its
root)
69
•
Most
common
words
in
English
that
do
not
contribute
to
classification,
clustering
or
association
are:
–
Articles
– a,
an,
the
–
Conjunctions
– and,
or…
–
Prepositions
– as,
by,
of
…
–
Pronouns
– you,
she,
he,
it
…
•
Text
documents
are
high
‐
dimension
data
–
Removal
of
stop
words
acts
as
technique
for
dimensionality
reduction
•
Other
non
‐
context
related
words
can
also
be
removed
•
The
process
for
reducing
inflected
(or
sometimes
derived)
words
to
their
stem,
base
or
root
form
–
Typically
achieved
by
removing
– ing,
‐
s,
‐
er
‐
ed etc.
–
For
example:
“mining”,
“miner”,
“mines”,
“
mined”
–
Stemmed
word
“mine
”
•
Common
Algorithms
are
–
Porter’s
Algorithm
–
KSTEM
Algorithm
71
•
Loading
the
data
•
Text
preprocessing
(as
needed)
–
Cleaning
–
Punctuation
removal
–
Number
removal
–
Stop
word
removal
–
Stemming
•
Building
term
document
matrix
•
Finding
frequent
term
association
73
Building
a
Term
Document
Matrix
OUTPUT:
A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309
Sparsity : 97% Maximal term length: 21
Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 data 1 1 0 0 2 0 0 0 0 0 1 2 1 1 1 0 1 0 0 0 databases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dataframe 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasets 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasetshttptconxrbuh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datastream 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datatable 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 details 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 detection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 development 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
75
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example Healthcare Applications
77
Healthcare
Expenses
Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.
Healthcare
Data
Types
EHR
Public
Health
Social
Behavioral
Public
Health
79
Healthcare
Data
•
Billing
Data
–
International
Classification
of
Diseases
(
ICD)
–
Current
Procedural
Terminology
(CPT)
•
Lab
results
–
Logical
Observation
Identifiers
Names
and
Codes
(LOINC)
•
Medication
–
National
Drug
Code
(NDC)
by
Food
and
Drug
Healthcare
Data
(Cont’d)
•
Clinical
notes
–
Unstructured
text
data
•
Image
Data
–
Unstructured
data
•
Social
Interaction
data
81
Analytic
Platform
Structured EHR Unstructured EHR Patients / Context Feature Selection Clustering Classification RecommendationApplications
of
Patient
Similarity
83
•
Heart
Failure
Prediction
•
Likelihood
of
Blood
Pressure
onset
•
Disease
recommendation
Information
Analysis
Options
•
Image
‐
based
Retrieval
Image
Query
•
Image
‐
based
Retrieval
–
Given
a
query
image
and
find
the
most
similar
images
•
Case
‐
based
Retrieval
–
Given
a
case
description,
details
of
the
symptoms,
tests
including
images
–
Find
similar
cases
including
images
with
case
descriptions
Road
Map
•
Introduction
•
Hadoop Echo
System
•
Data
Analytics
Approaches
•
Processing
of
Structured
data
•
Processing
of
Unstructured
data
•
Example
Healthcare
Applications
87
Benefits
of
Health
Care
Analytics
•
Better
diagnosis
•
Better
health
care
delivery
•
Better
value
for
patient,
provider
and
payer
•
Better
innovation
Big
Data
Analysis
Challenge
Distributed Programming Skills Domain Expertise Machine Learning Background89