Big Data Analytics Opportunities and Challenges

(1)

Big

Data

Analytics

Opportunities

and

Challenges

Anup Kumar,

Professor

and

Director

of

MINDS

Lab

Computer

Engineering

and

Computer

Science

Department

University

of

Louisville

(2)

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _Data

_Analytics

_Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

• _Concluding

_Observations

(3)

3

Big

Data

Applications

• Advertising and marketing

– Customer shopping patterns

– Response to promotional campaign • Manufacturing

– Maintenance of machine health • Social Media

– Browsing and sentiment analysis – Impact on buying patterns

• Government data

(4)

Big

Data

Applications

(cont’d)

• Stock

Market

–

Stock

performance

prediction

• Healthcare Management

–

Patient health monitoring

–

Impact of preventive care

• Financial

Institutions

–

Fraud

detection

and

mitigation

• Weather

(5)

5

Big

Data

Analytics:

Benefits

• Saving

money

–

Fraud

detection

in

medical

industry

–

Risk

Management

–

Lower

cost

and

better

outcomes

• Real

‐

time

decision

making

–

Sensor

data

analysis

–

Influence

of

social

sentiment

on

patient

health

• Getting

to

know

your

customer

better

–

Targeted

Components

of

Big

Data

• Volume

– Data generated by 2020 will be in Zettabytes (1021₎

– Soon after that the measure will be Yottabyte (1024₎_and_{Brontabyte (10}27₎

• Variety

– Structured (transactional data)

– Unstructured (image, video and text data) • Velocity

– Rate of data generation

– Increase in the ability to process data • Variability/Veracity

(7)

7

• Understanding Advanced Analytics – Association analysis

– Clustering, Classification, Regression – Recommendation framework

– Time series analysis

• Handling volume and velocity of data – Map/Reduce framework

– Cloud computing framework – Custom frameworks

• Dealing with Variety of Data

– Structured Data Analytics (Advanced Analytics) – Unstructured Data Analytics (Text Processing)

Key

Components

of

Big

Data

(8)

• Descriptive analytics :

– Specifies the data characteristics • How to describe the system?

• What happened in the system and when? • What are the parameters in the system?

• What is the impact of a parameter on the system? • Is there any correlation between the parameters?

• Predictive analytics: uses data mining and predictive modeling – It can answer the following questions

• What are the future trends?

• What is the decision based on past history?

(9)

9

• Hadoop Based Implementation of Analytics • In‐Database Analytics

(10)

• Allows analytics to be carried out on any type of data store • Provides a standard framework for computation

– On cloud environment – On in‐premises network • Advantages

– Vendor‐independent

– Allows any type of analytics to be carried out • Disadvantages

– Complex implementation

(11)

11

• Allows analytic computation to be carried out in the database – Uses:

• SQL

• SQL extensions • Advantages

– Higher analytics efficiency, easy usability, better database manageability – Analytic centralization may allow easy security, data, and version

management • Disadvantages

– Vendor dependent – Cost

(12)

• Structured Data Analytics – Association analysis – Clustering – Classification – Regression – Recommendation framework – Time series analysis

• Unstructured Data Analytics – Association analysis

– Clustering – Classification

(13)

13

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _Data

_Analytics

_Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

• _Concluding

_Observations

(14)

• Open source from Apache

– Free download from http://hadoop.apache.org

• Runs on commodity hardware

– Lowers procurement, licensing, and operational costs • Scalable

– Incrementally add hardware nodes as needed

• Can accommodate growth in data and processing requirements • Fault tolerant

– Automatic data replication and task reallocation

(15)

15

Core

Components

of

Hadoop

Big

Data

Architecture

Data Storage (HDFS) Data Processing (MapReduce) Security _Operations Integration

(16)

• Designed

for

large

‐

scale

data

processing

–

Allows

storage

of

large

data

files

• Runs

on

top

of

native

file

system

on

each

node

• Data

is

stored

in

units

known

as

blocks

–

64MB

(default)

although

normally

larger

• Large

files

are

stored

in

many

blocks

over

many

nodes

• Blocks

are

replicated

to

multiple

nodes

Hadoop

Distributed

File

System

(17)

17

HDFS

Blocks

and

Replication

File Block 3 Block 2 Block 1 Node 1 Block 1 Block 2 Node 2 Block 1 Node 3 Block 2 Block 3 Block 3

(18)

• Nodes

in

Hadoop

have

different

roles

to

play

• HDFS

nodes

–

NameNodes

–

DataNodes

• MapReduce

nodes

–

JobTracker

–

TaskTracker

(19)

19

Hadoop

Node

Interaction

DataNode DataNode DataNode DataNode NameNode TaskTracker TaskTracker JobTracker Hadoop Cluster MapReduce HDFS Start

(20)

• Receives

jobs

from

clients

and

manages

jobs

–

Master

in

master

‐

slave

architecture

with

TaskTrackers

• Determines

job

execution

plan

–

Assigns

nodes

to

different

processing

tasks

–

Monitors

tasks

as

they

execute

• Detects

failures

and

restarts/re

‐

runs

tasks

(21)

21

• Manages

individual

tasks

that

the

JobTracker

issues

–

Can

be

map

or

reduce

jobs

• Single

TaskTracker

can

run

many

map

or

reduce

tasks

in

parallel

• Keeps

JobTracker

updated

with

task

progress

using

heartbeat

signal

–

Failure

to

send

heartbeat

results

in

JobTracker

resubmitting

task

• To

another

TaskTracker

node

(22)

• HDFS

is

also

a

master

‐

slave

architecture

–

NameNode

is

the

HDFS

master

directing

DataNode

slave

nodes

• DataNodes

perform

low

‐

level

input/output

–

This

is

where

data

is

stored

• Keeps

track

of

files

–

How

they

are

broken

into

blocks

–

Which

nodes

blocks

are

stored

on

• Only

one

per

Hadoop

cluster!

–

Single

point

of

failure,

so

should

not

use

commodity

hardware

(23)

23

• Performs

low

‐

level

work

reading

and

writing

HDFS

blocks

to

local

file

system

• Communicates

with:

–

NameNode

• Provides

information

about

which

blocks

they

are

storing

–

DataNodes

to

replicate

data

blocks

• Many

per

Hadoop

cluster

–

Normally

run

on

inexpensive

commodity

hardware

(24)

• Hadoop can be configured to run in one of three modes – Local (standalone) mode

– Pseudo‐distributed mode – Fully distributed mode

• Four key configuration files define which mode to run in

– core-site.xml

– hdfs-site.xml

– mapred-site.xml

(25)

25

• This

is

the

default

mode

for

Hadoop

• No

assumptions

are

made

about

hardware

–

Does

not

use

HDFS

• Used

for

developing

and

debugging

application

logic

of

Hadoop

programs

(26)

• Hadoop

runs

in

clustered

mode

–

Cluster

size

is

one

• Useful

for

developing

and

debugging

Hadoop

applications

–

Enables

examination

of:

• Memory

and

HDFS

usage

• Configuration

files

used

to

set

port

numbers

used

for

NameNode

and

JobTracker

(27)

27

• This

is

the

mode

for

running

Hadoop

applications

in

production

• Configuration

requires

setup

of

–

Master

nodes

• Host

of

NameNode

and

JobTracker

components

–

Slave

nodes

• Machines

running

DataNode

and

TaskTracker

components

• NameNode

hosts

a

report

on

HDFS

status

on

port

number

50070

–

Enables

checking

of

each

DataNode

in

cluster

• JobTracker

provides

status

report

on

MapReduce

jobs

on

port

50030

–

Reports

about

ongoing

jobs

(28)

• Hadoop is

powerful

for

processing

large

datasets

–

Requires

significant

programming

skills

to

develop

applications

• Pig

is

a

Hadoop extension

that

simplifies

Hadoop programming

• It

is

Easy

to

use

• It

is

Higly scalable

• There

are

two

major

components

of

Pig

–

A

data

processing

language

called

Pig

Latin

–

A

compiler

to

generate

Pig

Latin

to

Hadoop programs

(29)

29

• A

data

warehousing

package

built

on

top

of

Hadoop

–

Originally

developed

at

Facebook

• Target

audience

for

Hive

is

data

analysts

comfortable

with

SQL

–

Provides

access

to

ad

hoc

queries

and

data

analysis

• On

large

‐

scale

datasets

• Focuses

on

structured

data

–

Enables

optimizations

to

be

performed

• Provides

SQL

‐

like

language

HiveQL

–

Based

on

tables,

rows,

columns

–

No

need

to

know

about

Hadoop

programming

(30)

• Hive

requires

a

metastore

service

–

Provides

schemas

to

structure

the

data

• Helps

with

efficient

querying

and

processing

–

Implemented

using

tables

in

a

relational

database

• By

default,

Hive

uses

Java’s

built

‐

in

Derby

database

• Hive

can

be

accessed

from

–

Command

‐

line

interface

–

Web

interface

known

as

Hive

web

Interface

(31)

31

• Secure authorization

– Ability to control and enforce access to data

– Grant privileges on data for authenticated users. • Fine‐grained access control

– Control at the database, table, and view level – As an example of fine grained control

• One group can see all data

• Another group can see only non‐sensitive data filtered by a view • Sentry provides unified security platform:

– Existing Hadoop Kerberos security for authentication

– Sentry access policy applied to all applications e.g. Pig, Hive etc

Introduction

to

Apache

Sentry

(32)

Introduction

to

Oozie

• Oozie is

a

Hadoop workflow

engine

–

Manages

data

processing

activities

–

Available

from

http://oozie.apache.org

• Simplifies

the

management

of

–

Recurring

and

data

driven

jobs

–

Re

‐

execution

of

failed

jobs

• Comprised

of

–

A

workflow

engine

• Can

execute

different

types

of

jobs

(33)

33

Oozie Workflows

• Oozie workflows are directed acyclic graphs (DAGs) • Graph nodes are either

– Action nodes

• Perform a workflow task – Control flow nodes

• Determine the logic between tasks

• There is always exactly one start node and at least one terminal node

start _action

action

terminal

(34)

Pig,

Hive,

and

Impala

in

Big

Data

Architecture

HDFS Map Reduce

Security _Operations Integration Pig Hive Mahout Impala

Flume

(35)

35

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _{Data Analytics Approaches}

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

• _Concluding

_Observations

(36)

Hadoop Based

Analytics

• Allows

analytics

to

be

carried

out

on

any

type

of

data

store

• Provides

standard

framework

for

computation

–

On

cloud

environment

–

On

in

‐

premises

network

• Advantages

–

Vendor

independent

–

Allows

any

type

of

analytics

to

be

carried

out

–

Provides

flexible

and

adaptable

architecture

for

implementation

(37)

37

• Allows

use

of

large

computational

resources

–

Inter

cluster

communication

is

managed

by

MapReduce

• MapReduce architecture

supports

–

Data

and

task

distribution

–

Fault

monitoring

–

Task

and

data

replication

–

Simple

programming

model

• Limitations

of

MapReduce

–

Cannot

solve

all

the

problems

(38)

• It

can

solve

many

Big

Data

problems

–

Data

filtering

–

Statistics

and

aggregation

–

Graph

analytics

–

Decision

Tree

and

classification

–

Clustering

and

recommendations

• Practical

Distributed

API

–

Easier

to

understand

and

use

• Higher

level

APIs

exist

–

To

reduce

the

complexity

of

programming

(39)

39

Phases

in

MapReduce Processing

• Data

Processing

with

MapReduce goes

through

three

phases

–

Map

Phase

• Processes

the

data

and

generate

<key,

Value>

pairs

–

Shuffle

Phase

• Moves

the

data

<key,

value>

pairs

to

appropriate

processing

node

for

reduction

–

Reduce

Phase

• Processes

the

data

<key,

value>

pairs

to

generate

final

output

(40)

• In order to compute support

– The number of times each product and its combinations occur in the data

has to be calculated

• The original transaction file format – 1, milk (M), bread (B)

– 2, bread (B), butter (T)

– 3, milk (M), bread (B), butter(T) – 4, milk(M)

• The input data file would be: – M B MB

– B T BT

– M B T MB MT BT MBT

(41)

41

Association

Rule

MapReduce

M B MB B T BT M B T MB MT BT MBT M M B T BT M B MB M,1 B ,1 T, 1 BT,1 M,1 B,1 MB, 1 M,1 BMT,1 B,3 BT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 Input Splitting Mapping Shuffling Reducing

M,1 M,1 MB,1 M B T MB MT BT MBT M, 1 B ,1 T, 1 MB,1 BT,1 MT,1 MBT,1 B,1 _B,1 _B,1 T,1 _T,1 MB,1 BT,1 BT,1 MT,1 MT,1 M,3 T,2 MB,2 MT,2

(42)

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _Data

_Analytics

_Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

• _Concluding

_Observations

(43)

43

Steps

in

Structured

Data

Analytics

• Step

1:

Business

domain

analysis

• Step

2:

Data

exploration

and

investigation

• Step

3:

Data

preparation

and

cleaning

• Step

4:

Model

design

and

development

• Step

5:

Model

verification

and

testing

(44)

• Descriptive

analytics

provides

knowledge

discovered

from

the

data

set

–

Examples

include

• Clustering

• Correlation

analysis

• Pattern

discovery

• Association

rules

• Predictive

analytics

builds

a

model

to

predict

future

behavior

–

Examples

include

• Regression

models

(45)

45

• Open source machine learning library

– Leveraging the power of

• MapReduce

• Hadoop

• Currently available algorithms are

– Various clustering algorithms

– Singular value decomposition

– Parallel Frequent Pattern mining

– Naive Bayes classifier

– Random forest decision tree‐based classifier

– Collaborative filtering

– _User_‐ _and_item_‐_based_recommenders – Others

(46)

Mahout

Architecture

Storage Storage Storage Compute Compute Storage Storage Storage Compute Compute Hadoop/MapReduce

Mahout: Machine Learning Library

Clustering, Classification, Recommendation, Filtering, etc.

(47)

47

• Command

line

interaction

–

Allows

execution

of

machine

learning

algorithms

using

Mahout

commands

–

Pros:

Easy

use

of

algorithms

–

Cons:

Have

to

specify

large

number

of

options;

fixed

number

of

options

are

available

• Application

Program

Interface–based

interaction

–

Allows

execution

of

machine

learning

algorithms

in

the

source

code

–

Pros:

More

flexibility

is

available

in

using

the

algorithms

–

Cons:

Have

to

know

programming

(48)

Clustering:

Introduction

• Clustering

is

the

grouping

of

available

data

based

on

similarities

on

some

pre

‐

specified

criteria

–

Uses

unsupervised

learning

techniques

–

Uses

statistical

methods

• A

way

of

looking

for

patterns

or

structure

in

the

data

that

are

of

interest

• Allows

partitioning

of

data

in

different

groups

–

Identifies

a

set

of

patterns

and

structures

• Can

be

used

as

a

standalone

technique

to

gain

insight

into

(49)

49

• Two

core

components

of

clustering

are

–

An

algorithm

to

group

items

in

clusters

• Kmeans

clustering

• Hierarchical

clustering

• Many

others

–

Concept

of

similarity

and

dissimilarity

• This

specifies

how

close

various

data

locations

are

(50)

Business

Applications

for

Clustering

• Market research – Segmentation – Targeted advertisements – Customer categorization • Government management

– Crime hot spot analysis

– Census data for economic and social analysis • Inventory management

(51)

51

• Select

features

from

the

data

sets

• Choose

clustering

approach

• Select

parameters

for

clustering

(52)

/bin/mahout kmeans

-i input-file

-o output

-c clusters

-dm

org.apache.mahout.common.distance.CosineDistanceMeasure

-x 5

-ow

-cd 1

-k 25

(53)

53

• Association rule (AR) mining involves finding one of the following from among

a set of data

– Frequent patterns

– Associations

– Correlation

– Causal structures

• AR is used to find interesting relationships between variables in transaction‐

oriented data sets

• Association rules are of the form

– {X} ‐> {Y } with [support, confidence, lift]

(54)

• Cross‐selling and up‐selling

– Market basket analysis • Catalog design

– Keeping the related items together • Planning and monitoring

– Network design based on web logs

– Planning web design using web usage log mining

• In general, any decision based on knowledge of previous transactions by sets

of users

Business

Applications

of

(55)

55

• Classification separates data into various categories

– Comprises assigning a class label to a set of unclassified data sets

• The classification of an unknown observation to a class is based on the model

– Thus, classification is categorized as supervised learning

• Classification is a two‐step process – Model training and creation

• Based on training data set where the observations have already been

classified – Model testing

• Apply the model to test data to classify them in appropriate categories

(56)

• Training examples

– Complete set of inputs for model development and testing • Training data

– Subset of training examples used for building the model • Model

– Set of rules generated after training the classifier • Test data

– Remaining examples from training examples not used in training data to

study how effective the model was • Predictor variables

– Set of variables that are used to decide the output for target variable • Target variable

(57)

57

• The

general

structure

for

input

for

training

is

–

TV,

PV

₁

,

PV

₂

,

PV

₃

,

………..,PV

_n

• The

general

structure

for

test

input

is

–

??,

PV

₁

,

PV

₂

,

PV

₃

,

………..,PV

_n

–

??

refers

to

the

target

variable

that

needs

to

be

evaluated

based

on

predictor

variable

The

Structure

of

Input

for

Mahout

Classifier

PV = predictor variable TV = target variable TV, PV₁, PV₂, PV₃, ………..,PV_n TV, PV₁, PV₂, PV₃, ………..,PV_n TV, PV₁, PV₂, PV₃, ………..,PV_n Classifier model generation Classifier generated model ??, PV₁, PV₂, PV₃, ………..,PV_n ??, PV₁, PV₂, PV₃, ………..,PV_n ??, PV₁, PV₂, PV₃, ………..,PV_n TV TV TV

(58)

• Loan

application

processing

• Categorization

of

customer

types

• Identification

of

product

classes

• Classification

of

a

privacy

policy

as

safe

or

unsafe

• Medical

diagnosis

• Targeted

marketing

Business

Applications

of

(59)

59

• Decision

tree

• Random

forest

• Many

others

(60)

• This provides tree structure with

– Internal nodes representing a test on a certain attribute

• The result of the test is represented by the branch at internal node – A class is represented by a leaf node

• Starts with the root node and splits into multiple branches

– May further split into new nodes

– Ends in leaf nodes that contain the “decisions”

• Model can be in the form of a set of rules or a decision tree – Rules specify the decision‐making process of the model • Uses recursive partitioning approach (divide and conquer)

– The process stops when partitioning does not improve the outcome

(61)

61

• Stochastic Gradient Descent (SGD)

– Can run in sequential, online, and incremental mode

– Data set of less than tens of millions is suitable for processing • Naive Bayes

– Can run in parallel mode

– Data set of millions to hundreds of millions is suitable for processing • Random forest

– Can run in parallel mode

– Data set of less than tens of millions is suitable for processing

Algorithms

Supported

by

Mahout

for

(62)

• Training:

bin/mahout trainlogistic [options]

Command

for

Training

the

SGD

Classifier

Major Options Explanation

--input <file> _Input_file --output <file> _{Output file} --target <variable> _Target_variable --categories <n> _{Possibilities in}_TV --predictors <pv1>….<pvn> _Predictor_variable --types <tpv1> …,<tpvn> _Types_of_PVs

--passes _Number_of_times_input_should_be

(63)

63

• Evaluating:

bin/mahout runlogistic

[options]

Command

for

Evaluating

the

SGD

Classifier

Options Explanation

--auc _{Show AUC}_score

--scores _Show_TV_and_scores_{for each}_input --confusion _Display_{confusion matrix}

--input <file> _Input_file

--model <model> _Read_model_from_specified_file --quiet _Generate_less_details_of_execution

(64)

• Medicine

– Disease recommendation – Drug recommendation – Case based search

• Marketing

– Cell phone companies for identifying users that may switch – Recommending books at Amazon

– Recommending products on the web sites

• Education

– Universities guiding students what courses to take – Conference organizers assigning papers to reviewers

Application

for

Recommendation

(65)

65

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _Data

_Analytics

_Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

• _Concluding

_Observations

(66)

Scope

of

Text

Mining

Text

Mining

Data Mining Information Retrieval Computational Linguistics Web Mining Natural Language Processing Statistics

(67)

67

• Each

document

text

may

contain

large

amounts

of

text

–

High

dimensionality

• Ambiguity

of

content

due

to

language

features

• Sematic

issues

–

Words

and

phrases

may

not

be

semantically

independent

• Complexity

of

natural

language

processing

(68)

• Feature

selection

–

Determine

nGrams necessary

• Text

Preprocessing

–

Removal

of

numbers

–

Removal

of

punctuation

marks

–

Text

case

conversion

as

needed

–

Stop

word

removal

(can

use

pre

‐

specified

list

or

generic

list)

–

Stemming

(identify

word

by

its

root)

(69)

69

• Most

common

words

in

English

that

do

not

contribute

to

classification,

clustering

or

association

are:

–

Articles

– a,

an,

the

–

Conjunctions

– and,

or…

–

Prepositions

– as,

by,

of

…

–

Pronouns

– you,

she,

he,

it

…

• Text

documents

are

high

‐

dimension

data

–

Removal

of

stop

words

acts

as

technique

for

dimensionality

reduction

• Other

non

‐

context

words

can

also

be

removed

(70)

• The

process

for

reducing

inflected

(or

sometimes

derived)

words

to

their

stem,

base

or

root

form

–

Typically

achieved

by

removing

– ing,

‐

s,

‐

er

‐

ed etc.

–

For

example:

“mining”,

“miner”,

“mines”,

“

mined”

–

Stemmed

word

“mine

”

• Common

Algorithms

are

–

Porter’s

Algorithm

–

KSTEM

Algorithm

(71)

71

• Loading

the

data

• Text

preprocessing

(as

needed)

–

Cleaning

–

Punctuation

removal

–

Number

removal

–

Stop

word

removal

–

Stemming

• Building

term

document

matrix

• Finding

frequent

term

association

(72)

(73)

73

Building

a

Term

Document

Matrix

OUTPUT:

A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309

Sparsity : 97% Maximal term length: 21

Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 data 1 1 0 0 2 0 0 0 0 0 1 2 1 1 1 0 1 0 0 0 databases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dataframe 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasets 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasetshttptconxrbuh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datastream 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datatable 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 details 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 detection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 development 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

(74)

(75)

75

(76)

Road

Map

• Introduction

• Hadoop Echo

System

• Data

Analytics

Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• Example Healthcare Applications

(77)

77

Healthcare

Expenses

Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.

(78)

Healthcare

Data

Types

EHR

Public

Health

Social

Behavioral

Public

Health

(79)

79

Healthcare

Data

• _Billing

_Data

–

International

Classification

of

Diseases

(

ICD)

–

Current

Procedural

Terminology

(CPT)

• _Lab

_results

–

Logical

Observation

Identifiers

Names

and

Codes

(LOINC)

• _Medication

–

National

Drug

Code

(NDC)

by

Food

and

Drug

(80)

Healthcare

Data

(Cont’d)

• Clinical

notes

–

Unstructured

text

data

• _Image

_Data

–

Unstructured

data

• Social

Interaction

data

(81)

81

Analytic

Platform

Structured EHR Unstructured EHR Patients / Context Feature Selection Clustering Classification Recommendation

(82)

(83)

Applications

of

Patient

Similarity

83

• _Heart

_Failure

_Prediction

• _Likelihood

_of

_Blood

_Pressure

_onset

• _Disease

_{recommendation}

(84)

Information

Analysis

Options

• Image

‐

based

Retrieval

(85)

Image

Query

• _Image

_‐

_based

_Retrieval

–

Given

a

query

image

and

find

the

most

similar

images

• _Case

_‐

_based

_Retrieval

–

Given

a

case

description,

details

of

the

symptoms,

tests

including

images

–

Find

similar

cases

including

images

with

case

descriptions

(86)

Road

Map

• _Introduction

• _{Hadoop Echo}

_System

• _Data

_Analytics

_Approaches

• Processing

of

Structured

data

• Processing

of

Unstructured

data

• _Example

_Healthcare

_Applications

(87)

87

Benefits

of

Health

Care

Analytics

• Better

diagnosis

• Better

health

care

delivery

• Better

value

for

patient,

provider

and

payer

• Better

innovation

(88)

Big

Data

Analysis

Challenge

Distributed Programming Skills Domain Expertise Machine Learning Background