• No results found

Big Data Analytics Opportunities and Challenges

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics Opportunities and Challenges"

Copied!
90
0
0

Loading.... (view fulltext now)

Full text

(1)

Big

 

Data

 

Analytics

 

Opportunities

 

and

 

Challenges

Anup Kumar,

 

Professor

 

and

 

Director

 

of

 

MINDS

 

Lab

Computer

 

Engineering

 

and

 

Computer

 

Science

 

Department

University

 

of

 

Louisville

(2)

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

Concluding

 

Observations

(3)

3

Big

 

Data

 

Applications

• Advertising and marketing

– Customer shopping patterns

– Response to promotional campaign • Manufacturing

– Maintenance of machine health • Social Media 

– Browsing and sentiment analysis – Impact on buying patterns

• Government data

(4)

Big

 

Data

 

Applications

 

(cont’d)

Stock

 

Market

 

Stock

 

performance

 

prediction

Healthcare Management

Patient health monitoring

Impact of preventive care

Financial

 

Institutions

Fraud

 

detection

 

and

 

mitigation

Weather

(5)

5

Big

 

Data

 

Analytics:

 

Benefits

Saving

 

money

Fraud

 

detection

 

in

 

medical

 

industry

Risk

 

Management

Lower

 

cost

 

and

 

better

 

outcomes

Real

time

 

decision

 

making

Sensor

 

data

 

analysis

Influence

 

of

 

social

 

sentiment

 

on

 

patient

 

health

Getting

 

to

 

know

 

your

 

customer

 

better

Targeted

 

advertisement

(6)

Components

 

of

 

Big

 

Data

• Volume

– Data generated by 2020 will be in Zettabytes (1021)

– Soon after that the measure will be Yottabyte (1024) and Brontabyte (1027)

• Variety

– Structured (transactional data)

– Unstructured (image, video and text data) • Velocity

– Rate of data generation

– Increase in the ability to process data • Variability/Veracity

(7)

7

• Understanding Advanced Analytics – Association analysis

– Clustering, Classification, Regression – Recommendation framework

– Time series analysis

• Handling volume and velocity of data – Map/Reduce framework

– Cloud computing framework – Custom frameworks

• Dealing with Variety of Data

– Structured Data Analytics (Advanced Analytics) – Unstructured Data Analytics (Text Processing)

Key

 

Components

 

of

 

Big

 

Data

 

(8)

• Descriptive analytics : 

– Specifies the data characteristics • How to describe the system?

• What happened in the system and when? • What are the parameters in the system?

• What is the impact of a parameter on the system? • Is there any correlation between the parameters?

• Predictive analytics: uses data mining and predictive modeling – It can answer the following questions

• What are the future trends?

• What is the decision based on past history?

(9)

9

• Hadoop Based Implementation of Analytics • In‐Database Analytics

(10)

• Allows analytics to be carried out on any type of data store • Provides a standard framework for computation

– On cloud environment – On in‐premises network • Advantages

– Vendor‐independent

– Allows any type of analytics to be carried out • Disadvantages

– Complex implementation

(11)

11

• Allows analytic computation to be carried out in the database – Uses:

• SQL

• SQL extensions • Advantages

– Higher analytics efficiency, easy usability, better database manageability – Analytic centralization may allow easy security, data, and version 

management • Disadvantages

– Vendor dependent – Cost

(12)

• Structured Data Analytics – Association analysis – Clustering – Classification – Regression – Recommendation framework – Time series analysis

• Unstructured Data Analytics – Association analysis

– Clustering – Classification

(13)

13

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

Concluding

 

Observations

(14)

• Open source from Apache

– Free download from http://hadoop.apache.org

• Runs on commodity hardware

– Lowers procurement, licensing, and operational costs • Scalable

– Incrementally add hardware nodes as needed

• Can accommodate growth in data and processing requirements • Fault tolerant

– Automatic data replication and task reallocation

(15)

15

Core

 

Components

 

of

 

Hadoop

Big

 

Data

 

Architecture

Data Storage (HDFS) Data Processing (MapReduce) Security Operations Integration

(16)

Designed

 

for

 

large

scale

 

data

 

processing

Allows

 

storage

 

of

 

large

 

data

 

files

Runs

 

on

 

top

 

of

 

native

 

file

 

system

 

on

 

each

 

node

Data

 

is

 

stored

 

in

 

units

 

known

 

as

 

blocks

64MB

 

(default)

 

although

 

normally

 

larger

Large

 

files

 

are

 

stored

 

in

 

many

 

blocks

 

over

 

many

 

nodes

Blocks

 

are

 

replicated

 

to

 

multiple

 

nodes

Hadoop

 

Distributed

 

File

 

System

 

(17)

17

HDFS

 

Blocks

 

and

 

Replication

File Block 3 Block 2 Block 1 Node 1 Block 1 Block 2 Node 2 Block 1 Node 3 Block 2 Block 3 Block 3

(18)

Nodes

 

in

 

Hadoop

 

have

 

different

 

roles

 

to

 

play

HDFS

 

nodes

NameNodes

DataNodes

MapReduce

 

nodes

JobTracker

TaskTracker

(19)

19

Hadoop

 

Node

 

Interaction

DataNode DataNode DataNode DataNode NameNode TaskTracker TaskTracker JobTracker Hadoop Cluster MapReduce HDFS Start

(20)

Receives

 

jobs

 

from

 

clients

 

and

 

manages

 

jobs

Master

 

in

 

master

slave

 

architecture

 

with

 

TaskTrackers

Determines

 

job

 

execution

 

plan

Assigns

 

nodes

 

to

 

different

 

processing

 

tasks

Monitors

 

tasks

 

as

 

they

 

execute

Detects

 

failures

 

and

 

restarts/re

runs

 

tasks

(21)

21

Manages

 

individual

 

tasks

 

that

 

the

 

JobTracker

 

issues

Can

 

be

 

map

 

or

 

reduce

 

jobs

Single

 

TaskTracker

 

can

 

run

 

many

 

map

 

or

 

reduce

 

tasks

 

in

 

parallel

Keeps

 

JobTracker

 

updated

 

with

 

task

 

progress

 

using

 

heartbeat

 

signal

Failure

 

to

 

send

 

heartbeat

 

results

 

in

 

JobTracker

 

resubmitting

 

task

To

 

another

 

TaskTracker

 

node

(22)

HDFS

 

is

 

also

 

a

 

master

slave

 

architecture

 

NameNode

 

is

 

the

 

HDFS

 

master

 

directing

 

DataNode

 

slave

 

nodes

DataNodes

 

perform

 

low

level

 

input/output

This

 

is

 

where

 

data

 

is

 

stored

Keeps

 

track

 

of

 

files

How

 

they

 

are

 

broken

 

into

 

blocks

Which

 

nodes

 

blocks

 

are

 

stored

 

on

 

Only

 

one

 

per

 

Hadoop

 

cluster!

Single

 

point

 

of

 

failure,

 

so

 

should

 

not

 

use

 

commodity

 

hardware

(23)

23

Performs

 

low

level

 

work

 

reading

 

and

 

writing

 

HDFS

 

blocks

 

to

 

local

 

file

 

system

Communicates

 

with:

NameNode

Provides

 

information

 

about

 

which

 

blocks

 

they

 

are

 

storing

 

DataNodes

 

to

 

replicate

 

data

 

blocks

Many

 

per

 

Hadoop

 

cluster

Normally

 

run

 

on

 

inexpensive

 

commodity

 

hardware

(24)

• Hadoop can be configured to run in one of three modes – Local (standalone) mode

– Pseudo‐distributed mode – Fully distributed mode

• Four key configuration files define which mode to run in 

– core-site.xml

– hdfs-site.xml

– mapred-site.xml

(25)

25

This

 

is

 

the

 

default

 

mode

 

for

 

Hadoop

No

 

assumptions

 

are

 

made

 

about

 

hardware

Does

 

not

 

use

 

HDFS

Used

 

for

 

developing

 

and

 

debugging

 

application

 

logic

 

of

 

Hadoop

 

programs

(26)

Hadoop

 

runs

 

in

 

clustered

 

mode

Cluster

 

size

 

is

 

one

Useful

 

for

 

developing

 

and

 

debugging

 

Hadoop

 

applications

Enables

 

examination

 

of:

Memory

 

and

 

HDFS

 

usage

Configuration

 

files

 

used

 

to

 

set

 

port

 

numbers

 

used

 

for

 

NameNode

 

and

 

JobTracker

(27)

27

This

 

is

 

the

 

mode

 

for

 

running

 

Hadoop

 

applications

 

in

 

production

Configuration

 

requires

 

setup

 

of

Master

 

nodes

Host

 

of

 

NameNode

 

and

 

JobTracker

 

components

Slave

 

nodes

Machines

 

running

 

DataNode

 

and

 

TaskTracker

 

components

NameNode

 

hosts

 

a

 

report

 

on

 

HDFS

 

status

 

on

 

port

 

number

 

50070

Enables

 

checking

 

of

 

each

 

DataNode

 

in

 

cluster

JobTracker

 

provides

 

status

 

report

 

on

 

MapReduce

 

jobs

 

on

 

port

 

50030

Reports

 

about

 

ongoing

 

jobs

(28)

Hadoop is

 

powerful

 

for

 

processing

 

large

 

datasets

Requires

 

significant

 

programming

 

skills

 

to

 

develop

 

applications

Pig

 

is

 

a

 

Hadoop extension

 

that

 

simplifies

 

Hadoop programming

It

 

is

 

Easy

 

to

 

use

It

 

is

 

Higly scalable

There

 

are

 

two

 

major

 

components

 

of

 

Pig

A

 

data

 

processing

 

language

 

called

 

Pig

 

Latin

A

 

compiler

 

to

 

generate

 

Pig

 

Latin

 

to

 

Hadoop programs

 

(29)

29

A

 

data

 

warehousing

 

package

 

built

 

on

 

top

 

of

 

Hadoop

Originally

 

developed

 

at

 

Facebook

Target

 

audience

 

for

 

Hive

 

is

 

data

 

analysts

 

comfortable

 

with

 

SQL

Provides

 

access

 

to

 

ad

 

hoc

 

queries

 

and

 

data

 

analysis

On

 

large

scale

 

datasets

Focuses

 

on

 

structured

 

data

Enables

 

optimizations

 

to

 

be

 

performed

 

Provides

 

SQL

like

 

language

 

HiveQL

Based

 

on

 

tables,

 

rows,

 

columns

No

 

need

 

to

 

know

 

about

 

Hadoop

 

programming

 

(30)

Hive

 

requires

 

a

 

metastore

 

service

Provides

 

schemas

 

to

 

structure

 

the

 

data

Helps

 

with

 

efficient

 

querying

 

and

 

processing

Implemented

 

using

 

tables

 

in

 

a

 

relational

 

database

By

 

default,

 

Hive

 

uses

 

Java’s

 

built

in

 

Derby

 

database

Hive

 

can

 

be

 

accessed

 

from

 

Command

line

 

interface

Web

 

interface

 

known

 

as

 

Hive

 

web

 

Interface

(31)

31

• Secure authorization

– Ability to control and enforce access to data 

– Grant privileges on data for authenticated users. • Fine‐grained access control

– Control at the database, table, and view level – As an example of fine grained control

• One group can see all data

• Another group can see only non‐sensitive data filtered by a view • Sentry provides unified security platform:

– Existing Hadoop Kerberos security for authentication

– Sentry access policy applied to all applications e.g. Pig, Hive etc

Introduction

 

to

 

Apache

 

Sentry

 

(32)

Introduction

 

to

 

Oozie

Oozie is

 

a

 

Hadoop workflow

 

engine

Manages

 

data

 

processing

 

activities

Available

 

from

 

http://oozie.apache.org

Simplifies

 

the

 

management

 

of

 

Recurring

 

and

 

data

 

driven

 

jobs

 

Re

execution

 

of

 

failed

 

jobs

Comprised

 

of

A

 

workflow

 

engine

Can

 

execute

 

different

 

types

 

of

 

jobs

(33)

33

Oozie Workflows

• Oozie workflows are directed acyclic graphs (DAGs) • Graph nodes are either

– Action nodes

• Perform a workflow task – Control flow nodes

• Determine the logic between tasks

• There is always exactly one start node and at least one terminal node

start action

action

terminal

(34)

Pig,

 

Hive,

 

and

 

Impala

 

in

 

Big

 

Data

 

Architecture

HDFS Map Reduce

Security Operations Integration Pig Hive Mahout Impala

Flume

(35)

35

Road

 

Map

Introduction

Hadoop Echo

 

System

Data Analytics Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

Concluding

 

Observations

(36)

Hadoop Based

 

Analytics

Allows

 

analytics

 

to

 

be

 

carried

 

out

 

on

 

any

 

type

 

of

 

data

 

store

Provides

 

standard

 

framework

 

for

 

computation

On

 

cloud

 

environment

On

 

in

premises

 

network

Advantages

Vendor

 

independent

Allows

 

any

 

type

 

of

 

analytics

 

to

 

be

 

carried

 

out

Provides

 

flexible

 

and

 

adaptable

 

architecture

 

for

 

implementation

(37)

37

Allows

 

use

 

of

 

large

 

computational

 

resources

Inter

 

cluster

 

communication

 

is

 

managed

 

by

 

MapReduce

MapReduce architecture

 

supports

Data

 

and

 

task

 

distribution

Fault

 

monitoring

Task

 

and

 

data

 

replication

Simple

 

programming

 

model

Limitations

 

of

 

MapReduce

Cannot

 

solve

 

all

 

the

 

problems

(38)

It

 

can

 

solve

 

many

 

Big

 

Data

 

problems

Data

 

filtering

Statistics

 

and

 

aggregation

Graph

 

analytics

Decision

 

Tree

 

and

 

classification

Clustering

 

and

 

recommendations

Practical

 

Distributed

 

API

Easier

 

to

 

understand

 

and

 

use

Higher

 

level

 

APIs

 

exist

To

 

reduce

 

the

 

complexity

 

of

 

programming

(39)

39

Phases

 

in

 

MapReduce Processing

Data

 

Processing

 

with

 

MapReduce goes

 

through

 

three

 

phases

Map

 

Phase

Processes

 

the

 

data

 

and

 

generate

 

<key,

 

Value>

 

pairs

Shuffle

 

Phase

Moves

 

the

 

data

 

<key,

 

value>

 

pairs

 

to

 

appropriate

 

processing

 

node

 

for

 

reduction

Reduce

 

Phase

Processes

 

the

 

data

 

<key,

 

value>

 

pairs

 

to

 

generate

 

final

 

output

(40)

• In order to compute support

– The number of times each product and its combinations occur in the data 

has to be calculated

• The original transaction file format – 1, milk (M), bread (B)

– 2, bread (B), butter (T)

– 3, milk (M), bread (B), butter(T) – 4, milk(M)

• The input data file would be: – M B MB

– B T BT

– M B T MB MT BT MBT

(41)

41

Association

 

Rule

 

MapReduce

M B MB B T BT M B T MB MT BT MBT M M B T BT M B MB M,1 B ,1 T, 1 BT,1 M,1 B,1 MB, 1 M,1 BMT,1 B,3 BT,2 BMT,1 M,3 B,3 T,2 MB,2 BT,2 MT,2 BMT,1 Input Splitting Mapping Shuffling Reducing

M,1 M,1 MB,1 M B T MB MT BT MBT M, 1 B ,1 T, 1 MB,1 BT,1 MT,1 MBT,1 B,1 B,1 B,1 T,1 T,1 MB,1 BT,1 BT,1 MT,1 MT,1 M,3 T,2 MB,2 MT,2

(42)

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

Concluding

 

Observations

(43)

43

Steps

 

in

 

Structured

 

Data

 

Analytics

Step

 

1:

 

Business

 

domain

 

analysis

Step

 

2:

 

Data

 

exploration

 

and

 

investigation

Step

 

3:

 

Data

 

preparation

 

and

 

cleaning

Step

 

4:

 

Model

 

design

 

and

 

development

Step

 

5:

 

Model

 

verification

 

and

 

testing

(44)

Descriptive

 

analytics

 

provides

 

knowledge

 

discovered

 

from

 

the

 

data

 

set

Examples

 

include

Clustering

 

Correlation

 

analysis

 

Pattern

 

discovery

 

Association

 

rules

Predictive

 

analytics

 

builds

 

a

 

model

 

to

 

predict

 

future

 

behavior

Examples

 

include

Regression

 

models

(45)

45

• Open source machine learning library

– Leveraging the power of 

• MapReduce

• Hadoop

• Currently available algorithms are

– Various clustering algorithms

– Singular value decomposition

– Parallel Frequent Pattern mining

– Naive Bayes classifier

– Random forest decision tree‐based classifier

– Collaborative filtering

User and itembased recommenders – Others

(46)

Mahout

 

Architecture

Storage Storage Storage Compute Compute Storage Storage Storage Compute Compute Hadoop/MapReduce

Mahout: Machine Learning Library

Clustering, Classification, Recommendation, Filtering, etc.

(47)

47

Command

 

line

 

interaction

Allows

 

execution

 

of

 

machine

 

learning

 

algorithms

 

using

 

Mahout

 

commands

Pros:

 

Easy

 

use

 

of

 

algorithms

Cons:

 

Have

 

to

 

specify

 

large

 

number

 

of

 

options;

 

fixed

 

number

 

of

 

options

 

are

 

available

Application

 

Program

 

Interface–based

 

interaction

Allows

 

execution

 

of

 

machine

 

learning

 

algorithms

 

in

 

the

 

source

 

code

Pros:

 

More

 

flexibility

 

is

 

available

 

in

 

using

 

the

 

algorithms

Cons:

 

Have

 

to

 

know

 

programming

(48)

Clustering:

 

Introduction

Clustering

 

is

 

the

 

grouping

 

of

 

available

 

data

 

based

 

on

 

similarities

 

on

 

some

 

pre

specified

 

criteria

Uses

 

unsupervised

 

learning

 

techniques

Uses

 

statistical

 

methods

 

A

 

way

 

of

 

looking

 

for

 

patterns

 

or

 

structure

 

in

 

the

 

data

 

that

 

are

 

of

 

interest

Allows

 

partitioning

 

of

 

data

 

in

 

different

 

groups

Identifies

 

a

 

set

 

of

 

patterns

 

and

 

structures

Can

 

be

 

used

 

as

 

a

 

standalone

 

technique

 

to

 

gain

 

insight

 

into

 

(49)

49

Two

 

core

 

components

 

of

 

clustering

 

are

An

 

algorithm

 

to

 

group

 

items

 

in

 

clusters

Kmeans

 

clustering

Hierarchical

 

clustering

Many

 

others

Concept

 

of

 

similarity

 

and

 

dissimilarity

This

 

specifies

 

how

 

close

 

various

 

data

 

locations

 

are

(50)

Business

 

Applications

 

for

 

Clustering

 

• Market research – Segmentation – Targeted advertisements – Customer categorization • Government management

– Crime hot spot analysis

– Census data for economic and social analysis • Inventory management

(51)

51

Select

 

features

 

from

 

the

 

data

 

sets

Choose

 

clustering

 

approach

Select

 

parameters

 

for

 

clustering

(52)

/bin/mahout kmeans

-i input-file

-o output

-c clusters

-dm

org.apache.mahout.common.distance.CosineDistanceMeasure

-x 5

-ow

-cd 1

-k 25

(53)

53

• Association rule (AR) mining involves finding one of the following from among 

a set of data

– Frequent patterns 

– Associations 

– Correlation 

– Causal structures

• AR is used to find interesting relationships between variables in transaction‐

oriented data sets

• Association rules are of the form

– {X}  ‐>  {Y } with [support, confidence, lift]

(54)

• Cross‐selling and up‐selling 

– Market basket analysis • Catalog design

– Keeping the related items together • Planning and monitoring

– Network design based on web logs

– Planning web design using web usage log mining

• In general, any decision based on knowledge of previous transactions by sets 

of users

Business

 

Applications

 

of

 

(55)

55

• Classification separates data into various categories

– Comprises assigning a class label to a set of unclassified data sets

• The classification of an unknown observation to a class is based on the model 

– Thus, classification is categorized as supervised learning 

• Classification is a two‐step process – Model training and creation

• Based on training data set where the observations have already been 

classified – Model testing

• Apply the model to test data to classify them in appropriate categories

(56)

• Training examples

– Complete set of inputs for model development and testing • Training data

– Subset of training examples used for building the model • Model

– Set of rules generated after training the classifier • Test data

– Remaining examples from training examples not used in training data to 

study how effective the model was • Predictor variables

– Set of variables that are used to decide the output for target variable • Target variable

(57)

57

The

 

general

 

structure

 

for

 

input

 

for

 

training

 

is

TV,

 

PV

1

,

 

PV

2

,

 

PV

3

,

 

………..,PV

n

The

 

general

 

structure

 

for

 

test

 

input

 

is

??,

 

PV

1

,

 

PV

2

,

 

PV

3

,

 

………..,PV

n

??

 

refers

 

to

 

the

 

target

 

variable

 

that

 

needs

 

to

 

be

 

evaluated

 

based

 

on

 

predictor

 

variable

The

 

Structure

 

of

 

Input

 

for

 

Mahout

 

Classifier

PV = predictor variable TV = target variable TV, PV1, PV2, PV3, ………..,PVn TV, PV1, PV2, PV3, ………..,PVn TV, PV1, PV2, PV3, ………..,PVn Classifier model generation Classifier generated model ??, PV1, PV2, PV3, ………..,PVn ??, PV1, PV2, PV3, ………..,PVn ??, PV1, PV2, PV3, ………..,PVn TV TV TV

(58)

Loan

 

application

 

processing

Categorization

 

of

 

customer

 

types

Identification

 

of

 

product

 

classes

Classification

 

of

 

a

 

privacy

 

policy

 

as

 

safe

 

or

 

unsafe

Medical

 

diagnosis

Targeted

 

marketing

Business

 

Applications

 

of

 

(59)

59

Decision

 

tree

Random

 

forest

Many

 

others

(60)

• This provides tree structure with

– Internal nodes representing a test on a certain attribute

• The result of the test is represented by the branch at internal node – A class is represented by a leaf node

• Starts with the root node and splits into multiple branches 

– May further split into new nodes 

– Ends in leaf nodes that contain the “decisions” 

• Model can be in the form of a set of rules or a decision tree – Rules specify the decision‐making process of the model • Uses recursive partitioning approach (divide and conquer) 

– The process stops when partitioning does not improve the outcome

(61)

61

• Stochastic Gradient Descent (SGD)

– Can run in sequential, online, and incremental mode

– Data set of less than tens of millions is suitable for processing • Naive Bayes

– Can run in parallel mode

– Data set of millions to hundreds of millions is suitable for processing • Random forest

– Can run in parallel mode

– Data set of less than tens of millions is suitable for processing

Algorithms

 

Supported

 

by

 

Mahout

 

for

 

(62)

Training:

bin/mahout trainlogistic [options]

Command

 

for

 

Training

 

the

 

SGD

 

Classifier

Major Options Explanation

--input <file> Input file  --output <file> Output file --target <variable> Target variable --categories <n> Possibilities in TV --predictors <pv1>….<pvn> Predictor variable --types <tpv1> …,<tpvn> Types of PVs

--passes Number of times input should be 

(63)

63

Evaluating:

 

bin/mahout runlogistic

[options]

Command

 

for

 

Evaluating

 

the

 

SGD

 

Classifier

Options Explanation

--auc Show AUC score

--scores Show TV and scores for each input --confusion Display confusion matrix

--input <file> Input file

--model <model> Read model from specified file --quiet Generate less details of execution

(64)

Medicine

– Disease recommendation – Drug recommendation – Case based search

Marketing

– Cell phone companies for identifying users that may switch – Recommending books at Amazon

– Recommending products on the web sites

Education

– Universities guiding students what courses to take – Conference organizers assigning papers to reviewers

Application

 

for

 

Recommendation

 

(65)

65

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

Concluding

 

Observations

(66)

Scope

 

of

 

Text

 

Mining

Text

 

Mining

Data  Mining Information  Retrieval Computational  Linguistics Web  Mining Natural Language Processing Statistics

(67)

67

Each

 

document

 

text

 

may

 

contain

 

large

 

amounts

 

of

 

text

 

High

 

dimensionality

Ambiguity

 

of

 

content

 

due

 

to

 

language

 

features

Sematic

 

issues

Words

 

and

 

phrases

 

may

 

not

 

be

 

semantically

 

independent

Complexity

 

of

 

natural

 

language

 

processing

(68)

Feature

 

selection

Determine

 

nGrams necessary

Text

 

Preprocessing

Removal

 

of

 

numbers

 

Removal

 

of

 

punctuation

 

marks

Text

 

case

 

conversion

 

as

 

needed

Stop

 

word

 

removal

 

(can

 

use

 

pre

specified

 

list

 

or

 

generic

 

list)

Stemming

 

(identify

 

word

 

by

 

its

 

root)

(69)

69

Most

 

common

 

words

 

in

 

English

 

that

 

do

 

not

 

contribute

 

to

 

classification,

 

clustering

 

or

 

association

 

are:

Articles

 

– a,

 

an,

 

the

Conjunctions

 

– and,

 

or…

Prepositions

 

– as,

 

by,

 

of

 

Pronouns

 

– you,

 

she,

 

he,

 

it

 

Text

 

documents

 

are

 

high

dimension

 

data

Removal

 

of

 

stop

 

words

 

acts

 

as

 

technique

 

for

 

dimensionality

 

reduction

Other

 

non

context

 

related

 

words

 

can

 

also

 

be

 

removed

(70)

The

 

process

 

for

 

reducing

 

inflected

 

(or

 

sometimes

 

derived)

 

words

 

to

 

their

 

stem,

 

base

 

or

 

root

 

form

Typically

 

achieved

 

by

 

removing

 

– ing,

 ‐

s,

 ‐

er

ed etc.

For

 

example:

 

“mining”,

 

“miner”,

 

“mines”,

 

 

mined”

Stemmed

 

word

 

“mine

Common

 

Algorithms

 

are

Porter’s

 

Algorithm

KSTEM

 

Algorithm

(71)

71

Loading

 

the

 

data

Text

 

preprocessing

 

(as

 

needed)

Cleaning

Punctuation

 

removal

Number

 

removal

Stop

 

word

 

removal

Stemming

Building

 

term

 

document

  

matrix

Finding

 

frequent

 

term

 

association

(72)
(73)

73

Building

 

a

 

Term

 

Document

 

Matrix

OUTPUT:

A term-document matrix (16 terms, 20 documents) Non-/sparse entries: 11/309

Sparsity : 97% Maximal term length: 21

Weighting : term frequency (tf) Docs Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 data 1 1 0 0 2 0 0 0 0 0 1 2 1 1 1 0 1 0 0 0 databases 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dataframe 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasets 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datasetshttptconxrbuh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datastream 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 datatable 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 details 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 detection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 development 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

(74)
(75)

75

(76)

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example Healthcare Applications

(77)

77

Healthcare

 

Expenses

Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.

(78)

Healthcare

 

Data

 

Types

EHR

Public

Health

Social

 

Behavioral

Public

 

Health

(79)

79

Healthcare

 

Data

Billing

 

Data

International

 

Classification

 

of

 

Diseases

 

(

 

ICD)

Current

 

Procedural

 

Terminology

 

(CPT)

Lab

 

results

Logical

 

Observation

 

Identifiers

 

Names

 

and

 

Codes

 

(LOINC)

Medication

National

 

Drug

 

Code

 

(NDC)

 

by

 

Food

 

and

 

Drug

 

(80)

Healthcare

 

Data

 

(Cont’d)

Clinical

 

notes

Unstructured

 

text

 

data

Image

 

Data

Unstructured

 

data

Social

 

Interaction

 

data

(81)

81

Analytic

 

Platform

Structured EHR Unstructured EHR Patients / Context Feature Selection Clustering Classification Recommendation

(82)
(83)

Applications

 

of

 

Patient

 

Similarity

83

Heart

 

Failure

 

Prediction

Likelihood

 

of

 

Blood

 

Pressure

 

onset

Disease

 

recommendation

(84)

Information

 

Analysis

 

Options

Image

based

 

Retrieval

(85)

Image

 

Query

Image

based

 

Retrieval

Given

 

a

 

query

 

image

 

and

 

find

 

the

 

most

 

similar

 

images

Case

based

 

Retrieval

Given

 

a

 

case

 

description,

 

details

 

of

 

the

 

symptoms,

 

tests

 

including

 

images

Find

 

similar

 

cases

 

including

 

images

 

with

 

case

 

descriptions

(86)

Road

 

Map

Introduction

Hadoop Echo

 

System

Data

 

Analytics

 

Approaches

Processing

 

of

 

Structured

 

data

Processing

 

of

 

Unstructured

 

data

Example

 

Healthcare

 

Applications

(87)

87

Benefits

 

of

 

Health

 

Care

 

Analytics

Better

 

diagnosis

Better

 

health

 

care

 

delivery

Better

 

value

 

for

 

patient,

 

provider

 

and

 

payer

Better

 

innovation

(88)

Big

 

Data

 

Analysis

 

Challenge

Distributed Programming Skills Domain Expertise Machine Learning Background

(89)

89

Big

 

Data

 

Analytics:

 

Barriers

Cost

 

of

 

analytics

Lack

 

of

 

skilled

 

talent

Difficult

 

to

 

architect

 

a

 

Big

 

Data

 

Solutions

Big

 

data

 

scalability

 

issues

Limited

 

capability

 

of

 

existing

 

database

 

analytics

Tangible

 

business

 

justification

 

(90)

Thank

 

you!

Questions?

References

Related documents

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Overnight documents to notaries are typically sent Priority Overnight – No signature required (depending on timing requirements). a) Client is to include a return overnight label

Specifically, we test whether, conditional on the economic and political fundamentals of rated countries, credit rating agencies assign better ratings to their

From 1990 through 1999 almost 3.2 billion guilders from the Netherlands’ budget for development assistance were spent on relief of the external debt of developing countries. A

The objective of the investigation should be to identify and locate, both horizontally and vertically, significant soil and rock types and ground water conditions present within a

It will ensure record of all rented machineries and equipment as per project with help of proper daily based data entry.. Resources includes mainly three

On the single objective problem, the sequential metamodeling method with domain reduction of LS-OPT showed better performance than any other method evaluated. The development of

TITLE I—ASBESTOS CLAIMS RESOLUTION Subtitle A—Office of Asbestos Disease Compensation Sec.. Establishment of Office of Asbestos