• No results found

Introduction Predictive Analytics Tools: Weka, R!

N/A
N/A
Protected

Academic year: 2021

Share "Introduction Predictive Analytics Tools: Weka, R!"

Copied!
54
0
0

Loading.... (view fulltext now)

Full text

(1)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Introduction Predictive Analytics

Tools: Weka, R

!

Predictive Analytics Center of

Excellence

San Diego Supercomputer Center

University of California, San Diego

(2)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Available Data Mining Tools

!

COTs:

!

n

IBM Intelligent Miner

!

n

SAS Enterprise Miner

!

n

Oracle ODM

!

n

Microstrategy

!

n

Microsoft DBMiner

!

n

Pentaho

!

n

Matlab

!

n

Teradata

!

Open Source:

!

n

WEKA

!

n

KNIME

!

n

Orange

!

n

RapidMiner

!

n

NLTK

!

n

R

!

n

Rattle

!

2

(3)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Agenda

!!

WEKA

!

Intro and background

"

Data Preparation

"

Creating Models/ Applying Algorithms

"

Evaluating Results

"

R

!

R Background

"

R Basics

"

Outline

"

R-Studio Overview

"

Hands On (homework)

"

(4)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(5)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Download and Install WEKA

!

Website:

http://www.cs.waikato.ac.nz/~ml/weka/index.html

!

!

(6)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

What is WEKA?

!

Waikato Environment for Knowledge Analysis

!

WEKA is a data mining/machine learning application developed

by Department of Computer Science, University of Waikato,

New Zealand

"

WEKA is open source software in JAVA

"

WEKA is a collection

machine learning algorithms and tools for data

mining tasks

"

data pre-processing, classification, regression, clustering, association,

and visualization.

"

WEKA is well-suited for developing new machine learning

schemes

"

WEKA is a bird found only in New Zealand.

!

(7)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Advantages of Weka

!

Free availability

!

under the GNU General Public License

"

Portability

!

fully implemented in the Java programming language and thus runs on

almost any modern computing platforms

"

Windows, Mac OS X and Linux

"

Comprehensive collection of data preprocessing and modeling

techniques

!

Supports standard data mining tasks: data preprocessing, clustering,

classification, regression, visualization, and feature selection

.

"

Easy to use GUI

!

Provides access to SQL databases

!

using Java Database Connectivity and can process the result returned

(8)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Disadvantages

!!

Sequence modeling is not covered by the

algorithms included in the Weka distribution

!

(9)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA Walk Through: Main GUI

!

Three graphical user interfaces

!

“The Explorer” (exploratory data analysis)

"

pre-process data

"

build “classifiers”

"

cluster data

"

find associations

"

attribute selection

"

data visualization

"

“The Experimenter” (experimental environment)

"

used to compare performance of different learning

schemes

"

“The KnowledgeFlow” (new process model

inspired interface)

"

Java-Beans-based interface for setting up and running

machine learning experiments.

"

Command line Interface (“Simple CLI”)

!

9

7/1/14

(10)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

1

0

(11)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: Preprocess

!

Importing data

!

Data format

"

Uses flat text files to describe the data

"

Data can be imported from a file in various formats:

"

ARFF, CSV, C4.5, binary

"

Data can also be read from a URL or from an SQL

database (using JDBC)

"

(12)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: ARFF file format

!

@relation heart-disease-simplified

@attribute age numeric

@attribute sex { female, male}

@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}

@attribute cholesterol numeric

@attribute exercise_induced_angina { no, yes}

@attribute class { present, not_present}

@data

63,male,typ_angina,233,no,not_present

67,male,asympt,286,yes,present

67,male,asympt,229,yes,present

38,female,non_anginal,?,no,not_present

...

!

A more thorough description is available here

(13)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

1

3

(14)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

1

4

(15)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Weka: Explorer:Preprocess

!

Preprocessing data

!

Visualization

"

Filtering algorithms

"

filters can be used to transform the data (e.g., turning numeric

attributes into discrete ones) and make it possible to delete

instances and attributes according to specific criteria.

"

Removing Noisy Data

"

Adding Additional Attributes

"

Remove Attributes

"

(16)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(17)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(18)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: Preprocess

!

Used to define filters to transform

Data.

!

WEKA contains filters for:

!

Discretization, normalization, resampling,

attribute selection, transforming, combining

attributes, etc

"

(19)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

1

9

(20)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(21)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(22)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Visualize

!

Visualization very useful in practice

!

help determine difficulty of the learning problem

"

WEKA can visualize single attributes (1-d) and

pairs of attributes (2-d)

!

Color-coded class values

!

“Jitter” option to deal with nominal attributes

(and to detect “hidden” data points)

!

“Zoom-in” function

!

22

7/1/14

(23)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

2

3

(24)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

2

4

(25)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Attribute Selection

!

Panel that can be used to investigate which

(subsets of) attributes are the most predictive

ones

!

Attribute selection methods contain two parts:

!

A search method: best-first, forward selection, random,

exhaustive, genetic algorithm, ranking

!

An evaluation method: correlation-based, wrapper,

information gain, chi-squared, …

"

Very flexible: WEKA allows (almost) arbitrary

combinations of these two

!

7/1/14

2

5

(26)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: building “classifiers”!

Classifiers in WEKA are models for

predicting nominal or numeric quantities

!

Implemented learning schemes include:

!

Decision trees and lists, instance-based

classifiers, support vector machines, multi-layer

perceptrons, logistic regression, Bayes’ nets, …

"

“Meta”-classifiers include:

!

Bagging, boosting, stacking, error-correcting

output codes, locally weighted learning, …

"

(27)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

2

7

(28)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

2

8

(29)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

WEKA:: Explorer: building “Cluster”!

WEKA contains “clusters” for finding

groups of similar instances in a dataset

!

Implemented schemes are:

!

k-Means, EM, Cobweb, X-means, FarthestFirst

"

Clusters can be visualized and compared

to “true” clusters (if given)

!

Evaluation based on loglikelihood if

clustering scheme produces a probability

distribution

!

(30)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Explorer: Finding associations

!

WEKA contains an implementation of the Apriori

algorithm for learning association rules

!

Works only with discrete data

"

Can identify statistical dependencies between

groups of attributes:

!

milk, butter

!

bread, eggs (with confidence 0.9 and

support 2000)

"

Apriori can compute all rules that have a given

minimum support and exceed a given

confidence

!

(31)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

References and Resources

!

References:

!

WEKA website:

http://www.cs.waikato.ac.nz/~ml/weka/index.html

"

WEKA Tutorial:

"

Machine Learning with WEKA: A

presentation

demonstrating all graphical

user interfaces (GUI) in Weka.

"

A

presentation

which explains how to use Weka for exploratory data

mining.

"

WEKA Data Mining Book:

"

Ian H. Witten and Eibe Frank, Data Mining: Practical Machine

Learning Tools and Techniques (Second Edition)

"

WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/

Main_Page

"

(32)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

(33)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Downloading R/ R Studio

!

http://www.r-project.org/

!

(34)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

What is R?

!!

An Environment

!

R is an integrated suite of software facilities for data manipulation,

calculation and graphical facilities for data analysis and display.

"

Effective data handling and storage

"

Suite of operators for calculations on arrays

"

Large, coherent, integrated collection of intermediate tools for data analysis

"

Programming language, run time environment

"

Developed at Bell Labs

!

GNU open source software

!

Under the terms of the Free Software Foundation's GNU General

Public License

"

Open Source implementation of S-Plus language

!

Well-developed, simple and effective programming language

"

(35)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

R Features

!

Software package designed for data analysis and graphical representation

!

Interactive, but may also be used programmatically

!

Platform independence

!

Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS.

"

Free, open source code

!

Engaged community

!

over 4,200 user-contributed packages

"

Extendable

!

User defined functions

"

• 

> 4000 packages available in the CRAN package repository"

Supports extensions / add-ons (i.e. – rApache)

"

Compatible with other languages (i.e. – SQL, perl, C)

"

Data Import

"

• 

Pre-processing data from different sources"

Scalability

!

(36)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

R packages for DM

!

Clustering

!

Classification

!

Association Rules

!

Sequential patterns

!

Time Series

!

Statistics

!

Graphics

!

Data manipulation

!

(37)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Data Mining

!

linear models (lm)

!

generalized linear

models(glm)

!

generalized additive

models (gam)

!

linear mixed effects

models(lme)

!

quantile regression (qr)

!

vector general additive

models(vgam)

!

lasso, ridge, and elastic

net models (glmnet)

!

non-linear models (nlm)

!

linear mixed effects

models (nlmer)

!

linear discriminant

analysis (lda)

!

quadratic discriminate

analysis (qda)

!

trees (tree)

!

random forests

(randomForrest)

!

support vector machines

(svm)

!

neural networks (nnet)

!

k-nearest neighbors (knn)

!

(38)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Big Data Options

!

lapply-based parallelism

!

multicore library

"

snow library

"

foreach-based parallelism

!

doMC backend

"

doSNOW backend

"

doMPI backend

"

Map/Reduce- (Hadoop-) based parallelism

!

Hadoop streaming with R mappers/reducers

"

Rhadoop (rmr, rhdfs, rhbase)

"

RHIPE

"

Poor-man's Parallelism

!

lots of Rs running

"

lots of input files

"

Hands-off Parallelism

!

OpenMP support compiled into R build

"

(39)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

R Considerations/Limitations

!

Command Line Interface

!

Performance

!

Memory Limits

!

memory limits dependent on the build, (32-bit vs. 64-bit)

"

32-bit build of R on Windows is dependent on the

underlying OS version

"

Syntax “curiosities”

!

(40)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

!

R-Studio

Overview

!

http://www.rstudio.com/ide/download/

R-Studio is an integrated development environment to

support R code.

R-Studio runs in two ways:

Desktop version for Linux, Mac, Windows: Single user,

perfect for laptop or desktop machine

Server Version for Linux: Allows an number of remote users

to run R-Studio within a web-browser, facilitates sharing of

code and data among team members

(41)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

“pop-up”:!

Multi-tab display: !

Shows graphics, !

Current directory and !

loaded packages! Project Window:! Currently loaded ! Workspace, and ! history! Console: Run R! Commands! Editor Window!

(42)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

The Fundamentals

!!

Launch R

!

Quit R

!

q()

"

Getting Help

!

help(package_name) or ?(package_name) or help start()

"

example(package_name)

"

??(keyword)

"

(43)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

The Basics

!

R environmental commands

!

list objects

"

• 

ls() "

• 

objects()"

list files in current directory

"

• 

list.files()"

list current directory

"

• 

getwd()"

set working directory

"

• 

setwd()"

remove objects

"

• 

rm()"

Workspace versus console

!

Clear workspace

"

rm(list=ls())

"

Clear console

"

(44)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

The Basics


(Naming Variables)

!

Requirements

!

Case sensitive, names must start with letter or '.’

"

Only letters, numbers, underscores and‘.’s

"

Special keywords

!

break, else, FALSE, for, function, if, Inf, NA, NaN, next,

repeat, return, TRUE, while

"

(45)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

The Basics

!

All entities in are called “objects”

!

arrays, vectors, matrices, functions, lists, data frames, factors

"

Expressions vs. assignments

!

10+10

"

my.age <- 23

"

my.age < - 23 (note the added space)

"

age<- c(my.age, 14, 59, 32)

"

my.age == 40

"

Data Types

!

Numeric, Integer, Complex, Logical, Character

"

• Function call

!

(46)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Summary of Data Structures

!

Linear

!

Rectangular

!

Homogeneous

"

Vectors

"

Matrices

"

Heterogeneous

"

Lists

"

Data Frames

"

"

Vectors and Matrices must contain same data type

!

Character Type will trump numeric: Values will be

(47)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

The Basics


(Functions)

!

Basic functions

!

mean(age)

"

sd(age)

"

sqrt(var(age))

"

TIP: to list all function in search path

"

sapply(search(), ls, all.names = TRUE)

User Defined functions

!

Score <- age * 10;

"

Using the correct functions for the given data

type

!

(48)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Function Components

!

writeLines(text=“text”, con = stdout(), sep = "\n", useBytes

= FALSE)

!

function name: writeLines

(“146.6”, “popRate.txt”, sep =

"\n”)

"

parentheses

: writeLines(“146.6”, “popRate.txt”, sep = "\n”

)

!

commas:

writeLines(“146.6”

,

“popRate.txt”

,

sep = "\n”)

"

first argument:

writeLines(“146.6”,

“popRate.txt”

, sep =

"\n”)

"

second argument:

writeLines(“146.6”, “popRate.txt”,

sep =

"\n”

)"

"

(49)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Importing Data/Exporting Data

!

Flat Files

!

Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)

"

>weatherdata <- read.table(file="C:/work/DM1/weather.csv",

header=TRUE, sep=",")

"

Export: > USTemps=read.table(file=file.choose(),header=TRUE)

"

Databases

!

Import

"

• 

connection <- dbConnect(driver, user, password, host, dbname)"

> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)

Export

"

• 

> connnection <- dbConnect(driver, user, password, host,dbname)"

> dbWriteTable (con, “AHW”, AHW)

R objects

!

Import: > load(‘AHW.Rdata’)

"

Export: > save(AHW, file=“New_AHW.Rdata”)

"

Web

!

• 

connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"

AHW <- read.csv(con, header=TRUE)

"

Plots

!

png(filename="C:/R/figure.png", height=295, width=300, bg="white")

"

pdf(file="C:/R/figure.pdf", height=3.5, width=5)

"

(50)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Name of data frame!

to be created with !

imported data!

Options for parsing !

the text data into !

fields and values!

How data frame will !

look once the data !

are imported!

(51)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Extending R

!

http://cran.r-project.org/web/packages/!

Install a package

!

from command line

"

"

> install.package(‘name_of_package’)

"

from GUI

"

• 

Packages & Data > Package Installer"

Load Library

(to use installed package)

"

> library(name_of_package)

"

Example

"

> library(markdown)"

Use Library Function

!

> function_name(parameters)

"

Example

"

> markdownToHTML("example.md")"

"

(52)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

More Information……

!

The R Manuals

!

http://www.stat.berkely.edu/~spector/R.pdf

"

And Introduction to R

!

http://cran.r-project.org/doc/manuals/R-intro.html

"

http://tryr.codeschool.com/

"

Books

!

(53)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

Other Resources

!

(54)

UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER

References

Related documents

Переривчасте шліфування застосовується для зменшення нагрівання поверхні, що шліфується за рахунок періодичного переривання її контакту з колом,

There were no differences in growth rate between mutants D FgCYP51A , D FgCYP51B , D FgCYP51C , D FgCYP51AC and D FgCYP51BC (Table 4) and the wild-type parental strain PH-1 in PDB

When the co-owners of petroleum and natural gas rights decide to use a joint venture structure, they typically enter in to a joint operating agreement (joa) to

Table 2 Overview of checkpoint inhibitor regimen, bacterial and non-bacterial challenges, and antibiotic regimen in each experiment analyzed from the ret rieved studies Study

Density: Figure 1 shows the results of the density tests conducted on the concrete cubes prepared using sawdust as partial replacement of sand from all the three species of

Crucially, repayment of debt is a credible signal of low-risk status because low-risk people have an advantage in resisting opportunistic behavior in the credit market, and

Effects of three Mexican medicinal plants (Asteraceae) on blood glucose levels on healthy mice and.. Xie JT, Wang CZ, Li XL, Ni M, Fishbein A, Yu- an CS.Anti-diabetic effect