UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Introduction Predictive Analytics
Tools: Weka, R
!
Predictive Analytics Center of
Excellence
San Diego Supercomputer Center
University of California, San Diego
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Available Data Mining Tools
!
COTs:
!
n
IBM Intelligent Miner
!
n
SAS Enterprise Miner
!
n
Oracle ODM
!
n
Microstrategy
!
n
Microsoft DBMiner
!
n
Pentaho
!
n
Matlab
!
n
Teradata
!
Open Source:
!
n
WEKA
!
n
KNIME
!
n
Orange
!
n
RapidMiner
!
n
NLTK
!
n
R
!
n
Rattle
!
2
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Agenda
!!
•
WEKA
!
•
Intro and background
"
•
Data Preparation
"
•
Creating Models/ Applying Algorithms
"
•
Evaluating Results
"
•
R
!
•
R Background
"
•
R Basics
"
•
Outline
"
•
R-Studio Overview
"
•
Hands On (homework)
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Download and Install WEKA
!
•
Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
!
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
What is WEKA?
!
•
Waikato Environment for Knowledge Analysis
!
•
WEKA is a data mining/machine learning application developed
by Department of Computer Science, University of Waikato,
New Zealand
"
•
WEKA is open source software in JAVA
"
•
WEKA is a collection
machine learning algorithms and tools for data
mining tasks
"
•
data pre-processing, classification, regression, clustering, association,
and visualization.
"
•
WEKA is well-suited for developing new machine learning
schemes
"
•
WEKA is a bird found only in New Zealand.
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Advantages of Weka
!
•
Free availability
!
•
under the GNU General Public License
"
•
Portability
!
•
fully implemented in the Java programming language and thus runs on
almost any modern computing platforms
"
•
Windows, Mac OS X and Linux
"
•
Comprehensive collection of data preprocessing and modeling
techniques
!
•
Supports standard data mining tasks: data preprocessing, clustering,
classification, regression, visualization, and feature selection
.
"
•
Easy to use GUI
!
•
Provides access to SQL databases
!
•
using Java Database Connectivity and can process the result returned
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Disadvantages
!!
•
Sequence modeling is not covered by the
algorithms included in the Weka distribution
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA Walk Through: Main GUI
!
•
Three graphical user interfaces
!
•
“The Explorer” (exploratory data analysis)
"
•
pre-process data
"
•
build “classifiers”
"
•
cluster data
"
•
find associations
"
•
attribute selection
"
•
data visualization
"
•
“The Experimenter” (experimental environment)
"
•
used to compare performance of different learning
schemes
"
•
“The KnowledgeFlow” (new process model
inspired interface)
"
•
Java-Beans-based interface for setting up and running
machine learning experiments.
"
•
Command line Interface (“Simple CLI”)
!
9
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
1
0
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess
!
•
Importing data
!
•
Data format
"
•
Uses flat text files to describe the data
"
•
Data can be imported from a file in various formats:
"
•
ARFF, CSV, C4.5, binary
"
•
Data can also be read from a URL or from an SQL
database (using JDBC)
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: ARFF file format
!
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
!
A more thorough description is available here
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
1
3
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
1
4
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Weka: Explorer:Preprocess
!
•
Preprocessing data
!
•
Visualization
"
•
Filtering algorithms
"
•
filters can be used to transform the data (e.g., turning numeric
attributes into discrete ones) and make it possible to delete
instances and attributes according to specific criteria.
"
•
Removing Noisy Data
"
•
Adding Additional Attributes
"
•
Remove Attributes
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess
!
•
Used to define filters to transform
Data.
!
•
WEKA contains filters for:
!
•
Discretization, normalization, resampling,
attribute selection, transforming, combining
attributes, etc
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
1
9
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Visualize
!
•
Visualization very useful in practice
!
•
help determine difficulty of the learning problem
"
•
WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
!
•
Color-coded class values
!
•
“Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
!
•
“Zoom-in” function
!
22
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
2
3
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
2
4
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Attribute Selection
!
•
Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
!
•
Attribute selection methods contain two parts:
!
•
A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
!
•
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
"
•
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
!
7/1/14
2
5
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: building “classifiers”!
•
Classifiers in WEKA are models for
predicting nominal or numeric quantities
!
•
Implemented learning schemes include:
!
•
Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes’ nets, …
"
•
“Meta”-classifiers include:
!
•
Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, …
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
2
7
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
2
8
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: building “Cluster”!
•
WEKA contains “clusters” for finding
groups of similar instances in a dataset
!
•
Implemented schemes are:
!
•
k-Means, EM, Cobweb, X-means, FarthestFirst
"
•
Clusters can be visualized and compared
to “true” clusters (if given)
!
•
Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Finding associations
!
•
WEKA contains an implementation of the Apriori
algorithm for learning association rules
!
•
Works only with discrete data
"
•
Can identify statistical dependencies between
groups of attributes:
!
•
milk, butter
!
bread, eggs (with confidence 0.9 and
support 2000)
"
•
Apriori can compute all rules that have a given
minimum support and exceed a given
confidence
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
References and Resources
!
•
References:
!
•
WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
"
•
WEKA Tutorial:
"
•
Machine Learning with WEKA: A
presentation
demonstrating all graphical
user interfaces (GUI) in Weka.
"
•
A
presentation
which explains how to use Weka for exploratory data
mining.
"
•
WEKA Data Mining Book:
"
•
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)
"
•
WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/
Main_Page
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Downloading R/ R Studio
!
•
http://www.r-project.org/
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
What is R?
!!
•
An Environment
!
•
R is an integrated suite of software facilities for data manipulation,
calculation and graphical facilities for data analysis and display.
"
•
Effective data handling and storage
"
•
Suite of operators for calculations on arrays
"
•
Large, coherent, integrated collection of intermediate tools for data analysis
"
•
Programming language, run time environment
"
•
Developed at Bell Labs
!
•
GNU open source software
!
•
Under the terms of the Free Software Foundation's GNU General
Public License
"
•
Open Source implementation of S-Plus language
!
•
Well-developed, simple and effective programming language
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
R Features
!
•
Software package designed for data analysis and graphical representation
!
•
Interactive, but may also be used programmatically
!
•
Platform independence
!
•
Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS.
"
•
Free, open source code
!
•
Engaged community
!
•
over 4,200 user-contributed packages
"
•
Extendable
!
•
User defined functions
"
•
> 4000 packages available in the CRAN package repository"
•
Supports extensions / add-ons (i.e. – rApache)
"
•
Compatible with other languages (i.e. – SQL, perl, C)
"
•
Data Import
"
•
Pre-processing data from different sources"
•
Scalability
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
R packages for DM
!
•
Clustering
!
•
Classification
!
•
Association Rules
!
•
Sequential patterns
!
•
Time Series
!
•
Statistics
!
•
Graphics
!
•
Data manipulation
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Data Mining
!
•
linear models (lm)
!
•
generalized linear
models(glm)
!
•
generalized additive
models (gam)
!
•
linear mixed effects
models(lme)
!
•
quantile regression (qr)
!
•
vector general additive
models(vgam)
!
•
lasso, ridge, and elastic
net models (glmnet)
!
•
non-linear models (nlm)
!
•
linear mixed effects
models (nlmer)
!
•
linear discriminant
analysis (lda)
!
•
quadratic discriminate
analysis (qda)
!
•
trees (tree)
!
•
random forests
(randomForrest)
!
•
support vector machines
(svm)
!
•
neural networks (nnet)
!
•
k-nearest neighbors (knn)
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Big Data Options
!
•
lapply-based parallelism
!
•
multicore library
"
•
snow library
"
•
foreach-based parallelism
!
•
doMC backend
"
•
doSNOW backend
"
•
doMPI backend
"
•
Map/Reduce- (Hadoop-) based parallelism
!
•
Hadoop streaming with R mappers/reducers
"
•
Rhadoop (rmr, rhdfs, rhbase)
"
•
RHIPE
"
•
Poor-man's Parallelism
!
•
lots of Rs running
"
•
lots of input files
"
•
Hands-off Parallelism
!
•
OpenMP support compiled into R build
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
R Considerations/Limitations
!
•
Command Line Interface
!
•
Performance
!
•
Memory Limits
!
•
memory limits dependent on the build, (32-bit vs. 64-bit)
"
•
32-bit build of R on Windows is dependent on the
underlying OS version
"
•
Syntax “curiosities”
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
!
R-Studio
Overview
!
•
http://www.rstudio.com/ide/download/
•
R-Studio is an integrated development environment to
support R code.
•
R-Studio runs in two ways:
•
Desktop version for Linux, Mac, Windows: Single user,
perfect for laptop or desktop machine
•
Server Version for Linux: Allows an number of remote users
to run R-Studio within a web-browser, facilitates sharing of
code and data among team members
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
“pop-up”:!
Multi-tab display: !
Shows graphics, !
Current directory and !
loaded packages! Project Window:! Currently loaded ! Workspace, and ! history! Console: Run R! Commands! Editor Window!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
The Fundamentals
!!
•
Launch R
!
•
Quit R
!
•
q()
"
•
Getting Help
!
•
help(package_name) or ?(package_name) or help start()
"
•
example(package_name)
"
•
??(keyword)
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
The Basics
!
•
R environmental commands
!
•
list objects
"
•
ls() "
•
objects()"
•
list files in current directory
"
•
list.files()"
•
list current directory
"
•
getwd()"
•
set working directory
"
•
setwd()"
•
remove objects
"
•
rm()"
•
Workspace versus console
!
•
Clear workspace
"
•
rm(list=ls())
"
•
Clear console
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
The Basics
(Naming Variables)
!
•
Requirements
!
•
Case sensitive, names must start with letter or '.’
"
•
Only letters, numbers, underscores and‘.’s
"
•
Special keywords
!
•
break, else, FALSE, for, function, if, Inf, NA, NaN, next,
repeat, return, TRUE, while
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
The Basics
!
•
All entities in are called “objects”
!
•
arrays, vectors, matrices, functions, lists, data frames, factors
"
•
Expressions vs. assignments
!
•
10+10
"
•
my.age <- 23
"
•
my.age < - 23 (note the added space)
"
•
age<- c(my.age, 14, 59, 32)
"
•
my.age == 40
"
•
Data Types
!
•
Numeric, Integer, Complex, Logical, Character
"
• Function call
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Summary of Data Structures
!
Linear
!
Rectangular
!
Homogeneous
"
Vectors
"
Matrices
"
Heterogeneous
"
Lists
"
Data Frames
"
"
•
Vectors and Matrices must contain same data type
!
•
Character Type will trump numeric: Values will be
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
The Basics
(Functions)
!
•
Basic functions
!
•
mean(age)
"
•
sd(age)
"
•
sqrt(var(age))
"
•
TIP: to list all function in search path
"
–
sapply(search(), ls, all.names = TRUE)
•
User Defined functions
!
•
Score <- age * 10;
"
•
Using the correct functions for the given data
type
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Function Components
!
writeLines(text=“text”, con = stdout(), sep = "\n", useBytes
= FALSE)
!
•
function name: writeLines
(“146.6”, “popRate.txt”, sep =
"\n”)
"
•
parentheses
: writeLines(“146.6”, “popRate.txt”, sep = "\n”
)
!
•
commas:
writeLines(“146.6”
,
“popRate.txt”
,
sep = "\n”)
"
•
first argument:
writeLines(“146.6”,
“popRate.txt”
, sep =
"\n”)
"
•
second argument:
writeLines(“146.6”, “popRate.txt”,
sep =
"\n”
)"
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Importing Data/Exporting Data
!
•
Flat Files
!
•
Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)
"
>weatherdata <- read.table(file="C:/work/DM1/weather.csv",
header=TRUE, sep=",")
"
•
Export: > USTemps=read.table(file=file.choose(),header=TRUE)
"
•
Databases
!
•
Import
"
•
connection <- dbConnect(driver, user, password, host, dbname)"
> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)
•
Export
"
•
> connnection <- dbConnect(driver, user, password, host,dbname)"
> dbWriteTable (con, “AHW”, AHW)
•
R objects
!
•
Import: > load(‘AHW.Rdata’)
"
•
Export: > save(AHW, file=“New_AHW.Rdata”)
"
•
Web
!
•
connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"
•
AHW <- read.csv(con, header=TRUE)
"
•
Plots
!
•
png(filename="C:/R/figure.png", height=295, width=300, bg="white")
"
•
pdf(file="C:/R/figure.pdf", height=3.5, width=5)
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Name of data frame!
to be created with !
imported data!
Options for parsing !
the text data into !
fields and values!
How data frame will !
look once the data !
are imported!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Extending R
!
•
http://cran.r-project.org/web/packages/!
•
Install a package
!
•
from command line
"
"
> install.package(‘name_of_package’)
"
•
from GUI
"
•
Packages & Data > Package Installer"
•
Load Library
(to use installed package)
"
•
> library(name_of_package)
"
•
Example
"
> library(markdown)"
•
Use Library Function
!
•
> function_name(parameters)
"
•
Example
"
> markdownToHTML("example.md")"
"
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
More Information……
!
•
The R Manuals
!
•
http://www.stat.berkely.edu/~spector/R.pdf
"
•
And Introduction to R
!
•
http://cran.r-project.org/doc/manuals/R-intro.html
"
•
http://tryr.codeschool.com/
"
•
Books
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER
Other Resources
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO SAN DIEGO SUPERCOMPUTER CENTER