# Machine Learning Algorithms - Summary + R Code

## Full text

(1)

1

### Supervised Learning Algorithms

1 Supervised Learning by Empirical Risk Minimization (EMR) 1 1 Empirical Risk Minimization and Inductive Bias

1 2 Ordinary Least Squares (OLS) 1 3 Ridge Regression 1 4 LASSO 1 5 Logistic Regression 1 6 Regression Classifier 1 7 Linear Support Vector Machines (SVM) 1 8 Generalized Additive Models (GAMs) 1 9 Projection Pursuit Regression (PPR) 1 10 Neural Networks (NNETs) 1 11 Classification and Regression Trees (CARTs) 1 12 Random Forests

1 13 Rotation Forest 1 14 Smoothing Splines 2 Non ERM Supervised Learning 2 1 k-Nearest Neighbour (KNN) 2 2 Kernel Regression 2 3 Local Likelihood and Local ERM 2 4 Boosting

2 5 Learning Vector Quantizations (LVQ) 3 Dimensionality Reduction In Supervised Learning 3 1 Variable Selection

3 2 LASSO

3 3 Principal Component Regression (PCAR) 3 4 Partial Least Squares (PLS) 3 5 Canonical Correlation Analysis (CCA) 3 6 Reduced Rank Regression (RRR) 4 Generative Models In Supervised Learning 4 1 Fisher's Linear Discriminant Analysis (LDA) 4 2 Fisher's Quadratic Discriminant Analysis (QDA) 4 3 Naive Bayes

5 Ensembles 5 1 Committee Methods 5 2 Bayesian Model Averaging 5 3 Stacking 5 4 Bootstrap Averaging (Bagging) 5 5 Boosting

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Classification:

• Kernel Density Classification.

Naive Bayes Classifier - has the form of a generalized additive model. The models are fit in quite different ways though.

• Mixture Models for Density Estimation and Classification - can be viewed as a kind of kernel method.

(2)

2

1 9 Projection Pursuit Regression (PPR)

Another way to generalize the hypothesis class F, which generalizes the GAM model, is to allow f to be some simple function of a linear combination of the predictors, of the form

(1.9)

### )

1 M m m m f x g w x = =

### ∑

,

where both gm and wm are learned from the data. The regularization is now performed by choosing M and the class of

### { }

gm mM=1.

Note: PPR is not a pure ERM. Just like the GAM problem, in the PPR problem

### { }

gm mM=1are learned by Kernel Regression. Solving the PPR problem is thus a hybrid of ERM and Kernel Regression algorithms.

Note: If M is taken arbitrarily large, for appropriate choice of gm the PPR model can

approximate any continuous function in Rp arbitrarily well. Such a class of models is called a universal approximator. However this generality comes at a price. Interpretation of the fitted model is usually difficult, because each input enters into the model in a complex and

multifaceted way. As a result, the PPR model is most useful for prediction, and not very useful for producing an understandable model for the data.

Notice also- that the neural network model with one hidden layer has exactly the same form as the projection pursuit model described above. The difference is that the PPR model uses nonparametric functions gm(v),while the neural network uses a far simpler function based on sigmoid(v).

1.10 Neural Networks (NNETs) - Single Hidden Layer

We introduce the NNET model via the PPR model, and not through its historically original construction. In the language of Eq.(1.9), a single-layer{feed-forward neural network, is a model where

### { }

gm mM=1are not learned from the data, but rather assumed a-priori.

:

m m m

g x =β σ αx

where

m, m m

### }

M=1only are learned from the data. A typical activation function is the standard logistic CDF:

### ( )

1 1 t t e σ = + .

As can be seen, the NNET is merely a non-linear regression model. The parameters of which are often called weights.

Loss Functions: Like any other ERM problem, we are free to choose the appropriate loss function.

Universal Approximator: Like the PPR, even when

### { }

gm mM=1 are fixed beforehand, the class is still a universal approximator.

Regularization: regularization of the model is done via the selection of the

### σ

, the number of nodes/variables in the network and the number of layers.

(3)

3

1.11 Classification and Regression Trees (CARTs)

CARTs are a type of ERM where f(x) include very non smooth functions that can be interpreted as "if-then" rules, also known as decision trees.

The hypothesis class of CARTs includes functions of the form

### ( )

{ } 1 m M m x R m f x c I = =

### ∑

The parameters of the model are the different conditions

### { }

Rm mM=1and the function's value at each condition

### { }

cm mM=1.

Regularization: is done by the choice of M which is called the tree depth.

Loss Functions: As usual, a squared loss can be used for continuous outcomes y. For

categorical outcomes, the loss function is called the impurity measure. Impurity Measure One can use either a misclassification error, the multinomial likelihood (knows as the deviance, or cross-entropy), or a first order approximation of the latter known as the Gini Index.

Universal Approximator: CART is a universal approximator.

### Random Forests

Trees are very flexible hypothesis classes. They thus have small bias but large variance. Bagging trees will reduce this variance by averaging trees from different bootstrap samples. Alas, the variance (thus the MSE) of bagged trees is lower bounded by the fact the trees use the same variables, and are thus correlated. To remedy this, [Breiman, 2001] proposed to fit trees to bootstrapped samples, using only a random subset of variables. This decorrelates between the trees, this allowing a reduction in the variances of the trees (thus their MSE).

(4)

4

(5)

5

### Unsupervised Learning

1 Introduction to Unsupervised Learning

2 Density Estimation 2 1 Parametric Density Estimation 2 2 Kernel Density Estimation 2 3 Graphical Models 3 High Density Regions 3 1 Association Rules 4 Linear-Space Embeddings 4 1 Principal Components Analysis (PCA) 4 2 Random Projections

4 3 Sparse Principal Component Analysis (sPCA) 4 4 Multidimensional Scaling (MDS) 4 5 Local MDS

4 6 Isometric Feature Mapping (Isomap) 5 Non-Linear-Space Embeddings 5 1 Kernel Principal Component Analysis (kPCA) 5 2 Self Organizing Maps (SOM) 5 3 Principal Curves and Surfaces 5 4 Local Linear Embedding (LLE) 5 5 Auto Encoders

5 6 Matrix Factorization 5 7 Information Bottleneck 6 Latent Space Generative Models 6 1 Factor Analysis (FA)

6 2 Independent Component Analysis (ICA) 6 3 Exploratory Projection Pursuit 6 4 Compressed Sensing

6 5 Generative Topographic Map (GTM) 6 6 Finite Mixtures

6 7 Hidden Markov Models (HMM) 6 8 Latent Space Graphical Models 6 9 Latent Dirichlet Allocation (LDA) 6 10 Probabilistic Latent Semantic Indexing (PLSI) 6 11 Prediction by Partial Matching (PPM) 6 12 Dynamic Markov Compression (DMC) 7 Random Graph Models

7 1 Erdos Renyi

7 2 Exchangeable Graph Model 7 3 p1 Graph Model 7 4 p2 Graph Model 7 5 Stochastic Block Graph Model 7 6 Latent Space Graph Model 7 7 Exponential Random Graphs (ERGMs) 8 Cluster Analysis

8 1 K-Means Clustering 8 2 K-Medoids Clustering (PAM) 8 3 Quality Threshold Clustering (QT) 8 4 Hierarchical Clustering

(6)

6 8 5 Fuzzy Clustering

8 6 Self Organizing Maps (SOM) 8 7 Spectral Clustering 8 8 Bi Clustering

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### 3.1 Association Rules (Market Basket Analysis; Aprioiri algorithm)

Association rules, or market basket analysis, or affinity analysis, can be seen as approximating the joint distribution with a region-wise constant function. Apriori Algorithm

Terminology

The algorithm:

### (use dummy variables for 0/1 response = "in basket"/"Not in basket").

The first pass over the data computes the support (relative frequency) of all single-item sets. Those whose support is less than the threshold are discarded. The second

pass computes the support of all item sets of size two that can be formed from pairs of the single items surviving the first pass. In other words, to generate all frequent itemsets with |K| = m, we need to consider only candidates such that all of their m ancestral item sets of size m − 1 are frequent. Those size-two item sets with support less than the threshold are discarded. Each successive pass over the data considers only those item sets that can be formed by combining those that survived the previous pass with those retained from the first pass. Passes over the data continue until all candidate rules from the previous pass have support less than the specified threshold.

> Example: suppose the item set K = {peanut butter, jelly, bread} and

consider the rule {peanut butter, jelly} => {bread}. A support value

of 0.03 for this rule means that peanut butter, jelly, and bread appeared

together in 3% of the market baskets. A confidence of 0.82 for this rule implies

that when peanut butter and jelly were purchased, 82% of the time

bread was also purchased. If bread appeared in 43% of all market baskets

then the rule {peanut butter, jelly} => {bread} would have a lift of 1.95.

The goal of this analysis is to produce association rules (A => B) with both

(7)

7 Examples of Association Rules:

### 4 Linear Space Embedding Methods

Linear space embedding are a class of dimensionality reduction techniques that map the data X into a lower dimensional linear space M. The mapping itself,

:

f XM can be linear or nonlinear. We denote the low dimensional representation of the data by ˆ :X = f X( )∈M .

The idea of ERM and Inductive Bias also applies to unsupervised learning.

We seek some f that does not incur too much loss, on average. I.e., we seek to minimize R(f). Remark: Two interpretations of "linear" can be found in the literature. It may refer to the nature of the low dimensional space approximating the data, or to the nature of the embedding operation.

4.1 PCA

Maximizing under a constraint, using Lagrange-Multipliers:

(8)

8

PCA is such a basic technique it has been rediscovered and renamed independently in many fields. It can be found under the names of discrete Karhunen-Loeve Transform; Hotteling Transform; Proper Orthogonal Decomposition (POD); Eckart-Young Theorem;

Schmidt-Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral Decomposition; Empirical Modal Analysis; and possibly more.

Example:

Consider human height and weight data. While clearly two dimensional data, you don't really need both to understand how "big" are the people in the data. This is because; height and weight vary mostly along a single dimension, which can be interpreted as the "bigness" of an individual. This is why, physicians use the Body Mass Index (BMI) as an indicator of size, instead of a two-dimensional measurement.

Assume now that you wish to give each individual a size score that is a linear combination of height and weight, PCA does just that. It returns the linear combination that has the most variability, i.e., the combination which best distinguishes between individuals.

Notice we have currently offered two motivations for PCA: (i) Find linear combinations that best distinguish between observations, i.e., maximize variance.

(ii) Find the linear subspace the bets approximates the data. The reason these two problems are equivalent, is due to the use of the squares error. Informally speaking, the data has some total variance. This variance can be decomposed into the part captured in

### M

, and the part not

captured.

Note: Usually for simplicity of exposition, we will assume that the data X has been mean centered.

Terminology:

Principal Components: The linear combinations of the features, which best separate between observations. In our example - the "bigness" index of each individual. The first component captures the most variance, the second components, the second most variance, etc. In terms of

### M

, the principal components are an orthogonal basis

for

### M

.

Scores: Synonymous to Principal Components.

Loadings: The weights of each data point in each principal component.

In our example, the importance of the height and weight in constructing the "bigness" score.

PCA as a Graph Method

Starting from the maximal variance motivation, it is perhaps not surprising that PCA depends only on the similarities between features, as measured by their empirical covariance. The linearity of the target manifold was there by assumption.

The building blocks of all these graph-based dimensionality reduction methods are:

1. Compute some similarity graph G (or dissimilarity graph D) from the raw features.

(9)

9

2. Call upon graph embedding theory to map the data points into the target manifold M.

To summarize: Task = dim reduce Type = optimization Input = Graph (G)

Output = embedding function

Sparse Principal Component Analysis (sPCA)

When analyzing the PCA results, we often wish to understand which features contribute to which component. This is much easier when the loadings (A) are sparse, i.e., include many zeroes. sPCA performs this in LASSO style, by means of l1 regularization.

4.4 Multidimensional Scaling (MDS)

MDS - Both self-organizing maps and principal curves and surfaces map data points in Rp to a lower dimensional manifold. Multidimensional scaling (MDS) has a similar goal, but approaches the problem in a somewhat different way.

MDS represents high-dimensional data in a low-dimensional coordinate system. MDS requires only the dissimilarities dij , in contrast to the SOM and principal curves and surfaces which need the data points xi.

MDS aims at representing a network (= a weighted graph) of distances (or

similarities) between observations, by embedding the observations in a q dimensional linear subspace, while preserving the original distances.

### 5 Non-Linear Space Embedding Methods

The fact that the linear-space embedding of the data depends only some similarity graph has laid a bridge between feature embedding, such as PCA, and graph embedding methods such as MDS. Moreover, it has opened the door for replacing the covariance similarity, with many other similarity measures.

Classic MDS is simply PCA when starting from G, thus viewed as a graph embedding problem. kPCA plugs kernel similarities instead of covariance similarities. LocalMDS and LLE follow a similar motivation using local measures of similarity.

PCA solution can be cast in terms of the covariance between individuals (G = X'X) or the Euclidean distances (D).

In particular, we show that all the information on the location (mean) of X, needed for the PCA reconstruction, is actually encoded in G (or D).

Kernel Principal Component Analysis (kPCA) The optimization problem is:

## }

arg max

g

Cov g X , where g(X) = best separating score (function). We thus have two matters to attend:

(i) We need to constrain g(x) so that it does not overfit.

(ii) We need the problem to be computable. This is precisely the goal of kPCA. We have already encountered a similar problem with Smoothing Splines. It is thus not

(10)

10

of the optimization problem takes a very simple form. The classes of such g's are known as Reproducing Kernel Hilbert Spaces (RKHS).

Nonlinear Dimension Reduction and Local Multidimensional Scaling - These methods can be thought of as “flattening” the manifold, and hence reducing the data to a set of low-dimensional coordinates that represent their relative positions in the manifold. They are useful for problems where signal-to-noise ratio is very high (e.g., physical systems), and are probably not as useful for

observational data with lower signal-to-noise ratios.

Three Methods of Nonlinear MDS:

ISOMAP = Isometric feature mapping (Tenenbaum et al., 2000) - constructs a graph to approximate the geodesic distance between points along the manifold. Specifically, for each data point we find its neighbors-points within some small Euclidean distance of that point. We construct a graph with an edge between any two neighboring points. The geodesic distance between any two points is then approximated by the shortest path between points on the graph. Finally, classical scaling is applied to the graph distances, to produce a low-dimensional mapping.

LLE = Local linear embedding (Roweis and Saul, 2000) - takes a very different approach, trying to preserve the local affine structure of the high-dimensional data. Each data point is approximated by a linear combination of neighboring points. Then a lower dimensional representation is constructed that best preserves these local approximations.

LLE aims at finding linear subspaces that are good approximations of small neighborhoods of the whole data X. It is similar in spirit to Isomap and LocalMDS (x5.4.5). It differs, however, in the way similarities are computed, and in the way embedding are performed. In particular, as the name may suggest, LLE performs local embedding to linear subspaces.

To summarize:

Task = dim. reduction Type = algorithm Input = graph (G) Output = data embedding Concept = local distance

Local MDS (Chen and Buja, 2008) - takes the simplest and arguably the most direct approach. We define N to be the symmetric set of nearby pairs of points; specifically a pair (i, i') is in N if point i is among the K-nearest neighbors of i', or vice-versa.

Self Organizing Maps (SOM)

SOMs, are a non-linear-subspace dimensionality reduction method, aimed at good clustering. It is non-linear because the algorithm (which cannot be cast

as an ERM problem, i.e., optimization problem) returns an embedding into a non-linear manifold.

To summarize:

Task = dim. reduction Type = algorithm Input = X (data)

Output = parametric curve or surface

Concept = self consistency => I.e., a curve with a path that is the average of all it's closest data points. Self Consistency Roughly speaking, one can think of this curve as a parameterized function, connecting all the k-means cluster centers in the smoothest way possible.

(11)

11

### 8 Cluster Analysis

Gaussian Mixtures as Soft K-means Clustering.

• K-means Clustering - the algorithm is appropriate when the dissimilarity measure is taken to be squared Euclidean distance. This requires all of the variables to be of the quantitative type. In addition, using squared Euclidean distance places the highest influence on the largest distances. This causes the procedure to lack robustness against outliers that produce very large distance.

• K-medoids Clustering - For a given cluster assignment (C) find the observation in the cluster minimizing total distance to other points in that cluster. This algorithm

assumes attribute data, but the approach can also be applied to data described only by proximity matrices. There is no need to explicitly compute cluster centers.

(12)

12

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Recommender Systems Algorithms

1. Content Filtering 2. Collaborative Filtering 3. Hybrid Filtering 4. Recommender Systems

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The two main approaches to recommender systems include content filtering and collaborative filtering.

1. Content Filtering

In content filtering, the system is assumed to have some background information on the user (say, because he logged in), and uses this information to give him recommendations. The recommendation in this case, is approached as a supervised learning problem: the system learns to predict a product's rating based on the user's features.

2. Collaborative Filtering

Unlike content filtering, in collaborative filtering, there is no external information on the user or the products, besides the ratings of other users.

Collaborative filtering can be approached as a supervised learning problem, or as an unsupervised learning problem. This is because it is neither. It is essentially a missing data problem.

The two main approaches to collaborative filtering include neighborhood methods, and latent factor models.

a. The neighborhood methods to collaborative filtering rest on the assumption that similar individuals have similar tastes. If someone similar to individual i has seen movie j, then i should have a similar opinion.

b. The latent factor models approach to collaborative filtering rests on the assumption that the rankings are a function of some latent user attributes and latent movie attributes. This idea is not a new one, as we have seen it in the context of

unsupervised learning in factor analysis (FA) and independent component analysis (ICA). This is why this approach is more commonly known as the

Matrix Factorization approach collaborative filtering.

We can present several matrix factorization problems in the ERM framework. Hybrid Filtering

After introducing the ideas of content filtering and collaborative filtering, why not marry the two? Hybrid filtering is the idea of imputing the missing data, thus making recommendations, using both a viewer's attributes, and other viewers' preferences.

It can be presented as an ERM problem. Recommender Systems Terminology

Content Based Filtering: A supervised learning approach to recommendations.

Collaborative Filtering: A missing data imputation approach to recommendations.

Memory Based Filtering: A non-parametric (neighborhood) approach to collaborative filtering.

(13)

13

Model Based Filtering: A latent space generative model approach to collaborative filtering.

Misc notes: ========

The Relation Between Supervised and Unsupervised Learning

It may be surprising that collaborative filtering can be seen as both an unsupervised and a supervised learning problem. But these are not mutually exclusive problems.

In unsupervised learning we try to learn the joint distribution of x, i.e., try to learn the relationship between any variable in x to the rest, we may see it as several supervised learning problems. In each, a different variable in x plays the role of y.

The Kernel Trick

Applies to: SVM, PCA, canonical correlation analysis, ridge regression, spectral clustering, Gaussian processes, and more (k-nearest neighbor (kNN) is also a kernel method).

Think of smoothing splines, it was quite magical that without constraining the hypothesis class F, the ERM problem has a finite dimensional closed form solution. The property of an infinite dimensional problem having a solution in a finite dimensional space is known as the kernel property

The problem is then- what type of penalties J(f) will return simple solutions to: (1)

The answer is: functions that belong to (RKHS) Reproducing Kernel Hilbert Space – function space.

The Bayesian View of RKHS

Just as the ridge regression has a Bayesian interpretation, so does the kernel trick. Informally, the functions solving Eq.(1) can be seen as the posterior mode if our prior beliefs postulate that the function we are trying to recover is a Gaussian zero-mean process with covariance given by K.

Generative Models

By generative model we mean that we specify the whole data distribution.

This is particularly relevant to supervised learning where many methods only assume the distribution of P(y|x) without stating the distribution of P(x).

LDA, QDA, and Naive Bayes, follow this exact same rational. Dimensionality Reduction

- It is thus intimately related to lossy compression in information theory.

- Dimensionality reduction is often performed before supervised learning to keep computational complexity low.

(14)

14

### R code

Supervised Learning Code

library(magrittr) # for piping

library(dplyr) # for handeling data frames # Some utility functions:

l2 <- function(x) x^2 %>% sum %>% sqrt l1 <- function(x) abs(x) %>% sum

MSE <- function(x) x^2 %>% mean

missclassification <- function(tab) sum(tab[c(2,3)])/sum(tab) 

We also initialize the random number generator so that we all get the same results (at least upon a first run)

{r set seed} set.seed(2015) 

# OLS

## OLS Regression

Starting with OLS regression, and a split train-test data set: {r OLS Regression}

View(prostate)

# now verify that your data looks as you would expect.... ols.1 <- lm(lcavol~. ,data = prostate.train)

# Train error:

MSE( predict(ols.1)- prostate.train$lcavol) # Test error: MSE( predict(ols.1, newdata = prostate.test)- prostate.test$lcavol) 

Now using cross validation to estimate the prediction error: {r Cross Validation}

folds <- 10

fold.assignment <- sample(1:5, nrow(prostate), replace = TRUE) errors <- NULL

for (k in 1:folds){

prostate.cross.train <- prostate[fold.assignment!=k,] prostate.cross.test <- prostate[fold.assignment==k,] .ols <- lm(lcavol~. ,data = prostate.cross.train)

.predictions <- predict(.ols, newdata=prostate.cross.test) .errors <- .predictions - prostate.cross.test$lcavol errors <- c(errors, .errors) (15) 15 # Cross validated prediction error: MSE(errors)  Also trying a bootstrap prediction error: {r Bootstrap} B <- 20 n <- nrow(prostate) errors <- NULL prostate.boot.test <- prostate for (b in 1:B){ prostate.boot.train <- prostate[sample(1:n, replace = TRUE),] .ols <- lm(lcavol~. ,data = prostate.boot.train) .predictions <- predict(.ols, newdata=prostate.boot.test) .errors <- .predictions - prostate.boot.test$lcavol errors <- c(errors, .errors)

}

# Bootstrapped prediction error: MSE(errors)



### OLS Regression Model Selection

Best subset selection: find the best model of each size: {r best subset}

# install.packages('leaps') library(leaps)

regfit.full <- prostate.train %>%

regsubsets(lcavol~.,data = ., method = 'exhaustive') summary(regfit.full)

plot(regfit.full, scale = "Cp") 

Train-Validate-Test Model Selection. Example taken from

[here](https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/ch6.html) {r OLS TVT model selection}

model.n <- regfit.full %>% summary %>% length

X.train.named <- prostate.train %>% model.matrix(lcavol ~ ., data = .) X.test.named <- prostate.test %>% model.matrix(lcavol ~ ., data = .) View(X.test.named)

val.errors <- rep(NA, model.n) train.errors <- rep(NA, model.n) for (i in 1:model.n) {

(16)

16 coefi <- coef(regfit.full, id = i)

pred <- X.train.named[, names(coefi)] %*% coefi train.errors[i] <- MSE(y.train - pred)

pred <- X.test.named[, names(coefi)] %*% coefi val.errors[i] <- MSE(y.test - pred)

}

plot(train.errors, ylab = "MSE", pch = 19, type = "black") points(val.errors, pch = 19, type = "b", col="blue") legend("topright",

legend = c("Training", "Validation"), col = c("black", "blue"),

pch = 19) 

AIC model selection: {r OLS AIC} # Forward search:

ols.0 <- lm(lcavol~1 ,data = prostate.train) model.scope <- list(upper=ols.1, lower=ols.0)

step(ols.0, scope=model.scope, direction='forward', trace = TRUE) # Backward search:

step(ols.1, scope=model.scope, direction='backward', trace = TRUE) 

Cross Validated Model Selection. {r OLS CV}

[TODO] 

Bootstrap model selection: {r OLS bootstrap} [TODO]



Partial least squares and principal components: {r PLS}

pls::plsr() pls::pcr() 

Canonical correlation analyis: {r CCA}

(17)

17 cancor()

# Kernel based robust version kernlab::kcca()



## OLS Classification {r OLS Classification} # Making train and test sets:

ols.2 <- lm(spam~., data = spam.train.dummy) # Train confusion matrix:

.predictions.train <- predict(ols.2) > 0.5

(confusion.train <- table(prediction=.predictions.train, truth=spam.train.dummy$spam)) missclassification(confusion.train) # Test confusion matrix: .predictions.test <- predict(ols.2, newdata = spam.test.dummy) > 0.5 (confusion.test <- table(prediction=.predictions.test, truth=spam.test.dummy$spam)) missclassification(confusion.test)  # Ridge Regression {r Ridge I} # install.packages('ridge') library(ridge)

ridge.1 <- linearRidge(lcavol~. ,data = prostate.train)

# Note that if not specified, lambda is chosen automatically by linearRidge. # Train error:

MSE( predict(ridge.1)- prostate.train$lcavol) # Test error: MSE( predict(ridge.1, newdata = prostate.test)- prostate.test$lcavol) 

Another implementation, which also automatically chooses the tuning parameter $\lambda$: {r Ridge II}

# install.packages('glmnet') library(glmnet)

ridge.2 <- glmnet(x=X.train, y=y.train, alpha = 0) # Train error:

(18)

18 # Test error:

MSE( predict(ridge.2, newx = X.test)- y.test) 

__Note__: glmnet is slightly picky.

I could not have created y.train using select() because I need a vector and not a data.frame. Also, as.matrix is there as glmnet expects a matrix class x argument. Thse objects are created in the make_samples.R script, which we sourced in the beggining.

# LASSO Regression {r LASSO}

# install.packages('glmnet') library(glmnet)

lasso.1 <- glmnet(x=X.train, y=y.train, alpha = 1) # Train error:

MSE( predict(lasso.1, newx =X.train)- y.train) # Test error:

MSE( predict(lasso.1, newx = X.test)- y.test) 

# Logistic Regression For Classification {r Logistic Regression}

logistic.1 <- glm(spam~., data = spam.train, family = binomial) # numerical error. Probably due to too many predictors.

# Maybe regularizing the logistic regressio with Ridge or LASSO will make things better? 

In the next chunk, we do $l_2$ and $l_1$ regularized logistic regression. Some technical remarks are in order:

- glmnet is picky with its inputs. This has already been discussed in the context of the LASSO regression above.

- The predict function for glmnet objects returns a prediction (see below) for many candidate regularization levels $\lambda$. We thus we cv.glmnet which does an automatic cross validated selection of the best regularization level.

{r Regularized Logistic Regression} library(glmnet)

# Ridge Regularization with CV selection of regularization:

logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0) # LASSO Regularization with CV selection of regularization:

logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)

(19)

19

.predictions.train <- predict(logistic.2, newx = X.train.spam, type = 'class') (confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train) .predictions.train <- predict(logistic.3, newx = X.train.spam, type = 'class') (confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train)

# Test confusion matrix:

.predictions.test <- predict(logistic.2, newx = X.test.spam, type='class') (confusion.test <- table(prediction=.predictions.test, truth=y.test.spam)) missclassification(confusion.test)

.predictions.test <- predict(logistic.3, newx = X.test, type='class') (confusion.test <- table(prediction=.predictions.test, truth=y.test)) missclassification(confusion.test)  # SVM ## Classification {r SVM classification} library(e1071)

svm.1 <- svm(spam~., data = spam.train) # Train confusion matrix:

.predictions.train <- predict(svm.1)

(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train) # Test confusion matrix: .predictions.test <- predict(svm.1, newdata = spam.test) (confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)



## Regression

{r SVM regression}

svm.2 <- svm(lcavol~., data = prostate.train) # Train error:

MSE( predict(svm.2)- prostate.train$lcavol) # Test error: MSE( predict(svm.2, newdata = prostate.test)- prostate.test$lcavol) 

(20)

20 # GAM Regression

{r GAM}

# install.packages('mgcv') library(mgcv)

form.1 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(svi)+s(lcp)+s(gleason)+s(pgg45)+s(lpsa) gam.1 <- gam(form.1, data = prostate.train) # the model is too rich. let's select a variable subset

ridge.1 %>% coef %>% abs %>% sort(decreasing = TRUE) # select the most promising coefficients (a very arbitrary practice)

form.2 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(lcp)+s(pgg45)+s(lpsa) # keep only promising coefficients in model

gam.2 <- gam(form.2, data = prostate.train) # Train error:

MSE( predict(gam.2)- prostate.train$lcavol) # Test error: MSE( predict(gam.2, newdata = prostate.test)- prostate.test$lcavol) 

# Neural Net ## Regression

{r NNET regression} library(nnet)

nnet.1 <- nnet(lcavol~., size=20, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 1000) # Train error:

MSE( predict(nnet.1)- prostate.train$lcavol) # Test error: MSE( predict(nnet.1, newdata = prostate.test)- prostate.test$lcavol) 

Let's automate the network size selection: {r NNET validate}

validate.nnet <- function(size){

.nnet <- nnet(lcavol~., size=size, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 200) .train <- MSE( predict(.nnet)- prostate.train$lcavol) .test <- MSE( predict(.nnet, newdata = prostate.test)- prostate.test$lcavol) return(list(train=.train, test=.test))

(21)

21 validate.nnet(3) validate.nnet(4) validate.nnet(20) validate.nnet(50) sizes <- seq(2, 30)

validate.sizes <- rep(NA, length(sizes)) for (i in seq_along(sizes)){

validate.sizes[i] <- validate.nnet(sizes[i])$test } plot(validate.sizes~sizes, type='l')  What can I say... This plot is not what I would expect. Could be due to the random nature of the fitting algorithm. ## Classification {r NNET Classification} nnet.2 <- nnet(spam~., size=5, data=spam.train, rang = 0.1, decay = 5e-4, maxit = 1000) # Train confusion matrix: .predictions.train <- predict(nnet.2, type='class') (confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train)

# Test confusion matrix:

.predictions.test <- predict(nnet.2, newdata = spam.test, type='class')

(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)  # CART ## Regression {r Tree regression} library(rpart) tree.1 <- rpart(lcavol~., data=prostate.train) # Train error: MSE( predict(tree.1)- prostate.train$lcavol) # Test error:

MSE( predict(tree.1, newdata = prostate.test)- prostate.test$lcavol)  At this stage we should prune the tree using prune()... ## Classification (22) 22 {r Tree classification} tree.2 <- rpart(spam~., data=spam.train) # Train confusion matrix: .predictions.train <- predict(tree.2, type='class') (confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train)

# Test confusion matrix:

.predictions.test <- predict(tree.2, newdata = spam.test, type='class')

(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)  # Random Forest TODO # Rotation Forest TODO # Smoothing Splines I will demonstrate the method with a single predictor, so that we can visualize the smoothing that has been performed: {r Smoothing Splines} spline.1 <- smooth.spline(x=X.train, y=y.train) # Visualize the non linear hypothesis we have learned: plot(y.train~X.train, col='red', type='h') points(spline.1, type='l')  I am not extracting train and test errors as the output of smooth.spline will require some tweaking for that. # KNN ## Classification {r knn classification} library(class) (23) 23 # Test confusion matrix: .predictions.test <- knn.1 (confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)



And now we would try to optimize k by trying different values.

# Kernel Regression

Kernel regression includes many particular algorithms. {r kernel}

# install.packages('np') library(np)

ksmooth.1 <- npreg(txdat =X.train, tydat = y.train) # Train error:

MSE( predict(ksmooth.1)- prostate.train$lcavol)  There is currently no method to make prediction on test data with this function. # Stacking As seen in the class notes, there are many ensemble methods. Stacking, in my view, is by far the most useful and coolest. It is thus the only one I present here. The following example is adapted from [James E. Yonamine](http://jayyonamine.com/?p=456). {r Stacking} #####step 1: train models #### #logits logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0) logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1) # Learning Vector Quantization (LVQ) my.codebook<-lvqinit(x=X.train.spam, cl=y.train.spam, size=10, prior=c(0.5,0.5),k = 2) my.codebook<-lvq1(x=X.train.spam, cl=y.train.spam, codebk=my.codebook, niter = 100 * nrow(my.codebook$x), alpha = 0.03)

# SVM

library('e1071')

svm.fit <- svm(y=y.train.spam, x=X.train.spam, probability=TRUE)

(24)

24 #####step 2a: build predictions for data.train#### train.predict<- cbind(

predict(logistic.2, newx=X.train.spam, type="response"), predict(logistic.3, newx=X.train.spam, type="response"),

knn1(train=my.codebook$x, test=X.train.spam, cl=my.codebook$cl), predict(svm.fit, X.train.spam, probability=TRUE)

)

####step 2b: build predictions for data.test#### test.predict <- cbind(

predict(logistic.2, newx=X.test.spam, type="response"), predict(logistic.3, newx=X.test.spam, type="response"), predict(svm.fit, newdata = X.test.spam, probability = TRUE), knn1(train=my.codebook$x, test=X.test.spam, cl=my.codebook$cl) )

####step 3: train SVM on train.predict####

final <- svm(y=y.train.spam, x=train.predict, probability=TRUE) ####step 4: use trained SVM to make predictions with test.predict#### final.predict <- predict(final, test.predict, probability=TRUE)

results<-as.matrix(final.predict) table(results, y.test.spam)  # Fisher's LDA {r LDA} library(MASS)

lda.1 <- lda(spam~., spam.train) # Train confusion matrix:

.predictions.train <- predict(lda.1)$class (confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train)

# Test confusion matrix:

.predictions.test <- predict(lda.1, newdata = spam.test)$class (confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)



__Caution__:

Both MASS have a function called select. I will thus try avoid the two packages being loaded at once, or call the functionby its full name: MASS::select or dplyr::select'.

(25)

25 # Naive Bayes

{r Naive Bayes} library(e1071)

nb.1 <- naiveBayes(spam~., data = spam.train) # Train confusion matrix:

.predictions.train <- predict(nb.1, newdata = spam.train)

(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam)) missclassification(confusion.train) # Test confusion matrix: .predictions.test <- predict(nb.1, newdata = spam.test) (confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam)) missclassification(confusion.test)

(26)

26

Unsupervised Learning R code

Some utility functions: {r utility}

l2 <- function(x) x^2 %>% sum %>% sqrt l1 <- function(x) abs(x) %>% sum

MSE <- function(x) x^2 %>% mean # Matrix norms:

frobenius <- function(A) norm(A, type="F") spectral <- function(A) norm(A, type="2") 

__Note__: foo::bar means that function foo is part of the bar package. With this syntax, there is no need to load (library) the package.

If a line does not run, you may need to install the package: install.packages('bar'). Sadly, RStudio currently does not autocomplete function arguments when using the :: syntax.

# Learning Distributions ## Gaussian Density Estimation {r}

# Sample from a multivariate Gaussian: ## Generate a covariance matrix

p <- 10

Sigma <- bayesm::rwishart(nu = 100, V = diag(p))$W lattice::levelplot(Sigma) # Sample from a multivariate Gaussian: n <- 1e3 means <- 1:p X1 <- mvtnorm::rmvnorm(n = n, sigma = Sigma, mean = means) dim(X1) # Estiamte parameters and compare to truth: estim.means <- colMeans(X1) # recall truth is (10,...,10) plot(estim.means~means); abline(0,1, lty=2) estim.cov <- cov(X1) estim.cov.errors <- Sigma - estim.cov lattice::levelplot(estim.cov.errors) plot(estim.cov~Sigma); abline(0,1, lty=2) frobenius(estim.cov.errors) (27) 27 # Now try the same while playing with n and p.  Other covariance estimators (robust, fast,...) {r covariances} # Robust covariance estim.cov.1 <- MASS::cov.rob(X1)$cov estim.cov.errors.1 <- Sigma - estim.cov.1 lattice::levelplot(estim.cov.errors.1) frobenius(estim.cov.errors.1)

# Nearest neighbour cleaning of outliers estim.cov.2 <- covRobust::cov.nnve(X1)$cov estim.cov.errors.2 <- Sigma - estim.cov.2 lattice::levelplot(estim.cov.errors.2) frobenius(estim.cov.errors.2) # Regularized covariance estimation estim.cov.3 <- robustbase::covMcd(X1)$cov estim.cov.errors.3 <- Sigma - estim.cov.3 lattice::levelplot(estim.cov.errors.3) frobenius(estim.cov.errors.3)

# Another robust covariance estimator

estim.cov.4 <- robustbase::covComed(X1)$cov estim.cov.errors.4 <- Sigma - estim.cov.4 lattice::levelplot(estim.cov.errors.4) frobenius(estim.cov.errors.4)  ## Non parametric density estimation There is nothing that will even try dimensions higher than 6. See [here](http://vita.had.co.nz/papers/density-estimation.pdf) for a review. ## Association rules Note: Visualization examples are taken from the arulesViz [vignette](http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf) {r association rules} library(arules) data("Groceries") inspect(Groceries[1:2]) summary(Groceries) (28) 28 summary(rules) rules %>% sort(by='lift') %>% head %>% inspect # Select a subset of rules rule.subset <- subset(rules, subset = rhs %pin% "yogurt") inspect(rule.subset) # Visualize rules: library(arulesViz) plot(rules) subrules <- rules[quality(rules)$confidence > 0.8]

plot(subrules, method="matrix", measure="lift", control=list(reorder=TRUE)) plot(subrules, method="matrix", measure=c("lift", "confidence"),

control=list(reorder=TRUE)) plot(subrules, method="grouped")

plot(rules, method="grouped", control=list(k=50)) subrules2 <- head(sort(rules, by="lift"), 10)

plot(subrules2, method="graph", control=list(type="items")) plot(subrules2, method="graph")

# Export rules graph to use with other software:

# saveAsGraph(head(sort(rules, by="lift"),1000), file="rules.graphml") rule.1 <- rules

inspect(rule.1)

plot(rule.1, method="doubledecker", data = Groceries) 

See also the prim.box function in the prim package for more algorithms to learn association rules

# Dimensionality Reduction ## PCA

Note: example is a blend from [Gaston Sanchez](http://gastonsanchez.com/blog/how-to/2012/06/17/PCA-in-R.html) and [Georgia's Geography

dept.](http://geog.uoregon.edu/GeogR/topics/pca.html).

Get some data {r PCA data} ?USArrests

(29)

29

corrplot::corrplot(cor(USArrests), method = "ellipse") # slightly fancier

# As a correaltion graph cor.1 <- cor(USArrests) qgraph::qgraph(cor.1)

qgraph::qgraph(cor.1, layout = "spring", posCol = "darkgreen", negCol = "darkmagenta") 

{r PCA}

USArrests.1 <- USArrests[,-3] %>% scale pca1 <- prcomp(USArrests.1, scale. = TRUE) (pca1$rotation) # loadings # Now score the states: pca1$x %>% extract(,1) %>% sort %>% head 

Interpretation:

- PC1 seems to capture overall crime rate.

- PC2 seems distinguish between sexual and non-sexual crimes

Projecting on first two PCs: {r visualizing PCA}

library(ggplot2) # for graphing pcs <- as.data.frame(pca1$x) ggplot(data = pcs, aes(x = PC1, y = PC2, label = rownames(pcs))) + geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0, colour = "gray65") + geom_text(colour = "tomato", alpha = 0.8, size = 4) + ggtitle("PCA plot of USA States - Crime Rates")  The bi-Plot {r biplot} biplot(pca1) #ugly! # library(devtools) # install_github("vqv/ggbiplot") ggbiplot::ggbiplot(pca1, labels = rownames(USArrests.1)) # better!  (30) 30 The scree-plot {r screeplot} ggbiplot::ggscreeplot(pca1)  So clearly the main differentiation Visualize the scoring as a projection of the states' attributes onto the factors. {r} # get parameters of component lines (after Everitt & Rabe-Hesketh) load <- pca1$rotation

mn <- apply(USArrests.1, 2, mean) intcpt <- mn - (slope * mn)

# scatter plot with the two new axes added dpar(pty = "s") # square plotting frame USArrests.2 <- USArrests[,1:2] %>% scale xlim <- range(USArrests.2) # overall min, max

plot(USArrests.2, xlim = xlim, ylim = xlim, pch = 16, col = "purple") # both axes same length

abline(intcpt, slope, lwd = 2) # first component solid line

abline(intcpt, slope, lwd = 2, lty = 2) # second component dashed legend("right", legend = c("PC 1", "PC 2"), lty = c(1, 2), lwd = 2, cex = 1) # projections of points onto PCA 1

y1 <- intcpt + slope * USArrests.2[, 1] x1 <- (USArrests.1[, 2] - intcpt)/slope y2 <- (y1 + USArrests.1[, 2])/2

x2 <- (x1 + USArrests.1[, 1])/2

segments(USArrests.1[, 1], USArrests.1[, 2], x2, y2, lwd = 2, col = "purple") 

Visualize the loadings: {r}

# install.packages('GPArotation')

pca.qgraph <- qgraph::qgraph.pca(USArrests.1, factors = 2, rotation = "varimax") plot(pca.qgraph)

qgraph::qgraph(pca.qgraph, posCol = "darkgreen", layout = "spring", negCol = "darkmagenta",

edge.width = 2, arrows = FALSE) 

(31)

31 More implementations of PCA:

{r}

# FAST solutions: gmodels::fast.prcomp() # More detail in output: FactoMineR::PCA()

# For flexibility in algorithms and visualization: ade4::dudi.pca()

# Another one...

install.packages('amap') amap::acp()



Principal tensor analysis: {r PTA} PTAk::PTAk()  ## sPCA {r sPCA}  ## kPCA {r kPCA} kernlab::kpca()  ## Random Projections {r Random Projections}  ## MDS {r MDS} stats::cmdscale() MASS::sammon() MASS::isoMDS()

(32)

32  ## Isomap {r Isomap}  ## LLE {r LLE}  ## LocalMDS {r Local MDS} 

## Principal Curves & Surfaces {r Principla curves}



# Latent Space Generative Models ## FA {r factor analysis} psych::principal()  ## ICA {r ICA}

fastICA::fastICA() # Also performs projection pursuit 

## Exploratory Projection Pursuit {r exploratory projection pursuit} install.packages('REPPlab')

library(REPPlab) % will require the rJava package 

(33)

33 ## Generative Topographic Map

[TODO] ## Finite Mixture {r mixtures} install.packages('mixtools') library(mixtools) 

Read [this](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf) for more information. ## HMM {r} # install.packages('HiddenMarkov') library(HiddenMarkov)  # Clustering: Generate clusters: {r generate clusters} X <- clusterGeneration::genRandomClust(numClust=2) clusterGeneration::viewClusters(X, cl=2)  ## K-means {r kmeans} stats::kmeans()  ## Kmeans++ {r kmeansPP} kmpp <- function(X, k) { n <- nrow(X) C <- numeric(k) C <- sample(1:n, 1) for (i in 2:k) { dm <- distmat(X, X[C, ]) pr <- apply(dm, 1, min); pr[C] <- 0

(34)

34 C[i] <- sample(1:n, 1, prob = pr)

} kmeans(X, X[C, ]) }  ## K-medoids {r kmedoids} cluster::pam()

# Many other similarity measures: proxy::dist()  ## Hirarchial {r} hclust() # install.packages('cluster') library(cluster) agnes() 

## Self Organizing Maps

You may note the similar function names. This is why the :: syntax is very useful. {r SOM} # install.packages('som') library(som) som::som() kohonen::som() class::SOM()  ## Spectral Clustering {r} # install.packages('kernlab') library(kernlab) specc() `

Updating...

Updating...