Super Learner In Prediction

(1)

University of California, Berkeley

U.C. Berkeley Division of Biostatistics Working Paper Series

Year  Paper

Super Learner In Prediction

Eric C. Polley

∗

Mark J. van der Laan

†

∗_{Division of Biostatistics, University of California, Berkeley, [email protected]} †_{University of California - Berkeley, [email protected]}

This working paper is hosted by The Berkeley Electronic Press (bepress) and may not be commer-cially reproduced without the permission of the copyright holder.

http://biostats.bepress.com/ucbbiostat/paper266 Copyright c2010 by the authors.

(2)

Super Learner In Prediction

Eric C. Polley and Mark J. van der Laan

Abstract

Super learning is a general loss based learning method that has been proposed and analyzed theoretically in van der Laan et al. (2007). In this article we con-sider super learning for prediction. The super learner is a prediction method de-signed to find the optimal combination of a collection of prediction algorithms. The super learner algorithm finds the combination of algorithms minimizing the validated risk. The super learner framework is built on the theory of cross-validation and allows for a general class of prediction algorithms to be considered for the ensemble. Due to the previously established oracle results for the cross-validation selector, the super learner has been proven to represent an asymptot-ically optimal system for learning. In this article we demonstrate the practical implementation and finite sample performance of super learning in prediction.

(3)

1 Introduction

A common task in statistical data analysis is estimator selection for prediction. An outcome _Y_i is measured along with a set of covariates_W_iand interest is in the regression function E(_Y|_W). For a given regression problem, it is possible to create a set of algorithms all estimating the same function but with some variety. An algorithm is an estimator that maps a data set of_nobservations (_W_i_{, Y}_i),

i= 1_{, . . . , n}, into a prediction function that can be used to map an input_W into a predicted value for_Y. The algorithms may diﬀer in the subset of the covariates used, the basis functions, the loss functions, the searching algorithm, and the range of tuning parameters, among others. We prefer the more general notion estimator selection instead of model selection, since the formal meaning of model in the ﬁeld of statistics is the set of possible probability distributions, while most algorithms are not indexed by a model choice.

Estimator selection is not limited to selecting only a single estimator. Recent work has demon-strated that an ensemble of the algorithms in the collection can outperform a single algorithm.

van der Laan et al. [2007] introduced the super learner for estimator selection and proved the optimality of such an method. The super learner is related to the stacking algorithm introduced in neural networks context by Wolpert [1992] and adapted to the regression context by Breiman

[1996b]. The stacking algorithm is examined inLeBlanc and Tibshirani[1996] and the relationship to the model-mix algorithm of Stone [1974] and the predictive sample-reuse method of Geisser

[1975] is discussed.

In this article, we demonstrate how the super learner algorithm, as outlined by van der Laan et al. [2007] as a general minimum loss based learning method, in the context of prediction can be implemented, and empirically demonstrate the advantage of such an ensemble method. After a review of the super learner algorithm, a sequence of simulations is presented to highlight properties of the super learner, in accordance with the theoretical optimality results of the super learner. Following the simulations, the super learner is demonstrated on a series of real data sets.

2 Super Learner

In this section the general super learner algorithm for prediction is described along with some of the speciﬁc recommendations for implementation. The details follow van der Laan et al. [2007] closely, although some of the notation has changed to be consistent with the rest of the article.

Observe the learning data set_X_i= (_Y_i_{, W}_i)_{, i}= 1_{, . . . , n}where_Y is the outcome of interest and

W is a p-dimensional set of covariates. The objective is to estimate the function_ψ₀(_W) = E(_Y|_W). The function can be expressed as the minimizer of the expected loss:

ψ0(W) = arg min

ψ E [L(X, ψ(W))] (1)

where the loss function is often the squared error loss, _L₂ : (_Y −_ψ(_W))2. For a given problem, a library of prediction algorithms can be proposed. A library is simply a collection of algorithms. The algorithms in the library should come from contextual knowledge and a large set of default algorithms. We use algorithm in the general sense as any mapping from the data into a predictor. The algorithms may range from a simple linear regression model to a multi-step algorithm involving

(4)

a predicted value we consider it a prediction algorithm. For example, the library may include least squares regression estimators for a large class of regression working models indexed by subsets of the covariates, algorithms indexed by set values of the ﬁne-tuning parameters for a collection of values, algorithms using internal cross-validation to set ﬁne-tuning parameters, algorithms coupled with screening procedures to reduce the dimension of the covariate vector, and so on. Denote the libraryL and the cardinality of Las_K(_n).

1. Fit each algorithm inLon the entire data setX ={_X_i :_i= 1_{, . . . , n}}to estimate ˆΨ_k(_W)_{, k}= 1_{, . . . , K}(_n).

2. Split the data set X into a training and validation sample, according to a V-fold cross-validation scheme: splits the ordered _n observations into _V-equal size groups, let the _ν-th group be the validation sample, and the remaining group the training sample, _ν = 1_{, . . . , V}. Deﬁne _T(_ν) to be the _νth training data split and _V(_ν) to be the corresponding validation data split. _T(_ν) =X \_V(_ν)_{, ν} = 1_{, . . . , V}.

3. For the_νthfold, ﬁt each algorithm inLon_T(_ν) and save the predictions on the corresponding validation data, ˆΨ_k,T₍_ν₎(_W_i),_X_i ∈_V(_ν) for_ν= 1_{, . . . , V}.

4. Stack the predictions from each algorithm together to create a _nby _K matrix,

Z =

ˆ

Ψ_k,T₍_ν₎(_W_V₍_ν₎)_{, ν}= 1_{, . . . , V} &_k= 1_{, . . . , K}

, where we used the notation _W_V₍_ν₎ = (_W_i:_X_i ∈_V(_ν)) for the covariate-vectors of the _V(_ν)-validation sample.

5. Propose a family of weighted combinations of the candidate estimators indexed by weight-vector_α: m(_z|_α) = K k=1 αkΨˆk,T(ν)(WV(ν)), αk≥0 ∀k, K k=1 αk= 1.

6. Determine the_αthat minimizes the cross-validated risk of the candidate estimatorK_k₌₁_α_kΨˆ_k over all allowed _α-combinations:

ˆ α= arg min α n i=1 (_Y_i−_m(_zi|_α))2_.

7. Combine ˆ_α with ˆΨ_k(_W)_{, k}= 1_{, . . . , K} according to the family _m(_z|_α) of weighted combina-tions to create the ﬁnal super learner ﬁt:

ˆ Ψ_SL(_W) = K k=1 ˆ αkΨˆk(W)

The super learner theory does not place any restrictions on the family of weighted combinations used for ensembling the algorithms in the library. The restriction of the parameter space for_αto be the convex combination of the algorithms in the library provides greater stability of the ﬁnal super learner prediction. The convex combination is not only empirically motivated, but also supported by the theory. The oracle results for the super learner require a bounded loss function. Restricting to the convex combination implies that if each algorithm in the library is bounded, the convex combination will also be bounded.

(5)

Also contained within the family of the convex combinations is the possibility of selecting only one algorithm. The nodes of the convex hull:

Anode ={(1,0, . . . ,0),(0,1, . . . ,0), . . . ,(0,0, . . . ,1)},

correspond with selecting the single algorithm from the library that minimizes the V-fold cross-validated risk. This is the usual cross-validation selector we refer to as the discrete super learner since it is searching over a ﬁnite set for the algorithm weights (_α). Although possible for the super learner to select only a single algorithm from the library, it typically selects a weighted average of the algorithms together in an ensemble.

Super learner performs asymptotically as well as best possible weighted combination

For convenience, we provide here a short summary of the oracle results established in previous articles. For detail, we refer to these articles as referred to invan der Laan et al.[2007]. The oracle result for the cross-validation selector among a family of candidate estimators was established in van der Laan and Dudoit [2003] for general bounded loss functions: see also van der Vaart et al. [2006] for unbounded loss functions with tails that are controlled by exponential bounds and inﬁnite families of candidate estimators, and van der Laan et al. [2007] for its application to the super learner. This result proves (see above references for the precise statement of these implications) that, if the number of candidate estimators,_K(_n), is polynomial in sample size, then the cross-validation selector is either asymptotically equivalent with the oracle selector (based on sample of size of training samples), or it achieves the parametric rate log_n/nfor convergence w.r.t.

d(_{ψ, ψ}₀) ≡ _E₀{_L(_ψ) −_L(_ψ₀)}. So in most realistic scenarios, in which none of the candidate estimators achieve the rate of convergence one would have with an a priori correctly speciﬁed parametric model, the cross-validated selected estimator selector performs asymptotically exactly as well (up till a constant) as the oracle selected estimator. As a consequence, the super learner will perform asymptotically exactly as well (w.r.t. the loss based dissimilarity) as the best possible choice for the given data set among the family of weighted combinations of the estimators. In particular, this proves that, by including all competitors in the library of candidate estimators, the super learner will asymptotically outperform any of its competitors, even if the set of competitors is allowed to grow polynomial in sample size. This motivated our naming “super learner” since it provides a system of combining many estimators into an improved estimator.

3 Examples

To examine the performance of the super learner we start with a series of simulations and then demonstrate a super learner on a collection of real data sets. The simulations are intended to illustrate the application of the super learner algorithm and highlight some of the properties.

(6)

3.1 Simulations

Four diﬀerent simulations are presented in this section. All four simulations involve a univariate_X drawn from a uniform distribution in [−4_,+4]. The outcome follows the function described below:

Sim 1: _Y =−2×I(_{X <}−3) + 2_.55×I(_{X >}−2)−2×I(_{X >}0) + 4×I(_{X >}2)−1×I(_{X >}3) +_ε Sim 2: _Y = 6 + 0_.4_X−0_.36_X2+ 0_.005_X3+_ε Sim 3: _Y = 2_.83×sin _π 2 ×X +_ε Sim 4: _Y = 4×sin (3_π×_X)×I(_{X >}0) +_ε

where I(·) is the usual indicator function and_εis drawn from an independent standard normal dis-tribution in all simulations. A sample of size 100 will be drawn for each scenario. Figure 1 contains a scatterplot with a sample from each of the four simulations. The true curve for each simulation is represented by the solid line. These four simulations were chosen because they represent a diverse set of true models but all four have the same optimal _R2 = 0_.80. The _R2 is computed as:

R2= 1− Yi− ˆ Yi ₂ (_Y_i−_ave(_Y_i))2 (2)

The optimal_R2 is the value attained when the true regression function (i.e., true conditional mean) is used. Knowing the true regression function implies _Y_i−_Yˆ_i2 = Var(_ε)_n= 1_n. Hence the optimal _R2 in all four simulations is _R2_opt = 1−1_/Var(_Y) and the variance of _Y is set such that

Ropt2 = 0.80 in each simulation.

In all four simulations, we start with the same library of prediction algorithms. Table 1 contains a list of the algorithms in the library. The library of algorithms should ideally be a diverse set. One common aspect of many prediction algorithms is the need to specify values for tuning parameters. For example the generalized additive models requires a degree of freedom value for the spline functions or the neural network requires a size value. The tuning parameters could be selected using cross-validation or bootstrapping, but the different values of the tuning parameters could also be considered different prediction algorithms. A library could contain three generalized additive models with degrees of freedom equal to 2, 3, and 4. The selection of tuning parameters does not need to be done inside an algorithm, but each fixed value of a tuning parameter can be considered a different algorithm and the ensembling method can decide how to weight the different algorithms with unique values of the tuning parameters. When one considers different values of tuning parameters as unique prediction algorithms in the library it is easy to see how the number of algorithms in the library can become large. The library for the simulations contains 21 algorithms when considering different values of tuning parameters. A linear model and a linear model with a quadratic term are considered. The default random forest algorithm along with a collection of bagging regression trees with values of the complexity parameter equal to 0.10, 0.01, and 0.00 and a bagging algorithm adjusting the minimum split parameter to be 5 (with the default complexity parameter of 0.01). The generalized additive model with degrees of freedom equal to 2, 3, and 4 is added along with the default gradient boosting model. Neural networks with sizes 2 through 5, the polymars algorithm and the Bayesian additive regression trees is added. Finally, we consider the loess curve with spans equal to 0.75, 0.50, 0.25, and 0.10.

(7)

Simulation 1 X Y −2 0 2 4 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 2 X Y −2 0 2 4 6 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 3 X Y −4 −2 0 2 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 4 X Y −4 −2 0 2 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2

Figure 1: Scatterplots of the four simulations. The solid line is the true relationship. The points represent one of the simulation samples of size _n= 100.

(8)

Table 1: Library of prediction algorithms for the simulations and citation for the correspondingR package.

Algorithm Description Author

glm linear model R Development Core Team[2010]

interaction polynomial linear model R Development Core Team[2010]

randomForest random Forest Liaw and Wiener[2002]

Breiman[2001]

bagging bootstrap aggregation of trees Peters and Hothorn [2009]

Breiman[1996a] gam generalized additive models Hastie[1992]

Hastie and Tibshirani [1990]

gbm gradient boosting Ridgeway[2007]

Friedman[2001]

nnet neural network Venables and Ripley [2002]

polymars polynomial spline regression Kooperberg [2009]

Friedman[1991]

bart Bayesian additive regression trees Chipman and McCulloch[2009]

Chipman et al. [2010]

loess local polynomial regression Cleveland et al.[1992]

Figure 2 contains the super learner fit on a simulated data set for each scenario. With the given library the super learner is able to adapt to the underlying structure of the data generating function. For each algorithm we evaluated the _R2 on a test set of size 10,000. To evaluate the performance of the super learner in comparison to each algorithm in the library, we simulated 100 samples of size 100 and computed the _R2 for each fit of the true regression function. The results are presented in Table 2. In the first simulation, the regression tree based methods perform best. Bagging complete regression trees (cp = 0) has the largest _R2. On the second simulation, the best algorithm is the quadratic linear regression model (SL.interaction). In both these cases the super learner is able to adapt to the underlying structure and is able to have an average _R2 near the best. The same trend is exhibited in simulations 3 and 4, the super learner method of combining algorithms does nearly as well as the individual best algorithm. Since the individual best algorithm is not known, if a researcher was to select a single algorithm they might do well on some cases, but the overall performance will be worse than the super learner. For example, an individual who always uses bagging complete trees (SL.bagging(cp = 0.0)) will do well on the first 3 simulations, but will perform poorly on the 4th simulation compared to the average performance of the super learner.

The optimal _R2 is the value attained by knowing the true regression function. The optimal value gives an upper bound on the possible _R2 for each algorithm. In the ﬁrst three simulations the super learner is able to come close to the optimal value because some of the algorithms in the library are able to well approximate the truth. But in the fourth simulation the library is not rich enough to contain a combination of algorithms able to approach the optimal value. The super learner is able to do as well as the best algorithms in the library but does not attain the optimal

(9)

Simulation 1 X Y −2 0 2 4 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 2 X Y −2 0 2 4 6 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 3 X Y −4 −2 0 2 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2 Simulation 4 X Y −4 −2 0 2 4 ● ● ● ● ● ● ● ● ● ● ● ● ● _● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 0 2

Figure 2: Scatterplots of the four simulations. The solid line is the true relationship. The points represent one of the simulated datasets of size_n= 100. The dashed line is the super learner ﬁt for the shown dataset.

(10)

Table 2: Results for all 4 simulations. Average R-squared based on 100 simulations and the corresponding standard errors.

Algorithm Sim 1 Sim 2 Sim 3 Sim 4

R2 se(_R2) _R2 se(_R2) _R2 se(_R2) _R2 se(_R2) SuperLearner 0_.741 0_.032 0_.754 0_.025 0_.760 0_.025 0_.496 0_.122 discrete SL 0_.729 0_.079 0_.758 0_.029 0_.757 0_.055 0_.509 0_.132 SL.glm 0_.422 0_.012 0_.189 0_.016 0_.107 0_.016 −0_.018 0_.021 SL.interaction 0_.428 0_.016 0_.769 0_.011 0_.100 0_.020 −0_.018 0_.029 SL.randomForest 0_.715 0_.021 0_.702 0_.027 0_.724 0_.018 0_.460 0_.109 SL.bagging(0.01) 0_.751 0_.022 0_.722 0_.036 0_.723 0_.018 0_.091 0_.054 SL.bagging(0.1) 0_.635 0_.120 0_.455 0_.195 0_.661 0_.029 0_.020 0_.025 SL.bagging(0.0) 0_.752 0_.021 0_.722 0_.034 0_.727 0_.017 0_.102 0_.060 SL.bagging(ms5) 0_.747 0_.020 0_.727 0_.030 0_.741 0_.016 0_.369 0_.104 SL.gam(2) 0_.489 0_.013 0_.649 0_.026 0_.213 0_.029 −0_.014 0_.023 SL.gam(3) 0_.535 0_.033 0_.748 0_.024 0_.412 0_.037 −0_.017 0_.029 SL.gam(4) 0_.586 0_.027 0_.759 0_.020 0_.555 0_.022 −0_.020 0_.034 SL.gbm 0_.717 0_.035 0_.694 0_.038 0_.679 0_.022 0_.063 0_.040 SL.nnet(2) 0_.476 0_.235 0_.591 0_.245 0_.283 0_.285 −0_.008 0_.030 SL.nnet(3) 0_.700 0_.096 0_.700 0_.136 0_.652 0_.218 0_.009 0_.035 SL.nnet(4) 0_.719 0_.077 0_.730 0_.062 0_.738 0_.102 0_.032 0_.052 SL.nnet(5) 0_.705 0_.079 0_.716 0_.070 0_.731 0_.077 0_.042 0_.060 SL.polymars 0_.704 0_.033 0_.733 0_.032 0_.745 0_.034 0_.003 0_.040 SL.bart 0_.740 0_.015 0_.737 0_.027 0_.764 0_.014 0_.077 0_.034 SL.loess(0.75) 0_.599 0_.023 0_.761 0_.019 0_.487 0_.028 −0_.023 0_.033 SL.loess(0.50) 0_.695 0_.018 0_.754 0_.022 0_.744 0_.029 −0_.033 0_.038 SL.loess(0.25) 0_.729 0_.016 0_.738 0_.025 0_.772 0_.015 −0_.076 0_.068 SL.loess(0.1) 0_.690 0_.044 0_.680 0_.064 0_.699 0_.039 0_.544 0_.118

(11)

For the fourth simulation example, if the researcher thought that the relationship between _X and _Y followed a linear model up to a knot point, and then was a sine curve with an unknown frequency and amplitude, they could augment the library with proposed prediction algorithms:

sinKnot(_X;_{k, ω}) ={_β₁+_β₂_X} ×I(_{X < k}) (3) +{_β₃+_β₄sin(_ωX)} ×I(_X≥_k)

for ﬁxed values of the knot _k and the frequency _ω. For the fourth simulation, we augmented the library with the regression model above selecting values for the knots in {−2_,−1_,0_,+1_,+2} and values for the frequency in {_π,2_π,3_π,4_π} and all pairwise combinations of the two tuning parameters. This added 20 algorithms to the library (for a total of 41 algorithms). The results are presented in table 3. The 20 knot plus sine curve functions are labeled sinKnot(_{k, ω}) in the table. The algorithm containing the true relationship between _X and _Y is sinKnot(0, 3_π). The super learner now achieves a performance very close to the optimal _R2.

3.2 Data Analysis

To study the super learner in real data examples, we collected a set of publicly available data sets. Table 4 contains a descriptions of the data sets used for the study. The sample sizes ranged from 200 to 654 observations and the number of covariates ranged from 3 to 18. All 13 data sets have a continuous outcome and no missing values. The data sets can be found either in public repositories like the UCI data repository or in textbooks, with the corresponding citation listed in the table.

For the library of prediction algorithms, the applicable algorithms from the univariate simula-tions along with the algorithms listed in table 5. These algorithms represent a diverse set of basis functions and should allow the super learner to work well in most real settings. For the comparison across all data sets, we kept the library of algorithms the same. The super learner may beneﬁt from including algorithms based on contextual knowledge of the data problem as demonstrated in the augmented library in simulation 4.

Each data set has a diﬀerent scale for the outcome. In order to compare the performance of the prediction algorithms across diverse data sets we used the relative mean squared error where the denominator is the mean squared error of a linear model:

relM SE(_k) = M SE(k)

M SE(_lm), k= 1, . . . , K (4)

The results for the super learner, the discrete super learner, and each individual algorithm can be found in figure 3. Each point represents the 10-fold cross-validated relative mean squared error for a data set and the plus sign is the geometric mean of the algorithm across all 13 data sets. The super learner slightly outperforms the discrete super learner but both outperform any individual algorithm. With the real data it is unlikely that one single algorithm contains the true relationship and the benefit of the combination of the algorithms versus the selection of a single algorithm is demonstrated. The additional estimation of the combination parameters (_α) does not appear to cause an over-fit in terms of the risk assessment. Among the individual library algorithms the

(12)

Table 3: Simulation results for example 2. Average mean squared error and R-squared based on 100 simulations and the corresponding standard errors.

Algorithm MSE se(MSE) _R2 se(_R2) SuperLearner 1_.193 0_.200 0_.759 0_.040 discrete SL 1_.036 0_.035 0_.791 0_.007 SL.glm 5_.040 0_.106 −0_.017 0_.021 SL.interaction 5_.057 0_.126 −0_.021 0_.026 SL.randomForest 2_.645 0_.523 0_.466 0_.106 SL.bagging(0.01) 4_.414 0_.351 0_.109 0_.071 SL.bagging(0.1) 4_.734 0_.200 0_.044 0_.040 SL.bagging(0.0) 4_.416 0_.343 0_.109 0_.069 SL.bagging(ms5) 2_.650 0_.543 0_.465 0_.110 SL.gam(2) 5_.033 0_.113 −0_.016 0_.023 SL.gam(3) 5_.061 0_.131 −0_.022 0_.027 SL.gam(4) 5_.089 0_.160 −0_.027 0_.032 SL.gbm 4_.580 0_.282 0_.075 0_.057 SL.nnet(2) 5_.067 0_.447 −0_.023 0_.090 SL.nnet(3) 4_.922 0_.627 0_.006 0_.127 SL.nnet(4) 4_.769 0_.528 0_.037 0_.107 SL.nnet(5) 4_.816 0_.928 0_.028 0_.187 SL.polymars 4_.996 0_.309 −0_.008 0_.062 SL.bart 4_.531 0_.198 0_.085 0_.040 SL.loess(0.75) 5_.124 0_.186 −0_.034 0_.037 SL.loess(0.50) 5_.206 0_.238 −0_.051 0_.048 SL.loess(0.25) 5_.661 0_.562 −0_.143 0_.114 SL.loess(0.1) 2_.413 0_.823 0_.513 0_.166 SL.sinKnot(-2, 3.14) 5_.097 0_.122 −0_.029 0_.025 SL.sinKnot(-1, 3.14) 5_.119 0_.135 −0_.033 0_.027 SL.sinKnot(0, 3.14) 5_.151 0_.151 −0_.040 0_.031 SL.sinKnot(1, 3.14) 5_.109 0_.193 −0_.031 0_.039 SL.sinKnot(2, 3.14) 5_.221 0_.217 −0_.054 0_.044 SL.sinKnot(-2, 6.28) 5_.146 0_.179 −0_.039 0_.036 SL.sinKnot(-1, 6.28) 5_.169 0_.195 −0_.043 0_.039 SL.sinKnot(0, 6.28) 5_.217 0_.246 −0_.053 0_.050 SL.sinKnot(1, 6.28) 5_.161 0_.228 −0_.042 0_.046 SL.sinKnot(2, 6.28) 5_.268 0_.271 −0_.063 0_.055 SL.sinKnot(-2, 9.42) 2_.405 0_.064 0_.515 0_.013 SL.sinKnot(-1, 9.42) 1_.854 0_.058 0_.626 0_.012 SL.sinKnot(0, 9.42) 1_.036 0_.035 0_.791 0_.007 SL.sinKnot(1, 9.42) 2_.054 0_.090 0_.585 0_.018 SL.sinKnot(2, 9.42) 3_.163 0_.163 0_.362 0_.033 SL.sinKnot(-2, 12.57) 5_.110 0_.129 −0_.031 0_.026 SL.sinKnot(-1, 12.57) 5_.136 0_.145 −0_.037 0_.029 SL.sinKnot(0, 12.57) 5_.173 0_.160 −0_.044 0_.032

(13)

Table 4: Description of data sets. _n is the sample size and _p is the number of covariates. All examples have a continuous outcome.

Name _n _p Source

ais 202 10 Cook and Weisberg[1994]

diamond 308 17 Chu[2001]

cps78 550 18 Berndt[1991] cps85 534 17 Berndt[1991] cpu 209 6 Kibler et al. [1989]

FEV 654 4 Rosner [1999]

Pima 392 7 Newman et al. [1998] laheart 200 10 Aﬁﬁ and Azen[1979]

mussels 201 3 Cook [1998]

enroll 258 6 Liu and Stengos[1999] fat 252 14 Penrose et al. [1985] diabetes 366 15 Harrell[2001] house 506 13 Newman et al. [1998]

super learner which demonstrates the advantage of combining algorithms over selecting a single algorithm.

Table 5: Additional prediction algorithms in the library for the real data examples to be combined with the algorithms from Table 1.

Algorithm Description Author

bayesglm Bayesian linear model Gelman et al. [2010]

Gelman et al. [2009]

glmnet Elastic net Friedman et al. [2010b]

Friedman et al. [2010a]

DSA DSA algorithm Neugebauer and Bullard[2009]

Sinisi and van der Laan[2004] step Stepwise regression Venables and Ripley [2002]

ridge Ridge regression Venables and Ripley [2002]

svm Support vector machine Dimitriadou et al.[2009]

Chang and Lin [2001]

3.3 Array Data

(14)

Figure 3: 10-fold cross-validated relative mean squared error compared to glm across 13 real datasets. Sorted by the geometric mean, denoted with the plus (+) sign.

Relative MSE

Method

svm gbm(1) randomForest gbm(2) ridge step DSA glmnet(.25) glmnet(.50) glmnet(1) glmnet(.75) bayesglm glm step.interaction polymars gam(5) gam gam(4) gam(3) bart discreteSL SuperLearner ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ●●● ●●●● ●● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ●● ●●● ●●●●●● ●● ●●● ●● ● ●●●●● ●● ● ● ●● ●● ●●● ●● ●● ● ● ●● ●●●●● ●● ●● ● ● ●● ●●●●● ●● ●●●●●● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●●●● ● ● ● ● ●●● ●●●●●● ● ● ● ● ●●● ● ●●●●● ● ●●●● ● ●● ●●●● ●● ● ●● ●●●● ● ●●●● ● ● ● ●● ●●● ● ●●● ●● ● ● ● ● ●● ●● ●●●●● ● ● ●● ●●● ●●●●●● ● ● ● ● ● ●●●●● ●●● 0.5 1 2 4 6 8 10 12 14

(15)

chosen the same.

The super learner in a microarray example is demonstrated using two publicly available data sets. The breast cancer data fromvan’t Veer et al. [2002] and the prostate cancer array data from

Singh et al. [2002]. The breast cancer study was conducted to develop a gene expression based predictor for 5 year distant metastases. The outcome in this case is a binary indicator that patient had a distant metastases within 5 years after initial therapy. In addition to the expression data, six clinical variables were attained. The clinical information is age, tumor grade, tumor size, estrogen receptor status, progesterone receptor status and angioinvasion. The array data contains 4348 genes after the unsupervised screening steps outlined in the original article. We used the entire sample of 97 individuals (combining the training and validation samples from the original article) to ﬁt the super learner.

The prostate cancer study was conducted to develop a gene expression based predictor of a cancerous tumor. 102 tissue samples were collected with 52 from cancer tissue and 50 from non-cancer tissue. The data set contains 6033 genes after the pre-processing steps, but no clinical variables were available in the data set.

One aspect of high dimensional data analysis is that it is often beneficial to screen the variables before running the prediction algorithms. With the super learner framework, screening of variables can be either supervised or unsupervised since the screening step will be part of the cross-validation step. Screening algorithms can be coupled with prediction algorithms to create new algorithms in the library. For example, we may consider k-nearest neighbors using all features and on the subset of only clinical variables. These two algorithms can be considered unique algorithms in the super learner library. Another screening algorithm we consider is to test the pairwise correlations of each variable with the outcome and rank the variables by the corresponding p-value. With the ranked list of variables, we consider the screening cutoffs as: variables with a p-value less than 0.1, variables with a p-value less than 0.01, variables in the bottom 20, and variables in the bottom 50. An additional screening algorithm is to run the glmnet algorithm and select the variables with non-zero coefficients.

The results for the breast cancer data can be found in table 6. The algorithms in the library are k-nearest neighbors with k = {10, 20, 30, 40}, elastic net with _α ={1_.0_,0_.75_,0_.50_,0_.25}, random forests, bagging, bart, and an algorithm that uses the mean value of the outcome as the predicted probability. We coupled these algorithms with the screening algorithms to produce the full list of 38 algorithms. Within the library of algorithms, the best algorithm in terms of minimum risk estimate is the random forest algorithm using only the clinical variables (MSE = 0.198). As we observed in the previous examples, the super learner is able to attain a risk comparable to the best algorithm (MSE = 0.194).

The results from the prostate cancer data can be found in table 7. The library of algorithms is similar to that used in the breast cancer example minus the clinical variable screening. In this example, the elastic net algorithm performs best among the library algorithms. The mean squared error for the elastic net fit with fine-tuning parameter alpha equal to 0.50, and fit on the entire data set, was 0.071. The super learner outperforms even the best algorithm in the library here with a mean squared error of 0.067.

Standard errors of the cross-validated risk are based on the results of Theorem 3 inDudoit and van der Laan [2005]. The equation for the estimator of the variance of the cross-validated risk of

(16)

Table 6: 20-fold cross-validated mean squared error for each algorithm and the standard error for the breast cancer study.

Algorithm Subset Risk SE

Super Learner – 0_.194 0.0168 Discrete SL – 0_.238 0.0239 SL.knn(10) All 0_.249 0_.0196 SL.knn(10) Clinical 0_.239 0_.0188 SL.knn(10) cor(_{p <} 0.1) 0_.262 0_.0232 SL.knn(10) cor(_{p <}0_.01) 0_.224 0_.0205 SL.knn(10) glmnet 0_.219 0_.0277 SL.knn(20) All 0_.242 0_.0129 SL.knn(20) Clinical 0_.236 0_.0123 SL.knn(20) cor(_{p <}0_.1) 0_.233 0_.0168 SL.knn(20) cor(_{p <}0_.01) 0_.206 0_.0176 SL.knn(20) glmnet 0_.217 0_.0257 SL.knn(30) All 0_.239 0_.0128 SL.knn(30) Clinical 0_.236 0_.0119 SL.knn(30) cor(_{p <}0_.1) 0_.232 0_.0139 SL.knn(30) cor(_{p <}0_.01) 0_.215 0_.0165 SL.knn(30) glmnet 0_.210 0_.0231 SL.knn(40) All 0_.240 0_.0111 SL.knn(40) Clinical 0_.238 0_.0105 SL.knn(40) cor(_{p <}0_.1) 0_.236 0_.0118 SL.knn(40) cor(_{p <}0_.01) 0_.219 0_.0151 SL.knn(40) glmnet 0_.211 0_.0208 SL.glmnet(1.0) cor(Rank = 50) 0_.229 0_.0285 SL.glmnet(1.0) cor(Rank = 20) 0_.208 0_.0260 SL.glmnet(0.75) cor(Rank = 50) 0_.221 0_.0269 SL.glmnet(0.75) cor(Rank = 20) 0_.209 0_.0258 SL.glmnet(0.50) cor(Rank = 50) 0_.226 0_.0269 SL.glmnet(0.50) cor(Rank = 20) 0_.211 0_.0256 SL.glmnet(0.25) cor(Rank = 50) 0_.230 0_.0266 SL.glmnet(0.25) cor(Rank = 20) 0_.216 0_.0252 SL.randomForest Clinical 0_.198 0_.0186 SL.randomForest cor(_{p <}0_.01) 0_.204 0_.0179 SL.randomForest glmnet 0_.220 0_.0245 SL.bagging Clinical 0_.207 0_.0160 SL.bagging cor(_{p <}0_.01) 0_.205 0_.0184 SL.bagging glmnet 0_.206 0_.0219 SL.bart Clinical 0_.202 0_.0183 SL.bart cor(_{p <}0_.01) 0_.210 0_.0207 SL.bart glmnet 0_.220 0_.0275 SL.mean All 0_.224 0_.1016

(17)

Table 7: 20-fold cross-validated mean squared error for each algorithm and the standard error for the prostate cancer study.

Algorithm subset Risk SE

SuperLearner – 0_.067 0_.018 Discrete SL – 0_.076 0_.019 SL.knn(10) All 0_.149 0_.017 SL.knn(10) cor (_{p <}0_.1) 0_.131 0_.018 SL.knn(10) cor (_{p <}0_.01) 0_.120 0_.020 SL.knn(10) glmnet 0_.066 0_.019 SL.knn(20) All 0_.167 0_.013 SL.knn(20) cor (_{p <}0_.1) 0_.159 0_.015 SL.knn(20) cor (_{p <}0_.01) 0_.134 0_.017 SL.knn(20) glmnet 0_.073 0_.018 SL.knn(30) All 0_.185 0_.010 SL.knn(30) cor (_{p <}0_.1) 0_.180 0_.011 SL.knn(30) cor (_{p <}0_.01) 0_.153 0_.014 SL.knn(30) glmnet 0_.076 0_.016 SL.knn(40) All 0_.201 0_.008 SL.knn(40) cor (_{p <}0_.1) 0_.199 0_.009 SL.knn(40) cor (_{p <}0_.01) 0_.175 0_.011 SL.knn(40) glmnet 0_.085 0_.014 SL.glmnet(_α = 1_.0) All 0_.080 0_.019 SL.glmnet(_α = 1_.0) cor (_{p <}0_.1) 0_.076 0_.018 SL.glmnet(_α = 1_.0) cor (Rank = 50) 0_.091 0_.021 SL.glmnet(_α = 0_.75) All 0_.074 0_.018 SL.glmnet(_α = 0_.75) cor (_{p <}0_.1) 0_.072 0_.017 SL.glmnet(_α = 0_.75) cor (Rank = 50) 0_.086 0_.021 SL.glmnet(_α = 0_.50) All 0_.071 0_.017 SL.glmnet(_α = 0_.50) cor (_{p <}0_.1) 0_.069 0_.017 SL.glmnet(_α = 0_.50) cor (Rank = 50) 0_.084 0_.019 SL.randomForest cor (_{p <}0_.01) 0_.101 0_.014 SL.randomForest cor (Rank = 50) 0_.082 0_.016 SL.randomForest glmnet 0_.086 0_.016 SL.bart cor (_{p <}0_.01) 0_.117 0_.016

SL.bart glmnet 0_.084 0_.017

SL.bart cor (Rank = 50) 0_.085 0_.018 SL.polymars cor (Rank = 50) 0_.081 0_.022 SL.polymars cor (Rank = 100) 0_.093 0_.024

(18)

an estimator ˆΨ as an estimator of the true risk is: σ2n= 1 n Yi−ΨˆT(i)(Wi) ₂ −_θ¯ n 2 (5) where ¯_θ_n is the cross-validated risk estimate of the mean squared error. In words, the sample standard deviation of the cross-validation values of the squared error are used to calculate the standard error of the V-fold cross-validated mean squared error estimates.

4 Discussion

Beyond the asymptotic oracle performance of the super learner, our evaluation of the practical per-formance of the super learner shows that the super learner is also an adaptive and robust estimator selection procedure for small samples. Combining estimators with the weights (i.e. positive and summing up till 1) based on minimizing cross-validated risk appears to control for over-fitting of the final ensemble fit generated by the super learning algorithm, even when using a large collection of candidate estimators. The above examples demonstrate that the super learner framework allows a researcher to try many prediction algorithms, and many a priori guessed models about the true regression model for a given problem, knowing that the final combined super learner fit will either be the best fit or near the best fit.

Combining estimators with the convex combination algorithm proposed here appears to also improve on the usual cross-validation selector (i.e. discrete super learner). Selection of a single algorithm based on V-fold cross-validated risk minimization may be unstable with the small sample sizes of the data sets presented here, while the super learner can average a few of the best algorithms in the library to give a more stable estimator compared to the discrete super learner.

References

A. Aﬁﬁ and S. Azen. Statistical Analysis: A Computer Oriented Approach. Academic Press, New York, NY, 2nd edition, 1979.

E. R. Berndt. The Practice of Econometrics. Addison-Wesley, New York, NY, 1991. L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996a.

L. Breiman. Stacked regressions. Machine Learning, 24:49–64, 1996b. L. Breiman. Random forests. Machine Learning, 45:5–32, 2001.

C.-C. Chang and C.-J. Lin.LIBSVM: a library for support vector machines, 2001. Software available

at_{http://www.csie.ntu.edu.tw/~cjlin/libsvm}.

H. Chipman and R. McCulloch. BayesTree: Bayesian Methods for Tree Based Models, 2009. URL

http://CRAN.R-project.org/package=BayesTree. R package version 0.3-1.

H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees.

(19)

S. Chu. Pricing the C’s of diamond stones. Journal of Statistical Education, 9(2), 2001.

W. S. Cleveland, E. Groose, and W. M. Shyu. Local regression models. In J. M. Chambers and T. Hastie, editors, Statistical Models in S, chapter 8. Wadsworth & Brooks/Cole, 1992.

D. Cook. Regression Graphics: Ideas for Studying Regression Through Graphics. Wiley, New York, NY, 1998.

D. Cook and S. Weisberg. An Introduction to Regression Graphics. Wiley, New York, NY, 1994. E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071: Misc Functions of

the Department of Statistics (e1071), TU Wien, 2009. URL http://CRAN.R-project.org/

package=e1071. R package version 1.5-22.

S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical Methodology, 2:131–154, 2005.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 2010a.

J. Friedman, T. Hastie, and R. Tibshirani. glmnet: Lasso and elastic-net regularized generalized

linear models, 2010b. URL http://CRAN.R-project.org/package=glmnet. R package version

1.1-5.

J. H. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1–141, 1991. J. H. Friedman. Greedy function approximation: A gradient boosting machine.Annals of Statistics,

29:1189–1232, 2001.

S. Geisser. The predictive sample reuse method with applications. Journal of the American Sta-tistical Association, 70(350):320–328, 1975.

A. Gelman, A. Jakulin, M. G. Pittau, and Y.-S. Su. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics, 2(3):1360–1383, 2009. A. Gelman, Y.-S. Su, M. Yajima, J. Hill, M. G. Pittau, J. Kerman, and T. Zheng.arm: Data

Anal-ysis Using Regression and Multilevel/Hierarchical Models, 2010. URLhttp://CRAN.R-project.

org/package=arm. R package version 1.3-02.

F. E. Harrell, Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer-Verlag, New York, NY, 2001.

T. Hastie. Generalized additive models. In J. M. Chambers and T. Hastie, editors, Statistical Models in S, chapter 7. Wadsworth & Brooks/Cole, 1992.

T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990.

D. Kibler, D. aha, and M. K. Albert. Instance-based prediction of real-valued attributes. Compu-tational Intelligence, 5:51, 1989.

(20)

M. LeBlanc and R. Tibshirani. Combining estimates in regression and classiﬁcation. Journal of the American Statistical Association, 91:1641–1650, 1996.

A. Liaw and M. Wiener. Classiﬁcation and regression by randomforest. R News, 2(3):18–22, 2002.

URLhttp://CRAN.R-project.org/package=randomForest.

Z. Liu and T. Stengos. Non-linearities in cross country growth regressions: A semiparametric approach. Journal of Applied Econometrics, 14:527–538, 1999.

R. Neugebauer and J. Bullard. DSA: Data-Adaptive Estimation with Cross-Validation and the

D/S/A Algorithm, 2009. URL_{http://www.stat.berkeley.edu/~laan/Software/}. R package

version 3.1.3.

D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. URL http://archive.ics.uci.edu/ml/.

K. Penrose, A. Nelson, and A. Fisher. Generalized body composition prediction equation for men using simple measurement techniques. Medicine and Science in Sports and Exercise, 17:189, 1985.

A. Peters and T. Hothorn. ipred: Improved Predictors, 2009. URLhttp://CRAN.R-project.org/

package=ipred. R package version 0.8-8.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foun-dation for Statistical Computing, Vienna, Austria, 2010. URL http://www.R-project.org. G. Ridgeway. gbm: Generalized Boosted Regression Models, 2007. R package version 1.6-3. B. Rosner. Fundamentals of Biostatistics. Duxbury, Paciﬁc Grove, CA, 5th edition, 1999.

D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D’Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoﬀ, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.

S. E. Sinisi and M. J. van der Laan. Deletion/Substitution/Addition algorithm in learning with applications in genomics. Statistical Applications in Genetics and Molecular Biology, 3(1):Article 18, 2004.

M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36(2):111–147, 1974.

M. J. van der Laan and S. Dudoit. Uniﬁed cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle in-equalities and examples. Technical Report 130, Division of Biostatistics, University of California, Berkeley, 2003. URL http://www.bepress.com/ucbbiostat/paper130/.

M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(25):Article 25, 2007.

(21)

A.W. van der Vaart, S. Dudoit, and M.J. van der Laan. Oracle inequalities for multi-fold cross vaidation. Statistics and Decisions, 24(3):351–371, 2006.

L. J. van’t Veer, H. Dal, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression proﬁling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002.

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, fourth edition, 2002.