Stepwise selection of the model - DIRECT GRADIENT ANALYSIS AND MONTE-CARLO PERMUTATION

4. DIRECT GRADIENT ANALYSIS AND MONTE-CARLO PERMUTATION

4.9. Stepwise selection of the model

At the end of the section 4.1, the forward selection of the explanatory variables for a regression model was described in some detail. The forward selection available in the CANOCO program has the same purpose and methodology, using the partial Monte Carlo permutation test to assess the quality of each potential candidate predictor for extending the subset of the explanatory variables used in the constrained ordination model.

If we select an interactive ("manual") forward selection procedure in the Canoco for Windows program, CANOCO presents the following window during the analysis (Figure 4-6).

┼_{Another level can be added, in some cases, using the permutation within blocks defined by the}

Figure 4-6 Dialog box for the Forward selection of environmental variables. The question-marks for the variable BrLeaf correspond to the variable which was not tested by the permutation test

during the forward selection.

The Figure 4-6 illustrates the state of the forward selection procedure where three best explanatory variables (Forest, BrLeaf, E2) were already selected (they are displayed in the lower part of the window). The values in the window top show that the three selected variables account for approximately 72% of the total variability explained by all the environmental variables (i.e. 0.320 of 0.447).

The list of variables in the upper part of the window shows the remaining "candidate predictors" ordered by the decreasing contribution that the variable would provide when added to the set of already selected variables. We can see that the variable "ForDens" is a hot candidate. It would increase the amount of explained variability from 0.320 to 0.352 (0.320 + 0.032).

To judge whether such an increase is larger than a random contribution, we can use a partial Monte Carlo permutation test. In this test, we would use the candidate variable as the only explanatory variable (so we would get an ordination model with just one canonical axis) and the already selected environmental variables (Forest, BrLeaf, and E2 in this case) as the covariables, together with any a priori selected covariables. If we reject the null hypothesis for that partial test, we can include that variable into the subset.

The effect of the variable tested in such context is called its conditional,

partial effect and its value is strictly dependent on the exact selection sequence. But

on the start of the forward selection process, when no environmental variable entered the selected subset yet, we can test each variable separately, to estimate its independent, marginal effect. This is the amount of variability in the species data that would be explained by a constrained ordination model using that variable alone as an explanatory variable. The discrepancy between the order of variables sorted based on their marginal effects and the order corresponding to a "blind" forward selection (when alway picking the best candidate) is caused by the correlations between the explanatory variables. If these variables would be completely linearly independent, both these orders would be identical.

If the primary purpose of the forward selection is to find a sufficient subset of the explanatory variables that represents the relation between the collected species and environmental data, then we have a problem with the "global" significance level related to the whole selected subset, if treated as a single entity. If we proceed by selecting the environmental variables as long as the best candidate has Type I error estimate (P) lower than some preselected significance level α, then the "collective"

Type I error probability is in fact higher than this level. We do not know how large is the Type I error probability, but we know that the upper limit is Nc* α, where Nc is

the maximum number of tests (comparisons) made during the selection process. The appropriate adjustment of the significance threshold levels on each partial test (selecting only variables with achieved Type I error probability estimate less than α/ Nc) is called Bonferoni correction. Here the value of Nc reprents the maximum

possible number of steps during the forward selection (i.e. the number of

independent environmental variables). Use of the Bonferoni correction is

a controversial issue.

Another difficulty we might encounter during the process of the forward selection of environmental variables occurs if we have one or more factors coded as dummy variables and used as the environmental variable. The forward selection procedure treats each dummy variable as an independent predictor so we cannot evaluate contribution of the whole factor at once. This is primarily because the whole factor contributes more than one degree of freedom to the constrained ordination model (similarly, a factor with K levels contributes K - 1 degrees of freedom in a regression model). In the constrained ordination, there are K - 1 canonical axes needed to represent the contribution of such a factor. On the other hand, the independent treatment of the factor levels provides an opportunity to evaluate the extent of differences between the individual classes of samples defined by such a factor. This is partly analogical to the multiple comparisons procedure in the analysis of variance.

In document Multivariate Analysis of Ecological Data. Jan Lepš & Petr Šmilauer (Page 53-55)