Variable Centring - Introduction to Epilepsy

Chapter 2: Introduction to Epilepsy

3.3 Variable Centring

Centring involves shifting the scale of a variable by subtracting a single value from all of the data points. It is called centring because people often use the mean as the value they subtract, so the new mean is now at zero, but it does not have to be the mean. In fact, there are many situations when a value other than the mean is most meaningful.

There are mixed opinions on the value of centring. Cronbach [152] suggests, ‘in regression analysis, always centre’ stating reasons such as increased relevance of the estimated regression coefficients and diminished multicollinearity. If centring is done unnecessarily,

the cost is minor. Aiken and West [153] and Cohen et al [154] have described centring and

the consequences of non-centred data while Glantz and Slinker [155] and Kromrey and Foster-Johnson [156] take the stand that centring does not usually change the statistical results, is necessary only in certain circumstances and can thus easily be avoided.

All variables analysed in the thesis are centred. This is particularly of relevance to the work in Chapters 4, 5, 6 and 10.

3.3.1 Multicollinearity

There are two sources of correlation between a predictor and an even power of the predictor, say between and [154]. The first is non-essential multicollinearity that

exists merely due to the scaling, non-zero mean, of . The second is essential

multicollinearity, correlation that exists because of any non-symmetry in the distribution of

the original variable. Marquardt [157] refers to the problems of multicollinearity

produced by non-centred variables as non-essential ill-conditioning, whereas those that exist because of actual relationships between variables in the population are referred to as essential ill-conditioning.

Problems with multicollinearity in least squares regression are well documented, particularly with multiple regression models containing both main effects and interaction

terms [154] – in general for two factors and , if the effect of variable on the outcome

varies according to the level of variable , there is said to be an interaction between and

Although the least squares estimates of the regression coefficients remain unbiased, as multicollinearity increases, the determinants of the independent variables covariance and correlation matrices approach zero and the standard errors of the coefficients increase. The resulting ill-conditioning yields coefficients, and an associated variance-covariance matrix, that are unstable. Small changes due to measurement or rounding error may be magnified resulting in large changes in the coefficients and associated variance-covariance matrix. In addition, when multicollinearity is present, slight sampling fluctuations in the estimates of the covariances can result in great variability in the values and signs of least squares estimates of the coefficients. Finally, as a result of the increase in the expected distance between the vector of the least squares coefficient and the vector of true regression coefficients, estimates with excessively large values or unreasonable signs may result when extreme collinearity is present [156].

The problems of collinearity may be overcome in several ways. In some situations the collinearity will have arisen purely as a computational problem and may be solved by

alternative definitions of some of the variables. For example, if both and are included

as explanatory variables and all the values of are positive, then and are likely to be

highly correlated. This can be overcome by redefining the quadratic term as , which will reduce the correlation whilst leading to an equivalent regression [28]. If the multicollinearity is structural, it can often be dealt with by centring the measured

independent variables on their mean values before computing the power, e.g. squared, and interaction, cross-product, terms specified by the regression equation.

3.3.2 Interpretation

Lower order coefficients in higher order regression equations, regression equations containing terms of higher than order unity, only have meaningful interpretation if the variable has a meaningful zero. For example, if some behaviour were predicted from a

measure of motivation, , and a seven point attitude scale, , ranging from one to seven,

the regression coefficient for on would be the slope of on at the value , a

value not even defined on the scale. Similarly, if strength of athletes were produced from

their height and weight, the regression coefficient predicting strength from height would represent the regression of strength on height for athletes weighing 0 pounds.

There is a simple solution to making the value, zero, meaningful on any quantitative scale;

centre the linear predictor. Thus the regression of on at becomes meaningful; it

is the linear regression of on at the mean of the variable . To gain the benefits of

interpretation of lower order terms, it is unnecessary to centre the criterion . This can be

left in raw score form so that predicted scores will be in the metric of the observed criterion [154].

3.3.3 Discussion

Cohen et al [154] strongly recommend the use and reporting of centred polynomial

equations. They suggest that doing so has no effect on the estimate of the highest order interaction in the regression equation and also yields two straightforward, meaningful interpretations of each first-order regression coefficient of predictors entered into the regression equation. Firstly, the effects of the individual predictors at the mean of the sample and secondly, the average effects of each individual predictors across the range of the other variables. Aiken and West [153] also recommend centring, this time for

computational reasons because the centred overall regression analysis provides regression coefficients for primary terms that may be informative.

The main disadvantage of centring, however, is that the variables are no longer the natural variables of the problem. If a predictor has a meaningful zero point, then one may wish to keep the predictor in non-centred form. Centring also produces a puzzling effect. When predictors are centred and entered into regression equations containing interactions, the regression coefficients for the first order effects are different numerically from those obtained by performing a regression analysis on the same data in raw score or non-centred form. The regression coefficients do not change when predictors are centred in regression equations containing no interactions [154]. Differences between the non-centred equation, and the centred one get absorbed into the intercept [155], therefore, according to Glantz and Slinker [155], centring will only be beneficial if an intercept term is included in the model.

Studies performed by Kromrey and Foster-Johnson [156] showed that regression equations obtained with centred and raw data were equivalent, results of hypothesis testing with either type of data were exactly the same and neither approach provided a viable vehicle for the interpretation of main effects in regression. They therefore suggest ‘one might just as well not bother.’ There is, though, very little cost to unnecessary centring but the costs of not centring when it is necessary can be major [158] as using non-centred data in regression analysis, often leads to inconsistent and misleading results. Thus it would always be better to centre in regression analyses.

It can be argued that not centring represents a de facto decision that all ordinal variables

be centred at zero, that all binary and categorical independent variables be coded somewhat arbitrarily, 1 and 0, and that one category, also often arbitrarily chosen, be used as the reference category. This can lead to serious errors of statistical inference [158].

Kraemer and Blasey [158] recommend the following default approach to protect against most errors in statistical inference. Each binary independent variable should be coded

and while each ordinal independent variable should be centred with the

median response. Categorical independent variables should be ‘dummy coded’ as usual,

but instead of coding each response as 1 and 0, the values and should be

used where is the number of categories. As in the usual situation, one categorical ‘dummy’ is omitted, but with the proposed centring it does not matter which one. For example, for a three level variable, the traditional dummy coding may be as in the left hand side of Table 4 while the recommended coding with centring is as per the right hand side.

Table 4: Alternatives for dummy coding of a three level categorical variable Traditional Coding Coding with Centring

Baseline 0 0

Level 1 1 0

Level 2 0 1

Requiring that centring always be done merely asks that what is done implicitly anyway be done explicitly and thoughtfully, which promotes better application and understanding of the results of regression analysis.

In document Prognostic factors for epilepsy (Page 75-79)