NONPARAMETRIC ANALYSIS OF VARIANCE
24.4 ANALYSIS: FRIEDMANN ’ S TEST (FOR THREE OR MORE DEPENDENT SAMPLES)
25.4.4 Regression Coeffi cients
Multiple regression analysis is used extensively to disentangle and measure the effects of different X variables on a single Y variable. Nevertheless, there are several important limi-tations of this procedure especially in observational studies. In any study, there will be X variables related to Y that have not been included. These may be variables thought to be unimportant, too diffi cult to measure, or are unknown to the investigator. Hence, the regression coeffi cient of a variable, for example, b 1 is not an unbiased estimate of β 1 but of β 1 in combination with possible effects not measured. It is therefore advisable, at least initially, to include in the study all X variables that are likely to affect Y or to study a population in which variables not of direct interest can be controlled. Introducing more variables into an analysis, however, adds to the data collection effort, may contribute only noise to the prediction, and may reduce the sensitivity ( “ power ” ) of the analysis. Hence, introducing new variables costs power unless each variable individually can explain important amounts of the variance. Deciding which variables to include in a study is usually a compromise between trying to achieve good predictive power while excluding irrelevant variables.
132 MULTIPLE LINEAR REGRESSION
25.4.5 Interpretation
The multiple regression analysis of the lichen growth data in relation to the eight climatic variables was carried out using STATISTICA software. The value of R 2 for the present data set was 0.85, that is, the regression accounts for over half of the variance and indicates that it is worthwhile to proceed with the analysis. The ANOVA of the multiple regression is shown in Table 25.2 . The value of F ( F = 5.69) is signifi cant at the 5% level of probabil-ity ( P < 0.05) and, therefore, at least some of the regression coeffi cients are not zero.
Estimates of the various regression coeffi cients are shown in Table 25.3 . This analysis suggests that the regression coeffi cients for mean maximum temperature ( t = 2.43, P < 0.05) and total sunshine hours ( t = − 2.56, P < 0.05) are the only climatic variables out of the eight to have had a signifi cant effect on growth. Hence, lichen growth may be positively related to maximum temperature but negatively related to total sunshine hours.
The likely explanation is that although warm temperatures may promote growth processes in R. geographicum , prolonged periods of hot dry weather actually inhibit growth presum-ably because of the drying out of the thalli.
25.5 CONCLUSION
Multiple linear regression determines the linear relationship between one dependent vari-able ( Y ) and multiple independent varivari-ables ( X 1 , X 2 , X 3 , etc.) and has many potential uses. An investigator should always have a clear hypothesis in mind before carrying out such a procedure and knowledge of the limitations of the analysis. In addition, multiple regression is probably best used in an exploratory context, identifying variables that might profi tably be examined in more detailed studies. Where there are many variables
TA B L E 25.2 Analysis of Variance ( ANOVA ) of Multiple Regression Data in Table 25.1 a
CONCLUSION 133
potentially infl uencing Y , they are likely to be intercorrelated and to account for relatively small amounts of the variance. Any analysis in which R 2 is less than 50% should be suspect as probably not indicating the presence of signifi cant variables. A further problem relates to sample size. It is often stated that the number of samples taken must be at least 5 to 10 times the number of variables included in the study (Norman & Streiner, 1994 ) (see Appendix 4 ). This advice should be taken only as a rough guide, but it does indicate that the variables included should be selected with great care as inclusion of an obviously unimportant variable may have a signifi cant impact on the sample size required.
Statnote 26
STEPWISE MULTIPLE REGRESSION
Uses of stepwise multiple regression.
Selection of X variables for prediction.
The step - up (forward) method.
The step - down (backward) method.
26.1 INTRODUCTION
In Statnote 25 , multiple linear regression was introduced as a method of studying the relationship between a dependent variable ( Y ) and two or more independent ( X ) variables.
A major objective of such an analysis is often to identify the most important X variables infl uencing Y and to rank them in order of signifi cance. There is usually no unique or satisfactory solution to this problem. One method would be to use the magnitude of the standard partial regression coeffi cients , usually calculated routinely in multiple regres-sion, as a measure of the relative importance. Any ranking of the X variables, however, may be affected by correlations between the variables themselves. Multiple regression analysis assumes that the X variables are relatively independent of each other, a situation rare in practice. In addition, the contribution of a specifi c X variable to the total variation in Y is frequently greater when that variable is considered alone than when it is included with other variables in a multiple regression equation. For example, three different length measurements ( X 1 , X 2 , X 3 ) are likely to be strongly intercorrelated and each may correlate signifi cantly with weight ( Y ). If one of the X variables is entered into a regression
Statistical Analysis in Microbiology: Statnotes, Edited by Richard A. Armstrong and Anthony C. Hilton Copyright © 2010 John Wiley & Sons, Inc.
136 STEPWISE MULTIPLE REGRESSION
equation, however, addition of the other two are not likely to improve the predictive power of the regression very much.
Another problem is if there are a large number of X variables included in the regression, the regression coeffi cients will change with each grouping of the variables. In addition, if the multiple correlation coeffi cient ( R 2 ) (see Statnote 25 ) is small, most of the variation in Y will remain unexplained and may be attributable to random error or to variables not included in the study. Inclusion of additional variables will also change the relationships between the existing X variables and their regression coeffi cients. An investigator may wish to select a small subset of the X variables that give the best prediction of the Y variable. In this case, the question is how many variables should the regression equation include? One method would be to calculate the regression of Y on every subset of the X variables and choose the subset that gives the smallest mean square deviation from the regression. Most investigators, however, prefer to use a stepwise multiple regression procedure. There are two forms of this analysis called the step - up (or forward ) method and the step - down (or backward ) method. This statnote illustrates the use of stepwise multiple regression with reference to the scenario introduced in Statnote 25 , namely the infl uence of climatic variables on the growth of the crustose lichen Rhizocarpon geographicum (L.)DC.
26.2 SCENARIO
We return to the scenario described in Statnote 24 . The radial growth rate (RGR) of thalli of the crustose lichen R. geographicum measured in 17 successive 3 - month periods over 51 months in North Wales. The radial growth of R. geographicum (Armstrong & Smith, 1987 ) was measured at between 8 and 10 randomly chosen locations around each thallus at 3 - month intervals from April 1993 to June 1997 using the method described by Armstrong (1973) . Essentially, the advance of the hypothallus, using a micrometer scale, is measured in relation to fi xed markers on the substratum. Radial growth in each period was averaged for each thallus and then over the 20 thalli to examine the pattern of seasonal growth. Climatic data included records of: (1) total rainfall over each 3 - month period, (2) the total number of rain days, (3) maximum ( T max) and minimum ( T min ) temperature recorded on each day and averaged for each 3 - month period, (4) the total number of air and ground frosts, (5) the total number of sunshine hours, and (6) average daily wind speed.
26.3 DATA
The data comprise for each 3 - month period a single dependent ( Y ) variable, namely radial growth of the lichen and eight possible defi ning climatic ( X ) variables and are presented in Table 25.1 of Statnote 25 .
26.4 ANALYSIS BY THE STEP - UP METHOD