CHAPTER CONCEPTS
Explanation versus prediction
Standardized partial regression coefficients Coefficient of determination
Squared multiple correlation coefficient Full versus restricted models
Measurement error
Additive versus relational model
Multiple regression models predict observed dependent variable values from know-ledge of multiple independent observed variable values. Multiple regression, a general linear modeling approach to the analysis of data, has become increasingly popular since 1967 (Bashaw & Findley, 1968). In fact, it has become recognized as an approach that bridges the gap between correlation and analysis of variance in answering research hypotheses (McNeil, Kelly, & McNeil, 1975). Many statis-tical textbooks have elaborated the relationship between multiple regression and analysis of variance (Draper & Smith, 1966; Edwards, 1979; Hinkle, Wiersma, &
Jurs, 2003). Graduate students who take an advanced statistics course are typically provided with the multiple linear regression framework for data analysis. Given knowledge of multiple regression techniques, an understanding can be extended to various multivariable statistical techniques (Newman, 1988).
This chapter shows how beta weights (standardized partial regression coefficients) are computed in multiple regression equations using structural equation modeling software. More specifically, we illustrate how the structural equation modeling approach can be used to compute parameter estimates in multiple regression and what types of output are reported. We begin with a brief overview of multiple regression concepts.
RegRession Models
53
OVERVIEW
Multiple regression techniques require a basic understanding of sample statistics (sample size, mean, and variance), standardized variables, correlation, and partial correlation (Cohen & Cohen, 1983; Houston & Bolding, 1974; Pedhazur, 1982). In standard-score form (z scores), the simple linear regression equation for predicting the dependent variable Y from a single independent variable X is
y x
ˆ ,
z = zβ
where β is the standardized regression coefficient. The basic rationale for using the standard-score formula is that variables are converted to the same scale of meas-urement, the z scale. Conversion back to the raw-score scale is easily accomplished by using the sample mean and the standard deviation.
The relationship connecting the Pearson product-moment correlation coeffi-cient, the unstandardized regression coefficient b and the standardized regression coefficient β is
where sx and sy are the sample standard deviations for variables X and Y, respect-ively. For two independent variables, the multiple linear regression equation with standard scores is
ˆz = z + zy β1 1 β2 2
and the standardized partial regression coefficients β1 and β2 are computed from
1 y y 12
The correlation between the dependent observed variable Y and the predicted scores Ŷ is given the special name multiple correlation coefficient. It is written as
yy y.12
R = Rˆ ,
where the latter subscripts indicate that the dependent variable Y is being pre-dicted by two independent variables, X1 and X2. The squared multiple correlation coefficient is computed as
RY Y2=Ry.212=β1ry.1+β2ry.2.
A BeginneR’s guide to stRuctuRAl equAtion Modeling
54
The squared multiple correlation coefficient indicates the amount of variance explained, predicted, or accounted for in the dependent variable by the set of inde-pendent predictor variables. The R2 value is also interpreted as an effect size or model-fit criterion in multiple regression analysis.
Kerlinger and Pedhazur (1973) indicated that multiple regression analysis can play an important role in prediction and explanation. Prediction and explan-ation reflect different research questions, study designs, inferential approaches, analysis strategies, and reported information. In prediction, the main emphasis is on practical application such that independent variables are chosen by their effectiveness in enhancing prediction of the dependent variable values. In explanation, the main emphasis is on the variability in the dependent variable explained by a theoretically meaningful set of independent variables. Huberty (2003) established a clear distinction between prediction and explanation when referring to multiple correlation analysis (MCA) and multiple regression analysis (MRA). In MCA, a parameter of interest is the correlation between the dependent variable Y and a composite of the independent variables Xp. The adjusted formula using sample size n and the number of independent predictors, p, is
In MRA, regression weights are also estimated to achieve a composite for the independent variables Xp, but the index of fit R2 is computed differently as
R R p
When comparing these two formulas, we see that R2Adj* has a larger adjustment.
For example, given R2 = .50, p = 10 predictor variables and n = 100 subjects, these two different fit indices are
Hypothesis testing would involve using the expected value or chance value of R2 for testing the null hypothesis, which is p/(n − 1), not 0 as typically indicated. In our example, the expected or chance value for R2 = 10/99 = .10, so the null hypothesis is H0: ρ2 = .10. An F test used to test the statistical significance of the R2 value is
RegRession Models
55
which is statistically significant when compared to the tabled F = 1.93, df = 10;
89, p < .05 (Table A.5). In addition to the statistical significance test, a researcher should calculate effect sizes and confidence intervals to aid understanding and interpretation (Soper, 2010).
The effect size (ES) is computed as ES = R2 – [p/(n − 1)]. In our example, ES R2Adj = .45 − .10 = .35 and ES R2Adj* = .39 − .10 = .29. This indicates a moderate to large effect size according to Cohen (1988), who gave a general reference for effect sizes (small = .1, medium = .25, and large = .4).
Confidence intervals (CIs) around the R2 value can also help our interpretation of multiple regression analysis. Steiger and Fouladi (1992) reported an R2 CI DOS program that computes confidence intervals, power, and sample size. Steiger and Fouladi (1997) and Cumming and Finch (2001) both discussed the importance of converting the central F value to an estimate of the non-central F before comput-ing a confidence interval around R2. Smithson (2001) wrote an R2 SPSS program to compute confidence intervals. Confidence intervals around R2 values, how-ever, have not been adopted by researchers and thus are not reported in published research.
After assessing our initial regression model fit, we might want to determine whether adding or deleting an independent variable would improve the index of fit R2, but we avoid using stepwise regression methods (Huberty, 1989). We run a second multiple regression equation where a single independent variable is added or deleted to obtain a second R2 value. We then compute a different F test to deter-mine the statistical significance between the two regression models as follows
F R R p p
where R2F is from the multiple regression equation with the full original set of independent variables p1 and R2R is from the multiple regression equation with the reduced set of independent variables p2. In our heuristic example, we drop a single independent variable and obtain R2R = .49 with p2 = 9 predictor variables. The F test is computed as
A BeginneR’s guide to stRuctuRAl equAtion Modeling
The F value is not significant at the .05 level, so the variable we dropped does not statistically add to the prediction of Y, which supports our dropping the single predictor variable; that is, a 1% decrease in R2 is not statistically significant. The nine-variable regression model therefore provides a more parsimonious model.
It is important to understand the basic concepts of multiple regression and cor-relation because they provide a better understanding of hypothesis testing, pre-diction, and explanation of a dependent variable. A review of multiple regression techniques also helps us to better understand path analysis, and structural equa-tion modeling in general. An SEM example is presented next to further clarify these basic multiple regression computations.
MULTIPLE REGRESSION EXAMPLE
The multiple linear regression analysis is conducted using data from Chatterjee and Yilmaz (1992). The data file contains scores from 24 patients on four variables (Var1 = patient’s age in years, Var2 = severity of illness, Var3 = level of anxiety, and Var4 = satisfaction level). Given raw data, two different approaches are pos-sible: (a) read in raw data file, or (b) compute a correlation or variance–covariance matrix for input into the software. We choose to compute and input a variance–covariance matrix into the software program. SEM software does not output all of the same information and related diagnostic results for multiple regression that you may be accustomed to viewing in SAS, SPSS, STATA, etc.
A regression equation is a theoretical model specified by the researcher. Therefore, the model specification involves finding relevant theory and prior research to for-mulate a theoretical regression model. The researcher is interested in specifying a regression model that should be confirmed with sample variance–covariance data, thus yielding a high R2 value and statistically significant F value. Model specifica-tion directly involves deciding which variables to include or not to include in the theoretical regression model.
If the researcher does not select the right variables, then the regression model could be misspecified and lack validity (Tracz, Brown, & Kopriva, 1991). The problem is that a misspecified model may result in biased parameter estimates or estimates that are systematically different from what they are in the true popula-tion model. This bias is known as specificapopula-tion error.
RegRession Models
57
The researcher’s goal is to determine whether the theoretical regression model fits the sample variance–covariance structure in the data, that is, whether the sample variance–covariance matrix implies some underlying theoretical regres-sion model. The multiple regresregres-sion model of theoretical interest in our example is to predict the satisfaction level of patients based on patient’s age, severity of illness, and level of anxiety (independent variables). This would be characteris-tic of an MCA model because a parcharacteris-ticular set of variables was selected based on theory. The dependent variable var4 is therefore predicted by the three inde-pendent variables (var1, var2, and var3). The diagram of the implied regression model is shown in Figure 4.1. The curved arrows indicate correlations between the observed independent variables. The lines pointing toward var4 indicate the direct paths for regression weights. The oval with error indicates the 1 – R2 or unexplained variance not accounted for by the three independent predictor variables.
Once a theoretical regression model is specified, the next concern is model identifi-cation. Model identification refers to deciding whether a set of unique parameter estimates can be computed for the regression equation. Algebraically, every free parameter in the multiple regression equation can be estimated from the sample variance–covariance matrix (a free parameter is an unknown parameter that you want to estimate). The number of distinct values in the sample variance–covariance matrix equals the number of parameters to be estimated; thus, multiple regres-sion models are always considered just-identified models because all parameters are estimated.
SEM computer output will always indicate that regression analyses are saturated models; that is, χ2 = 0 and degrees of freedom = 0. The regression model includes three independent variable variances, three covariance terms, three regression weights for the independent variables, and one error term, so all parameters in the regression equation are being estimated. Traditional software (SAS, SPSS,
var1
var2
var3
var4 error
Figure 4.1: SATISFACTION REGRESSION MODEL
A BeginneR’s guide to stRuctuRAl equAtion Modeling
58
etc.) reports the R2 and F values, whereas SEM software reports a chi-square value, because SEM software is testing the difference between the original sample variance–covariance matrix and the model-implied variance–covariance matrix given the regression equation.
The estimation of the regression weights is called model estimation, that is, computing the sample regression weights for the independent predictor var-iables. The term model estimation is used because there are several different estimation methods. The most common estimation method is ordinary least squares estimation (unweighted least squares; ULS), which selects the regres-sion weights based on minimizing the sum of squared errors. However, there are other estimation methods used in statistics, especially SEM software, for example: maximum likelihood (ML) estimation, two-stage least squares (2SLS), weighted least squares (WLS), generalized least squares (GLS), diagonally weighted least squares (DWLS), and robust versions in Mplus (MLM, MLMV, WLSM, WLSMV). The different estimation methods were derived to be used under various data analysis situations, including non-normality, small sample sizes, outliers, etc., to afford a more robust estimation of parameter estimates.
We will be discussing these other estimation methods when we present the dif-ferent SEM modeling approaches later in the book.
The squared multiple correlation with three predictor variables predicting the dependent variable Y is
2y.
1 y y 3 y
R 123= r + r + rβ 1 β2 2 β 3.
The correlation coefficients are multiplied by their respective standardized partial regression weights and summed to yield the squared multiple regression coeffi-cient R2y.123.
In LISREL9 we can write a SIMPLIS program to compute the regression weights in the regression model. The SIMPLIS program includes a title command, an observed variable command to specify variable names, sample size command, and covariance matrix command. The equation command specifies the regression equation with the dependent variable on the left-hand side of the equation. The number of decimals and path diagram commands are optional. The end of problem command ends the program.
The SIMPLIS program commands can be saved in a file (regression.spl) and run in the free student version of the software (www.ssicentral.com). The basic program setup is:
RegRession Models
59
Regression Analysis Example (no intercept term) Observed variables: VAR1 VAR2 VAR3 VAR4
Sample size: 24 Covariance matrix:
91.384
30.641 27.288 0.5840.641 0.100
−122.616 −52.576 −2.399 281.210 Equation: VAR4 = VAR1 VAR2 VAR3 Number of decimals = 3
Path Diagram End of Problem
The regression output without an intercept term in the regression equation is:
VAR4 = - 1.153*VAR1 - 0.267*VAR2 - 15.546*VAR3, Errorvar.= 88.515 Standerr (0.273) (0.533) (7.080) (27.402) Z-values -4.218 -0.501 -2.196 3.230 P-values 0.000 0.616 0.028 0.001 Goodness-of-Fit Statistics
Degrees of Freedom for (C1) - (C2) = 0
Maximum Likelihood Ratio Chi-Square (C1) = 0.0 (P = 1.000) Browne’s (1984) ADF Chi-Square (C2_N2) = 0.00 (P = 1.000) The model is saturated, the fit is perfect!
The regression weights are listed in front of each independent variable (VAR1, VAR2, VAR3). Below each regression weight is the standard error in parenthe-sese; for example, VAR1 regression weight has a standard error of .273, with the Z value indicated below that, and a p value listed below the Z value. The Z value is computed as the parameter estimate divided by the standard error (Z = −1.153/.273 = −4.128). If testing each regression weight at the critical z = 1.96, α = .05 level of significance, then VAR1 and VAR3 are statistically sig-nificant, but VAR2 is not (Z = −.501). The R2 = .685 or 69% of the variability in Y scores (VAR4) is predicted by knowledge of VAR1, VAR2, and VAR3. This example is further explained in Jöreskog and Sörbom (1993, pp. 1–6).
Model testing involves determining the fit of the theoretical regression model. We can calculate by hand the R2 value using the correlation matrix and standardized beta weights as shown in Table 4.1.
The adjusted R2 value for the MCA theoretical regression model approach is
A BeginneR’s guide to stRuctuRAl equAtion Modeling
60
y.123 1 y 2 y 3 y
R2 = r + r + rβ 1 β 2 β 3
= −.657(−.7649) + −.083(−.6002) + −.294(−.4530) = .685. The F test for the significance of the R2 value is
F R p
This would be considered a large effect size. Notice that the SEM software does not provide these values.
The results indicated that a patient’s age, severity of illness, and level of anxiety make up a statistically significant set of predictors of a patient’s satisfaction level.
There is a large effect size so one might expect similar results when conduct-ing a regression analysis on another sample of data. The negative standardized regression coefficients indicate that as patient age, severity of illness, and anxiety increase, a patient’s satisfaction decreases. SEM software can output both stand-ardized and unstandstand-ardized parameter estimates. Both should be reported, so we showed how they are computed. Also, we have not dropped the non-significant second variable, which is discussed next.
The theoretical regression model included a set of three independent explanatory variables, which resulted in a statistically significant R2 = .685. This implied that 69% of the patient satisfaction level score variance was explained by knowledge of a patient’s age, severity of illness, and level of anxiety. The regression analysis, Table 4.1: Correlation Matrix (n = 24)
Correlation Matrix
VAR1 VAR2 VAR3 VAR4
VAR1 1.0000
VAR2 0.6136 1.0000
VAR3 0.1935 0.3888 1.0000
VAR4 −0.7649 −0.6002 −0.4530 1.0000
RegRession Models
61
however, indicated that the regression weight for VAR2 was not statistically dif-ferent from zero (z = −0.501, p = .10). Thus, one might consider model modifica-tion where the theoretical regression model is modified to produce a better fitting model. In multiple regression, the two different regression equations would yield different R2 values, thus an F test of the difference between the two R2 values would be computed.
We would run the SIMPLIS program again, but this time drop VAR2 from the equation command. The output would now only show two independent predictor variables, and a different R2 value.
VAR4 = - 1.235*VAR1 - 16.780*VAR3, Errorvar.= 89.581, R² = 0.681 Standerr (0.216) (6.517) (27.063)
Z-values -5.727 -2.575 3.310
P-values 0.000 0.010 0.001
The F test for a difference between the two models is
F R R p p
The F test for the difference in the two R2 values was non-significant, indicat-ing that droppindicat-ing VAR2 does not affect the explanation of a patient’s satis-faction level (R2 = .685 vs R2 = .681). We therefore use the more parsimonious two-variable regression model (68% of the variance in a patient’s satisfaction level is explained by knowledge of a patient’s age and level of anxiety, that is, 68% of 281.210 = 191.22). The F test is also not provided in the SEM software.
Because the R2 value is not 1.0 (perfect explanation or prediction), additional vari-ables could be added if additional research indicated that another variable was relevant to a patient’s satisfaction level, for example, the number of psychological assessment visits. The unexplained error variance (89.581) was statistically sig-nificant, that is, 1 – R2 = 1 – .68 = .32 (32%), so additional significant predictor variables would be helpful in accounting for the unexplained variance. Obviously, more variables can be added in the model modification process, but a theoretical basis should be established by the researcher for the additional variables.
SUMMARY
A basic regression analysis was conducted in SEM. We discovered that the model-fit statistics and computer output information in SEM are not the same as in trad-itional statistics packages that run multiple regression. The parameter estimates can be computed using different estimation methods. The regression models are
A BeginneR’s guide to stRuctuRAl equAtion Modeling
62
considered saturated just-identified models, because all parameters are estimated.
We also showed that the selection of independent variables in the regression model (model specification) and the subsequent regression model modification are key issues not easily resolved without a sound theoretical justification.
The selection of a set of independent variables and the subsequent regression model modification are important issues in multiple regression. How does a researcher determine the best set of independent variables for explanation or pre-diction? It is highly recommended that a regression model be based on some the-oretical framework that can be used to guide the decision of what variables to include. Model specification consists of determining what variables to include in the model and which variables are independent or dependent. A systematic deter-mination of the most important set of variables can then be accomplished by set-ting the partial regression weight of a single variable to zero, thus tesset-ting full and restricted models for a difference in the R2 values (F test). This approach and other alternative methods were presented by Darlington (1968).
In multiple regression, the selection of a wrong set of variables can yield erro-neous and inflated R2 values. The process of determining which set of variables yields the best prediction, given time, cost, and staffing, is often problematic because several methods and criteria are available to choose from. Recent meth-odological reviews have indicated that stepwise methods are not preferred, and that an all-possible-subset approach is recommended (Huberty, 1989; Thompson, Smith, Miller, & Thomson, 1991). In addition, the Mallows CP statistic is advo-cated by some rather than R2 for selecting the best set of predictors (Mallows, 1966; Schumacker, 1994; Zuccaro, 1992). Overall, which variables are included in a regression equation will determine the validity of the model and the theoret-ical rationale of the researcher. For example, if the intercept term is omitted, the predictor variables are compared on the same scale with the intercept value of 0. However, if an intercept term is included, then the intercept indicates a starting point or baseline measure (see chapter footnote).
Because multiple regression techniques have been shown to be robust to viola-tions of assumpviola-tions (Bohrnstedt & Carter, 1971) and applicable to contrast coding, dichotomous coding, ordinal coding (Lyons, 1971), and criterion scal-ing (Schumacker, 1993), they have been used in a variety of research designs. In fact, multiple regression equations can be used to address several different types of research questions. The model specification issue, however, is paramount in achieving a valid multiple regression model. Replication, cross-validation, and
Because multiple regression techniques have been shown to be robust to viola-tions of assumpviola-tions (Bohrnstedt & Carter, 1971) and applicable to contrast coding, dichotomous coding, ordinal coding (Lyons, 1971), and criterion scal-ing (Schumacker, 1993), they have been used in a variety of research designs. In fact, multiple regression equations can be used to address several different types of research questions. The model specification issue, however, is paramount in achieving a valid multiple regression model. Replication, cross-validation, and