CHAPTER CONCEPTS Data set formats Data editing issues
Measurement scale Restriction of range Missing data
Outliers Linearity Non-normality
Sample size and degrees of freedom Start values
Non-positive definite matrices Heywood case (negative variance)
DATA SET FORMATS
An important first step in using an SEM software program is to be able to enter raw data and/or import data. Many of the software programs permit options to read in text files (ASCII), comma separated files (CSV), data from other programs, such as SPSS, SAS, or EXCEL, enter correlation or covariance matrices directly into a program, or read in raw data and save as a system file.
A system file is a file type unique to the software program, for example, PRELIS files are system files in LISREL that appear in spreadsheet format. The spreadsheet format is similar to that found in SPSS, and permits analysis of statistics in the pull-down menu, and the saving of files that contain the variance–covariance matrix, the correlation matrix, means, and standard deviations of variables. The PRELIS system file activates a pull-down menu that permits data editing features, data transformations, statistical analysis of data, graphical display of data, multi-level
Data Entry anD EDit issuEs
17
modeling, and many other related features. Statistical packages that provide SEM software permit the researcher to conduct much needed data screening and edit-ing (SPSS—AMOS, SAS—Proc Calis, Statistica—SEPATH, STATA—SEM).
Nothing prevents a researcher from using a statistical package to edit data prior to using it in a stand-alone SEM software program (EQS, Mplus, Mx, R).
A researcher should also be aware of important data editing features needed prior to SEM analysis, whether available in a statistics package or the SEM software. This involves checking for missing data, conducting homogeneity of variance tests, normality tests, and data output options. An important data option is to save raw data in an ASCII tab delimited file or comma separated file. Many of the SEM software programs can read in the raw data in these file types. The data output options might also permit saving different types of variance–covariance matrices, descriptive statistics, or scores for future use.
Your knowledge of data input and output features is very important, and tuto-rials or workshops are offered for many of the statistical packages and SEM software packages.
The statistical analysis of data in SEM requires you to become familiar with the software input and output data features. The most common approaches have been inputting raw data from ASCII data files (txt; dat), comma separated data files (csv), variance–covariance matrices, or correlation matrices with means and standard deviations. SEM software will convert raw data or the correlation matrix with means and standard deviations to a variance–covariance matrix in modeling applications. SEM software output varies, with some providing graphs, and others having options to output scores, different types of correlation matrices, or saving an asymptotic variance–covariance matrix.
DATA EDITING ISSUES
There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are com-monly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and non-normality. Each of these data issues will be discussed because they not only affect traditional statistics, but present additional problems and concerns in structural equation modeling.
Measurement Scale
How variables are measured or scaled influences the type of statistical analyses we perform (Anderson, 1961; Stevens, 1946). Properties of scale also guide our
a BEginnEr’s guiDE to structural Equation MoDEling
18
understanding of permissible mathematical operations. For example, a nominal variable implies mutually exclusive groups; a biological gender has two mutually exclusive groups, male and female. An individual can only be in one of the groups that define the levels of the variable. In addition, it would not be meaningful to calculate a mean and a standard deviation on the variable gender. Consequently, the number or percentage of individuals at each level of the gender variable is the only mathematical property of scale that makes sense.
An ordinal variable, for example, attitude toward school, that is scaled strongly agree, agree, neutral, disagree, and strongly disagree, implies mutually exclusive cat-egories that are ordered or ranked. When levels of a variable have properties of scale that involve mutually exclusive groups that are ordered, only certain mathem-atical operations are permitted. A comparison of ranks between groups is a per-mitted mathematical operation, while calculating a mean and standard deviation is not.
Final exam scores, an example of an interval variable, possess the property of scale that implies equal intervals between the data points, but no true zero point.
This property of scale permits the mathematical operation of computing a mean and a standard deviation. Similarly, a ratio variable, for example, weight, has the property of scale that implies equal intervals and a true zero point (weightless-ness). Therefore, ratio variables also permit mathematical operations of comput-ing a mean and a standard deviation. The difference between interval and ratio data is whether a true zero point exists. For interval data, it is wise to have an arbitrary zero point whenever possible. This aids our interpretation in many stat-istical applications.
Our use of different variables requires us to be aware of their properties of scale and what mathematical operations are possible and meaningful, especially in SEM, where variance–covariance matrices are used. Different types of correlation coefficients will produce different types of variance–covariance matrices (Pearson, polychoric, polyserial) among variables depending upon the level of measurement, and this creates a unique problem in SEM. It is very important to understand how continuous variables, ordinal variables, and group or categorical variables are used in SEM. We will address the use of different variable types and correlation matri-ces in subsequent chapters of the book.
Restriction of Range
Data values at the interval or ratio level of measurement can be further defined as being discrete or continuous. For example, final exam scores could be reported in whole numbers (discrete). Similarly, the number of children in a family would be
Data Entry anD EDit issuEs
19
considered a discrete level of measurement, that is, 5 children. In contrast, a con-tinuous variable is reported using decimal values, for example, a student’s grade point average would be reported as 3.75 on a 5-point scale.
Karl Jöreskog and Dag Sörbom (1996) provided a criterion based on their research that defines whether a variable is ordinal or interval. If a variable has fewer than 15 scale points, it should be considered an ordinal variable, whereas a variable with 15 or more scale points is considered a continuous variable. This 15 scale point criterion permits the Pearson correlation coefficient values to vary between +/– 1.0. Variables with fewer distinct scale points restrict the value of the Pearson correlation coefficient such that it may only vary between +/– 0.5. SEM software requires you to declare the type of variables being used in the model. Other factors that affect the Pearson correlation coefficient are presented in this chapter.
Missing Data
The statistical analysis of data is affected by missing data values in variables. That is, not every subject has an actual value for every variable in the data set, as some values are missing. It is common practice in statistical packages to have default values for handling missing values. Caution is needed because SEM software var-ies on what missing data values are permitted. Also, an ASCII tab-delimited data file with blanks for missing data can cause data input errors. The researcher has the options of deleting subjects who have missing values, replacing the missing data values, or using robust statistical procedures that accommodate for the pres-ence of missing data.
The various SEM software programs handle missing data differently and have different options for replacing missing data values. Table 2.1 lists many of the vari-ous options for dealing with missing data. These options can dramatically affect the number of subjects available for analysis, the magnitude and direction of the correlation coefficient, or create problems if means, standard deviations, and cor-relations are computed based on different sample sizes.
The Listwise deletion of cases and Pairwise deletion of cases are not always rec-ommended options due to the possibility of losing a large number of subjects, thus dramatically reducing the sample size, and affecting parameter estimates and standard errors. Mean substitution works best when only a small number of miss-ing values are present in the data, whereas regression imputation provides a useful approach with a moderate amount of missing data.
Schafer (1997) pointed out three types of non-response for missing values: MCAR (missing completely at random), MAR (missing at random), and MNAR (missing
a BEginnEr’s guiDE to structural Equation MoDEling
20
not at random). MCAR does not depend on any variable in the data for the miss-ingness. MAR usually implies that variable missingness is related to other variables.
MNAR does depend on response bias for the missingness, for example, males and females respond differently. The expectation maximization (EM), Monte Carlo Markov Chain (MCMC), and matching response pattern approaches have been recommended when larger amounts of data are missing at random. The EM algo-rithm generates one single complete data matrix whereas the MCMC method gen-erates several complete data matrices and uses the average. The MCMC method is more reliable than the EM algorithm. In both cases, the complete data matrix can be used to estimate the mean vector and the covariance matrix of the observed variables which can be used in SEM to estimate parameters in the model.
More information about how to handle missing data using MCMC with Mplus programs is available in Enders (2013), while others address general missing data issues: McKnight, McKnight, Sidani and Aurelio (2007), and Peng, Harwell, Liou, and Ehman (2007). Davey and Savla (2009) have recently published an excellent book with SAS, SPSS, STATA, and Mplus source programs to handle missing data in SEM, especially in the context of power analysis.
LISREL Example
The imputation of missing values in LISREL9 uses the matching response pattern approach with raw data in a system file. The matching response pattern approach substitutes the missing value of a single case on a variable with a value based on cases of other variables that have complete data, thus using the response pattern over the set of variables with complete data. Multiple imputation is possible when missing values occur on more than one variable by selecting mean substitution, the expectation maximization algorithm (EM) or the Monte Carlo Markov Chain (MCMC) option, which generates random draws from probability distributions via Markov chains. Once raw data are saved as a system file (*.lsf), a main menu appears with a pull-down menu. The Statistics menu has the option of Impute Missing Value (response matching approach) or the Multiple Imputation (mean substitution, EM, MCMC).
Table 2.1: Options for Dealing with Missing Data
Listwise Delete subjects with missing data on any variable
Pairwise Delete subjects with missing data on each pair of variables used Mean substitution Substitute the mean for missing values of a variable
Regression imputation Substitute a predicted value for the missing value of a variable Expectation maximization (EM) Find expected value based on expectation maximization algorithm Matching response pattern Match cases with incomplete data to cases with complete data to
determine a missing value
Data Entry anD EDit issuEs
21
We present an example using data on the cholesterol levels of 28 patients treated for heart attacks. We assume the data to be missing at random (MAR) with an underlying multivariate normal distribution. Cholesterol levels were measured after 2 days (VAR1), after 4 days (VAR2), and after 14 days (VAR3), but only 19 of the 28 patients had complete data. The missing value was entered as –9.000.
The raw data are from the file, chollev.dat, located in the Tutorial subfolder on the computer directory, LISREL9 Student Examples. We used the import data feature, which prompted us to save the raw data file as a LISREL system file type (chollev.LSF). The LISREL main menu should look like the image in Figure 2.1.
We next click on statistics on the tool bar menu and select impute Missing Values from the pull-down menu. This provides a dialog box with the variable that has missing values (VAR3), and the variables with complete cases to be used as match-ing variables (VAR1 and VAR2). The dialog box is shown in Figure 2.2.
We next selected output options and saved the transformed data in a new system file, cholnew.lsf, and output the new correlation matrix (chol.cor), means (chol.mn), and standard deviations (chol.sd) into saved files. These files were saved on the book website. The Output dialog box is shown in Figure 2.3.
After clicking on run, a computer output file appears with the results. Do not be upset about the warning messages. They are indicating that the variables are con-tinuous, which is defined by having more than 15 categories. Also, notice that the output indicates no missing values after imputation, as shown in Table 2.2.
We should examine our data both before (Table 2.3) and after (Table 2.4) imput-ation of missing values. Here, we used the matching response pattern method.
This comparison provides us with valuable information about the nature of the missing data.
Figure 2.1: LISREL MAIN MENU SCREEN
a BEginnEr’s guiDE to structural Equation MoDEling
22
Figure 2.2: IMPUTE MISSING VALUES DIALOG BOX
Figure 2.3: OUTPUT DIALOG BOX
Data Entry anD EDit issuEs
23
Table 2.2: LISREL Imputation Output
W_A_R_N_I_N_G: VAR1 has more than 15 categories and will be treated as continuous. ERROR CODE 201.
W_A_R_N_I_N_G: VAR2 has more than 15 categories and will be treated as continuous. ERROR CODE 201.
W_A_R_N_I_N_G: VAR3 has more than 15 categories and will be treated as continuous. ERROR CODE 201.
Number of Missing Values per Variable After Imputation VAR1 VAR2 VAR3
0 0 0
The correlation, mean, and standard deviation of VAR3 is mildly affected by the missing data. The imputation of missing values for VAR3 indicated a small reduc-tion in mean value (221.47 vs 220.71), a small reducreduc-tion in the standard deviareduc-tion (43.18 vs 42.77), and a slight change in the bivariate correlation coefficients. The concern is whether to proceed with the imputed values or drop subjects with miss-ing values, thus reducmiss-ing the sample size to n = 19. We do not want correlations based on different sample sizes, so an important decision must be made. Because the imputed values only mildly changed the descriptive statistics, the parameter estimates from a statistical analysis would probably not be substantially different.
However, it would be prudent to make a comparison of results before and after imputation to determine how much the parameter estimates and standard errors are affected by missing data.
Another option is to use the new feature in LISREL9, namely, full information maximum likelihood estimation, which is automatically activated when using a raw Table 2.3: Summary Statistics with Missing Data
Total Effective Sample Size = 19 Number of Missing Values per Variable VAR1 VAR2 VAR3
0 0 9
Univariate Summary Statistics for Continuous Variables
Variable Mean St. Dev. Skewness Kurtosis Minimum Freq. Maximum Freq.
VAR1 253.929 47.710 –0.351 0.473 142.000 1 360.000 1 VAR2 230.643 46.967 0.019 1.344 116.000 1 352.000 1 VAR3 221.474 43.184 –0.166 –0.890 142.000 1 294.000 1 Correlation Matrix
VAR1 VAR2 VAR3 VAR1 1.000
VAR2 0.689 1.000
VAR3 0.393 0.712 1.000
a BEginnEr’s guiDE to structural Equation MoDEling
24
Table 2.4: Summary Statistics with Imputation Total Sample Size (N) = 28
Number of Missing Values per Variable After Imputation VAR1 VAR2 VAR3
0 0 0 Imputations for VAR3
Case 2 imputed with value 204 (Variance Ratio = 0.000), NM = 1 Univariate Summary Statistics for Continuous Variables
Variable Mean St. Dev. Skewness Kurtosis Minimum Freq. Maximum Freq.
VAR1 253.929 47.710 –0.351 0.473 142.000 1 360.000 1
data file with missing data. The term full information maximum likelihood (FIML) is used with data that have missing values, while the term maximum likelihood (ML) is used with complete data. In LISREL9, however, you can still perform the steps outlined above to gain knowledge about the nature of the missingness in the data.
We highly recommend comparing any analyses before and after the replacement of missing data values to fully understand the impact missing data values have on the parameter estimates and standard errors. Given the high-speed comput-ers and software of today, we also recommend checking how the replacement of missing values using the EM, MCMC, and FIML approaches affect your statis-tical analysis. It would be easy to compare the different ways missing data affect parameter estimates and standard errors in a model, that is, analysis with missing data, analysis with imputed data, and analysis with automatic FIML. A com-parison of these methods is also warranted in multiple variable imputations to
Data Entry anD EDit issuEs
25
determine the effect of using a different algorithm on the replacement of missing values.
Outliers
Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X) or dependent (Y) variables or both (Ho & Naugher, 2000). Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard devi-ation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics. Sometimes, additional data will need to be collected to fill in the gap along either the Y or X axes. Statistical packages today have outlier detection methods available that include the following: box plot display, scatterplot, histogram, and frequency distributions.
Linearity
Some statistical calculations, such as Pearson correlation, assume that the vari-ables are linearly related to one another. Thus, a standard practice is to visualize the coordinate pairs of data points of two continuous variables by plotting the data in a scatterplot. These bivariate plots depict whether the data are linearly increasing or decreasing. The presence of curvilinear data reduces the magnitude of the Pearson correlation coefficient, even resulting in the presence of a zero cor-relation. Recall that the Pearson correlation value indicates the magnitude and direction of the linear relationships between two variables. Figure 2.4 shows the importance of visually displaying the bivariate data scatterplot.
Non-normality
In statistics, several transformations are given to handle issues with non-normal data. Some of these common transformations are described in Table 2.5.
Inferential statistics assumes data are normally distributed. Data that are skewed (lack of symmetry) frequently occur more often along one part of the measure-ment scale, and affect the variance–covariance among variables. In addition, kur-tosis (peakedness) in data will impact statistics. Leptokurtic data values are more peaked than the normal distribution, whereas platykurtic data values are flatter and more dispersed along the X axis, but have a consistent low frequency on the Y axis, that is, the frequency distribution of the data appears more rectangular in shape. One or more of the data transformation methods in Table 2.5 may correct
a BEginnEr’s guiDE to structural Equation MoDEling
26
for skewness or leptokurtic values in the data; however, platykurtic data usually require recoding into categories or obtaining better data.
Non-normal data can occur because of the scaling of variables (ordinal rather than interval) or the limited sampling of subjects. Possible solutions for skew-ness are to resample more participants or to perform a linear transformation as indicated in Table 2.5. Our experience is that a probit data transformation works best in correcting skewness. Kurtosis in data is more difficult to resolve; some
Non-normal data can occur because of the scaling of variables (ordinal rather than interval) or the limited sampling of subjects. Possible solutions for skew-ness are to resample more participants or to perform a linear transformation as indicated in Table 2.5. Our experience is that a probit data transformation works best in correcting skewness. Kurtosis in data is more difficult to resolve; some