Chapter 5: Research Method
5.9 Data Screening
5.9.1 Initial Data Screening
After the coding process, the data need to be edited to assure the completeness of the data and to make sure there were no errors at the stage of keying data. This process has been done using descriptive statistics in SPSS. Each variable was screened to check if the score was out of range by checking the frequencies, minimum, maximum, mean and standard deviation. When finding errors, it is necessary to go back to the questionnaires to confirm those data before correcting them. Only then are the data ready to be analysed. The descriptive statistics for the initial data screening can be found in Appendix I – Part C.
5.9.2 Missing Data
In multi-variate analysis, it is common to find missing data where valid values of one of more variables are not available for analysis. There are two causes leading to missing data values (Hair et al, 2006). First, missing data can be caused from the researcher-side, such as data entry errors or data collection problems; and second, any action on the part of the respondents such as refusal to answer. The missing data problem can affect the generalisability of the results. Therefore, it is important for the researcher to address the issue. There are two actions that can be taken regarding the missing data: delete the cases with a consequence of reducing sample size, or by applying a remedy. However, before doing so, the researcher should identify the patterns and relationships of the missing data in
order to maintain as close as possible the original distribution of values (Hair et al., 2006).
There are four steps in identifying missing data and applying remedies (Hair et al, 2006). The four steps are as follows.
1. Determine the type of missing data.
There are two types of missing data, ignorable and not ignorable. If missing data are expected because it is inherent in the technique used, then it requires no remedy (Little and Rubin, 2002; Schafer, 1997). However, Analysis of Moment Structures (AMOS) needs a complete data set; therefore, the missing data cannot be ignored. Hence, step 2 in handling missing data has to be taken.
2. Determine the extent of missing data.
In assessing the missing data, Hair et al. (2006) suggests tabulating: (1) the percentage of variables with missing data for each case; and (2) the number of cases with missing data for each variable. This can be done using SPSS missing data analysis. From the univariate statistic (see Appendix I – Part D), there are only two cases with missing data (0.6%). Since it is less than 10%, it can be ignored. However, as mentioned above, AMOS requires a complete data set, therefore, it is necessary to go to step 3.
3. Diagnosing randomness of missing data.
In this step, Expectation Maximisation (EM) missing data analysis is employed (see Appendix I - D). The EM method is an iterative process to predict the values of the missing variables using all other variables relevant to the construct of interest (Cunningham, 2008). In this analysis, Little’s MCAR (Missing Completely At Random) test shows Chi-Square = 83.086, DF = 96, and a significance level of 0.823. This result is not significant at an alpha level of 0.001, thus the missing data may be assumed to be missing at random. Consequently, the widest range of remedies can be used.
4. Select the imputation method.
Due to the requirement of AMOS, although with respect to the low level of missing data (below 10%) this could generally be ignored, it is necessary to complete the data. In this case, the regression method of imputation is selected to calculate the replacement values based on the rules that the missing data are less than 10% and classified as MCAR (missing completely at random). After handling the missing data using a regression imputation method with SPSS, the variables that will be used in SEM with AMOS data analysis are completed and free of missing data and, therefore, the data are ready for further analysis.
5.9.3 Multi-variate Outliers
After examining the missing data, the next step is to examine the data before further analysis is the detection of multi-variate outliers. An outlier is an observation that is substantially different from the other observations (i.e., has an extreme value) on one or more characteristics (variables) (Hair et al., 2006, p. 40). Furthermore, they state that an outlier cannot be categorically characterised as either beneficial or problematic, rather it must be viewed within the context of the analysis and should be evaluated by the types of information provided. Beneficial outliers may be an indication of a population characteristic that would not be discovered in the normal course of analysis. On the other hand, problematic outliers are not representative of the population. They are counter to the objectives of the analysis and can seriously impact statistical tests (Hair et al., 2006).
Multi-variate outliers are sometimes not easy to detect since they may involve extreme scores on two or more variables, or the pattern of scores is atypical (Kline, 2005). To examine the multi-variate outliers, AMOS provides the Mahalanobis d-squared statistic to indicate the observations farthest from the centroid (Mahalanobis distance). The Mahalanobis d-squared table is presented in Appendix I – D. Small numbers in the p1 column are to be expected. However, small numbers in the p2 column indicate observations that are implausibly far from the centroid under the hypothesis of normality (Arbuckle,
2006b). From the Mahalanobis d-squared table, the p1 column shows relatively small numbers, while the p2 column also exhibits some small numbers. This may be an indication of multi-variate outliers in the data.
There are two actions that can be taken in handling outliers, which are: retention; or deletion of the outliers. Following Hair et al. (2006), the outliers should be retained unless demonstrable proof indicates that they are really deviant and not representative of any observations in the population. In addition, by deleting the outliers, the researchers are improving the multi-variate analysis but at the cost of generalisability of the data. Therefore, the possible outlier’s data were retained in the current study.
5.9.4 Multi-variate Normality
The earlier steps of handling missing data and multi-variate outliers were undertaken to clean the data to a format most suitable for multi-variate analysis. The other step in dealing with data is testing the compliance of the data with the statistical assumptions underlying the multi-variate technique and then deal with the basic way in which the technique makes statistical inferences and results. Some robust techniques are less affected when the underlying assumptions are violated; however, in all cases, complying with some of the assumptions critically determines a successful analysis (Hair et al., 2006).
In multi-variate analysis the most fundamental assumption is normality. Normality is referring to the distribution of sample data that corresponds to a normal distribution. It is an assumption or requirement for statistical methods in some parametric tests (Hair et al., 2006). A normal distribution of data describes a symmetrical, bell-shaped curve which has the greatest frequency of scores in the middle with smaller frequencies towards the extremes (Gravetter and Wallnau, 2000).
It is important to assess the impact of violating the normality assumption since statistical tests that depend on the normality assumption may be invalid. Consequently, the conclusions drawn from the sample observations and their
statistics will be in question (Kerlinger and Lee, 2000). There are two dimensions to assess the severity of non-normality, which are: 1) the shape of the offending distribution; and 2) the sample size (Hair et al., 2006). It can be said that the extent of the non-normality distribution should be considered with the sample size, as the larger the sample size the smaller the effect of the non-normality distribution.
The data distribution, when it is different from the normal distribution, can be described by two measures, kurtosis and skewness (Hair et al., 2006). Accordingly, the assessment of the degree of normality can be examined from the value of the kurtosis and skewness. These values provide information about the shape of the distribution. Values for skewness and kurtosis are zero if the observed distribution is exactly normal (Coakes et al., 2008). Skewness measures the symmetry of a distribution (Hair, et al., 2006, p. 40). If most of the observation scores are piled up to the left, the distribution is said to be positively skewed; conversely if observation scores are piled up to the right, they are negatively skewed (Cunningham, 2008). Kurtosis refers to the peakedness of a distribution that measures the extent to which scores are clustered together (i.e., leptokurtic distribution) or widely dispersed (i.e., platykurtic distribution) (Cunningham, 2008). Newsom (2005) suggests that the absolute value of skewness less than or equal to 2 (|skew| ≤ 2) and the absolute value of kurtosis less than or equal to 3 (|kurtosis| ≤ 3) are acceptable limits for the condition of normality to be satisfied. While West, Finch and Curran (1995) recommend those absolute values of skewness and kurtosis greater than 2 and 7, respectively, were indicative of a moderately non-normal distribution.
In the present research, most of the uni-variate distributions are normal since the absolute values of kurtosis and skewness are below 2 and 3, respectively. However, the joint distributions of the variables may depart substantially from multi-variate normality. The existence of multi-variate normality can be tested by examining Mardia’s coefficient for multi-variate kurtosis (Mardia, 1970). Analysis of Moment Structures (AMOS) software can generate this coefficient. The Mardia’s coefficient is zero if the data are multi-variate normally distributed.
There is no absolute cut-off value of this coefficient, however, a value of 3 or more tends to be of concern (Wothke, 1993). In the present study, the Mardia’s multi-variate coefficient is relatively high; therefore the data may not be normally distributed. The violation of the multi-variate normality assumption can have a large effect on the standard errors and tests of significance when maximum likelihood (ML) estimation is used in confirmatory factor analysis (CFA) (Browne, 1982). In the present study, due to the high multi-variate normality value, the bootstrap method was employed. This method is discussed in sub-section 5.11.2.
5.9.5 Multi-collinearity
Multi-collinearity is the extent to which a construct can be explained by the other constructs in the analysis (Hair et al., 2006, p. 709). The existence of multi- collinearity occurs when variables that appear separate actually measure the same thing. It can be detected with the value of correlations. Even though there is no concession about how high the correlations have to be to exhibit multi- collinearity, Pallant (2005) points out that a correlation of up to 0.8 or 0.9 is reasonable. While Hayduk (1987) suggests concerns for values greater than 0.7 or 0.8.
In the present research, the test of reliability illustrates that some of the variables are highly correlated, which suggests the existence of multi-collinearity. There are two ways to deal with multi-collinearity, one is to eliminate variables, while the other is to combine the redundant variables into a composite variable (Kline, 2005). In the current study, the first method, which is the removal of variable (s) from data analysis is taken in dealing with multi-collinearity. This can be performed when conducting construct reliability and discriminant validity (see chapter 7).