Preliminary analysis - 5 Chapter Five: Data Preparation and Data Analysis Procedures

5 Chapter Five: Data Preparation and Data Analysis Procedures

5.1 Preliminary analysis

5.1.1 Data preparation

The preliminary stages of data preparation are essential to ensure that the raw data is of sufficient quality to produce valid and accurate results. At the same time it enables the researcher to gain a basic understanding of the data and potential relationships between the variables (Hair, Black, Babin, Anderson and Tatham, 2006; Churchill, 1999). Careful data preparation avoids the results being

compromised through biased findings and incorrect interpretation (Malhotra and Birks, 2003). The researcher must “check that there are no coding errors, that variables have been recoded appropriately if necessary, and that missing values have been dealt with properly” (Baumgartner and Homburg, 1996, p. 148). The purpose of the preliminary stages of data preparation is therefore the detection and elimination of errors in the data (Hair, Bush and Ortinau, 2006).

142 In preparation for analysis, the data were entered manually into a MS Excel

spreadsheet. As all the questionnaire items to be subjected to analysis were

closed questions with pre-specified answer categories coded on the questionnaire, values were transferred directly into the spreadsheet (Churchill, 1999). Negatively worded items were reverse scored so that, for all items, lowest scores were entered as 1 and highest scores were entered as 7.

Initial data editing was performed to identify and correct inaccurate or inconsistent entries resulting from errors in questionnaire completion or data entry (Churchill, 1999). To ensure the accuracy of data entry, a sample of the raw data was

compared with the original questionnaires, with no errors identified at this stage. A check was carried out to verify that all variables were within the expected limits of between 1 and 7 on the Likert-type scale (Tabachnick and Fidell, 2007). The dataset was then transferred into SPSS v. 18, and examined for missing values and assumptions of normality which enable statistical analysis to be performed correctly. The profile of the survey respondents was also prepared.

5.1.2 Missing values

The statistical analysis method selected for this research, structural equation modelling, requires that no missing values should occur in the dataset (Bentler and Chou, 1987; Hair, Black, Babin, Anderson and Tatham, 2006). Missing values occur either from the omission of answers by respondents, or from errors in data collection or data entry (Hair, Black, Babin, Anderson and Tatham, 2006). Since data for this research were collected by trained Market Research Company

interviewers, the quality of the raw data was very good but not perfect. However, it is rare for even such commissioned surveys to be returned without some

incomplete responses (Hair, Black, Babin, Anderson and Tatham, 2006; Churchill, 1999). Since respondents had (optionally) given a phone number, where possible those who had given incomplete data were recontacted and asked to revisit their responses. Nevertheless there remained a small number of questionnaires containing missing data which remained to be dealt with.

143 Statistical analysis packages provide several methods of dealing with missing data. Cases with missing values may be removed from the dataset altogether (listwise deletion), may be excluded from specific calculations where data is missing in the variables analysed (pairwise deletion), or missing data may be replaced by a suitable value (imputation). Each of these methods is not without problems. With listwise deletion, removing entire cases from the dataset can result in a loss of valuable information which may have been costly to obtain and, if large numbers of cases have missing values, the dataset may be severely reduced, thus biasing the remaining sample (Cohen, Cohen West and Aiken, 2003; Olinsky, Chen and Harlow, 2003). Pairwise deletion by excluding cases only if missing data exists in the variables being analysed can result in loss of comparability, as

sample size will vary between each procedure (Malhotra and Birks, 2006).

Replacing missing values by imputation enables the researcher to avoid the

problems associated with removing cases and to retain the maximum sample size.

Replacing missing data by imputation is recommended if the extent of missing data is acceptably low (under 3%), if the number of respondents with

unsatisfactory responses is small, if the proportion of unsatisfactory responses for these respondents is low, and if the pattern of missing data is random (Cohen, Cohen, West and Aiken, 2003; Hair, Black, Babin, Anderson and Tatham, 2006;

Malhotra and Birks, 2003).

Although empirical tests can be performed to assess the randomness of missing data, when the number of missing items is small a simple visual test is sufficient (Hair, Black, Babin, Anderson and Tatham, 2006). In this study, there were 27 questionnaires containing missing data out of a total of 819, with 53 missing items in total. Although this was only around 3.3% of the total sample, a combination of the above methods of dealing with missing data was used. Firstly listwise deletion was used, and three cases which had five or more missing values were deleted listwise, with the added benefit of eliminating 19 missing values. This resulted in a dataset of 816 cases. Of the remaining 24 cases with missing values, the majority (19) only had one missing value. The maximum number of missing values on any one variable was four (0.49%). The cases with missing values were evenly spread

144 among the three survey locations. Hence the remaining missing values were

considered sufficiently random and suitable for imputation.

Missing values can be imputed using a number of methods. Missing data can be replaced with a suitable value, for example calculated from the mean of its variable.

However, replacing with the mean can distort the results by underestimating variance and is not recommended (Enders and Bandalos, 2001). Other methods include the use of regression, Expectation Maximisation (EM) and other multiple imputation techniques to calculate a replacement value. These methods infer information from all available data, but whereas regression uses information about relationships between the variables in the dataset, the other methods use iterative processes of repeated calculations to reach the best possible replacement value (Enders and Bandalos, 2001; Schafer and Graham, 2002).

The method chosen in this study was EM due in part to its availability in SPSS but also because, as a multiple imputation method, its advantages over other

techniques have been demonstrated (Peters and Enders, 2002; Schafer and Graham, 2002). The EM algorithm uses a two-step iterative procedure whereby missing observations are initially estimated and replaced from the observed data in the covariance matrix; then the mean vector and covariance matrix are

estimated as though there were no missing data, from the statistics calculated from the previous step, until the difference between the observed and estimated covariance matrices falls below a pre-specified acceptable level (Enders, 2001;

Peters and Enders, 2002). Following imputation using EM, the remaining town centre image dataset contained complete data for 816 cases.

5.1.3 Assumptions of normality

To perform structural equation modelling, all variables are required to meet the assumptions of multivariate normality (Bentler and Chou, 1987). Lack of

multivariate normality affects the power of statistical analysis to distinguish between good and bad models, adversely influences goodness-of-fit indices and standard errors, and thus calls into question the validity of results (Baumgartner and Homburg, 1996). Normality of variables can be assessed using either

145 statistical or graphical methods (Baumgartner and Homburg, 1996; Tabachnick and Fidell, 2007).

The first condition for multivariate normality is univariate normality, where “each variable and all linear combinations of the variables are normally distributed”

(Tabachnick and Fidell, 2007, p. 78). Univariate normality is a necessary but not sufficient condition of multivariate normality (Hair, Black, Babin, Anderson and Tatham, 2006). A common indicator of normality compares skewness, which represents the symmetry of the distribution, and kurtosis, which describes its peakedness. In a normal distribution both skewness and kurtosis are zero (Tabachnick and Fidell, 2007). While formal statistical tests can be used, if the sample is large a visual assessment of the shape of the distribution is sufficient (Tabachnick and Fidell, 2007). Accordingly, descriptive statistics in the form of histograms with normal curves superimposed were created for each variable, and a visual check was performed (Hair, Black, Babin, Anderson and Tatham, 2006;

Tabachnick and Fidell, 2007). This served the purpose of checking for any scores outside the expected range, as well as assessing if the variables were evenly spread around the mean and hence normally distributed. Mean scores, standard distributions, and skewness and kurtosis measures were obtained, which revealed that the majority of the variables had means that were higher than the midpoint of the scale and were (negatively) skewed. However, in many research situations, variables have scores which are skewed either positively or negatively, reflecting the underlying reality of the construct rather than problems with measures (Pallant, 2007). None of the variables were outside of the expected range (Tabachnick and Fidell, 2007).

To achieve multivariate normality, achieving univariate normality is sufficient

particularly with large samples (over 200 cases) (Hair, Black, Babin, Anderson and Tatham, 2006). Due to the large sample size in the town centre image dataset, problems as a result of skewness and kurtosis were not expected to impact on the results and no modifications to the data were considered necessary (Tabachnick and Fidell, 2007). Large sample sizes can accommodate divergence from

multivariate normality since they increase stability and decrease variability (Hair, Black, Babin, Anderson and Tatham, 2006; Tabachnick and Fidell, 2007).

146 Outliers may also affect the normality of the variables (Baumgartner and Homburg, 1996). An outlier “is a case with such an extreme value on one variable ... or such a strange combination of scores on two or more variables ... that it distorts

statistics” (Tabachnick and Fidell, 2007, p.72). Outliers are important to identify as they may result from incorrect data entry, may be cases which do not belong to the sample population, or may represent real values which nevertheless are not

representative of the sample as a whole (Tabachnick and Fidell, 2007). Outliers may be problematic if they are not representative of the sample and may influence the results in a way that cannot be generalised to other samples (Hair, Black, Babin, Anderson and Tatham, 2006). Outliers can be diagnosed by using box-plots (which specify cases that fall outside the median), or Mahalanobis distance tests (which measure the distance of cases from the intersection of the means of all the variables) (Tabachnick and Fidell, 2007). Tests were carried out for outliers in SPSS v. 18 using box-plots: no action was considered necessary at this stage.

In document The image of a town centre: a retail perspective (Page 153-158)