Data Preparation and Preliminary Analysis

CHAPTER 5 RESEARCH METHODOLOGY AND DESIGN

5.5 ANALYTICAL METHODOLOGY

5.5.1 Data Preparation and Preliminary Analysis

(a) Data Screening

One of the most salient issues to consider before using the data collected from a survey is to ensure that the data accurately reflects the responses made, that the data has been correctly coded and entered, patterns in missing data points are ascertained, unusual or extreme responses are identified, and ensuring the data meets statistical assumptions that underlie the methods used to analyse the data (Meyers et al., 2006).

The data from the web-based surveys (NZICA and NZLS) was received electronically from respondents by the Information Technology Department of the University of Canterbury, and transferred into a spreadsheet and forwarded to the author. As such, it was envisaged that the data would be free of coding errors. Data from the mail survey was entered into a spreadsheet and each entry was manually checked against the survey instrument in an attempt to minimise any coding errors made during the transfer.

The data-sets were also examined to determine whether there were any non-random patterns in the missing data points, such as a concentration of missing data points in a specific set of questions. Any non-randomness patterns in the missing data points require a closer inspection (Hair et al., 2006). A visual check revealed that the missing data points were randomly distributed. A more robust analysis, Little‟s Missing Completely At Random (MCAR) test, which is available in Statistical Package for the Social Sciences (SPSS), was

130

also applied to the data-sets. This test determines whether the missing data points are missing completely at random, or whether there are any non-random patterns in the missing data points. The MCAR test will therefore ensure that the observed values of Y are a truly random sample of all Y values, with no underlying process that lends bias to the observed data (Little & Rubin, 1987; Allison, 2001; Little & Rubin, 2002; and Hair et al., 2006).

Next the percentage of variables with missing data points for each case was tabulated, followed by the tabulation of the number of cases with missing data points for each variable (Hair et al., 2006). This process will not only identify the extent of missing data points, but any exceptionally high levels of missing data points that occur for individual cases or observations. Cases with more than 10 percent of missing data points, or variables with more than 10 percent of missing data points were eliminated (Hair et al., 2006). However, missing data points from optional questions were not considered to be missing. In addition, a few respondents had (via separate email) indicated that they may have unintentionally skipped a page, and in such instances these cases are included in the study.97 A final review undertaken on the missing data points indicated that the remaining missing data points were not significant or were below the threshold to warrant any further additional diagnosis (Hair et al., 2006).

(b) Response Bias Analysis

(i) Nonresponse Bias (Levene’s t-Test)

In order to make valid statistical generalisation, it is necessary to consider whether „nonresponse‟ bias is present in the survey data (Brehm, 1993; Dillman, 2000; and Ziegler, 2006). Nonresponse bias is associated with systematic differences in some key areas between respondents and non-respondents. Testing for nonresponse bias establishes whether, if non- respondents had responded, the outcomes of the survey would have been significantly different. While there are a variety of methods to test for nonresponse bias, the two most common approaches are to test the difference in outcome between early and late responses, and to compare the characteristics of the sample to the population (Leong, 1980; and Sheik & Marringly, 1981). Other researchers have suggested contacting a sample of non-respondents and comparing the results from the non-respondents (Johnson, 1959; Miller & Smith, 1983; and Collier & Bienstock, 2007).

Unfortunately, the web-based survey did not have any features preventing respondents who did not complete all questions from moving onto the next page. Although some electronic packages (for example, the Survey Monkey and Qualtrics) offer a feature that guarantees complete survey responses in all compulsory fields, this was not adopted for a number of reasons.

131

Arguably, the last approach appears more empirically sound than the first two approaches; however, this option is not always feasible (Collier & Bienstock, 2007). Due to the nature of the current survey, it would be extremely difficult, if not impossible, to attempt to contact non-respondents. A recent study which analysed 535 articles over a five year period claims that the second most empirically sound method for assessing nonresponse was to extrapolate early and late respondents on both the variables of the study as well as demographic variables (Collier & Bienstock, 2007). In the current study, the extrapolation of early and late respondents (used as proxy non-respondents), on both study variables and also on demographic variables, will provide some assurance that respondents and non-respondents in the sample selected for this study do not differ in sample characteristics, or in their opinions and their attitudes that are the specific inquiry of this study. Consistent with Armstrong and Overton (1977) the current study compared the first 25 percent to the last 25 percent of the sample, with the last 25 percent of respondents representing non-respondents.

The independent t-test, which is available in SPSS, was employed to test whether the means of the two independent groups (early and late respondents) are similar or whether they differ (Gaur & Gaur, 2006; and Pallant, 2011). The null hypothesis for the t-test is that there is no difference in the responses of the early respondents and late respondents (Hinton et al., 2004). The significance (or non-significance) of the test statistics (F) will determine which values to use.98 If the t-test result (of the selected value) shows t statistic of p = < 0.05, then the null hypothesis should be rejected. However, if the p-value is greater than 0.05, the null hypothesis cannot be rejected, indicating that the responses of the two groups are similar.

(ii) Representativeness of Observed Samples

Samples are measured in order to make generalisations about the target population (Tabachnick & Fidell, 2007). In order for the results to have generalisability, it is important that the sample reflects the true population. Due to the nature of surveys, it is extremely difficult to generate a sample that is representative of the population. In this study, a selected number of the demographic and economic characteristics of the sample were compared to data available for the true population, in order to determine whether the sample reflects the population distribution. The selected attributes are: gender, age, income level, income source, and educational attainment.

A significant F-test statistic indicates that the equal variance not assumed value should be used. Conversely, if the F-test statistic is not significant, the equal variance assumed value should be used.

132

Selected attributes of the Taxpayer sample were compared to that of the New Zealand population obtained from Statistics New Zealand‟s website.99

The Tax Agent sample‟s attributes were compared to information available from the NZICA‟s 2006 Annual Report while information on remuneration was sourced from published results from the remuneration

survey undertaken in the 2007 year.100 The Tax Lawyer or NZLS sample was removed from

this study due to the low number of responses (37 responses) received and no further analysis was undertaken for this sample group.

(c) Estimation Technique for Missing Data

Most statistical packages, including PLS, require complete data sets and as such datasets with missing data must be remedied before they can be used. Arguably, the traditional approaches may cause problems and Hair et al. (2006) suggest a model-based approach where missing data is imputed based on all available data for a given respondent. One such method highly recommended is the Expectation Maximisation (EM) approach available in SPSS, which estimates the values of each mean and covariance as if there is no missing data. The EM method uses a maximum likelihood approach for estimating missing values (Little & Rubin, 2002). The EM algorithm is a two-step iterative process, with the first step using regression analysis to estimate the missing values. The next step involves applying maximum likelihood procedures to make estimates of parameters (for example, correlations) using the missing data replacements (Meyers et al., 2006).

Advantages of the EM method include fewer problems with convergence and less bias under conditions of random missing data. The only known disadvantage noted is that the effective sample size is uncertain for EM. Arguably, when the sample size exceeds 250 and the total amount of missing data is below 10 percent, it is acceptable to use the pair-wise approach. However, when the sample size is small and when the amount of missing data is large, then the model-based EM or ML becomes superior (Hair et al., 2006). In the present study, due to the smaller sample size, the EM imputation approach was used to address the remaining missing data.

(d) Descriptive Statistics

In addition to inferential statistical techniques, descriptive statistics are also an important feature in most empirical studies, as they provide a simple summary of the survey data and

Data downloaded from www.stats.govt.nz/census/Census2006HomePage.aspx.

100

NZICA‟s 2006 Annual Report was retrieved from: http://www.nzica.com.The results of the remuneration survey conducted for the 2007 year was obtained from NZICA‟s website at http://www.institutesurvey.co.nz/2008/ 2007results.asp.

133

they also form the basis of quantitative analysis of data. Tabachnick and Fidell (2007) argue that describing and making inferences about a data set are equally important for empirical research. Therefore, prior to undertaking any further analysis, demographic data collected from the survey will be used to develop a profile of the sample population; and the descriptive statistics computed for selected indicators and constructs in the research model will provide a preliminary view of the raw data and explain the underlying information. This involves computing the means, standard deviation and frequency for each selected variable.

In document The application of the theory of planned behaviour and structural equation modelling in tax compliance behaviour: a New Zealand study (Page 141-145)