Statistical analysis, models and their assumptions

The data analysis started by examining the data and it was observed that there were some missing data. Missing values for item scores were dealt with by replacing the missing value with the mean score for that item, as recommended by Han and Kamber (2006). A total of 45 of the responses contained one or more missing values.

The methods used to analyse data were the following:

Exploratory data analysis

An exploratory data analysis was performed between the variables and the items that formed each variable. Exploratory data analysis (Berry and Linoff, 1997; Parr 2001; Dunham, 2003) is a set of statistical techniques for analysing data. It includes graphic and quantitative techniques. In this research the techniques used were: Cronbach‟s alpha coefficients analysis, summary statistics analysis, distribution analysis, demographic distribution analysis and correlation analysis.

 Cronbach‟s alpha: The alpha coefficients are an internal consistency reliability indicator (See Section 4.5). The coefficient alpha is: “the average of the coefficients from all possible combinations of split halves.” (Hair et al., 2003) The split-half refers to the division of the items in half and correlates the two sets of items. The search for each variable and the items was performed and the values compared to the threshold of 0.7 minimum level of acceptance.

 Summary statistics analysis: This group of measures is represented by the search of statistical attributes of the sample by each variable: Anova test for means difference, mean, variance, kurtosis, and skewness.

Analysis of variance (Anova):

The general model for one factor is:

Where Yij represents the j-th observation in the

i

group, µ represents the mean of the whole sample,

τ

i is the factor per group

i

and εij the error of the j-th observation in the

i-th

group

Anova tests the overall model to determine if there was a difference in means between the members of groups, time of experience, time in the position, risk management process and risk type. The Anova has some assumptions: subjects are randomly assigned to one of 3 or more groups and that the data within each group are normally distributed with equal variances across groups. Sample sizes between groups do not have to be equal, but large differences in sample sizes for the groups may affect the outcome of some multiple comparisons tests.

The test statistic reported is an F-test with k‑1 and N‑k degrees of freedom, where N is the number of subjects and k the number of groups (N can be different in each group (Spiegel, 1997)). A low p‑value for the F-test is evidence to reject the null hypothesis. In other words, there is evidence that at least one pair of means is not equal.

The hypotheses for the comparison of independent groups are: (k is the number of groups)

Ho: µ1 = µ2 ... = µk (means of all groups are equal)

Ha: µi <> µj (means of two or more groups are not equal)

The test used for the means is Tukey‟s test and for the variance Levene‟s test. Tukey‟s test compares the means among groups and compares the differences in means to the standard error. This is a test used when the sample sizes of the group are different and values greater than the alpha value (5%) indicates no significant difference in the means. Levene‟s test is used to test for equal variances between groups. The advantage of this test is that is used for groups of different sizes. Values greater than the alpha value (5%) indicate no significant difference in the variances.

 Distribution analysis: This analysis consists of observing the histograms.

Q-Q and probability plots, as well as the formal tests of normality, in order to identify whether the variables are described by the normal distribution or

not. The demographic distribution analysis refers to the distribution analysis using the groups. The tests used for normality assessment was the Kolmogorov-Smirnov at the 5% level. The use of the Kolmogorov-Smirnov test is suggested as follows: ”An alternative strategy for evaluating the normality assumption is to evaluate the values computed for standardized residuals with respect to goodness-of-fit for a normal distribution. The latter can be accomplished through use of one of the goodness-of-fit test described in the book, such as the Kolmogorov-Smirnov goodness-of-fit test for a single sample …” and “ the Kolmogorov-Smirnov is designed to be employed with continuous variables .”(Sheskin, 2007), Similarly, “ The Kolmogorov-Smirnov (K-S) test is used to decide if a sample comes from a population with a specific density. It is used because the distribution of the K-S test statistic does not depend on the underlying cumulative distribution function being tested and because it is an exact test.” (Cohen and Cohen ,2008).

 Correlation Analysis: The correlation analysis used for a general view under the assumption of normality of variables was Pearson‟s Coefficient and for the non-normal and ordinal variables Spearman‟s coefficient.

Multivariate Analysis:

A multivariate exploratory data analysis was performed. The method selection took into consideration the following options:

 The use of SEM (Structural Equation Model)

 Multiple regression, with and without interactions

 Stepwise regression

 Review of the regression results using power analysis

The use of SEM

The use of Structural Equation Models (SEM) was considered. Hair et al.(2006) explain, “SEM estimates a series of separate, but interdependent, multiple regression terms…Thus, some dependent variables become independent variables in subsequent relationships…The proposed relationships are then translated into a series of structural equations (similar to regression equations) for each independent variable.”

The number of combinations of models can be enormous if there is not a pre-identified set of causal (path) models, Hair et al. (2006) said that the fit is for the selected model and it depends on the selected structural equations. The scope of this research is exploratory: the identification of relationships among KM and RM variables but without a previous path model. The lack of a previous path model would have meant considering many variable combinations for relationship identification.

Even if identified, a SEM model could have a good fit but it does not mean that the selected model is the only one, it is just one of the acceptable ones (Hair et al.

2006)..Kale et al. (2000) pointed out that without previous work identifying some kind of relationships, path analysis, and causality might produce a search of relationships without boundaries. Thus the use of SEM was not feasible at this exploratory stage in the research.

The use of Multiple Regression

The multiple regression provides means for the validation of the hypotheses. Each model is built for evaluation of the relationships to the two dependent variables.

The regression models allow the analysis of the variables using first the whole sample and the demographic groups as well. During the data analysis some other models were tried. In particular multiple regression with interactions,(see Appendix 9.4 for examples of two variable interaction). The results did not deliver any new insights in terms of model quality to the research and the interpretation of the

coefficients became complex because some of the significant interactions were negative.

Thus, the regression models formulated have the following structure:

where i =1...121 and p=1...8 That means

and where

u

is the

error term, Xi are the KM independent variables and Y is the dependent variable, qrc and perm in each model.

Hair et al. (1998) indicates the assumptions to be examined for the multiple regression model as follows:

 Linearity of the relationship between independent and dependent variables

 Homoscedasticity: Equality of variance of the errors

 No autocorrelation of the errors and independence of observations

 Normality of the residuals or errors

 No multicollinearity

 No outlier distortion

Linearity: this is the degree of association between the change of the dependent variable and independent variables. According to Hair et al. (1998) linearity is examined using the residual plots. With these plots, the identification of curvilinear patterns can lead to the conclusion of non-linearity.

Homoscedasticity is evaluated using the White test (SAS Reference 9.1.3 and Gujarati, 2003). The White test assumes that under the null hypothesis of no heteroscedasticity the n times (n number of the sample points) of the R² of the auxiliary regression follows a Chi-square distribution with the degrees of freedom equal to the number of independent variables (regressors of the auxiliary regression) used. The auxiliary regression model uses the error of the original model as dependent variable and is described through the original independent variables plus the squares of the independent variables plus the combination, two

by two, of all the variables. This combination of terms in the regression models indicates the number of degrees of freedom for a Chi-squared test.

The autocorrelation of the errors was tested by applying the Durbin-Watson test.

Values close to 2 of these statistics indicate data is independent (Christensen, 1997). The rule says that the value of the statistics should be between 1.5 and 2.5 to indicate independence of observations.

For the Normality analysis of the residuals Hair et al. (2003) suggest the review of three plots: the histogram vs. the normal distribution curve, the normal probability plot of regression standardized residuals and the standardized predicted values of the dependent variable against the standardized residuals from the regression equation. The attributes to observe are:

 Histogram vs. normal distribution: Establish if the observation of the histogram of the residuals suggests a good fit to the normal distribution, and if the points of the bars are touched by the normal curve and the difference in the tails does not appear big.

 Probability plot of residuals: Establish if the residuals are close to the diagonal, overlap the diagonal and if only some points appear to be outliers.

 Standardized predicted values vs. standardized residuals observing if there are values out of ranges as Hair et al. (2003) pointed out: “The plot of standardized residuals provides information on the assumption that errors are normally distributed. To assess this, you look at standardized residual plot and determine whether 95% of the standardized residuals are between -2 and +2.

If they appear to be, and this is a judgment call, then we conclude that errors are normally distributed.” Johnson and Wichern (1998) indicate possible issues in the patterns to use as reference in the plots analysis.

A multicollinearity analysis was performed: first, using what Hair et al.(2003) pointed out that if there is a correlation between two independent variables that is above 0,7, the variables should be considered to be removed. Second, using two

additional indicators for the multiple regression models: Variance Inflation Factor and Condition Index.

The variance inflation factor, an indicator of multicollinearity is a metric that identifies “how much the variance of the regression coefficients is inflated by multicollinearity.” (Hair et al.,2003).

The condition index is the square root of the ratio between the maximum eigenvalue and the minimum eigenvalue of the variance-covariance matrix. In the SAS 9.1.3 Reference guide is indicated: “Belsley, Kuh, and Welsch (1980) construct the condition indices as the square roots of the ratio of the largest eigenvalue to each individual eigenvalue... The condition number ...is defined as the largest condition index.... When this number is large, the data are said to be ill conditioned. A condition index of 30 to 100 indicates moderate to strong collinearity... A collinearity problem occurs when a component associated with a high condition index contributes strongly to the variance of two or more variables.

Thus, for a high condition index (>30), the corresponding row should be examined to see which variables have high values. Those would indicate near-linear dependence.”

No outlier distortion is the search for the observations that have a substantial difference between the predicted and actual value in the dependent variable or a big difference in observations relative to the others, in the independent variables.

The use of stepwise regression

This is the method of variable selection based on the variable contribution to the explanatory power of the model. The variable selection starts with the best predictor of the dependent variable. Variables are added if they increase the prediction of dependent variable or dropped if they reduce that prediction power.

Hair et al. (1998) pointed out that the independent variables are selected when their partial correlation coefficients are significant, and: “Independent variables are

dropped if their predictive power drops to a non-significant level when another independent variable is added to the model.”

The use of Power Analysis

This research included demographic variables as well as the variables related to KM and RM. In order to assess the quality of the regression models this research took into consideration the number of sample points used, the number of variables and the size of r-square. The way to validate if the number of sample points. number of variables and the r-square size, in the regression model, were appropriate was Power Analysis.

In particular the analysis by demographic groups needed the use of Power Analysis (Cohen et al., 2003). The statistical precision and power analysis refer to the standard error, confidence interval and probability of rejecting the null hypothesis when it is false and this depends on three elements: the sample size, the significant criterion and the size of the effect in the population. Power is the probability of rejecting Ho when a particular alternative value of the parameter is assumed or, to put it in another way, power is one minus the probability of a type II error. (Moore and McCabe 1999; Neter et al. 1990; Kleinbaum et al. 2008).

The tables used are in Cohen et al. (2003) and the threshold considered as sufficient power is 0.80 (Murphy and Myors, 1998; Kraemer and Thieman 1987) The Power Analysis is based on the following concepts associated with R-squared of the regression models:

2 2 2

1 R f R

  and L f²(nk1) then  ₂ k1 f

n L

Where

f²= Population effect size for R-squared n= Number of observations

L=Value that is obtained from the L tables for a given power value and number of independent variables

k= Number of independent variables

Non – response bias test

The non-response bias is “[t]he difference between the answers of nonrepondents and respondents.” (Lambert and Harrington 1990). These authors indicate that the comparison of the “characteristics that are relevant to the study.” leads to the identification of the bias.

Pervan (1998) pointed out, “The danger, however, is in non-response bias, i.e.

that those not responding have substantially different views from those who have responded. A recommended strategy for overcoming this is by resampling the non-respondents (Hartman et al., 1985). Such a follow-up survey was conducted and a further 19 responses received…. Tests of the difference in the mean critical score ratings between the 33 respondents in the first round and the 19 respondents in the second round were carried out and no significant differences were found, even at the 0.1% level of significance."

In this research the comparison of the answers between the first group of respondents and the second group who answered after a second invitation to participate in the survey was sent. This second group was taken as sample of those that are non-respondents, The comparison was performed and the results indicate (See Table 4-2) that two of 53 items had a significant mean difference,

“[t]he absence of nonresponse bias is inferred”(Lambert and Harrington, 1990).

As these represent one item out of 6 in the variable quality of network capacity for connecting people (nccp) and one out of 9 for the variable perceived value of ERM implementation (perm), it was thought reasonable to assume there was no overall non-response bias. .

Item Mean

isi1 3.05128 3.00000 0.89204 1.02857 0.33298 0.82125 nccp1 2.79487 2.55556 1.06208 0.88254 0.29135 0.29805 isi2 2.94872 2.77778 1.04993 1.03492 0.48460 0.47121 nccp2 2.61538 2.41667 1.03239 0.70714 0.13061 0.36144 isi3 3.05128 2.72222 1.41835 0.94921 0.11661 0.19658 nccp3 2.79487 2.13889 1.27260 0.75159 0.05942 0.00642 isi4 2.71795 2.66667 1.04993 0.91429 0.34106 0.82371 nccp4 3.07692 2.88889 0.96761 0.95873 0.49079 0.40987 isi5 2.74359 2.72222 1.45884 1.12063 0.21667 0.93551 nccp5 3.12821 3.16667 1.21997 0.82857 0.12536 0.87035 isi6 2.66667 2.83333 1.01754 0.71429 0.14665 0.44251 pqc1 3.41026 3.25000 0.77463 1.16429 0.11024 0.48173

cwc1 3.51282 3.63889 0.67746 0.75159 0.37602 0.52032 pqc2 3.82051 3.83333 0.57220 1.05714 0.03302 0.95086 0.9514749 cwc2 3.61538 3.72222 0.71660 0.66349 0.41056 0.57989 pqc3 3.33333 3.30556 0.85965 0.96111 0.36725 0.89999

cwc3 2.64103 2.47222 1.18354 0.94206 0.24899 0.48194 pqc4 3.02564 3.05556 0.97301 0.85397 0.34970 0.89279 cwc4 3.38462 3.27778 0.87449 0.89206 0.47441 0.62423 pqc5 3.17949 3.16667 0.83536 0.94286 0.35658 0.95319 cwc5 2.84615 2.69444 0.71255 1.01825 0.14131 0.48109 iis 3.20513 3.13889 0.85155 0.69444 0.27245 0.74589 cwc6 2.87179 2.69444 0.95682 0.78968 0.28443 0.41515 wcf1 2.82051 2.66667 0.83536 0.97143 0.32377 0.48527 qrks1 3.64103 3.55556 0.55196 1.05397 0.02646 0.67910 0.6831023 wcf2 2.56410 2.47222 0.77868 0.99921 0.22592 0.67375

qrks2 3.30769 3.00000 1.06073 0.80000 0.20100 0.17295 wcf3 2.94872 2.72222 0.52362 0.94921 0.03737 0.25439 0.260433 qrks3 3.46154 3.41667 0.62348 0.93571 0.11108 0.82587 wcf4 2.89744 2.77778 0.67341 0.92063 0.17309 0.56252

qrks4 3.17949 3.11111 0.88799 0.95873 0.40733 0.75887 wcf5 3.00000 2.75000 0.68421 0.76429 0.36825 0.20725 qrks5 3.07692 3.00000 0.75709 0.80000 0.43255 0.70696 wcf6 3.05128 2.75000 0.68151 1.10714 0.07258 0.17021 qrc1 3.28205 3.16667 0.73414 0.65714 0.37174 0.55177 perm1 3.76923 3.83333 0.55061 0.48571 0.35536 0.70150 qrc2 3.53846 3.58333 0.57085 0.87857 0.09751 0.81946 perm2 3.69231 3.66667 0.53441 0.62857 0.31161 0.88454 qrc3 3.17949 3.13889 0.57220 0.69444 0.27918 0.82558 perm3 3.58974 3.63889 0.93252 0.58016 0.07971 0.80843 qrc4 3.25641 3.47222 0.61673 0.77063 0.25057 0.26483 perm4 3.64103 3.69444 0.92038 0.84683 0.40325 0.80663

qrc5 3.17949 3.16667 0.57220 0.94286 0.06693 0.94910 perm5 3.53846 3.86111 0.83401 0.40873 0.01780 0.08283 0.0790151 misf1 3.33333 3.11111 0.91228 0.78730 0.33125 0.30111 perm6 3.56410 3.83333 0.83131 0.42857 0.02528 0.14910 0.1441873 misf2 3.20513 3.08333 0.79892 0.76429 0.44898 0.55316 perm7 3.56410 3.72222 0.83131 0.60635 0.17428 0.42382

misf3 3.20513 3.11111 0.79892 0.84444 0.43222 0.65476 perm8 3.51282 3.77778 0.94062 0.57778 0.07430 0.19455 misf4 3.10256 3.16667 0.72605 0.60000 0.28574 0.73487 perm9 3.12821 3.61111 0.90418 0.70159 0.22555 0.02281 misf5 3.12821 3.08333 0.69366 0.76429 0.38386 0.82058

Alpha = 0.05 ; N early = 39 and N late = 36 Alpha = 0.05 ; N early = 39 and N late = 36

Summary items nccp3 and perm 9 show significant mean difference

Table 4-2 Summary of the non-response bias test

4.7. Summary

This chapter presented the method, statistical techniques, the population and the variables used in this research. One of the points to note is that RM employees have a set of skills that include literacy in information system tools. Additionally, because of the kind of work that they perform, in general, there are some other skills related to managing figures, performing quantitative analysis and producing reports that involve different risks and different risk levels of decisions. The decision to survey this population is appropriate because this type of respondent gives value to structured tools of gathering data and the questionnaire worked properly for them.

This chapter has covered the research methodology. The first part identified the sample and the variables used in particular indicating the transformation required for a better measure of the variable through the scores of the items. It described the use of descriptive statistics and regression models to identify the relationships

of the variables in order to test the hypotheses. Equally, it indicated the importance of tests for the reliability and validity of the variables and for supporting the size and the criteria to assess models through power analysis.

In the next chapter there is a presentation of the results of the research. The sections have been divided into the results coming from exploratory data analysis and those from multivariate analysis.

In document Knowledge management applied to enterprise risk management (Page 162-173)