Descriptive statistics - The data analysis strategy

4.3 The data analysis strategy

4.3.1 Descriptive statistics

Descriptive statistics are the first step in the data analysis process and are performed with Mi- crosoft Excel and Statistica. These statistics merely describe the data and do not aim to predict new data points. The variable classes are compared, with focus given to readmission throughout the various analyses. There are numerous ways in which the variables can be described and compared to each other, but this is not the focus of the project and is not included in order to keep to the point. It is however presented to the clinical SMEs. The results of the descriptive analyses are presented in Section 5.1.

One of the initial analysing steps in Statistica entails drawing histograms of the variables which display the frequency of entries per group in each variable and are valuable for identifying outliers or text in numerical data. Statistica allocates numerical values to text data starting from the number 9999, which is much more than any age or LOS, and thus text such as ‘Still admitted’ which may occur in the LOS variable, is easily identifiable on the histogram. Classes with few data compared to the other classes in the variable, such as Area 5 in Figure 4.7, can also be identified from the histograms. This is a common phenomenon in medical data. When it occurs merging the groups has to be considered and discussed with both the statistical and clinical SMEs. The risk pertaining to the smaller classes is that a small number of observations are used to draw assumptions that are inferred back to the population. In addition, the rules predictive models derive from few data points may not have the ability to classify the new data correctly (Kidd, 2016a; Izenman, 2008). Another example of a histogram for continuous data is displayed in Figure 4.8, which is the age of the patients on their first admission, indicated by the ‘1’ in the name, Age1 (and Area1 ).

Figure 4.7: Histogram displaying the number of patients admitted from Area 1 - 5.

Figure 4.8: Histogram displaying the age of patients.

To follow, Statistica is used to conduct chi-square tests and analysis of variance (ANOVA). The chi-square test is nonparametric, which implies that there are no assumptions that have to be considered. The test investigates the categorical variables with regard to the dependent variable

4.3. The data analysis strategy 85

which has two classes, namely, readmitted (1) or not readmitted (0). Figure 4.9 and Figure 4.10 display the results of a chi-square test for the area from which a patient originates versus readmission. Figure 4.9 displays a categorised histogram of the distribution of each class (Area 1-5) with regard to readmission. Figure 4.10 displays the chi-square statistic from which it can be determined whether there is a significant difference between readmission (0/1) and the area a patient is admitted from.

In this particular project a p-value of less than 0.05 is regarded as significant owing to it being the most commonly applied. This means that the findings have a five percent probability of not being true. In some cases p-values of less than 0.1 will also be made mention of, which are at a 90% significance level.

The Pearson statistic is interpreted and in this example it can be seen that there is almost a significant difference between patients readmitted and not readmitted with regard to their area. These graphs and tests already convey valuable information about the variables and readmission, but it should be noted that the combinatorial effect of other variables is not taken into account. A table of the descriptive statistics, which contains the same information as displayed in the categorised histograms, is also generated in the output.

Figure 4.9: Categorised histograms generated by the chi-square analysis displaying the areas from which patients are admitted and if they are readmitted or not (at the first admission).

Figure 4.10: Pearson chi-

square statistic for area versus readmission.

ANOVA is a parametric test used for comparing the average of a continuous variable, such as age, to readmission or no readmission. If there is a difference it implies that age might be an indicator for readmission, which is the same reasoning used with the chi-square test. Two assumptions have to be verified when conducting ANOVA tests: (i) the data is approximately normally distributed, and (ii) the two groups have equal variances. ANOVA is quite robust with respect to normality which means that deviation from this assumption does not have a large effect on the probability of a Type I error (incorrectly accepting a false hypothesis). This is however only applicable if the sample size of one group is equal to or less than 1.5 times the size

of the other group. If the normality assumption is violated, the data can be transformed or the Mann-Whitney U test can be conducted (Laerd Statistics, 2013b). The ANOVA test is run to generate the following output:

Descriptive statistics table which displays the mean, standard deviation, sample sizes and confidence intervals for the parameters;

Levene’s test used to evaluate the assumption of homogeneity of variances;

Normal probability plot used to graphically determine if the data is normally distributed; and

Least squares means plot displaying the mean and CI for the continuous variable in both the readmission classes (0/1) along with the p-value of the ANOVA test indicating if there is a significant difference between the two groups with regard to the independent variable.

Normality can be determined numerically or graphically, with the graphical method being used in this project. The numerical test has the advantage of objectivity, but is sometimes overly sensitive with large samples and not sensitive enough with smaller samples. The graphical method is used, with help from the statistical SME with regard to interpreting the graphs correctly (Kidd, 2016a; Laerd Statistics, 2013c).

The larger the sample size, the less effect the deviance from the assumptions have on the test. As the sample size increases (> 40), violation of the normality assumption causes less concern owing to the central limit theorem2 (Elliott & Woodward, 2007). The dataset for the first admission (both readmitted and not readmitted) can be considered large, as it contains significantly more than 40 samples. Additionally, a significant p-value is generally much more likely with larger samples, which may occur with the dataset (Kidd, 2016b; Field, 2009). Thus, in addition to the Mann-Whitney U and ANOVA test, a macro is included to calculate Cohen’s effect size, which compares the means of two groups by dividing the difference in means by the average of the standard deviations. A value of 0.2 is regarded as small, 0.5 as medium and 0.8 or more as large (Sullivan & Feinn, 2012).

An example of the normal plot and least squares plot for age on first admission versus readmission is displayed in Figure 4.11 and Figure 4.12 respectively. From the normal plot in Figure 4.12, it can be seen that the ages at first admission are not normally distributed and in the top screenshot (iii) of Figure 4.13 Levene’s p-value indicates that variances do not differ significantly and satisfies the assumption of equal variances. Owing to the sample size being large (> 40), violation of the normality assumption is not alarming and the ANOVA results will still be taken into account in conjunction with interpreting the Mann-Whitney U test and Cohen’s effect size. The results of the Mann-Whitney test and descriptive statistics with Cohen’s effect size included are displayed in Figure 4.13. From Figure 4.11 the p-value of more than 0.05 indicates that the null hypothesis, which states that the means are equal, is rejected. This indicates that the means of the two groups do not vary significantly. The Mann-Whitney test also supports this by having a p-value larger than 0.05 and similarly, the effect size also found a negligible difference.

The central limit theorem states that sample means are approximately normally distributed for moderately large samples (> 40) even though the population might not be normally distributed (Elliott & Woodward, 2007).

4.3. The data analysis strategy 87

Figure 4.11: Least squares means plot of the

age variable versus readmission. _{Figure 4.12: Normal plot for the age variable.}

Figure 4.13: Screenshots of the (i) descriptive statistics with Cohen’s effect size; (ii) the Mann-Whitney U test; and (iii) Levene’s test with regard to age and readmission.

The Mann-Whitney U test is nonparametric and does not require the assumption of normality. The p-value can be compared to ANOVA’s p-value and is interpreted in the same way. The test has a few basic assumptions which this project’s dataset satisfies. The first assumption is that the dependent variable should be either ordinal or continuous; in this project it is continuous (years or days). Secondly, the grouping variable (readmitted or not) should be categorical, which it is, and the groups should be independent of each other, which they are. Thirdly, the observations should be independent of each other, thus there should not be a relationship between the observations in each group (a patient only occurs once) (Laerd Statistics, 2015).

In document Investigating the feasibility of crisis-discharge decision-support to reduce readmission rates at a psychiatric ward. (Page 113-116)