Chapter 4- Data Analysis Techniques and Procedures
4.4 Analysing Data
One of the initial steps to analyse a data set is through a frequency distribution, also known as a histogram. The data is plotted on a graph, with the observations or variables on the X axis, with the Y axis illustrating how many times that value occurred within the data (Field, 2013). Frequency distributions can take many forms, but ideally a normal distribution is the objective. A distribution can deviate from normal in two main ways. Firstly, skewness, which is a lack of symmetry within the data set. Skewed distributions are clustered at either end of the scale, either positively or negatively and are non-symmetrical (Pallant, 2013). Negative skewness is clustered around the right-hand side of the graph and represents high values, whereas, positive skewness illustrates low values bunched to the left of the graph (Pallant, 2013).
Secondly, the distribution could deviate through ‘peakedness’ which is known
as kurtosis. This relates to the degree at which scores cluster at the ends of the distribution, which is known as the tails (Field, 2013). A normal distribution value for skewness and kurtosis is 0, although this is rather uncommon in social sciences (Pallant, 2013). The further away from 0, the more likely it is that the data is not normally distributed. Pallant (2013) and Tabachnick and Fiddell (2013) suggest alongside this, it is also necessary to look at the histograms, to investigate the distribution, as it provides a graphic representation for the researcher. Based on the skewness and kurtosis values and the graphical distribution of the current data seen in the histograms, the data is non-normality distributed. Nevertheless, Pallant (2013) reinforces the fact that many measurements or scales within the social science field are positively or negatively skewed. This does not necessarily suggest that there is an issue with the scale, yet illustrates the underlying nature of the variable itself. For
example, the scale item ‘I was pleased that we were hosting the London 2012 Games’,
was negatively skewed, due to the nature of the question that was asked, as most of the participants in general would agree that they were hosting this SMEs.
Alongside, skewness and kurtosis, normality can be tested through the Kolmogorov-Smirnov statistic, which calculates the normality of the score’s
distribution (Pallant, 2013). A normally distributed sample, has a non-significant value of (p>.05), meaning that when the value is over .05, normality is assumed (Field, 2013). On the other hand, when the value is (p< .05) the distribution is significantly different and normality cannot be assumed (Field, 2013). For the current data set, all variables illustrated a Kolmogorov-Smirnov value of (p< .05), suggesting that the data is not normally distributed. Yet, Pallant (2013) states that this is common in certain types of social science research, due to the nature of certain questions, as highlighted above. This can lead to a skewed result either positively or negatively.
The median or mode is used to calculate the centre of a frequency distribution (central tendency) as the median is reasonably unaffected by skewed distributions and extreme scores at end of the distribution, this can be used with ordinal, interval or ratio data (Field, 2013). The mode is the score that occurs most frequently in the data, yet this can be difficult in some cases when two or more variables share the same mode value (Field, 2013). The mean on the other hand is a parametric measure, which can be distorted, if your data set is heavily skewed (Pallant, 2013). Mean is the most popular measure of central tendency and is the average score of all observations made for that variable (Gratton and Jones, 2004). It provides a hypothetical estimation of a typical score, yet the mean can be influenced by extreme scores or skewed data and can only be used with ratio or interval data (Field, 2013).
The measure of dispersion specifies the spread around the point of central tendency, whether that be the mean or median for example (Gratton and Jones, 2004). Field (2013) states that the easiest way to calculate the central tendency is through the range (by subtracting the highest score from the lowest score), yet this can be affected by extreme scores. Thus, one way around this is to exclude the extremes of distribution and then calculate the range, a non-parametric test known as the IQR (Field, 2013). This process allows you to eliminate the scores from the top and bottom 25% and calculate the range from the reminding 50%. The most common measure of dispersion is SD, which is the square root of the variance (Field, 2013). It is a parametric measure and is used alongside the mean. It provides an indication of the spread of data and measures how well the mean represents the data set. Field (2013) states that a standard deviation of 0 indicates that all data is the same. The larger the SD is, relative to the mean suggests that there are data points disbursed further from the mean, whereas the smaller the SD, the closer the data is to the mean value (Field, 2013).
It is essential to ensure the appropriate variable measurement is used for Likert item and Likert scale data, as the appropriate descriptive and inferential statistics vary for interval and ordinal variables. This may lead to the researcher using an inappropriate statistical method and possibly reaching the wrong conclusion (Jamieson, 2004). Due to the variable measurement, along with the skewed nature of the current data set, non-parametric descriptives (median and IQR) were the most suitable measure for individual Likert items. Mean and SD will also be presented, alongside median and IQR, due to the use of EFA and MANOVA in the latter stages of the data analysis.
4.5 Outliers
Outliers are scores within the data set that have values significantly above or below the majority shown through other scores. There are many techniques by which these can be investigated and they are a necessary consideration due to the sensitivity of many statistical tests (Pallant, 2013). Tabachnick and Fiddell (2013) state that there are multiple causes of outliers. First is incorrect data entry, this was checked using the minimum and maximum values displayed for each scale item. This research utilised a scale from 1-7, if there were any variables outside these values, an outlier would be evident. Yet, none were identified during data screening, due to the pre-set scale selection buttons on the online survey. This ensured that participants were unable to select a value outside the predetermined range. Secondly, there could be an error within the missing-values codes in the computers syntax, which would result in the computer analysing the missing value (Tabachnick and Fiddell, 2013). This was avoided because any missing-values identified in the raw data set typically within the demographic questions, were coded in the missing value section of SPSS as ‘99’, to ensure that the IBM SPSS did not misinterpret the value as a genuine value (Pallant, 2013).
Tabachnick and Fiddell (2013) also stated that the third way outliers occur is through a participant that is not in your intended sample, being a member of sampled population. This issue was eradicated by the selection criteria of the NGB staff and additionally by the contact method used via the individual’s personalised emails, which meant that an unintended response was not possible. Results were only received from intended participants, but the distribution across the sample itself included some extreme values, as well as, the predominantly normal distribution. This could be since
individuals from the selected population, had a range of experiences, alongside other organisational or personal factors that may have influenced the scores. This resulted in some individuals not having opinions or viewpoints that fell within the normal distribution. For example, some NGBs were not part of the London 2012 Games, meaning their attitudes and opinions were different, compared to the typically normal responses seen within the data set.