Data analysis - Research data - Data collection procedures

3. Methodology of the study

3.3 Data collection procedures

3.3.1 Research data

3.3.2.7 Data analysis

This section is considered as the concluding part of the evaluation process, which is implemented in exploring the quality of the data extracted. In another statement, the process is specifically considered as the practical way of evaluating the viability of all the data grouped in each variable. This approach is demonstrated through a descriptive analysis of data, and also through the graphical illustration of data analysed. In essence, the graphical illustration stimulates the chance of determining the spread of the scores [data value]. Data analysis can be referred to as ‘a method of managing and transforming the feasibility of the data acquired,

assessed and organised into a reliable result’ (Volunteer Estuary Monitoring 2006; Batini et al.

2009; Ehnes & Niu 2012). The major part of this section has been elucidated in the subsection 3.3.2.5 above.

On account of gratifying the application of data analysis, a platform was developed for data analysis to substantiate the tasks performed by data acquisition and data assessment in the evaluation process. However, this platform is referred to as ‘data analysis platform’, which serves as the medium of implementing the descriptive analysis of data, with the decision of exploring the relevance of the data grouped into variables. The platform is also designed to

accommodate the implementation of the data coding process. Fundamentally, in this research, the data analysis is implemented to carry out tasks such as:

 Differentiating the practical data from the impractical data,

 Procuring the analysis results that display the significance or practicality of the data fields or variables, and

 Comprehending the correlations and disparities between the related data fields. The above-mentioned tasks were used as yardstick to develop the approach to uncover the mystery behind the huge loss of data or quality of data gathered by the reporting officer in the process of completing the ARF.

3.3.2.7.1 Data analysis platform

This platform is developed in an excel to explicate the importance and reliability of the data assembled. The medium is established to have a clear view regarding the realism and distributions of the data points. Nonetheless, in this section, descriptive analyses were performed to ascertain the distribution of scores across each variable such as measures of central tendency, measures of dispersion, and quartile ranges [percentiles] etc. Furthermore, the platform was developed to support graphical representation of the distributions of the variables. Besides, the platform also enhances a feasible correlation background of variables, in order to determine the dependence between different variables.

Ultimately, the platform provides an overview of the results of the data analysed in a more concise manner for better understanding. The extraction processes discussed in the subsection 3.3.2.5 above, were instantaneously executed alongside the analysis of the data. As regards this process, all variables were analysed according to the data type. In some cases, class interval needs to be established in order to have a clear understanding into the meaning of the data assembled, while some analysis cases were based on the direct application of measures of dispersion and measures of central of tendency to show the gravity of the data distributions. In the case of class interval, the frequency of the data was obtained, also the histogram and the quartile ranges of the data were analysed to improve the clarity of the distributions.

However, the analysis result obtained demonstrates the distribution of variables as illustrated in analysis platform. These particular results exhibited the purposeful of the data distributed across each variable, and as well clarified the difference between the captured data and the non-captured data [uncaptured data] as regards the findings discovered on the quality of data collected by the reporting officers. The practical outcomes of the analysis are discussed in the following chapters in both tabular and graphical layouts.

3.3.2.7.2 Mathematical approach for the Average estimates of the non-

captured data

The grouping of the non-captured data per variable were further evaluated by determining the average estimates of the non-captured data as grouped into three related factors. Statistically, the average estimate of the non-captured obtained in each field, is simply referred to as the outcomes of the ‘ratio of the sum of non-captured data in each field per month, to the amount

of non-captured data in each field (Schmuller 2009)’ [refer to the formula below].

X =

∑ X_N …………Equation 1 (Montgomery & Runger 2007; Schmuller 2009; Lane et al. 2013)

X

= Mean/average of the non-

captured data per field.

∑ X

= Sum of non-captured data per month.

N = Number of non-captured data summed up.

3.3.2.7.3 Numerical approach for the Histogram of the non-captured

data

As part of the evaluation process performed on the non-captured data extracted, a histogram is considered as ‘the graphical approach suitable for illustration of the frequency distributions of ranges of scores (Montgomery & Runger 2007; Lane et al. 2013; Lane 2015). Besides, the frequency distributions of non-captured data estimates acquired in all the data fields in each related factor are displayed in two separated tables in all the analysis sections discussed in this study.

However, the purpose of separating the eight variables/data fields into two separate tables, offers a feasible understanding into the frequency distributions of the non-captured data across all the data fields in each related factor. In addition, in most cases, the range difference between the maximum and minimum scores is much wider due to high variation in the rate at which data is missing or mishandled in each field; hence, it will be difficult to have a sensible class interval/bin for the computation of the frequency distributions (Montgomery & Runger 2007).

In the course of obtaining the appropriate class width for the formation of the class interval/bin, two different methods were considered for similar results. The first of the two is the Sturgis’

rule, which is described as a feasible way of setting the class intervals as close as possible to

Equation 2 below.

𝑄 = 1 + 𝐿𝑜𝑔₂[𝑁]………...Equation 2

that is, rounded to the nearest integer (Lane et al. 2013; Lane 2015); where 𝐿𝑜𝑔₂[𝑁] is referred to as the base 2 Log of number of observations (Lane 2015). The above written formula can also be rewritten in another statistical format as illustrated in the Equation 3 below.

𝑄 = 1 + 3.3𝐿𝑜𝑔10[𝑁]….………Equation 3

where 𝐿𝑜𝑔10[𝑁] is the Log base 10 of the number of observations and Q is referred to as the

number of class intervals/bins in the histogram (Cimbala 2014; Lane 2015).

The second of the two is the Rice rule, which is statistically calculated to establish the number of interval twice the cube root of the number of observations (Cimbala 2014). Hence, the formula statistically generates the equation below.

Q = 2[𝑁]13_{...Equation 4}

‘Q’ was previously defined in Sturgis’ rule and N is referred to as the number of observations

(Cimbala 2014; Lane 2015). These two methods ensure the feasibility of achieving a better choice of class width to determine the number of class intervals/bins for an absolute frequency distribution (Lane 2015).

In document Investigating quality of data and the need for the restructuring of accident report form in South Africa (Page 117-121)