Data codification - Research data - Data collection procedures

3. Methodology of the study

3.3 Data collection procedures

3.3.1 Research data

3.3.2.5 Data codification

Data codification could be described as ‘the process implemented in transforming information into data points, which are considered appropriate for statistical analysis purpose’ (Schoenbach 2014). Furthermore, this aspect is considered very significant in the analysis of data assessed. It prepares data for technical analysis in the aspect of data grouping [assemblage] and graphical illustrations. The procedure was introduced after both the unstructured and semi-structured data has been captured into the dataset developed in the excel [data analysis tool]; although, all the data were gathered in a monthly sequence. Quintessentially, the data codification procedure is applied to extract data points from the dataset into data groups within the same data analysis tool. The dataset contains the data points of both the unstructured and semi-structured data measured, by using a specified dimensional metric as previously discussed in subsection 3.3.2.2 above.

Moreover, data points were extracted into various variable groups classified under each data field presented in the ARF, through the application of statistically computed formulas. The form of variables processed in this research is considered as Categorical Variables. This is a type of variable where observations are counted and grouped into various data groups of different variables for analysis purpose (Decoster 2006). The extraction of captured and non-captured data was successfully carried out through the application of the data codification process in the data analysis tool. The purpose of carrying out the extraction procedure is to organise the data into groups, interpret the data by understanding the practicality of the data, and analyse the data to acquire practical solutions to the real-world problems.

3.3.2.5.1 Extraction of captured data

The process introduced in mining the captured data into groups of categorical variables was achieved based on the appropriate assemblage of the variables in each field. The numerical formulas applied were formulated in accordance with the understanding of the type of data points prepared for analysis. In addition, the captured data consists the grouping of all available that are classified as fit and usable for analysis purpose. In this case, number of observable variables mined from the dataset were based on the counting of the number of occurrences of each categorical variable, while some were obtained by formulating the class interval, frequency and percentiles depending on the type of data.

Practically, numerical formulas applied are based on the nature and interpretation of the data categorised in each variable. Excel functions as CountA and Countblank, Countif, Sumif and

variables. CountA and Countblank were combined to have a complete number of reported accidents per month, with a full count of the dates and days that these accidents were reported. On the other hand, the introduction of both Countif and Sumif, is applied to obtain the number of observations of some important variables. For instance, Countif is applied in a case like;

total accidents occurred in weekdays = number of occurrence of each weekday that road accident occurred. Similar data extraction approach is applied to variables or data fields as Built-up area, Speed limit on road, Drivers’/cyclists’’ countries, Gender, Severity of injury, Road type, Vehicle type and many other important variables.

Considering the application of Sumif, two different variables were integrated to generate one possible outcome. The approach compares two ways of generating one possible result, but this depends on the reliability of data points acquired. For instance, Sumif is applied in cases like; ‘total number of vehicles in accidents in weekdays’ and ‘total number of vehicles involved

in accidents in a month.’ The relational connectivity considered for the extraction of right data

points for the two cases is the ‘number of vehicles involved in accidents’, where day of week was considered as a reference base in order to actualise the integrity of the data assembled. This further assist in disclosing the extent of variations in the data points representing the ‘number of vehicles involved’. The outcome of this approach is benchmarked with a direct sum of the total data points pertaining to single-vehicle and multiple-vehicle accidents, which yielded similar results.

The last numerical formula used in extracting data points is Frequency. This formula was considered to classify data fields that require the formulation of class intervals. The class intervals determine a complete distribution of data belonging to any particular variable or field. For instance, considering the Time of accident, all the registered periods of road accident occurrence are sorted with the intention of having a better view of the data acquired. In this case, accidents are reported 24 hours in a day, which indicates that accident occurrence periods are different. As a result of this, a class interval difference of one hour was established between the periods that accidents usually occurred, explicitly within the time interval of 12:59:00 am to 11:59:00 pm. This approach was used to produce a better way of generating data points for the periods at which accidents occurred in Stellenbosch area. Similar numerical approach was established to determine the distribution of the ages of accident victims. The class interval ranges from 20 to 100, with reference to a class interval of difference of 20, which dictates the appropriate way of grouping the ages of the accident victims.

3.3.2.5.2 Extraction of non-captured data

The mining of non-captured data was based on the definite criteria defined in terms of errors found in the ARF, such errors as item non-response errors, that is, omission of relevant data; and response errors, that is, incorrect completion of relevant data. The criteria constitute

assessment of the non-captured data based on the void or inappropriate completion of relevant data in the ARF. Although, the degree of sufficiency of the road accident data depends exclusively on errors reduction. The importance of mining the non-captured data paves way for simplification of the problems associating with errors found in the ARF. The mining process constitutes the application of numerical formulas as done in previous section. The numerical formulas applied in mining non-captured data are Countif, Sumif and Countifs. The application of the Countif is similar to the approach used in the previous section. The numerical formula introduced is applicable to a single variable, which requires no reference base. For instance, mining the errors in the Day of week; total non-captured data in weekdays=count of Sundays,

“void”, Mondays, “void” …. Saturdays, “void”. In addition, a similar approach was applied to

compute other variables like Accident date, Time of accident, etc.

Other numerical recipes applied are Sumif and Countifs. These two formulas required the use of reference base to incorporate all essential features into the process of mining all the available errors in terms of item non-response errors and response errors, which literally means that void cells are counted or summed up for all non-captured data. The reference base was fragmented into five relevant mining metrics, which are ‘one vehicle’, two vehicles’, ‘three

vehicles’, ‘four vehicles and more’ and ‘no data’ as the relational connectivity relating variables

together for a complete and accurate mining of non-captured data in each variable. In this approach, a ‘one-vehicle accident’ corresponding to void data [empty cells] in a variable is extracted; whereas similar approach is applied to other mining metrics, except ‘no data’ where only void data or empty cells in both the reference base and the require variable or field are extracted. In essence, ’no data’ could be referred to as the empty cells in the reference base corresponding to empty cells found in the variables or fields subject to measurement.

In document Investigating quality of data and the need for the restructuring of accident report form in South Africa (Page 113-115)