Feature Relevance Analysis and Classification of Kaduna State Road Traffic Accident Data using Machine Learning Techniques

(1)

86

Feature Relevance Analysis and Classification of Kaduna State Road Traffic Accident Data using Machine Learning Techniques

1Hassan Ibrahim Hayatu, ²Abdullahi Mohammed, ²Ahmad Barroon Isma’eel, ²

Yusuf Sahabi Ali,

3Umar Salisu Mohammed

1Institute for Agricultural Research, Ahmadu Bello University, Zaria, Nigeria.

[email protected]

2Department of Computer Science, Ahmadu Bello University, Zaria, Nigeria.

[email protected], [email protected],

[email protected]

3Department of Computer Science, Nigerian Defence Academy. Kaduna [email protected]

Abstract

Road accidents have been recognized among the leading cause of death resulting in the loss of socio-economic progress in society. Nigeria is not immune to this scourge as it is one of the major causes of mortality among its citizens. In this study, we conducted a comparative analysis of three attribute selection methods viz. Information gain (IG), Principal Component Analysis (PCA), and Correlation-based (CF) using four classification algorithms viz. Random Forest (RF), decision tree (J48), Support Vector Machine (SVM), and K-nearest Neighbor (KNN) to predict road crash severity in Nigeria and select the factors that provide good prediction result. A dataset from the Federal Road Safety Corps (FRSCN), Kaduna state command is used in this work.

The experiments are conducted using Waikato Environment for Knowledge Analysis (WEKA). The experiments depicted that IG+RF model with 22 attributes outperformed the other models with an accuracy of 85.39%. In terms of predicting the severity classes, the result depicted that SVM model with all RTC attributes and PCA+SVM using 22 attributes are good in predicting Fatal and Serious cases with a recall of 84.6% and 97.2% respectively. The experiment also depicted that most of the models are good in predicting Minor cases.

Furthermore, the result reveals that by reducing the dimensionality of the RTC dataset, we can predict the future severity of road crashes more effectively. The study will be helpful as relevant authorities can come up with rules and regulations to reduce the severity of traffic crash.

Keywords: Accident Severity; Road Traffic Crashes; Classification: Feature Selection;

1. Introduction

Movement is seen as one of the universal characteristics of human activities, as well as a very important part of society's socio-economic growth in general. The rapid development of new technology has created numerous categories of smart vehicles that contribute to the high rate of worldwide Road Traffic Crash (RTC) (WHO, 2018). Road traffic crash (RTC) occurs when road users (vehicles, motorcycles, or pedestrians) crash with each other due to technical, human, or environmental deficiencies. According to a report by World Health Organization (WHO), about 1.3 million individuals worldwide have been murdered, up to 50 million people are wounded as a result of RTC, thus, it is important to find and remove all causes of RTC severity.

To come up with rules and regulations on how to avoid RTC, factors leading to these crashes must be made available to the Road Safety Agency. Various techniques have been widely chosen to address the global problem of RTC which mainly deals with the analysis of the severity of injury rate using reported data of RTC.

It can be deduced from previous studies that the factors contributing to RTC are grouped into 3 categories:

environment, vehicle, and human. Predicting the major contributory factors for crash severity has become the central objective of researchers because it supports the Government and policy makers to look for a better way of eliminating or reducing crash severity in the community. The danger of the severity of road traffic usually differs significantly among nations of the world. According to WHO (2018), Africa has the maximum rate of

(2)

87

road traffic death with 26.6 percent per 100,000 populations. So, the mortality and morbidity rate in the region is huge as well as the socio-economic loss. As in several other developing countries, Nigeria has to contend with this recurring problem, as it accounts for the largest number of people killed by road traffic as well as road traffic accident rates. According to ITF (2018), the road traffic accident is 47 times more likely to kill anyone in Nigeria than in Britain. Additionally, the ratio between crash and death is also high, which is one death in 2.65 crashes. Whereas France has only one death in 175 crash, Czech Republic one death in 175 crash, and South Africa one death in 47. Therefore, the major factors of RTC severity; vehicle, environment, and human need to be thoroughly highlighted to classify what combinations of factors can be used to predict if the crash severity is fatal, serious, or minor, so that from the factors one can easily classify or predict the severity of future road traffic crash using data mining techniques (Radzi et al. 2019).

Data mining can be seen as a way to derive valuable information from large chunks of data, which is to isolate and turn information from a massive dataset into knowledge. It is a subfield of computer science and statistics with the ultimate aim of digging information from data and translating it into an understandable format for further usage (Leszek et al., 2020).

Several approaches have been employed to address road traffic crashes using data mining techniques. To the best of our knowledge, only two of the approaches (Birnin Gwari et al. 2017; Radzi et al. 2019) uses RTC data obtained from Federal Road Safety Corps Nigeria, Kaduna state command. Birnin Gwari et al. (2017) proposed an RTC model using SVM to classify crash severity into Fatal, Serious, and Minor. Radzi et al. (2019) proposed an RTC model using SVM and PCA to improve the classification of the RTC severity. In their work, they used the dataset collected from federal road safety from 2013-2015.

In this research study, we attempt to use IG, PCA, and CF-based methods to identify the most significant factors and also employ four classification algorithms (Random forest (RF), Decion Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN)) to classify the RTC severity. The study chooses both the methods based on the literature review conducted on the various approaches proposed for the road traffic crash severity classification using data mining to have an effective model. The dataset is collected from Federal Road Safety from 2013-2018.

The remaining part of this paper is structured as follows: Part 2 presents some of the recent studies on RTC severity using data mining. Part 3 presents the methodology employed in this study. In part 4, the result analysis and discussions of findings were presented. In section 5, we conclude the paper by providing the summary of our findings and future research directions.

2. Literature Review

Many research studies on RTC severity classification using data mining techniques have been identified in the literature. Bülbül et al., (2016) developed a classification model using CART algorithm to investigate the status of road accident occurrence. This work achieved an accuracy of 81.5%. Satu et al., (2017) use RTC severity data in Bangladesh to study the National Highway accidents using 892 cases. Several decision tree algorithms were used to figure out the traffic accident pattern. An SVM model for the classification of RTC severity (Fatal, Severe, and Minor) was developed by Birnin Gwari et al., (2017) using SVM on 641 RTC cases in Kaduna state. Gopinath et al., (2017) conducted a traffic accident analysis concerning road users using SVM with an accuracy of 75.58%. Gopinath et al., (2017) also proposed a classification model to figure out the possible causes of road crashes using a decision tree. Al-Radaideh & Daoud (2018), proposed a road traffic crash severity model using SVM. Data from the Transport Department, United Kindom was used for their study and achieved an accuracy of 54.8%, which indicates a very low detection accuracy. Al-Radaideh & Daoud (2018) also proposed a model to find out the most influential factors in RTC severity and classify the severity using J48, Random Forest, CART, and Random Tree algorithms. In the work of Radzi et al. (2019), a combination of SVM and PCA was proposed to determine which is/are the best attributes in predicting RTC severity. This work, is an extension of Gwari et al., (2017), thus, resulting in a 2% accuracy improvement. Labib et al.,

(3)

88

(2019) proposed a prediction model to predict Bangaladesh RTC into Grievous, Fatal, Motor collision, and Simple Injury using decision tree and KNN algorithms.

3. Methodology

In this section, we describe the materials and method used in the study as shown in Figure 1. Figure 1 presents the architecture which starts from the data collection after which the data is cleaned in a process called data pre- processing (Handling missing values, class imbalance, attribute selection methods (PCA, IG, CF)). After the data has been cleaned, the classification algorithms (SVM. KNN, J48, and RF) comes into play in classifying RTC severity. The various stages in Figure 1 are explained in detail in the following sections.

Figure 1: Architecture of the proposed study

3.1. Collection of Data

Data set containing 1650 crash cases from 2013 to 2018 was obtained from Federal Road Safety, Kaduna command. In the first instance, 35 different attributes that may cause accidents were in the data. These attributes/factors come from either human, environment, or vehicle, or a combination of two or all the mentioned factors. From the entire dataset, 67 cases and 12 attributes were removed as a result of missing field, field spelled wrongly, most of the information is lacking, or the class label cannot be identified. Finally, 1583 crash cases and 23 attributes were retained for the final analysis. Three target classes were present in the RTC dataset, which are; Fatal, Serious, and Minor cases. 576 fatal cases, 948 serious cases, and 58 minor cases constitute the RTC severity dataset. Table 1 describes the attributes (Name, Factor, Description, and Data type), while Table 2 gives the distribution of each of the target classes.

1. Handling Missing Values.

2. Handling Class Imbalance

3. Attribute Selection Methods

a. PCA b. IG

c.

CF

Collection of Data

Pre-processing Learning Algorithms

Data for Training Prediction rules produced

based on Algorithms 1. RF

2. KNN 3. J48 4. SVM

Data for Testing

Fatal, Serious, or Minor Predicted RTC Severity

(4)

89

Table 1: RTC Dataset Attribute Description

Attribute Name Factors Descriptions Data Type

ReportTime Human The time the crash reported Numeric

ArrivalTime Human Rescue team arrival time Numeric

Crash Date Environmental Date of crash occurrence Date type

CrashTime Environmental Time the crash happen Numeric

BadRoad Environmental Accident due to road conditions such as potholes, sharp bend, black spot, etc.

Numeric

RoadObstruction Environmental Accident due to some of the obstruction on the road

Numeric PoorWeather Environmental Accidents caused by weather such

as rain, or storm.

Numeric VehicleTtype Vehicle Type of vehicle involved in the

crash (Bus, Lorry, Car etc.)

categorical VehicleCategory Vehicle Different categories of vehicles

involve in the crash (Private or Commercial)

Categorical

VehicleMaker Vehicle Manufacturer of the vehicle (e.g.

Honda, Toyota, etc.)

Categorical BrakeFailure Vehicle Crash caused by brake failure Numeric MechanicalDeficiency Vehicle Crash because the vehicle is

mechanically deficient.

Numeric SignLightViolation Vehicle Accidents caused by vehicles not

having a good working light sign in the vehicle

Numeric

TyreBurst Vehicle Accidents resulted as a result of flat tires.

Numeric ResponseTime Human The precise time the rescue team

attend to the victim

Numeric OverSpeeding Human Accident due to Over speeding Numeric DangerousDrive Human Accident due to dangerous driving. Numeric LostControl Human Road crash occured due to loss of

control from the driver

Numeric Overloading Human Accident due to load or passenger

overloading.

Numeric RouteViolation Human Route violation in high way from

the driver.

Numeric DangerousOvertake Human Overtakes in a corner, sharp bed

without seeing his front.

Numeric SleepingOnSteering Human Accidents caused due to dangerous

overtaking such as overtake in a corner

Numeric

UseOfPhone Human Accidents caused due to the use of the phone by a driver on the road

Numeric

(5)

90

Table 2: RTC Severity Distribution

Severity Number of cases

Fatal 576

Serious 948

Minor 58

3.2 Pre-processing

Before applying any machine learning algorithm to a dataset, it is recommended that you carry out data pre- processing. Pre-processing is the process of cleaning the data before further analysis is carried out. It involves several processes that include, missing values handling, normalization, attribute selection or extraction transformation, and/or handling categorical. To handle missing values, we used the mean imputation technique, convert the categorical attribute into numeric using the encoding technique, and solve the problem of class imbalance using the Class-Imbalance technique in Weka. Attributes selection was also performed to select the most important attribute in RTC severity classification.

3.2.1 Missing values handling

To handle the missing values in our dataset, a WEKA filter function “ReplaceMissingValues” was used.

Numerical attributes are replaced using the mean value of that attribute, while for nominal attributes, the missing values were replaced using the mode of the attributes.

3.2.2 Handling Class imbalance

In a classification or prediction problem, the class imbalance is a serious issue of concern. It is mostly as a result of the skewed distribution of the dataset among the target classes. With an imbalance class dataset, most of the classification algorithms will focus mainly on classifying the majority class, thereby ignoring the minority class which in turn reduces the performance of the classification model (Al-Radaideh & Daoud, 2018). To handle this type of problem, sampling techniques are used which involves the resampling of the original imbalance dataset.

This can be achieved in different ways; by oversampling the minority class, by under-sampling the majority class, or by using a hybrid of the two previous methods (Amalia et al., 2019). In this research study, a WEKA filter function “weka.filters.supervised.instance.Resample” method was used to oversample the minor cases and under-sample the serious and the fatal cases, and over-sample the Minor cases. The dataset contains 527 Fatal, Serious, and Minor cases each after applying the sampling technique.

3.2.3 Attribute selection/extraction

Attribute extraction or selection is an essential pre-processing phase which helps to improve a model's predictive performance in a machine learning problem. Attribute selection can be divided into a ranking of attributes and the selection of subsets of attributes. In the first technique, the attributes are selected based on their predictive ranking. The attributes subset selection method selects the subset of attributes with the best predictive performance (Syed et al., 2020). In this study IG, PCA, and CF were used for selecting the best attributes in RTC severity classification.

A. Principle component analysis

PCA determines both eigenvectors and eigenvalues and also measures the matrix of the covariance. The eigenvectors will be chosen corresponding to the first big eigenvalues as . Compute the attribute extraction output based on component attribute, ∑ where, and , defined the absolute value of . The outcome of the attribute extraction will be the of the component attribute (Rahul & Vinod, 2020). The PCA would select the most appropriate feature from all the different features in the dataset.

(6)

91 B. Information gain

The information gain feature selection method is centered on the notion of entropy. The mutual information of the independent attribute and the target class provides the expected information gain value. It is the decrease in entropy of the target attribute realized by learning the nature of the independent variable (Ahmad et al., 2019).

One of the key downsides of using the information gain method is that it chooses attributes with huge numbers of distinct values against attributes having fewer values even if the latter contains more information. To compute the information gain of an attribute, consider an attribute and a target attribute . The information gain of a given attribute about target attribute is the decrease in the uncertainty about the value of assuming that the value of is already known. We measure the value using its entropy, (Ahmad et al., 2019). The uncertainty in given the value of is the conditional probability of with respect to .

(1) and are discrete attribute that takes values in and respectively, then the entropy value of can be calculated using eqn. 2 as:

∑ (2) The conditional entropy of attribute with respect to attribute is computed using Eqn. 3

∑ ( ) ( | ) (3) But if the predictive attribute is continuous, then, information gain of with respect to the target attribute is calculated by looking at all likely binary attributes that come from given that a threshold on were choosing (Ahmad et al., 2019). The threshold uses the values from all the values in . The information gain is shown in eqn.4:

(4)

C. Correlation-based Attribute Extraction (CF)

One of the major downsides of using univariate feature selection techniques like information gain is that the correlation between attributes is not taking into consideration. This problem can be solved by using multivariate feature selection or extraction techniques such as the correlation-based method. CF measures the correlation between each of the attributes with the target class. Attributes that are highly correlated with the class are chosen as the best attribute to be used in the classification algorithms. Pearson's correlation measure the linear relationship between attributes and the target class is used (Gnanambal et al., 2018).

3.3 Classification Algorithms

Four different machine learning algorithms are applied on the RTC dataset to find the best among them with respect to the marginal accuracy and recall for classification of the RTC severity using the already pre- processed data. Machine learning tasks can be categorized into clustering, association rules, and classification.

Classification methods classify cases into a predefined target class. In machine learning, classification involves the training and testing phase (Al-Radaideh & Daoud, 2018). In the training phase, the training data is used to train the machine learning algorithm to generate the classification or prediction rules. While in the testing phase, the test data is used with the classification rules generated to assess the performance of the algorithm on unseen data. If the performance is acceptable, the model can be used in the future for classifying new cases. Here, we used four classification algorithms (SVM. KNN, J48, and RF) to determine which of the algorithm to be used in the classification of RTC severity in the study area using accuracy, precision, and recall measures.

3.3.1 Support vector machine (SVM)

SVM can be defined as a linear and nonlinear data classification and prediction algorithm that was developed by Vapnik in 1960. The notion of SVMs revolves around a "gap" on either side of a vector space that separates two types of information. It was shown that optimizing the gap and, thereby, establishing the maximum possible

(7)

92

gap between the separating vector space and the instances on either side decreases the upper limit of the anticipated generalization error. (Hector et al., 2020).

3.3.2 K-nearest neighbor (KNN)

KNN is a non-parametric prediction or classification algorithm that predicts a target class of a new instance using the majority class of its nearest neighbors (Wang et al., 2018). The number of nearest neighbors to be used in KNN is represented as k. KNN is regarded as a lazy learning algorithm because it only stores all the cases and classifies or predict new cases using the majority class of their k neighbors. In the literature, distance measures like the Cosine, Manhattan, Euclidean, and Hamming were proposed to determine the nearest k neighbors in KNN. Since the KNN algorithm classifies the class of a given new case by identifying the cases that are nearest to it, the scale of the attribute is crucial. Any attribute that has a large scale will have a much more substantial influence on the distance between the cases than the attributes that are on a small scale, and therefore on using the KNN algorithm is important to convert all the attribute to the same scale (Agarwal &

Sagar, 2019 ; Sethi et al., 2019 ; Manzoor & Singla, 2019 ; Cao et al., 2019)

3.3.3 Decision tree (DT)

Decision trees are trees which categorize examples by organizing them according to attribute values. Every node in the tree denotes an attribute in an example that will be arranged, and each branch defines a value that the node can accept. Examples are organized from the parent node and then are identified based on their attribute values. The decision tree is used as a classification model that allocates instances on an element to a conclusion about the objective value element in machine learning and data mining. (Agarwal & Sagar, 2019).

Two different implementations of a decision tree in WEKA were used in this study which are J48 and Random Forest.

3.4 Performance Evaluation

This part describes the performance evaluation metrics and methods used for the validation of our model.

3.4.1 Validation of result

In this study, we used 10-folds cross-validation for splitting the training and testing data. This approach splits the data set into 10 equal-size subsets; one is used for testing and the other nine are used for training the classification model (Kathiravan et al., 2019). The experiment is reiterated ten (10) times, the mean error rate of the testing data is returned as the performance of the model.

3.4.2 Performance evaluation metrics

To measure the performance of our study, a confusion matrix was used. The confusion matrix is an matrix in which indicates the target output which assists in determining the performance of a prediction or classification model (Ivo & Gunther, 2020). Each column and each row in the matrix denote the actual target cases and the predicted cases respectively (Raihan-Al-Masud et al., 2020). The confusion matrix is shown in Table 3. From table 3, the accuracy precision, and recalls are computed.

Table 3: Confusion Matrix

Actual Target Classes

Predicted target classes

+ -

+ TP FP

- FN TN

(8)

93

a. Accuracy ( ): is defined as the number of all correctly predicted or classified cases divide by the number of correctly and incorrectly predicted cases (Raihan-Al-Masud et al., 2020). This can be expressed arithmetically as in eqn.3.

(3)

b. Recall ( ): is defined as the number of correctly classified (

+

) cases divide by the number of (

+

) cases present in the dataset or the number of correctly classified (

-

) cases divide by the number of (

-

) cases present in the dataset (Raihan-Al-Masud et al., 2020). This can be expressed arithmetically as in eqn.4.

, for (

+

) class or

, for (

-

) class (4)

c. Precision ( ): is defined as the number of cases the model classified/predicted in the class, and are in the class (Raihan-Al-Masud et al., 2020). This can be expressed arithmetically as in eqn.5

(5) 4. Discussions and Analysis of Results

The analysis was performed in WEKA using 10-fold cross-validation for splitting the RTC dataset into training and testing to measure the performance of the work. Table 4-7 indicates the performance of the algorithms obtained from our analysis. In Table 8-9, we present the summary for the highest marginal accuracy and recalls for each of the algorithms and graphically depicted in Figures 2 and 3.

Table 4: Support Vector Machine model performance with and without Attribute Selection Methods.

Models Selected Number of Attributes

Accuracy (%)

Recall (%) Precision (%)

Fatal Serious Minor Fatal Serious Minor

SVM All 83.49 84.60 70.05 98.91 62.40 88.40 99.61

CF+SVM 22 84.12 56.00 96.40 100 94.20 68.70 99.60

12 79.19 73.60 65.10 98.90 70.00 73.00 93.50

6 50.19 46.10 51.80 52.60 42.80 45.00 68.20

IG+SVM 22 83.12 72.50 77.00 99.20 80.80 71.40 98.50

12 80.52 70.00 72.90 98.70 74.40 71.20 95.20

6 73.88 64.90 62.80 93.90 69.10 64.50 86.40

PCA+SVM 22 84.25 55.60 97.20 100 95.10 68.60 100

12 83.87 63.60 88.0 100 84.80 70.90 99.10

6 83.30 63.80 86.90 99.20 83.20 71.00 98.30

The results in table 4-7 indicate the comparative performance of the classification models developed by using the three attribute selection methods and the classification algorithms used in the study. The effect of each of the attribute selection methods on each of the four classifications used. In table 8, the results reveal that IG+RF using 22 RTC attributes has the maximum marginal accuracy of 85.93%, followed by PCA+SVM using the same 22 number of RTC attributes with an accuracy of 84.25%.. The IG+KNN with 12 selected comes third with an accuracy of 83.42% and then IG+J48 using the same 12 attributes with an accuracy of 79.82%. While in Table 9, the experiment depicted that the fatal cases can best be classified using SVM with all the RTC attributes with a recall of 84.6% and PCA+SVM using 22 attributes is best in classifying serious cases with a recall of 97.2. The minor cases can be predicted by most of the models.

(9)

94

Table 5: J48 Model Performance with and without Attribute Selection Methods.

Models Selected Number of

Attributes

Accuracy (%)

J48 All 77.67 72.50 72.10 86.30 66.00 67.60 99.40

CF+J48 22 78.24 68.30 66.80 99.60 72.00 71.10 89.60

12 79.51 70.40 68.50 99.60 73.20 71.20 92.60

6 54.27 43.30 44.80 74.80 48.60 49.0 62.5

IG+J48 22 79.32 68.70 70.00 99.20 73.90 74.40 87.90

12 79.82 67.60 72.30 99.60 76.60 74.60 86.80

6 75.71 61.70 65.70 99.80 68.90 66.70 89.20

PCA+J48 22 78.81 69.60 67.00 99.80 72.20 72.50 89.80

12 79.51 69.60 69.30 99.60 73.10 73.30 90.40

6 77.55 61.30 71.30 100 72.30 68.40 90.20

Table 6: Random Forest Model Performance with and without Attribute Selection Methods.

Models Selected Number of

Attributes

Accuracy (%)

RF All 84.38 78.20 77.00 97.40 75.70 77.40 100

CF+RF 22 84.19 75.00 77.60 100 78.40 76.90 96.70

12 81.28 72.70 71.50 99.60 74.40 73.90 94.40

6 57.50 46.30 47.60 78.60 52.80 55.50 78.60

IG+RF 22 85.39 77.60 78.60 100 79.90 78.90 96.90

12 84.00 75.50 76.50 100 79.00 76.80 95.50

6 84.25 76.50 76.30 100 79.20 77.80 95.0

PCA+RF 22 84.25 78.00 74.80 100 77.70 78.20 96.20

12 84.12 76.30 76.10 100 78.10 77.90 95.60

6 83.05 75.00 74.20 100 76.80 76.80 94.40

Table 7: K-Nearest Neighbor Model Performance with and without Attribute Selection Methods Models Selected

Number of Attributes

Accuracy (%)

KNN All 82.1 77.50 75.70 91.80 73.10 73.20 100

CF+KNN 22 81.59 73.40 71.30 100 74.3 75.50 93.80

12 79.51 70.40 68.50 99.60 73.20 71.20 92.60

6 57.18 47.1 47.4 77.00 52.30 54.70 62.5

IG+KNN 22 82.29 73.40 73.40 100 77.20 75.70 92.60

12 83.42 74.00 76.30 100 79.30 76.30 93.10

6 82.47 74.80 72.70 100 76.80 76.40 92.90

PCA+KNN 22 81.72 73.10 72.10 100 75.50 75.10 93.30

12 82.99 74.00 75.00 100 77.20 76.80 93.80

6 82.48 73.80 73.60 100 76.00 76.70 93.60

(10)

95

Table 8: Highest Marginal Accuracy for each of the models

Models Selected Number of Attributes Accuracy (%)

PCA+SVM 22 84.25

IG+J48 12 79.82

IG+RF 22 85.39

IG+KNN 12 83.42

Table 9: Recalls summary for the three RTC severity.

Severity Models Number of features Recall (%)

Fatal SVM All 84.60

J48 All 72.10

RF All 78.20

KNN All 77.50

Serious PCA+SVM 22 97.20

J48 All 72.10

IG+RF 22 78.60

IG+KNN 12 76.30

Minor PCA+SVM 12 & 22 100

PCA+J48 06 100

PCA+RF 6, 12 & 22 100

PCA+KNN 6, 12 & 22 100

Figure 2: Highest Marginal Accuracy for each of the models.

77 78 79 80 81 82 83 84 85 86

PCA+SVM IG+J48 IG+RF IG+KNN

Marginal Accuracy (%)

Models

(11)

96

Figure 3: Recalls summary for the three RTC severity.

5. Conclusion and Future Work

In this work, a comparative analysis was conducted with three attribute selection methods: Information gain (IG), Principal Component Analysis (PCA), and Correlation-based (CF) using four classification algorithms:

Random Forest (RF), decision tree (J48), Support Vector Machine (SVM), and K-nearest Neighbor (KNN) to predict road crash severity in Nigeria and select the crash factors that provide the best prediction result. A dataset from the Federal Road Safety Corps (FRSCN), Kaduna state command was used. The experiments were conducted using Waikato Environment for Knowledge Analysis (WEKA). The experiments depicted that IG+RF model with 22 attributes outperformed the other models with an accuracy of 85.39%. In terms of predicting the severity classes, the result depicted that SVM model with all RTC attributes and PCA+SVM using 22 attributes are good in predicting Fatal and Serious cases with a recall of 84.6% and 97.2% respectively.

The experiment also depicted that most of the models are good in predicting minor cases. The result revealed that by reducing the dimensionality of the RTC dataset we can predict the future severity of road crashes more effectively. The study will be helpful as relevant authorities can come up with rules and regulations to reduce the frequent road traffic crash severity. As future work, the authors intend to add more data from other FRSCN commands in the country and develop a hybrid RTC severity model to have a better prediction.

0 20 40 60 80 100 120

SVM J48 RF KNN PCA+SVM J48 IG+RF IG+KNN PCA+SVM PCA+J48 PCA+RF PCA+KNN

Fatal Serious Minor

Recalls (%)

(12)

97 References

Agarwal, R., & Sagar, P. (2019). A Comparative Study of Supervised Machine Learning Algorithms for Fruit Prediction. Journal of Web Development and Web Designing, 14-18.

Ahmad , F. O., Tasmi, Sri , D. S., Mira , A., & Herri , S. (2019). Attribute Selection Using Information Gain and Naïve Bayes for Traffic Classification. Journal of Physics: Conf. Series, 1-6.

Al-Radaideh, Q. A., & Daoud, E. J. (2018). Data Mining Methods for Traffic Accident Severity Prediction.

International Journal of Neural Networks and Advanced Applications, 1-12.

Amalia, L., Alejandro , C., Alejandro, M. M.-G., & Ana de las, H. V. (2019). The Impact of Class imbalance in Classification performance metrics based on the Binary Confusion Matrix . Parttern Recognition, 216- 231.

Asha, G. K., Manjunath, A. S., & Jayaran, M. A. (2012). Comparative Study of Attributes Selection Using Gain Ratio and Correlation Based Feature Selection. International Journal of Information Technology and Knowledge Management, 271 – 277.

Birnin Gwari, I. S., Radzi, N. M., & Haszlinna, N. (2017). Road Traffic Crash Severity Classification Using Support Vector Machine. International Journal of Innovative Computing, 15-18.

Bülbül, , H. I., Kaya, T., & Tulgar, Y. (2016). Analysis for Status of the Road Accident Occurrence and Determination of the Risk of Accident by machine learning in Instanbul. International Conference on Machine Learning and Applications (ICMLA) (pp. 426-430). Anaheim: CA.

Cao, Y., Fang, X., Ottosson, J., Näslund, E., & Stenberg, E. (2019). Study of Machine Learning Algorithms in Predicting Severe Complications after Bariatric Surgery. Journal of Clinical Medicine, 668.

Castro, Y., & Kim, Y. J. (2016). Data mining on road safety: factor assessment on vehicle accidents using classification models. International Journal of Crashworthiness, 104-111.

doi:10.1080/13588265.2015.1122278

Gnanambal, S., Thangaraj , M., Meenatchi , V. T., & Gayathri, V. (2018). Classification Algorithms with Attribute Selection: an evaluation study using WEKA. Int. J. Advanced Networking and Applications, 0975-0290.

Gopinath, V., prakash, P. K., Yallamandha, C., Veni, K. G., & Krishna, S. (2017). Traffic Accidents Analysis with respect to Road Users using Data mining Techniques. International Journal of Emerging Trends &

Technology in Computer Science (IJETTCS), 15-19.

Hector , D. H., Ciro , R., & Doris , E. (2020). Comparative analysis of supervised machine learning algorithms for heart disease detection. 3C Tecnología. Glosas de innovación aplicadas a la pyme, 233-247.

ITF. (2018). ROAD SAFETY REPORT 2018 | NIGERIA. International Transport Forum. Retrieved from https://www.itf-oecd.org/sites/default/files/nigeria-road-safety.pdf

Ivo, D., & Gunther, G. (2020). Indeces for rough set approximation and the application to confussion matrices.

International Journal of Approximate Reasoning, 155-172.

Kathiravan, S., Aswani, K. C., Durai, R. V., Ashish, G., & Bor-Yann, C. (2019). An Efficient Implementation of Artificial Neural Networks with K-fold Cross-Validation for Process Optimization. Journal of Internet Technology, 1213-1225.

Labib, M. F., Rifat, A. S., Hossain, M. M., Das, A. K., & Nawrine, F. (2019). Road Accident Analysis and Prediction of Accident Severity by Using Machine Learning in Bangladesh. 2019 7th International Conference on Smart Computing & Communications (ICSCC) (pp. 978-1-7281). Bangladesh: IEEE.

Leszek , R., Maciej , J., & Piotr , D. (2020). Stream Data Mining: Algorithms and Their Probabilistic Properties (Studies in Big Data). Springer.

Manzoor, S. I., & Singla, J. (2019). A Comparative Analysis of Machine Learning Techniques for Spam Detection. International Journal of Advanced Trends in Computer Science and Engineering, 810-814.

Radzi, N. H., Gwari, I., Mustaffa, N., & Sallehuddin, R. (2019). Support Vector Machine with Principle Component Analysis for Road Traffic Crash Severity Classification. Joint Conference on Green

(13)

98

Engineering Technology & Applied Computing 2019 (pp. 1-5). IOP Publishing. doi:10.1088/1757- 899X/551/1/012068

Rahul, A., & Vinod, P. (2020). Feature selection using principal component analysis and genetic algorithm.

Journal of Discreate and Mathematical Science and Cryptography, 595-602.

Raihan-Al-Masud, M., Rubaiyat, M., & Mondal , H. (2020). Data-driven diagnosi of spinal abnormalities using faeture selection and marchine learning algorithms. PLOS Journal, 2-15.

Satu, M. S., Ahamed, S., Hossain , F., Akter, T., & Farid, D. M. (2017). Mining traffic accident data of N5 national highway in Bangladesh employing Decision Tree. IEEE Region 10 Humanitarian Technology Conference (R10-HTC) , (pp. 722-725). Dhaka.

Sethi, K., Gupta, A., Gupta, G., & Jaiswal, V. (2019). Comparative Analysis of Machine Learning Algorithms on Different Datasets. In Circulation in Computer Science International Conference on Innovations in Computing (ICIC 2017), (pp. 87-91).

Syed, A., Ali , S., Hafiz , M. S., Muhammad , W., & Saif , u. (2020). A Comparative Study of Feature Selection Approaches: 2016-2020. International Journal of Scientific and Engineering Research, 469.

Torsen, E., & Atule, A. A. (2017). A Nonparametric Approach to Data Analysis on Road Traffic Accident.

International Journal of Mathematics and Statistics Invention (IJMSI), 01-08.

Wang, W., Li, Y., Wang, X., Liu, J., & Zhang, X. (2018). Detecting Android malicious apps and categorizing benign apps with ensemble of classifiers. Future Generation Computer System, 987-994.

WHO. (2018). Global status report on road safety 2018. France: World health organization. Retrieved from https://www.who.int/violence_injury_prevention/road_safety_status/2018/en/