CITY HEALTH PREDICTION MODEL USING RANDOM FOREST CLASSIFICATION METHOD

(1)

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

CITY HEALTH PREDICTION MODEL USING RANDOM FOREST CLASSIFICATION

METHOD

1^st Prihandoko Prihandoko Faculty of Computer Science and Information Technology, Gunadarma

University

Depok, Jawa Barat, Indonesia [email protected]

2^nd Bertalya Bertalya Faculty of Computer Science and Information Technology, Gunadarma

University

3^rd Lilis Setyowati Faculty of Computer Science and Information Technology, Gunadarma

University

Abstract—City Health Office in Indonesia is creating a health report every year, describing the condition of the city public health. The report is used as the source of determining the city health index. The construction of city health development index is important in order to produce an objective formula. In this study, classification method Random Forest is used to develop a proper model for prediction and analysis the health index of a city. The goal of this work is to find a prediction model to make more accurate prediction and reducing errors in dealing with city health index. The performance of the model is evaluated by using three parameters: Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). The research shows that the model of Random Forest with 15 percent data test by using 200 decision trees give the best results with the value of MAE = 0.108, MSE = 0.035 and RMSE = 0.187, and the Accuracy = 94.6 percent.

Keywords— Data Mining, Classification, Prediction, City Health Index

I. INTRODUCTION

In order to provide an overview of the health situation in a city, a City Health Profile official document is made and published annually. City Health Profile illustrates the situation of public health degrees (mortality, nutritional status, morbidity), health efforts (health services, access and quality of health services, community life behavior, environmental conditions), health resources (health facilities, health workers, health financing) in a city [1].

The City Health Profile is built by City Health Office by capturing data from Public Health Center (Puskesmas) located in the city. In addition to knowing the health condition of the city, City Health Profile is also used to evaluate the progress of health development of the previous year and to make program and budget plan to be proposed in the Regional Revenue and Expenditure Budget.

Every five years, Indonesia Ministry of Health publishes Public Health Development Index (PHDI) for each province and city in Indonesia. This index is used to measure the level of healthiness of a province or a city, whether it is high, moderate or low. Every city is then expected to use PHDI to evaluate the progress of city health development and to make a plan to improve the public health.

The problem that is faced by every city is that they do not have a tool to produce a healthy index each year, while they need it to make a better plan to improve the healthy quality in their cities annually, not every five years.

The aim of this study is to build a prediction model of a city health. The model is developed based on the data produced by city health department of a city each year. The model will be used by the city to create prediction and to make public health program planning for the coming year.

The process of formulating prediction model from the attributes of City Health Profile is conducted by using data mining technique. Data mining is a process of collecting, pre-processing, executing, analysing, and getting understandings from data. In the real worlds, the problem areas, implementations and data requirements are varying.

In the collection phase, the data are captured and stored in a database for processing. When the data are composed, they are not in a format that is fit for processing. The attribute capturing stage is often conducted in parallel with data pre-processing, where lost and wrong parts of the data are revised. Finished with data cleaning, data are ready to be processed and analyzed by using any data mining methods to get insight information from data. In this paper, data mining method that is used to process public health data is Random Forest [2] that has already proven good for creating prediction model for many domain areas.

This paper implements the Random Forest (RF) algorithm [3] for attribute selection. The purpose of this algorithm is not only to reduce the attributes but also to confirm that the features chosen are the most significant.

The value of Random Forests is to apply numerous decision trees that are built using few bootstrap derived from the learning sample, X [4]. The Random Forest works very well when the attributes is bigger than the samples [5].

The Random Forest has been particularly beneficial in various implementations in the biomedical area. Diaz- Uriarte et al. [6] have operated on microarray data classification and demonstrated that Random Forest performs better to other algorithms, such as K-Nearest Neighbor (KNN), Linear Diagonal Discriminant Analysis (DLDA), and Support Vector Machine (SVM). Yao et al.

[7] have used variable scores attained from Random Forest to rank attributes and their outcomes were assessed with the mean accuracy of the SVM. Yang et al. [8] have proposed the RM and SVMFS methods because the implementation of the Random Forest and the SVM classifier is the same.

The Random Forest variable importance score is used and the attributes are detached. The newly acquired attributes are assessed using the SVM. In all these techniques, Random Forest is used as classifier to assess attributes.

(2)

Some of them have used variable importance scores to separate attributes [7, 9].

The paper is structured as follows: Section 1 describes the background of the research and the problem statement.

Section 2 discusses the literature review of similar researches and highlight some prediction models from other areas.

Section 3 explains the method that is used in this research, i.e., Random Forest and its stages in dealing with classification process. Section 4 concludes the research result and some findings that are found.

II. LITERATUREREVIEW

Random Forest is one of the means to determine the priority of independent attributes affecting a dependent attribute, has been implemented in many areas. It is one of data mining and machine learning methods and it moderates the prediction inaccuracy by exploiting the randomness based on decision tree. Specifically, it has been presented that Random Forest has high predictive supremacy for multidimensional data, which have an enormous number of attributes [10].

Ordonez [11] could predict the heart disease by using a number of attributes obtained from patients. They proposed a system that embraces the features of a person based on 13 fields like sex, cholesterol, blood pressure, and others to foresee the probability of a patient exaggerated by heart disease. They applied classification algorithms such as Naive Bayes, Decision Tree, and Neural Network to do prediction.

Duff, et al. [12] have conducted a study for 533 patients who had problems from cardiac stop and they were involved in the examination of heart disease possibilities. They implemented data mining analysis using Bayesian networks.

Palaniappan, et al. [13] have conducted a work and have developed a model recognized as Intelligent Heart Disease Prediction System (IHDPS) by using some data mining methods such as Naïve Bayes, Decision Trees, and Neural Network.

III. RESEARCHMETHODS

Figure 1 shows the process of data mining that is used in the study to build the city health index prediction model. It starts with the data obtained from City Health Profile published by 6 provinces (Jakarta, Central Sulawesi, Yogyakarta, Bali, East Kalimantan, West Java) as the samples. These data consist of 710 attributes from 81 tables, illustrate the situation of public health degrees, health efforts, and health resources.

Fig. 1. Data Mining Process

The second process is data cleaning, which is carried out to clean the above data by eliminating attributes which are empties. This process is also eliminating detail data which differentiate between male and female. Thus, from data cleaning process, we come up with 57 attributes.

The third process is classification. Classification is the process of detecting to which set of classes new data goes, on the basis of a training set of data comprising data (or instances) whose class belonging is known. The data classification process is demonstrated in Figure 2, which normally consists of two stages, the training and the testing stage. In the training stage, data is generated into a set of attributes based on the attribute generation models such as the vector space model.

In the prediction stage, data is characterised by the attribute set obtained in the training stage, and the classifier learned from the training stage will work on the attribute represented data to envisage the class. The attribute set used in the training stage is similar as that in the prediction stage.

Fig. 2. Data Classification Process

The learning algorithm that is used to classify the data is Random Forest. Random Forest [3, 14] is an algorithm that uses sets of decision trees on both splits with randomly generated vectors or random subsets of the training data, and then calculate the value as a function of diverse partitions.

Normally, the random vectors are produced from a static possibility distribution. Hence, random forests can be built by random split collection, or random input assortment.

Fig. 3. Random Forest Concept

(Source: https://dsc-spidal.github.io/harp/docs/examples/rf/) Random Forest algorithm builds multiple decision trees following the process shown above and takes majority selection of the forecast results of these trees. Random Forest approach contrasts from the conventional decision tree

(3)

algorithm in these aspects when constructing each tree: (1) it gathers the tree from a bootstrap sample of the training data;

(2) when selecting the finest attribute at each node, Random Forest addresses a subset of the feature space. These two adaptions of decision tree commence randomness into the tree learning process, and thus expansion the variety of base classifiers. Random forest considers the arbitrariness, while increasing the trees. Random Forest pursuits for the best attributes among a random subclass of attributes.

To evaluate the performance of the model, we measure it by using three parameters: Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).

MAE measures the average degree of the errors in a set of projections. It’s the average of the differences between prediction and actual data where all individual variances have equal weight. MAE formulae is as follows.

MAE = (1)

where yᵢ is the actual expected output and ŷᵢ is the model’s prediction.

MAE essentially dealings with average squared error of our estimates. It computes square difference between the estimates and the target. After that, the average of those values is calculated. When the value is higher, it means that the model is not good. It must be greater than zero, since the process starting at squaring the individual prediction-wise errors before summing them. The value of perfect model would be close to zero.

When we make a bad prediction, the MAE result will make the error worse and it might lean toward the metric for misjudging the model. That is a very challenging conduct if we have noisy data - even "perfect" models may have high MSE, it becomes difficult to determine how well the model accomplished. When all errors are small, the reverse effect is occurred, we can underestimate the evil of the model.

MSE characterizes the variance between the original and projected values obtained by squared the average variance over the data set. The MSE formula is as follows.

MSE = (2)

RMSE is a quadratic valuation rule that calculate the average size of an error. This is the square root of the average square difference between predictions and actual observations.

RMSE = (3)

MAE and RMSE definite average model calculation errors in units of variables. The two metrics can have a value from 0 to ∞ and are not dealing with the direction of the error. When the result outputs are negative, it means lower grades are good. Getting the square root of the mean square error has some interested implications for RMSE. Because errors are squared before averages, RMSE assigns relatively

high weights to large errors. This means RMSE must be more useful when big errors are very unfavorable.

IV. RESULT AND DISCUSSION

In this part, the outcomes of processing dataset by using Random Forest are discussed and compared. The classification process is conducted by firstly determining the portion of training dataset and testing dataset.

The data is distributed into two datasets, i.e., training dataset and testing dataset. The simulation process is carried out for two conditions, first condition is using 80% training dataset and 20% testing dataset, second condition is by using 85% training dataset and 15% testing dataset.

Fig. 4. MAE

Fig. 5. MSE

Fig. 6. RMSE

Figure 4-7 show the results of Random Forest measurements for 20% data test and 80% data training. The experiment is conducted 9 times, ranging from number of decision trees (estimators) from 100, 200, 500, 1000, 1500, 2000, 3000, 4000, 5000. Figure 4 shows that MAE is going down from number of estimators 100 to 200, but going up to

(4)

the peak when number of estimator is 2000. After number of estimators 3000 the curve is going flat.

Fig. 7. Accuracy

Figure 5 and 6 show the value of MSE and RMSE is increasing from number of estimators 100 to 2000 and slightly decreasing and flat afterwards. This shows that when the number of estimators equals to 2000, the line achieves the highest value of RMSE. Figure 7 presents the accuracy of the model. It shows the highest accuracy (77.3%) is achieved when the number of estimators is 500, and it achieves the lowest when the number of estimators is 2000.

Fig. 8. MAE

Figure 8-11 show the results of Random Forest measurements for 15% data test and 85% data training. The experiment is also conducted 9 times, ranging from number of decision trees (estimators) from 100, 200, 500, 1000, 1500, 2000, 3000, 4000, 5000. Figure 8 shows that MAE is going down from number of estimators 100 to 200, but going up to the peak when number of estimator is 1000. After number of estimators 2000 the curve is going flat.

Figure 9 and 10 show the value of MSE and RMSE which are significantly different. The value of MSE is decreasing from the number of estimators 100 to 200, then increasing from number of estimators 200 to 500, 1000, then decreasing when number of estimators 1500, but then

increasing again from 2000 until 5000. Figure 10 shows the line of RMSE, where the value is going up significantly from number of estimators equals to 100 to 200, then it is going flat from 200 to 5000.

Fig. 9. MSE

Fig. 10. RMSE

Figure 11 presents the accuracy of the model. This shows that the accuracy is going up when the number of estimators move from 100 to 200. The line is going down significantly going down if the number of estimators is 500 and 1000, and the line is slightly going up again until the estimator is 5000. The accuracy of the model achieves the highest value (94.6%) when the number of estimators is 200.

Fig. 11. Accuracy

V. CONCLUSIONS

This research has been working on building a prediction model for measuring the city development index

(5)

in Indonesia. The prediction model is required by the city government to develop a government plan for the coming year based on the performance of previous years.

The prediction model is developed by using Random Forest classification method. The method is chosen due to its robust performance in dealing with prediction system. The model has been created and tested to city health data obtained from six provinces in Indonesia.

The research results show that the random forest model with 15% data test by using 200 decision trees gives the best results with the value of MAE = 0.108, MSE = 0.035 and RMSE = 0.187, and the accuracy = 94.6%.

ACKNOWLEDGEMENTS

The authors would like to acknowledge the financial support from Directorate for Research and Community Services, Ministry of Research, Technology, and Higher Education, Republic of Indonesia.

REFERENCES

[1] Tim Dinkes Kota Samarinda, “Samarinda City Health Profile”, 2013.

[2] G. Biau, "Analysis of a Random Forests Model", Journal of Machine Learning Research 13, 1063-1095, 2012.

[3] L. Breiman. “Random forests. Machine learning”, 45(1), 5-32, 2001

[4] R. Genuer, V. Michel, E. Eger, and B. Thirion, “Random forests based feature selection for decoding fMRI data”. Proceedings Compstat, 267, 1-8, 2010.

[5] G. Biau and E. Scornet, “A random forest guided tour”. Test, 25(2), 197-227, 2016.

[6] R. Díaz-Uriarte and S.A. de Andres, “Gene selection and classification of microarray data using random forest”. BMC Bioinformatics, 7, 13 pages., 2006.

[7] D. Yao, J. Yang, X. Zhan, X. Zhan, and Z. Xie, “A novel random forests-based feature selection method for microarray expression data analysis”. International Journal of Data Mining and Bioinformatics, 13(1), 84-101., 2015.

[8] J. Yang, D. Yao, X. Zhan. and X. Zhan, X. “Predicting disease risks using feature selection based on random forest and support vector machine”. Proceedings of the 10th International Symposium on Bioinformatics Research and Applications.

Zhangjiajie, China, 1-11., 2014.

[9] R. Genuer, J-M Poggi, and C. Tuleau-Malot, “Variable selection using random forests”. Pattern Recognition Letters, 31(14), 2225- 2236., 2010.

[10] C. Charu. Aggarwal, “Data Mining, the Textbook”, Springer International Publishing Switzerland, 2015.

[11] C. Ordonez, “Improving Heart Disease Prediction using Constrained Association Rules”, Technical Seminar Presentation, University of Tokyo, 2004.

[12] F. Le Duff, C. Munteanb, M. Cuggiaa and P. Mabob, “Predicting Survival Causes After Out of Hospital Cardiac Arrest using Data Mining Method”, Studies in Health Technology and Informatics, Vol. 107, No. 2, pp. 1256-1259, 2004.

[13] S. Palaniappan and R. Awang, “Intelligent Heart Disease Prediction System using Data Mining Techniques”, International Journal of Computer Science and Network Security, Vol. 8, No. 8, pp. 1-6, 2008

[14] T. K. Ho. “Random decision forests”. In Proceedings of the International Conference on Document Analysis and Recognition, pages 278–282, 199