Improved C4.5 Algorithm Using The L’hospital Rule And Prunning On The Recommendation System

(1)

Improved C4.5 Algorithm Using The L’hospital

Rule And Prunning On The Recommendation

System

Meilany Nonsi Tentua, _{Agus Sihabuddin}

Abstract: The C4.5 algorithm is a widely used classification method in recommendation system as it has several advantages. However, it also has several disadvantages, such as 1) the finding of nodes with zero values or nearly zero value that does not contribute in generating rules 2) too many nodes are formed so that the tree generated is too large. The weaknesses of the C4.5 algorithm need an improvement so that the algorithm can run well with the existing case. This article proposes an improvement to the C4.5 algorithm by using L’Hospital Rule and prunning (C4.5 LHP algorithm). Based on the experiments conducted using eight datasets, the result shows that the improved C4.5 LHP algorithm has a higher level of accuracy (about 1,08%) compared to the C4.5 algorithm and C4.5 LH. Besides, in terms of the excecution time, the C4.5 LHP algorithm is faster than the C4.5 algorithm. Index Terms: Recommendation system, C4.5 Algorithm, L’Hospital Rule

——————————  ——————————

1. INTRODCUTION

HE recommendation system is an information filtering system that handles the problem of overload information [1] by filtering fragments of information from a large amount of data dynamically. Information is generated according to user preferences, interests, or observed behaviors [2]. Several classification methods that are widely used are C4.5, Bayesian networks, k-nearest neighbor algorithm, Neural Network, Support vector machine [3]. C4.5 algorithm is the most widely used classification method on recommendation system, because it has several advantages such as 1) Decision making scope that was previously complex and very global can be changed to be simple and specific; 2) Elimination of unnecessary calculations, because by using the decision tree method, examples are tested only based on certain criteria or classes; 3) Flexible to choose features from different internal nodes, selected features will distinguish a criterion compared to other criteria in the same node [4] . The trial phase of training data in the C4.5 algorithm constructs a decision tree. In the classification stage, a decision tree is used to predict the class of an unknown case. [5] The formation of trees in the C4.5 algorithm starts with the selection of attributes which allows getting the smallest decision tree of its size [6] [7] or attributes that can separate objects according to their class. Heuristically the attribute chosen is the attribute that produces the most "purest" node (the cleanest). The purity measure is shown by the level of impurity, and its calculation can be done by using the Entropy concept. Entropy expresses the impurity of a collection of objects. [8]

𝐸𝑛𝑡𝑟𝑜𝑝𝑖(𝑆) = 𝐸(𝑆) = ∑ −𝑝 𝑙𝑜𝑔𝑝 (1)

where : S is dataset

k is the number of S partitions

pj is the probability from Sum(Ya) divided by all Cases.

𝐺𝑎𝑖𝑛 (𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑖 (𝑆) − ∑

|𝑆

𝑖

|

|𝑆|

𝑥 𝐸𝑛𝑡𝑟𝑜𝑝𝑖(𝑆

𝑖

)

𝑖

where

S is dataset A is atribut.

|Si| is the number data in attribute i |S|is The number of data .

Entropi(Si) is entropy of data in attribute i

In the C4.5 algorithm, the attribute selection is done by using the Gain Ratio with the formula of [9]:

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) = 𝑖 ( )_{𝑖 ( )} (2) where:

A is atribut.

gain(A) is information gain in attributes A Split(A) is split information in attributes A

Attributes with the highest Gain Ratio value are selected as test attributes for nodes. With gain is information gain. This approach applies normalization to information gain by using split information. SplitInfo states entropy or potential information with the formula of [10] :

𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 (𝑆, 𝐴) = 𝐼(𝑆, 𝐴) = − ∑𝑖 𝑙𝑜𝑔 (3) where:

S is data set A is atribut.

Si is the number data in attribute i

Despite its advantages, the algorithm also has several disadvantages such as 1) the finding of nodes with zero or nearly zero value that does not contribute in generating rules 2) too many nodes are formed so that the generated tree is too large [11]. The weakness in the C4.5 algorithm needs to be improved so that this algorithm can run well with the existing case. Several improvements have been made to improve the performance of the C4.5 algorithm. Improvements to the C4.5 algorithm that have been done including the use of bagging [10], Taylor’s series[12], Rule-Based Classification[13] and prunning [14]. This article proposes improvements to the C4.5 algorithm by using L’Hospital Rule and prunning to overcome T

————————————————

 Meilany Nonsi Tentua is a lecture in Teknik Informatika, Fakultas Teknik, Universitas PGRI Yogyakarta, Indonesia. And now currently pursuing doctor degree program in Department of Computer Science and Electronics, FMIPA UGM, Yogyakarta, Indonesia. E-mail: [email protected]

(2)

the existing weaknesses. The improvement of algorithm C 4.5 using L’Hospital Rule and pruning will be referred to as the C4.5 LHP algorithm.

2 METHODS

2.1 Improved C4.5 Algorithm

The improvement of the C 4.5 algorithm is first done by improving the gain ratio formula with the assumption of having a dataset with attribute A. Suppose an attribute has two different values. Each attribute information candidate is calculated and the one with the largest gain value is selected as root.

Entropy from the entire S dataset can be expressed as [15]:

𝐸(𝑆, 𝐴) = 𝐼(𝑝, 𝑛) = −log −log (4) Where p is a positive set and n is a negative set.

𝐸(𝐴) = 𝐼(𝐴 , 𝐴 ) + 𝐼(𝐴 , 𝐴 )

𝐸(𝐴) = 𝐼(𝐴 , 𝐴 ) + 𝐼(𝐴 , 𝐴 ) (5)

Where, N = p + n

A1 : set of number of positive samples in A A2 : set of number of negative samples in A

A11 : set of number of samples that A is positive and the attribute value is positive

A12 : set of number of samples that A is positive and the attribute value is negative

A21 : set of number of samples that A is negative and the attribute value is positive

A22 : set of number of samples that A is negative and the attribute value is negative

The entropy of attribute A is stated as:

𝐺𝑎𝑖𝑛 𝑅𝑎𝑠𝑖𝑜 (𝐴) = 𝑖 ( )_{( )} = ( , ) ( )_{( )} (6) (6)

Where I(A)= splitinfo (A), we subsitution (5) in (6) become :

𝐺𝑎𝑖𝑛 𝑅𝑎𝑠𝑖𝑜 (𝐴) =

( , ) ( , ) ( , )

( , ) (7)

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ₍

)) ( ( ))

( ) (8)

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ) ( (

)) ( ( ))

( )

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ) ( (

)) ( ( ))

( ) (9)

if = 𝑝 + 𝑛, 𝑡𝑕 𝑛

= −

𝑎𝑛

= −

, and (9) become :

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ( ) ( )) ( (

)) ( ( ))

( ) (10)

If 𝐴

= 𝐴

+ 𝐴

, 𝑡𝑕 𝑛

= −

𝑎𝑛

= −

𝐴

= 𝐴

+ 𝐴

, 𝑡𝑕 𝑛

= −

𝑎𝑛

= −

And (10) become :

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ) ( ) ( (

) ( )) ( ( ) ( ))

( ( ) ( ) ) (11)

(3)

then l ( )

( )

= l

( )

(13)

using L’Hospital Rule : l ( )

= l

( ( ))

=

(14)

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜(𝐴) =

( ( ) ) ( (

( )

)

( )

) (

( )

)

( (

)

( ₎

)

(15)

When we use L’Hospita Rule (12) in (13) and we get

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝐴) =

₍

) ( )

(16)

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝐴) =

₍

) ( )

(17)

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝐴) =

(

) ( )

(18)

2.2 Prunning

The second process in improving the C4.5 algorithm is by prunning. Prunning a tree requires cutting off branches from the tree so as to improve accuracy and reduce overfutting. The prunning algorithm used is Reduced Error Pruning (REP) which is one of the post pruning algorithms [16]. The pruning algorithm used is REP which is one of the post pruning algorithms. This algorithm divides the data into two, namely training data and test data. Training data is data that are used to form decision trees, while test data are used to calculate error rate values on trees after being trimmed. The way REP works is by trimming internal nodes starting from the bottom of the internal node to the top. Prunning is done by replacing attributes with leaf nodes that have the dominant class appearing. After that, the test data is processed using the pruning rule, continued by the calculation of the error value rate. Test data is also processed with an initial rule, a rule formed before the tree is trimmed, then continued with the calculation of the error value rate. If the error rate resulting from prunning is smaller, then prunning is done. [17]

=

√

(16)

where: r = error rate

N= total sample z = Φ-1

(c)

3 RESULT

AND

DISCUSSION

The complexity of calculations in an equation is linear with the amount of time needed in performing the calculation process. The use of the improved C4.5 LH algorithm results in a simpler calculation of the gain ratio (16) compared to the gain ratio resulted without improvement (8). We use C4.5 LH with prunning (C4.5 LHP) to make higher accuration. The accuracy and speed test of the C4.5 LHP algorithm is done by using a collection of data taken from the UCI Machine Learning Repository. The data used has different amount of data, each of which is Nephritis (120 data), Bladder Inflammation (120 data), Iris (150 data), SPECT Heart (267 data), Vertebral Column 3 Class (310 data) , Vertebral Column 2 Class (310 data), Abalone (4177 data) and Spambase (4601 data). Each dataset has between 7 to 57 attributes with numeric, categorical or mixed types. The accuracy and the number of rule are seen by comparing the C4.5 Algorithm with the improved C4.5 LH Algorithm and C4.5 LHP Algorithm. The speed test is done by comparing the number of repetitions that occur in the calculations each algorithm.

Table 1. The Comparison of the accuracy of the C4.5 algorithm, C4.5 LH algorithm and C4.5 LHP algorithm

Dataset Number of _Instances Number of _Atributes Attribute Types Accuracy (%)

(4)

Nephritis 120 8 _categoricalNumerical, 100 100 100 0

Inflammation Urinary Bladder 120 8 _categoricalNumerical, 100 100 100 0

Iris 150 4 Numerical 93,75 93,75 98 4,25

SPECT Heart 267 22 Categorical 83,75 83,75 83,75 0

Vertebral Column 3 Class 310 7 Numerical 80,17 80,17 80,17 0

Vertebral Column 2 Class 310 7 Numerical 79,31 79,31 81,47 2,16

Abalone 4177 9 _categoricalNumerical, 64,76 64,76 65,86 1,1

Spambase 4601 57 Numerical 91,49 91,49 92,63 1,14

Table 1 shows the accuracy level between C4.5 algorithm, improved C4.5 LH algorithm and C4.5 LHP algorithm. In the table it can be seen that the accuracy of the 4 data sets used namely Iris, Vertebral Column 2 Class, Abalone, and Spambase have higher level of accuracy by using the C4.5 LHP algorithm. The biggest difference in accuracy in using the improved C4.5 algorithm using L’Hospital and prunning is in the Iris data set of 4.25%.

The used of 8 datasets seen that no difference of accuracy between the C4.5 algorithm and the C4.5 LH algorithm. This is because the number of rules between the C4.5 algorithm and the C4.5 LH algorithm are the same between the two algorithms. So to increase the level of accuracy in the C4.5 LH algorithm, prunning is used to prune the useless rules.

Table 2. The comparison of the number of rules between the C4.5 algorithm, C4.5 LH and C4.5 LHP algorithm

Dataset Number of _Instances

Number of Atributes

Attribute Types

Number of rules Reduced C4.5 C4.5+LH C4.5+LHP

Nephritis 120 8

Numerical,

categorical 6 6 4 2

Inflammation Urinary Bladder 120 8

Numerical,

categorical 6 6 4 2

Iris 150 4 Numerical 5 5 5 0

SPECT Heart 267 22 Categorical 18 18 6 2

Vertebral Column 3 Class 310 7 Numerical 35 35 21 11

Vertebral Column 2 Class 310 7 Numerical 32 32 23 9

Abalone 4177 9

Numerical,

categorical 865 865 508 357

Spambase 4601 57 Numerical 246 246 113 133

Table 2 shows the number of rules formed by the three algorithms used. It appears that the formation of rules between the C4.5 algorithm and the improved C4.5 using L’Hospital algorithm are the same. This is because the improved C4.5 algorithm using L’Hospital only shortens the calculation of the gain ratio. But the difference is seen when the improved C4.5 algorithm using L'Hospital and prunning is performed. All data sets used have smaller number of rules compared to the C4.5 algorithm and the improved C4.5 algorithm using L'Hospital. In the use of the improved C4.5 algorithm using L’Hospital and prunning, the rule formed is reduced between 25% -59% from the improved C4.5 algorithm using L’Hospital. In data set that have 4 attributes namely Iris, the use of prunning does not cause a reduction in the number of rules. So it can be said that when the data set has smaller than or equal to four attributes, the use of pruning does not show any difference.

Figure 1. Execution time between C4.5 algorithm and C4.5 LHP algorithm

0 50000 100000 150000 200000 250000 300000 350000

1 2 3 4 5 6 7 8

Execu

tion

T

ime

Data Set

C4.5

C4.5 LH

(5)

Figure 1 shows the speed test between C4.5 , C$.5 LH and C4.5 LHP. In that shown that the C4.5 algorithm is longer than the C 4.5 LH algorithm. But the C4.5 LHP algorithm is longer than C4.5 LH algorithm, bacause in C4.4 LHP algorithm need execution time for prunning. So it can be concluded that the C.4.5 LHP algorithm can speed down the execution time eventhough it longer than C4.5 LH. It can be seen from Figure 1, by using the three algorithms, the time used by a dataset with a small amount of data does not show visible difference. A striking time difference occurs when the data set used has large data.

4 CONCLUSION

The C4.5 LHP algorithm is an attempt to eliminate weaknesses in the C4.5 algorithm. L’Hospital Rule is used in the formulation of the Gain Ratio so that the number of repitition can be shortened so that less time is needed. Prunning is used to reduced too many nodes.According to the experiments that have been conducted using eight datasets, it is shown that using of the C 4.5 LHP algorithm has a higher level of accuracy and is much faster than the C4.5 algorithm.

REFERENCES

[1] J. A. Konstan and J. Riedl, ―Recommender systems: from algorithms to user experience,‖ User Model User-Adap Inter, vol. 22, no. 1, pp. 101–123, Apr. 2012.

[2] F. O. Isinkaye, Y. O. Folajimi, and B. A. Ojokoh, ―Recommendation systems: Principles, methods and evaluation,‖ Egyptian Informatics Journal, vol. 16, no. 3, pp. 261–273, Nov. 2015.

[3] S. Neelamegam and D. E. Ramaraj, ―Classification algorithm in Data mining: An Overview,‖ International Journal of P2P Network Trends and Technology (IJPTT), vol. 3, no. 5, p. 5, 2013.

[4] [17] B. N. Lakshmi, T. S. Indumathi, and N. Ravi, ―A comparative study of classification algorithms for predicting gestational risks in pregnant women,‖ in 2015 International Conference on Computers, Communications, and Systems (ICCCS), 2015, pp. 42–46.

[5] Han ,Jiawei Han, "Data Mining : Concepts and Techniques", Morgan Kaufmann Publishers, USA, 2012

[6] B. Hssina, A. Merbouha, H. Ezzikouri, and M. Erritali, ―A comparative study of decision tree ID3 and C4.5,‖ International Journal of Advanced Computer Science and Applications, vol. 4, no. 2, 2014.

[7] R. Rahim, E. Buulolo, N. Silalahi, and Fadlina, ―C4.5 Algorithm To Predict The Impact Of The Earthquake,‖ 2017.

[8] Y Song ,Y. Lu, ―Decision tree methods: applications for classification and prediction,‖ Shanghai Arch Psychiatry, vol. 27, no. 2, pp. 130–135, Apr. 2015.

[9] Deepti Juneja, Sachin Sharma, Anupriya Jain and Seema Sharma , "A Novel Approach to Construct Decision Tree Using Quick C4.5 Algorithm", Oriental Journal of Computer Science and Technology, Vol. 3(2), p. 305-310, 2010

[10] S.-J. Lee, Z. Xu, T. Li, and Y. Yang, ―A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making,‖ Journal of Biomedical Informatics, vol. 78, pp. 144–155, Feb. 2018.

[11]M. M. Mazid, ―Improved C4.5 Algorithm for Rule Based Classification,‖ Recent Advances In Artificial Intelligence, Knowledge Engineering And Data Bases, p. 7. ISBN: 978-960-474-154-0, 2014

[12]Sinam, Idriss, ―An Improved C4.5 Model Classification Algorithm

Based on Taylor’s Series,‖ Jordanian Journal of Computers and Information Technology, Vol. 05, No. 01, April 2019.

[13]Mazid, Mohammed M, "Improved C4.5 Algorithm for Rule Based Classification", Recent Advances In Artificial Intelligence, Knowledge Engineering And Data Bases, ISSN 1790-5109, 05 June 2014.

[14]M. A. Muslim, A. J. Herowati, E. Sugiharti, and B. Prasetiyo, ―Application of the pessimistic pruning to increase the accuracy of C4.5 algorithm in diagnosing chronic kidney disease,‖ Journal of Physics: Conference Series, vol. 983, p. 012062, Mar. 2018. [15]B. Z. Yahaya, L. J. Muhammad, N. Abdulganiyyu, F. S. Ishaq,

and Y. Atomsa, ―An Improved C4.5 Algorithm Using L’ Hospital Rule for Large Dataset,‖ Indian Journal of Science and Technology, vol. 11, no. 47, Dec. 2018.

[16]N.D. Oye, Akanmode, E.R, ―Prediction of Poultry Yield Using Data Mining Techniques,‖ International Journal of Innovation Engineering and Science Research, vol. 2, no. 4, Aug. 2018. [17]G. M. Bressan, B. C. F. de Azevedo, and E. A. S. Lizzi, ―A