• No results found

Cost sensitive Bayesian network learning using sampling

N/A
N/A
Protected

Academic year: 2020

Share "Cost sensitive Bayesian network learning using sampling"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Cost­sensitive Bayesian network learning 

using sampling

Nashnush, E and Vadera, S

http://dx.doi.org/10.1007/978­3­319­07692­8_44

Title

Cost­sensitive Bayesian network learning using sampling

Authors

Nashnush, E and Vadera, S

Type

Book Section

URL

This version is available at: http://usir.salford.ac.uk/34599/

Published Date

2014

USIR is a digital collection of the research output of the University of Salford. Where copyright 

permits, full text material held in the repository is made freely available online and can be read, 

downloaded and copied for non­commercial private study or research purposes. Please check the 

manuscript for any further copyright restrictions.

For more information, including our policy and submission procedure, please

(2)

Cost-Sensitive Bayesian Network Learning using

Sampling

Eman Nashnush1 , Sunil Vadera1

1The school of Computing, Science and Engineering, Salford university , Manchester, UK.

[email protected], [email protected]

Abstract. A significant advance in recent years has been the development of

cost-sensitive decision tree learners, recognising that real world classification problems need to take account of costs of misclassification and not just focus on accuracy. The literature contains well over 50 cost-sensitive decision tree induction algorithms, each with varying performance profiles. Obtaining good Bayesian networks can be challenging and hence several algorithms have been proposed for learning their structure and parameters from data. However, most of these algorithms focus on learning Bayesian networks that aim to maximise the accuracy of classifications. Hence an obvious question that arises is whether it is possible to develop cost-sensitive Bayesian networks and whether they would perform better than cost-sensitive decision trees for minimising classification cost? This paper explores this question by developing a new Bayesian network learning algorithm based on changing the data distribution to reflect the costs of misclassification. The proposed method is explored by conducting experiments on over 20 data sets. The results show that this approach produces good results in comparison to more complex cost-sensitive decision tree algorithms.

Keywords: Cost-sensitive classification, Bayesian Learning, Decision Trees.

1

Introduction

Classification is one of the most important methods in data mining; playing an essential role in data analysis and pattern recognition, and requiring the construction of a classifier. The classifier can predict a class label for an unseen instance from a set of attributes. However, the induction of classifiers from the data sets of pre-classified instances is a central problem in machine learning [1]. Therefore, several methods and algorithms have been introduced, such as decision trees, decision graphs, Bayesian networks, neural networks, and decision rules, etc. Over the last decade, graphical models have become one of the most popular tools to structure uncertain knowledge. Furthermore, over the last few years, Bayesian networks have become very popular and have been successfully applied to create consistent probabilistic representations of uncertain knowledge in several fields [2].

(3)

categories: the amending approach (changing the classifier to a transparent box) and the sampling approach (using the classifier as a black box). Among all the available cost-sensitive learning algorithms, most of the work has focused on decision tree learning, with very few studies considering the use of Bayesian networks for cost-sensitive classification.

Therefore, this paper aims to explore the use of Bayesian networks (BNs) for cost-sensitive classification. During this paper, a new method known as the Cost-Sensitive Bayesian Network (CS-BN) algorithm, which uses a sampling approach to induce cost-sensitive Bayesian networks, is developed and compared with other, more common approaches such as cost-sensitive decision trees. This paper is organized as follows: in section 2 we will provide a number of definitions and background information on cost-sensitive learning algorithms. Section 3 will introduce some of the previous work on the sampling approach. In section 4, we will present our method for converting the existing BN algorithm into a CS-BN by changing the number of examples to reflect the costs. In section 5 we present the results obtained by carrying out an empirical evaluation on data from the UCI repository. Finally, section 6 will provide a conclusion, along with a summary of the main contribution of this paper.

2

Cost-sensitive learning perspective and overview

A good cost-sensitive classifier should be able to predict the class of an example that leads to the lowest expected cost, where the expectation is computed after applying the classifier by using the expected cost function, as shown in the following equation [6,21]:

( | ) ∑ ( ) ( | ) ( )

(4)

Fig.1. Cost-sensitive learning category (Shend and Ling [7])

As Zadrozny [6] points out, wrapper methods (Black Box), deal with a classifier as a closed box, without changing the classifier behaviour, and can work for any classifier. In contrast, direct methods (Transparent Box), require knowledge of the particular learning algorithm, and can also work on the classifier itself by changing its structure to include the costs.

3

Review of previous work on sampling approach

Most studies regarding cost-sensitive learning have used direct methods or sampling methods, and most have focused on decision tree learning. This section briefly reviews some of these methods. In addition, this section describes different methods of cost-sensitive learning by changing the data distribution to involve costs and solve an unbalanced data distribution problem, where, for example, the number of negative examples is significantly less than the number of positive examples. Several literature reviews show different methods, where some of them amend the number of negative examples (over-sampling); some of them change the number of positive examples (under-sampling); a few of them use the “SMOTE” (Synthetic Minority Over-Sampling Technique) algorithm that tackles the imbalanced problem by generating synthetic minority class examples [8]; and others use a “Folk Theorem” [5, 21] that amends the distribution according to the cost of misclassification.

Kubat and Matwin [12] used one side selection by under-sampling the majority class while keeping the original population of the minority class. As Elkan [21] pointed out in 2001, changing the balance of negative and positive training examples will affect classification algorithms. Ling and Li [13] combined over-sampling with under-sampling to measure the performance of a classifier. Domingos [14] introduced the MetaCost algorithm which is based on sampling with labeling and bagging. MetaCost uses the resampling with replacement to create a different sample

Cost-sensitive direct learning Cost-sensitive meta- learning (wrapper) Cost-sensitive learning algorithms

GA method ICET [Turney 1995]

Cost-sensitive Decision trees [Drummond and Holte, 2000; Ling et al, 2004]

Sampling Non-sampling

Costing [Zadrozny,B et al 2003]

Threshold Weighting

Relabeling

MetaCost [Domingos, 1999]

CostSensitiveClassifier (CSC) [Witten & Frank, 2005]

C4.5CS [Ting 1998] Theoretical threshold [Elkan, 2001]

(5)
[image:5.595.197.419.272.403.2]

size, then estimates each example in the same sample size by voting each example in different samples, where the number of instances in each resamples is smaller than the training size, and then applies an equation (1) to re-label each training example with the optimal class estimation. Finally, it reapplies the classifier again, on the new relabelled training data set [15]. Figure 2 summarises the MetaCost algorithm [15]. Domingos concluded that this algorithm provides goods results on large data sets. In addition, most researchers have dealt with this problem by changing the data distributions to reflect the costs, though most of them utilize a decision tree learner as a base learner, and the reader is referred Lomax and Vadera [4] for a comprehensive survey of cost sensitive decision tree algorithms for details.

Fig. 2. The MetaCost system [15]

4

New Cost-Sensitive Bayesian Network learning algorithm via

the distributed sampling approach

A survey of the literature shows that, to date, there are very few publications regarding cost sensitive Bayesian networks (CS-BNs), but plenty on cost-sensitive decision tree learning. This section presents a sampling approach used to develop CS-BNs and presents the use of distributed sampling to take account of misclassification costs and reduce the number of errors. Thus, the compelling question, given the different class distributions, is: what is the correct distribution for a learning algorithm?

(6)

4.1 Use of the Folk Theorem for CS-BNs

This method can be applied to any cost-insensitive classification algorithm to form a cost-sensitive classification algorithm. This method can be conducted by reweighing the instances from the training example and then using that weight on the classification algorithm. The Folk Theorem is used to change the data distribution to reflect the costs. Zadroznyet al., [5] stated that "if the new examples are drawn from the old distribution, then optimal error rate classifiers for the new distributions are optimal cost minimizes for data drawn from original distribution." This is shown in the following equation (2) [5]:

( )

[ ]

( ) ( )

Where, the new distribution D' = factor * Old distribution D; x is instance; y is the class label; and C is the cost according to misclassified instance x. Technically, the optimal error rate classifiers from D' are the optimal cost minimizers from the data, which have been drawn from D. This theorem creates new distribution from the old distribution by multiplying old distribution with a factor proportional to the relative cost of each example, and the new distribution will be adapted with that cost. Therefore, this method enables the classifier to obtain the expected cost minimization from the original distribution and, in the worst case scenario; this method can be guaranteed a classifier to provide a good approximate cost minimization for any new sample.

However, there are different types of BNs, as well as methods for learning them. Given their efficiency compared to full networks, we used a search algorithm to construct Tree Augmented Naive Bayes Networks (TANs), along with Minimum Description Length (MDL), which was introduced by Fayyad and Irani [19] to calculate the score of information between links in a tree.

In our experiments, we attempted to change the proportions of instances (samples) in each class label, according to its cost, by using the above Folk Theorem [5]. In the current experiment, we used a constant cost of 1:4, where we assigned the common majority class cost to 1 and other, minority class cost to 4. The following steps were conducted with the CS-BN by using a sampling approach:

Splitting: Data are split into a training set and testing set. The training set uses 75% of the original data, while the testing set uses 25% of the original data.

Cost proportion: According to cost proportions, the new data distribution should

be calculated as being equal to these proportions. For instance, if the cost of wrongly classifying a sick patient as healthy is £20 and the cost of misclassifying a healthy patient as sick is £2, then the cost proportion of the sick class will be 20/22=0.90. In our experiments we used cost proportion by assigning rare class cost to 4 and common class cost to 1. Thus, the cost proportion in our algorithm would be 0.8 and 0.2 respectively, based on equation (3):

( )

(7)

Changing proportion: This involves changing the training data distributions according to the cost ratio of each class. For example, when the costs are 1:4, the new proportions on the training set for each class will be 20% and 80% respectively.

There are different methods that can be used to achieve sampling. During our research, we used two methods, as discussed in section 3 of this paper. These methods were under-sampling and over-sampling. Obviously, where the new proportion was less than the original proportion, we used under-sampling (without replacement) to delete some of the examples in the frequent class. On the other hand, if the new proportion was greater than the original proportion, we used over-sampling (with replacement) by making a random generation of new instances which belonged to the rare class, and increasing the number of examples. As a result, the training data required further resampling according to their costs. Finally, we used the original BN classifier on the training data, followed by using the testing set with the original distribution (without changing any instances) to evaluate the training model.

However, Figure 3 presents the pseudo-code of our method (i.e. CS-BN with sampling approach):

Fig. 3. Cost-Sensitive Bayesian Network Algorithm by sampling

4.2 Experiment

Our experiment demonstrates how changing the distribution of data will affect the performance and cost of a Bayesian classifier. We experimented with 24 data sets from the UCI repository [20]. To evaluate the performance of our proposed method, we used the original testing distribution. An evaluation was carried out in order to compare CS-BN with existing algorithms implemented in WEKA: (i) Original Bayes Net (that implemented by TAN) (Friedman et al. [1], Version 8); (ii) Decision Tree Algorithm J48 (which is their implementation of C4.5, Version 8); and (iii) MetaCost with J48 as the base classifier [14](iv) Naive Bayes. Table 1 presents the results of the CS-BN algorithm via changing distributions (Black Box), and the original BN algorithm. It also shows the comparison between the original Bayes Net (TAN), existing algorithm (decision tree J48), MetaCost with j48 classifier, and NB. The proposed algorithm produced lower costs for cost matrix 1 and 4 on most of the data set. In our experiment, we noticed that number of False Negative (rare) with our Black Box method was less than number of False Negative (rare) of the existing BN algorithm; thus, the total cost will be around 6121 units, and we reduce FN in all datasets.

CS-BN via sampling approach:

1. Divide dataset into 75% of instances for training, and 25%

for testing. With the same class distributions.

2. Changing the data distribution according to the cost

proportions in each class,

3. Using Bayesian network algorithm (TAN to learn structure).

(8)
(9)
[image:9.595.90.530.51.418.2]

Fig. 4.Expected cost of CS-BN via changing the distributions and existing algorithms

.

Fig. 5. Accuracy of CS-BN via changing the distributions and existing algorithms.

0 10000

adult

0 200

horse-colic 0 200

austra

0 100

bank

0 50

breast

0 200

bupa_liver_diorder 0 200

crx

0 200

diabetes 0 500

german

0 1000

gymexamg

0 50

heart

0 50

hepati

0 100

horse 0 20

Hoslem

0 50

hypo

0 50

iono

0 10

labor

0 200

pima

0 100

sonar

0 500

spambase

0 2000

supermarket 0 500

tic-tac 0 50

unbalanced 0 20

vote 0

5

weather

0 20 40 60 80 100 120

ad

u

lt

h

o

rs

e

-c

o

lic

au

stra bank

b

re

as

t

b

u

p

a_

liv

er_

d

io

rd

e

r

crx

d

iab

etes

ge

rm

an

gy

m

exa

m

g

h

ear

t

h

ep

at

i

h

o

rs

e

H

o

sl

em

h

yp

o

ion

o

lab

o

r

p

im

a

so

n

ar

sp

am

b

ase

su

p

e

rm

ar

ke

t

tic-ta

c

u

n

b

ala

n

ce

d

vo

te

w

eat

h

(10)

5

Results and discussion

This experiment shows that the number of misclassifications of rare class (more expensive) are always less than the number of misclassifications for the rare class in the original TAN algorithm for most of the data. Thus, the results are always better in terms of cost, as we can see in Table 1. Furthermore, as shown in Figure 4, for most of the data sets, the changing proportion method (CS-BN via sampling) gives good results compared to the original TAN, MetaCost, Decision Tree (j48), and Naïve Bayes(NB). On the other hand, in Figure 5, it is shown that the accuracy, in most cases, is a little lower than the original TAN algorithm.

As consequence, changing the data distributions before applying TAN classifier yields good results in most data; especially if the data are not very highly skewed to one class. Therefore, the expected cost of using our experiments will provide a reduction of misclassification costs, compared to the original algorithm, which does not use this method. Therefore, we believe that the proposed CS-BN approach of changing the data distributions will produce good results in terms of cost and accuracy.

6

Conclusion

Although much work has been conducted on the development of cost-sensitive decision tree learning, little has been conducted on assessing whether other classifiers, such as Bayesian networks, can lead to better results. Therefore, taking into account work with the folk theorem [22,5], a new Black Box method, based on amending the distribution of examples to reflect the costs of misclassification, was applied in order to develop cost-sensitive Bayesian networks. A preliminary experiment, amending the distributions of TAN, has been carried out on several datasets previously studied by various researchers using different methods.

Our CS-BN with sampling approach has been evaluated and compared with MetaCost+J4.8, standard decision tree (J48), standard Bayesian networks approaches, and standard

Naïve Bayes(NB

. The results for over 25 data sets show that the use of sampling yields better results than the current leading approach; namely, the use of MetaCost+J4.8.

In conclusion, our new CS-BN algorithm has been developed and explored by using a Black Box approach with sampling that amends the data distribution to take account of costs shows promising results in comparison to existing cost-sensitive tree induction algorithms.

REFERENCES

[1] Friedman, Jerome H. "Data Mining and Statistics: What's the connection?." Computing Science

and Statistics 29, no. 1 (1998):pp. 3-9.

[2] Pearl, Judea. "Embracing Causality in Formal Reasoning." In AAAI, pp. 369-373. 1987.

(11)

79-86.

[4] Lomax, Susan, and Sunil Vadera. "A survey of cost-sensitive decision tree induction

algorithms." ACM Computing Surveys (CSUR) 45, no. 2 (2013): pp 16:1-16:35.

[5] Zadrozny, Bianca, John Langford, and Naoki Abe. "Cost-sensitive learning by

cost-proportionate example weighting." In Data Mining, 2003. ICDM 2003. Third IEEE

International Conference on, pp. 435-442. IEEE, 2003a.

[6] Zadrozny, Bianca, John Langford, and Naoki Abe. "A simple method for cost-sensitive

learning." IBM Technical Report RC22666, ( 2003b).

[7] Sheng, Victor S., and Charles X. Ling. "Roulette sampling for cost-sensitive learning."

In Machine Learning: ECML 2007, pp. 724-731. Springer Berlin Heidelberg, 2007.

[8] Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.

"SMOTE: synthetic minority over-sampling technique." pp. 1106.1813 (2011).

[9] Ma, Guang-Zhi, Enmin Song, Chih-Cheng Hung, Li Su, and Dong-Shan Huang. "Multiple

costs based decision making with back-propagation neural networks." Decision Support

Systems 52, no. 3 (2012):pp. 657-663.

[10] Maloof, Marcus A. "Learning when data sets are imbalanced and when costs are unequal and

unknown." In ICML-2003 workshop on learning from imbalanced data sets II, vol. 2, pp. 2-1.

2003.

[11] Drummond, Chris, and Robert C. Holte. "C4. 5, class imbalance, and cost sensitivity: why

under-sampling beats over-sampling." In Workshop on Learning from Imbalanced Datasets II,

vol. pp.11. 2003.

[12] Kubat, Miroslav, and Stan Matwin. "Addressing the curse of imbalanced training sets: one-sided selection." In ICML, In Proceedings of the Fourteenth International Conference on Machine Learning, vol. 97, pp. 179-186. 1997.

[13] Ling, Charles X., and Chenghui Li. "Data Mining for Direct Marketing: Problems and

Solutions." In KDD, vol. 98, pp. 73-79. 1998.

[14] Domingos, Pedro. "Metacost: A general method for making classifiers cost-sensitive."

In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery

and data mining, pp. 155-164. ACM, 1999.

[15] Vadera, Sunil. "CSNL: A cost-sensitive non-linear decision tree algorithm." ACM Transactions

on Knowledge Discovery from Data (TKDD) 4, no. 2 (2010): 6.

[16] Pazzani, Michael J., Christopher J. Merz, Patrick M. Murphy, Kamal Ali, Timothy Hume, and

Clifford Brunk. "Reducing Misclassification Costs." In ICML, vol. 94, pp. 217-225. 1994.

[17] Batista, Gustavo EAPA, Ronaldo C. Prati, and Maria Carolina Monard. "A study of the

behavior of several methods for balancing machine learning training data." ACM Sigkdd

Explorations Newsletter 6, no. 1 (2004):pp. 20-29.

[18] Agarwal, Alekh. "Selective sampling algorithms for cost-sensitive multiclass prediction."

In Proceedings of the 30th International Conference on Machine Learning, pp. 1220-1228.

2013.

[19] Fayyad, Usama, and Keki Irani. “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning.” , Proceedings of the International Joint Conference on Uncertainty in AI ,(1993): pp. 1022-1027.

[20] Asuncion, Arthur, and David Newman. "UCI machine learning repository." (2007).

http://archive.ics.uci.edu/ml/.

[21] Elkan, Charles. "The foundations of cost-sensitive learning." In International joint conference

on artificial intelligence, vol. 17, no. 1, pp. 973-978. LAWRENCE ERLBAUM ASSOCIATES

Figure

Fig. 2. The MetaCost system [15]
Fig. 4.Expected cost of CS-BN via changing the distributions and existing algorithms

References

Related documents

2.09 Number of sampling/survey occasions (in time) to classify site or area One occasion per sampling season. 2.10 Number of spatial replicates per sampling/survey occasion to

To explore the extent to which we could assist students in the process of maintaining a narrative structure for their work, we designed an interactive multimedia program in the light

Question: Would it constitute a claim event if the Engineer, within a reasonable time-frame, (3 months) fails to review and respond (approve / disapprove or comment) to a Variation

The present study aims to create a new model of port safety management in our country based on the analysis and assessment of the risk with the Formal Safety Assessment methodology

The communication component is an essential part of the emerging cyber-physical systems. The performance of the communication component has a profound effect on the overall

This brief uses data from the Health Reform Monitoring Survey (HRMS) to examine changes in uninsurance and health care access and affordability for women of childbearing age (ages

Streamwise distribution, at the upper wall of the blunt flat plate, of the mean shear stress predicted by the LES through the mixed-scale model ( ) and the dynamic vorticity model (

VISUALIZATION USING k-MEDOIDS ALGORITHM The article presents the possibilities of using clustering algorithms to group and visualize data from blood tests of various people in