• No results found

A Cost sensitive Decision Tree Optimized Algorithm Based on Adaptive Mechanism

N/A
N/A
Protected

Academic year: 2020

Share "A Cost sensitive Decision Tree Optimized Algorithm Based on Adaptive Mechanism"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

) 7 1 0 2 E I I A ( g n ir e e n i g n E l a ir t s u d n I d n a e c n e g il l e t n I l a i c if it r A n o e c n e r e f n o C l a n o it a n r e t n I d r 3 7 1 0 2

8 7 9 : N B S

I -1-60595- 05 -9 2

t

s

o

C

A

-

s

e

n

s

i

it

v

e

D

e

c

i

s

i

o

n

T

r

e

e

O

p

it

m

i

z

e

d

A

l

g

o

r

ti

h

m

B

a

s

e

d

m

s

i

n

a

h

c

e

M

e

v

it

p

a

d

A

n

o

n

a

i

Q

W

A

N

G

a

n

d

Z U

h

i

L

I

*

a n i h C , g n i n o a i L , y ti s r e v i n U e m it i r a M n a il a D

r o h t u a g n i d n o p s e r r o C *

: s d r o w y e

K Cost-sensiitvel earning,Decisiontree,Adapitvemechanism,Largedataset.

Abstrac.tTheproblemofcostisi ncreasinglybecomingt hehotspo tofacademicr esearch .Therefore , a , r e p a p s i h t n I . m h t i r o g l a e e r t n o i s i c e d f o h c r a e s e r e h t n i t n u o c c a o t n i t s o c e h t e k a t o t y r a s s e c e n s i t i

p o r p s a w m s i n a h c e m e v i t p a d a n o d e s a b m h t i r o g l a w e

n osed .Theheuristicfunctionwasi mprovedt o t

s o c g n i t s i x e e h t f o t s o c h g i h e h t e v l o

s -sensitivedecisiont reeproblemandt hemulti-valuedattribute e

v i t p a d a e h T . m e l b o r p s a i

b determing parameters mechanism ,adaptive selecting the cu tpoin t n

a m s i n a h c e

m d adaptive removing attribute mechanism were all applied to the process of tree g n i s u f e r y t i l i b a b o r p " d e v o r p m i e h t , y l l a n i F . y c n e i c i f f e g n i d l i u b l e d o m e h t e v o r p m i o t g n i d l i u b

t a h t d e w o h s s t n e m i r e p x E . e e r t n o i s i c e d f o l e d o m e h t o t d e i l p p a s a w y g e t a r t s " g n i n u r

p theefficiencyof

b o s l a n a c t i d n a , t e s a t a d e g r a l n o y l s u o i v b o d e v o r p m i s a w m h t i r o g l a w e n e h

t eusedt obuildamode l

o i s i c e d f

o nt reewithl owredundancyandlowcos.t

n o it c u d o r t n I

t s o

C -sensitivedecisiont reealgorithm[1]h asbeenwidelystudiedandi thasachievedgoodr esults . .

A , 7 0 0 2 n

I Freitase ta lproposedt heCS-C4.5algorithm ][2 ,whicht ooki ntoaccountt hei nformation ,

s e t u b i r t t a e h t e r u s a e m o t t s o c t s e t e h t d n a e t a r n i a

g aswel lasadjustedt hei nformationgainrateand t

t a e h t f o t s o c t s e

t ributes by parameter ω .The algorithm is efficien ton smal ldata sets ,bu tthe d

e c u d e r y l t a e r g s i y c n e i c i f f

e o nlargedatasets .In2015 ,WilliamZhue tal .proposedanadaptivecos t -sensitive decision tree algorithm ][3 .Th e algorithm improves the buildin g efficiency of the tree

l e d o

m underthepremiseofmaintaininggoodr obustness .Butt healgorithmdoesno tproposes agood y t i l i b a b o r p " e h t d e s o p o r p l a t e n a F n i M , 0 1 0 2 n I . g n i t t i f f o m e l b o r p e h t e v l o s o t y g e t a r t s g n i n u r p

g n i k c i t

s pruning"mechanism ][ , 4 whichassumedt ha taccordingt othepruningrulesshouldno tbe t c e f f e n o i t a c i f i s s a l c e h t e v o r p m i y a m g n i n u r p y t i l i b a b o r p n i a t r e c a n i l l i t s s a w m h t i r o g l a e h t , d e n u r p

t s o c e h t f

o -sensitivedecision treesaswel lasreduce thecos tofclassification. The validity of the .

d e i f i r e v s i s i s e h t o p y h

g n i t s i x e e h t f o m e l b o r p e h t t a g n i m i

A cost-sensitive decision tree algorithms ah ev a good s

t e s a t a d l l a m s n o e c n a m r o f r e

p , butt heeffectisgreatlyreducedonmediumorl argedatasets ,aswel l s

a over-fittingandhighcos.tInthispaper,weimprovetheheuristicfunctionofC -S C4.5algorithm t

s o c g n i t s i x e e h t f o s g n i m o c t r o h s e h t t a g n i m i

a -sensitive decision tree algorithm. T he improved s

e u l a v e l p i t l u m h t i w s a i b e t u b i r t t a f o m e l b o r p e h t s e v l o s n o i t c n u f c i t s i r u e

h aswel la s takesthetes t

m h t i r o g l a 5 . 4 C a o t n i s e d a r g e d t i , n i a g a d e t s e t s i y t r e p o r p a n e h W . t n u o c c a o t n i t s o

c ][5 .I nt heprocess

n o i s i c e d a g n i d l i u b f

o tree ,wealsoi ntroducet hreeadaptivemechanismsi ntot heprocess ,andpropose h

T f o e m e h c S e v i t p a d A n

a ird_DecisionTree(AST_DT)algorithmwitht hreeadaptivemechanisms . e

e r t n o i s i c e d f o y c n e i c i f f e e h t s e v o r p m i m h t i r o g l a d e v o r p m i e h

T building and weighsthecos tof

. n o i t a c i f i s s a l c s i m d n a g n i t s e

t Finally,t hei mproved"probabilityrefusingpruning"strategyi sapplied f o m e l b o r p g n i t t i f r e v o e h t s e v l o s t I . g n i t t i f r e v o m o r f e e r t e h t t n e v e r p o t l e d o m e e r t n o i s i c e d e h t o t

(2)

f o n g is e

D t C the o -s sen isitveDeci isonT r Aee lgortihmBasedonAdap itveMechansim e

t e

D rmina itono fOp itma lHeursiitcFunciton f

o n o i t c n u f c i t s i r u e h e h t , r e p a p s i h t n

I C -CS 4.5 is improved aiming at the high cos tof sensitive i

t l u m f o s a i b e h t d n a e e r t n o i s i c e

d -valueattribute .Thei mprovedheuristicfunctioni s:

λ ) ) ( * ) ( 1 ( * ) , ( )

, ( a

u ltiy a p GainRaito a p tc a a

Q a = a + Φ ( 1) )

a ( c

t ist het es tcos tofattributea ,and Φ(a) isariskfactorforspecifict ypesofexperiments ,also .

t s e t d e y a l e d s a n w o n

k λ isanon-positivenumbert oadjustt hei mpac toft es tcostsont hedecision .

e e r t

. s i n o i t a u q e g n i t i r w e r e h t d n a , a e t u b i r t t a c i r e m u n e h t f o s t n i o p t u c e t a d i d n a c l l a f o t e s a e b a P t e L

) (

, } ) , ( {

x a m ) ( a

u ltiy a Qualtiy a pa pa Pa

Q = ∈ ( 2) n i a g n o i t a m r o f n i s t p o d a m h t i r o g l a d e v o r p m i e h t t a h t s i m h t i r o g l a d e v o r p m i e h t f o e g a t n a v d a e h T

e t u b i r t t a e h t s s e r p x e o t o i t a

r ’s classified ability ratherthan information gain .I tcan reducebiased .

s e u l a v e r o m e v a h t a h t s e t u b i r t t a s d r a w o

t Inaddition,t hei mprovedalgorithmdegradest ot heheurisitc .

n i a g a d e s u s i a e t u b i r t t a n e h w m h t i r o g l a 5 . 4 C e h t f o n o i t c n u f

it a n i m r e t e D r e t e m a r a P r o f e m e h c S e v it p a d A e h

T o n

, n o i t c n u f c i t s i r u e h f o a l u m r o f n o i t a l u c l a c e h t n

I λ is an undetermined parameterand t herange of e

u l a

v [ i - 0s 1 ] , ,i tcontrolsthedegreeofbiast ot het es tcost, thesmallert hevalue,t hemorebiasedt he t

n o i t c n u f c i t s i r u e

h o the tes tcos.t Can i tbe in the range of [-∞,0]? Obviously not .Because the n o i t a c i f i s s a l c e h t f i , l e d o m e h t f o y c a r u c c a n o i t a c i f i s s a l c e h t n o e c n e u l f n i n a s a h t s o c n o i t a c i f i s s a l c

e n o s i t s o

c -sidedemphasized, ath tmaymaket hemode lofclassificationaccuracyi spoor.However, s

r e t e m a r a p f o n o i t a n i m r e t e d l a i c i f i t r

a ist ime-consumingandi nefficien.tInordert osolvet heproblem , .

d e s o p o r p s i m h t i r o g l a P D A e h t

l a m i t p o e h t e n i m r e t e d e W . c b d W f o t e s a t a d e h t e s u e

W heuristicfunctionaccordingt ot hesearch s

i h t g n e l p e t s e h t t e s e w , y l t s r i F . d o h t e

m -0.1 ,startinga t0,t heselectedvaluesare0 ,-0.1 ,-0.2,.. .- ;1 w , s c i t s i t a t s f o e k a s e h t r o F . s e u l a v t n e r e f f i d o t g n i d r o c c a n o i t c n u f c i t s i r u e h t n e r e f f i d t e g n e h t d n

A e

a s a d e n i f e d s i n o i t a c i f i s s a l c s i m f o t s o c e h T . s t s o c t s e t m o d n a r e t a r e n e g o t n o i t u b i r t s i d l a m r o n e s u

. x i r t a

m Thematrixi sasfollows.

( ) ( )

1 , 0

0 , 1 0

0

c m c

m c m

 

=

 

) 1 , 0 ( c

m meanstha ttherea lclassis1 tha tismisclassified as0 and mc(1,0) meanstha ttherea l t e s e w e r e h d n a , t n e r e f f i d e r a s e u l a v o w t e s e h t , y l l a u s U . 1 s s a l c o t n i d e d i v i d s i h c i h w , 0 s i s s a l c

0 6 ) 1 , 0 ( c

m = , mc(1,0)=80. Theexperimenta lresultsareshowni nFigure1.

g i

F ure1. Theprocessofselectingparametervalue.

f o e u l a v e h t s a s e s a e r c e d g n i t s e t f o t s o c e h t , 1 e r u g i F m o r f n e e s e b n a c s

A λdecreases ,andt hecos t

i m f

(3)

n e h W . ) d e c n e r e f e r e b m h t i r o g l

a λ=0,t heaverageclassification costi st hel argest .When λ= -0.8 , .t

s e l l a m s e h t s i t s o c n o i t a c i f i s s a l c e g a r e v a e h

t Thecurvei nt hef igureshowst heminimumvalueoft he d

l u o h s t s o c n o i t a c i f i s s a l c e g a r e v

a betaken intheneighborhood of -0.8 .Theresul tofthefitting of :

s i l e c x E

2

5 . 2 2 1 4 . 2 0 2 9 . 3 5 1

y= − λ + λ ( 3)

y l h g i h e r a t n e i c i f f e o c n o i s s e r g e r d n a n o i t a u q e n o i s s e r g e r e h t t a h t s w o h s s i s y l a n a n o i s s e r g e R

k a t y b n e h T . l e v e l e c n e d i f n o c % 5 9 t a t n a c i f i n g i

s ingt hederivationoft heextremevaluemethod,t he s

i t s o c n o i t a c i f i s s a l c e g a r e v a e h t f o e u l a v t s e w o

l -0.83 .Theresultsobtainedi nt heaboveexperimen t f

o n i a m o d t n e c a j d a e h t h t i w t n e t s i s n o c e r a s e u l a v l a m i t p o e h t t u b , e m a s e h t t o n e r

a -0.8 .The

m i r e p x

e enta lresultsshow tha tthe improved algorithm iseffective and can effectively obtain the .t

s o c n o i t a c i f i s s a l c e g a r e v a r e w o l h t i w e u l a v r e t e m a r a p l a m i t p o

e h

T Adap itveSchemef or C tu Poin tSelecitons t

u c l a n o i t i d a r t e h t n

I -poin tselection mechanism ,iftheattributehasNinstancesand each attribute N

n e h t , e u l a v t n e r e f f i d a o t s d n o p s e r r o

c -1 times mus tbe evaluated ,which greatly reduces the .

m h t i r o g l a e h t f o y c n e i c i f f

e Thispaperi ntroducest heASCPmechanismaimingatl owefficiencyof r

o g l a g n i t s i x e e h

t ithmonl argedatasets. :

s i m h t i r o g l a e h t f o a e d i e h

T i tselectsthemiddle valueof theattributeand se tastep ,then the e h t l i t n u y l e t a r a p e s d e t a l u c l a c s i e u l a v t n i o p d n e e h t d n a , d e h c r a e s s i y t r e p o r p e h t f o e u l a v e l d d i m

c i t s i r u e h e h t f o e u l a v m u m i x a

m functioni sf ound.Thestepl engthcanbedeterminedaccordingt ot he k

o o l e h t r e n i f e h t , p e t s e h t r e t r o h s e h T . s e t u b i r t t a f o r e b m u

n -up .Thel ongert hestep,t hemoreefficien t .

m h t i r o g l a e h t

e h

T Adap itveSchemef orRemova lo fA ttribute t

a f o r e b m u n e h t f

I tributeoft heobjecti st oomucht ha tmayl eadt ot hedecisiont reet ool arge ,aswel l o s l a s i n o i t c u d o r p f o y c n e i c i f f e e h t , d n a t s r e d n u o t t l u c i f f i d d n a e m o s r e b m u c e r a s e l u r d e t a r e n e g e h t s a

I . d e c u d e r y l t a e r

g norderto solvetheseproblems, theARAmechanismisintroduced ,tha tis ,inthe t

a h t e t u b i r t t a e h t f o e m o s e v o m e r o t g n i d l i u b f o s s e c o r

p hasl essi mpac tont hedecisiontree . y t i l a u Q e h t g n i t a l u c l a c y b t e s e t u b i r t t a e h t m o r f y t r e p o r p e h t e v o m e r o t s i m h t i r o g l a e h t f o a e d i e h T

t u b i r t t a h c a e f o e u l a

v e ,ifthevalueofthecorresponding heuristicfunction islessthanthe ∂ ofthe .

n o i t c n u f c i t s i r u e h m u m i x a m

o d u e s p e h

T -codedescriptionoft heAST_DTalgorithmi sshowni nTable1.

b a

(4)

e h t f o n o it a z i m it p

O PruningStrategy

, " g n i n u r p g n i k c i t s y t i l i b a b o r p " f o a e d i e h t y b d e r i p s n

I we pu t forward a new pruning

y g e t a r t

s --"probabilityr efusingpruning",thei deaoft hestrategyi swheni tshouldbeprunedaccording ,

e l u r g n i n u r p e h t o

t thealgorithmstil lrefusest oprunea tacertainprobability. m

g n i n u r p y t i l i b a b o r p d e v o r p m i e h t r e h t e h w s w o h s 2 e r u g i

F echanism is valid or not .Where

o

N -prob represents "no pruning" ,prob-persist-pruning represents "probability sticking pruning" , b

o r

p -refusal-pruning represents "probability refusing pruning" , combine-pruning represents a n

o i t p o f o s r i a p o w t f o n o i t a n i b m o

c s .Theexperien tshowst hatt heprob-refusal-pruningstrategycan e

n i b m o c e h T . s t e s a t a d t s o m n o s e i g e t a r t s g n i n u r p r e h t o n a h t t s o c e g a r e v a r e l l a m s a n i a t b

o -pruning

. s t e s a t a d t s o m n o s t l u s e r r e t t e b n i a t b o d n a t s o c e g a r e v a e h t e c u d e r n a

c I thasprovedt heeffectiveness

. y g e t a r t s d e v o r p m i e h t f o

g i

F ure2 .Thet ota laverageclassificationcost.

d n a t n e m i r e p x

E Analyssi l

a t n e m i r e p x

E Pla ftormandDataSe I t ntroduciton

P D A h t i w d e r a p m o c d n a , m r o f t a l p t n e m p o l e v e d e s p i l c E n o d e t n e m e l p m i s i m h t i r o g l a T D _ D S A e h T

l

agorithm ,ACSDTalgorithmandCS-C4.5algorithm.

a t a d e h T . ] 6 [ I C U y b d e d i v o r p s t e s a t a d 0 1 e h t m o r f s i r e p a p s i h t n i d e s u t e s a t a d e l p m a s g n i n i a r t e h T

n i n w o h s e r a s l i a t e

d Table 2. |C |represents the number of attributes ,and |D |represents the tota l o

r e b m u

n fsamples.

n o i t a m r o f n i s t e s a t a D . 2 e l b a

T .

l a t n e m i r e p x

E Resutl sandAnalyssi

r o f m e t s y s g n i t a r e p o e h t , B G 4 y r o m e m , 3 i e r o C : U P C : t n e m n o r i v n e g n i t a r e p o l a t n e m i r e p x E

o p r u p e h t r o F . s g n i t t e s t s o c t s e t o n e v a h s t e s a t a d l l A . e g a u g n a l A V A J n i n e t t i r w , 7 s w o d n i

(5)

. s t s o c t s e t e t a r e n e g y l m o d n a r o t n o i t u b i r t s i d l a m r o n e h t y l p p a e w , s c i t s i t a t

s Al ldatasetsarer andomly

. s t e s a t a d t s e t r o f % 0 4 d n a s t e s a t a d g n i n i a r t r o f % 0 6 : s t e s b u s o w t o t n i d e d i v i

d We use the

i f i s s a l c s i

m cationcos tmatrixdefinedi nChapter2.wese t∂=10;r angef rom-1t o0,t hestepi s0.1. ω l

e l l a r a p n i s t s e t l l a t a h t e m u s s a e w e s u a c e b ; 5 1 . 0 e z i s p e t s , 1 o t 0 m o r f e g n a

r ,So le tφa= 1 and

e h t s s e c o r p e r

p selecteddatasett oremovedatat ha tdoesno tconformt ot hespecification.

s t s o c t s e t t n e r e f f i d 0 0 3 ( t s o c n o i t a c i f i s s a l c e g a r e v a e h t s w o h s t n e m i r e p x e s i h T : 1 t n e m i r e p x E

f i d t a s m h t i r o g l a r u o f e h t e r a p m o c t n e m i r e p x e e h T . ) n o i t u b i r t s i d e v i t i s o p y b e t a r e n e

g ferentt es tcosts.

, ) C T A ( t s o c t s e t e g a r e v a e h t s w o h s 4 e r u g i F , ) C M A ( t s o c n o i t a c i f i s s a l c s i m e g a r e v a e h t s w o h s 3 e r u g i F

. ) C A A ( t s o c n o i t a c i f i s s a l c e g a r e v a l a t o t e h t s w o h s 5 e r u g i F d n

a Figuresshow tha tthetota laverage h

t s s e l s i m h t i r o g l a T D _ T S A e h t f o t s o

c antha tofotheralgorithmson mos tdatasets ,especiallyon .

g e E d n a , c i g a M , 2 n a e l C s a h c u s s t e s a t a d e g r a

l Ont heWpbcdatase,tt heaveragemisclassification r e v a l a t o t e h t , r e v e w o H . s m h t i r o g l a r e h t o f o t a h t n a h t r e g r a l s i m h t i r o g l a w e n y b d e n i a t b o t s o

c age

w o n k n a c e w o s , s m h t i r o g l a r e h t o f o t a h t n a h t s s e l s i m h t i r o g l a w e n y b d e n i a t b o t s o c n o i t a c i f i s s a l c

t i r o g l a w e n e h t t a h

t hmi sweighedbetweenthetes tand themisclassificationcost ,whichprovesthe .

m h t i r o g l a w e n e h t f o y t i d i l a v

p x e e h T : 2 t n e m i r e p x

E erimenti st ocomparet hei mprovedAST_DTalgorithmwithCS-C4.5 ,ADP e r o m s i m h t i r o g l a T D _ T S A e h t t a h t s w o h s 6 e r u g i F . s t e s a t a d t n e r e f f i d n o s m h t i r o g l a T D S C A d n a

o m h t i r o g l a w e n e h t f o s e g a t n a v d a e h t d n a , s t e s a t a d t s o m n o s m h t i r o g l a r e h t o n a h t t n e i c i f f

e nlarge

d n a c i g a M , n a e l C n o m h t i r o g l a T D _ T S A f o e m i t n u r e h t , e l p m a x e r o F . t n a c i f i n g i s e r o m e r a s t e s a t a d

f o e m i t g n i n n u r f o h t g n e l e h t , n o i t i d d a n I . s m h t i r o g l a r e h t o n a h t s s e l y l t n a c i f i n g i s s i s t e s a t a d g e E

h t f o e z i s e h t n o s d n e p e d o s l a s m h t i r o g l a l a r e v e

s edatasets ,thelargerthedata set ,thelongerthe e

m i t n o i t a r e p

o .

g

i

F ure3.Theaveragemis-cos tofdifferen talgorithms Figure4.Theaveraget es tcos tofdifferen talgorithms

g

i

F ure5 .Thet ota laveragecos tof differentalgorithm .s gFi ure6 .Comparisonofruntimeofdifferen talgorithms.

y r a m m u S

: m h t i r o g l a T D _ D S A d e l l a c s i t a h t m h t i r o g l a w e n a s e s o p o r p r e p a p s i h

T Thei mprovemen tofheuristic

i t l u m f o s a i b e h t d n a t s o c h g i h f o m e l b o r p e h t s e v l o s n o i t c n u

f -valued attribute. hT e adaptive c

e h t g n i t c e l e s e v i t p a d a , s r e t e m a r a p g n i m r e t e

(6)

e h t , s s e c o r p g n i d l i u b e e r t n o i s i c e d e h t o t n i d e c u d o r t n i l l a e r a s m s i n a h c e m e v i t p a d a e e r h t e s e h t

l e s t n i o p t u c d n a n o i t u b i r t s i d e d o n f o y c n e i c i f f

e ectioni si mprovedgreatly,t heeffecti smoreobvious .t

e s a t a d g i b n

o Finally,t hei mproved"probabilityr efusingpruning"strategyi sappliedt ot hedecision n i a t r e c g n i n i a t n i a m f o e s i m e r p e h t r e d n u t s o c n o i t a c i f i s s a l c e h t e c u d e r n a c h c i h w , l e d o m e e r t

classificationaccuracy.Someworkalsoneedt odo .Ont heonehand,t healgorithmcanonlydea lwtih d

n a s e s s a l c y r a n i

b cannot dea lwith the problem of multivariate classes .On the other hand ,the n

o n e h t n o d e i l p p a m h t i r o g l a d e v o r p m i e h t f o y c n e i c i f f

e - un meric attributes cannot be greatly .

d e v o r p m

i Theeffecti sno tobvious.Iti st henex tstept ha tneedst obesolved.

t n e m e g d e l w o n k c A

s i h

T paper was supported by the Fundamenta l Research Funds for the Centra l Universities .

) 9 1 1 7 1 0 2 3 1 3 . o N (

s e c n e r e f e R

] 1

[ oH ng Zhao ,Xiangju Li .A cos tsensitive decision tree algorithm based on weighted class 3 0 3 . p p , 7 1 0 2 y r a u r b e F 1 , 8 7 3 e m u l o V . m s i n a h c e m e t u b i r t t a g n i t e l e d h c t a b h t i w n o i t u b i r t s i

d –316.

] 2

[ .A Freitas ,A.Costa-Pereira, .P Brazil .Cost-sensitivedecision treesappliedtomedica ldata, :in .

y r e v o c s i D e g d e l w o n K d n a g n i s u o h e r a W a t a

D Springer,2007 ,pp .303-312. ]

3

[ William Zhu . A cos t sensitive decision tree algorithm with two adaptive mechanisms . e

g d e l w o n

K -BasedSystems88(2015) ,pp .24- .3 3 ]

4

[ Fan Min, Zilong Xu. Research on Attribute Reduction and Decision Tree in Cost-sensitive .

g n i n r a e

L June.2014. ]

5

[ QuinlanJ.R .C4.5 :ProgramsforMachinel earning[J] .MorganKauffman ,1993 ,pp .23- .3 0 ]

6

[ .A Frank, .A Asuncion .UCI Machine Learning Repository. University of California ,Schoo l n

References

Related documents