) 7 1 0 2 E I I A ( g n ir e e n i g n E l a ir t s u d n I d n a e c n e g il l e t n I l a i c if it r A n o e c n e r e f n o C l a n o it a n r e t n I d r 3 7 1 0 2
8 7 9 : N B S
I -1-60595- 05 -9 2
t
s
o
C
A
-
s
e
n
s
i
it
v
e
D
e
c
i
s
i
o
n
T
r
e
e
O
p
it
m
i
z
e
d
A
l
g
o
r
ti
h
m
B
a
s
e
d
m
s
i
n
a
h
c
e
M
e
v
it
p
a
d
A
n
o
n
a
i
Q
W
A
N
G
a
n
d
Z U
h
i
L
I
*a n i h C , g n i n o a i L , y ti s r e v i n U e m it i r a M n a il a D
r o h t u a g n i d n o p s e r r o C *
: s d r o w y e
K Cost-sensiitvel earning,Decisiontree,Adapitvemechanism,Largedataset.
Abstrac.tTheproblemofcostisi ncreasinglybecomingt hehotspo tofacademicr esearch .Therefore , a , r e p a p s i h t n I . m h t i r o g l a e e r t n o i s i c e d f o h c r a e s e r e h t n i t n u o c c a o t n i t s o c e h t e k a t o t y r a s s e c e n s i t i
p o r p s a w m s i n a h c e m e v i t p a d a n o d e s a b m h t i r o g l a w e
n osed .Theheuristicfunctionwasi mprovedt o t
s o c g n i t s i x e e h t f o t s o c h g i h e h t e v l o
s -sensitivedecisiont reeproblemandt hemulti-valuedattribute e
v i t p a d a e h T . m e l b o r p s a i
b determing parameters mechanism ,adaptive selecting the cu tpoin t n
a m s i n a h c e
m d adaptive removing attribute mechanism were all applied to the process of tree g n i s u f e r y t i l i b a b o r p " d e v o r p m i e h t , y l l a n i F . y c n e i c i f f e g n i d l i u b l e d o m e h t e v o r p m i o t g n i d l i u b
t a h t d e w o h s s t n e m i r e p x E . e e r t n o i s i c e d f o l e d o m e h t o t d e i l p p a s a w y g e t a r t s " g n i n u r
p theefficiencyof
b o s l a n a c t i d n a , t e s a t a d e g r a l n o y l s u o i v b o d e v o r p m i s a w m h t i r o g l a w e n e h
t eusedt obuildamode l
o i s i c e d f
o nt reewithl owredundancyandlowcos.t
n o it c u d o r t n I
t s o
C -sensitivedecisiont reealgorithm[1]h asbeenwidelystudiedandi thasachievedgoodr esults . .
A , 7 0 0 2 n
I Freitase ta lproposedt heCS-C4.5algorithm ][2 ,whicht ooki ntoaccountt hei nformation ,
s e t u b i r t t a e h t e r u s a e m o t t s o c t s e t e h t d n a e t a r n i a
g aswel lasadjustedt hei nformationgainrateand t
t a e h t f o t s o c t s e
t ributes by parameter ω .The algorithm is efficien ton smal ldata sets ,bu tthe d
e c u d e r y l t a e r g s i y c n e i c i f f
e o nlargedatasets .In2015 ,WilliamZhue tal .proposedanadaptivecos t -sensitive decision tree algorithm ][3 .Th e algorithm improves the buildin g efficiency of the tree
l e d o
m underthepremiseofmaintaininggoodr obustness .Butt healgorithmdoesno tproposes agood y t i l i b a b o r p " e h t d e s o p o r p l a t e n a F n i M , 0 1 0 2 n I . g n i t t i f f o m e l b o r p e h t e v l o s o t y g e t a r t s g n i n u r p
g n i k c i t
s pruning"mechanism ][ , 4 whichassumedt ha taccordingt othepruningrulesshouldno tbe t c e f f e n o i t a c i f i s s a l c e h t e v o r p m i y a m g n i n u r p y t i l i b a b o r p n i a t r e c a n i l l i t s s a w m h t i r o g l a e h t , d e n u r p
t s o c e h t f
o -sensitivedecision treesaswel lasreduce thecos tofclassification. The validity of the .
d e i f i r e v s i s i s e h t o p y h
g n i t s i x e e h t f o m e l b o r p e h t t a g n i m i
A cost-sensitive decision tree algorithms ah ev a good s
t e s a t a d l l a m s n o e c n a m r o f r e
p , butt heeffectisgreatlyreducedonmediumorl argedatasets ,aswel l s
a over-fittingandhighcos.tInthispaper,weimprovetheheuristicfunctionofC -S C4.5algorithm t
s o c g n i t s i x e e h t f o s g n i m o c t r o h s e h t t a g n i m i
a -sensitive decision tree algorithm. T he improved s
e u l a v e l p i t l u m h t i w s a i b e t u b i r t t a f o m e l b o r p e h t s e v l o s n o i t c n u f c i t s i r u e
h aswel la s takesthetes t
m h t i r o g l a 5 . 4 C a o t n i s e d a r g e d t i , n i a g a d e t s e t s i y t r e p o r p a n e h W . t n u o c c a o t n i t s o
c ][5 .I nt heprocess
n o i s i c e d a g n i d l i u b f
o tree ,wealsoi ntroducet hreeadaptivemechanismsi ntot heprocess ,andpropose h
T f o e m e h c S e v i t p a d A n
a ird_DecisionTree(AST_DT)algorithmwitht hreeadaptivemechanisms . e
e r t n o i s i c e d f o y c n e i c i f f e e h t s e v o r p m i m h t i r o g l a d e v o r p m i e h
T building and weighsthecos tof
. n o i t a c i f i s s a l c s i m d n a g n i t s e
t Finally,t hei mproved"probabilityrefusingpruning"strategyi sapplied f o m e l b o r p g n i t t i f r e v o e h t s e v l o s t I . g n i t t i f r e v o m o r f e e r t e h t t n e v e r p o t l e d o m e e r t n o i s i c e d e h t o t
f o n g is e
D t C the o -s sen isitveDeci isonT r Aee lgortihmBasedonAdap itveMechansim e
t e
D rmina itono fOp itma lHeursiitcFunciton f
o n o i t c n u f c i t s i r u e h e h t , r e p a p s i h t n
I C -CS 4.5 is improved aiming at the high cos tof sensitive i
t l u m f o s a i b e h t d n a e e r t n o i s i c e
d -valueattribute .Thei mprovedheuristicfunctioni s:
λ ) ) ( * ) ( 1 ( * ) , ( )
, ( a
u ltiy a p GainRaito a p tc a a
Q a = a + Φ ( 1) )
a ( c
t ist het es tcos tofattributea ,and Φ(a) isariskfactorforspecifict ypesofexperiments ,also .
t s e t d e y a l e d s a n w o n
k λ isanon-positivenumbert oadjustt hei mpac toft es tcostsont hedecision .
e e r t
. s i n o i t a u q e g n i t i r w e r e h t d n a , a e t u b i r t t a c i r e m u n e h t f o s t n i o p t u c e t a d i d n a c l l a f o t e s a e b a P t e L
) (
, } ) , ( {
x a m ) ( a
u ltiy a Qualtiy a pa pa Pa
Q = ∈ ( 2) n i a g n o i t a m r o f n i s t p o d a m h t i r o g l a d e v o r p m i e h t t a h t s i m h t i r o g l a d e v o r p m i e h t f o e g a t n a v d a e h T
e t u b i r t t a e h t s s e r p x e o t o i t a
r ’s classified ability ratherthan information gain .I tcan reducebiased .
s e u l a v e r o m e v a h t a h t s e t u b i r t t a s d r a w o
t Inaddition,t hei mprovedalgorithmdegradest ot heheurisitc .
n i a g a d e s u s i a e t u b i r t t a n e h w m h t i r o g l a 5 . 4 C e h t f o n o i t c n u f
it a n i m r e t e D r e t e m a r a P r o f e m e h c S e v it p a d A e h
T o n
, n o i t c n u f c i t s i r u e h f o a l u m r o f n o i t a l u c l a c e h t n
I λ is an undetermined parameterand t herange of e
u l a
v [ i - 0s 1 ] , ,i tcontrolsthedegreeofbiast ot het es tcost, thesmallert hevalue,t hemorebiasedt he t
n o i t c n u f c i t s i r u e
h o the tes tcos.t Can i tbe in the range of [-∞,0]? Obviously not .Because the n o i t a c i f i s s a l c e h t f i , l e d o m e h t f o y c a r u c c a n o i t a c i f i s s a l c e h t n o e c n e u l f n i n a s a h t s o c n o i t a c i f i s s a l c
e n o s i t s o
c -sidedemphasized, ath tmaymaket hemode lofclassificationaccuracyi spoor.However, s
r e t e m a r a p f o n o i t a n i m r e t e d l a i c i f i t r
a ist ime-consumingandi nefficien.tInordert osolvet heproblem , .
d e s o p o r p s i m h t i r o g l a P D A e h t
l a m i t p o e h t e n i m r e t e d e W . c b d W f o t e s a t a d e h t e s u e
W heuristicfunctionaccordingt ot hesearch s
i h t g n e l p e t s e h t t e s e w , y l t s r i F . d o h t e
m -0.1 ,startinga t0,t heselectedvaluesare0 ,-0.1 ,-0.2,.. .- ;1 w , s c i t s i t a t s f o e k a s e h t r o F . s e u l a v t n e r e f f i d o t g n i d r o c c a n o i t c n u f c i t s i r u e h t n e r e f f i d t e g n e h t d n
A e
a s a d e n i f e d s i n o i t a c i f i s s a l c s i m f o t s o c e h T . s t s o c t s e t m o d n a r e t a r e n e g o t n o i t u b i r t s i d l a m r o n e s u
. x i r t a
m Thematrixi sasfollows.
( ) ( )
1 , 0
0 , 1 0
0
c m c
m c m
=
) 1 , 0 ( c
m meanstha ttherea lclassis1 tha tismisclassified as0 and mc(1,0) meanstha ttherea l t e s e w e r e h d n a , t n e r e f f i d e r a s e u l a v o w t e s e h t , y l l a u s U . 1 s s a l c o t n i d e d i v i d s i h c i h w , 0 s i s s a l c
0 6 ) 1 , 0 ( c
m = , mc(1,0)=80. Theexperimenta lresultsareshowni nFigure1.
g i
F ure1. Theprocessofselectingparametervalue.
f o e u l a v e h t s a s e s a e r c e d g n i t s e t f o t s o c e h t , 1 e r u g i F m o r f n e e s e b n a c s
A λdecreases ,andt hecos t
i m f
n e h W . ) d e c n e r e f e r e b m h t i r o g l
a λ=0,t heaverageclassification costi st hel argest .When λ= -0.8 , .t
s e l l a m s e h t s i t s o c n o i t a c i f i s s a l c e g a r e v a e h
t Thecurvei nt hef igureshowst heminimumvalueoft he d
l u o h s t s o c n o i t a c i f i s s a l c e g a r e v
a betaken intheneighborhood of -0.8 .Theresul tofthefitting of :
s i l e c x E
2
5 . 2 2 1 4 . 2 0 2 9 . 3 5 1
y= − λ + λ ( 3)
y l h g i h e r a t n e i c i f f e o c n o i s s e r g e r d n a n o i t a u q e n o i s s e r g e r e h t t a h t s w o h s s i s y l a n a n o i s s e r g e R
k a t y b n e h T . l e v e l e c n e d i f n o c % 5 9 t a t n a c i f i n g i
s ingt hederivationoft heextremevaluemethod,t he s
i t s o c n o i t a c i f i s s a l c e g a r e v a e h t f o e u l a v t s e w o
l -0.83 .Theresultsobtainedi nt heaboveexperimen t f
o n i a m o d t n e c a j d a e h t h t i w t n e t s i s n o c e r a s e u l a v l a m i t p o e h t t u b , e m a s e h t t o n e r
a -0.8 .The
m i r e p x
e enta lresultsshow tha tthe improved algorithm iseffective and can effectively obtain the .t
s o c n o i t a c i f i s s a l c e g a r e v a r e w o l h t i w e u l a v r e t e m a r a p l a m i t p o
e h
T Adap itveSchemef or C tu Poin tSelecitons t
u c l a n o i t i d a r t e h t n
I -poin tselection mechanism ,iftheattributehasNinstancesand each attribute N
n e h t , e u l a v t n e r e f f i d a o t s d n o p s e r r o
c -1 times mus tbe evaluated ,which greatly reduces the .
m h t i r o g l a e h t f o y c n e i c i f f
e Thispaperi ntroducest heASCPmechanismaimingatl owefficiencyof r
o g l a g n i t s i x e e h
t ithmonl argedatasets. :
s i m h t i r o g l a e h t f o a e d i e h
T i tselectsthemiddle valueof theattributeand se tastep ,then the e h t l i t n u y l e t a r a p e s d e t a l u c l a c s i e u l a v t n i o p d n e e h t d n a , d e h c r a e s s i y t r e p o r p e h t f o e u l a v e l d d i m
c i t s i r u e h e h t f o e u l a v m u m i x a
m functioni sf ound.Thestepl engthcanbedeterminedaccordingt ot he k
o o l e h t r e n i f e h t , p e t s e h t r e t r o h s e h T . s e t u b i r t t a f o r e b m u
n -up .Thel ongert hestep,t hemoreefficien t .
m h t i r o g l a e h t
e h
T Adap itveSchemef orRemova lo fA ttribute t
a f o r e b m u n e h t f
I tributeoft heobjecti st oomucht ha tmayl eadt ot hedecisiont reet ool arge ,aswel l o s l a s i n o i t c u d o r p f o y c n e i c i f f e e h t , d n a t s r e d n u o t t l u c i f f i d d n a e m o s r e b m u c e r a s e l u r d e t a r e n e g e h t s a
I . d e c u d e r y l t a e r
g norderto solvetheseproblems, theARAmechanismisintroduced ,tha tis ,inthe t
a h t e t u b i r t t a e h t f o e m o s e v o m e r o t g n i d l i u b f o s s e c o r
p hasl essi mpac tont hedecisiontree . y t i l a u Q e h t g n i t a l u c l a c y b t e s e t u b i r t t a e h t m o r f y t r e p o r p e h t e v o m e r o t s i m h t i r o g l a e h t f o a e d i e h T
t u b i r t t a h c a e f o e u l a
v e ,ifthevalueofthecorresponding heuristicfunction islessthanthe ∂ ofthe .
n o i t c n u f c i t s i r u e h m u m i x a m
o d u e s p e h
T -codedescriptionoft heAST_DTalgorithmi sshowni nTable1.
b a
e h t f o n o it a z i m it p
O PruningStrategy
, " g n i n u r p g n i k c i t s y t i l i b a b o r p " f o a e d i e h t y b d e r i p s n
I we pu t forward a new pruning
y g e t a r t
s --"probabilityr efusingpruning",thei deaoft hestrategyi swheni tshouldbeprunedaccording ,
e l u r g n i n u r p e h t o
t thealgorithmstil lrefusest oprunea tacertainprobability. m
g n i n u r p y t i l i b a b o r p d e v o r p m i e h t r e h t e h w s w o h s 2 e r u g i
F echanism is valid or not .Where
o
N -prob represents "no pruning" ,prob-persist-pruning represents "probability sticking pruning" , b
o r
p -refusal-pruning represents "probability refusing pruning" , combine-pruning represents a n
o i t p o f o s r i a p o w t f o n o i t a n i b m o
c s .Theexperien tshowst hatt heprob-refusal-pruningstrategycan e
n i b m o c e h T . s t e s a t a d t s o m n o s e i g e t a r t s g n i n u r p r e h t o n a h t t s o c e g a r e v a r e l l a m s a n i a t b
o -pruning
. s t e s a t a d t s o m n o s t l u s e r r e t t e b n i a t b o d n a t s o c e g a r e v a e h t e c u d e r n a
c I thasprovedt heeffectiveness
. y g e t a r t s d e v o r p m i e h t f o
g i
F ure2 .Thet ota laverageclassificationcost.
d n a t n e m i r e p x
E Analyssi l
a t n e m i r e p x
E Pla ftormandDataSe I t ntroduciton
P D A h t i w d e r a p m o c d n a , m r o f t a l p t n e m p o l e v e d e s p i l c E n o d e t n e m e l p m i s i m h t i r o g l a T D _ D S A e h T
l
agorithm ,ACSDTalgorithmandCS-C4.5algorithm.
a t a d e h T . ] 6 [ I C U y b d e d i v o r p s t e s a t a d 0 1 e h t m o r f s i r e p a p s i h t n i d e s u t e s a t a d e l p m a s g n i n i a r t e h T
n i n w o h s e r a s l i a t e
d Table 2. |C |represents the number of attributes ,and |D |represents the tota l o
r e b m u
n fsamples.
n o i t a m r o f n i s t e s a t a D . 2 e l b a
T .
l a t n e m i r e p x
E Resutl sandAnalyssi
r o f m e t s y s g n i t a r e p o e h t , B G 4 y r o m e m , 3 i e r o C : U P C : t n e m n o r i v n e g n i t a r e p o l a t n e m i r e p x E
o p r u p e h t r o F . s g n i t t e s t s o c t s e t o n e v a h s t e s a t a d l l A . e g a u g n a l A V A J n i n e t t i r w , 7 s w o d n i
. s t s o c t s e t e t a r e n e g y l m o d n a r o t n o i t u b i r t s i d l a m r o n e h t y l p p a e w , s c i t s i t a t
s Al ldatasetsarer andomly
. s t e s a t a d t s e t r o f % 0 4 d n a s t e s a t a d g n i n i a r t r o f % 0 6 : s t e s b u s o w t o t n i d e d i v i
d We use the
i f i s s a l c s i
m cationcos tmatrixdefinedi nChapter2.wese t∂=10;r angef rom-1t o0,t hestepi s0.1. ω l
e l l a r a p n i s t s e t l l a t a h t e m u s s a e w e s u a c e b ; 5 1 . 0 e z i s p e t s , 1 o t 0 m o r f e g n a
r ,So le tφa= 1 and
e h t s s e c o r p e r
p selecteddatasett oremovedatat ha tdoesno tconformt ot hespecification.
s t s o c t s e t t n e r e f f i d 0 0 3 ( t s o c n o i t a c i f i s s a l c e g a r e v a e h t s w o h s t n e m i r e p x e s i h T : 1 t n e m i r e p x E
f i d t a s m h t i r o g l a r u o f e h t e r a p m o c t n e m i r e p x e e h T . ) n o i t u b i r t s i d e v i t i s o p y b e t a r e n e
g ferentt es tcosts.
, ) C T A ( t s o c t s e t e g a r e v a e h t s w o h s 4 e r u g i F , ) C M A ( t s o c n o i t a c i f i s s a l c s i m e g a r e v a e h t s w o h s 3 e r u g i F
. ) C A A ( t s o c n o i t a c i f i s s a l c e g a r e v a l a t o t e h t s w o h s 5 e r u g i F d n
a Figuresshow tha tthetota laverage h
t s s e l s i m h t i r o g l a T D _ T S A e h t f o t s o
c antha tofotheralgorithmson mos tdatasets ,especiallyon .
g e E d n a , c i g a M , 2 n a e l C s a h c u s s t e s a t a d e g r a
l Ont heWpbcdatase,tt heaveragemisclassification r e v a l a t o t e h t , r e v e w o H . s m h t i r o g l a r e h t o f o t a h t n a h t r e g r a l s i m h t i r o g l a w e n y b d e n i a t b o t s o
c age
w o n k n a c e w o s , s m h t i r o g l a r e h t o f o t a h t n a h t s s e l s i m h t i r o g l a w e n y b d e n i a t b o t s o c n o i t a c i f i s s a l c
t i r o g l a w e n e h t t a h
t hmi sweighedbetweenthetes tand themisclassificationcost ,whichprovesthe .
m h t i r o g l a w e n e h t f o y t i d i l a v
p x e e h T : 2 t n e m i r e p x
E erimenti st ocomparet hei mprovedAST_DTalgorithmwithCS-C4.5 ,ADP e r o m s i m h t i r o g l a T D _ T S A e h t t a h t s w o h s 6 e r u g i F . s t e s a t a d t n e r e f f i d n o s m h t i r o g l a T D S C A d n a
o m h t i r o g l a w e n e h t f o s e g a t n a v d a e h t d n a , s t e s a t a d t s o m n o s m h t i r o g l a r e h t o n a h t t n e i c i f f
e nlarge
d n a c i g a M , n a e l C n o m h t i r o g l a T D _ T S A f o e m i t n u r e h t , e l p m a x e r o F . t n a c i f i n g i s e r o m e r a s t e s a t a d
f o e m i t g n i n n u r f o h t g n e l e h t , n o i t i d d a n I . s m h t i r o g l a r e h t o n a h t s s e l y l t n a c i f i n g i s s i s t e s a t a d g e E
h t f o e z i s e h t n o s d n e p e d o s l a s m h t i r o g l a l a r e v e
s edatasets ,thelargerthedata set ,thelongerthe e
m i t n o i t a r e p
o .
g
i
F ure3.Theaveragemis-cos tofdifferen talgorithms Figure4.Theaveraget es tcos tofdifferen talgorithms
g
i
F ure5 .Thet ota laveragecos tof differentalgorithm .s gFi ure6 .Comparisonofruntimeofdifferen talgorithms.
y r a m m u S
: m h t i r o g l a T D _ D S A d e l l a c s i t a h t m h t i r o g l a w e n a s e s o p o r p r e p a p s i h
T Thei mprovemen tofheuristic
i t l u m f o s a i b e h t d n a t s o c h g i h f o m e l b o r p e h t s e v l o s n o i t c n u
f -valued attribute. hT e adaptive c
e h t g n i t c e l e s e v i t p a d a , s r e t e m a r a p g n i m r e t e
e h t , s s e c o r p g n i d l i u b e e r t n o i s i c e d e h t o t n i d e c u d o r t n i l l a e r a s m s i n a h c e m e v i t p a d a e e r h t e s e h t
l e s t n i o p t u c d n a n o i t u b i r t s i d e d o n f o y c n e i c i f f
e ectioni si mprovedgreatly,t heeffecti smoreobvious .t
e s a t a d g i b n
o Finally,t hei mproved"probabilityr efusingpruning"strategyi sappliedt ot hedecision n i a t r e c g n i n i a t n i a m f o e s i m e r p e h t r e d n u t s o c n o i t a c i f i s s a l c e h t e c u d e r n a c h c i h w , l e d o m e e r t
classificationaccuracy.Someworkalsoneedt odo .Ont heonehand,t healgorithmcanonlydea lwtih d
n a s e s s a l c y r a n i
b cannot dea lwith the problem of multivariate classes .On the other hand ,the n
o n e h t n o d e i l p p a m h t i r o g l a d e v o r p m i e h t f o y c n e i c i f f
e - un meric attributes cannot be greatly .
d e v o r p m
i Theeffecti sno tobvious.Iti st henex tstept ha tneedst obesolved.
t n e m e g d e l w o n k c A
s i h
T paper was supported by the Fundamenta l Research Funds for the Centra l Universities .
) 9 1 1 7 1 0 2 3 1 3 . o N (
s e c n e r e f e R
] 1
[ oH ng Zhao ,Xiangju Li .A cos tsensitive decision tree algorithm based on weighted class 3 0 3 . p p , 7 1 0 2 y r a u r b e F 1 , 8 7 3 e m u l o V . m s i n a h c e m e t u b i r t t a g n i t e l e d h c t a b h t i w n o i t u b i r t s i
d –316.
] 2
[ .A Freitas ,A.Costa-Pereira, .P Brazil .Cost-sensitivedecision treesappliedtomedica ldata, :in .
y r e v o c s i D e g d e l w o n K d n a g n i s u o h e r a W a t a
D Springer,2007 ,pp .303-312. ]
3
[ William Zhu . A cos t sensitive decision tree algorithm with two adaptive mechanisms . e
g d e l w o n
K -BasedSystems88(2015) ,pp .24- .3 3 ]
4
[ Fan Min, Zilong Xu. Research on Attribute Reduction and Decision Tree in Cost-sensitive .
g n i n r a e
L June.2014. ]
5
[ QuinlanJ.R .C4.5 :ProgramsforMachinel earning[J] .MorganKauffman ,1993 ,pp .23- .3 0 ]
6
[ .A Frank, .A Asuncion .UCI Machine Learning Repository. University of California ,Schoo l n