• No results found

Learning Bayesian networks with local structure, mixed variables, and exact algorithms

N/A
N/A
Protected

Academic year: 2021

Share "Learning Bayesian networks with local structure, mixed variables, and exact algorithms"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

International

Journal

of

Approximate

Reasoning

www.elsevier.com/locate/ijar

Learning

Bayesian

networks

with

local

structure,

mixed

variables,

and

exact

algorithms

Topi Talvitie

a

,

Ralf Eggeling

b

,

Mikko Koivisto

a,

aDepartmentofComputerScience,UniversityofHelsinki,Finland

bDepartmentofComputerScience,UniversityofTübingen,Germany

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received16March2019

Receivedinrevisedform11July2019 Accepted8September2019 Availableonline16September2019 Keywords:

Bayesiannetworks Decisiontrees Exactalgorithms Structurelearning

Modern exact algorithms for structure learning in Bayesian networks first computean exact local score of every candidate parent set, and then find anetwork structure by combinatorial optimization so as to maximizethe global score. Thisapproach assumes thateachlocalscorecanbecomputedfast, whichcanbeproblematicwhenthescarcity of the data calls for structured local models or when there are both continuous and discretevariables,forthesecaseshavelackedefficient-to-computelocalscores.Toaddress thischallenge, weintroducea local scorethat is basedon aclassofclassification and regression trees.We showthat under modestrestrictions onthepossible branchings in the tree structure, it is feasible to finda structure that maximizes aBayes score in a rangeofmoderate-sizeprobleminstances.Inparticular, thisenablesglobal optimization oftheBayesiannetworkstructure,includingthelocalstructure.Inaddition,weintroduce arelatedmodel classthat extendsordinary conditionalprobability tables tocontinuous variables by employing anadaptive discretization approach. The two model classes are compared empirically by learning Bayesian networks from benchmark real-world and syntheticdatasets.Wediscusstherelativestrengthsofthemodelclassesintermsoftheir structurelearningcapability,predictiveperformance,andrunningtime.

©2019TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCC BY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

Bayesiannetworks(BNs)composeamultivariatedistributionasaproductofunivariateconditionalprobability distribu-tions(CPDs). ThepotentialcomplexityofthelocalCPDs,togetherwiththeglobalacyclicityconstraintoftheBNstructure, make the task oflearning BNs from datachallenging in manyways. Evenif we assume that the available data contain neitherunobservedvariablesnormissingentries—likewewilldothroughoutthisarticle—thereoftenremainthefollowing threeconcerns:

(i) Statistical efficiency:thedatamaybescarce,renderingsimpletabularparameterizationsoftheCPDs,suchasthe

condi-tional probability tables (CPTs),statisticallyinefficient.

(ii) Heterogeneous variables:so-called

hybrid BNs

containbothdiscreteandcontinuousvariables,againrulingout themost convenientandstandardparameterizationsofCPDs.

*

Correspondingauthor.

E-mailaddress:mikko.koivisto@helsinki.fi(M. Koivisto). https://doi.org/10.1016/j.ijar.2019.09.002

0888-613X/©2019TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

(iii) Computational efficiency: the mostnaturalformulationsof thelearning problemare computationally hard. Thislimits boththedimensionsofthedataandthecomplexityoftheCPDsinrelationtowhatcanbesolvedunderusefulquality guarantees.

Each oftheseissues (i–iii) hasbeenaddressed inthe literature,aswe will describe inthe next paragraphs.The present authorsare,however,notawareofanypriorworkthatwouldsimultaneouslyaddressallthethreeconcerns.

To addresstheissue ofstatisticalefficiency,one canreplace tabularCPDs bymore sophisticatedstructureswherethe effectivenumberofparametersadapts totheamountandcomplexityoftheavailable data.Thenotionofcontext-specific independence[1] offersaparticularlyappealingwaytoachievethis:givencertainconfigurationforasubsetofthe

predictor

variables, the

response

variable isindependentof theremaining predictors.Forstructure learningin BNs,context-specific independences have been represented with decisiontrees [2] and with the somewhat more expressive decision graphs [3]. Both model classesare non-parametric inthe sense their numberof parameters can grow large, letting the models approximateanyCPD(infact,matchCPTsfordiscretevariables overfinitedomains). Thesestudieswere,however,limited todiscretevariablesandgreedyorlocalsearchalgorithms,thusnotaddressingissues(ii)and(iii).Alsoexactalgorithmsto partition tojointstate spaceofdiscretepredictorshavebeenproposed [4],butthesealgorithms becomecomputationally infeasiblealreadywith,say,threepredictorsthattakefourstateseach.

ForhybridBNs,therearetwomainapproaches.Oneistodiscretizeallcontinuousvariables.Inpractice,thismeansthat separatelyforeachcontinuousvariable,itsrangeispartitionedintointervals,i.e.,bins,whicharethentreatedasthestates ofthediscretized variable.Then itiscommontoassignCPTs forthediscretizedvariables,evenifthisignores thenatural orderingofthestates.Simplepreprocessingroutinesofthiskindarecomputationallyconvenient,buttheyarepronetolose information in anuncontrolled way.Andif, asaremedy, oneincreasesthe numberofbins,then thestatisticalefficiency becomes a concern.Thereare alsomoreinvolvedschemes,whichcouplemulti-variablediscretization withgreedysearch forahigh-scoring networkstructure[5];theseschemespotentially yieldbetterdiscretizations,buttheyappeartorequire search heuristicsthatcannotguaranteeoptimalityofthefoundsolution.AnotherapproachistoemployCPDsthancan ac-commodatebothdiscreteandcontinuousvariables,suchasconditionallinearGaussianmodelsandtheirgeneralizationsto discreteresponsesviaappropriatelinkfunctions[6,Ch. 5].Thesemodelsare,however,nonsatisfactoryconcerningstatistical efficiency: onecan arguethat they overparameterizeby introducingseparate linearmodel foreveryconfiguration of dis-cretepredictors,whiletheyalsounderparameterizebyignoringnonlinearinteractionsofcontinuouspredictors.Mixturesof truncatedexponentials(MTEs)offeramoresophisticatedmodel,inwhichadecisiontreepartitionsthespaceofpredictor variablesintoregions,ineachofwhichtheresponseisassignedamixturedistribution[7].Unfortunately,nofastalgorithms tohandleMTEsofneededsizesareknown,andconsequentlyonehastoresorttoproceduresthatdonotofferperformance guarantees.

Thehope forcomputationalefficiencyhastobeinterpretedinrelationtotheNP-hardnessoftheBNstructurelearning problem[8]:arealistic goalistosolvetypicalmoderate-sizeinstancesinpracticewithoutsacrificingtheoptimalityofthe solution, rather than aiming at worst-case polynomial time complexity. Nowadays, we have several exact algorithms, or solvers, that achievethisgoal,especially ifchoosingthebests solverforeach givenprobleminstance [9]. Whiledifferent solvers employdifferentalgorithmictechniques,suchasdynamicprogramming[10–13],A*search[14],integerlinear pro-gramming[15],andconstraintprogramming[16],theyallproceedintwosteps:First,thelocalscoresarecomputedforall nodesandtheir possibleparent sets,oftensettinganupperbound forthenumberofparents.Second, aglobally optimal network structure is found by combinatorialsearch. One challenge in applyingthesesolvers is that the numberof local scores computedinthefirststep cangrowlarge forlarger networks.Partly forthisreason,thesolvers havebeenmostly studiedinsettingswherethelocalscoresareofsomeoftheaforementionedsimpleforms,enablingfastevaluation.Ithas been observed that exact algorithms, when applicable, not only remove theuncertainty on the quality, butalso tend to performbetterinstructurelearninganddensityestimation[17].

Inthisarticle,weaddressallthethreeissues(i–iii)simultaneouslybyputtingforwardalocalmodelbasedon classifica-tionandregressiontrees(CARTs)[18].WederiveaBayesscoremostlyfollowingtheworksonBayesianCART[19–21].The main challenge isthat learning CART models iscomputationally intractable; theexisting algorithms are basedon greedy andlocalsearchandcannotofferusefulqualityguarantees.Here,twoobservationscometorescue.First,intheBNlearning context thenumber ofpredictorscan beassumedto be relativelysmallinpractice.This reducesthenumberof possible CART structures, i.e., decision trees,butis alone not sufficient formaking thesearch computationallytractable. The sec-ond observationis that theso-called dyadic CART, inwhich therecursive splits ofthe rangesof continuous variablesare always atthemidpoint, essentially inheritsthestatistical powerof CARTandyet enablessignificantly faster global opti-mization [22–25]. Buildingon theseobservations,we introduce the

partition–dyadic CART (PCART)

model; here“partition” refersto(recursive)splittingofthestatespacesofcategoricalvariables—tothebestofourknowledge,categoricalpredictors havenotbeenallowedinpreviousworksondyadicCART.

Whenallvariablesarecategorical,onecancomparetheperformanceofPCARTagainststandardCPTsinastraightforward manner. TocomparePCARTagainstanalogoustabular modelsalsointhepresenceofcontinuousvariables,we introducea full-table (FT) variant ofPCART:it discretizes the involvedcontinuous predictors (but not theresponse). While FTadapts thenumberofstatespervariabletothedataathand,itrendersglobaloptimizationcomputationallyfeasible,unlikesome aforementioneddiscretizationschemes[5].

(3)

Inadditiontothesemethodologicaldevelopments,westudyempiricallythefollowingtwoquestions:DoesPCARTboost the statisticalefficiency of structure learning anddensity estimation inpractice, that is, on benchmarkdata sets? What is the price to be paid in terms of computational efficiency, i.e., which kind of problem instances, in terms of model complexityanddatadimensions,canbesolvedinareasonabletime?Onecouldhypothesizethat,thankstoaparsimonious parameterization,PCART enablesmorepowerfuldetectionof arcsandmore robustparameter inference atthecost ofan only modest increase in running time due to efficientPCART optimization algorithms. We will show how the empirical resultssupportthishypothesis.

Partsofthisworkhavebeenpublishedinapreliminaryforminconference proceedings[26].Whiletheconferencepaper onlyconsideredcategoricalandcontinuousvariables,thepresentarticlealsoconsidersordinalvariables—thisamountstoa majorextensiontoboththetheoreticalandempiricalcontributions.Wealsorevisethegeneralexpositionandsubstantially extendtheexperimentalstudy;specifically,thestudiesonpredictiveperformance andhyperparametersensitivityarenew additions. Furthermore, we now also make our implementation of the presented algorithms for computing local scores underthePCARTmodelpubliclyavailable.1

Theremainderofthisarticleisorganizedasfollows.Section2reviewsthenecessarypreliminaryconceptsandnotation regardingBNs, CARTs,andtheBayesianapproach tolearning. We introducethePCART modelinSection 3andourexact algorithms for learning a PCART from data inSection 4.Section 5is devoted to empirical studies. Last, we concludein Section6.Foreaseofreading,somedetailedmaterialispostponedtotheappendix.

2. Preliminaries

ThissectionreviewssomebasicsofBNsanddecisiontrees,includingtheBayesianapproachtolearningthemfromdata. We will focus on the key terminology andnotation neededin later sections; for a more comprehensivediscussion, the readeris referred to the literature on graphicalmodels [6, Chs. 3, 5, and 18] and score-basedlearning of decisiontrees [19,21].

2.1. Bayesian networks

Consider a multivariateprobability distribution p

(

X1

,

X

2

,

. . . ,

X

n

),

whereeach variable Xv takesvaluesfromsome set

Sv.Wewrite V fortheindexset

{

1,2,

. . . ,

n

}

andindexbysubsets:e.g.,ifP

V,wewrite XP forthetuple

(

Xv

)

vP and

SP fortheCartesianproduct

×

vPSv;herewefollowtheconventionthattheindicesaretakeninincreasingorder.

Wesaythatthedistribution

p

andadirectedacyclicgraph(DAG)

G

=

(

V

,

A

)

formaBNifthejointdistributionfactorizes into a product ofthe CPDs p

(

Xv

|

XGv

),

where Gv

= {

u

:

(

u

,

v

)

A

}

is the set of

parents

of v in G. When referring to a

particularCPD,wecall Xv the

response

variableandeach Xu,with

u

Gv,a

predictor

variable.

There are various ways to parameterize a CPD p

(

Xv

|

XP

),

with P

V

\ {

v

}

.Here andhenceforth we may indexthe

predictorvariablessimplyby

P

whenthereisnoneedtoconsidertheDAGon

V

.AmongthesimplestonesareCPTs,which assigna distinctparameterforeachjointconfigurationoftheinvolvedvariables,assumingtheir statespacesare finite.In whatfollows,wewillfocusonmoresophisticatedCARTmodels,whichgeneralizethetabularCPTs.

2.2. Classification and regression trees

InCARTmodels,thestructure isrepresentedasa decisiontree.Adecisiontreeoveramultidimensionalspace SP isa

recursive, axis-parallel,binary partition ofthespace. Moreformally,let T be a rootedbinary tree whereeach node R is a subsetof SP oftheform

×

uPRu.Wecall

T

a

decision tree

of SP ifits rootis SP andevery inner node R anditstwo

children Rand R differinexactlyonedimension

u

P forwhich

{

Ru

,

R

u

}

isapartitionof Ru.The“decision”associated

withnode R iswhetherthevariable Xu belongsto Ru or Ru.Consequently,theleavesof T partition SP.Adecisiontree

Tv of SP andaCPD p

(

Xv

|

XP

)

are

compatible

witheachother if, forall leaves R,we havethat Xv isindependentof XP

conditionallyontheeventXP

R.Inotherwords,theCPDisconstantovereachleaf Randpiecewiseconstantover SP.

Inordertoparameterizethedistributionoftheresponsevariable Xv ineachleaf,wedistinguishbetweenadiscreteand

acontinuousdistribution.Inthiswork,wefocusontwostandardcases:

Categorical distribution: IfSv isfinite,welet

θ

x betheprobabilitythat Xv

=

x,foreach

x

Sv.

Normal distribution: IfSv isthesetofreals, Xvisnormallydistributedwithmean

μ

andvariance

σ

2.

InthecontextofBNswhereeachvariable Xv appearsastheresponse,we understandthattheparametersaredistinct for

all v andallleavesofTv,andomitemphasizingthisinthenotation.Dependingonwhethertheresponseiscategoricalor

continuouswecallthedecisiontreeequippedwiththeCPDa

classification tree

ora

regression tree

,generallya

CART

. Whatismaybenotimmediatefromtheabovedescriptionisthatwealsomodelordinalresponsesbycategorical distri-butions:weenforcetheparameters

θ

x beequalforsomesubsequentvalues

x

.WegivedetailsinSection2.5.
(4)

2.3. Bayesian structure learning

To learn the model structure fromdata, we take a Bayesian score-based approach.We only note in passing that the derivationsofSections3and4alsoapplytootherscoringschemes,especiallypenalizedmaximum-likelihoodscores.

Considerasetofdatapoints

X

i

=

(

Xi

1

,

X

2i

,

. . . ,

X

ni

),

i

=

1,2,

. . . ,

N

.Wemodelthedatapointsasindependentdrawsfrom

adistribution pwhosestructure,namelyaDAG

G

=

(

Gv

)

vV withadecisiontreeTvof SGv foreach

v

V,areunknown;

theparametersoftheCPDsareunknownaswell.Wewilldescribeamodularpriorofthestructureandtheparametersand obtainthescoreastheposteriorprobabilityofthestructure,bymarginalizingouttheparameters.

Consider adecisiontree

T

v ofSP;inourcontext P

=

Gv.Foreachleafof Tv,weassignstandard conjugatepriorsfor

theparametersofthecategoricalandnormaldistributions:

Dirichlet: Let

θ

Dirichlet(

α

),

where

α

=

(

α

x

)

xSv arehyperparameters.

Normal-inverse-gamma: Let

σ

2

IG(

ν

/2,

ν

λ/2)

and

μ

|

σ

2

N(

μ

¯

,

σ

2

/

a

),

where

ν

,

λ,

μ

¯

,and

a

arehyperparameters.

When an ordinal response is modeled by a categorical distribution, a Dirichlet prior is, however, only applied after the valuesarepartitionedintosomenumberofbins;wegivedetailsinSection2.5.

While we could set thehyperparameters separately foreach response variable Xv, inour experiments we will focus

on the casewhereall are setto constant values;see Section 5.1forthe defaultvalueswe used inour experimentsand Section5.6forasensitivityanalysis.

By letting theparameters of all leavesof Tv be independent,we obtaincomputationally convenient formulasfor the

marginallikelihoodfunction.Foraleaf R,let IR

= {

i

:

XiP

R

}

betheindexsetofthedatapointsthatbelongto R.If Xv is

categorical,weobtainthemarginallikelihoodof

T

v as

v

(

Tv

)

=

Rleaf ofTv

(

x

α

x

)

(

NR

+

x

α

x

)

x

(

NRx

+

α

x

)

(

α

x

)

,

where NR

= |

IR

|

and

N

Rx

= |{

i

IR

:

Xiv

=

x

}|

.If Xv iscontinuous,weget

v

(

Tv

)

=

Rleaf ofTv

π

NR/2

ν

)

ν/2

a

NR

+

a

((

NR

+

ν

)/

2

)

(

ν

/

2

)

(

sR

+

tR

+

ν

λ)

(NR)/2

,

where sR

=

(

NR

1)

σ

R2,tR

=

NRa

(

μ

R

− ¯

μ

)

2

/(

NR

+

a

),

and

μ

R isthe sample mean and

σ

R2 the sample variance of the

NR values Xiv with

i

IR [21,Eq. 14].Foran ordinal Xv we derivethelikelihood functionanddiscussitscomputational

complexityinSection2.5.

Furthermore, we let the parameters be independent for each v, so that the marginal likelihood of the structure

(

Gv

,

Tv

)

vV factorizesintoaproductofthelocalmarginallikelihoods

v

(

Tv

).

Wealsoassignamodularstructureprior

p

(

Gv

,

Tv

)

vV

vV

π

v

(

Gv

,

Tv

) ,

whereeach

π

v issome(unnormalized)distribution.Intheliterature,variouspriorshavebeenproposedbothforDAGs(see,

e.g.,AngelopoulosandCussens [27] andreferencestherein)andfordecisiontrees[19,21].Ourchoicesforthepriorsdepend ontherestrictionswemaketotheconsidereddecisiontrees,andwillbedescribedinSection 3.

From analgorithmsstandpoint,thestructurelearningproblemnowbecomes thatoffindingastructure

(

Gv

,

Tv

)

vV so

astomaximizetheposteriorprobability

constant

×

vV

π

v

(

Gv

,

Tv

)

v

(

Tv

) .

(1)

Astheunspecified(normalizing)constantisthesameforallstructures,itcanbeignored. 2.4. Posterior predictive distribution

Givenasetof

N

datapoints X(N)

:=

(

Xi

)

N

i=1,thedistributionforanewdatapoint XN+1 iscalledtheposteriorpredictive

distribution.We willconsidera settingwherewe firstlearnamodelstructure M

ˆ

:=

(

G

ˆ

v

,

T

ˆ

v

)

vV from X(N),andthenthe

distribution of XN+1 is obtained conditionally on both the learned structure andthe N data points. The distribution of interestisthus

p

(

XN+1

|

X(N)

,

M

ˆ

)

=

p

(

X(N+1)

| ˆ

M

)/

p

(

X(N)

| ˆ

M

),

andconsequently,themixeddensity–massfunctionof XN+1 is

obtainedastheratioofthemarginallikelihoodsofM

ˆ

for

X

(N+1)andX(N).This,inturn,amountstoaproductvariable-wise ratiosover

v

V.

Considertheratioforasingle

v

V.Write y forthevalueofXN+1

v andRfortheleafofT

ˆ

vthatmatchesthevaluesof

XGN+1

(5)

(

NR

+

x

α

x

)

(

NR

+

1

+

x

α

x

)

(

NR y

+

1

+

α

y

)

(

NR y

+

α

y

)

=

NR y

+

α

y NR

+

x

α

x

.

If the variable Xv is continuous, we get the densityfunction of a non-standardized Student’st-distribution2 with

ν

degreesof freedom, location parameter

μ

¯

, andscale parameter

λ

(

a

+

1)/a, where

ν

,

μ

¯

,

λ

,and

a

are theposterior counterpartsofthehyperparameters

ν

,

μ

¯

,

λ,

and

a

,obtainedby

a

:=

a

+

NR

,

μ

¯

:=

(

a

μ

¯

+

NR

μ

R

)/

a

,

ν

:=

ν

+

NR

,

λ

:=

sR

+

tR

+

ν

λ

/(

a

ν

) ,

with sR andtR as definedbefore; for a derivation (up to some differences in the parameterization), see,e.g., Murphy’s

notes [28,Sections 5.5and6.5].

Weconsiderthecaseofanordinal Xv inthenextsubsection.

2.5. Extension to ordinal response variables

Theprevioussectionsfocusedonthecasesofcategoricalandcontinuousresponsevariables.Wenowextendthemodel to alsoaccommodate ordinalresponse variables.Tothisend, we introducea Bayesian histogramwherethebinlocations andwidthsaretreatedasauxilliaryparametersand“integratedaway”inthemarginallikelihoodforeachleaf Rinagiven decisiontree

T

v;bymultiplyingtheseleafscoreswegetthelikelihood

v

(

Tv

)

asbefore.

Consider aresponsevariable Xv.Weassume that Xv takesa finitenumberofdifferentvalues, Sv

= {

1,2,

. . . ,

k

}

,with

someinteger

k

.Similarlytoacategoricalresponse,welet

θ

x denotetheprobabilitythat Xv takesvalue

x

.However,instead

ofassigning the vector

θ

a Dirichletdistribution, we assigna prior that constrains the parameters

θ

x be equal forsome

consecutivevalues x. More precisely, givena segmentation B

= {

B1

,

B2

,

. . . ,

B

b

}

of Sv, that is, a partition of Sv intosets

ofconsecutivenumbers,called

bins

,welet

θ

x

:=

β

j

/

|

Bj

|

forall x

Bj.Now,welet

β

Dirichlet(

α

(1)

,

α

(2)

,

. . . ,

α

(b)

)

with

some

α

(j).Observethattheparameters

θ

x sumupto1,asrequired.

Tocompletethemodel,weletthepriorprobability ofa segmentation B with

b

binsbeproportionalto

κ

b,withsome

κ

>

0,andlet

α

(j)

:=

xBj

α

x.Thusthemodelhas

k

+

1 hyperparameters:

α

1

,

α

2

,

. . . ,

α

k,and

κ

.

Proposition1.

The marginal likelihood of a decision tree T

vis given by

v

(

Tv

)

=

R leaf of Tv wv

(

R

) ,

with wv

(

R

)

:=

B segm. of Sv

κ

b−1

(

1

+

κ

)

k−1

(

j

α

(j)

)

(

NR

+

j

α

(j)

)

×

b

j=1

(

N(Rj)

+

α

(j)

)

(

α

(j)

)

N(Rj)

(

NRx

)

xBj

|

Bj

|

N (j) R

,

where N(Rj)

:=

xBjNRxis the total number of data points in Bj.

Proof. Theprobabilityofthedata

(

NR1

,

NR2

,

. . . ,

N

Rk

)

givenasegmentation B

= {

B1

,

B

2

,

. . . ,

B

b

}

canbewrittenas

NR NR1

,

NR2

, . . . ,

NRk

p

|

B

)

b

j=1

β

j

|

Bj

|

N(Rj) d

β ,

where

p

|

B

)

isthedensityfunctionoftheDirichletdistribution.Bywriting

NR NR1

,

NR2

, . . . ,

NRk

=

NR NR(1)

,

NR(2)

, . . . ,

N(Rb)

b j=1

N(Rj)

(

NRx

)

xBj

,

takingtheterms

|

Bj

|

N (j)

R outoftheintegral,andusingtheknownclosed-formexpression fortheremaining integral,we

getthesummandin

w

v

(

R

),

withoutthefactor

κ

b−1

/(1

+

κ

)

k−1,whichisthepriorprobabilityof

B

;theunnormalizedprior

probabilities

κ

b sumupto

κ

(1

+

κ

)

k−1,sincethereareexactly

k−1 b−1

segmentationswith

b

nonemptybins.

2

Proposition2.

Given the counts N

R1

,

NR2

,

. . . ,

NRk, the term wv

(

R

)

can be computed in time O

(

k2logNR

)

.

2 Thedensityfunctionf

(y)ofthenon-standardizedStudent’st-distributionwithddegreesoffreedom,locationparameterm,andscaleparameters,is givenby f(y)= (d+1)/2 (d/2) πds−1/21+(ym)2 ds(d+1)/2

(6)

Fig. 1.An example of a PCART model. It represents the conditional probability densityp(X|Y,Z), whereX(0,10),Y(0,100), andZ∈ {a,b,c,d}. Proof. Since

j

α

(j)equals

α

+

:=

x

α

x forall segmentationsB,theterm

κ

−1

(1

+

κ

)

1−k

(

α

+

)/(

NR

+

α

+

)

inthe sum-manddoesnotdependon

B

andcanbefactoredout.Theremainderofthesummandfactorizesintoaproductof

b

terms, one for each Bj. In total, there are

k+1 2

different terms, one for each bin

[

s

,

t

]

:= {

s

,

s

+

1,

. . . ,

t

}

with1

s

,

t

k. To showhowthesetermscanbeprecomputedwithintheclaimedtime,denoteby

N

R(s,t)and

α

(s,t)thecorrespondingnumbers

N(Rj)and

α

(j),i.e.,when B

j

= [

s

,

t

]

.Observefirstthat thenumbers NR(s,t)and

α

(s,t) aresimplesumsthatcanbecomputed

incrementally, i.e., by addingNRt and

α

t tothe alreadycomputedNR(s,t−1) and

α

(s,t−1).Thus, assumingthe valuesofthe

gammafunctionhavebeenprecomputed,theirratiocanbecomputedinconstanttime perbin.Similarly,themultinomial coefficient is updated bymultiplying it bythe binomialcoefficient

N(Rs,t)

NRt

,which we assumeto be accessibleinconstant time (e.g., viathree evaluationsofthe gammafunction). Finally,the binlength t

s

+

1 israised to theexponent N(Rs,t) by repeatedsquaringintime O

logNR(s,t)

.Let f

(

s

,

t

)

denotetheprecomputed factors,1

s

,

t

k. Considerthe function g

(

t

)

definedby g

(0)

:=

1 and g

(

t

)

:=

ts=1g

(

s

1)f

(

s

,

t

)

for1

t

k.Wehavethat

w

v

(

R

)

=

g

(

k

)

(

α

+

)

(

NR

+

α

+

).

It remainstoobservethat

g

(

k

)

canbeevaluatedintime

O

(

k2

).

2

Wecanreducetherunningtimeto O

(

k2

)

byprecomputingthepowers

l

mforall1

l

kand1

m

N.Eventhought

the precomputation takestime andspace ororder kN, thiscan be advantageous when wv

(

R

)

iscomputed formany v

andR.

Fortheprobability massfunctionoftheposteriorpredictive distribution,we arenot awareany“closed-formformula,” butthefollowingresultisimmediateandsufficientforthecomputationalpurposes:

Proposition3.

Given a decision tree

Tv of Sv and N data points, the response equals y in a new data point with probability

wv

(

R

)/

wv

(

R

)

, where R is the unique leaf of Tvthat matches the new data point, wv

(

R

)

is as defined in Proposition1, and wv

(

R

)

is

the same when NR yis replaced by NR y

+

1. 3. Partition–dyadicCART

Partition–dyadic CART (PCART)is a subclass ofthe CART model, which we obtain by restricting the wayin which the rangeofacontinuouspredictorvariableisrecursivelypartitionedbythedecisiontree.Specifically,weassumetherangeis an interval andonlyallowhalvingit initsmidpoint,whence thenamedyadicsplits. Forthenumberofrecursivedyadic splitspervariablewesetanupperbound,

the maximum number of splits

,denotedby

s

.Incontrast,foracategoricalpredictor weallowarbitrary(binary)partitioning.Fig.1showsanexampleofaPCARTmodel.

PCART generalizes dyadic decision trees [23], which assume all predictors be continuous (or binary) and the response be categorical. The motivation fordyadic splits stems from their ability to guarantee both statistical andcomputational efficiency:dyadicdecisiontreesenablecomputationallyfeasibleglobaloptimizationondatasetswithuptoaroundadozen ofpredictorvariables,andtheyachieve(inaminimaxsense)nearlyoptimalratesofconvergenceforarangeofclassification problems[23,24].

3.1. Structure prior

Havingfixedthefamilyofdecisiontrees,we proceedtospecifyapriordistributionoverthefamily.Toconformto the modularstructureofthelikelihoodfunction,weconstructapriorwheretheprobability

p

(

T

)

ofatree

T

factorizesoverthe leavesof

T

.Thisyieldsa decomposablescoringfunction,whichpropertyiscrucialfortheoptimizationalgorithmwegive inthenextsection.

We consider priors that assign p

(

T

)

γ

l(T), wherel

(

T

)

is the number of leaves of T and 0

<

γ

1 is a constant independent ofT.Importantly, wedonotset

γ

toanyfixedvalue,butwe insteadlet

γ

dependon thenumberofsplits possibleineachinnernode.Withoutsuchdependency,thepriorcoulddistributenearlyallprobabilitytoverylargedecision trees,astheirnumberoverwhelmsthenumberofsmalltrees.
(7)

Moreprecisely,weset

γ

=

1/(4C

),

where

C

isthetotalnumberofpossiblesplitsataninnernode(ignoringsplitsmade inotherinner nodes).Wehavethat

C

isasumwhereeach continuouspredictorcontributes1,eachcategoricalpredictor with

k

possiblevaluescontributes2k−1

1,andeachordinalpredictorwith

k

possiblevaluescontributes

k

1.Thefactor

4 comes fromthe numberofordered fullbinary treeswith

i

innernodes,whichisgivenbythe ithCatalannumberand equalling to 4i up toa factor polynomial in i. We can interpretthe prior as firstdrawing the number ofleaves froma nearly uniformdistribution,then drawing arandom full binary treewiththegivennumber ofleaves, drawinga random splitindependentlyforeachinnernodeofthetree,andacceptingthetreeifitisavaliddecisiontree.

For an illustration, consider the decisiontree shown inFig. 1. In thiscase we have one categoricalpredictor with4 possible values andone continuous predictor.Consequently, the constant C equals 24−1

1

+

1

=

8,whence

γ

=

1/32. Thus, the prior probability of theshown treewith 5leaves is 324

=

220 timessmaller than the prior probability ofthe

independencemodel,i.e.,thetreewheretherootistheonlyleaf. 3.2. PCART as a local model

ToincorporatePCARTaslocalmodelsinBNs,weneedtospecifythecomponents

π

v

(

Gv

,

T

v

)

ofthestructureprior.Let

P

=

Gv

V

\ {

v

}

with

|

P

|

D;the

maximum indegree

parameter

D

controlsthecomputationalandstatisticalcomplexityof

themodel.Forbrevity,write

T

for

T

v.Weusethefactorization

π

v

(

P

,

T

)

=

π

v

(

P

)

π

v

(

T

|

P

)

c

(

P

) ,

where

c

(

P

)

normalizes

π

v

(

T

|

P

)

intoaprobabilitydistributionfor

T

.Itremainstospecify

π

v

(

P

)

and

π

v

(

T

|

P

).

Weconsider

twodifferentchoicesforboth

π

v

(

T

|

P

).

Our first choice is to set

π

v

(

P

)

=

1, i.e., the prior is nearly uniform over parent sets of size at most D.3 This is the

simplestandmost“ignorant”prior, whichtreatsall DAGsequallyapriori;foradeeperdiscussionofgraphpriorswerefer toAngelopoulosandCussens[27] andreferencestherein.Wegetthebasicvariantofourmodel:

PCART (partition–dyadic CART): Let

π

v

(

T

|

P

)

=

(4

C

)

l(T) foralldecisiontrees T of SP withatmostsdyadicsplitsper

con-tinuousvariable.

Oursecondchoiceisgivenby

π

v

(

P

)

=

1

n−1 |P|

,i.e.,theprior is

nearly uniform over the sizes of the parent sets

.4 Thisprior avoidsconcentrating almost allprior probability onthe densestgraphs; see,e.g.,Friedman andKoller [29] for usage and discussionofthisprior.Wewillcallthisthe

penalized

variantandrefertoitas

PCARTp

.

Through thelikelihood function,themodel definesajointlocal score

π

v

(

Gv

,

Tv

)

v

(

Tv

).

This,inturn,induces alocal

scorefortheparentset,

g

v

(

Gv

),

obtainedbymaximizingthejointlocalscoreover

T

v.Therolesofthesetwokindsof local

scoresin structurelearning willbeclarified inthenextsection (Proposition4).We note thattheresultingglobalscoring functionforDAGs,obtainedbytakingaproductoftheparent-wiselocalscores,isnotscoreequivalent,that is,twoDAGs thatencodethesameconditionalindependencerelationsmaygetdifferentscores.Itisanopenquestion,whetherthereare anynontrivialconstraintsunderwhichwewouldachievescoreequivalence.

3.3. Full table variants and adaptive discretization

RecallthatthestandardCPDmodelinBNsisthefulltablemodel,whereeachconfigurationtheparentvariable’sstates isassignedadedicatedcategorical(multinomial)distribution.ForcomparingPCARTtothisstandardmodel,wenextextend thefulltablemodelalsotocaseswheretheresponseorsomepredictorvariablesarenotcategorical.

If all predictor variables are categorical, the PCART and PCARTp models we just defined have the following tabular counterparts:

FT (full table): Supposingallpredictorvariablesarecategorical,let F beamaximaldecisiontreeof SP whereeachleafisa

singleton

{

xP

}

with

x

P

SP.Welet

π

v

(

T

|

P

)

=

1 if

T

=

F and

π

v

(

T

|

P

)

=

0 otherwise.Inaddition,

π

v

(

P

)

=

1.

FTp (penalized full table): LikeFTbutwith

π

v

(

P

)

=

1

n−1 |P|

.

Inotherwords,thesemodelsintroduce,foreachjointconfigurationofthepredictorvariables’states,adedicatedcategorical ornormaldistribution.IntheformercasethiscoincideswithastandardCPT;inthelattercase,itcoincideswithastandard conditionallinearGaussianmodel,albeitofasomewhatdegenerateform.

Tomake thetabular variants applicable tocontinuous and ordinalpredictors aswell, we introducethe following dis-cretization scheme. Foreach continuous or ordinal predictor u

P, discretize its range of values into

k

u bins matching

3 ThepriorisuniformoverDAGsbutnotexactlyuniformoverthepossibleparentsforafixednode,becauseasmallersetofparentsiscompatiblewith alargernumberofDAGs.

(8)

the ku quantilesofthe empiricaldistribution.The numbers

k

u arefree structuralparameters ofthemodel,takingvalues

between 2 and some maximum M (weused M

=

7 inour experiments). We assignuniform prior over the jointvalues

(

ku

)

uP.

Thisdiscretizationschemehastwo importantfeatures.Oneisthatitconsidersjointdiscretizationsofthepredictorsin relationtotheresponse,asopposedtosomesimplerschemesthatconsiderthevariablesseparatelyorinpredictor-response pairs. The other featureis that theschemedoes notfix anysingle discretization ofa continuous variableinthe BN,but the discretization mayvary betweenthe parent sets to which thevariable belongs.Both features enhance thepower of the discretization schemeas compared tothe simple schemes.The prize we have topay is that thecomputational cost of searchingfor optimaldiscretizations grows,as we needto consider all possibleconfigurations

(

ku

)

uP for each P we

encounter;wewilladdressthisissueinSection4.

4. Algorithms

Thestate-of-the-artalgorithmsforfindingagloballyoptimalBNstructureproceedintwosteps:

Step 1 (Candidate parent set scoring and pruning) Foreach v

V and P

V

\ {

v

}

with

|

P

|

D compute the localscore gv

(

P

),

definedintermsofthelocalmodelandthedata(e.g.,priorandlikelihood).Pruneevery setthathasasubset

withatleastaslargescore.Returntheremainingparentsetsandtheirscores. Step 2 (DAG search) ReturnaDAG

G

thatmaximizesthetotalscore

vVgv

(

Gv

).

ForStep 2wemayuseexistingcompletesolvers,e.g.,onesbasedonintegerlinearprogramming [15] orA* [14].Inour experimentsweusedtheintegerlinearprogrammingbasedsolver

GOBNILP

[15],sinceinourpreliminarystudyitappeared toscalewell.ItisworthnotingthatwecannotusetheconstraintprogrammingbasedsolverofvanBeekandHoffmann [16] becauseitassumesscoreequivalence,i.e.,thatthescoringfunctiongivesthesamescoreforallMarkov-equivalentDAGs.5

OurconcernisStep 1.Toagreewiththedefinitionofthescoringfunction(1),wesetthelocalscoresby

gv

(

P

)

=

max T

π

v

(

P

,

T

)

v

(

T

)

,

where

T

runsthroughalllocalstructuresonthepredictors P.Indeed,itiseasytoseethatthisassignmentguaranteesthat thealgorithmwillfindagloballyoptimalstructure:

Proposition4

(Optimality). Let G be a DAG that maximizes

vVgv

(

Gv

)

. Then, for each v

V , there exists a Tv such that the

structure

(

Gv

,

T

v

)

vVmaximizes the scoring function (1).

Proof. ForanyfixedDAG

G

thechoicesforthelocaltrees

T

v areindependent.Therefore,wehavethat

vV max Tv

π

v

(

Gv

,

Tv

)

v

(

Tv

)

=

max (Tv)vV

vV

π

v

(

Gv

,

Tv

)

v

(

Tv

)

.

Thus, choosingforeach

G

v atree

T

v thatmaximizes

π

v

(

Gv

,

T

v

)

v

(

Tv

)

yieldsastructure

(

Gv

,

T

v

)

vV thatmaximizesthe

scoringfunction(1)

2

InthecaseofPCARTeachlocalscorerequiresoptimizationoveralargenumberofdecisiontrees.Moreover,thenumber of required local scores grows rapidly withthe total numberof variables

n

, beingn

dD=0

n−1 d

for maximum indegree D.In theremainder ofthis section we detaila dynamic programmingalgorithm tosolve thisoptimization problem. We havedevelopedasimilaralgorithmforcomputingtherequirednormalizingterm

c

(

P

),

whichisa

sum

overdecisiontrees; however,asthistermisindependentofthedataandfastertocompute,wewillomitamoredetaileddescription.

ForFT (maximaldecisiontree),itsuffices toenumeratethe possiblejointdiscretizations ofthecontinuous predictors. For fixed v and P, thisrequires atmost

(

M

1)|P| evaluationsof a closed-form expression; formoderate valuesof the maximumnumberofbins

M

,thecomputationalcostissmallcomparedtothatofPCARTstructureoptimization.

4.1. PCART structure optimization

WegiveadynamicprogrammingalgorithmtofindanoptimalPCARTstructureforaresponse

v

andasetofpredictors P

V

\ {

v

}

.The algorithm applies toany objectivefunction that factorizes intoa product oflocal scores f

(

R

)

over the leaves R ofthetree.OurBayesscoreisclearly ofthisformasboth thepriorandthelikelihoodfactorizeovertheleaves. Foreverypossibledecisiontreenode R

= ×

uPRu,thealgorithmcomputes,inatop-downfashion,thebestscore f

(

R

)

for

thesubtreebelowR.Thusthescoreoftheroot, f

(

SP

),

isthemaximumscoreoverallPCARTstructures.
(9)

Algorithm1PCARToptimization.

Input:Responsev,predictorsPV\ {v}anddataX=(X1,X2,. . . ,XN)

γ←1/4uPcontinuous1+ uPcategorical(2|Su|−1−1)+ uPordinal(|Su| −1) −1

Computethestructurepriorfactor functionLeaf-Score(y1,y2,. . . ,yN)

returnγ ×marginallikelihoodofresponsevaluesycomputedusingtheformulasinSections2.3and2.5 LetMbeanassociativearray

functionSubtree-Score(R,Y=(Y1,. . . ,YN)) Computesf(R)giventhatY isthesetofdatapointsinnodeR foreachcategoricalorordinalpredictoruPdo Removemissingvaluesfromdiscretevariables

Ru← {Yu1,Yu2,. . . ,YN

u}

ifMdoesnotcontainelementRthen

MR←Leaf-Score(Y1

v,Y2v,. . . ,YN

v ) ConsiderthecasewhereR isaleafnode

ifN>0then

foreachcontinuouspredictoruPdo Considerallsplitsincontinuousvariables LetRu= [x,y),Su= [x,y)

if(xy)/(xy)>2−sthen

LetR,RbesuchthatRu= [x,(x+y)/2),Ru= [(x+y)/2,y)andRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi

u< (x+y)/2

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}

foreachcategoricalpredictoruPdo Considerallsplitsincategoricalvariables foreachpartition{C,C}ofRpintotwononemptysetsdo

LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi

uC

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}

foreachordinalpredictoruPdo Considerallsplitsinordinalvariables foreachpartition{C,C}ofRpintotwononemptysetssuchthatmaxC<minCdo

LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi

uC

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)} returnMR

Output:Subtree-Score(SP,X)

Thebestsubtreescorefornode Risobtainedbyconsideringallpossibilitiesforthenode:thenodemaybealeafnode, oritisaninnernodewithtwochildren Rand

R

.Thescoreofaleafiscomputedfromthetrainingdatapointscontained in

R

asspecifiedinSections2and3.Foraninnernodethescoreiscomputedrecursivelyastheproduct f

(

R

)

f

(

R

)

ofthe scoresofthechildsubtrees.Therecursioncachesthealreadycomputedsubtreescorestoavoidcomputingthesamescore multipletimes.Aftercomputingthescores,anoptimaldecisiontreeisextractedbytracingthesplitsthatgavetheoptimal scores.

Withourchoiceofscores,ithappensthatanode R thatcontainsnotrainingdatapointshastoalwaysbealeafnode, andthe recursiondoesnot need to advancefurther. Furthermore,iffor some categoricalorordinal variable u there are some valuesin Ru thatdonot appearinthetrainingdatathat iscontainedinR,we cansimplyconsidera smallernode

withtheunused valuesremoved. Thismaycause Ru to consist ofnonadjacentvaluesforsome ordinal variable

u

,butit

doesnotaffecttheresultsince weonlyconsidersplits respectivetotheorderingofthevariable,thatis,all thevaluesin Ru aresmallerthanthe valuesin Ru.Theseoptimizationssignificantly reduce thenumberofnodesthealgorithmhasto consider.ThepseudocodeforthealgorithmisshowninAlgorithm1.

WenextgiveanasymptoticupperboundforthenumberofnodesK thealgorithmconsidersandfortherunningtime ofthealgorithm.Theboundswillberatherloose:theyfailtoaccountforalltheoptimizationsandspecialpropertiesofthe datasets.

Proposition5

(Time complexity). Algorithm

1considers K

min

N2A(k−1)−2C

(

s

+

1)B

,

2Ak+B(s+1)C

(

r

+

1)2C tree nodes, and

the total time complexity is O

n

(2

kA

+

B

+

rC

)

KlogK

, where A, B, and C are the numbers of categorical, continuous, and ordinal predictor variables, respectively, k is the maximum number of categories, s the maximum number of recursive splits per continuous variable, and r the maximum number of values of any ordinal variable.

Proof. Counting the possiblenodes simply asthe product of atmost 2k possible value subsets Ru

Su per categorical

variable

u

P,atmost2s+1 subsetsper continuousvariable,andatmost

r+1 2

(

r

+

1)2

/2 subsets

perordinal variable,

givesan upperbound K

2Ak+B(s+1)C

(

r

+

1)2C. Inthe casewhenthe numberoftraining datapoints N is low,weget abetter boundsimilarly toBlanchardetal. [24] (whoassume allvariables becontinuous): Sincethealgorithmonly con-siders nodesthat contain at least one training datapoint, we can bound the numberof nodes by the total number of containing nodes foreach datapoint. The number of value subsets that contain a givenvalue isat most 2k−1 foreach

categoricalvariable,

s

+

1 foreachcontinuous variable,and

(

r

+

1)2

/4 for

eachordinal variable,yielding anupperbound K

N2A(k−1)−2C

(

s

+

1)B

(

r

+

1)2C.

Now,forevery node the algorithmconsiders all possiblesplits—at most2k foreach categoricalvariable,one foreach continuousvariable,andatmost

r

1 foreachordinalvariable—andfindsthescorefromanassociativearraydatastructure in

O

(log

K

)

time.Thusthetotaltimecomplexityis

O

n

(2

kA

+

B

+

rC

)

KlogK

.

2

(10)

Fig. 2.Anillustrationofthebinarynoderepresentation.TheexamplecorrespondstothePCARTinFig.1,andconsistsoftwoparts:3bitsforthecontinuous variableY (herethemaximumnumberofsplitsis2)and4bitsforthecategoricalvariableZ.Forthecontinuousvariable,thelengthoftheleading1-bits andthefollowing0-bit(ingray)indicatesthelengthoftheinterval,andtheremainingbitsencodethepositionoftheinterval.

Foran illustrationofthe complexitybound,considerthe decisiontreeshowninFig.1.Inthiscasewehaveone cate-goricalpredictorwith4possiblevaluesandonecontinuouspredictor.ByProposition5,thetimecomplexityis

O

(

nKlogK

),

wherethenumberofconsideredtreenodesK isatmostmin

{

64

·

N

(

s

+

1),256

·

2s+1

}

.

4.2. Implementation tricks

The dynamicprogrammingalgorithmforfindingan optimalPCARTstructure iseasilyimplementedusingrecursion,as describedinSection4.1.However,bycarefullytuningthememoryrepresentationofnodes,weareabletooptimizeourC++ implementationtorunanorderofmagnitudefaster.

Ourimplementationstoresnode

R

asabitvectorthatisaconcatenationoffixed-lengthbitvectors,eachofwhichstores the value subset Ru for a variable u

P. Foreach continuous variablewith maximum numberof splits s,the 2s+1

1

possible intervalsare storedas

(

s

+

1)-bit binary numbers.Foreach categoricalvariable u,the subset ofvalues Ru

Su

is storedusing

|

Su

|

bits.AnexampleoftherepresentationisshowninFig.2.The predictorvaluesofthedatapoints are

alsoencodedastheminimalnodescontainingthem.Inpractice,64bitssufficeforstoringthenoderepresentations,which meansthattheoperationsperformedbythealgorithmonthenodestranslateintoefficientbitoperationson64-bitmachine words.Thedatacanbepartitionedin-placeusingsimplebitoperationsandbecauseofthecompactrepresentation,thedata willstaypresentinCPUcaches,makingtheinnermostloopsofthealgorithmveryefficient.

5. Empiricalstudies

Weempiricallycomparedthedifferentlocalmodelsinrelationto(a)structurerecovery,(b)predictiveperformance,and (c)runningtime.Tothisend,weusedbothsyntheticdatageneratedfromknownbenchmarknetworksandrealdatafrom variousdomains.

5.1. Data sets and hyperparameters

Asground-truthbenchmarknetworks,weusedAlarm,Insurance,andWater,allextractedfromthe

bnlearn

[30] repository.6 The networks contain 37, 27,and 32 variables respectively, and are thus suitable for finding the globally optimal DAG withmodern solvers. However,asthesebenchmarknetworkscontain onlycategoricalvariables,we didnotuse themfor evaluatingtheperformanceonmixedvariables.

Forrealdata,weextractedallsuitabledatasetsfromtheUCImachinelearningrepository [31];moreexactly,we consid-eredalltheclassificationandregressiondatasetsthathaveatleast100datapoints,atleast8butfewerthan36variables, andthatconsist ofa singletable thatisfreelyavailable.We preprocessedthedatasetsbynormalizingallthecontinuous variables to havemeanzeroandvariance one andlimitingthenumberofcategoriesin eachcategoricalvariable to5by combiningtherarestcategoriesintoone.Wealsoremovedidentifiervariablesanddatapointsthatcontainmissingvalues, rejectingthedatasetswhereweneedtoremoveamajorityofthedatapoints.WealsorejectedthedatasetsMushroomand

ParkinsonsTelemonitoring,since thenetwork optimizationphaseusing

GOBNILP

took too muchtime forthem.SeeTable B.5

(AppendixB)foracompletelistofthe52datasets.

Wesettheuserparametersofthemodelsasfollows.Thehyperparametersofthepriordistributionsweresetas

α

=

1

/

2 (Jeffreys prior),

μ

¯

=

0,anda

=

ν

=

λ

=

1,exceptfor thesensitivitystudy (Section5.6) wherewe varied thelatter three values;forthesegmentationpriorofordinalresponses weset

κ

=

1

/

4toapproximatelybalancethepriormassonsmaller
(11)

Fig. 3.ComparisonofgloballyoptimalDAGsearch(FT,FTp)withheuristicapproaches(HC,TABU)andaconstraint-basedalgorithm(PC)forthetaskof recoveringaground-truthnetworkstructurebasedondatasamplesofvaryingsize.TheevaluationmetricisanaveragestructuralHammingdistanceover tenindependentdatasamples.Errorbarsindicate(hereandinallfollowingplots)thestandarderrorinbothdirections.

andlargernumbersofbins.Wesetthemaximumnumberofsplitspercontinuousvariableto

s

=

5,themaximumnumber ofbinsinthefulltablevariantto M

=

7,andthemaximumindegreeofaDAGto D

=

4.

5.2. Comparison of exact and heuristic algorithms for Bayesian network structure learning

As a preliminary small-scalestudy, we investigatedthe advantage offinding a globally optimal DAGas opposedto a merely high-scoring DAG.Here werestrict ourselves tostandard CPTbasedlocalscores. Tobe precise,we comparedour implementation of the FT local score (with GOBNILP for searching for the globally optimal DAG) with Scutari’s bnlearn implementation [30] ofhill-climbing(HC),TABUsearch,andtheconstraint-basedPCalgorithm.ExceptforthePCalgorithm, all theseimplementations usethesamescoringfunction (theBDscore withhyperparameters setto1

/

2,combinedwitha uniformstructureprior),thusenablingfaircomparisontoFT.Forreference,wealsocomparedtoglobaloptimizationunder thepenalized FTp score.We usedthePC-implementation withitsdefaultparameters:an asymptoticchi-squared test for themutualinformationteststatisticwithasignificancelevelof0.05.

Fromthethreebenchmarknetworkswesampleddatasetsofdifferentsizes,tenrepetitionseach,andlearnedanetwork structureforeachdatasetandalgorithm.WethencomputedthestructuralHammingdistance(SHD) [32],whichmeasures differencesup to the equivalenceclass, betweeneach learnednetwork andthe groundtruth. Weaveraged the SHDs for eachalgorithmandsamplesizeoverthetenindependentrepetitions.TheresultinglearningcurvesareshowninFig.3.

WeobservethatFTandFTpoutperformtheotherthreeapproachesforAlarmandInsuranceoncethesamplesizeexceeds afew hundreddatapoints.ForWater,however,FTperforms substantiallyworse thanHCandTABU.Thiscanbeexplained by the observation that the BDscore alone (without a nonuniform structure prior) can be a very poorscoring function duetoapreferenceforoverlycomplexmodelsatrelativelysmallsamplesizes. Insuch cases,theinabilityoftheheuristic algorithmstofindtheratherdensegloballyoptimalDAGbecomesafortunateadvantage,asthey stopthesearchatalocal optimumthatisa relativelysparsesubgraph.ThepoorperformanceofFTjustifiestheinclusionoftheFTpvariant,which usesastructurepriorthatisknowntopenalizecomplexityappropriately [33].Unfortunately,

bnlearn

doesnotallowcalling HCorTABUwithanarbitraryscoringfunction,soitisnotpossibletodirectlycomparethegloballyoptimalFTpsolutionto thecorrespondingheuristicresults.

The constraint-based algorithm (PC) hasadvantages atvery smallsample sizes, but it doesnot learn much from an increasedamount ofdataandeventually recovers the generatingnetwork worse thanthe score-basedlearning methods. (Onemight improve the performance of PC by letting its significance levelparameter depend onthe data size in some appropriateway,butthe

blearn

implementationdoesnotdothatbydefault.)

AsidefromtheratherpathologicalcaseoftheWaternetwork,theobservationsfromthepreliminarystudyaddto previ-ouslypublished evidencefortheneedofglobaloptimization[17] andjustifiesourrestrictiontostudylearning withlocal structuressolelywithintheglobaloptimizationframework.

5.3. The effect of local structure on benchmark network structure recovery

Consider thenthesameexperimental settingasintheprevious section,butnowforcomparingthefourlocalmodels, PCART(p)andFT(p),andonlyexactalgorithms.Weobserve(Fig.4)thatbothPCARTvariantsyieldsmallerSHDsthanFTfor allgeneratingnetworksandsamplesizes. Withthesparsity-favoringDAGprioradded,FTpbecomescompetitiveforAlarm. Apossible explanation isthat Alarm hasthe smallestedge/noderatioof only1.24, whereas Insurance andWater have1.93 and2.06, respectively.Asmallaverage indegreeleaveslittle potentialforstructured CPDsasthemajority ofCPTs canbe accuratelyrepresentedwitha (small)probabilitytable.ForInsuranceandWaterthePCARTvariantsyieldfasterconvergence thanFTp.

Insummary,PCART(p)appearstohaveaslightadvantageoverthefull-tablevariantsinrecoveringthenetworkstructure, butnotineachandeverycase.Thisisnotsurprising,sincethereisnoreasontobelievethateveryconditionalprobability

(12)

Fig. 4.Accuracy of PCART(p) with FT(p) in structure recovery from data samples generated from three benchmark networks.

Fig. 5. LearningcurvesproducedbytheIntersection-ValidationmethodforfourUCIdatasets.Thelargestsamplesizeineachplotcorrespondstothe originaldatasize,forwhichallmethodsyieldthesamepartialHammingdistance(ofvaluezero)bydefinition.

table in theground-truthnetworksadmitsan economical tree-structuredrepresentation.Moreover, interms ofSHDit is oftenbeneficial tolearn aconservative(sparse) modelwhenthesample sizeis small [33].Forvery smalldata sets,even attemptingtolearnastructuredlocalmodelmayyieldsmallstructuraloverfittingeffectsthatoutweighthepossiblyhigher expressivenessofPCART.

5.4. Structure recovery on real data

For evaluating the accuracy of structure learning fromreal data, we apply a very recent methodcalled Intersection-Validation [34].ThemethodallowsustodrawlearningcurvessimilartotheexamplesinFig.4evenwhennoground-truth DAGisavailable.Themethodcheckswhichstructuralfeaturesalllearningalgorithms(e.g.,scoringfunctions)thataretobe comparedagreeuponwhenlearningfromtheentiredataset,andtreatsthemassurrogategroundtruth.Structureslearned atsmallersubsamplesoftheentiredatasetarethencomparedagainstthispartialnetworktocomputeanapproximation ofSHD,called

partial Hamming distance

.

For each of the 52 UCI data sets under consideration and foreach ofthe four scoring functions, we learn networks fromsubsamplesof1

/

2,1

/

4,1

/

8,1

/

16,1

/

32,and1

/

64oftheoriginaldatasizeandrepeatthesubsamplingprocesstentimes.The resulting averagedlearning curvesforthe fourscoring functionsincomparisonare showninFig. 5forthefirstfourdata sets according to an alphabeticalranking of their names, andinAppendix A forthe remaining 48 datasets. Byvisually inspectingthelearningcurves,wedrawsimilarconclusionsasinthebenchmarknetworkstudyfromtheprevioussection: FTiswidelyinferiortotheothermethods,likelyduetolearningoverlycomplexmodelsatsmallsample sizes.PCARTlocal modelsperformoften,thoughnotalways,slightlybetterthanFTp.

Inorderto summarizetheresultsfromthe52datasetsinasystematicfashion,weaggregatedthembythefollowing procedure.Weconsideredeachfractionoftheoriginaldatasize(1

/

2,1

/

4,etc.)asaseparategroup,irrespectiveoftheabsolute samplesize.Wethenmadeapairwisecomparisonofthefourlocalmodelsforeachofthesixgroups.Tothisend,weused the followingassessment:method Aperforms better/worsethan methodB(win/loss) ifityields a smaller/larger average partial Hammingdistanceandiftheerrors bars inthelearning curvesdo notoverlap. Iftheydo overlap,we considered bothmodelstoperformequallywell(tie).

We show theseaggregated statistics inTable 1 for the two mostimportant modelcomparisons and inTable A.3for theremainingfourcases.WeobservethatthePCARTvariantsperformconsistentlybetterthanthefull-tablemodels, espe-cially atsmallsamplesizes. ThisdoesnotonlypertaintothecomparisontoFT,butalsotoFTp,whichdemonstratesthat structured CPDshelptorecovercorrectstructuralfeatureswithfewersamples,incomparisontofull-tablemodels.Thisis an interestingresultastheeffectwasnot thatobviousinFig.4,whichindicatesthat realdatasetstendtohavedifferent properties than syntheticdatasampled frombenchmarknetworks. An alternative todeclaring a tie betweenmethods in the eventofoverlappingerrorbarsistousea nonparametric statisticaltest (TableA.4).The numberoftiesdoeshere

(13)

in-Table 1

SummarizedmethodcomparisonbasedontheIntersection-Validationmethod.Samplesizereferstothefractionof thesizeoftheoriginaldataset.Atieoccursiftheerrorbarsoftwomethodsoverlap.Otherwisethemeanpartial Hammingdistancedecidesuponwinorloss.

(a) PCARTp vs FTp

Sample size Win Tie Loss

1/64 21 30 1 1/32 26 24 2 1/16 33 15 4 1/8 29 18 5 1/4 23 22 7 1/2 18 30 4 (b) PCART vs FT

Sample size Win Tie Loss

1/64 50 2 0 1/32 49 3 0 1/16 47 4 1 1/8 40 12 0 1/4 39 10 3 1/2 31 16 5 Table 2

SummaryofpredictionresultsonUCIdata.Atieoccursiftheerrorbarsof twomethodsoverlap.

Win Tie Loss

PCART vs PCARTp 1 51 0 PCART vs FT 20 23 9 PCART vs FTp 19 24 9 PCARTp vs FT 21 20 11 PCARTp vs FTp 18 23 11 FT vs FTp 0 51 1

creaseduetothesmallsamplesize(tenrepetitions),butamonginstanceswithastatisticalsignificantdifference,thePCART methodsoutperformtheirfulltablecounterparts.

TheseresultsonrealdatafurtherimplythatthePCARTvariantsareeffectivenotonlyforcategoricalvariables,butalso inthepresenceofmixed,orcompletelycontinuousdata.

5.5. Predictive performance

Inall previous studies,weviewedthe graphstructureasthelearning target. Anothercommonlyconsidered target for learning isthejointprobabilitydistribution.Onecan evaluatealearneddistribution,e.g.,by computingthe cross-entropy tothegroundtruthdistributionor,inabsenceofgroundtruth,byapproximatingthecross-entropybytheaveragelogloss inacross-validationexperiment.

Wefollowedthisapproachforall the52UCI datasetsandfourlocalmodels.Foreachcombinationwecomputedthe averagelogloss,calledhenceforth

log loss

forshort,byfirstfindingaglobally optimalmodelstructure,includingthelocal structure,inrelationtothetrainingdata,andthentakingthelosswithrespecttotheBayesianposteriorpredictivedensity ofeach testdatapoint.Allloglossvalues,alongwithstandarderrors,areshowninAppendixB.Table2showsawin–tie– losssummary basedonoverlapping errorbarsandTableB.6 showsan alternativebased onatwo-sidedWilcoxon-signed ranktest [35].Incontrasttothesummarystatisticsfortheintersection-validationstudy,werehereobservethesignificance testtoproducelesstiesthantheoverlappingerrorbarcriterion,whichcanbeexplainedbytherelativelylargesamplesize of50repetitions.

Inboth caseswe seethatforpredictionit makes virtuallyno differencewhetherasparsity-favoring structure prioris used ornot: thepenalized and unpenalizedversionsofPCART andFTperform nearly identically.Models that are overly complex accordingto SHDcan still host the trueprobability distribution, andestimatingit from datais effectiveunless theirCPDsaresolargethatmassiveparameteroverfittingoccurs,whichisunlikelyevenformodelslearnedwithFT.

Aside from this aspect, we again observe a slight advantage of PCART compared to full tables, albeit in manycases bothtypesoflocalmodelsperformequallywell.Thisisnotsurprisingasithasbeenobservedthatmeasuringthelearned distributionshowslessdifferencesamongstructurelearningmethodsthanmeasuringthestructuralfeatures [34].

5.6. Hyperparameter sensitivity analysis

AllofthepreviousstudiesusedafixedsetofhyperparametersfortheBayesianpriors.While

α

=

1

/

2(Jeffreysprior)and

¯

μ

=

0 (meanofastandardnormal)canbeconceptuallymotivated,thechoiceof

a

=

ν

=

λ

=

1 wasbasedonsimplicityand decentperformanceinpreliminarytests.Wenowstudyhowvaryingeachofthesethreeparameters(whilekeepingthetwo othersfixedattheir defaultvalue)affectsthepredictiveperformance underthefourlocalmodels,both inabsoluteterms andrelativetoeachother.

The results forthe UCI data set concrete-compressive-strength are shownin Fig.6, whereas the resultsfor 14additional randomly chosen UCI datasets areshown inAppendixC.We findthat varying thehyperparametermayindeed havean effect.Ourdefaultvaluesaregenerallyagoodchoice,albeittheycouldbeimprovedinsomecases.

(14)

Fig. 6.HyperparametersensitivityanalysisforoneUCIdataset.Eachplotvariesonehyperparameter(x-axis)whilekeepingtheothertwofixed,showingthe effecttopredictiveperformanceasmeasuredbylogloss(y-axis).Dashedverticallineindicatesthedefaultvalue.Notethatforthisdatasetthedifference betweenthepenalizedandunpenalizedalgorithmvariantsisextremelysmall.

Fig. 7.Running time in seconds of Step 1 vs Step 2 under the four local models. See text for a description of the included data sets.

The more importantobservation, however,is that thedecision whetherPCART(p)orFT(p) issuperior fora particular datasetdoesnotheavilydependontheparticularchoiceofhyperparametervalues,exceptforthecaseofverylargevalues (suchas

λ

=

103 inFig.6)whereallmodelsperformpoorly.

5.7. Running times

Last,weinvestigatedhowthechoiceofthelocalmodelaffectsthetime requirementsforthelocalscorecomputations (Step 1) and fortheDAG optimization (Step 2,by GOBNILP). Fig. 7showsthe measured running timesforthe datasets we sampledfromthethreebenchmarknetworks(Section5.3) andfortherealUCIdatasetswe includedinthestructure recoverystudy(Section5.4).

Weseethatunderallfourmodels,Step 1tendstoconsumemoretimethanStep 2.Thisobservationhighlightsthefact that

GOBNILP

solves the DAG optimizationstep faston typicalproblem instances,andthus the localscore computations can bea bottleneck,evenunderthestandard CPTmodel(e.g.,FTandFTp oncategoricaldatageneratedfrombenchmark networks).Nevertheless,therearealsodatasetsforwhichStep 1isfasterthanStep 2oralmostequallyfast;importantly, this holds alsounder PCARTandPCARTp. This observation, inturn, highlightsthe fact that DAG optimizationis hard in general, andthereforeonecanaffordcomputationallydemandinglocalscores.Itisworthnotingthatthetimecomplexity ofthelocalscorecomputationsgrowsonlypolynomiallyinthenumberofnodes

n

,assumingaconstantmaximumindegree, whereas thecomplexity oftheDAGoptimizationstepissuspectedtogrowsuperpolynomially, thecurrentbestwort-case boundbeingexponentialin

n

.

5.8. Studies with ordinal variables

Inordertobenchmarkthealgorithmvariantsthatcan handleordinaldata,wesearchedforordinalvariablesinthe52 UCI datasetsfromtheprevious sections.In ordertoavoidadditionaldiscretization,we onlyconsidereddiscrete variables with lessthan 16 possible values.As a consequence,we treat an attribute such as“Age” (in years), which is ordinal in principle, ascontinuous variable nevertheless. From all datasets, we chose sixdata sets thathave atleastthree ordinal variables;seeAppendixD.1fordetails.

ForcomparingtheordinalPCARTvariants(OPCARTandOPCARTp)withtheordinalfulltablemodels(OFTandOFTp),we repeatedtheintersection-validationstudyfromSection5.4andthepredictionstudyfromSection5.5anddisplaytheresults inAppendixD.2andAppendixD.3(TableD.7).WeobservethatthePCARTvariantscontinuetoperformbetterthanthefull table variantsonaverage.Hence,thetreatmentofordinalvariablesassuch,insteadofbeingmodeledaseithercontinuous orcategorical,doesnotchangetherelativeperformanceofPCART-basedmodelsinrelationtotheirfulltablecounterparts.

(15)

Fig. 8.The effect of including ordinal variables on structure learning measured by the intersection-validation method.

Nevertheless,fromtheseresultsitisyetunclearwhethertheextension toordinalvariablesispractically useful.Hence, we investigatedhowwell theordinalvariants ofouralgorithmperforminrelationto thenon-ordinal variantsstudiedin theprevioussections.Whilethepredictiveprobabilitiesanddensitiesarenotcomparableanylongerassoonasavariable changesitstype,theinterpretationofthelearnedDAGsremainsthesame,allowingafaircomparisonamongthemethods. We thus applied the intersection-validation method similar to Section 5.4 and compared PCART, PCARTp, OPCART, and OPCARTp(Fig.8).Weobservethattheordinalvariantsyieldalmostneverasubstantiallyworse partialHammingdistance compared to their non-ordinal counterparts, and often there is even a slight improvement. This demonstrates that the extensionofouralgorithmstoordinalvariablespaysoffindeed.

6. Discussion

ThisworkrevisitedtheideaofincorporatingdecisiontreesaslocalmodelsinBayesiannetworks.Specifically,we intro-duceda newmodelclass,PCART,which canhandleboth discrete andcontinuous predictorandresponse variables,takes a Bayesianapproach,andadmitscomputationallyfeasible globaloptimizationinarangeofproblemsizes. Wealso intro-ducedananalogousfull-tablevariant,FT,whichextendsthestandardCPTofcategoricalvariablestocontinuousandordinal responseandpredictorvariables.

Empiricalcomparisonofthetree-structuredPCARTandthetabularFTshoweda clearadvantage ofPCARTinstructure learningandaslightadvantageindensityestimation.Whiletheresultswere obtainedmainlyunderafixed,globalchoice forthevaluesofthehyperparameters,theresultswereobservedtobe robusttochangesofthesevalues.Thatbeingsaid, thesensitivityanalysisalsosuggeststhattheperformanceofbothPCARTandFTcouldbefurtherenhancedbychoosingthe hyperparametervaluesseparatelyforeachgivendatasetandeachvariable,e.g.,bymeansofempiricalBayes.

Inourexperimentswesettheparametersthatcontrolthecomputationalcomplexityofthemodel(maximumindegree, numberof splits)so that thecomputations ofthelocal scores (Step 1)could be completed within a few hours foreach dataset.Thiswasmotivatedbythefactthatforsomeofthedatasets,thestate-of-the-artsolversrequirehourstofindan optimalDAG (Step 2).Thereare two simplewaysto expediteStep 1,ifneeded:by settingthe complexityparameters to lowervalues,andbyrunningthecomputationsfordifferentparentsetsinparallel.Anopenquestioniswhetheronecould gainfurtherreductionsinthe

References

Related documents