Learning Bayesian networks with local structure, mixed variables, and exact algorithms

(1)

Contents lists available atScienceDirect

International

Journal

of

Approximate

Reasoning

www.elsevier.com/locate/ijar

Learning

Bayesian

networks

with

local

structure,

mixed

variables,

and

exact

algorithms

Topi Talvitie

a

,

Ralf Eggeling

b

,

Mikko Koivisto

a,

∗

a_Department_of_Computer_Science,_University_of_Helsinki,_Finland

b_Department_of_Computer_Science,_University_of_Tübingen,_Germany

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received16March2019

Receivedinrevisedform11July2019 Accepted8September2019 Availableonline16September2019 Keywords:

Bayesiannetworks Decisiontrees Exactalgorithms Structurelearning

Modern exact algorithms for structure learning in Bayesian networks first computean exact local score of every candidate parent set, and then find anetwork structure by combinatorial optimization so as to maximizethe global score. Thisapproach assumes thateachlocalscorecanbecomputedfast, whichcanbeproblematicwhenthescarcity of the data calls for structured local models or when there are both continuous and discretevariables,forthesecaseshavelackedefficient-to-computelocalscores.Toaddress thischallenge, weintroducea local scorethat is basedon aclassofclassification and regression trees.We showthat under modestrestrictions onthepossible branchings in the tree structure, it is feasible to finda structure that maximizes aBayes score in a rangeofmoderate-sizeprobleminstances.Inparticular, thisenablesglobal optimization oftheBayesiannetworkstructure,includingthelocalstructure.Inaddition,weintroduce arelatedmodel classthat extendsordinary conditionalprobability tables tocontinuous variables by employing anadaptive discretization approach. The two model classes are compared empirically by learning Bayesian networks from benchmark real-world and syntheticdatasets.Wediscusstherelativestrengthsofthemodelclassesintermsoftheir structurelearningcapability,predictiveperformance,andrunningtime.

1. Introduction

Bayesiannetworks(BNs)composeamultivariatedistributionasaproductofunivariateconditionalprobability distribu-tions(CPDs). ThepotentialcomplexityofthelocalCPDs,togetherwiththeglobalacyclicityconstraintoftheBNstructure, make the task oflearning BNs from datachallenging in manyways. Evenif we assume that the available data contain neitherunobservedvariablesnormissingentries—likewewilldothroughoutthisarticle—thereoftenremainthefollowing threeconcerns:

(i) Statistical eﬃciency:thedatamaybescarce,renderingsimpletabularparameterizationsoftheCPDs,suchasthe

condi-tional probability tables (CPTs),statisticallyineﬃcient.

(ii) Heterogeneous variables:so-called

hybrid BNs

containbothdiscreteandcontinuousvariables,againrulingout themost convenientandstandardparameterizationsofCPDs.

*

Correspondingauthor.

E-mailaddress:mikko.koivisto@helsinki.ﬁ(M. Koivisto). https://doi.org/10.1016/j.ijar.2019.09.002

(2)

(iii) Computational eﬃciency: the mostnaturalformulationsof thelearning problemare computationally hard. Thislimits boththedimensionsofthedataandthecomplexityoftheCPDsinrelationtowhatcanbesolvedunderusefulquality guarantees.

Each oftheseissues (i–iii) hasbeenaddressed inthe literature,aswe will describe inthe next paragraphs.The present authorsare,however,notawareofanypriorworkthatwouldsimultaneouslyaddressallthethreeconcerns.

To addresstheissue ofstatisticalefficiency,one canreplace tabularCPDs bymore sophisticatedstructureswherethe effectivenumberofparametersadapts totheamountandcomplexityoftheavailable data.Thenotionofcontext-specific independence[1] offersaparticularlyappealingwaytoachievethis:givencertainconfigurationforasubsetofthe

predictor

variables, the

response

variable isindependentof theremaining predictors.Forstructure learningin BNs,context-speciﬁc independences have been represented with decisiontrees [2] and with the somewhat more expressive decision graphs [3]. Both model classesare non-parametric inthe sense their numberof parameters can grow large, letting the models approximateanyCPD(infact,matchCPTsfordiscretevariables overﬁnitedomains). Thesestudieswere,however,limited todiscretevariablesandgreedyorlocalsearchalgorithms,thusnotaddressingissues(ii)and(iii).Alsoexactalgorithmsto partition tojointstate spaceofdiscretepredictorshavebeenproposed [4],butthesealgorithms becomecomputationally infeasiblealreadywith,say,threepredictorsthattakefourstateseach.

ForhybridBNs,therearetwomainapproaches.Oneistodiscretizeallcontinuousvariables.Inpractice,thismeansthat separatelyforeachcontinuousvariable,itsrangeispartitionedintointervals,i.e.,bins,whicharethentreatedasthestates ofthediscretized variable.Then itiscommontoassignCPTs forthediscretizedvariables,evenifthisignores thenatural orderingofthestates.Simplepreprocessingroutinesofthiskindarecomputationallyconvenient,buttheyarepronetolose information in anuncontrolled way.Andif, asaremedy, oneincreasesthe numberofbins,then thestatisticalefficiency becomes a concern.Thereare alsomoreinvolvedschemes,whichcouplemulti-variablediscretization withgreedysearch forahigh-scoring networkstructure[5];theseschemespotentially yieldbetterdiscretizations,buttheyappeartorequire search heuristicsthatcannotguaranteeoptimalityofthefoundsolution.AnotherapproachistoemployCPDsthancan ac-commodatebothdiscreteandcontinuousvariables,suchasconditionallinearGaussianmodelsandtheirgeneralizationsto discreteresponsesviaappropriatelinkfunctions[6,Ch. 5].Thesemodelsare,however,nonsatisfactoryconcerningstatistical efficiency: onecan arguethat they overparameterizeby introducingseparate linearmodel foreveryconfiguration of dis-cretepredictors,whiletheyalsounderparameterizebyignoringnonlinearinteractionsofcontinuouspredictors.Mixturesof truncatedexponentials(MTEs)offeramoresophisticatedmodel,inwhichadecisiontreepartitionsthespaceofpredictor variablesintoregions,ineachofwhichtheresponseisassignedamixturedistribution[7].Unfortunately,nofastalgorithms tohandleMTEsofneededsizesareknown,andconsequentlyonehastoresorttoproceduresthatdonotofferperformance guarantees.

Thehope forcomputationalefficiencyhastobeinterpretedinrelationtotheNP-hardnessoftheBNstructurelearning problem[8]:arealistic goalistosolvetypicalmoderate-sizeinstancesinpracticewithoutsacrificingtheoptimalityofthe solution, rather than aiming at worst-case polynomial time complexity. Nowadays, we have several exact algorithms, or solvers, that achievethisgoal,especially ifchoosingthebests solverforeach givenprobleminstance [9]. Whiledifferent solvers employdifferentalgorithmictechniques,suchasdynamicprogramming[10–13],A*search[14],integerlinear pro-gramming[15],andconstraintprogramming[16],theyallproceedintwosteps:First,thelocalscoresarecomputedforall nodesandtheir possibleparent sets,oftensettinganupperbound forthenumberofparents.Second, aglobally optimal network structure is found by combinatorialsearch. One challenge in applyingthesesolvers is that the numberof local scores computedinthefirststep cangrowlarge forlarger networks.Partly forthisreason,thesolvers havebeenmostly studiedinsettingswherethelocalscoresareofsomeoftheaforementionedsimpleforms,enablingfastevaluation.Ithas been observed that exact algorithms, when applicable, not only remove theuncertainty on the quality, butalso tend to performbetterinstructurelearninganddensityestimation[17].

Inthisarticle,weaddressallthethreeissues(i–iii)simultaneouslybyputtingforwardalocalmodelbasedon classifica-tionandregressiontrees(CARTs)[18].WederiveaBayesscoremostlyfollowingtheworksonBayesianCART[19–21].The main challenge isthat learning CART models iscomputationally intractable; theexisting algorithms are basedon greedy andlocalsearchandcannotofferusefulqualityguarantees.Here,twoobservationscometorescue.First,intheBNlearning context thenumber ofpredictorscan beassumedto be relativelysmallinpractice.This reducesthenumberof possible CART structures, i.e., decision trees,butis alone not sufficient formaking thesearch computationallytractable. The sec-ond observationis that theso-called dyadic CART, inwhich therecursive splits ofthe rangesof continuous variablesare always atthemidpoint, essentially inheritsthestatistical powerof CARTandyet enablessignificantly faster global opti-mization [22–25]. Buildingon theseobservations,we introduce the

partition–dyadic CART (PCART)

model; here“partition” refersto(recursive)splittingofthestatespacesofcategoricalvariables—tothebestofourknowledge,categoricalpredictors havenotbeenallowedinpreviousworksondyadicCART.

Whenallvariablesarecategorical,onecancomparetheperformanceofPCARTagainststandardCPTsinastraightforward manner. TocomparePCARTagainstanalogoustabular modelsalsointhepresenceofcontinuousvariables,we introducea full-table (FT) variant ofPCART:it discretizes the involvedcontinuous predictors (but not theresponse). While FTadapts thenumberofstatespervariabletothedataathand,itrendersglobaloptimizationcomputationallyfeasible,unlikesome aforementioneddiscretizationschemes[5].

(3)

Inadditiontothesemethodologicaldevelopments,westudyempiricallythefollowingtwoquestions:DoesPCARTboost the statisticalefficiency of structure learning anddensity estimation inpractice, that is, on benchmarkdata sets? What is the price to be paid in terms of computational efficiency, i.e., which kind of problem instances, in terms of model complexityanddatadimensions,canbesolvedinareasonabletime?Onecouldhypothesizethat,thankstoaparsimonious parameterization,PCART enablesmorepowerfuldetectionof arcsandmore robustparameter inference atthecost ofan only modest increase in running time due to efficientPCART optimization algorithms. We will show how the empirical resultssupportthishypothesis.

Partsofthisworkhavebeenpublishedinapreliminaryforminconference proceedings[26].Whiletheconferencepaper onlyconsideredcategoricalandcontinuousvariables,thepresentarticlealsoconsidersordinalvariables—thisamountstoa majorextensiontoboththetheoreticalandempiricalcontributions.Wealsorevisethegeneralexpositionandsubstantially extendtheexperimentalstudy;speciﬁcally,thestudiesonpredictiveperformance andhyperparametersensitivityarenew additions. Furthermore, we now also make our implementation of the presented algorithms for computing local scores underthePCARTmodelpubliclyavailable.1

Theremainderofthisarticleisorganizedasfollows.Section2reviewsthenecessarypreliminaryconceptsandnotation regardingBNs, CARTs,andtheBayesianapproach tolearning. We introducethePCART modelinSection 3andourexact algorithms for learning a PCART from data inSection 4.Section 5is devoted to empirical studies. Last, we concludein Section6.Foreaseofreading,somedetailedmaterialispostponedtotheappendix.

2. Preliminaries

ThissectionreviewssomebasicsofBNsanddecisiontrees,includingtheBayesianapproachtolearningthemfromdata. We will focus on the key terminology andnotation neededin later sections; for a more comprehensivediscussion, the readeris referred to the literature on graphicalmodels [6, Chs. 3, 5, and 18] and score-basedlearning of decisiontrees [19,21].

2.1. Bayesian networks

Consider a multivariateprobability distribution p

(

X1

,

X

2

,

. . . ,

X

n

),

whereeach variable Xv takesvaluesfromsome set

Sv.Wewrite V fortheindexset

{

1,2,

. . . ,

n

}

andindexbysubsets:e.g.,ifP

⊆

V,wewrite XP forthetuple

(

Xv

)

v∈P and

SP fortheCartesianproduct

×

v∈PSv;herewefollowtheconventionthattheindicesaretakeninincreasingorder.

Wesaythatthedistribution

p

andadirectedacyclicgraph(DAG)

G

=

(

V

,

A

)

formaBNifthejointdistributionfactorizes into a product ofthe CPDs p

(

Xv

|

XGv

),

where Gv

= {

u

:

(

u

,

v

)

∈

A

}

is the set of

parents

of v in G. When referring to a

particularCPD,wecall Xv the

response

variableandeach Xu,with

u

∈

Gv,a

predictor

variable.

There are various ways to parameterize a CPD p

(

Xv

|

XP

),

with P

⊆

V

\ {

v

}

.Here andhenceforth we may indexthe

predictorvariablessimplyby

P

whenthereisnoneedtoconsidertheDAGon

V

.AmongthesimplestonesareCPTs,which assigna distinctparameterforeachjointconﬁgurationoftheinvolvedvariables,assumingtheir statespacesare ﬁnite.In whatfollows,wewillfocusonmoresophisticatedCARTmodels,whichgeneralizethetabularCPTs.

2.2. Classiﬁcation and regression trees

InCARTmodels,thestructure isrepresentedasa decisiontree.Adecisiontreeoveramultidimensionalspace SP isa

recursive, axis-parallel,binary partition ofthespace. Moreformally,let T be a rootedbinary tree whereeach node R is a subsetof SP oftheform

×

u∈PRu.Wecall

T

a

decision tree

of SP ifits rootis SP andevery inner node R anditstwo

children Rand R differinexactlyonedimension

u

∈

P forwhich

{

R_u

,

R

u

}

isapartitionof Ru.The“decision”associated

withnode R iswhetherthevariable Xu belongsto Ru or Ru.Consequently,theleavesof T partition SP.Adecisiontree

Tv of SP andaCPD p

(

Xv

|

XP

)

are

compatible

witheachother if, forall leaves R,we havethat Xv isindependentof XP

conditionallyontheeventXP

∈

R.Inotherwords,theCPDisconstantovereachleaf Randpiecewiseconstantover SP.

Inordertoparameterizethedistributionoftheresponsevariable Xv ineachleaf,wedistinguishbetweenadiscreteand

acontinuousdistribution.Inthiswork,wefocusontwostandardcases:

Categorical distribution: IfSv isﬁnite,welet

θ

x betheprobabilitythat Xv

=

x,foreach

x

∈

Sv.

Normal distribution: IfSv isthesetofreals, Xvisnormallydistributedwithmean

μ

andvariance

σ

2.

InthecontextofBNswhereeachvariable Xv appearsastheresponse,we understandthattheparametersaredistinct for

all v andallleavesofTv,andomitemphasizingthisinthenotation.Dependingonwhethertheresponseiscategoricalor

continuouswecallthedecisiontreeequippedwiththeCPDa

classiﬁcation tree

ora

regression tree

,generallya

CART

. Whatismaybenotimmediatefromtheabovedescriptionisthatwealsomodelordinalresponsesbycategorical distri-butions:weenforcetheparameters

θ

x beequalforsomesubsequentvalues

x

.WegivedetailsinSection2.5.

(4)

2.3. Bayesian structure learning

To learn the model structure fromdata, we take a Bayesian score-based approach.We only note in passing that the derivationsofSections3and4alsoapplytootherscoringschemes,especiallypenalizedmaximum-likelihoodscores.

Considerasetofdatapoints

X

i

=

(

_Xi

1

,

X

2i

,

. . . ,

X

ni

),

i

=

1,2,

. . . ,

N

.Wemodelthedatapointsasindependentdrawsfrom

adistribution pwhosestructure,namelyaDAG

G

=

(

Gv

)

v∈V withadecisiontreeTvof SGv foreach

v

∈

V,areunknown;

theparametersoftheCPDsareunknownaswell.Wewilldescribeamodularpriorofthestructureandtheparametersand obtainthescoreastheposteriorprobabilityofthestructure,bymarginalizingouttheparameters.

Consider adecisiontree

T

v ofSP;inourcontext P

=

Gv.Foreachleafof Tv,weassignstandard conjugatepriorsfor

theparametersofthecategoricalandnormaldistributions:

Dirichlet: Let

θ

∼

Dirichlet(

α

),

where

α

=

(

α

x

)

x∈Sv arehyperparameters.

Normal-inverse-gamma: Let

σ

2

∼

_IG(

ν

/2,

ν

λ/2)

_and

μ

|

σ

2

∼

_N(

μ

¯

,

σ

2

/

_a

),

_where

ν

_,

λ,

μ

¯

_,_and

a

_are_{hyperparameters.}

When an ordinal response is modeled by a categorical distribution, a Dirichlet prior is, however, only applied after the valuesarepartitionedintosomenumberofbins;wegivedetailsinSection2.5.

While we could set thehyperparameters separately foreach response variable Xv, inour experiments we will focus

on the casewhereall are setto constant values;see Section 5.1forthe defaultvalueswe used inour experimentsand Section5.6forasensitivityanalysis.

By letting theparameters of all leavesof Tv be independent,we obtaincomputationally convenient formulasfor the

marginallikelihoodfunction.Foraleaf R,let IR

= {

i

:

XiP

∈

R

}

betheindexsetofthedatapointsthatbelongto R.If Xv is

categorical,weobtainthemarginallikelihoodof

T

v as

v

(

Tv

)

=

Rleaf ofTv

(

_x

α

x

)

(

NR

+

x

α

x

)

x

(

NRx

+

α

x

)

(

α

x

)

,

where NR

= |

IR

|

and

N

Rx

= |{

i

∈

IR

:

Xiv

=

x

}|

.If Xv iscontinuous,weget

v

(

Tv

)

=

Rleaf ofTv

π

−NR/2

(λ

ν

)

ν/2

√

a

√

NR

+

a

((

NR

+

ν

)/

2

)

(

ν

/

2

)

(

sR

+

tR

+

ν

λ)

−(NR+ν)/2

,

where sR

=

(

NR

−

1)

σ

R2,tR

=

NRa

(

μ

R

− ¯

μ

)

2

/(

NR

+

a

),

and

μ

R isthe sample mean and

σ

R2 the sample variance of the

NR values Xiv with

i

∈

IR [21,Eq. 14].Foran ordinal Xv we derivethelikelihood functionanddiscussitscomputational

complexityinSection2.5.

Furthermore, we let the parameters be independent for each v, so that the marginal likelihood of the structure

(

Gv

,

Tv

)

v∈V factorizesintoaproductofthelocalmarginallikelihoods

v

(

Tv

).

Wealsoassignamodularstructureprior

p

(

Gv

,

Tv

)

v∈V

∝

v∈V

π

v

(

Gv

,

Tv

) ,

whereeach

π

v issome(unnormalized)distribution.Intheliterature,variouspriorshavebeenproposedbothforDAGs(see,

e.g.,AngelopoulosandCussens [27] andreferencestherein)andfordecisiontrees[19,21].Ourchoicesforthepriorsdepend ontherestrictionswemaketotheconsidereddecisiontrees,andwillbedescribedinSection 3.

From analgorithmsstandpoint,thestructurelearningproblemnowbecomes thatofﬁndingastructure

(

Gv

,

Tv

)

v∈V so

astomaximizetheposteriorprobability

constant

×

v∈V

π

v

(

Gv

,

Tv

)

v

(

Tv

) .

(1)

Astheunspeciﬁed(normalizing)constantisthesameforallstructures,itcanbeignored. 2.4. Posterior predictive distribution

Givenasetof

N

datapoints X(N)

:=

(

_Xi

)

N

i=1,thedistributionforanewdatapoint XN+1 iscalledtheposteriorpredictive

distribution.We willconsidera settingwherewe ﬁrstlearnamodelstructure M

ˆ

:=

(

G

ˆ

v

,

T

ˆ

v

)

v∈V from X(N),andthenthe

distribution of XN+1 is obtained conditionally on both the learned structure andthe N data points. The distribution of interestisthus

p

(

XN+1

|

_X(N)

,

_M

ˆ

)

=

_p

(

_X(N+1)

| ˆ

_M

)/

_p

(

_X(N)

| ˆ

_M

),

_and_{consequently,}_the_mixed_{density–mass}_function_of _XN+1 _is

obtainedastheratioofthemarginallikelihoodsofM

ˆ

for

X

(N+1)_and_X(N)_._This,_in_turn,_amounts_to_a_product_{variable-wise} ratiosover

v

∈

V.

Considertheratioforasingle

v

∈

V.Write y forthevalueofXN+1

v andRfortheleafofT

ˆ

vthatmatchesthevaluesof

X_GN+1

(5)

(

NR

+

x

α

x

)

(

NR

+

1

+

x

α

x

)

(

NR y

+

1

+

α

y

)

(

NR y

+

α

y

)

=

NR y

+

α

y NR

+

x

α

x

.

If the variable Xv is continuous, we get the densityfunction of a non-standardized Student’st-distribution2 with

ν

degreesof freedom, location parameter

μ

¯

, andscale parameter

λ

(

a

+

1)/a, where

ν

,

μ

¯

,

λ

,and

a

are theposterior counterpartsofthehyperparameters

ν

,

μ

¯

,

λ,

and

a

,obtainedby

a

:=

a

+

NR

,

μ

¯

:=

(

a

μ

¯

+

NR

μ

R

)/

a

,

ν

:=

ν

+

NR

,

λ

:=

sR

+

tR

+

ν

λ

/(

a

ν

) ,

with sR andtR as deﬁnedbefore; for a derivation (up to some differences in the parameterization), see,e.g., Murphy’s

notes [28,Sections 5.5and6.5].

Weconsiderthecaseofanordinal Xv inthenextsubsection.

2.5. Extension to ordinal response variables

Theprevioussectionsfocusedonthecasesofcategoricalandcontinuousresponsevariables.Wenowextendthemodel to alsoaccommodate ordinalresponse variables.Tothisend, we introducea Bayesian histogramwherethebinlocations andwidthsaretreatedasauxilliaryparametersand“integratedaway”inthemarginallikelihoodforeachleaf Rinagiven decisiontree

T

v;bymultiplyingtheseleafscoreswegetthelikelihood

v

(

Tv

)

asbefore.

Consider aresponsevariable Xv.Weassume that Xv takesa ﬁnitenumberofdifferentvalues, Sv

= {

1,2,

. . . ,

k

}

,with

someinteger

k

.Similarlytoacategoricalresponse,welet

θ

x denotetheprobabilitythat Xv takesvalue

x

.However,instead

ofassigning the vector

θ

a Dirichletdistribution, we assigna prior that constrains the parameters

θ

x be equal forsome

consecutivevalues x. More precisely, givena segmentation B

= {

B1

,

B2

,

. . . ,

B

b

}

of Sv, that is, a partition of Sv intosets

ofconsecutivenumbers,called

bins

,welet

θ

x

:=

β

j

/

|

Bj

|

forall x

∈

Bj.Now,welet

β

∼

Dirichlet(

α

(1)

,

α

(2)

,

. . . ,

α

(b)

)

with

some

α

(j)_._Observe_that_the_parameters

θ

x sumupto1,asrequired.

Tocompletethemodel,weletthepriorprobability ofa segmentation B with

b

binsbeproportionalto

κ

b_,_with_some

κ

>

0,andlet

α

(j)

:=

x∈Bj

α

x.Thusthemodelhas

k

+

1 hyperparameters:

α

1

,

α

2

,

. . . ,

α

k,and

κ

.

Proposition1.

The marginal likelihood of a decision tree T

vis given by

v

(

Tv

)

=

R leaf of Tv wv

(

R

) ,

with wv

(

R

)

:=

B segm. of Sv

κ

b−1

(

1

+

κ

)

k−1

(

_j

α

(j)

)

(

NR

+

j

α

(j)

)

×

b

j=1

(

N(_Rj)

+

α

(j)

)

(

α

(j)

)

N(_Rj)

(

NRx

)

x∈Bj

|

Bj

|

−N (j) R

,

where N(Rj)

:=

x∈BjNRxis the total number of data points in Bj.

Proof. Theprobabilityofthedata

(

NR1

,

NR2

,

. . . ,

N

Rk

)

givenasegmentation B

= {

B1

,

B

2

,

. . . ,

B

b

}

canbewrittenas

NR NR1

,

NR2

, . . . ,

NRk

p

(β

|

B

)

b

j=1

β

j

|

Bj

|

N(_Rj) d

β ,

where

p

(β

|

B

)

isthedensityfunctionoftheDirichletdistribution.Bywriting

NR NR1

,

NR2

, . . . ,

NRk

=

NR N_R(1)

,

N_R(2)

, . . . ,

N(_Rb)

b j=1

N(_Rj)

(

NRx

)

x∈Bj

,

takingtheterms

|

Bj

|

−N (j)

R _out_of_the_integral,_and_using_the_known_closed-form_expression _for_the_remaining _integral,_we

getthesummandin

w

v

(

R

),

withoutthefactor

κ

b−1

/(1

+

κ

)

k−1,whichisthepriorprobabilityof

B

;theunnormalizedprior

probabilities

κ

b _sum_up_to

κ

(1

+

κ

)

k−1_,_since_there_are_exactly

k−1 b−1

segmentationswith

b

nonemptybins.

2

Proposition2.

Given the counts N

R1

,

NR2

,

. . . ,

NRk, the term wv

(

R

)

can be computed in time O

(

k2logNR

)

.

2 _The_density_function_f

(y)ofthenon-standardizedStudent’st-distributionwithddegreesoffreedom,locationparameterm,andscaleparameters,is givenby f(y)= (d+1)/2 (d/2) πds−1/21+(y−m)2 ds −(d+1)/2

(6)

Fig. 1.An example of a PCART model. It represents the conditional probability densityp(X|Y,Z), whereX∈(0,10),Y∈(0,100), andZ∈ {a,b,c,d}. Proof. Since

_j

α

(j)_equals

α

+

:=

x

α

x forall segmentationsB,theterm

κ

−1

(1

+

κ

)

1−k

(

α

+

)/(

NR

+

α

+

)

inthe sum-manddoesnotdependon

B

andcanbefactoredout.Theremainderofthesummandfactorizesintoaproductof

b

terms, one for each Bj. In total, there are

k+1 2

different terms, one for each bin

[

s

,

t

]

:= {

s

,

s

+

1,

. . . ,

t

}

with1

≤

s

,

t

≤

k. To showhowthesetermscanbeprecomputedwithintheclaimedtime,denoteby

N

R(s,t)and

α

(s,t)thecorrespondingnumbers

N(_Rj)and

α

(j)_,_i.e.,_when _B

j

= [

s

,

t

]

.Observeﬁrstthat thenumbers NR(s,t)and

α

(s,t) aresimplesumsthatcanbecomputed

incrementally, i.e., by addingNRt and

α

t tothe alreadycomputedNR(s,t−1) and

α

(s,t−1).Thus, assumingthe valuesofthe

gammafunctionhavebeenprecomputed,theirratiocanbecomputedinconstanttime perbin.Similarly,themultinomial coeﬃcient is updated bymultiplying it bythe binomialcoeﬃcient

N(Rs,t)

NRt

,which we assumeto be accessibleinconstant time (e.g., viathree evaluationsofthe gammafunction). Finally,the binlength t

−

s

+

1 israised to theexponent N(_Rs,t) by repeatedsquaringintime O

logN_R(s,t)

.Let f

(

s

,

t

)

denotetheprecomputed factors,1

≤

s

,

t

≤

k. Considerthe function g

(

t

)

deﬁnedby g

(0)

:=

1 and g

(

t

)

:=

t_s₌₁g

(

s

−

1)f

(

s

,

t

)

for1

≤

t

≤

k.Wehavethat

w

v

(

R

)

=

g

(

k

)

(

α

+

)

(

NR

+

α

+

).

It remainstoobservethat

g

(

k

)

canbeevaluatedintime

O

(

k2

).

2

Wecanreducetherunningtimeto O

(

k2

)

_by_precomputing_the_powers

l

m_for_all₁

≤

_l

≤

_k_and₁

≤

_m

≤

_N_._Even_thought

the precomputation takestime andspace ororder kN, thiscan be advantageous when wv

(

R

)

iscomputed formany v

andR.

Fortheprobability massfunctionoftheposteriorpredictive distribution,we arenot awareany“closed-formformula,” butthefollowingresultisimmediateandsuﬃcientforthecomputationalpurposes:

Proposition3.

Given a decision tree

Tv of Sv and N data points, the response equals y in a new data point with probability

w_v

(

R

)/

wv

(

R

)

, where R is the unique leaf of Tvthat matches the new data point, wv

(

R

)

is as deﬁned in Proposition1, and wv

(

R

)

is

the same when NR yis replaced by NR y

+

1. 3. Partition–dyadicCART

Partition–dyadic CART (PCART)is a subclass ofthe CART model, which we obtain by restricting the wayin which the rangeofacontinuouspredictorvariableisrecursivelypartitionedbythedecisiontree.Speciﬁcally,weassumetherangeis an interval andonlyallowhalvingit initsmidpoint,whence thenamedyadicsplits. Forthenumberofrecursivedyadic splitspervariablewesetanupperbound,

the maximum number of splits

,denotedby

s

.Incontrast,foracategoricalpredictor weallowarbitrary(binary)partitioning.Fig.1showsanexampleofaPCARTmodel.

PCART generalizes dyadic decision trees [23], which assume all predictors be continuous (or binary) and the response be categorical. The motivation fordyadic splits stems from their ability to guarantee both statistical andcomputational eﬃciency:dyadicdecisiontreesenablecomputationallyfeasibleglobaloptimizationondatasetswithuptoaroundadozen ofpredictorvariables,andtheyachieve(inaminimaxsense)nearlyoptimalratesofconvergenceforarangeofclassiﬁcation problems[23,24].

3.1. Structure prior

Havingﬁxedthefamilyofdecisiontrees,we proceedtospecifyapriordistributionoverthefamily.Toconformto the modularstructureofthelikelihoodfunction,weconstructapriorwheretheprobability

p

(

T

)

ofatree

T

factorizesoverthe leavesof

T

.Thisyieldsa decomposablescoringfunction,whichpropertyiscrucialfortheoptimizationalgorithmwegive inthenextsection.

We consider priors that assign p

(

T

)

∝

γ

l(T)_, _where_l

(

_T

)

_is _the _number _of _leaves _of _T _and ₀

<

γ

≤

_{1 is} _a _constant independent ofT.Importantly, wedonotset

γ

toanyﬁxedvalue,butwe insteadlet

γ

dependon thenumberofsplits possibleineachinnernode.Withoutsuchdependency,thepriorcoulddistributenearlyallprobabilitytoverylargedecision trees,astheirnumberoverwhelmsthenumberofsmalltrees.

(7)

Moreprecisely,weset

γ

=

1/(4C

),

where

C

isthetotalnumberofpossiblesplitsataninnernode(ignoringsplitsmade inotherinner nodes).Wehavethat

C

isasumwhereeach continuouspredictorcontributes1,eachcategoricalpredictor with

k

possiblevaluescontributes2k−1

−

_1,_and_each_ordinal_predictor_with

k

_possible_values_contributes

k

−

_1._The_factor

4 comes fromthe numberofordered fullbinary treeswith

i

innernodes,whichisgivenbythe ithCatalannumberand equalling to 4i up toa factor polynomial in i. We can interpretthe prior as ﬁrstdrawing the number ofleaves froma nearly uniformdistribution,then drawing arandom full binary treewiththegivennumber ofleaves, drawinga random splitindependentlyforeachinnernodeofthetree,andacceptingthetreeifitisavaliddecisiontree.

For an illustration, consider the decisiontree shown inFig. 1. In thiscase we have one categoricalpredictor with4 possible values andone continuous predictor.Consequently, the constant C equals 24−1

−

1

+

1

=

8,whence

γ

=

1/32. Thus, the prior probability of theshown treewith 5leaves is 324

=

₂20 _times_smaller _than _the _prior _probability _of_the

independencemodel,i.e.,thetreewheretherootistheonlyleaf. 3.2. PCART as a local model

ToincorporatePCARTaslocalmodelsinBNs,weneedtospecifythecomponents

π

v

(

Gv

,

T

v

)

ofthestructureprior.Let

P

=

Gv

⊆

V

\ {

v

}

with

|

P

|

≤

D;the

maximum indegree

parameter

D

controlsthecomputationalandstatisticalcomplexityof

themodel.Forbrevity,write

T

for

T

v.Weusethefactorization

π

v

(

P

,

T

)

=

π

v

(

P

)

π

v

(

T

|

P

)

c

(

P

) ,

where

c

(

P

)

normalizes

π

v

(

T

|

P

)

intoaprobabilitydistributionfor

T

.Itremainstospecify

π

v

(

P

)

and

π

v

(

T

|

P

).

Weconsider

twodifferentchoicesforboth

π

v

(

T

|

P

).

Our ﬁrst choice is to set

π

v

(

P

)

=

1, i.e., the prior is nearly uniform over parent sets of size at most D.3 This is the

simplestandmost“ignorant”prior, whichtreatsall DAGsequallyapriori;foradeeperdiscussionofgraphpriorswerefer toAngelopoulosandCussens[27] andreferencestherein.Wegetthebasicvariantofourmodel:

PCART (partition–dyadic CART): Let

π

v

(

T

|

P

)

=

(4

C

)

−l(T) foralldecisiontrees T of SP withatmostsdyadicsplitsper

con-tinuousvariable.

Oursecondchoiceisgivenby

π

v

(

P

)

=

1

n−1 |P|

,i.e.,theprior is

nearly uniform over the sizes of the parent sets

.4 Thisprior avoidsconcentrating almost allprior probability onthe densestgraphs; see,e.g.,Friedman andKoller [29] for usage and discussionofthisprior.Wewillcallthisthe

penalized

variantandrefertoitas

PCARTp

.

Through thelikelihood function,themodel deﬁnesajointlocal score

π

v

(

Gv

,

Tv

)

v

(

Tv

).

This,inturn,induces alocal

scorefortheparentset,

g

v

(

Gv

),

obtainedbymaximizingthejointlocalscoreover

T

v.Therolesofthesetwokindsof local

scoresin structurelearning willbeclariﬁed inthenextsection (Proposition4).We note thattheresultingglobalscoring functionforDAGs,obtainedbytakingaproductoftheparent-wiselocalscores,isnotscoreequivalent,that is,twoDAGs thatencodethesameconditionalindependencerelationsmaygetdifferentscores.Itisanopenquestion,whetherthereare anynontrivialconstraintsunderwhichwewouldachievescoreequivalence.

3.3. Full table variants and adaptive discretization

RecallthatthestandardCPDmodelinBNsisthefulltablemodel,whereeachconﬁgurationtheparentvariable’sstates isassignedadedicatedcategorical(multinomial)distribution.ForcomparingPCARTtothisstandardmodel,wenextextend thefulltablemodelalsotocaseswheretheresponseorsomepredictorvariablesarenotcategorical.

If all predictor variables are categorical, the PCART and PCARTp models we just deﬁned have the following tabular counterparts:

FT (full table): Supposingallpredictorvariablesarecategorical,let F beamaximaldecisiontreeof SP whereeachleafisa

singleton

{

xP

}

with

x

P

∈

SP.Welet

π

v

(

T

|

P

)

=

1 if

T

=

F and

π

v

(

T

|

P

)

=

0 otherwise.Inaddition,

π

v

(

P

)

=

1.

FTp (penalized full table): LikeFTbutwith

π

v

(

P

)

=

1

n−1 |P|

.

Inotherwords,thesemodelsintroduce,foreachjointconﬁgurationofthepredictorvariables’states,adedicatedcategorical ornormaldistribution.IntheformercasethiscoincideswithastandardCPT;inthelattercase,itcoincideswithastandard conditionallinearGaussianmodel,albeitofasomewhatdegenerateform.

Tomake thetabular variants applicable tocontinuous and ordinalpredictors aswell, we introducethe following dis-cretization scheme. Foreach continuous or ordinal predictor u

∈

P, discretize its range of values into

k

u bins matching

3 _The_prior_is_uniform_over_DAGs_but_not_exactly_uniform_over_the_possible_parents_for_a_ﬁxed_node,_because_a_smaller_set_of_parents_is_compatible_with alargernumberofDAGs.

(8)

the ku quantilesofthe empiricaldistribution.The numbers

k

u arefree structuralparameters ofthemodel,takingvalues

between 2 and some maximum M (weused M

=

7 inour experiments). We assignuniform prior over the jointvalues

(

ku

)

u∈P.

Thisdiscretizationschemehastwo importantfeatures.Oneisthatitconsidersjointdiscretizationsofthepredictorsin relationtotheresponse,asopposedtosomesimplerschemesthatconsiderthevariablesseparatelyorinpredictor-response pairs. The other featureis that theschemedoes notﬁx anysingle discretization ofa continuous variableinthe BN,but the discretization mayvary betweenthe parent sets to which thevariable belongs.Both features enhance thepower of the discretization schemeas compared tothe simple schemes.The prize we have topay is that thecomputational cost of searchingfor optimaldiscretizations grows,as we needto consider all possibleconﬁgurations

(

ku

)

u∈P for each P we

encounter;wewilladdressthisissueinSection4.

4. Algorithms

Thestate-of-the-artalgorithmsforﬁndingagloballyoptimalBNstructureproceedintwosteps:

Step 1 (Candidate parent set scoring and pruning) Foreach v

∈

V and P

⊆

V

\ {

v

}

with

|

P

|

≤

D compute the localscore gv

(

P

),

deﬁnedintermsofthelocalmodelandthedata(e.g.,priorandlikelihood).Pruneevery setthathasasubset

withatleastaslargescore.Returntheremainingparentsetsandtheirscores. Step 2 (DAG search) ReturnaDAG

G

thatmaximizesthetotalscore

_v_∈_Vgv

(

Gv

).

ForStep 2wemayuseexistingcompletesolvers,e.g.,onesbasedonintegerlinearprogramming [15] orA* [14].Inour experimentsweusedtheintegerlinearprogrammingbasedsolver

GOBNILP

[15],sinceinourpreliminarystudyitappeared toscalewell.ItisworthnotingthatwecannotusetheconstraintprogrammingbasedsolverofvanBeekandHoffmann [16] becauseitassumesscoreequivalence,i.e.,thatthescoringfunctiongivesthesamescoreforallMarkov-equivalentDAGs.5

OurconcernisStep 1.Toagreewiththedeﬁnitionofthescoringfunction(1),wesetthelocalscoresby

gv

(

P

)

=

max T

π

v

(

P

,

T

)

v

(

T

)

,

where

T

runsthroughalllocalstructuresonthepredictors P.Indeed,itiseasytoseethatthisassignmentguaranteesthat thealgorithmwillﬁndagloballyoptimalstructure:

Proposition4

(Optimality). Let G be a DAG that maximizes

_v_∈_Vgv

(

Gv

)

. Then, for each v

∈

V , there exists a Tv such that the

structure

(

Gv

,

T

v

)

v∈Vmaximizes the scoring function (1).

Proof. ForanyﬁxedDAG

G

thechoicesforthelocaltrees

T

v areindependent.Therefore,wehavethat

v∈V max Tv

π

v

(

Gv

,

Tv

)

v

(

Tv

)

=

max (Tv)v∈V

v∈V

π

v

(

Gv

,

Tv

)

v

(

Tv

)

.

Thus, choosingforeach

G

v atree

T

v thatmaximizes

π

v

(

Gv

,

T

v

)

v

(

Tv

)

yieldsastructure

(

Gv

,

T

v

)

v∈V thatmaximizesthe

scoringfunction(1)

2

InthecaseofPCARTeachlocalscorerequiresoptimizationoveralargenumberofdecisiontrees.Moreover,thenumber of required local scores grows rapidly withthe total numberof variables

n

, beingn

dD=0

n−1 d

for maximum indegree D.In theremainder ofthis section we detaila dynamic programmingalgorithm tosolve thisoptimization problem. We havedevelopedasimilaralgorithmforcomputingtherequirednormalizingterm

c

(

P

),

whichisa

sum

overdecisiontrees; however,asthistermisindependentofthedataandfastertocompute,wewillomitamoredetaileddescription.

ForFT (maximaldecisiontree),itsuﬃces toenumeratethe possiblejointdiscretizations ofthecontinuous predictors. For ﬁxed v and P, thisrequires atmost

(

M

−

1)|P| evaluationsof a closed-form expression; formoderate valuesof the maximumnumberofbins

M

,thecomputationalcostissmallcomparedtothatofPCARTstructureoptimization.

4.1. PCART structure optimization

WegiveadynamicprogrammingalgorithmtoﬁndanoptimalPCARTstructureforaresponse

v

andasetofpredictors P

⊆

V

\ {

v

}

.The algorithm applies toany objectivefunction that factorizes intoa product oflocal scores f

(

R

)

over the leaves R ofthetree.OurBayesscoreisclearly ofthisformasboth thepriorandthelikelihoodfactorizeovertheleaves. Foreverypossibledecisiontreenode R

= ×

u∈PRu,thealgorithmcomputes,inatop-downfashion,thebestscore f

(

R

)

for

thesubtreebelowR.Thusthescoreoftheroot, f

(

SP

),

isthemaximumscoreoverallPCARTstructures.

(9)

Algorithm1PCARToptimization.

Input:Responsev,predictorsP⊆V\ {v}anddataX=(X1_,_X2_,_{. . . ,}_XN₎

γ←1/4u∈Pcontinuous1+ u∈Pcategorical(2|Su|−1−1)+ u∈Pordinal(|Su| −1) −1

Computethestructurepriorfactor functionLeaf-Score(y1_,_y2_,_{. . . ,}_yN₎

returnγ ×marginallikelihoodofresponsevaluesycomputedusingtheformulasinSections2.3and2.5 LetMbeanassociativearray

functionSubtree-Score(R,Y=(Y1,. . . ,YN)) Computesf(R)giventhatY isthesetofdatapointsinnodeR foreachcategoricalorordinalpredictoru∈Pdo Removemissingvaluesfromdiscretevariables

Ru← {Yu1,Yu2,. . . ,YN

u}

ifMdoesnotcontainelementRthen

MR←Leaf-Score(Y1

v,Y2v,. . . ,YN

v ) ConsiderthecasewhereR isaleafnode

ifN>0then

foreachcontinuouspredictoru∈Pdo Considerallsplitsincontinuousvariables LetRu= [x,y),Su= [x,y)

if(x−y)/(x−y)>2−s_then

LetR,RbesuchthatRu= [x,(x+y)/2),Ru= [(x+y)/2,y)andRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYi_satisfying_Yi

u< (x+y)/2

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}

foreachcategoricalpredictoru∈Pdo Considerallsplitsincategoricalvariables foreachpartition{C,C}ofRpintotwononemptysetsdo

LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYi_satisfying_Yi

u∈C

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}

foreachordinalpredictoru∈Pdo Considerallsplitsinordinalvariables foreachpartition{C,C}ofRpintotwononemptysetssuchthatmaxC<minCdo

LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYi_satisfying_Yi

u∈C

MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)} returnMR

Output:Subtree-Score(SP,X)

Thebestsubtreescorefornode Risobtainedbyconsideringallpossibilitiesforthenode:thenodemaybealeafnode, oritisaninnernodewithtwochildren Rand

R

.Thescoreofaleafiscomputedfromthetrainingdatapointscontained in

R

asspeciﬁedinSections2and3.Foraninnernodethescoreiscomputedrecursivelyastheproduct f

(

R

)

f

(

R

)

ofthe scoresofthechildsubtrees.Therecursioncachesthealreadycomputedsubtreescorestoavoidcomputingthesamescore multipletimes.Aftercomputingthescores,anoptimaldecisiontreeisextractedbytracingthesplitsthatgavetheoptimal scores.

Withourchoiceofscores,ithappensthatanode R thatcontainsnotrainingdatapointshastoalwaysbealeafnode, andthe recursiondoesnot need to advancefurther. Furthermore,iffor some categoricalorordinal variable u there are some valuesin Ru thatdonot appearinthetrainingdatathat iscontainedinR,we cansimplyconsidera smallernode

withtheunused valuesremoved. Thismaycause Ru to consist ofnonadjacentvaluesforsome ordinal variable

u

,butit

doesnotaffecttheresultsince weonlyconsidersplits respectivetotheorderingofthevariable,thatis,all thevaluesin R_u aresmallerthanthe valuesin R_u.Theseoptimizationssigniﬁcantly reduce thenumberofnodesthealgorithmhasto consider.ThepseudocodeforthealgorithmisshowninAlgorithm1.

WenextgiveanasymptoticupperboundforthenumberofnodesK thealgorithmconsidersandfortherunningtime ofthealgorithm.Theboundswillberatherloose:theyfailtoaccountforalltheoptimizationsandspecialpropertiesofthe datasets.

Proposition5

(Time complexity). Algorithm

1considers K

≤

min

N2A(k−1)−2C

(

_s

+

₁₎B

,

₂Ak+B(s+1)−C

(

_r

+

₁₎2C _{tree nodes, and}

the total time complexity is O

n

(2

kA

+

B

+

rC

)

KlogK

, where A, B, and C are the numbers of categorical, continuous, and ordinal predictor variables, respectively, k is the maximum number of categories, s the maximum number of recursive splits per continuous variable, and r the maximum number of values of any ordinal variable.

Proof. Counting the possiblenodes simply asthe product of atmost 2k possible value subsets Ru

⊂

Su per categorical

variable

u

∈

P,atmost2s+1 _subsets_per _continuous_variable,_and_at_most

r+1 2

≤

(

r

+

1)2

/2 subsets

_per_ordinal _variable,

givesan upperbound K

≤

2Ak+B(s+1)−C

(

r

+

1)2C. Inthe casewhenthe numberoftraining datapoints N is low,weget abetter boundsimilarly toBlanchardetal. [24] (whoassume allvariables becontinuous): Sincethealgorithmonly con-siders nodesthat contain at least one training datapoint, we can bound the numberof nodes by the total number of containing nodes foreach datapoint. The number of value subsets that contain a givenvalue isat most 2k−1 _for_each

categoricalvariable,

s

+

1 foreachcontinuous variable,and

(

r

+

1)2

/4 for

eachordinal variable,yielding anupperbound K

≤

N2A(k−1)−2C

(

_s

+

₁₎B

(

_r

+

₁₎2C_.

Now,forevery node the algorithmconsiders all possiblesplits—at most2k foreach categoricalvariable,one foreach continuousvariable,andatmost

r

−

1 foreachordinalvariable—andﬁndsthescorefromanassociativearraydatastructure in

O

(log

K

)

time.Thusthetotaltimecomplexityis

O

n

(2

k_A

+

_B

+

_rC

)

_K_log_K

_.

2

(10)

Fig. 2.Anillustrationofthebinarynoderepresentation.TheexamplecorrespondstothePCARTinFig.1,andconsistsoftwoparts:3bitsforthecontinuous variableY (herethemaximumnumberofsplitsis2)and4bitsforthecategoricalvariableZ.Forthecontinuousvariable,thelengthoftheleading1-bits andthefollowing0-bit(ingray)indicatesthelengthoftheinterval,andtheremainingbitsencodethepositionoftheinterval.

Foran illustrationofthe complexitybound,considerthe decisiontreeshowninFig.1.Inthiscasewehaveone cate-goricalpredictorwith4possiblevaluesandonecontinuouspredictor.ByProposition5,thetimecomplexityis

O

(

nKlogK

),

wherethenumberofconsideredtreenodesK isatmostmin

{

64

·

N

(

s

+

1),256

·

2s+1

}

.

4.2. Implementation tricks

The dynamicprogrammingalgorithmforﬁndingan optimalPCARTstructure iseasilyimplementedusingrecursion,as describedinSection4.1.However,bycarefullytuningthememoryrepresentationofnodes,weareabletooptimizeourC++ implementationtorunanorderofmagnitudefaster.

Ourimplementationstoresnode

R

asabitvectorthatisaconcatenationofﬁxed-lengthbitvectors,eachofwhichstores the value subset Ru for a variable u

∈

P. Foreach continuous variablewith maximum numberof splits s,the 2s+1

−

1

possible intervalsare storedas

(

s

+

1)-bit binary numbers.Foreach categoricalvariable u,the subset ofvalues Ru

⊆

Su

is storedusing

|

Su

|

bits.AnexampleoftherepresentationisshowninFig.2.The predictorvaluesofthedatapoints are

alsoencodedastheminimalnodescontainingthem.Inpractice,64bitssufficeforstoringthenoderepresentations,which meansthattheoperationsperformedbythealgorithmonthenodestranslateintoefficientbitoperationson64-bitmachine words.Thedatacanbepartitionedin-placeusingsimplebitoperationsandbecauseofthecompactrepresentation,thedata willstaypresentinCPUcaches,makingtheinnermostloopsofthealgorithmveryefficient.

5. Empiricalstudies

Weempiricallycomparedthedifferentlocalmodelsinrelationto(a)structurerecovery,(b)predictiveperformance,and (c)runningtime.Tothisend,weusedbothsyntheticdatageneratedfromknownbenchmarknetworksandrealdatafrom variousdomains.

5.1. Data sets and hyperparameters

Asground-truthbenchmarknetworks,weusedAlarm,Insurance,andWater,allextractedfromthe

bnlearn

[30] repository.6 The networks contain 37, 27,and 32 variables respectively, and are thus suitable for ﬁnding the globally optimal DAG withmodern solvers. However,asthesebenchmarknetworkscontain onlycategoricalvariables,we didnotuse themfor evaluatingtheperformanceonmixedvariables.

Forrealdata,weextractedallsuitabledatasetsfromtheUCImachinelearningrepository [31];moreexactly,we consid-eredalltheclassiﬁcationandregressiondatasetsthathaveatleast100datapoints,atleast8butfewerthan36variables, andthatconsist ofa singletable thatisfreelyavailable.We preprocessedthedatasetsbynormalizingallthecontinuous variables to havemeanzeroandvariance one andlimitingthenumberofcategoriesin eachcategoricalvariable to5by combiningtherarestcategoriesintoone.Wealsoremovedidentiﬁervariablesanddatapointsthatcontainmissingvalues, rejectingthedatasetswhereweneedtoremoveamajorityofthedatapoints.WealsorejectedthedatasetsMushroomand

ParkinsonsTelemonitoring,since thenetwork optimizationphaseusing

GOBNILP

took too muchtime forthem.SeeTable B.5

(AppendixB)foracompletelistofthe52datasets.

Wesettheuserparametersofthemodelsasfollows.Thehyperparametersofthepriordistributionsweresetas

α

=

1

/

₂ (Jeffreys prior),

μ

¯

=

0,anda

=

ν

=

λ

=

1,exceptfor thesensitivitystudy (Section5.6) wherewe varied thelatter three values;forthesegmentationpriorofordinalresponses weset

κ

=

1

/

₄_to_{approximately}_balance_the_prior_mass_on_smaller

(11)

Fig. 3.ComparisonofgloballyoptimalDAGsearch(FT,FTp)withheuristicapproaches(HC,TABU)andaconstraint-basedalgorithm(PC)forthetaskof recoveringaground-truthnetworkstructurebasedondatasamplesofvaryingsize.TheevaluationmetricisanaveragestructuralHammingdistanceover tenindependentdatasamples.Errorbarsindicate(hereandinallfollowingplots)thestandarderrorinbothdirections.

andlargernumbersofbins.Wesetthemaximumnumberofsplitspercontinuousvariableto

s

=

5,themaximumnumber ofbinsinthefulltablevariantto M

=

7,andthemaximumindegreeofaDAGto D

=

4.

5.2. Comparison of exact and heuristic algorithms for Bayesian network structure learning

As a preliminary small-scalestudy, we investigatedthe advantage ofﬁnding a globally optimal DAGas opposedto a merely high-scoring DAG.Here werestrict ourselves tostandard CPTbasedlocalscores. Tobe precise,we comparedour implementation of the FT local score (with GOBNILP for searching for the globally optimal DAG) with Scutari’s bnlearn implementation [30] ofhill-climbing(HC),TABUsearch,andtheconstraint-basedPCalgorithm.ExceptforthePCalgorithm, all theseimplementations usethesamescoringfunction (theBDscore withhyperparameters setto1

/

₂_,_combined_with_a uniformstructureprior),thusenablingfaircomparisontoFT.Forreference,wealsocomparedtoglobaloptimizationunder thepenalized FTp score.We usedthePC-implementation withitsdefaultparameters:an asymptoticchi-squared test for themutualinformationteststatisticwithasigniﬁcancelevelof0.05.

Fromthethreebenchmarknetworkswesampleddatasetsofdifferentsizes,tenrepetitionseach,andlearnedanetwork structureforeachdatasetandalgorithm.WethencomputedthestructuralHammingdistance(SHD) [32],whichmeasures differencesup to the equivalenceclass, betweeneach learnednetwork andthe groundtruth. Weaveraged the SHDs for eachalgorithmandsamplesizeoverthetenindependentrepetitions.TheresultinglearningcurvesareshowninFig.3.

WeobservethatFTandFTpoutperformtheotherthreeapproachesforAlarmandInsuranceoncethesamplesizeexceeds afew hundreddatapoints.ForWater,however,FTperforms substantiallyworse thanHCandTABU.Thiscanbeexplained by the observation that the BDscore alone (without a nonuniform structure prior) can be a very poorscoring function duetoapreferenceforoverlycomplexmodelsatrelativelysmallsamplesizes. Insuch cases,theinabilityoftheheuristic algorithmstoﬁndtheratherdensegloballyoptimalDAGbecomesafortunateadvantage,asthey stopthesearchatalocal optimumthatisa relativelysparsesubgraph.ThepoorperformanceofFTjustiﬁestheinclusionoftheFTpvariant,which usesastructurepriorthatisknowntopenalizecomplexityappropriately [33].Unfortunately,

bnlearn

doesnotallowcalling HCorTABUwithanarbitraryscoringfunction,soitisnotpossibletodirectlycomparethegloballyoptimalFTpsolutionto thecorrespondingheuristicresults.

The constraint-based algorithm (PC) hasadvantages atvery smallsample sizes, but it doesnot learn much from an increasedamount ofdataandeventually recovers the generatingnetwork worse thanthe score-basedlearning methods. (Onemight improve the performance of PC by letting its signiﬁcance levelparameter depend onthe data size in some appropriateway,butthe

blearn

implementationdoesnotdothatbydefault.)

AsidefromtheratherpathologicalcaseoftheWaternetwork,theobservationsfromthepreliminarystudyaddto previ-ouslypublished evidencefortheneedofglobaloptimization[17] andjustiﬁesourrestrictiontostudylearning withlocal structuressolelywithintheglobaloptimizationframework.

5.3. The effect of local structure on benchmark network structure recovery

Consider thenthesameexperimental settingasintheprevious section,butnowforcomparingthefourlocalmodels, PCART(p)andFT(p),andonlyexactalgorithms.Weobserve(Fig.4)thatbothPCARTvariantsyieldsmallerSHDsthanFTfor allgeneratingnetworksandsamplesizes. Withthesparsity-favoringDAGprioradded,FTpbecomescompetitiveforAlarm. Apossible explanation isthat Alarm hasthe smallestedge/noderatioof only1.24, whereas Insurance andWater have1.93 and2.06, respectively.Asmallaverage indegreeleaveslittle potentialforstructured CPDsasthemajority ofCPTs canbe accuratelyrepresentedwitha (small)probabilitytable.ForInsuranceandWaterthePCARTvariantsyieldfasterconvergence thanFTp.

Insummary,PCART(p)appearstohaveaslightadvantageoverthefull-tablevariantsinrecoveringthenetworkstructure, butnotineachandeverycase.Thisisnotsurprising,sincethereisnoreasontobelievethateveryconditionalprobability

(12)

Fig. 4.Accuracy of PCART(p) with FT(p) in structure recovery from data samples generated from three benchmark networks.

Fig. 5. LearningcurvesproducedbytheIntersection-ValidationmethodforfourUCIdatasets.Thelargestsamplesizeineachplotcorrespondstothe originaldatasize,forwhichallmethodsyieldthesamepartialHammingdistance(ofvaluezero)bydeﬁnition.

table in theground-truthnetworksadmitsan economical tree-structuredrepresentation.Moreover, interms ofSHDit is oftenbeneﬁcial tolearn aconservative(sparse) modelwhenthesample sizeis small [33].Forvery smalldata sets,even attemptingtolearnastructuredlocalmodelmayyieldsmallstructuraloverﬁttingeffectsthatoutweighthepossiblyhigher expressivenessofPCART.

5.4. Structure recovery on real data

For evaluating the accuracy of structure learning fromreal data, we apply a very recent methodcalled Intersection-Validation [34].ThemethodallowsustodrawlearningcurvessimilartotheexamplesinFig.4evenwhennoground-truth DAGisavailable.Themethodcheckswhichstructuralfeaturesalllearningalgorithms(e.g.,scoringfunctions)thataretobe comparedagreeuponwhenlearningfromtheentiredataset,andtreatsthemassurrogategroundtruth.Structureslearned atsmallersubsamplesoftheentiredatasetarethencomparedagainstthispartialnetworktocomputeanapproximation ofSHD,called

partial Hamming distance

.

For each of the 52 UCI data sets under consideration and foreach ofthe four scoring functions, we learn networks fromsubsamplesof1

/

₂_,1

/

₄_,1

/

₈_,1

/

₁₆_,1

/

₃₂_,_and1

/

₆₄_of_the_original_data_size_and_repeat_the_subsampling_process_ten_times._The resulting averagedlearning curvesforthe fourscoring functionsincomparisonare showninFig. 5fortheﬁrstfourdata sets according to an alphabeticalranking of their names, andinAppendix A forthe remaining 48 datasets. Byvisually inspectingthelearningcurves,wedrawsimilarconclusionsasinthebenchmarknetworkstudyfromtheprevioussection: FTiswidelyinferiortotheothermethods,likelyduetolearningoverlycomplexmodelsatsmallsample sizes.PCARTlocal modelsperformoften,thoughnotalways,slightlybetterthanFTp.

Inorderto summarizetheresultsfromthe52datasetsinasystematicfashion,weaggregatedthembythefollowing procedure.Weconsideredeachfractionoftheoriginaldatasize(1

/

₂_,1

/

₄_,_etc.)_as_a_separate_group,_irrespective_of_the_absolute samplesize.Wethenmadeapairwisecomparisonofthefourlocalmodelsforeachofthesixgroups.Tothisend,weused the followingassessment:method Aperforms better/worsethan methodB(win/loss) ifityields a smaller/larger average partial Hammingdistanceandiftheerrors bars inthelearning curvesdo notoverlap. Iftheydo overlap,we considered bothmodelstoperformequallywell(tie).

We show theseaggregated statistics inTable 1 for the two mostimportant modelcomparisons and inTable A.3for theremainingfourcases.WeobservethatthePCARTvariantsperformconsistentlybetterthanthefull-tablemodels, espe-cially atsmallsamplesizes. ThisdoesnotonlypertaintothecomparisontoFT,butalsotoFTp,whichdemonstratesthat structured CPDshelptorecovercorrectstructuralfeatureswithfewersamples,incomparisontofull-tablemodels.Thisis an interestingresultastheeffectwasnot thatobviousinFig.4,whichindicatesthat realdatasetstendtohavedifferent properties than syntheticdatasampled frombenchmarknetworks. An alternative todeclaring a tie betweenmethods in the eventofoverlappingerrorbarsistousea nonparametric statisticaltest (TableA.4).The numberoftiesdoeshere

(13)

in-Table 1

SummarizedmethodcomparisonbasedontheIntersection-Validationmethod.Samplesizereferstothefractionof thesizeoftheoriginaldataset.Atieoccursiftheerrorbarsoftwomethodsoverlap.Otherwisethemeanpartial Hammingdistancedecidesuponwinorloss.

(a) PCARTp vs FTp

Sample size Win Tie Loss

1/64 21 30 1 1/32 26 24 2 1/16 33 15 4 1/8 29 18 5 1/4 23 22 7 1/2 18 30 4 (b) PCART vs FT

Sample size Win Tie Loss

1/64 50 2 0 1/32 49 3 0 1/16 47 4 1 1/8 40 12 0 1/4 39 10 3 1/2 31 16 5 Table 2

SummaryofpredictionresultsonUCIdata.Atieoccursiftheerrorbarsof twomethodsoverlap.

Win Tie Loss

PCART vs PCARTp 1 51 0 PCART vs FT 20 23 9 PCART vs FTp 19 24 9 PCARTp vs FT 21 20 11 PCARTp vs FTp 18 23 11 FT vs FTp 0 51 1

creaseduetothesmallsamplesize(tenrepetitions),butamonginstanceswithastatisticalsigniﬁcantdifference,thePCART methodsoutperformtheirfulltablecounterparts.

TheseresultsonrealdatafurtherimplythatthePCARTvariantsareeffectivenotonlyforcategoricalvariables,butalso inthepresenceofmixed,orcompletelycontinuousdata.

5.5. Predictive performance

Inall previous studies,weviewedthe graphstructureasthelearning target. Anothercommonlyconsidered target for learning isthejointprobabilitydistribution.Onecan evaluatealearneddistribution,e.g.,by computingthe cross-entropy tothegroundtruthdistributionor,inabsenceofgroundtruth,byapproximatingthecross-entropybytheaveragelogloss inacross-validationexperiment.

Wefollowedthisapproachforall the52UCI datasetsandfourlocalmodels.Foreachcombinationwecomputedthe averagelogloss,calledhenceforth

log loss

forshort,byfirstfindingaglobally optimalmodelstructure,includingthelocal structure,inrelationtothetrainingdata,andthentakingthelosswithrespecttotheBayesianposteriorpredictivedensity ofeach testdatapoint.Allloglossvalues,alongwithstandarderrors,areshowninAppendixB.Table2showsawin–tie– losssummary basedonoverlapping errorbarsandTableB.6 showsan alternativebased onatwo-sidedWilcoxon-signed ranktest [35].Incontrasttothesummarystatisticsfortheintersection-validationstudy,werehereobservethesignificance testtoproducelesstiesthantheoverlappingerrorbarcriterion,whichcanbeexplainedbytherelativelylargesamplesize of50repetitions.

Inboth caseswe seethatforpredictionit makes virtuallyno differencewhetherasparsity-favoring structure prioris used ornot: thepenalized and unpenalizedversionsofPCART andFTperform nearly identically.Models that are overly complex accordingto SHDcan still host the trueprobability distribution, andestimatingit from datais effectiveunless theirCPDsaresolargethatmassiveparameteroverﬁttingoccurs,whichisunlikelyevenformodelslearnedwithFT.

Aside from this aspect, we again observe a slight advantage of PCART compared to full tables, albeit in manycases bothtypesoflocalmodelsperformequallywell.Thisisnotsurprisingasithasbeenobservedthatmeasuringthelearned distributionshowslessdifferencesamongstructurelearningmethodsthanmeasuringthestructuralfeatures [34].

5.6. Hyperparameter sensitivity analysis

AllofthepreviousstudiesusedaﬁxedsetofhyperparametersfortheBayesianpriors.While

α

=

1

/

₂_(Jeffreys_prior)_and

¯

μ

=

0 (meanofastandardnormal)canbeconceptuallymotivated,thechoiceof

a

=

ν

=

λ

=

1 wasbasedonsimplicityand decentperformanceinpreliminarytests.Wenowstudyhowvaryingeachofthesethreeparameters(whilekeepingthetwo othersﬁxedattheir defaultvalue)affectsthepredictiveperformance underthefourlocalmodels,both inabsoluteterms andrelativetoeachother.

The results forthe UCI data set concrete-compressive-strength are shownin Fig.6, whereas the resultsfor 14additional randomly chosen UCI datasets areshown inAppendixC.We ﬁndthat varying thehyperparametermayindeed havean effect.Ourdefaultvaluesaregenerallyagoodchoice,albeittheycouldbeimprovedinsomecases.

(14)

Fig. 6.HyperparametersensitivityanalysisforoneUCIdataset.Eachplotvariesonehyperparameter(x-axis)whilekeepingtheothertwoﬁxed,showingthe effecttopredictiveperformanceasmeasuredbylogloss(y-axis).Dashedverticallineindicatesthedefaultvalue.Notethatforthisdatasetthedifference betweenthepenalizedandunpenalizedalgorithmvariantsisextremelysmall.

Fig. 7.Running time in seconds of Step 1 vs Step 2 under the four local models. See text for a description of the included data sets.

The more importantobservation, however,is that thedecision whetherPCART(p)orFT(p) issuperior fora particular datasetdoesnotheavilydependontheparticularchoiceofhyperparametervalues,exceptforthecaseofverylargevalues (suchas

λ

=

103 inFig.6)whereallmodelsperformpoorly.

5.7. Running times

Last,weinvestigatedhowthechoiceofthelocalmodelaffectsthetime requirementsforthelocalscorecomputations (Step 1) and fortheDAG optimization (Step 2,by GOBNILP). Fig. 7showsthe measured running timesforthe datasets we sampledfromthethreebenchmarknetworks(Section5.3) andfortherealUCIdatasetswe includedinthestructure recoverystudy(Section5.4).

Weseethatunderallfourmodels,Step 1tendstoconsumemoretimethanStep 2.Thisobservationhighlightsthefact that

GOBNILP

solves the DAG optimizationstep faston typicalproblem instances,andthus the localscore computations can bea bottleneck,evenunderthestandard CPTmodel(e.g.,FTandFTp oncategoricaldatageneratedfrombenchmark networks).Nevertheless,therearealsodatasetsforwhichStep 1isfasterthanStep 2oralmostequallyfast;importantly, this holds alsounder PCARTandPCARTp. This observation, inturn, highlightsthe fact that DAG optimizationis hard in general, andthereforeonecanaffordcomputationallydemandinglocalscores.Itisworthnotingthatthetimecomplexity ofthelocalscorecomputationsgrowsonlypolynomiallyinthenumberofnodes

n

,assumingaconstantmaximumindegree, whereas thecomplexity oftheDAGoptimizationstepissuspectedtogrowsuperpolynomially, thecurrentbestwort-case boundbeingexponentialin

n

.

5.8. Studies with ordinal variables

Inordertobenchmarkthealgorithmvariantsthatcan handleordinaldata,wesearchedforordinalvariablesinthe52 UCI datasetsfromtheprevious sections.In ordertoavoidadditionaldiscretization,we onlyconsidereddiscrete variables with lessthan 16 possible values.As a consequence,we treat an attribute such as“Age” (in years), which is ordinal in principle, ascontinuous variable nevertheless. From all datasets, we chose sixdata sets thathave atleastthree ordinal variables;seeAppendixD.1fordetails.

ForcomparingtheordinalPCARTvariants(OPCARTandOPCARTp)withtheordinalfulltablemodels(OFTandOFTp),we repeatedtheintersection-validationstudyfromSection5.4andthepredictionstudyfromSection5.5anddisplaytheresults inAppendixD.2andAppendixD.3(TableD.7).WeobservethatthePCARTvariantscontinuetoperformbetterthanthefull table variantsonaverage.Hence,thetreatmentofordinalvariablesassuch,insteadofbeingmodeledaseithercontinuous orcategorical,doesnotchangetherelativeperformanceofPCART-basedmodelsinrelationtotheirfulltablecounterparts.

(15)

Fig. 8.The effect of including ordinal variables on structure learning measured by the intersection-validation method.

Nevertheless,fromtheseresultsitisyetunclearwhethertheextension toordinalvariablesispractically useful.Hence, we investigatedhowwell theordinalvariants ofouralgorithmperforminrelationto thenon-ordinal variantsstudiedin theprevioussections.Whilethepredictiveprobabilitiesanddensitiesarenotcomparableanylongerassoonasavariable changesitstype,theinterpretationofthelearnedDAGsremainsthesame,allowingafaircomparisonamongthemethods. We thus applied the intersection-validation method similar to Section 5.4 and compared PCART, PCARTp, OPCART, and OPCARTp(Fig.8).Weobservethattheordinalvariantsyieldalmostneverasubstantiallyworse partialHammingdistance compared to their non-ordinal counterparts, and often there is even a slight improvement. This demonstrates that the extensionofouralgorithmstoordinalvariablespaysoffindeed.

6. Discussion

ThisworkrevisitedtheideaofincorporatingdecisiontreesaslocalmodelsinBayesiannetworks.Speciﬁcally,we intro-duceda newmodelclass,PCART,which canhandleboth discrete andcontinuous predictorandresponse variables,takes a Bayesianapproach,andadmitscomputationallyfeasible globaloptimizationinarangeofproblemsizes. Wealso intro-ducedananalogousfull-tablevariant,FT,whichextendsthestandardCPTofcategoricalvariablestocontinuousandordinal responseandpredictorvariables.

Empiricalcomparisonofthetree-structuredPCARTandthetabularFTshoweda clearadvantage ofPCARTinstructure learningandaslightadvantageindensityestimation.Whiletheresultswere obtainedmainlyunderaﬁxed,globalchoice forthevaluesofthehyperparameters,theresultswereobservedtobe robusttochangesofthesevalues.Thatbeingsaid, thesensitivityanalysisalsosuggeststhattheperformanceofbothPCARTandFTcouldbefurtherenhancedbychoosingthe hyperparametervaluesseparatelyforeachgivendatasetandeachvariable,e.g.,bymeansofempiricalBayes.

Inourexperimentswesettheparametersthatcontrolthecomputationalcomplexityofthemodel(maximumindegree, numberof splits)so that thecomputations ofthelocal scores (Step 1)could be completed within a few hours foreach dataset.Thiswasmotivatedbythefactthatforsomeofthedatasets,thestate-of-the-artsolversrequirehourstoﬁndan optimalDAG (Step 2).Thereare two simplewaysto expediteStep 1,ifneeded:by settingthe complexityparameters to lowervalues,andbyrunningthecomputationsfordifferentparentsetsinparallel.Anopenquestioniswhetheronecould gainfurtherreductionsinthe