Contents lists available atScienceDirect
International
Journal
of
Approximate
Reasoning
www.elsevier.com/locate/ijar
Learning
Bayesian
networks
with
local
structure,
mixed
variables,
and
exact
algorithms
Topi Talvitie
a,
Ralf Eggeling
b,
Mikko Koivisto
a,∗
aDepartmentofComputerScience,UniversityofHelsinki,FinlandbDepartmentofComputerScience,UniversityofTübingen,Germany
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received16March2019
Receivedinrevisedform11July2019 Accepted8September2019 Availableonline16September2019 Keywords:
Bayesiannetworks Decisiontrees Exactalgorithms Structurelearning
Modern exact algorithms for structure learning in Bayesian networks first computean exact local score of every candidate parent set, and then find anetwork structure by combinatorial optimization so as to maximizethe global score. Thisapproach assumes thateachlocalscorecanbecomputedfast, whichcanbeproblematicwhenthescarcity of the data calls for structured local models or when there are both continuous and discretevariables,forthesecaseshavelackedefficient-to-computelocalscores.Toaddress thischallenge, weintroducea local scorethat is basedon aclassofclassification and regression trees.We showthat under modestrestrictions onthepossible branchings in the tree structure, it is feasible to finda structure that maximizes aBayes score in a rangeofmoderate-sizeprobleminstances.Inparticular, thisenablesglobal optimization oftheBayesiannetworkstructure,includingthelocalstructure.Inaddition,weintroduce arelatedmodel classthat extendsordinary conditionalprobability tables tocontinuous variables by employing anadaptive discretization approach. The two model classes are compared empirically by learning Bayesian networks from benchmark real-world and syntheticdatasets.Wediscusstherelativestrengthsofthemodelclassesintermsoftheir structurelearningcapability,predictiveperformance,andrunningtime.
©2019TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCC BY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Bayesiannetworks(BNs)composeamultivariatedistributionasaproductofunivariateconditionalprobability distribu-tions(CPDs). ThepotentialcomplexityofthelocalCPDs,togetherwiththeglobalacyclicityconstraintoftheBNstructure, make the task oflearning BNs from datachallenging in manyways. Evenif we assume that the available data contain neitherunobservedvariablesnormissingentries—likewewilldothroughoutthisarticle—thereoftenremainthefollowing threeconcerns:
(i) Statistical efficiency:thedatamaybescarce,renderingsimpletabularparameterizationsoftheCPDs,suchasthe
condi-tional probability tables (CPTs),statisticallyinefficient.
(ii) Heterogeneous variables:so-called
hybrid BNs
containbothdiscreteandcontinuousvariables,againrulingout themost convenientandstandardparameterizationsofCPDs.*
Correspondingauthor.E-mailaddress:mikko.koivisto@helsinki.fi(M. Koivisto). https://doi.org/10.1016/j.ijar.2019.09.002
0888-613X/©2019TheAuthors.PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
(iii) Computational efficiency: the mostnaturalformulationsof thelearning problemare computationally hard. Thislimits boththedimensionsofthedataandthecomplexityoftheCPDsinrelationtowhatcanbesolvedunderusefulquality guarantees.
Each oftheseissues (i–iii) hasbeenaddressed inthe literature,aswe will describe inthe next paragraphs.The present authorsare,however,notawareofanypriorworkthatwouldsimultaneouslyaddressallthethreeconcerns.
To addresstheissue ofstatisticalefficiency,one canreplace tabularCPDs bymore sophisticatedstructureswherethe effectivenumberofparametersadapts totheamountandcomplexityoftheavailable data.Thenotionofcontext-specific independence[1] offersaparticularlyappealingwaytoachievethis:givencertainconfigurationforasubsetofthe
predictor
variables, theresponse
variable isindependentof theremaining predictors.Forstructure learningin BNs,context-specific independences have been represented with decisiontrees [2] and with the somewhat more expressive decision graphs [3]. Both model classesare non-parametric inthe sense their numberof parameters can grow large, letting the models approximateanyCPD(infact,matchCPTsfordiscretevariables overfinitedomains). Thesestudieswere,however,limited todiscretevariablesandgreedyorlocalsearchalgorithms,thusnotaddressingissues(ii)and(iii).Alsoexactalgorithmsto partition tojointstate spaceofdiscretepredictorshavebeenproposed [4],butthesealgorithms becomecomputationally infeasiblealreadywith,say,threepredictorsthattakefourstateseach.ForhybridBNs,therearetwomainapproaches.Oneistodiscretizeallcontinuousvariables.Inpractice,thismeansthat separatelyforeachcontinuousvariable,itsrangeispartitionedintointervals,i.e.,bins,whicharethentreatedasthestates ofthediscretized variable.Then itiscommontoassignCPTs forthediscretizedvariables,evenifthisignores thenatural orderingofthestates.Simplepreprocessingroutinesofthiskindarecomputationallyconvenient,buttheyarepronetolose information in anuncontrolled way.Andif, asaremedy, oneincreasesthe numberofbins,then thestatisticalefficiency becomes a concern.Thereare alsomoreinvolvedschemes,whichcouplemulti-variablediscretization withgreedysearch forahigh-scoring networkstructure[5];theseschemespotentially yieldbetterdiscretizations,buttheyappeartorequire search heuristicsthatcannotguaranteeoptimalityofthefoundsolution.AnotherapproachistoemployCPDsthancan ac-commodatebothdiscreteandcontinuousvariables,suchasconditionallinearGaussianmodelsandtheirgeneralizationsto discreteresponsesviaappropriatelinkfunctions[6,Ch. 5].Thesemodelsare,however,nonsatisfactoryconcerningstatistical efficiency: onecan arguethat they overparameterizeby introducingseparate linearmodel foreveryconfiguration of dis-cretepredictors,whiletheyalsounderparameterizebyignoringnonlinearinteractionsofcontinuouspredictors.Mixturesof truncatedexponentials(MTEs)offeramoresophisticatedmodel,inwhichadecisiontreepartitionsthespaceofpredictor variablesintoregions,ineachofwhichtheresponseisassignedamixturedistribution[7].Unfortunately,nofastalgorithms tohandleMTEsofneededsizesareknown,andconsequentlyonehastoresorttoproceduresthatdonotofferperformance guarantees.
Thehope forcomputationalefficiencyhastobeinterpretedinrelationtotheNP-hardnessoftheBNstructurelearning problem[8]:arealistic goalistosolvetypicalmoderate-sizeinstancesinpracticewithoutsacrificingtheoptimalityofthe solution, rather than aiming at worst-case polynomial time complexity. Nowadays, we have several exact algorithms, or solvers, that achievethisgoal,especially ifchoosingthebests solverforeach givenprobleminstance [9]. Whiledifferent solvers employdifferentalgorithmictechniques,suchasdynamicprogramming[10–13],A*search[14],integerlinear pro-gramming[15],andconstraintprogramming[16],theyallproceedintwosteps:First,thelocalscoresarecomputedforall nodesandtheir possibleparent sets,oftensettinganupperbound forthenumberofparents.Second, aglobally optimal network structure is found by combinatorialsearch. One challenge in applyingthesesolvers is that the numberof local scores computedinthefirststep cangrowlarge forlarger networks.Partly forthisreason,thesolvers havebeenmostly studiedinsettingswherethelocalscoresareofsomeoftheaforementionedsimpleforms,enablingfastevaluation.Ithas been observed that exact algorithms, when applicable, not only remove theuncertainty on the quality, butalso tend to performbetterinstructurelearninganddensityestimation[17].
Inthisarticle,weaddressallthethreeissues(i–iii)simultaneouslybyputtingforwardalocalmodelbasedon classifica-tionandregressiontrees(CARTs)[18].WederiveaBayesscoremostlyfollowingtheworksonBayesianCART[19–21].The main challenge isthat learning CART models iscomputationally intractable; theexisting algorithms are basedon greedy andlocalsearchandcannotofferusefulqualityguarantees.Here,twoobservationscometorescue.First,intheBNlearning context thenumber ofpredictorscan beassumedto be relativelysmallinpractice.This reducesthenumberof possible CART structures, i.e., decision trees,butis alone not sufficient formaking thesearch computationallytractable. The sec-ond observationis that theso-called dyadic CART, inwhich therecursive splits ofthe rangesof continuous variablesare always atthemidpoint, essentially inheritsthestatistical powerof CARTandyet enablessignificantly faster global opti-mization [22–25]. Buildingon theseobservations,we introduce the
partition–dyadic CART (PCART)
model; here“partition” refersto(recursive)splittingofthestatespacesofcategoricalvariables—tothebestofourknowledge,categoricalpredictors havenotbeenallowedinpreviousworksondyadicCART.Whenallvariablesarecategorical,onecancomparetheperformanceofPCARTagainststandardCPTsinastraightforward manner. TocomparePCARTagainstanalogoustabular modelsalsointhepresenceofcontinuousvariables,we introducea full-table (FT) variant ofPCART:it discretizes the involvedcontinuous predictors (but not theresponse). While FTadapts thenumberofstatespervariabletothedataathand,itrendersglobaloptimizationcomputationallyfeasible,unlikesome aforementioneddiscretizationschemes[5].
Inadditiontothesemethodologicaldevelopments,westudyempiricallythefollowingtwoquestions:DoesPCARTboost the statisticalefficiency of structure learning anddensity estimation inpractice, that is, on benchmarkdata sets? What is the price to be paid in terms of computational efficiency, i.e., which kind of problem instances, in terms of model complexityanddatadimensions,canbesolvedinareasonabletime?Onecouldhypothesizethat,thankstoaparsimonious parameterization,PCART enablesmorepowerfuldetectionof arcsandmore robustparameter inference atthecost ofan only modest increase in running time due to efficientPCART optimization algorithms. We will show how the empirical resultssupportthishypothesis.
Partsofthisworkhavebeenpublishedinapreliminaryforminconference proceedings[26].Whiletheconferencepaper onlyconsideredcategoricalandcontinuousvariables,thepresentarticlealsoconsidersordinalvariables—thisamountstoa majorextensiontoboththetheoreticalandempiricalcontributions.Wealsorevisethegeneralexpositionandsubstantially extendtheexperimentalstudy;specifically,thestudiesonpredictiveperformance andhyperparametersensitivityarenew additions. Furthermore, we now also make our implementation of the presented algorithms for computing local scores underthePCARTmodelpubliclyavailable.1
Theremainderofthisarticleisorganizedasfollows.Section2reviewsthenecessarypreliminaryconceptsandnotation regardingBNs, CARTs,andtheBayesianapproach tolearning. We introducethePCART modelinSection 3andourexact algorithms for learning a PCART from data inSection 4.Section 5is devoted to empirical studies. Last, we concludein Section6.Foreaseofreading,somedetailedmaterialispostponedtotheappendix.
2. Preliminaries
ThissectionreviewssomebasicsofBNsanddecisiontrees,includingtheBayesianapproachtolearningthemfromdata. We will focus on the key terminology andnotation neededin later sections; for a more comprehensivediscussion, the readeris referred to the literature on graphicalmodels [6, Chs. 3, 5, and 18] and score-basedlearning of decisiontrees [19,21].
2.1. Bayesian networks
Consider a multivariateprobability distribution p
(
X1,
X
2,
. . . ,
X
n),
whereeach variable Xv takesvaluesfromsome setSv.Wewrite V fortheindexset
{
1,2,. . . ,
n}
andindexbysubsets:e.g.,ifP⊆
V,wewrite XP forthetuple(
Xv)
v∈P andSP fortheCartesianproduct
×
v∈PSv;herewefollowtheconventionthattheindicesaretakeninincreasingorder.Wesaythatthedistribution
p
andadirectedacyclicgraph(DAG)G
=
(
V,
A
)
formaBNifthejointdistributionfactorizes into a product ofthe CPDs p(
Xv|
XGv),
where Gv= {
u:
(
u,
v
)
∈
A}
is the set ofparents
of v in G. When referring to aparticularCPD,wecall Xv the
response
variableandeach Xu,withu
∈
Gv,apredictor
variable.There are various ways to parameterize a CPD p
(
Xv|
XP),
with P⊆
V\ {
v}
.Here andhenceforth we may indexthepredictorvariablessimplyby
P
whenthereisnoneedtoconsidertheDAGonV
.AmongthesimplestonesareCPTs,which assigna distinctparameterforeachjointconfigurationoftheinvolvedvariables,assumingtheir statespacesare finite.In whatfollows,wewillfocusonmoresophisticatedCARTmodels,whichgeneralizethetabularCPTs.2.2. Classification and regression trees
InCARTmodels,thestructure isrepresentedasa decisiontree.Adecisiontreeoveramultidimensionalspace SP isa
recursive, axis-parallel,binary partition ofthespace. Moreformally,let T be a rootedbinary tree whereeach node R is a subsetof SP oftheform
×
u∈PRu.WecallT
adecision tree
of SP ifits rootis SP andevery inner node R anditstwochildren Rand R differinexactlyonedimension
u
∈
P forwhich{
Ru,
R
u}
isapartitionof Ru.The“decision”associatedwithnode R iswhetherthevariable Xu belongsto Ru or Ru.Consequently,theleavesof T partition SP.Adecisiontree
Tv of SP andaCPD p
(
Xv|
XP)
arecompatible
witheachother if, forall leaves R,we havethat Xv isindependentof XPconditionallyontheeventXP
∈
R.Inotherwords,theCPDisconstantovereachleaf Randpiecewiseconstantover SP.Inordertoparameterizethedistributionoftheresponsevariable Xv ineachleaf,wedistinguishbetweenadiscreteand
acontinuousdistribution.Inthiswork,wefocusontwostandardcases:
Categorical distribution: IfSv isfinite,welet
θ
x betheprobabilitythat Xv=
x,foreachx
∈
Sv.Normal distribution: IfSv isthesetofreals, Xvisnormallydistributedwithmean
μ
andvarianceσ
2.InthecontextofBNswhereeachvariable Xv appearsastheresponse,we understandthattheparametersaredistinct for
all v andallleavesofTv,andomitemphasizingthisinthenotation.Dependingonwhethertheresponseiscategoricalor
continuouswecallthedecisiontreeequippedwiththeCPDa
classification tree
oraregression tree
,generallyaCART
. Whatismaybenotimmediatefromtheabovedescriptionisthatwealsomodelordinalresponsesbycategorical distri-butions:weenforcetheparametersθ
x beequalforsomesubsequentvaluesx
.WegivedetailsinSection2.5.2.3. Bayesian structure learning
To learn the model structure fromdata, we take a Bayesian score-based approach.We only note in passing that the derivationsofSections3and4alsoapplytootherscoringschemes,especiallypenalizedmaximum-likelihoodscores.
Considerasetofdatapoints
X
i=
(
Xi1
,
X
2i,
. . . ,
X
ni),
i
=
1,2,. . . ,
N
.Wemodelthedatapointsasindependentdrawsfromadistribution pwhosestructure,namelyaDAG
G
=
(
Gv)
v∈V withadecisiontreeTvof SGv foreachv
∈
V,areunknown;theparametersoftheCPDsareunknownaswell.Wewilldescribeamodularpriorofthestructureandtheparametersand obtainthescoreastheposteriorprobabilityofthestructure,bymarginalizingouttheparameters.
Consider adecisiontree
T
v ofSP;inourcontext P=
Gv.Foreachleafof Tv,weassignstandard conjugatepriorsfortheparametersofthecategoricalandnormaldistributions:
Dirichlet: Let
θ
∼
Dirichlet(α
),
whereα
=
(
α
x)
x∈Sv arehyperparameters.Normal-inverse-gamma: Let
σ
2∼
IG(ν
/2,
ν
λ/2)
andμ
|
σ
2∼
N(μ
¯
,
σ
2/
a),
whereν
,λ,
μ
¯
,anda
arehyperparameters.When an ordinal response is modeled by a categorical distribution, a Dirichlet prior is, however, only applied after the valuesarepartitionedintosomenumberofbins;wegivedetailsinSection2.5.
While we could set thehyperparameters separately foreach response variable Xv, inour experiments we will focus
on the casewhereall are setto constant values;see Section 5.1forthe defaultvalueswe used inour experimentsand Section5.6forasensitivityanalysis.
By letting theparameters of all leavesof Tv be independent,we obtaincomputationally convenient formulasfor the
marginallikelihoodfunction.Foraleaf R,let IR
= {
i:
XiP∈
R}
betheindexsetofthedatapointsthatbelongto R.If Xv iscategorical,weobtainthemarginallikelihoodof
T
v asv
(
Tv)
=
Rleaf ofTv(
xα
x)
(
NR+
xα
x)
x(
NRx+
α
x)
(
α
x)
,
where NR
= |
IR|
andN
Rx= |{
i∈
IR:
Xiv=
x}|
.If Xv iscontinuous,wegetv
(
Tv)
=
Rleaf ofTvπ
−NR/2(λ
ν
)
ν/2√
a√
NR+
a((
NR+
ν
)/
2)
(
ν
/
2)
(
sR+
tR+
ν
λ)
−(NR+ν)/2,
where sR
=
(
NR−
1)σ
R2,tR=
NRa(
μ
R− ¯
μ
)
2/(
NR+
a),
andμ
R isthe sample mean andσ
R2 the sample variance of theNR values Xiv with
i
∈
IR [21,Eq. 14].Foran ordinal Xv we derivethelikelihood functionanddiscussitscomputationalcomplexityinSection2.5.
Furthermore, we let the parameters be independent for each v, so that the marginal likelihood of the structure
(
Gv,
Tv)
v∈V factorizesintoaproductofthelocalmarginallikelihoodsv
(
Tv).
Wealsoassignamodularstructurepriorp
(
Gv,
Tv)
v∈V∝
v∈V
π
v(
Gv,
Tv) ,
whereeach
π
v issome(unnormalized)distribution.Intheliterature,variouspriorshavebeenproposedbothforDAGs(see,e.g.,AngelopoulosandCussens [27] andreferencestherein)andfordecisiontrees[19,21].Ourchoicesforthepriorsdepend ontherestrictionswemaketotheconsidereddecisiontrees,andwillbedescribedinSection 3.
From analgorithmsstandpoint,thestructurelearningproblemnowbecomes thatoffindingastructure
(
Gv,
Tv)
v∈V soastomaximizetheposteriorprobability
constant
×
v∈Vπ
v(
Gv,
Tv)
v(
Tv) .
(1)Astheunspecified(normalizing)constantisthesameforallstructures,itcanbeignored. 2.4. Posterior predictive distribution
Givenasetof
N
datapoints X(N):=
(
Xi)
Ni=1,thedistributionforanewdatapoint XN+1 iscalledtheposteriorpredictive
distribution.We willconsidera settingwherewe firstlearnamodelstructure M
ˆ
:=
(
Gˆ
v,
Tˆ
v)
v∈V from X(N),andthenthedistribution of XN+1 is obtained conditionally on both the learned structure andthe N data points. The distribution of interestisthus
p
(
XN+1|
X(N),
Mˆ
)
=
p(
X(N+1)| ˆ
M)/
p(
X(N)| ˆ
M),
andconsequently,themixeddensity–massfunctionof XN+1 isobtainedastheratioofthemarginallikelihoodsofM
ˆ
forX
(N+1)andX(N).This,inturn,amountstoaproductvariable-wise ratiosoverv
∈
V.Considertheratioforasingle
v
∈
V.Write y forthevalueofXN+1v andRfortheleafofT
ˆ
vthatmatchesthevaluesofXGN+1
(
NR+
xα
x)
(
NR+
1+
xα
x)
(
NR y+
1+
α
y)
(
NR y+
α
y)
=
NR y+
α
y NR+
xα
x.
If the variable Xv is continuous, we get the densityfunction of a non-standardized Student’st-distribution2 with
ν
degreesof freedom, location parameter
μ
¯
, andscale parameterλ
(
a+
1)/a, whereν
,μ
¯
,λ
,anda
are theposterior counterpartsofthehyperparametersν
,μ
¯
,λ,
anda
,obtainedbya
:=
a+
NR,
μ
¯
:=
(
aμ
¯
+
NRμ
R)/
a,
ν
:=
ν
+
NR,
λ
:=
sR
+
tR+
ν
λ
/(
aν
) ,
with sR andtR as definedbefore; for a derivation (up to some differences in the parameterization), see,e.g., Murphy’s
notes [28,Sections 5.5and6.5].
Weconsiderthecaseofanordinal Xv inthenextsubsection.
2.5. Extension to ordinal response variables
Theprevioussectionsfocusedonthecasesofcategoricalandcontinuousresponsevariables.Wenowextendthemodel to alsoaccommodate ordinalresponse variables.Tothisend, we introducea Bayesian histogramwherethebinlocations andwidthsaretreatedasauxilliaryparametersand“integratedaway”inthemarginallikelihoodforeachleaf Rinagiven decisiontree
T
v;bymultiplyingtheseleafscoreswegetthelikelihoodv
(
Tv)
asbefore.Consider aresponsevariable Xv.Weassume that Xv takesa finitenumberofdifferentvalues, Sv
= {
1,2,. . . ,
k}
,withsomeinteger
k
.Similarlytoacategoricalresponse,weletθ
x denotetheprobabilitythat Xv takesvaluex
.However,insteadofassigning the vector
θ
a Dirichletdistribution, we assigna prior that constrains the parametersθ
x be equal forsomeconsecutivevalues x. More precisely, givena segmentation B
= {
B1,
B2,
. . . ,
B
b}
of Sv, that is, a partition of Sv intosetsofconsecutivenumbers,called
bins
,weletθ
x:=
β
j/
|
Bj|
forall x∈
Bj.Now,weletβ
∼
Dirichlet(α
(1),
α
(2),
. . . ,
α
(b))
withsome
α
(j).Observethattheparametersθ
x sumupto1,asrequired.
Tocompletethemodel,weletthepriorprobability ofa segmentation B with
b
binsbeproportionaltoκ
b,withsomeκ
>
0,andletα
(j):=
x∈Bj
α
x.Thusthemodelhask
+
1 hyperparameters:α
1,
α
2,
. . . ,
α
k,andκ
.Proposition1.
The marginal likelihood of a decision tree T
vis given byv
(
Tv)
=
R leaf of Tv wv(
R) ,
with wv(
R)
:=
B segm. of Svκ
b−1(
1+
κ
)
k−1(
jα
(j))
(
NR+
jα
(j))
×
b j=1(
N(Rj)+
α
(j))
(
α
(j))
N(Rj)(
NRx)
x∈Bj|
Bj|
−N (j) R,
where N(Rj):=
x∈BjNRxis the total number of data points in Bj.
Proof. Theprobabilityofthedata
(
NR1,
NR2,
. . . ,
N
Rk)
givenasegmentation B= {
B1,
B
2,
. . . ,
B
b}
canbewrittenas NR NR1,
NR2, . . . ,
NRkp
(β
|
B)
b j=1β
j|
Bj|
N(Rj) dβ ,
where
p
(β
|
B)
isthedensityfunctionoftheDirichletdistribution.Bywriting NR NR1,
NR2, . . . ,
NRk=
NR NR(1),
NR(2), . . . ,
N(Rb)b j=1 N(Rj)
(
NRx)
x∈Bj,
takingtheterms
|
Bj|
−N (j)R outoftheintegral,andusingtheknownclosed-formexpression fortheremaining integral,we
getthesummandin
w
v(
R),
withoutthefactorκ
b−1/(1
+
κ
)
k−1,whichisthepriorprobabilityofB
;theunnormalizedpriorprobabilities
κ
b sumuptoκ
(1
+
κ
)
k−1,sincethereareexactlyk−1 b−1segmentationswith
b
nonemptybins.2
Proposition2.Given the counts N
R1,
NR2,
. . . ,
NRk, the term wv(
R)
can be computed in time O(
k2logNR)
.2 Thedensityfunctionf
(y)ofthenon-standardizedStudent’st-distributionwithddegreesoffreedom,locationparameterm,andscaleparameters,is givenby f(y)= (d+1)/2 (d/2) πds−1/21+(y−m)2 ds −(d+1)/2
Fig. 1.An example of a PCART model. It represents the conditional probability densityp(X|Y,Z), whereX∈(0,10),Y∈(0,100), andZ∈ {a,b,c,d}. Proof. Since
jα
(j)equalsα
+
:=
xα
x forall segmentationsB,thetermκ
−1(1
+
κ
)
1−k(
α
+)/(
NR+
α
+)
inthe sum-manddoesnotdependonB
andcanbefactoredout.Theremainderofthesummandfactorizesintoaproductofb
terms, one for each Bj. In total, there are k+1 2different terms, one for each bin
[
s,
t]
:= {
s,
s
+
1,. . . ,
t}
with1≤
s,
t≤
k. To showhowthesetermscanbeprecomputedwithintheclaimedtime,denotebyN
R(s,t)andα
(s,t)thecorrespondingnumbersN(Rj)and
α
(j),i.e.,when Bj
= [
s,
t]
.Observefirstthat thenumbers NR(s,t)andα
(s,t) aresimplesumsthatcanbecomputedincrementally, i.e., by addingNRt and
α
t tothe alreadycomputedNR(s,t−1) andα
(s,t−1).Thus, assumingthe valuesofthegammafunctionhavebeenprecomputed,theirratiocanbecomputedinconstanttime perbin.Similarly,themultinomial coefficient is updated bymultiplying it bythe binomialcoefficient
N(Rs,t)NRt
,which we assumeto be accessibleinconstant time (e.g., viathree evaluationsofthe gammafunction). Finally,the binlength t
−
s+
1 israised to theexponent N(Rs,t) by repeatedsquaringintime OlogNR(s,t).Let f(
s,
t)
denotetheprecomputed factors,1≤
s,
t≤
k. Considerthe function g(
t)
definedby g(0)
:=
1 and g(
t)
:=
ts=1g(
s−
1)f(
s,
t)
for1≤
t≤
k.Wehavethatw
v(
R)
=
g(
k)
(
α
+)
(
NR+
α
+).
It remainstoobservethatg
(
k)
canbeevaluatedintimeO
(
k2).
2
Wecanreducetherunningtimeto O
(
k2)
byprecomputingthepowersl
mforall1≤
l≤
kand1≤
m≤
N.Eventhoughtthe precomputation takestime andspace ororder kN, thiscan be advantageous when wv
(
R)
iscomputed formany vandR.
Fortheprobability massfunctionoftheposteriorpredictive distribution,we arenot awareany“closed-formformula,” butthefollowingresultisimmediateandsufficientforthecomputationalpurposes:
Proposition3.
Given a decision tree
Tv of Sv and N data points, the response equals y in a new data point with probabilitywv
(
R)/
wv(
R)
, where R is the unique leaf of Tvthat matches the new data point, wv(
R)
is as defined in Proposition1, and wv(
R)
isthe same when NR yis replaced by NR y
+
1. 3. Partition–dyadicCARTPartition–dyadic CART (PCART)is a subclass ofthe CART model, which we obtain by restricting the wayin which the rangeofacontinuouspredictorvariableisrecursivelypartitionedbythedecisiontree.Specifically,weassumetherangeis an interval andonlyallowhalvingit initsmidpoint,whence thenamedyadicsplits. Forthenumberofrecursivedyadic splitspervariablewesetanupperbound,
the maximum number of splits
,denotedbys
.Incontrast,foracategoricalpredictor weallowarbitrary(binary)partitioning.Fig.1showsanexampleofaPCARTmodel.PCART generalizes dyadic decision trees [23], which assume all predictors be continuous (or binary) and the response be categorical. The motivation fordyadic splits stems from their ability to guarantee both statistical andcomputational efficiency:dyadicdecisiontreesenablecomputationallyfeasibleglobaloptimizationondatasetswithuptoaroundadozen ofpredictorvariables,andtheyachieve(inaminimaxsense)nearlyoptimalratesofconvergenceforarangeofclassification problems[23,24].
3.1. Structure prior
Havingfixedthefamilyofdecisiontrees,we proceedtospecifyapriordistributionoverthefamily.Toconformto the modularstructureofthelikelihoodfunction,weconstructapriorwheretheprobability
p
(
T)
ofatreeT
factorizesoverthe leavesofT
.Thisyieldsa decomposablescoringfunction,whichpropertyiscrucialfortheoptimizationalgorithmwegive inthenextsection.We consider priors that assign p
(
T)
∝
γ
l(T), wherel(
T)
is the number of leaves of T and 0<
γ
≤
1 is a constant independent ofT.Importantly, wedonotsetγ
toanyfixedvalue,butwe insteadletγ
dependon thenumberofsplits possibleineachinnernode.Withoutsuchdependency,thepriorcoulddistributenearlyallprobabilitytoverylargedecision trees,astheirnumberoverwhelmsthenumberofsmalltrees.Moreprecisely,weset
γ
=
1/(4C),
whereC
isthetotalnumberofpossiblesplitsataninnernode(ignoringsplitsmade inotherinner nodes).WehavethatC
isasumwhereeach continuouspredictorcontributes1,eachcategoricalpredictor withk
possiblevaluescontributes2k−1−
1,andeachordinalpredictorwithk
possiblevaluescontributesk
−
1.Thefactor4 comes fromthe numberofordered fullbinary treeswith
i
innernodes,whichisgivenbythe ithCatalannumberand equalling to 4i up toa factor polynomial in i. We can interpretthe prior as firstdrawing the number ofleaves froma nearly uniformdistribution,then drawing arandom full binary treewiththegivennumber ofleaves, drawinga random splitindependentlyforeachinnernodeofthetree,andacceptingthetreeifitisavaliddecisiontree.For an illustration, consider the decisiontree shown inFig. 1. In thiscase we have one categoricalpredictor with4 possible values andone continuous predictor.Consequently, the constant C equals 24−1
−
1+
1=
8,whenceγ
=
1/32. Thus, the prior probability of theshown treewith 5leaves is 324=
220 timessmaller than the prior probability oftheindependencemodel,i.e.,thetreewheretherootistheonlyleaf. 3.2. PCART as a local model
ToincorporatePCARTaslocalmodelsinBNs,weneedtospecifythecomponents
π
v(
Gv,
T
v)
ofthestructureprior.LetP
=
Gv⊆
V\ {
v}
with|
P|
≤
D;themaximum indegree
parameterD
controlsthecomputationalandstatisticalcomplexityofthemodel.Forbrevity,write
T
forT
v.Weusethefactorizationπ
v(
P,
T)
=
π
v(
P)
π
v(
T|
P)
c(
P) ,
where
c
(
P)
normalizesπ
v(
T|
P)
intoaprobabilitydistributionforT
.Itremainstospecifyπ
v(
P)
andπ
v(
T|
P).
Weconsidertwodifferentchoicesforboth
π
v(
T|
P).
Our first choice is to set
π
v(
P)
=
1, i.e., the prior is nearly uniform over parent sets of size at most D.3 This is thesimplestandmost“ignorant”prior, whichtreatsall DAGsequallyapriori;foradeeperdiscussionofgraphpriorswerefer toAngelopoulosandCussens[27] andreferencestherein.Wegetthebasicvariantofourmodel:
PCART (partition–dyadic CART): Let
π
v(
T|
P)
=
(4
C)
−l(T) foralldecisiontrees T of SP withatmostsdyadicsplitspercon-tinuousvariable.
Oursecondchoiceisgivenby
π
v(
P)
=
1 n−1 |P|,i.e.,theprior is
nearly uniform over the sizes of the parent sets
.4 Thisprior avoidsconcentrating almost allprior probability onthe densestgraphs; see,e.g.,Friedman andKoller [29] for usage and discussionofthisprior.Wewillcallthisthepenalized
variantandrefertoitasPCARTp
.Through thelikelihood function,themodel definesajointlocal score
π
v(
Gv,
Tv)
v
(
Tv).
This,inturn,induces alocalscorefortheparentset,
g
v(
Gv),
obtainedbymaximizingthejointlocalscoreoverT
v.Therolesofthesetwokindsof localscoresin structurelearning willbeclarified inthenextsection (Proposition4).We note thattheresultingglobalscoring functionforDAGs,obtainedbytakingaproductoftheparent-wiselocalscores,isnotscoreequivalent,that is,twoDAGs thatencodethesameconditionalindependencerelationsmaygetdifferentscores.Itisanopenquestion,whetherthereare anynontrivialconstraintsunderwhichwewouldachievescoreequivalence.
3.3. Full table variants and adaptive discretization
RecallthatthestandardCPDmodelinBNsisthefulltablemodel,whereeachconfigurationtheparentvariable’sstates isassignedadedicatedcategorical(multinomial)distribution.ForcomparingPCARTtothisstandardmodel,wenextextend thefulltablemodelalsotocaseswheretheresponseorsomepredictorvariablesarenotcategorical.
If all predictor variables are categorical, the PCART and PCARTp models we just defined have the following tabular counterparts:
FT (full table): Supposingallpredictorvariablesarecategorical,let F beamaximaldecisiontreeof SP whereeachleafisa
singleton
{
xP}
withx
P∈
SP.Weletπ
v(
T|
P)
=
1 ifT
=
F andπ
v(
T|
P)
=
0 otherwise.Inaddition,π
v(
P)
=
1.FTp (penalized full table): LikeFTbutwith
π
v(
P)
=
1 n−1 |P| .Inotherwords,thesemodelsintroduce,foreachjointconfigurationofthepredictorvariables’states,adedicatedcategorical ornormaldistribution.IntheformercasethiscoincideswithastandardCPT;inthelattercase,itcoincideswithastandard conditionallinearGaussianmodel,albeitofasomewhatdegenerateform.
Tomake thetabular variants applicable tocontinuous and ordinalpredictors aswell, we introducethe following dis-cretization scheme. Foreach continuous or ordinal predictor u
∈
P, discretize its range of values intok
u bins matching3 ThepriorisuniformoverDAGsbutnotexactlyuniformoverthepossibleparentsforafixednode,becauseasmallersetofparentsiscompatiblewith alargernumberofDAGs.
the ku quantilesofthe empiricaldistribution.The numbers
k
u arefree structuralparameters ofthemodel,takingvaluesbetween 2 and some maximum M (weused M
=
7 inour experiments). We assignuniform prior over the jointvalues(
ku)
u∈P.Thisdiscretizationschemehastwo importantfeatures.Oneisthatitconsidersjointdiscretizationsofthepredictorsin relationtotheresponse,asopposedtosomesimplerschemesthatconsiderthevariablesseparatelyorinpredictor-response pairs. The other featureis that theschemedoes notfix anysingle discretization ofa continuous variableinthe BN,but the discretization mayvary betweenthe parent sets to which thevariable belongs.Both features enhance thepower of the discretization schemeas compared tothe simple schemes.The prize we have topay is that thecomputational cost of searchingfor optimaldiscretizations grows,as we needto consider all possibleconfigurations
(
ku)
u∈P for each P weencounter;wewilladdressthisissueinSection4.
4. Algorithms
Thestate-of-the-artalgorithmsforfindingagloballyoptimalBNstructureproceedintwosteps:
Step 1 (Candidate parent set scoring and pruning) Foreach v
∈
V and P⊆
V\ {
v}
with|
P|
≤
D compute the localscore gv(
P),
definedintermsofthelocalmodelandthedata(e.g.,priorandlikelihood).Pruneevery setthathasasubsetwithatleastaslargescore.Returntheremainingparentsetsandtheirscores. Step 2 (DAG search) ReturnaDAG
G
thatmaximizesthetotalscorev∈Vgv(
Gv).
ForStep 2wemayuseexistingcompletesolvers,e.g.,onesbasedonintegerlinearprogramming [15] orA* [14].Inour experimentsweusedtheintegerlinearprogrammingbasedsolver
GOBNILP
[15],sinceinourpreliminarystudyitappeared toscalewell.ItisworthnotingthatwecannotusetheconstraintprogrammingbasedsolverofvanBeekandHoffmann [16] becauseitassumesscoreequivalence,i.e.,thatthescoringfunctiongivesthesamescoreforallMarkov-equivalentDAGs.5OurconcernisStep 1.Toagreewiththedefinitionofthescoringfunction(1),wesetthelocalscoresby
gv
(
P)
=
max Tπ
v(
P,
T)
v(
T)
,
where
T
runsthroughalllocalstructuresonthepredictors P.Indeed,itiseasytoseethatthisassignmentguaranteesthat thealgorithmwillfindagloballyoptimalstructure:Proposition4
(Optimality). Let G be a DAG that maximizes
v∈Vgv(
Gv)
. Then, for each v∈
V , there exists a Tv such that thestructure
(
Gv,
T
v)
v∈Vmaximizes the scoring function (1).Proof. ForanyfixedDAG
G
thechoicesforthelocaltreesT
v areindependent.Therefore,wehavethat v∈V max Tvπ
v(
Gv,
Tv)
v(
Tv)
=
max (Tv)v∈V v∈Vπ
v(
Gv,
Tv)
v(
Tv)
.
Thus, choosingforeach
G
v atreeT
v thatmaximizesπ
v(
Gv,
T
v)
v
(
Tv)
yieldsastructure(
Gv,
T
v)
v∈V thatmaximizesthescoringfunction(1)
2
InthecaseofPCARTeachlocalscorerequiresoptimizationoveralargenumberofdecisiontrees.Moreover,thenumber of required local scores grows rapidly withthe total numberof variables
n
, beingn dD=0 n−1 dfor maximum indegree D.In theremainder ofthis section we detaila dynamic programmingalgorithm tosolve thisoptimization problem. We havedevelopedasimilaralgorithmforcomputingtherequirednormalizingterm
c
(
P),
whichisasum
overdecisiontrees; however,asthistermisindependentofthedataandfastertocompute,wewillomitamoredetaileddescription.ForFT (maximaldecisiontree),itsuffices toenumeratethe possiblejointdiscretizations ofthecontinuous predictors. For fixed v and P, thisrequires atmost
(
M−
1)|P| evaluationsof a closed-form expression; formoderate valuesof the maximumnumberofbinsM
,thecomputationalcostissmallcomparedtothatofPCARTstructureoptimization.4.1. PCART structure optimization
WegiveadynamicprogrammingalgorithmtofindanoptimalPCARTstructureforaresponse
v
andasetofpredictors P⊆
V\ {
v}
.The algorithm applies toany objectivefunction that factorizes intoa product oflocal scores f(
R)
over the leaves R ofthetree.OurBayesscoreisclearly ofthisformasboth thepriorandthelikelihoodfactorizeovertheleaves. Foreverypossibledecisiontreenode R= ×
u∈PRu,thealgorithmcomputes,inatop-downfashion,thebestscore f(
R)
forthesubtreebelowR.Thusthescoreoftheroot, f
(
SP),
isthemaximumscoreoverallPCARTstructures.Algorithm1PCARToptimization.
Input:Responsev,predictorsP⊆V\ {v}anddataX=(X1,X2,. . . ,XN)
γ←1/4u∈Pcontinuous1+ u∈Pcategorical(2|Su|−1−1)+ u∈Pordinal(|Su| −1) −1
Computethestructurepriorfactor functionLeaf-Score(y1,y2,. . . ,yN)
returnγ ×marginallikelihoodofresponsevaluesycomputedusingtheformulasinSections2.3and2.5 LetMbeanassociativearray
functionSubtree-Score(R,Y=(Y1,. . . ,YN)) Computesf(R)giventhatY isthesetofdatapointsinnodeR foreachcategoricalorordinalpredictoru∈Pdo Removemissingvaluesfromdiscretevariables
Ru← {Yu1,Yu2,. . . ,YN
u}
ifMdoesnotcontainelementRthen
MR←Leaf-Score(Y1
v,Y2v,. . . ,YN
v ) ConsiderthecasewhereR isaleafnode
ifN>0then
foreachcontinuouspredictoru∈Pdo Considerallsplitsincontinuousvariables LetRu= [x,y),Su= [x,y)
if(x−y)/(x−y)>2−sthen
LetR,RbesuchthatRu= [x,(x+y)/2),Ru= [(x+y)/2,y)andRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi
u< (x+y)/2
MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}
foreachcategoricalpredictoru∈Pdo Considerallsplitsincategoricalvariables foreachpartition{C,C}ofRpintotwononemptysetsdo
LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi
u∈C
MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)}
foreachordinalpredictoru∈Pdo Considerallsplitsinordinalvariables foreachpartition{C,C}ofRpintotwononemptysetssuchthatmaxC<minCdo
LetR,RbesuchthatRu=C,Ru=CandRP\{u}=RP\{u}=RP\{u} PartitionY into(Y,Y)suchthatYcontainsthedatapointsYisatisfyingYi
u∈C
MR←max{MR,Subtree-Score(R,Y)×Subtree-Score(R,Y)} returnMR
Output:Subtree-Score(SP,X)
Thebestsubtreescorefornode Risobtainedbyconsideringallpossibilitiesforthenode:thenodemaybealeafnode, oritisaninnernodewithtwochildren Rand
R
.Thescoreofaleafiscomputedfromthetrainingdatapointscontained inR
asspecifiedinSections2and3.Foraninnernodethescoreiscomputedrecursivelyastheproduct f(
R)
f(
R)
ofthe scoresofthechildsubtrees.Therecursioncachesthealreadycomputedsubtreescorestoavoidcomputingthesamescore multipletimes.Aftercomputingthescores,anoptimaldecisiontreeisextractedbytracingthesplitsthatgavetheoptimal scores.Withourchoiceofscores,ithappensthatanode R thatcontainsnotrainingdatapointshastoalwaysbealeafnode, andthe recursiondoesnot need to advancefurther. Furthermore,iffor some categoricalorordinal variable u there are some valuesin Ru thatdonot appearinthetrainingdatathat iscontainedinR,we cansimplyconsidera smallernode
withtheunused valuesremoved. Thismaycause Ru to consist ofnonadjacentvaluesforsome ordinal variable
u
,butitdoesnotaffecttheresultsince weonlyconsidersplits respectivetotheorderingofthevariable,thatis,all thevaluesin Ru aresmallerthanthe valuesin Ru.Theseoptimizationssignificantly reduce thenumberofnodesthealgorithmhasto consider.ThepseudocodeforthealgorithmisshowninAlgorithm1.
WenextgiveanasymptoticupperboundforthenumberofnodesK thealgorithmconsidersandfortherunningtime ofthealgorithm.Theboundswillberatherloose:theyfailtoaccountforalltheoptimizationsandspecialpropertiesofthe datasets.
Proposition5
(Time complexity). Algorithm
1considers K≤
minN2A(k−1)−2C(
s+
1)B,
2Ak+B(s+1)−C(
r+
1)2C tree nodes, andthe total time complexity is O
n(2
kA+
B+
rC)
KlogK, where A, B, and C are the numbers of categorical, continuous, and ordinal predictor variables, respectively, k is the maximum number of categories, s the maximum number of recursive splits per continuous variable, and r the maximum number of values of any ordinal variable.Proof. Counting the possiblenodes simply asthe product of atmost 2k possible value subsets Ru
⊂
Su per categoricalvariable
u
∈
P,atmost2s+1 subsetsper continuousvariable,andatmostr+1 2≤
(
r+
1)2/2 subsets
perordinal variable,givesan upperbound K
≤
2Ak+B(s+1)−C(
r+
1)2C. Inthe casewhenthe numberoftraining datapoints N is low,weget abetter boundsimilarly toBlanchardetal. [24] (whoassume allvariables becontinuous): Sincethealgorithmonly con-siders nodesthat contain at least one training datapoint, we can bound the numberof nodes by the total number of containing nodes foreach datapoint. The number of value subsets that contain a givenvalue isat most 2k−1 foreachcategoricalvariable,
s
+
1 foreachcontinuous variable,and(
r+
1)2/4 for
eachordinal variable,yielding anupperbound K≤
N2A(k−1)−2C(
s+
1)B(
r+
1)2C.Now,forevery node the algorithmconsiders all possiblesplits—at most2k foreach categoricalvariable,one foreach continuousvariable,andatmost
r
−
1 foreachordinalvariable—andfindsthescorefromanassociativearraydatastructure inO
(log
K)
time.ThusthetotaltimecomplexityisO
n(2
kA+
B+
rC)
KlogK.2
Fig. 2.Anillustrationofthebinarynoderepresentation.TheexamplecorrespondstothePCARTinFig.1,andconsistsoftwoparts:3bitsforthecontinuous variableY (herethemaximumnumberofsplitsis2)and4bitsforthecategoricalvariableZ.Forthecontinuousvariable,thelengthoftheleading1-bits andthefollowing0-bit(ingray)indicatesthelengthoftheinterval,andtheremainingbitsencodethepositionoftheinterval.
Foran illustrationofthe complexitybound,considerthe decisiontreeshowninFig.1.Inthiscasewehaveone cate-goricalpredictorwith4possiblevaluesandonecontinuouspredictor.ByProposition5,thetimecomplexityis
O
(
nKlogK),
wherethenumberofconsideredtreenodesK isatmostmin{
64·
N(
s+
1),256·
2s+1}
.4.2. Implementation tricks
The dynamicprogrammingalgorithmforfindingan optimalPCARTstructure iseasilyimplementedusingrecursion,as describedinSection4.1.However,bycarefullytuningthememoryrepresentationofnodes,weareabletooptimizeourC++ implementationtorunanorderofmagnitudefaster.
Ourimplementationstoresnode
R
asabitvectorthatisaconcatenationoffixed-lengthbitvectors,eachofwhichstores the value subset Ru for a variable u∈
P. Foreach continuous variablewith maximum numberof splits s,the 2s+1−
1possible intervalsare storedas
(
s+
1)-bit binary numbers.Foreach categoricalvariable u,the subset ofvalues Ru⊆
Suis storedusing
|
Su|
bits.AnexampleoftherepresentationisshowninFig.2.The predictorvaluesofthedatapoints arealsoencodedastheminimalnodescontainingthem.Inpractice,64bitssufficeforstoringthenoderepresentations,which meansthattheoperationsperformedbythealgorithmonthenodestranslateintoefficientbitoperationson64-bitmachine words.Thedatacanbepartitionedin-placeusingsimplebitoperationsandbecauseofthecompactrepresentation,thedata willstaypresentinCPUcaches,makingtheinnermostloopsofthealgorithmveryefficient.
5. Empiricalstudies
Weempiricallycomparedthedifferentlocalmodelsinrelationto(a)structurerecovery,(b)predictiveperformance,and (c)runningtime.Tothisend,weusedbothsyntheticdatageneratedfromknownbenchmarknetworksandrealdatafrom variousdomains.
5.1. Data sets and hyperparameters
Asground-truthbenchmarknetworks,weusedAlarm,Insurance,andWater,allextractedfromthe
bnlearn
[30] repository.6 The networks contain 37, 27,and 32 variables respectively, and are thus suitable for finding the globally optimal DAG withmodern solvers. However,asthesebenchmarknetworkscontain onlycategoricalvariables,we didnotuse themfor evaluatingtheperformanceonmixedvariables.Forrealdata,weextractedallsuitabledatasetsfromtheUCImachinelearningrepository [31];moreexactly,we consid-eredalltheclassificationandregressiondatasetsthathaveatleast100datapoints,atleast8butfewerthan36variables, andthatconsist ofa singletable thatisfreelyavailable.We preprocessedthedatasetsbynormalizingallthecontinuous variables to havemeanzeroandvariance one andlimitingthenumberofcategoriesin eachcategoricalvariable to5by combiningtherarestcategoriesintoone.Wealsoremovedidentifiervariablesanddatapointsthatcontainmissingvalues, rejectingthedatasetswhereweneedtoremoveamajorityofthedatapoints.WealsorejectedthedatasetsMushroomand
ParkinsonsTelemonitoring,since thenetwork optimizationphaseusing
GOBNILP
took too muchtime forthem.SeeTable B.5(AppendixB)foracompletelistofthe52datasets.
Wesettheuserparametersofthemodelsasfollows.Thehyperparametersofthepriordistributionsweresetas
α
=
1/
2 (Jeffreys prior),μ
¯
=
0,anda=
ν
=
λ
=
1,exceptfor thesensitivitystudy (Section5.6) wherewe varied thelatter three values;forthesegmentationpriorofordinalresponses wesetκ
=
1/
4toapproximatelybalancethepriormassonsmallerFig. 3.ComparisonofgloballyoptimalDAGsearch(FT,FTp)withheuristicapproaches(HC,TABU)andaconstraint-basedalgorithm(PC)forthetaskof recoveringaground-truthnetworkstructurebasedondatasamplesofvaryingsize.TheevaluationmetricisanaveragestructuralHammingdistanceover tenindependentdatasamples.Errorbarsindicate(hereandinallfollowingplots)thestandarderrorinbothdirections.
andlargernumbersofbins.Wesetthemaximumnumberofsplitspercontinuousvariableto
s
=
5,themaximumnumber ofbinsinthefulltablevariantto M=
7,andthemaximumindegreeofaDAGto D=
4.5.2. Comparison of exact and heuristic algorithms for Bayesian network structure learning
As a preliminary small-scalestudy, we investigatedthe advantage offinding a globally optimal DAGas opposedto a merely high-scoring DAG.Here werestrict ourselves tostandard CPTbasedlocalscores. Tobe precise,we comparedour implementation of the FT local score (with GOBNILP for searching for the globally optimal DAG) with Scutari’s bnlearn implementation [30] ofhill-climbing(HC),TABUsearch,andtheconstraint-basedPCalgorithm.ExceptforthePCalgorithm, all theseimplementations usethesamescoringfunction (theBDscore withhyperparameters setto1
/
2,combinedwitha uniformstructureprior),thusenablingfaircomparisontoFT.Forreference,wealsocomparedtoglobaloptimizationunder thepenalized FTp score.We usedthePC-implementation withitsdefaultparameters:an asymptoticchi-squared test for themutualinformationteststatisticwithasignificancelevelof0.05.Fromthethreebenchmarknetworkswesampleddatasetsofdifferentsizes,tenrepetitionseach,andlearnedanetwork structureforeachdatasetandalgorithm.WethencomputedthestructuralHammingdistance(SHD) [32],whichmeasures differencesup to the equivalenceclass, betweeneach learnednetwork andthe groundtruth. Weaveraged the SHDs for eachalgorithmandsamplesizeoverthetenindependentrepetitions.TheresultinglearningcurvesareshowninFig.3.
WeobservethatFTandFTpoutperformtheotherthreeapproachesforAlarmandInsuranceoncethesamplesizeexceeds afew hundreddatapoints.ForWater,however,FTperforms substantiallyworse thanHCandTABU.Thiscanbeexplained by the observation that the BDscore alone (without a nonuniform structure prior) can be a very poorscoring function duetoapreferenceforoverlycomplexmodelsatrelativelysmallsamplesizes. Insuch cases,theinabilityoftheheuristic algorithmstofindtheratherdensegloballyoptimalDAGbecomesafortunateadvantage,asthey stopthesearchatalocal optimumthatisa relativelysparsesubgraph.ThepoorperformanceofFTjustifiestheinclusionoftheFTpvariant,which usesastructurepriorthatisknowntopenalizecomplexityappropriately [33].Unfortunately,
bnlearn
doesnotallowcalling HCorTABUwithanarbitraryscoringfunction,soitisnotpossibletodirectlycomparethegloballyoptimalFTpsolutionto thecorrespondingheuristicresults.The constraint-based algorithm (PC) hasadvantages atvery smallsample sizes, but it doesnot learn much from an increasedamount ofdataandeventually recovers the generatingnetwork worse thanthe score-basedlearning methods. (Onemight improve the performance of PC by letting its significance levelparameter depend onthe data size in some appropriateway,butthe
blearn
implementationdoesnotdothatbydefault.)AsidefromtheratherpathologicalcaseoftheWaternetwork,theobservationsfromthepreliminarystudyaddto previ-ouslypublished evidencefortheneedofglobaloptimization[17] andjustifiesourrestrictiontostudylearning withlocal structuressolelywithintheglobaloptimizationframework.
5.3. The effect of local structure on benchmark network structure recovery
Consider thenthesameexperimental settingasintheprevious section,butnowforcomparingthefourlocalmodels, PCART(p)andFT(p),andonlyexactalgorithms.Weobserve(Fig.4)thatbothPCARTvariantsyieldsmallerSHDsthanFTfor allgeneratingnetworksandsamplesizes. Withthesparsity-favoringDAGprioradded,FTpbecomescompetitiveforAlarm. Apossible explanation isthat Alarm hasthe smallestedge/noderatioof only1.24, whereas Insurance andWater have1.93 and2.06, respectively.Asmallaverage indegreeleaveslittle potentialforstructured CPDsasthemajority ofCPTs canbe accuratelyrepresentedwitha (small)probabilitytable.ForInsuranceandWaterthePCARTvariantsyieldfasterconvergence thanFTp.
Insummary,PCART(p)appearstohaveaslightadvantageoverthefull-tablevariantsinrecoveringthenetworkstructure, butnotineachandeverycase.Thisisnotsurprising,sincethereisnoreasontobelievethateveryconditionalprobability
Fig. 4.Accuracy of PCART(p) with FT(p) in structure recovery from data samples generated from three benchmark networks.
Fig. 5. LearningcurvesproducedbytheIntersection-ValidationmethodforfourUCIdatasets.Thelargestsamplesizeineachplotcorrespondstothe originaldatasize,forwhichallmethodsyieldthesamepartialHammingdistance(ofvaluezero)bydefinition.
table in theground-truthnetworksadmitsan economical tree-structuredrepresentation.Moreover, interms ofSHDit is oftenbeneficial tolearn aconservative(sparse) modelwhenthesample sizeis small [33].Forvery smalldata sets,even attemptingtolearnastructuredlocalmodelmayyieldsmallstructuraloverfittingeffectsthatoutweighthepossiblyhigher expressivenessofPCART.
5.4. Structure recovery on real data
For evaluating the accuracy of structure learning fromreal data, we apply a very recent methodcalled Intersection-Validation [34].ThemethodallowsustodrawlearningcurvessimilartotheexamplesinFig.4evenwhennoground-truth DAGisavailable.Themethodcheckswhichstructuralfeaturesalllearningalgorithms(e.g.,scoringfunctions)thataretobe comparedagreeuponwhenlearningfromtheentiredataset,andtreatsthemassurrogategroundtruth.Structureslearned atsmallersubsamplesoftheentiredatasetarethencomparedagainstthispartialnetworktocomputeanapproximation ofSHD,called
partial Hamming distance
.For each of the 52 UCI data sets under consideration and foreach ofthe four scoring functions, we learn networks fromsubsamplesof1
/
2,1/
4,1/
8,1/
16,1/
32,and1/
64oftheoriginaldatasizeandrepeatthesubsamplingprocesstentimes.The resulting averagedlearning curvesforthe fourscoring functionsincomparisonare showninFig. 5forthefirstfourdata sets according to an alphabeticalranking of their names, andinAppendix A forthe remaining 48 datasets. Byvisually inspectingthelearningcurves,wedrawsimilarconclusionsasinthebenchmarknetworkstudyfromtheprevioussection: FTiswidelyinferiortotheothermethods,likelyduetolearningoverlycomplexmodelsatsmallsample sizes.PCARTlocal modelsperformoften,thoughnotalways,slightlybetterthanFTp.Inorderto summarizetheresultsfromthe52datasetsinasystematicfashion,weaggregatedthembythefollowing procedure.Weconsideredeachfractionoftheoriginaldatasize(1
/
2,1/
4,etc.)asaseparategroup,irrespectiveoftheabsolute samplesize.Wethenmadeapairwisecomparisonofthefourlocalmodelsforeachofthesixgroups.Tothisend,weused the followingassessment:method Aperforms better/worsethan methodB(win/loss) ifityields a smaller/larger average partial Hammingdistanceandiftheerrors bars inthelearning curvesdo notoverlap. Iftheydo overlap,we considered bothmodelstoperformequallywell(tie).We show theseaggregated statistics inTable 1 for the two mostimportant modelcomparisons and inTable A.3for theremainingfourcases.WeobservethatthePCARTvariantsperformconsistentlybetterthanthefull-tablemodels, espe-cially atsmallsamplesizes. ThisdoesnotonlypertaintothecomparisontoFT,butalsotoFTp,whichdemonstratesthat structured CPDshelptorecovercorrectstructuralfeatureswithfewersamples,incomparisontofull-tablemodels.Thisis an interestingresultastheeffectwasnot thatobviousinFig.4,whichindicatesthat realdatasetstendtohavedifferent properties than syntheticdatasampled frombenchmarknetworks. An alternative todeclaring a tie betweenmethods in the eventofoverlappingerrorbarsistousea nonparametric statisticaltest (TableA.4).The numberoftiesdoeshere
in-Table 1
SummarizedmethodcomparisonbasedontheIntersection-Validationmethod.Samplesizereferstothefractionof thesizeoftheoriginaldataset.Atieoccursiftheerrorbarsoftwomethodsoverlap.Otherwisethemeanpartial Hammingdistancedecidesuponwinorloss.
(a) PCARTp vs FTp
Sample size Win Tie Loss
1/64 21 30 1 1/32 26 24 2 1/16 33 15 4 1/8 29 18 5 1/4 23 22 7 1/2 18 30 4 (b) PCART vs FT
Sample size Win Tie Loss
1/64 50 2 0 1/32 49 3 0 1/16 47 4 1 1/8 40 12 0 1/4 39 10 3 1/2 31 16 5 Table 2
SummaryofpredictionresultsonUCIdata.Atieoccursiftheerrorbarsof twomethodsoverlap.
Win Tie Loss
PCART vs PCARTp 1 51 0 PCART vs FT 20 23 9 PCART vs FTp 19 24 9 PCARTp vs FT 21 20 11 PCARTp vs FTp 18 23 11 FT vs FTp 0 51 1
creaseduetothesmallsamplesize(tenrepetitions),butamonginstanceswithastatisticalsignificantdifference,thePCART methodsoutperformtheirfulltablecounterparts.
TheseresultsonrealdatafurtherimplythatthePCARTvariantsareeffectivenotonlyforcategoricalvariables,butalso inthepresenceofmixed,orcompletelycontinuousdata.
5.5. Predictive performance
Inall previous studies,weviewedthe graphstructureasthelearning target. Anothercommonlyconsidered target for learning isthejointprobabilitydistribution.Onecan evaluatealearneddistribution,e.g.,by computingthe cross-entropy tothegroundtruthdistributionor,inabsenceofgroundtruth,byapproximatingthecross-entropybytheaveragelogloss inacross-validationexperiment.
Wefollowedthisapproachforall the52UCI datasetsandfourlocalmodels.Foreachcombinationwecomputedthe averagelogloss,calledhenceforth
log loss
forshort,byfirstfindingaglobally optimalmodelstructure,includingthelocal structure,inrelationtothetrainingdata,andthentakingthelosswithrespecttotheBayesianposteriorpredictivedensity ofeach testdatapoint.Allloglossvalues,alongwithstandarderrors,areshowninAppendixB.Table2showsawin–tie– losssummary basedonoverlapping errorbarsandTableB.6 showsan alternativebased onatwo-sidedWilcoxon-signed ranktest [35].Incontrasttothesummarystatisticsfortheintersection-validationstudy,werehereobservethesignificance testtoproducelesstiesthantheoverlappingerrorbarcriterion,whichcanbeexplainedbytherelativelylargesamplesize of50repetitions.Inboth caseswe seethatforpredictionit makes virtuallyno differencewhetherasparsity-favoring structure prioris used ornot: thepenalized and unpenalizedversionsofPCART andFTperform nearly identically.Models that are overly complex accordingto SHDcan still host the trueprobability distribution, andestimatingit from datais effectiveunless theirCPDsaresolargethatmassiveparameteroverfittingoccurs,whichisunlikelyevenformodelslearnedwithFT.
Aside from this aspect, we again observe a slight advantage of PCART compared to full tables, albeit in manycases bothtypesoflocalmodelsperformequallywell.Thisisnotsurprisingasithasbeenobservedthatmeasuringthelearned distributionshowslessdifferencesamongstructurelearningmethodsthanmeasuringthestructuralfeatures [34].
5.6. Hyperparameter sensitivity analysis
AllofthepreviousstudiesusedafixedsetofhyperparametersfortheBayesianpriors.While
α
=
1/
2(Jeffreysprior)and¯
μ
=
0 (meanofastandardnormal)canbeconceptuallymotivated,thechoiceofa
=
ν
=
λ
=
1 wasbasedonsimplicityand decentperformanceinpreliminarytests.Wenowstudyhowvaryingeachofthesethreeparameters(whilekeepingthetwo othersfixedattheir defaultvalue)affectsthepredictiveperformance underthefourlocalmodels,both inabsoluteterms andrelativetoeachother.The results forthe UCI data set concrete-compressive-strength are shownin Fig.6, whereas the resultsfor 14additional randomly chosen UCI datasets areshown inAppendixC.We findthat varying thehyperparametermayindeed havean effect.Ourdefaultvaluesaregenerallyagoodchoice,albeittheycouldbeimprovedinsomecases.
Fig. 6.HyperparametersensitivityanalysisforoneUCIdataset.Eachplotvariesonehyperparameter(x-axis)whilekeepingtheothertwofixed,showingthe effecttopredictiveperformanceasmeasuredbylogloss(y-axis).Dashedverticallineindicatesthedefaultvalue.Notethatforthisdatasetthedifference betweenthepenalizedandunpenalizedalgorithmvariantsisextremelysmall.
Fig. 7.Running time in seconds of Step 1 vs Step 2 under the four local models. See text for a description of the included data sets.
The more importantobservation, however,is that thedecision whetherPCART(p)orFT(p) issuperior fora particular datasetdoesnotheavilydependontheparticularchoiceofhyperparametervalues,exceptforthecaseofverylargevalues (suchas
λ
=
103 inFig.6)whereallmodelsperformpoorly.5.7. Running times
Last,weinvestigatedhowthechoiceofthelocalmodelaffectsthetime requirementsforthelocalscorecomputations (Step 1) and fortheDAG optimization (Step 2,by GOBNILP). Fig. 7showsthe measured running timesforthe datasets we sampledfromthethreebenchmarknetworks(Section5.3) andfortherealUCIdatasetswe includedinthestructure recoverystudy(Section5.4).
Weseethatunderallfourmodels,Step 1tendstoconsumemoretimethanStep 2.Thisobservationhighlightsthefact that
GOBNILP
solves the DAG optimizationstep faston typicalproblem instances,andthus the localscore computations can bea bottleneck,evenunderthestandard CPTmodel(e.g.,FTandFTp oncategoricaldatageneratedfrombenchmark networks).Nevertheless,therearealsodatasetsforwhichStep 1isfasterthanStep 2oralmostequallyfast;importantly, this holds alsounder PCARTandPCARTp. This observation, inturn, highlightsthe fact that DAG optimizationis hard in general, andthereforeonecanaffordcomputationallydemandinglocalscores.Itisworthnotingthatthetimecomplexity ofthelocalscorecomputationsgrowsonlypolynomiallyinthenumberofnodesn
,assumingaconstantmaximumindegree, whereas thecomplexity oftheDAGoptimizationstepissuspectedtogrowsuperpolynomially, thecurrentbestwort-case boundbeingexponentialinn
.5.8. Studies with ordinal variables
Inordertobenchmarkthealgorithmvariantsthatcan handleordinaldata,wesearchedforordinalvariablesinthe52 UCI datasetsfromtheprevious sections.In ordertoavoidadditionaldiscretization,we onlyconsidereddiscrete variables with lessthan 16 possible values.As a consequence,we treat an attribute such as“Age” (in years), which is ordinal in principle, ascontinuous variable nevertheless. From all datasets, we chose sixdata sets thathave atleastthree ordinal variables;seeAppendixD.1fordetails.
ForcomparingtheordinalPCARTvariants(OPCARTandOPCARTp)withtheordinalfulltablemodels(OFTandOFTp),we repeatedtheintersection-validationstudyfromSection5.4andthepredictionstudyfromSection5.5anddisplaytheresults inAppendixD.2andAppendixD.3(TableD.7).WeobservethatthePCARTvariantscontinuetoperformbetterthanthefull table variantsonaverage.Hence,thetreatmentofordinalvariablesassuch,insteadofbeingmodeledaseithercontinuous orcategorical,doesnotchangetherelativeperformanceofPCART-basedmodelsinrelationtotheirfulltablecounterparts.
Fig. 8.The effect of including ordinal variables on structure learning measured by the intersection-validation method.
Nevertheless,fromtheseresultsitisyetunclearwhethertheextension toordinalvariablesispractically useful.Hence, we investigatedhowwell theordinalvariants ofouralgorithmperforminrelationto thenon-ordinal variantsstudiedin theprevioussections.Whilethepredictiveprobabilitiesanddensitiesarenotcomparableanylongerassoonasavariable changesitstype,theinterpretationofthelearnedDAGsremainsthesame,allowingafaircomparisonamongthemethods. We thus applied the intersection-validation method similar to Section 5.4 and compared PCART, PCARTp, OPCART, and OPCARTp(Fig.8).Weobservethattheordinalvariantsyieldalmostneverasubstantiallyworse partialHammingdistance compared to their non-ordinal counterparts, and often there is even a slight improvement. This demonstrates that the extensionofouralgorithmstoordinalvariablespaysoffindeed.
6. Discussion
ThisworkrevisitedtheideaofincorporatingdecisiontreesaslocalmodelsinBayesiannetworks.Specifically,we intro-duceda newmodelclass,PCART,which canhandleboth discrete andcontinuous predictorandresponse variables,takes a Bayesianapproach,andadmitscomputationallyfeasible globaloptimizationinarangeofproblemsizes. Wealso intro-ducedananalogousfull-tablevariant,FT,whichextendsthestandardCPTofcategoricalvariablestocontinuousandordinal responseandpredictorvariables.
Empiricalcomparisonofthetree-structuredPCARTandthetabularFTshoweda clearadvantage ofPCARTinstructure learningandaslightadvantageindensityestimation.Whiletheresultswere obtainedmainlyunderafixed,globalchoice forthevaluesofthehyperparameters,theresultswereobservedtobe robusttochangesofthesevalues.Thatbeingsaid, thesensitivityanalysisalsosuggeststhattheperformanceofbothPCARTandFTcouldbefurtherenhancedbychoosingthe hyperparametervaluesseparatelyforeachgivendatasetandeachvariable,e.g.,bymeansofempiricalBayes.
Inourexperimentswesettheparametersthatcontrolthecomputationalcomplexityofthemodel(maximumindegree, numberof splits)so that thecomputations ofthelocal scores (Step 1)could be completed within a few hours foreach dataset.Thiswasmotivatedbythefactthatforsomeofthedatasets,thestate-of-the-artsolversrequirehourstofindan optimalDAG (Step 2).Thereare two simplewaysto expediteStep 1,ifneeded:by settingthe complexityparameters to lowervalues,andbyrunningthecomputationsfordifferentparentsetsinparallel.Anopenquestioniswhetheronecould gainfurtherreductionsinthe