Proof of Claims (d,e) of Theorem 4 (Case n = 2)

Claims (d,e) of Theorem 4 state that if n=2, YD=Y for N ≥N0 and Yi >0 for i=1, . . . ,2n,

lnI_[_N,YD_]_{(Eq. 13) is asymptotically equal to NP}₍_Y_|_wML₎₋3

2ln N+O(1)(Eq. 17) for Y 6∈S0 and asymptotically equal to NP(Y_|wML)−3₂ln N+2 ln ln N+O(1) (Eq. 18) for Y ∈S0. Similar to the

proofs of Claims (b,c), we first employ Lemma 6, which relatesI_[_N,YD_]_{with ˜}I_[_N,Y_]_{(Eqs. 28 and 30)}

and Lemma 7, which relates ˜I_[_N,Y_]_with J_[_N,Y_]_{(Eqs. 31 and 32). Consequently, it remains to}

evaluateJ_[_N,Y_{] =}_max_p

0∈U0Jp0[N]. For this task, one needs examine the neighborhoods of arbitrary

minimum points p0∈U0of the function f . From the definition ofϒ,ϒ0and S (Section 4) it follows that S=ϒ0=ϒfor n=2. Note that there is no regular points in this case. We now modify the proofs of type 1 and type 2 singularities to fit to the case n=2.

Type 1 singularity: The zero set U0is the same set as described by Eq. 43 with l=1 and k=2. The analysis of the form of the exponent functionφof the integrand ofJ_p

0[N]gives Eqs. 45 and 46

without the∑l₆=i,jclu2l terms. Thus, by the same analysis, the contribution of these regions to the

Type 2 singularity: The zero set U0=U¯0−∪U¯0+∪U¯01∪U¯02 is the same set as described by Eqs. 34 and 35. Now, however, ¯U0₋, ¯U0+, ¯U01 and ¯U02 are of the same dimension, namely, two. This fact changes the asymptotic approximation.

Consider the cases C1-C5 one by one. There is no change in cases C1 and C3 where the point (x0,u0,s0)lies on the proper two dimensional surfaces U01, U02or U0₋, U0+. Here, the functionφ

can be approximated by 3 variables, resulting in the contribution cN−3/2of these regions toJ_[_N,Y_]_.

The more complex situation is in C2, C4 and C5 cases, where zero planes of the same dimension meet. Generally, the intersection points of zero surfaces of the same dimension are expected to give rise to a ln ln N term. While this is not always a case, e.g., see example in Section 2, the ln ln N term does appear now. We have:

C2: The principal part ofφis x2₁+x2₂+u2₁u2₂, as specified by Eq. 37. Integrating out the x2_i terms we obtain through the analysis of the poles of J(λ) =R

u2₁λu2₂λdu1du2that the largest pole of

J(λ)isλ=₋1/2 with multiplicity m=2. Thus the contribution of this region toJ_[_N,Y_]_is

cN−3/2ln N.

C4: The principal part ofφis x2₁+x2₂+s2u₂2or x2₁+x₂2+s2u2₁ (see Eq. 48). Similarly to the case

C2, the contribution of this region toJ_[_N,Y_]_{is cN}−3/2_{ln N.}

C5: Here, the principal part ofφis x2₁+x2₂+s2u2₁u2₂(see Eq. 49). Once again, we integrate out the xi

variables and analyze the poles of J(λ) =R

s2λu2₁λu2₂λdsdu1du2. The largest pole isλ=−1/2 with multiplicity m=3, and thus the contribution of this region to J_[_N,Y_]_{, including the}

factors from integrating out the xi’s, is cN−3/2ln2N.

Summarizing the contributions of the neighborhoods of various critical points for Y _∈S0, we see thatJ_[_N,Y_]_∼_cN−3/2_ln2_{N and, consequently, ln}I_[_N,Y_{] =}_{N fY}₋3

2ln N+2 ln ln N+O(1). A.6 Proof of Theorem 4f (Case n=1)

Theorem 4f states that if n=1, YD=Y for N≥N0 and Y1,Y2 >0, then lnI[N,YD] (Eq. 13) is asymptotically equal to NP(Y_|wML)−1₂ln N+O(1)(Eq. 19). Once again, we first employ Lemma 6,

which relatesI_[_N,Y_D_]_{with ˜}I_[_N,Y_]_{(Eqs. 28 and 30) and Lemma 7, which relates ˜}I_[_N,Y_]_withJ_[_N,Y_]

(Eqs. 31 and 32). Consequently, it remains to evaluateJ_[_N,Y_{] =}_max_p

0∈U0Jp0[N]. For this task, one

needs examineJ_p

0[N]in the neighborhoods of arbitrary minimum points p0∈U0of the function f .

From the definitions of ϒ,ϒ0 and S0, for n=1, there is no distinction between different type of statistics andϒ=ϒ0=S0. Moreover, according to Theorem 2 the asymptotic form of the in- tegral Jp0[N] =

Uεe−

N(z1(x,u,s)−z01)2dx du ds is determined by the poles of J(λ) =R

Uε(z1(x,u,s)−

z0₁)2λ_{dx du ds, where, in this case, z}

1(x,u,s)−z01=x21. Once again, contributions of points p0∈U0 lying on the boundary of U can be ignored, since their domains of integration are smaller than do- mains of integration of the corresponding internal points. Thus, the largest pole of J(λ)isλ=₋1/2 with multiplicity m=1 and ln I[N,YD]is asymptotically equal to N fY−1₂ln N+O(1).

We can also compute the integralI[N,YD](Eq. 11) directly for n=1 and YD=Y . It is

I_[_N,Y_{] =}

(0,1)3e

where Y1=1−Y0. Ignoring the density µ(a,b,t)by using the assumption of bounded density (A1) and changing the variables to x=at+b(1₋t), we rewriteI_[_N,Y_]_{is asymptotically equivalent form}

˜ I[N,Y] = Z 1 0 Z 1 0 1 b₋a Z b a e N(Y0ln x+Y1ln[1−x])_{dx da db.} Consider now I_1[_N,Y_{] =} Z b a eN(Y0ln x+Y1ln(1−x))_dx

for some 0_≤a<b_≤1 (the case b>a is symmetric). This is the integral of the beta distribution

withα=NY0+1 andβ=NY1+1 (DeGroot, 1970, page 40). Let f(x) =Y0ln x+Y1ln(1−x). The maximum of the integrand function f(x)on[0,1]is achieved at x0=Y0and it is eN f(Y0). There are three cases to consider according to the location of x0relative to(a,b).

1. Internal point, x0=Y0∈(a,b). In this case

f(Y0+x) = f(Y0) +Y0ln 1+ x Y0 + (1₋Y0)ln 1₋₁₋x_Y 0 = f(Y0) +Y0 x Y0− x2 2Y02 +O(x3₎_{+ (}₁₋_Y 0) −x 1−Y0− x2 2(1−Y0)2+O(x 3₎ = f(Y0)₋_2Y 1 0(1−Y0)x 2₊_O₍_x3₎_.

Thus, in the small neighborhood of x0, f can be approximated by quadratic form and the classic Laplace approximation (Lemma 1) can be applied yieldingI_1[_N,Y_]_∼_c₁_eN fY_N−1/2_.

Moreover, since I_1[_N,Y_] _{and e}N f(Y0) _{are continuous functions of N and x}

0=Y0, uniform asymptotic bounds on I₁_[_N,Y_]_{exists for all x}₀ _{in a proper closed subset of} ₍_a,b₎ _{as N} _→

∞. I.e., the integralI_1[_N,Y_]_{is bounded within a constant multiplies of e}N fY_N−12 and these

constants are independent of x0and N for all x0∈[a+ε,b−ε]and N≥1. Note that the above approximation of f is only valid for Y06=0,1 (Assumption A2). Otherwise, the approximation of f is non-quadratic.

2. Border point, x0=Y0∈ {a,b}. The expansion for f(Y0+x)is the same, but the integration is performed only on the half of the interval, which results in half the constant factor to the final approximation compared with the previous case.

3. Maximum of f is outside of [a,b]. Let m denote the maximum of ef(x) _on _[_a,b_]_{, i.e., m}₌

maxx∈[a,b]ef(x). We haveI1[N,Y]≤(b−a)mN<c3eN fYN−1/2, for some appropriate constant

c3.

The above analysis shows thatI₁_[_N,Y_]_<_cuppeN fY_N−1/2_{for some constant c}

uppfor all a and b. Fur-

thermore,I_[_N,Y_]_>_cloweN fY_N−1/2_{for some c}

low>0 for(a,b)∈ {(a,b)|a<Y0,b>Y0,b−a>2ε> 0_}. Since the later region has a non-zero Lebesgue measure, it follows thatI_[_N,Y_]_∼_ceN fY_N−1/2

and lnI_[_N,Y_{] =}_{N fY}₋1

2ln N+O(1). Appendix B. Proof of Theorem 5

Theorem 5 states the asymptotic approximation for the marginal likelihood given a degenerate bi- nary naive Bayesian model M that has m missing links. In order to prove this theorem we examine the log-likelihood function of the degenerate model and decompose it into a degenerate part and

a naive Bayesian part. These parts define two probability functions that are independent and the marginal likelihood of data is computed relevant to each one of them. Combining the results gives Theorem 5.

Letψbe the log-likelihood function of the marginal likelihood integral (Eq. 20) for the degenerate binary naive Bayesian network described in Theorem 5. We have

1 Nψ(a,b,t,c) = ∑xYxlnθx(ω) = ∑xYx lnθ_(x₁_,...,_x_n₋_m₎(a,b,t) +∑n i=n−m+1(xiln ci+ (1−xi)ln(1−ci)) = ∑(x1,...,xn−m) lnθ_(x 1,...,xn−m)(a,b,t)·∑(xn−m+1,...,xn)Yx +∑n i=n−m+1(∑xYxxiln ci+∑xYx(1−xi)ln(1−ci)) = ∑(x1,...,xn−m)Y(x1,...,xn−m)lnθ(x1,...,xn−m)(a,b,t) +∑n i=n−m+1(Yiln ci+ (1−Yi)ln(1−ci))

where(x1, . . . ,xk)are binary vectors of length k, Y(x1,...,xn−m)=∑(xn−m+1,...,xn)Y(x1,...,xn)and Yi=

∑(x1,...,xi−1,1,xi+1,...,xn)Yx. The new statistics Y(x1,...,xn−m)and Yi’s are positive, because Y is positive (A2).

Using the assumptions of bounded density (A1) and stable statistics (A3), the marginal likelihood integralI_[_N,Y_]_{(Eq. 20) can be rewritten as}

I_[_N,YD_]_∼Iˆ_[_N,Y_{] =} " n

∏

i=n−m+1 Z 1 0 cNYi i (1−ci) N(1−Yi)_dci # Z (0,1)2n−2m+1e N∑x˜Yx˜lnθx˜(ω)_dω_. ₍₅₀₎

where ˜x= (x1, . . . ,xn−m). The first m integrals are integrals over the beta distribution (DeGroot,

1970, page 40) and Z 1 0 cNYi i (1−ci)N(1−Yi)dci= Γ(NYi+1)Γ(N(1₋Yi) +1) Γ(N+2)

The asymptotic behavior of Gamma function is well understood and it is described by Stirling formula,Γ(z) =e−zzz−12√2π1+O(z−1)(Murray, 1984, page 38), and thus lnΓ(z) =−z+ (z−

2)ln z+O(1). Using the equality ln(Y N+1) =lnY N+O(1), we obtain

lnΓ(NYi+1)Γ(N(1−Yi)+1) Γ(N+2) = (NYi+12)ln(NYi+1) + (N(1−Yi) +12)ln(N(1−Yi) +1)−(N+32)ln(N+2) +O(1) = (NYi+1₂)ln NYi+ (N(1−Yi) +1₂)ln N(1−Yi)−(N+3₂)ln N+O(1) =−1 2ln N+N(YilnYi+ (1−Yi)ln(1−Yi)) +O(1).

Hence, the contribution of the first m integrals to ln ˆI_[_N,Y_]_{is N ln p}₍_Y_n₋_m+₁_{, . . . ,Y}_n_|_c_ML₎₋m

2ln N. The second integral in Eq. 50 is exactly of the type analyzed in Theorem 4, and the theorem follows by summing up the contributions of these two parts.

References

Shreeram S. Abhyankar. Algebraic Geometry for Scientists and Engineers. Number 35 in Mathe- matical Surveys and Monographs. American Mathematical Society, 1990.

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic

Control, 19(6):716–723, December 1974.

M.F. Atiyah. Resolution of singularities and division of distributions. Communications on Pure and

Applied Mathematics, 13:145–150, 1970.

Peter Cheeseman and John Stutz. Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge

Discovery and Data Mining, pages 153–180. AAAI Press, 1995.

Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, October 1992.

Morris H. DeGroot. Optimal Statistical Decisions. McGraw-Hill Book Company, 1970.

Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine Learn-

ing, 29(2-3):131–163, 1997.

Dan Geiger, David Heckerman, Henry King, and Christopher Meek. Stratified exponential families: Graphical models and model selection. Annals of Statistics, 29(2):505–529, 2001.

Dan Geiger, David Heckerman, and Christopher Meek. Asymptotic model selection for directed networks with hidden variables. In Eric Horvitz and Finn Jensen, editors, Proceedings of the

Twelfth Conference on Uncertainty in Artificial Intelligence, pages 283–290. Morgan Kaufmann

Publishers, Inc., 1996.

Dominique Haughton. On the choice of a model to fit data from an exponential family. Annals of

Statistics, 16(1):342–355, 1988.

David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The com- bination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995.

Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 7(1,2):109–326, 1964.

Christine Keribin. Consistent estimation of the order of mixture models. Sankhya, Series A, 62(1), February 2000.

Serge Lang. Complex Analysis. Springer-Verlag, 3rd edition, 1993.

Steffen L. Lauritzen. Graphical Models. Number 17 in Oxford Statistical Science Series. Clarendon Press, 1996.

James D. Murray. Asymptotic Analysis. Number 48 in Applied Mathematical Sciences. Springer- Verlag, 1984.

Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor- gan Kaufmann, 1988.

Dmitry Rusakov and Dan Geiger. Asymptotic model selection for naive Bayesian networks. In Adnan Darwiche and Nir Friedman, editors, Proceedings of the Eighteenth Conference on Un-

Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. Raffaella Settimi and Jim Q. Smith. On the geometry of Bayesian graphical models with hidden

variables. In Gregory F. Cooper and Serafin Moral, editors, Proceedings of the Fourteenth Con-

ference on Uncertainty in Artificial Intelligence, pages 472–479. Morgan Kaufmann Publishers,

Inc., 1998.

Raffaella Settimi and Jim Q. Smith. Geometry, moments and conditional independence trees with hidden variables. Annals of Statistics, 28:1179–1205, 2000.

Peter Spirtes, T Richardson, and Christopher Meek. The dimensionality of mixed ancestral graphs. Technical Report CMU-PHIL-83, Philosophy Department, Carnegie Mellon University, 1997. Sumio Watanabe. Algebraic analysis for nonidentifiable learning machines. Neural Computation,

13(4):899–933, 2001.

Roderick Wong. Asymptotic Approximations of Integrals. Computer Science and Scientific Com- puting. Academic Press, 1989.

In document Asymptotic Model Selection for Naive Bayesian Networks (Page 30-35)