Formulation of the Tree Approximation Problem as a Detection Problem and Relation between the AUC and Information Theory Divergences

(1)

Formulation of the Tree Approximation Problem as a

Detection Problem and Relation between the AUC

and Information Theory Divergences

Navid Tafaghodi Khajavi and Anthony Kuh

∗

Deptartment of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: {navidt, kuh}@hawaii.edu

Abstract

This paper looks at graphical models and discusses the quality of tree approximations by ex-amining information measures and formulating the problem as a detection problem. One of the widely used algorithms for tree-structured approximation and modeling is the Chow-Liu algo-rithm. While this algorithm is optimal for Gaussian distributions in the sense of the Kullback-Leibler (KL) divergence, it is not optimal when compared with other information divergences and criteria such as Area Under the detection Curve (AUC) and reverse KL divergence. In this paper, we discuss the optimality of tree approximation methods. We show that different information theory divergences and criteria such as the KL divergence, the Jeffreys divergence and the AUC are all related of the correlation approximation matrix (CAM), Δ. We also show some explicit relations between these different information divergences and criteria and investigate the relation between quality of the tree approximation by formulating a detection problem and considering the AUC and the Jefferys divergence which is a distance between two conditional means, as alternative approaches for the tree approximation. The tree structure approximation algorithms have interesting applications. In general, the tree structured enables us to do distributed algorithms such as belief propagation and also to do inference.

Because of computational complexity, it is important to consider simpler graphical models such as trees when modeling systems for many applications. We previously discuss the problem of convergence of the distributed state estimation algorithm for electric distribution grids, “mi-crogrids,” with distributed renewable energy generation. The correlations between distributed renewable energy generators was approximated by a simpler tree-structured graphical model. This paper carries the research further by looking at real spatial solar irradiation data and approximating correlation matrices by a tree structured graph. We also conduct simulations on synthetic data with larger number of nodes.

Keywords: Tree approximation, Detection problem, AUC, Smart Grid, Big Data

∗_{This work was supported in part by NSF grant ECCS-1310634 and the University of Hawaii REIS project.}

Volume 53, 2015, Pages 257–264

2015 INNS Conference on Big Data

Selection and peer-review under responsibility of the Scientiﬁc Programme Committee of INNS-BigData2015 c

(2)

1 Introduction

This paper considers the quality of tree approximations for graphical models. We begin by asking the following important questions: “what is the best closeness divergence or criteria for the tree approximation of the correlation matrix for the Gaussian model and which one of the possible trees is the best approximation of the Gaussian correlation matrix?” Looking into the literature, the Kullback-Leibler (KL) divergence has been proposed as a closeness divergence between the original distribution and its tree approximation distribution [1] in lots of the tree approximation algorithms while the reverse KL divergence is used in variational methods to learn the desired tree structured [2]. In this paper, we bring a diﬀerent perspective to the tree approximation by formulating a detection problem. For Gaussian data, the detection problem leads to calculation of the KL divergence and the reverse KL divergence as well as the area under the detection curve (AUC). The detection problem formulation gives us a broader view as well as diﬀerent ways of determining whether a particular tree is a good approximation or not. Note that, the tree approximation is useful to perform distributed algorithms such as belief propagation and inference on trees. The Chow-Liu minimum spanning tree (MST) [1] is the most popular tree structure learning algorithm which minimizes the KL divergence cost function [3] as the closeness criterion. This algorithm is easy to construct and gives the optimal tree solution by utilizing the Kruskal algorithm [4].

In this paper, we use information theory divergences in order to approximate the tree struc-ture for high-dimensional data. We are asking the following question: “how do we develop

a more general and a broader approach to studying simple approximation models (i.e trees) to model high-dimensional data?” To answer this question, note that each closeness criterion

brings a different prespective to the tree approximation and by formulating a detection problem we gain a broader understanding of the tree approximation problem. The detection problem formulation for the Gaussian data leads to calculation of the KL divergence, the reverse KL divergence and the AUC and also gives us a broader view as well as different ways of determin-ing whether a particular tree is a good approximation or not. A key quantity that we define is the correlation approximation matrix (CAM) as the product of the original correlation matrix and the inverse of the tree approximation correlation matrix. For Gaussian data this matrix contains all the information needed to compute information divergences and the AUC. We also show the relationship between the CAM, the AUC, the Jeffreys divergence [5], the KL diver-gence and the reverse KL diverdiver-gence. Through real solar irradiation data and synthetic data, we observe the behavior of these different performance divergences.

A nice application of the tree approximation is smart grid. Smart grid is a promising solution that delivers reliable energy to consumers through the power grid while there are uncertainties such as stochastic generation such as solar PVs and wind power plants. Smart grid technologies such as smart meters and communication links are added to the power grid in order to obtain the high dimensional, real-time data and information “Big Data,” and overcome uncertainties and unforeseen faults. Learning from high dimensional data requires large computation power which is expensive and not always available due to the hardware limitation. Thus, we need to compromise between the time complexity of the learning algorithm and its accuracy by using the best possible approximation algorithm and imposing structure. Current grid below substation level are mostly radial tree networks and thus it is easy to do distributed state estimation [6]. The future grid will incorporate distributed renewable energy generation (DERG) such as solar PVs, with these sources being highly correlated. Thus, tree approximations for DERG are needed to eﬃciently perform distributed state estimation.

The rest of this paper is organized as follows. The detection problem formulation and the corresponding suﬃcient statistics for the Gaussian random variables are given in Section 2. This

(3)

section talks brieﬂy about calculation of the generalized chi-squared distribution. Moreover, this section shows the relation between information divergences and criteria such as the KL divergence, the reverse KL divergence, the Jeﬀreys divergence and the AUC with the CAM. Section 3 presents some simulation results by looking at real spatial solar irradiation data and synthetic data comparing the CAM and various divergences and criteria that are presented in Section 2. Finally, Section 4 summarizes results of this paper.

2 Detection Problem Formulation

2.1 Preliminaries

We want to approximate a multivariate distribution by the product of lower order component distributions [7]. For the purpose of tree approximation, the maximum order of these lower

order distributions is two, i.e. no more than pairs of variables. Let X∼ N (0, Σ_X) (i.e. jointly

Gaussian with mean 0 and covariance matrix Σ_X) where X∈ Rphave the graph representation

G = (V, E) where sets V and E are the set of all vertices and edges of the graph representing

of X.1 Let X_T ∼ N (0, Σ_X_T) have the graph representationG_T = (V, E_T) where E_T ⊆ E is a

set of edges that represents a tree structure. Let X_l∼ N (0, Σ_X_l) has the graph representation

Gl = (V, El) where El ⊆ ET is the set of all edges in the graph of Xl. The joint probability

density function can be represented by joint pdfs of two variables and marginal PDFs in the following convenient form:

f_X_l(x_l) = (u,v)∈El f_Xu_,Xv(xu, xv) f_Xu(xu)f_Xv(xv) u∈V f_Xu(xu). (1)

Consider the sequence of random variables X_l with 0 ≤ l ≤ |E_T|, where X_l is recursively

generated by augmenting a new edge, (i, j)∈ E_l, to the graph representation of X_l−1. For the

special case of Gaussian distributions, Σ_X

l has the following recursive formulation

Σ−1

X_l= Σ−1X_l−1+ Σ†i,j− Σ†i− Σ†j ,∀ 0 ≤ l ≤ |ET| (2)

where Σ†_i,j= [e_i e_j]Σ−1_i,j[e_i e_j]Tand Σ†_i= e_iΣ−1_i eT_i where e_i is a unitary vector with 1 at the i-th

place and Σ_i,jand Σ_i are the 2-by-2 and 1-by-1 principle sub-matrices of Σ_X, with initial step

Σ_X

0= diag(ΣX) where diag(ΣX) represents a diagonal matrix with diagonal elements of ΣX.

Deﬁnition 1. The correlation approximation matrix, CAM, for the tree approximation is de-ﬁned as Δ Σ_XΣ−1_X

T and is a positive deﬁnite matrix.

Remark: Eigenvalues of the CAM contains all information about the tree approximation.

2.2 Detection Problem Formulation

A way to look at the problem of quantifying the quality of tree approximation algorithm is to formulate it as a detection problem [8]. Given the set of data in the detection problem, the goal is to distinguish between two hypotheses, the null hypothesis and the alternative hypothesis.

In this paper, the known graph hypothesis, H_K, is the hypothesis that parameter of interest

(the covariance of the random vector X) is known and is equal to Σ_X while the tree-structured

graph hypothesis H_T, is the hypothesis that the random vector X follows the approximated

tree-structured distribution is equal to Σ_X

T. Therefore, the CAM is Δ = ΣXΣXT

−1_{. Let the}

eigenvalues of Δ matrix be λ_i> 0 for i∈ {1, . . . , p}.

(4)

Based on the Neyman-Pearson (NP) Lemma [9], likelihood ratio test (LRT) is the most

powerful test statistic which is deﬁned as the ratio of the joint probability underH_K and of the

joint probability underH_T. The suﬃcient test statistic based on the log likelihood ratio test

(LLRT) is:

l(x) = log fX(x|HK) f_X(x|H_T).

This can be simpliﬁed for the Gaussian set up as l(x) = c− 1₂k(x) where c = −1₂log (|Δ|) is

a constant and k(x) = xTKx where K = (Σ_X−1− Σ_X_T−1) is an indeﬁnite matrix with both

positive and negative eigenvalues.

Proposition 1. For the indeﬁnite quadratic forms, l(X):

1. E(l(X)|H_K) =D(f_X(x|H_K)||f_X(x|H_T)) = m_K,

2. E(l(X)|H_T) =−D(f_X(x|H_T)||f_X(x|H_K)) = m_T.

Proof. Proof is based on the deﬁnition of the KL divergence. Note that, the ﬁrst quantity is

the KL divergence while the second one is negative of the reverse KL divergence.

In a regular detection problem framework, the NP decision rule is to accept hypothesis

HK if l(X) exceeds a critical value, and reject it otherwise. Moreover, the critical value is

set based on the rejection probability of hypothesis H_K. Note that, we pursue a diﬀerent

goal in the approximation problem scenario. We approximate a tree structure distribution as close as possible to the known distribution. The closeness criterion is based on the modiﬁed detection problem framework where we compute the suﬃcient statistic l(X) and compare it with a threshold. In ideal case where there is no approximation error, the receiver operating curve (ROC) [10] should be a line of slope 1 passing through the origin.

The random vector X has Gaussian distribution under both hypothesesH_KandH_T. Thus

under both hypotheses, the real random variable, k(X) = XTKX has a generalized chi-squared

distribution which is a weighted sum of chi-squared distributions. Let W = Σ_X−12_{X under}H_K

and Z = Σ_X_T−12_{X under} H_T_{, where Σ}_X12 _{and Σ}_X

T 1

2 _{are the square root of matrices Σ}_X

and Σ_X_T. Then W ∼ N (0, I) and Z ∼ N (0, I) where I is the identity matrix of appropriate

order. Also, let λ_i> 0 for i∈ {1, . . . , p} be eigenvalues of the correlation approximation matrix

matrix, Δ. Thus, the random variable k(X), under hypothesesH_K andH_T can be written as:

k(X|H_K) = p i=1 (1− λ_i)W_i2 and k(X|H_T) = p i=1 (λ−1_i − 1)Z_i2

respectively, where W_i and Z_i are the i-th element of W and Z. Also, W_i2 and Z_i2 have

central chi-squared distribution of order 1. Note that it follows from the Markov property and the product of lower order distributions assumption presented before that the summation of

weights of the Chi-squared distributions is zero under the hypothesisH_K, i.e. p_i=1(1−λ_i) = 0,

and this summation is positive under the hypothesisH_T, i.e. p_i=1(λ−1_i − 1) ≥ 0 [3].

2.3 Approximation of the generalized chi-squared distribution and

computation of the AUC

There are diﬀerent approaches to compute the probability density function (PDF) or Cumula-tive distribution function (CDF) of generalized chi-squared distribution; but, as it is mentioned in [11] for real valued case, it has to be done numerically. There is an expression for a closed form density for both PDF and CDF of generalized chi-squared distribution which is proposed

(5)

in [12]. This method consists of a computation method for finding infinite summations. We sug-gest to approximate either PDF or CDF of generalized chi-squared distribution. An accurate approximation has been proposed in [13] for the distribution of positive definite and indefi-nite quadratic forms for normal random variables which computes the moments of a quadratic form recursively. The other method is to numerically compute the multivariate normal dis-tribution values for ellipsoidal sets [14]. Since we are working with high-dimensional data we suggest a simple PDF approximation based on the moment matching procedure. Our method is to approximate the distribution of the statistic, l(X), under each hypothesis as a Gaussian distribution and matching the first two moments by computing the mean and the variance.

2.3.1 Gaussian Approximation

We denote by N (m_K, σ_K2) andN (m_T, σ_T2), Gaussian approximations of the probability of the

statistic l(X) under hypothesis H_K and hypothesis H_T, respectively. Means and variances of

these two Gaussian distributions can be derived by the moment matching procedure as

m_K=−1 2log(|Δ|) = − 1 2 p i=1 log(λ_i), σ_K2 = 1 2tr((I− Δ) 2_{) =} 1 2 p i=1 (1− λ_i)2, m_T =−1 2(log(|Δ|) + tr(Δ −1₎_{− p) = −}1 2 _p i=1 log(λ_i) + p i=1 λ−1_i − p and σ2_T = 1 2tr((Δ −1_{− I)}2_{) =}1 2 p i=1 (λ−1_i − 1)2.

Note that means have been previously calculated in proposition 1. Here, we give another representation of the means based on the eigenvalues of the CAM.

Lemma 1. For all positive deﬁnite covariance matrices, Σ_X, and Σ_X_T ∈ T , we have: DJ(fX(x|HK), fX(x|HT)) = 1₂tr(Δ−1)−p₂

where Δ = Σ_XΣ_X_T−1 and tr(Δ−1)≥ p.

Proof. The Jeﬀerys Divergence is deﬁned as

DJ(fX(x|HK), fX(x|HT)) =D(fX(x|HT)||fX(x|HK)) +D(fX(x|HK)||fX(x|HT)) = mK− mT

and the proof follows from above.

2.4 Relationship Between Information Divergences and the AUC

cri-terion using the CAM

The AUC is deﬁned as the area under the ROC curve. Here, we compute the AUC using the Gaussian approximation presented previously. The AUC under the Gaussian approximation can be calculated by taking the integral involving the Gaussian Q-function as

(6)

where Q(z) = √1 2π _∞ z e− u2 2 du and τ_G = √mK−mT σ2 K+σ2T

. Note that diﬀerence between means of the

statistic l(X) under two hypotheses, m_K− m_T = 1₂(tr(Δ−1)− p), is the Jeﬀerys divergence.

Let f_ZG(z) =Nm_K− m_T, (σ_K2 + σ_T2) . Then in this case, the approximated AUC using the

Gaussian approximation, AU C_G, can be computed by evaluating the CDF of random variable

Z at zero and then subtract it from one, i.e. AU C_G = 1− F_ZG(0). Note that, random variable Z has a nice interpretation which is the subtraction of the statistic l(X) under both hypotheses

considering the Gaussian approximation, i.e. Z = l(X|H_K)− l(X|H_T).

We want to ﬁnd a tree structured approximation matrix, Σ_X_T, such that the CAM is

approximately equal to the identity matrix, i.e. Δ≈ I, and that the AUC is approximately

one half. An alternative way to look at this problem which we justiﬁed in previous subsection, is to minimize the distance between the conditional means of the random variable l(x), i.e. the Jeﬀreys divergence, which implies the following minimization:

Σ_X

T

∗_{= arg min}

ΣXT∈T

DJ(fX(x|HK), fX(x|HT)). (3)

where T denote the set of all positive deﬁnite covariance matrices that has a tree structured

representation. Unlike the Chow Liu algorithm which minimizes the KL divergence there is not a simple algorithm to minimize the Jeﬀreys divergence.

3 Simulation Results and Discussion

In this section, simulation results are performed based on [15] and synthetic data. Two data sets which are obtained from the NREL website [16] introduced in [15]. The ﬁrst data set is the Oahu solar measurement grid which consists of 19 sensors (17 horizontal sensors and two tilted) and the second one is the NREL solar data for 6 sites near Denver, Colorado. These two data sets are normalized using standard normalization method and the zenith angle normalization

method [15] and then the unbiased estimate of the correlation matrix is computed.2

0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ROC curves for Oahu and Colorado data sets

P( − | H T ) P( − | H K) Colorado 12:00 (AUC = 0.6228) Colorado 8:00 (AUC = 0.6175) Oahu 9:00 (AUC = 0.9119) Oahu 13:00 (AUC = 0.9066) y=x 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Subset of 6 nodes from Oahu data

P( − | H K) P( − | H T ) First 6 (AUC = 0.6678) Random 6 (AUC = 0.6584) Last 6 (AUC = 0.6115) Random 6 (AUC = 0.5485) y=x

Figure 1: Left: ROC curves with their associated AUC values for Colorado and Oahu data sets at diﬀerent time of day. Right: ROC curves with their associated AUC values for diﬀerent sub-graphs of size 6 the Oahu data-set at 8:00.

Figure 1 (Left) plots the ROC curves for Colorado data and Oahu data at diﬀerent time of the day. It also shows the associated AUC values for the tree approximation using the KL divergence and the shortest path algorithm. The ROC curves are plotted for the data times 8:00 and 12:00 from Colorado data-set and the data times 9:00 and 13:00 from Oahu data-set.

(7)

This figure indicates the quality of the tree approximation does not change much on average for both Oahu and Colorado data-sets. Comparing the results of Colorado data-set and Oahu data-set in this figure, we conclude that since the AUC metric is around 0.61 for the Colorado data-set, the tree approximation is reasonable, while this is not the case for the Oahu data-set where the AUC metric is around 0.91. This value suggests that the AUC depends on the size of the graph. Figure 1 (Right) indicates this by plotting ROC curves for different sub-graphs of the Oahu data-set at 8:00. This figure also shows the ROC curves for four different subsets of size six of Oahu data and their associated AUC values for the tree approximation using the KL divergence and the shortest path algorithm.The simulation shows that the AUC is varying for graphs of size 6 which means that the tree approximation quality varies for different sub-graphs of nodes of Oahu data set. The tree approximation is more reasonable for those sub-graphs that are associated with closer AUC value to 0.5 than the others. Comparing with figure 1 (Left) and the AUC results of Oahu data and Colorado data, we can clearly see that number of nodes in the graph have a large impact on the AUC values which means that the tree approximation for larger network is more likely to have larger AUC value than the smaller network.

−4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1x 10 −3 PDF 12:00 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 8:00 CDF Threshold 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 P( − | H T ) P( − | HK) 8:00 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1x 10 −3 PDF 8:00 −4 −2 0 2 4 0 0.2 0.4 0.6 0.8 1 12:00 Threshold CDF 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 12:00 P( − | H T ) P( − | HK) Under HK Under HT mK mT ROC (AUC = 0.6228) y=x ROC (AUC = 0.6175) y=x Under HK Under HT

Figure 2: CDFs, PDFs and ROC curves with their associated AUC values and the associated KL divergences and reverse KL divergences for Colorado data-set at 8:00 and 12:00.

Figure 2 shows CDFs and PDFs under the both two hypotheses and the ROC curves their associated AUC values for the Chow-Liu tree approximation using the KL divergence. From the figure, the distance between means of PDFs is the Jefferys divergence (sum of the KL divergence and the reverse KL divergence). Figure 3 shows the histogram of trees for some synthetic data with 100 nodes. The distribution is computed using Monte Carlo Markov Chain (MCMC) [17] method. Here we generate a covariance matrix with no dominant, increasing eigenvalues and plot the eigenvalues spectrum of the CAM, on the right hand side of this figure. Basis matrix is generated such that its coefficients are normally distributed. We can clearly see that a greedy solution of (3) using the shortest path algorithm can do better than the Chow-Liu tree.

4 Conclusion and Future Directions

In this paper, we formulate a detection problem and investigate the quality of tree approxima-tion using this set up. We presented the CAM, and discuss its relaapproxima-tionship with informaapproxima-tion divergences and the AUC. We show that we only need the knowledge of the CAM eigenvalues to compute the AUC and information divergences such as the KL divergence and Jeﬀreys

(8)

diver-800 850 900 950 1000 1050 1100 1150 1200 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Jefferys divergence Distribution of Trees

Distribution for Simulation Data with 100 nodes (MCMC) Independent−nodes CL−Tree Jeffrey−Approx Real−Data (MCMC) 0 10 20 30 40 50 60 70 80 90 100 10−3 10−2 10−1 100 101 Index of Eigenvalues Value of Eigenlvalues Spectrum of Eigenvalues Δ (CL−Tree) Δ (Jeffreys−Approx) Known−Cov Tree (Jeffreys−Approx) Tree (CL−Tree)

Figure 3: Right: The eigenvalue spectrum of the CAM. Left: Histogram of trees computed using MCMC for a graph with 100 nodes. Figure shows the position of the optimal tree solution in the sense of KL divergence and a greedy solution of (3) using the shortest path algorithm. gence. The quality of the Chow-Liu tree algorithm is investigated using the proposed detection problem framework. The quality of tree approximation algorithms is investigated through some simulation on synthetic data and real solar irradiation data.

References

[1] C. K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, pages 462–467, 1968.

[2] D. J. C. MacKay. Information theory, inference, and learning algorithms, volume 7. 2003. [3] A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, March 1972.

[4] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.

[5] H. Jeﬀreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London., 186(1007):453–461, 1946.

[6] N. T. Khajavi and A. Kuh. First order markov chain approximation of microgrid renewable generators covariance matrix. In Proc. of IEEE International Symposium on Information Theory, Istanbul, Turkey (ISIT’ 13), pages 1207–1211, July 2013.

[7] P. M. Lewis-II. Approximating probability distributions to reduce storage requirements. Informa-tion and control, 2(3):214–225, 1959.

[8] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. springer, 2006.

[9] J. Neyman and E. S. Pearson. On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika, 20, 1928.

[10] L. L. Scharf. Statistical signal processing, volume 98. Addison-Wesley Reading, MA, 1991. [11] T. Y. Al-Naﬀouri and B. Hassibi. On the distribution of indeﬁnite quadratic forms in gaussian

random variables. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on, pages 1744–1748. IEEE, 2009.

[12] S. B. Provost and E. M. Rudiuk. The exact distribution of indeﬁnite quadratic forms in noncentral normal vectors. Annals of the Institute of Statistical Mathematics, 48(2):381–394, 1996.

[13] H. T. Ha and S. B. Provost. An accurate approximation to the distribution of a linear combination of non-central chi-square random variables. REVSTAT, 11(3):231–254, 2013.

[14] A. Genz and F. Bretz. Computation of multivariate normal and t probabilities. 2009.

[15] N. T. Khajavi, A. Kuh, and N. P. Santhanam. Spatial correlations for solar pv generation and its tree approximation analysis. In Proc. of the Asia-Paciﬁc Signal and Information Processing Association (APSIPA ASC), pages 1–5, Dec 2014.

[16] NREL. Solar irradiance website. [Online]. Available: http://www.nrel.gov/midc/. [17] B. A. Berg and A. Billoire. Markov Chain Monte Carlo Simulations. Wiley Online, 2008.