Gaussian Case - Information Theory With Large Alphabets And Source Coding Error Exponents

4.4 Examples

4.4.2 Gaussian Case

A similar test channel tension arises in the Gaussian case. This can be seen most clearly by considering the optimization problem over ρxz for fixed σX2 . In Fig.

4.3 we plot G3(ρxz) = inf σ2 Y sup λ∈Λ inf ρxy,ρyz G [K, Σ, λ, ∆, R] where we hold σ2

X = 1, and K = K(1, σY, 1, ρxy, ρyz, ρxz)is the covariance matrix

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Test channel correlation ρ_xz

Figure 4.3: Test channel optimization for Theorem 19. The plot shows the exponent against ρxz, holding σ2X = 1 fixed for R = 0.4, ζxy =

0.7and ∆ = 0.4.

Intuitively, ρxz controls the number of different codewords we use to cover

the source sequences. At rate R the scheme allows us to identify at most exp(nR) codewords uniquely, and binning is required to go beyond this. A large codebook has the advantage that each source can be mapped to a better (i.e. closer) codeword. As we increase the size of the codebook beyond this point, the gains from having a “cleaner” codebook are outweighed by the penalty we pay for binning. From the plot we can see there is an optimum choice around ρxz = 0.76

for these parameters.

Figure 4.4 shows the exponent plotted (by numerically solving the optimization problem) against the rate. For comparison the upper bound of Theorem 20 is included, as is the exponent for the no side information case, corresponding a the continuous version of Marton’s point-to-point exponent [58]. This result

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.2 0.4 0.6 0.8 1 1.2 1.4

Rate R (nats per sample)

(Achievable) Error exponent

Our Exponent "Informed" Encoder NoSide information

Figure 4.4: A plot of the achievable exponent of Theorem 19. Here ζxy =

0.7 (the correlation coefficient between the source and side information) and ∆ = 0.4. R(∆) = 0.121 nats for these parameters.

was first proved by Ihara and Kubo [59], who showed the exponent is inf σX:1₂log( σ2 X ∆ )>R D(fσX||f1) = 1

2(∆ exp(2R) − log(∆ exp(2R)) − 1) . (4.22)

We can show our achievable exponent recovers (4.22) by taking the side information to be statistically independent i.e. ζ = 0. In this case, one can show that ρxy = ρyz = 0solve the inner optimization problem of (4.12). Further, since

X ⊥⊥ Y , Y cannot help achieve the distortion constraint, choosing σY = 1 is

nature’s best play. With these choices we see that D(K|| ¯K) = D(f_σ2

X||f1) and

we are left with the following equivalent optimization (where we have written ˆ X = αZ) inf σX sup ρxˆx,σXˆ                D(f_σ2 X||f1) E[(X − ˆ X)2_{] ≥ ∆}_or I(X; ˆX) ≥ R ∞ otherwise.

As nature will always pick σX such that the supremum is finite, we are left with

inf

σX:R(σX2,∆)≥R

D(f_σ2 X||f1).

Expanding the divergence and appealing to the monotonicity of x − log x gives (4.22)2_.

Using equation (4.22) and Theorem 20 we can determine the error exponent exactly when the side information is available at both the encoder and decoder. In this case, Wyner [79, section 3] provides a simple scheme to achieve the rate distortion function. The encoder simply subtracts the conditional mean E[X|Y = y] from the source. An achievable exponent then follows by com- puting the point-to-point exponent for the random variable X|Y = y, which is again Gaussian, with mean −ζy and variance 1 − ζ2_{. Our achievable exponent}

in this case is inf σX:R(σX,∆)>R D(f_σ2 X||f1−ζ 2) = 1 2 ∆ exp(2R) 1 − ζ2 − log ∆ exp(2R) 1 − ζ2 − 1 (4.23) We now show that this is in fact the best we can do, by showing that (4.23) coincides with the upper bound of Theorem 20. The optimization problem of Theorem 20 can be solved as follows. We first note that if X, Y are zero mean with covariance matrix K, then Var(X|Y ) = det(K)_VarY . Hence we may write the problem as

inf

K0: g(K,∆,R)≤0D(K||Σ)

where g(K, ∆, R) = − log det(K) + log(∆) + log eT

2Ke2+ 2R. The KKT conditions

tell us the optimum K∗ must satisfy

2_{Using a virtually identical argument one can show that exponent of Theorem 17 reduces to}

Marton’s exponent for the discrete-memoryless case when the side information is independent of the source.

1. −1 2(K ∗₎−1 +1 2Σ −1_{+ λ}   −(K ∗₎−1₊    0 0 0 eT 2K ∗_e 2      = 0 2. λg(K∗) = 0.

One can solve to this system to find

K∗ =    ζ2_{+ ∆ exp(2R) ζ} ζ 1   .

Evaluating D(K∗||Σ) yields (4.23). Therefore, when the side information is available in both places we have determined the exponent exactly as (4.23).

CHAPTER 5

IMPROVED SOURCE CODING EXPONENTS VIA WITSENHAUSEN’S RATE

In this chapter we improve the results of the previous chapter, for the special case that the side information is availably fully (i.e. without being encoded) at the decoder, see Fig 5.1.

2011 IEEE. Portions, reprinted, with permission, from [Kelly and Wagner, “Improved Source Coding Exponents via Witsenhausen’s Rate”, to appear in IEEE Transactions on Information Theory].

5.1 Notation and Preliminaries

For sets, types, etc., we use the same notations as the previous chapter. Unless specified, exponents and logarithms are taken in base 2. We use kxk∞ to de-

note the supremum norm, i.e. kxk∞ = maxi|xi|. The notation T_Qn, denotes the

(Q, n, )-typical set, i.e. the set of x ∈ Xn_{satisfying kQ}

x− Qk∞≤ .

A graph G = (V, E) is a pair of sets, where V is the set of vertices and E ⊂ V × V is the set of edges. Two vertices x, y ∈ V are connected iff (x, y) ∈ E. In this chapter we need only consider simple graphs, i.e. undirected graphs without self-loops. The degree of a vertex v, ∆(v), is the number of other vertices

Encoder R Decoder

X X

to which v is connected. The degree of a graph G, denoted ∆(G) is defined as maxv∈V ∆(v). A coloring of a graph is an assignment of colors to vertices so

that no pair of adjacent vertices share the same color. The chromatic number of G, γ(G), is defined to be the fewest number of colors needed to color G. For U ⊂ V , G(U )is the (vertex-) induced subgraph, i.e. the graph with vertex set U and edge set E ∩ (U × U ). An independent set of G is a subgraph of G containing no edges. The graph ¯G is the graph complement of G, which has the same vertex set of G and two vertices are connected in ¯Gif and only if they are not connected in G. A clique of G is a subset of the vertices of G such that every two vertices are connected. A graph G is called perfect if the chromatic number of every induced subgraph, G(V0)is equal to the size of the largest clique of G(V0).

Let G = (V, E), H = (V0_{, E}0₎ _{be two graphs. The strong product (also called}

the and product or normal product) G ∧ H is a graph whose vertex set is V × V0

and in which two vertices (v, v0), (u, u0)are connected iff

1. v = u and (v0, u0) ∈ E0or 2. v0 _{= u}0_{and (v, u) ∈ E or}

3. (v, u) ∈ E and (v0, u0) ∈ E0.

We will be interested in Gn_{= G∧G∧. . .∧G}_{(n-factors), the n-fold strong product}

of G. One may think of the vertices of Gn _{as length n vectors (v}

1, . . . , vn) with

two vertices are connected in Gn _{if each of the components of the vectors are}

either the same or connected in G. The characteristic graph, GX, of a source PXY

is the graph whose vertex set is X and two vertices x, x0are connected if there is a y ∈ Y such that P (y|x0)P (y|x) > 0. For a given y, the set Z(y) = {x : P (x|y) > 0}is the set of ‘confusable’ sequences, i.e. the set of xs than can occur with a

given y.

For a graph G and distribution Q on the vertices of G, we define the following functional. Definition 9. κ(G, Q) = max W :W G QW =Q H(W |Q). (5.1)

Note: whenever we write the graph G where a matrix is expected, we abuse notation and refer to the matrix G = A + I where A is the adjacency matrix of graph G and I is the identity matrix.

A second equivalent definition of κ is κ(G, Q) = max

X, ˜X:QX=QX˜=Q

H( ˜X|X) (5.2)

where X and ˜X have common alphabet and P (˜x|x) > 0only if (˜x, x) ∈ E(G)or x = ˜x.

We remark that similar optimizations arise in the determination of maxi- mum entropy Markov chains subject to moment constraints [80].

We will also make use of the following graph functionals from graph/zero- error information theory.

Definition 10. The graph entropy [64], H(G, Q), of a graph G and a distribution Qon the vertices of G is defined as

H(G, Q) = min

X∈Z∈Γ(G)I(X; Z) (5.3)

where X is a random node in the graph and has distribution Q, Γ(G) denotes the set of all maximal independent sets of G, and the notation X ∈ Z means PZ|X(z|x) = 0for x 6∈ z.

Definition 11. The complementary graph entropy (or co-entropy or π-entropy) [62, 63], ¯H(G, Q)of a graph G with a distribution Q on the vertices of G is defined as ¯ H(G, Q) = lim →0lim sup_n→∞ log γ(Gn_X(T_Qn,)) n . (5.4)

Graph entropy and complementary graph entropy are related as follows (see for example [81, Theorem 4])

H(G, Q) ≤ H(G, Q), (5.5)

and equality holds in (5.5) if G is perfect [82, Corollary 12].

In document Information Theory With Large Alphabets And Source Coding Error Exponents (Page 104-112)