Universal coding for classes of sources

(1)

Universal coding for classes of sources ^∗

Denver Greene

This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License

^†

We have discussed several parametric sources, and will now start developing mathematical tools in order to investigate properties of universal codes that oer universal compression w.r.t. a class of parametric sources.

1 Preliminaries

Consider a class Λ of parametric models, where the parameter set θ characterizes the distribution for a specic source within this class, {p θ (·) , θ ∈ Λ} .

Example 1

Consider the class of memoryless sources over an alphabet α = {1, 2, ..., r}. Here we have

θ = {p (1) , p (2) , ..., p (r − 1)}. (1)

The goal is to nd a xed to variable length lossless code that is independent of θ, which is unknown, yet achieves

E θ

l (X ₁ ⁿ ) n

n→∞ → H θ (X) , (2)

where expectation is taken w.r.t. the distribution implied by θ. We have seen for

p (x) = 1

2 p 1 (x) + 1

2 p 2 (x) (3)

that a code that is good for two sources (distributions) p 1 and p 2 exists, modulo the one bit loss here ¹ . As an expansion beyond this idea, consider

p (x) = Z

Λ

dw (θ) p θ (X) , (4)

where w (θ) is a prior.

Example 2

Let us revisit the memoryless source, choose r = 2, and dene the scalar parameter

∗

Version 1.2: May 16, 2013 12:48 pm -0500

†

http://creativecommons.org/licenses/by/3.0/

1

"Source models", (2) <http://cnx.org/content/m46231/latest/#uid1>

(2)

θ = Pr (X i = 1) = 1 − Pr (X i = 0) . (5) Then

p θ (x) = θ ⁿ

^X

⁽¹⁾ · (1 − θ) ⁿ

^X

⁽⁰⁾ (6)

and

p (x) = Z 1

0 dθ · θ ⁿ

^X

⁽¹⁾ · (1 − θ) ⁿ

^X

⁽⁰⁾ . (7)

Moreover, it can be shown that

p (x) = n _X (0)!n _X (1)!

(n + 1)! , (8)

this result appears in Krichevsky and Tromov [2].

Is the source X implied by the distribution p (x) an ergodic source? Consider the event lim n→∞ 1 n

P n i=1 X i ≤

1 2 . Owing to symmetry, in the limit of large n the probability of this event under p (x) must be ¹ ₂ , Pr{ lim

n→∞

1 n

n

X

i=1

X i ≤ 1 2 } = 1

2 . (9)

On the other hand, recall that an ergodic source must allocate probability 0 or 1 to this avor of event.

Therefore, the source implied by p (x) is not ergodic.

Recall the denitions of p θ (x) and p (x) in (6) and (7), respectively. Based on these denitions, consider the following,

H θ (X ₁ ⁿ ) = − P

X

ⁿ₁

∈A

ⁿ

p θ (X ₁ ⁿ ) logp θ (X ₁ ⁿ ) = H (X ₁ ⁿ |Θ = θ) ,

H (X ₁ ⁿ ) = − P

X

₁ⁿ

p (X ₁ ⁿ ) logp (X ₁ ⁿ ) ,

H (X ₁ ⁿ |Θ) = R

Λ dw (θ) · H (X ₁ ⁿ |Θ = θ) .

(10)

We get the following quantity for mutual information between the random variable Θ and random sequence X ₁ ^N ,

I (Θ; X ₁ ⁿ ) = H (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ) . (11) Note that this quantity represents the gain in bits that the parameter θ creates; more about this quantity will be mentioned later.

2 Redundancy

We now dene the conditional redundancy,

r n (l, θ) = 1

n [E θ (l (X ₁ ⁿ )) − H θ (X ₁ ⁿ )] , (12) this quanties how far a coding length function l is from the entropy where the parameter θ is known. Note that

l (X ₁ ⁿ ) = Z

Λ

dw (θ) E θ (l (X ₁ ⁿ )) ≥ H (X ₁ ⁿ |θ) . (13)

(3)

Denote by c n the collection of lossless codes for length-n inputs, and dene the expected redundancy of a code l ∈ C n by

R ⁻ _n (w, l) = R

Λ dw (θ) r n (l, θ) , R ⁻ _n (w) = inf

l∈C

n

R ⁻ _n (w, l) . (14)

The asymptotic expected redundancy follows,

R ⁻ (w) = lim

n→∞ R ⁻ _n (w) , (15)

assuming that the limit exists.

We can also dene the minimum redundancy that incorporates the worst prior for parameter, R ⁻ _n = sup

w∈W

R ⁻ _n (w) , (16)

while keeping the best code. Similarly,

R ⁻ = lim

n→∞ R ⁻ _n . (17)

Let us derive R ⁻ n ,

R ⁻ _n = sup

w

inf

l

R

Λ dw (θ) ¹ _n [E θ (l (X ₁ ⁿ )) − H (X ₁ ⁿ |Θ = θ)]

= sup

w

inf

l 1

n E _p [l (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ)]

= sup

w 1

n [H (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ)]

= sup

w 1

n I (Θ; X ₁ ⁿ ) = ^C _n

ⁿ

,

(18)

where C n is the capacity of a channel from the sequence x to the parameter [4]. That is, we try to estimate the parameter from the noisy channel.

In an analogous manner, we dene

R ⁺ _n = inf

l

sup

θ∈Λ

r n (l, θ)

= inf

l

sup

θ 1 n E _θ h

log ₂ ^p

_{−l(xn )}^θ

^(x

ⁿ

⁾ i

= inf

Q

sup

θ 1

n D (P θ ||Q) ,

(19)

where Q is the prior induced by the coding length function l.

3 Minimal redundancy

Note that

∀w, l, sup

θ

r _n (l, θ) ≥ R

Λ w (dθ) r _n (l, θ)

≥ inf

l∈c

n

R

Λ w (dθ) r n (l, θ) . (20)

Therefore,

R ⁺ _n = inf

l

sup

θ

r n (l, θ) ≥ sup

w

inf

l

Z

Λ

w (dθ) r n (l, θ) = R _n ⁻ . (21)

(4)

In fact, Gallager showed that R n ⁺ = R ⁻ _n . That is, the min-max and max-min redundancies are equal.

Let us revisit the Bernoulli source p θ where θ ∈ Λ = [0, 1]. From the denition of (6), which relies on a uniform prior for the sources, i.e., w (θ) = 1, ∀θ ∈ Λ, it can be shown that there there exists a universal code with length function l such that

E θ [l (x ⁿ )] ≤ nE θ

h 2

n _x (1) n

+ log (n + 1) + 2, (22)

where h 2 (p) = −plog (p)−(1 − p) log (1 − p) is the binary entropy. That is, the redundancy is approximately log (n) bits. Clarke and Barron [1] studied the weighting approach,

p (x) = Z

Λ

dw (θ) p _θ (x) , (23)

and constructed a prior that achieves R ⁻ n = R ⁺ _n precisely for memoryless sources.

Theorem 5 [1] For memoryless source with an alphabet of size r, θ = (p (0) , p (1) , · · · , p (r − 1)),

nR ⁻ _n (w) = r − 1

2 log n 2πe

+

Z

Λ

w (dθ) log

p |I (θ) | w (θ)

!

+ O n (1) , (24)

where O n (1) vanishes uniformly as n → ∞ for any compact subset of Λ, and

I (θ) , E

"

∂lnp _θ (x i )

∂θ

∂lnp _θ (x i )

∂θ

T #

(25) is Fisher's information.

Note that when the parameter is sensitive to change we have large I (θ), which increases the redundancy.

That is, good sensitivity means bad universal compression.

Denote

J (θ) = p|I (θ) | R

Λ p|I (θ ^' ) |dθ ^' , (26)

this is known as Jerey's prior. Using w (θ) = J (θ), it can be shown that R ⁻ n = R ⁺ _n . Example 3

Let us derive the Fisher information I (θ) for the Bernoulli source,

p θ (x) = θ ⁿ

^x

⁽¹⁾ · (1 − θ) ⁿ

^x

⁽⁰⁾

⇒ lnp θ (x) = n x (1) lnθ + n x (0) ln (1 − θ)

⇒ ^∂lnp _∂θ

^θ

^(x) = n x (1) ¹ _θ − n x (0) _1−θ ¹

⇒ _∂lnp

θ

(x)

∂θ

²

= ⁿ

²^x

_θ ⁽¹⁾

2

+ _(1−θ) ⁿ

²^x

⁽⁰⁾

2

− ²ⁿ

^x

_θ(1−θ) ⁽¹⁾ⁿ

^x

⁽⁰⁾

⇒ E

_∂lnp

θ

(x)

∂θ

2 = _θ ^θ

2

+ _(1−θ) ^1−θ

2

− _θ(1−θ) ² E [n x (1) n x (0)]

= ¹ _θ + _1−θ ¹ − 0

= _θ(1−θ) ¹ .

(27)

Therefore, the Fisher information satises I (θ) = _θ(1−θ) ¹ . Example 4

Recall the KrichevskyTromov coding, which was mentioned in Example 2. Using the denition of Jereys' prior (26), we see that J (θ) ∝ √ ¹

θ(1−θ) . Taking the integral over Jeery's prior,

(5)

p J (x ⁿ ) = R 1 0 c √ ^dθ

θ(1−θ) θ ⁿ

^x

⁽¹⁾ (1 − θ) ⁿ

^x

⁽⁰⁾

= c R 1

0 θ ⁿ

^x

⁽¹⁾⁻

¹²

(1 − θ) ⁿ

^x

⁽⁰⁾⁻

¹²

dθ

= ^Γ ( ⁿ

x

(0)+

¹₂

) ^Γ ( ⁿ

x

(1)+

¹₂

)

πΓ(n+1) ,

(28)

where we used the gamma function. It can be shown that

p J (x ⁿ ) =

n

Y

t=0

p J x t+1 |x ^t ₁ , (29)

where

p J (x t+1 |x ^t ₁ ) = ^p

^J

( ^x

^t+1₁

)

p

_J

( ^x

^t1

) , p _J (x _t+1 = 0|x ^t ₁ ) = ⁿ

^t^x

⁽⁰⁾⁺ _t+1

¹²

, p _J (x _t+1 = 1|x ^t ₁ ) = ⁿ

^t^x

⁽¹⁾⁺ _t+1

¹²

.

(30)

Similar to before, this universal code can be implemented sequentially. It is due to Krichevsky and Tro- mov [2], its redundancy satises Theorem 5 (p. 4) by Clarke and Barron [1], and it is commonly used in universal lossless compression.

4 Rissanen's bound

Let us consider on an intuitive level why C n ≈ ^r−1 ₂ ^log(n) _n . Expending ^r−1 ₂ log (n) bits allows to dierentiate between ( √

n) ^r−1 parameter vectors. That is, we would dierentiate between each of the r − 1 parameters with √

n levels. Now consider a Bernoulli RV with (unknown) parameter θ.

One perspective is that with n drawings of the RV, the standard deviation in the number of 1's is O ( √ n) . That is, √

n levels dierentiate between parameter levels up to a resolution that reects the randomness of the experiment.

A second perspective is that of coding a sequence of Bernoulli outcomes with an imprecise parameter, where it is convenient to think of a universal code in terms of rst quantizing the parameter and then using that (imprecise) parameter to encode the input x. For the Bernoulli example, the maximum likelihood parameter θ M L satises

θ M L = argmax{θ ⁿ

^x

⁽¹⁾ (1 − θ) ⁿ

^x

⁽⁰⁾ }, (31) and plugging this parameter θ = θ M L into p θ (x) minimizes the coding length among all possible parameters, θ ∈ Λ . It is readily seen that

θ M L = n _x (1)

n . (32)

Suppose, however, that we were to encode with θ ^' = θ M L + ∆ . Then the coding length would be l θ (x) = −log

θ ^' ⁿ

x

(1)

1 − θ ^' ⁿ

x

(0)

. (33)

It can be shown that this coding length is suboptimal w.r.t. l θ

_{M L}

(x) by n · O ∆ ²

bits. Keep in mind that doubling the number of parameter levels used by our universal encoder requires an extra bit to encode the extra factor of 2 in resolution. It makes sense to expend this extra bit only if it buys us at least one other bit, meaning that n · O ∆ ² = 1 , which implies that we encode θ M L to a resolution of 1/ √

n , corresponding to O ( √

n) levels. Again, this is a redundancy of ≈ ¹ ₂ log (n) bits per parameter.

(6)

Having described Rissanen's result intuitively, let us formalize matters. Consider {p θ , θ ∈ Λ} , where Λ ⊂ R ^K is a compact set. Suppose that there exists an estimator ^

θ such that

∀n ≥ n (c) : p θ {k ^

θ (x ⁿ ) − θ k> c

√ n } ≤ δ (c) , (34)

where lim c→∞ δ (c) = 0 . Then we have the following converse result.

Theorem 6 (Converse to universal coding [5]) Given a parametric class that satises the above condition (34), for all ε > 0 and all codes l that do not know θ,

r n (l, θ) ≥ (1 − ε) K 2

log (n)

n , (35)

except for a class of θ in B ε (n) ⊆ Λ whose Lebesgue volume shrinks to zero as n increases.

That is, a universal code cannot compress at a redundancy substantialy below ¹ ₂ log (n) bits per parameter.

Rissanen also proved the following achievable result in his seminal paper.

Theorem 7 (Achievable to universal coding [5]) If p θ (x) is twice dierentiable in θ for every x ⁿ , then there exists a universal code such that ∀θ ∈ Λ : r n (l, θ) ≤ (1 + ε) ^K ₂ ^log(n) _n .

5 Universal coding for piecewise i.i.d. sources

We have emphasized stationary parametric classes, but a parametric class can be nonstationary. Let us show how universal coding can be achieved for some nonstationary classes of sources by providing an example.

Consider Λ = {0, 1, ..., n} where

p _θ (x ⁿ ) = Q ₁ x ^θ ₁ · Q 2 x ⁿ _θ+1 , (36) where Q 1 and Q 2 are both know i.i.d. sources. This is a piecewise i.i.d. source; in each segment it is i.i.d., and there is an abrupt transition in statistics when the rst segment ends and the second begins.

Here are two approaches to coding this source.

1. Encode the best index θ M L using dlog (n + 1)e bits, then encode p θ

_{M L}

(x ⁿ ) . This is known as two-part code or plug-in; after encoding the index, we plug the best parameter into the distribution. Clearly,

l (x) = min

0≤θ≤n d−logp θ (x)e + dlog (n + 1)e

≤ −logp θ (x) + log (n + 1) + 2.

(37)

2. The second approach is a mixture, we allocate weights for all possible parameters, l (x) = −log

1 n+1

P n

i=0 p i (x ⁿ )

< −log

1 n+1 p _θ

_{M L}

(x ⁿ )

= −log (p θ

_{M L}

(x)) + log (n + 1) .

(38)

Merhav [3] provided redundancy theorems for this class of sources. Algorithmic approaches to the mixture appear in Shamir and Merhav [6] and Willems [7].

The theme that is common to both approaches, the plug-in and the mixture, is that they lose approx- imately log (n) bits in encoding the location of the transition. Indeed, Merhav showed that the penalty for each transition in universal coding is approximately log (n) bits [3]. Intuitively, the reason that the redundancy required to encode the location of the transition is larger than the ¹ ₂ log (n) from Rissanen [5]

is because the location of the transition must be described precisely to prevent paying a big coding length

penalty in encoding segments using the wrong i.i.d. statistics. In contrast, in encoding our Bernoulli example

(7)

an imprecision of ^√ ¹ _n in encoding θ M L in the rst part of the code yields only an O (1) bit penalty in the second part of the code.

It is well known that mixtures out-compress the plug-in. However, in many cases they do so by only a small amount per parameter. For example, Baron et al. showed that the plug-in for i.i.d. sources loses approximately 1 bit per parameter w.r.t. the mixture.

References

[1] B.S. Clarke and A.R. Barron. Jereys' prior is asymptotically least favorable under entropy risk. J. Stat.

Planning Inference, 41(1):378211;60, 1994.

[2] R. Krichevsky and V. Tromov. The performance of universal encoding. IEEE Trans. Inf. Theory, 27(2):1998211;207, 1981.

[3] N. Merhav. On the minimum description length principle for sources with piecewise constant parameters.

IEEE Trans. Inf. Theory, 39(6):19628211;1967, 1993.

[4] N. Merhav and M. Feder. A strong version of the redundancy-capacity theorem of universal coding.

IEEE Trans. Inf. Theory, 41(3):7148211;722, 1995.

[5] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory, 30(4):6298211;636, Jul. 1984.

[6] G.I. Shamir and N. Merhav. Low-complexity sequential lossless coding for piecewise-stationary memo- ryless sources. IEEE Trans. Inf. Theory, 45(5):14988211;1519, 1999.

[7] F.M.J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE Trans.

Inf. Theory, 42(6):22108211;2217, 1996.

Universal coding for classes of sources

Universal coding for classes of sources ∗

Denver Greene

This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License

We have discussed several parametric sources, and will now start developing mathematical tools in order to investigate properties of universal codes that oer universal compression w.r.t. a class of parametric sources.

1 Preliminaries

Consider a class Λ of parametric models, where the parameter set θ characterizes the distribution for a specic source within this class, {p θ (·) , θ ∈ Λ} .

Example 1

Consider the class of memoryless sources over an alphabet α = {1, 2, ..., r}. Here we have

θ = {p (1) , p (2) , ..., p (r − 1)}. (1)

The goal is to nd a xed to variable length lossless code that is independent of θ, which is unknown, yet achieves

E θ

l (X 1 n ) n

n→∞ → H θ (X) , (2)

where expectation is taken w.r.t. the distribution implied by θ. We have seen for

p (x) = 1

2 p 1 (x) + 1

2 p 2 (x) (3)

that a code that is good for two sources (distributions) p 1 and p 2 exists, modulo the one bit loss here 1 . As an expansion beyond this idea, consider

p (x) = Z

Λ

dw (θ) p θ (X) , (4)

where w (θ) is a prior.

Example 2

Let us revisit the memoryless source, choose r = 2, and dene the scalar parameter

Version 1.2: May 16, 2013 12:48 pm -0500

http://creativecommons.org/licenses/by/3.0/

"Source models", (2) <http://cnx.org/content/m46231/latest/#uid1>

θ = Pr (X i = 1) = 1 − Pr (X i = 0) . (5) Then

p θ (x) = θ n

(1) · (1 − θ) n

(0) (6)

and

p (x) = Z 1

0

dθ · θ n

(1) · (1 − θ) n

(0) . (7)

Moreover, it can be shown that

p (x) = n X (0)!n X (1)!

(n + 1)! , (8)

this result appears in Krichevsky and Tromov [2].

Is the source X implied by the distribution p (x) an ergodic source? Consider the event lim n→∞ 1 n

P n i=1 X i ≤

1

2 . Owing to symmetry, in the limit of large n the probability of this event under p (x) must be 1 2 , Pr{ lim

n→∞

1 n

n

X

i=1

X i ≤ 1 2 } = 1

2 . (9)

On the other hand, recall that an ergodic source must allocate probability 0 or 1 to this avor of event.

Therefore, the source implied by p (x) is not ergodic.

Recall the denitions of p θ (x) and p (x) in (6) and (7), respectively. Based on these denitions, consider the following,

H θ (X 1 n ) = − P

X

∈A

p θ (X 1 n ) logp θ (X 1 n ) = H (X 1 n |Θ = θ) ,

H (X 1 n ) = − P

X

p (X 1 n ) logp (X 1 n ) ,

H (X 1 n |Θ) = R

Λ dw (θ) · H (X 1 n |Θ = θ) .

(10)

We get the following quantity for mutual information between the random variable Θ and random sequence X 1 N ,

I (Θ; X 1 n ) = H (X 1 n ) − H (X 1 n |Θ) . (11) Note that this quantity represents the gain in bits that the parameter θ creates; more about this quantity will be mentioned later.

2 Redundancy

We now dene the conditional redundancy,

r n (l, θ) = 1

n [E θ (l (X 1 n )) − H θ (X 1 n )] , (12) this quanties how far a coding length function l is from the entropy where the parameter θ is known. Note that

l (X 1 n ) = Z

Λ

dw (θ) E θ (l (X 1 n )) ≥ H (X 1 n |θ) . (13)

Denote by c n the collection of lossless codes for length-n inputs, and dene the expected redundancy of a code l ∈ C n by

R − n (w, l) = R

Λ dw (θ) r n (l, θ) , R − n (w) = inf

l∈C

R − n (w, l) . (14)

Universal coding for classes of sources ^∗

We have discussed several parametric sources, and will now start developing mathematical tools in order to investigate properties of universal codes that oer universal compression w.r.t. a class of parametric sources.

Consider a class Λ of parametric models, where the parameter set θ characterizes the distribution for a specic source within this class, {p θ (·) , θ ∈ Λ} .

The goal is to nd a xed to variable length lossless code that is independent of θ, which is unknown, yet achieves

l (X ₁ ⁿ ) n

that a code that is good for two sources (distributions) p 1 and p 2 exists, modulo the one bit loss here ¹ . As an expansion beyond this idea, consider

Let us revisit the memoryless source, choose r = 2, and dene the scalar parameter

p θ (x) = θ ⁿ

⁽¹⁾ · (1 − θ) ⁿ

⁽⁰⁾ (6)

dθ · θ ⁿ

⁽¹⁾ · (1 − θ) ⁿ

⁽⁰⁾ . (7)

p (x) = n _X (0)!n _X (1)!

this result appears in Krichevsky and Tromov [2].

2 . Owing to symmetry, in the limit of large n the probability of this event under p (x) must be ¹ ₂ , Pr{ lim

On the other hand, recall that an ergodic source must allocate probability 0 or 1 to this avor of event.

Recall the denitions of p θ (x) and p (x) in (6) and (7), respectively. Based on these denitions, consider the following,

H θ (X ₁ ⁿ ) = − P

p θ (X ₁ ⁿ ) logp θ (X ₁ ⁿ ) = H (X ₁ ⁿ |Θ = θ) ,

H (X ₁ ⁿ ) = − P

p (X ₁ ⁿ ) logp (X ₁ ⁿ ) ,

H (X ₁ ⁿ |Θ) = R

Λ dw (θ) · H (X ₁ ⁿ |Θ = θ) .

We get the following quantity for mutual information between the random variable Θ and random sequence X ₁ ^N ,

I (Θ; X ₁ ⁿ ) = H (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ) . (11) Note that this quantity represents the gain in bits that the parameter θ creates; more about this quantity will be mentioned later.

We now dene the conditional redundancy,

n [E θ (l (X ₁ ⁿ )) − H θ (X ₁ ⁿ )] , (12) this quanties how far a coding length function l is from the entropy where the parameter θ is known. Note that

l (X ₁ ⁿ ) = Z

dw (θ) E θ (l (X ₁ ⁿ )) ≥ H (X ₁ ⁿ |θ) . (13)

Denote by c n the collection of lossless codes for length-n inputs, and dene the expected redundancy of a code l ∈ C n by

R ⁻ _n (w, l) = R

Λ dw (θ) r n (l, θ) , R ⁻ _n (w) = inf

R ⁻ _n (w, l) . (14)

R ⁻ (w) = lim

n→∞ R ⁻ _n (w) , (15)

We can also dene the minimum redundancy that incorporates the worst prior for parameter, R ⁻ _n = sup

R ⁻ _n (w) , (16)

R ⁻ = lim

n→∞ R ⁻ _n . (17)

Let us derive R ⁻ n ,

R ⁻ _n = sup

Λ dw (θ) ¹ _n [E θ (l (X ₁ ⁿ )) − H (X ₁ ⁿ |Θ = θ)]

n E _p [l (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ)]

n [H (X ₁ ⁿ ) − H (X ₁ ⁿ |Θ)]

n I (Θ; X ₁ ⁿ ) = ^C _n

In an analogous manner, we dene

R ⁺ _n = inf

θ 1 n E _θ h

log ₂ ^p

^(x

⁾ i

r _n (l, θ) ≥ R

Λ w (dθ) r _n (l, θ)

R ⁺ _n = inf