Universal coding for classes of sources ∗
Denver Greene
This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License
†We have discussed several parametric sources, and will now start developing mathematical tools in order to investigate properties of universal codes that oer universal compression w.r.t. a class of parametric sources.
1 Preliminaries
Consider a class Λ of parametric models, where the parameter set θ characterizes the distribution for a specic source within this class, {p θ (·) , θ ∈ Λ} .
Example 1
Consider the class of memoryless sources over an alphabet α = {1, 2, ..., r}. Here we have
θ = {p (1) , p (2) , ..., p (r − 1)}. (1)
The goal is to nd a xed to variable length lossless code that is independent of θ, which is unknown, yet achieves
E θ
l (X 1 n ) n
n→∞ → H θ (X) , (2)
where expectation is taken w.r.t. the distribution implied by θ. We have seen for
p (x) = 1
2 p 1 (x) + 1
2 p 2 (x) (3)
that a code that is good for two sources (distributions) p 1 and p 2 exists, modulo the one bit loss here 1 . As an expansion beyond this idea, consider
p (x) = Z
Λ
dw (θ) p θ (X) , (4)
where w (θ) is a prior.
Example 2
Let us revisit the memoryless source, choose r = 2, and dene the scalar parameter
∗
Version 1.2: May 16, 2013 12:48 pm -0500
†
http://creativecommons.org/licenses/by/3.0/
1
"Source models", (2) <http://cnx.org/content/m46231/latest/#uid1>
θ = Pr (X i = 1) = 1 − Pr (X i = 0) . (5) Then
p θ (x) = θ n
X(1) · (1 − θ) n
X(0) (6)
and
p (x) = Z 1
0
dθ · θ n
X(1) · (1 − θ) n
X(0) . (7)
Moreover, it can be shown that
p (x) = n X (0)!n X (1)!
(n + 1)! , (8)
this result appears in Krichevsky and Tromov [2].
Is the source X implied by the distribution p (x) an ergodic source? Consider the event lim n→∞ 1 n
P n i=1 X i ≤
1
2 . Owing to symmetry, in the limit of large n the probability of this event under p (x) must be 1 2 , Pr{ lim
n→∞
1 n
n
X
i=1
X i ≤ 1 2 } = 1
2 . (9)
On the other hand, recall that an ergodic source must allocate probability 0 or 1 to this avor of event.
Therefore, the source implied by p (x) is not ergodic.
Recall the denitions of p θ (x) and p (x) in (6) and (7), respectively. Based on these denitions, consider the following,
H θ (X 1 n ) = − P
X
n1∈A
np θ (X 1 n ) logp θ (X 1 n ) = H (X 1 n |Θ = θ) ,
H (X 1 n ) = − P
X
1np (X 1 n ) logp (X 1 n ) ,
H (X 1 n |Θ) = R
Λ dw (θ) · H (X 1 n |Θ = θ) .
(10)
We get the following quantity for mutual information between the random variable Θ and random sequence X 1 N ,
I (Θ; X 1 n ) = H (X 1 n ) − H (X 1 n |Θ) . (11) Note that this quantity represents the gain in bits that the parameter θ creates; more about this quantity will be mentioned later.
2 Redundancy
We now dene the conditional redundancy,
r n (l, θ) = 1
n [E θ (l (X 1 n )) − H θ (X 1 n )] , (12) this quanties how far a coding length function l is from the entropy where the parameter θ is known. Note that
l (X 1 n ) = Z
Λ
dw (θ) E θ (l (X 1 n )) ≥ H (X 1 n |θ) . (13)
Denote by c n the collection of lossless codes for length-n inputs, and dene the expected redundancy of a code l ∈ C n by
R − n (w, l) = R
Λ dw (θ) r n (l, θ) , R − n (w) = inf
l∈C
nR − n (w, l) . (14)
The asymptotic expected redundancy follows,
R − (w) = lim
n→∞ R − n (w) , (15)
assuming that the limit exists.
We can also dene the minimum redundancy that incorporates the worst prior for parameter, R − n = sup
w∈W
R − n (w) , (16)
while keeping the best code. Similarly,
R − = lim
n→∞ R − n . (17)
Let us derive R − n ,
R − n = sup
w
inf
l
R
Λ dw (θ) 1 n [E θ (l (X 1 n )) − H (X 1 n |Θ = θ)]
= sup
w
inf
l 1
n E p [l (X 1 n ) − H (X 1 n |Θ)]
= sup
w 1
n [H (X 1 n ) − H (X 1 n |Θ)]
= sup
w 1
n I (Θ; X 1 n ) = C n
n,
(18)
where C n is the capacity of a channel from the sequence x to the parameter [4]. That is, we try to estimate the parameter from the noisy channel.
In an analogous manner, we dene
R + n = inf
l
sup
θ∈Λ
r n (l, θ)
= inf
l
sup
θ 1 n E θ h
log 2 p
−l(xn )θ(x
n) i
= inf
Q
sup
θ 1
n D (P θ ||Q) ,
(19)
where Q is the prior induced by the coding length function l.
3 Minimal redundancy
Note that
∀w, l, sup
θ
r n (l, θ) ≥ R
Λ w (dθ) r n (l, θ)
≥ inf
l∈c
nR
Λ w (dθ) r n (l, θ) . (20)
Therefore,
R + n = inf
l
sup
θ
r n (l, θ) ≥ sup
w
inf
l
Z
Λ
w (dθ) r n (l, θ) = R n − . (21)
In fact, Gallager showed that R n + = R − n . That is, the min-max and max-min redundancies are equal.
Let us revisit the Bernoulli source p θ where θ ∈ Λ = [0, 1]. From the denition of (6), which relies on a uniform prior for the sources, i.e., w (θ) = 1, ∀θ ∈ Λ, it can be shown that there there exists a universal code with length function l such that
E θ [l (x n )] ≤ nE θ
h 2
n x (1) n
+ log (n + 1) + 2, (22)
where h 2 (p) = −plog (p)−(1 − p) log (1 − p) is the binary entropy. That is, the redundancy is approximately log (n) bits. Clarke and Barron [1] studied the weighting approach,
p (x) = Z
Λ
dw (θ) p θ (x) , (23)
and constructed a prior that achieves R − n = R + n precisely for memoryless sources.
Theorem 5 [1] For memoryless source with an alphabet of size r, θ = (p (0) , p (1) , · · · , p (r − 1)),
nR − n (w) = r − 1
2 log n 2πe
+
Z
Λ
w (dθ) log
p |I (θ) | w (θ)
!
+ O n (1) , (24)
where O n (1) vanishes uniformly as n → ∞ for any compact subset of Λ, and
I (θ) , E
"
∂lnp θ (x i )
∂θ
∂lnp θ (x i )
∂θ
T #
(25) is Fisher's information.
Note that when the parameter is sensitive to change we have large I (θ), which increases the redundancy.
That is, good sensitivity means bad universal compression.
Denote
J (θ) = p|I (θ) | R
Λ p|I (θ ' ) |dθ ' , (26)
this is known as Jerey's prior. Using w (θ) = J (θ), it can be shown that R − n = R + n . Example 3
Let us derive the Fisher information I (θ) for the Bernoulli source,
p θ (x) = θ n
x(1) · (1 − θ) n
x(0)
⇒ lnp θ (x) = n x (1) lnθ + n x (0) ln (1 − θ)
⇒ ∂lnp ∂θ
θ(x) = n x (1) 1 θ − n x (0) 1−θ 1
⇒ ∂lnp
θ
(x)
∂θ
2
= n
2xθ (1)
2+ (1−θ) n
2x(0)
2− 2n
xθ(1−θ) (1)n
x(0)
⇒ E
∂lnp
θ