empty dictionary.
output new dictionary entry
c 1. c
a 2. a
aa 3. aa
aab 4. aab
aabc 5. aabc aabb 6. aabb The decoded text is caaaaabaabcaabb.
3.11 Other types of Compression
We have only considered compression methods that are lossless: there is no information lost. There are lossy methods of compression that can typically achieve better compression at the expense of losing (usually superfluous) information. For example, JPEG and MPEG use lossy compression. The difficulty with lossy compression is that you have to choose which information to lose. With images and music this is usually done so that the human eye or ear cannot detect the loss: you drop subtle colour differences or extreme frequency values for example. Once these extremes are removed, a Huffman, arithmetic or LZ78 coding of what remains will be much more efficient.
See www.whydomath.org under wavelets for more information.
Chapter 4
Information Theory
As usual we have a source
S ={s1, s2,· · · , sq} with probabilities p1, p2,· · · , pq.
If we receive a symbol from this source, how much “information” do we receive?
If
p1 = 1 and p2 =· · · = pq = 0
and s1 is received then we are not at all surprised and we receive no “information”.
But if
p1 = ǫ > 0 (very small)
and s1 arrives we are very surprised and receive lots of “information”.
“Information” will measure the amount of surprise or uncertainty about receiving a given symbol: it is somehow inversely related to probability.
Let I(p) be the information contained in an event of probability p.
We make some natural assumptions about I(p) and see where these assumptions lead us.
Assume that
(a) I(p)≥ 0 for 0 ≤ p ≤ 1,
(b) I(p1p2) = I(p1) + I(p2) for independent events of probability p1 and p2, (c) I(p) is a continuous function of p.
Now if we independently repeat the same event we get
I(p2) = I(p.p) = I(p) + I(p) = 2I(p) and by induction, for each positive integer n we have
I(pn) = nI(p).
Next, let m, n be positive integers and let
pn= x, so p = x1n . 69
Then
I(x) = nI(xn1) and multiplying by m gives
mI(x) = nI(xmn).
Hence
I(xmn) = mnI(x) and we have
I(xα) = αI(x) for all positive rational numbers α.
Then by continuity we have
I(xα) = αI(x) for all real α > 0.
Hence I(x) behaves like a logarithm.
In fact it can be shown that the only function which satisfies (a), (b), (c) is I(x) = k ln x where k < 0 is a constant.
For the binary case (radix 2), define for 0 < p≤ 1, I(p) = I2(p) =− log2p
=− 1 ln 2ln p
= log2
1 p
Here we have chosen the scale factor
k =− 1 ln 2.
With this scale factor we say that information is measured in “binary units” of infor-mation. Some books shorten this to binits, but as with many books we will call it bits.
However, this is a different use of the word to binary digits = bits used earlier.
For general radix r, define
Ir(p) = − logrp
=− 1 ln rln p
= logr
1 p
measured in radix r units. Here the scale factor used is k =− 1
ln r.
The reason for the differing scale factors will be clearer later on.
We will always use radix 2, and hence log2, unless specifically stated.
Now for our source S ={s1,· · · , sq} we write
I(sj) = I(pj) =− log2pj
71 for the information contained in observing or receiving the symbol sj of probability pj.
Define the entropy of the source S to be the average or expected information per symbol,
H(S) = H2(S) = Xq
i=1
piI(pi) =− Xq
i=1
pilog2pi (binary) and for radix r
Hr(S) =− Xq
i=1
pilogrpi .
(We often omit the subscript r when it is clear what the radix/log base should be.) Example 4.1 Toss a coin. This has 2 outcomes h, t of probability 1
2,1
2, so the entropy is H = H(S) =−1
2log2
1 2
− 1 2log2
1 2
= log22
= 1 bits/symbol
Example 4.2 Toss a biased coin with P (h) = 0.05, P (t) = 0.95. Equivalently, observe a binary source in which P (0) = 0.05, P (1) = 0.95. This gives an entropy of
H = −0.05 log2(0.05)− 0.95 log2(0.95)
= 0.286 bits/symbol
Example 4.3 Our initial example on Huffman coding in Chapter 3 had probabilities 0.3, 0.2, 0.2, 0.1, 0.1, 0.1 and so has entropy
H = −0.3 log2(0.3)− 2 × 0.2 log2(0.2)− 3 × 0.1 log2(0.1)
= 2.446 bits/symbol (in binary units)
The Huffman code for this had L = 2.5 bits/symbol (binary digits), which is close to the H = 2.446 bits/symbol (binary units of information).
As we will see later, this relationship H < L is no accident, and is related to the dual use of bits/symbol (being the units for code length on the one hand, and units of information on the other).
Example 4.4 Our initial example on arithmetic coding in section 3.10.1 had probabilities 0.4, 0.2, 0.2, 0.1, 0.1 (including the stop symbol) and we encoded a 10 symbol message into a 7-digit decimal number. The decimal entropy of this source is 0.639 digits per symbol, so we see the arithmetic coding has got as close as possible to the entropy.
4.1 Some mathematical preliminaries
1. Consider the function f (x) =−x logrx for 0 < x≤ 1. What is its graph?
Ignoring scale factors consider
y = f (x) =−x ln x
y
0 1/e
1/e
1 x
Here dy
dx =−1 − ln x is infinite at x = 0 and the graph has a maximum when
−1 = ln x; that is, when x = e−1. Also
xlim→0+f (x) = lim
x→0+
− ln x 1/x
= lim
x→0+
−1/x
−1/x2 by L’Hˆopital
= lim
x→0+x
= 0
So we define f (x) =−x log x to equal 0 when x = 0.
Note: If p = 0 then I(p) is infinite but this event never occurs so there is no problem.
However if in our source there is a pi = 0 we want it to contribute nothing to the entropy. Hence we say that −p log p = 0 when p = 0.
This means that we can use the formula for H without worrying about whether p = 0 is possible or not.
2. Suppose
p1, p2,· · · , pq and p′1, p′2,· · · , p′q
are two probability distributions. Then 0≤ pi, p′i ≤ 1 for all i and Xq
i=1
pi = 1 and
Xq i=1
p′i = 1.
4.2. MAXIMUM ENTROPY THEOREM 73
This inequality (in either form) is called Gibbs’ inequality and will be used in some proofs in this chapter. Moreover equality occurs in Gibbs’ inequality if and only if p′i/pi = 1 for all i; that is, if and only if p′i = pi for 1≤ i ≤ q.
4.2 Maximum Entropy Theorem
Theorem 4.1 : Maximum Entropy Theorem
For any source S with q symbols, the base r entropy satisfies Hr(S)≤ logrq
with equality if and only if all symbols are equally likely.
Proof: Throughout this proof we write log for logr. Now
H− log q = −
Here p′i = 1q is an equally likely source and we have used Gibbs’ inequality.
The result follows.