Optimizing the input frequencies

Coding Theory

4.4 Optimizing the input frequencies

As before, let S= {s1,...,sm} be the source alphabet, and A = {a1,...,an} the code alphabet. A is also the input alphabet of the channel we plan to use. Sup-pose that the (relative) source frequencies f1,..., fm are known, and also the optimal channel input frequencies p1,..., pnof the input letters a1,...,an. We have the problem of coming up with a “good” encoding scheme, sj → wj ∈ A⁺, j = 1,...,m. The goodness of the scheme is judged with reference to a number of criteria. We have already seen that for unique decodability, we may as well have a scheme that satisfies the prefix condition. For minimiz-ing ¯ =_m

j=1 fjlgth(wj), we have Huffman’s algorithm. Now let us consider the requirement that the input frequencies p1,..., pn of the letters a1,...,an

should be as close as possible, in some sense, to the optimal input frequencies p1,..., pn.

In particular circumstances we can wrangle over the metric, the sense of “closeness,” to be used, and we can debate the rank of this requirement among the various contending requirements, but it is clear that we will make no progress toward satisfying this requirement if we cannot compute the input frequencies p1,..., pnarising from a particular encoding scheme. This compu-tation is the subject of the following theorem.

4.4.1 Theorem Suppose that sj → wj ∈ A⁺, j = 1,...,m, is an encoding scheme. Suppose that ai occurs exactly ui j times in _wj, i = 1,...,n, j = 1,...,m. Then, fori= 1,...,m,

pi=

m j=1ui jfj

j=1 fjlgth(wj)= ( ¯)⁻¹

m j=1

ui jfj.

Proof: We will have a rather informal proof; logicians and philosophers can be hired later to dignify it.

Suppose we have a block of source text with a large number N of source characters, with the marvelous property that, for each j= 1,...m, sj occurs exactly the expected number of times, N fj. After encoding, the total number of characters in the code text is_m

j=1(N fj)lgth(wj) = N ¯. The number of occurrences of aiin the code text ism

j=1ui j(N fj) = Nm

j=1ui jfj. Dividing, we find that the proportion of ai’s in the code text is

pi= N

jui j fj

N ¯ = ( ¯)⁻¹

m j=1

ui jfj.

4.4.2 Example Suppose that S= {a,b,c}, A = {0,1} and fa= .6, fb= .3, and fc= .1. Suppose that the encoding scheme is

4.4 Optimizing the input frequencies 91

a→ 00 b→ 101 c→ 010.

This scheme does not minimize average code word length, but it may have compensatory virtues that suit the situation. Letting the alphabet characters serve as indices, we have

u0a= 2,u0b= 1,u0c= 2, and u1a= 0,u1b= 2,u1c= 1.

Thus the input frequencies will be

p0= 2(.6) + (.3) + 2(.1) 2(.6) + 3(.3) + 3(.1)=17

and p1= 1− p0= 7/24. If the channel involved is a binary symmetric channel, then the optimal input frequencies are p0= p1= 1/2 (see 3.4.5), so p0and p1

here are quite far from optimal.

The code designer may have had good reasons for the choice of this scheme.

Would the designer agree to changing the second digit in each of the code words of the scheme? This would not change any lengths, nor the relationships among the code words. (You might ponder what “relationships” means here.) The new scheme:

a→ 01 b→ 111 c→ 000.

The new input frequencies: p0= 3/8, p1= 5/8. These are not optimal, but they are closer to 1/2 than were the former input frequencies, 17/24 and 7/24.

If the new scheme is as good as the original in every other respect, then we may as well use the new scheme.

Optimizing the input frequencies, after minimizing¯, with a prefix-condition code

4.4.3 Problem The input consists of S, A, the source frequencies f1,..., fm, and the optimal input frequencies p1,..., pn for the channel of which A is the input alphabet. The output is to be an encoding scheme sj → wj ∈ A⁺, j = 1,...,m such that

(i) the prefix condition is satisfied;

(ii) ¯ =

j fjlgth(wj) is minimal, among average code word lengths of schemes satisfying (i); and

(iii) the n-tuple(p1,..., pn), computed as in Theorem 4.4.1, is as close as pos-sible to( p1,..., pn), by some previously agreed upon measure of close-ness. If d(p, p) denotes the distance from p = (p1,..., pn) to p= ( p1, ..., pn), this means that d(p, p) is to be minimal among all such numbers computed from schemes satisfying (i) and (ii).

4.4.4 Usually, d(p, p) =_n

j=1(pj− pj)², but you can take d(p, p) =

n j=1

|pj− pj|^α

for some powerα other than 2, or d(p, p) = max1≤ j≤n|pj− pj|. When n = 2, these different measures of distance are equivalent: for any choice of d, above, d(p, p) ≤ d(p, p) if and only if |p1− p1| ≤ |p₁− p1|. SeeExercise 4.4.4.

It would be nice to have a slick algorithm to solve Problem 4.4.3, especially in the case n= 2, when the output will not vary with different reasonable defi-nitions of d(p, p). Also, the case n = 2 is distinguished by the fact that binary channels are in widespread use in the real world.

We have no such good algorithm! Perhaps someone reading this will sup-ply one some day. However, we do have an algorithm; it’s brutish, but it’s an algorithm. Here it is: Supposing f1≥ f2≥ ··· ≥ fm, use Huffman’s algorithm to find all n-ary Huffman sequences1≤ ··· ≤ m that minimize ¯ =

fjj; for each of these sequences, we find all possible prefix-condition schemes sj→ wj ∈ A^j and compute pi = ( ¯)⁻¹_m

j=1ui j fj, i= 1,...,n. We choose the scheme for which(p1,..., pn) is closest to ( p1,..., pn).

4.4.5 Example Let’s carry out the brute-force program suggested above in the easy circumstances of Example 4.4.2, assuming that the channel is a BSC. We have S= {a,b,c}, A = {0,1}, fa= .6, fb= .3, fc= .1, and p0= p1= 1/2.

There is only one sequence of code word lengths to consider: a = 1, b= 2= c. We have ¯ = 1.4. There are four different prefix-condition schemes to consider; the two that start with a→ 0 are: a → 0, b → 10, c → 11 and a → 0, b→ 11, c → 10. For the first of these, p0= (1.4)⁻¹(.6+.3) = 9/14, and, for the second, p0= (1.4)⁻¹(.6 + .1) = 1/2. Clearly the second wins! Alternatively, the scheme a→ 1, b → 00, c → 01 gives optimal input frequencies.

With the same S, A, and source frequencies, if the channel had been so oddly constructed that p0= 1/3, p1= 2/3, then the optimal scheme of the four candidates would have been a→ 1, b → 01, c → 00.

4.4.6 Example S= {a,b,c,d,e}, A = {0,1}, p0 = p 1 = 1/2, f e = . 35, f a = .3, fd = . 2, f b = . 1, and f c = . 05. The sequences ( e,a,d,b,c) satis-fying

j∈ S 2⁻^j≤ 1 for which ¯ =

f j j is m in imal are (1, 2, 3, 4, 4) and (2, 2, 2, 3, 3). [SeeExercise 4.2.3andExample 4.3.4.] The value of ¯ is 2 . 15.

Bo th o p tim al seq u e nces are obtainable from Huffman’s algorith m; th e d iffer-ence arises from the choice of th e o rderin g o f the alphabet obtained from the second merge.

The optimal schemes in this case are associated with(2,2,2,3,3). (There are quite a number of schemes to look at, but, taking into account that p0= p1= 1/2, the possibilities boil down to only eight or nine essentially different schemes.) Here is one of the optimal schemes:

4.4 Optimizing the input frequencies 93

e→ 01 a→ 10 d→ 11 b→ 000 c→ 001.

Verify that p0= 1.05/2.15 = 21/43, and that this is as close to 1/2 as you can get in this situation. (Since each fj is an integer multiple of.05, the numerator of p0=

u_{0 j} fj

2.15 will be an integer multiple of.05. Thus the closest p0can be made to 1/2 is 1.05/2.15 or 1.10/2.15.)

Observe that in this case we can have p0= p1= 1/2 exactly, with a prefix-condition scheme, if we sacrifice the minimization of ¯. For instance, the fixed-length scheme e→ 0011, a → 1100, d → 0101, b → 1001, c → 1010 gives unique decodability and p0= p1= 1/2.

4.4.7 In general, whenever the optimal input frequencies p1,..., pn are ra-tional numbers, we can achieve exact input frequency optimization, pi = pi, i= 1,...,n, with a uniquely decodable block code; just make = lgth(wj), j= 1,...,m so large that it is possible to find m distinct words w1,...,wm∈ A such that the proportion of the occurrences of ai in each is exactly pi. And, if some of the pi are irrational, we can approximate p= ( p1,..., pn) by a ratio-nal vector(p1,..., pn) = p (satisfyingn

i=1pi = 1, pi ≥ 0, i = 1,...,n) as closely as we wish, and then produce a fixed-length scheme from which the pi

arise as the input frequencies of the ai. Thus the variables p1,..., pn are truly

“vary-able,” as we promised inChapter 3, and arrangements can be made in the code, or input, “language,” so that the relative input frequencies are as close as desired to optimal.

However, the method suggested in the preceding paragraph for approximat-ing the optimal input frequencies is clearly inpractical; the code words would have to be quite long, so that the rate of processing of source text would be quite slow, and increasing that rate is generally reckoned to be of greater consequence than the close approximation of the optimal input frequencies.

In the same vein, one might well question the importance of Problem 4.4.3, although in this problem the approximation of the optimal input frequencies is subordinated to minimizing ¯ – i.e., to speeding up the processing of source text. As long as the scheme is uniquely decodable and ¯ is minimized, why fiddle with trying to approximate the optimal input frequencies? Optimizing the average amount of information conveyed by the channel per input letter, with the input stream somewhat artificially regarded as randomly generated, may seem an ivory-tower objective, an academic exercise of doubtful connection to the real world problem of communicating a source stream through a channel.

However, it is an indirect and little-noted consequence of the famed Noisy Channel Theorem, to be explained in Section 4.6, that there is a connection be-tween the practical problems of communication and the problem of encoding the source stream so that the input frequencies are approximately optimal. Not

to go into detail, the import of the NCT is that there exist ways of encoding the source stream that simultaneously do about as well as can be done regarding the two most obvious practical problems of communication: keeping pace with the source stream (up to a threshold that depends on the channel capacity), and reducing the error frequency, in the reconstitution of the source stream (decod-ing) at the receiver of the channel. Although it is not explicitly proven in any of the rigorous treatments of the NCT, the role of the channel capacity in the NCT strongly argues for the information-theoretic folk theorem that the relative input frequencies resulting from those wonderful optimizing coding methods whose existence is asserted by the NCT must be nearly optimal, themselves.

This folk theorem is of particular interest when you realize that all known proofs of the NCT are probabilistic existence proofs; there is no good construc-tive way known of acquiring those coding methods whose existence is proved.

Furthermore, when you understand the nature of those methods, you will un-derstand that they would be totally impractical, even if found. [The situation reminds us of contrived gambling games in which the expected gain per play is infinite, yet the probability of going bankrupt due to accumulated losses is very close to one, even for Bill Gates.] So the problem of effective coding realizing the aspirations expressed in the NCT is still on the agenda, and has been for the 54 years (as this is written) since Shannon’s masterpiece [63]. So far as we know, the indirect approach of aiming, among other things, to get close to the optimal relative input frequencies by astute coding has not been a factor in the progress of the past half-century. In part this has to do with the fact that binary symmetric channels are the only channels that have been seri-ously considered; also, it has been generally assumed that the relative source frequencies are equal (see the discussion, next section, on the equivalence of Maximum Likelihood Decoding and Nearest Code Word Decoding), and the dazzling algebraic methods used to produce great coding and decoding under these assumptions automatically produce a sort of uniformity that makes p0and p1equal or trivially close to 1/2. Perhaps the problem of approximating the op-timal input frequencies by astute encoding will become important in the future, as communication engineering ventures away from the simplifying assumption of equal source frequencies.

Exercises 4.4

1. We return to 4.3.4: S= {a,b,c,d,e}, fe= 0.3, fa= 0.25, fd= 0.2, fb= 0.15, and fc= 0.1. Find a scheme which solves the problem in paragraph 4.4.3 when

(a) A= {0,1}, p0= 1/2 = p1; (b) A= {0,1}, p0= 2/3, p1= 1/3;

(c) A= {0,1,∗}, p0= p1= p_∗= 1/3;

(d) A= {0,1,∗}, p0= p1= 2/5, p_∗= 1/5.

4.5 Error correction and reliability 95

2. In each of (a)–(d) in the preceding problem, find a uniquely decodable fixed-length scheme which gives the optimal input frequencies exactly. The shorter the length, the better.

3. When|S| = m = 26, find the shortest length of a fixed-length prefix-condi-tion scheme, by which the optimal input frequencies are realized exactly, constructed as suggested in 4.4.7, when the code alphabet and optimal input frequencies are as in 1(a)–(d), above. [Notice that the method suggested in 4.4.7 takes no account of the source frequencies.] Compare with the ¯ you found in exercise 4.3.2 (a) and (b).

4. Verify the assertion about the case n= 2 made in 4.4.4. [Hint: observe that if p1+ p2= p₁+ p₂= p1+ p2, then if|p1− p1| ≤ |p₁− p1|, it follows that|p2− p2| ≤ |p₂− p2|, since |p1− p1| = |p2− p2| and |p₁− p1| =

|p₂− p2|.]

5. Suppose the source text is encoded by the scheme sj → wj ∈ A⁺, j = 1,...,m, the source frequencies are f1,..., fm, and the ui j are as in The-orem 4.4.1. We select a letter at random from the source text and look at it; if it is sj, we then select a letter at random fromwj. What is the probability that ai will be selected by this procedure? Is this the same as ( ¯)⁻¹_m

j=1ui jfj? If not, why not?

6. This exerise concerns the efficiency of the brute-force algorithm suggested for solving Problem 4.4.3.

(a) How many prefix-condition binary encoding schemes are there with code word lengths 2,2,3,3,3,3?

(b) How many prefix-condition binary encoding schemes are there with code word lengths 2,2,2,3,4,4?

∗(d) Given |A| = n ≥ 2 and positive integers m 1 ≤ ··· ≤ m satisfying

j=1n⁻^j ≤ 1, give a formula, in terms of n and 1,...,m, for the number of different prefix-condition encoding schemes for S → A, S= {s1,...,sm}, with code word lengths 1,...,m. [SeeSection 1.6 and the proof of Kraft’s Inequality.]

4.5 Error correction, maximum likelihood decoding,

In document Introduction to Information Theory and Data Compression (Page 98-103)