Higher-order Modeling
7.1 Higher-order Huffman encoding
j fi j.
7.1 Higher-order Huffman encoding
One way to use the hard-won knowledge of the relative frequencies f(i1,..., ik+1) would be to treat Sk+1as the source alphabet and to produce an encoding scheme using Huffman’s algorithm. This encoding scheme would have mk+1 lines.
In kth-order Huffman encoding, k≥ 1, we have, instead of one big scheme, rather a lot of little schemes, mk of them, in fact, each with m lines, so the total hidden cost of kth-order encoding is about the same as that of zeroth-order encoding using the huge source alphabet Sk+1. Let us call each source word si1···sik of length k a kth-order context. For each such context, and 1≤ j ≤ m, let
P(sj| si1···sik) = f(i1,...,ik, j) f(i1,...,ik) ,
the conditional probability that, if you have just scanned the word si1···sik in the source text, the next letter will be sj. The mk encoding schemes come about by applying Huffman’s algorithm to S= {s1,...,sm} equipped with the conditional relative frequencies P(s1| si1···sik),..., P(sm| si1···sik), for each context si1···sik. Thus there is one scheme per context, which makes mk of them, and each is an encoding scheme for S, and so has m lines.
Once you have all these schemes, how do you encode source text? Each occurrence of the letter sj is encoded with the code word for sj in the scheme associated with the context si1···sik, the k-letter word immediately preceding that occurrence of sj. Thus different occurrences of sj may well be encoded differently. How, then, will the decoder be able to recognize the code for that occurrence of sj, following si1···sik? Very simple: the decoder has decoded the code text preceding the code for that occurrence of sj, so the decoder knows that it is “in context si1···sik”; the decoder proceeds to scan the code text with reference to the encoding scheme associated with the correct context.
The discerning reader will have detected that there is a problem with those first k letters in the source text, which are not preceded by a k-letter context.
No problem—decide on some prefix-condition “starter scheme” for S and use
7.1 Higher-order Huffman encoding 183
it for those first k letters. (Of course, the decoder will have to be told what the starter scheme is.) It seems reasonable to use the Huffman scheme based on the relative frequencies f1,..., fm of the source letters, calculable as follows:
fj=
(How are these fi j found? Sampling. But these particular fi j were just made up.) We find (how?) that the relative source frequencies of s1,s2,s3,s4 are f1= .4, f2= .3, f3= .2, f4= .1, so we take the following as starter scheme:
s1→ 0, s2→ 10, s3→ 110, s4→ 111.
Now we compute the context schemes. For context si, we are supposed to assign to s1,...,s4the conditional relative frequencies fi1/ fi,..., fi4/ fi; since these are proportional to fi1,..., fi4, we use these to form the Huffman tree. Simi-larly, in general, in kth-order encoding, the Huffman tree for context si1···sikis formable with the assignment of the f(i1,...,ik, j) to the sj; it is not necessary to compute P(sj | si1···sik) = f (i1,...,ik, j)/f (i1,...,ik).
Context s1:
s1 07162534.16 007162534.4 }}}}1
s2 07162534.10 007162534.24 }}}}1
s3 07162534.10 007162534.14 }}}}1
s1 07162534.08 007162534.13
1
s2 07162534.17 AAAA071625340.3
1
s3 07162534.04 007162534.05 }}}}1
(Of course, a different labeling of the edges will give a different scheme, but with the same code word lengths.)
Context s3:
s1 07162534.14 007162534.2
1
s1 07162534.02 007162534.05 }}}}1
s2 07162534.02 007162534.03
1
s3 07162534.05 07162534.10 00000000
So, for instance, with the four context schemes and the starter scheme at our disposal, the source text s2s1s1s2s1s4s3s3s1 is encoded 10000100011111010.
Check the encoding, and also check that the decoder can recover the source string from the code string, if supplied with the starter and the context schemes.
7.1.2 Computing the compression ratio Again, S= {s1,...,sm} and the rel-ative frequencies f(i1,...,ik+1) of the words in Sk+1 are given. For a context si1···sik (assuming k≥ 1) and 1 ≤ j ≤ m, let (i1,...,ik, j) be the length of the code word for sj in the encoding scheme for the context. The average length of a code word replacing a source letter (neglecting the starter scheme, the ef-fect of which would be negligible with a large source text) is, by elementary considerations (seeSection 1.8)
Thus, for instance, in Example 7.1.1, we have
[(i, j)] = [i j] = 1.9. If you apply Huffman’s algorithm to S2, with sisj assigned relative fre-quency fi j, you will get ¯ = ¯(0)(S2) = 3.53. Thus the compression ratio
7.1 Higher-order Huffman encoding 185
achievable by this method, 2 ¯L/3.53, assuming the sj themselves are binary words with average length ¯L, is less than that achieved by first-order encoding,
¯L/1.72.
Here is an academic question of practical importance. Given f(i1,..., ik+1), 1 ≤ i1,...,ik+1≤ m, let ¯(k)be as defined above, and let ¯(Sk+1) be the average code word length achieved by Huffman’s algorithm applied to Sk+1 as the source alphabet equipped with the relative frequencies f(i1,...,ik+1). Is it always the case that ¯(k)≤ k+1k ¯(Sk+1)? In other words, is the compression achieved by kth-order Huffman encoding always at least as good as the com-pression achieved by zeroth-order encoding, treating Sk+1as the source alpha-bet? (Notice that when k= 0, these two are the same.) It is somewhat surprising that the answer is: not always. SeeExercise 7.1.4. Notice that the situation in that exercise is rather extreme. The next question is: under what conditions do we have ¯(k)≤ k+11 ¯(Sk+1)? It is a large question that probably does not have a snappy answer given the current state of our knowledge and terminology, but its obvious practical importance makes it worth looking into.
Here is another question of practical importance: in case k≥ 1, is it neces-sarily the case that ¯(k)≤ ¯(k−1)? Or, can increasing the order sometimes give you worse compression? We suspect that ¯(k)≤ ¯(k−1)always holds, but we have no proof.
Exercises 7.1
1. Suppose S= {s1,s2,s3,s4}, with s1= 000, s2= 001, s3= 01, s4= 1, and digram frequencies f(i, j) = fi jgiven in
[ fi j] =
.2 .04 .06 .05 .05 .17 .08 .02 .07 .08 .06 .02 .03 .03 .03 .01
.
(a) Find the single-letter relative frequencies f1, f2, f3, f4, and the com-pression ratio achieved if Huffman’s algorithm is applied to S.
(b) Find the compression ratio achieved if Huffman’s algorithm is applied to S2(with the relative frequencies fi jgiven above, of course).
(c) Give the four context schemes for first-order encoding of this source and encode the source string s2s2s1s3s1s1s1s3s2s3s3s1s4. (Use the scheme associated with (a) for the first letter, s2. There are differ-ent correct schemes for the starter and the contexts, so, if doing this exercise as part of a problem set, clearly label your schemes.)
(d) Find the compression ratio achieved by first-order encoding of this source alphabet. (This does not mean the compression ratio achieved in part (c) on that small segment of source text, but in general, on the average, on very large “typical” blocks of source text.)
2. Suppose someone were to examine the source text of problem 1 and to dis-cover the single-letter source frequencies, f1, f2, f3, and f4, but to remain ignorant of the digram frequencies fi j. Suppose this person applies Huff-man’s algorithm to S2, assuming the relative frequency of sisj among all two-letter source words to be fi fj.
(a) What compression ratio would this person believe they have achieved, given their assumption about the digram frequencies?
(b) What compression ratio would they actually have achieved?
3. The lazy but earnest person of problem 2 also tries first-order Huffman encoding of the source text of problem 1, again assuming that the relative frequency of sisj is fifj.
(a) What compression ratio does the encoder believe has been achieved by this method?
(b) What is the actual compression ratio achieved?
4. Let S= {s1,s2,s3} and suppose the digram frequencies are given by [ fi j] =
.7.02 .04 .04.05 .05 .08 .01 .01
.
Recalling the notation of this section, compute ¯(0), ¯(1), and ¯(S2) for this source alphabet. Observe that ¯(1)> ¯(S2)/2.