We conclude this chapter by applying the various tests considered here to some real world data. We used the datasets summarized in Table 2.1, which were taken from [74]. Dealing with real-world data requires us to make some de- partures from the model considered in this chapter. Firstly, each of the datasets comprise multi-class classification problems as opposed to the binary classifi- cation considered here. Secondly, the training and test datasets are not all of a common length.
Fortunately each of the tests we considered can be viewed as minimum distance test, e.g. in the case of the GLRT we decide the class i for which
Dataset L2 GLRT Chi-squared Hellinger
20ng 22 15.7 19 16
r52 6.7 0 5.6 4.4
r8 5.0 1.5 2.6 2.6
webkb 19 11.8 13.6 14
Table 2.2: Classification results for “rare” words (words occurring at-most 20 times) only. Figures are percentage of correct classifications
d(ΛXi, ˆPi) + d(ΛZ, ˆPi)is smallest, where ˆPi is the weighted sum of ΛXi and ΛZ
(with the weighing given by the length of the training document); or in the case of the L2-norm test we decide the class i for which kΛXi− ΛZk
2
2 is smallest.
Table 2.2 shows the results when applied to data that loosely “fit” the α = 1- large-alphabet model. To obtain these results we took the real-world data and kept only those words that occurred fewer than 20 times. This meant that some common words with possibly high discriminatory power were removed from the test and training sets. The results show the L2norm test performing the best
of the various distance metrics.
Table 2.3 shows the results using all of the available data. Also included are results for support vector machines (SVM) [23] reported by [74]. The column GLRT(b) corresponds to a tweaked version of the GLRT we devised to correct the poor performance on the r52 dataset. We observed that when dealing with skewed training sets (i.e. where the lengths of the training data are very differ- ent), the GLRT is systematically biased towards the shorter class. For example suppose we have training lengths nx and n with nx n and the test string is
also length n. The GLRT first forms the quantities ˆ p(·) = N (·|X nx) + N (·|Zn) n + nx , ˆq(·) = N (·|Y n) + N (·|Zn) n + n
Dataset SVM L2 GLRT GLRT(b) Chi-squared Hellinger
20ng 80.8 52 81.7 82.7 74.8 60.4
r52 92 84.1 1.9 91.2 86.1 74.8
r8 94.5 91.2 87.5 96.2 94.2 90.9 webkb 87.9 74.3 91.1 91.7 82.5 78.7 Table 2.3: Classification results for full datasets. Figures are percentage of correct
classifications.
and then carries out the test nx
n D(ΛXnx||ˆp) + D(ΛZn||p) ≶ D(ΛYn||ˆp) + D(ΛZn||p).
When the true hypothesis is that Yn and Zn are from the same class (i.e. have
the same distribution) we observed that the GLRT incorrectly decided for the case that Xnx and Znwere from the same class. A reason for this appeared to be
that D(ΛZn||ˆp)was small because ˆp ≈ ΛZn. By “repeating” the training data, so
that all strings were the same length, e.g. by forming ˜ p(·) = n nxN (·|X nx) + N (·|Zn) n + n
we found the bias disappeared, and these are reported as GLRT(b) in Table 2.3. As can be seen from the table, GLRT(b) performs quite well, outperforming the published SVM results in all but one example.
CHAPTER 3
COMPRESSION OF OF LARGE ALPHABET SOURCES
In this chapter we formulate and study the problem of compression of large alphabet sources. A connection between the results of the previous chapter and the present chapter is apparent if one notices that the set of distributions used in the converse part Theorem 11 are in-fact α = 1 large-alphabet distributions. Thus another interpretation for the results in this chapter is that for 0 ≤ α < 1 universal compression of α-large-alphabet sources is possible; for α = 1 it is not.
3.1
Notation and Preliminaries
Throughout logarithms and exponents are in base e. For a distribution p on a finite alphabet A, we use H(p) = P
a∈A−p(a) log p(a) to denote entropy. The
notation A×nis the n-fold Cartestian product of A. We use bold type to denote strings (or vectors), e.g. x = x1· · · xn, usually the length is clear from the context
and will be omitted. We use Λx to denote the empirical distribution or type
of the string x. H2(x) denotes the binary entropy function. For a probability
distribution p, supp p denotes the support of p i.e. the set of symbols having positive probability. P(A) denotes the set of all distributions on the set A. Pn(A)
denotes the set of possible empirical distributions for a string of length n on the alphabet A. For a type Q ∈ Pn(A), we use Tn
Q to denote the typeclass of Q, i.e.
the set of strings with type Q.
We mainly consider sequences of alphabets {An} and distributions {pn ∈
P(An)}. In this case, unless specified otherwise, when we write the ran-
{Xn,m, 1 ≤ m ≤ n}n≥1, so that Xn= Xn,1, . . . , Xn,nand Xn,i ∼ pn.
We first formalize the notation of an achievable rate sequence.
Definition 6. Let {An} be a sequence of finite alphabets. For a sequence of
distributions {Qn}, where Qn is defined on the product space A×nn , we say a
sequence of rates {Rn} is achievable (for source coding) if for every δ > 0, > 0,
there exist a sequence of sets {Mn} and a sequence of deterministic maps {fn :
A×n n → Mn, gn: Mn → A×nn } satisfying 1 nlog |Mn| < Rn+ δ and Qn(gn(fn(Xn)) 6= Xn) ≤
for all n sufficiently large.
Remark: For a given sequence of distributions {Qn}, it is straightforward
to verify that a sequence of rates {Rn} is achievable with deterministic maps iff
{Rn} is achievable with randomized maps.
Using information-spectrum methods [33], the following theorem provides a second characterization of an achievable rate sequence.
Theorem 9. Let {An} be a sequence of finite alphabets. Let {Qn} be a sequence of
probability measures such that Qn is a measure on the product space A×nn . Suppose
Xn ∼ Q
n. Then the sequence {Rn} is achievable for source coding if and only if for
every δ > 0 lim n→∞Qn − 1 nlog Qn(X n) − R n> δ = 0. (3.1)
Proof. Achievability: Suppose that (3.1) holds. Let Gδ n = {x ∈ A ×n n : −n−1log Q n(x) ≤ Rn+ δ}. Then 1 ≥ Qn(Gδn) ≥ |G δ n| exp(−n(Rn+ δ)) which implies 1 n log |G δ n| ≤ Rn+ δ. Furthermore, by hypothesis, as n → ∞ Qn(Gδn c ) → 0,
so that defining fn, gnto identify those sequences in Gδnsuffices.
Converse: Assume that there exists δ > 0 and > 0 so that lim sup n→∞ Qn − 1 nlog Qn(X n ) − Rn> δ > . Let Bn = {x : −n−1log Qn(x) > Rn+ δ}.
By Definition 6, the achievability of Rn implies the existence of a sequence of
sets An, {x : gn(fn(x)) = x}satisfying Qn(Acn) ≤ /4for all n sufficiently large.
Now, Qn(An∩Bn) ≥ Qn(Bn)−Qn(Anc)and Qn(An∩Bn) < |An∩Bn| exp(−n(Rn+
δ)), which together gives
n−1log |An| ≥ n−1log |An∩ Bn|
> n−1log[Qn(An∩ Bn)] + Rn+ δ
≥ n−1log[Qn(Bn) − Qn(Acn)] + Rn+ δ.
However, for a subsequence {nk} we have that Qnk(Bnk) ≥ /2, thus Qnk(Bnk) −
Qnk(A c
nk) > /4for all nk≥ n0. Therefore for all nk ≥ n0
nk−1log |Mnk| ≥ nk −1
i.e. n−1log |Mn| > Rn+δ/2for infinitely many n, contradicting the achievability
of {Rn}.
Corollary 1. If there is a code f : A×nn → Mn, g : Mn → A×nn with rate
n−1log |Mn| ≤ Rnand probability of error Qn(g(f (Xn)) 6= Xn) ≤ then
Qn(−n−1log Qn(Xn) > Rn+ δ) ≤ + exp(−nδ).
Proof. This is implied by the calculations in the converse part of the proof The- orem 9. Adopting the definitions from that proof we saw that
Qn(Bn) ≤ Qn(Acn) + Qn(An∩ Bn). By hypothesis Qn(Acn) ≤ . Furthermore Qn(An∩ Bn) = X x∈An∩Bn Qn(x) ≤ X x∈An∩Bn exp(−n[Rn+ δ]) ≤ exp(−nδ)
where the final equality uses the fact that the range of f is at most exp(nRn).