Ex. 2.1 (Davide Basilio Bartolini)

(1)

ECE 534: Elements of Information Theory, Fall 2010 Homework 1

Solutions

Ex. 2.1 (Davide Basilio Bartolini)

Text

Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required.

(a) Find the Entropy H(X) in bits

(b) A random variable X is drawn according to this distribution. Find an “efficient” sequence of yes-no questions of the form, “Is X contained in the set S?”. Compare H(X) to the expected number of questions required to determine X.

Solution

(a) The random variable X is on the domain X = {1, 2, 3, . . .} and it denotes the number of flips needed to get the first head, i.e. 1 + the number of consecutive tails appeared before the first head.

Since the coin is said to be fair, we have p(“head”) = p(“tail”) = 1₂ and hence (exploiting the independence of the coin flips):

p(X = 1) = p(“head”) = 1 2 p(X = 2) = p(“tail”) ∗ p(“head”) = 1 2∗ 1 2 = 1 2 2 .. . p(X = n) = ntimes z }| {

p(“tail”) ∗ . . . ∗ p(“tail”) ∗p(“head”) = 1 2 ∗ . . . ∗ 1 2∗ 1 2 = 1 2 n

from this, it is clear that the probability mass distribution of X is:

pX(x) = 1

2 x

(2)

Once the distribution is known, H(X) can be computed from the definition: H(X) = −X x∈X pX(x) log2pX(x) = − ∞ X x=1 1 2 x log₂ 1 2 x = − ∞ X x=0 1 2 x log₂ 1 2 x

(since the summed expr. equals 0 for x = 0)

= − ∞ X x=0 1 2 x x log₂ 1 2 (property of logarithms) = − log₂ 1 2 ∞ X x=0 1 2 x x = ∞ X x=0 1 2 x x = 1 2 1 −1₂2 = 2 [bit] exploiting ∞ X x=0 (k)xx = k (1 − k)2 !

(b) Since the most likely value for X is 1 (p(X = 1) = 1₂), the most efficient first question is: “Is X = 1?”; the next question will be “Is X = 2?” and so on, until a positive answer is found. If this strategy is used, the random variable Y representing the number of questions will have the same distribution as X and it will be:

E [Y ] = ∞ X y=0 y 1 2 y = 1 2 1 −1₂2 = 2 which is exactly equal to the entropy of X.

An interpretation of this fact could be that 2 bits (which is the entropy value for X) are the amount of memory required to store the outcomes of the two binary questions which are enough (on average) to get a positive answer on the value of X.

Exercise 2.4 (Matteo Carminati)

Entropy of functions of a random variable. Let X be a discrete random variable. Show that the entropy of a function of X is less than or equal to the entropy of X by justifying the following steps:

H(X, g(X))(a)= H(X) + H(g(X)|X)(b)= H(X) (1) H(X, g(X))(c)= H(g(X)) + H(X|g(X))

(d)

(3)

Thus, H(g(X)) ≤ H(X). Solution

(a) It comes from entropy’s chain rule applied to random variables X and g(X), i.e. H(X, Y ) = H(X) + H(Y |X), so H(X, g(X)) = H(X) + H(g(X)|X).

(b) Intuitively, if g(x) depends only on X and if the value of X is known, g(X) is completely specified and it has a deterministic value. The entropy of a deterministic value is 0, so H(g(X)|X) = 0 and H(X) + H(g(X)|X) = H(X).

(c) Again, this formula comes from the entropy’s chain rule, in the form: H(X, Y ) = H(Y ) + H(X|Y ).

(d) Proving that H(g(X)) + H(X|g(X)) ≥ H(g(X)) means proving that H(X|g(X)) ≥ 0: the non-negativity is one of the property of entropy and can be proved from its definition by noting that the logarithm of a probability (a quantity always less than or equal to 1) is non-positive. In particular H(X|g(X)) = 0 if the knowledge of the value of g(X) allows to totally specify the value of X; otherwise H(X|g(X)) > 0 (for example if g(X) is an injective function).

Ex. 2.7(a) (Davide Basilio Bartolini)

Text

Coin weighing. Suppose that one has n coins, among which there may or may not be one counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. The coins are to be weighed by a balance.

Find an upper bound on the number of coins n so that k weighings will find the counterfeit coin (if any) and correctly declare it to be heavier or lighter.

Solution

Let X be a string of n characters on the alphabet X = {−1, 0, 1}n, each of which represents one coin. Each of the characters of X may have three different values (say 1 if the coin is heavier than a normal one, 0 if it is regular, −1 if it is lighter). Since only one of the coins may be counterfeit, X may be a string of all 0 (if all the coins are regular) or may present either a 1 or a −1 at only one position. Thus, the possible configurations for X are 2n + 1.

Under the hypothesis of a uniform distribution of the probability of which coin is counterfeit, the entropy of X will be:

(4)

Now let Z = [Z1, Z2, . . . , Zk] be a random variable representing the weighings; each of the Zi will have three possible values to indicate whether the result of the weighing is balanced, left arm heavier or right arm heavier. The entropy of each Zi will be upper-bounded by the three possible values it can assume: H(Zi) ≤ log23, ∀i ∈ [1, k] and for Z (under the hypothesis of independence of the weighings): H(Z) = H(Z1, Z2, . . . , Zk) (ChainR.) = k X i=1 H(Zi|Zi−1, . . . , Z1) (Indep.) = k X i=1 H(Zi) ≤ log23

Since we want to know how many weghings will yield the same amount of information which is given by the configuration of X (i.e. we want to know how many weighings will be needed to find out which coin - if any - is counterfeit), we can write:

H(X) = H(Z) ≤ log₂3 log₂(2n + 1) ≤ log₂3

2n + 1 ≤ 3k n ≤ 3

k_{− 1}

2 , which is the wanted upper bound.

Ex. 2.12 (Kenneth Palacio)

X Y 0 1

0 1/3 1/3

1 0 1/3

Table 1: p(x,y) for problem 2.12. Find: (a) H(X), H(Y ). (b) H(X|Y ), H(Y |X). (c) H(X, Y ). (d) H(Y ) − H(Y |X). (e) I(X; Y ).

(f) Draw a Venn diagram for the quantities in parts (a) through (e).

(5)

Compute of marginal distributions: p(x) = [2 3, 1 3] p(y) = [1 3, 2 3] (a) H(X), H(Y ). H(X) = −2 3log2 2 3 −1 3log2 1 3 (3) H(X) = 0.918bits (4) H(Y ) = −1 3log2 1 3 −2 3log2 2 3 (5) H(Y ) = 0.918bits (6) Figure 1: H(X), H(Y) (b) H(X|Y ), H(Y |X). H(X|Y ) = 1 X i=0 p(y = i)H(X|Y = y) (7) H(X|Y ) = 1 3H(X|Y = 0) + 2 3H(X|Y = 1) (8) H(X|Y ) = 1 3H(1, 0) + 2 3H(1/2, 1/2) (9) H(X|Y ) = 2/3 (10)

(6)

H(Y |X) = 1 X i=0 p(x = i)H(Y |X = x) (11) H(Y |X) = 2 3H(Y |X = 0) + 1 3H(Y |X = 1) (12) H(Y |X) = 2 3H(1/2, 1/2) + 1 3H(0, 1) (13) H(Y |X) = 2/3 (14)

Figure 2: H(X|Y ), H(Y |X) (c) H(X, Y ). H(X, Y ) = 1,1 X x=0,y=0 p(x, y) log₂p(x, y) (15) H(X, Y ) = −31 3log2 1 3 (16) H(X, Y ) = 1.5849625bits (17) Figure 3: H(X,Y)

(7)

(d) H(Y ) − H(Y |X).

H(Y ) − H(Y |X) = 0.918 − 2/3 (18)

H(Y ) − H(Y |X) = 0.25134 (19)

Figure 4: H(Y ) − H(Y |X) (e) I(X; Y ). I(X; Y ) =X x,y p(x, y) log₂ p(x, y) p(x)p(y) (20) I(X; Y ) = 1 3log2 1/3 (2/3)(1/3) +1 3log2 1/3 (2/3)(2/3) +1 3log2 1/3 (1/3)(2/3) (21) I(X; Y ) = 0.25162916 (22) Figure 5: I(X;Y) (f) Venn diagram is already shown for each item.

Ex. 2.20 (Kenneth Palacio)

Run-length coding. Let X1,X2, . . . , Xn be (possibly dependent) binary random variables. Suppose that one calculates the run lengths R = (R1, R2, . . .) of this sequence (in order as they occur). For example, the sequence X = 0001100100 yields run lengths R = (3, 2, 2, 1, 2). Compare

(8)

H(X1, X2, ..., Xn), H(R), and H(Xn, R). Show all equalities and inequalities, and bound all the differences.

Solution:

Lets assume that one random variable Xj (0 < j ≤ n) is known, then if R is also known, H(Xj,R) will provide the same information about uncertainty than H(X1,X2, . . Xj, . . ,Xn), since the whole sequence of X can be completely recovered from the knowledge of Xj and R. For example, with X5 = 1 and the run lengths R = (3, 2, 2, 1, 2) it’s possible to recover the original sequence as follows:

X5 = 1, R = (3, 2, 2, 1, 2) leads to recover the sequence: X = 0001100100.

It can be concluded that:

H(Xj, R) = H(X1, X2, ....Xn) Using the chain rule, H(Xj,R) can be written as:

H(Xj, R) = H(R) + H(Xj|R) H(Xj|R) ≤ H(Xj), since conditioning reduces entropy.

Then it’s possible to write:

H(Xj) ≥ H(X1, X2, ...Xn) − H(R)

H(Xj) + H(R) ≥ H(X1, X2, ...Xn) Computing H(Xj) = Pn

xp(Xj) log2Xj, where the distribution of Xj is unknown, it can be as-sumed to be: a probability of p for Xj=0 and of (1-p) for Xj=1. It can be observed that the maximum entropy is given when p=1/2 leading max H(Xj)= 1.

Then:

1 + H(R) ≥ H(X1, X2, ...Xn)

Considering the results obtained in problem 2.4, we can write also that: H(R) ≤ H(X). Because R is a function of X.