• No results found

Measuring information complexity

In document Complexity of Algorithms (Page 144-153)

Fix an alphabet Σ. Let Σ0 = Σ\ {∗}and consider a two-tape universal Turing machine T

over Σ. We say that the word (program)q over Σprints wordxif writingqon the second tape (the program tape) ofT and leaving the first tape empty, the machine stops in finitely many steps with the wordxon its first tape (the data tape).

Let us note right away that every word is printable on T. There is namely a one-tape (perhaps large, but rather trivial) Turing machineSxthat, when started with the empty tape,

writes the wordxonto it and halts. This Turing machine can be simulated by a programqx

that, in this way, printsx.

The information complexity (also called Kolmogorov complexity) of a word x∈ Σ

0 we

mean the length of the shortest word (program) that makesT print the wordx. We denote the complexity of the wordxbyKT(x).

We can also consider the program printingxas a “code” of the wordxwhere the Turing machineT performs the decoding. This kind of code will be called aKolmogorov code. For the time being, we make no assumptions about how much time this decoding (or encoding, finding the appropriate program) can take.

We would like the complexity to be a characteristic property of the wordxand to depend on the machineT as little as possible. It is, unfortunately, easy to make a Turing machine that is obviously “clumsy”. For example, it uses only every second letter of each program and “skips” the intermediate letters. Such a machine can be universal, but every word will be defined twice as complex as on the machine without this strange behavior.

We show that if we impose some—rather simple—conditions on the machine T then it will no longer be essential which universal Turing machine is used for the definition of information complexity. Roughly speaking, it is enough to assume that every input of a computation performable onT can also be submitted as part of the program. To make this more exact, we assume that there is a word (say, DATA) for which the following holds:

Every one-tape Turing machine can be simulated by a program that does not contain the word DATA as a subword;

If the machine is started so that its program tape contains a word of the formxDATAy

8.2. MEASURING INFORMATION COMPLEXITY 139 only if it halts when started with y written on the data tape and xon the program tape, and in fact with the same output on the data tape,

It is easy to see that every universal Turing machine can be modified to satisfy the assumptions (a) and (b). In what follows, we will always assume that our universal Turing machine has these properties.

Lemma 8.2.1 There is a constant cT (depending only onT) such thatKT(x)≤ |x|+cT.

Proof. T is universal, therefore the (trivial) one-tape Turing machine that does nothing (stops immediately) can be simulated on it by a programp0(not containing the word DATA).

But then, for every word x Σ

0, the program p0DATAx will print the wordx and stop.

Thus the constantcT =|p0|+ 4 satisfies the conditions. ¤

In what follows we assume, to be specific, thatcT 100.

Remark 8.2.1 We had to be a little careful since we did not want to restrict what symbols can occur in the wordx. In BASIC, for example, the instructionPRINT "x"is not good for printing wordsxthat contain the symbol ”. We are interested in knowing how concisely the word xcan be coded in the given alphabet, and we do not allow therefore the extension of the alphabet.

We prove now the basic theorem showing that the complexity (under the above conditions) does not depend too much on the given machine.

Theorem 8.2.2 (Invariance Theorem) Let T andS be universal Turing machines satis- fying the conditions(a),(b). Then there is a constantcT S such that for every wordxwe have

|KT(x)KS(x)| ≤cT S.

Proof. We can simulate the two-tape Turing machine S by a one-tape Turing machineS0

in such a way that if on S, a programqprints a wordxthen writingqon the single tape of

S0, it also stops in finitely many steps, withxprinted on its tape. Further, we can simulate

the work of Turing machine S0 onT by a programpS0 that does not contain the subword

DATA.

Now let x be an arbitrary word from Σ

0 and let qx be a shortest program printing x

onS. Consider the program pS0DATAqx onT: this obviously printsxand has length only |qx|+|pS0|+ 4. The inequality in the other direction is obtained similarly. ¤

On the basis of this lemma, we will not restrict generality if we considerT fixed and do not indicate the indexT.

Unfortunately, the following theorem shows that the optimal code cannot be found algo- rithmically.

Theorem 8.2.3 The functionK(x)is not recursive.

Proof. The essence of the proof is a classical logical paradox, the so-called typewriter- paradox. (This can be formulated simply as follows: let n be the smallest number that cannot be defined with fewer than 100 symbols. We have just definednwith fewer than 100 symbols!)

Assume, by way of contradiction, thatK(x) is computable by a Turing machineS. Let (y1, y2, . . .) be the increasing ordering of Σ0 (as defined in Section 1.2). Arrange the

elements of in increasing order, and letx(k) denote thek-th word according to this ordering. Letxbe the first word withK(x)≥c. Assuming that our language can be programmed in the language Pascal let us consider the following simple program.

Letc be a natural number to be chosen appropriately. Program 8.2.4 var k: integer;

functiony(k: integer): integer; ..

.

functionKolm(y: integer): integer; .. . begin k:= 0; whileKolm(k)< cdok:=k+ 1; print(y(k)); end

The dotted parts stand for subroutines computing the functionsy(k) =yk and Kolm(k) =

K(yk). The first is easy and could be explicitely included. The second is hypothetical, based

on the assumption thatK(x) is computable.

This program obviously printsx. When determining its length we must take into account the lenght of the subroutines for the computation of the functionsyk andKolm(k); but this

is a constant, independent ofc. Thus even taken together, the number of all these symbols is only logc+O(1). If we takeclarge enough, this program consists of fewer thanc symbols

and printsx, which is a contradiction. ¤

As a simple application of the theorem, we get a new proof for the undecidability of the halting problem. To this end, let’s ask the following question: Why is it not possible to

8.2. MEASURING INFORMATION COMPLEXITY 141 computeK(x) as follows? Take all words y in increasing order and check whetherT prints

x when started with y on its program tape. Return the firsty for which this happens; its length isK(x).

We know that something must be wrong here, sinceK(x) is not computable. The only trouble with this algorithm is thatT may never halt with somey. If the halting problem were decidable, we could “weed out” in advance the programs on which T would work forever, and not even try these. Thus we could computeK(x).

Thus the halting problem is not decidable.

In contrast to Theorem 8.2.3, we show that the complexity K(x) can be very well ap- proximatedon the average.

For this, we must first make it precise what we mean by “on the average”. Assume that the input words come from some probability distribution; in other words, every wordx∈Σ

0

has a probabilityp(x). Thus

p(x)0, X

x∈Σ

0

p(x) = 1.

We assume that p(x) is computable, i.e., each p(x) is a rational number whose numerator and denominator are computable from x. A simple example of a computable probability distribution isp(xk) = 2−k where xk is thek-th word in size order, orp(x) = (m+ 1)−|x|−1

where mis the alphabet size.

8.2.5 Remark There is a more general notion of a computable probability distribution that does not restrict probabilities to rational numbers; for example, {e−1,1e1} could also be

considered a computable probability distribution. Without going into details we remark that our theorems would also hold for this more general class.

Theorem 8.2.6 For every computable probability distribution there is an algorithm comput- ing a Kolmogorov code f(x) for every word x such that the expectation of |f(x)| −K(x) is finite.

For simplicity of presentation, assume thatp(x)>0 for every wordx. We will need three orderings of Σ

0. Let y1, y2, . . . be all words arranged in increasing

order. Let x1, x2, . . .be an ordering of the words in Σ0 for which p(x1)≥p(x2)≥ · · ·, and

the words with equal probability are, say, in increasing order (since each word has positive probability, for everyxthere are only a finite number of words with probability at leastp(x), and hence this is indeed a single sequence). Let (z1, z2, . . . ,) be an ordering of the words so

that K(z1)K(z2)≤. . . (we can’t compute this ordering, but we don’t have to compute

We start with the following lemma.

Lemma 8.2.7 (a)Given a word x, the index ifor whichx=xi is computable.

(b) Given a natural numberi, the wordxi is computable.

Proof. (a) Given a wordsx, it is easy to find the index j for whichx=yj. Next, find the

firstk≥j for which

p(y1) +· · ·+p(yk)>1−p(yj). (8.1)

Since the left-hand side converges to 1 while the right-hand side is less than 1, this will occur sooner or later.

Clearly each of the remaining words yk+1, yk+2, . . . has probability less than p(yj), and

hence to determine the index ofx=yjit suffices to order the finite set{y1, . . . , yk}according

to decreasingp, and find the index ofyj among them.

(b) Given an index i, we can compute the indices ofy1, y2, . . . using (a) and wait untili

shows up. ¤

Proof of Theorem 8.2.6. The program of the algorithm in lemma 8.2.7, together with the numberi, provides a Kolmogorov codef(xi) for the wordxi. We show that this code satisfies

the requirements of the theorem. Obviously,|f(x)| ≥K(x). Furthermore, the expected value of|f(x)| −K(x) is

X

i=1

p(xi)(|f(xi)| −K(xi)).

We want to show that this sum is finite. Since its terms are non-negative, it suffices to show that it partial sums remain bounded, i.e., that

N

X

i=1

p(xi)(|f(xi)| −K(xi))< C

for someC independent ofN. To the end, write

N X i=1 p(xi)(|f(xi)| −K(xi)) (8.2) = N X i=1 p(xi)(|f(xi)| −logmi) + N X i=1 p(xi)(logmi−K(xi)). (8.3)

We claim that both sums remain bounded. The difference|f(xi)|−logmiis just the length of

the program computingxi without the length of the parameteri, and hence it is an absolute

8.3. *SELF-DELIMITING INFORMATION COMPLEXITY 143 To estimate the second term in (8.2), we use the following simple but useful principle. Leta1≥a2≥ · · · ≥ambe a decreasing sequence and letb1, . . . , bmbe an arbitrary sequence

of real numbers. Let b∗

1 ≥ · · · ≥ b∗m be the sequence b ordered decreasingly, and let b∗∗1

· · · ≤b∗∗

m be the sequencebordered increasingly. Then

X i aib∗∗i X i aibi≤ X i aib∗i. By this principle, N X i=1 p(xi)K(xi) N X i=1 p(xi)K(zi).

The number of wordsxwithK(x) =kis at mostmk, and hence the number of wordsxwith

K(x)≤kis at most 1 +m+. . . mk < mk+1. This is the same as saying that

i≤mK(zi)+1, and hence K(zi)logmi−1. Thus N X i=1 p(xi)(logmi−K(xi)) N X i=1 p(xi)(logmi−K(zi)) N X i=1 p(xi) = 1.

This proves the theorem. ¤

8.3

*Self-delimiting information complexity

The Kolmogorov-code, strictly taken, uses an extra symbol besides the alphabet Σ0: it

recognizes the end of the program while reading the program tape by encountering the symbol “”. We can modify the concept in such a way that this should not be possible: the head reading the program should not run beyond program. We will call a word self-delimiting if, when it is written on the program tape of our two-tape universal Turing machine, the head does not even try to read any cell beyond it. The length of the shortest self-delimiting program printingxwill be denototed byHT(x). This modified information complexity notion

was introduced by Levinand Chaitin. It is easy to see that the Invariance Theorem here also holds and therefore it is again justified to drop the subscript and use the notationH(x). The functionsK andHdo not differ too much, as it is shown by the following lemma:

Lemma 8.3.1

K(x)H(x)K(x) + 2 logm(K(x)) +O(1).

Proof. The first inequality is trivial. To prove the second inequality, letpbe a program of lengthK(x) for printingxon some machineT. Letn=|p|, letu1· · ·uk be the form of the

numbernin the basemnumber system. Letu=u10u20· · ·uk011. Then the prefixuof the

wordupcan be uniquely reconstructed, and from it, the length of the word can be determined without having to go beyond its end. Using this, it is easy to write a self-delimiting program

of length 2k+n+O(1) that printsx. ¤

From the foregoing, it may seem that the functionH is a slight technical variant of the Kolmogorov complexity. The next lemma shows a significant difference between them. Lemma 8.3.2

Pxm−K(x)= +.

Pxm−H(x)1.

Proof. The statement 8.3.2 follows easily from Lemma 8.2.1.In order to prove 8.3.2, consider an optimal codef(x) for each word x. Due to the self-delimiting, neither of these can be a prefix of another one; thus, 8.3.2 follows immediately from the simple but important

information-theoretical lemma below. ¤

Lemma 8.3.3 LetL ⊆Σ

0 be a language such that neither of its words is a prefix of another

one. Letm=|Σ0|. Then

X

y∈L

m−|y|1.

Proof. Choose lettersa1, a2, . . .independently, with uniform distribution from the alphabet

Σ0; stop if the obtained word is in L. The probability that we obtained a word y ∈ L is

exactlym−|y|(since according to the assumption, we did not stop on any prefix ofy). Since these events are mutually exclusive, the statement of the lemma follows. ¤

Some interesting consequences of these lemmas are formulated as exercises. The next theorem shows that the functionH(x) can be approximated well.

Theorem 8.3.4 (Coding Theorem) Letpbe a computable probability distribution onΣ

0.

Then for every wordxwe have H(x)≤ −logmp(x) +O(1).

8.3. *SELF-DELIMITING INFORMATION COMPLEXITY 145 Proof. Let us call m-ary rational those rational numbers that can be written with a numerator that is a power of m. The m-ary rational numbers of the interval [0,1) can be written in the form 0.a1. . . ak where 0≤ai≤m−1.

Subdivide the interval [0,1) beginning into left-closed, right-open intervals

J(x1), J(x2), . . . of lengths p(x1), p(x2), . . . respectively (where x1, x2, . . . is a size-ordering

of Σ

0). For everyx∈Σ0 withp(x)>0, there will be anm-ary rational number 0.a1. . . ak

with 0.a1. . . ak J(x) and 0.a1. . . ak−1 J(x). We will call a shortest sequence a1. . . ak

with this property theShannon-Fano codeofx.

We claim that every wordxcan be computed easily from its Shannon-Fano code. Indeed, for the given sequence a1, . . . , ak, for values i = 1,2, . . ., we check consecutively whether

0.a1. . . ak and 0.a1. . . ak−1 belong to the same interval J(x); if yes, we print x and stop.

Notice that this program is self-delimiting: we need not know in advance how long is the code, and ifa1. . . ak is the Shannon-Fano code of a word xthen we will never read beyond

the end of the sequence a1. . . ak. ThusH(x) is not greater than the common length of the

(constant-length) program of the above algorithm and the Shannon-Fano code of x; about this, it is easy to see that it is at most logmp(x) + 1. ¤

This theorem implies that the expected value of the difference between H(x) and logmp(x) is bounded (compare with Theorem 8.2.6).

Corollary 8.3.5 With the conditions of Theorem 8.3.4

X x p(x)|H(x) + logmp(x)|=O(1). Proof. X x p(x)|H(x) + logmp(x)| =X x p(x)|H(x) + logmp(x)|+ X x p(x)|H(x) + logmp(x)|−.

Here, the first sum can be estimated, according to Theorem 8.3.4, as follows:

X x p(x)|H(x) + logmp(x)|+ X x p(x)O(1) =O(1).

We estimate the second sum as follows:

X

x

p(x)|H(x) + logmp(x)|− ≤m−H(x)logmp(x)= 1

p(x)m H(x),

and hence according to Lemma 8.3.2,

X x p(x)|H(x) + logmp(x)|− X x m−H(x)1.

¤

Remark 8.3.1 The following generalization of the coding theorem is due to Levin. We say thatp(x) is asemicomputable semimeasureover Σ

0 ifp(x)0,

P

xp(x)1 and

there is a computable functiong(x, n) taking rational values such thatg(x, n) is monotonically increasing innand limn→∞g(x, n) =p(x).

Levin proved the coding theorem for the more general when p(x) is a semicomputable semimeasure. Lemma 8.3.2 shows thatm−H(x)is a semicomputable semimeasure. Therefore

Levin’s theorem implies thatm−H(x)is maximal, to within a multiplicative constant, among

all semicomputable semimeasures. This is a technically very useful characterization ofH(x). Let us show that the complexityH(x) can be well approximated for “almost all” strings, where the meaning of “almost all” is given by some probability distribution. But we will not consider arbitrary probability distributions, only such that can be approximated, at least from below, by a computable sequence.

Definition 8.3.1 Letf(x) be a function on strings, taking real number values. We say that

f(x) is enumerable if there is computable function g(x, n) taking rational values such that

g(x, n) is monotonically increasing innand limn→∞g(x, n) =f(x). Proof for the semicomputable case:

Proof. Ifp(x) is enumerable then the set {(x, k) :k2−k< p(x)}

is recursively enumerable. Let{(zt, kt) :t= 1,2, . . .}be a recursive enumeration of this set

without repetition. Then

X t 2−kt=X x X {2−kt :z t=x} ≤ X x 2p(x)<2.

Let us cut off consecutive adjacent, disjoint intervals I1, I2, . . ., whereIthas length 2−kt−1,

from the left side of the interval [0,1]. For any binary string w consider the interval Jw

delimited by the binary “decimals” 0.wand 0.w1. We define a functionF(w) as follows. IfJw

is the largest such binary subinterval of someItthenF(w) =zt. OtherwiseF(w) is undefined.

In document Complexity of Algorithms (Page 144-153)