Learning with Algorithmic Information Theory

Proof. By Remark 3.9lnf ∈o(t). Applying Corollary 3.53 we get

EP h E_nQ−E_nP ≤2o(t) + 2 q 2EPEnPo(t)≤2o(t) + 2 p 2O(t)o(t)∈o(t).

3.6 Learning with Algorithmic Information Theory

Algorithmic information theory provides a theoretical framework to apply the probability theory results from the previous sections. In the following we discuss Solomonoff’s famous theory of induction (Section 3.6.1), the speed prior (Section 3.6.2), and learning with a universal compression algorithm (Section 3.6.3).

3.6.1 Solomonoff Induction

Solomonoff (1964, 1978) proposed a theory of learning, also known as universal induction or Solomonoff induction. It encompasses Ockham’s razor by favoring simple explanations over complex ones, and Epicurus’ principle of multiple explanations by never discarding possible explanations. See Rathmanner and Hutter (2011) for a very readable introduction to Solomonoff’s theory and its philosophical motivations and Sterkenburg (2016) for a critique of its optimality.

At the core of this theory isSolomonoff ’s distributionM, as defined in Example 3.5. SinceM dominates all lower semicomputable semimeasures, we get all the merging and prediction results from Section 3.4 and Section 3.5: when drawing a string from any computable measureP,M arrives at the correct belief for any hypothesis.

Corollary 3.55 (Strong Merging for Solomonoff Induction). M merges strongly with every computable measure.

Proof. From Proposition 3.16a and Theorem 3.25.

Corollary 3.56 (Expected Prediction Regret for Solomonoff Induction). For all computable measures P, EP E_tM −E_tP≤K(P) ln 4 + q 2EPEtPK(P) ln 16.

Proof. From Corollary 3.44 and cP = 2−K(P).

Remark 3.57 (Converging Fast and Slow). The convergence ofM to a computableP is fast in the sense of Corollary 3.56: M cannot make many more prediction errors than P in expectation. When predicting an infinite computable sequence x1:∞, the total

number of prediction errors is bounded by|p|2 ln 2≈1.4|p|wherep is a program that generates x1:∞ (Example 3.46).

The convergence of M to P is also slow in the sense that M(xt |x<t)→ 1 slower

than any computable function since1−M(xt|x<t)

≥2−minn≥tK(n) _{for all}_t. 3 The bound from Corollary 3.56 is not optimal. Even if we knew the program p generating the sequencex1:∞, there might be a shorter programp0 that computesx1:∞;

hence the improved boundE∞M ≤ |p0|2 ln 2 also holds. Since Kolmogorov complexity is

incomputable, we can’t find the ‘best’ bound algorithmically.

Solomonoff induction may even converge on some incomputable measures.

Example 3.58 (M Converges on Some Incomputable Measures). Letr be an incomputable real number. Then the measureP :=Bernoulli(r) is not computable andM is not absolutely continuous with respect toP: for

A:= n

x∈ X∞

_tlim_→∞ones(x1:t) =r o

we have P(A) = 1 but M(A) = 0. Since M _LP we get from Theorem 3.27 thatM does not merge with P. Nevertheless, M still succeeds at prediction because it dominates Bernoulli(q) for each rationalq and the rationals are dense around r. According to Lehrer and Smorodinsky (1996, Lem. 3), this implies that M weakly dominates P

and by Theorem 3.38M almost weakly merges to P. 3

The fact that M does not merge strongly with every Bernoulli(r) process is not a failure of Solomonoff’s prior. Ryabko (2010, p. 7) shows that for the class of all Bernoulli measures there is no probability measure that merges strongly with each of them.

The definition of M has only one parameter: the choice of the universal Turing machine. The effect of this choice on the function K can be uniformly bounded by a constant by the invariance theorem (Li and Vitányi, 2008, Thm. 3.1.1). Hence the choice of the UTM changes the prediction regret bound from Corollary 3.56 only by a constant. This constant can be large, preventing any finite-time guarantees that are independent of the UTM. However, asymptotically Solomonoff induction succeeds even for terrible choices of the UTM.

The Solomonoff normalization Mnorm of M is defined according to Definition 2.16.

While Mnorm dominates M according to Lemma 2.17 and thus every lower semicom-

putable semimeasure, in some respects, Mnorm behaves a little differently from M.

Another way to complete the semimeasureM into a measure is given in the following example.

Example 3.59 (The Measure Mixture; Gács, 1983, p. 74). Themeasure mixtureM is defined as M(x) := lim n→∞ X y∈Xn M(xy). (3.6)

It is the same as M except that the contributions by programs that do not produce infinite strings are removed: for any such program p, let k denote the length of the finite string generated by p. Then for|xy|> k, the program p does not contribute to M(xy), hence it is excluded from M(x).

Similarly toM, the measure mixtureM is not a (probability) measure sinceM()< 1; but in this case normalization (2.2) is just multiplication with the constant 1/M(),

§3.6 Learning with Algorithmic Information Theory 43

Even thoughM merges strongly with any computable measurePwithP-probability 1, Lattimore and Hutter (2013, 2015) show that generally it does not hold for all Martin- Löf random sequences (which also form a set of P-probability 1). Hutter and Muchnik (2007, Thm. 6) construct non-universal lower semicomputable semimeasures that have this convergence property for all P-Martin-Löf random sequences. For infinite nonran- dom sequences whose bits are selectively predicted by some total recursive function, Lattimore et al. (2011, Thm. 10) show that the normalized Solomonoff measureMnorm

converges to 1 on the selected bits. This does not hold for the unnormalized measure M (Lattimore et al., 2011, Thm. 12).

3.6.2 The Speed Prior

Solomonoff’s priorM is incomputable (Theorem 6.3); a computable alternative is the speed prior from Example 3.11. In this section we state merging and prediction results forSKt, a speed prior introduced by Filan et al. (2016) formally defined in Example 3.11.

It is slightly different from the speed prior defined by Schmidhuber (2002), but for the latter no compatibility properties are known for nondeterministic measures.

Definition 3.60 (Estimable in Polynomial Time). A functionf :X∗→Risestimable

in polynomial time iff there is a function g :X∗ → _R computable in polynomial time such thatf =× g.

For a measure P estimable in polynomial time the speed prior SKt dominates P

with coefficients polynomial in |x| −logP(x) (Filan et al., 2016, Eq. 12). Thus SKt

weakly dominates P and we get the following results.

Corollary 3.61(Almost Weak Merging forSKt). SKtalmost weakly merges with every

measure estimable in polynomial time.

Proof. From Theorem 3.38 and Filan et al. (2016, Eq. 12) since logP does not grow superexponentially P-almost surely.

Corollary 3.62 (Expected Prediction Regret forSKt; Filan et al., 2016, Thm. 9). For

all measures P estimable in polynomial time,

EP ESKt n −EnP ∈O logn+ q EPE∞P logn .

Proof. From Corollary 3.44 and Filan et al. (2016, Eq. 14). 3.6.3 Universal Compression

Solomonoff’s distribution can be approximated using a standard compression algorithm, motivated by the similarityM(x)≈2−Km(x), whereKmdenotes monotone Kolmogorov complexity. The function Km is a universal compressor, compressing at least as well as any other recursively enumerable program.

Gács (1983) shows that the similarity M ≈2−Km is not an equality. However, the difference between −logM and Kmis very small: the best known lower bound is due

to Day (2011) who shows thatKm(x)>−logM(x) +O(log log|x|) for infinitely many x∈ X∗_.

Nevertheless, 2−Km dominates every computable measure (Li and Vitányi, 2008, Thm. 4.5.4 and Lem. 4.5.6ii(d); originally proved by Levin, 1973). Hence all the strong results that hold for Solomonoff induction (prediction regret and strong merging) also hold for compression: we apply Theorem 3.25 and Corollary 3.44 to get the following results. See Hutter (2006a) for further discussion on using the universal compressor Kmfor learning.

Corollary 3.63(Strong Merging for Universal Compression). The distribution2−Km(x) merges strongly with every computable measure.

Corollary 3.64(Expected Prediction Regret for Universal Compression). ForQ(x) := 2−Km(x) and for all computable measures P there is a constant cP such that

E_tQ−E_tPi≤cP +

cPEPEtP.

This provides a theoretical basis for viewing compression as a general purpose learning algorithm. In this spirit, theHutter prizeis awarded for the compression of a 100MB excerpt from the English Wikipedia (Hutter, 2006c).

Practical compression algorithms (such as the algorithm by Ziv and Lempel (1977) used ingzip) are not universal. Hence they do not dominate every computable distribution. As with the speed prior, what matters is the rate at whichYt=Q(x1:t)/P(x1:t)

goes to 0, i.e., does the compressor weakly dominate the true distribution in the sense of Definition 3.8?

Veness et al. (2015) successfully apply the Lempel-Ziv compression algorithm as a learning algorithm for reinforcement learning; however, some preprocessing of the data is required. More remotely, Vitányi et al. (2009) use standard compression algorithms to classify mammal genomes, languages, and classical music.

In document Nonparametric General Reinforcement Learning (Page 59-62)