Proof. By Remark 3.9lnf ∈o(t). Applying Corollary 3.53 we get
EP h EnQ−EnP ≤2o(t) + 2 q 2EPEnPo(t)≤2o(t) + 2 p 2O(t)o(t)∈o(t).
3.6
Learning with Algorithmic Information Theory
Algorithmic information theory provides a theoretical framework to apply the probabil- ity theory results from the previous sections. In the following we discuss Solomonoff’s famous theory of induction (Section 3.6.1), the speed prior (Section 3.6.2), and learning with a universal compression algorithm (Section 3.6.3).
3.6.1 Solomonoff Induction
Solomonoff (1964, 1978) proposed a theory of learning, also known as universal in- duction or Solomonoff induction. It encompasses Ockham’s razor by favoring simple explanations over complex ones, and Epicurus’ principle of multiple explanations by never discarding possible explanations. See Rathmanner and Hutter (2011) for a very readable introduction to Solomonoff’s theory and its philosophical motivations and Sterkenburg (2016) for a critique of its optimality.
At the core of this theory isSolomonoff ’s distributionM, as defined in Example 3.5. SinceM dominates all lower semicomputable semimeasures, we get all the merging and prediction results from Section 3.4 and Section 3.5: when drawing a string from any computable measureP,M arrives at the correct belief for any hypothesis.
Corollary 3.55 (Strong Merging for Solomonoff Induction). M merges strongly with every computable measure.
Proof. From Proposition 3.16a and Theorem 3.25.
Corollary 3.56 (Expected Prediction Regret for Solomonoff Induction). For all com- putable measures P, EP EtM −EtP≤K(P) ln 4 + q 2EPEtPK(P) ln 16.
Proof. From Corollary 3.44 and cP = 2−K(P).
Remark 3.57 (Converging Fast and Slow). The convergence ofM to a computableP is fast in the sense of Corollary 3.56: M cannot make many more prediction errors than P in expectation. When predicting an infinite computable sequence x1:∞, the total
number of prediction errors is bounded by|p|2 ln 2≈1.4|p|wherep is a program that generates x1:∞ (Example 3.46).
The convergence of M to P is also slow in the sense that M(xt |x<t)→ 1 slower
than any computable function since1−M(xt|x<t)
×
≥2−minn≥tK(n) for allt. 3 The bound from Corollary 3.56 is not optimal. Even if we knew the program p generating the sequencex1:∞, there might be a shorter programp0 that computesx1:∞;
hence the improved boundE∞M ≤ |p0|2 ln 2 also holds. Since Kolmogorov complexity is
incomputable, we can’t find the ‘best’ bound algorithmically.
Solomonoff induction may even converge on some incomputable measures.
Example 3.58 (M Converges on Some Incomputable Measures). Letr be an incom- putable real number. Then the measureP :=Bernoulli(r) is not computable andM is not absolutely continuous with respect toP: for
A:= n
x∈ X∞
tlim→∞ones(x1:t) =r o
we have P(A) = 1 but M(A) = 0. Since M LP we get from Theorem 3.27 thatM does not merge with P. Nevertheless, M still succeeds at prediction because it domi- nates Bernoulli(q) for each rationalq and the rationals are dense around r. According to Lehrer and Smorodinsky (1996, Lem. 3), this implies that M weakly dominates P
and by Theorem 3.38M almost weakly merges to P. 3
The fact that M does not merge strongly with every Bernoulli(r) process is not a failure of Solomonoff’s prior. Ryabko (2010, p. 7) shows that for the class of all Bernoulli measures there is no probability measure that merges strongly with each of them.
The definition of M has only one parameter: the choice of the universal Turing machine. The effect of this choice on the function K can be uniformly bounded by a constant by the invariance theorem (Li and Vitányi, 2008, Thm. 3.1.1). Hence the choice of the UTM changes the prediction regret bound from Corollary 3.56 only by a constant. This constant can be large, preventing any finite-time guarantees that are independent of the UTM. However, asymptotically Solomonoff induction succeeds even for terrible choices of the UTM.
The Solomonoff normalization Mnorm of M is defined according to Definition 2.16.
While Mnorm dominates M according to Lemma 2.17 and thus every lower semicom-
putable semimeasure, in some respects, Mnorm behaves a little differently from M.
Another way to complete the semimeasureM into a measure is given in the following example.
Example 3.59 (The Measure Mixture; Gács, 1983, p. 74). Themeasure mixtureM is defined as M(x) := lim n→∞ X y∈Xn M(xy). (3.6)
It is the same as M except that the contributions by programs that do not produce infinite strings are removed: for any such program p, let k denote the length of the finite string generated by p. Then for|xy|> k, the program p does not contribute to M(xy), hence it is excluded from M(x).
Similarly toM, the measure mixtureM is not a (probability) measure sinceM()< 1; but in this case normalization (2.2) is just multiplication with the constant 1/M(),
§3.6 Learning with Algorithmic Information Theory 43
Even thoughM merges strongly with any computable measurePwithP-probability 1, Lattimore and Hutter (2013, 2015) show that generally it does not hold for all Martin- Löf random sequences (which also form a set of P-probability 1). Hutter and Muchnik (2007, Thm. 6) construct non-universal lower semicomputable semimeasures that have this convergence property for all P-Martin-Löf random sequences. For infinite nonran- dom sequences whose bits are selectively predicted by some total recursive function, Lattimore et al. (2011, Thm. 10) show that the normalized Solomonoff measureMnorm
converges to 1 on the selected bits. This does not hold for the unnormalized measure M (Lattimore et al., 2011, Thm. 12).
3.6.2 The Speed Prior
Solomonoff’s priorM is incomputable (Theorem 6.3); a computable alternative is the speed prior from Example 3.11. In this section we state merging and prediction results forSKt, a speed prior introduced by Filan et al. (2016) formally defined in Example 3.11.
It is slightly different from the speed prior defined by Schmidhuber (2002), but for the latter no compatibility properties are known for nondeterministic measures.
Definition 3.60 (Estimable in Polynomial Time). A functionf :X∗→Risestimable
in polynomial time iff there is a function g :X∗ → R computable in polynomial time such thatf =× g.
For a measure P estimable in polynomial time the speed prior SKt dominates P
with coefficients polynomial in |x| −logP(x) (Filan et al., 2016, Eq. 12). Thus SKt
weakly dominates P and we get the following results.
Corollary 3.61(Almost Weak Merging forSKt). SKtalmost weakly merges with every
measure estimable in polynomial time.
Proof. From Theorem 3.38 and Filan et al. (2016, Eq. 12) since logP does not grow superexponentially P-almost surely.
Corollary 3.62 (Expected Prediction Regret forSKt; Filan et al., 2016, Thm. 9). For
all measures P estimable in polynomial time,
EP ESKt n −EnP ∈O logn+ q EPE∞P logn .
Proof. From Corollary 3.44 and Filan et al. (2016, Eq. 14). 3.6.3 Universal Compression
Solomonoff’s distribution can be approximated using a standard compression algorithm, motivated by the similarityM(x)≈2−Km(x), whereKmdenotes monotone Kolmogorov complexity. The function Km is a universal compressor, compressing at least as well as any other recursively enumerable program.
Gács (1983) shows that the similarity M ≈2−Km is not an equality. However, the difference between −logM and Kmis very small: the best known lower bound is due
to Day (2011) who shows thatKm(x)>−logM(x) +O(log log|x|) for infinitely many x∈ X∗.
Nevertheless, 2−Km dominates every computable measure (Li and Vitányi, 2008, Thm. 4.5.4 and Lem. 4.5.6ii(d); originally proved by Levin, 1973). Hence all the strong results that hold for Solomonoff induction (prediction regret and strong merging) also hold for compression: we apply Theorem 3.25 and Corollary 3.44 to get the following results. See Hutter (2006a) for further discussion on using the universal compressor Kmfor learning.
Corollary 3.63(Strong Merging for Universal Compression). The distribution2−Km(x) merges strongly with every computable measure.
Corollary 3.64(Expected Prediction Regret for Universal Compression). ForQ(x) := 2−Km(x) and for all computable measures P there is a constant cP such that
EP
h
EtQ−EtPi≤cP +
q
cPEPEtP.
This provides a theoretical basis for viewing compression as a general purpose learn- ing algorithm. In this spirit, theHutter prizeis awarded for the compression of a 100MB excerpt from the English Wikipedia (Hutter, 2006c).
Practical compression algorithms (such as the algorithm by Ziv and Lempel (1977) used ingzip) are not universal. Hence they do not dominate every computable distri- bution. As with the speed prior, what matters is the rate at whichYt=Q(x1:t)/P(x1:t)
goes to 0, i.e., does the compressor weakly dominate the true distribution in the sense of Definition 3.8?
Veness et al. (2015) successfully apply the Lempel-Ziv compression algorithm as a learning algorithm for reinforcement learning; however, some preprocessing of the data is required. More remotely, Vitányi et al. (2009) use standard compression algorithms to classify mammal genomes, languages, and classical music.