On calibration error of randomized forecasting algorithms

(1)

Contents lists available atScienceDirect

Theoretical Computer Science

journal homepage:www.elsevier.com/locate/tcs

Vladimir V. V’yugin

Institute for Information Transmission Problems, Russian Academy of Sciences, Bol’shoi Karetnyi per. 19, Moscow GSP-4, 127994, Russia

a r t i c l e i n f o Keywords: Machine learning Universal prediction Randomized prediction Algorithmic prediction Calibration Randomized rounding a b s t r a c t

It has been recently shown that calibration with an error less than∆ > 0 is almost surely guaranteed with a randomized forecasting algorithm, where forecasts are obtained by random rounding the deterministic forecasts up to∆. We show that this error cannot be improved for a vast majority of sequences: we prove that, using a probabilistic algorithm, we can effectively generate with probability close to one a sequence ‘‘resistant’’ to any randomized rounding forecasting with an error much smaller than∆. We also reformulate this result by means of a probabilistic game.

1. Introduction

A minimal requirement for testing any prediction algorithm is that it should be calibrated (see Dawid [1]). An informal explanation of calibration can be given as follows: Let a binary sequence

ω

1

, ω

2

, . . . , ω

n−1of outcomes be observed by a

forecaster whose task is to give a probabilitypnof a future event

ω

n

=

1. In a typical examplepnis interpreted as a probability

that it will rain. Forecaster is said to be well-calibrated if it rains as often as he leads us to expect. It should rain about 80% of the days for whichpn

=

0

.

8, and so on. For simplicity, we consider binary sequences, i.e.

ω

n

∈ {

0

,

1

}

for alln. We give a

rigorous definition of calibration later.

If the weather acts adversarially, then Oakes [11] and Dawid [2] show that any deterministic forecasting algorithm will not always be calibrated. V’yugin [20] presented a computable version of this result: he proved that no deterministic algorithm can be calibrated for a vast majority of sequences generated by some probabilistic algorithm.

Foster and Vohra [4] show that calibration is almost surely guaranteed with a randomizing forecasting rule, i.e., where the forecasts are chosen using internal randomization and the forecasts are hidden from the weather until weather makes its decision whether to rain or not.

Kakade and Foster [7] presented ‘‘an almost deterministic’’randomized roundinguniversal forecasting algorithmf. For any sequence of outcomes and for any precision of rounding∆

>

0 an observer can simply randomly round the deterministic forecast up to∆in order to calibrate for this sequence with calibration error less than∆.

The goal of this paper is to complement the result of Kakade and Foster with a lower bound. We prove inTheorems 1 and2that the upper bound∆cannot be improved toc∆forc

<

0

.

25. We also show that when the setting ofTheorem 2is slightly modified calibration becomes impossible.

In Section2 we discuss some approaches to universal prediction of individual sequences in statistics and machine learning. In particular, we give Dawid’s definition of calibration and discuss its generalizations.

In Section3we give the definition of randomized forecasting and randomized rounding and formulate Kakade and Foster’s result; the construction of the universal randomized rounding algorithm is given inAppendix.

In Section4we consider some probabilistic version of the Oakes’ example for randomized rounding algorithms. We present this example in the form of a game between Realty, Forecaster and Skeptic in the manner of Shafer and Vovk’s book [14]. We show that Realty and Skeptic win with probability one if Forecaster randomly rounds deterministic forecasts up to a given positive precision level.

E-mail address:[email protected].

(2)

In Section5the asymptotic calibration is considered in the algorithmic setting. We present a uniform lower bound of calibration error for all computable randomized forecasting schemes. We also show that this uniform lower bound is valid for a vast majority of sequences generated by some probabilistic algorithm. The main result of this paper —Theorem 2, shows that given∆

>

0 it is possible, using the probabilistic algorithm, to effectively generate with probability close to one a sequence ‘‘resistant’’ to any randomized rounding forecasting with an error much smaller than∆. The proof of the main result is given in Section6.Theorem 3shows that when the setting ofTheorem 2is slightly modified the randomized forecasting with calibration error less than 0

.

25 becomes impossible.Theorem 4shows that for a vast majority of sequences the calibration error can be much bigger, namely

≥

0

.

25, irrespective of the precision of rounding we use when forecasts are produced by a deterministic forecasting algorithm.

This paper is an extended version of the conference paper [21]. 2. Prediction of individual sequences

Predicting sequences is the key problem of machine learning and statistics. The learning process proceeds as follows: observing a finite-state sequence given on-line a forecaster assigns subjective probabilities to future states. The method of evaluation of these forecasts depends on an underlying learning approach.

According to the classical approach of statistics, we suppose that the observed sequence is generated by some source — a finite-state stochastic process governed by an unknown (to the forecaster) probability distribution. Asymptotic accuracy of forecasts is achieved when a forecast merges to the objective distribution. In the classical statistical theory of sequential prediction, the sequence of outcomes is assumed to be a realization of a stationary stochastic process. In this case statistical properties of the process based on past observations can be estimated and using this estimation efficient prediction strategies can be constructed. In this case the performance of a prediction strategy is usually evaluated by the expected value of some loss function which measures the distance between the predicted value and the true outcome (see, for example, [10]).

In the case of an arbitrary source distribution, the existence of a universal forecasting scheme was proved in the algorithmic information theory.

Solomonoff [16,17] and Levin [23] have defined a universal priorM

(

x

)

as the probability that the output of a universal Turing machine starts with a sequencexwhen provided with fair coin flips on the input tape. Assume now that the sequences

xare generated by a probability distributionP, i.e. the probability of a sequence starting fromxisP

(

x

) >

0. Then the probability of observing

ω

n

=

1 after observations

ω

1

. . . ω

n−1is

P

(

1

|

ω

₁

. . . ω

_n−1

)

=

P

(ω

1

. . . ω

n−11

)/

P

(ω

1

. . . ω

n−1

).

Solomonoff [17] proved thatP-almost surely the universal posterior

M

(

1

|

ω

₁

. . . ω

_n−1

)

=

M

(ω

1

. . . ω

n−11

)/

M

(ω

1

. . . ω

n−1

)

converges toP

(

1

|

ω

₁

. . . ω

_n−1

)

asn

→ ∞

if the measurePis computable. Solomonoff’s result was further developed by

Hutter [5] and Chernov et al. [6].

Hence,M is a valid predictor in the case of unknown computableP. Unfortunately, the functionM

(

1

|

ω

₁

. . . ω

_n−1

)

is

non-computable, so, it can be helpful only for a theoretical analysis.

According to a different viewpoint, we abandon the assumption that the outcomes are generated by a well-behaved stochastic process and consider the sequence of outcomes as produced by some unspecified mechanism, which could be deterministic, stochastic or adversarially adaptive to our prediction method. This setup where no probabilistic assumption is made on how the sequence is generated is often referred as prediction of individual sequences.

At the same time, without a probabilitic model it is not obvious how to measure the performance of the prediction algorithm. A solution of this problem was proposed by Dawid [1], whoseprequential principlesays that our evaluation of the accuracy of the forecasts should not depend on any model of generating data. Dawid started from an observation that in reality, we have only individual sequence

ω

1

, ω

2

, . . . , ω

n

, . . .

of outcomes and that the corresponding forecasts

p1

,

p2

, . . . ,

pn

, . . .

whose testing is considered may fall short of defining a full probability distribution on the whole set

of all sequences. The prequential principle says that the evaluation of a probability forecaster should depend only on his actual probability forecasts and the corresponding outcomes.

The notion of calibration, originated by Dawid [1,2], checks whether the observed empirical frequencies of state occurrences converge to their forecaster probabilities. The notion of calibration complies with the prequential principle.

LetI

(

p

)

denote the characteristic function of a subintervalI

⊆ [

0

,

1

]

, i.e.,I

(

p

)

=

1 ifp

∈

I, andI

(

p

)

=

0, otherwise. An infinite sequence of forecasts p1

,

p2

, . . .

iswell-calibrated for an infinite sequence of outcomes

ω

1

ω

2

. . .

if for the

characteristic functionI

(

p

)

of any subinterval of

[

0

,

1

]

the calibration errortends to zero, i.e.,

n

P

i=1 I

(

pi

)(ω

i

−

pi

)

n

P

i=1 I

(

pi

)

→

0 (1)

(3)

The indicator functionI

(

pi

)

determines some ‘‘selection rule’’ which selects indicesiwhere we compute the deviation

between forecastspiand outcomes

ω

i. The most general notion of selection rule was considered in Vovk and Shafer [18]: a

selection rule is a functionF

(

p1

, ω

1

, . . . ,

pi−1

, ω

i−1

,

pi

)

with range

{

0

,

1

}

, where

ω

i

∈ {

0

,

1

}

andpi

∈ [

0

,

1

]

,i

=

1

,

2

, . . .

.

The main problem of sequential forecasting is to define a universal forecasting algorithm which computes forecasts

pn given past observations

ω

1

, . . . , ω

n−1 for each n. This universal prediction algorithm should be well-calibrated for

each infinite sequence of outcomes. Oakes [11] proposed arguments (see Dawid [3] for a different proof) that no such algorithm can be well-calibrated for all possible sequences: any forecasting algorithm cannot be calibrated for the sequence

ω

=

ω

₁

ω

₂

. . .

, where

ω

i

=

1 ifpi

<

0

.

5

0 otherwise

andpi are forecasts computed by the algorithm given

ω

1

, . . . , ω

i−1,i

=

1

,

2

, . . .

. The corresponding intervals areI0

=

[

0

,

0

.

5

)

andI1

= [

0

.

5

,

1

]

. It is easy to see that the condition (1) of calibration fails for this

ω

, whereI

=

I0orI

=

I1.

Theorem 4given below in Section5presents an effective method for generating a vast majority of sequences possessing this property. It says that using coin-tossing and a transducer algorithm, we can generate with probability close to one an infinite sequence

ω

such that each forecasting algorithm cannot be calibrated for

ω

.

Foster and Vohra [4] show that calibration is almost surely guaranteed with a randomizing forecasting rule, i.e., where the forecasts are chosen using internal randomization. Kakade and Foster [7] noticed that some calibration results require very little randomization. They defined ‘‘an almost deterministic’’randomized roundinguniversal forecasting algorithmf: an observer can only randomly round the deterministic forecast in order to calibrate for each sequence of outcomes regardless of the nature of the source generating it. This approach was further developed by, among others, Lehrer [8], Sandrony et al. [13]. These papers were only concerned with asymptotic calibration. These authors asked only that that the entire sequence of forecasts and its certain subsequences be properly calibrated.

Non-asymptotic version of randomized forecasting was proposed by Vovk and Shafer [18] and by Vovk et al. [19]. This approach is based on the game-theoretic framework of Shafer and Vovk [14]. The main requirement of this approach that the forecasts resist any betting strategy can be interpreted by saying that they must pass all statistical tests, not only tests of calibration.

We discuss details of the randomized forecasting algorithms in Section3and inAppendix. The game-theoretic framework is partially used in Section4. In this section we also develop a probabilistic version of the Oakes’ example for the randomized rounding algorithms and present the corresponding betting strategy.

3. Randomized forecasting

Forecasting can be thought as a perfect-information game between two players: Forecaster and Realty [18]. Assuming that Realty makes a binary choice at each step,a game of deterministic forecastingis described by the protocol:

FORn

=

1

,

2

, . . .

Forecaster announces a forecastpn

∈ [

0

,

1

]

.

Reality announces an outcome

ω

n

∈ {

0

,

1

}

.

ENDFOR

LetP

[

0

,

1

]

be the set of all probability measures on the unit interval

[

0

,

1

]

supplied with the standard

σ

-algebraF of all Borel subsets of

[

0

,

1

]

.

A game of randomized forecastingrequires a new player — Random Number Generator. It can be described by the protocol:

FORn

=

1

,

2

, . . .

Forecaster announces probability distributionPn

∈

P

[

0

,

1

]

.

Reality announces

ω

n

∈ {

0

,

1

}

.

Random Number Generator announcespn

∈ [

0

,

1

]

distributed according toPn.

ENDFOR

By Sandrony et al. [13]a randomized forecasting schemeis a sequence of functions

ζ

n

: {

0

,

1

}

n−1

× [

0

,

1

]

n−1

→

P

[

0

,

1

]

;

for anyngiven a sequence of past forecastsp1

, . . . ,

pn−1and a sequence of outcomes

ω

n−1

=

ω

1

. . . ω

n−1,

ζ

noutputs a

probabilityPnfor a forecastpn. We denote the randomized forecasting schemeP_ωn−1

(

·|p

1

, . . . ,

pn−1

)

,n

=

1

,

2

, . . .

(

ω

0

=

λ

is the empty sequence).

For any such randomized forecasting scheme by Ionescu–Tulcea theorem (see Shiryaev [15]) for any sequence of outcomes

ω

=

ω

₁

, ω

₂

, . . .

, a unique overall probability measurePron the set

[

0

,

1

]

∞_{of all forecasting paths — infinite}

sequencesp1

,

p2

, . . .

, exists such that for eachn,Pωn−1

(

·|p

₁

, . . . ,

p_n−1

)

is a conditional probability induced byPr, i.e.,

P_ωn−1

(

p_n

∈

A|p₁

, . . . ,

p_n−1

)

=

Pr

(

pn

∈

A|p1

, . . . ,

pn−1

)

for alln, whereAis a measurable subset of

[

0

,

1

]

.

Assume that for eachn, the probability distributionP_ωn−1

(

·|p

₁

, . . . ,

p_n−1

)

is concentrated on a finite subsetDnof

[

0

,

1

]

,

say,Dn

= {p

n,1

, . . . ,

pn,mn

}

. The number∆

=

lim infn→∞∆n, where

∆n

=

inf

{|

pn,i

−

pn,j

| :

i

6=

j

}

,

(4)

In general caseDnis a predictable, i.e., measurable with respect to the

σ

-algebraFn−1, random variable depending on

ω

n−1_.

A typical example is the uniform rounding: for eachn, rational pointsp_n,idivide the unit interval into equal parts of size 0

<

∆

<

1; the probability distributionP_ωn−1

(

·|p

₁

, . . . ,

p_n−1

)

is concentrated on these points. In this case the corresponding

level of discreteness on arbitrary sequence

ω

1

, ω

2

, . . .

equals∆.

Kakade and Foster [7] presented ‘‘an almost deterministic’’ universal randomized rounding forecasting scheme. This scheme, given a partition of the unit interval into equal parts of size∆, randomly rounds a deterministic forecast computed by some algorithm to nearby points of the partition.

Notice that the random forecastpngenerated by Kakade and Foster’s forecasting scheme is independent on the previous

forecastsp1

, . . . ,

pn−1; it depends only on

ω

n−1(seeAppendix). In this case the overall probability distributionPrdefined

by Kakade and Foster’s randomized forecasting schemeP_ωn−1is the product of independent probability distributions.

Proposition 1 (Kakade and Foster). For any infinite sequence

ω

=

ω

₁

ω

₂

. . .

and for the characteristic function I

(

p

)

of any

subinterval of

[

0

,

1

]

the overall probability Pr of the event

1 n n

X

i=1 I

(

pi

)(ω

i

−

pi

)

≤

∆ (2)

tends to1as n

→ ∞, where Pr is the overall probability distribution defined by Kakade and Foster’s randomized forecasting

scheme P_ωn−1, n

=

1

,

2

, . . .

, and p₁

,

p₂

, . . .

is a sequence of forecasts distributed according to Pr.

For example, the forecast 0

.

8512 can be rounded up to second digit to 0

.

86 with probability 0

.

12, and to 0

.

88, at the next moment of time, the forecast 0

.

2688 can be rounded up to second digit to 0

.

12, and to 0

.

88.

Details of the Kakade and Foster’s algorithm are given inAppendix. 4. Oakes’ example for randomized rounding forecasting

In this section, we consider some probabilistic variant of the Oakes’ example in the form of a game. This game will be used in the proof ofTheorem 2.

A generalization of Foster and Vohra’s result was obtained by Vovk and Shafer [18]. They consider a perfect-informatin game of randomized forecasting between the players — Forecaster, Skeptic, Reality, Random Number Generator.

The goal of Forecaster is to state probabilities that pass all possible statistical tests in light of Realty subsequent moves. This goal is formalized by adding a third player — Skeptic, who seeks to refute Forecaster’s probabilities.

We consider some probabilistic version of the game: LetK0

=

1

.

FORn

=

1

,

2

, . . .

Skeptic announcedSn

: [

0

,

1

] →

R.

Forecaster announces a probability distributionPn

∈

P

[

0

,

1

]

.

Reality announces

ω

n

∈ {

0

,

1

}

.

Random Number Generator announcespn

∈ [

0

,

1

]

distributed according toPn.

Skeptic updates his capitalKn

=

Kn−1

+

Sn

(

pn

)(ω

n

−

pn

)

.

ENDFOR

Restriction on Skeptic:Skeptic must choose theSnso that his capitalKnis nonnegative for allnno matter how the other

players move.

Winner:Forecaster wins if Skeptic’s capitalKnstays bounded asn

→ ∞

. Otherwise Realty and Skeptic win. Random

Number Generator is aneutralplayer.1

Theorem 1 of Vovk and Shafer [18] which is stated in a purely game-theoretic setting implies for our mixed setting that there exist a randomized strategy for Forecaster — a sequence of probability distributions

Pn

=

Pωn−1

(

·|p

₁

, . . .

p_n−1

)

,n

=

1

,

2

, . . .

, and an overall probabilityPrfor these probabilities such that for each sequence

ω

=

ω

₁

ω

₂

. . .

announced by Realty the set of forecasting pathsp1

,

p2

, . . .

where Forecaster wins hasPr-probability 1.

By [18] (Theorem 3) and [19]Proposition 1is a corollary of this result.

We prove that when Forecaster uses finite subsets of

[

0

,

1

]

for randomization Realty and Skeptic can defeat Forecaster in this forecasting game.

Define a strategy for Realty: at stepnRealty announces an outcome

ω

n

=

0 ifPn

(

[

0

.

5

,

1

]

)

≥

0

.

5

1 otherwise.

1 Our approach is not purely game-theoretic in sense of Shafer and Vovk [14]. Random Number Generator is used to introduce into consideration the forecasting paths. A purely game-theoretic version of this game and ofTheorem 1(and even ofTheorem 2) can also be obtained, but it is out of the scope of this paper.

(5)

Let

k

=

2−k,k

=

1

,

2

, . . .

. We consider two infinite sequences of Skeptic’s auxiliary strategies:

S_n1,k

(

p

)

= −

kK_n1−,k1

ξ (

p

≥

0

.

5

),

(3)

S_n2,k

(

p

)

=

kK_n2−,k1

ξ (

p

<

0

.

5

),

(4)

where

ξ(

true

)

=

1,

ξ(

false

)

=

0 andK_ns,k−1is the capital of Skeptic at the end of stepn

−

1 when he follows the strategy

S_ns,k₋₁,s

=

1

,

2 andk

=

1

,

2

, . . .

,K₀s,k

=

1.

We combine them in the strategySn

(

p

)

=

1₂

(

Sn1

(

p

)

+

Sn2

(

p

))

, whereSn1

(

p

)

=

P

∞

k=1

kSn1,k

(

p

)

andS2n

(

p

)

=

P

∞

k=1

kSn2,k

(

p

)

.

It follows from the definition thatK_ns,k

(

p

)

≤

2n_and

|S

s,k

n

(

p

)

| ≤

2n

−1_for_s

=

₁

,

_{2 and for all}_p_,_k_and_n_{. Then these sums are}

finite for eachnandp.

The total capital of Skeptic at stepnwhen he follows the strategySn

(

p

)

equals Kn

=

1 2 ∞

X

k=1

k

(

Kn1,k

+

K 2,k n

).

Suppose Forecaster uses a randomized forecasting schemeP_ωn−1

(

·|p

₁

, . . . ,

p_n−1

)

for definingPn. Since

ω

n−1andSni,i

=

1

,

2,

are defined in terms of

σ

-algebraFn−1_{, by Ionescu–Tulcea theorem an overall probability distribution}_Pr_{exists such that}

for eachn,Pn

=

Pr

(

·|p

1

, . . . ,

pn−1

)

is the conditional probability induced byPr.

Theorem 1.Suppose Forecaster’s randomized forecasting scheme has a positive level of discreteness on each infinite sequence

ω

.

Then using the strategies defined above, Realty and Skeptic win with Pr-probability1.

Proof. Suppose that for eachj,pjis distributed according toPj. Define two sequences of random variables

ϑ

n,1

=

n

X

j=1

ξ(

pj

≥

0

.

5

)(ω

j

−

pj

),

ϑ

n,2

=

n

X

j=1

ξ(

pj

<

0

.

5

)(ω

j

−

pj

),

wheren

=

1

,

2

, . . .

.

Suppose that for anyn, the probability distributionPn is concentrated on a finite set

{p

n,1

, . . . ,

pn,mn

}

. Denotep

−

n

=

max

{p

_n,t

:

p_n,t

<

0

.

5

}

andp+_n

=

min

{p

_n,t

:

p_n,t

≥

0

.

5

}

.2By definition

ω

n,p+n andp

−

n are predictable andp

+

n

−

p

−

n

≥

∆for

alln, where∆

>

0. We have

E

(ϑ

_n,1

)

≤

X

ωj=0 Pj

{p

j

≥

0

.

5

}

(

−p

+ j

)

+

X

ωj=1 Pj

{p

j

≥

0

.

5

}

(

1

−

p + j

)

≤ −

0

.

5 n

X

j=1

ξ(ω

j

=

0

)

p + j

+

0

.

5 n

X

j=1

ξ(ω

j

=

1

)(

1

−

p + j

).

E

(ϑ

_n,2

)

≥

X

ωj=0 Pj

{p

j

<

0

.

5

}

(

−p

−j

)

+

X

ωj=1 Pj

{p

j

<

0

.

5

}

(

1

−

p−j

)

≥ −

0

.

5 n

X

j=1

ξ(ω

j

=

0

)

p−j

+

0

.

5 n

X

j=1

ξ(ω

j

=

1

)(

1

−

p−j

),

whereEis the mathematical expectation with respect toPr. Then

E

(ϑ

_n,2

)

−

E

(ϑ

n,1

)

≥

0

.

5∆n (5)

for alln.

By the martingale strong law of large numbers withPr-probability 1, 1 n

ϑ

n,1

−

1 nE

(ϑ

n,1

)

→

0

,

1 n

ϑ

n,2

−

1 nE

(ϑ

n,2

)

→

0 (6)

asn

→ ∞

. Then by (5) and (6) withPr-probability 1 for each

δ >

0, 1

n

ϑ

n,2

−

1

n

ϑ

n,1

≥

0

.

5∆

−

δ

(7)

for all sufficiently largen. In particular, (7) impliesCorollary 1.

(6)

We follow Shafer and Vovk’s [14] method of defining the defensive strategy for Skeptic. For anykandn, K_n1,k

=

n

Y

j=1

(

1

−

_k

ξ(

pj

≥

0

.

5

)(ω

j

−

pj

)),

(8) K_n2,k

=

n

Y

j=1

(

1

+

_k

ξ(

pj

<

0

.

5

)(ω

j

−

pj

)).

(9)

If Skeptic follows the strategy (3) then at stepn

lnK_n1,k

≥ −

_k

ϑ

_n,₁

−

_k2n

.

(10)

If Skeptic follows the strategy (4) then at stepn

lnK_n2,k

≥

_k

ϑ

_n,₂

−

_k2n

.

(11)

Here we have used the inequality ln

(

1

+

r

)

≥

r

−

r2for all

|r| ≤

1. The inequalities (7), (10) and (11) imply

lim sup n→∞ lnK1,k n

+

lnKn2,k n

≥

1 2

k∆

−

2

2 k

≥

2

2 k (12)

is valid withPr-probability 1, where

k

≤

1₈∆.

Using (12), fors

=

1 or fors

=

2 withPr-probability 1 lim sup n→∞ lnK_ns,k n

≥

2 k

.

(13)

By definition fors

=

1

,

2 and for allk,K_ns,k

≥

0 for allnno matter how the other players move. By (13) if

k

≤

1₈∆then

fors

=

1 or fors

=

2, lim sup_n→∞Kns,k

= ∞

withPr-probability 1.

SinceKn is the weighted sum ofKns,k, we have lim supn→∞Kn

= ∞

on a set of forecasting pathsp1

,

p2

, . . .

ofPr

-probability 1.

4

We also obtain a lower bound of calibration error.

Corollary 1. Suppose Forecaster’s randomized forecasting scheme has a positive level of discreteness on each infinite sequence

ω

.

Then using the strategy defined above, Realty announces a sequence

ω

such that, for i

=

1or for i

=

2, with Pr-probability1

lim sup n→∞

1 n

ϑ

n,i

≥

0

.

25∆

.

(14)

This corollary follows from (7).

5. Uniform lower bounds of calibration error

In this section, we study the asymptotic calibration in the algorithmic setting. In particular, we present a computable uniform version ofTheorem 1. We show that the lower bound (14) of calibration error also is a uniform lower bound for all computable randomized forecasting schemes. We also show that this uniform lower bound is valid for a vast majority of sequences generated by some probabilistic algorithm.

LetΩ

= {

0

,

1

}

∞_{be the set of all infinite binary sequences,}_Ξ

= ∪

∞

n=1

{

0

,

1

}

nbe the set of all finite binary sequences,

and

λ

be the empty sequence. For any finite or infinite sequence

ω

=

ω

₁

. . . ω

_n

. . .

, we write

ω

n

=

ω

1

. . . ω

n(we put

ω

0

=

ω

0

=

λ

). Also,l

(ω

n

)

=

ndenotes the length of the sequence

ω

n. Ifxis a finite sequence and

ω

is a finite or infinite

sequence thenx

ω

denotes the concatenation of these sequences;x

v

ω

means thatx

=

ω

n_{for some}_n_.

We need some computability concepts. LetRbe the set of all real numbers extended by adding the infinities

−∞

and

+∞

.

For any set of of finite objectsA, we suppose that elements ofAcan be effectively enumerated by positive integer numbers (see Rogers [12]). In particular, we will identify a computer program with its number. We fix some effective one-to-one enumeration of all pairs (triples, and so on) of nonnegative integer numbers. We identify any pair

(

t

,

s

)

with its number

ht

,

si.

A function

φ

:A

→

Ris called (lower) semicomputable if the set

{

(

r

,

x

)

:

r

< φ(

x

),

ris a rational number

}

is recursively enumerable. This means that there is an algorithm which when fed with a rational numberr and a finite objectxeventually stops ifr

< φ(

x

)

and never stops, otherwise. In other words, the semicomputability off means that if

φ(

x

) >

rthis fact will sooner or later be learned, whereas if

φ(

x

)

≤

rwe may be for ever uncertain. A function

φ

is upper semicomputable if

−

φ

is lower semicomputable.

(7)

A standard argument based on the recursion theory shows that there exist the lower and the upper semicomputable real functions

φ

−

(

_j

,

_x

)

_and

φ

+

(

_k

,

_x

)

_{universal for all lower semicomputable and upper semicomputable functions from}_x

∈

_Ξ_.3 As follows from the definition, for every computable real function

φ(

x

)

there exist a pair

hj

,

kisuch that

φ(

x

)

=

φ

−

(

j

,

x

)

=

φ

+

(

k

,

x

)

for allx. Let

φ

−

s

(

j

,

x

)

be equal to the maximal rational numberrsuch that the triple

(

r

,

j

,

x

)

is enumerated inssteps in the

process of enumeration of the set

{

(

r

,

j

,

x

)

:

r

< φ(

j

,

x

),

ris rational

}

and equals

−∞

, otherwise. Any such function

φ

−

s

(

j

,

x

)

takes only finite number of rational values distinct from

−∞

. By

definition,

φ

_s−

(

j

,

x

)

≤

φ

_s−₊₁

(

j

,

x

)

for allj

,

s

,

x, and

φ

−

(

j

,

x

)

=

lim

s→∞

φ

−

s

(

j

,

x

).

An analogous non-increasing sequence of functions

φ

+_s

(

k

,

x

)

exists for any upper semicomputable function.

A function

φ

:

Ξ

→

Ris computable if there exists an algorithm which givenz

∈

Ξ and a degree of precision

κ

computes a rational approximation of

φ(

z

)

up to

κ

. In more detail, leti

= ht

,

ki. We say that the function

φ

i

(

x

)

isdefined on

xif given any degree of precision — a positive rational number

κ >

0, it holds that

|

φ

_s+

(

t

,

x

)

−

φ

_s−

(

k

,

x

)

| ≤

κ

(15)

for somes;

φ

i

(

x

)

is undefined otherwise. If somesexists such that (15) holds, define

φ

i,κ

(

x

)

=

φ

−s

(

k

,

x

)

for minimal suchs,

φ

i,κ

(

x

)

is undefined otherwise. The function

φ

_i,κ

(

x

)

is called the rational approximation (from below) of

φ

i

(

x

)

up to

κ

.

Any measurePonΩcan be defined as follows. Let us consider intervals Γz

= {

ω

∈

Ω

:

z

v

ω

}

,

wherez

∈

Ξ. We denoteP

(

z

)

=

P

(

Γz

)

and extend this function to all Borel subsets ofΩin a standard way.

We also use the concept ofcomputable operationonΞ

S

Ω[22,23]. LetF

ˆ

be a recursively enumerable set of ordered pairs of finite sequences satisfying the following properties:

•

(i)

(

x

, λ)

∈ ˆ

Ffor eachx;

•

(ii) if

(

x

,

y

)

∈ ˆ

F,

(

x0

,

_y0

)

∈ ˆ

_F_and_x

v

_x0_then_y

v

_y0_or_y0

v

_y_{for all finite binary sequences}_x

,

_x0

,

_y

,

_y0_.

A computable operationFis defined as follows

F

(ω)

=

sup

{y

|

x

v

ω

and

(

x

,

y

)

∈ ˆ

Ffor somex}

,

where

ω

∈

Ω

S

Ξand sup is in the sense of the partial order

v

onΞ.

Informally, the computable operationFis defined by some algorithm; this algorithm when fed with an infinite or a finite sequence

ω

takes it sequentially bit-by-bit, processes it, and produces an output sequence also sequentially bit-by-bit.

A probabilistic algorithmis a pair

(

P

,

F

)

, wherePis a computable measure on the set of all binary sequences andF is a

computable operation. For any probabilistic algorithm

(

P

,

F

)

and a setA

⊆

Ω, we consider the probability

P{

ω

:

F

(ω)

∈

A}

of generating by means ofF a sequence fromAgiven a sequence

ω

distributed according to the computable probability distributionP. In that followsP

=

L, whereL

(

x

)

=

L

(

Γx

)

=

2−l(x)is the uniform measure onΩ.

A deterministic forecasting systemis a functionf

:

Ξ

→ [

0

,

1

]

(see Dawid [1]). In that follows we consider total forecasting

systemsf, i.e., everywhere defined onΞ.

In that follows, we mean bya randomized forecasting systema probability distributionP_xon

[

0

,

1

]

, wherex

∈

Ξ is a parameter.4_{The precise definition of computable probability distribution on}

[

₀

,

₁

]

_{requires some technicalities. In fact,} in the construction below, we computePxonly at one set

[

0

.

5

,

1

]

; so, we call a randomized forecasting systemPxweakly

computableifPx

(

[

0

.

5

,

1

]

)

is a computable function from the parameterx.

LetI0

=

I0

(

p

)

andI1

=

I1

(

p

)

be the characteristic functions of the intervals

[

0

,

0

.

5

)

and

[

0

.

5

,

1

]

, correspondingly.

The following theorem gives a uniform lower bound of calibration error for computable randomized forecasting with positive level of discretness.

Theorem 2.For any

>

0, a probabilistic algorithm

(

L

,

F

)

can be constructed, which with probability

≥

1

−

outputs an

infinite binary sequence

ω

=

ω

₁

ω

₂

. . .

such that for every weakly computable randomized forecasting system Pxwith the level of

3 _{This means that each lower semicomputable function}_φ(_x₎ _{can be represented as}_φ(_x₎ ₌ _φ−₍

j,x)for somej. The same holds for upper semicomputability.

4 Such forecasting systems are used in Kakade and Foster’s universal forecasting algorithm (seeAppendix). We can prove a game-theoretic version of

(8)

discreteness∆on

ω

, for

ν

=

0or for

ν

=

1, Pr-probability of the event lim sup n→∞

1 n n

X

i=1 I_ν

(

pi

)(ω

i

−

pi

)

≥

0

.

25∆ (16)

equals1, where the overall probability Pr is defined by P_ωi−1, i

=

1

,

2

, . . .

, and p₁

,

p₂

, . . .

are distributed according to these

probabilities.

Theorem 2uses forecast-based selection rules —I_ν

(

pi

)

,

ν

=

0

,

1. In the case when we select random forecasts for checking

using their mean value, we obtain a lower bound does not depending on the accuracy of randomized rounding. Let for anyi,E_ωi−1

=

R

1

0 pPωi−1

(

dp

)

be the mean value of the forecasts generated by a randomized forecasting system

P_ωi−1.

In the following theorem we consider randomized forecasting systemsf with computable mathematical expectations, i.e.,E_ωi−1is a computable real function from the parameter

ω

i−1.

Theorem 3. For any

>

(

L

,

F

)

≥

1

−

outputs an infinite

binary sequence

ω

=

ω

₁

ω

₂

. . .

such that for every randomized forecasting system Pxwith computable mathematical expectation,

for

ν

=

0or for

ν

=

1, Pr-probability of the event

lim sup n→∞

1 n n

X

i=1 I_ν

(

E_ωi−1_,_n

)(ω

_i

−

p_i

)

≥

0

.

25 (17)

equals1, where the overall probability Pr is defined by P_ωn−1, n

=

1

,

2

, . . .

, and p1

,

p2

, . . .

are distributed according to these

probabilities.5_{Here, E}

ωi−1_,nis a computable sequence of rational approximations of E_ωi−1up to a given precision

κ

_n, where

κ

_n

→

0

as n

→ ∞, i

=

1

,

2

, . . .

.6

The following theorem is a computable uniform version of the Oakes’ counter example for deterministic forecasting systems. We expand this example to a large set of sequences generated by a probabilistic algorithm. A variant of this result was obtained by V’yugin [20].

An outcome-based selection rule is a function

δ

:

Ξ

→ {

0

,

1

}

.

Theorem 4. For any

>

(

L

,

F

)

≥

1

−

outputs an

infinite binary sequence

ω

1

ω

2

. . .

such that for every deterministic forecasting algorithm f there exists a computable

outcome-based selection rule

δ

such that the following inequality holds

lim sup n→∞

1 n n

X

i=1

δ(ω

i−1

)(ω

i

−

pi

)

≥

0

.

25

,

(18) where pi

=

f

(ω

i−1

)

, i

=

1

,

2

, . . .

.

For each deterministic forecasting systemf, there exists a unique probability measure PonΩ such thatP

(

Γx1

)

=

f

(

x

)

P

(

Γx

)

for allx

∈

Ξ. In other words,f is a version of the conditional probability, according toP, thatxwill be followed

by 1. Dawid’s original notion of calibration can be considered as a version of von Mises’ definition of an individual random sequence with respect to the measureP(see Li and Vitányi [9] for details of algorithmic randomness).

V’yugin [20] proved that for any computable probability measureP, the martingale strong law of large numbers holds for each infinite sequence

ω

random with respect toPin sense of Martin-Löf. Therefore, the probabilistic algorithm

(

L

,

F

)

fromTheorem 4outputs with probability

≥

1

−

an infinite binary sequence

ω

which is not Martin-Löf random (it is not even Dawid random) with respect to each computable probability distribution.

6. Proofs ofTheorems2–4

For any probabilistic algorithm

(

P

,

F

)

, we consider the function

Q

(

x

)

=

P{

ω

:

x

v

F

(ω)

}

.

(19)

It is easy to verify that this function is lower semicomputable and satisfies:

Q

(λ)

≤

1

;

Q

(

x0

)

+

Q

(

x1

)

≤

Q

(

x

)

for all x. Any function satisfying these properties is called semicomputable semimeasure. For any semicomputable semimeasureQ, a probabilistic algorithm

(

L

,

F

)

exists such that (19) holds, whereP

=

L(for the proof see [22,23]).

5 The lower bounds ofTheorems 3and4have been weakened comparing to the conference version [21] (0.25 is used instead of 0.5−). Author does not know how to improve these lower bounds.

(9)

Though the semimeasureQ is not a measure, we consider the corresponding measure on the setΩ. Define

¯

Q

(

Γx

)

=

inf n

X

l(y)=n,xvy Q

(

y

)

for allx

∈

Ξand extend this function to all Borel subsets ofΩ(see [23]). For any semimeasureP, definethe support set of P

EP

= {

ω

∈

Ω

: ∀n

(

P

(ω

n

)

6=

0

)

}

.

It is easy to see thatEPis a closed subset ofΩandP

¯

(

EP

)

= ¯

P

(

Ω

)

.

We will construct a semicomputable semimeasureQ as a some sort of network flow. We define an infinite network on the base of the infinite binary tree. Anyx

∈

Ξdefines two directed edges

(

x

,

x0

)

and

(

x

,

x1

)

of length one. In the construction below we will add to the network extra edges

(

x

,

y

)

of length

>

1, wherex

,

y

∈

Ξ,x

v

yandy

6=

x0

,

x1. By the length of an edge

(

x

,

y

)

we mean the numberl

(

y

)

−

l

(

x

)

. For any edge

σ

=

(

x

,

y

)

we denote by

σ

1

=

xits starting vertex and by

σ

2

=

yits terminal vertex. A computable functionq

(σ )

defined on all edges of length one and on all extra edges and taking

rational values is calleda networkif for everyx

∈

Ξ

X

σ:σ1=x

q

(σ )

≤

1

.

LetGbe the set of all extra edges of the networkq(it is a part of the domain ofq).

Byq-flow we mean the minimal semimeasureP such thatP

≥

R, where the functionRis defined by the following recursive equations R

(λ)

=

1

;

R

(

y

)

=

X

σ:σ2=y q

(σ )

R

(σ

1

)

(20) fory

6=

λ

.

By definitionRis a computable function taking rational values and the semimeasurePis lower semicomputable. A networkqis calledelementaryif the set of extra edges is finite andq

(σ )

=

1

/

2 for almost all edges of unit length. For any networkqwe define thenetwork flow delayfunction (q-delay function)

d

(

x

)

=

1

−

q

(

x

,

x0

)

−

q

(

x

,

x1

).

The construction below works with all programsicomputing the functions

φ

i

(

x

)

.7In the proof (seeLemma 6) we use a

special class of these functions, namely, functions of the type

φ(

x

)

=

Px

(

[

0

.

5

,

1

]

),

(21)

wherePx,x

∈

Ξ, is a weakly computable randomized forecasting system. For any such function

φ

,

φ

=

φ

ifor somei.

The processing of any programiis divided on sessionss

=

1

,

2

, . . .

. At each sessions, the construction works with the rational approximations

φ

_i,κs

(

x

)

of

φ

i

(

x

)

from below up to

κ

sfor allx

∈

Ξ, where

κ

s

=

1

/

s.

Besides, at each sessions, we visit each rational approximation

φ

_i,κ_s

(

x

)

on infinitely many stepsn. To do this, we define some functionp

(

n

)

such that for any positive integer numbermwe havep

(

n

)

=

mfor infinitely manyn. For example, we can definep

(

hm

,

ki

)

=

mandp0

(

hm

,

_ki

)

=

_k_{for all}_m_and_k_{, where}

hm

,

_ki_{is some computable one-to-one enumeration}

of all pairs of nonnegative integer numbers. Then for each stepnwe compute

hi

,

si =p

(

n

)

, whereiis a program andsis a number of a session; so,i

=

p

(

p

(

n

))

ands

=

p0

(

p

(

n

))

.

For a programi, a number of sessions, finite binary sequencesxandy, an elementary networkq, and for a nonnegative integer numbern, the predicateB

(

hi

,

si

,

x

,

y

,

q

,

n

)

istrueif the following holds

•

(i)sl

(

x

) <

n;

•

(ii)l

(

y

)

=

n,x

v

y,

•

(iii)d

(

yk

) <

_{1 for all}_k_{, 1}

≤

_k

≤

_n_{, where}_d_{is the}_q_{-delay function and}_yk

=

_y 1

. . .

yk;

•

(iv) for allksuch thatl

(

x

)

≤

k

<

sl

(

x

)

the values

φ

_i,κs

(

yk

)

are defined in

≤

nsteps and

yk+1

=

0 if

φ

_i,κ_s

(

yk

)

≥

₀

.

₅

1 otherwise. It isfalseotherwise. Define

β(

x

,

q

,

n

)

=

min

{y

:

p

(

l

(

y

))

=

p

(

l

(

x

)),

B

(

hp

(

p

(

l

(

x

))),

p0

(

p

(

l

(

x

)))

i

,

x

,

y

,

q

,

n

)

}

.

Herep

(

p

(

l

(

x

)))

is a program andp0

(

_p

(

_l

(

_x

)))

_{is a number of a session; min is considered for lexicographical ordering of strings;}

we suppose that min

∅

is undefined.

7 _{Recall that}_i _{= h}_j_,_k_i_{for some}_j_,_k_{; we use the lower and upper semicomputable real functions}_φ−₍

j,x)andφ+₍

k,x)universal for all lower semicomputable and upper semicomputable functions fromx∈Ξto compute valuesφi(x).

(10)

Lemma 1. For any total function

φ

i,

β(

x

,

q

,

n

)

is defined for all x

∈

Ξand for all sufficiently large n such that p

(

p

(

n

))

=

i.

Proof. The needed sequenceycan be easily defined for all sufficiently largensequentially bit-by-bit, since

φ

_i,κs

(

z

)

is defined

for allzands.

4

The goal of the construction below is the following. Each extra edge

σ

corresponds to some task numberI

= hi

,

sisuch thatp

(

l

(σ

1

))

=

p

(

l

(σ

2

))

=

I. The goal of the taskIis to define a finite set of extra edges

σ

such that for each infinite binary

sequence

ω

the following holds: either

ω

contains some extra edge as a subword, or the network flow delay functiond

equals 1 on some initial fragment of

ω

.

For each extra edge

σ

added to the networkq,B

(

I

, σ

1

, σ

2

,

qn−1

,

n

)

is true; it is false, otherwise.

Lemma 5shows thatQ

¯

(

EQ

) >

1

−

0

.

5

, whereQ is theq-flow andEQ is its support set.Lemma 6shows that for each

ω

∈

EQ, the event (16) holds with the overall probability 1.

Construction.Let

ρ(

n

)

=

(

n

+

n0

)

2for some sufficiently largen0(the valuen0will be specified below in the proof of

Lemma 5).

Using the mathematical induction byn, we define a sequenceqn_{of elementary networks. Put}_q0

(σ)

=

₁

/

_{2 for all edges}

σ

of length one.

Letn

>

0 and a networkqn−1_{be defined. Let}_dn−1_{be the}_qn−1_{-delay function and let}_Gn−1_{be the set of all extra edges.}

We also suppose thatl

(σ

2

) <

nfor all

σ

∈

Gn−1.

Let us define a networkqn_{. At first, we define a network flow delay function}_dn_{and a set}_Gn_{. The construction can be split}

up into three cases.

Let

w(

I

,

qn−1

)

be equal to the minimalmsuch thatp

(

m

)

=

Iandm

>

sl

(σ

2

)

for each extra edge

σ

∈

Gn−1such that

p

(

l

(σ

1

))) <

I, wheres

=

p0

(

p

(

I

))

is the number of the session assigned with the taskI.

The inequality

w(

I

,

qm

)

6=

w(

I

,

qm−1

)

can be induced by some taskJ

<

Ithat adds an extra edge

σ

=

(

x

,

y

)

such that

l

(

y

) > w(

i

,

qm−1

)

_and_p

(

_l

(

_x

))

=

_p

(

_l

(

_y

))

=

_J_._{Lemma 2}_{(below) will show that this can happen only at finitely many steps of}

the construction.

Case1.

w(

p

(

n

),

qn−1

)

=

n(the goal of this part is to start a new taskI

=

p

(

n

)

or to restart the existing taskI

=

p

(

n

)

if it was destroyed by some taskJ

<

Iat some preceding step).

Putdn

(

_y

)

=

₁

/ρ(

_n

)

_for_l

(

_y

)

=

_n_{and define}_dn

(

_y

)

=

_dn−1

(

_y

)

_{for all other}_y_{. Put}_Gn

=

_Gn−1_.

Case2.

w(

p

(

n

),

qn−1

) <

_n_{(the goal of this part is to process the task}_I

=

_p

(

_n

)

_).

LetCnbe the set of allxsuch that

w(

I

,

qn−1

)

≤

l

(

x

) <

n, 0

<

dn−1

(

x

) <

1, the function

β(

x

,

qn−1

,

n

)

is defined8and

there is no extra edge

σ

∈

Gn−1such that

σ

1

=

x.

In this case for eachx

∈

Cndefinedn

(β(

x

,

qn−1

,

n

))

=

0, and for all otheryof lengthnsuch thatx@ydefine

dn

(

y

)

=

d

n−1

(

_x

)

1

−

dn−1

(

_x

)

.

Definedn

(

_y

)

=

_dn−1

(

_y

)

_{for all other}_y_{. We add an extra edge to}_Gn−1_{, namely, define}

Gn

=

Gn−1

∪ {

(

x

, β(

x

,

qn−1

,

n

))

:

x

∈

Cn

}

.

We say that the taskI

=

p

(

n

)

addsthe extra edge

(

x

, β(

x

,

qn−1

,

_n

))

_{to the network and that all existing tasks}_J

>

_I_are

destroyed by the taskI.

After Case 1 and Case 2, define for any edge

σ

of unit length

qn

(σ)

=

1

2

(

1

−

d

n

(σ

1

))

andqn

(σ)

=

_dn

(σ

1

)

for each extra edge

σ

∈

Gn.

Case3. Cases 1 and 2 do not hold. Definedn

=

_dn−1_,_qn

=

_qn−1_,_Gn

=

_Gn−1_.

As the result of the construction we define the networkq

=

limn→∞qn, the network flow delay functiond

=

limn→∞dn

and the set of extra edgesG

= ∪

_nGn_.

The functionsqanddare computable and the setGis recursive by their definitions. LetQ denote theq-flow.

The following lemma shows that any task can add new extra edges only at finite number of steps. LetG

(

I

)

be the set of all extra edges added by the taskI,

w(

I

,

q

)

=

limn→∞

w(

I

,

qn

)

.

Lemma 2. The set G

(

I

)

is finite and

w(

I

,

q

) <

∞

for all I.

Proof. Note that ifG

(

J

)

is finite for allJ

<

I, then

w(

I

,

q

) <

∞

. Hence, we must prove that the setG

(

I

)

is finite for allI. Suppose the opposite assertion holds. LetIbe the minimal such thatG

(

I

)

is infinite. By choice ofIthe setsG

(

J

)

for allJ

<

I

are finite. Then

w(

I

,

q

) <

∞

.

By definition ifd

(ω

m

)

6=

0 thenpm

=

1

/

d

(ω

m

)

is a positive integer number. Besides, if

(ω

n

,

y

), (ω

m

,

y0

)

∈

G

(

I

)

, where

n

<

mandl

(

y

)

=

m, thenpn

>

pm. Hence, for each

ω

∈

Ωa maximalmexists such that

(ω

m

,

y

)

∈

G

(

I

)

for someyor no

such extra edge exists. In the latter case letm

=

w(

I

,

q

)

. Defineu

(ω)

=

1

/

d

(ω

m

)

_.

(11)

By the construction the integer valued functionu

(ω)

is constant on the intervalΓ_ωm, and then, it is continuous in the

topology generated by such intervals. SinceΩis compact in this topology,u

(ω)

is bounded. Then for somem0_,_u

(ω)

=

_u

(ω

m0

)

for all

ω

. By the construction if any extra edge ofIth type was added toG

(

I

)

at some step thend

(

y

) >

d

(

x

)

holds for some new pair

(

x

,

y

)

such thatx

⊆

y. This gives us a contradiction ifG

(

I

)

is infinite.

4

An infinite sequence

α

∈

Ωis called anI-extensionof a finite sequencexifx

v

α

andB

(

I

,

x

, α

n

,

_n

)

_{is true for almost all}_n_.

A sequence

α

∈

Ωis calledI-closed ifd

(α

n

)

=

_{1 for some}_n_{such that}_p

(

_n

)

=

_I_{, where}_d_{is the}_q_{-delay function. Note}

that if

σ

∈

G

(

I

)

is an extra edge thenB

(

I

, σ

1

, σ

2

,

n

)

is true, wheren

=

l

(σ

2

)

.

Lemma 3. Assume that for each initial fragment

ω

n_{of an infinite sequence}

ω

_{some I-extension exists. Then either the sequence}

ω

will be I-closed in the process of the construction or

ω

contains an extra edge of Ith type (i.e.

σ

2

v

ω

for some

σ

∈

G

(

I

)

).

Proof. Assume a sequence

ω

is notI-closed. ByLemma 2the maximalmexists such thatp

(

m

)

=

Iandd

(ω

m

) >

_{0. Since}

the sequence

ω

m_{has an}_I_{-extension and}_d

(ω

k

) <

_{1 for all}_k_{, by Case 2 of the construction a new extra edge}

(ω

m

,

_y

)

_of_I_th

type must be added to the binary tree. By the constructiond

(

y

)

=

0 andd

(

z

)

6=

0 for allzsuch that

ω

m

v

z,l

(

z

)

=

l

(

y

)

, andz

6=

y. By choice ofmwe havey

v

ω

.

4

Lemma 4. For any y, Q

(

y

)

=

0if and only if q

(σ )

=

0for some edge

σ

of unit length located on y (this edge satisfies

σ

2

v

y).

Proof. The necessary condition is obvious. To prove that this condition is sufficient, assume thatq

(

yn

,

yn+1

)

=

0 for some

n

<

l

(

y

)

butQ

(

y

)

6=

0. Then by definitiond

(

yn

)

=

_{1. Since}_Q

(

_y

)

6=

_{0, an extra edge}

(

_x

,

_z

)

∈

_G_{exists such that}_x

v

_yn_and

yn+1

v

_z_{. But, by the construction, this extra edge cannot be added to the network}_ql(z)−1_since_d

(

_zn

)

=

_{1. This contradiction}

proves the lemma.

4

ByLemma 4the relationQ

(

y

)

=

0 is recursive and

EQ

=

Ω

\ ∪

d(x)=1Γx

,

(22)

whereEQis the support set ofQ.

Lemma 5. It holds thatQ

¯

(

EQ

) >

1

−

0

.

5

.

Proof. We boundQ

¯

(

Ω

)

from below. LetRbe defined by (20). By definition of the network flow delay function, we have

X

u:_l(u)=n+1 R

(

u

)

=

X

u:_l(u)=n

(

1

−

d

(

u

))

R

(

u

)

+

X

σ:_σ∈_G,l(σ2)=n+1 q

(σ)

R

(σ

1

).

(23)

Define an auxiliary sequence

Sn

=

X

u:l(u)=n R

(

u

)

−

X

σ:σ∈G,l(σ2)=n q

(σ )

R

(σ

1

).

At first, we consider the case

w(

p

(

n

),

qn−1

) <

n. If there is no edge

σ

∈

Gsuch thatl

(σ

2

)

=

nthenSn+1

≥

Sn. Assume some

such edge exists. Define

P

(

u

, σ)

⇐⇒

l

(

u

)

=

l

(σ

2

)

&

σ

1

v

u&u

6=

σ

2&

σ

∈

G

.

By definition of the network flow delay function, we have

X

u:l(u)=n d

(

u

)

R

(

u

)

=

X

σ:σ∈G,l(σ2)=n d

(σ

2

)

X

u:P(u,σ ) R

(

u

)

=

X

(σ

1

)

1

−

d

(σ

1

)

X

u:P(u,σ ) R

(

u

)

≤

X

(σ

1

)

R

(σ

1

)

=

X

σ:σ∈G,l(σ2)=n q

(σ)

R

(σ

1

).

(24)

Here we used the inequality

X

u:P(u,σ )

R

(

u

)

≤

R

(σ

1

)

−

d

(σ

1

)

R

(σ

1

)

for all

σ

∈

Gsuch thatl

(σ

2

)

=

n. Combining this bound with (23) we obtainSn+1

≥

Sn.

Let us consider the case

w(

p

(

n

),

qn−1

)

=

_n_{. Then}

X

u:l(u)=n

d

(

u

)

R

(

u

)

≤

ρ(

n

)

=

1

(

n

+

n0

)

2

(12)

Combining (23) and (24) we obtain

Sn+1

≥

Sn

−

1

(

n

+

n0

)

2

for alln. SinceS0

=

1, this implies

Sn

≥

1

−

∞

X

i=1 1

(

i

+

n0

)

2

≥

1

−

0

.

5

for some sufficiently large constantn0. SinceQ

≥

R, it holds that

¯

Q

(

Ω

)

=

inf n

X

l(u)=n Q

(

u

)

≥

inf n Sn

≥

1

−

0

.

5

.

Lemma is proved.

4

Lemma 6. For each weakly computable randomized forecasting system Pxand and for each sequence

ω

∈

EQ, the event(16)

holds with Pr-probability1, where the overall probability Pr is defined by P_ωi−1, i

=

1

,

2

, . . .

.

Proof. Assume that

ω

∈

EQ andPxis a weakly computable randomized forecasting system, i.e., the corresponding

φ

i

(

x

)

=

Px

(

[

0

.

5

,

1

]

)

is defined for allx

∈

Ξ.

Let

φ

_i,κ_sbe a rational approximation of of

φ

ifrom below up to

κ

s

=

1

/

s, and letI

= hi

,

si. Since there are infinitely many

sessionssof the construction when we visit

φ

i, we can consider only stepsn,p

(

n

)

=

I, such thatsis sufficiently large.

By (22),d

(ω

n

) <

1 for alln. Since

ω

is anI-extension of

ω

nfor eachn, byLemma 3there exists an extra edge

σ

∈

G

(

I

)

such that

σ

2

v

ω

. In that followsk

=

l

(σ

1

)

andn

=

sk.

We reproduce the game from the proof ofTheorem 1on the edge

σ

.

Recall that for anyj,p−_j

=

max

{p

_j,t

:

p_j,t

<

0

.

5

}

andp+_j

=

min

{p

_j,t

:

p_j,t

≥

0

.

5

}

, where

{p

_j,1

, . . . ,

pj,mj

}

is the support set

ofP_ωj−1. By definition of precision of roundingp+_j

−

p−_j

≥

∆for allj.

Assume thatpjare distributed according toPωj−1, wherej

=

1

,

2

, . . .

. In that follows, we use the inequality

φ

i,κs

(ω

j−1

)

≤

P_ωj−1

{p

_j

≥

0

.

5

} ≤

φ

_i,κ_s

(ω

j−1

)

+

κ

_s

.

(25)

Consider two sequences of random variables

ϑ

n,1

=

n

X

j=k+1

ξ(

pj

≥

0

.

5

)(ω

j

−

pj

),

(26)

ϑ

n,2

=

n

X

j=k+1

ξ(

pj

<

0

.

5

)(ω

j

−

pj

),

(27)

where

ξ(

true

)

=

1 and

ξ(

false

)

=

0.

We compute the bounds on mathematical expectations of these variables. These expectations are taken with respect to the overall probability distributionPrgenerated by the probability distributionsPr_ωj−1,j

=

1

,

2

, . . .

(

ω

is fixed). Using the definition of the subword

σ

∈

G

(

i

)

of the sequence

ω

, we obtain (k

<

j

≤

n)

E

(ϑ

_n,1

)

≤

X

ωj=0 P_ωj−1

{p

_j

≥

0

.

5

}

(

−p

+_j

)

+