Multiobjective Prediction with Expert Advice

(1)

Multiobjective Prediction with Expert Advice

Alexey Chernov

Computer Learning Research Centre and Department of Computer Science Royal Holloway University of London

(2)

Example: Prediction of Sport Match Outcome

V. Vovk, F. Zhdanov. Predictions with Expert Advice for Brier Game. ICML’08 Bookmakers data:

4 bookmakers, odds for∼10000 tennis matches (2 outcomes)

8 bookmakers, odds for∼9000 football matches (3 outcomes)

Oddsa_i can be transformed to probabilitiesProb[i]:

Prob[i] = P1/ai

j1/aj

The loss is measured by the square (Brier) loss function. Learner’s strategy is the Aggregating Algorithm.

(3)

Tennis Prediction, Square Loss

0 2000 4000 6000 8000 10000 12000 −4 −2 0 2 4 6 8 10 12 14 16 Theoretical bounds Learner Bookmakers

Graph of the negative regretLossEk(T)−Loss(T), 4 Experts

(4)

Tennis Prediction, Log Loss

0 2000 4000 6000 8000 10000 12000 −5 0 5 10 15 20 25 Theoretical bounds Learner Bookmakers

(5)

Tennis Predictions: “Wrong” Losses

Graphs of the negative regretLossEk(T)−Loss(T)

0 2000 4000 6000 8000 10000 12000 −10 −5 0 5 10 15 20 25 Theoretical bounds Learner Bookmakers 0 2000 4000 6000 8000 10000 12000 −5 0 5 10 15 Theoretical bounds Learner Bookmakers

log loss square loss

the AA for the square loss the AA for the log loss

(6)

Aggregating Algorithm with Wrong Losses

Fact

For the game with2outcomes, one can construct a sequence of predictions of2Experts and a sequence of outcomes with the following property. If Learner’s predictions are generated by the Aggregating Algorithm for the log lossthen for almost all T

Loss(T)≥LossE1(T) +T/10,

whereLoss(T)andLossE1(T)are thesquare lossesof Learner and Expert 1.

A similar statement holds for the Aggregating Algorithm for the square loss evaluated by the log loss.

(7)

New Settings

Experts:γ_t(1), . . . , γ_t(k) Learner:γt

Reality:ωt

Many loss functions

Loss(_Em) k (T) = T X t=1 λ(m)(γ_t(k), ωt) Loss(m)(T) = T X t=1 λ(m)(γt, ωt)

Expert Evaluator’s advice

Loss(_Ek) k (T) = T X t=1 λ(k)(γ_t(k), ωt) Loss(k)(T) = T X t=1 λ(k)(γt, ωt)

(8)

New Settings

Experts:γ_t(1), . . . , γ_t(k) Learner:γt

Reality:ωt

Many loss functions

Loss(_Em) k (T) = T X t=1 λ(m)(γ_t(k), ωt) Loss(m)(T) = T X t=1 λ(m)(γt, ωt)

Expert Evaluator’s advice

Loss(_Ek) k (T) = T X t=1 λ(k)(γ_t(k), ωt) Loss(k)(T) = T X t=1 λ(k)(γt, ωt)

(9)

Bound for New Settings

Theorem

Ifλ(k)areη(k)-mixable proper loss functions, k =1, . . . ,K , Learner has a strategy (e. g. the Defensive Forecasting algorithm) that guarantees, for all T and for all k , that

Loss(k)(T)≤Loss(_Ek)

k (T) +

1 η(k)lnK.

Corollary

Ifλ(m)areη(m)-mixable proper loss functions, m=1, . . . ,M, Learner has a strategy that guarantees, for all T , for all k and for all m, that

Loss(m)(T)≤Loss(_Em)

k (T) +

1

(10)

Tennis Predictions, Two Losses

Graphs of the negative regretLoss(_Em)

k (T)−Loss (m)₍_T₎ 0 2000 4000 6000 8000 10000 12000 −4 −2 0 2 4 6 8 10 12 14 16 Theoretical bounds Learner Bookmakers 0 2000 4000 6000 8000 10000 12000 −5 0 5 10 15 20 25 Theoretical bounds Learner Bookmakers

square loss log loss

(11)

Defensive Forecasting Algorithm

∃π∀ω K X k=1 p(_t₋k)₁eη(λ(π,ω)−λ(π(tk),ω))≤₁_,

wherep(_t₋k)₁=p₀(k)eη(Loss(t−1)−LossE_k(t−1))

To get this (from Levin’s Lemma) we need thatλ(π, ω)is continuous

and for allπ, π0

Eπeη(λ(π,·)−λ(π

0_,·₎₎

= X

ω∈Ω

(12)

Defensive Forecasting Algorithm

∃π∀ω K X k=1 p(_t₋k)₁eη(k)(λ(k)(π,ω)−λ(k)(πt(k),ω))≤1_,

wherep(_t₋k)₁=p₀(k)eη(k)(Loss(k)(t−1)−Loss

(k)

E_k(t−1))

To get this (from Levin’s Lemma) we need thatλ(k)(π, ω)is continuous

and for allπ, π0

Eπeη

(k)₍_λ(k)₍_π,·₎_−λ(k)₍_π0_,·₎₎

= X

ω∈Ω

(13)

The DFA and the AA

λis continuous and∀π, π0Eπeη(λ(π,·)−λ(π 0_,·₎₎ ≤1 ⇒ λisη-mixable ∀π, π0Eπeη(λ(π,·)−λ(π 0_,·₎₎ ≤1 ⇒ λis proper λisη-mixable and ? ⇒ ∀π, π0Eπeη(λ(π,·)−λ(π 0_,·₎₎ ≤1

(14)

The DFA and the AA

λis continuous and∀π, π0Eπeη(λ(π,·)−λ(π 0_,·₎₎ ≤1 ⇒ λisη-mixable ∀π, π0Eπeη(λ(π,·)−λ(π 0_,·₎₎ ≤1 ⇒ λis proper

λisη-mixable and proper ⇒ ∀π, π0Eπeη(λ(π,·)−λ(π

0_,·₎₎

(15)

Proper Loss Functions

λis proper if for anyπ, π0 ∈ P(Ω)

Eπλ(π,·)≤Eπλ(π0,·)

Ifω∼πthenEπλ(π0, ω)is the expected loss for predictionπ0.

The expected loss is minimal for the true distribution

⇒the forecaster is encouraged to give the true probabilities

(16)

Example: Hellinger Loss

λHellinger(γ, ω) = 1 2 r X j=1 p γ(j)−q_I_{ω₌_j_}2

The Hellinger loss is√2-mixable

(17)

Proper Mixable Loss Functions

Each mixable loss functionλ(γ, ω)has a proper analogueλproper(π, ω)

such that

1 ∀_π∃_γ∀_{ω λ}proper₍_{π, ω}_{) =}_λ₍_{γ, ω}₎ 2 ∀_π∀_γ E_π_λproper₍_π,·₎≤E_π_λ₍_γ,·₎

For the Hellinger loss, the proper analogue is the spherical loss λspherical(π, ω) =1−_q π(ω)

Pr

j=1(π(j))2 λspherical(π, ω) =λHellinger(γ, ω)forγ(ω) = P(rπ(ω))2

(18)

Proper Mixable Loss Functions

Each mixable loss functionλ(γ, ω)has a proper analogueλproper(π, ω)

such that

1 ∀_π∃_γ∀_{ω λ}proper₍_{π, ω}_{) =}_λ₍_{γ, ω}₎ 2 ∀_π∀_γ E_π_λproper₍_π,·₎≤E_π_λ₍_γ,·₎

For the Hellinger loss, the proper analogue is the spherical loss λspherical(π, ω) =1−_q π(ω)

Pr

j=1(π(j))2 λspherical(π, ω) =λHellinger(γ, ω)forγ(ω) = P(rπ(ω))2

(19)

Example: Mixable and Non-Mixable Losses

Experts 1, . . . ,K predictπ(k)∈ P({0,1}). Experts 1, . . . ,Npredictγ(n)∈ {0,1}.

Learner predicts(π,π˜)∈ P({0,1})× P({0,1})such that ifπ(0)>1/2 thenπ˜(0) =1 and ifπ(1)>1/2 then˜π(1) =1.

There exists a strategy for Learner that guarantees for anyk

T X t=1 λsquare(πt, ωt)≤ T X t=1 λsquare(π(_tk), ωt) +ln(K +N)

and for anym

T X t=1 λabs(˜πt, ωt)≤ T X t=1 λsimple(γ_t(n), ωt) +O( p Tln(K +N) +Tln lnT)

(20)

Tennis Predictions, Square and Absolute Losses

Graphs of the negative regretLoss(_Em)

k (T)−Loss (m)₍_T₎ 0 50 100 150 200 250 300 350 400 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 theoretical bounds DF experts 0 50 100 150 200 250 300 350 400 −2 −1.5 −1 −0.5 0 0.5 DF experts

square loss absolute loss

Learner optimizes for both loss functions, using the DF algorithm with “mixability” and Hoeffding supermartingales.

(21)

The Mixed Supermartingale

1 K +N K X k=1 e2PTt=−11((pt−ωt)2−(p (k) t −ωt)2)_×_e2((p−ω)2−(p (k) T −ω) 2₎ + 1 K +N N X n=1 Z 1/e 0 dη η ln1_η2 eηPTt=−11(|p˜t−ωt|−[γ(tn)6=ωt])−η2/2 ×eη(|˜p−ω|−[γT(n)6=ω])−η 2_/₂ wherept =πt(1),p(_tk)=π_t(k)(1),˜pt = ˜πt(1). [x 6=y] =1 ifx 6=y and[x 6=y] =0 ifx =y.

(22)

References

V. Vovk, F. Zhdanov. Predictions with Expert Advice for Brier Game.

ICML 2008. http://arxiv.org/abs/0710.0485

A. Chernov, Y. Kalnishkan, F. Zhdanov, V. Vovk. Supermartingales in Prediction with Expert Advice. ALT 2008 and TCS.

http://arxiv.org/abs/1003.2218

A. Chernov, V. Vovk. Prediction with expert evaluators’ advice.

ALT 2009. http://arxiv.org/abs/0902.4127

A. Chernov, V. Vovk. Prediction with Advice of Unknown Number of

Experts. UAI 2010. http://arxiv.org/abs/1006.0475