Can multi-model combination really enhance the prediction skill of probabilistic ensemble forecasts?

(1)

Published online in Wiley InterScience

(www.interscience.wiley.com) DOI: 10.1002/qj.210

Can multi-model combination really enhance the prediction

skill of probabilistic ensemble forecasts?

A. P. Weigel,* M. A. Liniger and C. Appenzeller

Federal Office of Meteorology and Climatology (MeteoSwiss), Z¨urich, Switzerland

ABSTRACT: The success of multi-model ensemble combination has been demonstrated in many studies. Given that a multi-model contains information from all participating models, including the less skilful ones, the question remains as to why, and under what conditions, a multi-model can outperform the best participating single model. It is the aim of this paper to resolve this apparent paradox.

The study is based on a synthetic forecast generator, allowing the generation of perfectly-calibrated single-model ensembles of any size and skill. Additionally, the degree of ensemble under-dispersion (or overconfidence) can be prescribed. Multi-model ensembles are then constructed from both weighted and unweighted averages of these single-model ensembles. Applying this toy model, we carry out systematic model-combination experiments. We evaluate how multi-model performance depends on the skill and overconfidence of the participating single models. It turns out that multi-model ensembles can indeed locally outperform a ‘best-model’ approach, but only if the single-model ensembles are overconfident. The reason is that multi-model combination reduces overconfidence, i.e. ensemble spread is widened while average ensemble-mean error is reduced. This implies a net gain in prediction skill, because probabilistic skill scores penalize overconfidence. Under these conditions, even the addition of an objectively-poor model can improve multi-model skill. It seems that simple ensemble inflation methods cannot yield the same skill improvement.

Using seasonal near-surface temperature forecasts from the DEMETER dataset, we show that the conclusions drawn from the toy-model experiments hold equally in a real multi-model ensemble prediction system. Copyright2008 Royal Meteorological Society

KEY WORDS DEMETER; inflation; probabilistic verification; seasonal predictions; toy model; under-dispersion

Received 6 April 2007; Revised 27 November 2007; Accepted 10 December 2007

1. Introduction

Weather and climate predictions are subject to many uncertainties and sources of forecast error. These can be grouped into two families:

• uncertainties in model initialization, for example due to incomplete data coverage, measurement errors, or inappropriate data-assimilation procedures;

• uncertainties and errors in the model itself, for example due to the parametrization of physical processes, the effect of unresolved scales, or imperfect boundary conditions (Buizzaet al., 2005; Schwierzet al., 2006; Weigelet al., 2007a).

For dynamical models, the uncertainties in model initialization can be addressed by applying ensem-ble techniques, i.e. by repeatedly integrating the model forward in time from slightly-perturbed initial condi-tions (e.g. Kalnay, 2003), with the perturbacondi-tions being designed so that they capture as much as possible of the underlying uncertainty. Sophisticated methods of

* Correspondence to: A. P. Weigel, MeteoSwiss, Krähbühlstrasse 58, PO Box 514, CH-8044 Zürich, Switzerland.

E-mail: [email protected]

ensemble generation have been developed and success-fully implemented for operational numerical weather and short-range climate forecasts (e.g. Tracton and Kalnay, 1993; Toth et al., 1997; Molteni et al., 1996; Buizza, 1997; Pellerinet al., 2003), and they have found a wide range of application in weather and climate risk manage-ment.

As far as model uncertainties are concerned, however, there is currently no theoretical concept that would pro-vide accurate estimates of the corresponding probability distributions (Palmer, 2001). Three pragmatic approaches have been pursued to obtain at least a first crude esti-mate of the range of uncertainties induced by model error:

• the introduction of ‘stochastic physics’, i.e. the ran-dom perturbation of parametrized physical processes (Buizzaet al., 1999);

• the ‘perturbed parameter’ approach, whereby a model is run several times with different settings of physical parameters (Pellerinet al., 2003);

• the combination of several ensemble prediction sys-tems to form a multi-model super-ensemble (Krishna-murtiet al., 1999; Palmeret al., 2004).

(2)

The latter technique is often referred to as themulti-model ensembleapproach, and is the focus of this study.

Simple multi-model ensembles (MMEs) can be con-structed by combining the individual ensemble forecasts with equal weights (Hagedorn et al., 2005). In more sophisticated approaches, the participating single-model ensembles (SMEs) are weighted according to their prior performance (e.g. Rajagopalan et al., 2002; Robertson et al., 2004; Doblas-Reyeset al., 2005; Stephensonet al., 2005). Regardless of which combination method has been applied, all these studies have shown that multi-model ensemble combination (MMEC) does increase prediction skill.

Hagedornet al. (2005) have investigated the ‘rationale behind the success of multi-model ensembles’. Among other questions, the authors address the question of how it is possible that MMEs on average outperform single models in their skill, given that they take the information from all participating single models, including the less skilful ones. They conclude that the success of MMEC is mainly due to error cancellation and the nonlinearity of the skill metrics applied, i.e. to the fact that multi-model skill generally differs from the average skill of the participating SMEs. But why should one not simply use the best participating SME alone, rather than considering a compound of several SMEs, including the poorer ones? Or, to turn the question around: how is it possible that the addition of poorer models can enhance prediction skill? Hagedornet al. (2005) argue that this question is wrongly posed, since it is not usually possible to identify a ‘best’ or a ‘poorest’ model from a set of models, as their individual strengths and flaws typically vary with forecasting context (location, predictand, initialization time, etc.). Indeed, for a given specific forecast at a given location, the multi-model can be expected to be outperformed by some of the participating SMEs. It is only in the long run, i.e. averaged over a sufficient number of grid points and forecast realizations, that the multi-model would outperform any single-model strategy.

But what if we were able to identify, say, a ‘poor’ model, i.e. a model that consistently performs worse than average over the whole range of prediction aspects? Hagedornet al. (2005) state that such a model could not contribute to prediction skill. In other words, if it were always known which model was best, MMEC would not be able to outperform a simple ‘best-model’ approach, i.e. a strategy that always selects the best SME available for any given forecast context.

It is the aim of the present study to revisit this question, and to demonstrate that there are conditions under which a multi-model consistently outperforms a best-model approach, and so really does enhance prediction skill – assuming that the ‘best’ model can be identified, and that the forecast is perfectly calibrated in a climatological sense. This appears to be a paradox, because it implies that the skill of an SME may be enhanced by adding a consistently-poorer model to the ensemble forecasts. We seek to resolve this apparent

paradox, and to identify the underlying mechanisms, both by applying a simple synthetic Gaussian forecast ensemble generator (a ‘toy model’) and by evaluating a real seasonal MME prediction system.

The paper is structured as follows. Section 2 provides a detailed description of the Gaussian toy model, as well as a summary of the model combination methods and the verification measure applied. In Section 3, system-atic toy-model experiments are presented, identifying the conditions under which MMEC really enhances predic-tion skill. In Secpredic-tion 4, the findings are substantiated with a real seasonal MME prediction system. Concluding remarks are given in Section 5.

2. Methods

2.1. The synthetic toy model 2.1.1. Formulation

The core of this study is based on a synthetic Gaussian generator of forecast–observation pairs. This toy model is designed in such a way that, for a given ‘observation’

x, it generates anM-member ensemble forecast

f(x)=(f1, f2, . . . , fM),

fulfilling preset conditions with respect to forecast skill and ensemble properties. These conditions are controlled by two parameters,αandβ, as described below. The toy model has the following properties:

1. Synthetic ‘observations’xare randomly sampled from a normally-distributed ‘climate’.

2. The corresponding M-member ensemble forecasts

f(x)are generated in such a way that the forecasts have the same climatology as the verifying observations, i.e. the ‘model climatology’ is unbiased and perfectly calibrated.

3. The average correlation coefficient between the fore-cast ensemble members and the observations is pre-scribed by a model parameterα.

4. Since in real operational prediction systems the ensembles are often overconfident (i.e. under-dis-persive), meaning that ensemble spread is too nar-row while being centred at the wrong value, a sec-ond parameterβ is introduced, prescribing the degree of ensemble overconfidence. A value of β=0 cor-responds to well-dispersed ensembles covering the entire range of uncertainties inherent to the predictions of a given correlation α. As β increases, ensemble under-dispersion increases.

A ‘forecast generator’ fulfilling the above conditions is given by:     f1 f2 .. . fM    =αx+β+     1 2 .. . M    , (1)

(3)

with: x∼N(0,1); β ∼N(0, β); 1, . . . , M ∼N(0, 1−α2−_β2_).

Here α2 _≤_{1 and 0}_≤_β_≤√₁₋_α2_{; the notation} N(µ, σ )refers to a random number drawn from a normal distribution with meanµand varianceσ2.

For a given observationxand control parametersαand

β, an ensemble forecastfis generated by multiplying the value of x by α and adding a vector of perturbations

1, . . . , M, as well as a scalar perturbation term β.

The perturbations are randomly sampled from the normal distributions specified above. The expression αx+β

determines the ‘centre’ of the ensemble distribution. Ensemble spread is controlled by1−α2₋_β2_{, which} is the standard deviation of the parent distribution from which the perturbations1, . . . , M are sampled.

It is trivial that property 1 is fulfilled: the observations

x are, by construction, sampled from a standardized normal distribution. The fulfillment of conditions 2 and 3 is shown in Appendix A. This leaves property 4, i.e. the meaning of the overconfidence parameterβ.

We begin by considering the case β=0. For the toy model of Equation (1), this implies thatβ =0 and

that the forecast ensembles are sampled from a nor-mal distribution centred at αx and with standard devi-ation σ =√1−α2_{. Such a situation is illustrated in}

Figure 1(a–c). For a given observation x and correla-tion coefficient α, the shaded curves show the distri-butions from which the forecast ensembles are sam-pled (henceforth referred to as ‘ensemble parent distri-butions’ (EPDs)). Ifα=0 (Figure 1(a)), i.e. in the case of zero correlation between forecasts and observations, the EPD is identical to climatology. As α increases (Figure 1(b–c)), the centre of the EPD shifts toward the value of the observation, while its spread becomes narrower. In the case of perfect correlation (α=1, not shown), the EPD would be given by a delta function peaking at x. Note that, for given α and x, the EPD is uniquely determined, i.e. there is no uncertainty in its shape or location. Consequently, for β=0, the EPDs quantify the full range of uncertainty inherent to ensem-ble predictions for a given correlationα. Ensembles that are randomly sampled from these EPDs can therefore be regarded asperfect(i.e.well-dispersed) ensembles. Note thatαis often interpreted as a measure of potential pre-dictability (Kharin and Zwiers, 2003).

Now consider the case of non-zero β. For given x

andα, the EPDs (standard deviationσ =1−α2−_β2₎ become sharper asβ increases. At the same time, there is now some uncertainty in the location of the EPD. The range of this uncertainty is controlled byβ, and grows

asβ increases. Thus, for a givenα, the ensemble spread is too small and no longer represents the full amount of uncertainty inherent to the predictions: the ensembles are overconfident. The effect of the overconfidence param-eter β is illustrated in Figure 1(d–f). It is important to

observable x x x αx αx σ = σ = 1 (a) (b) (c) α = 0 β = 0 α = 0.3 β = 0 α = 0.65 β = 0 x αx α = 0.65 β = 0 observable

scaled probability density

x εβ εβ x α = 0.65 β = 0.7 α = 0.65 β = 0.7 (d) (e) (f) αx αx 1−α2 σ = 1−α2 σ = 1−α2 σ = 1− α2−β2 σ = 1− α2−β2

Figure 1. Illustration of the toy model. Upper panels: the effect of enhancing the correlation between the toy-model ensemble forecasts and the observations (parameterα). The shaded curves are ensemble parent distributions (EPDs) for a given observationxin the case of well-dispersed (i.e.β=0) ensemble forecasts, with (a)α=0, (b)α=0.3 and (c)α=0.65. Asα increases, the EPDs become narrower, and their centre shifts towardx. Lower panels: the effect of adding overconfidence (parameterβ). Panel (d) is the same as (c). Panels (e) and (f) display two realizations of highly overconfident ensemble forecasts with the sameα as in (d), but withβ=0.7. The overconfident EPDs have reduced ensemble spread, but some uncertainty in their location, imposed by the randomly-sampled perturbation termβ. The solid black line is the

(4)

stress again that the two aspects characterizing overcon-fidence in this toy model – too-narrow ensemble spread, and random location errors in the ensemble mean – are both controlled by the same parameterβ, and are interde-pendent. Because of the requirement of a well-calibrated model climatology, it is not possible to generate forecast ensembles that are always at the ‘right’ location but have the wrong ensemble spread. Equally, ensembles cannot be well-dispersed while being randomly displaced.

Note also that, for a given correlationα, the ensemble spread cannot be increased beyond the ‘perfect’ value of √1−α2 _{without violating the constraint of a} per-fect model climatology. This means that over-dispersive (i.e. underconfident) forecast ensembles are not possible with this toy model.

2.1.2. Realism

Obviously, this toy model is based on very simplifying assumptions, and cannot represent the complexity that is characteristic of real ensemble predictions. When evaluating and discussing results obtained from it, it is important to be aware of its limitations with respect to reality. The main idealizations are discussed below.

Normality The climatology and the ensemble distri-butions are assumed to be normally distributed. The toy model does not allow skewed distributions, which are typical for precipitation, for example; nor does it allow bimodal or multimodal distributions, which are also com-monly found (Wilks, 2002), and whose detection actually represents one of the main hopes of ensemble forecast-ing. These limitations will not be evaluated here and are beyond the scope of this paper. However, since skewed unimodal distributions can easily be normalized – for example, via Box–Cox transformations, as applied by Tippett et al. (2007) – we believe that the toy-model results presented here can, at least in principle, be gener-alized to skewed unimodal distributions.

Stationary climatology Both the observed and the model climatology are stationary. In reality, however, climatology reveals fluctuations and trends on many time-scales (e.g. seasonal cycle, global warming trend), allow-ing the occurrence of record events that would have been considered impossible or highly unlikely in the past – such as the 2003 heatwave in Europe, as discussed by Sch¨ar et al. (2004). It is therefore a limitation of our toy model that it is not able to generate and ‘pre-dict’ events that are inconsistent with the prescribed, i.e. ‘observed’, climatology. On the other hand, an appro-priate verification of forecasts and observations sampled from a non-stationary climatology is highly problematic. Indeed, Hamill and Juras (2006) have shown that non-stationarities can produce ‘false’ prediction skill when there is actually no skill. Since a robust verification is essential in the present study, we only consider a station-ary climatology here.

Well-calibrated model climatology The toy-model climatology is well-calibrated, in the sense that each ensemble member has the same climatology as the obser-vations. This is in contrast to real prediction systems,

which usually reveal systematic biases in mean and vari-ance. Often it is not possible to determine a model clima-tology at all, because there are not enough independent historical forecast samples. The impact of these uncer-tainties on the success of MMEC is therefore not covered in this study. Note that, thanks to the well-calibrated model climatology, the ensemble means are distributed as N(0,α2+_β2₎_{, so that they have a lower variance} than the observations, and for small α and β are very unlikely to be located on the tails of the climatology. Consequently, in deterministic forecast strategies that are based on ensemble-mean predictions, it might actually be desirable to have a model climatology that is extended beyond the known climatology, so that the ensemble means cover the true climatology and can also ‘predict’ climatologically-rare events.

Stationary skillAs will be described in Section 3, the parametersαandβ are kept constant for all samples that feed into one verification experiment. In other words, spread and correlation do not vary from sample to sample, nor do they depend on the value of x. This contrasts with reality, where spread does vary from case to case and where prediction skill is often found to be conditioned on the magnitude of the anomaly observed. Seasonal predictability in the Pacific region, for example, depends strongly on the strength of the ENSO forcing (Shuklaet al., 2000). However, we do not think that this idealization has an impact on the general conclusions to be drawn in this study, since, at least in principle, any sufficiently-large set of verification samples can be stratified to subsets of constant skill and spread.

Predictable signal and observational errorsThe cen-tral tendency of a well-dispersed ensemble distribution is often considered as the potentially-predictable signal

µ(Kharin and Zwiers, 2003). It is a major simplifying assumption of the present toy model that it requires the signal to be given byαx, and thus to be ‘causally’ deter-mined by the verifying observation. In a more advanced setting, this simplification can be avoided by first sam-plingµand then constructing a forecast–observation pair from µ: forecast ensembles would be constructed by analogy with Equation (1), i.e.fn=µ+β+n, while

a verifying observation would be obtained by adding an unpredictable observational-noise term x to µ, i.e.

x=µ+x with ∼N(0, √

1−α2₎_{. By this approach,} which in essence is equivalent to the statistical model presented by Kharin and Zwiers (2003), observational errors can also be accounted for. As will be pointed out later, this signal-based model leads to conclusions that are qualitatively equivalent to those of the simpler toy model applied in the present study.

2.2. Combination

Using the toy model, we can issue probabilistic categori-cal ensemble forecasts with respect to predefined forecast categories, by taking the proportion of ensemble members

(5)

falling into the respective forecast categories. In the fol-lowing, we explain how we combine such single-model ensemble forecasts to form a multi-model.

Consider Nindependent ensemble prediction systems, with ensemble sizesM1, . . . , MN, which are to be

com-bined. Assume that probabilistic forecasts are issued with respect toK forecast categories. Let pk be the

climato-logical probability of the event falling into category k, withk∈ {1, . . . , K}. Further, letmn,kdenote the number

of ensemble members of thenth model that forecast the

kth category; thus

K

k=1

mn,k=Mn.

Finally, let mn,k/Mn be the corresponding probability

forecast issued by modelnfor categoryk. Two methods of combining such SME forecasts to an MME forecast

y=(y1, . . . , yK)are examined in this study. These

meth-ods are henceforth referred to as POOL and IGN. Both methods are well established and applied operationally: methodPOOL, for example, in the European Multimodel Seasonal to Interannual Prediction System (EUROSIP) (Vitartet al., 2007), and method IGNat the International Research Institute for Climate and Society (IRI) (Barn-stonet al., 2003).

2.2.1. MethodPOOL

In method POOL, MME forecasts are generated by sim-ply pooling together the participating SMEs, with all ensemble members having equal weight (e.g. Hagedorn et al., 2005). The probabilistic multi-model forecast for the event to fall into thekth category y_kPOOL is then the proportion of all ensemble members of all participating SMEs that predict thekth category:

yPOOL_k = N n=1 mn,k N n=1 Mn . (2) 2.2.2. MethodIGN

The second method is a more sophisticated approach, in that the participating SMEs are weighted according to their prior performance, and the climatological forecast

(p1, . . . , pK)is included as an additional ‘model’. Using

this method, weighted probabilistic multi-model forecasts for thekth category,y_kIGN, are formally constructed by:

y_kIGN=w0pk+ N n=1 wnmn,k Mn . (3)

Here w0, . . . , wN are predetermined weights with

N

n=0wn=1 and with wn∈[0,1]. Rajagopalan et al.

(2002) have derived a conceptually Bayesian justification for this approach, with climatology being the prior.

To determine a set of optimum weights, we follow their suggestion and maximize the posterior likelihood functionL(w0, . . . , wN)defined over a common record

of T multi-model forecasts and verifying observations. The function to be maximized is given by:

L(w0, w1, . . . , wN)= T

t=1

y_kIGN∗_(t)(t), (4)

with k∗(t) representing the category of the verifying observation at time t. Rajagopalan et al. (2002) and Robertson et al. (2004) have shown that this Bayesian methodology yields multi-model forecasts that generally outperform equal-weight multi-models constructed with thePOOL method.

As an optimization algorithm, we here apply the method of Byrd et al. (1995), a quasi-Newton method that allows box constraints in order to fulfil the conditions N

n=0wn=1 andwn∈[0,1].

The reason for the name IGN is that the negative dual logarithm of Equation (4) divided by the record lengthT

is equivalent to the averageignorance score IGN, as introduced by Roulston and Smith (2002):

−1 T log2L= − 1 T T t=1

log₂y_kIGN∗_(t)(t)= IGN. (5)

This means that maximizing the quantity in Equa-tion (4) is equivalent to minimizing the ignorance of the multi-model forecasting system. The weights are deter-mined so as to minimize the average information deficit (‘ignorance’) of a user who is in possession of the multi-model forecasts but does not know the true outcome. 2.3. Verification

Our verification is based on a modified version of the widely-used ‘ranked probability skill score’ (RPSS) (Epstein, 1969; Murphy, 1969; Murphy, 1971). The clas-sical RPSS is a squared measure comparing the cumula-tive probabilities of categorical forecast and observation vectors relative to a climatological forecast strategy. It is defined by: RPSS =1− RPS RPSCl , (6) where RPS = K k=1 (Yk−Ok)2, and RPSCl= K k=1 (Pk−Ok)2.

The angle brackets · denote the average of the RPS and RPSCl values over a given number of fore-cast–observation pairs; Yk is the kth component of a

(6)

an SME or an MME), andOkis thekth component of the

corresponding cumulative observation vector O against which the forecast is verified. That is, Yk=

k i=1yi,

with yi being the probabilistic forecast for the event to

fall in categoryi; andOk =

k

i=1oi, with oi =1 if the

observation is in categoryiandoi =0 if the observation

is in a category j =i. Analogously, Pk is the

cumu-lative climatological probability of the kth category. A more detailed description of the RPSS is provided in Wilks (2006). The RPSS is a favourable probabilistic skill score, in that it is sensitive to distance, i.e. a forecast is increasingly penalized the more its cumulative probabil-ity differs from the actual outcome. Moreover, the RPSS is strictly proper, meaning that it cannot be optimized by hedging the probabilistic forecasts toward other val-ues against the forecaster’s true belief. A big caveat of the RPSS is its strong negative bias for small ensemble sizes (e.g. Buizza and Palmer, 1998; Richardson, 2001; Kumar et al., 2001; Mason, 2004). The reason for this bias is the intrinsic unreliability (Weigel et al., 2007b) of small ensembles, leading to inconsistencies in the for-mulation of the RPSS. However, particularly when the performance of multi-model prediction systems is to be assessed, it is important to know whether increases in pre-diction skill are due to a true gain in potentially-usable information, or whether they are simply an artefact of a negative bias that decreases as the ensemble grows. In other words, a ‘bias-free’ skill score – i.e. one that is insensitive to ensemble size – is required.

M¨ulleret al. (2005) and Weigelet al. (2007b, 2007c) have derived a ‘debiased’ version of the RPSS, the so-called RPSSD, which lacks the RPSS’s strong dependence on ensemble size while retaining its favourable proper-ties, in particular strict propriety, making it the skill score of choice for the present study. IfKequiprobable forecast categories are considered (as is the case in this study), the RPSSDassumes a relatively simple analytic form, given by: RPSSD=1− RPS RPSCl +D0/Meff , (7) where D0 = K2−1 6K ,

andMeffis the ‘effective ensemble size’ of the prediction system. If only a single model of ensemble size M is considered, thenMeff=M. If N models with ensemble sizes M1, . . . , MN are combined, with all ensemble

members having equal weight (methodPOOL), then Meff is given by M_effPOOL=N_n₌₁Mn. For weighted MMEs

constructed by method IGN, Weigel et al.(2007c) have shown that: M_effIGN= _N n=1 w2_n Mn −1 . 3. Toy-model experiments 3.1. Methodology

The toy model described in Section 2.1 is now applied in order to study systematically how multi-model skill depends on correlation, overconfidence and the combina-tion method applied. For all our experiments, the proce-dure applied is as follows. First, 100 000 ‘observations’ are randomly sampled from the normally-distributed cli-matology. Then, for each of these randomly-sampled observations, the toy model is applied with a given correlation coefficient α1 and overconfidence parameter

β1, generating 100 000 corresponding forecast ensembles. The ensemble size of the toy-model forecasts has been (arbitrarily) set to 20. The ensemble forecasts obtained are binned into three equiprobable, mutually-exclusive and exhaustive forecast categories. Probability forecasts for each of the three categories are calculated by tak-ing the proportion of ensemble members falltak-ing into the respective bins. This procedure is then repeated with (dif-ferent or equal) toy-model parametersα2andβ2, yielding another set of 100 000 probabilistic forecasts. From these two sets of SME forecasts, MME forecasts are finally constructed by applying methodsPOOLandIGN. The high sample size used in these experiments makes an analy-sis of statistical significance unnecessary. In reality, of course, verification sets are much smaller, introducing additional uncertainty (see also the discussion in Sec-tion 4.2).

3.2. RPSSD as a function ofαandβ

Before looking at multi-models, we evaluate how the RPSSD of a single toy model depends on correlation and overconfidence. Being a probabilistic skill score, the RPSSD measures the shape and location of the ensem-ble distribution. This means that, for a given correlation coefficient α, the RPSSD should favour a forecast that correctly estimates the underlying uncertainties, rather than one that has a sharp spread but is located at the wrong place: in other words, the RPSSDshould penalize overconfidence, with skill decreasing asβ grows. Mean-while, for a given value ofβ,RPSSDshould increase asα increases: increasing the correlation coefficient between forecasts and observations should also increase prediction skill.

Figure 2 shows the degree to which these favourable characteristics hold for the RPSSD. Toy-model SME forecasts have been used to calculate the RPSSD as a function of α for several values of β. Note that as β

grows, the range of possibleα values decreases, thanks to the condition 0≤β≤√1−α2_{. The figure shows} that the RPSSD does generally increase as α increases and β decreases. However, inconsistencies can arise for overconfident ensembles if the ensemble spread is extremely small or even zero, i.e. if 1−α2₋_β2 _≈₀ (open circles in Figure 2). Under these conditions, skill can drop despite growing correlation. Note that the same behaviour is observed if the classical RPSS rather than the

(7)

debiased RPSSD is applied (not shown here). A further evaluation of this behaviour is left to future research. In our analyses, we avoid these extreme cases of vanishing ensemble spread and do not consider the parameter combinations indicated as open circles in Figure 2. 3.3. Multi-models constructed from well-dispersed ensembles

In a first set of multi-model experiments, we combine forecasts that are generated with β=0, and are there-fore well-dispersed. Probabilistic SME there-forecasts from two ‘models’ (henceforth referred to as Model 1 and Model 2) with parametersα1 andα2 are generated, with

α1, α2∈ {−0.5,−0.4, . . . ,1}. As described above, these two models are then combined to form MMEs in such a way that all possible combinations of α1 and α2 are considered, using methodsPOOLandIGN. The skill of the probabilistic SME and MME forecasts is visualized on ‘skill matrices’ (e.g. Figure 3), which display theRPSSD as a function ofα1 andα2, withα1 varying along the hor-izontal axis andα2 along the vertical axis. We introduce the following notation.

0.0 −1.0 0.5 0.5 0.0 −0.5 −0.5 1.0 1.0 alpha RPSSd β=0.8 β=0.7 β=0.5 β=0

Figure 2. RPSSDof the toy model as a function of potential

predictabil-ity (correlation coefficient)α and overconfidenceβ. The open circles indicate parameter combinations that are excluded from the following

analysis (see text in Section 3.2 for explanation).

• MODβ₁=0(α1, α2)=MODβ1=0(α1)is the skill matrix of probabilistic SME forecasts obtained from Model 1. It is independent ofα2 by construction.

• MODβ₂=0(α1, α2)=MOD

β=0

2 (α2)is the skill matrix of probabilistic SME forecasts obtained from Model 2. It is independent ofα1 by construction.

• BESTβ=0(α1, α2)=max[MOD1β=0(α1), MODβ2=0(α2)] is the matrix that, for givenα1andα2, selects the better of the two participating SME prediction systems. • MEANβ=0(α1, α2)=1₂[MOD₁β=0(α1)+MODβ₂=0(α2)]

is the matrix that, for givenα1 and α2, represents the average skill of Models 1 and 2.

• POOLβ=0(α1, α2)is the skill matrix for MME ensem-ble forecasts constructed from Models 1 and 2 by methodPOOL.

• IGNβ=0

(α1, α2)is the skill matrix for MME ensemble forecasts constructed from Models 1 and 2 by method IGN.

The skill matrices MODβ₁=0, MOD₂β=0 and POOLβ=0 are shown in Figures 3(a), 3(b) and 4, respectively. The latter reveals a plausible structure, in that the skill is highest when bothα1 andα2 are close to 1, and lowest when none of the participating SMEs has skill.

We are interested firstly in whether there are combina-tions ofα1 and α2 such that the multi-model has higher skill than any of the participating SMEs alone. We there-fore calculate, for each combination of α1 and α2, the difference betweenPOOLβ=0 andBESTβ=0. The results are displayed in Figure 5(a). The matrix is zero along the diagonal (where α1=α2) and negative elsewhere. Thus there are no combinations ofα1 andα2 that would enhance MME prediction skill beyond the maximum of the two participating SMEs. This result is also evident in Figure 5(b), where all matrix elements of POOLβ=0 are plotted against the corresponding elements ofBESTβ=0_. The points consistently lie underneath or on the bisecting line, indicating thatPOOLβ=0 _≤_BESTβ=0 _always.

(a) (b) α2 α1 α1 α2 0.0 1.0 0.5 0.5 0.0 −0.5 0.0 0.5 1.0 −0.5 −1.0 0.0 0.5 1.0 −0.5 −0.5 1.0 0.5 0.0 −0.5 1.0

Figure 3. Toy-model experiments with well-dispersed model ensembles: (a)MODβ₁=0, the skill matrix of Model 1; (b)MODβ₂=0, the skill matrix of Model 2. The size of the dots is proportional to the skill (RPSSD), as shown in the legend above.

(8)

α2 α1 0.0 1.0 0.5 0.5 0.0 −0.5 0.0 0.5 1.0 −0.5 −1.0 −0.5 1.0

Figure 4. Toy-model experiments with well-dispersed model ensem-bles:POOLβ=0_{, the skill matrix of the multi-model constructed with}

methodPOOL. The size of the dots is proportional to the skill (RPSSD),

as shown in the legend above.

It is interesting to note that for any combination of α1 and α2, MME skill is greater than or equal to the mean skill of the two participating SMEs, i.e. POOLβ=0(α1, α2)≥MEANβ=0(α1, α2). This is evident from Figure 6, which shows the difference between POOLβ=0 and MEANβ=0. Thus, considering the skill average over all combination experiments carried out and displayed on the matrices, POOL MMEs on average outperform both Model 1 and Model 2, even though there is not a single parameter combination (α1, α2) where the multi-model would be better than both participating SMEs. So the frequently-reported observation that MMEs on average outperform SMEs, regardless of whether temporal or spatial skill averages are considered, does not imply that MMEC ‘really’ enhances prediction skill, i.e. that the multi-model is also best for the individual

forecasts feeding into the verification. The skill gain may simply be an averaging effect due to the nonlinearity of the skill metric applied (Hagedornet al., 2005).

We proceed with theIGN method of MMEC to inves-tigate whether multi-models become more successful if a more sophisticated combination algorithm is applied. The results are displayed analogously to Figures 4 and 5: Figure 7 shows theIGNβ=0 _{skill matrix, Figure 8(a)} shows the difference matrix IGNβ=0₋_BESTβ=0_{, and} Figure 8(b) shows all matrix elements of IGNβ=0 plotted against the corresponding matrix elements of BESTβ=0.

If bothα1andα2are negative, i.e. if bothMODβ= 0

1 and

MODβ₂=0 are negative, the IGN multi-model consistently has zero skill (matrix elements in the lower-left corner

α2 α1 0.0 1.0 0.5 0.5 0.0 −0.5 0.0 0.4 0.8 −0.4 −0.8 −0.5 1.0

Figure 6. Toy-model experiments with well-dispersed model ensem-bles:POOLβ=0−MEANβ=0, the difference between multi-model skill (as shown in Figure 4) and the average of the two single-model skill matrices shown in Figure 3. The size of the dots is proportional to the

skill (RPSSD), as shown in the legend above.

−0.4 −0.2 0.0 0.2 0.4 (b) POOL β =0 BESTβ=0 (a) α2 α1 0.0 1.0 0.5 0.5 0.0 −0.5 ₋1.0 ₋0.5 0.0 0.5 1.0 −0.5 1.0 0.5 0.0 −1.0 −0.5 1.0

Figure 5. Toy-model experiments with well-dispersed model ensembles. (a)POOLβ=0−BESTβ=0, the pixel-wise difference between multi-model skill (as shown in Figure 4) and the maximum of the two single-model skill matrices shown in Figure 3. The size of the dots is proportional to the skill (RPSSD), as shown in the legend above. (b) All matrix elements ofPOOLβ=0, plotted as a function of the corresponding matrix

(9)

α2 α1 0.0 1.0 0.5 0.5 0.0 −0.5 0.0 0.5 1.0 −0.5 −1.0 −0.5 1.0

Figure 7. Toy-model experiments with well-dispersed model ensem-bles: IGNβ=0_{, the skill matrix of the multi-model constructed with}

methodIGN. The size of the dots is proportional to the skill (RPSSD),

as shown in the legend above.

of IGNβ=0 _{in Figure 7). This means that, in contrast} to the POOL method, there are combinations of α1 and

α2 where the multi-model appears to outperform both Model 1 and Model 2 (black dots in lower-left corner of Figure 8(a)). However, it can be shown that this is simply a consequence of the fact that methodIGN is a Bayesian method starting with a climatological prior. In other words, if neither Model 1 nor Model 2 has prediction skill, then the combination algorithm assigns zero weight to the two models and simply issues the climatological forecast, i.e. w1 =w2 =0 and w0=1 in Equation (3). Thus, the gain in prediction skill observed for negative

α1 and α2 in Figure 8(a) is not an effect of MMEC per se, but rather of theIGNalgorithm’s ability to fully ignore Model 1 and Model 2 in this case.

If, on the other hand, at least one of α1 and α2 is positive, then the matrix elements of IGNβ=0 _are almost identical to those of BESTβ=0 (see Figure 8).

Indeed, a weight analysis (not shown here) reveals that the combination algorithm assigns almost all weight to the better of Model 1 and Model 2. Thus, the IGN method clearly outperforms the POOL method, and does not ‘spoil’ good forecasts by adding poor ones. However, the central conclusion drawn from thePOOL experiments also holds for the IGN method: namely, that for well-dispersed ensemble forecasts there are no combinations ofα1 andα2 for which the multi-model has higher skill than the best participating model alone.

3.4. Multi-models constructed from highly overconfident model ensembles

What changes if highly overconfident SMEs are com-bined, rather than well-dispersed ones as before? To answer this question, the combination experiments des-cribed above are repeated, but with a positive over-confidence parameter of β=0.7. By analogy with the notation used above, the resulting skill matrices are denotedMODβ₁=0.7, MOD₂β=0.7, BESTβ=0.7, POOLβ=0.7 andIGNβ=0.7. Note that the range of possible values ofα

is now limited because of the conditionβ≤√1−α2 _in Equation (1). We chooseα1, α2∈ {−0.5,−0.4, . . . ,0.6}; the choiceα1=α2 =0.7 is omitted for the reason men-tioned in Section 3.2.

The skill matrices of overconfident SME forecasts, MODβ₁=0.7andMODβ₂=0.7, are shown in Figure 9, reveal-ing that for given values ofα1andα2skill is significantly reduced with respect to the well-dispersed ensemble fore-casts (β=0) of Figure 3. This is consistent with the conclusions drawn from Figure 2, where it is revealed that the RPSSD penalizes overconfidence.

The change in skill due to multi-model combina-tion, obtained both with the POOL method and with the IGN method, is displayed in the same way as before: Figure 10(a) shows the difference matrix POOLβ=0.7₋ BESTβ=0.7_{, while in Figure 10(b) all matrix elements} ofPOOLβ=0.7 are plotted against the corresponding ele-ments of BESTβ=0.7. Analogously, Figure 11(a) shows

−0.4 −0.2 0.0 0.2 0.4 (b) IGN β =0 BESTβ=0 (a) α2 α1 0.0 1.0 0.5 0.5 0.0 −0.5 −1.0 −0.5 0.0 0.5 1.0 −0.5 1.0 0.5 0.0 −1.0 −0.5 1.0

Figure 8. As Figure 5, but applying methodIGNto construct the multi-model: (a)IGNβ=0₋_BESTβ=0_{; (b) all matrix elements of}_IGNβ=0_,

(10)

the difference matrix IGNβ=0.7−BESTβ=0.7 and Figure 11(b) displays the matrix elements of IGNβ=0.7 as a function of the elements ofBESTβ=0.7_.

The outcome is fundamentally different from what was shown above in the case of well-dispersed model ensembles: for both the POOL and the IGN multi-models, there now are combinations of α1 and α2 such that the multi-model really has higher skill than any of the participating models alone. This is evident from the black regions in the difference matrices of Figures 10(a) and 11(a), and from the fact that the scatter plots of Figures 10(b) and 11(b) exhibit points well above the bisection line. If the POOL method is applied, the biggest skill improvement is located along the matrix diagonal, i.e. where α1 ≈α2. However, there are also parameter combinations(α1,α2)such that the addition of a consistently-poor model of no skill to a clearly-better model of positive skill can further improve prediction skill (visible, for example, in the regions α1≈0.2 and

α2 ≈0.6). If the IGN method is applied, all matrix

elements have higher skill than the single models alone (again with the constraint that for negativeα1 andα2 the gain in skill occurs because the climatological forecast receives all the weight).

3.5. Discussion

In summary, the toy-model experiments described above show that MMEC can enhance prediction skill beyond the best participating single model, but only if the single models are overconfident. This conclusion holds regard-less of which combination algorithm is applied. How can this be understood? To illustrate the underlying mech-anism as simply as possible, without loss of generality we here consider only the equally-weighted POOL com-bination method, and assume that α1 =α2, i.e. that the two models to be combined have the same potential pre-dictability. −0.4 −0.2 0.0 0.2 0.4 0.6 (b) (a) α2 α2 α1 α1 0.0 0.5 1.0 −0.5 −1.0 0.0 0.0 0.6 0.6 0.4 0.4 0.2 0.2 −0.4 −0.4 −0.2 −0.2 0.0 0.6 0.4 0.2 −0.4 −0.2

Figure 9. As Figure 3, but with overconfident model ensembles: (a)MODβ₁=0.7; (b)MODβ₂=0.7.

−0.4 ₋0.2 0.0 0.2 0.4 −0.4 −0.4 −0.2 −0.2 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 (b) POOL β =0.7 BESTβ=0.7 (a) α2 α1 0.0 0.5 1.0 −1.0 −0.5 0.5 0.0 −1.0 −0.5 1.0

Figure 10. As Figure 5, but with overconfident model ensembles: (a)POOLβ=0.7₋_BESTβ=0.7_{; (b) all matrix elements of}_POOLβ=0.7_plotted

(11)

−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.4 −0.2 −0.2 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 (b) IGN β =0.7 BESTβ=0.7 (a) α2 α1 0.0 0.5 1.0 −1.0 ₋0.5 0.5 0.0 −1.0 −0.5 1.0

Figure 11. As Figure 8, but with overconfident model ensembles: (a)IGNβ=0.7−BESTβ=0.7; (b) all matrix elements ofIGNβ=0.7plotted as a function of the corresponding matrix elements ofBESTβ=0.7.

We start with well-dispersed SMEs (i.e. β=0). For any toy-model forecast, the EPD from which the ensem-ble members are sampled is then uniquely determined (Equation (1)), since the perturbation term β is zero

(see Figure 1(a–c)). Given that here the two participat-ing sparticipat-ingle models have the sameαand refer to the same observation x, the two SMEs to be combined are sam-pled from the same EPD. In other words, the only effect of MMEC in this case is to enhance the sample size, and thus to provide a better estimate of the EPD: an effect that is not captured by the RPSSD, which is insensitive to ensemble size. In essence, the EPD of the MME is identical to the EPD of the participating SMEs.

Now consider overconfident SMEs (i.e.β >0). Again, for simplicity assume that α1=α2. In contrast to the case of the well-dispersed ensemble, the two EPDs of the participating single models are not uniquely deter-mined. They have the same well-defined spread (with standard deviation 1−α2−_β2_{), but the their} loca-tions (i.e. their mean values) are randomly perturbed by

β (see Equation (1) and Figure 1(e–f)). Thus, in the

overconfident case, MMEC does modify the shape of the resulting multi-model EPD, since the two participat-ing sparticipat-ingle-model EPDs are generally not identical. This is illustrated in Figure 12(a–b) for a randomly-sampled observationx andα=0.65 andβ=0.7. What happens if more and more overconfident single models with the same α and β are added to the multi-model, thus pro-viding more and more random samples ofβ? Following

the line of argument presented in Appendix A, it is easy to show that for an observationx, the multi-model EPD then approaches the distribution N(αx,√1−α2₎_{. This} is identical to the EPD of a well-dispersed single model withβ=0. In other words, the combination of indepen-dent overconfiindepen-dent models widens the MME spread while reducing the error in the ensemble location. The larger the number of overconfident models contributing to the MME, the more the multi-model EPD loses its overcon-fidence characteristics in favour of the characteristics of

well-dispersed ensembles (illustrated in Figure 12 (c) and (d) for multi-models consisting of 3 and 10 000 single models respectively).

Note that the same effect can be observed if the signal-based toy model discussed at the end of Section 2.1.2 is applied: in this case also, the EPDs of well-dispersed ensembles are all centred at the same value (namely atµ) and are therefore uniquely determined, while the EPDs of overconfident forecasts reveal random displacements

β aroundµthat average out as more and more models

are combined.

Given that probabilistic skill scores penalize overconfi-dence, the ‘paradox’ of skill improvement due to MMEC can now be understood: MMEC reduces the errors in ensemble location while widening ensemble spread, thus reducing the overconfidence penalty of the RPSSD, and thus improving net prediction skill. The correlation α, however, is not improved. From this follows that, at least in this simple toy-model framework, the theoretical upper limit of skill achievable by MMEC is given by the skill of a well-dispersed single model that has the same correlation coefficient α but lacks the overconfidence penalty. This is confirmed in Figure 13, which shows how the average RPSSD of MME predictions changes as the number of contributing single models increases: for all participating single models being well-dispersed (grey line, all SMEs haveα=0.65 andβ=0); and for all participating SMEs being overconfident (black line, all SMEs haveα=0.65 andβ=0.7). As expected, for well-dispersed model ensembles, skill is independent of the number of models involved, while for overconfident model ensembles, skill grows monotonically, and asymp-totically approaches that of a well-dispersed ensemble. In this sense, it appears meaningful to regardα as a mea-sure of potential predictability, as suggested by Kharin and Zwiers (2003).

Note that this conclusion has been reached under the assumption that the contributing SMEs have the same

(12)

α = 0.65

β = 0.7

observable

scaled probability density

α = 0.65

β = 0.7 α

= 0.65

β = 0.7

x x x

1 model 2 models 10000 models

(a) (b) (d) α = 0.65 β = 0.7 3 models (c) x

Figure 12. The effect of multi-model combination of overconfident (β=0.7) ensemble forecasts. For a given observationx and a correlation coefficientα=0.65, the panels show, from left to right, how a typical SME parent distribution (shaded) becomes modified as 2, 3 and 10 000 equally overconfident models are included in the forecast. The multi-model EPD eventually assumes the shape of a well-dispersed ensemble forecast, as shown in Figure 1(c). The solid black line is the climatological distribution. The EPDs are scaled differently for illustrative purposes.

0.0 0.2 0.4 0.6 0.8 1.0 number of models RPSSd 15 10 5 overconfident well-dispersed

Figure 13. RPSSDof multi-model ensemble forecasts as a function of

the number of participating single models, for overconfident ensembles (β=0.7, black) and well-dispersed ensemble forecasts (β=0, grey). All participating models have the same correlation coefficientα=0.65. The skill values plotted are obtained from 10 000 random observations

and corresponding multi-model ensemble forecasts.

in real multi-model prediction systems the correlation skills of the participating single models are relatively similar (Yoo and Kang, 2005), the situation of course changes if two overconfident models of differentα – with

α1 > α2, say – are combined. In that case, the average correlation coefficient of the MME members is lower than α1, meaning that the gain in skill due to reduced overconfidence is accompanied by a loss of skill due to reduced correlation. Thus, the greater the difference betweenα1 andα2, the more difficult it is for the multi-model to beat the best participating single multi-model (see, for example, the upper-left and lower-right corners in Figure 10(a)).

3.6. Deterministic interpretation

As was stated in Section 2.1, overconfident toy-model ensembles are characterized by too-narrow spread and a random additive displacement error of the ensemble mean. Both these features are equivalent manifestations of overconfidence, and affect each other. Consequently, a deterministic forecast evaluation, i.e. an evaluation that only considers ensemble-mean errors, should lead to con-clusions concerning MMEC similar to those of the fully

probabilistic verification presented above, which consid-ers both ensemble spread and mean errors. We generally prefer the probabilistic view, since it is physically more meaningful and considers all available forecast informa-tion, but in the present simple toy-model context both views are equivalent. Indeed, one of the reviewers has provided a convincing deterministic argument on the effects of MMEC. It is based on the mean squared errors (MSEs) of ensemble means, and is outlined below.

For simplicity, assume that the ensemble size is infinite. In that case, the ensemble meanfof an SME is identical to the centre of the EPD, and is given by:

f =αx+β. (8)

The average MSE of the SME means, i.e. their squared distance from the verifying observationsx, is then given by:

MSE =(α−1)2+β2,

assuming thatxandβare not correlated with each other.

Now assume that an MME is constructed from N such SMEs by applying method POOL. Let the participating SMEs have the same overconfidence parameterβ, but dif-ferent correlation coefficientsα1, . . . , αN. Assume further

that theβ values of theN SMEs,β(1), . . . , β(N ), are

sampled independently. The mean fMME of the MME is then given by:

fMME = 1 N N n=1 αnx+β(n) , (9)

and the average multi-model MSE becomes: MSEMME=(α −1)2+ 1 N2β 2 , (10) where α = 1 N N n=1 αn,

is the mean of the values of α of the single models. From Equation (10) one can directly draw conclusions

(13)

analogous to those presented above for the probabilistic view:

• Ifβ=0, the best single model (withα=αmax) always has an MSE less than or equal to that of the multi-model, and therefore accuracy greater than or equal to that of the multi-model, because α ≤αmax. This means that MMEC does not enhance prediction skill beyond the best participating SME.

• If β=0, then the MME can have a lower MSE, and therefore a higher accuracy, than any participating SME alone, especially if the values ofαare of similar magnitude for all SMEs. In that case, the first term on the right-hand side of Equation (10) is hardly affected by MMEC, while the second term decreases quadrat-ically as the number of participating SMEs increases, thus reducing the MSE. In this deterministic view, the addition of more and more overconfident models, i.e. models with random additive errors of their ensem-ble means, continuously improves the forecast accu-racy, which asymptotically approaches the MSE one would obtain for well-dispersed SMEs (withβ=0). 3.7. Can ensemble inflation have the same effect as multi-model combination?

Given that the enhanced prediction skill of multi-models is due to a reduction in the overconfidence penalty rather than an improvement in potential predictability, the ques-tion arises as to whether the same effect could be obtained by inflating overconfident single-model forecasts, i.e. by an appropriate recalibration. To address this question, we return to the probabilistic view, and apply an infla-tion method described by Doblas-Reyeset al. (2005) and Kharin and Zwiers (2003). In essence, this method widens the ensemble spread from a value of1−α2−_β2 _{to a} value of √1−α2_{, i.e. from the observed spread to the} spread that would be assumed if the ensemble forecasts were not overconfident. The ensembles are thus inflated by a factor γ = 1−α2 σens ,

where σens is the average observed ensemble spread. Simultaneously, the ensemble means are shifted toward the climatological mean by a factor δ in such a way that the forecast climatology remains well-calibrated. By equating the variances of the raw and inflated forecasts, one obtains δ=α/σem, where σem is the standard deviation of the ensemble-mean forecasts. The calibrated version of the toy model is then:

    f1 f2 .. . fM    =δ(αx+β)+γ     1 2 .. . M    , (11) where γ = 1−α2 1−α2−β2, and δ= α2 α2+β2.

As described above, the factor δ <1 reduces the inter-ensemble variance so that the forecast climatology remains identical to the observation climatology. How-ever, this implies that ensemble inflation is coupled with a reduction in the effective correlation between the indi-vidual forecast-ensemble members and the observations, fromαtoδα. In other words, while one may expect some beneficial effect on skill due to a reduction in overconfi-dence, one would at the same time experience a decrease in skill due to reduced correlation. Thus, the net skill improvement due to inflation depends on the respective magnitudes of these two contrary effects. If α is very small, one would expect a strong net gain in skill, since a priori there is hardly any correlation that could be fur-ther reduced. Asαgrows, however, the absolute loss in effective correlation increases, thus limiting the benefit of inflation. MMEC is fundamentally different, in that it gradually widens ensemble spread and moves the ensem-ble mean toward truth without reducing the correlation (assuming that all participating models have similar cor-relation).

Again, note that the same conclusion can be obtained from the signal-based model described at the end of Sec-tion 2.1.2, where any ensemble inflaSec-tion must be accom-panied by a reduction in the signal amplitudeµto keep the model climatology well-calibrated, thus reducing the correlation between forecasts and observations. MMEC, on the other hand, would also leaveµunchanged here.

This conclusion is illustrated in Figure 14. For an overconfidence parameter of β=0.7, the graph shows theRPSSDas a function ofαfor ‘raw’ non-inflated SME forecasts, inflated SME forecasts, MME forecasts (POOL method) based on two overconfident SMEs of equalα, and MME forecasts (POOL method) based on an infinite number of overconfident SMEs of equal α. The latter are equivalent to well-dispersed SMEs with β=0 (see

0.0 0.1 0.2 0.3 0.4 0.5 0.6 −0.2 0.0 0.2 0.4 0.6 alpha RPSSd multi-model (∞ models) multi-model (2 models) inflated single-model “raw” single-model

Figure 14. Effect of ensemble inflation, in comparison with multi-model combination, for overconfident (β=0.7) ensemble forecasts:RPSSD

as a function of correlation coefficientα, for: ‘raw’ SME forecasts; inflated SME forecasts; MME forecasts (POOLmethod) based on two overconfident SMEs; and MME forecasts (POOLmethod) based on an

infinite number of overconfident SMEs. The latter are equivalent to well-dispersed SMEs withβ=0.

(14)

above). As expected, the plot shows that inflation does enhance skill with respect to raw SMEs; however, the net gain in skill drops asα grows. Multi-models, on the other hand, reveal a uniform skill improvement over the whole range of α values, with the skill improvement growing as the number of participating SMEs grows. Even a simple MME consisting of only two models outperforms the inflated SME forα >0.15. On the other hand, ensemble inflation inhibits negative skill scores. These results suggest that, for climatologically well-calibrated models, ensemble inflation is most effective when the potential predictability is low and when only SME forecasts are available.

This conclusion is still not general enough to be transferred to real forecasts, first because it is based on only one exemplary calibration method (there are other common methods that have not been evaluated here, such as the ‘reliability correction’ described in Tothet al. (2003)), and secondly because it is based on a very idealized toy model. Indeed, in the present setting, the success of MMEC is closely linked to the independence of the model-error terms β. In real prediction systems,

model errors tend to be correlated (e.g. Yoo and Kang, 2005), raising the question as to whether MMEC would still outperform simple calibration approaches under real conditions. The study of Doblas-Reyes et al. (2005) suggests that they actually do, at least in the context of seasonal forecasts, but a more general investigation of this issue is left to future research.

4. Application to real seasonal forecast data

So far, all results have been obtained on the basis of a simple Gaussian-type toy model. It is the aim of this section to show that the conclusions drawn also hold for real (seasonal) multi-model ensemble predictions. 4.1. Models applied

In the following, ensemble forecasts of two operational seasonal prediction systems are evaluated and combined: ECMWF’s System 2 (S2) (Anderson et al., 2003), and the UK Met Office’s GloSea (GS) (Gordonet al., 2000). Hindcast data for these two models are obtained from the DEMETER database (Palmeret al., 2004). Although this database comprises hindcasts of seven different models, we have restricted ourselves to two models, in order to be consistent with the toy-model simulations and to keep the interpretation of the results as simple as possible. Moreover, the robustness of the weight estimations for methodIGN drops drastically as the number of models to be combined increases (Robertsonet al., 2004).

We consider hindcasts of mean summer near-surface (2 m) temperature, averaged over the months June, July and August. All hindcasts have been started from 1 May initial conditions. The hindcast period is 1960–2001. Data are verified grid-point-wise against the correspond-ing ‘observations’ from the 40-year ECMWF re-analysis

(ERA40) dataset (Uppalaet al., 2005). Both the forecasts and the verifying observations are interpolated on a grid with 2.5°_×2.5° resolution. As in the toy-model experi-ments above, three equiprobable categories are consid-ered. The terciles separating the three categories are determined from the hindcast and observation data sepa-rately.

4.2. Seasonal prediction skill

Ensemble forecasts of the two models are combined to MMEs by applying both the methods POOL and IGN. To reduce the sampling variability of the model weights in theIGNmethod, a nine-point binomial spatial smoother is applied for the optimization procedure, as described by Robertson et al. (2004). All calculations are carried out in a “one-year out cross-validation” mode (Wilks, 2006). This means that for each year to be verified, the target year is eliminated from the computation of observation terciles, model terciles and optimum model weights. Skill is calculated grid-point-wise for the two single models, as well as thePOOLandIGNmulti-model ensembles, using the debiased RPSSD. The resulting skill maps are shown in Figure 15, and the optimum weights obtained from the IGN combination are displayed in Figure 16. Most notably, both S2 (Figure 15(a)) and GS (Figure 15(b)) reveal large areas with negative skill, i.e. areas where the models perform worse than if the forecast ensembles were simply randomly sampled from climatology, particularly in the extratropics. While the extent and magnitude of these areas is reduced for POOL MME forecasts (Figure 15(c)), they are almost totally eliminated forIGN MME forecasts (Figure 15(d)). This is because of theIGN method’s ability to ignore poor ensemble forecasts and to issue the climatological forecast instead (see Figure 8(a), lower-left corner). Indeed, Figure 16(c) shows that in the regions of poor single-model skill, climatology receives almost all the weighting.

For completeness, spatially-averaged values of single-model and multi-single-model skill are provided in Table I for a number of regions, applying both the debiased RPSSD and the classical RPSS. With only 40 hindcast years available, the sample size is very small compared to the toy-model experiments described above, giving rise to uncertainties in the skill estimates obtained. Indeed, a comprehensive quantitative verification requires confi-dence intervals over the skill estimates (Joliffe, 2007), a task that is particularly complicated in the present case, where not only temporal averages over scores but also spatial averages over statistically-dependent grid points are considered, so that the number of independent fore-cast–observation pairs is not known (see also the dis-cussion in Weigel et al. (2007c)). Since a quantitative comparison of methods IGN and POOL has already been provided by Rajagopalan et al. (2002) and is not cen-tral to this paper, we omit an estimate of confidence intervals. However, the figures in Table I appear to con-firm the conclusions of Rajagopalan et al. (2002): that on regionally-averaged areas multi-models have higher

(15)

Figure 15. Skill maps (RPSSD) for real seasonal forecasts (June, July, August) of 2 m temperature, with a lead time of one month, obtained

from the DEMETER database for the period 1960–2001. Skill is evaluated for: (a) S2; (b) GS; (c) the multi-model constructed with method

POOL(equal weights); and (d) the multi-model constructed with methodIGN(optimum weights).

0.0 0.2 0.4 0.6 0.8 1.0

(a) (b) (c)

Figure 16. Optimum weights obtained from method IGN for the forecasting context described in Figure 15: (a) weights of the S2 model; (b) weights of the GS model; (c) weights attributed to climatology.

skill than the participating SMEs alone, and that IGN MMEs always outperform POOL MMEs. These conclu-sions appear to hold for both the classical RPSS and the debiased RPSSD.

4.3. Local skill improvement and overconfidence Given the results of Section 3.3 (in particular Figure 6), the success of MMEC on regionally-averaged areas as

suggested by Table I does not imply that the multi-model would also outperform any of the participating single models locally, i.e. at each individual grid point. We therefore proceed with an evaluation on a grid-point basis. In particular, we want to investigate whether the link between skill improvement and overconfidence observed with the toy model is also visible with real seasonal ensemble predictions.

(16)

Figures 10 and 11 from the toy-model experiments have shown that skill improvement due to MMEC is more clearly visible forIGN multi-models than forPOOL multi-models. While the latter enhance prediction skill only if

α1 ≈α2 (a prerequisite that is not fulfilled in the general case), theIGNmulti-models yield skill improvement over the whole range of combinations(α1, α2). For this reason, we henceforth considerIGN MMEs only.

Figure 17(a) shows those grid points where the multi-model skill (IGN) of the seasonal forecasts is positive and exceeds that of the best single model (i.e. both S2 and GS). Thus, it excludes those grid points where skill improvement is only due to the IGN method’s abil-ity to issue the climatological forecast in the case of poor ensemble forecasts. Despite the considerable scat-ter, which is not too surprising given the comparatively short verification period of 42 years (i.e. 42 samples), the grid points whereIGN locally outperforms all SMEs are clearly organized in larger-scale structures, predomi-nantly along the tropical belt. Apart from tropical Africa and South America, these grid points are mainly found over the oceans. It should be mentioned here that a sim-ilar analysis based on POOL ensembles reveals a much

noisier picture, which makes it difficult to identify these regions (not shown).

How does this pattern of skill improvement relate to overconfidence? To obtain an estimate of the overconfi-dence characteristics of the two participating SMEs, we assume Gaussian behaviour of the observed and simu-lated seasonal averages of near-surface temperature, and fit the toy-model parametersαandβto the joint series of model hindcasts and verifications. The Gaussian assump-tion is admittedly a very simplifying one, but can be justified as a first rough estimate for the variable con-sidered (Wilks, 2002, 2006) and the small number of verification samples (Scherreret al., 2006).

The procedure is as follows. First, the ERA40 ‘obser-vations’ are linearly and grid-point-wise scaled so that the climatology of scaled observations has zero mean and unit variance. This means that, if x is the mean value of all observationsx at a given grid point andσ_x2

is the corresponding variance, the scaled observationsx

are determined from:

x= x− x σx

. (12)

Table I. Average skill of single- and multi-model ensembles in various regions, for seasonal forecasts (June, July, August) of 2 m temperature, with a lead time of one month, obtained from the DEMETER database for the period 1960 – 2001. The table shows skill measured with the debiased RPSSDfor ECMWF’s System 2 (S2), the Met Office’s GloSea (GS), and multi-models constructed with methodsPOOL (equal weights) and IGN(optimum weights). The classical RPSS is given in parentheses. Skill

values are given in percentage form. For all regions considered, the highest skill occurs with methodIGN.

Region Longitude Latitude S2 RPSS (%) GS RPSS (%) POOLRPSS (%) IGNRPSS (%)

Global 180°W– 180°E 85°S–85°N 9.5 (−0.6) 5.9 (−4.5) 10.5 (5.6) 11.1 (9.4) Tropics 180°W– 180°E 20°S–20°N 21.0 (12.3) 18.3 (9.3) 23.8 (19.6) 24.9 (22.7) Europe 15°W – 45°E 35°N – 70°N 5.4 (−5.1) 3.5 (−7.2) 5.7 (0.5) 7.0 (5.3) Russia 40°E – 180°E 40°N – 80°N 2.0 (−8.8) −3.4 (−14.8) 1.5 (−0.4) 2.3 (1.4) Asia 60 °E – 140°E 0 °– 40°N 10.9 (1.0) 6.4 (−3.9) 11.6 (6.7) 12.9 (11.1) Australia 100°E – 180°E 50°S–10°N 12.6 (1.9) 8.8 (−3.3) 14.1 (9.3) 16.2 (12.8) Africa 20°W – 50°E 40°S–35°N 10.4 (0.5) 8.2 (−2.0) 11.9 (7.0) 12.7 (10.7) N. America 140°W – 60°W 15°N – 75°N 7.1 (−3.2) 3.6 (−7.1) 7.9 (2.7) 8.4 (6.9) S. America 90°W – 30°W 60°S–15°N 13.2 (3.6) 13.1 (3.4) 16.4 (11.8) 17.6 (15.6) (a) (b)

Figure 17. (a) Regions where the multi-model locally outperforms both participating single models, and (b) regions where the participating single models are overconfident, using the seasonal forecasting context described in Figure 15. The black pixels in (a) indicate grid points where the multi-model (using methodIGN, i.e. optimum weights) locally outperforms both participating single models. The black pixels in (b) indicate grid points where both participating single models are highly overconfident (β >0.63). The grey areas are regions of zero or negative multi-model skill: at these grid points, skill improvement is dominated by theIGNmethod’s ability to fully ignore poor ensemble forecasts and to issue the