• No results found

Quality Control on Space Weather Forecasts

N/A
N/A
Protected

Academic year: 2021

Share "Quality Control on Space Weather Forecasts"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

© NERC All rights reserved

Quality Control on

Space Weather

Forecasts

Alan W P Thomson

Geomagnetism,

British Geological Survey, UK

(2)

© NERC All rights reserved

Outline

What do we mean by a (space weather) forecast?

A few examples of verified forecasting systems, what

they predict and where to find them (non exhaustive!)

Measuring accuracy of forecasts

Detailed Example 1: Predicting Solar Flares

Detailed Example 2: Predicting Geomagnetic Activity

(3)

© NERC All rights reserved

What do we mean by a (Space

Weather) Forecast?

Forecast

here means

a prediction of the future state

(of the

weather, stock market prices, or whatever)

Forecast verification

is then the process of assessing the quality

of a forecast

Forecasts of

Solar flares (e.g. occurrence time; magnitude)

CME eruption (occurrence time; magnitude)

CME arrival at Earth (time; solar wind parameters)

Radiation storm (onset time, duration, end time; ‘magnitude’)

Geomagnetic storm (onset time, duration, end time; ‘magnitude’)

...

Forecasts (mostly) of variables that either occur/not-occur and/or are represented as a spatio-temporal series, i.e. continuous variable

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(4)

© NERC All rights reserved

Examples of Forecasting Systems

Solar flares

Coronal mass ejections

Solar particle events

Solar wind/heliosphere

Radiation belts

Magnetosphere

Thermosphere

Ionosphere

Geomagnetic field

Forecast

verification

is done for

all of these

-

ENLIL-CONE

-

ENLIL-CONE

-

CCMC Various*

DTM-2009

-

-

(5)

© NERC All rights reserved

WSA-ENLIL-CONE

ENLIL is a 3D MHD model of the heliosphere.

It solves for plasma mass, momentum and energy density, and magnetic field. Its inner radial boundary is beyond the sonic point, typically 21.5 - 30 solar radii.

The ENLIL cone model forecasts CME

propagation from the ENLIL inner boundary. The cone model assumes that close to the Sun CME propagates at constant angular and radial velocity

The WSA component combines a Potential

Source Surface Model with the 'Schatten' current sheet model to produce a model of the global coronal magnetic field. It uses a simple kinematic model to propagate the

solar wind and the magnetic polarity from the outer boundary of the Schatten current sheet model into the heliosphere

(6)

© NERC All rights reserved

WSA-ENLIL-CONE: Verified

Source: Dusan Odstricil, ENLIL: Modeling of Heliospheric Space

(7)

© NERC All rights reserved

Magnetospheric MHD Models

Simulations of

BATS-R-US, GUMICS, LFM and

OpenGGCM MHD

codes

Compared with Cluster

(magnetosheath),

Geotail (near tail),

WIND (far tail) and

CPCP (SuperDARN)

data

Honkonen et al, Space Weather, 11, 313-326, doi: 10.1002/swe.20055, 2013

(8)

© NERC All rights reserved

Magnetospheric MHD Models

Honkonen et al, Space Weather, 11, 313-326, doi:

(9)

© NERC All rights reserved

Magnetospheric MHD Models

(10)

© NERC All rights reserved

Thermospheric Models

Drag Temperature Model is a

semi-empirical model describing the temperature, density, and composition of the

thermosphere

DTM2009 and DTM2000, and

the COSPAR reference models NRLMSISE-00 and JB2008, are evaluated in order to establish benchmark values for updated DTM models

Compared with high resolution

CHAMP and GRACE data

(11)

© NERC All rights reserved

Thermospheric Models: Verified

Mean and RMS of

density ratios and residuals

JB2008 is the most accurate below 300 km, JB2008 and DTM2009 are best at 300–500 km, NRLMSISE-00 and DTM2009 are best above 500km

(12)

© NERC All rights reserved

Properties of ‘Good’ Forecasts

Consistency

- the degree to which the forecast corresponds to

the forecaster's best judgement about the situation, based upon

his/her knowledge base

Quality

- the degree to which the forecast corresponds to what

actually happened

• Bias • Association • Accuracy • Skill • Reliability • Resolution • Sharpness • Discrimination • Uncertainty

Value

- the degree to which the forecast helps a decision maker

to realize some incremental economic and/or other benefit

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(13)

© NERC All rights reserved

Types of Forecasts

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(14)

© NERC All rights reserved

Definition of Verification Methods

Visual

– a qualitative look at the data

Dichotomous

– a simple binary yes/no

Multi-category

– extension of binary variables to many

Continuous

– non-binary variables, e.g. compute RMS

difference between measured and predicted

Probabilistic

– forecasts are probabilities, verified against

event occurring or not occurring

Spatial

– e.g. at what scale does the forecast match best

match reality?

Ensemble

– e.g. how well does the ensemble spread of the

forecast represent the true variability (uncertainty) in the

observations?

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(15)

© NERC All rights reserved

Measuring Accuracy of Forecasts (1)

Dichotomous, or binary forecasts

Construct a contingency table of 4 categories

hit - event forecast to occur, and did occur

miss - event forecast not to occur, but did occur (false negative)

false alarm - event forecast to occur, but did not occur (false positive)

correct negative - event forecast not to occur, and did not occur

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(16)

© NERC All rights reserved

Assessing Binary Forecasts

Accuracy (what fraction of forecasts were correct?)

Bias score (how did the forecast frequency of yes events compare to the observed frequency of yes events?)

Probability of detection (what fraction of yes events were correctly forecast?)

False alarm ratio (what fraction of the predicted yes events did not occur?)

Probability of false detection (what fraction of the no events were forecast as yes?)

Success ratio (what fraction of the forecast yes events were correctly observed?)

Threat score (how well did the forecast yes events correspond to the observed yes

events?)

Equitable threat score (Gilbert score) (how well did the forecast "yes" events correspond to the observed "yes" events (accounting for hits due to chance)?

Hanssen and Kuipers discriminant (True skill statistic; Pierce’s skill score) (How well did the forecast separate the "yes" events from the "no" events?)

Heidke skill score (what was the accuracy of the forecast relative to that of random chance?)

Odds ratio (what is the ratio of the odds of a yes forecast being correct, to the odds of a yes forecast being wrong?)

Odds ratio skill score (what was the improvement of the forecast over random chance?)

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(17)

© NERC All rights reserved

Binary Forecast Verification - Detail

Range: 0 to 1. Perfect score: 1.

Characteristics: Sensitive to hits, but ignores false alarms. Very sensitive to the climatological frequency of the event. Good for rare events. Can be artificially improved by issuing more "yes"

forecasts to increase the number of hits.

Range: 0 to 1. Perfect score: 0.

Characteristics: Sensitive to false alarms, but ignores misses. Very sensitive to the

climatological frequency of the event. Should be used in conjunction with the probability of

detection (above).

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(18)

© NERC All rights reserved

Range: 0 to 1, 0 indicates no skill. Perfect score: 1.

Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as the

accuracy when correct negatives have been

removed from consideration, that is, TS is

only concerned with forecasts that count.

Range: -1/3 to 1, 0 indicates no

skill. Perfect score: 1.

Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits

associated with random chance (for example, it is easier to correctly forecast rain

occurrence in a wet climate than in a dry climate).

Binary Forecast Verification - Detail

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(19)

© NERC All rights reserved

Range:

-

∞ to 1, 0 indicates no skill.

Perfect score:

1.

Characteristics:

Measures the fraction of correct forecasts after

eliminating those forecasts which would be correct due purely to

random chance. This is a form of the generalized skill score, where

the

score

in the numerator is the number of correct forecasts, and the

reference forecast in this case is random chance

Binary Forecast Verification - Detail

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(20)

© NERC All rights reserved

Measuring Accuracy of Forecasts (2)

Multi-category forecasts

Generalisation of e.g. Accuracy, Heidke skill score, and the Hanssen and

Kuipers discriminant all exist

Can also analyse by simply plotting histograms (observed vs. forecast for

each category) or the Gerrity skill score (what was the accuracy of the forecast in predicting the correct category, relative to that of random chance?)

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(21)

© NERC All rights reserved

Measuring Accuracy of Forecasts (3)

Continuous forecasts

– how do numerical forecast values

differ from actual values?

Example: Autoregressive prediction of

Ap

geomagnetic

index, used in MSIS thermospheric density model by ESA

for LEO orbital

prediction

and control from ~1995-2000

(22)

© NERC All rights reserved

Assessing Continuous Forecasts

Scatter plot (how well did the forecasts compare with the observed values?)

Box plot (how well did the distribution of forecasts correspond to the distribution of observed values?)

Mean error (what is the average forecast error?)

Bias (how does the average forecast compare to the average observed value?)

Mean absolute error (what is the average magnitude of the forecast errors?)

Root mean square error (what is the average magnitude of the forecast errors?)

Mean square error (mean squared difference between forecast and observation)

Prediction efficiency (how well does the forecast match the observations, relative to the scatter in the observations?)

Linear error in probability space (measures the error in probability space, depending on the cumulative probability density function of the observations, as determined from climatology)

Stable equitable error in probability space (similar to LEPS)

Correlation coefficient (how well did the forecast values correspond to the observed values?)

Skill score (what is the relative improvement of the forecast over some reference forecast?)

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(23)

© NERC All rights reserved

Continuous Forecast Verification

Range: 0 to 1. Perfect score: 0.

Characteristics: Does not discourage forecasting extreme values if they are warranted. Requires knowledge of climatological PDF

Range: Lower bound depends on what score is being used to compute skill and what reference forecast is used, but upper bound is always 1; 0 indicates no improvement over the

reference forecast. Perfect score: 1

Characteristics: Implies information about the value or worth of a forecast relative to an alternative (reference) forecast, e.g. persistence (no change from most recent observation) or

climatology Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(24)

© NERC All rights reserved

‘PDFLAP’

Performance

Statistics

PDFLAP =

autoregressive model

of

Ap/F10.7

, with

30/60 coefficients,

determined every day

from 180/365 days of

data

Accuracy compared

to expected

performance based

on model tests on

previous two solar

cycles of data

(25)

© NERC All rights reserved

‘PDFLAP2’

Performance

Statistics

An optimised forecast model

for Ap (developed ~2000)

lags 1-3: neural net model lags 4-6: climatological model lags 7-15: ARMA model

lags 16-27: minimum forecast between ARMA and climatology Not physically based therefore needs regular checking for accuracy and relevance

(26)

© NERC All rights reserved

Detailed Example 1: Solar Flares

Many methods used to predict flares

(and much published since 2000)

Poisson statistics

Bayesian statistics

Support vector machines

Discriminant analysis

Neural networks

Wavelets

Superposed epoch analysis

Empirical methods

...

Most methods provide a probability for

an X-ray flare with peak flux of some

magnitude in some time frame

(27)

© NERC All rights reserved

ASAP

‘Automated Solar Activity Prediction tool’:

spaceweather.inf.brad.ac.uk/downloads.html

A machine learning-based system designed to analyze years of

sunspot and flare data to create associations that can be

represented using computer based learning rules

An imaging-based real-time system that provides automated

detection, grouping, and then classification of recent sunspots based

on the McIntosh classification is created and integrated within this

system

(28)

© NERC All rights reserved

ASAP

Tested on solar 5267

SOHO MDI

intensity-gram images from 1

February 1999 to 31

December 2002.

Verified against NGDC

flare catalogue

(29)

© NERC All rights reserved

Comparing flare forecast models

Bloomfield et al compare

a ‘thresholded-Poisson

probability’ model of

flaring with other methods

Ordinal logistic regression (Song et al, 2009)

Predictor teams (Huang et al, 2010)

Neural networks (Ahmed et al, 2013)

Highlights the

significance of the

underlying flare/noflare

probability and the

preference therefore for

TSS over HSS

(30)

© NERC All rights reserved

Detailed Example 2: Geomagnetic Activity

One to three day ahead forecast of geomagnetic activity by

BGS colleagues for various non-paying academic and other

users

Has been running since 2000

Initial analysis courtesy of Ellen Clarke, BGS

More to be done and presented at the European Space Weather Week

Forecasts given in 4, noon-to-noon, activity classes (with

explanatory text):

ACTIVITY CLASS

Daily Planetary Activity

Level (Ap)

QUIET – UNSETTLED

<=15

ACTIVE

16-29

MINOR STORM

30-49

MAJOR STORM

>=50

(31)

% Correct Forecasts by Year

(32)

ACTIVE, MINOR and MAJOR STORM days only

(33)

% Correct Forecasts on ACTIVE, MINOR and

MAJOR STORM days

(34)

Analysed by

Forecaster Team

Member

Forecast Team (2000-2012)

• Toby Clark

• Alan Thomson

• Ellen Clarke

• Pam White

• Allan Mackay

• Sarah Reay

• Jess King

• Orsi Baillie

• Brian Hamilton

• Thomas Humphries

• Ewan Dawson

• Gemma Kelly

• Laurence Billingham

(35)

Contingency Tables – Storm Classes

(36)

Evaluation using Forecast Skill Scores

(ACTIVE, MINOR and MAJOR STORM days)

(37)

Evaluation using Forecast Skill Scores

(ACTIVE, MINOR and MAJOR STORM days)

(38)

Use ‘Decision Theory’ approach (e.g. Matthews, 1996, 1997)

K = User-defined ‘loss structure’, measuring relative costs of complacency and false alarms to user

LR = Loss ratio >1 is

good

Odds(A) determined from historical

distribution of activity

Forecasts are useful

where costs of missing an event is greater than that of a false alarm (K>1) 0.000 1.000 2.000 3.000 4.000 5.000 6.000 1 2 3 Days L R BGS SEC Persistence LR = Pr(F|A)/Pr(F|~A) LR * Odds(A) > 1/K

Odds(A) = Pr(A)/Pr(~A)

K="Complacency-Cost" /"False-Alarm-Cost" 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1 2 3 Days K BGS SEC Persistence

From Another Angle: the User’s Perspective

Storms in 2002-2003

(39)

The c01 is false alarm cost, c10 is cost of a ‘false negative’; nxy are matrix elements of corresponding contingency table (x=row,

y= column)

‘Skill score’ S>0 implies merit

Forecasts are useful when cost <0.5 (i.e. complacency cost is much more important)

One lesson here: use persistence for one

day ahead during this phase of solar cycle if no CME data or if coronal hole effects are expected to dominate!

-0.100 -0.050 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 S k ill -S c o re Normalised Cost One-Day Ahead BR Skill-Scores

BGS SEC PES

Tw o-Day Ahead BR Skill-Scores

-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Norm alised Cost

S k ill-S c o re BGS SEC PES

Three-Day Ahead BR Skill-Scores

-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Norm alised Cost

S k ill-S c o re BGS SEC PES ) 1 )( ( ) 1 ( 10 11 01 11 θ θ θ θ + − − = n n n n S ) ( 01 10 01 c c c + = θ

From Another Angle: Briggs-Rupert Skill Scores

Storms in 2002-2003

(40)

Geomagnetic Activity Forecasts – Summary

 1, 2 and 3 day ahead forecasts have been analysed from 2000 to

2012

 There is a tendency to underestimate activity levels for ACTIVE and STORM days only (hedging our bets?)

 Manual forecasts prove generally better than using simple persistence/recurrence forecast model(s)

 Further work required to determine a good measure for the bias and

forecast performance (e.g. HSS, TSS and GS)

 Decision Theory and Briggs-Rupert skill scores provide other perspectives on forecast value, usefulness and merit

 This brings the end-user in to the process

(41)

© NERC All rights reserved

Summary and Conclusions

There isn’t a one-stop, sure-fire way to

measure ‘goodness’ or ‘usefulness’ of a

forecast

The forecaster needs to lay out the options and

provide evidence, statistics and commentary

From simple graphs of measured versus forecast – ‘eyeball’ the data

Added to with more complicated skill scores

The user can have a role (e.g. what are the

relative costs of ‘false positives’ and ‘false

negatives’) and that influences which forecast is ‘best’

Verification remains an essential part of

determining where forecast models are ‘good’

or ‘bad’

And perhaps helping decide how or where

(42)

© NERC All rights reserved

References on Forecast Verification

Jolliffe, I.T., and D.B. Stephenson, 2012:

Forecast

Verification: A Practitioner's Guide in Atmospheric

Science. 2nd Edition

. Wiley and Sons Ltd, 274 pp.

Verification of space weather models

. Wintoft

et al

, J. Sp.

Weath. Sp. Clim., 2013.

Acknowledgments

A superb, succinct source of material for this lecture:

World Weather Research Programme – Forecast

Verification Research Group

NOAA Space Weather Prediction Centre -

References

Related documents

On the file level, users can select predefined files and add user-defined files to the data set. Selection and definition is done with the File Definition option of the

As a consequence, the cash fl ow to equity holders, e ( V ) , will take on di ff erent expressions, depending on the corporate taxation scheme. Three corporate taxation rules will

During a recent review of Hong Kong’s regulatory regime for investment products, the Insurance Authority noted that some insurers are confused with the classification of

Parents are often unaware of quality elements when choosing child care and early education settings, including the importance of nutrition, physical activity, and screen time

Many research studies have investigated ways to improve the user experience in issue tracking systems [4], [11], while others have proposed techniques for helping developers man-

The overall European context in which the Spanish gas and electricity markets are situated remains one of significant external dependency, active competition policy and evolving

For each approved payment transaction using eSelectPlus as the payment processor you will receive a response message.. This message will be stored in the “response_variables” field

The continued importance of the personal dimension and the SteemIt platform for subsequent English posts (Table 3 and the sentiment results) suggests that social capital is