Quality Control on Space Weather Forecasts

(1)

Quality Control on

Space Weather

Forecasts

Alan W P Thomson

Geomagnetism,

British Geological Survey, UK

(2)

Outline

• What do we mean by a (space weather) forecast?

• A few examples of verified forecasting systems, what

they predict and where to find them (non exhaustive!)

• Measuring accuracy of forecasts

• Detailed Example 1: Predicting Solar Flares

• Detailed Example 2: Predicting Geomagnetic Activity

(3)

What do we mean by a (Space

Weather) Forecast?

• Forecast

here means

a prediction of the future state

(of the

weather, stock market prices, or whatever)

• Forecast verification

is then the process of assessing the quality

of a forecast

• Forecasts of

•

Solar flares (e.g. occurrence time; magnitude)

•

CME eruption (occurrence time; magnitude)

•

CME arrival at Earth (time; solar wind parameters)

•

Radiation storm (onset time, duration, end time; ‘magnitude’)

•

Geomagnetic storm (onset time, duration, end time; ‘magnitude’)

•

...

•

Forecasts (mostly) of variables that either occur/not-occur and/or are represented as a spatio-temporal series, i.e. continuous variable

Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(4)

Examples of Forecasting Systems

• Solar flares

• Coronal mass ejections

• Solar particle events

• Solar wind/heliosphere

• Radiation belts

• Magnetosphere

• Thermosphere

• Ionosphere

• Geomagnetic field

Forecast

verification

is done for

all of these

-

ENLIL-CONE

-

ENLIL-CONE

-

CCMC Various*

DTM-2009

-

(5)

WSA-ENLIL-CONE

•

ENLIL is a 3D MHD model of the heliosphere.

It solves for plasma mass, momentum and energy density, and magnetic field. Its inner radial boundary is beyond the sonic point, typically 21.5 - 30 solar radii.

•

The ENLIL cone model forecasts CME

propagation from the ENLIL inner boundary. The cone model assumes that close to the Sun CME propagates at constant angular and radial velocity

•

The WSA component combines a Potential

Source Surface Model with the 'Schatten' current sheet model to produce a model of the global coronal magnetic field. It uses a simple kinematic model to propagate the

solar wind and the magnetic polarity from the outer boundary of the Schatten current sheet model into the heliosphere

(6)

WSA-ENLIL-CONE: Verified

Source: Dusan Odstricil, ENLIL: Modeling of Heliospheric Space

(7)

Magnetospheric MHD Models

• Simulations of

BATS-R-US, GUMICS, LFM and

OpenGGCM MHD

codes

• Compared with Cluster

(magnetosheath),

Geotail (near tail),

WIND (far tail) and

CPCP (SuperDARN)

data

Honkonen et al, Space Weather, 11, 313-326, doi: 10.1002/swe.20055, 2013

(8)

Magnetospheric MHD Models

Honkonen et al, Space Weather, 11, 313-326, doi:

(9)

Magnetospheric MHD Models

(10)

Thermospheric Models

•

Drag Temperature Model is a

semi-empirical model describing the temperature, density, and composition of the

thermosphere

•

DTM2009 and DTM2000, and

the COSPAR reference models NRLMSISE-00 and JB2008, are evaluated in order to establish benchmark values for updated DTM models

•

Compared with high resolution

CHAMP and GRACE data

(11)

Thermospheric Models: Verified

•

Mean and RMS of

density ratios and residuals

•

JB2008 is the most accurate below 300 km, JB2008 and DTM2009 are best at 300–500 km, NRLMSISE-00 and DTM2009 are best above 500km

(12)

Properties of ‘Good’ Forecasts

• Consistency

- the degree to which the forecast corresponds to

the forecaster's best judgement about the situation, based upon

his/her knowledge base

• Quality

- the degree to which the forecast corresponds to what

actually happened

• Bias • Association • Accuracy • Skill • Reliability • Resolution • Sharpness • Discrimination • Uncertainty

• Value

- the degree to which the forecast helps a decision maker

to realize some incremental economic and/or other benefit

(13)

Types of Forecasts

(14)

Definition of Verification Methods

• Visual

– a qualitative look at the data

• Dichotomous

– a simple binary yes/no

• Multi-category

– extension of binary variables to many

• Continuous

– non-binary variables, e.g. compute RMS

difference between measured and predicted

• Probabilistic

– forecasts are probabilities, verified against

event occurring or not occurring

• Spatial

– e.g. at what scale does the forecast match best

match reality?

• Ensemble

– e.g. how well does the ensemble spread of the

forecast represent the true variability (uncertainty) in the

observations?

(15)

Measuring Accuracy of Forecasts (1)

• Dichotomous, or binary forecasts

• Construct a contingency table of 4 categories

•

hit - event forecast to occur, and did occur

•

miss - event forecast not to occur, but did occur (false negative)

•

false alarm - event forecast to occur, but did not occur (false positive)

•

correct negative - event forecast not to occur, and did not occur

(16)

Assessing Binary Forecasts

•

Accuracy (what fraction of forecasts were correct?)

•

Bias score (how did the forecast frequency of yes events compare to the observed frequency of yes events?)

•

Probability of detection (what fraction of yes events were correctly forecast?)

•

False alarm ratio (what fraction of the predicted yes events did not occur?)

•

Probability of false detection (what fraction of the no events were forecast as yes?)

•

Success ratio (what fraction of the forecast yes events were correctly observed?)

•

Threat score (how well did the forecast yes events correspond to the observed yes

events?)

•

Equitable threat score (Gilbert score) (how well did the forecast "yes" events correspond to the observed "yes" events (accounting for hits due to chance)?

•

Hanssen and Kuipers discriminant (True skill statistic; Pierce’s skill score) (How well did the forecast separate the "yes" events from the "no" events?)

•

Heidke skill score (what was the accuracy of the forecast relative to that of random chance?)

•

Odds ratio (what is the ratio of the odds of a yes forecast being correct, to the odds of a yes forecast being wrong?)

•

Odds ratio skill score (what was the improvement of the forecast over random chance?)

(17)

Binary Forecast Verification - Detail

•

Range: 0 to 1. Perfect score: 1.

•

Characteristics: Sensitive to hits, but ignores false alarms. Very sensitive to the climatological frequency of the event. Good for rare events. Can be artificially improved by issuing more "yes"

forecasts to increase the number of hits.

•

Characteristics: Sensitive to false alarms, but ignores misses. Very sensitive to the

climatological frequency of the event. Should be used in conjunction with the probability of

detection (above).

(18)

•

Range: 0 to 1, 0 indicates no skill. Perfect score: 1.

•

Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as the

accuracy when correct negatives have been

removed from consideration, that is, TS is

only concerned with forecasts that count.

•

Range: -1/3 to 1, 0 indicates no

skill. Perfect score: 1.

•

Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits

associated with random chance (for example, it is easier to correctly forecast rain

occurrence in a wet climate than in a dry climate).

Binary Forecast Verification - Detail

(19)

• Range:

-

∞ to 1, 0 indicates no skill.

Perfect score:

1.

• Characteristics:

Measures the fraction of correct forecasts after

eliminating those forecasts which would be correct due purely to

random chance. This is a form of the generalized skill score, where

the

score

in the numerator is the number of correct forecasts, and the

reference forecast in this case is random chance

Binary Forecast Verification - Detail

(20)

Measuring Accuracy of Forecasts (2)

•

Multi-category forecasts

•

Generalisation of e.g. Accuracy, Heidke skill score, and the Hanssen and

Kuipers discriminant all exist

•

Can also analyse by simply plotting histograms (observed vs. forecast for

each category) or the Gerrity skill score (what was the accuracy of the forecast in predicting the correct category, relative to that of random chance?)

(21)

Measuring Accuracy of Forecasts (3)

• Continuous forecasts

– how do numerical forecast values

differ from actual values?

• Example: Autoregressive prediction of

Ap

geomagnetic

index, used in MSIS thermospheric density model by ESA

for LEO orbital

prediction

and control from ~1995-2000

(22)

Assessing Continuous Forecasts

•

Scatter plot (how well did the forecasts compare with the observed values?)

•

Box plot (how well did the distribution of forecasts correspond to the distribution of observed values?)

•

Mean error (what is the average forecast error?)

•

Bias (how does the average forecast compare to the average observed value?)

•

Mean absolute error (what is the average magnitude of the forecast errors?)

•

Root mean square error (what is the average magnitude of the forecast errors?)

•

Mean square error (mean squared difference between forecast and observation)

•

Prediction efficiency (how well does the forecast match the observations, relative to the scatter in the observations?)

•

Linear error in probability space (measures the error in probability space, depending on the cumulative probability density function of the observations, as determined from climatology)

•

Stable equitable error in probability space (similar to LEPS)

•

Correlation coefficient (how well did the forecast values correspond to the observed values?)

•

Skill score (what is the relative improvement of the forecast over some reference forecast?)

(23)

Continuous Forecast Verification

•

Characteristics: Does not discourage forecasting extreme values if they are warranted. Requires knowledge of climatological PDF

•

Range: Lower bound depends on what score is being used to compute skill and what reference forecast is used, but upper bound is always 1; 0 indicates no improvement over the

reference forecast. Perfect score: 1

•

Characteristics: Implies information about the value or worth of a forecast relative to an alternative (reference) forecast, e.g. persistence (no change from most recent observation) or

climatology Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/

(24)

‘PDFLAP’

Performance

Statistics

PDFLAP =

autoregressive model

of

Ap/F10.7

, with

30/60 coefficients,

determined every day

from 180/365 days of

data

Accuracy compared

to expected

performance based

on model tests on

previous two solar

cycles of data

(25)

‘PDFLAP2’

Performance

Statistics

An optimised forecast model

for Ap (developed ~2000)

lags 1-3: neural net model lags 4-6: climatological model lags 7-15: ARMA model

lags 16-27: minimum forecast between ARMA and climatology Not physically based therefore needs regular checking for accuracy and relevance

(26)

Detailed Example 1: Solar Flares

• Many methods used to predict flares

(and much published since 2000)

•

Poisson statistics

•

Bayesian statistics

•

Support vector machines

•

Discriminant analysis

•

Neural networks

•

Wavelets

•

Superposed epoch analysis

•

Empirical methods

•

...

• Most methods provide a probability for

an X-ray flare with peak flux of some

magnitude in some time frame

(27)

ASAP

• ‘Automated Solar Activity Prediction tool’:

spaceweather.inf.brad.ac.uk/downloads.html

• A machine learning-based system designed to analyze years of

sunspot and flare data to create associations that can be

represented using computer based learning rules

• An imaging-based real-time system that provides automated

detection, grouping, and then classification of recent sunspots based

on the McIntosh classification is created and integrated within this

system

(28)

ASAP

• Tested on solar 5267

SOHO MDI

intensity-gram images from 1

February 1999 to 31

December 2002.

• Verified against NGDC

flare catalogue

(29)

Comparing flare forecast models

• Bloomfield et al compare

a ‘thresholded-Poisson

probability’ model of

flaring with other methods

•

Ordinal logistic regression (Song et al, 2009)

•

Predictor teams (Huang et al, 2010)

•

Neural networks (Ahmed et al, 2013)

• Highlights the

significance of the

underlying flare/noflare

probability and the

preference therefore for

TSS over HSS

(30)

Detailed Example 2: Geomagnetic Activity

• One to three day ahead forecast of geomagnetic activity by

BGS colleagues for various non-paying academic and other

users

• Has been running since 2000

• Initial analysis courtesy of Ellen Clarke, BGS

•

More to be done and presented at the European Space Weather Week

• Forecasts given in 4, noon-to-noon, activity classes (with

explanatory text):

ACTIVITY CLASS

Daily Planetary Activity

Level (Ap)

QUIET – UNSETTLED

<=15

ACTIVE

16-29

MINOR STORM

30-49

MAJOR STORM

>=50

(31)

% Correct Forecasts by Year

(32)

ACTIVE, MINOR and MAJOR STORM days only

(33)

% Correct Forecasts on ACTIVE, MINOR and

MAJOR STORM days

(34)

Analysed by

Forecaster Team

Member

Forecast Team (2000-2012)

• Toby Clark

• Alan Thomson

• Ellen Clarke

• Pam White

• Allan Mackay

• Sarah Reay

• Jess King

• Orsi Baillie

• Brian Hamilton

• Thomas Humphries

• Ewan Dawson

• Gemma Kelly

• Laurence Billingham

(35)

Contingency Tables – Storm Classes

(36)

Evaluation using Forecast Skill Scores

(ACTIVE, MINOR and MAJOR STORM days)

(37)

Evaluation using Forecast Skill Scores

(ACTIVE, MINOR and MAJOR STORM days)

(38)

•

Use ‘Decision Theory’ approach (e.g. Matthews, 1996, 1997)

•

K = User-defined ‘loss structure’, measuring relative costs of complacency and false alarms to user

•

LR = Loss ratio >1 is

good

•

Odds(A) determined from historical

distribution of activity

•

Forecasts are useful

where costs of missing an event is greater than that of a false alarm (K>1) 0.000 1.000 2.000 3.000 4.000 5.000 6.000 1 2 3 Days L R BGS SEC Persistence LR = Pr(F|A)/Pr(F|~A) LR * Odds(A) > 1/K

Odds(A) = Pr(A)/Pr(~A)

K="Complacency-Cost" /"False-Alarm-Cost" 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1 2 3 Days K BGS SEC Persistence

From Another Angle: the User’s Perspective

Storms in 2002-2003

(39)

•

The c₀₁ is false alarm cost, c₁₀ is cost of a ‘false negative’; n_xy are matrix elements of corresponding contingency table (x=row,

y= column)

•

‘Skill score’ S>0 implies merit

•

Forecasts are useful when cost <0.5 (i.e. complacency cost is much more important)

•

One lesson here: use persistence for one

day ahead during this phase of solar cycle if no CME data or if coronal hole effects are expected to dominate!

-0.100 -0.050 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 S k ill -S c o re Normalised Cost One-Day Ahead BR Skill-Scores

BGS SEC PES

Tw o-Day Ahead BR Skill-Scores

-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Norm alised Cost

S k ill-S c o re BGS SEC PES

Three-Day Ahead BR Skill-Scores

-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Norm alised Cost

S k ill-S c o re BGS SEC PES ) 1 )( ( ) 1 ( 10 11 01 11 θ θ θ θ ₊ ₋ − − = n n n n S ) ( ₀₁ ₁₀ 01 c c c + = θ

From Another Angle: Briggs-Rupert Skill Scores

Storms in 2002-2003

(40)

Geomagnetic Activity Forecasts – Summary

 1, 2 and 3 day ahead forecasts have been analysed from 2000 to

2012

 There is a tendency to underestimate activity levels for ACTIVE and STORM days only (hedging our bets?)

 Manual forecasts prove generally better than using simple persistence/recurrence forecast model(s)

 Further work required to determine a good measure for the bias and

forecast performance (e.g. HSS, TSS and GS)

 Decision Theory and Briggs-Rupert skill scores provide other perspectives on forecast value, usefulness and merit