© NERC All rights reserved
Quality Control on
Space Weather
Forecasts
Alan W P Thomson
Geomagnetism,
British Geological Survey, UK
© NERC All rights reserved
Outline
•
What do we mean by a (space weather) forecast?
•
A few examples of verified forecasting systems, what
they predict and where to find them (non exhaustive!)
•
Measuring accuracy of forecasts
•
Detailed Example 1: Predicting Solar Flares
•
Detailed Example 2: Predicting Geomagnetic Activity
© NERC All rights reserved
What do we mean by a (Space
Weather) Forecast?
•
Forecast
here means
a prediction of the future state
(of the
weather, stock market prices, or whatever)
•
Forecast verification
is then the process of assessing the quality
of a forecast
•
Forecasts of
•
Solar flares (e.g. occurrence time; magnitude)•
CME eruption (occurrence time; magnitude)•
CME arrival at Earth (time; solar wind parameters)•
Radiation storm (onset time, duration, end time; ‘magnitude’)•
Geomagnetic storm (onset time, duration, end time; ‘magnitude’)•
...•
Forecasts (mostly) of variables that either occur/not-occur and/or are represented as a spatio-temporal series, i.e. continuous variableSource: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Examples of Forecasting Systems
•
Solar flares
•
Coronal mass ejections
•
Solar particle events
•
Solar wind/heliosphere
•
Radiation belts
•
Magnetosphere
•
Thermosphere
•
Ionosphere
•
Geomagnetic field
Forecast
verification
is done for
all of these
-
ENLIL-CONE
-
ENLIL-CONE
-
CCMC Various*
DTM-2009
-
-
© NERC All rights reserved
WSA-ENLIL-CONE
•
ENLIL is a 3D MHD model of the heliosphere.It solves for plasma mass, momentum and energy density, and magnetic field. Its inner radial boundary is beyond the sonic point, typically 21.5 - 30 solar radii.
•
The ENLIL cone model forecasts CMEpropagation from the ENLIL inner boundary. The cone model assumes that close to the Sun CME propagates at constant angular and radial velocity
•
The WSA component combines a PotentialSource Surface Model with the 'Schatten' current sheet model to produce a model of the global coronal magnetic field. It uses a simple kinematic model to propagate the
solar wind and the magnetic polarity from the outer boundary of the Schatten current sheet model into the heliosphere
© NERC All rights reserved
WSA-ENLIL-CONE: Verified
Source: Dusan Odstricil, ENLIL: Modeling of Heliospheric Space
© NERC All rights reserved
Magnetospheric MHD Models
•
Simulations of
BATS-R-US, GUMICS, LFM and
OpenGGCM MHD
codes
•
Compared with Cluster
(magnetosheath),
Geotail (near tail),
WIND (far tail) and
CPCP (SuperDARN)
data
Honkonen et al, Space Weather, 11, 313-326, doi: 10.1002/swe.20055, 2013
© NERC All rights reserved
Magnetospheric MHD Models
Honkonen et al, Space Weather, 11, 313-326, doi:
© NERC All rights reserved
Magnetospheric MHD Models
© NERC All rights reserved
Thermospheric Models
•
Drag Temperature Model is asemi-empirical model describing the temperature, density, and composition of the
thermosphere
•
DTM2009 and DTM2000, andthe COSPAR reference models NRLMSISE-00 and JB2008, are evaluated in order to establish benchmark values for updated DTM models
•
Compared with high resolutionCHAMP and GRACE data
© NERC All rights reserved
Thermospheric Models: Verified
•
Mean and RMS ofdensity ratios and residuals
•
JB2008 is the most accurate below 300 km, JB2008 and DTM2009 are best at 300–500 km, NRLMSISE-00 and DTM2009 are best above 500km© NERC All rights reserved
Properties of ‘Good’ Forecasts
•
Consistency
- the degree to which the forecast corresponds to
the forecaster's best judgement about the situation, based upon
his/her knowledge base
•
Quality
- the degree to which the forecast corresponds to what
actually happened
• Bias • Association • Accuracy • Skill • Reliability • Resolution • Sharpness • Discrimination • Uncertainty•
Value
- the degree to which the forecast helps a decision maker
to realize some incremental economic and/or other benefit
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Types of Forecasts
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Definition of Verification Methods
•
Visual
– a qualitative look at the data
•
Dichotomous
– a simple binary yes/no
•
Multi-category
– extension of binary variables to many
•
Continuous
– non-binary variables, e.g. compute RMS
difference between measured and predicted
•
Probabilistic
– forecasts are probabilities, verified against
event occurring or not occurring
•
Spatial
– e.g. at what scale does the forecast match best
match reality?
•
Ensemble
– e.g. how well does the ensemble spread of the
forecast represent the true variability (uncertainty) in the
observations?
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Measuring Accuracy of Forecasts (1)
•
Dichotomous, or binary forecasts
•
Construct a contingency table of 4 categories
•
hit - event forecast to occur, and did occur•
miss - event forecast not to occur, but did occur (false negative)•
false alarm - event forecast to occur, but did not occur (false positive)•
correct negative - event forecast not to occur, and did not occurSource: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Assessing Binary Forecasts
•
Accuracy (what fraction of forecasts were correct?)•
Bias score (how did the forecast frequency of yes events compare to the observed frequency of yes events?)•
Probability of detection (what fraction of yes events were correctly forecast?)•
False alarm ratio (what fraction of the predicted yes events did not occur?)•
Probability of false detection (what fraction of the no events were forecast as yes?)•
Success ratio (what fraction of the forecast yes events were correctly observed?)•
Threat score (how well did the forecast yes events correspond to the observed yesevents?)
•
Equitable threat score (Gilbert score) (how well did the forecast "yes" events correspond to the observed "yes" events (accounting for hits due to chance)?•
Hanssen and Kuipers discriminant (True skill statistic; Pierce’s skill score) (How well did the forecast separate the "yes" events from the "no" events?)•
Heidke skill score (what was the accuracy of the forecast relative to that of random chance?)•
Odds ratio (what is the ratio of the odds of a yes forecast being correct, to the odds of a yes forecast being wrong?)•
Odds ratio skill score (what was the improvement of the forecast over random chance?)Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Binary Forecast Verification - Detail
•
Range: 0 to 1. Perfect score: 1.•
Characteristics: Sensitive to hits, but ignores false alarms. Very sensitive to the climatological frequency of the event. Good for rare events. Can be artificially improved by issuing more "yes"forecasts to increase the number of hits.
•
Range: 0 to 1. Perfect score: 0.•
Characteristics: Sensitive to false alarms, but ignores misses. Very sensitive to theclimatological frequency of the event. Should be used in conjunction with the probability of
detection (above).
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
•
Range: 0 to 1, 0 indicates no skill. Perfect score: 1.•
Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as theaccuracy when correct negatives have been
removed from consideration, that is, TS is
only concerned with forecasts that count.
•
Range: -1/3 to 1, 0 indicates noskill. Perfect score: 1.
•
Characteristics: Measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hitsassociated with random chance (for example, it is easier to correctly forecast rain
occurrence in a wet climate than in a dry climate).
Binary Forecast Verification - Detail
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
•
Range:
-
∞ to 1, 0 indicates no skill.
Perfect score:
1.
•
Characteristics:
Measures the fraction of correct forecasts after
eliminating those forecasts which would be correct due purely to
random chance. This is a form of the generalized skill score, where
the
score
in the numerator is the number of correct forecasts, and the
reference forecast in this case is random chance
Binary Forecast Verification - Detail
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Measuring Accuracy of Forecasts (2)
•
Multi-category forecasts•
Generalisation of e.g. Accuracy, Heidke skill score, and the Hanssen andKuipers discriminant all exist
•
Can also analyse by simply plotting histograms (observed vs. forecast foreach category) or the Gerrity skill score (what was the accuracy of the forecast in predicting the correct category, relative to that of random chance?)
Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Measuring Accuracy of Forecasts (3)
•
Continuous forecasts
– how do numerical forecast values
differ from actual values?
•
Example: Autoregressive prediction of
Ap
geomagnetic
index, used in MSIS thermospheric density model by ESA
for LEO orbital
prediction
and control from ~1995-2000
© NERC All rights reserved
Assessing Continuous Forecasts
•
Scatter plot (how well did the forecasts compare with the observed values?)•
Box plot (how well did the distribution of forecasts correspond to the distribution of observed values?)•
Mean error (what is the average forecast error?)•
Bias (how does the average forecast compare to the average observed value?)•
Mean absolute error (what is the average magnitude of the forecast errors?)•
Root mean square error (what is the average magnitude of the forecast errors?)•
Mean square error (mean squared difference between forecast and observation)•
Prediction efficiency (how well does the forecast match the observations, relative to the scatter in the observations?)•
Linear error in probability space (measures the error in probability space, depending on the cumulative probability density function of the observations, as determined from climatology)•
Stable equitable error in probability space (similar to LEPS)•
Correlation coefficient (how well did the forecast values correspond to the observed values?)•
Skill score (what is the relative improvement of the forecast over some reference forecast?)Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
Continuous Forecast Verification
•
Range: 0 to 1. Perfect score: 0.•
Characteristics: Does not discourage forecasting extreme values if they are warranted. Requires knowledge of climatological PDF•
Range: Lower bound depends on what score is being used to compute skill and what reference forecast is used, but upper bound is always 1; 0 indicates no improvement over thereference forecast. Perfect score: 1
•
Characteristics: Implies information about the value or worth of a forecast relative to an alternative (reference) forecast, e.g. persistence (no change from most recent observation) orclimatology Source: World Weather Research Programme – Forecast Verification Research Group http://www.cawcr.gov.au/projects/verification/
© NERC All rights reserved
‘PDFLAP’
Performance
Statistics
PDFLAP =
autoregressive model
of
Ap/F10.7
, with
30/60 coefficients,
determined every day
from 180/365 days of
data
Accuracy compared
to expected
performance based
on model tests on
previous two solar
cycles of data
© NERC All rights reserved
‘PDFLAP2’
Performance
Statistics
An optimised forecast model
for Ap (developed ~2000)
lags 1-3: neural net model lags 4-6: climatological model lags 7-15: ARMA model
lags 16-27: minimum forecast between ARMA and climatology Not physically based therefore needs regular checking for accuracy and relevance
© NERC All rights reserved
Detailed Example 1: Solar Flares
•
Many methods used to predict flares
(and much published since 2000)
•
Poisson statistics•
Bayesian statistics•
Support vector machines•
Discriminant analysis•
Neural networks•
Wavelets•
Superposed epoch analysis•
Empirical methods•
...•
Most methods provide a probability for
an X-ray flare with peak flux of some
magnitude in some time frame
© NERC All rights reserved
ASAP
•
‘Automated Solar Activity Prediction tool’:
spaceweather.inf.brad.ac.uk/downloads.html•
A machine learning-based system designed to analyze years of
sunspot and flare data to create associations that can be
represented using computer based learning rules
•
An imaging-based real-time system that provides automated
detection, grouping, and then classification of recent sunspots based
on the McIntosh classification is created and integrated within this
system
© NERC All rights reserved
ASAP
•
Tested on solar 5267
SOHO MDI
intensity-gram images from 1
February 1999 to 31
December 2002.
•
Verified against NGDC
flare catalogue
© NERC All rights reserved
Comparing flare forecast models
•
Bloomfield et al compare
a ‘thresholded-Poisson
probability’ model of
flaring with other methods
•
Ordinal logistic regression (Song et al, 2009)•
Predictor teams (Huang et al, 2010)•
Neural networks (Ahmed et al, 2013)•
Highlights the
significance of the
underlying flare/noflare
probability and the
preference therefore for
TSS over HSS
© NERC All rights reserved
Detailed Example 2: Geomagnetic Activity
•
One to three day ahead forecast of geomagnetic activity by
BGS colleagues for various non-paying academic and other
users
•
Has been running since 2000
•
Initial analysis courtesy of Ellen Clarke, BGS
•
More to be done and presented at the European Space Weather Week•
Forecasts given in 4, noon-to-noon, activity classes (with
explanatory text):
ACTIVITY CLASS
Daily Planetary Activity
Level (Ap)
QUIET – UNSETTLED
<=15
ACTIVE
16-29
MINOR STORM
30-49
MAJOR STORM
>=50
% Correct Forecasts by Year
ACTIVE, MINOR and MAJOR STORM days only
% Correct Forecasts on ACTIVE, MINOR and
MAJOR STORM days
Analysed by
Forecaster Team
Member
Forecast Team (2000-2012)• Toby Clark
• Alan Thomson
• Ellen Clarke
• Pam White
• Allan Mackay
• Sarah Reay
• Jess King
• Orsi Baillie
• Brian Hamilton
• Thomas Humphries
• Ewan Dawson
• Gemma Kelly
• Laurence Billingham
Contingency Tables – Storm Classes
Evaluation using Forecast Skill Scores
(ACTIVE, MINOR and MAJOR STORM days)
Evaluation using Forecast Skill Scores
(ACTIVE, MINOR and MAJOR STORM days)
•
Use ‘Decision Theory’ approach (e.g. Matthews, 1996, 1997)•
K = User-defined ‘loss structure’, measuring relative costs of complacency and false alarms to user•
LR = Loss ratio >1 isgood
•
Odds(A) determined from historicaldistribution of activity
•
Forecasts are usefulwhere costs of missing an event is greater than that of a false alarm (K>1) 0.000 1.000 2.000 3.000 4.000 5.000 6.000 1 2 3 Days L R BGS SEC Persistence LR = Pr(F|A)/Pr(F|~A) LR * Odds(A) > 1/K
Odds(A) = Pr(A)/Pr(~A)
K="Complacency-Cost" /"False-Alarm-Cost" 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1 2 3 Days K BGS SEC Persistence
From Another Angle: the User’s Perspective
Storms in 2002-2003
•
The c01 is false alarm cost, c10 is cost of a ‘false negative’; nxy are matrix elements of corresponding contingency table (x=row,y= column)
•
‘Skill score’ S>0 implies merit•
Forecasts are useful when cost <0.5 (i.e. complacency cost is much more important)•
One lesson here: use persistence for oneday ahead during this phase of solar cycle if no CME data or if coronal hole effects are expected to dominate!
-0.100 -0.050 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 S k ill -S c o re Normalised Cost One-Day Ahead BR Skill-Scores
BGS SEC PES
Tw o-Day Ahead BR Skill-Scores
-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Norm alised Cost
S k ill-S c o re BGS SEC PES
Three-Day Ahead BR Skill-Scores
-0.200 -0.150 -0.100 -0.050 0.000 0.050 0.100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Norm alised Cost
S k ill-S c o re BGS SEC PES ) 1 )( ( ) 1 ( 10 11 01 11 θ θ θ θ + − − − = n n n n S ) ( 01 10 01 c c c + = θ
From Another Angle: Briggs-Rupert Skill Scores
Storms in 2002-2003
Geomagnetic Activity Forecasts – Summary
1, 2 and 3 day ahead forecasts have been analysed from 2000 to
2012
There is a tendency to underestimate activity levels for ACTIVE and STORM days only (hedging our bets?)
Manual forecasts prove generally better than using simple persistence/recurrence forecast model(s)
Further work required to determine a good measure for the bias and
forecast performance (e.g. HSS, TSS and GS)
Decision Theory and Briggs-Rupert skill scores provide other perspectives on forecast value, usefulness and merit
This brings the end-user in to the process
© NERC All rights reserved
Summary and Conclusions
•
There isn’t a one-stop, sure-fire way to
measure ‘goodness’ or ‘usefulness’ of a
forecast
•
The forecaster needs to lay out the options andprovide evidence, statistics and commentary
•
From simple graphs of measured versus forecast – ‘eyeball’ the data•
Added to with more complicated skill scores•
The user can have a role (e.g. what are therelative costs of ‘false positives’ and ‘false
negatives’) and that influences which forecast is ‘best’
•
Verification remains an essential part of
determining where forecast models are ‘good’
or ‘bad’
•
And perhaps helping decide how or where© NERC All rights reserved