2. DEVELOPMENT OF AN IMPROVED PROTOCOL FOR EVALUATING THE
2.4 PROMPT Implementation
2.4.3 PROMPT Implementation Phase 3 (PI-P3)
The goal of PI-P3 is to test if a model estimates the necessary precision for the control requirement estimation, depending on the precision demanded by policies. Some control options may require less precise answers than others. There are cases where the imprecision can mislead policy makers, especially when the imprecision can result in an ambiguous assessment of ozone response to emission changes. Consequently, the issue of ‘precision’ becomes a context-dependent question that cannot be answered by modelers alone. At the same time, modelers cannot get the necessary information without communicating with policy makers. Therefore, modelers should consult with policy makers concerning the ‘precision’ question. For example, our expectation of the model’s desired accuracy should be different when policy makers want to see the effectiveness of two different controls: a specific point source control versus domain wide control. The specific point source control needs much more local precision and accuracy than the domain wide control, which can relieve high accuracy requirements at local scale. Thus, PROMPT requests modelers to have the following information for evaluating the anticipated precision: 1) the description of proposed control options, 2) the range of precursors that the selected modeling system can distinguish for control option tests, and 3) the potential direction of biases in precursor predictions.
In traditional approaches, different types of biases are often treated similarly. Model biases are a set of one-way errors. For example, predicted ozone at a site for an hour cannot be overestimated and underestimated simultaneously, yet the ozone field can have very small overall error with positive and negative biases at different places over a modeling domain. At the same time, more than two biased processes can also form compensating errors if their
magnitudes are similar while their contribution to ozone concentration change is opposite. Therefore, PROMPT defines two kinds of compensating errors.
Type I compensating errors are those that influence summary statistics. For example, one of the commonly used statistics in the EPA’s existing protocol is ‘normalized bias,’ an average of the sum of the difference of observed ozone and predicted ozone normalized to the observed ozone. This measure can mislead modelers. For example, two large biases with different signs can produce a small normalized bias. Therefore, to reveal Type I
compensating errors, the evaluator should report summary statistics with temporal and spatial distribution of the error for ozone and other variables. Potential compensation should be described and documented for use in judging its impact on decisions.
Type II compensating errors are the errors caused by the model’s internal compensation. For example, a PAQM can match history with NOxemissions lower than in reality if the
winds are slower or the mixing height is lower than in reality. Type I error analysis may lead to investigation of Type II errors, however, Type II errors cannot be examined only through investigating chemical concentrations because these concentrations are the product of all model processes involved in ozone formation. We strongly recommend applying PA (Tonnesen, 1995; Wang, 1997; Jang et al., 1995) if a Type II compensating error seems to occur because PA is specifically designed and implemented to investigate Type II
compensating errors.
For observations, there is no good tool to segregate process contributions to final ozone and developing such tools are beyond the scope of this study. Nevertheless, we recommend two types of analyses: (1) the observation-driven constrained steady state (CSS) box model approach (Kleinman, 2005) and (2) diagnostic indicators such as NO/NO2 ratio (Arnold et
al., 2003), if tools for these approaches are implemented or can be implemented for a SIP modeling given resources. The CSS box model approach is promising because it can provide ozone production rate, P(O3), that is directly comparable with the PA’s P(O3) output.
The focus of the procedures described in the following two sections is how to reveal the types of compensating errors and to provide guidance on further analyses. Note that the outcome of evaluating the anticipated precision will likely vary spatially and temporally. Application of the following two analyses to each monitor for every day for the selected episode period will provide necessary information to generate spatiotemporal resolved answers.
2.4.3.1 Performance analysis at observed locations
The main goal of this analysis is to assess the ‘matching history’ of the PAQMs. The most basic graphical presentations of history matching are time series plots for all available species and physical variables at each monitor. Matching history of a chemical concentration can be reliable if the matching history of the meteorological inputs is also acceptable.
Matching the history of surface winds is a good surrogate for matching the history of the meteorological inputs because wind observations are more frequently made than other meteorological factors such as the ventilation factor or mixing height. Considering surface wind speed and direction, the preferred “time series” plot is a hodogram (or hodograph), a plot showing a set of vectors sharing vector head (or tail) positions. Time series plots and hodograms are supplemented with scatter plots of predicted values verses observed values. These are used to detect the overall bias or extreme values. Other more specialized plots such as quantile-quantile plots are also used as needed.
A similar but more complex graphical analysis would be performed with segments of aircraft measurements following the general approach for monitored sites. However, extra tasks are needed to prepare data for the matching history assessment because the nature of aircraft data is different from routine ground monitor data in spatial/temporal resolution and the number of species observed.
The mere creation of the plots mentioned above is not sufficient. Agreements and disagreements must be identified and explained. At a minimum, the following two sets of questions must be asked conditionally on the quality of ‘history matching’ exhibited:
x If model predictions generally match history, is there any way this might be due to compensating errors among processes such that the apparently good match occurs for the wrong reasons? Are the process rates from the modeling system used in the study consistent with those from modeling systems that are apparently working well?
x If model predictions do not match the history, what are the likely causes of the failure? Are the physical conditions correctly simulated by the model? Is the wind speed and direction approximately correct? Is the volume of the mixed layer approximately correct? Is the vertical mixing process too slow or too fast? Are emission and deposition processes or magnitudes atypical? Are the chemical rates as expected?
x After reviewing all the monitors for each day, the results can be divided into three categories:
x Category MH-R: Those monitor sites where the model matches history reasonably well and there are no indications of compensating error.
x Category MH-A: Those monitor sites where the history matching is ambiguous and there is little evidence as to the cause.
x Category MH-U: Those monitor sites where the physical conditions simulated by the modeling system preclude a good history match for chemical concentrations,
especially for secondary products like ozone.
Where, ‘MH’ stands for ‘matching history’. ‘R’, ‘A’, ‘U’ stand for ‘Reliable’, ‘Ambiguous’, and ‘Unreliable’.
The monitors within Category MH-R are candidate sites and days for evaluating the future prediction of the model for their response to the various control strategies that are to be considered. Those monitors within Category MH-A should also be examined in the future case condition, but they would be considered as merely supportive and would not be used to actually decide if a particular control strategy would be successful. The monitors within Category MH-U would not be used in the future case evaluation.
Having determined which monitors on each day are reliable for assessing the creation of secondary pollutants, more standard EPA statistical tests can be applied to only these
monitors according to the type of attainment demonstration being performed. Note that all analyses above are not just relevant to those monitor sites that were exceeding the standards because a reliable model must also accurately predict non-exceedance monitor-days too. If a model is not able to do this, it is a strong indication that any accurate prediction on
exceedance days is the result of compensating errors and these monitor sites even on the exceedance days might not be reliable in future ozone predictions.
2.4.3.2 Analysis of model predictions at non-observed locations
One of the primary reasons for using a complex process-based model is to predict environmental state variables (such as ozone) where no observations are available. The model should be able to predict environmental state variables reliably if (1) it computes the contributions of all the important processes to the state variables (including concentrations of primary and secondary chemical species) over a certain period and (2) it estimates changes of the state variables reflecting the local conditions by summing those contributions. What warrant the reliability of the model’s predictions are the trustworthy methods for computing the contributions of the processes to the changes of the state variables and the availability of accurate inputs. If there is a reason to question the reliability of the computational methods or the accuracy of the model inputs, the model’s predictions where no observations are available become suspicious. Assessment of the model’s reliability where no observations are available is especially important to evaluation of a future control strategy because the success of the future control strategy may hinge on predictions where no observation is available. We note that this is especially true in the new 8 hour ozone test because it relies heavily on a designed value at a monitor, yet the model’s highest ozone and smallest change in ozone in the future case may be at a non-monitored location.
If a PAQM is dealing with a space among a set of monitors where the modelers have already assess their performance using the assessment strategies above, and where it could be concluded that the model predictions are matching historical observations reasonably well, then, modelers would have good reasons to accept that the model’s predictions in the non- observed area among them are reliable. On the other hand, if modelers know that a model will not give correct reduction estimations for the future case at observed locations and can
explain the deviation of a model’s prediction from the history, then modelers would have good evidence that the model’s response in an area among the poorly predicted monitors is not useful for testing policy options. In fact, it is important to know about unreliable predictions because modelers can inform policy makers that the model cannot provide scientifically defensible evidence with regard to certain proposed control options. On the other hand, if modelers are dealing with a space distant from monitors with a different physical environment, more intensive analysis efforts would be needed to conclude how the model would be reliable both in the base case and in the future case. We believe that knowledge of unreliable model performance is better than ignorance of the model’s trustworthiness when it comes to setting policy.
Several procedures, including but not limited to the following, will likely be useful in assessing the reliability of the model prediction at non-observed locations:
x Visualize the model’s inputs for the important processes in the area of interest. This might include plotting the vertical diffusivities over land and water; visualizing the low level and high level emission inputs of the area; visualizing the model’s predicted wind field; and performing dispersion simulations for selected emissions without chemistry to determine how various sources are contributing to the focus area.
x Perform process analysis of the focus region to visualize and understand the
interaction among the physical and chemical processes and to explain the state of the chemical transformations.
x Conduct selected sensitivity analyses by varying important inputs or process representations and determine the effects these have on the model. Each sensitivity
analysis may require the performance analysis component and the prediction analysis component, i.e. a recursive application of the model evaluation.
x Review the state of the science and the alternative representations available and assess if the current representation in the model is adequate. It is desirable to acquire auxiliary tools for this procedure such as a modified version of the PAQM, if possible. After reviewing all such areas for each modeled day, the results can be divided into three categories:
x Category MP-R: Those model locations where the model’s performance is more likely than not adequate and one should accept the predictions as reliable.
x Category MP-A: Those model locations where the model’s performance is ambiguous and there is little evidence as to the cause.
x Category MP-U: Those model locations where the model’s performance is either the physical conditions simulated or chemical conditions simulated by the modeling system are more likely than not resulting in biased results.
Where, ‘MP’ stands for ‘model prediction’. ‘R’, ‘A’, ‘U’ stand for ‘Reliable’, ‘Ambiguous’, and ‘Unreliable’.
After determining which non-monitored areas on which days are likely reliable for assessing the creation of secondary pollutants, those areas may be included in the standard EPA analyses according to the type of attainment demonstration being performed. On the other hand, the assessments arrived above can be major components of a weight-of-evidence argument to explain why the model’s results should not be accepted at face value.