The Fort Collins precipitation rate is a very popular dataset in the hydrology field, which contains the daily precipitation rate in Fort Collins, Colorado, USA. This data is available in extRemes package in R. It will be shifted by removing the zero values for the following analysis as zero is not inside the support of gamma and Weibull distributions.
The feature of this data is that there are multiple choices for the threshold. Scarrott and
MacDonald (2012) has demonstrated the multiple choices of the threshold selection for this data in detail and this data is discussed in Section 2.2.2 as well. In this section, all the positive bounded support models will apply this data with the thresholds of 0.395, 0.85 and 1.2. Katz et al. (2002) suggested the threshold u to be 0.395 and Scarrott and MacDonald (2012) considered the thresholds of 0.85 and 1.2. These 3 different choices will be used as the initial value of the threshold in each model. The best fit model is determined by the maximum likelihood function value (or minimum negative likelihood function value). Figures 6.11 and 6.12 are the corresponding mean residual life and parameter stability plots for the Fort Collins data.
This data is also heavily rounded, so the bandwidth can be zero as the cross validation likelihood is utilized in the boundary correction kernel GPD model. Hence, a small residual has been added to the data. The default simple boundary correction method by Jones (1993) does not work well for this data. I have found that this default method tends to overestimate the bandwidth, which leads to a poor fit to the distribution below the threshold. This may be caused by the cross-validation likelihood which used for bandwidth parameter estimation. This issue requires further investigations. Therefore, I adopt the refection method of Boneva et al. (1971) as the boundary correction method for the boundary correction kernel GPD model.
Figures 6.13 and 6.14 provide the histogram of the data overlaid with the fitting density, which suggests that all models can achieve a reasonable fit. Figures 6.15 and 6.16 are the corresponding return level plot for data. Most mixture models can only give a reasonable fit up to the 99% quantile level. Beyond the 99% quantile level, the parameterised approach based gamma GPD and boundary correction kernel GPD model can still provide a good tail exploration.
Figure 6.11: The mean residual life plot for Fort Collins precipitation data.
Figure 6.13: Histogram of Fort Collins precipitation data fitted by parametric bulk mixture model using parameterised and bulk approaches, where the vertical line represents the corre- sponding threshold for each model.
Figure 6.14: Histogram of Fort Collins precipitation data fitted by dynamically weighted mixture and boundary correction kernel GPD model, where the vertical line represents the corresponding threshold for each model.
Figure 6.15: The return level plot of Fort Collins precipitation data fitted by parametric bulk mixture model using bulk and parameterised approaches, where the vertical line represents the corresponding threshold for each model. The 95%, 99%, 99.9% quantile of the Fort Collins data are also indicated in the plot. The plot is constructed at a Gumbel scale.
Figure 6.16: The return level plot of Fort Collins precipitation data fitted by dynamically weighted mixture and boundary correction kernel GPD model, where the vertical line repre- sents the corresponding threshold for each model. The 95%, 99%, 99.9% quantile of the Fort Collins data are also indicated in the plot. The plot is constructed at a Gumbel scale.
Chapter 7
Conclusion and Future Work
7.1
Summary
This thesis evaluates the majority of the of extreme value mixture models in the literature, in- cluding implementing them in the statistical computing package R and carrying out a simulation study to compare their performance. I extended some of the existing models using different parametric components in the mixture and placing them in a more generalized framework to compare the performance under the various specifications used in the literature. The R package I created called evmix will be available on CRAN to increase the usability of such mixture models and enable these models to be further developed.
Chapter 2 reviews both the extreme value theory and the kernel density estimator. Chapter 3 presents a detailed literature review of existing extreme value models with some discussions of their properties. Some new model extensions which have continuous density at the threshold are also introduced. Chapter 4 details the user’s guide to the evmix package, demonstrating the usage of the package with examples.
The simulation study which investigates the model performance is carried out in Chapter 5. The extreme value mixture models in the evmix package can be classified into three groups: a single tail with full support, a single tail with bounded positive support and two tailed with full support model by the support of the distribution. The simulation study provides some definitive conclusions about the properties of these mixture models. High quantile estimates (99.9%) are fairly robust to the choice of bulk model while the lower quantile estimates (90%, 95%, 99%) are very sensitive to the choice of the bulk model.
The results also indicate that the kernel density estimator based mixture models are superior or close to the best performer. However, Bayesian inference based kernel GPD and boundary corrected kernel GPD suffer huge computational burden. The likelihood inference based kernel GPD and boundary corrected kernel GPD are good alternatives. MacDonald et al. (2011) and MacDonald et al. (2013) develop three kernel density estimator based extreme value mixture models by Bayesian inference. Two of the three models are implemented by Bayesian inference, and all three models are implemented by the likelihood inference as well in the evmix library. However, if the bulk model is correctly specified, then they are preferred due to their simplicity and good performance. Furthermore, in this case the bulk model based tail fraction approach is
beneficial as the tail inference essentially benefits from the extra data in the bulk, but provides a substantial disadvantage if the bulk model is mis-specified. The continuity constraint at the threshold is beneficial if the bulk model is correctly specified, but again is a disadvantage if the bulk model is mis-specified. If the bulk model is mis-specified, it is better to adopt the parameterised tail fraction approach, as it provides more flexibility and reduces the sensitivity of the tail fit on the bulk model.
In the more general situations, it is very difficult to judge whether the bulk model is mis-specified or not before fitting the data (e.g. an unknown population distribution or a complex data with bimodal or multimodal modes). The kernel density estimator based mixture models perform well in general under different population distributions, as the kernel density estimator does not assume any particular function form. A key problem with the kernel density estimator is that it does not perform well under a heavy tailed distribution due to the cross-validation likelihood based bandwidth is biased high. In this case, the kernel density estimator based mixture models can be included to have GPDs for both tails to overcome the issue. Therefore, it is better to adopt the non-parametric form mixture model rather than the parametric and semi-parametric form mixture models.
I have implemented three recent extremal mixture models in the package including two tailed normal GPD (GNG) by Zhao et al. (2010), two tailed kernel GPD by MacDonald et al. (2011) and the boundary corrected kernel GPD model by MacDonald et al. (2013), which are the most relevant models for financial applications. This work may provide a useful new insight for researchers in the field of extreme value statistics and financial models. These mixture models have been applied to the application in Danish fire insurance data and FTSE 100 index data in Chapter 6.