ABSTRACT
RAHA, SOHINI. Spatial Analysis of Duration of Heatwaves and Robust Variable Selection. (Under the direction of Sujit Ghosh and Howard Bondell.)
In Chapter 1, we describe a general introduction to the problems we approach in Chapter 2 to 5.
Characterization of heatwaves is becoming increasingly important in environmental research as they pose a significant threat to many human lives worldwide. Though several quantification of the extremities of a heatwave have been proposed in literature, they are mostly improvised and there does not exist a universally accepted definition of heatwave. In Chapter 2, we devise a probabilistic inferential framework to characterize heatwave, and come up with a definition which can capture the essence of all existing ad hoc definitions. We derive an exact distribution on the frequency of durations for a stationary Markov process, and also an approximate distribution of durations for a stationary non-Markov time series. For a given site, using a daily time series (of ambient temperature or heat-index) we define a heatwave as the number of sustained days above a given threshold using the probability distribution of the durations. We illustrate the proposed methodology using daily time series of ambient temperature for a fixed site (of Atlanta) and also using the USCRN consisting of 126 sites across the U.S.. Further we also derive an empirical quadratic curve based relationship between expected durations and extreme thresholds.
is computationally challenging, it is of utmost importance to derive a spatial model for the quadratic equations. In Chapter 3, we propose a few potential spatial models, us-ing the idea of Gaussian Markov Random Field. We compare these models usus-ing their predictive performance based on a 6-fold cross-validation using 120 USCRN sites across the U.S.. We found that a simple Multivariate Conditionally Auto-regressive model fit our data best, and using this model, we predict the expected lengths of durations of heatwaves in the United States. The procedure decribed in Chapter 3 can be extended to any other large geographical regions given we have temperature data for a few locations inside the region. In Chapter 4, we discuss a potential future extension of this spatial modeling, which requires a spatial functional analysis of the durations of heatwaves at various locations.
© Copyright 2019 by Sohini Raha
Spatial Analysis of Duration of Heatwaves and Robust Variable Selection
by Sohini Raha
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Statistics
Raleigh, North Carolina 2019
APPROVED BY:
Brian Reich Tao Pang
Sujit Ghosh
DEDICATION
BIOGRAPHY
ACKNOWLEDGEMENTS
The last five years of my life was an extremely challenging time both in professional and personal aspects, and I am very fortunate to have enormous support throughout this graduate life.
Majority of my accomplishments can be attributed to the constant encouragement, tireless effort and skillful guidance of my doctoral advisors, Dr. Sujit Ghosh, and Dr. Howard Bondell. Through inspiring conversations, motivating examples, and dedicated mentoring, Dr. Ghosh helped me shape my academic objectives, along with imparting methods to the madness of research. Working with him has been a great learning curve for me both in my academic and non-academic endeavors. Dr. Bondell introduced me to the complex path of research and guided me patiently despite my naiveté. His gentle and tenacious counsel helped me stay focused to my goals in the most critical juncture of my life. I am immensely grateful to both of them, and I will always treasure my experience working under their guidance.
Also, I am thankful to Dr. Brian Reich, and Dr. Tao Pang, for being a part of my dissertation committee. Their invaluable recommendations and suggestions have been instrumental in the completion of this dissertation.
struggle with academic and personal welfare. Alison’s continually affectionate conversa-tions never failed to cheer me up even in the most down-hearted days. I have always cherished their smiling presence in the department, and I have never had a dull moment with them.
I also had the wonderful opportunity to be a part of SAMSI (The Statistical and Ap-plied Mathematical Sciences Institute) as a Research Assistant. Participating in the work-ing groups at SAMSI has always given me a new perspective to my research problems, and also helped me build professional networks. This material is based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statisti-cal and Applied MathematiStatisti-cal Sciences Institute (SAMSI: https://www.samsi.info/). I consider myself highly privileged to have a wonderful circle of caring friends through this difficult phase of life, and these acknowledgements would not have been complete without naming a few of them. Thank you, Jyotishka and Shalini, for tolerating me at my worst, and for providing me a home away from home when I needed it the most. Thank you, Max, for your unconditional love and adorable support which became the ray of hope in the darkest of times. Thank you, Suman, for your unique advice and guidance that always led me in the right direction. Thank you, Arkaprava, for being the friend that everybody wishes they had. Thank you, Raka, for being my timeless soul-sister in this indelible journey of life. Thank you, Arkopal, for always being there no matter what. Thank you Moumita, Indranil, Arnab, Aniket, Sayan, Sapna, Debarati, Sneha, Aritra, Sumit, Anwesha, for making my life in North Carolina so special. Thank you, Malabi, for all those thought-provoking lively conversations even from afar.
for being an exemplary human being who I can always look up to. Also, I must thank my little brother, Ritam, for keeping the love and adorable bickering alive from halfway around the world. I would like to acknowledge my grandparents for the unlimited love and pampering throughout. Also, I would like to express a special thanks to Delilah, who taught me to never give up, be adventurous and love purely.
TABLE OF CONTENTS
List of Tables . . . ix
List of Figures . . . x
Chapter 1 Introduction . . . 1
Chapter 2 On the Probability Distribution of the Duration of Heatwaves 7 2.1 Introduction . . . 7
2.1.1 Various Definitions of Heatwave . . . 8
2.2 Probability Distributions of Duration and Intensity of Heatwave . . . 11
2.2.1 Exact Distribution of Durations for Markov processes . . . 14
2.2.2 Hierarchical Model based on Exact Distribution . . . 19
2.2.3 Approximate Distribution of Durations . . . 23
2.2.4 Hierarchical Model based on Approximate Distribution . . . 27
2.3 Simulation study . . . 29
2.3.1 Comparing Theoretical and Empirical Probabilities of Auto-regressive Process . . . 30
2.3.2 Effect of Estimating Order of Auto-regressive Process . . . 31
2.4 Data Analysis . . . 34
2.4.1 Atlanta Data . . . 35
2.4.2 Exploratory Analysis . . . 35
2.4.3 Expected Duration corresponding to the Exact Distribution . . . 37
2.4.4 Expected Duration corresponding to the Approximate Distribution 39 2.4.5 Definition of heatwave using expected length of durations . . . 44
2.4.6 Analysis based on USCRN Data . . . 44
2.4.7 Description of the USCRN data . . . 45
2.4.8 Fitting the Hierarchical Model . . . 46
2.5 Comparing with Real Heatwaves . . . 48
2.6 Discussion . . . 49
Chapter 3 Spatial Analysis of duration of Heatwaves . . . 52
3.1 Introduction . . . 52
3.2 Introduction to UCAR and MCAR model . . . 55
3.2.1 Defining Neighborhood . . . 56
3.2.2 UCAR Model . . . 58
3.2.3 MCAR model . . . 60
3.3 Data Analysis . . . 61
3.3.1 Description of the Data . . . 61
3.4 Finding the “Best" Model . . . 65
3.4.1 Model 1: ΣR,k =diag(ν12, ν22, ν32) . . . 65
3.4.2 Model 2: ΣR,k = Σc . . . 66
3.4.3 Model 3: ΣR,k =diag(ν12,k, ν22,k, ν32,k) . . . 67
3.4.4 Cross-Validation . . . 69
3.5 Prediction of Expected Lengths of Duration in the United States . . . 72
3.6 Comparing with Real Heatwaves . . . 73
3.7 Discussion . . . 75
Chapter 4 Future Work: Spatial Functional Analyis of durations of Heat-waves . . . 89
4.1 Introduction . . . 89
4.2 The Motivation . . . 92
Chapter 5 Outlier Robust Variable Selection without Loss of Estimation Efficiency . . . 95
5.1 Introduction . . . 95
5.2 Proposed Approach . . . 98
5.2.1 Robust Estimator via Generalized Empirical Likelihood . . . 98
5.2.2 Penalized Saddlepoint Procedure . . . 102
5.2.3 Computation . . . 103
5.2.4 Choice Of The Penalty Parameter . . . 105
5.3 Simulation study . . . 106
5.3.1 Comparison to exisiting estimators . . . 106
5.3.2 Sampling schemes . . . 107
5.3.3 Performance Measure . . . 107
5.3.4 Simulation Results . . . 109
5.4 Data Analysis . . . 114
5.5 Discussion . . . 116
References . . . 118
APPENDICES . . . 125
Appendix A On the Probability Distribution of the Duration of Heatwaves 126 A.1 Data Analysis . . . 126
A.1.1 Exploratory Analysis . . . 126
A.1.2 Expected Duration corresponding to the Exact Distribution 126 A.1.3 Expected Duration corresponding to the Approximate Distri-bution . . . 127
LIST OF TABLES
Table 3.1 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %) for 6 fold cross-validation for all the coefficients for Model1 . . . . 79 Table 3.2 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)
for 6 fold cross-validation for all the coefficients for Model2 . . . . 79 Table 3.3 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)
for 6 fold cross-validation for all the coefficients for Model3 . . . . 80 Table 3.4 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)
for 6 fold cross-validation for all the coefficients for Model1 for ex-pected lengths . . . 81 Table 3.5 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)
for 6 fold cross-validation for all the coefficients for Model2 for ex-pected lengths . . . 82 Table 3.6 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)
for 6 fold cross-validation for all the coefficients for Model3 for ex-pected lengths . . . 83 Table 5.1 Results for the no outlier scheme: FPR, FNR, RMSPE, AUC
aver-aged over 500 simulations . . . 109 Table 5.2 Results for the vertical outlier scheme: FPR, FNR, RMSPE, AUC
averaged over 500 simulations . . . 111 Table 5.3 Results for the Leverage Point outlier scheme: FPR, FNR, RMSPE,
AUC averaged over 500 simulations . . . 111 Table 5.4 Results for the Different β scheme: FPR, FNR, RMSPE, AUC
av-eraged over 500 simulations . . . 114 Table 5.5 Comparison table for LASSO, RLARS, sparseaLTS and our method
LIST OF FIGURES
Figure 1.1 Heatwave formation: borrowed from Wikipedia . . . 2
Figure 1.2 US Fatalities due to heatwaves (source: WorldAtlas & NOAA) . . 3
Figure 2.1 Definition of duration and intensity . . . 13
Figure 2.2 Empirical and theoretical duration distribution for selected values of transition probabilities andρ and quantiles for AR(1) . . . 31
Figure 2.3 Empirical and theoretical duration distribution for selected values of ρ1,ρ2 and quantiles for AR(2) . . . 32
Figure 2.4 Observed (red) and simulated data (gray) from AR(2) . . . 33
Figure 2.5 Expected durations with estimated order (red) and known order (blue) with the boxplot showing the Monte Carlo error for 50 sim-ulations . . . 34
Figure 2.6 up-crossings corresponding to threshold of 81 quantile in the year 1991 . . . 36
Figure 2.7 The posterior distributions of Chi-square statistic for different orders 37 Figure 2.8 Expected durations for the quantiles interval (0.70,0.95) for exact distribution . . . 38
Figure 2.9 Quadratic relationship between expected lengths and the quantiles for exact distributions . . . 40
Figure 2.10 The comparison of empirical and estimated CDF (The red continu-ous line corresponds the estimated Beta distribution and the black dotted line corresponds to the empirical CDF) . . . 41
Figure 2.11 Plot of expected length vs. quantiles with confidence interval at selected quantiles for approximate distributions . . . 42
Figure 2.12 Quadratic relationship between expected lengths and the quantiles for approximate distribution . . . 43
Figure 2.13 Locations of 126 USCRN stations . . . 46
Figure 2.14 Boxplot of the adjusted R2 for all the stations . . . 47
Figure 2.15 spatial clustering of the estimated quadratic equations . . . 48
Figure 2.16 Heatwaves in 2007 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95) . . . 49
Figure 2.17 Heatwaves in 2012 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95) . . . 50
Figure 3.1 Expected length of durations for a threshold of 81 percentile at USCRN weather stations . . . 54
Figure 3.2 Adjacency network for three USCRN stations (marked in red) with circle radius of 200 kilometers . . . 57
Figure 3.4 Plot of Crossvalidation Groups . . . 70
Figure 3.5 Prediction Plot of the expected lengths at quantile 0.70 . . . 73
Figure 3.6 Prediction Plot of the expected lengths at quantile 0.75 . . . 74
Figure 3.7 Prediction Plot of the expected lengths at quantile 0.80 . . . 75
Figure 3.8 Prediction Plot of the expected lengths at quantile 0.85 . . . 76
Figure 3.9 Prediction Plot of the expected lengths at quantile 0.90 . . . 77
Figure 3.10 Prediction Plot of the expected lengths at quantile 0.95 . . . 77
Figure 3.11 Heatwave days in Chicago in 2012 using 70 percentile threshold using our definition (July 4-7 marked in red) . . . 80
Figure 3.12 Heatwave days in Chicago in 2012 using 75 percentile threshold using our definition (July 4-7 marked in red) . . . 84
Figure 3.13 Heatwave days in Chicago in 2012 using 80 percentile threshold using our definition (July 4-7 marked in red) . . . 85
Figure 3.14 Heatwave days in Chicago in 2012 using 85 percentile threshold using our definition (July 4-7 marked in red) . . . 86
Figure 3.15 Heatwave days in Chicago in 2012 using 90 percentile threshold using our definition (July 4-7 marked in red) . . . 87
Figure 3.16 Heatwave days in Chicago in 2012 using 95 percentile threshold using our definition (July 4-7 marked in red) . . . 88
Figure 4.1 Locations of 120 USCRN stations . . . 92
Figure 5.1 ROC Curve for no outliers . . . 110
Figure 5.2 ROC Curve for vertical outliers . . . 112
Figure 5.3 ROC Curve for Leverage Points . . . 113
Figure 5.4 ROC Curve for Different β . . . 115
Figure A.1 Histograms of durations of 1991,1997,2003, and 2009 for 0.81 per-centile in Atlanta Data . . . 127
Figure A.2 Histograms of durations of the year 2002 with different quantiles in Atlanta Data . . . 128
Figure A.3 observed vs fitted durations of the years 1991,1997,2003, and 2009 in Atlanta Data (the red lines correspond to the observed durations and the black lines correspond to the fitted duration) . . . 129
Chapter 1
Introduction
occur-Figure 1.1: Heatwave formation: borrowed from Wikipedia
Figure 1.2: US Fatalities due to heatwaves (source: WorldAtlas & NOAA)
occurred (Meehl and Tebaldi [2004]).
While the effect of extreme heat on human health is widely studied, there is still no universally accepted definition of heatwave. There exist hundreds of definitions in the literature to quantify the extremities of these conditions, but not a unique global one. The form of the most popular definitions of heatwave is that the temperature (or a combination of humidity and temperature, e.g. apparent temperature etc.) has to be greater than a threshold, say, T °C for at least d consecutive days where the thresholds
T °C andd days are to be chosen suitably. Also, T andd can vary with locations such as
characterizing heatwave is of utmost importance.
There were several attempts to model extremes in dependent time series (Chavez-Demoulin and Davison [2012]). But these approaches may fail to capture several existing definitions that use other block quantiles or moments of the distributions and not neces-sarily the maxima, e.g., HI or apparent temperature. In Chapter 2, we propose a method which can provide a probabilistic model to all of the definitions which are based on a threshold of any time series and a threshold on sustainability of a severe condition. Then we validate the presented model with application to two types of dataset, one of them is for a fixed location (Atlanta, United States) and the other one is for varying locations (USCRN weather stations spread across the United States).
quadratic equations at each of this locations. We also explore Univariate and Multivariate Conditionally Auto-Regressive (UCAR and MCAR) model on the coefficients of these quadratic equations.
Chapter 2
On the Probability Distribution of the
Duration of Heatwaves
2.1
Introduction
2.1.1
Various Definitions of Heatwave
The form of the most popular definitions of heatwave is that the temperature (or a combination of humidity and temperature, e.g. apparent temperature etc.) has to be greater than a threshold, say, T °C for at least d consecutive days where the thresholds
T °C and d days are to be chosen suitably. Also, T and d can vary with locations such
as countries, and sometimes even within various regions of a large country. In Frich et al. [2002], heatwave is defined as a period of at least 5 consecutive days when the daily maximum ambient temperature exceeds the normal daily temperature (where the normal is defined as the daily average ambient temperature based on a period of 1961-1990) by 5°C (9°F). In Karl and Knight [1997], the distribution of daytime and nighttime apparent temperature were simulated to find the rare cases where the minimum of them stay extremely high for a period of two or three consecutive days. Another slightly complicated definition of heatwave can be found in Peng et al. [2011] which has been used extensively in literature. This definition is based on two thresholds T1 and T2, where T1 is the 97.5 percentile and T2 is 81 percentile of all the daily ambient temperatures available at a given location over a pre-specified time period. The definition says “A heat wave is then defined as the longest period of consecutive days satisfying the following conditions: the daily maximum temperature is above T1 for at least 3 days, the daily maximum temperature is above T2 for every day of the entire period, and the average of daily maximum temperature over the entire period is above T1.” All of these definitions were
mortality rate due to heatwave increases by 4% when the definition is based on having daily mean ambient temperature more than 95 percentile of all available daily mean temperatures for at least 2 days, but it increases by 3% when the definition is based on having daily mean ambient temperature more than 98 percentile of all available daily mean temperatures for at least 2 days, an increase by 7% when the definition is based on having daily mean ambient temperature more than 99 percentile of all available daily mean temperatures for at least 2 days and even an increase by 16% when the definition is based on having daily mean ambient temperature more than 97 percentile of all available daily mean temperatures for at least 5 days. These results aptly show the relative impact of the choice of the thresholds of T °C and d days in defining the heatwaves on the
mortality rates based on data from similar locations.
There is another school of thought in literature that suggests that relative humidity also plays a significant role in human health along with high temperature. So, a combi-nation of temperature and relative humidity, i.e., heat index (HI) should be considered in the characterization of heatwaves. In Schoen [2005], a new empirical model for heat index (HI) was proposed as
HI=T −1.0799e0.03755T[1−e0.0801(D−14)],
where HI,T, andDare all in °C andT is ambient air temperature andDis the dew-point
temperature which is calculated as
D= b×α
a−α,where α= a×T
where a =17.27, b =237.3, RH is measured relative humidity expressed as decimal
frac-tion. This formula for dew-point temperature is valid if 0°< T <60°C and0.01<RH<
1.00. In Yin and Wang [2018], another formula for computing heat index (HI) in Beijing
was used to study the effect of it on cardiovascular health which is as follows:
HI=
1.8T −0.55(1.8T −26)∗(1−0.6) + 32 (RH≤60%)
1.8T −0.55(1.8T −26)∗(1−RH) + 32 (RH>60%)
whereT is in °C and is the daily maximum temperature, RH is in %, and is the daily
mean relative humidity. Wehner et al. [2018] used a definition using a threshold on heat index (a bi-quadratic empirical function with ambient temperature (T) in °F and relative humidity (R) in %) provided by NOAA.
HI =42.379 + 2.04901523T + 10.14333127R−0.22475541T R−0.00683783T2
−0.05481717R2+ 0.00122874T2R+ 0.00085282T R2−0.00000199T2R2
where T > 80°F, and R >40%. Now, if HI is between 80-90°F, the condition is known
as “Caution”, if HI is between 90-103°F, the condition is known as “Extreme Caution”, if HI is between 103-125°F, the condition is known as “Danger”, if HI is above 125°F, the condition is known as “Extreme Danger”.
produce a reliable result, the selection of the heat index needs to be more judicious. There were several attempts on finding the probability distribution of the extreme rare events. The well known POT (Peak over threshold) method is one of the most commonly used tool in extreme value theory (Leadbetter [1991], Kysel`y et al. [2010]). A study on heatwave was done using hierarchical Bayesian model for serially-dependent extremes by Reich et al. [2014]. One apparent limitation to these approaches is based on the fact that only the extremes (typically taken as block maxima) of those time series can be modeled using the well known extreme value theory. While these methods give a new perspective to the existing problems, they are not able to capture the essence of most of the existing definitions. In this chapter, we propose a probabilistic framework which is able to apprehend most of the existing definitions. This framework is not only limited to the case of heatwaves, rather it can be applied to any stationary time series, and any event based on the exceedance above a threshold. We apply our models using two different datasets which will be discussed in latter sections.
2.2
Probability Distributions of Duration and
Inten-sity of Heatwave
One of our primary goal is to encapsulate all the existing definitions using a probabilistic framework. So, we consider a time series Xt for our analysis where t will usually denote
days but may also represent any other units of time. For example, Xt can be daily
maximum temperature or heat index (HI) or any other related time series depending on the existing definition of heatwave we want to represent. In general, we want Xt to be
be restricted to daily time scale. Since climate time series data tend to be cyclical with the periodicity of a year, we compute the number of heatwave within a year and their durations and intensities for the analysis. Let M0 be a chosen threshold for the time series Xt. M0 can be of different forms, for example it can be a fixed quantity, or it can be the long termq-th percentile of all the available values on that specific time series over
several years. Each instance when the time series Xt exceeds the threshold M0, we call
that an up-crossing. Similarly, a down-crossing can be defined as when Xt drops below
M0. Now, we define two random variables, “Duration” and “Intensity” of an up-crossing
corresponding to a specific Xt and M0. “Duration” of an up-crossing is defined as the
length of time Xt stays above M0 until the succeeding down-crossing. “Intensity” of an up-crossing is defined as the amount of exceedance in that up-crossing, i.e., the area under the curveXt aboveM0 in the corresponding up-crossing. Figure 2.1 illustrates these two concepts more succinctly.
LetXt,j(s)be the value of the time seriesXt on dayt of yearj at a location s, where
j = 1,2, ..., J(s)and i= 1,2, ..., Tj(s) whereTj(s)is the total number of days considered
within the year j at location s. In this section, we always assume that all the analysis is
for a fixed locations unless mentioned otherwise. So for simplicity, momentarily we drop
the symbolsfor subsequent definitions. Given a quantile level q∈(0,1), letM(q)be the
long termq-th quantile of the entire series{Xt,j, t= 1, ..., Tj;j = 1, ..., J}.M(q)will be a
constant function if the threshold is chosen to be a fixed quantity instead of a quantile. We define a dependent Bernoulli sequence {Bt,j(q) : t = 1,2, ...;j = 1,2, ...}, where Bt,j(q)
Figure 2.1: Definition of duration and intensity
Then, the times of crossing the threshold in j-th year can be defined as follows:
T1,j(q) =min{t≥1 :Bt,j(q) = 1}
T2,j(q) =min{t≥T1,j(q) :Bt,j(q) = 0}
More generally, for m=1,2,...we define
T2m−1,j(q) =min{t≥T2m−2,j(q) :Bt,j(q) = 1}
T2m,j(q) =min{t≥T2m−1,j(q) :Bt,j(q) = 0}
Using the above notations, we define Duration and Intensity of an up-crossing as follows:
Definition 2.2.1. Duration of i-th up-crossing in j-th year is defined to be the number of consecutive days when Xt,j stays above M(q) for that up-crossing and is denoted by
di,j(q). So, di,j(q) =T2i,j(q)−T2i−1,j(q).
Definition 2.2.2. Intensity of i-th up-crossing in j-th year is defined to be the amount of exceedance in i-th up-crossing and is denoted by INi,j(q). So, INi,j(q)
=PT2i,j(q)−1
k=T2i−1,j(q)Xk−M(q)(T2i,j(q)−T2i−1,j(q)).
Now, using these two definitions, we can express majority of the existing definitions that are based on a threshold of a time series and on sustainability. For example, the definition used in Frich et al. [2002] can be represented if we set a cut-off for duration of an up-crossing to be 5 days, and also consider Xt,j to be (Mt,j −Avg(Mt,j))where Mt,j
is the maximum daily temperature at t-th day in j-th year, Avg(Mt,j) is the long term
average value of Mt,j’s, and set M(q) to be 5. For the definition used in Wehner et al.
[2018], we can set Xt,j to be the heat-index (HI) of t-th day in j-th year and thresholds
can be set as it is. So, it is prudent to derive the probability distributions of Duration and Intensity of the up-crossings. In the next subsections, we derive an exact and an approximate distribution of duration, since most of the definitions can be captured by just the definition of duration.
2.2.1
Exact Distribution of Durations for Markov processes
In order to find an exact distribution of the durations of an up-crossing, we assume that the time series Xt is strictly stationary and follows an Markov process of order k, whereassumptions on{Xt:t= 1,2, ..., n}, we will show that the probability of the duration of
an up-crossing being equal tol, will beA×Bl−k for alll≥k for someA≤1, andB ≤1.
Theorem 2.2.1. Let Xt,j denote the value of time-series Xt on day t in j-th year,
i = 1,2, ..., Tj and j = 1,2, .., J. If for every j, {Xt,j : t = 1,2, ..., Tj} follows a Markov
process of order k, and is strictly stationary, then i-th duration in year j corresponding
to q-th quantile, di,j(q) has the following probability distribution:
P(di,j(q) =l) =
πj,l(q) l < k
Aj(q)×Bj(q)l−k l ≥k
for some sequence{πj,l(q) :l = 1,2, ..,(k−1)}, for someAj(q)andBj(q)whereAj(q)≤1,
Bj(q)≤1 and πj,l(q)≤1 for every l < k.
To prove this theorem, we will first show the proof where k is equal to 1 and 2 in
details, and extend the proof for general k by induction.
Proof. Case I: k = 1. Fix a j ∈ {1, ..., J}. Notice that
P(Xt,j ∈A|Xl,j ∈Bl, l = (t−1),(t−2), ...) = P(Xt,j ∈A|Xt−1,j ∈B).
Let,Bt,j(q) = I(Xt,j > M(q))whereM(q)is the threshold corresponding toq-th quantile.
Now,since we assume strict stationarity of the sequence {Xt,j : t = 1,2, ...;j = 1,2, ...},
i.e., the joint distribution of (Xt1,j, Xt2,j, ..., Xtk,j) is same as that of (Xt1+h,j, Xt2+h,j, ...,
stationary. For, l ≥1, and formi =Ti,j(q),
P(di,j(q) =l) =P(B(mi−1),j(q) = 0, Bmi,j(q) = 1, ..., B(mi+l−1),j(q) = 1, B(mi+l),j(q) = 0) =P(B(mi+l),j(q) = 0|B(mi+l−1),j(q) = 1)
×P(B(mi+l−1),j(q) = 1|B(mi+l−2),j(q) = 1)× ...×P(Bmi+1,j(q) = 1|Bmi,j(q) = 1)
×P(Bmi,j(q) = 1|B(mi−1),j(q) = 0)×P(B(mi−1),j = 0)
=P(B2,j(q) = 0|B1,j(q) = 1)×P(B2,j(q) = 1|B1,j(q) = 1)l−1
×P(B2,j(q) = 1|B1,j(q) = 0)×P(B1,j = 0)
(2.1)
The last equality follows by the strict stationarity. Let us denote the transition prob-abilities α1 = P(B2,j(q) = 1|B1,j(q) = 1) and α2 = P(B2,j(q) = 1|B1,j(q) = 0) and
substituting these in Eq (2.1) we get
P(di,j(q) =l) = (1−α1)α1l−1α2×P(B1,j = 0) (2.2)
Putting Aj(q) = (1−α1)α2×P(B1,j = 0) and Bj(q) =α1, the theorem is proved.
Case II: k = 2. We consider the four transition probabilities that will be required to
derive the distributions:
α1 =P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 1)
α2 =P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 0)
α3 =P(B3,j(q) = 1|B2,j(q) = 0, B1,j(q) = 1)
Now, for, l≥2, and for mi =Ti,j(q),
P(di,j(q) =l) =P(B(mi−1),j(q) = 0..., B(mi+l−1),j(q) = 1, B(mi+l),j(q) = 0) =P(B(mi+l),j(q) = 0|B(mi+l−1),j(q) = 1, B(mi+l−2),j(q) = 1)
×P(B(mi+l−1),j(q) = 1|B(mi+l−2),j(q) = 1, B(mi+l−3),j(q) = 1)× ...×P(Bmi+1,j(q) = 1|Bmi,j(q) = 1, B(mi−1),j(q) = 0)
×P(Bmi,j(q) = 1, B(mi−1),j = 0)
=P(B3,j(q) = 0|B2,j(q) = 1, B1,j(q) = 1)×P(B3,j(q) = 1|B2,j(q) = 1,
B1,j(q) = 1)l−2×P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 0)×P(B2,j(q) = 1, B1,j = 0)
(2.3)
Substituting α1, α2, α3 and α4 in Eq (3), we get
P(di,j(q) =l) = (1−α1)α1l−2α2×P(B2,j = 1, B1,j = 0) (2.4)
Putting Aj(q) = (1−α1)α2 ×P(B2,j = 1, B1,j = 0) and Bj(q) = α1, the theorem is
proved for l≥2. Now, for l= 1,
P(di,j(q) = l) = P(B(mi−1),j(q) = 0, Bmi,j(q) = 1, B(mi+1),j(q) = 0)
. From strict stationarity,
P(di,j(q) =l) =P(B1,j(q) = 0, B2,j(q) = 1, B3,j(q) = 0)
So, if we put πj,1 =P(B1,j(q) = 0, B2,j(q) = 1, B3,j(q) = 0), the theorem is proved for
Finally for a general Markov process of order k, we will have to define2k many
tran-sition probabilities, where
α1 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B1,j = 1)
α2 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B2,j = 1, B1,j = 0)
α3 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B2,j = 0, B1,j = 0)
Proceeding like this, we will have
α2k =P(B(k+1),j(q) = 1|Bk,j(q) = 0, ..., B1,j = 0)
Now, imitating the proof for order 1 and 2, we can write for l ≥k,
P(di,j(q) = l) = P(B(k+1),j(q) = 0|Bk,j(q) = 1, ..., B1,j(q) = 1)
×P(B(k+1),j(q) = 1|Bk,j(q) = 1, ..., B1,j(q) = 1)l−k
×P(B(k+1),j(q) = 1|Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0)
×P(Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0)
(2.5)
So, substituting for the values of α1 and α2 in Eq (2.5), we get
P(di,j(q) =l) = (1−α1)αl1−kα2×P(Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0) (2.6)
the theorem is proved for l ≥k. Now, for l < k,
P(di,j(q) =l) =P(B1,j(q) = 0, B2,j(q) = 1, ..., B(l+1),j = 1, B(l+2),j(q) = 0)
So, for l < k, if we put πj,l = P(B1,j(q) = 0, B2,j(q) = 1, ..., B(l+1),j = 1, B(l+2),j(q) =
0), the theorem is proved forl < k.
So, from this theorem, we can see that the exact distribution follows a generalized geometric-type distribution which is intuitive, since the probability of a stationary time series to stay above a threshold, should decrease after a time point.
Remark 2.2.1. Specifically, this theorem can be applied to an Auto-regressive process of order k. For an AR(k) process with normal innovations, the joint and conditional
distributions can be computed easily. Exact expressions of πj,l(q), Aj(q) andBj(q) can be
derived using multi-dimensional integrals.
In Section 2.3, we will validate this theoretical distribution with empirical distribution for Auto-regressive processes.
2.2.2
Hierarchical Model based on Exact Distribution
Our goal is to find a suitable model for duration of the up-crossings so that we can come up with a probabilistic definition that can apprehend all the existing definitions of heatwave. We assume that Xt is stationary and follows an Auto-regressive process of
orderk. Now, let us fix a threshold at a long-term q-th percentile. Letfj,l(q) denotes the
in year j. Let m denote the maximum length of an up-crossings (we can consider m to
be reasonably large). Then,
(fj,1(q), fj,2(q), ..., fj,m(q))|Ij(q)∼M ult(Ij(q),(pj,1(q).pj,2(q), ..., pj,m(q)))
where we know from the previous subsection that
pj,l(q) =
πj,l(q) l < k
Aj(q)×Bj(q)l−k l≥k
Now, we need to make sure that the cell probabilities of the Multinomial distribution add up to 1. So, we can consider a reparameterization of the cell probabilities with the help of the following lemma.
Lemma 2.2.1. Suppose a discrete random variable X has a distribution as follows:
P(X =l) = pl =
πl l < k
A×Bl−k l=k, ..., m
Then, to make sure that Pm
l=1pl= 1, we can do a reparametrization as follows: Suppose η=Pm
l=0AB
l−k. Then, there exists (α
1, α2, .., αk−2) such that,
π1 = (1−η)α1
πl = (1−η)(1−α1)...(1−αl−1)αl for l= 2,..,k-2
Where αl’s are required to satisfyαl∈[0,1].
Notice that for any l ∈ {1,2, ..., k−2},
(1−η)−Pl
j=1πj = (1−η)
Ql
j=1(1−αj), for example,
(1−η)−π1 = (1−η)(1−α1)
(1−η)−π1−π2 = (1−η)(1−α1)(1−α2). So
k−1 X
l=1
πl = (1−η)
and hence,
m
X
l=1
pl=η+ (1−η) = 1
Now, using this lemma, we use a repameterization on pj,l(q)’s and let αj,l(q), l =
1,2, ...,(k−2) be the corresponding parameters by reparameterization. We choose Beta
distribution as prior distributions on our k free parameters, αj,l(q)’s, ηj(q) and Bj(q)
since Aj(q) is dependent on ηj(q) and Bj(q) (Aj(q) = ηj(q) 1
−Bj(q)
1−Bj(q)m−k+1). So, the prior
distributions are as follows:
αj,l(q)∼Beta(µα,l(q)τα,l(q),(1−µα,l(q))τα,l(q)) for l= 1,2, ...,(k−2)
ηj(q)∼Beta(µη(q)τη(q),(1−µη(q))τη(q))
Bj(q)∼Beta(µB(q)τB(q),(1−µB(q))τB(q))
For the hierarchical model, we choose the Multinomial probability as described before since that is the most intuitive choice for the counts of the durations of different lengths. We need to keep in mind that this multinomial probability distribution of the frequency of durations of a certain length is conditioned on the total number of up-crossings Ij(q)
of the durations that we derived in the previous subsection. We choose Beta distribution as the parent distribution while allowing the variations among the years since the exact distribution is similar to a Geometric distribution after a time-point and Geometric and Beta are conjugate to each other. We validate the goodness of fit of this model using Chi-square goodness of fit test and given the data, we choose the order of the AR process, k, such that the Chi-square goodness of fit statistic is minimized. So, to compute the Chi-sqaure statistic, we need to find the population versions of the cell probabilities, and compare the Chi-sqaure distance betweent them. So, letel(q) denote the population
version of pj,l(q)’s, i.e.,el(q) =E(pj,l(q)) for l= 1,2, .., m. Then,
e1(q) = µα,1(q)(1−µη(q))
ei(q) = (1−µη(q))(1−µα,1(q))...(1−µα,(i−1)(q))µα,i(q) for i= 2,..,(k-2)
ek−1(q) = (1−µη(q))(1−µα,1(q))(1−µα,2(q))...(1−µα,(k−2)(q))
el(q) = µη(q)E[1/(1−Bm−k+1)] forl =k, ..., m ....(∗)
whereB ∼Beta((µB(q)τB(q) +l−k),((1−µB(q))τB(q) + 1))So, for a fixedk, for yearj,
the Chi-square statistic can be computed asχ2
j =
Pm
l=1
(fj,l(q)−Ij(q)el)2
Ij(q)el and approximately
follow χ2
m−k−1. So, we choose the order of the AR process as k, for which the posterior
mean of χ2
j statistic is minimized. Then, fixing k, we can compute the expected duration
of an up-crossings,E(di,j(q)) =
Pm
l=1lel(q)and its posterior distribution. The model can
be easily implemented using the JAGS programming language.
Now, using JAGS, we can thus get a posterior sample of expected durations, and we can choose the estimated expected duration to be the posterior median. So, for a fixed location, for a fixed quantile q, we can computeE(di,j(q)), the expected duration of an
Algorithm 1: Finding the posterior distributions of expected durations 1. Fix q.
2. Fix order k∈1,2, .., K.
3. For r∈ {1,2, .., R} (e.g. R ≥103)
• Get θ(r)= (µα,1, τα,1, ..., µα,(k−2), τα,(k−2), µη, τη, µB, τB)using JAGS.
• Compute e(r) =g(θ(r))as defined in (*).
4. Then obtain k= arg minkDk whereDk =E[J1 PJj=1 χ2(jr)
m−k−1|Data]. 5. Then fix k.
6. For eachq ∈(0.7,0.95), forr ∈ {1,2, .., R}
• Compute Expected duration Eq(r) =Plm=1le(lr)(q).
2.2.3
Approximate Distribution of Durations
Now, let us drop the assumption that the time series has to follow a Markov process, since that might not be the case all of the times. It is very difficult to compute the exact distribution of durations in general, but in this subsection, we will show that an approximate distribution for stationary time series can be computed. All the analysis in this subsection is based on a fixed quantileq unless mentioned otherwise. So we drop the
notationq momentarily to keep things simpler. In order to obtain the distribution ofi-th
duration inj-th year, i.e., T2i,j −T2i−1,j, we need to know the distribution of the sum of
dependent Bernoulli random variables. This statement can be justified once we express the probability density function of i-th duration in j-th year at a discrete point in the
For a given year j, for any integerl,
P(di,j =l) =P(T2i,j−T2i−1,j =l)
=
Tj
X
k=1
P(T2i,j −T2i−1,j =l|T2i−1,j =k)P(T2i−1,j =k)
=
Tj
X
k=1
P(Bk,j =Bk+1,j =...=Bk+l−1,j = 1, Bk+l,j = 0)P(T2i−1,j =k)
=
Tj
X
k=1
P(
k+l−1 X
m=k
(1−Bm,j) +Bk+l,j = 0)P(T2i−1,j =k) (2.7)
Here, both Bt,j and (1−Bt,j) are dependent Bernoulli sequences, and we need the
distribution of sum of a subset of these sequences. Though the exact distribution of sum of dependent Bernoulli series is not available in simple closed form, there exist numerous literature on Poisson approximation of dependent Bernoulli sequences and its various uses. The precision of the bound of this approximation has been improved over the years, and here we present a lemma proved in Chen et al. [2013] which gives a simple exact bound of the Poisson approximation.
Lemma 2.2.2 (Chen et al. [2013]). {Xα : α ∈ J} be Bernoulli Random variables with
success probabilities pα, α ∈ J. Let W =
P
α∈JXα and λ = EW =
P
α∈Jpα. Then, for
any collection of sets Bα ⊂J, α ∈J
dT V(L(W), P oi(λ))≤(1∧ λ1)(b1 +b2) + (1 +√1.4λ) and |P[W = 0]−e−λ| ≤(1∧ 1
λ)(b1+b2+b3),
where
b1 = P
α∈J
P
β∈Bαpαpβ, b2 =
P
α∈J
P
b3 =Pα∈J|E(Xα|Xβ, β 6∈Bα)−pα|
Now, we use this lemma to approximate the distribution of the sum of the Bernoulli sequence in Eq (2.7), i.e., {(1−Bα,j) :α∈(k, k+l−1)} ∪ {Bk+l,j}.
P(
k+l−1 X
m=k
(1−Bm,j) +Bk+l,j = 0) ≈e−[
Pk+l−1
m=k P(Bm,j=1)+P(Bk+l,j=0)] (2.8)
Given our assumption of strict stationarity on the time series, for allm ∈ {k, k+1, ..., k+l}
P(Bk+m,j = 1) =P(B1,j = 1) (2.9)
Also,P(Bl,j = 0) = 1−P(B1,j = 1) (2.10)
Now, combining Eq (2.9), (2.10) and equation (2.8), we get
P(
k+l−1 X
m=k
(1−Bm,j) +Bk+l,j = 0) ≈e−(l−1)P(B1,j=1)+1 (2.11)
Substituting Eq (2.11) to Eq (2.7), we get that for any positive integer l,
P(di,j =l)≈ Tj
X
k=1
e−(l−1)P(B1,j=1)+1P(T
2i−1,j =k)
=e−(l−1)P(B1,j=1)+1
Tj
X
k=1
P(T2i−1,j =k)
From Eq (2.12), we conclude that the distribution of Duration can be approximated by a Geometric distribution with success probability of (1−e−P(B1,j=1)) if we assume
strict stationarity on the time series {Xt,j : t = 1,2, ...;j = 1,2, ...}. A closed bound
can be obtained for the distance between the exact and approximate distribution using Lemma 2.2.2. Therefore, we write the following theorem on the approximate distribution of Duration of an up-crossing.
Theorem 2.2.2. Under the assumption of strict stationarity on the time series ((Xt,j)),
the distribution of i-th Duration inj-th year,di,j, can be approximated by the distribution
of another random variable d∗i,j, where d∗i,j ∼Geo(1−e−P(B1,j=1)).
Though strict stationarity remains a sufficient condition for proving Theorem 2.2.2, it does not need to be a necessary condition for the approximate distribution of durations to be Geometric. Suppose we have a non-stationary time series {Xt,j : t = 1,2, ...;j =
1,2, ...}that we can write as follows:Xt,j =µt+σtt,j wheret= 1, ..., Tj and j = 1, ..., J.
An empirical estimate of µt can be µˆt = J1 PJj=1Xt,j or median of (Xt,1, ..., Xt,J) or
Moving Average based estimate µ˜t,j =
Pt+d
m=t−dXm,j
2d+1 and µˆt = 1
J
PJ
j=1µ˜t,j where Xt,j = 0 if t ≤ 0 or t > Tj. Thus, µˆt is the long-term average of Xt,j’s. Similarly, we estimate
σt by the long-term standard deviation of Xt,j’s, i.e., σˆ2t = J−11 PJ
j=1(Xt,j −µˆt)
2 where
t = 1, ..., Tj. Also, we can use a robust version of standard deviation in case µˆt is the
median, i.e.,σˆt = J1
PJ
j=1|Xt,j −µˆt|.
So instead of Xt,j, we can work with an estimate of t,j, where ˆt,j =
Xt,j−µˆt
ˆ
σt and its
quantiles m(q) where m(q) = M(q)−µˆt
ˆ
σt , M(q) being the q-th quantile of Xt,j’s. But the
distribution of Bt,j will remain the same since I(Xt,j ≥ M(q)) = I(ˆt,j ≥ m(q)). So,
2.2.4
Hierarchical Model based on Approximate Distribution
Now, with the support of the framework of Subsection 2.2.3, we design a hierarchical model for durations of the up-crossings for a chosen quantile q. We assume,di,j(q)∼Geo(pj(q))
pj(q)∼Beta(α(q)τ(q),(1−α(q))τ(q))
where di,j(q) is the i-th Duration in year j for quantile q, and α(q) and τ(q) need to
be estimated. We choose the Geometric distribution for durations in a year since the approximate distribution of durations are theoretically Geometric based on the materials presented in the previous subsection. We assume that there exists yearly variation in the “success” probability of the Geometric distributions of durations, but they are associated with each other through a parent distribution which we choose to be Beta since Beta and Geometric distributions form a conjugate family of distributions. However, we later validate this assumption for real case studies. Now, due to this conjugacy, estimating
α(q) and τ(q) by the maximizer of the marginal likelihood becomes computationally
straightforward. The illustration of this reasoning is as follows. The marginal likelihood of (α(q), τ(q))is given by
L(α(q), τ(q)) =
J
Y
j=1 Z
pj(q)
P
idi,j(q)−Ij(q)
(1−pj(q))Ij(q)pj(q)
α(q)τ(q)−1
× (1−pj(q))
(1−α(q))τ(q)−1
B(α(q)τ(q),(1−α(q))τ(q))
=
J
Y
j=1
B(α(q)τ(q) +P
idi,j(q)−Ij(q), Ij(q) + (1−α(q))τ(q))
where B(a, b) = Γ(Γ(aa)Γ(+bb)), Ij(q) is the number of up-crossings in year j with threshold
being M(q). As the quantile q goes higher, the number of up-crossings in a year gets
lower, even sometimes can result in no up-crossings at all for an entire year. One way to deal with this issue is to eliminate that year from the analysis itself, but this can cause insufficient use of some valuable information. So instead, we keep that year’s input by simply putting P
idi,j(q)to be 0 in the likelihood Eq (2.13). Though we do not need to
estimate the average durations through marginal likelihood for higher quantiles, because later we show that those can be extrapolated by the quadratic relationship between the average durations and quantiles. The details of this analysis is provided in Section 2.4. We still can estimate the parameters based on di,j(q)’s for the year even with missing
data (or no up-crossings) by computing the conditional likelihood which is simple since we have conjugacy of these two distributions.
The empirical validations of the assumed hierarchical model is done in Section 2.4 using exploratory analysis of the data sets. Now, given this model, we can quantify the average or expected length of durations corresponding to the chosen threshold M(q).
E(di,j(q)|pj(q)) =
1 1−pj(q)
E(di,j(q)) = E
1
1−pj(q)
= B(α(q)τ(q),(1−α(q))τ(q)−1) B(α(q)τ(q),(1−α(q))τ(q))
= τ(q)−1
(1−α(q))τ(q)−1 (2.14)
We plug in the estimates ofα(q)and τ(q),α(q)ˆ andτ(q)ˆ in Eq (2.14) to get the expected
length of durations where the threshold is M(q). We estimate the confidence interval
τ(q)−1
(1−α(q))τ(q)−1 isˆa(q) ˆΣ(q)ˆa(q)
T, whereΣ(q)is the asymptotic variance covariance matrix of
ˆ
α(q)andτˆ(q), the estimates ofα(q)andτ(q), where,a = (((1−τ(αq()(qτ))(τq()q−)1)−1)2,
α(q)
((1−α(q))τ(q)−1)2).
Now, for a chosen quantile q, if the length of a duration exceeds the expected length
of duration more than the expected length of duration, we can consider that abnormal, hence that period of consecutive days can be considered as a heatwave. Thus, we fix the minimum length of duration of a heatwave with a probabilistic support. Also, the hierarchical model provided can be used to quantify the likelihood of a heatwave forming in a particular year, hence can be used for prediction of heatwave. Moreover, there is a quadratic relationship between expected length of duration, E(di,j(q)), and the quantile
qwith which the expected length of duration at a fixed location can easily be determined.
We discuss this in greater details in Section 2.4.
2.3
Simulation study
2.3.1
Comparing Theoretical and Empirical Probabilities of
Auto-regressive Process
We first compute the transition probabilitiesα1andα2associated with an Auto-regressive process of order 1 where α1 = P(B2,j(q) = 1|B1,j(q) = 1) and α2 = P(B2,j(q) =
1|B1,j(q) = 0)when the Auto-regressive process Xt can be described asXt=ρXt−1+at,
at’s being the associated innovations and follow a Normal distribution.
Now, we compute the theoretical probabilities of the durations with these transition probabilities as
P(di,j(q) =l) = (1−α1)α1l−2α2×P(B2,j = 1, B1,j = 0) (2.15)
Now, we generate 1000 observations for yearly data (365 days) of Bernoulli sequence with chosen quantiles with these transition probabilities using Binomial distribution, i.e.,
Bt ∼ Binomial(pt) where pt = α1(1− Bt−1) + (1 −α2)Bt−1. Now, we compute the
empirical probabilities of durations using this simulated data. We repeat these steps for different values ofα1,α2, ρand quantiles. We observe that the empirical and theoretical
distributions match for each of the cases. Figure 2.2 shows the comparison of theoretical and empirical probabilities for AR(1) processes.
Now, computation of the four transition probabilities is also easy for Auto-regressive process of order 2. So we compute the transition probabilities and compute the empirical and theorectical probabilities for Ar(2) processes with selected values ofρ1 and ρ2 where the Auto-regressive process Xt can be described asXt=ρ1Xt−1+ρ2Xt−2+at, at being
theo_prob emp_prob
0.00
0.03
0.06
1 2 3 4 5 6 7 8 9 10
alpha= 0.135 beta= 0.573 rho= 0.5 q= 0.8
theo_prob emp_prob
0.00
0.02
0.04
1 2 3 4 5 6 7 8 9 10
alpha= 0.076 beta= 0.674 rho= 0.5 q= 0.9
theo_prob emp_prob
0.000
0.015
1 2 3 4 5 6 7 8 9 10
alpha= 0.041 beta= 0.753 rho= 0.5 q= 0.95
theo_prob emp_prob
0.000
0.015
0.030
1 2 3 4 5 6 7 8 9 10
alpha= 0.095 beta= 0.404 rho= 0.75 q= 0.8
theo_prob emp_prob
0.000
0.010
0.020
1 2 3 4 5 6 7 8 9 10
alpha= 0.052 beta= 0.493 rho= 0.75 q= 0.9
theo_prob emp_prob
0.000
0.010
1 2 3 4 5 6 7 8 9 10
alpha= 0.029 beta= 0.562 rho= 0.75 q= 0.95
theo_prob emp_prob
0.000
0.003
0.006
1 2 3 4 5 6 7 8 9 10
alpha= 0.049 beta= 0.169 rho= 0.95 q= 0.8
theo_prob emp_prob
0.000
0.003
1 2 3 4 5 6 7 8 9 10
alpha= 0.033 beta= 0.199 rho= 0.95 q= 0.9
theo_prob emp_prob
0.000
0.002
0.004
1 2 3 4 5 6 7 8 9 10
alpha= 0.025 beta= 0.218 rho= 0.95 q= 0.95
Figure 2.2: Empirical and theoretical duration distribution for selected values of transi-tion probabilities and ρ and quantiles for AR(1)
theo_prob emp_prob
0.00
0.02
0.04
1 2 3 4 5 6 7 8 9 10
rho1= 0.6 rho2= 0.09 q= 0.8
theo_prob emp_prob
0.000
0.015
0.030
1 2 3 4 5 6 7 8 9 10
rho1= 0.6 rho2= 0.09 q= 0.9
theo_prob emp_prob
0.000
0.010
0.020
1 2 3 4 5 6 7 8 9 10
rho1= 0.6 rho2= 0.09 q= 0.95
theo_prob emp_prob
0.00
0.03
0.06
1 2 3 4 5 6 7 8 9 10
rho1= 0.5 rho2= 0.06 q= 0.8
theo_prob emp_prob
0.00
0.02
0.04
1 2 3 4 5 6 7 8 9 10
rho1= 0.5 rho2= 0.06 q= 0.9
theo_prob emp_prob
0.000
0.015
0.030
1 2 3 4 5 6 7 8 9 10
rho1= 0.5 rho2= 0.06 q= 0.95
theo_prob emp_prob
0.00
0.04
0.08
1 2 3 4 5 6 7 8 9 10
rho1= 0.2 rho2= 0.01 q= 0.8
theo_prob emp_prob
0.00
0.03
0.06
1 2 3 4 5 6 7 8 9 10
rho1= 0.2 rho2= 0.01 q= 0.9
theo_prob emp_prob
0.000
0.015
0.030
1 2 3 4 5 6 7 8 9 10
rho1= 0.2 rho2= 0.01 q= 0.95
Figure 2.3: Empirical and theoretical duration distribution for selected values of ρ1,ρ2 and quantiles for AR(2)
simulate from an Auto-regressive process as follows:
Xt,j = 0.79Xt−1,j−0.03Xt−2,j+at,j
where t ∈ {16,17, ...,350}, j ∈ {1,2, ...,30} and then, we add the estimated trend aver-aged over all 22 years. Figure 2.4 shows that the simulated data and the observed data are similar, so the inferences we can draw for the actual data will not be just random, instead the results are validated using the simulated data.
50 100 150 200 250 300 350
0
20
40
60
80
100
120
days
obser
ved and sim
ulated temper
ature
Figure 2.4: Observed (red) and simulated data (gray) from AR(2)
1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25
4
6
8
10
fixed order
estimated order
Figure 2.5: Expected durations with estimated order (red) and known order (blue) with the boxplot showing the Monte Carlo error for 50 simulations
2.4
Data Analysis
2.4.1
Atlanta Data
Description of the dataAtlanta data was obtained from the National Climatic Data Center for the first-order weather station located at the Atlanta Hartsfield International Airport and was used in the study of the association between Emergency Department visits and heatwaves in Chen et al. [2017]. This data consists of daily values of many weather related variables including various types of temperatures for years 1991 to 2012. The variables reported in the dataset are mostly the variables which might be used to characterize heatwave, e.g., maximum and minimum ambient temperature, maximum and minimum dew-point temperature, maximum and minimum apparent temperature etc.. For our analysis, we use daily maximum ambient temperaturejust for illustration. It may be noted that ambient is defined as “of the surrounding area or environment" (definition by NOAA). Any of the aforementioned variables can also be used since our proposed model does not depend on a specific time series. Atlanta data also contains hourly values of many other variables (including solar radiation, relative humidity etc.), but for our analysis, we mostly illustrate our methodology using daily data. Maximum or minimum apparent temperature can be used in case we want to incorporate relative humidity.
2.4.2
Exploratory Analysis
We first set a quantile q and look at the corresponding up-crossings in each individual
year. Following the popular choice (e.g. Peng et al. [2011]) we first set the quantile q
2, 6, 7, 1, 2, 9, 1, 2, 5, 3, and 9 respectively.
0 50 100 150 200
20
40
60
80
100
days
Maxim
um ambient temper
ature
1991
Figure 2.6: up-crossings corresponding to threshold of 81 quantile in the year 1991
2.4.3
Expected Duration corresponding to the Exact
Distribu-tion
As we have already seen in the simulation study, estimating order as described in Al-gorithm 1 does not change the estimate of expected durations. So, we will estimate the order of the Auto-regressive process first by minimizing the expected Chi-square statistic. Using JAGS, we get the posterior distributions ofDk’s (see Algorithm 1) and Figure 2.7
shows the boxplot of posterior samples of Dk’s where k ∈ {2,3, ...,13}. From Figure 2.7,
2 3 4 5 6 7 8 9 10 11 12 13
1
2
3
4
5
6
7
Figure 2.7: The posterior distributions of Chi-square statistic for different orders
we can see that k = 5 minimizes the expected Chi-square statistic. Using this
as thresholds would result in a significantly higher number of up-crossings which may not be appropriate. Also, in the literature, we find quantiles range from 0.80 to 0.99, so we would like to see inferences based on that range of quantiles. But we added a little bit more values of quantiles to the left to the range, since in later part of this section, we find relationships between the expected lengths and the quantiles which require a bigger range. Also, we fix the upper range at 0.95 since analysis with higher threshold will not be reliable. There will be a very few number of up-crossings in a year, even no up-crossings at all if we choose higher quantiles than 0.95. So, the range (0.70,0.95) is a reasonable range of quantiles for our analysis, and Figure 2.8 shows the expected lengths of durations in this range with its credible intervals.
2 3 4 5 6 7
0.70 0.75 0.80 0.85 0.90 0.95
threshold
Expected length
From Figure 2.8, we see that the expected length of durations decrease with higher quantiles, which agrees with the fact that the chosen time series cannot stay above for higher thresholds for a long time. Also, it is evident that there might be a simple polynomial relationship between the quantiles and the expected length. So, we fit three different relationships, a linear, a log-linear, and a quadratic between them and choose the one with the highest adjustedR2. Thus, we find that the quadratic relationship is the
optimal among them with an adjusted R2 of 0.99. While fitting the quadratic equation,
we assume that the errors are independent, but this may not be the case in real scenario. Even if we lose this assumption, we will still have unbiased estimates of the coefficients, but they will lack in terms of efficiency. Equation (2.16) shows the fitted relationship between expected length and quantiles. For q ∈(0.65,0.99)
E(dij(q)) = 3.9−4.55(−2.16 + 2.61q+ 4.01e−15q2) + 0.29(26.37−64.46q+ 39.06q2)
(2.16)
Figure 2.12 represents the estimated quadratic relationship between the observed expected lengths and the quantiles.
We can now compute the expected length of durations using the quadratic equation in a fixed location. Using this relationship, we can now provide a definition of heatwave which will be able to comprehend all the existing definitions of heatwave.
2.4.4
Expected Duration corresponding to the Approximate
Dis-tribution
0.70 0.75 0.80 0.85 0.90 0.95
3
4
5
6
Quadratic fit
quantiles
e
xpected dur
ation
Figure 2.9: Quadratic relationship between expected lengths and the quantiles for exact distributions
and see if the results match or not. As the proposed approximate model suggests that for a fixed quantile q, the durations in a year,dij(q), follows Geometric distribution, we first
up-crossings in general which is compliant to the theory we developed corresponding to approximate distributions. Figures corresponding to this model validation can be found in the Appendix section. With this conclusion, we now explore on the validation of the hierarchical model. We use marginal likelihood method to estimate the parameters, α
and τ of the hierarchical Beta distribution. Figure 2.10 shows the comparison of the
empirical cumulative distribution function and the cumulative distribution function of the estimated Beta distribution suggesting a suitable fit.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fitting Beta
value
CDF
Figure 2.10: The comparison of empirical and estimated CDF (The red continuous line corresponds the estimated Beta distribution and the black dotted line corresponds to the empirical CDF)
goodness of fit test (Anderson and Darling [1952]) using R package ‘goftest’. The p-value corresponding to the Kolmogorov-Smirnoff test is 0.76 and that corresponding to the Anderson-Darling test is 0.68. Both the p-values are much greater than 0.05, suggesting the proposed hierarchical model is a good fit for this Atlanta data. Though we show this result for a chosen quantile 0.81, this model is a good fit even if we vary the quantiles between 0.70 and 0.95, the range of quantiles we have chosen. We now compute the expected lengths of duration, E(dij(q)), for different choices of quantiles. Figure 2.11
shows the plot of expected length vs. quantile for Atlanta data. From Figure 2.11, we see
Figure 2.11: Plot of expected length vs. quantiles with confidence interval at selected quantiles for approximate distributions
the expected lengths are similar to the ones with the exact distribution, the difference between them might be a result of the Geometric approximation of the exact distribution. Equation (2.17) shows the fitted relationship between expected length and quantiles. For
q∈(0.65,0.99)
E(dij(q)) = 4.57−5.36(−2.16 + 2.61q+ 4.01e−15q2+ 0.28(26.37−64.46q+ 39.06q2)
(2.17)
Figure 2.12 represents the estimated quadratic relationship between the observed expected lengths and the quantiles.
0.70 0.75 0.80 0.85 0.90 0.95
3
4
5
6
Quadratic fit
quantiles
expected_length
Figure 2.12: Quadratic relationship between expected lengths and the quantiles for ap-proximate distribution
the expected length of duration for a given quantile in a fixed location.
2.4.5
Definition of heatwave using expected length of durations
The idea of this definition is to combine all the existing definitions of heatwave using the definition of durations. Given a threshold using a quantileq, we can now get a statisticalcut-off for the durations using the definition of expected length.
Definition 2.4.1. In a fixed locations, given a quantileq, if an up-crossing of a time se-ries of temperature or heat-index lasts more than expected length of duration,i.e.,E(dij(q))
(as computed from the estimated quadratic equation for that location), we call that
up-crossing to be a heatwave corresponding to the quantile q ∈(0.75,0.95).
Another advantage of using this definition is that we do not have to compute the expected length of durations for every quantile using the data. There will be fewer amount of data available for higher thresholds, so we can just extrapolate the values of expected lengths corresponding to higher quantile using the quadratic relationship.
The only problem now remains is to find the optimal quantile q for a fixed location
s. One way to do that is to get a mortality data at a fixed location s, and compare the
mortality rates corresponding to a heatwave using different definitions and choose the one that has the highest association with the mortality rate or health hazards. In order to do that, using our definition minimizes the number of definitions that need to be compared. This problem will be pursued in a separate work in future.
2.4.6
Analysis based on USCRN Data
equations for every location in the later sections.
2.4.7
Description of the USCRN data
The USCRN data contains daily and hourly meteorological values collected at 232 weather stations of USCRN, 226 stations in the United States and 6 stations outside the United States. The time period ranges from 2000 to 2017. The data contains the daily values of average ambient temperature (average does not mean hourly average, rather it represents the average of temperature recorded at 6 different radars ) and precipitation, also hourly values of the same. For our analysis, we just use the daily values of ambient temperature. But some stations of USCRN are fairly new, so those stations do not contain data on sufficiently large number of years required for our analysis. So we eliminate information of those stations where there are less than 10 years of data. Thus, we end up using values at 126 stations with at least 10 years of daily values of meteorological variables.
Since the size of this data is fairly big, there are lots of missing values at many points as expected. So, if we have a missing value of daily average ambient temperature at a certain date of a specific year, we replace the value of that with the median of the daily temperature at the same date of the other years. In that way, we will still have a robust analysis without eliminating the missing values. In a further study, we will use hierarchical Bayesian models that will use posterior predictive distributions to impute missing values.
Figure 2.13: Locations of 126 USCRN stations
2.4.8
Fitting the Hierarchical Model
Now, we do the same analysis for all these 126 stations as we did for the Atlanta data with the approximate distribution. For a fixed quantile 0.81, we estimated Beta distribution for all these stations, and computed p-values corresponding to the Anderson-Darling test. Except for 8 stations, the p-values turn out to be greater than 0.05 suggesting a good fit in general. Upon visual inspection it seems that the test fails in those 8 stations is because of one distant value from the estimated Beta cumulative distribution. Further investigation is needed to come to a decisive conclusion.
R2. Figure 2.14 shows that the adjusted R2 for all these stations are greater than 0.95 suggesting a good fit for all of them.
0.90
0.92
0.94
0.96
0.98
1.00
Adjusted R−squared
Figure 2.14: Boxplot of the adjusted R2 for all the stations
Now, we need to investigate if there exists a spatial clustering between these estimated coefficients of the quadratic equations. So we plot all the estimated quadratic equations in a plot to check visually if there exists any. Figure 2.15 shows the spatial clustering of the quadratic equation.
0.70 0.75 0.80 0.85 0.90 0.95
0
5
10
15
20
25
Quadratic Plots
quantiles
expected length
Figure 2.15: spatial clustering of the estimated quadratic equations
2.5
Comparing with Real Heatwaves
Now, let us try to find when heatwaves occurred according to our definition of heatwaves and compare those with the existing heatwaves. Figure 2.16 and Figure 2.17 shows the heatwaves that occurred at Atlanta in 2007 and 2012 for two different quantiles (0.8 and 0.95) using our definition with the expected lengths corresponding to the approximate distribution.
100 150 200 250 300
50
60
70
80
90
100
days
maxim
um ambient temper
ature
2007
Figure 2.16: Heatwaves in 2007 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95)
midwest) and Canada recorded a massive heatwave in Summer, 2012 (https://
www.bbc.com/news/world-us-canada-18758667, http://www.climatecentral.org/ news/coverage-of-the-2012-heat-wave-archived-and-accessable). So our defini-tion is being able to capture the real heatwaves efficiently. So, it will be prudent to use our definition in the future since it is also based on a probabilistic framework.
2.6
Discussion
100 150 200 250 300
60
70
80
90
100
days
maxim
um ambient temper
ature
2012
Figure 2.17: Heatwaves in 2012 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95)
sustainability of a time series above a threshold can be expressed using our definition. We have also found a quadratic relationship between the threshold quantiles and the ex-pected duration of an up-crossing for a fixed location, which will make the identification of heatwaves much simpler.
The next approach to finalize the definition of a heatwave at a fixed location would be to find an optimal threshold that might require an analysis of a data with the mortality and the health hazards related to extreme heat and/or relative humidity. But using our definition, the number of comparisons should be much less than scrutinizing every possible combination of thresholds and sustainability.