Spatial Analysis of Duration of Heatwaves and Robust Variable Selection.

(1)

ABSTRACT

RAHA, SOHINI. Spatial Analysis of Duration of Heatwaves and Robust Variable Selection. (Under the direction of Sujit Ghosh and Howard Bondell.)

In Chapter 1, we describe a general introduction to the problems we approach in Chapter 2 to 5.

Characterization of heatwaves is becoming increasingly important in environmental research as they pose a significant threat to many human lives worldwide. Though several quantification of the extremities of a heatwave have been proposed in literature, they are mostly improvised and there does not exist a universally accepted definition of heatwave. In Chapter 2, we devise a probabilistic inferential framework to characterize heatwave, and come up with a definition which can capture the essence of all existing ad hoc definitions. We derive an exact distribution on the frequency of durations for a stationary Markov process, and also an approximate distribution of durations for a stationary non-Markov time series. For a given site, using a daily time series (of ambient temperature or heat-index) we define a heatwave as the number of sustained days above a given threshold using the probability distribution of the durations. We illustrate the proposed methodology using daily time series of ambient temperature for a fixed site (of Atlanta) and also using the USCRN consisting of 126 sites across the U.S.. Further we also derive an empirical quadratic curve based relationship between expected durations and extreme thresholds.

(2)

is computationally challenging, it is of utmost importance to derive a spatial model for the quadratic equations. In Chapter 3, we propose a few potential spatial models, us-ing the idea of Gaussian Markov Random Field. We compare these models usus-ing their predictive performance based on a 6-fold cross-validation using 120 USCRN sites across the U.S.. We found that a simple Multivariate Conditionally Auto-regressive model fit our data best, and using this model, we predict the expected lengths of durations of heatwaves in the United States. The procedure decribed in Chapter 3 can be extended to any other large geographical regions given we have temperature data for a few locations inside the region. In Chapter 4, we discuss a potential future extension of this spatial modeling, which requires a spatial functional analysis of the durations of heatwaves at various locations.

(3)

(4)

Spatial Analysis of Duration of Heatwaves and Robust Variable Selection

by Sohini Raha

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina 2019

APPROVED BY:

Brian Reich Tao Pang

Sujit Ghosh

(5)

DEDICATION

(6)

BIOGRAPHY

(7)

ACKNOWLEDGEMENTS

The last five years of my life was an extremely challenging time both in professional and personal aspects, and I am very fortunate to have enormous support throughout this graduate life.

Majority of my accomplishments can be attributed to the constant encouragement, tireless effort and skillful guidance of my doctoral advisors, Dr. Sujit Ghosh, and Dr. Howard Bondell. Through inspiring conversations, motivating examples, and dedicated mentoring, Dr. Ghosh helped me shape my academic objectives, along with imparting methods to the madness of research. Working with him has been a great learning curve for me both in my academic and non-academic endeavors. Dr. Bondell introduced me to the complex path of research and guided me patiently despite my naiveté. His gentle and tenacious counsel helped me stay focused to my goals in the most critical juncture of my life. I am immensely grateful to both of them, and I will always treasure my experience working under their guidance.

Also, I am thankful to Dr. Brian Reich, and Dr. Tao Pang, for being a part of my dissertation committee. Their invaluable recommendations and suggestions have been instrumental in the completion of this dissertation.

(8)

struggle with academic and personal welfare. Alison’s continually affectionate conversa-tions never failed to cheer me up even in the most down-hearted days. I have always cherished their smiling presence in the department, and I have never had a dull moment with them.

I also had the wonderful opportunity to be a part of SAMSI (The Statistical and Ap-plied Mathematical Sciences Institute) as a Research Assistant. Participating in the work-ing groups at SAMSI has always given me a new perspective to my research problems, and also helped me build professional networks. This material is based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statisti-cal and Applied MathematiStatisti-cal Sciences Institute (SAMSI: https://www.samsi.info/). I consider myself highly privileged to have a wonderful circle of caring friends through this difficult phase of life, and these acknowledgements would not have been complete without naming a few of them. Thank you, Jyotishka and Shalini, for tolerating me at my worst, and for providing me a home away from home when I needed it the most. Thank you, Max, for your unconditional love and adorable support which became the ray of hope in the darkest of times. Thank you, Suman, for your unique advice and guidance that always led me in the right direction. Thank you, Arkaprava, for being the friend that everybody wishes they had. Thank you, Raka, for being my timeless soul-sister in this indelible journey of life. Thank you, Arkopal, for always being there no matter what. Thank you Moumita, Indranil, Arnab, Aniket, Sayan, Sapna, Debarati, Sneha, Aritra, Sumit, Anwesha, for making my life in North Carolina so special. Thank you, Malabi, for all those thought-provoking lively conversations even from afar.

(9)

for being an exemplary human being who I can always look up to. Also, I must thank my little brother, Ritam, for keeping the love and adorable bickering alive from halfway around the world. I would like to acknowledge my grandparents for the unlimited love and pampering throughout. Also, I would like to express a special thanks to Delilah, who taught me to never give up, be adventurous and love purely.

(10)

LIST OF TABLES

Table 3.1 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %) for 6 fold cross-validation for all the coefficients for Model1 . . . . 79 Table 3.2 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)

for 6 fold cross-validation for all the coefficients for Model2 . . . . 79 Table 3.3 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)

for 6 fold cross-validation for all the coefficients for Model3 . . . . 80 Table 3.4 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)

for 6 fold cross-validation for all the coefficients for Model1 for ex-pected lengths . . . 81 Table 3.5 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)

for 6 fold cross-validation for all the coefficients for Model2 for ex-pected lengths . . . 82 Table 3.6 MSE, Prediction Coverage (90 %) and Prediction Coverage (95 %)

for 6 fold cross-validation for all the coefficients for Model3 for ex-pected lengths . . . 83 Table 5.1 Results for the no outlier scheme: FPR, FNR, RMSPE, AUC

aver-aged over 500 simulations . . . 109 Table 5.2 Results for the vertical outlier scheme: FPR, FNR, RMSPE, AUC

averaged over 500 simulations . . . 111 Table 5.3 Results for the Leverage Point outlier scheme: FPR, FNR, RMSPE,

AUC averaged over 500 simulations . . . 111 Table 5.4 Results for the Different β scheme: FPR, FNR, RMSPE, AUC

av-eraged over 500 simulations . . . 114 Table 5.5 Comparison table for LASSO, RLARS, sparseaLTS and our method

(13)

LIST OF FIGURES

Figure 1.1 Heatwave formation: borrowed from Wikipedia . . . 2

Figure 1.2 US Fatalities due to heatwaves (source: WorldAtlas & NOAA) . . 3

Figure 2.1 Definition of duration and intensity . . . 13

Figure 2.2 Empirical and theoretical duration distribution for selected values of transition probabilities andρ and quantiles for AR(1) . . . 31

Figure 2.3 Empirical and theoretical duration distribution for selected values of ρ1,ρ2 and quantiles for AR(2) . . . 32

Figure 2.4 Observed (red) and simulated data (gray) from AR(2) . . . 33

Figure 2.5 Expected durations with estimated order (red) and known order (blue) with the boxplot showing the Monte Carlo error for 50 sim-ulations . . . 34

Figure 2.6 up-crossings corresponding to threshold of 81 quantile in the year 1991 . . . 36

Figure 2.7 The posterior distributions of Chi-square statistic for different orders 37 Figure 2.8 Expected durations for the quantiles interval (0.70,0.95) for exact distribution . . . 38

Figure 2.9 Quadratic relationship between expected lengths and the quantiles for exact distributions . . . 40

Figure 2.10 The comparison of empirical and estimated CDF (The red continu-ous line corresponds the estimated Beta distribution and the black dotted line corresponds to the empirical CDF) . . . 41

Figure 2.11 Plot of expected length vs. quantiles with confidence interval at selected quantiles for approximate distributions . . . 42

Figure 2.12 Quadratic relationship between expected lengths and the quantiles for approximate distribution . . . 43

Figure 2.13 Locations of 126 USCRN stations . . . 46

Figure 2.14 Boxplot of the adjusted R2 for all the stations . . . 47

Figure 2.15 spatial clustering of the estimated quadratic equations . . . 48

Figure 2.16 Heatwaves in 2007 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95) . . . 49

Figure 2.17 Heatwaves in 2012 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95) . . . 50

Figure 3.1 Expected length of durations for a threshold of 81 percentile at USCRN weather stations . . . 54

Figure 3.2 Adjacency network for three USCRN stations (marked in red) with circle radius of 200 kilometers . . . 57

(14)

Figure 3.4 Plot of Crossvalidation Groups . . . 70

Figure 3.5 Prediction Plot of the expected lengths at quantile 0.70 . . . 73

Figure 3.11 Heatwave days in Chicago in 2012 using 70 percentile threshold using our definition (July 4-7 marked in red) . . . 80

Figure 4.1 Locations of 120 USCRN stations . . . 92

Figure 5.1 ROC Curve for no outliers . . . 110

Figure 5.2 ROC Curve for vertical outliers . . . 112

Figure 5.3 ROC Curve for Leverage Points . . . 113

Figure 5.4 ROC Curve for Different β . . . 115

Figure A.1 Histograms of durations of 1991,1997,2003, and 2009 for 0.81 per-centile in Atlanta Data . . . 127

Figure A.2 Histograms of durations of the year 2002 with different quantiles in Atlanta Data . . . 128

Figure A.3 observed vs fitted durations of the years 1991,1997,2003, and 2009 in Atlanta Data (the red lines correspond to the observed durations and the black lines correspond to the fitted duration) . . . 129

(15)

Chapter 1 Introduction

(16)

occur-Figure 1.1: Heatwave formation: borrowed from Wikipedia

(17)

Figure 1.2: US Fatalities due to heatwaves (source: WorldAtlas & NOAA)

occurred (Meehl and Tebaldi [2004]).

While the effect of extreme heat on human health is widely studied, there is still no universally accepted definition of heatwave. There exist hundreds of definitions in the literature to quantify the extremities of these conditions, but not a unique global one. The form of the most popular definitions of heatwave is that the temperature (or a combination of humidity and temperature, e.g. apparent temperature etc.) has to be greater than a threshold, say, T °C for at least d consecutive days where the thresholds

T °C andd days are to be chosen suitably. Also, T andd can vary with locations such as

(18)

characterizing heatwave is of utmost importance.

There were several attempts to model extremes in dependent time series (Chavez-Demoulin and Davison [2012]). But these approaches may fail to capture several existing definitions that use other block quantiles or moments of the distributions and not neces-sarily the maxima, e.g., HI or apparent temperature. In Chapter 2, we propose a method which can provide a probabilistic model to all of the definitions which are based on a threshold of any time series and a threshold on sustainability of a severe condition. Then we validate the presented model with application to two types of dataset, one of them is for a fixed location (Atlanta, United States) and the other one is for varying locations (USCRN weather stations spread across the United States).

(19)

quadratic equations at each of this locations. We also explore Univariate and Multivariate Conditionally Auto-Regressive (UCAR and MCAR) model on the coefficients of these quadratic equations.

(20)

(21)

Chapter 2 On the Probability Distribution of the

Duration of Heatwaves

2.1 Introduction

(22)

2.1.1 Various Definitions of Heatwave

The form of the most popular definitions of heatwave is that the temperature (or a combination of humidity and temperature, e.g. apparent temperature etc.) has to be greater than a threshold, say, T °C for at least d consecutive days where the thresholds

T °C and d days are to be chosen suitably. Also, T and d can vary with locations such

as countries, and sometimes even within various regions of a large country. In Frich et al. [2002], heatwave is defined as a period of at least 5 consecutive days when the daily maximum ambient temperature exceeds the normal daily temperature (where the normal is defined as the daily average ambient temperature based on a period of 1961-1990) by 5°C (9°F). In Karl and Knight [1997], the distribution of daytime and nighttime apparent temperature were simulated to find the rare cases where the minimum of them stay extremely high for a period of two or three consecutive days. Another slightly complicated definition of heatwave can be found in Peng et al. [2011] which has been used extensively in literature. This definition is based on two thresholds T1 and T2, where T1 is the 97.5 percentile and T2 is 81 percentile of all the daily ambient temperatures available at a given location over a pre-specified time period. The definition says “A heat wave is then defined as the longest period of consecutive days satisfying the following conditions: the daily maximum temperature is above T1 for at least 3 days, the daily maximum temperature is above T2 for every day of the entire period, and the average of daily maximum temperature over the entire period is above T1.” All of these definitions were

(23)

mortality rate due to heatwave increases by 4% when the definition is based on having daily mean ambient temperature more than 95 percentile of all available daily mean temperatures for at least 2 days, but it increases by 3% when the definition is based on having daily mean ambient temperature more than 98 percentile of all available daily mean temperatures for at least 2 days, an increase by 7% when the definition is based on having daily mean ambient temperature more than 99 percentile of all available daily mean temperatures for at least 2 days and even an increase by 16% when the definition is based on having daily mean ambient temperature more than 97 percentile of all available daily mean temperatures for at least 5 days. These results aptly show the relative impact of the choice of the thresholds of T °C and d days in defining the heatwaves on the

mortality rates based on data from similar locations.

There is another school of thought in literature that suggests that relative humidity also plays a significant role in human health along with high temperature. So, a combi-nation of temperature and relative humidity, i.e., heat index (HI) should be considered in the characterization of heatwaves. In Schoen [2005], a new empirical model for heat index (HI) was proposed as

HI=T −1.0799e0.03755T[1−e0.0801(D−14)],

where HI,T, andDare all in °C andT is ambient air temperature andDis the dew-point

temperature which is calculated as

D= b×α

a−α,where α= a×T

(24)

where a =17.27, b =237.3, RH is measured relative humidity expressed as decimal

frac-tion. This formula for dew-point temperature is valid if 0°< T <60°C and0.01<RH<

1.00. In Yin and Wang [2018], another formula for computing heat index (HI) in Beijing

was used to study the effect of it on cardiovascular health which is as follows:

HI=

      

1.8T −0.55(1.8T −26)∗(1−0.6) + 32 (RH≤60%)

1.8T −0.55(1.8T −26)∗(1−RH) + 32 (RH>60%)

whereT is in °C and is the daily maximum temperature, RH is in %, and is the daily

mean relative humidity. Wehner et al. [2018] used a definition using a threshold on heat index (a bi-quadratic empirical function with ambient temperature (T) in °F and relative humidity (R) in %) provided by NOAA.

HI =42.379 + 2.04901523T + 10.14333127R−0.22475541T R−0.00683783T2

−0.05481717R2+ 0.00122874T2R+ 0.00085282T R2−0.00000199T2R2

where T > 80°F, and R >40%. Now, if HI is between 80-90°F, the condition is known

as “Caution”, if HI is between 90-103°F, the condition is known as “Extreme Caution”, if HI is between 103-125°F, the condition is known as “Danger”, if HI is above 125°F, the condition is known as “Extreme Danger”.

(25)

produce a reliable result, the selection of the heat index needs to be more judicious. There were several attempts on finding the probability distribution of the extreme rare events. The well known POT (Peak over threshold) method is one of the most commonly used tool in extreme value theory (Leadbetter [1991], Kysel`y et al. [2010]). A study on heatwave was done using hierarchical Bayesian model for serially-dependent extremes by Reich et al. [2014]. One apparent limitation to these approaches is based on the fact that only the extremes (typically taken as block maxima) of those time series can be modeled using the well known extreme value theory. While these methods give a new perspective to the existing problems, they are not able to capture the essence of most of the existing definitions. In this chapter, we propose a probabilistic framework which is able to apprehend most of the existing definitions. This framework is not only limited to the case of heatwaves, rather it can be applied to any stationary time series, and any event based on the exceedance above a threshold. We apply our models using two different datasets which will be discussed in latter sections.

2.2 Probability Distributions of Duration and

Inten-sity of Heatwave

One of our primary goal is to encapsulate all the existing definitions using a probabilistic framework. So, we consider a time series Xt for our analysis where t will usually denote

days but may also represent any other units of time. For example, Xt can be daily

maximum temperature or heat index (HI) or any other related time series depending on the existing definition of heatwave we want to represent. In general, we want Xt to be

(26)

be restricted to daily time scale. Since climate time series data tend to be cyclical with the periodicity of a year, we compute the number of heatwave within a year and their durations and intensities for the analysis. Let M0 be a chosen threshold for the time series Xt. M0 can be of different forms, for example it can be a fixed quantity, or it can be the long termq-th percentile of all the available values on that specific time series over

several years. Each instance when the time series Xt exceeds the threshold M0, we call

that an up-crossing. Similarly, a down-crossing can be defined as when Xt drops below

M0. Now, we define two random variables, “Duration” and “Intensity” of an up-crossing

corresponding to a specific Xt and M0. “Duration” of an up-crossing is defined as the

length of time Xt stays above M0 until the succeeding down-crossing. “Intensity” of an up-crossing is defined as the amount of exceedance in that up-crossing, i.e., the area under the curveXt aboveM0 in the corresponding up-crossing. Figure 2.1 illustrates these two concepts more succinctly.

LetXt,j(s)be the value of the time seriesXt on dayt of yearj at a location s, where

j = 1,2, ..., J(s)and i= 1,2, ..., Tj(s) whereTj(s)is the total number of days considered

within the year j at location s. In this section, we always assume that all the analysis is

for a fixed locations unless mentioned otherwise. So for simplicity, momentarily we drop

the symbolsfor subsequent definitions. Given a quantile level q∈(0,1), letM(q)be the

long termq-th quantile of the entire series{Xt,j, t= 1, ..., Tj;j = 1, ..., J}.M(q)will be a

constant function if the threshold is chosen to be a fixed quantity instead of a quantile. We define a dependent Bernoulli sequence {Bt,j(q) : t = 1,2, ...;j = 1,2, ...}, where Bt,j(q)

(27)

Figure 2.1: Definition of duration and intensity

Then, the times of crossing the threshold in j-th year can be defined as follows:

T1,j(q) =min{t≥1 :Bt,j(q) = 1}

T2,j(q) =min{t≥T1,j(q) :Bt,j(q) = 0}

More generally, for m=1,2,...we define

T2m−1,j(q) =min{t≥T2m−2,j(q) :Bt,j(q) = 1}

T2m,j(q) =min{t≥T2m−1,j(q) :Bt,j(q) = 0}

(28)

Using the above notations, we define Duration and Intensity of an up-crossing as follows:

Definition 2.2.1. Duration of i-th up-crossing in j-th year is defined to be the number of consecutive days when Xt,j stays above M(q) for that up-crossing and is denoted by

di,j(q). So, di,j(q) =T2i,j(q)−T2i−1,j(q).

Definition 2.2.2. Intensity of i-th up-crossing in j-th year is defined to be the amount of exceedance in i-th up-crossing and is denoted by INi,j(q). So, INi,j(q)

=PT2i,j(q)−1

k=T2i−1,j(q)Xk−M(q)(T2i,j(q)−T2i−1,j(q)).

Now, using these two definitions, we can express majority of the existing definitions that are based on a threshold of a time series and on sustainability. For example, the definition used in Frich et al. [2002] can be represented if we set a cut-off for duration of an up-crossing to be 5 days, and also consider Xt,j to be (Mt,j −Avg(Mt,j))where Mt,j

is the maximum daily temperature at t-th day in j-th year, Avg(Mt,j) is the long term

average value of Mt,j’s, and set M(q) to be 5. For the definition used in Wehner et al.

[2018], we can set Xt,j to be the heat-index (HI) of t-th day in j-th year and thresholds

can be set as it is. So, it is prudent to derive the probability distributions of Duration and Intensity of the up-crossings. In the next subsections, we derive an exact and an approximate distribution of duration, since most of the definitions can be captured by just the definition of duration.

2.2.1 Exact Distribution of Durations for Markov processes

In order to find an exact distribution of the durations of an up-crossing, we assume that the time series Xt is strictly stationary and follows an Markov process of order k, where

(29)

assumptions on{Xt:t= 1,2, ..., n}, we will show that the probability of the duration of

an up-crossing being equal tol, will beA×Bl−k _{for all}_l_≥_k _{for some}_A_≤₁_{, and}_B _≤₁_.

Theorem 2.2.1. Let Xt,j denote the value of time-series Xt on day t in j-th year,

i = 1,2, ..., Tj and j = 1,2, .., J. If for every j, {Xt,j : t = 1,2, ..., Tj} follows a Markov

process of order k, and is strictly stationary, then i-th duration in year j corresponding

to q-th quantile, di,j(q) has the following probability distribution:

P(di,j(q) =l) =

    

πj,l(q) l < k

Aj(q)×Bj(q)l−k l ≥k

for some sequence{πj,l(q) :l = 1,2, ..,(k−1)}, for someAj(q)andBj(q)whereAj(q)≤1,

Bj(q)≤1 and πj,l(q)≤1 for every l < k.

To prove this theorem, we will first show the proof where k is equal to 1 and 2 in

details, and extend the proof for general k by induction.

Proof. Case I: k = 1. Fix a j ∈ {1, ..., J}. Notice that

P(Xt,j ∈A|Xl,j ∈Bl, l = (t−1),(t−2), ...) = P(Xt,j ∈A|Xt−1,j ∈B).

Let,Bt,j(q) = I(Xt,j > M(q))whereM(q)is the threshold corresponding toq-th quantile.

Now,since we assume strict stationarity of the sequence {Xt,j : t = 1,2, ...;j = 1,2, ...},

i.e., the joint distribution of (Xt1,j, Xt2,j, ..., Xtk,j) is same as that of (Xt1+h,j, Xt2+h,j, ...,

(30)

stationary. For, l ≥1, and formi =Ti,j(q),

P(di,j(q) =l) =P(B(mi−1),j(q) = 0, Bmi,j(q) = 1, ..., B(mi+l−1),j(q) = 1, B(mi+l),j(q) = 0) =P(B(mi+l),j(q) = 0|B(mi+l−1),j(q) = 1)

×P(B(mi+l−1),j(q) = 1|B(mi+l−2),j(q) = 1)× ...×P(Bmi+1,j(q) = 1|Bmi,j(q) = 1)

×P(Bmi,j(q) = 1|B(mi−1),j(q) = 0)×P(B(mi−1),j = 0)

=P(B2,j(q) = 0|B1,j(q) = 1)×P(B2,j(q) = 1|B1,j(q) = 1)l−1

×P(B2,j(q) = 1|B1,j(q) = 0)×P(B1,j = 0)

(2.1)

The last equality follows by the strict stationarity. Let us denote the transition prob-abilities α1 = P(B2,j(q) = 1|B1,j(q) = 1) and α2 = P(B2,j(q) = 1|B1,j(q) = 0) and

substituting these in Eq (2.1) we get

P(di,j(q) =l) = (1−α1)α1l−1α2×P(B1,j = 0) (2.2)

Putting Aj(q) = (1−α1)α2×P(B1,j = 0) and Bj(q) =α1, the theorem is proved.

Case II: k = 2. We consider the four transition probabilities that will be required to

derive the distributions:

α1 =P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 1)

α2 =P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 0)

α3 =P(B3,j(q) = 1|B2,j(q) = 0, B1,j(q) = 1)

(31)

Now, for, l≥2, and for mi =Ti,j(q),

P(di,j(q) =l) =P(B(mi−1),j(q) = 0..., B(mi+l−1),j(q) = 1, B(mi+l),j(q) = 0) =P(B(mi+l),j(q) = 0|B(mi+l−1),j(q) = 1, B(mi+l−2),j(q) = 1)

×P(B(mi+l−1),j(q) = 1|B(mi+l−2),j(q) = 1, B(mi+l−3),j(q) = 1)× ...×P(Bmi+1,j(q) = 1|Bmi,j(q) = 1, B(mi−1),j(q) = 0)

×P(Bmi,j(q) = 1, B(mi−1),j = 0)

=P(B3,j(q) = 0|B2,j(q) = 1, B1,j(q) = 1)×P(B3,j(q) = 1|B2,j(q) = 1,

B1,j(q) = 1)l−2×P(B3,j(q) = 1|B2,j(q) = 1, B1,j(q) = 0)×P(B2,j(q) = 1, B1,j = 0)

(2.3)

Substituting α1, α2, α3 and α4 in Eq (3), we get

P(di,j(q) =l) = (1−α1)α1l−2α2×P(B2,j = 1, B1,j = 0) (2.4)

Putting Aj(q) = (1−α1)α2 ×P(B2,j = 1, B1,j = 0) and Bj(q) = α1, the theorem is

proved for l≥2. Now, for l= 1,

P(di,j(q) = l) = P(B(mi−1),j(q) = 0, Bmi,j(q) = 1, B(mi+1),j(q) = 0)

. From strict stationarity,

P(di,j(q) =l) =P(B1,j(q) = 0, B2,j(q) = 1, B3,j(q) = 0)

So, if we put πj,1 =P(B1,j(q) = 0, B2,j(q) = 1, B3,j(q) = 0), the theorem is proved for

(32)

Finally for a general Markov process of order k, we will have to define2k _many

tran-sition probabilities, where

α1 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B1,j = 1)

α2 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B2,j = 1, B1,j = 0)

α3 =P(B(k+1),j(q) = 1|Bk,j(q) = 1, B(k−1),j(q) = 1, ..., B2,j = 0, B1,j = 0)

Proceeding like this, we will have

α2k =P(B₍_k₊₁₎_,j(q) = 1|B_k,j(q) = 0, ..., B₁_,j = 0)

Now, imitating the proof for order 1 and 2, we can write for l ≥k,

P(di,j(q) = l) = P(B(k+1),j(q) = 0|Bk,j(q) = 1, ..., B1,j(q) = 1)

×P(B(k+1),j(q) = 1|Bk,j(q) = 1, ..., B1,j(q) = 1)l−k

×P(B(k+1),j(q) = 1|Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0)

×P(Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0)

(2.5)

So, substituting for the values of α1 and α2 in Eq (2.5), we get

P(di,j(q) =l) = (1−α1)αl1−kα2×P(Bk,j(q) = 1, ..., B2,j = 1, B1,j(q) = 0) (2.6)

(33)

the theorem is proved for l ≥k. Now, for l < k,

P(di,j(q) =l) =P(B1,j(q) = 0, B2,j(q) = 1, ..., B(l+1),j = 1, B(l+2),j(q) = 0)

So, for l < k, if we put πj,l = P(B1,j(q) = 0, B2,j(q) = 1, ..., B(l+1),j = 1, B(l+2),j(q) =

0), the theorem is proved forl < k.

So, from this theorem, we can see that the exact distribution follows a generalized geometric-type distribution which is intuitive, since the probability of a stationary time series to stay above a threshold, should decrease after a time point.

Remark 2.2.1. Specifically, this theorem can be applied to an Auto-regressive process of order k. For an AR(k) process with normal innovations, the joint and conditional

distributions can be computed easily. Exact expressions of πj,l(q), Aj(q) andBj(q) can be

derived using multi-dimensional integrals.

In Section 2.3, we will validate this theoretical distribution with empirical distribution for Auto-regressive processes.

2.2.2 Hierarchical Model based on Exact Distribution

Our goal is to find a suitable model for duration of the up-crossings so that we can come up with a probabilistic definition that can apprehend all the existing definitions of heatwave. We assume that Xt is stationary and follows an Auto-regressive process of

orderk. Now, let us fix a threshold at a long-term q-th percentile. Letfj,l(q) denotes the

(34)

in year j. Let m denote the maximum length of an up-crossings (we can consider m to

be reasonably large). Then,

(fj,1(q), fj,2(q), ..., fj,m(q))|Ij(q)∼M ult(Ij(q),(pj,1(q).pj,2(q), ..., pj,m(q)))

where we know from the previous subsection that

pj,l(q) =

    

πj,l(q) l < k

Aj(q)×Bj(q)l−k l≥k

Now, we need to make sure that the cell probabilities of the Multinomial distribution add up to 1. So, we can consider a reparameterization of the cell probabilities with the help of the following lemma.

Lemma 2.2.1. Suppose a discrete random variable X has a distribution as follows:

P(X =l) = pl =

    

πl l < k

A×Bl−k l=k, ..., m

Then, to make sure that Pm

l=1pl= 1, we can do a reparametrization as follows: Suppose η=Pm

l=0AB

l−k_{. Then, there exists (}_α

1, α2, .., αk−2) such that,

π1 = (1−η)α1

πl = (1−η)(1−α1)...(1−αl−1)αl for l= 2,..,k-2

(35)

Where αl’s are required to satisfyαl∈[0,1].

Notice that for any l ∈ {1,2, ..., k−2},

(1−η)−Pl

j=1πj = (1−η)

Ql

j=1(1−αj), for example,

(1−η)−π1 = (1−η)(1−α1)

(1−η)−π1−π2 = (1−η)(1−α1)(1−α2). So

k−1 X

l=1

πl = (1−η)

and hence,

m

X

l=1

pl=η+ (1−η) = 1

Now, using this lemma, we use a repameterization on pj,l(q)’s and let αj,l(q), l =

1,2, ...,(k−2) be the corresponding parameters by reparameterization. We choose Beta

distribution as prior distributions on our k free parameters, αj,l(q)’s, ηj(q) and Bj(q)

since Aj(q) is dependent on ηj(q) and Bj(q) (Aj(q) = ηj(q) 1

−Bj(q)

1−Bj(q)m−k+1). So, the prior

distributions are as follows:

αj,l(q)∼Beta(µα,l(q)τα,l(q),(1−µα,l(q))τα,l(q)) for l= 1,2, ...,(k−2)

ηj(q)∼Beta(µη(q)τη(q),(1−µη(q))τη(q))

Bj(q)∼Beta(µB(q)τB(q),(1−µB(q))τB(q))

For the hierarchical model, we choose the Multinomial probability as described before since that is the most intuitive choice for the counts of the durations of different lengths. We need to keep in mind that this multinomial probability distribution of the frequency of durations of a certain length is conditioned on the total number of up-crossings Ij(q)

(36)

of the durations that we derived in the previous subsection. We choose Beta distribution as the parent distribution while allowing the variations among the years since the exact distribution is similar to a Geometric distribution after a time-point and Geometric and Beta are conjugate to each other. We validate the goodness of fit of this model using Chi-square goodness of fit test and given the data, we choose the order of the AR process, k, such that the Chi-square goodness of fit statistic is minimized. So, to compute the Chi-sqaure statistic, we need to find the population versions of the cell probabilities, and compare the Chi-sqaure distance betweent them. So, letel(q) denote the population

version of pj,l(q)’s, i.e.,el(q) =E(pj,l(q)) for l= 1,2, .., m. Then,

e1(q) = µα,1(q)(1−µη(q))

ei(q) = (1−µη(q))(1−µα,1(q))...(1−µα,(i−1)(q))µα,i(q) for i= 2,..,(k-2)

ek−1(q) = (1−µη(q))(1−µα,1(q))(1−µα,2(q))...(1−µα,(k−2)(q))

el(q) = µη(q)E[1/(1−Bm−k+1)] forl =k, ..., m ....(∗)

whereB ∼Beta((µB(q)τB(q) +l−k),((1−µB(q))τB(q) + 1))So, for a fixedk, for yearj,

the Chi-square statistic can be computed asχ2

j =

Pm

l=1

(fj,l(q)−Ij(q)el)2

Ij(q)el and approximately

follow χ2

m−k−1. So, we choose the order of the AR process as k, for which the posterior

mean of χ2

j statistic is minimized. Then, fixing k, we can compute the expected duration

of an up-crossings,E(di,j(q)) =

Pm

l=1lel(q)and its posterior distribution. The model can

be easily implemented using the JAGS programming language.

Now, using JAGS, we can thus get a posterior sample of expected durations, and we can choose the estimated expected duration to be the posterior median. So, for a fixed location, for a fixed quantile q, we can computeE(di,j(q)), the expected duration of an

(37)

Algorithm 1: Finding the posterior distributions of expected durations 1. Fix q.

2. Fix order k∈1,2, .., K.

3. For r∈ {1,2, .., R} (e.g. R ≥103)

• Get θ(r)= (µα,1, τα,1, ..., µα,(k−2), τα,(k−2), µη, τη, µB, τB)using JAGS.

• Compute e(r) =g(θ(r))as defined in (*).

4. Then obtain k= arg min_kDk whereDk =E[_J1 PJ_j₌₁ χ2(_jr)

m−k−1|Data]. 5. Then fix k.

6. For eachq ∈(0.7,0.95), forr ∈ {1,2, .., R}

• Compute Expected duration Eq(r) =P_lm₌₁le(_lr)(q).

2.2.3 Approximate Distribution of Durations

Now, let us drop the assumption that the time series has to follow a Markov process, since that might not be the case all of the times. It is very difficult to compute the exact distribution of durations in general, but in this subsection, we will show that an approximate distribution for stationary time series can be computed. All the analysis in this subsection is based on a fixed quantileq unless mentioned otherwise. So we drop the

notationq momentarily to keep things simpler. In order to obtain the distribution ofi-th

duration inj-th year, i.e., T2i,j −T2i−1,j, we need to know the distribution of the sum of

dependent Bernoulli random variables. This statement can be justified once we express the probability density function of i-th duration in j-th year at a discrete point in the

(38)

For a given year j, for any integerl,

P(di,j =l) =P(T2i,j−T2i−1,j =l)

=

Tj

X

k=1

P(T2i,j −T2i−1,j =l|T2i−1,j =k)P(T2i−1,j =k)

=

Tj

X

k=1

P(Bk,j =Bk+1,j =...=Bk+l−1,j = 1, Bk+l,j = 0)P(T2i−1,j =k)

=

Tj

X

k=1

P(

k+l−1 X

m=k

(1−Bm,j) +Bk+l,j = 0)P(T2i−1,j =k) (2.7)

Here, both Bt,j and (1−Bt,j) are dependent Bernoulli sequences, and we need the

distribution of sum of a subset of these sequences. Though the exact distribution of sum of dependent Bernoulli series is not available in simple closed form, there exist numerous literature on Poisson approximation of dependent Bernoulli sequences and its various uses. The precision of the bound of this approximation has been improved over the years, and here we present a lemma proved in Chen et al. [2013] which gives a simple exact bound of the Poisson approximation.

Lemma 2.2.2 (Chen et al. [2013]). {Xα : α ∈ J} be Bernoulli Random variables with

success probabilities pα, α ∈ J. Let W =

P

α∈JXα and λ = EW =

P

α∈Jpα. Then, for

any collection of sets Bα ⊂J, α ∈J

dT V(L(W), P oi(λ))≤(1∧ _λ1)(b1 +b2) + (1 +√1.4_λ) and |P[W = 0]−e−λ| ≤(1∧ 1

λ)(b1+b2+b3),

where

b1 = P

α∈J

P

β∈Bαpαpβ, b2 =

P

α∈J

P

(39)

b3 =P_α∈J|E(Xα|Xβ, β 6∈Bα)−pα|

Now, we use this lemma to approximate the distribution of the sum of the Bernoulli sequence in Eq (2.7), i.e., {(1−Bα,j) :α∈(k, k+l−1)} ∪ {Bk+l,j}.

P(

k+l−1 X

m=k

(1−Bm,j) +Bk+l,j = 0) ≈e−[

Pk+l−1

m=k P(Bm,j=1)+P(Bk+l,j=0)] (2.8)

Given our assumption of strict stationarity on the time series, for allm ∈ {k, k+1, ..., k+l}

P(Bk+m,j = 1) =P(B1,j = 1) (2.9)

Also,P(Bl,j = 0) = 1−P(B1,j = 1) (2.10)

Now, combining Eq (2.9), (2.10) and equation (2.8), we get

P(

k+l−1 X

m=k

(1−Bm,j) +Bk+l,j = 0) ≈e−(l−1)P(B1,j=1)+1 (2.11)

Substituting Eq (2.11) to Eq (2.7), we get that for any positive integer l,

P(di,j =l)≈ Tj

X

k=1

e−(l−1)P(B1,j=1)+1_P_(T

2i−1,j =k)

=e−(l−1)P(B1,j=1)+1

Tj

X

k=1

P(T2i−1,j =k)

(40)

From Eq (2.12), we conclude that the distribution of Duration can be approximated by a Geometric distribution with success probability of (1−e−P(B1,j=1)₎ if we assume

strict stationarity on the time series {Xt,j : t = 1,2, ...;j = 1,2, ...}. A closed bound

can be obtained for the distance between the exact and approximate distribution using Lemma 2.2.2. Therefore, we write the following theorem on the approximate distribution of Duration of an up-crossing.

Theorem 2.2.2. Under the assumption of strict stationarity on the time series ((Xt,j)),

the distribution of i-th Duration inj-th year,di,j, can be approximated by the distribution

of another random variable d∗_i,j, where d∗_i,j ∼Geo(1−e−P(B1,j=1)₎_.

Though strict stationarity remains a sufficient condition for proving Theorem 2.2.2, it does not need to be a necessary condition for the approximate distribution of durations to be Geometric. Suppose we have a non-stationary time series {Xt,j : t = 1,2, ...;j =

1,2, ...}that we can write as follows:Xt,j =µt+σtt,j wheret= 1, ..., Tj and j = 1, ..., J.

An empirical estimate of µt can be µˆt = _J1 PJ_j₌₁Xt,j or median of (Xt,1, ..., Xt,J) or

Moving Average based estimate µ˜t,j =

Pt+d

m=t−dXm,j

2d+1 and µˆt = 1

J

PJ

j=1µ˜t,j where Xt,j = 0 if t ≤ 0 or t > Tj. Thus, µˆt is the long-term average of Xt,j’s. Similarly, we estimate

σt by the long-term standard deviation of Xt,j’s, i.e., σˆ2t = J−11 PJ

j=1(Xt,j −µˆt)

2 _where

t = 1, ..., Tj. Also, we can use a robust version of standard deviation in case µˆt is the

median, i.e.,σˆt = _J1

PJ

j=1|Xt,j −µˆt|.

So instead of Xt,j, we can work with an estimate of t,j, where ˆt,j =

Xt,j−µˆt

ˆ

σt and its

quantiles m(q) where m(q) = M(q)−µˆt

ˆ

σt , M(q) being the q-th quantile of Xt,j’s. But the

distribution of Bt,j will remain the same since I(Xt,j ≥ M(q)) = I(ˆt,j ≥ m(q)). So,

(41)

2.2.4 Hierarchical Model based on Approximate Distribution

Now, with the support of the framework of Subsection 2.2.3, we design a hierarchical model for durations of the up-crossings for a chosen quantile q. We assume,

di,j(q)∼Geo(pj(q))

pj(q)∼Beta(α(q)τ(q),(1−α(q))τ(q))

where di,j(q) is the i-th Duration in year j for quantile q, and α(q) and τ(q) need to

be estimated. We choose the Geometric distribution for durations in a year since the approximate distribution of durations are theoretically Geometric based on the materials presented in the previous subsection. We assume that there exists yearly variation in the “success” probability of the Geometric distributions of durations, but they are associated with each other through a parent distribution which we choose to be Beta since Beta and Geometric distributions form a conjugate family of distributions. However, we later validate this assumption for real case studies. Now, due to this conjugacy, estimating

α(q) and τ(q) by the maximizer of the marginal likelihood becomes computationally

straightforward. The illustration of this reasoning is as follows. The marginal likelihood of (α(q), τ(q))is given by

L(α(q), τ(q)) =

J

Y

j=1 Z

pj(q)

P

idi,j(q)−Ij(q)

(1−pj(q))Ij(q)pj(q)

α(q)τ(q)−1

× (1−pj(q))

(1−α(q))τ(q)−1

B(α(q)τ(q),(1−α(q))τ(q))

=

J

Y

j=1

B(α(q)τ(q) +P

idi,j(q)−Ij(q), Ij(q) + (1−α(q))τ(q))

(42)

where B(a, b) = Γ(_Γ(a_a)Γ(₊_bb₎), Ij(q) is the number of up-crossings in year j with threshold

being M(q). As the quantile q goes higher, the number of up-crossings in a year gets

lower, even sometimes can result in no up-crossings at all for an entire year. One way to deal with this issue is to eliminate that year from the analysis itself, but this can cause insufficient use of some valuable information. So instead, we keep that year’s input by simply putting P

idi,j(q)to be 0 in the likelihood Eq (2.13). Though we do not need to

estimate the average durations through marginal likelihood for higher quantiles, because later we show that those can be extrapolated by the quadratic relationship between the average durations and quantiles. The details of this analysis is provided in Section 2.4. We still can estimate the parameters based on di,j(q)’s for the year even with missing

data (or no up-crossings) by computing the conditional likelihood which is simple since we have conjugacy of these two distributions.

The empirical validations of the assumed hierarchical model is done in Section 2.4 using exploratory analysis of the data sets. Now, given this model, we can quantify the average or expected length of durations corresponding to the chosen threshold M(q).

E(di,j(q)|pj(q)) =

1 1−pj(q)

E(di,j(q)) = E

1

1−pj(q)

= B(α(q)τ(q),(1−α(q))τ(q)−1) B(α(q)τ(q),(1−α(q))τ(q))

= τ(q)−1

(1−α(q))τ(q)−1 (2.14)

We plug in the estimates ofα(q)and τ(q),α(q)ˆ andτ(q)ˆ in Eq (2.14) to get the expected

length of durations where the threshold is M(q). We estimate the confidence interval

(43)

τ(q)−1

(1−α(q))τ(q)−1 isˆa(q) ˆΣ(q)ˆa(q)

T_{, where}_Σ(q)_{is the asymptotic variance covariance matrix of}

ˆ

α(q)andτˆ(q), the estimates ofα(q)andτ(q), where,a = (₍₍₁₋τ(_αq₍)(_qτ₎₎(_τq₍)_q−₎1)₋₁₎2,

α(q)

((1−α(q))τ(q)−1)2).

Now, for a chosen quantile q, if the length of a duration exceeds the expected length

of duration more than the expected length of duration, we can consider that abnormal, hence that period of consecutive days can be considered as a heatwave. Thus, we fix the minimum length of duration of a heatwave with a probabilistic support. Also, the hierarchical model provided can be used to quantify the likelihood of a heatwave forming in a particular year, hence can be used for prediction of heatwave. Moreover, there is a quadratic relationship between expected length of duration, E(di,j(q)), and the quantile

qwith which the expected length of duration at a fixed location can easily be determined.

We discuss this in greater details in Section 2.4.

2.3 Simulation study

(44)

2.3.1 Comparing Theoretical and Empirical Probabilities of

Auto-regressive Process

We first compute the transition probabilitiesα1andα2associated with an Auto-regressive process of order 1 where α1 = P(B2,j(q) = 1|B1,j(q) = 1) and α2 = P(B2,j(q) =

1|B1,j(q) = 0)when the Auto-regressive process Xt can be described asXt=ρXt−1+at,

at’s being the associated innovations and follow a Normal distribution.

Now, we compute the theoretical probabilities of the durations with these transition probabilities as

P(di,j(q) =l) = (1−α1)α1l−2α2×P(B2,j = 1, B1,j = 0) (2.15)

Now, we generate 1000 observations for yearly data (365 days) of Bernoulli sequence with chosen quantiles with these transition probabilities using Binomial distribution, i.e.,

Bt ∼ Binomial(pt) where pt = α1(1− Bt−1) + (1 −α2)Bt−1. Now, we compute the

empirical probabilities of durations using this simulated data. We repeat these steps for different values ofα1,α2, ρand quantiles. We observe that the empirical and theoretical

distributions match for each of the cases. Figure 2.2 shows the comparison of theoretical and empirical probabilities for AR(1) processes.

Now, computation of the four transition probabilities is also easy for Auto-regressive process of order 2. So we compute the transition probabilities and compute the empirical and theorectical probabilities for Ar(2) processes with selected values ofρ1 and ρ2 where the Auto-regressive process Xt can be described asXt=ρ1Xt−1+ρ2Xt−2+at, at being

(45)

theo_prob emp_prob

0.00

0.03

0.06

1 2 3 4 5 6 7 8 9 10

alpha= 0.135 beta= 0.573 rho= 0.5 q= 0.8

theo_prob emp_prob

0.00

0.02

0.04

1 2 3 4 5 6 7 8 9 10

alpha= 0.076 beta= 0.674 rho= 0.5 q= 0.9

theo_prob emp_prob

0.000

0.015

1 2 3 4 5 6 7 8 9 10

alpha= 0.041 beta= 0.753 rho= 0.5 q= 0.95

theo_prob emp_prob

0.000

0.015

0.030

1 2 3 4 5 6 7 8 9 10

alpha= 0.095 beta= 0.404 rho= 0.75 q= 0.8

theo_prob emp_prob

0.000

0.010

0.020

1 2 3 4 5 6 7 8 9 10

alpha= 0.052 beta= 0.493 rho= 0.75 q= 0.9

theo_prob emp_prob

0.000

0.010

1 2 3 4 5 6 7 8 9 10

alpha= 0.029 beta= 0.562 rho= 0.75 q= 0.95

theo_prob emp_prob

0.000

0.003

0.006

1 2 3 4 5 6 7 8 9 10

alpha= 0.049 beta= 0.169 rho= 0.95 q= 0.8

theo_prob emp_prob

0.000

0.003

1 2 3 4 5 6 7 8 9 10

alpha= 0.033 beta= 0.199 rho= 0.95 q= 0.9

theo_prob emp_prob

0.000

0.002

0.004

1 2 3 4 5 6 7 8 9 10

alpha= 0.025 beta= 0.218 rho= 0.95 q= 0.95

Figure 2.2: Empirical and theoretical duration distribution for selected values of transi-tion probabilities and ρ and quantiles for AR(1)

(46)

theo_prob emp_prob

0.00

0.02

0.04

1 2 3 4 5 6 7 8 9 10

rho1= 0.6 rho2= 0.09 q= 0.8

theo_prob emp_prob

0.000

0.015

0.030

1 2 3 4 5 6 7 8 9 10

rho1= 0.6 rho2= 0.09 q= 0.9

theo_prob emp_prob

0.000

0.010

0.020

1 2 3 4 5 6 7 8 9 10

rho1= 0.6 rho2= 0.09 q= 0.95

theo_prob emp_prob

0.00

0.03

0.06

1 2 3 4 5 6 7 8 9 10

rho1= 0.5 rho2= 0.06 q= 0.8

theo_prob emp_prob

0.00

0.02

0.04

1 2 3 4 5 6 7 8 9 10

rho1= 0.5 rho2= 0.06 q= 0.9

theo_prob emp_prob

0.000

0.015

0.030

1 2 3 4 5 6 7 8 9 10

rho1= 0.5 rho2= 0.06 q= 0.95

theo_prob emp_prob

0.00

0.04

0.08

1 2 3 4 5 6 7 8 9 10

rho1= 0.2 rho2= 0.01 q= 0.8

theo_prob emp_prob

0.00

0.03

0.06

1 2 3 4 5 6 7 8 9 10

rho1= 0.2 rho2= 0.01 q= 0.9

theo_prob emp_prob

0.000

0.015

0.030

1 2 3 4 5 6 7 8 9 10

rho1= 0.2 rho2= 0.01 q= 0.95

Figure 2.3: Empirical and theoretical duration distribution for selected values of ρ1,ρ2 and quantiles for AR(2)

simulate from an Auto-regressive process as follows:

Xt,j = 0.79Xt−1,j−0.03Xt−2,j+at,j

where t ∈ {16,17, ...,350}, j ∈ {1,2, ...,30} and then, we add the estimated trend aver-aged over all 22 years. Figure 2.4 shows that the simulated data and the observed data are similar, so the inferences we can draw for the actual data will not be just random, instead the results are validated using the simulated data.

(47)

50 100 150 200 250 300 350

0

20

40

60

80

100

120

days

obser

ved and sim

ulated temper

ature

Figure 2.4: Observed (red) and simulated data (gray) from AR(2)

(48)

1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25

4

6

8

10

fixed order

estimated order

Figure 2.5: Expected durations with estimated order (red) and known order (blue) with the boxplot showing the Monte Carlo error for 50 simulations

2.4 Data Analysis

(49)

2.4.1 Atlanta Data

Description of the data

Atlanta data was obtained from the National Climatic Data Center for the first-order weather station located at the Atlanta Hartsfield International Airport and was used in the study of the association between Emergency Department visits and heatwaves in Chen et al. [2017]. This data consists of daily values of many weather related variables including various types of temperatures for years 1991 to 2012. The variables reported in the dataset are mostly the variables which might be used to characterize heatwave, e.g., maximum and minimum ambient temperature, maximum and minimum dew-point temperature, maximum and minimum apparent temperature etc.. For our analysis, we use daily maximum ambient temperaturejust for illustration. It may be noted that ambient is defined as “of the surrounding area or environment" (definition by NOAA). Any of the aforementioned variables can also be used since our proposed model does not depend on a specific time series. Atlanta data also contains hourly values of many other variables (including solar radiation, relative humidity etc.), but for our analysis, we mostly illustrate our methodology using daily data. Maximum or minimum apparent temperature can be used in case we want to incorporate relative humidity.

2.4.2 Exploratory Analysis

We first set a quantile q and look at the corresponding up-crossings in each individual

year. Following the popular choice (e.g. Peng et al. [2011]) we first set the quantile q

(50)

2, 6, 7, 1, 2, 9, 1, 2, 5, 3, and 9 respectively.

0 50 100 150 200

20

40

60

80

100

days

Maxim

um ambient temper

ature

1991

Figure 2.6: up-crossings corresponding to threshold of 81 quantile in the year 1991

(51)

2.4.3 Expected Duration corresponding to the Exact

Distribu-tion

As we have already seen in the simulation study, estimating order as described in Al-gorithm 1 does not change the estimate of expected durations. So, we will estimate the order of the Auto-regressive process first by minimizing the expected Chi-square statistic. Using JAGS, we get the posterior distributions ofDk’s (see Algorithm 1) and Figure 2.7

shows the boxplot of posterior samples of Dk’s where k ∈ {2,3, ...,13}. From Figure 2.7,

2 3 4 5 6 7 8 9 10 11 12 13

1

2

3

4

5

6

7

Figure 2.7: The posterior distributions of Chi-square statistic for different orders

we can see that k = 5 minimizes the expected Chi-square statistic. Using this

(52)

as thresholds would result in a significantly higher number of up-crossings which may not be appropriate. Also, in the literature, we find quantiles range from 0.80 to 0.99, so we would like to see inferences based on that range of quantiles. But we added a little bit more values of quantiles to the left to the range, since in later part of this section, we find relationships between the expected lengths and the quantiles which require a bigger range. Also, we fix the upper range at 0.95 since analysis with higher threshold will not be reliable. There will be a very few number of up-crossings in a year, even no up-crossings at all if we choose higher quantiles than 0.95. So, the range (0.70,0.95) is a reasonable range of quantiles for our analysis, and Figure 2.8 shows the expected lengths of durations in this range with its credible intervals.

2 3 4 5 6 7

0.70 0.75 0.80 0.85 0.90 0.95

threshold

Expected length

(53)

From Figure 2.8, we see that the expected length of durations decrease with higher quantiles, which agrees with the fact that the chosen time series cannot stay above for higher thresholds for a long time. Also, it is evident that there might be a simple polynomial relationship between the quantiles and the expected length. So, we fit three different relationships, a linear, a log-linear, and a quadratic between them and choose the one with the highest adjustedR2. Thus, we find that the quadratic relationship is the

optimal among them with an adjusted R2 of 0.99. While fitting the quadratic equation,

we assume that the errors are independent, but this may not be the case in real scenario. Even if we lose this assumption, we will still have unbiased estimates of the coefficients, but they will lack in terms of efficiency. Equation (2.16) shows the fitted relationship between expected length and quantiles. For q ∈(0.65,0.99)

E(dij(q)) = 3.9−4.55(−2.16 + 2.61q+ 4.01e−15q2) + 0.29(26.37−64.46q+ 39.06q2)

(2.16)

Figure 2.12 represents the estimated quadratic relationship between the observed expected lengths and the quantiles.

We can now compute the expected length of durations using the quadratic equation in a fixed location. Using this relationship, we can now provide a definition of heatwave which will be able to comprehend all the existing definitions of heatwave.

2.4.4 Expected Duration corresponding to the Approximate

Dis-tribution

(54)

0.70 0.75 0.80 0.85 0.90 0.95

3

4

5

6

Quadratic fit

quantiles

e

xpected dur

ation

Figure 2.9: Quadratic relationship between expected lengths and the quantiles for exact distributions

and see if the results match or not. As the proposed approximate model suggests that for a fixed quantile q, the durations in a year,dij(q), follows Geometric distribution, we first

(55)

up-crossings in general which is compliant to the theory we developed corresponding to approximate distributions. Figures corresponding to this model validation can be found in the Appendix section. With this conclusion, we now explore on the validation of the hierarchical model. We use marginal likelihood method to estimate the parameters, α

and τ of the hierarchical Beta distribution. Figure 2.10 shows the comparison of the

empirical cumulative distribution function and the cumulative distribution function of the estimated Beta distribution suggesting a suitable fit.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fitting Beta

value

CDF

Figure 2.10: The comparison of empirical and estimated CDF (The red continuous line corresponds the estimated Beta distribution and the black dotted line corresponds to the empirical CDF)

(56)

goodness of fit test (Anderson and Darling [1952]) using R package ‘goftest’. The p-value corresponding to the Kolmogorov-Smirnoff test is 0.76 and that corresponding to the Anderson-Darling test is 0.68. Both the p-values are much greater than 0.05, suggesting the proposed hierarchical model is a good fit for this Atlanta data. Though we show this result for a chosen quantile 0.81, this model is a good fit even if we vary the quantiles between 0.70 and 0.95, the range of quantiles we have chosen. We now compute the expected lengths of duration, E(dij(q)), for different choices of quantiles. Figure 2.11

shows the plot of expected length vs. quantile for Atlanta data. From Figure 2.11, we see

Figure 2.11: Plot of expected length vs. quantiles with confidence interval at selected quantiles for approximate distributions

(57)

the expected lengths are similar to the ones with the exact distribution, the difference between them might be a result of the Geometric approximation of the exact distribution. Equation (2.17) shows the fitted relationship between expected length and quantiles. For

q∈(0.65,0.99)

E(dij(q)) = 4.57−5.36(−2.16 + 2.61q+ 4.01e−15q2+ 0.28(26.37−64.46q+ 39.06q2)

(2.17)

Figure 2.12 represents the estimated quadratic relationship between the observed expected lengths and the quantiles.

0.70 0.75 0.80 0.85 0.90 0.95

3

4

5

6

Quadratic fit

quantiles

expected_length

Figure 2.12: Quadratic relationship between expected lengths and the quantiles for ap-proximate distribution

(58)

the expected length of duration for a given quantile in a fixed location.

2.4.5 Definition of heatwave using expected length of durations

The idea of this definition is to combine all the existing definitions of heatwave using the definition of durations. Given a threshold using a quantileq, we can now get a statistical

cut-off for the durations using the definition of expected length.

Definition 2.4.1. In a fixed locations, given a quantileq, if an up-crossing of a time se-ries of temperature or heat-index lasts more than expected length of duration,i.e.,E(dij(q))

(as computed from the estimated quadratic equation for that location), we call that

up-crossing to be a heatwave corresponding to the quantile q ∈(0.75,0.95).

Another advantage of using this definition is that we do not have to compute the expected length of durations for every quantile using the data. There will be fewer amount of data available for higher thresholds, so we can just extrapolate the values of expected lengths corresponding to higher quantile using the quadratic relationship.

The only problem now remains is to find the optimal quantile q for a fixed location

s. One way to do that is to get a mortality data at a fixed location s, and compare the

mortality rates corresponding to a heatwave using different definitions and choose the one that has the highest association with the mortality rate or health hazards. In order to do that, using our definition minimizes the number of definitions that need to be compared. This problem will be pursued in a separate work in future.

2.4.6 Analysis based on USCRN Data

(59)

equations for every location in the later sections.

2.4.7 Description of the USCRN data

The USCRN data contains daily and hourly meteorological values collected at 232 weather stations of USCRN, 226 stations in the United States and 6 stations outside the United States. The time period ranges from 2000 to 2017. The data contains the daily values of average ambient temperature (average does not mean hourly average, rather it represents the average of temperature recorded at 6 different radars ) and precipitation, also hourly values of the same. For our analysis, we just use the daily values of ambient temperature. But some stations of USCRN are fairly new, so those stations do not contain data on sufficiently large number of years required for our analysis. So we eliminate information of those stations where there are less than 10 years of data. Thus, we end up using values at 126 stations with at least 10 years of daily values of meteorological variables.

Since the size of this data is fairly big, there are lots of missing values at many points as expected. So, if we have a missing value of daily average ambient temperature at a certain date of a specific year, we replace the value of that with the median of the daily temperature at the same date of the other years. In that way, we will still have a robust analysis without eliminating the missing values. In a further study, we will use hierarchical Bayesian models that will use posterior predictive distributions to impute missing values.

(60)

Figure 2.13: Locations of 126 USCRN stations

2.4.8 Fitting the Hierarchical Model

Now, we do the same analysis for all these 126 stations as we did for the Atlanta data with the approximate distribution. For a fixed quantile 0.81, we estimated Beta distribution for all these stations, and computed p-values corresponding to the Anderson-Darling test. Except for 8 stations, the p-values turn out to be greater than 0.05 suggesting a good fit in general. Upon visual inspection it seems that the test fails in those 8 stations is because of one distant value from the estimated Beta cumulative distribution. Further investigation is needed to come to a decisive conclusion.

(61)

R2_{. Figure 2.14 shows that the adjusted} _R2 _{for all these stations are greater than 0.95} suggesting a good fit for all of them.

0.90

0.92

0.94

0.96

0.98

1.00

Adjusted R−squared

Figure 2.14: Boxplot of the adjusted R2 for all the stations

Now, we need to investigate if there exists a spatial clustering between these estimated coefficients of the quadratic equations. So we plot all the estimated quadratic equations in a plot to check visually if there exists any. Figure 2.15 shows the spatial clustering of the quadratic equation.

(62)

0.70 0.75 0.80 0.85 0.90 0.95

0

5

10

15

20

25

Quadratic Plots

quantiles

expected length

Figure 2.15: spatial clustering of the estimated quadratic equations

2.5 Comparing with Real Heatwaves

Now, let us try to find when heatwaves occurred according to our definition of heatwaves and compare those with the existing heatwaves. Figure 2.16 and Figure 2.17 shows the heatwaves that occurred at Atlanta in 2007 and 2012 for two different quantiles (0.8 and 0.95) using our definition with the expected lengths corresponding to the approximate distribution.

(63)

100 150 200 250 300

50

60

70

80

90

100

days

maxim

um ambient temper

ature

2007

Figure 2.16: Heatwaves in 2007 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95)

midwest) and Canada recorded a massive heatwave in Summer, 2012 (https://

www.bbc.com/news/world-us-canada-18758667, http://www.climatecentral.org/ news/coverage-of-the-2012-heat-wave-archived-and-accessable). So our defini-tion is being able to capture the real heatwaves efficiently. So, it will be prudent to use our definition in the future since it is also based on a probabilistic framework.

2.6 Discussion

(64)

100 150 200 250 300

60

70

80

90

100

days

maxim

um ambient temper

ature

2012

Figure 2.17: Heatwaves in 2012 at Atlanta (gray being the heatwave days with quantile 0.8 and red being the same with quantile 0.95)

sustainability of a time series above a threshold can be expressed using our definition. We have also found a quadratic relationship between the threshold quantiles and the ex-pected duration of an up-crossing for a fixed location, which will make the identification of heatwaves much simpler.

The next approach to finalize the definition of a heatwave at a fixed location would be to find an optimal threshold that might require an analysis of a data with the mortality and the health hazards related to extreme heat and/or relative humidity. But using our definition, the number of comparisons should be much less than scrutinizing every possible combination of thresholds and sustainability.

Spatial Analysis of Duration of Heatwaves and Robust Variable Selection.

ABSTRACT

DEDICATION

BIOGRAPHY

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

Chapter 1

Introduction

Chapter 2

On the Probability Distribution of the

Duration of Heatwaves

2.1

Introduction

2.1.1

Various Definitions of Heatwave

2.2

Probability Distributions of Duration and

Inten-sity of Heatwave

2.2.1

Exact Distribution of Durations for Markov processes

2.2.2

Hierarchical Model based on Exact Distribution

2.2.3

Approximate Distribution of Durations

2.2.4

Hierarchical Model based on Approximate Distribution

2.3

Simulation study

2.3.1

Comparing Theoretical and Empirical Probabilities of

Auto-regressive Process

2.4

Data Analysis

2.4.1

Atlanta Data

2.4.2

Exploratory Analysis

2.4.3

Expected Duration corresponding to the Exact

Distribu-tion

2.4.4

Expected Duration corresponding to the Approximate

Dis-tribution

2.4.5

Definition of heatwave using expected length of durations

2.4.6

Analysis based on USCRN Data

2.4.7

Description of the USCRN data

2.4.8

Fitting the Hierarchical Model

2.5

Comparing with Real Heatwaves

2.6

Discussion

Chapter 3

Spatial Analysis of duration of

Heatwaves

3.1

Introduction