3.1 1929 Buller and 1968 Inangahua earthquakes, New Zealand
Chapter 4 - Spatial models of earthquake-triggered landslide probability landslide probability
4.1 Methodology: Hillslope failure probability modelling
There is a large body of literature exploring controls on the spatial distribution of earthquake-triggered landslides. All these studies first involve processing the data, in order to convert maps of landslide areas or locations into a format suitable for statistical analysis. This commonly involves the calculation of landslide point density or landslide area-density within discrete zones representing regions with similar hillslope stability or seismic forcing characteristics. In calculating landslide density, the response variable for analysis is continuous, indicating the percentage of hillslopes that fail within a given area. This approach is suitable for the majority of studies which consider the influence of variables on landslide occurrence individually using simple statistics designed to model the influence of a single predictor variable (e.g. PGA) on a single
84 continuous response variable (e.g. landslide density). However, as multiple factors influence the spatial distribution of landslides, bivariate approaches can only provide partially explanations. This problem can be solved by analysing hillslope failure probability using multiple regression models.
4.1.1 Introduction to logistic regression
Logistic regression is a type of regression analysis used for predicting the outcome of a categorical response variable (Cox, 1958, Walker and Duncan, 1967) and is widely used in biomedical research (Hosmer and Lemeshow, 2000) and increasingly in the earth sciences (e.g.: Atkinson et al., 1998, Garcia-Rodriguez et al., 2008, Gorsevski et al., 2006, Perkins, 1997, von Ruette et al., 2011) as a method of modelling probability. In the case of binomial logistic regression the dependent variable (𝑌) is binary (0 and 1).
By convention 1 indicates the occurrence of an event of interest while 0 indicates non-occurrence. Logistic regression is used to estimate the coefficients (𝑏, 𝑏 …) for predicting the probability that 𝑌 = 1, given the values of one or more predictor variables (𝑥, 𝑥 …), using the logistic, or log-odds function:
Equation 4-1: (Chen et al., 2012a)
𝑃(𝑌 = 1) = 1
1 + 𝑒 −(𝑏 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 … 𝑏 𝑥 )
The function has an unlimited range for 𝑥, while 𝑃 is restricted to the range 0 to 1. Use of this function in logistic regression prevents predicted probabilities from exceeding 1 or falling below 0. The regression coefficients are estimated by the method of maximum likelihood. Logistic regression carries many of the normal assumption of ordinary least squares regression. These include the following (Chen et al., 2012a:
Chapter 3):
No important variables are omitted from the model
No extraneous variables are included in the model
The independent variables are measured without error
The observations are independent
85
The independent variables are not linear combinations of each other
Additionally it is also assumed that conditional probabilities are a logistic function of the independent variables, i.e.: the log-odds of probability is a linear combination of the independent variables. This assumption is best understood by considering the linear form of the model equation:
ln 𝑃(𝑌 = 1)
1 − 𝑃(𝑌 = 1) = 𝑏 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 … 𝑏 𝑥
Where ln (( )) is the log-odds of probability, expressed as a linear function of the independent variables. Note that logistic regression carries no assumptions regarding the distribution of either the response or predictor variables. This means that predictor variables can be continuous or categorical.
Alternatives to logistic regression include the probit model and discriminant analysis.
However, these techniques are generally more difficult to implement in analysis and rely on additional assumptions being met (Harrell, 2001). Although probit regression assumes as similar shape to the logistic function, it involves more cumbersome calculations and there is no natural interpretation of its regression parameters (Harrell, 2001). Discriminant analysis assumes that predictor variables are normally distributed and that jointly the predictors have a multivariate normal distribution. As a result the technique cannot be used with categorical predictors. Additionally, even when all assumptions are met, logistic regression is virtually as accurate as the discriminant model (Harrell and Lee, 1985).
Logistic regression has been widely used as a method to model landslide spatial probability (e.g.: Yesilnacar and Topal, 2005, Dai and Lee, 2003, Garcia-Rodriguez et al., 2008, von Ruette et al., 2011). A first challenge in applying logistic regression to modelling landslide probability is in establishing how the principles of probability and probability modelling can be applied to hillslope failure, and along with this how landslide inventory maps should be converted to a binary variable for analysis.
86 4.1.2 Landslide probability
Probability is the number of times an event occurs divided by the number of times an event could occur, within a given time-frame. When using inventories of landslides triggered by a particular event, the time-frame is the period over which landslides are triggered by that event. For co-seismic landslides this is generally assumed to be the duration period of seismic shaking, however, where landslides have been mapped later after an earthquake this period also includes additional time following the earthquake.
A common approach to analysing landslide spatial probability is to adopt the landslide area-density principle (e.g.: Meunier et al., 2008, Meunier et al., 2007) in which landslide probability is defined as the size of the area covered by landslides divided by the total area of interest. However, the total area covered by landslides is not only the sum of landslide source areas, but also the area of material run-out and deposition, which are the product of different physical mechanisms. In investigating hillslope failure, landslide probability is better defined as the sum of landslide source areas (𝐴 ) divided by the total area of interest (𝐴 ).
Equation 4-2
𝑃 (𝐴) =∑ 𝐴
∑ 𝐴
Where 𝑃 (𝐴) is the landslide source area probability, which is also:
i. The proportion of the area of interest covered by new landslide source areas following a given earthquake (landslide area density)
ii. The probability that any location in the area of interest undergoes failure during (or shortly following) a given earthquake
For each of the landslide inventory datasets, landslide source areas were therefore separated from the full landslide areas prior to analysis, using the methodology described in Section 3.1.8.
4.1.3 Matrix grid sampling of variables for analysis
In order to model the relationship between 𝑃 (𝐴) and multiple independent variables, the landscape must be divided into discrete areas of equal sizes to provide individual
87 observations for analysis. These areas must be classified to indicate the presence or absence of landslide source areas (i.e.: whether or not hillslope failure has occurred) and associated with values of each of the predictor variables. This was achieved by dividing the landscape based on the areas of individual pixels at the available DEM grid scale (e.g.: Dai and Lee, 2003, Lee et al., 2008a, Lee et al., 2008b). This approach captures the minimum scale of variability in predictor variables, set by variables derived from the DEM, and other predictor variables with coarser resolutions can be sampled at the DEM resolution. For predictor variables derived directly from the DEM, such as local hillslope gradient or distance and directional variables calculated using Euclidean functions, no resampling is required prior to analysis. Predictor variables at coarser resolution raster or in vector formats, such as polygons from geological mapping, incrementally contoured data such as isoseismals and coarsely gridded precipitation data, were resampled to the DEM grid resolution. Here the majority resampling approach was used, in which DEM pixels are classified by the class representing the majority (>= 50 %) of their plan area (ESRI, 2012). For the response variable, pixels were classified based on whether the majority of their area fell inside (Y=1) or outside (Y=0) of a landslide source zone.