Survival Model - Analytic Framework

CHAPTER II RESEARCH FRAMEWORK

2.2 Analytic Framework

2.2.2 Survival Model

Researchers have developed a different type of statistical model, known as “survival analysis,” to time-to-event data. Also called “duration analysis” among

economists, time-to-event data is measured as the length of time in which a certain event of interest occurs. Survival analysis is used for a range of different research areas

including health, economics, finance, and social science. Time to sales in the housing markets is one example of the time-to-event data.

The statistical model for predicting the duration of foreclosed property sales needs to consider both how long a foreclosed property remains in the process and when the property is sold. The time-to-event (or duration) data is characterized as censored, indicating that the occurrence of the event is only observable within a time window given the data (Wooldridge, 2010). Although the measure of duration is positive, because of censoring, the normality of error terms is often violated and the predicted value could be negative (Greene, 2012). Therefore, a typical type of regression such as ordinary least square (OLS) is not appropriate for the duration data (Guo, 2010). The subjects for an OLS regression should be observed and non-censored (Greene, 2012). A logistic regression can be used for predicting the proportion of exiting foreclosures or the

likelihood of a property remaining in foreclosure. However, this does not predict the duration clearly, and it ignores the question about “how long” (Guo, 2010). Instead, survival analysis has been employed when researchers are interested in questions about how long a real property stays on the real estate market, when the property is sold, and what other covariates (e.g., property attributes) affect the time-to-event (Benefield & Hardin, 2013; Haurin, 1988).

A survival model primarily uses the hazard function, which is the probability of the event occurring subsequently within a time interval given that the event has not yet happened at time, t (Cleves, Gould, Gutierrez, & Marchenko, 2008; Guo, 2010;

Wooldridge, 2010). The hazard function is written as (Wooldridge, 2010):

ℎ(𝑡) = lim

𝑠→0

Pr(𝑡 ≤ 𝑇 < 𝑡 + 𝑠 | 𝑇 ≥ 𝑡)

𝑠 ,

where T is the length of time until the event occurs, t is a particular time, and s is the time interval.

T has the probability distribution, 𝑓(𝑡), and the cumulative distribution function is F(𝑡) = Pr(𝑇 ≤ 𝑡) = ∫ 𝑓(𝑢)𝑑𝑢₀𝑡 where t is a particular value of T. The survivor function is represented as S(𝑡) = Pr(𝑇 > 𝑡) = 1 − 𝐹(𝑡), which is the probability of surviving (no occurrence of event) until time t. The probability of the event occurring in the time interval can be expressed as (Wooldridge, 2010):

Pr(𝑡 ≤ 𝑇 < 𝑡 + 𝑠 | 𝑇 ≥ 𝑡) =Pr(𝑡 ≤ 𝑇 < 𝑡 + 𝑠)

Pr(𝑇 ≥ 𝑡) =

𝐹(𝑡 + 𝑠) − 𝐹(𝑡) 1 − 𝐹(𝑡) .

25 ℎ(𝑡) = lim 𝑠→0 𝐹(𝑡 + 𝑠) − 𝐹(𝑡) 𝑠 ∙ 1 1 − 𝐹(𝑡)= 𝑓(𝑡) 1 − 𝐹(𝑡)= 𝑓(𝑡) 𝑆(𝑡).

Since the associations between duration and explanatory variables are of primary interest in this dissertation, the hazard function is considered conditional on a set of explanatory variables, 𝐱. The hazard function is written as (Guo, 2010):

ℎ(𝑡, 𝑥) = 𝑓(𝑡|𝐱) 1 − 𝐹(𝑡|𝐱)=

𝑓(𝑡|𝐱) 𝑆(𝑡|𝐱).

The probability of the hazard can differ by the characteristics of the covariates. One popular model for specifying the hazard function is a proportional hazard model; it is written as (Wooldridge, 2010):

ℎ(𝑡, 𝑥) = ℎ₀(𝑡) ∙ 𝑘(𝐱),

where 𝑘(𝐱) is a function of the explanatory variables, x, and ℎ₀(𝑡) is the baseline hazard function in the absence of explanatory variables. The baseline hazard function can be specified according to the distribution of the survival time, T. Commonly used

parametric distributions are based on the exponential, the Weibull, and the log-logistic hazard functions (Cleves et al., 2008). In the case of the exponential distribution, the hazard function is constant; if the Weibull or log-logistic distribution is chosen, the hazard function increases or decreases nonlinearly according to the values of the defined parameters (Cleves et al., 2008; Greene, 2012).

The Cox proportional hazards model is a popular type of regression in survival analysis. It does not require any assumptions or information regarding the shape of the hazard distribution being studied. The Cox model for the hazard risk at time t is specified as follows (Cox, 1972):

ℎ(𝑡, 𝑥) = ℎ₀(𝑡) ∙ exp(𝐗𝛃),

where 𝑘(𝐱) is defined as exp (𝐗𝛃), and 𝛃 is the estimated coefficient. The baseline hazard, ℎ₀(𝑡), is cancelled out in the likelihood estimation. This suggests an advantage in that there is no need to make an assumption for the shape of the hazard function and it offers computation feasibility (Hosmer, Lemeshow, & May, 2008). Given the

proportional form, the Cox hazard model can estimate the coefficients of explanatory variables without knowing the baseline hazard. It is a useful method for estimating the hazard ratio of interest after adjusting for other covariates (Guo, 2010). Therefore, the Cox proportional hazard model can be adequately considered for the analysis on which the built environmental characteristics in a neighborhood influence the likelihood of an REO being sold.

The assumption for the Cox proportional hazard model is that an explanatory variable has the same effects across all points in time. This proportional-hazard assumption can be checked by plotting hazard curves and/or testing the correlation between time and “Schoenfeld residuals” (Cleves et al., 2008).7 However, the assumption is likely to be violated for some variables in many applications. In such cases, the coefficient can be interpreted as “average effect” of the variable over the time period (Allison, 2010). In some applications, the violation could be critical and it should be taken into consideration (Hosmer et al., 2008).

In document From Crisis to Recovery: A Study of Walkability Impacts on Foreclosure in Los Angeles, California (Page 33-37)