Regression Models for Time Series Analysis

(1)

Regression Models for Time Series

Analysis

Benjamin Kedem1

and

Konstantinos Fokianos2

1_{University of Maryland, College Park, MD}

2_{University of Cyprus, Nicosia, Cyprus}

(2)

Cox (1975). Partial likelihood, Biometrika, 62, 69–76.

Fahrmeir and Tutz (2001). Multivariate

Sta-tistical Modelling Based on GLM. 2nd ed., Springer, NY.

Fokianos (1996). Categorical Time Series: Pre-diction and Control. Ph.D. Thesis, University of Maryland.

Fokianos and Kedem (1998), Prediction and classification of non-stationary categorical time

series. J. Multivariate Analysis,67, 277-296.

Kedem (1980). Binary Time Series, Marcel

Dekker, NY

(3)

McCullagh and Nelder (1989). Generalized Lin-ear Models. 2nd edi., Chapman & Hall, Lon-don.

Nelder and Wedderburn (1972). Generalized

linear models. JRSS, A, 135, 370–384.

Slud (1982). Consistency and efficiency of

in-ference with the partial likelihood. Biometrika,

69, 547–552.

Slud and Kedem (1994). Partial likelihood

analysis of logistic regression and

autoregres-sion. Statistica Sinica, 4, 89–106.

Wong (1986). Theory of partial likelihood.

(4)

Part I: GLM and Time Series

Overview

Extension of Nelder and Wedderburn (1972), McCullagh and Nelder(1989) GLM to time se-ries is possible due to:

• Increasing sequence of histories relative to

an observer.

• Partial likelihood.

• The partial score is a martingale.

(5)

Partial Likelihood

Suppose we observe a pair of jointly distributed

time series, (X_t, Y_t), t = 1, . . . , N, where _{Y_t_} is

a response series and _{X_t_} is a time dependent random covariate. Employing the rules of con-ditional probability, the joint density of all the

X, Y observations can be expressed as,

f_θ(x₁, y₁, . . . , x_N, y_N) = f_θ(x₁)   N Y t=2 f_θ(x_t _| d_t)     N Y t=1 f_θ(y_t _| c_t)   (1) d_t = (y₁, x₁, . . . , y_t₋₁, x_t₋₁) c_t = (y₁, x₁, . . . , y_t₋₁, x_t₋₁, x_t).

The second product on the right hand side of (1) constitutes a partial likelihood according to Cox(1975).

(6)

An increasing sequence of σ-fields

F0 ⊂ F1 ⊂ F2 . . ..

Y₁, Y₂, . . . a sequence of random variables on

some common probability space such that Y_t

is _F_t measurable.

Y_t _{| F}_t₋₁ _∼ f_t(y_t; θ).

θ _∈ Rp is a fixed parameter.

The partial likelihood (PL) function relative to

θ, _F_t, and the data Y₁, Y₂, . . . , Y_N, is given by

the product PL(θ; y₁, . . . , y_N) = N Y t=1 f_t(y_t;θ) (2)

(7)

The General Regression Problem

{Y_t_} is a response time series with the

corre-sponding p–dimensional covariate process,

Z_t₋₁ = (Z₍_t₋₁₎₁, . . . , Z₍_t₋₁₎_p)′.

Define,

Ft₋1 = σ{Yt₋1, Yt₋2, . . . ,Zt₋1,Zt₋2, . . .}.

Note: Z_t₋₁ may already include past Y_t’s.

The conditional expectation of the response given the past:

µ_t = E[Y_t _{| F}_t₋₁],

(_•) The problem is to relate µ_t to the

(8)

Time Series Following GLM

1. Random Component. The conditional distribution of the response given the past be-longs to the exponential family of distributions in natural or canonical form,

f(y_t; θ_t, φ _{| F}_t₋₁) = exp ( y_tθ_t ₋ b(θ_t) α_t(φ) + c(yt; φ) ) . (3)

α_t(φ) = φ/ω_t, dispersion φ, prior weight ω_t.

2. Systematic Component. There is a

mono-tone function g(_·) such that,

g(µ_t) = η_t =

p

X

j=1

β_jZ₍_t₋₁₎_j = Z′_t₋₁β. (4)

g(_·): the link function

(9)

Typical choices for η_t = Z′_t₋₁β, could be β₀ + β₁Y_t₋₁ + β₂Y_t₋₂ + β₃X_t cos(ω₀t) or β₀ + β₁Y_t₋₁ + β₂Y_t₋₂ + β₃Y_t₋₁X_t + β₄X_t₋₁ or β₀ + β₁Y_t₋₁ + β₂Y_t7₋₂ + β₃Y_t₋₁ log(X_t₋₁₂)

(10)

GLM Equations:

Z

f(y;θ_t; φ _{| F}_t₋₁)dy = 1.

This implies the relationships:

µ_t = E[Y_t _{| F}_t₋₁] = b′(θ_t). (5)

Var[Y_t _{| F}_t₋₁] = α_t(φ)b′′(θ_t) _≡ α_t(φ)V (µ_t). (6)

Since Var[Y_t _{| F}_t₋₁] > 0, it follows that b′ is

monotone. Therefore, equation (5) implies

that

θ_t = (b′)−1(µ_t). (7)

We see that θ_t itself is a monotone function

of µ_t and hence it can be used to define a link

function. The link function

g(µ_t) = θ_t(µ_t) = η_t = Z′_t₋₁β (8)

(11)

Example: Poisson Time Series.

f(y_t; θ_t, φ _{| F}_t₋₁) = exp_{(y_tlog µ_t ₋ µ_t) ₋ log y_t!_}.

E[Y_t_|F_t₋₁] = µ_t, b(θ_t) = µ_t = exp(θ_t), V (µ_t) =

µ_t, φ = 1, and ω_t = 1. The canonical link is

g(µ_t) = θ_t(µ_t) = log µ_t = η_t = Z′_t₋₁β.

As an example , if Z_t₋₁ = (1, X_t, Y_t₋₁)′, then

log µ_t = β₀ + β₁X_t + β₂Y_t₋₁

with _{X_t_} standing for some covariate process,

or a possible trend, or a possible seasonal com-ponent.

(12)

Example: Binary Time Series

{Y_t_} takes the values 0,1. Let

π_t = P(Y_t = 1 _{| F}_t₋₁). Then f(y_t;θ_t, φ _{| F}_t₋₁) = exp ( y_t log πt 1 ₋ π_t ! + log(1 ₋ π_t) )

The canonical link gives the logistic regression

model g(π_t) = θ_t(π_t) = log πt 1 ₋ π_t = ηt = Z ′ t₋1β. (9) Note: π_t = F_l(η_t) (10)

(13)

Partial Likelihood Inference

Given a time series _{Y_t_}, t = 1, . . . , N,

con-ditionally distributed as (3).

The partial likelihood of the observed series is PL(β) = N Y t=1 f(y_t; θ_t, φ _{| F}_t₋₁). (11)

Then from (3), the log–partial likelihood, l(β),

is given by l(β) = N X t=1 log f(y_t;θ_t, φ _{| F}_t₋₁) _≡ N X t=1 l_t = N X t=1 ( y_tθ_t ₋ b(θ_t) α_t(φ) + c(yt, φ) ) = N X t=1 ( y_tu(z′_t₋₁β) ₋ b(u(z′_t₋₁β)) α_t(φ) + c(yt, φ) ) (12)

(14)

∇ ≡ ∂ ∂β₁, ∂ ∂β₂,· · ·, ∂ ∂βp !′ .

The partial score is a p–dimensional vector,

S_N(β) _{≡ ∇}l(β) = N X t=1 Z_t₋₁∂µt ∂η_t (Y_t ₋ µ_t(β)) σ_t2(β) (13) with σ_t2(β) = Var[Y_t _{| F}_t₋₁].

The partial score vector process _{S_t(β)_}, t = 1, . . . , N, is defined from the partial sums

S_t(β) = t X s=1 Z_s₋₁∂µs ∂ηs (Ys − µs(β)) σ_s2(β) . (14)

(15)

The solution of the score equation,

S_N(β) = _∇ log PL(β) = 0 (15)

is denoted by ˆβ, and is referred to as the

max-imum partial likelihood estimator (MPLE) of

β. The system of equations (15) is non–linear

and is customarily solved by the Fisher scoring method, an iterative algorithm resembling the

Newton–Raphson procedure. Before turning

to the Fisher scoring algorithm in our context of conditional inference, it is necessary to in-troduce several important matrices.

(16)

An important role in partial likelihood inference

is played by the cumulative conditional

infor-mation matrix, G_N(β), defined by a sum of conditional covariance matrices,

G_N(β) = N X t=1 Cov " Z_t₋₁∂µt ∂η_t (Y_t ₋ µ_t(β)) σ_t2(β) | Ft−1 # = N X t=1 Z_t₋₁ ∂µt ∂η_t !2 1 σ_t2(β)Z ′ t₋1 = Z′W(β)Z. We also need: H_N(β) _{≡ −∇∇} ′l(β).

Define R_N(β) from the difference

H_N(β) = G_N(β) ₋ R_N(β).

(17)

Fisher Scoring: In Newton-Raphson replace

H_N(β) by its conditional expectation:

ˆ

β(k+1) = ˆβ(k) + G−_N1(ˆβ(k))S_N(ˆβ(k)).

Fisher scoring becomes Newton-Raphson for canonical links.

Fisher scoring simplifies to Iterative Reweighted Least Squares: ˆ β(k+1) = Z′W(ˆβ(k))Z −1 Z′W(ˆβ(k))q(k).

(18)

Asymptotic Theory

Assumption A

A1. The true parameter β belongs to an open

set B _⊆ Rp.

A2. The covariate vector Z_t₋₁ almost surely

lies in a nonrandom compact subset Γ of Rp,

such that P[PN

t=1 Zt−1Z′t₋1 > 0] = 1. In

ad-dition, Z′_t₋₁β lies almost surely in the domain

H of the inverse link function h = g−1 for all

Z_t₋₁ _∈ Γ and β _∈ B.

A3. The inverse link function h–defined in

(A2)–is twice continuously differentiable and

(19)

A4. There is a probability measure ν on Rp

such that R

Rp zz′ν(dz) is positive definite, and

such that under (3) and (4) for Borel sets

A _⊂ Rp, 1 N N X t=1 I_[_Z_t −1∈A]→ν(A)

in probability as N _{→ ∞}, at the true value of β.

A4 calls for asymptotically “well behaved” co-variates: PN t=1 f(Zt−1) N → Z Rp f(z)ν(dz)

in probability as N _{→ ∞}. Thus, there exists a

p _× p limiting information matrix per

observa-tion, G(β), such that

G_N(β)

N → G(β) (16)

(20)

Slud and K (1994), Fokianos and K (1998): 1. _{S_t(β)_} relative to _{F_t_}, t = 1, . . . , N, is a martingale. 2. RN(β) N → 0. 3. SN√(β) N → Np(0,G(β)). 4. √N(ˆβ ₋ β)_→N_p(0,G−1(β)).

(21)

100(1 ₋ α)% prediction interval (h = g−1) µ_t(β) =· µ_t(ˆβ)_±z_α/₂|h ′₍_Z′ t₋1β)| √ N q Z′_t₋₁G−1(β)Z_t₋₁ Hypothesis Testing

Let ˜β be the MPLE of β obtained under H₀

with r < p restrictions, H₀ : β₁ = _{· · ·} = βr = 0.

Let ˆβ be the unrestricted MPLE. The log–

partial likelihood ratio statistic

λ_N = 2 nl(ˆβ) ₋ l(˜β)o (17)

(22)

More generally:

Assume C is a known matrix with full rank

r, r < p.

Under the general linear hypothesis,

H₀ : Cβ = β₀ against H₁ : Cβ ₆= β₀, (18)

(23)

Diagnostics

l(y;y): maximum log partial likelihood

corre-sponding to the saturated model.

l(ˆµ;y): maximum log partial likelihood from

the reduced model.

• Scaled Deviance:

D _≡ 2_{l(y; y) ₋ l(ˆµ;y)_{} ∼} χ2_N₋_p

• AIC(p) = ₋2 log PL(ˆβ) + 2p,

(24)

Analysis of Mortality Count Data in LA

Weekly data from Los Angeles County during a period of 10 years from January 1, 1970, to December 31, 1979: Weekly sampled filtered

time series. N = 508.

Response

Y Total Mortality (filtered)

Weather T Temperature RH Relative humidity Pollution CO Carbon monoxide SO₂ Sulfur dioxide NO₂ Nitrogen dioxide HC Hydrocarbons OZ Ozone KM Particulates

(25)

0 100 200 300 400 500 140 160 180 200 220 Mortality Lag ACF 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 Series : Mort 0 100 200 300 400 500 50 60 70 80 90 100 Temperature Lag ACF 0 5 10 15 20 25 -0.5 0.0 0.5 1.0 Series : Temp 0 100 200 300 400 500 1.0 1.5 2.0 2.5 3.0 log(CO) Lag ACF 0 5 10 15 20 25 -0.4 0.0 0.2 0.4 0.6 0.8 1.0 Series : log(CO) Mortality, Temperature, log(CO)

Weekly data of filtered total mortality and tem-perature, and log-filtered CO, and the corre-sponding estimated autocorrelation functions.

(26)

Covariates and η_t used in Poisson regression.

S = SO₂, N = NO₂. To recover η_t, insert the

β’s. For Model 2, η_t = β₀ + β₁Y_t₋₁ + β₂Y_t₋₂, etc. Model 0 T_t + RH_t + CO_t + S_t + N_t +HC_t + OZ_t + KM_t Model 1 Y_t₋₁ Model 2 Y_t₋₁ + Y_t₋₂ Model 3 Y_t₋₁ + Y_t₋₂ + T_t₋₁ Model 4 Y_t₋₁ + Y_t₋₂ + T_t₋₁ + log(CO_t) Model 5 Y_t₋₁ + Y_t₋₂ + T_t₋₁ + T_t₋₂ + log(CO_t) Model 6 Y_t₋₁ + Y_t₋₂ + T_t + T_t₋₁ + log(CO_t)

(27)

Comparison of 7 Poisson regression models.

N = 508.

Model p D df AIC BIC

0 9 315.69 499 333.69 371.76 1 2 276.07 506 280.07 288.53 2 3 222.23 505 228.23 240.92 3 4 203.52 504 211.52 228.44 4 5 174.55 503 184.55 205.71 5 6 174.53 502 186.53 211.91 6 6 171.41 502 183.41 208.79 Choose Model 4: log(ˆµ_t) = ˆβ₀ + ˆβ₁Y_t₋₁ + ˆβ₂Y_t₋₂ + ˆβ₃T_t₋₁ + ˆβ₄ log(CO_t)

(28)

0 100 200 300 400 500 140 160 180 200 220 ____ Data ... Fit

Poisson Regression of Filtered LA Mortality Mrt(t) on Mrt(t-1), Mrt(t-2), Tmp(t-1), log(Crb(t))

Observed and predicted weekly filtered total mortality from Model 4.

(29)

Model 0 Residuals 0 100 200 300 400 500 -0.1 0.1 0.3 Model 4 Residuals 0 100 200 300 400 500 -0.15 0.0 0.10 Lag ACF 0 5 10 15 20 25 0.0 0.4 0.8 Series : Resid0 Lag ACF 0 5 10 15 20 25 0.0 0.4 0.8 Series : Resid4 Comparison of Residuals

Working residuals from Models 0 and 4, and their respective estimated autocorrelation.

(30)

Part II: Binary Time Series

Example of bts: Two categories obtained by clipping. Y_t _≡ I_[_X_t_∈_C_] = ( 1, if X_t _∈ C 0, if X_t _∈ C (19) Y_t _≡ I_[_X_t_≥_r_] = ( 1, if X_t _≥ r 0, if X_t < r (20)

Other examples: at time t = 1,2, ...,

(31)

{Y_t_} taking the values 0 or 1, t = 1,2,3,_{· · ·}.

{Z_t₋₁_} p-dim covariate stochastic data.

Against the backdrop of the general framework presented above we wish to relate

µ_t(β) = π_t(β) = P_β(Y_t = 1_|F_t₋₁) (21) to the covariates. For this we need good links!

(32)

Standard logistic distribution, F_l(x) = e x 1 + ex = 1 1 + e−x, −∞ < x < ∞ Then, F_l−1(x) = log(x/(1 ₋ x))

(33)

• Fact: For any bts, there are θ_j such that log ( P(Y_t = 1_|Y_t₋₁ = y_t₋₁, ..., Y₁ = y₁) P(Y_t = 0_|Y_t₋₁ = y_t₋₁, ..., Y₁ = y₁) ) = θ₀ + θ₁y_t₋₁ + _{· · ·} + θpyt₋p or π_t(β) = 1 1 + exp[₋(θ₀ + θ₁y_t₋₁ + _{· · ·} + θ_py_t₋_p)]

• Fact: Consider an AR(p) time series

X_t = γ₀ + γ₁X_t₋₁ + . . . + γpXt₋p + λǫt

where ǫ_t are i.i.d. logistically distributed.

De-fine: Y_t = I_[_X_t_≥_r_]. Then

π_t(β) =

1

(34)

This motivates logistic regression:

π_t(β) _≡ P_β(Y_t = 1_|F_t₋₁) = F_l(β′Z_t₋₁)

= 1

1 + exp[₋β′Z_t₋₁]

or equivalently, the link function is

logit(π_t(β)) _≡ log ( π_t(β) 1 ₋ π_t(β) ) = β′Z_t₋₁

(35)

Link functions for binary time series.

logit β′Z_t₋₁ = log_{π_t(β)/(1 ₋ π_t(β))_}

probit β′Z_t₋₁ = Φ−1_{π_t(β)_}

log-log β′Z_t₋₁ = ₋log_{− log(π_t(β))_}

C-log-log β′Z_t₋₁ = log_{− log(1 ₋ π_t(β))_}

Note: Here all the inverse links are cdf’s. In what follows we always assume the inverse link

(36)

The partial likelihood of β takes on the simple product form, PL(β) = N Y t=1 [π_t(β)]yt_[1 ₋ _π t(β)]1−yt = N Y t=1 [F(β′Z_t₋₁)]yt_[1 ₋ _F₍_β′_Z t₋1)]1−yt

We have under Assumption A:

√

N(ˆβ ₋ β)_→N_p(0,G−1(β)).

For the canonical link (logistic regression):

G_N(β) N → G(β) = Z Rp eβ′z (1 + eβ′z)2 zz′ν(dz)

(37)

Illustration of asymptotic normality. logit(π_t(β)) = β₁ + β₂ cos 2πt 12 + β₃Y_t₋₁ so that Z_t₋₁ = (1,cos(2πt/12), Y_t₋₁)′. (a) 0 50 100 150 200 0.0 0.4 0.8 (b) t 0 50 100 150 200 0.4 0.6 0.8

Logistic autoregression with a sinusoidal

(38)

-2 -1 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4 b1 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 b2 -4 -2 0 2 0.0 0.1 0.2 0.3 0.4 b3

Histograms of normalized MPLE’s where β =

(0.3,0.75, 1)′, N = 200. Each histogram

(39)

Goodness of Fit C₁,_{· · ·} , C_k, a partition of _Rp. For j = 1,_{· · ·}, k, define, M_j _≡ N X t=1 I_[_Z t−1∈Cj]Yt and E_j(β) _≡ N X t=1 I_[_Z t−1∈Cj]πt(β) Put: M _≡ (M₁,_{· · ·}, M_k)′, E(β) _≡ (E₁(β),_{· · ·}, E_k(β))′.

(40)

Slud and K (1994), K and Fokianos (2002): With σ_j2 _≡ Z C_j F(β ′_z₎₍₁ ₋ _F₍_β′_z₎₎_ν₍_d_z₎ χ2(β) _≡ 1 N k X j=1 (M_j ₋ E_j(β))2/σ_j2 _→ χ2_k.

In practice need to adjust the df when replacing

(41)

When ˆβ and (M₋E(β)) are obtained from the

same data set,

E(χ2(ˆβ)) _≈ k ₋

k

X

j=1

(B′G(β)B)_jj/σ_j2

When ˆβ and (M ₋ E(β)) are obtained from

independent data sets,

E(χ2(ˆβ)) _≈ k +

k

X

j=1

(42)

Illustration of the distribution of χ2(β) using Q-Q plots.

Consider the previous logistic regression model with a periodic component. Use the partition

C₁ = _{Z : Z₁ = 1,₋1 _≤ Z₂ < 0, Z₃ = 0_}

C₂ = _{Z : Z₁ = 1,₋1 _≤ Z₂ < 0, Z₃ = 1_}

C₃ = _{Z : Z₁ = 1,0 _≤ Z₂ _≤ 1, Z₃ = 0_}

C₄ = _{Z : Z₁ = 1,0 _≤ Z₂ _≤ 1, Z₃ = 1_}

Then, k=4, M_j is the sum of those Y_t’s for

which Z_t₋₁ is in C_j, j = 1, 2,3,4, and the E_j(β)

are obtained similarly. Estimate σ_j2 by,

˜ σ_j2 = 1 N N X t=1 I_[_Z t−1∈Cj]πt(β)(1 − πt(β))

(43)

The χ2₄ approximation is quite good. ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••• ••••••••••••••••••••••• •••••••••••• ••••••••••• ••••••••••••••••• •••••• •••••••• ••••• ••• •• • CHI SQ STATISTIC TRUE CHI SQ 0 5 10 15 20 25 0 2 4 6 8 10 12 14 N=200, b=(0.3,0.75,1) ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••• •••••••••••••••••• ••••••••••••••••• •••••••••• ••••••••• •• •• •••• • • • CHI SQ STATISTIC TRUE CHI SQ 0 5 10 15 20 0 5 10 15 N=400, b=(0.3,1,2)

(44)

Example: Modeling successive eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming.

1 if duration is greater than 3 minutes 0 if duration is less than 3 minutes

101110110101011010110101010 111110101010101010101010101 011111010101011010111011111 011101010101010101010101010 101101010101011101111111011 111011111110101010101011111 101010101110101011010111101 010101110101011011011101010 101101111111010101111011011 101101011101011111011101010 110101111111101010101010101 10

(45)

Candidate models η_t = β′Z_t₋₁ for Old Faithful.

1 β₀ + β₁Y_t₋₁

2 β₀ + β₁Y_t₋₁ + β₂Y_t₋₂

3 β₀ + β₁Y_t₋₁ + β₂Y_t₋₂ + β₃Y_t₋₃

4 β₀ + β₁Y_t₋₁ + β₂Y_t₋₂ + β₃Y_t₋₃ + β₄Y_t₋₄

Comparison of models using the logistic link: Model 2 is “best”. Probit reg. gives similar results.

Model p _X2 D AIC BIC

1 2 165.00 227.38 231.38 238.46 2 3 165.00 215.53 221.53 232.15 3 4 165.00 215.08 223.08 237.24 4 5 164.97 213.99 223.99 241.69 ˆ π_t = π_t(ˆβ) = 1 1 + expn₋(ˆβ₀ + ˆβ₁Y_t₋₁ + ˆβ₂Y_t₋₂)o.

(46)

Part III: Categorical Time Series

EEG sleep state classified or quantized in four categories as follows.

1 : quiet sleep,

2 : indeterminate sleep, 3 : active sleep,

4 : awake.

Here the sleep state categories or levels are as-signed integer values. This is an example of a

categorical time series _{Y_t_}, t = 1, ..., N, taking

the values 1, ..., 4.

This is an arbitrary integer assignment. Why not the values 7.1, 15.8, 19.24, 71.17 ? Any other scale ?

(47)

Assume m categories.

The t’th observation of any categorical time

series–regardless of the measurement scale– can be represented by the vector

Y_t = (Y_t₁, . . . , Y_tq)′

of length q = m ₋ 1, with elements

Y_tj =

1, if the jth category is observed at time t

0, otherwise

for t = 1, . . . , N and j = 1, . . . , q.

(48)

Write for j = 1, . . . , q, π_tj = E[Y_tj _{| F}_t₋₁] = P(Y_tj = 1 _{| F}_t₋₁), Define: π_t = (π_t₁, . . . π_tq)′ Y_tm = 1 ₋ q X j=1 Y_tj π_tm = 1 ₋ q X j=1 π_tj.

Let _{Z_t₋₁_}, t = 1, . . . , N, for a p_×q matrix that represents a covariate process.

Y_tj corresponds to a vector of length p of

(49)

Assume the general regression model: (_∗) π_t(β) =       π_t₁(β) π_t₂(β) · · · π_tq(β)       =       h₁(Z′_t₋₁β) h₂(Z′_t₋₁β) · · · h_q(Z′_t₋₁β)       = h(Z′_t₋₁β).

The inverse link function h is defined on Rq

and takes values in Rq.

We shall only examine nominal and ordinal time series.

(50)

Nominal Time Series.

Nominal categorical variables lack natural or-dering.

A model for nominal time series: Multinomial logit model π_tj(β) = exp(β ′ jzt−1) 1 + Pq l=1exp(β′lzt−1) , j = 1, . . . , q Note that π_tm(β) = 1 1 + Pq l=1 exp(β′lzt−1) .

(51)

Multinomial logit model is a special case of (*).

Indeed, define β to be the p _≡ qd-vector

β = (β′₁, . . . ,β′_q)′,

and Z_t₋₁ the qd _× q matrix

Z_t₋₁ =      z_t₋₁ 0 _{· · ·} 0 0 z_t₋₁ _{· · ·} 0 ... ... ... ... 0 0 _{· · ·} z_t₋₁      .

Let h stand for the vector valued function whose

components h_j, j = 1, . . . , q, are given by

π_tj(β) = h_j(η_t) = exp(ηtj) 1 + Pq l=1 exp(ηtl) , j = 1, . . . , q with η_t = (η_t₁, . . . , η_tq)′ = Z′_t₋₁β.

(52)

Ordinal Time Series.

Measured on a scale endowed with a natural ordering.

A model for ordinal time series: Need a la-tent or auxiliary variable.

Put

X_t = ₋γ′z_t₋₁ + e_t,

1. e_t _∼ cdf F i.i.d.

2. γ d-dim vector of parameters.

3. z_t₋₁ covariate d-dim vector.

Define a categorical time series _{Y_t_}, from the

levels of _{X_t_},

(53)

Then

π_tj = P(θ_j₋₁ _≤ X_t < θ_j _{| F}_t₋₁)

= F(θ_j + γ′z_t₋₁) ₋ F(θ_j₋₁ + γ′z_t₋₁),

for j = 1, . . . , m.

There are many possibilities depending on F.

Special case: Proportional Odds Model,

F(x) = F_l(x) = 1

1 + exp(₋x).

Then we have for j = 1, . . . , q,

log ( P[Y_t _≤ j_|F_t₋₁] P[Y_t > j_|F_t₋₁ ) = θ_j + γ′z_t₋₁

(54)

Proportional odds model has the form (*) with

p = (q + d):

β = (θ₁, . . . , θq,γ′)′

and Z_t₋₁ the (q + d) _× q matrix

Z_t₋₁ =         1 0 _{· · ·} 0 0 1 _{· · ·} 0 ... ... ... ... 0 0 _{· · ·} 1 z_t₋₁ z_t₋₁ _{· · ·} z_t₋₁         . Now set h = (h₁, . . . , hq)′,

and let for j = 2, . . . , q,

π_t₁(β) = h₁(η_t) = F(η_t₁),

π_tj(β) = h_j(η_t) = F(η_tj) ₋ F(η_t₍_j₋₁₎),

where

(55)

Partial likelihood estimation.

Introduce the multinomial probability

f(y_t; β _{| F}_t₋₁) =

m

Y

j=1

π_tj(β)ytj_.

The partial likelihood is a product of the multi-nomial probabilities, PL(β) = N Y t=1 f(y_t; β_|F_t₋₁) = N Y t=1 m Y j=1 π_tjytj(β),

so that the partial log-likelihood is given by

l(β) _≡ log PL(β) = N X t=1 m X j=1 y_tj log π_tj(β).

(56)

Under a modified Assumption A: G_N(β) N → Z Rp×q ZU(β)Σ(β)U ′₍_β₎_Z′_ν₍_d_Z_{) =} _G₍_β₎ √ N(ˆβ ₋ β)_→N_p 0,G−1(β) √ N π_t(ˆβ) ₋ π_t(β) _→ Nq 0,Z_t₋₁D_t(β)G−1(β)D′_t(β)Z′_t₋₁

(57)

Example: Sleep State.

Covariates: Heart Rate, Temperature.

N = 700. 1 : quiet sleep, 2 : indeterminate sleep, 3 : active sleep, 4 : awake. Ordinal CTS: ”4” < ”1” < ”2” < ”3”. time State 0 200 400 600 800 1000 1.0 2.0 3.0 4.0 time Heart Rate 0 200 400 600 800 1000 100 140 180 time Temperature 0 200 400 600 800 1000 36.9 37.2

(58)

Fit proportional odds models.

Model Covariates AIC

1 1+Y_t₋₁ 401.56 2 1+Y_t₋₁+log R_t 401.51 3 1+Y_t₋₁+log R_t+T_t 403.32 4 1+Y_t₋₁+T_t 403.52 5 1+Y_t₋₁+Y_t₋₂+log R_t 407.28 6 1+Y_t₋₁+log R_t₋₁ 403.40 7 1+logR_t 1692.31

(59)

Model 2: 1 + Y_t₋₁ + log R_t log " P(Y_t _≤ ”4” _{| F}_t₋₁) P(Y_t > ”4” _{| F}_t₋₁) # = θ₁ + γ₁Y₍_t₋₁₎₁ + γ₂Y₍_t₋₁₎₂ + γ₃Y₍_t₋₁₎₃ + γ₄ log R_t, log " P(Y_t _≤ ”1” _{| F}_t₋₁) P(Y_t > ”1” _{| F}_t₋₁) # = θ₂ + γ₁Y₍_t₋₁₎₁ + γ₂Y₍_t₋₁₎₂ + γ₃Y₍_t₋₁₎₃ + γ₄ log R_t, log " P(Y_t _≤ ”2” _{| F}_t₋₁) P(Y_t > ”2” _{| F}_t₋₁) # = θ₃ + γ₁Y₍_t₋₁₎₁ + γ₂Y₍_t₋₁₎₂ + γ₃Y₍_t₋₁₎₃ + γ₄ log R_t, ˆ θ₁ = ₋30.352, ˆθ₂ = ₋23.493, ˆθ₃ = ₋20.349, ˆγ₁ = 16.718, ˆγ₂ = 9.533, ˆγ₃ = 4.755, ˆγ₄ = 3.556.

The corresponding standard errors are 12.051, 12.012, 11.985, 0.872, 0.630, 0.501 and 2.470.

(60)

State 0 50 100 150 200 250 300 1.0 2.0 3.0 4.0 (a) State 0 50 100 150 200 250 300 1.0 2.0 3.0 4.0 (b)

(a) Observed versus (b) predicted sleep states

for Model 2 of Table applied to the testing