Forecasting Financial Time Series

(1)

Forecasting Financial Time Series

Manfred Deistler and Christiane Zinner

Canberra, February, 2007

(2)

1 _Introduction

Forecasting Financial Time Series: Problems and Approaches

2 Factor Models

The Basic Frame Work

Principal Component Analysis The Frisch Model

(3)

Forecasting Financial Time Series: Problems and

Approaches

Time series: (Relative) returns rt = pt −pt−1 pt−1 , t=1. . .T

pt . . .prices of financial assets such as shares, indices,

exchange rates.

Different time scales: Intraday (high frequency), daily, weekly, monthly data.

(4)

Aims

In (quantitative) portfolio management it is of interest to forecast Conditional expectations

Conditional variances

(5)

Stylized facts

Return series are not i.i.d., but show little serial correlation. Support by ”classical“ theory:

E(pt+1|pt,pt−1, . . .) =pt - Weak efficiency

or

E(pt+1|It) =pt - Semi-strong efficiency,

whereIt . . .publically available information at timet

⇒Problem: Can we find better forecasting models from data

-Can we beat the market?

(6)

Main issues in this context:

Input selection

Modeling of dynamics Nonlinearities

Structural changes and adaption Outliers

Problem dependent criteria for forecast quality and forecast evaluation

(7)

Input selection

Large number (several thousands) of candidates, and large number of model classes: Overfitting and computation time.

What are appropriate model selection criteria and

strategies? AIC andBIC do not seem to be appropriate in

cases where the number of model classes is of the same order as sample size. Intelligent search algorithms (An algorithm, Genetic programming); The role of prior knowledge.

Modelling of dynamics Similar to input selection Nonlinearities

NN as benchmarks

(8)

Structural changes and adaptation

Time varying parameters, adaptive estimation procedures, detection of local trends.

Criteria for forecast quality

Ideally depending on portfolio optimization criteria. Out of sampleR2or hitrate are only easy-to-calculate-substitutes. Long run target: Portfolio optimization criteria based identification procedures.

(9)

Factor Models

Here we consider factor type models for forecasting return series:

linear multivariate

(10)

Factor models are used to condense high dimensional data consisting of many variables into a much smaller number of factors. The factors represent the comovement between the single time series or underlying nonobserved variables influencing the observations. Here we consider (static and dynamic)

Principal component models

Frisch or idiosyncratic noise models Generalized linear factor models

(11)

The model

The basic, common equation for all different kinds of factor models considered here is of the form

yt = Λ(z)ξt +ut, (1) where yt . . . observations (n-dim). ξt . . . factors (unobserved) (r <<n-dim). Λ(z) =P∞ j=−∞Λjzj,Λj ∈Rn×r . . . factor loadings ˆ yt = Λ(z)ξt . . . latent variables Λ = Λ0 . . . (quasi-)static case.

(12)

Assumptions

Throughout we assume the following:

Eξt =0, Eut =0 for allt ∈Z.

Eξtu0_s=0 for alls, t∈Z.

(ξt)and(ut)are wide sense stationary and (linearly)

regular with covariancesγξ(s) =Eξtξ0t+s and γu(s) =Eutu_t0₊_s satisfying ∞ X s=−∞ |s|kγξ(s)k<∞, ∞ X s=−∞ |s|kγu(s)k<∞ (2)

(13)

Then the spectral densityfy ofyt exists:

fy(λ) = Λ(e−iλ)fξ(λ)Λ(e−iλ)∗+fu(λ). (3)

And for the static case we obtain

Σy = ΛΣξΛ∗+ Σu where e.g.Σy =Eytyt0 (4)

(14)

We will be concerned with the following questions. Identifiability questions:

Identifiability offˆy = ΛfξΛ∗andfu

Identifiability ofΛandfξ

Estimation of integers and real-valued parameters:

Estimation of r

Estimation of the free parameters inΛ,fξ,fu

Estimation ofξt

(15)

Principal Component Analysis

The quasi-static case:

We commence from the eigenvalue decomposition ofΣy:

Σy =OΩO0=O1Ω1O01 | {z } Σˆy +O2Ω2O20 | {z } Σu ,

whereΩ1is ther ×r-dim. diagonal matrix containing the r largest eigenvalues ofΣy. This decomposition is unique for ωr > ωr+1.

A special choice for the factor loading matrix isΛ =O1, then

yt =O1ξt +ut,ξt =O₁0yt andut =yt−O1O0₁yt =O2O₂0yt

Note: Here, factors are linear functions ofyt.

(16)

Estimation:

Determine r fromω1, . . . , ωn

EstimateΛ,Σξ,Σu, ξt from the eigenvalue decomposition of

ˆ

(17)

The dynamic case:

We commence from the canonical representation of the spectral densityfy: fy(λ) =O1(e−iλ)Ω1(λ)O1(e−iλ)∗ | {z } fˆy(λ) +O2(e−iλ)Ω2(λ)O2(e−iλ)∗ | {z } fu(λ) , then yt =O1(z)ξt +ut andξt =O1∗(z)yt

Note: HereEut0ut is minimal among all decompositions where

rk(fˆy(λ)) =r a.e.

Whereas for givenr the decomposition above is unique and

thus(ˆyt)and(ut)are identifiable, the transfer functionΛ(z)and

the factorsξt are only identifiable up to regular dynamic linear transformations.

(18)

Againξt =O∗₁(z)yt, i.e. factors are linear transformations of(yt)

Problem: In general, the filterO₁∗(z)will be non-causal and non-rational. Thus, naive forecasting may lead to infeasible forecasts foryt. Restriction to causal filters is required.

(19)

The Frisch Model

Here the additional assumptionfu is diagonal is imposed in (1).

Interpretation: Factors describe the common effects, the noise

ut takes into account the individual effects, e.g. factors describe

markets and sector specific movements and the noise the firm specific movements of stock returns.

For givenyˆt the components ofyt are conditionally

uncorrelated.

(20)

The quasi-static case:

Identifiability:More demanding compared to PCA

Σy = ΛΣξΛ0

| {z }

Σˆy

+Σu, (Σu)diagonal (5)

Identifiability ofΣyˆ: Uniqueness of solution of (5)

for given n and r, the number of equations (i.e. the number of free elements inΣy) is n(n₂+1). The number of free parameters

(21)

Now let B(r) = n(n+1) 2 −(nr− r(r −1) 2 +n) = 1 2((n−r) 2₋_n₋_r₎

then the following cases may occur:

B(r)<0:In this case we might expect non-uniqueness of the decomposition

B(r)≥0:In this case we might expect uniqueness of the decomposition

The argument can be made more precise, in particular, for

B(r)>0 generic uniqueness can be shown. GivenΣy, if

Σξ =Ir is assumed, thenΛis unique up to postmultiplication by

orthogonal matrices (rotation).

Note that, as opposed to PCA, here the factorsξt, in general,

cannot be obtained as a function of the observationsyt. Thus,

the factors have to be approximated by a linear function ofyt.

(22)

Estimation:

Ifξt andut were Gaussian white noise, then the (negative

logarithm of the) likelihood function has the form

LT(Λ,Σu) = 1 2Tlog(det(ΛΛ 0_{+ Σ} u)) + 1 2 T X t=1 y_t0(ΛΛ0+ Σu)−1yt = = 1 2Tlog(det(ΛΛ 0 + Σu)) + 1 2Ttr((ΛΛ 0 + Σu)−1Σˆy).(6)

(23)

The dynamic case:

Here Equation (1) together with the assumption

fuis diagonal.

is considered. Againut represents the individual influences and

ξt the comovements. The only difference to the previous

section is thatΛis now a dynamic filter and the components of

ut are orthogonal to each other for all leads and lags.

There are still many unsolved problems.

(24)

Motivation:

In a number of applications the cross-sectional dimension is high, possibly exceeding sample size.

Weak dependence (“local” correlation) between noise components should be allowed.

Examples:

cross-country business cycle analysis asset pricing

forecasting returns of financial instruments

(25)

The model

For the analysisnis not regarded as fixed. Thus we consider a double sequence(yit|i ∈N,t ∈Z), where our

general assumptions hold true for every vector

yn

t = (y1,t,y2,t, . . . ,yn,t)0withn∈N.

⇒Sequence of factor models:

y_tn= Λn(z)ξt +u_tn, t ∈_Z, n∈_N, (7)

whereu_tnandΛn(z)are nested.

(26)

Additional Assumptions

(Weak Dependence) The largest eigenvalue offn

u,

ωn

u,1: [−π, π]→Rsay, is uniformly bounded for alln∈N,

i.e. there exists aω∈_R, such thatωn

u,1(λ)≤ωfor all

λ∈[−π, π]and for alln∈_N.

⇒the variance of(un

t)“averages-out” forn→ ∞.

The firstr eigenvalues offˆyn,ω

n

ˆ

y,j say,j =1, . . . ,r diverge

a.e. in[−π, π]asn→ ∞.

(27)

Representation, Forni and Lippi, 2001

The double sequence{y_it|i ∈_N,t∈_Z}can be represented by a generalized dynamic factor model, if and only if,

1 _{the first}_r _{eigenvalues of}_fn

y,ωyn,j say,j =1, . . . ,r , diverge

a.e. in[−π, π]asn→ ∞, whereas

2 _the₍_r ₊₁₎_{-th eigenvalue of}_fn

y,ωyn,r+1say, is uniformly bounded for allλ∈[−π, π]and for alln∈N.

The last result indicates that the spectral densities ofyˆt andut

are asymptotically (forn→ ∞) identified by dynamic PCA, i.e. the canonical representation off_yn, decomposed as

f_yn(λ) =O₁n(e−iλ)Ωn₁(λ)O₁n(e−iλ)∗+O₂n(e−iλ)Ωn₂(λ)On₂(e−iλ)∗,

(8)

(28)

Factor Space, Forni and Lippi, 2001

The space spanned by the firstr dynamic principal components

of(y_tn), i.e.O₁n(z)∗y_tn, converges to the space spanned by the factorsξt, where space is short for Hilbert space and where

convergence of spaces is understood in the sense that the residuals of a projection from one space to the other converge to 0 in mean square.

(29)

Latent variables, Forni et al., 2000

The latent variables from the PCA-model converge to the corresponding generalized dynamic factor model variables as

n→ ∞. Letyˆn

it =O1n,i(z)On1(z) ∗_yn

t denote the projection ofyitn

ontoOn₁(z)∗y_tn, i.e. the i-th element of the latent variable of the

corresponding PCA-model att, then

lim

n→∞yˆ

n

it = ˆyit, for alli∈Nand for allt ∈Z, (9)

whereˆyit denotes thei-th element ofΛn(z)ξt (forn≥i), hence

the corresponding “true” latent variable.

(30)

Identifiability

Concerning identifiability, the results from dynamic PCA are adopted asymptotically, i.e. asymptotically (asntends to infinity) the latent variables as well as the idiosyncratic

components are identifiable, whereas the transfer functionΛ(z)

and the factorsξt are only identifiable up to regular dynamic linear transformations. If the factors are assumed to be white noise withEξtξt0 =Ir, which is no further restriction, since our

assumptions always allow this transformation, they are identifiable up to static rotations.

(31)

For estimation the dynamic PCA estimators are employed, thus e.g.yˆˆ_tn=hOˆ₁n(z) ˆO₁n(z)∗i

ty n

t , whereOˆn1(e−iλ)denotes the matrix consisting of the firstr eigenvectors of a consistent estimator off_yn(λ),ˆf_xn(λ)say.

The filterOˆn

1(z) ˆOn1(z)

∗ _{is in general two-sided and of infinite}

order and has to be truncated at lagt−1 and leadT−tas for

t≤0 andt>T y_tnis not available. As a consequence of this truncation convergence of the estimators ofyˆt andut, asnand

T tend to infinity, can only be granted for a “central” part of the

observed series, whereas for fixedt the estimators are never

consistent.

(32)

Determination of r

So far the number of factorsr was considered fixed and known,

whereas in practice it has to be determined from the data. The above discussion indicates that the eigenvalues ofˆf_xncould be used for determining the number of factors and Forni et al. propose a heuristic rule, but indeed, no formal testing procedure has been developed yet.

(33)

One-sided model

The fact that the filters occurring in dynamic PCA are in general two-sided and thus non causal yields infeasible forecasts. One way to overcome this problem is to assume that

Λn(z)is of the form Λn(z) =Pp j=0Λ n jz j and that (ξt)is of the formξt =A(z)−1εt,A(z) =I−A1z−. . .Aszs

withAj ∈Rr×r, detA(z)6=0 for|z| ≤1,s≤p+1 and the

innovationsεt arer-dimensional white noise withEεtε0_t =I

(and are orthogonal tout at any leads and lags.)

(34)

cost of higher dimensional factors,

yt = ΛFt+ut = ˆyt +ut, t ∈Z, (10)

whereFt = (ξt0, . . . , ξ0t−p)0 is theq=r(p+1)-dimensional vector

of stacked factors andΛn= (Λn₀, . . . ,Λn_p)is the

(n×q)-dimensional static factor loading matrix. – Note that

(yn

t ),Λn,(ˆytn),(unt)and their variances and spectra still depend

onn, but for simplicity we will drop the superscript from now.

Under the assumptions imposedFt andut remain orthogonal at

(35)

For estimation, Stock and Watson propose the static PCA

procedure withq factors and they prove consistency for the

factor estimates (i.e. the firstqsample principal components of

yt) up to premultiplication with a nonsingular matrix asnandT

tend to infinity. In other words, the space spanned by the true GDFM-factors can be consistently estimated.

(36)

has been proposed by Forni et al. It differs from classical PCA in two respects: firstly in the determination of the covariance and secondly in the determination of the weighting scheme. While classical PCA is based on the sample covarianceΣˆy of

(yt), this approach commences from the estimated spectral

densityˆfy decomposed according to the dynamic model. Then

the covariance matricesΣˆy andΣuof the common component

and the noise respectively are estimated as

ˆ Σyˆ = Z −π π ˆ fˆy(λ)dλ and (12)

(37)

In the second step, the factors are estimated asFˆt =C0yt,

where the(n×q)-dimensional weighting matrixCis determined as the firstq generalized eigenvectors of the matrices( ˆΣyˆ,Σˆu), i.e.ΣˆˆyC= ˆΣuCΩ1, whereΩ1denotes the diagonal matrix containing theqlargest generalized

eigenvalues.Cis then the matrix that maximizes

C0ΣˆˆyC

s.t. C0ΣˆuC =Iq, (14)

(whereas in static PCA the corresponding weightsCmaximize

C0ΣˆyC,s.t.C0C=Iq).

(38)

The matrixΛis then estimated by projectingyt ontoC0yt, i.e.

ˆ

Λ = ˆΣyC(C0ΣˆyC)−1. The common component estimatoryˆˆt

defined this way can be shown to be consistent (fornandT

tending to infinity). The argument for this procedure is that variables with higher noise variance or lower common

component variance respectively get smaller weights and vice versa, thus the common-to-idiosyncratic variance ratio in the resulting latent variables is maximized, which could possibly improve efficiency upon static PCA.

(39)

Boivin and Ng propose a third estimation method called “weighted PCA” which is related to classical PCA as

generalized least squares (GLS) is related to ordinary least squares (OLS) in the regression context. Remember, that, if the regression residuals are non-spherical, but their variance matrix is known, GLS weights the residuals with the inverse of the square root of their variance matrix and yields efficient

estimators. Assume for a moment that the noise varianceΣu

were known, then the same principle could be applied to PCA by transforming the least squares problem minEu_t0ut into the

generalized least squares problem minEu0_tΣ−u1ut.

Since the residual matrix of the unweighted PCA-model is singular, it cannot be used. A feasible alternative is e.g. the estimator resulting ¿from dynamic PCA. Again the factor space is consistently estimated asnandT tend to infinity.

(40)

Concerning forecasting, at least two different approaches have been proposed.

First, as suggested by Forni et al., the problem of forecastingy_i_,_t say can be split into forecasting the common componentyˆi,t and forecasting the idiosyncratic

ui,t separately. To forecast the common component,yi,t+h

is projected onto the space spanned by the factor estimatesFˆt:

ˆ

(41)

Second, as proposed by Stock and Watson, forecasting the common componentyˆi,t and the idiosyncraticui,t may

be performed simultaneously. In this caseui,t is supposed

to follow the AR(S)-processb(z)ui,t =νi,t withνi,t being

white noise. In the case of one-step ahead forecasts, the prediction equation

y_i_,_t₊1= ΓFt +ui,t+1 (15)

can then be rewritten as

yi,t+1= ˜Γ ˜Ft +γ(z)yi,t +νi,t+1, (16)

whereγ(z) = (1−b(z))z−1, and henceyi,t+1is estimated by a “factor augmented AR model” (FAAR).

(42)

Data and sample period

Targets (i.e. time series to be forecast)

Weekly returns of 5 investable stock indices: Euro STOXX 50, Nikkei 225, S&P 500, Nasdaq and FTSE (data:

Monday close prices) Additional information

additionally used to estimate the factor space: relative differences (weekly) of sector indices of Euro STOXX 50, Nikkei 225 and S&P 500

(43)

50

100

150

200

250

4Jan99 27Nov2000 28Oct2002 27Sep2004 22Jan2007

Targets STOXX50 NIKKEI NASDAQ S&P FTSE −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 0.20

4Jan99 27Nov2000 28Oct2002 27Sep2004 22Jan2007

Target returns STOXX50 NIKKEI NASDAQ S&P FTSE

Figure:Target time series: absolute prices (indexed) and returns.

(44)

Estimation details

We calculate 1-week ahead forecasts recursively based on a rolling 250-weeks estimation window.

⇒out of sample forecasting period: 171 weeks (10/20/2003

-01/22/2007)

Two benchmarks

Univariate AR(p)-models (lag selection: recurs. computed

AIC, BIC with 0≤p≤15

Univariate full ARX-models : all 54 time series are used as regressors (with lag 1), the target time series may enter

(45)

Estimation details

Generalized dynamic factor models

FAAR : simultaneous selection ofq(no. of static factors)

andp(AR-lags) by rec. computed AIC, BIC applied to the

(univariate) forecasting equation.

PCA : selection ofqby rec. computed AIC, BIC and

modified AIC, BIC (as proposed by Bai, Ng, 2002) applied to the (multivariate) factor model equation of the targets. (idiosyncratic model, see AR(p)-model)

GPCA (Generalized PCA) : selection ofr (no. of dynamic

factors) based on dynamic eigenvalues, selection ofqand

p: see PCA.

(46)

−0.03 −0.02 −0.01 0.00 0.01 0.02 0.03

11Sep2006 13Nov2006 18Dec2006 22Jan2007 Target returns and forecasts

3800

3900

4000

4100

4200

11Sep2006 13Nov2006 18Dec2006 22Jan2007 Target (abs.) and forecasts

Target AR ARX PCA FAAR GPCA

(47)

Results - Forecasting quality

With respect to o.o.s.R2and hitrate the GDFM-methods

dominate the 2 benchmarks across different series and

selection methods; in the case of forecasting root mean square error (RMSE) this does not hold true.

Average o.o.s. statistics over all target time series:

AR ARX FAAR PCA GPCA

RMSE 0.0248 0.0302 0.0247 0.0255 0.0258

R2 0.0018 0.0034 0.0080 0.0054 0.0041

Hitrate 0.51 0.51 0.52 0.53 0.53

(48)

We will consider

One asset long/short portfolios : according to the forecast’s direction the whole capital is invested either long or short into the asset.

Equally weighted long short portfolios : according to the forecast’s direction a constant proportion of the capital is invested either long or short into each asset (hence the weights can only be 0.2 or -0.2, as we have 5 assets to be traded).

(49)

Portfolio simulations

Portfolio statistics for one-asset long/short strategy (average across series):

Return p.a. 0.0383 0.0137 0.0228 0.1404 0.0665

Volatility p.a. 0.1390 0.1394 0.1391 0.1379 0.1389

Sharpe-R. p.a. 0.2814 0.0940 0.1959 1.0188 0.5291

The two GDFM-methods that split the forecasting problem into 2 separated problems for the latent component and the noise, PCA and GPCA, clearly outperform all other models;

FAAR-forecasts are very similar to AR-forecasts but slightly worse, and ARX-forecasts (without input selection!) perform worst not generating any significant return.

(50)

100 150 200 13.10.2003 20.09.2004 05.09.2005 22.01.2007 STOXX50 NIKKEI NASDAQ S&P FTSE

(51)

Portfolio simulations

Portfolio statistics for equally weighted long/short strategy:

Return p.a. 0.0421 0.0202 0.0267 0.1450 0.1149

Volatility p.a. 0.0905 0.0869 0.0936 0.0873 0.0797

Sharpe-R. p.a. 0.4656 0.2328 0.2857 1.6614 0.9162

Once again, the PCA- and GPCA-forecasts, clearly outperform all other models; FAAR-forecasts behave very similar to

AR-forecasts but slightly worse, and ARX-forecasts (without input selection!) perform worst not generating any significant return.

(52)

90 100 110 120 130 140 150 13.10.2003 20.09.2004 05.09.2005 22.01.2007 AR ARX PCA FAAR GPCA

(53)

Conclusions

Obviously, forecasting financial time series is a very difficult problem.

A large number of candidate inputs and model classes highly increases the risk of overfitting.

Forecasting models for high-dimensional time series are needed.

Factor models solve, at least in part, the input selection problem as they condense the information contained in the data into a small number of factors. They allow for

modelling dynamics and for forecasting high-dimensional time series.

However, in this area there is still a substantial number of unsolved problems.