Nonlinearity or Structural Break? - Data Mining in Evolving Financial Data Sets from a Bayesian Model Combination Perspective

(1)

Nonlinearity or Structural Break? - Data Mining in Evolving Financial Data Sets

from a Bayesian Model Combination Perspective

Hao David Zhou

Management Department

Drexel University

Abstract

Much work in the exploration of data mining has been focused on quality data. Data non-stationary is prevalent in reality and is further complicated by model uncertainty. The emphasis on organizational impact and benefit maximization of data mining urges us to develop models that are easy to be understood by managerial decision makers. To provide a better solution to these application challenges, we propose a new approach integrating Bayesian structural break models with change point detection methods to chronologically ordered observations. We apply our approach to three exchange rate predictions. Our approach incorporates both structural break and model uncertainty explicitly. It not only has a clear intuitive appeal but also has a fairly firm statistical foundation. The benchmark comparison shows strong empirical evidence that our approach could match the approximating ability of neural networks in mining data with structural break. Furthermore, comparing to neural networks, our approach provides better interpretability.

1. Introduction

In the past, much work in the theoretical and methodological exploration of data mining has been focused on quality data. The input to the data mining algorithm or technique is assumed to be stationary and to contain no incorrect or anomaly observations. The likelihood functions of the approximating models between input and output are assumed to be smooth and single-modal. As we accumulate more and more data and are able to access richer classes of models, two new application challenges emerge and could severely inhibit our capability of using data mining to improve managerial decision-making. They are: (a) how to exploit the evolving data under a dynamic business environment (structural break) and (b) how to select the “best” model under the coexistence of numerous competing alternatives

(model uncertainty) to maximize the benefits of technology solutions.

Structural break and structural instability characterize many forecasting models fitted to economic and financial data and have been widely documented in economics and finance literature (Pesaran and Timmermann [25]). Although evolving relationship between variables in dynamic business environment is prevalent in reality, there is very little research taking into account this problem explicitly in knowledge discovery and data mining application.

Sung et al [29] first justify the necessity for different models in bankruptcy prediction under different economic conditions. They classify economic conditions to normal condition and crisis condition. When we apply the prediction models for the normal condition to data analysis under the crisis condition, the accuracy rate of the bankruptcy prediction model drops significantly - from 83.3 percent to 66.7 or worse. Sung et al [29] state that the non-robustness of modeling requires continuous remodeling as data or situation changes.

In data mining application, the non-stationary challenge is further complicated by the presence of non-linearity and outliers. Many application fields like financial time series prediction are domains characterized by non-linearity, strong noise, weak signal and lack of functional structure. In many situations, the likelihood functions of the approximating models between input and output are non-smooth and multi-modal. Granger and Timmermann [9] argue that as we are likely to search over a much larger space of models through flexible modeling approach such as neural networks, the likelihood of overfitting the data-generating process is much higher. Granger and Timmermann’s statement are further elaborated by Andrew, Lo, a finance professor at MIT, as he warns: "Given enough time, enough attempts, and enough imagination, almost any pattern can be teased out of any data set." (Lo [16])

Most of the existing approaches in mining data with structural break and model uncertainty could not incorporate model uncertainty explicitly and offer good interpretability. Black-box modeling greatly restricts their

(2)

application because many managerial decision makers prefer to adopt the models that are easy to understand (Dhar et al [5]). Density forecasting has won more attention recently (Pasley et al [24]). In many business applications, a prediction interval with accepted confidence level is more preferred because one wants to know the associated risk to make the optimal decision. In this paper we propose a new approach in mining data with structural break and model uncertainty. We apply our approach to the prediction of exchange rate, one of the most extensively studied financial variables. Our new approach integrates Bayesian structural break models and the change point detection methods to chronologically ordered observations. We use the change point detection methods to detect the potential structural breaks. Focused on the most recent breaks, we then use Bayesian structural models to incorporate the model uncertainty of the structural breaks. Our research question is whether the integrated approach between Bayesian structural break models and change point detection methods could match the approximating ability of neural networks in mining data with structural break. We argue our approach has a clear intuitive appeal and a fairly firm statistical foundation. Our results show strong support that our Bayesian structural break model integrated with change point detection methods could match the approximating ability of neural networks. Furthermore, comparing to neural networks, our approach provides better interpretability, a competitive advantage in data mining application.

After this introduction, we organize our paper as follows. In section 2 we formally define the non-linearity models, structural break models and outlier models we are going to use and review the existing approach in mining data with structural break. In section 3 we describe the two methods we adopt in evolving data stream mining. In section 4 we report the results. We conclude and discuss the implication and the future work in section 5.

2. Background

2.1 Non-linearity models, structural break models

and outlier models

Before we proceed, we need clarify some definitions that we are going to use. In their fundamental contribution, Koop and Potter [15] define the models where dynamics change permanently in a way that cannot be predicted by the history of the series as structural break models. They define the models that allow for dynamics varying over the business cycle in a predictable way as nonlinear models. They further define models that have apparent departures from linearity and are due to unpredictable large shocks with only temporary effects as

outlier models. It is not surprising that there is some extent of abuse of terminology in the literature. For instance, Lubrano [17] classifies nonlinear model to structural break models and threshold regression models. Lubrano defines the models that combine an abrupt transition function with a time index as the structural break model. The breaking point could be unknown. Lubrano defines the models where an abrupt transition is not combined with a time index but combined with a continuous variable, not necessarily growing over time, as threshold models. In order to avoid confusion, we will use the definition adopted by Koop and Potter [15] throughout the remainder of this paper.

Pesaran and Timmermann state “Financial time series are likely to undergo sudden, large changes reflecting institutional changes, regime switches or breakdowns in a market mechanism as observed during financial crises” (Pesaran and Timmermann [25] Page 508). “Breaks or jumps in the parameters that relate security returns to state variables could arise due to a number of factors such as major changes in market sentiments, burst or creation of speculative bubbles, regime switches in a monetary and debt management policies” ( Pesaran and Timmermann [25] Page 496). Timmermann and Granger [33] further argue that stable forecasting patterns are unlikely to persist for long periods of time and will self-destruct when discovered by a large number of investors.

Pesaran and Timmermann [26] illustrate that it can be very costly to ignore breaks and that forecasting approaches conditioning on the most recent break are likely to perform better over unconditional approaches. Anomaly or atypical observations could affect time series prediction and statistical models greatly. Franses and Dijk [7] further point that neglecting atypical observations will have even more impact on out of sample forecasts in nonlinear time series than in linear time series and we need pay considerable attention to take into account such observations while constructing nonlinear models.

In this paper, we will focus on models for financial time series that impose a regime-switching structure. Following Koop and Potter [15], we will restrict our focus on models that have a clear interpretation and are plausible from an economic perspective. The following model’s posterior probabilities are well known. Because of the time constraint, we only investigate model 1, 2, 3 and their combinations to predict exchange rate change:

1. Linear model

2. Homoscedastic structural break model with one break

3. Heteroscedastic structural break model with one break

4. Outlier models with one outlier

5. Homoscedastic nonlinear models with one threshold

(3)

6. Heteroscedastic nonlinear models with one thresholds

A formal specification of the models of the financial time series prediction is described by Koop and Potter [15]. We describe these models briefly here.

¯ ® = = + + + + = − − − − 0 1 1 0 1 0 00 1 1 1 1 10 t t t t p t t p t I I if V X if V X Y σ β β σ β β

where It is an indicator variable for the regime and β_i₀ is

intercept coefficients and _β_ip is the slope coefficients. Vt-1

is assumed to be standard normal and independent over time.

There are four ways of defining It

(1) Model 1 is obtained if we set It =0 for all t.

(2) The structural break model with one breaks is obtained if we set It = 1 when t <

τ

1 and It = 0 when

τ

1<

t,

(3) The outlier model is obtained if we set It

≠

0 for

only one values of t and_β₁_p ₌_β₀_p, _σ₁₌_σ₀ but _β₁₀_≠_β₀₀ (4) ¯ ® = + + = + + = − − − − 0 1 1 0 |, 1 0 00 1 1 |, 1 1 10 t t i t p t t i t p t I if V X I if V X Y σ β β σ β β i t

X₋₁_|,is obtained by keeping all predictors in X_t₋₁ except for

i t

X₋₁_, The nonlinear model is obtained if we set It = 1 if

i t

X−1,>γ1, It =0 if Xt−1,i

≤

r

1,.

2.2 Structural break detection methods

In the current knowledge discovery and data mining literature, the common approach to structural break is using an arbitrary size of recent data to generate the updated models. While forecasting bias and forecasting variance together determine the accuracy level when the forecasting variable is continuous, we have to face a trade-off between reducing bias and reducing variance in processing the evolving data sets: if we totally abandon the data in the long time ago, we may have lower bias, but higher variance in the prediction. If we use the data in the long time ago without any preprocessing, we may have lower variance, but higher bias in the prediction. Using different training horizons with daily data for the S&P 500 index, Mehta and Bhattacharyya [18] report the sensitivity of the performance of the discovered patterns with respect to the training durations.

Most of the previous research in modeling structural break has a focus on the in-sample model fit instead of the forecast of the future (Oh and Han [21] [22]). In their fundamental contribution, Pesaran and Timmermann [25] analyze the stability of a model relating US stock returns to lagged values of the dividend yield, short-run interest rate and default premium by using reversed ordered Cusum (ROC) breakpoint method. They compared the

ROC method to existing unconditional approaches such as expanding or rolling window, time varying parameters and etc. The reversed ordered Cusum breakpoint method seems to work sufficiently well to consistently identify three major breaks in a forecasting model for the stock returns.

In their seminal work, Csorgo and Horvath [3] enumerate three major change-point detection methods for detecting various types of changes in chronologically ordered observations - (1) the likelihood ration test for a parametric method based on likelihood (2) the Pettitt test for a nonparametric approach based on the Mann-Whitney type statistic and (3) the Chow test for a linear model with linear model restriction. In their innovation, Oh and Han [21] [22] further propose a new clustering forecasting system which integrates change-point detection and a universal approximating methodology - artificial neural networks. They adopt the above three change-point detection methods to detect a series of change points and group the successive training data to different homogenous groups based on the change points detected. In the following stage, neural networks are trained based on the training inputs at time t with categorical group output for t+1. The trained networks are applied to forecast the categorical group output to the new coming data. In the final stage, neural networks trained with the corresponding group data are applied to forecast the new coming data’s magnitude output.

In machine learning field, an active research field tracking context drift (Schlimmer et al [30]) is closely related to our research. Widmer et al [35] present a general two-level learning model that could effectively track structural break by trying to detect structural break clue and use this clue to focus the learning process. Harries et al [11] argue that concept drift due to hidden changes in concept complicates learning in many applications and present a new approach which use an existing batch learner and the process of contextual clustering to identify hidden contexts.

Srivastava et al [32] and Weigend [34] introduce a new tool called Scale-Sensitive Gated Experts (SSGE) to analyze time series with structural break: SSGE consists of a nonlinear gating neural network and several competing nonlinear experts (modeled by using neural network). The gating network’s task is to learn to associate inputs with particular experts and the set of expert networks’ task is to predict the value at the regression surface given the input. The association probability - the probability of associating an input-output pair to a particular expert is derived by using the principle of maximum entropy.

Most of the existing research does not consider model uncertainty explicitly. A structural break could be either a sudden large change or a slow small change. A structural break model could be multi-modal as we will show in

(4)

figure 3 and figure 4. Most of the existing approaches in modeling structural break try to find an optimal structural break point and thus omit the model uncertainty completely. Most of the existing approach could not offer a good interpretability.

3. Method specification

In this paper, we adopt two methodologies – Neural networks with Bayesian regularization and Bayesian structural models. We compare the advantage and disadvantage of both of these two methods. As Pesaran and Timmermann [25] note, using a fixed rolling window size, c, to forecast may make the window too short or too long, if c is not selected appropriately. We use two methods to determine the rolling window size. One is a constant rolling window size with 202 most recent observations. Another is a dynamic rolling window size determined by the change point detection methods.

3.1 Reversed Pettitt test and Cusum test

In this paper, we adopt two change point detection methods – the reversed Pettitt test method and the reversed Cusum (ROC) test. As Oh and Han [22] state, these two tests are frequently offered by statistical packages and are representative in nonparametric approach and linear model approach. Furthermore, as Oh and Han [21] and Pettitt [27] state, Pettitt test is more preferable in forecasting chaotic test because it provides a robust method resistant to anomaly or atypical observation frequently caught in financial time series data. We deviate from the previous research in that we use the change point detection method to detect the most recent two breaks (if more than one break could be detected) or the only break (if only one break could be found). We then use the breaks detected to segment the data and apply neural network and Bayesian structural break model to approximate the two most recent break data (if more than one break is detected) or the whole data set (if only one break could be found). Change point detection method requires our decision on significant level. For instance, in the figure 4, different significant level could determine whether the first break happens at points around point 40, point 12 or point 1. As noted by Pesaran and Timmermann [25], it may be optimal to also include pre-break data to estimate a forecasting window. By sing the integration of Bayesian structural break model and change point detection method, we do not have this concern because Bayesian structural break model could incorporate this uncertainty explicitly.

3.2 Artificial neural Networks with Bayesian

regularization

ANN has attracted many scholars from many different fields. ANN could approximate a nonlinear (or linear) function to an arbitrary degree of accuracy through the composition of a network of relatively simple function, if we select an appropriate number of ANN’s hidden-layer units (Hornik et al [14]). In other words, ANN is a universal approximator. However, it is also well known that the black-box modeling makes neural network unpractical when the application domain needs clear interpretation. Some exciting work extracts rules from neural network (Baesens [1]). However, in many situations, a prediction interval with accepted confidence level is more preferred because one wants to know the associated risk to make the optimal investment decision. Furthermore, the number of the nodes in the hidden layer, the initial value of the parameters and the training period all greatly affect the performance of neural network. However, there is no rigorous procedure or widely accepted rule to identify, select and test the model structure of neural networks (Hastie [12]). It is still an art to determine the structure of neural network and we need balance the trade-off between over-fitting and under-fitting .

3.3 Bayesian structural models and Bayesian

model combination

The seminal work of Bates and Granger’s “The combination of forecasts” (Bates and Granger [2]) has inspired extensive works on model combination in both management science, econometric and artificial intelligence literature

In most examples of inference and prediction, a model

M is used to describe the relationship between dependent variable(s) and independent variable(s). Darper [5] state that a model M typically include two parts: (1) structural description S such as a particular link function in a generalized linear model or a particular form of heteroscedasticity and (2) parameter description

θ

whose meaning is specific to the chosen structure. In modeling structural breaks, a model could be either heteroscedastic or homoscedastic. A structural break model could be either single modal or multi-modal. In practice most statistical methods acknowledge parametric uncertainty about

θ

without acknowledging structural uncertainty about S, and only search a single “best” choice

*

s

according to some criteria such as R2, AIC, SIC, FIC and PIC to make inferences and predictions as if S were known to be correct.

Darper [5] demonstrate that Bayesian model averaging approach solve the problem of failure to assess and propagate structural uncertainty by treating the entire

(5)

model M=(S,

θ

) as a nuisance parameter and integrating over uncertainty about S and _θ, as in the expression

¦³

₌ =

¦

₌ = m i m i i i i i i i i i pS x d pS x p y x S S x y p x y p 1 1 ) , | ( ) | ( ) | , ( ) , , | ( ) , | ( ϕ θ θ θ

The first factor on the right-hand side is the posterior probability of Sj, which is given by:

¦ = = m i i i i i i S P S x P S P S x P x S P 1 ) ( ) | ( ) ( ) | ( ) | (

where P(Si) is the decision-maker’s prior belief that Si is the correct model, and P(x|Si) is the marginal likelihood

of the data.

In knowledge discovery and data mining literature, model uncertainty problem has long been ignored. In stead, a lot of efforts have been put on optimization issue. For instance, Hand et al [10, page 15] state that data mining components include model/pattern structure determination, score function judgment, optimization and search method. Padmanabhan and Tuzhilin [23] argue that combining the optimization methods with the data mining can result in more powerful analytical approaches. We agree that optimization provides an opportunity in maximizing the benefits of data mining technology. However, we believe overestimating optimization methods could lead us to draw spurious conclusion – especially when the model uncertainty is very high. Koop and Potter [15] note that in the situation when there are many nonlinear models, structural break models and outlier models to explain the same data sets, likelihood functions could be non-smooth and multi-modal. Bayesian methods reduce the uncertainty because it could not only use the information from the entire parameter space but also use posterior model probabilities to combine models.

Bayesian methods could be computationally demanding. As we could access more and more fast computer, the interest in Bayesian methods also surge very quickly in recent years.

Suppose we have a model specification

¦

= + = k j j j X Y 1 ε β

Following George [8] and Holmes et al [13], we use the widely adopted Normal Inverse – Gamma Distribution (NIG) as the conjugate choice of joint prior for

β

and

σ

. In particular, The priors is ) ( ) | ( ) , (_β _σ _β _σ2 _σ2 p p p = = ( , 2 ) ( , ) b a IG V m N σ = 2 ( (/2)1) 2 / 1 2 / ( ) ) ( | | ) 2 ( + + − Γ k a k a a V b _σ π )] 2 /( } 2 ) ( ) {( exp( ' 1 2 σ β β−m V −m + b − × − where 1 ) ' ( − =c X X V

The model posterior probability of linear model is

* ) ( ) ( | | ) ( ) ( | | ) | , ( ) , ( ) | ( ) | ( * 2 / 2 / 1 * 2 / 1 * 2 2 2 , ai i n i i a i i i i i b a V a b V D p p D p M D p − Γ Γ = = π σ β σ β σ β where ), ( ) ( 1 ' 1 1 ' * _V _X_X _V _m _X_Y m = − + − − + , ) ( 1 ' 1 *₌ _V− ₊_X _X − V 2 / * n a a = + 2 / } ) ( ) ( { * ' 1 ' * ' * 1 * m V m Y Y m V m b b = + − + − −

Based on the posterior probability of every model and its estimation, we could attain the combined model’s predictive inference from the following formula:

¦

= = M i i i p M D M D x y p D x y p 1 ) | ( ) , , | ( ) , | ( where

¦

= ) ( ) | ( ) ( ) | ( ) | ( j j j i i i M p M D p M p M D p D M p

Detail model description could be found in George [8], Holmes et al [13] and many other books on Bayesian model averaging,

As it is well known, c is critical because it determines the preferable size of the model. We refer detail discussion about the importance of setting the value of c to Holmes et al [13] and George [8]. We set c value following the way proposed by George [8]. Although George [8] set the c value to select the single best model, we believe it has a clear intuitive appeal and a fairly firm statistical foundation in combining models. We assume each potential model has an equal prior. In other words, we adopt a total data driven approach.

We adopt the model specification in section 2.1 because of the existence of analytical result. Conditional on knowingI_t, the structural break model breaks into two standard linear regression models. The analytical posterior results exist for linear regression if we adopt natural conjugate points for each regime. To obtain posterior results which are not conditional onI_t, we should know the marginal posterior for the parameters defining I_t. The details to reach the posterior results for the structural break models and outlier models could be found at Koop and Potter [15]. Furthermore, following Koop and Potter [15], we set a discrete uniform prior over all possible sample breaks. We also restrict that at least 30 data observations lie in each regime. This rule is used to ensure that an adequate amount of data is available in each regime.

4. Application to the foreign exchange rate

prediction

4.1 Data description

In this study, we apply the proposed methodology to the prediction of foreign exchange rate, one of the most extensively studied variables in the financial economy

(6)

Following Qi and Wu [28], we employ a simple version of the monetary model of exchange rate determination to guide the choice of forecasting variables.

h t t t t t h h t h t s b c m a ay ar s s₊ − = + ( − 0− 1 + 2 − )+ε₊

where s_t₊_h and s_t is the logarithm of exchange rate (domestic currency price of one unit foreign currency) at time t+h and time t.

b

h and

c

hare regression parameters at horizon h and ε_t₊_h is the h period ahead forecast error.

t

m and y_t are respectively, natural logarithms of the relative money supply and relative real income between the domestic and foreign countries; r_t is their interest rate differential. More model specification could be found in Qi and Wu [28]. We study two forecasting horizons (h= 6 months and 12 months) in this paper.

In their rigorous examination, Qi and Wu [28] provide strong empirical evidence of the inability of existing theoretical models of exchange rate determination to outperform a random walk in forecasting 6 and 12 month ahead changes in exchange rates – even by using neural networks incorporating model non-linearity and model uncertainty.

Using neural network and a well know statistical resampling techniques - Bootstrap, White and Racine [36] find that exchange rates do appear to contain information that is exploitable for enhanced point prediction, but the nature of the predictive relations evolves through time. In the financial literature, there are strong empirical evidence show that superior forecasts could be obtained at longer horizons by allowing coefficients to change (Schinasi and Swamy [31] ) or by modeling both long memory and structural change (Morana and Beltratti [19]). In this paper, we investigate the predictability of exchange rate over short horizons by using Bayesian structural break models and change point detection methods.

Table 1. Description of variables

Variable Name

Description Attribute

h t

s

₊ exchange rate h period

change

Output

t

m

relative money supply

between the domestic and foreign countries;

Input

t

s

domestic currency price of

one unit foreign currency

Input

t

y

relative real income

between the domestic and foreign countries;

Input

t

r

interest rate differential between the domestic and foreign countries;

Input

All data are monthly and are obtained from IMF’s International Financial Statistics. Our sample starts in March 1973 and ends in July 1997 with 292 observations. We use the last 90 observation as test data set. We select exchange rates between the U.S. dollar and the Japanese yen, the Deutsche mark and the Canadian dollar. Exchange rates are end-of-month U.S. dollar prices of the foreign currencies. Following Qi and Wu [28], we measure money supply by M1, and real income by industrial production in each of the countries. We use Treasury-bill rates for Canada and the U.S. (line 60c) and call money rates (line 60b) for Germany and Japan as alternative measure of interest rate. The following table shows the dependent variable and independent variables used in this study.

-1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 M 3 1 973 M 1 1 975 M1 1 1 976 M 9 1 978 M 7 1 980 M 5 1 982 M 3 1 984 M 1 1 986 M1 1 1 987 M 9 1 989 M 7 1 991 M 5 1 993 M 3 1 995 M 1 1 997 M1 1 1 998 Germany Canada

Figure 1. End-of-month U.S. dollar prices of Deutsche mark and Canadian dollar from Mar. 1973 to Dec. 1998 -7 -6 -5 -4 -3 -2 -1 0 M 3 19 73 M 12 19 74 M 9 19 76 M 6 19 78 M 3 19 80 M 12 19 81 M 9 19 83 M 6 19 85 M 3 19 87 M 12 19 88 M 9 19 90 M 6 19 92 M 3 19 94 M 12 19 95 M 9 19 97 Japan

Figure 2. End-of-month U.S. dollar prices of Japanese Yen from Mar. 1973 to Dec. 1998

Figure 1 and Figure 2 show the historical trend of end-of-month U.S. dollar prices of Canadian dollars, Deutsche

(7)

mark andJapanese Yen from Mar. 1973 to Dec. 1998. As it has widely been documented in the literature, comparing to other financial variables, exchange rates are characterized with higher volatility, large noise and regime switching. To predict exchange rate, we should explicitly take into account of the model uncertainty and model instability.

The parameters of the neural network are estimated by minimizing the sum of squared errors

¦

₊ 2

h t

ε . We use Bayesian regularization

,

a modification of the Levenberg – Marquardt training algorithm to produce networks that generalize well and to reduce the difficulty of determining the optimum network architecture (MacKay [20]). Furthermore, to the inputs to neural network, we use one fourth of the data as validation set and three fourth of the data as training set. We pick the sets as equally spaced points throughout the original data. It has already been widely documented in the literature that the number of processing units in hidden layer has great impact on the performance of neural network’s forecast. To avoid the model misspecification of neural network, we also tried different configuration of neural network with different number of processing units in the hidden layer and different initial values of parameters. Initially we set the number of processing units in the hidden layer as a large number (10) and then we keep reducing the number of processing units. Our conclusions are robust to different neural network architecture.

To make the forecast from Bayesian model combination and neural network more efficient, we scale the inputs (independent variables) and targets (dependent variables) to both Bayesian model averaging and neural network, i.e. we normalize the inputs and targets so that they will have zero mean and unity standard deviation.

4.2 Forecasting performance comparison with a

fixed window size

We first compare the test data forecasting performance of Bayesian structural models and neural network by using a fixed rolling window size with 202 most recent observations. Table 2 and table 3 show the test data set result of forecasting U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche mark prediction at 12 and 6 month horizon ahead. The result is described in percentage of RMSE (Root of Mean Square Error * 100). Overall, We list the prediction from random walk models, neural networks, linear models, structural break models with one homoscedastic break or one heteroscedastic break and the those models’ combination based on their posterior probabilities.

From table 2 and table 3, we could see after incorporating model instability, both neural network and Bayesian model combination could predict exchange rate

of Japanese Yen and Canadian dollar very well. Overall, Bayesian structural break model could match the state of the art methodology – neural network which has been widely acknowledged as a universal approximating methodology.

Some interesting observation should be further interpreted. As we could see, to Deutsche mark, the Bayesian structural break model could not offer a superior approach. We would call for the special attention that due to the time constraint, we only incorporate a very limited number of models into Bayesian combination pools. For instance, we did not incorporate the nonlinear models. We did not incorporate the structural break models with more than one break. Ignoring these models could greatly affect the performance of the prediction of Deutsche mark.

Table 2. 12 month ahead U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche mark prediction with fixed rolling window size

Japan Canada Germany

Linear model (1) 8.1221 3.6953 8.4870 Homo structural break model (2) 6.6825 2.6954 10.4466 Heto structural break model (3) 7.2611 2.6146 10.2623 (1)+(2) 8.3592 2.7809 10.5831 (1)+(2)+(3) 8.3593 2.7596 10.3214 ANN 7.7878 2.4475 9.3902 Random Walk 11.44 4.47 11.13

Table 3. 6 month ahead U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche mark prediction with fixed rolling window size

In figure 3 and figure 4, we show the marginal posterior probability at different potential break points. As we could see in Figure 3 and Figure 4, there is strong evidence that one break model probably is not appropriate. A structural model with more than one break is more preferable according to the multi-modal marginal posterior probability. In other words, the fixed rolling window size with 202 most recent observations may be too long for the Bayesian structural break models. An intelligent rolling window size should be detected to

(8)

improve the Bayesian structural break model fit to the data.

Marginal Model Posterior Probability

0.00E+00 1.00E-28 2.00E-28 3.00E-28 4.00E-28 5.00E-28 6.00E-28 7.00E-28 8.00E-28 1 12 23 34 45 56 67 78 89 100 111 122 133 Model 2 fitted to Germany Currency

Figure 3. Marginal model posterior probability conditioning on the parameters defining

I

_t.

Marginal Model Posterior Probability

0.00E+00 5.00E-31 1.00E-30 1.50E-30 2.00E-30 2.50E-30 3.00E-30 3.50E-30 4.00E-30 1 12 23 34 45 56 67 78 89 100 111 122 133 Model 3 fitted to Germany Currency

Figure 4. Marginal model posterior probability conditioning on the parameters defining

I

_t.

4.3 Forecasting performance comparison with the

window size detected with change detection

method

In the coming experiments, we then compare the test data forecasting performance of Bayesian structural models and neural network by using a rolling window size detected with change point detection methods. Table 4 and table 5 show the test data set result of forecasting U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche mark prediction at 12 and 6 month horizon ahead. The result is described in percentage of RMSE (Root of Mean Square Error). We could see by using the window size detected with change point detection method, the performance of forecasting Germany mark could be greatly improved. Our Bayesian structural break model integrated with change point detection method could match the approximating ability of neuralnetwork.

Furthermore, if we only adopt the Bayesian structural models, we could see Bayesian structural break models outperform the neural network in all the forecast.

Figure 5 show the rolling window size detected with reversed ROC test in forecasting. We could see change point detection find strong evidence of Deutsche mark’s structural break in the 1980’s .

Table 4. 12 month ahead U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche

mark prediction with detected window size

Table 5. 6 month ahead U.S. dollar prices of Japanese Yen, Canadian dollar and Deutsche

mark prediction with detected window size

Japan Canada Germany

Figure 5. ROC test detected window sizes of 6 month ahead U.S. dollar prices prediction of Japanese Yen, Canadian dollar and Deutsche mark

ROC Test Detected Window Sizes

0 50 100 150 200 250 1 7 13 19 25 31 37 43 49 5561 67 73 79 85 Forecasting Points Window Sizes Japanese Yen Canadian dollar Deutsche mark

(9)

5. Conclusions

Data mining application challenges like non-stationary and model uncertainty are prevalent in reality. In this paper we propose an invocative approach integrating Bayesian structural break model and change point detection methods. Our empirical results provide strong support that our proposed approach could match the universal approximating methodology – artificial neural networks when there is structural break in the evolving data. Our approach’s superior performance is due to its capability to incorporate model uncertainty – when there are multiple models and break points competing to explain the same data set and model instability – when there are structural break in the relationship between the variables studied. As we could access more data and more powerful computer technology, model instability and model uncertainty could become critical issues to the success of data mining application. Our proposed approach has not only a clear intuitive appeal but also a fairly firm statistical foundation in solving these challenges.

In this study, we only incorporate a very limited number of model specifications. It is promising that we could fully benefit from Bayesian model combination if we incorporate more model specification like outlier models and nonlinear models specified in section 2.1. We believe Bayesian model combination offer a very competitive methodology in explicitly incorporating model uncertainty and model instability.

In a related work, we compare the performance of Bayesian model averaging, neural networks with Bayesian regularization, decision trees and support vector machine in several financial prediction problems. Our results show that Bayesian model combination is very competitive comparing to those state of the art nonlinear approximating methodologies. Bayesian method punishes the model with larger size. As we could see, although model 2 and model 3 (the models with structural break) have better model fit, linear model attain higher posterior probability because it parsimonious size. When we make Bayesian model combination, how many models are too many is still an open question. However, this only makes our result more conservative.

Many data mining application fields like financial time series prediction are domain characterized by non-stationary, strong noise, weak signal and lack of functional structure. The emphasis on organizational impact and benefit maximization of data mining urges us to develop models that could be understood by managerial decision makers. Comparing to the traditional black box modeling, Bayesian structural break models integrated with change point detection methods could offer better interpretability. This is a competitive advantage because

managerial decision makers prefer to adopt the models that are easy to understand and to associate return with risk. .

We plan to conduct more rigorous examination on the performance of our proposed approach in the future. For instance, in addition to standard statistical error measures, a trading simulation would measure the benefits of technology more accurately. Future work lies on more application of Bayesian structural break model integrated with change point detection methods to other fields. For instance, intrusion detection, customer purchasing behavior change analysis, health science and quality control all could be promising fields to apply our proposed approach.

References

[1] B. Baesens, R. Setiono, C. Mues, and J. Vanthienen, “Using Neural Network Rule Extraction and Decision Tables for

Credit-risk Evaluation,” Management Science, vol. 49, no. 3, 2003, pp.

312-329.

[2] J. M. Bates, and C.W. J. Granger, “The Combination of

Forecasts,” Operations Research Quarterly, vol. 20, 1969, pp.

319-325.

[3] M. Csorgo and L. Horvath, Limit Theorems in Change Point

Analysis, New York: John Wiley & Sons, 1997

[4] D.G. T. Denison, C.C. Holmes, B.K. Mallick, and A.F.M.

Smith, Bayesian Methods for Nonlinear Classification and

Regression, John Wiley & Sons, Ltd, West Sussex, England. 2002

[5] V. Dhar, D, Chou, and F. Provost, “Discovering Interesting Patterns for Investment Decision Making with Glower – A

Genetic Learner Overlaid with Entropy Reduction”, Data

Mining and Knowledge Discovery, 4, 2000, pp. 251-280

[6] D. Draper, “Assessment and Propagation of Model

Uncertainty,” J. R. Statist. Soc. B. vol. 57 no. 1, 1995, pp. 45-97.

[7] P. H. Franses and D. V. Dijk, Non-linear Time Series Models

in Empirical Finance, The University of Cambridge, Cambridge, United Kingdom, 2000.

[8] E. I. George and D. P. Foster, “Calibration and Empirical

Bayes Variable Selection,” Biometrika, vol. 87, 2000, pp.

731-747.

[9] C. Granger and A. Timmermann, “Data Mining with Local Model Specification Uncertainty: A Discussion of Hoover and

Perez”, Econometrics Journal 2, 2000, pp. 220-225

[10] D. Hand, H. Mannila and P. Smyth, Principle of Data

Mining. MIT Press, Cambridge: Massachusetts, 2001.

[11] M. B. Harries and C. Sammut, “Extracting Hidden

(10)

[12] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer. 2001.

[13] C.C. Holmes, and D.G.T. Denison, “Classification with

Bayesian MARS,” Machine Learning, vol. 50, 2003, pp.

159-173.

[14] K. Hornik, M. Stinchcombe and H. White, “Multilayer

Feed Forward Networks Are Universal Approximators”, Neural

Networks 2, 1989, pp. 359-366

[15] G. Koop and S. Potter, “Nonlinearity, Structural Breaks, or

Outliers in Economic Time Series?”, pp. 61-78 in Nonlinear

conometric Modeling in Time Series Analysis Proceedings of the Eleventh International Symposium in Economic Theory and Econometrics Edited by W. A. Barnett, D. F. Hendry, S. Hylleberg, T. Teräsvirta, D. TjØstheim, and A. Würtz Cambridge UK, 2000

[16] A. Lo, “Data-Snooping Biases in Financial Analysis”, in H.

R. Fogler, ed.: Blending Quantitative and Traditional Equity

Analysis, 1994. Charlottesville, VA: Association for Investment Management and Research.

[17] M. Lubrano, “Bayesian Analysis of Nonlinear Time Series

Models with a Threshold,” 79-118 in Nonlinear Econometric

Modeling in Time Series Analysis Proceedings of the Eleventh International Symposium in Economic Theory and Econometrics

Edited by W. A. Barnett, D. F. Hendry, S. Hylleberg, T. Teräsvirta, D. TjØstheim, and A. Würtz Cambridge UK, 2000 [18] K. Mehta and S. Bhattacharyya, “Adequacy of Training

Data for Evolutionary Mining of Trading Rules”, Decision

Support Systems, 37, 2004, pp. 461-474

[19] C. Morana and A. Beltratti, “Structural Change and Long-range Dependence in Volatility of Exchange Rates: Either,

Neither or Both”, Journal of Empirical Finance, forthcoming.

[20] D.J.C., MacKay, “Bayesian Interpolation,” Neural

Computation, vol. 4, 1992, pp. 415-447.

[21] K. J. Oh and I. Han, “Using Change-point Detection to Support Artificial Neural Networks for Interest Rates

Forecasting,” Expert systems with application, vol. 19, 2000, pp.

105-115.

[22] K. J. Oh and I. Han, “An Intelligent Clustering Forecasting System based on Change-Point Detection and Artificial Neural

Networks: Application to Financial Economics,” Proceedings

of the 34th Hawaii International Conference on System Sciences

2001.

[23] B. Padmanabhan, and A. Tuzhilin , “On the Use of Optimization for Data Mining: Theoretical Interactions and

eCRM Opportunities,” Management Science, vol. 49, no. 10,

2003, pp. 1327 – 1343.

[24] A. Pasley and J. Austin, “Distribution Forecasting of High

Frequency Time Series”, Decision Support Systems, vol. 37,

2004. pp. 501-513

[25] M. H. Pesaran and A. Timmermann, “A Market Timing and

Return Prediction under Model Instability,” Journal of

Empirical Finance, vol. 9, 2002, pp. 495-510.

[26] M. H. Pesaran and A. Timmermann, “How Costly Is It to Ignore Breaks When Forecasting the Direction of A Time

Series?” International Journal of Forecasting, In Press,

[27] A. N. Pettitt, “Some Results on Estimating a Change-Point

Using nonparametric type statistics,” Journal of Statistical

Computation and Simulation, 11, 1980, pp. 261-272

[28] M. Qi and Y. Wu, “Nonlinear Prediction of Exchange Rates

with Monetary Fundamentals,” Journal of Empirical Finance,

vol. 10, 2003, pp. 623-640.

[29] T. K. Sung, N. Chang, and G. Lee, “Dynamics of Modeling in Data Mining: Interpretive Approach to Bankruptcy

Prediction,” Journal of Management Information Systems, vol.

16, no. 1, 1999, pp. 63-85.

[30] J. C. Schlimmer and R. H. Granger, JR., “Incremental

Learning from Noisy Data”, Machine Learning, vol. 1, 1986, pp.

317-354

[31] G. J. Schinasi and P. A. V. B. Swamy, “The Out-of-sample Forecasting Performance of Exchange Rate Models When

Coefficients Are Allowed to Change,” Journal of International

Money and Finance, vol. 8, no. 3 1989, pp. 375-390.

[32] A. N. Srivastava, R, Su and A. W. Weigend, “Data Mining

for Features Using Scale-Sensitive Gated Experts”, IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 12 1999, pp. 1268-1279

[33] A. Timmermann and C. W. J. Granger, “Efficient Market

Hypothesis and Forecasting,” International Journal of

Forecasting, vol. 20, no. 1 , 2004, pp. 15-27.

[34] A.S. Weigend, M. Mangeas, and A. N. Sirvastava, “Nonlinear Gated Experts for Time Series: Discovering

Regimes and Avoiding Overfiting”, International Journal of

Neural Systems, vol. 6, 1995, pp. 373-399

[35] G. Widmer, “Tracking Context Changes through

Meta-Learning”, Machine learning, vol. 27, 1997, pp. 259-286

[36] H. White, and J. Racine, “Statistical Inference, the Bootstrap, and Neural-network Modeling with Application to

Foreign Exchange Rates”, IEEE Transaction on Neural