Methodology adopted for statistical selection

Figure 2.2 The context of the Construction Industry in the National Economy

THE CASE OF SINGAPORE

4) Public concerns

6.3 Methodology adopted for statistical selection

In many cases, it is necessary to preprocess or adjust the data after they have been collected in order

for them to become suitable for quantitative analysis. Most adjustments involve modifications of

the data to eliminate part of the information contained in the raw data. It is only after this stage that

variable selection procedures are applied to the adjusted dataset. The methodology adopted for these

6.3.1 Data transformations

The first step o f data modification involves transforming current price data into constant prices using

a deflator index namely, the GDP deflators (base year 1990). As in many instances, accounting data

are recorded in terms o f current prices and thus reflect both inflation and real growth or decline in

that item. This transformation allows the time series to be examined without being biased by

inflation.

The second step o f data preprocessing entails the disaggregation o f annual time series to quarterly

figures. This is necessary because most o f the macroeconomic data are reported either quarterly or

annually and it poses a difficulty when several time series have to be analysed jointly if their

observation frequencies are not consistent. The need also arises when there is a change in the

observation frequency in the same series. For example, a time series has been observed annually

over several years and because of its increasing importance, the reporting agency decided to observe

and release quarterly figures instead. This brings about a time series with annual observation in the

first portion and quarterly figures in the remainder. Hence, a reasonable action is to disaggregate

the annual time-series data to quarterly figures, instead o f aggregating quarterly series to yearly

totals which leads to a considerable loss o f information. In a recent study (Chan, 1993), several

prom inent methods o f disaggregating annual time series were described and their performances compared. Among the six methods examined, the INTER procedure was one o f the two that gave

satisfactory disaggregation results. This procedure is, therefore, chosen for the study and applied

to the annual time series in the dataset. The following is a brief description o f the INTER procedure.

The procedure is developed by Almon (1988) which provides a method to convert annual series to

quarterly figures by interpolation. The annual time series is first cumulated to Zj- i.e.

A cubic polynomial is fitted to each successive set of four points o f Zj. The values o f the polynomial are calculated at the ends o f the quarters and these values are differenced to give quarterly series consistent with the yearly aggregates. For example, disaggregating to four quarterly values

p(2.5), p (2 .75) andp(2) are computed. The quarterly figures are then obtained b yp(2.25) -p(2), p(2.5) - p(2.25), ...,p(3) - p (2 .75), respectively. The sum o f these quarterly figures is p(3) - p(2)

-Z3 - Z2- Y3.

The fitting o f a cubic polynomial to four successive points o f produces the following equation in

matrix format: A X = b 1 1 1 1 a ^ 1 1 2 4 8 b 1 3 9 27 c 1 4 16 64 d Z4

In order to solve for a, b, c and d, the inverse o f matrix A is calculated as:

4 - 6 4 - 1

-4 .3 3 9.5 - 7 1.83

1.5 - 4 3.5 -1

-0 .1 6 7 0.5 -0 .5 0.167

and, the equation becomes:

jc = A ~ ^ b

By substituting values of Z7- and the inverse matrix into the above equation, values of a, b, c

and d are calculated for each successive set of four points o f Z^. As elaborated earlier, the quarterly figures are obtained by computing thspfT) values and then differencing them.

The third step o f data adjustment involves the application o f moving averages to eliminate

seasonably and randomness from the data series. As the study uses quarterly time series, a centered four-quarter moving average is undertaken. First, a four-quarter moving average is calculated by

adding the first four figures in the time series and dividing the total by four. Second, each successive

average is obtained by dropping the first figure included in the previous average and adding the next

figure in the series to derive the new average. In order to obtain an average corresponding to one of

the time periods in the original series, a centered moving average is carried out by computing a two-

period moving average of the moving averages that have been calculated previously. This smoothing

method effectively removes seasonal variations from the data and, at the same time, cancels the

effects of irregular factors. It helps to bring the data series to a more stable state, revealing its basic

underlying pattern, where it can be examined more efficiently without the interference o f short-term

fluctuating components. However, this transformation is only applicable to time series that have

been, in the first instance, recorded quarterly; those derived from disaggregating annual time series

are excluded as they do not, originally, contain any quarterly seasonal factors.

6.3.2 The use of variable selection procedures

For any given number of independent variables, a variable selection procedure should provide the

subset which its estimated equation produces the best fit, that is, the subset which its estimated

equation produces the minimum residual sum o f squares or, equivalently, the maximum coefficient

of determination, known as the R-square. The variable selection methods adopted in the study are

namely, the Stepwise Selection and the Forward Selection. In short, these methods adopt the principles of least squares regression. They have been regarded as the most practical least squares

selection procedures among all other regression methods (Draper and Smith. 1981). Their basic

steps are outlined below.

The Forward Selection method begins by finding the variable that produces the optimum one-

variable subset, that is, the variable with the largest R-square. In the second step, the procedure

finds the variable which, when added to the already chosen variable, results in the largest reduction

in the residual sum of squares or increase in R-square. The third step finds the variable which, when

added to the two already chosen, gives the minimum residual sum o f squares or largest increase in

R-square. The process continues until no variable considered for addition to the model provides a

reduction in sum o f squares considered statistically significant at a level specified at the start o f the

The Stepwise Selection procedure begins like the Forward Selection, but after each new variable has

been added to the model, the resulting equation is examined to check if any o f the earlier introduced variables might conceivably have to be removed in order to give a better R-square result. The

process o f looking for the next best variable to include, and checking to see if previously included

variable should be removed, is continued until certain criteria are satisfied, such as an earlier

specified significance level. When it is no longer possible to find any new variable that contributes

significantly to the R-square, or if no variable needs to be removed to improve on the current R-

square, then the iterative procedure stops.

The choice o f significant levels to be used in variable selection procedures plays a crucial part in

determining which variables would be included in the subset. In short, a higher significance level

such as, 95 per cent (p-value = 0.05), permits lesser variables into the subset than a lower

significance level such as, 90 per cent (p-value = 0.10). Hence, it may be feasible to specify a few

different levels of significance for each method to obtain a range o f possible subsets. In relation to

the specification requirements o f the Forward Selection method and the Stepwise Selection method,

there are two main differences. First, as the latter operates by selecting as well as removing variables

from its model, it requires a specification o f the entry and the exit levels; the former only requires

an entry level to be specified. It is possible to use different levels for the entry and exit, but the entry

level should always be set smaller than the exit level so that variables just entered would not be

immediately rejected. Second, the default entry level recommended for the Forward Selection method is 0.50, while the entry and exit levels for the Stepwise Selection are 0.10 and 0.15,

respectively (Freund and Littell, 1986). In order to experiment with different levels o f significance, the ones chosen for each method are given in Table 6.1.

Table 6.1 Levels of significance to be used for the Forward Selection and the Stepwise Selection

FORWARD SELECTION STEPW ISE SELECTION Entry (0.05) Entry (0.05); Exit (0.10)

Entry (0.10) Entry (0.10); Exit (0.15) Entry (0.50)

Between the two methods, the Stepwise Selection is considered a more stringent variable selection

that the subset o f variables selected at every stage is optimum: variables not contributing

significantly to the R-Square are removed. In the case o f the Forward Selection method, once a

variable has been selected, it stays in the model. Hence, it can be implied that the Stepwise Selection

procedure generally selects a smaller subset o f variables than the Forward method. Perhaps, this

may help to justify the choice o f these two methods for the study. The idea of choosing a stringent

method and a less restrictive one is to allow these procedures to select different possible subsets of

variables, ranging from the smallest to the largest. The largest subset can serve to provide insights

into the degree o f consensus between theoretical deduction and statistical analysis. As these

economic indicators have been theoretically identified beforehand, being statistically selected again

proves stronger their relationship with construction demand. However, for forecasting purposes, it

is often not advisable to use too many explanatory variables as the phenomenon o f overfitting may

occur, giving rise to poor predictive ability o f the model. Although the use o f more variables

improves the model's fit to past data, forecasting performance on new data would actually

deteriorate. Besides, a model containing a large number o f independent variables generally produces

the problem of multicollinearity. A high degree of multiple correlation among several independent variables occurs because as more variables are included in the model, there is a greater likelihood

of some of them measuring similar phenomena. It often results in coefficient estimates that are not

statistically significant or have incorrect signs or magnitude (Myers, 1986).

In document Construction demand modeling: A systematic approach to using economic indicators and a comparative study of alternative forecasting approaches (Page 132-137)