Results from the Predictive Modeling of the Stock Return

75 3.4.2 Missing Data

4.2 Results from the Predictive Modeling of the Stock Return

4.2.1 Determining the Final Model

In the analysis stage 2, the objective was to find a predictive model by, applying DM methods, that can separate ex ante winners from losers in context of the companies that are selected with GF. For this task the holding period return of the companies from the Nordic GF simulation study was coded as 3-class nominal category variable with: 1=StrongProfit (HPR-%≥30, 2=Profit ( 0≤ HPR-%<30) and 3=Loss. From the initial data matrix a group of influential variables was first extracted. Then a scan for 2-way interactions between these variables was conducted. And those interactions that seemed

promising as terms in the predictive model were then examined carefully with and without the main effects (constituent variables) in the model. When the main effects were also significant and included in the model, the interpretations of the interaction term was to indicate variable’s varying strength of effect for the response variable conditional on the level of the other constituent variable. If the interaction term was significant but the main effects were not, the term needs to make sense as an independent entity as these terms are no longer interpreted as interactions between variables but new variables. A good example is GNP as a product of price and quantity components in the economy.

Nobody thinks that the constituents should be included in the model if GNP is significant predictor.

With this data similar thing was with variables like P/E and P/B. They lacked significant influence as independent single variable but were important elements in interaction variables that proxy some more complicated latent company attributes. In the final phase a group of 14 highly influential terms (7 independent variables and 7 interactions) was separated. These terms were then tested with automated tools to determine the final model that would best meet the predetermined criteria: (1) Separation performance should be high especially for the Loss- category, (2) All model terms, especially the interaction variables, should have intuitive meaning and (3) The model should be as parsimonious as possible (include fewer terms). The final model consists of 8 terms and a constant (2 single variables, 6 interaction variables and a constant).

The rate of significant interaction variables to single variables is not surprising; in fact, it was

anticipated as the modeling task is far from trivial. The interaction variables represent some underlying factors that can only be measured via a proxy. Screening the 2-way interactions of influential variables and assessing their significance both with statistical methods and intuitively, can discover this new information, and put it to practical use.

The two functions in the model were trained to separate the lower (Loss-category) and upper tail StrongProfit-category) in the HPR-% distribution. The hypothesis was that since the tails seem to be long and distinctive from the center, there would be a distinctive set of attributes that could be ex ante predicted with investment time information. The model indeed succeeded very well in this task. The data was divided into three partitions to avoid overtraining, and to ensure that the model performs well with out of sample data. The partitions were training (40%), testing (40%) and validation (20%). With this kind of partitioning the training sample is used for training, testing is used for testing but also refinement of the model and the validation is then used to get the idea how the model performs with out of sample data. The final model was able to predict over all correct 88%, 83% and 87% of the

observations in the partitions respectively, and in the validation partition it got correct 90% of the StrongProfit classifications and 95% of the Loss classifications.

4.2.2 Model Fit Statistics

Commonly the first performance indicator to view for a classification model is the overall prediction accuracy table. The correct and wrong predictions of the model are presented by partition in Table 7.

In this case, as the predictions for the tail categories are weighed more, the best model fit criteria is the matrix that shows the correct and incorrect predictions in each category of the response variable grouped by the partition. This so called coincidence matrix can be used for detecting systematic errors in predictions. The coincidence matrix for this model is presented in Table 8. In the table the column values are defined by the predicted values, and the rows display the actual observed values. The value is the number of records in each pattern of predicted-actual pair.

The numbers in the overall prediction performance table are already very good but the real value of the model is revealed in Table 8 as the validation partition classification performance verifies that the model is capable of functioning according its purpose as it can separate the tail observations. If the model predicts that the stock will be in StrongProfit-category, it is correct 90% of the time. Further, when the stock is predicted to be in Loss-category, the model accuracy is nearly 95%. On the other hand if the prediction is in the center category, Profit-category, the correct categorization percent drops to 78% in validation participation.

An indicator of the model consistency is also that there are no cross-tail classification errors in the validation partition i.e. stocks that actually belonged to Loss-category would have been predicted to be StrongProfit or vice versa. Only in the training and testing partitions is one such misclassification but none in the final phase. The consequences for even one notch misclassifications in applications for Loss-category could be more severe than for StrongProfit-category since in aggressive strategy they would be sold short. On the other hand, a long investment in a stock predicted to be a high-rising and realizing only moderately positive (0-30%), is not a disaster.

Table 7: Overall Prediction Performance

Partition Training Testing Validation

Correct 77 87.5% 66 82.5% 54 87.1%

Wrong 11 12.5% 14 17.5% 8 12.9%

Total 88 80 62

In document Value investing with rule-based stock selection and data mining (Page 97-100)