Data Mining Methods in this Study - 54 3.1.2 Calculation of Greenblatt’s Formula

54 3.1.2 Calculation of Greenblatt’s Formula

3.2 Data Mining Methods in this Study

3.2.1 DM History

Data mining is usually referred as application of statistical, data analytical and machine learning methods to large data sets. Witten and Frank (2005, p. 8) define it concisely as a process of discovering useful patterns, automatically or semiautomatically, in large quantities of data.

DM is sometimes claimed to be a new discipline but in fact the term is mentioned in scientific journals already in early 1980s. Besides many of its core techniques are inherited from much older periods.

The search for utilizable patterns from data has continued already centuries with methods like correspondence analysis, discriminant analysis, and logistic regression or the Bayes’ theorem

originating from the 1700s. The term has also been used in a negative context for over exhausting the data by testing a multitude of variables without any a priori hypothetical reasoning of correlation or causality until some combination fits to the data. Foster et al. (1997) provide theoretical analyses of data exhausting in the context of return predictability. He states that it is typically believed that out-of-sample tests provide protection against this as long as the test observations are not used in the model estimation. Apart from these concerns, the use of DM techniques in discovering completely new information and dimensions from the data should be completely acceptable practice in science.

Traditionally these methods have been applied in e.g. credit scoring in banking business or customer scoring in database marketing in context of planning campaign offers. As an example of more

contemporary uses are mentioned (Tufféry 2011, 2) pharmaceuticals industry that uses data mining in screening effects of chemicals and molecules to different diseases e.g. cancer. They may not know the exact effect mechanism beforehand but they get valuable ideas where to start developing medicines.

Recently, automated inference with model and variable selection algorithms has raised great

enthusiasm in empirical econometrics. Phillips (2005) discusses automated discovery in science and claims that advances in computer power, electronic communication, and data collection processes have changed empirical economics profession, elevated its status and opened new possibilities. Particularly, he emphasizes the ability to build econometric models in an automated way according to an algorithm of decision rules. Thousands of regressions and model evaluations may be performed in seconds, statistical inference may be automated according to the properties of the data, and policy decisions can be made and adjusted in real time with the arrival of new data. Empirical modelers are widely adopting the use of modern computing power and tailored software to search systematically for models with superior performance. Phillips also makes a point about the important challenge that the researches face in incorporating economic thinking and methods into the automated model and variable selection process.

Pesaran and Timmermann (2005) employ automated methods in real time econometrics for the use of businesses, governments, central banks and traders in financial markets with the focus on making decisions in real time, and who hence have an urgent need to develop robust interactive systems that use econometric models to guide the real-time decisions. They also suggest procedures to mitigate the differences in statistical inference with the traditional approach. Campos et al. (2005) argue in favor of this new empirical research paradigm against the conventional. In next quotation, in short, they say that for practical work in econometrics, the use of data driven methods is essential in today’s world. This same idea applies to many other areas and data mining has thus become a main stream discipline.

“The economy is a complicated, dynamic, nonlinear, simultaneous, high-dimensional, and evolving entity; social systems alter over time; laws change; and technological innovations occur. Thus, the target is not only a moving one; it behaves in a distinctly non-stationary manner, both evolving over time and being subject to sudden and unanticipated shifts.

Economic theories are highly abstract and simplified; and they also change over time, with conflicting rival explanations sometimes coexisting. The data evidence is tarnished: economic magnitudes are inaccurately measured and subject to substantive revisions, and many

important variables are not even observable. The “conventional” approach insists on a complete theoretical model of the phenomena of interest prior to data analysis, leaving the empirical evidence as little more than quantitative clothing. Unfortunately, the complexity and non-stationarity of economies makes it improbable than anyone–however brilliant–could deduce apriori the multitude of quantitative equations characterizing the behavior of millions of disparate and competing agents. Without a radical change in the discipline’s methodology, empirical progress seems doomed to remain slow.”

DM methods can be subdivided in two distinct classes, predictive and descriptive, which are also called supervised and unsupervised respectively. In unsupervised methods no target variable is identified for the DM algorithm but the patterns and structures among all the variables are searched. These are often used in dimension reduction in the data like clustering and principal components analysis (PCA). The data is described in new dimensions. On the other hand in the supervised methods (1) there is a

particular target variable, and (2) the algorithm is trained with the data to adjust the model parameters for the best predictive properties. Aim is either to predict the level of the target or classify it to some predefined category. The most important are decision trees and neural networks but also classical models like logistic and linear regression.

The division is illustrated below in Figure 6 with just a couple of examples of large pool of methods.

Figure 6: DM Methods

DM METHODS

PREDICTIVE

CLASSIFICATION REGRESSION

DESCRIPTIVE

CLUSTERING FACTOR ANALYSIS

59 3.2.2 DM Process

There exist multitudes of different specifications of DM process. In common case the starting point is some electronic data storage: database, datamart or data warehouse, depending on the task scope. The version presented in this text is applied to the task at hand. Figure 7 presents a quite generic operation flowchart. The link from end to beginning is important because when new data accumulates, also new questions may arise and process could restart with new or refined objectives. In practical analysis there could be a link from every phase box back to Problem Definition phase since this is also a learning process and when knowledge of the data accumulates the researcher also acquires a better

understanding of the most achieving ways of working with it.

Figure 7: DM Process (Oracle® Data Mining Concepts)

Giudici and Figini (2009, p.1-4) list the tasks related to each process phase. The following list is an adaptation from their book with comments relevant to this particular project.

1. Definition of objectives

 It is important to crystallize the goals of the DM project since this, to large extent, determines the methods that are applicable. It is not always easy to define the analyzed phenomenon statistically so this phase is one of the most difficult ones.

 In this study the primary objective was to find a filtering method for the unsuccessful investments. Secondary goals were to separate also the highest returning stocks from the moderate risers. Also intuitive functional form with testable coefficients was considered important. The last criterion sets multiple or ordinal logistic regression as preferred to tree based classification algorithms and neural networks but they all still should get tested for performance in primary and secondary objectives. A probit model could be an alternative to logit but it requires more normally distributed data.

2. Selection, organization and pre-treatment of the data (data cleaning)

 After the analysis objectives are clear, it is time to identify the available data sources and collect or select the variables for the initial data matrix. There are usually internal and external sources both openly accessed and proprietary that contain relevant information for analysis.

 The data needs to be quality controlled before the analysis: some variables may not be suitable or have missing/unreliable data. When some variable has part of the data missing, the analyst needs to carefully study and model it. The distribution of the cases with missing data in terms of the other variables is of interest, and it should be as random. as possible. Based on this study there is a decision to either delete the variable or choose an imputation (patching) method for the missing values. Otherwise the results may be biased.

 Internal data source in this study was the information produced by the investment simulation itself since it was obvious that variables like stock’s rankings in both Earnings Yield and Capital Return as well as in combined scoring and an indicator for the stock’s previous selections were interesting. As outside sources several databases for company and macro-economic data was utilized. After the data matrix was acquired SPSS Statistics® was then used to handle the missing data.

3. Data screening, transformation and exploratory analysis

 In this phase the data is screened visually to establish the distribution of the individual variables and assess the need for transformations. For many analysis methods it is valuable that the data is close to normal, and thus e.g. log transformation often improves the quality. Here also the possible anomalous data points, that are different from the rest, are detected. Outliers can have significant influence on the analysis outcome so it is important to consider carefully whether the point is erroneous or does it instead contain valuable information. The data needs also to be screened for multinomial outliers (a combination of variables that is unusual) that are not detected in visual inspection but by calculating the Mahalanobis distances which is a multivariate data point’s relative distance in from a common midpoint. In this phase a need for additional data may be noticed and observations here often influence also the method choices in the next phase.

 Initial variable choices are made based on their power of influence on the dependent variable. Histograms, box and scatter plots are important visual aids in data screening but also pairwise and multivariate charts and tables.

 Most statistical packages have automatic tools for the tasks in this phase e.g. to screen and rank the variables for influence on the dependent variable or to handle outliers.

They may help in large projects but should be used with caution. The automatic outlier handling routine transforms observations that lie beyond a preset number of standard deviations from the center to the border. If this is done routinely without consideration it could easily destroy valuable information. Variables’ influence also often varies by the method and combination of other explanatory variables.

4. Specification of analysis methods and techniques

 There are numerous statistical methods available, and the decision depends on the information from the phases 1 and 3. The DM process depends on the project; its goals and data. The goals determine if the analysis’s purpose is to describe the data or predict.

This knowledge guides to select the relevant analysis method from one of the main groups in figure 6. After screening the data, the measuring scales (e.g. ratio,

categorical, ordinal) and distributions (e.g. continuous, discrete, multinomial, binomial, categorical) of the variables are known which aids the analyst with the final model selection as some methods have requirements and performance differences on quality of data inputs. It is customary in DM to create several models using different algorithms in modeling the problem and then test and rank those models to end up with the best solution.

 The model and variable selection is an iterative process. Different methods will require a distinctive set of variables for optimal performance. Also some algorithms may work better with a certain types of data transformations. Here some automated modeling tools may assist the process by testing with dispatch the model with a number of variables and transformations of variables in order to find the most suitable combinations.

 It is important that when testing the models the test data comes from a partition that has not been involved in the model specification but from hold-out-sample to avoid data snooping i.e. to eventually, after large number of tries, come up with a combination of variables that appears to work well but in reality is not capable to predict the

independent variable outside the fitting sample.

5. Evaluation and choice of the final model

 The model candidates are tested and ranked according to their performance with some predefined criteria. The desired feature can be e.g. the model’s ability to predict correct classes as a total percentage or weight right and wrong predictions differently between classes. The target could be to find a model that can best separate rising stocks from falling ones and hence the weighting in that score would be heavier than on the ability to separate very high gainers from moderate ones.

 When the final model is determined and found fulfilling the analysis requirements, it is then deployed to production environment which in the case of this study could be a server with on-line access to data feed (e.g. Bloomberg, Reuters) and order placement facility for relevant stock exchanges. With this kind of setting an automated trading program could easily be created. At least SPSS Modeler® and some other DM software have out of the box functionality that supports this deployment scenario. Otherwise some programming language can be used to tailor the functionality suitably. After all it is the model’s ability to produce some real, monetary, output that defines its usefulness.

In document Value investing with rule-based stock selection and data mining (Page 64-73)