• No results found

2.6 Supervised Learning for modelling Consumer Indebtedness

2.6.2 Potential of Data Mining models

As the need to develop fairly accurate quantitative prediction models becomes appar- ent (Atiya,2001), the field of economics can benefit from the variety of techniques and models Data Mining has to offer. Accurate and powerful models like Random Forests and Neural Networks that can handle non-linearities in the data (Refenes et al., 1994; Gromping, 2009) can pose as strong candidates to analyse real world data like socio- economic data and provide meaningful answers to the research questions of Consumer Debt Analysis. Since both models are suitable for both regression and classification, they can handle the task of separating debtors from non debtors, which can be rephrased as a classification task and the tasks of predicting the level of debt and debt repayment, which can be rephrased as classification/regression tasks. Therefore, these two models with all the advantages they possess can be used to create a clear, complete and con- ceptual model of consumer indebtedness that has not yet emerged, despite the fact that

many factors influencing consumer debt have been proposed in literature (Livingstone and Lunt,1992). On the contrary, all the efforts so far have focused on a few or a subset of demographic, economic, psychological and situational factors limiting their predic- tive ability and their generalizability (Stone and Maury, 2006). As defined in (Stone and Maury, 2006) the purpose of this financial indebtedness model would be to accu- rately identify individuals who are at risk for developing personal financial management problems.

Both models have started to be used in field of economics replacing traditional sta- tistical modelling. More accuretely, Random Forests have been utilised in marketing applications like (Ghose and Ipeirotis, 2011) where a model measuring the impact of the reviews of products in sales and perceived usefulness was constructed. On the other hand Neural Networks have been used in stock performance modelling (Nicholas Refenes et al., 1994) and for credit risk assessment (Atiya, 2001) where banks need to predict the possibility of default of a potential counterpart before they extend a loan. For credit risk assessment any improvement in default prediction accuracy not only can lead to increased savings and assist in estimating a fair value of interest rate but also can help in accurately assessing the credit risk of bank loan portfolios. For that reason, Neural Neuworks seem to replace linear regression in default prediction (Sousa et al.,2007) as part of an ongoing trend that maximises the utilisation of Neural Networks for credit risk assessment. A characteristic example of this emerging trend is the Moody’s Public Firm and Risk model that is now based on Neural Networks (Atiya,2001).

Another model that is very popular in Machine Learning and Data Mining applications is the Support Vector Machines (SVM’s) (Cortes and Vapnik, 1995). Support Vector Machines have gained a momentum recently because of their strong mathematical back- ground which guarantees that the solution will reach a global optimum. They exhibit good generelisation

2.6.2.1 Random Forests

Random forests (Breiman,2001) are an ensemble learning method that operate by con- structing a multitude of decision trees on random samples of the training data based on the ideas of bootstrap sampling, bagging and the random selection of features. They have gained a lot of popularity recently because of their ability to handle large number of variables with relatively small number of instances and they provide a mechanism to assess variable importance (Gromping, 2009; Segal, 2004) in contrast with Neural Networks. They allow for non-linearities to be learned from the data without any need to be explicitly modelled (Gromping, 2009) as linear regression requires and manage

to achieve exceptional performance (Segal, 2004). In fact, research has shown that a large number of trees can be particularly important when the interest is in diagnostic quantities like variable importance (Gromping, 2009), while their averaging is respon- sible for variance reduction which is enhanced by the reduction in correlation between the averaged results of the trees caused by the injection of randomness. Hence, random forests correct for decision trees’ habit of overfitting to the training set.

2.6.2.2 Neural Networks

The development of artificial neural networks (ANN’s) arose from the attempt to simu- late biological nervous systems by combining many simple computing elements (neurons) into a highly interconnected system and hoping that complex phenomena as “intelli- gence” would emerge as the result of self-organisation or learning (Sarle,1994). Neural Networks are capable of processing vast amounts of data making accurate predictions. In fact they can serve as universal approximators as they can approximate any function to any desired degree of accuracy when given enough hidden neurons and enough data, a fact that is confirmed in (Hornik et al.,1989).

Their strength lies on their ability to handle non-linearities in the data (Sarle, 1994; Refenes et al., 1994), to allow for extrapolation (Sousa et al., 2007) that makes them suitable for generalisation (Sousa et al.,2007;Refenes et al.,1994) and to deal with the problem of structural ability which refers to the situation when the relationship between dependent and independent variables changes over time. In addition to this, a very interesting ability they possess is the ability to fully parametrise the topology of the network introducing a concept of logical structure among the neurons that compose the network. This gives the ability to design a network that will incorporate the knowledge extracted from the Behavioural Extraction phase into a Behavioural Modelling suitable for the purposes of our analysis. The same idea has been exploited in (Shifei et al., 2011) where factor analysis is utilised in order to define the topology of the network and although their result has shown not to actually improve the precision of the existing neural network, it manages to speed up the convergence of the algorithm.

In contrast with Linear Regression and Random Forests, their biggest disadvantage stems from their inability to provide transparent results. The mechanisms that occur inside the Neural Network are hidden and ignored and thus they are usually charac- terised as “black boxes” (Gevrey et al., 2003; Harrington and Wan, 1998). Because of this difficulty, the practical use of Neural Networks is limited in real world appli- cations (Harrington and Wan, 1998) and it is prohibited in the social sciences where the interpretation of the models is very significant. But in literature there have been

a series of techniques proposed to assess the importance of input variables in Neural Networks. In (Harrington and Wan, 1998) Sensitivity Analysis offers an approach to identify important variables in Neural Networks that can be achieved by perturbing the input variables and measuring the impact on the outcome variable. In (Gevrey et al., 2003) a comparison of different techniques measuring the contribution of input variables is provided. Seven techniques of contribution analysis are examined and analysed. From them, the Partial Derivatives method and the Profile method offer the most complete results.

2.6.2.3 Support Vector Machines

Another model that is very popular in Machine Learning and Data Mining applications is the Support Vector Machines (SVM’s) (Cortes and Vapnik, 1995). Support Vector Machines have gained a momentum recently because of their strong mathematical back- ground which guarantees that the solution will reach a global optimum. They embody the Structural Risk Minimisation (SRM) principle which tries to minimise an upper bound on the expected risk in contrast with the majority of learning algorithms that try to minimise the error on the training data. This grants them the ability to generalise in unseen data and avoid data over-fitting which is a valuable property in real world applications (Dibike et al., 2000). Also by performing the Kernel “trick” they SVM’s can extend to handle non-linearities in the data.

However, in this work we are going to use the first two models. That is because the parameters of the Support Vector Machines are not interpretable as Random Forests can be and they don’t exhibit the same level of flexibility Neural Networks do when building the model. In addition to this, as show in the comparative review of (Meyer et al.,2003) SVM’s were not able to outperform Random Forests and Neural Networks in many cases despite the fact that they yielded a very good performance.