Statistical Techniques - Credit Scoring Techniques

2.4 Credit Scoring Techniques

2.4.1 Statistical Techniques

As previously stated, statistical techniques have been widely used in the field of credit scoring. They are still being used by most financial institutions for building scorecards because they are simple and yet robust. Two of the most commonly used methods are discriminant analysis and logistic regression. They are appropriate techniques when the dependent attribute is a categorical attribute, as is the case for credit scoring where the dependent attribute consists of two groups or classifications, ‘good’ or ‘bad’.

2.4.1.1 Discriminant Analysis

Discriminant analysis is usually used to classify observations into two or more mutually exclu- sive groups by using the information provided by a set of predictor attributes. When only two groups are involved, this technique is referred to as two-group discriminant analysis; however, when there are three or more groups, it is called multiple discriminant analysis (Hair, Black, Babin, Anderson & Tatham 2006). It should be noted that while only the two-group discrimi- nant analysis is used in this research, for the sake of brevity, the term discriminant analysis is used throughout. It takes the form of:

y=c+

n X

i=1

w_ix_i (2.2)

whereyis the discriminant score, cis a constant,wis the weight of each attribute andxis the independent attribute. Discrimination is achieved by multiplying each independent attribute by its corresponding weight and adding these products and the constant together.

The final decision function of discriminant analysis is determined by a cut-off score. That score represents the dividing point used to classify observations into their groups based on the value of y. It is calculated by averaging the group means or centroids, which are in turn obtained by averaging the discriminant scores of all the observations within a particular group (Hair et al. 2006).

Even though discriminant analysis has been reported to be one of the most commonly used data mining techniques for classification problems, one of its significant limitations is that independent attributes which are of a categorical nature cannot be handled properly. Another important restriction is its dependence on a relatively equal distribution of group membership. Discrim-

inant analysis also assumes that the predictor variables follow a normal distribution1 _{and have}

linear as well as homoscedastic2 _{relationships. Due to the categorical nature of the credit data}

and the fact that the covariance matrices3_{of the ‘credit-worthy’ and ‘non-credit-worthy’ classes}

are not likely to be equal, the use of discriminant analysis for credit scoring has often been criticised (West 2000). However, it was found that the predictive performance of discriminant analysis, when applied to the context of credit scoring, was superior to that of artificial neural networks, genetic algorithms and decision trees (Yobas, Crook & Ross 2000).

2.4.1.2 Logistic Regression

Logistic regression has emerged as the most suitable technique when the dependent attribute is binary and the independent attributes are continuous, categorical or both. Since the outcome of credit scoring is binary, i.e. to either grant or refuse applicants credit, logistic regression is probably the most suitable statistical approach for credit scoring. In fact, based on our discus- sion with several credit analysts of a leading Australian bank, logistic regression was found to be the main technique used by most financial institutions for credit scoring. It is designed to predict the probability of an event (for example, granting credit) occurring and assumes that the log likelihood ratio (odds) is linear. Logistic regression takes the form of:

log y 1−y =c+ n X i=1 wixi (2.3)

wherey is the probability of classification outcome, c is a constant, w is the weight of each attribute andxis the independent attribute.

Based on Equation 2.3, the value ofy can be generated. Since a probability can have a value between 0 and 1, a cut-off point, for example 0.5, is required to classify an applicant as either ‘good’ or ‘bad’.

Unlike other statistical models, logistic regression can suit various kinds of distribution func- tions, such as Poisson and normal (Press & Wilson 1978), and as such does not require some of the assumptions necessary for discriminant analysis (i.e. normal distribution, linear and ho- 1_{A normal distribution is a purely theoretical continuous probability distribution in which the scores of the}

variables are clustered around the mean in a symmetrical pattern known as the bell-shape or normal curve.

2_{A set of data is said to be homoscedastic if the variance of the error terms appears constant over a range of}

predictor variables.

moscedastic relationships between the dependent and independent variables). Therefore, one might expect logistic regression to perform better compared to discriminant analysis given its stronger theoretical justification. However, it has been found that logistic regression is as ef- ficient as linear discriminant analysis but not more so (Harrell & Lee 1985). One of the most detailed published analyses compared the performance of various methods including discriminant analysis, logistic regression, mathematical programming methods and smoothing non- parametric methods on credit scoring (Hand & Henley 1997, Henley 1995). From the results of their experiments, it was concluded that there is no overall ‘best’ method and that what is best depends on the type of problem, for example, in relation to data structure, characteristics used or objective of the classification. For instance, neural networks are better suited to situations where there is a poor understanding of the data structure while regression and tree-based methods are more applicable to situations where reasons must be given for any decisions reached.

In document Investigation of artificial immune systems and variable selection techniques for credit scoring (Page 40-42)