Loan Prediction System Using Decision Tree and Random Forest Algorithms

(1)

346

©IJETIE 2020

Loan Prediction System Using Decision Tree and Random Forest Algorithms

Shubham Chaudhary

Student, IT Dept. of Galgotias College of

Engg and Tech Greater Noida, UP [email protected]

om

Vishal Baliyan

Yatharth Katheria

ABSTRACT

In today’s world, obtaining loans from financial institutions and Banks have become a very common phenomenon. Every day many people apply for loans, for a variety of purposes. But not all the applicants are reliable, and not everyone can be approved. Every year, there are cases where people do not get the loan from the various Banks or financial institutions. Due to the risk which is associated with the loan approval. Hence, the idea of this project is to gather loan data from the Lending Club website and use machine learning techniques on this data to extract important information and predict if a customer would be able to get a loan from a bank or not.

Most banks are renewing their business models and switching to Machine Learning methodology. In this paper, we have discussed classifiers based on Machine and deep learning models on real data in predicting loan default probability. The most important features from various models are selected and then used in the modelling process to test the stability of Random Forest classiﬁers and Decision Tree Classifier by comparing their performance on data.

Keywords:

Python 3.7.2, Jupyter Notebook, Numpy, Pandas, Matplotlib, Scikit-learn, Random Forest and Decision Tree Classifier.

1.INTRODUCTION

Many researches have been conducted based on data mining and machine learning in the field of financial and banking sector This section presents shortly a number of these techniques that area unit employed in loans risk management and their finding. In [1]

the researchers analyze the data set using data mining technique. Data mining procedure provides a great vision in loan prediction systems, since this will promptly distinguish the customers who are able to repay the loan amount within a period. Some data mining algorithms such as “Bayes net”, “J48 algorithm”, Naive Bayes” are employed in it. On applying these algorithms to the datasets, it was shown that “J48 algorithm” has high accuracy (correct percent) of 78.3784% which provides the banker to decide whether the loan can be given to the costumer or not.

In paper [2], “loan prediction using Ensemble technique”, used “Tree model”, “Random forest”,

“svm model” and combined the above three models as Ensemble model. A prototype has been discussed in paper [2] so that the banking sectors can agree/reject the loan request from their customers.

The main method used is real coded genetic algorithms. The combined algorithms from the ensemble model, loan prediction is exhausted a better method. it's found that tree algorithmic program provides high accuracy of 81.25%. In paper [3], using R-language, an improved risk prediction clustering algorithm is used to find the bad loan customers since probability of default (PD) is the critical step for the customers who comes for a bank loan. So, a framework for finding probability of default in the data frame is provided by the data mining technique. R- Language has the technique called as KNN (K-nearest neighbor) algorithm and it is used for performing multiple imputation calculation when there are missing values seen in the data set. The paper [4] had used tree model. It helps to find whether the banking sector people will be

(2)

347

©IJETIE 2020 able to overcome the loan problem with their

customers. It provides a high accuracy of 80.87%.

The paper [5] uses decision tree induction algorithm and found that the algorithm finds a best way to evaluate the credit risk. To avoid the credit risk, bankers holds the technique called as “credit score”, where it helps the lenders to keep note on who are the applicants who will able to repay the amount or probability of going into the default risks. The input given for credit evaluation was customer data, WEKA software, civil score. The methodology employed in prediction system was drawback and information understanding, information filtering, system modelling and eventually system analysis.

This was done on the banks existing dataset containing 1140 rows and 25 attributes. At last the system was tested and helps the bankers to make a correct decision on whether to accept or reject the loan approval. The paper [6] used predictive model technique and descriptive model technique is used to predict the loan approval by banks. In predictive model technique, classification and regression were used and in descriptive model technique clustering and association were used. Classifiers also implement several algorithms like naive Bayes, KNN algorithms of R language and regressors implements several algorithms like decision trees, neural networks, etc., To undergo this prediction analysis, out of all these algorithms, naive Bayes produces a most accurate classifier and the algorithms like decision tree, neural network, K-NN algorithms will be more accurate regressors. The main goal of the paper is to predict the loan classification based on the type of loan, loan applicant and the assets (property) that loan applicant holds. On doing the analysis we got an accuracy of 85% through decision tree

classification. The paper [7], An Exploratory Data Analysis for Loan Prediction Based on Nature of the Clients. EDA is used to process the dataset. In this paper lot of comparison has been made based on the annual income of the applicant. The paper [8], Prediction of Credit Risks in Lending Bank Loans.

In this paper SVM and boosted Decision Tree model is used with the Artificial Neural Network to increase the efficiency of the model. The paper [9], Loan Approval Prediction based on Machine Learning Approach. In this paper lot of machine learning model is used like-Decision Tree, Random Forest, Support Vector Machine, Linear Model and also the Neural Network and Adaptive Boosting. In the paper [10], Loan Default Prediction using Machine Learning Techniques, prediction based on Logistic model 3 and Random Forest Classifier with accuracy 85% and 78%.

2.MACHINE LEARNING

Introduction to Machine Learning Machine learning is an arena of computer science that involves the learning of pattern identification and computational learning theory in AI. Machine learning basically refers to the changes in systems that carry out tasks associated with artificial intelligence (AI). Such tasks embody recognition, analysis, planning, mechanism management, forecasting, etc. It explores the study and construction of formula that may build prediction on knowledge. Machine Learning is employed to make programs with its tuning parameters that are custom-made consequentially thus on increase their functioning by adapting to earlier data. Machine learning can be broken into two categories supervised unsupervised machine learning. the steps of building a machine are given below

Fig 1 Flow chart of Machine Learning Model

(3)

348

©IJETIE 2020

3.METHODS

3.1 Decision Tree Classification

Decision trees are created via an algorithmic approach that identifies ways that to separate an information set supported completely different conditions. it's one among the foremost wide used and sensible ways for supervised learning. Tree models wherever the target variable will take a separate set of values are called classification trees.

Decision trees wherever the target variable will take continuous values (typically real numbers) are called regression trees.

3.2 Random Forest Classification

Random forest, like its name implies, consists of an oversized variety of individual call trees that operate as an assemble. every individual tree within the random forest spits out a class prediction and also the class with the foremost votes becomes our model’s prediction. The low correlation between models is that the key. similar to however investments with low correlations move to create a portfolio that's bigger than the sum of its components, unrelated models will manufacture ensemble predictions that are additional correct than any of the individual predictions. the rationale for this glorious result is that the trees shield one another from their individual errors (as long as they don’t perpetually all error within the same direction). whereas some trees could also be wrong, several different trees are going to be right, thus as a gaggle the trees are able to move within the correct direction.

4 PROBLEM WITH THE METHODS

We face lot of problems while building a machine learning model like how to treat missing values, how to treat categorical data, scalability issues and overfitting problem but in the paper, we are mainly emphasizing on overfitting problem.

4.1 Overfitting Problem

Overfitting happens once a model learns the detail and noise within the training information to the extent that it negatively impacts the performance of the model on new information. this suggests that the

noise or random fluctuations within the training information is picked up and learned as ideas by the model. the matter is that these ideas don't apply to new information and negatively impact the model’s ability to generalize. Overfitting is additional probably with statistic and nonlinear models that have additional flexibility once learning a target operate. As such, several statistic machine learning algorithms additionally embody parameters or techniques to limit and constrain what proportion detail the model learns. For example, call trees area unit a statistic machine learning rule that's terribly versatile and is subject to overfitting training information. This downside is often self-addressed by pruning a tree when it's learned so as to remove a number of the detail it's picked up.

5 SOLUTION OF THE PROBLEM 5.1 Parameter Tuning and

Hyperparameter

Tuning is that the method of increasing a model’s performance while not overfitting or making too high of a variance. In machine learning, this can be accomplished by choosing acceptable

“hyperparameters.”

Tuning machine learning hyperparameters could be a tedious however crucial task, as the performance of Associate in Nursing algorithmic rule may be extremely dependent on the selection of hyperparameters. Manual calibration takes time off from necessary steps of the machine learning pipeline like feature engineering and interpreting results. Grid and random search are inactive, however need long-term times as a result of the waste time evaluating inauspicious areas of the search area. more and more, hyperparameter calibration is completed by machine-driven strategies that aim to seek out best hyperparameters in less time mistreatment an informed search with no manual effort necessary on the far side the initial set-up.

Here within the analysis we tend to use the max_depth parameter with price three. it'll solely produce the tree of depth three and save our model from being overfit. Which results are shown in below figures

(4)

349

©IJETIE 2020 Fig 2 Result before parameter tuning

Fig 3 Result after parameter tuning

6 RESULT

After processing the dataset through the Decision tree classifier, we got 74.79% efficiency in predicting the result normally. But After using parameter tuning with appropriate Hyperparameter we get around 85% efficiency with 90% F1 test score and after processing the dataset through the Random forest classifier, we are getting the 78.64%

efficiency normally but by using the parameter tuning with appropriate hyperparameter we get a efficiency of 85.3%. which is comparatively similar as the predicting efficiency of the Decision tree classifier.

(5)

350

Fig 6.7 Random Forest Classifier result with Hyperparameter Tuning

7 PROPOSED SYSTEM

The efficiency of machine learning algorithms decision tree and random forest classifier can be improved by using the appropriate parameter hyperparameter tuning. we are getting the good efficiency using the Random forest classifier and Decision Tree Classifier if we apply all the proposed

rules in the same algorithms then we get the great results. which are good for predicting the right result in the current world scenario and also help the bank to give the money in the right hands and also help the peoples in getting loan in a much faster way because of our purposed system.

(6)

351

©IJETIE 2020

8 REFERENCES

[1] A. Goyal and R. Kaur, “A survey on Ensemble Model for Loan Prediction”, International Journal of Engineering Trends and Applications (IJETA), vol. 3(1), pp. 32-37, 2016.

[2] A. J. Hamid and T. M. Ahmed, “Developing Prediction Model of Loan Risk in Banks using Data Mining”.

[3] G. Shaath, “Credit Risk Analysis and Prediction Modelling of Bank Loans Using R”.

[4] A. Goyal and R. Kaur, “Accuracy Prediction for Loan Risk Using Machine Learning Models”.

[5] M. Sudhakar, and C.V.K. Reddy, “Two Step Credit Risk Assessment Model for Retail Bank Loan Applications Using Decision Tree Data Mining Technique”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), vol. 5(3), pp. 705718, 2016.

[6] Gerritsen, R. (1999). Assessing loan risks: a data mining case study. IT professional, 1(6), 16-21.

[7] X. Francis Jency, V.P. Sumathi, Janani Shiva Sri: An Exploratory Data Analysis for Loan Prediction Based on Nature of the Clients, International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277- 3878, Volume-7 Issue-4S, November 2018.

[8] Mohit Lakhani, Bhavesh Dhotre, Saurabh Giri: Prediction of Credit Risks in Lending Bank Loans.

[9] Kumar Arun, Garg Ishan, Kaur Sanmeet: Loan Approval Prediction based on Machine Learning Approach, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. I (May-Jun. 2016).

[10] Vikash V, Mohammad Amir Ahmed: Loan Default Prediction using Machine Learning Technique.