• No results found

Research on Supplier Credit Evaluation Based on Data Mining

N/A
N/A
Protected

Academic year: 2020

Share "Research on Supplier Credit Evaluation Based on Data Mining"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

2018 International Conference on Computer, Communications and Mechatronics Engineering (CCME 2018) ISBN: 978-1-60595-611-4

Research on Supplier Credit Evaluation Based on Data Mining

Ping-hua YANG, Cai-xia YANG and Jie HE

School of Computer Engineering, Guangzhou College of South China University of Technology, Guangzhou, Guangdong 510800, China

Keywords: Credit scoring, Data mining, Logistic model, Weights of evidence.

Abstract. Data mining method is used to clean the real data of suppliers on cost link. With its 80% random drawing as a training set, a WOE and logistic regression model is set up to distinguish the integrity supplier on the platform according to the new calculation formula of WOE for indicator variable. And using the rest 20% as test data to calculate the accuracy of this model which is as high as 96.9%. Finally, the credit score card is established according to the linear scaling relation between the WOE value and regression coefficient of each indicator variable, which can calculate supplier credit score and provide decision-making reference for the platform and users when selecting suppliers.

Introduction

Credit is an important foundation of the market economy. Credit evaluation is an important basis for enterprises to select and motivate suppliers, and is the basis for establishing long-term stable cooperative relations. How to accurately portray the credit of suppliers is of great significance to the improvement of cooperation efficiency and the establishment of credit mechanism[1]. At present, there are many methods for suppliers’ credit evaluation, such as gray correlation method[2], analytic hierarchy process [3-4], genetic clustering method[5], principal component analysis method[6], fuzzy comprehensive evaluation method[3,7], neural network[8-9], logistics regression method[10] etc. Among them, the logistics regression method has excellent comprehensive performance in terms of efficacy, practicability and well-posedness[11], and it is widely used in the dual credit evaluation model, but it also has defects. For example, in practical applications, the acquisition quality of real data is required to be high, and it is necessary to assume a linear relationship between the intermediate variable and the independent variable. As the distance between the distribution of dishonest sample and the distribution of honest samples can be described by the weights of evidence (WOE) method in information theory, it can be used to estimate the probability of supplier dishonesty[11]. Replacing the independent variable with the WOE value of the independent variable for Logistic regression can overcome the difficulty of not satisfying the linearity. Therefore, this paper combines the evidence weighting method with the classical Logistic regression method, and applies it to the real data of the first comprehensive data interaction platform in the domestic construction industry, the real data of the cost-effectiveness, and establishes its supplier's binary credit evaluation classification model to distinguish the integrity. This paper also establish a credit scorecard with a non-integrity supplier based on the linear proportional relationship between the WOE value and the regression coefficient of each independent variable, and calculate the specific credit value score of each supplier, which provides qualitative and quantitative dual-level decision-making reference for the platform and users when screening suppliers.

Data Processing and Analysis

Data Preparing

(2)
[image:2.595.82.511.130.481.2]

variables, 5 categorical variables). The response variable y is a two-category variable that distinguishes whether the supplier is an honest merchant. The specific variable types and meanings are shown in Table 1.

Table 1. The specific variable types and meanings.

variable name

Variable type

Variable description

y Categorical

variables

Whether the company is a credible merchant on the platform of the cost-effective platform (1 means no, 0 means yes)

1

x Numerical

variable

Registered capital of the enterprise

2

x Categorical

variables

Type of business: 1 means state-owned enterprise; 2 means joint-stock enterprise; 3 means Sino-foreign joint venture; 4 means private enterprise

3

x Numerical

variable

The number of employees

4

x Numerical

variable

Annual output value of the enterprise

5

x Numerical

variable

The number of certificate added by the enterprise (5 or more is recorded as 5)

6

x Categorical

variables

Sales area: 1 indicates the province where the company is located; 2 indicates domestic; 3 indicates domestic and foreign

7

x Categorical

variables

The product types of the company (5 or more are recorded as 5)

8

x Categorical

variables

Whether to carry out integrity certification on the cost platform (1 for certification, 0 for uncertified)

9

x Categorical

variables

Cost-to-user rating of the supplier

10

x Numerical

variable

The number of times the supplier updated the price on the platform within one year

Missing Value Analysis and Processing

The aggr function in the R language can be used to plot the missing values in the data set (Figure 1, red is missing data, blue is normal one).

Figure 1. Data missing situation.

It can be seen that there are only variablesx1 and x4missing values in the data set. Calling the

md. pattern function shows that there are 88 missing values in the dataset (50 ofx1, 38 ofx4). There

[image:2.595.160.440.540.660.2]
(3)

Variable Correlation Analysis

[image:3.595.218.381.165.310.2]

In order to avoid the occurrence of multi-collinearity problems and improve the accuracy of the model, it is necessary to check the correlation between variables before establishing the model. The Corrplot function in the R language is used to plot the correlation between the variables (see Figure 2, the deeper the color, the larger the correlation coefficient, the higher the correlation between the variables).

Figure 2. Variable correlation.

In Figure 2, the intersections of the variables x3andx4, as well asx2 and x4are darker, showing

a higher correlation between the two sets of variables. Therefore, the variablex4is removed, and the remaining variables are tested for correlation. The function chart. Correlation is called to display the correlation between the variables in the form of correlation coefficients (see Figure 3).

Figure 3. Variable correlation coefficient graph.

As can be seen in Figure 3, the most relevant set of variables isx3andx6, the correlation coefficient is 0.28, while the correlation coefficient between the other variables is smaller.

Modeling Data Set Segmentation

80% of the 3795 raw data sets from which the variablex4was removed were randomly selected as the training set, and 20% were used as test sets to establish the model and the test model, respectively.

Supplier Credit Evaluation Model

Establishment of Classical Logistic Regression Model

[image:3.595.156.439.393.577.2]
(4)

1

y is recorded as pP y( 1|X) , then the probability that y0 is 1p. The ratio

1

p p

 is the ratio of the odds of the event, recorded asodds. When taking the natural logarithm of odds,

( ) ln( )

Logit podds。 can be got. When p(0,1), odds(0,) Logit p( )  ( , ). Establish

a linear regression model Logit p( ) with independent variables X (x x x1, 2, 3,...,xn), which is

the logistic regression model: g X( )Logit p( )01 1x 2x2,...,nxn.

The regression coefficients of the variables in the model can be obtained by the maximum likelihood estimation, and the AIC value is 339.08. However, the pvalues of the three variables and the three variables were greater than 0.05 and failed to pass the test. Therefore, the three variablesx1,x3and x7were eliminated and the remaining six variables were re-transformed by Logistic regression. The six variables passed the test, and the new model has an AIC value of 334.41, which is smaller than before, indicating that the new model works better.

WOE-based Logistic Regression Model

In the classical logistic regression model, it is assumed that the intermediate variable Logit p( )has

a linear relationship with the independent variableX (x x x1, 2, 3,...,xn), which is not necessarily met in practice. To this end, imitating the formula (7) in [11], the WOE defining the selected

variablex is as follows: WOE( ) ln[( ( 1)) / ( ( 0))]

( 1) ( 0)

n y n y x

m y m y

 

  Among them, n( ) indicates the number of each category in the training sample, m( ) indicates the number of each category in the total sample. That is, the WOE value is the difference between the logarithm of the non-integrity and integrity supplier in the training sample and the logarithm of the non-integrity and integrity supplier ratio in the total sample. It reflects the extent to which this variable affects the supplier's classification results. An increase in the WOE value of the selected variable implies a decrease in the probability of dishonesty. Using the monotonous WOE value as a new independent variable into the Logistic regression model can overcome the difficulty that the intermediate variable Logit p( )

[image:4.595.107.490.555.726.2]

and the original data may be nonlinear, and the regression coefficients are shown in Table 2. At this time, the AIC value is 288.06.

Table 2. Regression coefficient based on WOE value modeling.

Regression coefficients estimated value Standard deviation Z value P value

0

 1.8024 0.6447 2.796 0.00518

2

 -0.7370 0.2335 -3.156 0.00160

5

 -0.8112 0.3469 -2.338 0.01938

6

 -1.0970 0.2069 -5.301 1.15e-07

8

 -0.9480 0.0905 -10.475 <2e-16

9

 -1.4410 0.3608 -3.994 6.49e-05

10

 -1.0913 0.1551 -7.036 1.98e-12

Model Evaluation

(5)
[image:5.595.212.385.133.214.2]

value is compared to 0.5 using the if else function, and the output value is converted to 0 and 1 types. The confusion matrix that finally obtains the output predicted value and the true value is shown in Table 3.

Table 3. Confusion matrix.

classification y0

y=0

1

y total

y=0 692 10 702

y=1 14 52 66

total 706 62 768

From the above table, the accuracy of each category of the model can be calculated asy0: 692

98%

706 ;y1: 52

83.8%

62 . The recall rates for each category are y0: 692

98.6% 702 ;

1

y : 52 78.8%

66  . The total accuracy of the model is

692 52

96.9% 768

. It can be seen that the

model has a strong predictive ability.

Credit Score

Establishment of Credit Scorecard

The two-category Logistic regression model established above can distinguish honest and untrustworthy merchants, but cannot judge whoever has higher credit in honest merchants. So the score corresponding to each variable can be obtained according to the linear proportional relationship between the WOE value and the regression coefficient of each variable. The specific calculation formula is: multiply the regression coefficient of the variable coe i( )by the WOE value of the variable, multiply by the set scale factork, and finally add the offsete, that is,

WOE ( )

score coe i  k e (1) When the odds ratioodds15 :1, the corresponding score is set 600, and if the score is 10 higher,

the odds ratio is doubled. Which is 600 ln15 610 ln 30

k e

k e

  

 

 , it can be solved that

14.4269, 560.9311

ke

From this, the base score for the merchant credit can be calculated as:k coe (1)+e 593

Then, according to formula (1), the binning range of each variable is scored, and the final scorecard is shown in Table 4.

Table 4. Credit scorecard.

variable x2(type of enterprise) x5(Addition certificate) x6(Sales area)

category 1 2 3 4 1 2 3 4 5 1 2 3

score 28 -10 -14 3 -15 -28 53 54 57 -11 11 19

variable x8(Platform integrity certification) x9(service comment) x10(Updates)

category 0 1 1 2 3 4 5 (0,10] (10,20] (20,30]

score -38 18 -68 -76 1 4 15 -35 0 78

The final credit score of the merchant is calculated as the base score plus the score corresponding to each part of the variable. Based on this scorecard, you can calculate:

Minimum credit score = 593-14-28-11-38-76-35=391

[image:5.595.67.530.602.733.2]
(6)

Scoring Case

Assuming that the data of a supplier of the platform is as shown in Table 5, the credit value of the supplier is divided into 593+3-15-11-38+4-35=501. It can be seen that the credit value of the supplier is relatively general.

Table 5. Variable value and score of a supplier.

variable x2 x5 x6 x8 x9 x10

data 4 1 1 0 4 1

Corresponding score 3 -15 -11 -38 4 -35

Conclusion

Based on the registered data of the first comprehensive data interaction platform in the construction industry, the paper uses R language data mining method to establish its supplier credit score model based on WOE and Logistic regression. Tested by the test set, the model can help the platform to screen and screen the network suppliers, and eliminate the bad companies whose credit value is not good, so as to improve the quality of the platform suppliers and provide the best quality choice environment for users.

References

[1] Fang Q. Credit Assessment of Enterprise Suppliers [J]. Logistics Technology, 2009, 28(12):85-86.

[2] Xu J, Qi Zh F. Analysis of Suppliers' Credit Level and Its Grey Correlation Model [J]. Journal of Shanxi Finance and Economics University, 2003, 25(4): 71-74.

[3] Li ZhK, Liu X. Fuzzy Comprehensive Evaluation of Nonferrous Metal Suppliers Based on Green Supply Chain [J]. Statistics and Decision, 2008, (4): 65-66.

[4] Xu Y, Zhang Y. A online credit evaluation method based on AHP and SPA[J]. Communications in Nonlinear Science & Numerical Simulation, 2009, 14(7):3031-3036.

[5] She Ch Y, Liang X J. Algorithm of Genetic Clustering and its Application to Evaluation of Supplier Credit[J]. Journal of China Three Gorges University (Natural Sciences), 2007, 29(3): 242-244.

[6] Li Y, Wang L H. Supplier Credit Evaluation Based on Principal Component Analysis [J]. Commercial Research, 2009, (8): 209-210.

[7] Liu Y D, Gao J. Fuzzy comprehensive evaluation on the supplier credit for manufactory[J].Information Technology, 2008, (4):101-103.

[8] Fu Y G, Zhu J M. Network provider credit evaluation model based on big data [J]. Journal of Central University of Finance and Economics, 2016, (8): 74-83.

[9] Rong F Q, Guo M F. The Study of Credit Evaluation of Suppliers on Cross-border E-commerce Platform Based on Big Data[J]. Statistics & Information Forum, 2018, 33(3): 100-107.

[10] Shi L L, Li Ch Y, J Ch, etc. Minor enterprise credit assessment based on Logistic regression model[J]. Journal of Science of Teachers' College and University, 2018, 38(5):10-17.

Figure

Figure 1. Data missing situation.
Figure 3. Variable correlation coefficient graph.
Table 2. Regression coefficient based on WOE value modeling.
Table 3. Confusion matrix.

References

Related documents

The citation is then saved to the list of key references, and filed under the appropriate heading, corresponding to one of three research sub-questions: (a) what are the

A RESOLUTION OF THE PLANNING COMMISSION OF THE CITY OF BRENTWOOD APPROVING A CONDITIONAL USE PERMIT (CUP 14-007) TO ALLOW THE ESTABLISHMENT OF A MARTIAL ARTS STUDIO AND

Because most of the energy usage comes from the materials phase, the rest of the life of the garment is considerably less harmful than natural fibers. Refer to the following table

In this lesson you learned how the medians of a set of data can be used to represent the values in a meaningful graph called the box-and-whisker plot. You also learned that two sets

A key advantage of Windows Phone is that Microsoft has solidified the foundation of its OS in con - junction with the chip-set maker, which frees up device makers

i) Roads should not be closed for cycle racing save in very exceptional circumstances (e.g. ‘criterium’ events which normally involve several laps of a closed circuit within a

In order to step up as a true business partner in your organisation, PMO professionals must rise above process and emerge as strategic leaders, driving strategy and

M(-bic) denotes the multivariate test with the number of principal components selected by the BIC approach, whereas M(-0.9) by the variance rule with the cut-off variance 0.9, and