Variable selection and discretization

2 SUPPLIER RISK ASSESSMENT

4.1. A Novel Approach for Supplier Risk Scoring

4.1.3 Risk Scoring Model

4.1.3.1 Variable selection and discretization

The goal is to select the variables that results in a “best” model within the context of supplier risk scoring model development. Therefore, a simple method is proposed, to

90 select appropriate independent variables will most likely in result more stable and easily generalized model. Further selected numerical type data variables are converted into categorical type data as the numerical type data can increase also increase the redundancy problem.

According to definition, supplier risk is failure to fulfil the obligatory contracted performance due to realised supply risk during given time period, which cause loss to buyer”. Therefore, supplier risk is caused due to supply risk; hence the variables that

are significant predictor of supply risk can be the significant predictor of supplier risk. The active sum and passive sum provide the information of how much a variable has predictive power toward specific class i.e. supply risk outcome. Therefore, in the current thesis, the variables selection principle is given as

𝑋_𝑖𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑖𝑓

𝐼AS_𝑟(𝑖) ≥ 𝐼AS_𝑟 (𝑎𝑣𝑒𝑟𝑔𝑒) 𝑂𝑟 I (𝑖) ≥ I (𝑎𝑣𝑒𝑟𝑔𝑒) 𝑂𝑟 𝐼𝑃S_𝑟(𝑖) ≥ 𝐼𝑃S_𝑟 (𝑎𝑣𝑒𝑟𝑔𝑒) (4.35)

Variables have higher active or passive sum or integration value than average value will be selected. According to principles explained for variable selection (see section 3.4.4), selected variable should be logical and predictive, furthermore excluded variable should cause less information loss. The variable with high active or passive sum or integration value represents the more predictive contribution toward specific supply risk outcome. The supply risk outcome (see Figure 4.2) is linear relationship with the actual performance of supplier (equation 4.10). Consequently, it is assumed that variable has high predictive power for supply risk outcome also have high predictive power toward supplier performance (failure to fulfil the obligatory contracted performance). Further the lower values will not cause unacceptable levels of information loss.

The selected numerical type variables data are discretised based on the knowledge discovery. Selected numerical type variable is converted into categorical type (groups or bins) through the cut off values of selected variable shown in selected rule set. Numerical value of selected variable is replaced by categorical values or bins by considering following steps:

91 1. Round the cut of values to whole number if the numerical type variable has maximum values more than 10 in data set, otherwise round to one digit point. After rounding apply step 2.

2. Select only two most repeating cut off values of selected variable in selected rule set and convert numerical type variable into categorical type variable with three groups. The repetition of cut-off value should be more than or equal to 20 percent of total number of the cut-off values appeared for given variable in rule set. Otherwise consider only one most repeating cut off value with minimum repetition threshold value (20%) and convert the selected variable into binary type.

3. If no cut value meets the required minimum threshold value of repetition i.e. (20%), then round the whole number to nearest zeros and round the digital value to whole number and then apply step2

4. If more than two cut-off values have same repetition, then selected those two values those increase the rage between two selected cut-off values.

Table 4.6: Example of discretization based proposed method

For example 𝑋1is selected variable for risk scoring, which has numerical type data

ranging from 0 to 9. Selected rule set in table 4.1 shows the cut off values for𝑋₁.

According to interesting rule set, two most repeating cut off values of 𝑋₁are 2.7 and 1.8

(round to one digit point) also have repetition more than required threshold. According

to proposed discretization method, after discretization 𝑋₁ will have 3 groups that are

𝑋₁> 2.7 2.7 ≥ 𝑋1≥ 1.8

1.8 > 𝑋₁

Similar 𝑋₂ and 𝑋₃respectively,

𝑋2> 4, 4 ≥ 𝑋2≥ 3, 3 > 𝑋2

80 > 𝑋₃ or 𝑋₃ ≥ 80,

4.1.3.2 Model building

In the current thesis, Logistic regression technique is selected for model development to estimate the weight of independent variables i.e. “𝛽𝑗”. Estimation and validation of

supplier risk scoring model requires data on supplier contracted performance and its potential predictors (risk factors). In the current study we have “n” number of observation for (𝑋_𝑖, 𝐶_𝑖) in given dataset S. where 𝑋_𝑖are independent d-dimension vector of risk factors and 𝐶_𝑖 is target variable with binary value (1, 0). It holds value 𝐶𝑖 =1 if supplier is good (no risk) and 𝐶𝑖=0 otherwise. A probability function p for the

92 𝑃(𝑋𝑖) = 𝑒

∑𝑖_𝑗=1_{𝛽𝑗𝑋𝑖𝑗}

1+𝑒∑𝑖𝑗=1𝛽𝑗𝑋𝑖𝑗 (4.36)

Where,

P (𝑋𝑖) is the probability that ith supplier is good means 𝐶𝑖=1

(1- P (𝑋𝑖)) is the probability that ith supplier is bad means 𝐶𝑖=0

𝑋_𝑖𝑗 is 𝑗𝑡ℎ_{variable of 𝑖}𝑡ℎ_supplier

𝛽_𝑗 is weighted estimator or co-efficient parameters for 𝑗𝑡ℎ variable in logistic regression.

The logarithm likelihood for this model can be expressed as

log(𝛽) = ∑[𝐶_𝑖𝑙𝑜𝑔𝑃(𝑋_𝑖) + (1 − 𝐶𝑖)(1 − 𝑙𝑜𝑔𝑃(𝑋𝑖)] (4.37)

In current study the WEKA data mining bench tool for building the logistic regression model that use the ridged estimation technique (Cessie and Houwelingen 1992) for estimating the weighing estimators “𝛽_𝑗”of variables. In this technique the difference between two successive estimated parameters is restricted to(𝛽_𝑗+1− 𝛽_𝑗)2_{. The ridged}

parameter controls the values of “𝛽_𝑗”. When the ridged parameter is equal to zero, then solution is same as ordinary MLE, however when ridged parameter tends toward infinity the values of “𝛽_𝑗” tends toward zero. Therefore, the default setting of ridged parameter is used when using Weka base logistic regression implementation for obtaining a good estimation of “𝛽𝑗” for model building.

After the development of regression model for risk scoring, it is needed to be evaluated for its predictive power. An appropriate benchmark rate is required to compare the predictive power of risk scoring model for evaluation. In general, the benchmark for the dichotomous model is 50 percent because the dependent variable is binary, however in most cases the portion of target classes in a given population are not same. Consequently, the Neter (1996)’s method is used for calculating the benchmark rate in condition of unbalance data about target class population. This method assumes that the observation can be classified correctly at the same rate as their population portion. Such as, suppose there is 80 percent of good supplier and 20 percent of bad supplier in known population then benchmark can be calculated as,

0.8 × 0.8 + 0.2 × 0.2 = 68% (4.38)

So the developed model predictive accuracy should be higher than the calculated benchmark.

In document A novel knowledge discovery based approach for supplier risk scoring with application in the HVAC industry (Page 99-103)