4. Chapter 4: Algorithm Design
4.3. Weighted Stacking Ensemble Method of Classifiers
4.3.3. k-Nearest Neighbours (KNN) Method
In this study, the KNN method is illustrated using a data structure called class. The class name for the KNN method is k_Nearest_Neighbors, and contains a constructor to initialize values. The Table 8.3 in section 8 shows the names and descriptions of the class properties of the k_Nearest_Neighbors class. The class definition of the KNN Method is defined in algorithm 13 of section 8.4. The following pseudo-code shows the flow of how the KNN Method was defined in section 8.4.
Algorithm 13: K-Nearest Neighbour (KNN)
Input: Row length, neighbours, TrainDataX, TrainDataY, Random Seed
Output: Predicted Output
Begin
1. Function for calculating the distance: (πππ€1_π£πππ’π β πππ€2_π£πππ’π)2
2. Function for calculating the Euclidean distance:
Input: row1, row2
Distance = 0
For ind = 0 to length of the row Distance += distance(row1, row2)
Return the square root of the Distance
3. Function to calculate the nearest neighbours (getNeighbors) Input:TrainDataX
k-nearest_neighbour = []
for ind = 0 to length of TrainDataX
dist = Euclidean_Distance(row1[ind], row2[ind])
Append (ind, dist) into k-nearest_neighbour Sort the dist in the k-nearest_neighbour
Return top 5 nearest distances
4. Function for predicting the transactions
75 k-nearest_neighbours = getNeighbors(TestRow)
for each neighbour in k-nearest_neighbours: count0, count1 = 0, 0 if TrainDataY[Neighbor[0]] = 0: count0 += 1 elif TrainDataY[Neighbor[0]] = 1: count1 += 1 if count0 > count1: predictions = 0 else predictions = 1 Return predictions
5. Function to get the predictions
Input: TestDataX
Predictions = []
For each TestRow in TestDataX
Append predicted values into Predictions
Return Predictions
6. Function to return the Class probability
Input: TestDataX
Percentage = []
For each TestRow in TestDataX
k-nearest_neighbours = getNeighbors(TestRow) count0, count1 = 0, 0
for each Index and Neighbour in k-nearest_neighbours
if TrainDataY[Index] = 0.0: count0 += 1 elif TrainDataY[Index] = 0.0: count1 += 1 if count0 > count1: Percentage.append(count0/5) Else Percentage.append(count1/5) Return Percentage
7. Function to get an Accuracy
Input: TestDataY, predictions
Correct = 0
For y = 0 to length of TestDataY If TestDataY[y] = predictions[y] Correct += 1
Return (Correct / length of TestDataY) * 100
End of the Class
4.3.4. NaΓ―ve Bayesian (NB) Method
In this research study, the NB method is illustrated using a data structure called class. The class name for the NB method is NaΓ―ve_Bayesian, and it contains a constructor to initialize values. The Table 8.4 in section 8 shows the names and descriptions of the class properties of the NaΓ―ve_Bayesian class. The class definition of the NB
76
Method is defined in algorithm 14 of section 8.4. The following pseudo-code shows the flow of how the NB Method was defined in section 8.4.
Algorithm 14: NaΓ―ve Bayesian (NB)
Input: TrainDataX, TrainDataY, Random Seed
Output: Predicted Output
Begin
1. Function to calculate the mean: sum(numbers) / length of numbers
2. Function to calculate the standard deviation: β (π₯βππππ)2
π π
π=0
3. Function to calculate the Gaussian Distribution (CalculateProbability):
Input: RowValue, mean, standard deviation
Exponent = (π ππ€ππππ’πβππππ) 2 2(π π‘ππππππ πππ£πππ‘πππ)2 Return 1 β2βππ π πΈπ₯ππππππ‘
4. Function to summarise the input data:
Input: TrainData
Summary = mean(TrainData), standard deviation(TrainData)
Return Summary
5. Function to separate Transaction by class values (seperateByClass):
Input: TrainDataX
Seperated = {}
For i = 0 to length of TrainDataX Vector = TrainDataX[i]
If TrainDataX[i] not in Seperated Separated[TrainDataY[i]] = [] Append vector into Seperated
Return Seperated
6. Function to summarise values by class (SummarizeByClass):
Separated = seperateByClass() Summaries = []
For each classvalue and instances in Seperated Summaries[classValue] = summarise(instances)
Return Summaries
7. Function to calculate the class probability:
Input: TestDataInstance
Probabilities = []
Summaries = summarizeByClass()
For each classValue, classSummaries in Summaries Probabilities[classValue] = 1
For ind = 0 to length of classSummaries:
Mean, standard deviation = classSummaries[ind] RowValue = TestDataInstance[ind]
Probabilities[classValue] *= CalculateProbability()
Return Probabilities
8. Function to predict the output of the transaction:
Input: TestDataInstance
Probabilities = calculateClassProbabilities(TestDataInstance) bestLabel, bestProb = None, -1
for each classValue, probability in probabilities() if bestLabel is none or probability > bestProb
77 bestProb = probability
bestLabel = classValue
Return bestLabel
9. Function to get Predictions
Input: TestDataX
Predictions = []
For i = 0 to length of TestDataX Result = predict(TestDataX[i]) Append Result in predictions
Return Predictions
10. Function to return probabilities
Input: TestDataX
Probas = []
For each TestDataInstance in TestDataX
Proba = CalculateClassProbability(TestDataInstance) Proba = maximum likelihood estimator (Proba) Append Proba in Probas
Return Probas
11. Function to get an Accuracy
Input: TestDataY, predictions
Correct = 0
For y = 0 to length of TestDataY If TestDataY[y] = predictions[y] Correct += 1
Return (Correct / length of TestDataY) * 100
End of the Class
4.3.5. Decision Tree (DT) Method
The last and the final base model of stacking ensemble, the DT method, is illustrated using a data structure called class. The class name for the DT method is Decision_tree, and contains a constructor to initialize values. The Table 8.5 in section 8 shows the names and descriptions of the class properties of the Decision_tree class. The class definition of the DT Method is defined in algorithm 15 of section 8.4. The following pseudo-code shows the flow of how the DT Method was defined in section 8.4.
78
Input: Gini_Criterion, Max_depth, Min_samples_leaf, Entropy_Criterion, Random Seed
Output: Predicted Output
Begin
1. Function to train the dataset using Gini
Input: criterion, random state, max depth, min sample leaf, TrainX, TrainY
Clf_gini = decisionTreeClassifier(Input) Fit Clf_gini model with TrainX and TrainY
Return Clf_gini
2. Function to train the dataset using Entropy
Input: criterion, random state, max depth, min sample leaf, TrainX, TrainY
Clf_entropy = decisionTreeClassifier(Input) Fit Clf_entropy model with TrainX and TrainY
Return Clf_entropy
3. Function to predict the probability
Input: TestX, clf_object
Predict_prob = clf_object.predict(TestX)
Return Predict_prob
4. Function to make predictions
Input: TestX, clf_object
Y_pred = clf_object.predict(TestX)
Return Y_pred
5. Function to calculate the probability
Input: pred_list, pred_proba_list
Proba = []
For each index and value in pred_list
Append pred_proba_list[index][value] into Proba
Return Proba
6. Function to get an Accuracy
Input: TestDataY, predictions
Correct = 0
For y = 0 to length of TestDataY If TestDataY[y] = predictions[y] Correct += 1
Return (Correct / length of TestDataY) * 100
End of the Class
The weighted stacking ensemble method of the above defined base models was performed by voting out the output of any given credit card transaction as to whether the transaction is illegal or not. The output of the base models was used to generate a new dataset. The new dataset was stored in a data dictionary consisting of columns as names of base models, and rows as the predicted output of base models. The data dictionary of the new dataset is discussed in the next section.