The Apriori algorithm - Critical assessment of applied machine learning algorithms for the 3 rd

2. Applied Machine Learning Algorithms: Strengths and Pitfalls

2.3 Critical assessment of applied machine learning algorithms for the 3 rd research

2.3.1 The Apriori algorithm

The section describes the most popular association rule learning algorithm, known as Apriori algorithm, which was used to conduct the research experiments for the third research project. The unsupervised learning method extract rules that best explain observed relationships between variables in the transactional dataset. The algorithm allows the discovery of important and commercially useful associations in large multidimensional datasets, such as the underlying Berka dataset.

Agrawal, Imieliński and Swami (1993) define the concept of association rule learning through a set of formal definitions as follows:

Definition (Association rule) - Let I = {ⅈ₁, ⅈ₂, ⅈ₃,…,ⅈ_𝑛} be a set of n payment types known as items and D = {𝑡1, 𝑡2, 𝑡3,…,𝑡𝑛} be the set of transactions known as database. Every

transaction, 𝑡𝑖 in D has a unique transaction ID, and it consists of a subset of itemsets

in I. A rule can be defined as an implication, X⟶Y where X and Y are subsets of I (X, Y⊆I), and they have no element in common, for instance, X∩Y. X and Y are the

antecedent and the consequent of the rule, respectively.

Figure 2-12: Visualization of frequent itemset generation using Apriori learning algorithm Figure 2-12 describes a small and practical example of frequent itemset generation applying an Apriori learning algorithm on payment transactions. The set of itemsets, I = {Heating, Rent, Insurance, Electricity, Loan}, consists of six payment transactions.

Each payment transaction is a tuple of 0s (absence of an item) and 1s (presence of an item). On that basis, it is possible to identify multiple interesting and significant rules from a transactional dataset by looking at required measures such as support, confidence and lift:

Definition (Support) - The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It describes the popularity of an itemset.

𝑠𝑢𝑝𝑝 (𝑋) =Number of transactions in which X appears

Total number of transactions

Following the figure above, supp(heating) = 4

6 = 0.66667.

Definition (Confidence) - Confidence of a rule is defined as follows:

𝑐𝑜𝑛𝑓 (𝑋 → 𝑌) =𝑠𝑢𝑝𝑝 (𝑋 ∪ 𝑌)

𝑠𝑢𝑝𝑝 (𝑋)

It shows the likelihood of a payment type (item) Y being executed when the payment type (item) X is executed. In the example above, the rule {Heating, Rent}

→

{Insurance}

is correct for 75% of the payment transactions. However, this measure takes only the popularity of itemset X into account, and not the popularity of Y. The measure lift will overcome this drawback as follows:

Definition (Lift) - The lift of a rule is defined as:

𝑙ⅈ𝑓𝑡 (𝑋 → 𝑌) = 𝑠𝑢𝑝𝑝 (𝑋 ∪ 𝑌)

𝑠𝑢𝑝𝑝 (𝑋) ∗ 𝑠𝑢𝑝𝑝 (𝑌)

It shows the likelihood of the itemset Y being executed when the payment type (item) X is executed, while taking into account the popularity of Y. A lift value of greater than 1 indicates that the itemset Y is likely to be executed with itemset X, while a lift value of less than 1 means that itemset Y is unlikely to be executed if the itemset X is executed.

The general learning process of the Apriori algorithm for frequent itemset generation is illustrated step-by-step in figure 2-12. More detailed explanation about the functionality of the entire Apriori algorithm is provided by Zaki (2001), Fournier-Viger et al. (2012), and Anastasiu, Iverson, Smith and Karypis (2014). If the prerequisites in the above example are met, the algorithm works as follows:

The first step creates a frequency table of all the payment types (item) that occur in all six transactions. The second step selects only those important elements for which the support threshold is ≥ 50%, and single payment types (items) that are executed by the bank customers frequently are provided. Next, step 3 brings all possible pairs of the important payment transaction types together without taking care about their order. Step 4 counts the occurrences of each pair in all six transactions. After conducting step 5, only those important itemsets which cross the support threshold of 50% are left. Step 6 consists of a self-join to create a set of x items another rules in order to apply again the threshold rule of ≥ 50% finalize the last step.

Figure 2-13: Visualization of association rule generation using Apriori learning algorithm According to the example above, figure 2-13 depicts the outcome of association rule generation using Apriori learning algorithm. Detailed explanation of the theory for rule generation using an Apriori algorithm can be found in Hahsler, Grün and Hornik (2005), Hahsler and Chelluboina (2011), Hahsler et al. (2011), Anastasiu, Iverson, Smith, Karypis, et al. (2014), and Johnson (2018). The general two-step approach for finding association rules efficiently works as follows:

The first step, frequent itemset generation, is about finding all itemsets for which the support is greater than the threshold support, following the process from step 1-6 described in figure 2-12. The learning algorithm finally returns the frequent itemset “HRI.”

The second step, rule generation, is creating candidate rules from each frequent itemset using the binary partition of frequent itemsets and seeking for the ones with high confidence. The frequent itemset in our example consists of 3 elements (k=3); all possible candidate association rules (2𝑘− 2) using “HRI” are shown in figure 2-13.

For example, one possible association rule would be {Heating (H), Rent (R)} →

{Insurance (I)}, which means if a transaction for heating and rent are executed, banking clients also perform a transaction for insurance.

In document Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery (Page 73-76)