Partial Support Trees P-trees - Frequent Pattern Mining

Chapter 2: Literature Review and Related Work

2.4. Frequent Pattern Mining

2.4.3. Partial Support Trees P-trees

The most computationally expensive part of association rules and related algorithms for exam- ple (Apriori and FP-Growth) is identifying the subsets of a record that are members of the candidate set being considered, particularly for records that include a large number of attributes [34]. This can be avoided by first counting only sets occurring in the database, without considering subsets [101].

Let i be a subset of the set I (i.e. I, is the set of n attributes in the database). Pi is defined as the partial support for the set i, to be the number of records in which the contents are identical with the set i. Also, Ti, is the total support for the set i. This can be shown as follows:

𝑇_𝑖 = ∑ 𝑃_𝑖

Equation 7 [34]

For a database of m records, the partial supports can, be counted easily in a single database pass, to produce m’ partial totals, for some m’ ≤ m . Rymon’s set enumeration framework [99] can be used to store all counts in a tree; Figure 8 shows this for I = {A, B, C, D}. To avoid the possible exponential scale of this, the tree is constructed concurrently as the database is scanned in order to include only those nodes that exemplify sets actually present as records in the database, as well as some additional nodes created to preserve the tree structure when necessary. The cost of construction this tree and its size are linearly related to m instead of 2n.

Advantage can be taken of the structural relationships between sets of attributes from the tree when the construction phase is used to begin the computation of sum supports. While each set is located within the tree during the process of the database pass, it is computationally low-cost

to add to interim support-counts, Qi is stored for subsets which precede it in the tree ordering.

Therefore, in Figure 8 the number associated with the nodes are the provisional support counts. These can be stored in the tree constructed from the dataset and the records which compose exactly one instance of each possible set. Hence, for instance, Q (BC) = 2, is derived from 1 instance of BC and 1 of BCD as follows:

T (BC) = Q (BC) + P (ABC) + P(ABCD) = Q (BC) + Q (ABC)

The method described above is named P-tree (partial support tree) and was developed by [34] to indicate this incomplete set- enumeration tree of interim support counts. This algorithm for constructing the P-tree is able to count the interim totals because it contains all the relevant data stored in the original database. Research has shown that this concept can be applied and

(1) (1) (4) (2) (1) (2) (1) (1) (4) (2) (1) (1) (2) (1) (8) A C D ABD ABC ABCD AC AD BC BD CD AB BCD ACD B

utilised to almost any created algorithm to complete the summation of total supports [102] [103]. The use of the P-tree as an alternative for the original database basically offers two possible advantages. First, when n is small (2n < m), then traversing the tree to examine every node will be notably quicker than scanning the whole dataset. Secondly, even for great amount of n, if the database includes a high degree of duplication (m’ < m), utilising the tree will be a significantly faster process compared to a full pass of database, particularly if the duplicated records are densely-populated with attributes. Ultimately, the computation required in each cy- cle of the algorithm is significantly decreased because of the partial summation already con- ducted in constructing the tree. For instance, (considering pairs of attributes) in the second pass of Apriori, a record including r attributes might require the counts for each of its r (r - 1) / 2 subset-pairs to be increased. It is important to consider only those subsets not already covered by a parent node, when examining a node of the P-tree, contrarily, that would be only r –

1subsets, in the best case scenario. To exemplify this, in Figure 8 consider the node ABCD in the tree. The partial total for ABCD has been previously included in the interim total for ABC. In addition, this will be added to the final totals for the subsets of ABC when the second node is examined. This means, in terms of examining the node ABCD, the need is only to consider those subsets not covered by its higher level (parent), namely those including the attribute D. The result obtained from this will be larger in addition to the greater the number of attributes in the set which is being considered. The structure of P-trees is similar to the FP-tree mentioned previously but it has a different form and similar properties. It is noticeable that the FP-tree is built in two database passes. Firstly, to eliminate attributes that fail to reach the support thresh- old, and then to order the others by frequency of occurrence. The FP-tree also stores each node in a single attribute Therefore each path in the tree represents and counts one or more records in the database. Moreover, it includes more structural information, allowing all the nodes to represent any attribute being related into a list. This structure enables the execution of an FP-

growth algorithm which can generate successively subtrees from the FP-tree similarly to each frequent attribute, to indicate to all sets in which the attribute is associated with its predecessors in the ordering of a tree. The combination of two structures, the FP-tree and P-tree, which have been developed separately, are utilised in the new algorithm, which is discussed in Chapter 3.

In document A new strategy for case based reasoning retrieval using classification based on association (Page 54-57)