Graph Representation comparison - Artificial Data Graphs

7.6 Issues of the Dynamic Approach to WBD

7.7.2 Artificial Data Graphs

7.7.2.2 Graph Representation comparison

The evaluation in this sub-section compared the various graph representations using the RW graph traversal (RW-SE, RW-SW, RW-SE-SW, RW-EW and RW-CW) with respect to the BF, DF and SAR graph traversal methods. The characteristic behaviour of the graph representations using RW were evaluated with respect to graph coverage, and performance of the WBD solution produced. The graph representation methods evaluated show that it is possible to control the graph traversal of a dynamic approach to WBD such that the graph coverage of target and noise pages can be influenced to increase the performance of the WBD solution while decreasing the cost of processing unwanted noise. The graph representation has an increased cost attached which is associated with the particular mechanism used to control the traversal. The cost of the graph representation, in some cases, does not provide a justifiably significant im- provement in performance with respect to the WBD solution produced when compared with the simplistic implementation and low cost of the alternative dynamic approaches evaluated.

The dynamic approaches presented in this sub-section use the IKM algorithm to produce a WBD solution as it was proven more effective in the previous sub-section 7.7.2.1. A substantial number of experiments were conducted using numerous com- binations of each edge representation using various adjustments and parameters. In

Table 7.10: A comparison of ICA and IKM algorithms using the BF and RW graph traversal in terms of WBD score performance. Score values are shown for 2k and 10k steps for BF and RW respectively.

Graph Traversal Clustering Algorithm Score ADG Set1 BF IKM 0.585 ICA 0.497 RW IKM 0.791 ICA 0.726 ADG Set2 BF IKM 0.691 ICA 0.501 RW IKM 0.729 ICA 0.712

some cases the conducted experiments generated results which either; (1) did not provide significant variation from a base line, or (2) did not provide significant insight in to the WBD problem. As a consequence, these tests have been omitted from the narrative presented in this section. The evaluated approaches that are reported in this sub-section are listed as follows:

• Breadth First (BF)

• Depth First (DF)

• Random Walk (RW)

• Random Walk using Similarity Weighting (RW-SW)

• Random Walk using Similarity Edges (RW-SE)

• Random Walk using Euclidean Weighting (RW-EW)

• Random Walk using Cluster Weighting (RW-CW)

• Random Walk using Similarity Edges and Similarity Weighting (RW-SE-SW)

• Self Avoiding Random (SAR) Walk

The best performing dynamic approaches presented in this evaluation were shown to be RW-SE-SW and RW. The dynamic approach that covered the least amount of noise was shown to be RW-CW, but this method has a considerable cost associated with the Cluster Weighting (CW) graph representation and does not produce a high performing WBD solution. The RW dynamic approach produced the overall best WBD solution when considering the amount of noise covered with respect to the lowest cost and consistent performance for the data sets used for evaluation. The RW method was deemed the most appropriate method due to its simply operation, which is fast and does not require a high resource cost in terms of graph representation or selection of edges to traverse.

Table 7.11: The graph traversal and graph representation approaches (as indicated) ordered according to WBD performance score. The average graph coverage, WBD performance score and average time per step is shown for each approach using data set ADG Set1.

Coverage Time(ms)/ Total

Target Noise Total Score Steps Steps RO 1.000 1.000 1.000 0.779 0.191 50000 RW-SE-SW 0.997 0.703 0.845 0.703 0.212 50000 RW 0.997 0.887 0.940 0.657 0.174 50000 RW-SE 0.995 0.692 0.838 0.653 0.167 50000 SAR 0.999 0.649 0.818 0.643 0.189 50000 RW-EW 0.997 0.869 0.931 0.637 0.234 50000 RW-SW 0.987 0.879 0.921 0.627 0.224 50000 DF 1.000 1.000 1.000 0.615 3.047 287 RW-CW 0.894 0.395 0.636 0.601 12.148 50000 BF 1.000 1.000 1.000 0.556 3.215 287

Table 7.12: The graph traversal and graph representation approaches (as indicated) ordered according to WBD performance score. The average graph coverage, WBD performance score and average time per step is shown for each approach using data set ADG Set2.

Coverage Time(ms)/ Total

Target Noise Total Score Steps Steps DF 1.000 1.000 1.000 0.859 5.642 466 RO 1.000 1.000 1.000 0.766 0.259 50000 RW-SE-SW 0.959 0.459 0.731 0.745 0.262 50000 RW 0.979 0.633 0.821 0.676 0.184 50000 RW-EW 0.980 0.643 0.827 0.669 0.235 50000 RW-SW 0.970 0.663 0.827 0.659 0.225 50000 RW-SE 0.959 0.479 0.740 0.658 0.210 50000 BF 1.000 1.000 1.000 0.625 4.869 466 RW-CW 0.659 0.270 0.482 0.614 27.250 50000 SAR 0.984 0.638 0.826 0.557 0.175 50000

Tables 7.11 and 7.12 show the experimental results of each of the dynamic approaches listed above with respect to WBD performance for the data sets ADG set1 and set2 respectfully. The Tables show the graph coverage in terms of target, noise and total, the WBD performance score, and average time taken per step for each of the approaches. The Random Ordering (RO) method is include in the Tables, see the following section 7.7.2.3 for the evaluation of the RO approach.

The top performing graph representation in terms of WBD performance was the RW-SE-SW approach. The evaluation shows consistent performance on ADG set1 and set2 (Tables 7.11 and 7.12). The top performance of graph representation RW- SE-SW using both Similarity Edges (SE) and Similarity Weighting (SW) compared to when used individually (RW-SE and RW-SW), showed an increase in performance. The RW-SE-SW method exhibited an increased average time to complete a step as a consequence of the graph representation used (SE and SW). The RW method produced a WBD performance score just below that of the RW-SE-SW method. The RW method showed a much smaller average time to complete a step, this is due to its comparatively simplistic operation.

The RO method has a good comparative performance on both ADG set1 and set2. The high performance of the RO method is due to its randomised clustering approach which is independent of graph structure. This allows the clustering algorithm to ran- domise the ordering of nodes it clusters on a constant bases which increases cluster performance. The DF method performed better with respect to set2 than set1 because of the underlying graph structure of set2. The connections of target and noise clusters have fuzzy edges (clusters are more stringy) in set2. This allows the DF method to gain an improved initial clustering compared to other methods, as a deep crawl is initially performed.

Figures 7.15 and 7.17 illustrate the graph coverage history of the RWs and SAR approaches for data sets ADG set1 and set2 respectfully. Figures 7.14 and 7.16 show the history of the RWs and SAR methods in terms of WBD performance.

RW-CW is consistently the worst performer in terms of recall for both ADG set1 (Figure 7.14b) and set2 (Figure 7.16b). In contrast it is the best performer in terms of precision (Figures 7.14c and 7.16c), which can be explained using the graph coverage measures. The overall coverage of the graph using the RW-CW approach is much lower than the other approaches evaluated. The ratio of items covered includes a much larger amount of target than noise web pages. This will increase the precision because the ratio of target to noise pages in the website is higher, while the amount of target pages included in the website boundary is low compared to the target pages contained in the website.

The RW-SE and RW-SE-SW approaches perform consistently across set1 and set2 in terms of precision (Figures 7.14c and 7.16c), recall (Figures 7.14b and 7.16b) and

coverage (Figures 7.15a and 7.15a). The RW-SE-SW approach produced a slightly higher accuracy for the graphs in set2 (Figure 7.16a) than in set1. This implies that adding Secondary Edges (SE) improves the performance in terms of the WBD solutions produced when the graphs have less connected target pages. Whereas the SE method provides no observable benefit when the amount of target pages in the graph are densely connected, although the Similarity Weighting (SW) method improved the overall WBD performance, shown in set1 (Figure 7.14a),

The DF method outperforms other methods in terms of WBD performance score when evaluated using ADG set2. The high performance of the DF method can be attributed to the fact that the graphs structure of set2 lends its self to a depth first crawl. The amount of noise and target web pages are crawled with respect to an overall view of the data, which is due to the deep search of the DF method. This ordering is much more beneficial for a clustering algorithm with respect to WBD.

The RW-EW method covers the most target and noise web pages compared to the other RW methods with respect to both set1 and set2 (Figures 7.15c and 7.17c); how- ever, there is no significant increase in the WBD solutions produced (Figures 7.14a and 7.16a). The RW-EW method has the effect of increasing coverage of the graph compared to the RW, which was achieved by Euclidean Weighted (EW) edges. This has the adverse effect of covering more noise web pages, and is due to increasing the Euclidean weighting of edges that are not part of the website, but exhibit some similarity; hence the increased weighting.

The SAR method performs much better on set1 than set2 in terms of the WBD performance accuracy (Figures 7.14a and 7.16a). The coverage of the SAR method produced a desirable result in terms of reducing noise coverage with respect to set1 as the amount of noise covered is less than other RW methods. In the less dense web graph of set2, the performance of the SAR method decreased in terms of WBD performance score (Table 7.12), and graph coverage (Figures 7.17b and 7.17c).

In document Website boundary detection via machine learning (Page 180-184)