Random Ordering - Artificial Data Graphs - Issues of the Dynamic Approach to WBD

7.6 Issues of the Dynamic Approach to WBD

7.7.2 Artificial Data Graphs

7.7.2.3 Random Ordering

The evaluation presented in this sub-section reports on the WBD performance of the Random Ordering (RO) method of selecting pages from the graph. The RO method is not strictly a dynamic approach to the WBD problem, because the entire graph data is needed apriori (see section 7.7.2.3). The RO method selects pages uniformly at random from all pages in the graph independent of link structure. This traversal can be considered possible if all web pages in a graph are known at the initial point, and a complete set of edges is imposed between all pages. The evaluation of the RO method allowed for the comparative performance of a method that produced a randomised ordering of pages, but is independent of the link structure of the graph.

Figure 7.14: The average WBD performance of the dynamic approaches indicated on data set ADG set1.

(a) Accuracy (b) Recall

Figure 7.15: The graph coverage of the dynamic approaches indicated on data set ADG set1.

(a) Total Coverage (b) Target Coverage

Figure 7.16: The average WBD performance of the dynamic approaches indicated on data set ADG set2.

(a) Accuracy (b) Recall

Figure 7.17: The graph coverage of the dynamic approaches indicated on data set ADG set2.

(a) Total Coverage (b) Target Coverage

approaches with respect to WBD performance for the data sets ADG set1 and set2 respectfully. The Tables show the graph coverage in terms of: (1) target pages, noise pages and total pages, (2) the WBD performance score, and (3) the average time taken per step for each of the approaches. It is shown that the RO method provides the highest performing WBD solution when compared to the other approaches considered if the number of pages in a graph is not an adversely large amount. This is shown in Table 7.11 were RO produces the highest performing WBD solution for ADG set1, while subsequently fails to do the same for the larger ADG set2 as shown in Table 7.12. Figures 7.18 and 7.19 show the history in terms of WBD performance and graph coverage for the RO, BF, DF, RW and SAR approaches on ADG set1 and set2 respectfully. The RO method has improved accuracy recall and precision for data set1 within a small number of steps (Figures 7.18a, 7.18b and 7.18c). Where as RW and SAR need a larger number of steps to effectively cover web pages and subsequently randomise the selection of pages to produced an increasing accuracy (Figure 7.18a).

The evaluation of the RO method on set2 illustrated consistency with that of set1 in terms of the accuracy of the WBD solution produced, the accuracy was higher than that of the BF,DF, RW and SAR methods (Figure 7.19a). A different result was shown for recall and precision were the DF method is the best performing solution (Figures 7.19b and 7.19c). The lower recall and precision of the RO method on ADG set2 is due to the fact that the RO method cannot use the link structure of the graph to explore regions of connected and related pages, thus when applied to graphs of increasing size, in terms of both target and noise, it cannot make any adjustment to visit more target pages, and less noise pages. The ordering of target and noise pages produce by RO from a large graph makes it difficult to make the correct decisions to cluster target pages in a website cluster. The incremental clustering algorithm will not get to reconsider, and thus re-cluster the target pages, thus re-clustering pages will become increasingly infrequent as the graph increases in size. This means that initial clustering will have adverse effects when using the RO method with respect to graphs of increasing size.

The graph coverage of the RO method covers the target at a much faster rate than that of the RW and SAR methods which is consistent for both set1 and set2 (Tables 7.11 and 7.12). Due to the nature of the uniform selection of the RO method, revisits of nodes are possible, therefore the graph coverage cannot be as fast as the linear time coverage of the BF and DF methods. Another characteristic of the RO method is that it covers the noise page of the graph at a much faster rate than that of the RW and SAR methods, this has an obviously negative effect on the resource cost associated with the RO method.

The run time of RO is also faster than that of the linear time BF and DF methods, but not as fast as the randomised methods of RW and SAR (Tables 7.11 and 7.12). The reason for the lower run time per step in comparison with BF and DF is that the

RO method has the advantage of not having the overhead costs of processing edges contained in the web graph by extracting hyperlinks. The RO method still has to extract features from the pages, which means that due to its random selection of noise and target pages, is not faster than RW and SAR methods. The RW and SAR methods have a higher probability of re-visiting a web page, re-visiting a page has a low cost when compared to visiting a new page; because features and edges are already extracted and cached.

Figure 7.18: The WBD performance of the dynamic approaches indicated on data set ADG set1.

(a) Accuracy (b) Recall

7.7.2.4 Summary

The evaluation that was presented in this section reported on results using ADGs that were synthetically generated based on the preferential attachment model. The evaluation presented in this section shows that:

1. The IKM algorithm outperformed the ICA in the evaluation presented using the BF and RW graph traversal methods.

2. It is possible to control the graph traversal of a dynamic approach to WBD such that the graph coverage of target and noise pages can be influenced to increase the effectiveness of the WBD solution while decreasing the cost of processing unwanted noise.

Figure 7.19: The WBD performance of the dynamic approaches indicated on data set ADG set2.

(a) Accuracy (b) Recall

3. The RO method produces the highest performing WBD solutions with respect to ADG data sets that are not proportionally large compared with the number of noise and target pages.

The evaluation reported in the following section presents the WBD performance of the MHRW, BF, DF, RW and SAR methods using the real data graphs. Utilising the real data sets the complexity of the graphs used for evaluation was increased compared to the evaluation presented in this section. In a bid to recreate the RO method that can be applied in a dynamic context, the MHRW method of graph traversal was evaluated. The following evaluation shows the MHRW traversal is most like the RO method, but traverses a local area which means it is not as adversely effected by noise. It is shown that the MHRW method produces the best WBD performance with respect to the dynamic approaches presented in this chapter using real data graphs.

In document Website boundary detection via machine learning (Page 184-189)