Graph Traversal Method Comparison

7.6 Issues of the Dynamic Approach to WBD

7.7.3 Real Data Graphs

7.7.3.2 Graph Traversal Method Comparison

The average performance of dynamic approaches BF, DF, RW, MHRW and SAR using the title feature and the ADG data sets is shown in Table 7.13. Figure 7.22 illus- trates the WBD performance history of the approaches. In the initial stages of the approaches the average WBD performance score is higher for the BF and DF methods in comparison to RW, MHRW and SAR (Figure 7.22g). This behaviour is also consis- tent for the Fmeasure and recall values (Figure 7.22b and 7.22e). The higher values of recall indicated that the target pages were visited at a faster rate using the BF and DF methods when compared to RW, MHRW and SAR. The higher performance of the deterministic dynamic approaches BF and DF indicated that the identified web pages from the target website were subsequently being grouped together successfully into a representative website boundary using the title feature representation.

The values of precision and purity produced were around the same value for BF, DF, RW and MHRW (Figures 7.22f and 7.22d). This indicated that the number of the target web pages in the target cluster (KT) was high compared to noise web pages.

In the long term, both BF and DF produced consistently similar WBD performances to the RW and MHRW methods (Table 7.13). The advantages of BF and DF methods are that they can traverse the graph structure in linear time, indicated by the number of end steps. Due to complexities of the RDGs, coupled with the task of clustering incrementally and the adverse effects of noise, the DF traversal does not show an improved WBD performance. The randomised traversal of RW, MHRW and SAR are subject to the underlying structure of the graph, in RDGs the structure can be complex, and unpredictable. The graph coverage of the RW dynamic approach remained high in

comparison to MHRW and SAR (Figure 7.21). A closer analysis reveals that MHRW is much slower at covering the total graph than RW and SAR (Figure 7.21a). However, the MHRW traversal covered noise pages at a lower rate (Figure 7.21c), while it also maintained a high coverage of target pages (Figure 7.21b).

Table 7.13: The WBD performance the dynamic approaches as indicated, ordered by performance score. The graph coverage, WBD performance score, average time per step and total steps is shown for each approach using the average performance on the RDG (LivChem, LivHistory, LivMath and LivSace) data sets.

Coverage Time(ms)/ Total

Target Noise Total Score Steps Steps BF 1.000 1.000 1.000 0.717 314.801 421 DF 1.000 1.000 1.000 0.714 320.561 421 RW 0.960 0.992 0.988 0.654 5.957 25000 MHRW 0.941 0.768 0.788 0.646 4.740 25000 SAR 0.890 0.869 0.871 0.495 5.611 25000 7.7.3.3 Summary

To summarise, the MHRW method proved to be the best performing graph traversal method when compared to the other methods tested in terms of the WBD solutions generated in a dynamic context using the RDGs. The MHRW method traverses the graph selecting pages in a random order depending on the hyperlink structure. In contrast to RW, the MHRW method avoids high degree pages with a certain probability. The MHRW method of selecting pages from the graph offered the advantages of:

1. Not visiting a high number of noise web pages.

2. Producing a high WBD performance in comparison to the other methods evaluated.

3. Applicability in a dynamic context by using the hyperlink structure for its traversal of an unknown graph.

4. Randomising the ordering of pages so as to provide an effective order for the IKM algorithm.

The characteristics of the MHRW method translate into a solution that requires fewer resources to execute, as it minimises the downloading and parsing of unnecessary noise pages from the web, while producing a comparatively high quality WBD solution. The cost associated with downloading and processing unwanted noise pages from the web can rise if the WBD problem is scaled up to address larger domains.

Figure 7.21: The average graph coverage for five graph traversals (as indicated) on ADG data sets (LivChem, LivHistory, LivMaths and LivSace) using the title feature.

(a) Total Coverage

Figure 7.22: The WBD performance for five graph traversals (as indicated) on RDG data sets (LivChem, LivHistory, LivMaths and LivSace) using the title feature over 1000 steps.

(a) Accuracy (b) Fmeasure

(e) Recall (f) Precision

7.8 Evaluation Summary

The evaluation strategy that was used in this chapter, directed at dynamic approaches to the WBD problem, considered an increasingly complex sequence of data sets. The test data was modelled in three particular ways using Binomial Random Graphs (BRGs), Artificial Data Graphs (ADGs) and Real Data Graphs (RDGs). The evaluation strategy was presented in three sections corresponding to the three kinds of data set used. Binomial Random Graphs Evaluation The evaluated dynamic methods were RW

and RW-SW used in combination with a buffer to store the ordering of pages produced by the graph traversals. The buffer is used to calculate a ratio value which can be used to detect cluster transitions in the BRG data sets.

The evaluation illustrated that: (1) the ordering of pages selected by RW ex- hibits frequent traversal of highly connected or dense sub-regions of pages in a graph, (2) the ordering of pages produced by the RW can be used to detect cluster transitions, by using a buffer and calculated ratio value, and (3) the graph representation method SW can be used to improve the ability of RW to remain inside a cluster when a cluster is adversely connected (for example when connections across clusters are proportionally high, or connections inside a cluster are proportionally low).

Artificial Data Graph Evaluation The dynamic approaches to the WBD problem, RW and BF, were initially compared in terms of WBD performance using both the ICA and IKM algorithms. The higher performing IKM algorithm was used to compare RW graph traversal using graph representations SE, SW, EW, CW, SE-SW with BF, DF, SAR and RO.

The evaluation demonstrated that: (1) the IKM algorithm outperformed the ICA using the BF and RW graph traversal methods, (2) the graph representation methods evaluated show that it is possible to control the graph traversal using a dynamic approach to WBD such that the graph coverage of target and noise pages can be influenced to increase quality of the WBD solution produced while decreasing the cost of processing unwanted noise, and (3) the RO method pro- duces the highest quality WBD solution using ADGs that are not proportionally large compared with the number of noise and target pages.

Real Data Graph Evaluation The evaluated dynamic approaches to the WBD problem were BF, DF, RW, RO, MHRW and SAR in combination with the standard hyperlink graph representation using the IKM algorithm.

The characteristics of the MHRW method translate into a solution that cost fewer resources to execute, as it minimises the downloading and parsing of unnecessary

noise pages from the web, while producing a comparatively high WBD solution. The MHRW method of selecting pages from the graph offered the following advantages: (1) randomising the ordering of pages which proves to provide an effective order for the IKM algorithm, (2) applicable in a dynamic context, using hyperlink structure for its traversal of an unknown graph, (3) avoid visiting high amount of noise web pages, and (4) producing high WBD performance in comparison to the other method evaluated in this section.

7.9 Conclusions

This chapter presented an investigation of the WBD problem in the dynamic context. In the dynamic context the web data is not fully available prior to the start of the analysis. The approaches presented in this chapter used various graph traversal techniques to gather portions of web data, which were then clustered incrementally in order to produce a WBD solution using only partial data.

In the dynamic approach the web page data is gathered by traversing the web graph using the hyperlink structure. The web pages are then pre-processed and feature representations created for each page. The pages are then incrementally clustered as the pages are traversed, a website boundary is then identified based on the clusters produced.

The evaluation of the dynamic approaches presented in this chapter was performed using the three categories of web graph data set: (1) Binomial Random Graphs (BRG), (2) Artificial Data Graphs (ADG), and (3) Real Data Graphs (RDG). The evaluation using each category of data set illustrated the advantages of the random walk based method of graph traversal with respect to WBD performance. In particular settings where the amount of data is large and not immediately available, and thus an adverse cost is associated with gathering data.

The Metropolis Hastings Random Walk (MHRW) method of graph traversal coupled with the title feature representation using the incremental kmeans algorithm (IKM) proved to be the best performing dynamic method in terms of the evaluation conducted using the Real Data Graph (RDG). The evaluation specifically concentrated on techniques that can be applied in a dynamic context, produce acceptable WBD performance while at the same time reducing the amount of noise pages visited. The website boundary representation was “as good” as that produced by other methods; however, MHRW visited fewer noise pages, hence efficiency gains were realised. Thus the MHRW method was found to provide the most effective WBD problem solution, providing the best comprise between WBD performance, while maintaining to visit a lower amount of noise pages of the graph, which proves to cost much less in terms of resources.

Chapter 8

Conclusion

This chapter presents a summary of the proposed approaches to the Website Boundary Detection (WBD) problem, a comparison of the approaches, the contribution and main findings of the research, and possible directions for future work. The summary of the research is presented in section 8.1. A comparison of the proposed approaches is presented in section 8.2. The contributions and main findings are given in section 8.3 and the suggested future research directions in section 8.4.

8.1 Summary

This thesis has described research undertaken in the field of web data mining, which is a sub field of Knowledge Discovery in Databases (KDD). The specific area this thesis contributes to is the area of website mining. The specific problem that the research described was directed at was an investigation of solutions to the WBD problem. The WBD problem is described as the task of identifying the collection of all web pages that are part of a single website. Potential solutions to the WBD problem can be beneficial with respect to the archiving of web content and the automated construction of web directories, amongst many others (see section 1.1).

The findings in this work can be validated by the application of the techniques to WBD problems other that what have been presented in this thesis. In particular opinion mining is a problem that can benefit from the techniques presented in this work. The website boundary in question with respect to the problem of opinions mining is one that is focused on a certain topic. The topic containing opinions on a product or a brand. This boundary can span multiple physical domains and can have a large magnitude given the prolific use of social media. The boundary detection problem in such a case benefits from techniques that can discover website boundaries in a dynamic setting were the magnitude of the problem can cause traditional techniques to become non-scalable.

A pre-requisite to any practical WBD approach is that of a definition of a website. This thesis has presented a discussion of previous definitions of websites, which

concluded that the current definitions provide ambiguity with respect to any practical approach to the WBD problem. A proposed definition of a website, which was used with respect to the approaches presented in this thesis, was thus presented.

The approaches to the WBD problem investigated in this thesis were directed at both the static and dynamic contexts. In the static context the web data to be considered is required to be available prior to the start of any WBD solution process. In the dynamic context the web data is collected as the WBD solution generation process proceeds. Three approaches to the solution of the WBD problem were presented in this thesis, two static approaches and one dynamic approach:

1. Feature analysis based static WBD

2. Graph structure partitioning based static WBD

3. Incremental clustering using graph traversal based dynamic WBD

The latter built upon the initial research findings generated through the evaluation of the two static approaches.

The first static approach presented in this thesis concentrated on the types of features that could be used to represent web pages. This approach presented a practical solution to the WBD problem by applying clustering algorithms to various combina- tions of features. Further analysis investigated the best combination of features to be used in terms of WBD performance.

The second static approach investigated graph partitioning techniques based on the structural properties of the web graph in order to produce WBD solutions. Two approaches were considered, a hierarchical graph partitioning technique, and a method based on the minimum cuts of flow networks.

The final proposed approach to the WBD problem presented in this research considered the dynamic context. The dynamic approach was founded on the findings from the application of the two static approaches. The dynamic approach subsequently produced a solution that incrementally built a website boundary by traversing the web graph structure, while clustering web pages using various feature representations.

In an initial attempt to evaluate the approaches presented in the dynamic context, synthetically generated Artificial Data Graphs (ADGs) were created to provide a con- trolled environment for analysis. A final evaluation of both the static and dynamic approaches presented in this thesis was conducted using Real Data Graphs (RDGs) gathered from four academic departments hosted by the University of Liverpool (LivChem, LivHistory, LivMaths and LivSace).

In document Website boundary detection via machine learning (Page 191-198)