4.3 Real Data Graphs (RDGs)
4.3.4 Comparison of data sets
It is shown in Table 4.13 that for each data set, the density of the target cluster is greater than that of the noise cluster. The noise cluster has similar density to that of the whole graph, which shows that there is clear structural relationship between the vertices of the target cluster, in comparison to the base line density of the whole graph, and of the noise cluster. It should also be noted that the LivChem and LivMaths data sets show a much higher density value for their respective target clusters.
With respect to the degree of the data sets, the LivHistory and LivSace collections have a fairly uniform degree with respect to each identified cluster. The target and noise clusters have similar values of degree in comparison to the whole graph. LivChem and LivMaths however exhibit an increased average degree for the target cluster in comparison to the noise and whole graph clusters. It can also be seen that the noise cluster has a slightly lower average degree when compared to the average for the entire graph. This suggests that both LivChem and LivMaths have a target cluster that is more cohesively connected in terms of hyperlink structure than that of the LivHistory and LivSace data sets.
The higher connectivity of LivChem and LivMaths is further corroborated by the values associated with the connected components of each graph. The total number is higher for the target clusters of LivChem and LivMaths in comparison to LivHistory and LivSace. It can also be observed that there are more connected components present in the noise clusters of LivChem, LivMath and LivSace than that of the LivHistory data set. In certain cases, particularly if traversing the graph structure, an increased number of connected components can lead to an increased visitation frequency of vertices.
Table 4.13: Comparison of four University of Liverpool department data sets.
LivChem LivHistory LivMaths LivSace Density Whole (W) 0.037 0.034 0.037 0.038 Target (CT) 0.242 0.093 0.279 0.209 Noise (CN) 0.039 0.04 0.028 0.039 Avg Degree Whole (W) 18.231 17.063 18.231 18.075 Target (CT) 29.932 18.894 35.265 17.142 Noise (CN) 16.36 16.779 15.236 18.108 Connected com- ponents (CC) Whole (W) 1 1 1 1 Target (CT) 1 1 1 1 Noise (CN) 5 1 6 4 Max edges in CC Whole (W) 7803 7235 7803 7429 Target (CT) 970 392 1385 45 Noise (CN) 6013 6153 3912 7174
4.4
Summary
This chapter has presented the data sets that were used to evaluate the approaches to the WBD problem proposed in this thesis. Three kinds of data set were used: (i) Binomial Random Graphs, (ii) Artificial Data Graphs and (iii) Real Data Graphs. The first two used synthetic methods to produce data sets that model the WBD problem according to various scenarios. In total 3 BRG data sets and 2 ADG data sets were generated. The 4 RDG data sets were created by collecting and labelling web pages hosted by the University of Liverpool. A statistical analysis concerning each of the data sets (artificial and real) was also presented.
Chapter 5
Static Technique 1: Feature
Analysis
This chapter presents the investigation of the WBD problem in the static context. In the static context all the web data is available prior to the start of analysis (as previously explained in section 3.4.2). The approaches in this chapter use various attributes extracted from web pages to represent features, and these features are then subsequently grouped using a variety of clustering algorithms, and a WBD solution produced. A methodology for providing a solution to the WBD problem in a static context is discussed, and two techniques are proposed; (1) the first uses n-feature types to provide a solution to the WBD problem, (2) the second using feature discrimination to produce the “best” set of features with respect to the WBD problem. A range of possible attributes to represent the web pages, extracted from content and associated meta-data are considered. Clustering algorithms were applied to the various features and website boundary solutions produced. The most relevant features that can be used to model web pages in terms of producing an optimal solution to the WBD problem are identified, along with the most appropriate clustering algorithms. From the experiments presented in this chapter it was found that the most effective features used to represent web pages with respect to WBD are: links to scripts (e.g Java script code), image links (e.g links that reference .png, jpg images) and resource links (e.g links that reference CSS styling files). The most appropriate clustering algorithms for WBD in the static context were found to be DBScan and kmeans from the specified selection used in this work.
The rest of this chapter is organised as follows, in section 5.1 some formal description is given, followed by details on the feature representation in section 5.2. The static approach is presented in section 5.3, with two techniques whereby this approach can be implemented is presented in section 5.4.1 and 5.4.2 respectively. The evaluation of the feature representation and clustering algorithms used is presented in section 5.5. This chapter is concluded in section 5.6
5.1
Formal Description
Recall the general WBD problem’s formal description (see section 3.4). Given a collec- tion of web pagesW comprising n individual pages, such thatW ={w1, w2,· · ·, wn},
where the seed page isws. The website boundary (ω) is said to be the bounded subset
of pages in W that form the website given by ws. Each of the individual web pages
W ={w1, w2,· · ·, wn}can be described using a dimensional vector lengthm, such that
V = {v1, v2, . . . , vm}. Each dimension of the feature space v1, v2, . . . , vm describes a
single feature.
In contrast to the general WBD formal description, in the static context considered in this chapter, there is a key characteristic in the composition of the set of features V. The set V is constructed from some arbitrary concatenation of global features in set F ={f1, f2, . . . , fb}, whereb is the total number of sub features. The setF serves the
purpose of defining the actual sub features making up a particular composition of set V.
The sequence (set) of possible values for a feature fi is then described by the set
of values{ϕi1, ϕi2,· · ·, ϕik}. The value of k will depend on the nature of the feature;
given a binary valued featurek= 2 the value set will be{0,1}. Given (say) a numeric feature the value ofkmay be substantial. An example is shown in Figure 5.1.
V SM = f1 z }| { v1 v2 f2 z}|{ v3 f3 z }| { v4 . . . vm w1 1 0 0 1 . . . 1 w2 1 0 1 0 . . . 1 w3 0 1 1 1 . . . 1 w4 1 0 1 0 . . . 0 . . . . wn 1 0 0 1 . . . 1
Figure 5.1: The vector space model for the WBD problem using some formal notation. In this examplek= 2 for all features and thus a binary value of {0,1} is allocated to each feature f1,f2 and f3.
The complete set of features F, comprises one or more sub-sets of features (each describing a different aspect of a www page, see section 5.2). SetF gives the maximum dimensions of possible values that can be used to describe a web page. Feature vec- tors for individual www pages are thus created by concatenating together sub-vectors (f1, f2. . .) representing individual features. An illustration is given in Figure 5.2.
Each web page in the collection W can be represented as a feature vector. What web page features to include in the feature space is a subject for debate that will be considered later in this chapter (see section 5.2). The more features that are included the greater the computational overhead. Clearly it is also not desirable to include sub
features that are not good discriminators; the question is what are the features that make good discriminators?
Once a set of features has been selected to represent the web pages as a vector space, the WBD clustering paradigm as described in section 3.4.1 is applied. Recall that the clustering paradigm essentially groups web pages into two sets, KT and KN,
representing the target cluster and noise cluster respectively.
Figure 5.2: An example data set W (left), showing a possible clus- ter configuration using different features (right). Each coloured area represents a related set of pages using a particular sub feature (fi),
overlap of clusters occurs as different features express similar relation- ships.