Fixed source s, iterating sink t - Evaluation of Minimum Cut Approach

6.4 Evaluation of Minimum Cut Approach

6.4.1 Fixed source s, iterating sink t

The evaluation in this sub-section reports on an approach to finding the mincut of a graph when sink t is unknown, it uses the technique of fixing s and iterating t. Fortunately, in the definition of the WBD problem, a known vertex is supplied in the form of a seed web page (ws). In this particular case the single seed is fixed as the

source node, the sink is then iterated via all possible remaining nodes.

For a graph ofnvertices labelled{0. . . n}starting from the seed pagews, the source

is fixed at s equals vertex 0. The sink titerates from vertex 1 to n. The results were first analysed in order of sinkt selection (1 ton).

The min cut partition produced is such that s∈S, and t ∈T. In this evaluation scenario, because the source is the seed page (s = ws), it can be inferred that the

forming the website boundary. Therefore the nature of the generated WBD solutions was measured with respect to the setS.

6.4.1.1 Standard Flow Graph Model

The WBD performance scores (shown in Figures 6.15a, 6.15b, 6.15c and 6.15d) illustrate the poor and consistent performance (scores of between 0.3 and 0.6) for all iterations of t in the standard flow graph representation of the web graph. In some iterations the cardinality of the sets S and T can reflect an ideal WBD solution, relative to the number of target vertices. However, the composition of these cuts do not reflect the correct target and noise ratios (Figure 6.16). The identified website boundary cluster (KT) either contained: a high ratio of target to noise pages, but with the target pages

inKN; or a low ratio of target to noise pages are contained (low precision), but with

a high proportion of target pages in KT (high recall). Over the data sets tested, no

emerging pattern could be identified as to what source vertices to choose in terms of cutting the graph to give a high performing WBD solution.

Figure 6.15: WBD performance score using the standard flow graph model, with a fixed source s = ws (the seed page) and iterating sink t. For a graph of n vertices, s = 0

is fixed, t is then iterated from t= 1 to t=n. The plots show the performance score value (y-axis), plotted for each sinkt(x- axis).

(a) LivChem (b) LivHistory

Figure 6.16: WBD performance recall and precision values using the standard flow graph model, with a fixed sources=ws (the seed page) and iterating sinkt.

(a) LivChem (b) LivHistory

6.4.1.2 BackLinks Flow Graph Model

The performance using the backlinks flow graph model show a slightly different story to the poor performing standard model results presented above. The performance results generated from the backlinks tests show a clear two tiered performance (Figure 6.17). The performance is either similar to the standard test with scores of between 0.4 and 0.5, or a performance score of around 0.7. This “two level” trend is consistent across all data sets tested (as shown in Figures 6.17a, 6.17b, 6.17c and 6.17d).

Closer observation of the compositions of clusters S and T reveals that there were only two types of cut being made in the flow graph. The cut was either:

1. vertexs∈S and all remaining vertices in T. 2. vertext∈T and all remaining vertices in S.

Thus the minimum cut is such that a single node (vertex) is being cut from the graph. Given the fixed nature of s, and the iteration of t the cut is either: set S containing onlys, and set T containing the remaining nodes; or set T containing only t, and set S containing the remaining nodes. Interpretation of these two types of cut in the context of a WBD solution produces either: (i) a collection containing a single

isolated target vertex, the seed node (1 above); or (ii) a collection containing the seed node and a large number of noise pages (2 above).

The dense nature of the backlinks representation of the web graph intuitively makes the lowest capacity cut of the network in order to separatesfromt, this is simply a cut isolatings ortfrom the network. Which essentially makes the capacity of the mincut in the network equal to the degree of the isolated verticessort. Cuttingsortfrom the main network is much cheaper than performing a cut of potentially very high capacity across the dense connections of the network in order to correctly separate all target nodes from remaining noise nodes.

Figure 6.17: WBD score using the backlinks model, with a fixed source s= ws (the

seed page) and iterating sink t. For a graph of n vertices, s = 0 is fixed, t is then iterated from t= 1 to t =n−1. The plots show the score value (y-axis) plotted for each possible sinkt (x- axis).

(a) LivChem (b) LivHistory

The potential WBD solutions were subsequently analysed in order of: (1) lowest capacity mincut of all iterations oftand (2) highest density of setS. The configuration that featured the lowest capacity mincut was then selected as the most desirable WBD solution in each case. Inspection of the ordered results did not provide any greater insight into how the sink should be selected so as to produce the best WBD solution.

The lowest capacity mincut did not yield the highest score value for the WBD solution (in some cases quite the opposite infact). A low capacity min cut implies that a low number of edges were cut to separate s from t. As in the case of the reported

experiments using both the backlinks and standard flow models, the lowest capacity cuts often resulted in low performing WBD solutions, this is because a low cut can segment a small portion of the graph irrelevant to WBD purpose.

Consideration of the density value of setSandT gives slightly better insight into the cluster composition. If a cluster contains a single vertex, then it will have a minimum density value associated with it. Analysing each potential WBD solution in terms of which solutions produced the highest density cuts of the graph in terms of set S, effectively filters out cuts of the graph that contain single vertices. Although analysing potential WBD solutions in terms of the highest density of setSremoved low performing solutions, it was found not to be an effective method to yield high performing WBD solutions.

By fixing the source s(as the given seed nodews), and iterating sink t, for a graph

of n vertices, the minimum cut approach using the backlinks flow graph model using n−1 iterations, performs a mincut of the graph on each iteration. As shown above, this did not yield a good performance in terms of the WBD solutions produced. The cuts of the graph did not reflect good performing partitions using both backlinks and standard representations of the web graph.

In document Website boundary detection via machine learning (Page 143-147)