Iterating source s and sink t - Evaluation of Minimum Cut Approach

6.4 Evaluation of Minimum Cut Approach

6.4.2 Iterating source s and sink t

An alternative to fixing the sourcesand iteratingtis to iterate both the sourcesand the sinkt for all combinations of vertices in the input graph. For a graph containing nvertices, this producesn∗ncombinations ofsand t. This is therefore an exhaustive approach to partitioning the graph when compared to the previous method which considered onlyn−1 combinations. The same method for finding the minimum cut of the graph for each s and t combination is again used. For each iteration a mincut is performed, the lowest cut across all combinations is then the mincut of the graph. 6.4.2.1 Standard Flow Graph Model

For each of the datasets there exists at least one WBD solution that has a score of>0.9. (shown in Figures 6.18a, 6.18b, 6.18c and 6.18d). Sorting the results by highest performing WBD solution shows an interesting trend. For two of the datasets (LivChem Figure 6.18a and LivHistory Figure 6.18b) the performance tails off quite dramatically after around 25,000 combinations. This trend can also be observed in the remaining two datasets but when a much lower number of combinations is reached (LivMath Figure6.18c and LivSace Figure 6.18d). The performance trend for all datasets even- tually tails off to the same low scoring amount as reported in the fixed source testing results presented above (a score value of approximately 0.5).

The frequency plots of the top performing results reveal that the sink vertex should be selected from target website vertices, while the source should be selected from either

the target or the noise vertices (Figures 6.19a, 6.19b, 6.19c, 6.19d). In fact the results shown in the figures indicate that it is usually better to identify the sink node with the seed page rather than the source, which is contrary to the intuitive thinking adopted for the first set of recorded tests where the source was identified as the given seed page of a website.

Figure 6.18: WBD performance score using the standard flow model, with iterating source s and sink t. The plots show the score value (y-axis), plotted for each sink s and tcombination (x- axis). All cominbations ofsand tare shown.

(a) LivChem (b) LivHistory

6.4.2.2 Backlinks Flow graph model

The backlinks model does not show any improvement when iterating over all combinations ofsandtin the graph. The results exhibit the same two level score pattern as for the previous experiments (Figures 6.20a, 6.20b, 6.20c and 6.20d), which is consistent with the results from the the fixed source iterating sink tests shown previously (Figure 6.17).

The iterating source and sink WBD solutions presented here show a vast improvement in terms of the quality of the website boundary discovered compared to the previous fixed source iterating sink solutions. When using the standard model the results proved that there does exist asandtcombination where the mincut (S, T) of the graph produces a more representative WBD solution.

Figure 6.19: Frequency distribution of source and sink for the top 25 scoring WBD solutions using the standard flow model, with iterating sourcesand sink t. The plots show the frequency value (y-axis) plotted against the vertex range. Note the x-axis depicts vertices in breadth first order from seed node, the lower range represent target nodes, the higher range represent noise nodes, in accordence with the amount of target and noise in a data set (see section 4.3).

(a) LivChem (b) LivHistory

Figure 6.20: WBD score using the backlinks model, with iterating source s and sink t. The plots show the score value (y-axis), plotted for each sink s and t combination (x-axis). The top 450 results are shown only, this is to highlight the two level trend.

(a) LivChem (b) LivHistory

6.4.3 Evaluation Summary

The evaluation of results using the minimum cut approach to graph partitioning are summarised in this sub section. This summary compares the four possible combinations for allocating the sourcesand sink tusing either target vertices or the noise vertices. Given a graph G which contains two clusters: CT containing target pages from the

website and CN containing noise pages (notation recalled from section 3.4.1). The

selection of a source sand sinktfrom vertices inGfalls into one of four scenarios: 1. s∈CT and t∈CT - Good performance.

2. s∈CT and t∈CN - Bad performance.

3. s∈CN andt∈CN - Bad performance.

4. s∈CN andt∈CT - Good performance.

The scenarios listed above illustrate the main finding which is that if the sink vertex is not chosen from the target cluster, the performance of the WBD solution produced is poor.

Choosing the sink from the target cluster means that flow is augmented to an end point that is in the more highly connected (target) reign of the graph. If flow is augmented from inside (s ∈ CT) the target cluster a bottle neck is created (and

subsequently cut) on the boundary of the target (website) cluster. This is because flow is augmenting inside the more highly connected cluster, with low amount of flow augmenting to the noise cluster due to fewer augmenting paths. If flow is augmented from outside (s∈CN) the target cluster, a bottle neck is again created on the boundary

of the target (website) cluster due to flow being augmented along the few available paths into the target cluster. These paths cause a bottle neck of flow on the edge of the more highly connected cluster.

The backlinks model provided a much more stable graph partitioning compared to the standard model. This was exhibited as a clear two tier trend in WBD performance. The stability refers to the range of source and sink selections that can be made which produces a graph partition that has the same (stable) WBD performance. Consequently the WBD solutions produced were found to be less “volatile” than those produced using the standard model. This implies that there is some slight leniency with respect to the selection of source and sink vertices and the quality of WBD solutions generated. However, the major drawback of the backlinks model was that it did not perform as well as the standard model in terms of WBD solution performance. This was due to the model creating uniformity in the graph structure by adding extra (back) links which would otherwise be used to advantage with respect to WBD performance using a flow analysis method.

In document Website boundary detection via machine learning (Page 147-152)