Graph Representation - Website boundary detection via machine learning

This section describes methods that can be used to change the link structure of the internal representation of the visited graph. The methods described can be used dy- namically, as the graph is traversed. Table 7.7 presents an example of an edge weight method of graph representation with respect to the dynamic approach (presented above in section 7.2). It will be shown that by utilising a graph representation method the traversal of the graph can be effected or “controlled” to a certain degree by creating extra links or weighting links to change the pathways of the graph.

Each of the methods described in this section can be used with the probabilistic graph traversal methods RW, SAR, MHRW. The underlying link structure is modified as the graph is retrieved, thus creating an interpretation of the graph compatible with any traversal method. The deterministic methods of traversal (BF, DF) can in infact be used with a modified graph, but it will not effect these traversals, as the graph is modified once retrieved (as it is not available apriori), therefore these methods cannot effect a deterministic walk process in advance.

There are two main methods by which the link structure can be modified. The first are edge weighting methods. The edges weights can be assigned using arbitrary values, and can be modified at various points in the traversal of the graph. The second are so called “artificial” edges. Essentially increasing the weight of an edge which was originally equal to 0 to a value>0. These are edges that can be added between nodes based on a certain criteria, they are not strictly limited to a physical hyper link existing in the original web graph. Each is described in further detail below.

7.5.1 Edge Weighting

An edge weight is assigned to each edge in the structure as it is traversed. The weight can then be increased or decreased as the walk progresses (see Table 7.7). The edge weighting methods presented in this work include:

Similarity Weighting (SW) Edges are reduced in weight depending on the similarity of the connecting nodes being above a threshold. This process is conducted upon each step. Each time an edge is used that is considered to lead from a target node, to a noise node, the edge weight is decreased. This will then reduce the likelihood of a random walk using this edge in the future. The default weighting of edges is 1. This effectively means that all edges have equal probability of traversal. The magnitude of the value whereby edge weights may be decreased was experimentally found to be proportional to the number of connections between clusters (section 7.7.1.2). The weighting of an edge is updated upon traversal of the edge.

AlgorithmEdge Weighting template (ws)

KT ={ws};KN ={};

set up the process internal state; set Qtows; set qi to null;

repeat

{ Edge Weighting }

if Qand qi adhere to Edge Weighting conditions then

set edge (Q, qi) =value;

set edge (qi, Q) =value;

end if qi=Q;

select a page Qfrom G; add Q toKT orKN;

update the process state; until convergence;

return KT;

Table 7.7: Pseudo code for generic edge weighting method.

Euclidean Weighting (EW) Edges are weighted inversely proportional to the similarity between nodes, using the actual Euclidean distance between features of the nodes. A random walk has a greater probability of visiting similar nodes, than nodes of increasing dissimilarity.

Cluster Weighting (CW) This method draws on the clusters of the clustering algorithm to reduce the edge weight of edges across the clustersKT and KN. At each

iteration, the nodes traversed are assigned to either clusterKT orKN, depending

on the clustering algorithm used. The edges which reside between the nodes of the clusters are extracted, and are subsequently reduced in weight.

Note that all of the above methods work using dynamic graphs, which means that at any two points in the traversal, the graph may have a different number of nodes, and edges subsequent to change.

7.5.2 Similarity Edges

The process of adding an “artificial” edge between two nodes is subject to the similarity between these nodes. These are known as Similarity Edges (SE). As the traversal progresses, upon visiting a previously unseen node, edges are added between this and any visited node that has a similarity above a certain threshold. This creates dense clusters of similar nodes that can subsequently be traversed more often (see Table 7.8). The similarity can be based on a multitude of characteristics, in this work the similarity is based on attributes associated with nodes.

AlgorithmSimilarity Edge template (ws)

KT ={ws};KN ={};

set up the process internal state; set Qtows; set qi to null;

repeat

{ SE Additions }

if first visit toQ then

for qi = 0 to number of nodes visited so far do

if Similarity(qi,Q)> threshold then

add a “similarity edge” (qi, Q) to G;

end if end for end if

select a page Qfrom G; add Q toKT orKN;

update the process state; until convergence;

return KT;

Table 7.8: Pseudo code for artificial edges based on similarity.

In document Website boundary detection via machine learning (Page 163-165)