Implementation of Directed Graphlets and Orbits Counting Al-

3.1 Methods

3.1.4 Implementation of Directed Graphlets and Orbits Counting Al-

We implemented a counting algorithm that counts all up to four node graphlets in a directed network, as defined in Section 3.1.1, and all the orbits that each node in the network touches. As discussed in Section 3.1.1 we count graphlets and orbits in directed networks which can contain anti-parallel pairs of arcs but no multiple edges or self-loops. So, when the network is loaded into a data structure in memory, we pre- process the data to remove all selfloops (and any nodes that were solely involved in that type of interactions) and remove all, if any, multiple edges in the network.

There are different approaches for implementing an algorithm to count the sub-graphs of a network. Some of the approaches which focus on speed performance include sam- pling [171,172], are based on pattern similarities [173], or rely on reconfigurable hardware accelerators based on Field-Programmable Gate Array (FPGA) chips, where hardware design was implemented using Verilog hardware description language [174]. The first counting algorithm for undirected graphlets was based on direct enumeration, with corrections for the over-counting graphlets and orbits [175]. One of the more recent

undirected graphlet counter implementations is a combinatorial method [176] that uses a system of equations to link counts of orbits from up to five nodes graphlets, which allows it to compute all orbit counts by enumerating just a single one. However, for the 40 directed graphlet orbits from Figure 3.1, a similar set of equations can be constructed only if we were to implement a counter for networks without anti-parallel arcs where graphlets are induced sub-graphs. Our implementation counts graphlets and orbits in the networks with anti-parallel pairs of arcs, where graphlets are not strictly induced (recall Figure 3.2) and we cannot establish the system of equations, as discussed before. Thus, our method of choice is the direct enumeration approach.

For each node in the network we construct the list of node’s successors and predecessors. We visit each node in the network and update counts of all up to four node orbits as follows. GraphletG0 contains orbits 0 and 1 (Figure 3.1). When the counting

algorithm visits a node in the network, it counts the node’s successors to determine the count of orbits 0 for the visited node. In the same iteration, the counter increments the count of orbits 1 for the node’s successors and updates the count of graphletsG0 in the

network. Similarly, for each set of orbits that belong to the same graphlet, we update counts of these orbits and the count of the corresponding graphlet in the same iteration, when visiting a node in the network. For counting three-node orbits of the visited node, the algorithm iterates through the lists of the node’s successors and/or predecessors and checks their relationships, or through the lists of the node’s successors or predecessors and then their successors or predecessors (depending on the orbit counted). Similarly, in order to update counts for all four node orbits of a visited node, the algorithm needs to examine up to three-level-deep neighbourhood of a node. To improve the time ef- ficiency we aim to group the graphlets that are subgraphs of one another, and update their counts and orbits in the same iteration. For example, we update counts of orbits 5 and 6 (graphletG2) and orbits 25, 26, 27 and 28 (graphlet G9) in the same iteration

because graphletG2 is induced on graphletG9 (see Figure 3.1).

Following the approach described above, the graphlets containing automorphism orbits are over-counted during the counting process. An example of this is shown in Algorithm 1 (based on the original counter implementation, source code is available in Appendices in Section B.1). On graphlet G2 there are two non-interacting nodes

touching orbit 5 and one middle node touching orbit 6 (see graphletG2 in Figure 3.1).

Algorithm 1 updates orbits 5 and 6 and the counts of graphletG2 by visiting each node

in the network and, among the node’s successors, it looks for pairs of successors that do not interact. When found, the count of orbit 6 is updated for the visited node i, and the count of orbit 5 is updated for both nodes in the identified non-interacting pair

Algorithm 1 Updating counts of orbits 5,6,25,26,27,28 and graphlets 2 and 9.

1: input: G, directed graph as an edge list

2: V = list of nodes fromG

3: pred= container with vector of predecessorspred(n) for each noden∈V

4: succ= container with vector of successorssucc(n) for each node n∈V

5: //Note: Multiple edges and selfloops are discarded when creating pred and succ.//

6: output: graphlets, vector of graphlet countsgraphlets(i) for each graphleti∈[0,39]

7: output: orbits, matrix of orbits counts orbit(n)(j) for each node n∈V and orbit

j∈[0,128]

8: output: dictionary, list of nodes in the order that corresponds to node indexes

from the orbitsmatrix. 9: forn∈V do

10: Updating counts of orbits 5,6,25,26,27,28 and graphlets 2 and 9: 11: for s1∈succ(n)do

12: fors2∈succ(n)do

13: if s16=s2 and s16∈succ(s2) and s16∈pred(s2) then

14: orbit(s1)(5)++

15: orbit(s2)(5)++

16: orbit(n)(6)++

17: graphlets(2)++

18: fors3 in succ(s1) do

19: if s36=s2 and s36=n and s36∈succ(s2) and s36∈pred(s2) and s36∈succ(n) and s36∈pred(n)then

20: orbit(s1)(26)++ 21: orbit(s2)(28)++ 22: orbit(n)(27)++ 23: orbit(s3)(25)++ 24: graphlets(9)++ 25: end if

26: //Note: In the complete counter we use the last for loop to explore relationships between nodes n, s1, s2 and s3, and update their counts of orbits (49,50,51,52), (65,66,67,68), (46,47,48), (85,86,87) and (88,89,90) and counts of graphlets 18,22,17,27,28 respectively.//

27: end for

28: end if

29: end for

30: end for 31: end for

32: Correcting for overcounted graphlets and orbits:

33: //Note: Here, we correct overcounts only for orbits 5,6,25,26,27,28 and corresponding graphlets 2 and 9.//

34: forn∈V do:

35: orbit(n)(5) = orbit(₂n)(5); 36: orbit(n)(6) = orbit(₂n)(6); 37: end for

of successors. However, we need to distinguish the nodes in the non-interacting pair of successors of i, because, as described above, in the same iteration we are updating counts of orbits 25, 26, 27 and 28 and the corresponding graphletG9. This means that

if the neighbourhood of the node ialso corresponds to orbit 27 on G9, then the count

of orbit 26 is updated for one of the nodes in the non-interacting pair of successors of i, while orbit 28 is updated for the other node in the pair. As a result, each non- interacting pair of successors (j, k) of the node i needs to be examined twice: once to check if the nodej corresponds to orbit 26 (consequently the nodesiand kcorrespond to orbits 27 and 28 respectively) and a second time to check if the node kcorresponds to orbit 26 (consequently the nodesiandjcorrespond to orbits 27 and 28 respectively). If both scenarios are true, the count of orbit 27 for the node i will be updated twice, which corresponds to two different graphlets G9 in the network (on each of them the

nodes j and k touch different orbits). However, examining the nodes i, j and k twice results in updating the count of orbit 6 for the nodeitwice, updating the count of orbit 5 twice for each of the nodes j and k, and counting graphlet G2 twice, although the

nodesi,j andkaccount for only one graphletG2. Hence, when the counting process is

completed for all nodes in a network, we divide the following counts by two: the counts of orbit 5 for all nodes, the counts of orbit 6 for all nodes and the count of graphlets

G2 in the network. Similarly, the other graphlets containing automorphism orbits are

over-counted, depending on how the algorithm iterates over the nodes. We solve this by correcting for all such orbit and graphlet over-counts after the counting process is finalised.

The complexity of our algorithm is O(N ×d3), where N is the number of nodes in the network anddis the maximum degree over all nodes in the network. However, since the counting algorithm is implemented so that each node in the network is visited separately, the code can easily be parallelised by dividing nodes in the network into sets and assigning each set of nodes to a separate job. Each job should separately maintain the temporary DGDVs for all nodes in the network, so when the job is counting orbits that a particular node touches, it can still update the orbits for other nodes, even if they are not within its set. This approach gives jobs the flexibility to be either separate thread- s/processes on a single CPU or distributed over a cluster resource, adding scalability to our approach. After all the jobs are completed, values from all temporary DGDVs for each node are added together. All the corrections for the over-counts discussed above, should be performed after the merging of the temporary vectors.

3.2 Evaluation of Directed Graphlet-based Methods for

In document Analysing directed network data (Page 89-93)