• No results found

State of the Art

2.3 Complex Networks

2.3.2 Characterising networks

Any network is characterised by its number of nodes (usually denoted by n) and links (denoted by l). More precisely, it is fully defined by an adjacency matrix (denoted A), of size n x n, whose element ai,j has a value of 1 when a connection between the two nodes i and j exists, and 0 otherwise3. The number of links l can simply be derived from A as l =

i,jai,j.

As previously introduced, all methods aiming at characterising the structure of a network that one can propose roughly fall into the three aforementioned categories (i.e. within the micro-,

3Note that this matrix may not be symmetric, i.e. ai,j ̸= aj,i, in that it may exist a connection from node i to node j, but the connection j→ i may be missing.

meso- and macro- scales). We will now proceed with a review of the most common metrics for each category.

Micro-scale metrics

The simplest example of a micro-scale metric is the degree, defined as the number of connections a node has - that is, its number of connected neighbours. The degree k of a node i is calculated as:

ki =∑

j

ai,j. (2.2)

It must be noted that while this measure unveils some information about the importance of a node inside the structure of the network - as it might for instance be of interest to localise the more connected node - additional information is contained within the degree distribution P (k). To illustrate, a long tailed distribution reflects a hub-and-spoke configuration - that is, comprising a few nodes that are highly connected while the others are isolated - therefore excerpting intelligence about the system’s resilience to attacks or to random failures. This type of information conferred in a functional form must nevertheless be synthesised into a singular feature, in order to enable its use in data mining algorithms. The entropy of the degree distribution presents itself as a compelling solution [WTGX06]:

H =−

k

p(k)log2p(k) (2.3)

The minimum value H = 0 indicates a constant degree across all nodes, while a higher value indicates a more uniform distribution of degrees.

A second - and equally important - metric is obtained through the analysis of individual de-grees: the assortativity (or degree correlation) of a system. It defines the presence of degree-correlations between connected nodes. High assortativity indicates high degree nodes’ tendency to connect with other high degree nodes. A dissortative (negative assortativity) network indi-cates a tendency of high degree nodes to link with low degree ones. This tendency is assessed

by comparing the probability of this preferential connections with what would be expected in a random configuration of the network. It has been extensively applied to real-world systems such as social network (assortative), biological networks (dissortative) or technological network (dissortative) [New01b, GI95, DYB03, CCGJ02]. Mathematically, it is defined as P(k | k), representing the probability of a node of degree k to be linked with a node of degree k. It can be explicitly expressed through the Pearson coefficient correlation as:

Dc= 1 l

j>i

1

2(ki+ kj)aij. (2.4)

For the sake of completeness, we also consider the maximum degree, which is the degree of the most connected node: kmax = maxiki.

An additional micro-scale metric of interest is the link density, defined as the proportion of ac-tual links in the network, as compared to the total possible number of links n2. Mathematically, we obtain:

ld =

i,jai,j n2 = l

n2 (2.5)

This definition demonstrates that the metric is defined within the [0, 1] interval. A network with a null link density would be void, whilst a link density of 1 indicates a fully connected network.

Meso-scale metrics

When the focus is put on a group of nodes, though not the entire set, meso-scale metrics emerge. For the scope of this work, the two most important are the clustering coefficient and the Information Content (IC). Two additional metrics can be mentioned: motifs and com-munities. However, the concept and assessment of community - defined as groups of more densely interconnected nodes - is not consistently defined and will not be used in this PhD Thesis. More information about the different definitions can nevertheless be encountered in

[RCC+04, NG04, ZMW05]. Motifs are better defined and refers to the concept of clustering [MSOI+02]. Specifically, motifs are subgraphs of three or four nodes that appear more often than what statistically expected. However, such notion again will not be used in our work.

Getting back to the most important ones, the clustering coefficient (or transitivity) measures the density of triangles in the network - that is, the proportion of groups-of-three-interconnected-nodes (triangles) with respect to number of triples (set of three groups-of-three-interconnected-nodes that can be reached from all other, directly or indirectly) [New01a]. If we denote by N the number of triangles in the network and by N3 the number of triples, then:

CC = 3N

N3 (2.6)

more precisely:

N = ∑

k>j>i

aijajkaik, and, N3 = ∑

k>j>i

(aijaik+ ajkaji+ akiakj) (2.7)

A clustering coefficient close to 1 indicates that all triangles are closed, which translate in network language the well known social rule of ‘the friend of my friends are my friends’.

On the other hand, the Information Content (IC) assesses the presence of mesoscale structures in its broad term [ZSM14], by looking at recurrences within the adjacency matrix of a network.

Basically, it consists in iteratively merging pairs of nodes with the lowest loss of information possible. More in details, it identifies the pair of nodes (where both generally share a sim-ilar connectivity pattern) that would yield the smallest loss of information, from a Shannon information theory perspective, when both are merged together. The IC is then computed as the sum of all the information lost at each step until the network is shrunk into a single node. Finally, the value is normalised against the average value obtained on an ensemble of random networks. This metric somehow represents the quantity of information encoded into the proper structure of the network. A low information is representative of regular topologies where connectivity pattern are repeated, in other words, of meso-scale structures.

Macro-scale metrics

Only metrics belonging to the macro-scale family remain to be described. They are defined as ‘macro’ because they account for the overall flow of information within the structure of the system. There are many examples to illustrate such notion, as the number of steps required to go from one extreme of the network to the other side, or the role that a node plays in the propagation of the information in the system. Two of them have already been described as a natural prolongation of a micro-scale metric, namely, the entropy of the degree distribution and the assortativity.

An additional basic macro-scale metric consists in calculating the number of jumps necessary to go from one node to another (on average). It is called the average geodesic distance, denoted here g:

g = 1

N(N-1)

i̸=j

dij (2.8)

This intuitive definition presents the important disadvantage of diverging when the network is disconnected, as it will exists pairs of nodes i and j from two disjoint parts of the network for which the distance between them dij is inf. To solve this problem, [LM01] proposed to consider the inverse of the harmonic mean of the distance, and called the metric ‘global efficiency’:

E = 1

N(N-1)

i̸=j

1

dij (2.9)

The name comes from the fact that it quantifies the efficiency of the network in transmitting information from one node to another, under the assumption of a non weighted cost of trans-mission - that is, that the total cost of transtrans-mission is proportional to the distance between the pair of nodes considered.

Finally, we can mention the diameter D of a complex network, that is, the greatest distance

between any pair of vertices, defined as:

D = maxi,j(dij) (2.10)

Benchmarking

The attentive reader will have noted that some of these metrics are strongly bound to the structure of the network and/or to its number of link. As such, a comparison between two networks can become quickly irrelevant when different conditions are considered. Therefore, in order to benchmark metrics across heterogeneous networks, it is necessary to normalise them.

This is usually performed through a normalisation against a reference model, i.e. the value obtained in average in random equivalent networks. Let us explain better this normalisation.

The idea is to create an ensemble of random network - called Erdös-Renyi graphs (ER) - of the same number of links and nodes. The metric is then computed over all the randomly created networks and averaged - representing therefore the value that would be expected for the metric in a network equivalent to the studied one (i.e. with the same number of nodes and links). The normalisation is then performed through a Z-Score:

m = m− µ

σ , (2.11)

where µ represents the metric average across the random networks, and σ being the corre-sponding standard deviation.

With this definition, a Z-Score superior or inferior to 0 respectively indicates a value that is higher (respectively lower) than expected. That way, two heterogeneous network could be benchmarked against one another through the corresponding Z-Scores.