Graph Elements, Types, and Density - (the Pragmatic Programmers) Dmitry Zinoviev-Data Science E

If at least one graph edge is directed (say, it connects node A to node B but not the other way around), the graph itself is directed and is called a digraph.

If there are parallel edges in a graph (that is, node A may be connected to node B by more than one edge), the graph is called a multigraph. The edge from node A to itself is called a loop. A graph without loops and parallel edges is called a simple graph.

Weights can be assigned to graph edges. A weight is usually (but not neces-sarily) a number between 0 and 1, inclusive. The larger the weight, the stronger the connection between the nodes. A graph with weighted edges is called a weighted graph.

Edge weight is an example of an edge attribute. Edges may have other attributes, including numeric, Boolean, and string. Graph nodes can have attributes, too.

The degree of a graph node is defined as the number of edges connected (incident) to the node. For a directed graph, indegree (the number of edges coming into the node) and outdegree (the number of edges going out of the node) are distinguished.

Graph density d (0≤d≤1) tells how close the graph is to a complete graph—a cobweb of all possible edges. For example, for a directed graph with ^e edges and ⁿ nodes:

d = ^e

n(n − 1) For an undirected graph:

Chapter 7. Working with Network Data

•

¹²²

faceted objects. To understand them better, it’s important to understand the terminology.

The concept of inter-node connectivity can be expanded by looking at graph walks. A walk is any sequence of edges such that the end of one edge is the beginning of another edge. When we take a bus from node “home” to node

“subway station A,” then take a train from node “subway station A” to node

“subway station B,” and then walk from node “subway station B” to node “our office,” we complete a graph walk (even though only part of the walk involves actual walking). We call a walk that doesn’t intersect itself (in other words, it doesn’t include the same node twice, except perhaps for the first-last node) a path. We call a closed path a loop. It’s a loop that takes us back home at the end of a hard day.

Just like we can measure the distance between two objects in real life, we can measure the distance between any two graph nodes: it is simply the smallest number of edges (sometimes called “hops”) in any path connecting the nodes. This definition doesn’t work in the case of weighted graphs, but sometimes you can pretend that the graph is really not weighted, and still count the edges. In a directed graph, the distance from A to B is not always equal to the distance from B to A. In fact, we may be able to get from A to B, but not back! Living your life from birth to death is a sad but sobering example.

The largest distance between any two nodes in a graph is called the diameter (^D) of the graph. Unlike circles, graphs don’t have area.

A connected component, or simply a component, is a set of all nodes in a graph such that there is a path from each node in the set to each other node in the set. For a directed graph, strongly connected components (connected by actual paths) and weakly connected components (connected by the paths where the edges were converted to undirected) are distinguished.

If a graph has several components, the largest component is called the giant connected component (GCC). Quite often the size of the GCC is huge. When this is the case, it’s often best to work with the GCC rather than with the entire graph to avoid non-connectivity issues.

Sometimes two parts of a graph are connected, but the connection is so subtle that the removal of a single edge breaks the graph apart. Such an edge is called a bridge.

A clique is a set of nodes such that each node is directly connected to each other node in the set. D’Artagnan and Three Musketeers formed a clique, and so does any other band that operates under the “all for one and one for all”

principle. We call the largest clique in a graph the maximum clique. If a clique cannot be enlarged by adding another node to it, it is called a maximal clique.

A maximum clique is always a maximal clique, but the converse is not always true. A complete graph is a maximal clique.

A star is a set of nodes such that one node is connected to all other nodes in the set, but the other nodes are not connected to each other. Stars are com-monly found in hierarchical multilevel systems (for example, corporations, military institutions, and the Internet).

A set of nodes directly connected to a node A is called the neighborhood (G(A) of ^A). Neighborhoods are a key part in snowballing, which is a data acquisition technique whereby the edges of a randomly selected seed node are followed to its neighbors, then to the neighbors of the neighbors (a second-degree neighborhood), and so on.

The local clustering coefficient of a node Â (also known as simply the clustering coefficient) is the number ^CC(A) of edges in Â’s neighborhood (excluding the edges directly connected to Â), divided by the maximum possible number of edges. In other words, it’s the density of the neighborhood of A without the node Â itself. The clustering coefficient of any node in a star is 0. The clustering coefficient of any node in a complete graph is 1. ^CC(A) can be used as a measure of star likeness or completeness of ^G(A).

A network community is a set of nodes such that the number of edges inter-connecting these nodes is much larger than the number of edges crossing the community boundary. Modularity (^m∈[-1/2, 1]) is a measure of quality of community structure. It’s defined as the fraction of the edges that fall within the community minus the expected such fraction if edges were distributed at random. High modularity (^m≈1) characterizes a network with dense and clearly visible communities. Identifying such communities is perhaps the most important outcome of network data analysis (and we take a look at that next in Unit 39, Network Analysis Sequence, on page 126).

Chapter 7. Working with Network Data

•

¹²⁴

several types of centrality measures that measure different aspects of impor-tance. For convenience, centralities are often scaled to the range between 0 (an unimportant, peripheral node) and 1 (an important, central node).

Degree

Degree centrality of A is the number of neighbors of A, which is the same as the degree of ^A or the size of ^G(A). You can scale it by dividing by the maximum possible number of neighbors of ^A, ^n-1.

Closeness

Closeness centrality of ^A is the reciprocal of the average shortest path length L_BA from all other nodes to ^A:

ccA= ^{n − 1}

∑

_B≠A^LBA Betweenness

Betweenness centrality of Â is the fraction of all shortest paths between all pairs of nodes in the network, excluding Â, that contain Â.

Eigenvector

Eigenvector centrality of ^A is defined recursively as the scaled sum of the eigenvector centralities of all neighbors of ^A:

ecA= ¹

∑

_B∈G(A)^ecB

The last two centrality measures are computationally expensive, and it may not be practical to calculate them for large networks.

Unit 39

In document (the Pragmatic Programmers) Dmitry Zinoviev-Data Science Essentials in Python_ Collect - Organize - Explore - Predict - Value-Pragmatic Bookshelf (2016) (Page 135-139)