Dynamical Network Analysis - Handling Inconsistency in Knowledge Bases

CHAPTER 2. METHODS

2.4 Dynamical Network Analysis

Network analysis (sometimes known as graph analysis) encompasses a very widely adopted set of methods for studying the dynamics of systems comprising many individual components, such as computer networks, social networks, biomes and physical systems[43]. Each individual in the system is represented by a node in a network, with edges connecting

interacting individuals and the weight of each edge representing an arbitrary measure of the strength of interaction between the individuals it connects.

The definition of nodes in biomolecular systems is arbitrary, but most commonly taken to be each residue in a protein[44], while nucleobases and ribophosphate groups are split into separate nodes in nucleic acids. Edges are then placed nearby nodes and weighted according to the Cartesian covariance between the nodes during an MD simulation:

𝑤!,! =𝐸[ 𝑥! −𝑥! 𝑥!−𝑥! ] (2.22)

where w is the edge weight between nodes i and j, 𝑥 represents the coordinates of the node, 𝑥 represents the average coordinates of the node throughout the simulation and E expresses the expectation value of the argument. The network holds key, often subtle, dynamical information about the correlated motions within the system. These motions can be linked to conformational phenomena relevant to biological processes. There exists a multitude of network analysis methodologies to address or highlight various aspects of a network.

2.4.1 Suboptimal Paths

Communication between two nodes, here termed the source and sink, separated by multiple edges can occur through many different paths in a well-connected network. In the case of dynamical network analysis for MD simulations, the network is undirected, rendering the terms source and sink interchangeable. The shortest path between a source and a sink is referred to as the optimal path. However, there are a number of slightly longer, nearly-optimal paths that will also contribute significantly to communication between the source and sink. These are known as suboptimal paths and they must be taken into account to adequately map the flow of communication between source and sink[44]. For MD system networks, these communication pathways carry the bulk of allosteric transmissions between source and sink.

where 𝐿! is the length of the path, summed over all edge lengths belonging to said path. Edge

lengths are defined as −log (𝑤!) because stronger correlations correspond to shorter distances in

network space. All suboptimal paths under an arbitrary length cutoff are assumed to contribute significantly to allosteric signaling and combine to form an allosteric tether between source and sink.

2.4.2 Community Analysis

While network structures can be informative, they are often far too complex to analyze visually for large biomolecules. Community analysis is a tool for segregating groups of nodes into larger, semi-autonomous communities[45]. In a biomolecule, communities indicate heavily self-interacting groups of residues and can be used to visualize general dynamical topology and highlight general functional sites.

Many methods exist to separate networks into communities[45]. One popular method, Girvan-Newman (GN)[46], is especially well-suited for analysis of MD-based networks due to the absence of free parameters in the algorithm, leading to simpler implementation and use. The GN algorithm is executed as follows: edges are assigned betweenness values, where betweenness is calculated as the number of shortest paths between any two nodes that traverse the edge; edges are removed from the network individually, in order of decreasing betweenness, calculating modularity at each iteration and identifying communities (groups of nodes not connected to the

rest of the network), until no edges remain; finally, the highest modularity community structure is returned. The key metric in GN, modularity, is defined as

𝑄 =_!!! [𝐴_!"−!!!!

!! ]

!" −!!!!_!!! (2.24)

where 𝑄 is modularity[47], 𝑘_! is the degree of node v, 𝐴_!" is the adjacency matrix, 𝑠_! is the membership of node v and m is the total number of edges in the graph. The adjacency matrix contains a 1 at position v,w if there is an edge between nodes v and w, and 0 elsewhere. The degree of node v is the number of edges connected to node v. Community membership, s, is set as 1 or -1, depending on which community a node belongs to. It should be noted that that eq. 2.24 is defined for two communities, leading to hierarchical community splitting in practice.

While the modularity optimization method rigorously defines the optimal community structure, the number of communities generated from an MD simulation, especially in relatively rigid complexes where thermal noise contributes significantly to the overall dynamics, can be excessive or errant. A minor adaptation of the GN algorithm has been developed for use in this dissertation, wherein a modularity cutoff is set, as a percentage of the optimal modularity, with the final community structure being that which possesses the fewest communities while maintaining a modularity value that lies within the cutoff. Very small reductions in modularity can lead to large reductions in the number of communities GN produces, yielding more interpretable community structures with similarly high modularity scores to the optimal structure.

In document Handling Inconsistency in Knowledge Bases (Page 37-40)