Using Surface Residue Triplets to Identify Putative Binding Sites

3.1 Materials and Methods

3.1.1 Using Surface Residue Triplets to Identify Putative Binding Sites

As previously described, the Delaunay tessellation of a set of protein-points yields a convex hull composed of Delaunay tetrahedra. The convex hull defines a set of simplices with one or more triangular faces that are not shared with an adjacent simplex; however, as the convex hull does not accurately describe the shape of the protein, we removed all simplices with edge lengths greater than a certain threshold. After some experimentation, we selected a threshold of 11.5 ˚A as the minimum distance to allow all residues to retain Delaunay edges.

The resulting hull is no longer convex, but effectively defines the solvent accessible surface of the protein by the unshared triangular faces that we call surface residue triplets. These surface residue triplets characterize the surface topology of a protein (Figure 3.1A) and provide a unique and critical basis for scoring protein surface residues using SNAPP. Surface residue triplets define a surface topology that is dependent on the distance threshold for edge removal; although other methods such as α-shapes [36] have been used to remove Delaunay edges, we have found removal of edges based on length is not only consistent but computationally simpler.

By definition, triplets cannot be scored using the four-body SNAPP scoring function. However, a triplet at a protein-protein interface would form a new simplex when tessellated with the binding partner, resulting in a SNAPP-scorable, four-body simplex. Such an interfacial tetrahedron would have a constrained simplex type (limited to type 0, 1, or 3–see Figure 2.1C for type definition) based on the sequence adjacency of the surface residue triplet. Based on this concept, we define anad hocsimplex built on the triplet, but we allow the composition of the fourth residue to vary, yielding a modular but SNAPP-capable scoring function. We hypothesize that particular fourth residue compositions may provide additional stability and a lower binding free energy for the PPI and that these particular compositions will also yield higher SNAPP scores, allowing us to identify (1) triplets that are likely to form more favorable interfacial tetrahedra and (2) the composition of potential surface residues that will maximize the stability of a particular interfacial tetrahedron formed with a given triplet.

Therefore, for each surface residue triplet ti with a given residue composition and se-

quence adjacency, we define an ad hoc simplex (Figure 3.1B) whose tetrahedral type is defined by the sequence adjacency of the triplet residues and the non-adjacent residueX, thus limiting eachad hocsimplex to type 0, 1, or 3. Composition of each ad hoc simplex is defined by the triplet residues and anad hocresidueX, whereX represents the set of all 20 naturally

Figure 3.1: The CRACLe workflow. (A) Delaunay tessellation of a protein structure using a single-point-per-residue model to identify the solvent-exposed simplex faces, i.e., surface residue triplets. (B) For each surface residue triplet, we evaluate the likelihood of a potential interaction between the triplet and each of the 20 standard amino acids, represented by the imaginary residue X, resulting in a triplet feature vectorvT of 20 SNAPP scores. A sum-

mation of all triplet feature vectors that contain a single residue in common yields a residue feature vector vRfor each surface residue. We then concatenate each surface residue feature

vector to form the SNAPP pairing matrix, where each cell contains the pairing potential between a surface residue and a particular amino acid. (C) Each pairing potential in the SNAPP pairing matrix is ranked according to the highest potential. The topN0 pairing potentials are identified, and up to U0 unique surface residues are identified as primary critical residues. The top N1 pairing potentials are then identified as secondary critical residues and mapped onto the protein surface. Binding sites are predicted based on the clustering of primary and secondary critical residues. Both the function-based and the maximum-potential algorithms follow this generic workflow.

occurring amino acids, resulting in a1×20triplet feature vector of SNAPP scores,vT: vT(i, j) = q(ti, Xj) 20 (3.1)

Each of the twenty SNAPP scores in vT is a likelihood function of simplex occurrence, but

because it is a logarithm, we are able to use vector summation to calculate the likelihood of any two triplets occurring together. Such a summation essentially calculates a local SNAPP score, similar to Equation 2.5, that is dependent on the value ofX.

All triplet feature vectors that contain a mutual vertex are added together using a vector summation to generate a residue feature vectorvRfor each surface residueri (Figure 3.1B).

Thus, each of the twenty scores invRis the summation of SNAPP scores for a simplex com-

posed of a particular residueXthe neighboring triplets. EachvRscore estimates the likelihood

that the surface residueriwill interact with a particular residueX, and we call this statistical

likelihood a pairing potential. A second residue feature vector,vR0, is also created by dividing

each residue feature vectorvRby the number of contributing triplets.

vR(ri) = X vT :ri ∈vT (3.2) vR0= vR |vT| (3.3)

Both the vR and vR0 feature vectors are independently normalized using a z-score and

concatenated to form two independent, protein-specific SNAPP pairing matrices (Fig. 2b) with dimensionsNAA ×NSR, whereNAA is the number of amino acids in the alphabet and

NSR is the number of surface residues on the protein. Columns contain the scores for each

surface residue (1×NAA), and rows contain the scores for eachad hocresidueX(1×NSR).

By definition, each cell contains the pairing potential sij for an interaction between a given

surface residueri and a particular ad hocamino acid Xj. Both scoring matrices are used to

In document Bush_unc_0153D_14262.pdf (Page 53-57)