The previous representation of the rotamer-pair-energy table was as one contiguous sparse matrix, a table energy2b (energy, 2-body) which relied on three offset arrays that were used to index into it. This table represents only a small subset of the rotamer- pair energies that would be stored for n rotamers in a hypothetical n ×n that held an entry for each rotamer pair. Call this hypothetical n×n table T. The structure of energy2b is easiest to describe by noting the regions of T that energy2b does not allocate memory for.
First, T is symmetric. Therefore energy2b does not allocate space for any part of the lower triangle ofT (Figure 5.1a).
Next, pairs of rotamers that originate from the same residue can never be assigned simultaneously, so there is no way they can interact. energy2b allocates blocks of T. The rotamers for all residues are sorted by the residue that they originated from, so that all rotamers on the same residue are next to each other. The blocks ofT that represent the interactions between pairs of rotamers that originate from the same residue look like steps along the diagonal (Figure 5.1b). energy2b does not allocate these blocks.
Finally, for short-ranged energy functions only, not all rotamers interact. energy2b
uses the technique of allocating blocks to store subsections of T; it uses this technique twice. First, energy2b groups rotamers together by the residue they originated from; for example, all rotamers from residue i are grouped together and all rotamers from residuej are grouped together. Then if there are no rotamers fromiandj that interact,
energy2b does not allocate space for any rotamer pairs in this block (Figure 5.1c). Second, energy2bgroups rotamers together from the same residue by their amino acid type: for example, all of the serine rotamers on residuei are grouped together, and all the arginine rotamers on residue j are grouped together. Then if there are no serine rotamers from residue i that interact with any arginine rotamers from residue j, then
energy2bdoes not allocate space for any rotamer pairs in the block between the serines and arginines on this residue pair (Figure 5.1d). Since pairs of serine side chains are less likely to interact at long range than a serine/arginine pair or a pair of arginines,
This second of the two sparse-matrix-by-blocks techniques represents a significant space savings. I have refactored the original code that supports this technique into its own class, theAminoAcidNeighborSparseMatrixclass. I like to think of this technique as keeping track of “amino acid neighbors:” some amino acid pairs may neighbor one another for one residue pair (e.g. a pair of arginines) while other amino acid pairs for that same residue pair might not (e.g. a pair of serines). In instances of protein design, I have observed this technique to yield a 40% reduction in memory over the picture painted in Figure 5.1c.
Because energy2b is allocated as a single contiguous block of memory, it must be allocated before rotamer-pair-energy calculation begins. Therefore, the packer must predict without actually calculating any interaction energies which pairs of residues interact, and for the interacting residue pairs, must predict which pairs of amino acids interact. First, the packer examines the Cβ distances for all residue pairs and compares them against a threshold of 16 ˚A; any residue pair with a Cβ distance less than this threshold is presumed to interact. Second, for each interacting residue pair, the packer compares their Cβ distance against a set of amino-acid-pair-specific cutoffs. For a pair of alanines, their Cβ atoms must be within 8 ˚A to interact; for a pair of tryptophans, their Cβ atoms must be within 15 ˚A to interact. After predicting which blocks of T
contain interactions, Rosetta allocates the energy2b array, providing exactly as much room as can store those energies it has predicted could be non-zero.
Given that energy2b is restricted to not represent both halves of the symmetric energy table, it is laid out for optimal cache efficiency for energy retrieval in simulated annealing. When simulated annealing considers substituting rotamer r with rotamer
r′ on residue i, it has to retrieve all energies that rotamer r′ has with each rotamer
assigned to the neighboring residues. These neighboring residues can be divided into those with a higher residue number and those with a lower residue number. energy2b
Figure 5.1: Previous Two-body Energy Table An illustration of what regions of the complete n × n table that energy2b allocates space for; darkened regions are not allocated. energy2b does not allocates space to store rotamer-pair energies for a) the lower diagonal, b) rotamers from the same residue, c) residue pairs that have no interacting rotamers, and d) amino acid pairs with no interacting rotamers. The picture in d) only illustrates one residue pair for which the amino-acid neighbor sparse matrix representation has carved away space from the remainingn×ntable, but representation applies to the entire table.
of higher index are in a contiguous block of memory. That is, the location inenergy2b
of the energies that r′ has with the larger-indexed residues can be visualized as a row
going across the upper triangle in Figure 5.1d. In this row, the interaction energy that rotamer r′ has with the assigned rotamer s on residue j is separated from the
interaction energy that rotamer r′ has with the assigned rotamer t on residue j + 1
by only the set of interaction energies that rotamer r′ has with the other rotamers of
residues j and j+ 1 – a small enough number that for the most part, r′’s interactions
with s and t are in the same line of cache. Thus, the energy retrievals for substituting
r′ for r on rotameri incur one cache miss per residue for those residues with a smaller
On average, if each residue has n neighbors, then the average cost of retrieving the interaction energies of a rotamer is n/2 + 1 cache misses.
I postulate that the most time-efficient way to lay out rotamer-pair energies for retrieval in simulated annealing would cost a single cache miss per rotamer substitution. If memory were laid out so that all of the interaction energies that rotamerr′ had with
the rest of the rotamers in the protein were stored next to each other, and if these interactions all fit in a single line of cache (they might not), then a retrieval for a rotamer substitution would cost a single cache miss. However, this scheme would use twice as much memory.