4.4 Dynamic Programming on Interaction Graphs: SCPP
4.4.2 Optimizing Dynamic Programming
The running time for my initial implementation of dynamic programming was 180 times longer than the running time of the final implementation. The same treewidth-3, small-rotamer-library optimization that ran on my laptop (1.2 GHz Pentium IV Mobile with 576 MB RAM) in 90 minutes the first time, ran in 30 seconds in the months that followed. There were two speedups that were especially important
Vertex::eliminate() {
newHyperedge = new Hyperedge( neighbors )
backtrackingTable = new BacktrackingTable( neighbors ) NSV neighborsStateVector( neighbors ) neighborsStateVector.startAtZero() while ( ! neighborsStateVector.visitedAllStates() ) bestscore = 0 beststate = 0 for (i = 0: numStates - 1) scoreThisAssignment = 0 for ( j = 0: numHyperedges - 1)
scoreThisAssignment += hyperedges[ j ]->getScore( neighborsStateVector, j ) if ( scoreThisAssignment < bestscore || i == 0 )
bestscore = scoreThisAssignment beststate = i
newHyperedge.setScore( neighborsStateVector, bestscore ) backtrackingTable.setBestState( neighborsStateVector, beststate) neighborsStateVector.increment()
dropAllIncidentEdges() }
Program 1: Psuedocode for vertex elimination in dynamic programming
The first and most dramatic speedup was due to efficient caching. Modern com- puter memory caches are designed with the concept of locality of memory reference: when data is accessed from one location in memory, the data in surrounding regions is very likely to be accessed as well. The heuristic employed to capitalize on this locality of reference is to move the block that surrounds a piece of memory onto the proces- sor’s cache whenever a piece of data is retrieved from memory, so that subsequent data retrievals are faster. Knowing that processor architecture is optimized to handle lo- calized references, programmers can increase the efficiency of their code by organizing data reference to be local.
The first speedup related to the way I laid out memory so that the large hyperedge scoring tables stored next to each other scores that were accessed sequentially during dynamic programming. The vertex elimination subroutine enumerates all state com- binations for the states of the neighbors of the vertex being eliminated. Enumeration proceeds in lexicographical order for these neighbors where the vertices are sorted in increasing order by their index. That is, if the eliminated vertex had two neighbors and the smaller indexed neighbor had 2 states and the higher indexed neighbor had 3 states, then the enumeration of their states would proceed as follows:
0 0 --> 0 1 --> 0 2 --> 1 0 --> 1 1 --> 1 2
For each state assignment to the neighbor vertices, dynamic programming then enu- merates all state assignments to the vertex being eliminated, finding the optimal state
restricted to the context of states assigned to the neighbor.
The hyperedges allocate their tables in row-major order by increasing vertex index. Therefore, the edge that is incident upon vertices {3,4,5} stores the score for state assignment S = (a, b, c) next to the state assignment S′ = (a, b, c+ 1) forc <|S
5| −1
and next to the state assignmentS′ = (a, b+ 1,0) forc=|S
5| −1 andb <|S4| −1, etc.
The first optimization technique is to assign vertex indices in reverse elimination order, so that dynamic programming traverses the hyperedge scoring tables in a more cache-efficient manner. At the time that vertex v is eliminated, its dimension in the hyperedge scoring tables for each of its incident edges is the last dimension. Thus retrieving scores while iterating across all the states ofv while keeping fixed the states of v’s neighbors means retrieving data from several contiguous blocks of memory, one block per hyperedge scoring table.
This optimization works only if the order of vertices for elimination is known before dynamic programming begins; this is a reasonable requirement for this implementation as it takes the vertex elimination ordering as an input. I describe another version of dynamic programming in Section 4.15 where I compute the vertex elimination ordering on the fly and thus do not lay out memory for cache efficiency.
This first optimization alone brought the running time down from 90 minutes to 1 minute; a 90× speedup.
The second optimization relates to computing indices when retrieving values from the hyperedge scoring tables. After the first optimization, profiling the code revealed that half of the time spent in dynamic programming is spent on computing indices into the hyperedge scoring tables; the other half is spent retrieving the values from these tables. Computing the index for an entry in a table of dimension d takes d−1 multiplies and d − 1 additions. However, in dynamic programming, nowhere near this much work need actually be performed. The traversal of dynamic programming through the hyperedge scoring tables is incredibly simple: for the most part, dynamic programming retrieves the entry in the table next to the entry it just retrieved, and then at regular intervals, it jumps backward and starts over at a certain part of the table.
I developed a scheme I call implicit indexing whereby each hyperedge keeps track of the index for the last score retrieval and then in preparation for the next retrieval either increments that index by one, or decrements that index by some large, easily calculated quantity. The size of the decrement, the jump backward, is a simple function of the number of vertices the edge is incident upon that changed to their first state.
Consider the elimination of a vertex x with degree 4. For notational convenience, lets say these are vertices 1, 2, 3, 4 and 5 and that vertexi has si states. Consider also a particular hyperedge e that is incident upon vertices 1, 4 and 5. Letj be the step at which dynamic programming reaches the state assignment (1,2, s3−1, s4 −1, s5−1)
so that the state assignment it considers at step j + 1 is (1,3,0,0,0). The index that hyperedge e used retrieved its score at step j is 1 ∗s4 ∗s5 + (s4 −1)∗s5 +s5 −1.
The index that hyperedge e uses to retrieve the score at step j+ 1 is 1∗s4∗s5. The
difference between the index of the retrieved score at step j and step j + 1 is simply
s4∗s5. Generally, the index is decremented by the product of the number of states of
those vertices that are reset back to zero.
The implicit indexing scheme was responsible for a speedup of 2×.