Finally, optimizing side-chain placement is NP-hard. I would like to have proven in this thesis that P=NP, and to have built a polynomial time solver for the side-chain- placement problem; however, I have not been able to. Instead, in Chapter 4, I put another nail into the tractability coffin of side-chain placement, by proving that this problem is like other NP-hard problems in admitting a dynamic programming solution with time and space requirements exponential in the treewidth of the graph corre- sponding to each problem instance. While this result at first sounds encouraging, the
treewidth of most problem instances in protein design is large, making dynamic pro- gramming infeasible.
What I have found encouraging is that simulated annealing performs extremely well at the side-chain-placement problem. In implementing both the exact techniques of dynamic programming and dead-end elimination, I have not found any cases where Rosetta’s simulated annealing protocol produced poor side-chain placements in com- parison to the optimal placement produced by either of the exact techniques.
The complexity of the side-chain-placement problem caries over to the hydrogen- placement problem; however, the problem instances that arise in hydrogen placement have dramatically smaller treewidths. This dissertation includes a description of dy- namic programming in the hydrogen placement problem; the dynamic programming algorithm presented here is now part of the Richardson’s program for hydrogen place- ment, REDUCE.
Chapter 4
Dynamic Programming on an
Interaction Graph
This chapter1 defines the interaction graph as a model for the side chain placement
problem. As a model, the interaction graph connects the side-chain-placement problem to a class of NP-hard problems for which certain problem instances can be solved in polynomial time (Bodlaender, 1988; Arnborg and Proskurowski, 1989). The complexity for a single problem instance for this class of graph problems depends on a property of the instance’s graph, a property called treewidth. The treewidth of the problem instance often sits in the exponent of its complexity; those instances with treewidths less than some constant can be solved in polynomial time.
The chapter defines a novel version of dynamic programming specifically tailored to the kinds of problems present in protein design. This version uses less memory and runs in less time at the cost of introducing a guaranteed level of approximation to the energy function. I call italgorithm adaptive dynamic programming.
Finally, the chapter provides an interaction graph formulation of the hydrogen- placement problem that includes a non-pairwise decomposable energy function. It shows that dynamic programming similarly solves this problem. The dynamic pro- gramming algorithm described here is now in use and being distributed inside software
1This chapter represents a combination of three publications. The sections on dynamic program-
ming in the side-chain placement problem and adaptive dynamic programming was published in the Pacific Symposium on Biocomputing (PSB) 2005. This publication was in collaboration with Brian Kuhlman and Jack Snoeyink. The sections on dynamic programming in the hydrogen-placement prob- lems were published initially in the Workshop on Experimental Algorithms (ALENEX) 2004. This publication was in collaboration with Yuanxin Liu and Jack Snoeyink. Following ALENEX’04, the paper was invited for submission into the Journal of Experimental Algorithms, and awaits a second round of review. Xueyi Wang contributed to this expanded publication.
for hydrogen placement (Word et al., 1999b).
4.1
Graphs, Hypergraphs, and Treewidth
A graph G = {V, E} consists of a set of vertices V and a set of edges E ⊂ V ×V. Vertices u and v are adjacent if the pair (u, v) ∈ E. The neighbors of a vertex v are those vertices adjacent to v, and the degree of a vertex is the number of neighbors it has. Ahypergraph is a generalization of a graph in which an edge (sometimes called a hyperedge) can contain any number of vertices of V. The degree of a hyperedge is the number of vertices it is incident upon. Figure 4.3b contains an example hypergraph with the vertices drawn as points and the hyperedges drawn as curves encircling them. A multi-graph is a graph in which multiple edges may be incident upon the same vertices, and yet be distinct; each edge must be identified by its set of vertices and an index that identifies which edge in the set it is. A multi-hypergraph is a hypergraph with repeated hyperedges.
The remainder of the discussion on graphs relies on some set notation. Breifly, for the setS and the set T, the cardinality ofS is represented by|S|and the set difference of S and T is represented by S\T.
The treewidthof a graph can be defined in one of two ways: I present both, as both definitions prove useful in the analysis to come.
The first definition of treewidth arises from the definition of a class of graphs called partial k-trees. A k-tree is a recursively defined graph: a (k + 1)-clique is a k-tree, and any graph formed from a k-tree T by connecting a new vertex to each vertex of a k-clique of T is also a k-tree. The 1-trees correspond to ordinary trees (connected graphs with |V|=|E|+ 1 ), except that the graph consisting of a single vertex would not be considered a 1-tree. 2-trees look like triangles that have been glued together at their edges (Figure 4.1). 3-trees look like tetrahedra that have been glued together at their faces. A partial k-tree is a k-tree from which any number of edges have been dropped. The partial 1-trees are forests. Notice that for a graph G with |V|> k + 1, if G is a partial k-tree, it is also a partial (k+ 1)-tree. The treewidth of a graph G is the smallest k such that Gis a partial k-tree.
The second definition of treewidth arises from the definition of atree-decomposition of a graph. The tree decomposition of a (hyper)graph G = {V, E} is a tree T whose vertices X ={X1, X2, . . . , Xm} are subsets of V that satisfy the following three prop- erties:
Figure 4.1: A Treewidth-2 Graph. The graph in (b) is a 2-tree, constructed by adjoining tri- angles (3-cliques) at their edges (2-cliques). Graph (c) demonstrates the relationship between the tree in (a) and the 2-tree in (b): each triangle in (b) corresponds to a vertex in (a), and each edge shared by two triangles in (b) corresponds to an edge in (a).
1. The union of sets S
1≤i≤mXi =V.
2. Each edge e∈E is in some set: e⊆Xi for some 1≤i≤m.
3. Each vertex v ∈V occupies a connected part of tree T: for any Xj on a path in
T from Xi to Xk, the intersection Xi∩Xk ⊆Xj.
The first two rules say: 1) the tree decomposition must capture all vertices in the graph, and 2) the tree decomposition must capture the edges in the graph. The third rule says that for any vertex v ∈ V the induced subgraph of T formed by taking all vertices
{t ∈ X|v ∈ Xt} is connected. That is, for any vertex v, if v ∈ Xi and v ∈ Xk, then
v ∈Xj, for all j on the path from i tok.
The treewidth of a tree decomposition is maxi|Xi|−1. There are often many ways to create tree decompositions for a single graph. A treeT with only a single vertexX1 =V
is a tree decomposition of G with a treewidth of |V| −1. Such a tree decomposition, however, is uninformative. The treewidth of a graph G is the minimum treewidth over all possible tree decompositions of G. Figure 4.2b illustrates a treewidth-3 tree decomposition of the graph in Fig. 4.2a.
Certain tree decompositions prove especially useful for dynamic programming as they define a “vertex elimination order.” I describe vertex elimination orders in greater detail in Section 4.4. These tree decompositions have been called nice tree decomposi- tionsdue to their usefulness (Bodlaender and Kloks, 1993). Arooted tree decomposition
a b
Figure 4.2: A Tree Decomposition. The graph in (a) represents the interactions between fifteen residues on the surface of ubiquitin’s beta sheet, as shown in Figure 4.9; vertices are labeled with their residue numbers. The tree in (b) is a tree-decomposition of (a). Each tree node contains the vertex label set, Xi. Since maxi(Xi) = 4, this tree-decomposition has a
treewidth of 3.
is a tree decomposition where one node t ∈ X is the root, and all edges in T are di- rected such that they point away from the root. One vertex i is said to be the parent of another vertex j if i and j are connected, and j is further from the root than i. A nice tree decomposition is a rooted tree decomposition such that for any child/parent pair i and j, there is exactly one vertex inXi that is not in Xj, that is the size of the set difference |Xi\Xj|= 1. Any tree decomposition can be converted into a nice tree decomposition in linear time without changing the treewidth.
Lemma 2 A tree decomposition T with treewidth w of the graph G = {V, E} can be converted in O(w(|T|+|V|))time into a rooted tree decomposition with treewidth w in which the vertex sets Xi satisfy |Xi\Xj|= 1 whenever Xj is the parent of Xi.
Proof: I give a constructive proof. Begin by choosing an arbitrary node of T as the root. Now, consider any node Xi and parent Xj for which |Xi \Xj| 6= 1. If
|Xi \Xj| = 0, then delete Xi and connect any children of Xi to the parent Xj. Otherwise take one vertex from the set difference, v ∈ Xi \Xj, and create a new node,Y = (Xi∩Xj)∪{v}betweenXiandXj. NodeXinow has the desired property. TreeT is still a tree decomposition: the first two properties for tree decompositions hold trivially, and the third property holds because (Xi∩Xj) = (Xi∩Y ∩Xj). The treewidth of the new decomposition is still w since |Y| <|Xi| ≤ (w+ 1). Recurse onY until the desired property holds.
This conversion takesO(w(|T|+|V|)) time since each vertex ofT must be visited at least once, and each new vertex Y added toT corresponds to exactly one vertex
ofG. The cost of visiting a node ofT and the cost of inserting a new nodeY intoT
are both linear in the treewidthw. Assuming that nodes ofT have their vertex lists sorted, then the set difference operation proceeds inO(w) time. Therefore the time spent visiting nodes ofT isO(w|T|). A new nodeY is added toT when some vertex
v was contained inXj and not inXi. By the third property of tree decompositions, sincev 6∈Xi,v cannot be contained in any node ofT that is not part of the subtree rooted at Xi. After Y’s addition, v is not contained in any other node of T other than Y whose parent does not contain v. Since each vertex in V can induce the creation of one new node Y, the number of nodes added to the tree decomposition is O(|V|). The construction of the new node Y is linear in the number of vertices it contains. and, the number of vertices of V contained by each new node Y is less than w+ 1. Therefore the cost for vertex addition is O(w|V|).
Nice tree decompositions define a partial order on V called an elimination order. The elimination order can be iteratively determined from T as follows: for any leaf nodej inT, 1) if j is not the root, then write down the single vertex in Xj \Xi where
iisj’s parent, and remove j fromT; 2) ifj is the root, then write down the vertices in
Xj in any order.
The elimination order produced by the nice tree decomposition of a graphGprovides an intuitive relationship between the two definitions of treewidth: the elimination order for G is the reverse order in which to build a k-tree (with the building rule of adding a vertex and connecting it to a k-clique) from which edges can be dropped to form G
as a partial k-tree.