Lifted Sequential Inference - Sequential Clamping

3.2 Sequential Clamping

3.2.1 Lifted Sequential Inference

When we turn a complex inference task into a sequence of simpler tasks, we are repeatedly answering slightly modified queries on the same graph. Because LBP and LWP generally lack the opportunity of adaptively changing the lifted graph and using the updated lifted graph for efficient inference, they are doomed to lift the original model in each of the k iterations again from scratch as it was already depicted in Algorithm 3. Each Color Passing run scales O(h · |E|) where |E| is the number of edges in the factor graph and h is the number of iterations required. Hence, we can spend up to _{O(k · h · |E|) time just on lifting if we clamp k variables} in a sequential fashion. The previous section already showed how BP-guided decimation fixed one variable after another. Depending on the nature of the CNF, or more generally on the structure of the factor graph, this can require up to k = n iterations.

Let us now consider BP-guided sampling which can also be casted into the framework of decimation. We will explain its idea in greater detail and show how it poses similar issues for lifting due to changing evidence. When we want to sample from the joint distribution over k variables, this can be reduced to a sequence of one-variable samples conditioned on a subset of the other variables [146]. Thus, to get a sample for X = (X1, . . . , Xk), we first compute p(X1), then p(X2|X1), . . . , p(Xk|X1, . . . Xk−1). This problem can also be casted into the framework in Algorithm 3 and one should keep in mind that a CNF represented as a factor graph defines a joint distribution where each solution has equal probability. Let us now exemplify the idea of BP-guided sampling with a small example.

Example 3.4. Assume we want to sample from the joint distribution p(X1, . . . , X6), given the network in Figure 3.4a(top). Further assume that we begin our sequential process by first computing p(X3) from the prior lifted network, i.e., the lifted network when no evidence has been set (depicted in Figure 3.4b(top)). After sampling a state x3, we want to compute p(X\3|x3) as shown in Figure 3.4c(top).

To do so, it is useful to describe BP and its operations in terms of its computation tree (CT), see e.g. [97]. The CT is the unrolling of the (loopy) graph structure where each level i corresponds to the i-th iteration of message passing. Similarly, we can view Color Passing as a colored computation tree (CCT). More precisely, one considers for every node X the CT rooted in X but now each node in the tree is colored according to the nodes’ initial colors (cf. Figures 3.4a to 3.4c(bottom)). Each CCT encodes the root nodes’ local communication patterns that show all the colored paths along which node X communicates in the network. Consequently, CP groups nodes with respect to their CCTs: nodes having the same set of rooted paths of colors (node and factor names neglected) are clustered together.

Example 3.5. For instance, Figure 3.4a(bottom) shows the CTs rooted in X3 and X5. Because their set of paths are different, X3 and X5 are clustered into different clusternodes as indicated by different colors in Figure 3.4b(top). For this prior lifted network, the light green nodes exhibit the same communication pattern in the network which can be seen in identical CCTs in Figure 3.4b(bottom), and were consequently grouped together. Now, when we clamp the node X3 to a value x3, we change the communication pattern of every node having a path to X3. Specifically, we change X3’s, and only X3’s, color in all CCTs where X3 is involved, as indicated by the “c” in Figure 3.4c. This affects nodes X1 and X2 differently than X4, respectively X5 and X6, for two reasons:

1. they have different communication patterns as they belong to different clusternodes in the prior network

2. more importantly, they have different paths connecting them to X3 in their CCTs

The shortest path is the shortest sequence of factor colors connecting two nodes. Since we are not interested in the paths but whether the paths are identical or not, these sets might as well be represented as colors. Note that in Figure 3.4 we assume identical factors for simplicity. Thus in this case path colors reduce to distances. In the general case, however, we compare the paths, i.e., the sequence of factor colors.

Example 3.6. The prior lifted network can be encoded as the vector l = (0, 0, 1, 1, 0, 0) of node colors. Thus, to get the lifted network for p(X\3|x3), as shown in Figure 3.4c, we only have to consider the vector dist3 of shortest paths distances to X3 (see Figure 3.4d) and refine the initial clusternodes correspondingly. This is done by

1. the element-wise concatenation of two vectors: l⊕ dist3 2. viewing each resulting number as a new color

For our example, we obtain:

(0, 0, 1, 1, 0, 0)_{⊕ (1, 1, 0, 1, 2, 2) =}(1)(01, 01, 10, 11, 02, 02) =(2)(3, 3, 4, 5, 6, 6) , which corresponds to the lifted network for p(X\3|x3) as shown in Figure 3.4c. Thus, having the shortest path matrix, we can directly update the prior lifted network in linear time without taking the detour through running CP on the ground network. Now, we run inference, sample a state X4 = x4 afterwards, and compute the lifted network for p(X\{3,4}|x4, x3) to draw a sample for p(X1|x4, x3). Essentially, we proceed as before: compute l⊕ (dist3⊕ dist4).

X2 X1 X3 X4 X5 X6 (a) X2 X1 X3 X4 X5 X6 (b) X1 X2 X3 X4 X5 X6 X1 0 1 1 1 2 2 X2 1 0 1 2 3 2 X3 1 1 0 1 2 1 X4 1 2 1 0 1 1 X5 2 3 2 1 0 1 X6 2 2 1 1 1 0 (c)

Figure 3.5: Toy example of a graph where a single run of Shortest-Paths-lifting fails to return the correct lifting.

However, the resulting network might be suboptimal in cases when more than one variables is clamped and variables from the same initial cluster are sampled identically, i.e., take on the same value in the sample.

Example 3.7. The concatenation in the previous example assumed x3 6= x4 and, hence, X3 and X4 cannot be in the same clusternode. For x4 = x3, they could be placed in the same clusternode because they were in the same clusternode in the prior network. If X3 and X4 are clamped, this can be checked by dist3 dist4, the element-wise sort of two vectors. In our case, this yields l_{⊕ (dist}3 dist4) = l⊕ l = l: the prior lifted network.

In general, we compute l_{⊕ (}M x (M s distx,s)) , where distx,s = K i∈x:xi=s disti,

with clsuternodes x and the truth values s. For an arbitrary network, however, the shortest paths might be identical although the nodes have to be split, i.e., they differ in a longer path, or in other words, the shortest paths of other nodes to the evidence node are different. Consequently, we apply the shortest paths lifting iteratively. Let CNE denote the clusternodes given the set E as evidence. By applying the shortest paths procedure, we compute CN{X1}

from CN∅. This step might cause initial clusternodes to be split into newly formed clusternodes. To incorporate these changes in the network structure the shortest paths lifting procedure has to be iteratively applied. Thus in the next step we compute CN{X1}∪∆_X1 from CN{X1},

where ∆X1 denotes the changed clusternodes of the previous step. This procedure is iteratively

applied until no new clusternodes are created. We exemplify this issue by introducing another example where a single run of shortest paths lifting fails, however, an iterative application returns the correct lifting.

Example 3.8. The initial lifting of the graph depicted in Figure 3.5a can be encoded as l = (0, 1, 2, 2, 1, 0). If we now clamp variable X3 as shown in Figure 3.5b, we can use the distances in Figure 3.5c to concatenate l and dist3:

Yet, this lifting is not correct, as we have to distinguish X1 and X6. We also observe that the clusternodes of X2, X4, and X5 have changed. Therefore, we now have to iteratively apply the shortest paths lifting based on the nodes that changed in the previous iteration:

(0, 1, 2, 3, 4, 0) ⊕(1, 0, 1, 2, 3, 2) ⊕(1, 2, 1, 0, 1, 1) ⊕(2, 3, 2, 1, 0, 1)

=(0112, 1023, 2112, 3201, 4310, 0211) = (0, 1, 2, 3, 4, 5) . Since we are now at the ground level anyhow, we can stop iterating.

The description above together with Example 3.8 essentially sketch the proof of the following theorem. Originally, the theorem in a slightly different way, together with its proof, were presented by Ahmadi et al. [2].

Theorem 3.1. If the shortest path colors among all nodes and the prior lifted network are given, computing the lifted network for p(X|Xk, . . . , X1), k > 0, takesO((k + h) · n), where n is the number of nodes and h is the number of required iterative applications of the concatenation. Furthermore, running LBP produces the same results as running BP on the original model. Proof. Assuming a graph G = (V, E). We have seen above that the concatenation is a linear operation in the number of nodes. When k nodes have been set, we have to concatenate k distance vectors. However, this concatenation can result in supernodes being changed. Consequently, this requests the concatenation of additional nodes. This iterative application can result in h additional concatenations.

When we set new evidence for a node X ∈ V then for all nodes within the network the color of node X in the CCT s is changed. If two nodes Y1, Y2 ∈ V were initially clustered together and belonged to the same clusternode, i.e., X(Y1) = X(Y2), they have to be split if the CCT s differ. Now, we have to consider two cases:

1. If the difference in the CCT s is on the shortest path connecting X with Y1 and Y2, respectively, then shortest path lifting directly provides the new clustering.

2. If the coloring along the shortest paths is identical, the nodes’ CCT s might change in a longer path. Since X(Y1) = X(Y2) there exists a mapping between the paths of the respective CCT s. In particular∃Z1, Z2, s.t. X(Z1) = X(Z2) from a different clusternode, i.e., X(Zi)6= X(Yi), and

Y1, . . . , Z1, . . . , X | {z } ∆1 ∈ CCT (Y1), Y1, . . . , Z2, . . . , X | {z } ∆2 ∈ CCT (Y2) ,

and ∆1 ∈ CCT (Z1) 6= ∆2 ∈ CCT (Z2) are the respective shortest paths for Z1 and Z2. Thus, by iteratively applying shortest-path lifting as explained above, the evidence propagates through and we obtain the new clustering.

In fact, SPS lifting can be quite fast and we will now show in our experimental evaluation how the lifting time can be decreased for tasks with evidence arriving sequentially, e.g., lifted satisfiability or lifted sampling of joint configurations. If, however, lifting does not pay off, computing the pairwise distances may produce an overhead.

input: A factor graph G, list of query vars Y output : An assignment a to the variables in Y 1 G ← compress(G); 2 D ← calcDistances(G); 3 a ← ∅; 4 while Y6= ∅ do 5 b← runLiftedInference(G); 6 yt ← pickVarsToClamp(b); 7 a = aS y_t; 8 Y = Y\ Yt; 9 clamp(G, yt); 10 adaptLifiting(G, D); // SPS-lifting // only for SAT

11 simplify(G); // based on LWP 12 if containsContradition(G) then 13 return None ; 14 end 15 end 16 return a

Algorithm 4: Lifted Decimation using SPS-lifting

3.2.2 Experiments

After we have described the idea of SPS-lifting, we will now show how it can be integrated into an improved lifted decimation framework and we want to investigate the following questions: (Q3.3) Can we further support the results from the previous section and show that lifted

decimation solves satisfiability problems more efficiently?

(Q3.4) Does lifting improve sequential inference approaches beyond lifted satisfiability? (Q3.5) Is SPS-lifting even more beneficial than naive lifting based on CP?

Therefore, we run experiments on two AI tasks in which sequential clamping is essential. Namely, BP-guided decimation for satisfiability problems and sampling configurations from MLNs. With SPS-lifting, both tasks essentially follow the decimation strategy shown in Algorithm 4. However, in the case of sampling, we do not run LWP and check for contradictions. Additionally, the variable is not clamped based on the magnetization. Instead, we sample the value of the variable based on its marginal belief. The previous section already contained initial experiments for lifted satisfiability, however, the lifted decimation was implemented in a naive way. This initial investigation of Q3.2 will now be extended to additional problem instances and supported by a secondary set of experiments based on a different inference task.

Lifted Satisfiability

As indicated above, our lifted satisfiability based on decimation fits well into the sequential setting. However, we now use a more elaborated variant compared to the one depicted in Algorithm 3. It distinguishes most importantly from avoiding the entire re-lifting from scratch

CNF Name # Iters. Ground Naive SPS Walksat ls8-normalized 26 3.17 1.12 0.95 540                            structured ls9-normalized 13 5.47 1.65 1.47 1,139 ls10-normalized 14 10.27 1.84 1.59 1,994 ls11-normalized 26 38.82 11.51 10.64 4,500 ls12-normalized 35 60.83 13.15 11.57 10,351 ls13-normalized 21 55.39 9.99 8.21 30,061 ls14-normalized 22 83.30 10.22 8.30 104,326 2bitmax 6 55 2.35 1.25 1.05 379 5 100 sd schur 53 111.19 75.98 64.91 1,573,208 wff.3.100.150 54 0.19 0.26 0.22 17    random wff.4.100.500 78 1.73 2.04 1.89 33 wff.3.150.525 126 6.36 6.76 6.56 284

Table 3.2: Total messages sent (in millions) in SAT experiments and the number of average flips needed by Walksat (last column).

in each iteration. We first run LBP on the lifted graph and use its marginals to fix the next variable. Based on this clamping, the lifting is updated using SPS-lifting. We then run LWP on the modified lifted factor graph to clamp directly implied variables as before. Again, LWP is also used to detect possible contradictions. When LWP finds a contradiction, the algorithm stops and does not return a satisfying configuration. Otherwise, we continue by running LBP again.

We continue to compare the performance of lifted message passing approaches with the corresponding ground versions on the previously used CNF benchmark from [75]. We use the decimation as described to measure the effectiveness of the algorithms. To assess performance, we again report the number of messages sent. As before, for the typical message sizes, e.g., for binary random variables with low degree, computing color messages is essentially as expensive as computing the actual messages. Therefore, we report both color and (modified) BP messages, treating individual message updates as atomic unit time operations. We use the parallel message protocol for (L)BP and (L)WP where messages are passed from each variable to all corresponding factors and back at each step. The convergence threshold was set to 10−8 for (L)BP and all messages were initialized uniformly. In the case of (L)WP, all messages were initialized with zero, i.e., no warning being sent initially. As mentioned above, it is usually necessary to iteratively apply the SPS-lifting to obtain the correct adapted lifted graph. The number of required iterations, however, can be high if long paths occur in the network. Therefore, we use the SPS-lifting only once but then continued with standard CP to determine the new lifting. This can still save several passes of CP because the SPS-lifting provides us with a good head start.

We evaluated (Lifted) WP+BP decimation on different CNFs, ranging from problems with about 450 up to 78,510 edges. The CNFs contain structured problems as well as random instances. The statistics of the runs are shown in Table 3.2. As one can see, naive lifting already yields significant improvement, further underlining the experiments from the previous section. When applying the SPS-lifting, we can do even better by saving additional messages in the compression phases. The savings in messages are visible in running times as well. Looking

0 5 10 15 20 25 Iteration 103 104 105 106 107 WP+BP Messages Ground Naive

(a) Ground vs. Lifted

0 5 10 15 20 25 Iteration 0k 100k 200k 300k 400k 500k 600k 700k CP Messages Naive SPS (b) Naive vs. SPS-lifting ls9 ls10 ls11 ls12 ls13 ls14

Size of Latin Square 100 101 102 103 Scaling Ratio LWP+LBP Walksat (c) Comparison to Walksat

Figure 3.6: Experimental results for lifted decimation on Latin squares. (a) Comparison of ground decimation with naive lifted decimation on ls8-normalized. (b) Comparison of naive lifting with SPS-lifting on ls8-normalized. (c) Comparison of the growth in computational costs on increasing problem sizes measured relative to the smallest problem.

at Figure 3.6a, we compare ground decimation with its lifted counterpart. In the lifted case, we only send 33% of the total ground messages. When using the SPS-lifting, we can save up to an additional 10% of messages sent in the compression phase (Figure 3.6b). In the decimation, we always clamp the most magnetized variable which is the variable having the largest difference between the probability of the True and False state. We also applied the lifted message passing algorithms to random CNFs (last three rows in Table 3.2). As expected, no lifting was possible because random instances do usually not contain symmetries. In our experiments, we were able to find satisfying solutions for all problems which validates the effectiveness of the (lifted) decimation.

Although we are not aiming at presenting a state-of-the-art SAT solver, we solved all problems using Walksat [196] as well. Hence, we report results measured in variable flips in Table 3.2 too. Although Walksat requires fewer flips than we send messages, one can see that our lifted decimation strategy still scales well. In Figure 3.6c we have compared the computational effort on increasing problem sizes for Walksat and our lifted decimation. The results indicate that our approach can handle large problem instances without employing complex heuristics and code optimization but exploiting symmetries in the problems.

In combination with the results from the previous section, Q3.3 and Q3.5 have clearly been answered in favor of our algorithms for the task of lifted SAT solving. The next experiments will show that this also holds in the case of lifted sampling.

Lifted Sampling

We investigated BP, LBP and SPS-LBP for sequentially sampling a joint configuration over a set of variables, i.e., for a sequence of one-variable samples conditioned on a subset. Thus, to get a sample for X =_{X1, . . . , Xn}, we first compute p(X1), then p(X2|X1), . . . , p(Xn|X1, . . . Xn−1) as shown in Algorithm 4 for the SPS-case. In contrast to the decimation for SAT solving, the procedure of picking a variable to clamp is solely based on the index of the variables. Additionally, we now sample a state from the computed BP marginals instead of clamping a variable to the most magnetized state. Figure 3.7 summarizes the results for the “Smokers- and-Friends” dynamic MLN with ten people over ten time steps. This MLN over time was described in Section 2.3.1.

0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 100 time (seconds) sample size BP LBP SPS

(a) (L)BP-guided sampling for a varying sample size

0 100 200 300 400 500 600 700 800 900 2 4 6 8 10 12 14 time (seconds) number samples BP LBP SPS

(b) (L)BP-guided sampling for a varying number of samples

1 2 3 4 5 6 7 8 9 10 Parameter 0.0 0.5 1.0 1.5 2.0 2.5 Absolute Difference BP-guided Sampling Gibbs Sampling

Figure 3.7: Experimental results for Lifted BP guided sampling.

In our first experiment, we randomly chose 1, 5, 10, 20, 30, . . ., 100 “cancer” nodes over all time steps and sampled from the joint distribution. As one can see in Figure 3.7a, LBP already

In document Graphical models beyond standard settings: lifted decimation, labeling, and counting (Page 78-86)