CHAPTER 4. PARALLEL EXACT BAYESIAN EDGE LEARNING
4.3 Parallel Algorithm
4.3.1 n-D Hypercube Algorithm
4.3.1.2 Parallel Fast Zeta Transforms
In section4.3.1.1, we have assumed Ai(S) for all i /∈ S are precomputed at node S. For any
i ∈ V , computing Ai(S) for any subset S ⊆ V − {i} requires the summation over all subsets of
S with size no more than d (see Equation4.6). If processors in the hypercube compute their Ai
independently, the processor responsible for computing Ai(V −{i}) for all i ∈ V takes O(d2n−1)
time. This certainly nullifies our effort of improving time complexity by parallel algorithm. In this section, we describe parallel algorithms with which all Ai (and Γv) scores can be computed
on the n-D hypercube cluster in O(n2) time.
First, we give definitions for two variants of the well-known zeta transform (Kennes, 1992).
Let V = {1, ..., n}. Let s : 2V → R be a mapping from the subsets of V onto the real numbers.
Let d ≤ |V | be a positive integer.
Definition 4.1 (Truncated Upward Zeta Transform). (Koivisto and Sood, 2004) A function t : 2V → R is the truncated upward zeta transform of s if
t(T ) = X
S⊆T :|S|≤d
s(S), for all T ⊆ V.
Definition 4.2 (Truncated Downward Zeta Transform). (Koivisto, 2006a) A function t : 2V →
R is the truncated downward zeta transform of s if
t(T ) = X
S:T ⊆S⊆V
s(S), for all T ⊆ V with |T | ≤ d.
It is easy to see that the function Ai for all i ∈ V can be viewed as a case of the truncated
upward zeta transform. Similarly, the function Γv(Gv) can be viewed as a case of the truncated
downward zeta transform.
Two techniques introduced in (Koivisto and Sood, 2004) and (Koivisto, 2006a) are able to
realize both transforms in O(d2n) time, respectively. Here, we present the parallel versions
computer cluster. The serial versions of the algorithms are given in (Koivisto and Sood, 2004) and (Koivisto, 2006a), respectively.
Algorithm 4.2 Parallel Truncated Upward Zeta Transform on n-D hypercube
Assumption: each subset S ⊆ V is encoded by an n-bit string ω, where ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. Subset S is computed on processor with id ω.
1: On each , t0(S) ← s(S) for |S| ≤ d and t0(S) ← 0 otherwise. 2: for j ← 1 to n do
3: for each processor ω s.t. |S ∩ {j + 1, ..., n}| ≤ d do
4: tj(S) ← 0
5: if |S ∩ {j, ..., n}| ≤ d then
6: tj(S) ← tj−1(S)
7: end if
8: if j ∈ S then
9: Retrieve tj−1(S − {j}) from processor ω0 = ω ⊕ 2j−1. 10: tj(S) ← tj(S) + tj−1(S − {j})
11: end if
12: end for
13: end for
14: return tn(S) on processor ω
Algorithm 4.3 Parallel Truncated Downward Zeta Transform on n-D hypercube
Assumption: each subset S ⊆ V is encoded by an n-bit string ω, where ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. Subset S is computed on processor with id ω.
1: On each processor, t0(S) ← s(S). 2: for j ← 1 to n do
3: for each processor ω s.t. |S ∩ {1, ..., j}| ≤ d do
4: tj(S) ← tj−1(S) 5: if j /∈ S then
6: Retrieve tj−1(S ∪ {j}) from processor ω0= ω ⊕ 2j−1. 7: tj(S) ← tj(S) + tj−1(S ∪ {j})
8: end if
9: end for
10: end for
11: return tn(S) on processor ω
By our definition in section 4.3.1.1, a subset S is encoded by an n-bit string ω, where
ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. In an n-D hypercube, ω is also used to denote the id of a processor. We can take this natural mapping so that each processor ω is responsible for the corresponding subset S. This forms the basic idea of the two parallel algorithms.
Algorithm 4.2 runs for n + 1 iterations. In each iteration, all 2n processors operate on
their S ⊆ V concurrently (lines 3 to 12). In iteration j, before the computation starts, each processor ω with ω[j] = 0 sends its tj−1(S) to its neighbor ω0 obtained by inverting its ω[j] to
(a) j = 0 (b) j = 1
(c) j = 2 (d) j = 3
Figure 4.3: An illustrative example of parallel truncated upward zeta transform on n-D hypercube. In this case, n = 3, d = 2. The algorithm takes four iterations. Iterations 0, 1, 2 and 3 are show in (a), (b), (c) and (d), respectively. The functions in dashed boxes are the messages sending between the processors. In each iteration, the computed tj(S) is shown underneath each processor.
(a) j = 0 (b) j = 1
(c) j = 2 (d) j = 3
Figure 4.4: An illustrative example of parallel truncated downward zeta transform on n-D hypercube. In this case, n = 3, d = 2. The algorithm takes four iterations. Iterations 0, 1, 2 and 3 are show in (a), (b), (c) and (d), respectively. The functions in dashed boxes are the messages sending between the processors. In each iteration, the computed tj(S) is shown above each processor.
1, i.e., ω0 = ω ⊕ 2j−1 (Line 3).5 The neighbor receiving this tj−1 will perform the addition on
line 10 in iteration j if necessary. Figure4.3 illustrates an example of Algorithm4.2solving a problem with n = 3 and d = 2.
In Algorithm 4.3, the mapping of the computation of S to n-D hypercube is the same as in
Algorithm4.2. In iteration j, after the computation starts, each processor ω with ω[j] = 1 sends its tj−1 to its neighbor ω0 obtained by inverting its ω[j] to 0, i.e., ω0 = ω ⊕ 2j−1. The neighbor
receiving this tj−1 will perform the addition on line 7 in iteration j if necessary. Figure 4.4
illustrates an example of Algorithm 4.3solving a problem with n = 3 and d = 2.
In each iteration, all S ⊆ V are computed concurrently on a n-D hypercube, the com-
putation times for both parallel algorithms being O(n). Further, in both algorithms, the
communications happen only between neighboring processors (two binary strings ω, ω0 differ in
only one bit). Thus, the two algorithms are communication-efficient.
We can use Algorithm4.2to compute Ai(S) for a given i ∈ V and all S ⊆ V − {i} by setting
s(·) = Bi(·) and then computing Ai(S) = qi(S) · t(S).6 Note that Bi(S) for any S ⊆ V − {i}
and |S| ≤ d has also been computed on processor corresponding to S since s(S) = Bi(S). To
compute Ai(S) for all i ∈ V , we run Algorithm 4.2 n times with each time switching to the
corresponding qiand Bifunctions. Thus, Aifor all i ∈ V can be computed in |V |·O(n) = O(n2)
time. Each processor ω computes and keeps the corresponding Ai(S) for all i ∈ V , which is
the assumption we made in section4.3.1.1. Thus, the mapping adopted by the two algorithms
is well suited for our algorithm as it avoids a large number of messages to be passed when the computation transits to the next step.
We will use Algorithm 4.3to compute Γv for a given v ∈ V . However, before applying the
algorithm, we shall first compute qv(S)F (S)R(V − {v} − S) on the processor corresponding to
S (see Equation 4.11). However, F (S) and R(V − {v} − S) are not on the same processor at
the time when we have computed functions F and R. Fortunately, they are on the processors who are neighbors in the hypercube. Thus, the processor ω who has F (S) shall retrieve R(V −
5⊕ stands for the bitwise exclusive or (XOR) between two binary strings. 2j−1 stands for the binary string
of integer 2j−1.
6A
i(S) are defined for all S ⊆ V − {i}, instead of S ⊆ V . However, the algorithm can still be deployed by
Figure 4.5: Retrieve R for computing Γv. The example shows the case in which Γv for v = 1 is
computed.
{v}−S) from its neighbor ω0obtained by inverting its ω[v] to 1, i.e., ω0 = ω ⊕2v−1(see example
in Figure4.5), and compute qv(S)F (S)R(V − {v} − S) before Algorithm 4.3is run. Then with
Algorithm 4.3, computing Γv for any fixed v ∈ V takes O(n). The time for computing Γv(Gv)
for all v ∈ V and Gv ⊆ V − {v} and |Gv| ≤ d is therefore |V | · O(n) = O(n2).