Parallel Fast Zeta Transforms - n-D Hypercube Algorithm

CHAPTER 4. PARALLEL EXACT BAYESIAN EDGE LEARNING

4.3 Parallel Algorithm

4.3.1 n-D Hypercube Algorithm

4.3.1.2 Parallel Fast Zeta Transforms

In section4.3.1.1, we have assumed Ai(S) for all i /∈ S are precomputed at node S. For any

i ∈ V , computing Ai(S) for any subset S ⊆ V − {i} requires the summation over all subsets of

S with size no more than d (see Equation4.6). If processors in the hypercube compute their Ai

independently, the processor responsible for computing Ai(V −{i}) for all i ∈ V takes O(d2n−1)

time. This certainly nullifies our effort of improving time complexity by parallel algorithm. In this section, we describe parallel algorithms with which all Ai (and Γv) scores can be computed

on the n-D hypercube cluster in O(n2) time.

First, we give definitions for two variants of the well-known zeta transform (Kennes, 1992).

Let V = {1, ..., n}. Let s : 2V _{→ R be a mapping from the subsets of V onto the real numbers.}

Let d ≤ |V | be a positive integer.

Definition 4.1 (Truncated Upward Zeta Transform). (Koivisto and Sood, 2004) A function t : 2V → R is the truncated upward zeta transform of s if

t(T ) = X

S⊆T :|S|≤d

s(S), for all T ⊆ V.

Definition 4.2 (Truncated Downward Zeta Transform). (Koivisto, 2006a) A function t : 2V →

R is the truncated downward zeta transform of s if

t(T ) = X

S:T ⊆S⊆V

s(S), for all T ⊆ V with |T | ≤ d.

It is easy to see that the function Ai for all i ∈ V can be viewed as a case of the truncated

upward zeta transform. Similarly, the function Γv(Gv) can be viewed as a case of the truncated

downward zeta transform.

Two techniques introduced in (Koivisto and Sood, 2004) and (Koivisto, 2006a) are able to

realize both transforms in O(d2n) time, respectively. Here, we present the parallel versions

computer cluster. The serial versions of the algorithms are given in (Koivisto and Sood, 2004) and (Koivisto, 2006a), respectively.

Algorithm 4.2 Parallel Truncated Upward Zeta Transform on n-D hypercube

Assumption: each subset S ⊆ V is encoded by an n-bit string ω, where ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. Subset S is computed on processor with id ω.

1: On each , t0(S) ← s(S) for |S| ≤ d and t0(S) ← 0 otherwise. 2: for j ← 1 to n do

3: for each processor ω s.t. |S ∩ {j + 1, ..., n}| ≤ d do

4: tj(S) ← 0

5: if |S ∩ {j, ..., n}| ≤ d then

6: tj(S) ← tj−1(S)

7: end if

8: if j ∈ S then

9: Retrieve tj−1(S − {j}) from processor ω0 = ω ⊕ 2j−1. 10: tj(S) ← tj(S) + tj−1(S − {j})

11: end if

12: end for

13: end for

14: return tn(S) on processor ω

Algorithm 4.3 Parallel Truncated Downward Zeta Transform on n-D hypercube

Assumption: each subset S ⊆ V is encoded by an n-bit string ω, where ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. Subset S is computed on processor with id ω.

1: On each processor, t0(S) ← s(S). 2: for j ← 1 to n do

3: for each processor ω s.t. |S ∩ {1, ..., j}| ≤ d do

4: tj(S) ← tj−1(S) 5: if j /∈ S then

6: Retrieve tj−1(S ∪ {j}) from processor ω0= ω ⊕ 2j−1. 7: tj(S) ← tj(S) + tj−1(S ∪ {j})

8: end if

9: end for

10: end for

11: return tn(S) on processor ω

By our definition in section 4.3.1.1, a subset S is encoded by an n-bit string ω, where

ω[i] = 1 if variable i ∈ S and ω[i] = 0 otherwise. In an n-D hypercube, ω is also used to denote the id of a processor. We can take this natural mapping so that each processor ω is responsible for the corresponding subset S. This forms the basic idea of the two parallel algorithms.

Algorithm 4.2 runs for n + 1 iterations. In each iteration, all 2n processors operate on

their S ⊆ V concurrently (lines 3 to 12). In iteration j, before the computation starts, each processor ω with ω[j] = 0 sends its tj−1(S) to its neighbor ω0 obtained by inverting its ω[j] to

(a) j = 0 (b) j = 1

Figure 4.3: An illustrative example of parallel truncated upward zeta transform on n-D hypercube. In this case, n = 3, d = 2. The algorithm takes four iterations. Iterations 0, 1, 2 and 3 are show in (a), (b), (c) and (d), respectively. The functions in dashed boxes are the messages sending between the processors. In each iteration, the computed tj(S) is shown underneath each processor.

(a) j = 0 (b) j = 1

Figure 4.4: An illustrative example of parallel truncated downward zeta transform on n-D hypercube. In this case, n = 3, d = 2. The algorithm takes four iterations. Iterations 0, 1, 2 and 3 are show in (a), (b), (c) and (d), respectively. The functions in dashed boxes are the messages sending between the processors. In each iteration, the computed tj(S) is shown above each processor.

1, i.e., ω0 = ω ⊕ 2j−1 (Line 3).5 The neighbor receiving this tj−1 will perform the addition on

line 10 in iteration j if necessary. Figure4.3 illustrates an example of Algorithm4.2solving a problem with n = 3 and d = 2.

In Algorithm 4.3, the mapping of the computation of S to n-D hypercube is the same as in

Algorithm4.2. In iteration j, after the computation starts, each processor ω with ω[j] = 1 sends its tj−1 to its neighbor ω0 obtained by inverting its ω[j] to 0, i.e., ω0 = ω ⊕ 2j−1. The neighbor

receiving this tj−1 will perform the addition on line 7 in iteration j if necessary. Figure 4.4

illustrates an example of Algorithm 4.3solving a problem with n = 3 and d = 2.

In each iteration, all S ⊆ V are computed concurrently on a n-D hypercube, the com-

putation times for both parallel algorithms being O(n). Further, in both algorithms, the

communications happen only between neighboring processors (two binary strings ω, ω0 differ in

only one bit). Thus, the two algorithms are communication-efficient.

We can use Algorithm4.2to compute Ai(S) for a given i ∈ V and all S ⊆ V − {i} by setting

s(·) = Bi(·) and then computing Ai(S) = qi(S) · t(S).6 Note that Bi(S) for any S ⊆ V − {i}

and |S| ≤ d has also been computed on processor corresponding to S since s(S) = Bi(S). To

compute Ai(S) for all i ∈ V , we run Algorithm 4.2 n times with each time switching to the

corresponding qiand Bifunctions. Thus, Aifor all i ∈ V can be computed in |V |·O(n) = O(n2)

time. Each processor ω computes and keeps the corresponding Ai(S) for all i ∈ V , which is

the assumption we made in section4.3.1.1. Thus, the mapping adopted by the two algorithms

is well suited for our algorithm as it avoids a large number of messages to be passed when the computation transits to the next step.

We will use Algorithm 4.3to compute Γv for a given v ∈ V . However, before applying the

algorithm, we shall first compute qv(S)F (S)R(V − {v} − S) on the processor corresponding to

S (see Equation 4.11). However, F (S) and R(V − {v} − S) are not on the same processor at

the time when we have computed functions F and R. Fortunately, they are on the processors who are neighbors in the hypercube. Thus, the processor ω who has F (S) shall retrieve R(V −

5_{⊕ stands for the bitwise exclusive or (XOR) between two binary strings. 2}j−1 _{stands for the binary string}

of integer 2j−1.

6_A

i(S) are defined for all S ⊆ V − {i}, instead of S ⊆ V . However, the algorithm can still be deployed by

Figure 4.5: Retrieve R for computing Γv. The example shows the case in which Γv for v = 1 is

computed.

{v}−S) from its neighbor ω0obtained by inverting its ω[v] to 1, i.e., ω0 = ω ⊕2v−1(see example

in Figure4.5), and compute qv(S)F (S)R(V − {v} − S) before Algorithm 4.3is run. Then with

Algorithm 4.3, computing Γv for any fixed v ∈ V takes O(n). The time for computing Γv(Gv)

for all v ∈ V and Gv ⊆ V − {v} and |Gv| ≤ d is therefore |V | · O(n) = O(n2).

In document Structure Discovery in Bayesian Networks: Algorithms and Applications (Page 82-87)