Computing Bit-Length Messages - Integer Probabilistic Inference

4.3 Integer Probabilistic Inference

4.3.4 Computing Bit-Length Messages

Bit-length messages have computationally convenient properties. In fact, we will now see how to employ an efficient sparse data structure to exploit that all summands in (4.3) are powers of two. The main bit-length propagation (BLprop) algorithm is mostly identical to LBP (Algorithm 2.1), whereas the ordinary messages are replaced by β_u→v(x), and ϵ has to be provided as a rational number to check for convergence. Our algorithm for the actual message computation is shown in Algorithm 4.1. The most important property of this algorithm is the use of a sorted list structure to represent the message. While this might sound odd at first, each message β_u→v(x) is the bit-length of a sum of |Xv| powers

of two. Hence, the binary representation of the sum that corresponds to β_u→v(x) has at most |Xv| bits set to 1. Storing the bit positions in a sorted list allows us to access

the highest bit, which represents the bit-length of the sum—and hence β_u→v(x)—in time O(1). For the list, we need O(|Xv|) temporary storage for the computation of

arbitrary large messages. Note that a plain 64 bit integer will overflow soon if the number of neighbors or the magnitude of incoming bit-lengths is large. The sparse integer representation of the sum via a list cannot overflow, as long as the underlying type is large enough to hold the position of a bit. As an example, if the underlying integer type has 16 bits, the list allows us to represent any number whose binary representation requires less than 216 _{bits. Without the list structure, only messages of to 16 bits would}

be representable. The list itself can be discarded after computation, and only its largest element is propagated.

Theorem 4.4 (Correctness of Message Computation) Algorithm 4.1 computes the bit-length message β_u→v(xv) for a base-2 exponential family.

Proof. By definition,

β_u→v(xv) = bl

∑

xu∈Xu

2⟨θvu,ϕ(xvu)⟩+∑w∈N (u)\{v}βw→u(xu) _.

β_u→v(xv) is the bit-length of a sum of powers of 2. In line 1 we allocate a list structure

(or a tree, or a hash map) B, to store the sparse integer representation of the message. Here, sparsity is meant w.r.t. to the binary encoding of the message it self. The actual summation takes place in lines 2–9. In line 3, we load the exponent i of the current summand, and lines 4–8 represent raw binary addition of 2i to the current sum which is stored in B. Checking if i is already contained in the list takes at most O(log |Xu|)

steps via binary search in the sorted list. Since the summation is over |Xu| terms, the

final outcome can only have |Xu| non-zero bits. Even if the result of this summation is a

1 million bit number, only O(|Xu|) storage is required during message computation. In

the last line, we return 1 plus the position of the most significant bit stored in B, which is the bit-length of a binary number. Accessing the position of the most significant bit is performed in constant time by using a pointer to the end of the list. _■ Note that at the end of each message computation, +1 is added to the most significant bit position. This leads to an increased message magnitude. To see this, consider a simple chain model with n binary state vertices and a zero parameter vector θ = 0. Any outgoing message β_1→2(x) from vertex 1 to vertex 2 has the value 1. The message β_2→3(x) has the value 2, and so on, until the last message β_(n−1)→n(x) is n − 1. This behavior is not problematic due to our sparse message representation, but it shows that the magnitude of messages depends on the size of the graph. Without our sparse representation, the bit-length propagation would be restricted to small graphs and parameters with small magnitude.

The above procedure allows for nearly arbitrary large messages, but it does not guar- antee a seamless computation of marginal probabilities. Recall that (vertex) marginals are computed via

pv(x) =

2∑u∈N (v)βu→v(x)

∑

y∈Xv2

∑

u∈N (v)βu→v(y) .

While we can easily compute the (possibly very large) denominator via the sparse integer representation that we used in Algorithm 4.1, the result cannot be processed natively on the CPU due to the fixed word-size ω. In order to allow for native subsequent processing of the marginals (e.g., in another application without sparse integer support), we can shift numerator and denominator to the right until both fit into native CPU registers with, say, ω = 16 bit word-size.

Definition 4.4 (Right Bit-Shifts) Let a, b ∈ N. The right bit-shift operation a ≫ b is defined as

a ≫ b = ⌊a 2b

⌋ .

Whenever a b bit sparse integer a has to be transferred to native program code, we store a ≫ (b − ω) instead whenever a does not fit into a native ω bit register. While this procedure introduces a large error into any single number, shifting two numbers a and c by the same amount of bits b leaves their ratio almost intact, i.e., a/c ≈ (a ≫ b)/(c ≫ b). The following lemma formalizes this observation in the context of probability mass functions.

Lemma 4.4 (Probability Shifting) Let a be some vector of integers, such that p(x) = 2ax_/∑

y∈X2

ay _{for some random variable X with state space X . Let further ω be the}

word-size of the underlying CPU, and b = bl∑

y∈X2 ay_{. If b > ω and} ˆ p(x) = 2 ax _{≫ (b − ω)} ( ∑ y∈X2ay ) ≫ (b − ω) ,

is a shifted version of p, then

|p(x) − ˆp(x)| < 2−ω+1 ,

i.e., the error introduced by the shifting is bounded and exponentially small in the word- size.

Proof. First, assume (2ax _{≫ (b − ω)) = 0. In this case,}

|p(x) − ˆp(x)| = p(x) = 2 ax ∑ y∈X 2ay < 2 b−ω ∑ y∈X2ay ≤ 2−ω+1, because 2ax _{< 2}b−ω _and ∑ y∈X 2

ay _{≥ 2}b−1_{. Now, let (2}ax _{≫ (b − ω)) > 0. In this case,}

according to Definition 4.4, we have (2ax _{≫ (b − ω)) = 2}ax−b+ω_{. Thus,}

|p(x) − ˆp(x)| = 2 ax−b+ω ( ∑ y∈X 2ay ) ≫ (b − ω) − 2 ax ∑ y∈X 2ay , (4.7)

where the | · | disappears because (( ∑ y∈X 2ay ) ≫ (b − ω) )    ⌊c⌋ = ⌊ 2−b+ω∑ y∈X 2ay ⌋ ≤ 2−b+ω ( ∑ y∈X 2ay )    c .

Multiplying p(x) in (4.7) with 1 = 2−b+ω/2−b+ω, and observing that ⌊c⌋ > c − 1, yields |p(x) − ˆp(x)| = 2 ax−b+ω ⌊c⌋ − 2ax−b+ω c ≤ 2ax−b+ω (c − 1)c ≤ 2ax−b+ω (2b−1_{− 1)2}b−1 ≤ 2 ax−3b+3+ω _.

Finally, the lemma follows from ∀x ∈ X : ax < b and b ≥ ω + 1 (due to integrality of b

and ω). _■

It is thus rather safe to pass shifted marginals to native code. An orthogonal approach is the restriction of the parameter space to the set {0, 1, . . . , k − 1}d_{⊂ N}d_{. This prevents}

the parameters and the messages from becoming arbitrary large—a topic that we will revisit in Section 4.4.

In document Exponential families on resource-constrained systems (Page 127-130)