• No results found

Computing Bit-Length Messages

4.3 Integer Probabilistic Inference

4.3.4 Computing Bit-Length Messages

Bit-length messages have computationally convenient properties. In fact, we will now see how to employ an efficient sparse data structure to exploit that all summands in (4.3) are powers of two. The main bit-length propagation (BLprop) algorithm is mostly identical to LBP (Algorithm 2.1), whereas the ordinary messages are replaced by βu→v(x), and ϵ has to be provided as a rational number to check for convergence. Our algorithm for the actual message computation is shown in Algorithm 4.1. The most important property of this algorithm is the use of a sorted list structure to represent the message. While this might sound odd at first, each message βu→v(x) is the bit-length of a sum of |Xv| powers

of two. Hence, the binary representation of the sum that corresponds to βu→v(x) has at most |Xv| bits set to 1. Storing the bit positions in a sorted list allows us to access

the highest bit, which represents the bit-length of the sum—and hence βu→v(x)—in time O(1). For the list, we need O(|Xv|) temporary storage for the computation of

arbitrary large messages. Note that a plain 64 bit integer will overflow soon if the number of neighbors or the magnitude of incoming bit-lengths is large. The sparse integer representation of the sum via a list cannot overflow, as long as the underlying type is large enough to hold the position of a bit. As an example, if the underlying integer type has 16 bits, the list allows us to represent any number whose binary representation requires less than 216 bits. Without the list structure, only messages of to 16 bits would

be representable. The list itself can be discarded after computation, and only its largest element is propagated.

Theorem 4.4 (Correctness of Message Computation) Algorithm 4.1 computes the bit-length message βu→v(xv) for a base-2 exponential family.

Proof. By definition,

βu→v(xv) = bl

xu∈Xu

2⟨θvu,ϕ(xvu)⟩+∑w∈N (u)\{v}βw→u(xu) .

βu→v(xv) is the bit-length of a sum of powers of 2. In line 1 we allocate a list structure

(or a tree, or a hash map) B, to store the sparse integer representation of the message. Here, sparsity is meant w.r.t. to the binary encoding of the message it self. The actual summation takes place in lines 2–9. In line 3, we load the exponent i of the current summand, and lines 4–8 represent raw binary addition of 2i to the current sum which is stored in B. Checking if i is already contained in the list takes at most O(log |Xu|)

steps via binary search in the sorted list. Since the summation is over |Xu| terms, the

final outcome can only have |Xu| non-zero bits. Even if the result of this summation is a

1 million bit number, only O(|Xu|) storage is required during message computation. In

the last line, we return 1 plus the position of the most significant bit stored in B, which is the bit-length of a binary number. Accessing the position of the most significant bit is performed in constant time by using a pointer to the end of the list. Note that at the end of each message computation, +1 is added to the most signifi- cant bit position. This leads to an increased message magnitude. To see this, consider a simple chain model with n binary state vertices and a zero parameter vector θ = 0. Any outgoing message β1→2(x) from vertex 1 to vertex 2 has the value 1. The message β2→3(x) has the value 2, and so on, until the last message β(n−1)→n(x) is n − 1. This behavior is not problematic due to our sparse message representation, but it shows that the magnitude of messages depends on the size of the graph. Without our sparse repre- sentation, the bit-length propagation would be restricted to small graphs and parameters with small magnitude.

The above procedure allows for nearly arbitrary large messages, but it does not guar- antee a seamless computation of marginal probabilities. Recall that (vertex) marginals are computed via

pv(x) =

2∑u∈N (v)βu→v(x)

y∈Xv2

u∈N (v)βu→v(y) .

While we can easily compute the (possibly very large) denominator via the sparse integer representation that we used in Algorithm 4.1, the result cannot be processed natively on the CPU due to the fixed word-size ω. In order to allow for native subsequent processing of the marginals (e.g., in another application without sparse integer support), we can shift numerator and denominator to the right until both fit into native CPU registers with, say, ω = 16 bit word-size.

Definition 4.4 (Right Bit-Shifts) Let a, b ∈ N. The right bit-shift operation a ≫ b is defined as

a ≫ b = ⌊a 2b

⌋ .

Whenever a b bit sparse integer a has to be transferred to native program code, we store a ≫ (b − ω) instead whenever a does not fit into a native ω bit register. While this procedure introduces a large error into any single number, shifting two numbers a and c by the same amount of bits b leaves their ratio almost intact, i.e., a/c ≈ (a ≫ b)/(c ≫ b). The following lemma formalizes this observation in the context of probability mass functions.

Lemma 4.4 (Probability Shifting) Let a be some vector of integers, such that p(x) = 2ax/

y∈X2

ay for some random variable X with state space X . Let further ω be the

word-size of the underlying CPU, and b = bl∑

y∈X2 ay. If b > ω and ˆ p(x) = 2 ax ≫ (b − ω) ( ∑ y∈X2ay ) ≫ (b − ω) ,

is a shifted version of p, then

|p(x) − ˆp(x)| < 2−ω+1 ,

i.e., the error introduced by the shifting is bounded and exponentially small in the word- size.

Proof. First, assume (2ax ≫ (b − ω)) = 0. In this case,

|p(x) − ˆp(x)| = p(x) = 2 ax ∑ y∈X 2ay < 2 b−ω ∑ y∈X2ay ≤ 2−ω+1, because 2ax < 2b−ω and ∑ y∈X 2

ay ≥ 2b−1. Now, let (2ax ≫ (b − ω)) > 0. In this case,

according to Definition 4.4, we have (2ax ≫ (b − ω)) = 2ax−b+ω. Thus,

|p(x) − ˆp(x)| = 2 ax−b+ω ( ∑ y∈X 2ay ) ≫ (b − ω) − 2 ax ∑ y∈X 2ay , (4.7)

where the | · | disappears because (( ∑ y∈X 2ay ) ≫ (b − ω) )    ⌊c⌋ = ⌊ 2−b+ω∑ y∈X 2ay ⌋ ≤ 2−b+ω ( ∑ y∈X 2ay )    c .

Multiplying p(x) in (4.7) with 1 = 2−b+ω/2−b+ω, and observing that ⌊c⌋ > c − 1, yields |p(x) − ˆp(x)| = 2 ax−b+ω ⌊c⌋ − 2ax−b+ω c ≤ 2ax−b+ω (c − 1)c ≤ 2ax−b+ω (2b−1− 1)2b−1 ≤ 2 ax−3b+3+ω .

Finally, the lemma follows from ∀x ∈ X : ax < b and b ≥ ω + 1 (due to integrality of b

and ω).

It is thus rather safe to pass shifted marginals to native code. An orthogonal approach is the restriction of the parameter space to the set {0, 1, . . . , k − 1}d⊂ Nd. This prevents

the parameters and the messages from becoming arbitrary large—a topic that we will revisit in Section 4.4.