Terminology and notation - The generalised shuttle algorithm

The generalised shuttle algorithm

8.2 Terminology and notation

Let X = (X₁, X₂, . . . , X_k) be a vector of k discrete random variables cross-classiﬁed in a frequency count table n ={n(i)}i∈I, where I = I1 × I2 × · · · × Ik and Xr

takes the values Ir :={1, 2, . . . , Ir}. Denote K = {1, 2, . . . , k}. For r ∈ K, denote byP(Ir) the set of all partitions ofIr, i.e.,

P(Ir) :=00

I_r¹,I_r², . . . ,I_r^l^r1

:I_r^l = ∅ for all l,

∪^l_{j = 1}^r I_r^j =Ir,I_r^j¹ ∩ I_r^j² =∅ if j1 = j2

LetRD be the set of marginal tables obtainable by aggregating n not only across variables, but also across categories within variables. We can uniquely determine a table n ∈ RD from n by choosing I₁ ∈ P(I1),I₂ ∈ P(I2), . . . ,I_k ∈ P(Ik). We write

n={n(J₁, J₂, . . . , J_k) : (J₁, J₂, . . . , J_k)∈ I₁ × I₂ × . . . × I_k} , where the entries of n are sums of appropriate entries of n:

n(J1, J2, . . . , Jk) :=

i1∈J1

i2∈J2

. . .

ik∈Jk

nK(i1, i2, . . . , ik).

We associate the table n withI_r ={{1}, {2}, . . . , {Ir}}, for r = 1, . . . , k. On the other hand, choosingIr ={Ir} is equivalent to collapsing across the r-th variable.

The dimension of n ∈ RD is the number of variables cross-classiﬁed in n that have more than one category. For C ⊂ K, we obtain the C-marginal nC of n by taking

I_r =

! {{1}, {2}, . . . , {Ir}} , if r∈ C, Ir, otherwise,

for r = 1, 2, . . . , k. The dimension of nC is equal to the number of elements in C.

The grand total of n has dimension zero, while n has dimension k.

We introduce the set of tablesRD(n) containing the tables n∈ RD obtainable from n by table redesign such that n and n have the same dimension. We have n ∈ RD(n) andRD(n_∅) ={n_∅}, where n_∅ is the grand total of n. The set RD itself results from aggregating every marginal nC of n across categories, such that every variable having at least two categories in nC also has at least two categories in the new redesigned table: The elements in T are blocks or ‘super-cells’ formed by joining table entries in n.

These blocks can be viewed as entries in a k-dimensional table that cross-classiﬁes the variables (Yj : j = 1, 2, . . . , k), where Yj takes values yj ∈0 the set of cells deﬁning another ‘super-cell’ t₁ = t_J¹

1...J_k¹ ∈ T, then we write t1 = t_J₁¹_...J¹

k ≺ t2 = t_J₁²_...J²

k. We formally deﬁne the partial ordering ≺ on the cells in T by

t_J¹

1J₂¹...J_k¹ ≺ tJ₁²J₂²...J_k² ⇔ J₁¹ ⊆ J₁², J₂¹ ⊆ J₂², . . . , J_k¹ ⊆ J_k².

This partial ordering, (T,≺), has a maximal element, namely the grand total n_∅= t_I₁_I₂_...I_k of the table and several minimal elements – the actual cell counts n(i) = n(i1, i2, . . . , ik) = t_{i₁_}{i₂_}...{i_k_}. Thus, we can represent the lattice (T,≺) as a hierarchy with the grand total at the top level and the cells counts n(i) at the bottom level. If t1 = t_J₁¹_J₂¹_...J¹ joining table entries in n. The operator⊕ is equivalent to joining two blocks of cells in T to form a third block where the blocks to be joined have the same categories in (k−1) dimensions and they cannot share any categories in the remaining dimension.

8.3 The generalised shuttle algorithm

The fundamental idea behind the generalised shuttle algorithm (GSA) is that the upper and lower bounds for the cells in T are interlinked, i.e., bounds for some cells in T induce bounds for some other cells in T. We can improve (tighten) the bounds for all the cells in which we are interested until we can make no further adjustment.

Although (Buzzigoli and Giusti 1999) introduced this innovative idea, they did not fully exploit the special hierarchical structure of T.

Let L0(T) :={L0(t) : t∈ T} and U0(T) :={U0(t) : t∈ T} be initial upper and lower bounds. By default we set L0(t) = 0 and U0(t) = n_∅, but we can express almost any type of information about the counts in cells T using these bounds. For example, a known count c in a cell t with a ﬁxed marginal implies that L0(t) = U₀(t) = c. A cell t that can take only two values 0 or 1 has L₀(t) = 0 and U₀(t) = 1.

We denote by S[L0(T), U0(T)] the set of integer feasible arrays V (T) :=

{V (t) : t ∈ T} consistent with L0(T) and U0(T): (i) L0(t) ≤ V (t) ≤ U0(t), for all t∈ T and (ii) V (t1) + V (t3) = V (t2), for all (t1, t2, t3)∈ Q(T), where

Q(T) := {(t1, t₂, t₃)∈ T × T × T : t1⊕ t3 = t₂} .

We let N ⊂ T be the set of cells in table n. A feasible table consistent with the constraints imposed (e.g., ﬁxed marginals) is {V (t) : t ∈ N } where V (T) ∈ S[L0(T), U0(T)].

The sharp integer bounds [L(t), U (t)], t ∈ T, are the solution of the integer optimisation problems:

min{±V (t) : V (T) ∈ S[L0(T), U0(T)]} .

We initially set L(T) = L₀(T) and U (T) = U₀(T) and sequentially improve these loose bounds by GSA until we get convergence. Consider T0 :={t ∈ T : L(t) = U (t)} to be the cells with the current lower and upper bounds equal. We say that the remaining cells in T\ T0 are free. As the algorithm progresses, we improve the bounds for the cells in T and add more and more cells to T0. For each t in T0, we assign a value V (t) := L(t) = U (t).

We sequentially go through the dependenciesQ(T) and update the upper and lower bounds in the following fashion. Consider a triplet (t1, t2, t3) ∈ Q(T). We have t1 ≺ t2 and t3 ≺ t2. We update the upper and lower bounds of t1, t2 and t3

so that the new bounds satisfy the dependency t₁⊕ t3 = t₂.

If all three cells have ﬁxed values, i.e., t₁, t₂, t₃ ∈ T0, we check whether V (t₁) + V (t3) = V (t2). If this equality does not hold, we stop GSA because S[L0(T), U0(T)]

is empty – there is no integer table consistent with the constraints imposed.

Now assume that t1, t3 ∈ T0and t2 ∈ T/ 0. Then t2can take only one value, namely V (t₁) + V (t₃). If V (t₁) + V (t₃) /∈ [L(t2), U (t₂)], we encounter an inconsistency and stop. Otherwise we set V (t2) = L(t2) = U (t2) := V (t1) + V (t3) and include t2 in T₀. Similarly, if t₁, t₂ ∈ T0 and t₃ ∈ T/ 0, t₃ can only be equal to V (t₂)− V (t1).

If V (t2)− V (t1) /∈ [L(t3), U (t3)], we again discover an inconsistency. If this is not true, we set V (t3) = L(t3) = U (t3) := V (t2)− V (t1) and T0 := T0 ∪ {t3}. In the case when t₂, t₃ ∈ T0 and t₁∈ T/ 0, we proceed in an analogous manner.

Next we examine the situation when at least two of the cells t₁, t₂, t₃ do not have a ﬁxed value. Suppose t1 ∈ T/ 0. The new bounds for t1 are

U (t₁) := min{U(t1), U (t₂)− L(t3)}, L(t₁) := max{L(t1), L(t₂)− U(t3)}.

If t₃ ∈ T/ 0, we update L(t₃) and U (t₃) in the same way. Finally, if t₂ ∈ T/ 0, we set U (t₂) := min{U(t2), U (t₁) + U (t₃)}, L(t₂) := max{L(t2), L(t₁) + L(t₃)}.

After updating the bounds of some cell t ∈ T, we check whether the new upper bound equals the new lower bound. If this is true, we set V (t) := L(t) = U (t) and include t in T0.

We continue iterating through all the dependencies in Q(T) until the upper bounds no longer decrease, the lower bounds no longer increase and no new cells are added to T0. Therefore the procedure comes to an end if and only if we detect an inconsistency or if we cannot improve the bounds. One of these two events eventually occurs; hence the algorithm stops after a ﬁnite number of steps.

If we do not encounter any inconsistencies, the algorithm converges to bounds Ls(T) and Us(T) that are not necessarily sharp: Ls(t) ≤ L0(t) ≤ U0(t) ≤ Us(t).

These arrays deﬁne the same feasible set of tables as the arrays L₀(T) and U₀(T) we started with, i.e., S[Ls(T), Us(T)] = S[L0(T), U0(T)], since the dependencies Q(T) need to be satisﬁed.

There exist two particular cases when we can easily prove that GSA converges to sharp integer bounds: (i) the case of a dichotomous k-dimensional table with all (k− 1)-dimensional marginals fixed and (ii) the case when the marginals we fix are the minimal sufficient statistics of a decomposable log-linear model. In both instances explicit formulas for the bounds exist. Employing GSA turns out to be equivalent to calculating the bounds directly as we prove in the next two sections.

8.4 Computing bounds for dichotomous k-way cross classiﬁcations

In document Algebraic and Geometric Methods in Statistics (Page 155-158)