• No results found

Terminology and notation

The generalised shuttle algorithm

8.2 Terminology and notation

Let X = (X1, X2, . . . , Xk) be a vector of k discrete random variables cross-classified in a frequency count table n ={n(i)}i∈I, where I = I1 × I2 × · · · × Ik and Xr

takes the values Ir :={1, 2, . . . , Ir}. Denote K = {1, 2, . . . , k}. For r ∈ K, denote byP(Ir) the set of all partitions ofIr, i.e.,

P(Ir) :=00

Ir1,Ir2, . . . ,Irlr1

:Irl = ∅ for all l,

lj = 1r Irj =Ir,Irj1 ∩ Irj2 =∅ if j1 = j2

 .

LetRD be the set of marginal tables obtainable by aggregating n not only across variables, but also across categories within variables. We can uniquely determine a table n ∈ RD from n by choosing I1 ∈ P(I1),I2 ∈ P(I2), . . . ,Ik ∈ P(Ik). We write

n={n(J1, J2, . . . , Jk) : (J1, J2, . . . , Jk)∈ I1 × I2 × . . . × Ik} , where the entries of n are sums of appropriate entries of n:

n(J1, J2, . . . , Jk) :=

i1∈J1

i2∈J2

. . .

ik∈Jk

nK(i1, i2, . . . , ik).

We associate the table n withIr ={{1}, {2}, . . . , {Ir}}, for r = 1, . . . , k. On the other hand, choosingIr ={Ir} is equivalent to collapsing across the r-th variable.

The dimension of n ∈ RD is the number of variables cross-classified in n that have more than one category. For C ⊂ K, we obtain the C-marginal nC of n by taking

Ir =

! {{1}, {2}, . . . , {Ir}} , if r∈ C, Ir, otherwise,

for r = 1, 2, . . . , k. The dimension of nC is equal to the number of elements in C.

The grand total of n has dimension zero, while n has dimension k.

We introduce the set of tablesRD(n) containing the tables n∈ RD obtainable from n by table redesign such that n and n have the same dimension. We have n ∈ RD(n) andRD(n) ={n}, where n is the grand total of n. The set RD itself results from aggregating every marginal nC of n across categories, such that every variable having at least two categories in nC also has at least two categories in the new redesigned table: The elements in T are blocks or ‘super-cells’ formed by joining table entries in n.

These blocks can be viewed as entries in a k-dimensional table that cross-classifies the variables (Yj : j = 1, 2, . . . , k), where Yj takes values yj 0 the set of cells defining another ‘super-cell’ t1 = tJ1

1...Jk1 ∈ T, then we write t1 = tJ11...J1

k ≺ t2 = tJ12...J2

k. We formally define the partial ordering ≺ on the cells in T by

tJ1

1J21...Jk1 ≺ tJ12J22...Jk2 ⇔ J11 ⊆ J12, J21 ⊆ J22, . . . , Jk1 ⊆ Jk2.

This partial ordering, (T,≺), has a maximal element, namely the grand total n= tI1I2...Ik of the table and several minimal elements – the actual cell counts n(i) = n(i1, i2, . . . , ik) = t{i1}{i2}...{ik}. Thus, we can represent the lattice (T,≺) as a hierarchy with the grand total at the top level and the cells counts n(i) at the bottom level. If t1 = tJ11J21...J1 joining table entries in n. The operator⊕ is equivalent to joining two blocks of cells in T to form a third block where the blocks to be joined have the same categories in (k−1) dimensions and they cannot share any categories in the remaining dimension.

8.3 The generalised shuttle algorithm

The fundamental idea behind the generalised shuttle algorithm (GSA) is that the upper and lower bounds for the cells in T are interlinked, i.e., bounds for some cells in T induce bounds for some other cells in T. We can improve (tighten) the bounds for all the cells in which we are interested until we can make no further adjustment.

Although (Buzzigoli and Giusti 1999) introduced this innovative idea, they did not fully exploit the special hierarchical structure of T.

Let L0(T) :={L0(t) : t∈ T} and U0(T) :={U0(t) : t∈ T} be initial upper and lower bounds. By default we set L0(t) = 0 and U0(t) = n, but we can express almost any type of information about the counts in cells T using these bounds. For example, a known count c in a cell t with a fixed marginal implies that L0(t) = U0(t) = c. A cell t that can take only two values 0 or 1 has L0(t) = 0 and U0(t) = 1.

We denote by S[L0(T), U0(T)] the set of integer feasible arrays V (T) :=

{V (t) : t ∈ T} consistent with L0(T) and U0(T): (i) L0(t) ≤ V (t) ≤ U0(t), for all t∈ T and (ii) V (t1) + V (t3) = V (t2), for all (t1, t2, t3)∈ Q(T), where

Q(T) := {(t1, t2, t3)∈ T × T × T : t1⊕ t3 = t2} .

We let N ⊂ T be the set of cells in table n. A feasible table consistent with the constraints imposed (e.g., fixed marginals) is {V (t) : t ∈ N } where V (T) ∈ S[L0(T), U0(T)].

The sharp integer bounds [L(t), U (t)], t ∈ T, are the solution of the integer optimisation problems:

min{±V (t) : V (T) ∈ S[L0(T), U0(T)]} .

We initially set L(T) = L0(T) and U (T) = U0(T) and sequentially improve these loose bounds by GSA until we get convergence. Consider T0 :={t ∈ T : L(t) = U (t)} to be the cells with the current lower and upper bounds equal. We say that the remaining cells in T\ T0 are free. As the algorithm progresses, we improve the bounds for the cells in T and add more and more cells to T0. For each t in T0, we assign a value V (t) := L(t) = U (t).

We sequentially go through the dependenciesQ(T) and update the upper and lower bounds in the following fashion. Consider a triplet (t1, t2, t3) ∈ Q(T). We have t1 ≺ t2 and t3 ≺ t2. We update the upper and lower bounds of t1, t2 and t3

so that the new bounds satisfy the dependency t1⊕ t3 = t2.

If all three cells have fixed values, i.e., t1, t2, t3 ∈ T0, we check whether V (t1) + V (t3) = V (t2). If this equality does not hold, we stop GSA because S[L0(T), U0(T)]

is empty – there is no integer table consistent with the constraints imposed.

Now assume that t1, t3 ∈ T0and t2 ∈ T/ 0. Then t2can take only one value, namely V (t1) + V (t3). If V (t1) + V (t3) /∈ [L(t2), U (t2)], we encounter an inconsistency and stop. Otherwise we set V (t2) = L(t2) = U (t2) := V (t1) + V (t3) and include t2 in T0. Similarly, if t1, t2 ∈ T0 and t3 ∈ T/ 0, t3 can only be equal to V (t2)− V (t1).

If V (t2)− V (t1) /∈ [L(t3), U (t3)], we again discover an inconsistency. If this is not true, we set V (t3) = L(t3) = U (t3) := V (t2)− V (t1) and T0 := T0 ∪ {t3}. In the case when t2, t3 ∈ T0 and t1∈ T/ 0, we proceed in an analogous manner.

Next we examine the situation when at least two of the cells t1, t2, t3 do not have a fixed value. Suppose t1 ∈ T/ 0. The new bounds for t1 are

U (t1) := min{U(t1), U (t2)− L(t3)}, L(t1) := max{L(t1), L(t2)− U(t3)}.

If t3 ∈ T/ 0, we update L(t3) and U (t3) in the same way. Finally, if t2 ∈ T/ 0, we set U (t2) := min{U(t2), U (t1) + U (t3)}, L(t2) := max{L(t2), L(t1) + L(t3)}.

After updating the bounds of some cell t ∈ T, we check whether the new upper bound equals the new lower bound. If this is true, we set V (t) := L(t) = U (t) and include t in T0.

We continue iterating through all the dependencies in Q(T) until the upper bounds no longer decrease, the lower bounds no longer increase and no new cells are added to T0. Therefore the procedure comes to an end if and only if we detect an inconsistency or if we cannot improve the bounds. One of these two events eventually occurs; hence the algorithm stops after a finite number of steps.

If we do not encounter any inconsistencies, the algorithm converges to bounds Ls(T) and Us(T) that are not necessarily sharp: Ls(t) ≤ L0(t) ≤ U0(t) ≤ Us(t).

These arrays define the same feasible set of tables as the arrays L0(T) and U0(T) we started with, i.e., S[Ls(T), Us(T)] = S[L0(T), U0(T)], since the dependencies Q(T) need to be satisfied.

There exist two particular cases when we can easily prove that GSA converges to sharp integer bounds: (i) the case of a dichotomous k-dimensional table with all (k− 1)-dimensional marginals fixed and (ii) the case when the marginals we fix are the minimal sufficient statistics of a decomposable log-linear model. In both instances explicit formulas for the bounds exist. Employing GSA turns out to be equivalent to calculating the bounds directly as we prove in the next two sections.

8.4 Computing bounds for dichotomous k-way cross classifications