Selecting the values - Multiple imputation for missing data and statisticaldisclosure control

In this section, we will discuss the process of selecting the data points that need to be replaced with synthetic values. We will be focusing on the categorical variables because in most scenarios, the key variables which can lead to identification are assumed to be categorical (Drechsler and Reiter (2008)). Suppose we have categorical variables x1, x2, . . . , xp,

where xj = (x1,j, x2,j, . . . , xn,j)0 for variables j = 1, . . . , p and units i = 1, . . . , n. Each xi,j

for can take values 1, 2, . . . , wj where wj is the number of levels in variable xj. We then

vary the fastest, the second subscript x2 will vary the second fastest, and so on (see Schafer

(1997) for more details). An illustration of anti-lexicographical order is shown below: Table 6.1: Example of anti-lexicographical ordering

x1 x2 · · · xp 1 1 · · · 1 2 1 · · · 1 .. . ... ... d1 1 · · · 1 1 2 · · · 1 2 2 · · · 1 .. . ... ... d1 d2 · · · dp

We cross-classify all the variables and we then denote the maximum number of possible combinations of all the variables by D, where D = Qp

j=1wj. We can denote a variable

h = (h1, . . . , hn)0 for each unit i where hi ∈ {1, . . . , D} and corresponds to the particular

combination xi,1, xi,2, . . . , xi,p. Now let the frequencies for the different values of hi be

denoted by fg = n X i=1 I(hi = g), g = 1, . . . , D,

where I(·) is the indicator function: I(A) = 1 if A is true and I(A) = 0 otherwise. The combination value g is unique if fg = 1. Similarly, fg = 2 means combination value g

appears twice in the data set and so on. Let the frequencies of frequencies be denoted

nr = D

g=1

I(fg = r), r = 1, 2, . . .

where n1 represents the number of unique combinations in the data set.

Now suppose that subject i is unique, i.e. hi = g, fg = 1, g ∈ {1, . . . , D}. Instead of

synthesizing the whole row of data points (xi,1, xi,2, . . . , xi,p) corresponding to this subject

i, we only synthesize the data points that make this subject i unique. Table 6.2 shows a simple illustration of this idea.

Table 6.2: Illustration of choosing one data point that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)

1 2 2 1 2 12 2

2 2 2 1 2 12 2

i 2 2 1 1 4 1

x1, x2, x3 are binary variables and x4 is a categorical variable with 3 levels. The column

h represents the combination values for each unit and the frequencies for the combination values are denoted by column fg. All the values in columns h and fg are derived using the

method discussed previously. Unit 1 and 2 have the same set of data points (2,2,1,2), which is corresponding to the combination value 12 (h1 = h2 = 12) and this combination appears

twice, hence the frequency f12 = 2. We noticed that unit i is unique, i.e. f4 = 1, with

g = 4 being the combination value corresponds to unit i. We now need to find the data points that make this subject i unique. It is obvious that the only data point that we need to replace with synthetic data in this example is xi,4 (the circled number). Lets formalise

this idea of selecting the data points that need to be replaced with synthetic data. We first denote the function h = h(x1, x2, . . . , xp) where hi\{xj} means the element xj is being

excluded from the set {x1, x2, . . . , xp}, i.e. hi\{xj} = h(x1, x2, . . . , xj−1, xj+1, . . . , xp). The

main idea behind this selection procedure is that we try to find a data point xi,j such

that when we recompute hi\{xj}, for unit i = 1, . . . , n with a new set of cross-classified

values for x1, x2, . . . , xj−1, xj+1, . . . , xp, i.e. aggregated over variable xj that takes values

g0 ∈ {1, . . . , D0_{}, then for g}0 _{= h}

i\{xj}, we would like fg0 > Cs, where Cs is some threshold

value. If such a variable xj can be found, we then replace xi,j with a synthetic value. So

if we now aggregate over variable x4 in the data set from Table 6.2, we have:

Table 6.3: Data set from Table 6.2 when aggregating over variable x4

subject x1 x2 x3 Combination value (h) Frequency (fg0)

1 2 2 1 4 3

2 2 2 1 4 3

i 2 2 1 4 3

From Table 6.3 we noticed that after aggregating over variable x4, each unit has a new

combination value as well as a new frequency value fg0. Unit i is no longer unique since

f4 = 3 where g0 = 4. Hence, instead of replacing xi,1, xi,2, xi,3, xi,4 with synthetic values,

we only need to synthesize xi,4. In this simple case, we set the threshold value Cs = 1

such that fg0 > 1 in order to select the data points that need to be replaced with synthetic

data for units which are unique, i.e. fg = 1. By setting Cs = 2 such that fg0 > 2, we are

selecting the data points that need to be replaced with synthetic data for units which are unique, i.e. fg = 1 as well as for units which have combination value that repeat twice in

the data sets as fg = 2. In general, by setting Cs = r, r = 1, 2, . . ., we are selecting the

data points for units i = 1, . . . , n such that fg ≤ r.

If it is not possible to find a xi,j such that fg0 > C_s for g0 = h_i\{x

j}, we can consider

a set of variables {xj, xj0} where j0 6= j and we try to find data points x_i,j, x_i,j0 such that

for g0 = hi\{xj,xj0}, where we aggregate over pairs of variables in the data set, we would

like fg0 > C_s, where C_s is the threshold value. If such variables x_j and x_j0 can be found,

we then replace xi,j, xi,j0 with synthetic values. Table 6.4 shows an example where there

are 2 data points that we need to replace with synthetic data, which are xi,2 and xi,4.

Table 6.4 shows a data set consists of 3 units and 4 variables, x1, x2, x3, x4, where

x1, x2, x3 are binary variables and x4 is a categorical variable with 3 levels. The column

h represents the combination values for each unit and the frequencies for the combination values are denoted by column fg. We noticed that unit i is unique, i.e. f18 = 1, with

Table 6.4: Illustration of choosing two data points that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)

1 2 2 1 2 12 2

2 2 2 1 2 12 2

i 2 1 1 3 18 1

g = 18 being the combination value corresponds to unit i. We now need to find the data points that make this subject i unique and we first aggregate over variable x4 and we have:

Table 6.5: Data set from Table 6.4 when aggregating over variable x4

subject x1 x2 x3 Combination value (h) Frequency (fg0)

1 2 2 1 4 2

2 2 2 1 4 2

i 2 1 1 2 1

So similar to before, we notice that after aggregating over variable x4, each unit has

a new combination value as well as a new frequency value fg0. Unit i is still unique after

aggregating over variable x4 since f2 = 1 where g0 = 2. Hence, we need to consider

aggregating over pairs of variables so that unit i will be unique afterwards, in this case, we see that aggregating over variable x2 results in:

Table 6.6: Data set from Table 6.4 by aggregating over variables x2 and x4

subject x1 x3 Combination value (h) Frequency (fg0)

1 2 1 2 3

2 2 1 2 3

i 2 1 2 3

Hence unit i is no longer unique after we aggregate over variable x2 and x4since f2 = 3

where g0 = 2. Hence, we found the data points that need to be replaced with synthetic values, which are xi,2 and xi,4. In general, if we can’t satisfy the criteria hi\{xj,xj0} > Cs

by aggregating over variables {xj, xj0}, we will keep increasing the number of variables

{xj, xj0, x_j00, . . .}, where j 6= j0 6= j 00

until the criteria hi\{xj,xj0,xj00,...} > Cs is satisfied.

If it is possible to find a xi,j such that fg0 > C_s for g0 = h_i\{x

j} and a xi,˜j where ˜j 6= j

such that f˜g > Cs for ˜g = hi\{x˜j}, we will select both the data points xi,j and xi,˜j. Table

6.7 shows an illustration of this case where we select both xi,2 and xi,4 given we can make

unit i no longer unique by aggregating over either x2 or x4.

From Table 6.7, we noticed that unit i is unique, i.e. f6 = 1, with g = 6 being the

combination value corresponds to unit i. We now need to find the data points that make this subject i unique. We first aggregate over variable x4 and the data set is tabulated in

Table 6.7: Illustration of choosing both data points that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)

1 2 2 1 2 12 2 2 2 2 1 2 12 2 i 2 1 2 1 6 1 4 2 1 2 3 22 2 5 2 1 2 3 22 2 6 2 2 2 1 8 3 7 2 2 2 1 8 3 8 2 2 2 1 8 3

since f6 = 3 where g0 = 6. Now suppose instead of aggregating over x4, we aggregate over

variable x2 and the data set is shown in Table 6.9.

Hence unit i is no longer unique after we aggregate over variable x2 since f4 = 4 where

g0 = 4. This shows that it is possible to find a xi,j such that fg0 > C_s for g0 = h_i\{x

and a x_i,˜_j where ˜j 6= j such that f˜g > Cs for ˜g = hi\{x˜j}, in this case, we will select

both the data points xi,2 and xi,4. In the general case of we have two subsets of variables

Ai, Bi ⊆ {xi,1, xi,2, . . . , xi,p} and fg0 > C_s for g0 = h_i\{A

i} and f˜j > Cs for ˜j = hi\{Bi}, then

we select the union Ai∪ Bi to synthesize.

Table 6.8: Data set from Table 6.7 by aggregating over variables x4

subject x1 x2 x3 Combination value (h) Frequency (fg0)

1 2 2 1 4 2 2 2 2 1 4 2 i 2 1 2 6 3 4 2 1 2 6 3 5 2 1 2 6 3 6 2 2 2 8 3 7 2 2 2 8 3 8 2 2 2 8 3

Table 6.9: Data set from Table 6.7 by aggregating over variables x2

subject x1 x3 x4 Combination value (h) Frequency (fg0)

1 2 1 2 6 2 2 2 1 2 6 2 i 2 2 1 4 4 4 2 2 3 12 2 5 2 2 3 12 2 6 2 2 1 4 4 7 2 2 1 4 4 8 2 2 1 4 4

In document Multiple imputation for missing data and statistical disclosure control for mixed-mode data using a sequence of generalised linear models (Page 122-127)