In this section, we will discuss the process of selecting the data points that need to be replaced with synthetic values. We will be focusing on the categorical variables because in most scenarios, the key variables which can lead to identification are assumed to be cate- gorical (Drechsler and Reiter (2008)). Suppose we have categorical variables x1, x2, . . . , xp,
where xj = (x1,j, x2,j, . . . , xn,j)0 for variables j = 1, . . . , p and units i = 1, . . . , n. Each xi,j
for can take values 1, 2, . . . , wj where wj is the number of levels in variable xj. We then
vary the fastest, the second subscript x2 will vary the second fastest, and so on (see Schafer
(1997) for more details). An illustration of anti-lexicographical order is shown below: Table 6.1: Example of anti-lexicographical ordering
x1 x2 · · · xp 1 1 · · · 1 2 1 · · · 1 .. . ... ... d1 1 · · · 1 1 2 · · · 1 2 2 · · · 1 .. . ... ... d1 d2 · · · dp
We cross-classify all the variables and we then denote the maximum number of possible combinations of all the variables by D, where D = Qp
j=1wj. We can denote a variable
h = (h1, . . . , hn)0 for each unit i where hi ∈ {1, . . . , D} and corresponds to the particular
combination xi,1, xi,2, . . . , xi,p. Now let the frequencies for the different values of hi be
denoted by fg = n X i=1 I(hi = g), g = 1, . . . , D,
where I(·) is the indicator function: I(A) = 1 if A is true and I(A) = 0 otherwise. The combination value g is unique if fg = 1. Similarly, fg = 2 means combination value g
appears twice in the data set and so on. Let the frequencies of frequencies be denoted
nr = D
X
g=1
I(fg = r), r = 1, 2, . . .
where n1 represents the number of unique combinations in the data set.
Now suppose that subject i is unique, i.e. hi = g, fg = 1, g ∈ {1, . . . , D}. Instead of
synthesizing the whole row of data points (xi,1, xi,2, . . . , xi,p) corresponding to this subject
i, we only synthesize the data points that make this subject i unique. Table 6.2 shows a simple illustration of this idea.
Table 6.2: Illustration of choosing one data point that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)
1 2 2 1 2 12 2
2 2 2 1 2 12 2
i 2 2 1 1 4 1
x1, x2, x3 are binary variables and x4 is a categorical variable with 3 levels. The column
h represents the combination values for each unit and the frequencies for the combination values are denoted by column fg. All the values in columns h and fg are derived using the
method discussed previously. Unit 1 and 2 have the same set of data points (2,2,1,2), which is corresponding to the combination value 12 (h1 = h2 = 12) and this combination appears
twice, hence the frequency f12 = 2. We noticed that unit i is unique, i.e. f4 = 1, with
g = 4 being the combination value corresponds to unit i. We now need to find the data points that make this subject i unique. It is obvious that the only data point that we need to replace with synthetic data in this example is xi,4 (the circled number). Lets formalise
this idea of selecting the data points that need to be replaced with synthetic data. We first denote the function h = h(x1, x2, . . . , xp) where hi\{xj} means the element xj is being
excluded from the set {x1, x2, . . . , xp}, i.e. hi\{xj} = h(x1, x2, . . . , xj−1, xj+1, . . . , xp). The
main idea behind this selection procedure is that we try to find a data point xi,j such
that when we recompute hi\{xj}, for unit i = 1, . . . , n with a new set of cross-classified
values for x1, x2, . . . , xj−1, xj+1, . . . , xp, i.e. aggregated over variable xj that takes values
g0 ∈ {1, . . . , D0}, then for g0 = h
i\{xj}, we would like fg0 > Cs, where Cs is some threshold
value. If such a variable xj can be found, we then replace xi,j with a synthetic value. So
if we now aggregate over variable x4 in the data set from Table 6.2, we have:
Table 6.3: Data set from Table 6.2 when aggregating over variable x4
subject x1 x2 x3 Combination value (h) Frequency (fg0)
1 2 2 1 4 3
2 2 2 1 4 3
i 2 2 1 4 3
From Table 6.3 we noticed that after aggregating over variable x4, each unit has a new
combination value as well as a new frequency value fg0. Unit i is no longer unique since
f4 = 3 where g0 = 4. Hence, instead of replacing xi,1, xi,2, xi,3, xi,4 with synthetic values,
we only need to synthesize xi,4. In this simple case, we set the threshold value Cs = 1
such that fg0 > 1 in order to select the data points that need to be replaced with synthetic
data for units which are unique, i.e. fg = 1. By setting Cs = 2 such that fg0 > 2, we are
selecting the data points that need to be replaced with synthetic data for units which are unique, i.e. fg = 1 as well as for units which have combination value that repeat twice in
the data sets as fg = 2. In general, by setting Cs = r, r = 1, 2, . . ., we are selecting the
data points for units i = 1, . . . , n such that fg ≤ r.
If it is not possible to find a xi,j such that fg0 > Cs for g0 = hi\{x
j}, we can consider
a set of variables {xj, xj0} where j0 6= j and we try to find data points xi,j, xi,j0 such that
for g0 = hi\{xj,xj0}, where we aggregate over pairs of variables in the data set, we would
like fg0 > Cs, where Cs is the threshold value. If such variables xj and xj0 can be found,
we then replace xi,j, xi,j0 with synthetic values. Table 6.4 shows an example where there
are 2 data points that we need to replace with synthetic data, which are xi,2 and xi,4.
Table 6.4 shows a data set consists of 3 units and 4 variables, x1, x2, x3, x4, where
x1, x2, x3 are binary variables and x4 is a categorical variable with 3 levels. The column
h represents the combination values for each unit and the frequencies for the combination values are denoted by column fg. We noticed that unit i is unique, i.e. f18 = 1, with
Table 6.4: Illustration of choosing two data points that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)
1 2 2 1 2 12 2
2 2 2 1 2 12 2
i 2 1 1 3 18 1
g = 18 being the combination value corresponds to unit i. We now need to find the data points that make this subject i unique and we first aggregate over variable x4 and we have:
Table 6.5: Data set from Table 6.4 when aggregating over variable x4
subject x1 x2 x3 Combination value (h) Frequency (fg0)
1 2 2 1 4 2
2 2 2 1 4 2
i 2 1 1 2 1
So similar to before, we notice that after aggregating over variable x4, each unit has
a new combination value as well as a new frequency value fg0. Unit i is still unique after
aggregating over variable x4 since f2 = 1 where g0 = 2. Hence, we need to consider
aggregating over pairs of variables so that unit i will be unique afterwards, in this case, we see that aggregating over variable x2 results in:
Table 6.6: Data set from Table 6.4 by aggregating over variables x2 and x4
subject x1 x3 Combination value (h) Frequency (fg0)
1 2 1 2 3
2 2 1 2 3
i 2 1 2 3
Hence unit i is no longer unique after we aggregate over variable x2 and x4since f2 = 3
where g0 = 2. Hence, we found the data points that need to be replaced with synthetic values, which are xi,2 and xi,4. In general, if we can’t satisfy the criteria hi\{xj,xj0} > Cs
by aggregating over variables {xj, xj0}, we will keep increasing the number of variables
{xj, xj0, xj00, . . .}, where j 6= j0 6= j 00
until the criteria hi\{xj,xj0,xj00,...} > Cs is satisfied.
If it is possible to find a xi,j such that fg0 > Cs for g0 = hi\{x
j} and a xi,˜j where ˜j 6= j
such that f˜g > Cs for ˜g = hi\{x˜j}, we will select both the data points xi,j and xi,˜j. Table
6.7 shows an illustration of this case where we select both xi,2 and xi,4 given we can make
unit i no longer unique by aggregating over either x2 or x4.
From Table 6.7, we noticed that unit i is unique, i.e. f6 = 1, with g = 6 being the
combination value corresponds to unit i. We now need to find the data points that make this subject i unique. We first aggregate over variable x4 and the data set is tabulated in
Table 6.7: Illustration of choosing both data points that need to be synthesized subject x1 x2 x3 x4 Combination value (h) Frequency (fg)
1 2 2 1 2 12 2 2 2 2 1 2 12 2 i 2 1 2 1 6 1 4 2 1 2 3 22 2 5 2 1 2 3 22 2 6 2 2 2 1 8 3 7 2 2 2 1 8 3 8 2 2 2 1 8 3
since f6 = 3 where g0 = 6. Now suppose instead of aggregating over x4, we aggregate over
variable x2 and the data set is shown in Table 6.9.
Hence unit i is no longer unique after we aggregate over variable x2 since f4 = 4 where
g0 = 4. This shows that it is possible to find a xi,j such that fg0 > Cs for g0 = hi\{x
j}
and a xi,˜j where ˜j 6= j such that f˜g > Cs for ˜g = hi\{x˜j}, in this case, we will select
both the data points xi,2 and xi,4. In the general case of we have two subsets of variables
Ai, Bi ⊆ {xi,1, xi,2, . . . , xi,p} and fg0 > Cs for g0 = hi\{A
i} and f˜j > Cs for ˜j = hi\{Bi}, then
we select the union Ai∪ Bi to synthesize.
Table 6.8: Data set from Table 6.7 by aggregating over variables x4
subject x1 x2 x3 Combination value (h) Frequency (fg0)
1 2 2 1 4 2 2 2 2 1 4 2 i 2 1 2 6 3 4 2 1 2 6 3 5 2 1 2 6 3 6 2 2 2 8 3 7 2 2 2 8 3 8 2 2 2 8 3
Table 6.9: Data set from Table 6.7 by aggregating over variables x2
subject x1 x3 x4 Combination value (h) Frequency (fg0)
1 2 1 2 6 2 2 2 1 2 6 2 i 2 2 1 4 4 4 2 2 3 12 2 5 2 2 3 12 2 6 2 2 1 4 4 7 2 2 1 4 4 8 2 2 1 4 4