Chapter 3 PPDP: k Jump Strategy for Privacy Preserving Data Publishing
3.4 Data Utility Comparison
3.4.2 Reusing Generalization Functions
With the naive strategy, whether a generalization function satisfies the privacy prop- erty is independent of other functions. Therefore, it is meaningless to evaluate the same function more than once. However, we now show that with thek-jump strategy, it is mean- ingful toreusea generalization function along the evaluation path. This will either increase the data utility of the original algorithm, or lead to new algorithms with incomparable data utility to enrich the the existing family of algorithms. That is, reusing generalization func- tions may benefit the optimization of data utility.
Theorem 3.4 Given the set of generalization functions, there always exist cases in which the data utility of the algorithm with reusing generalization functions is better than that of the algorithm without reusing, and vice versa.
Proof: Consider two algorithmsa1 anda2 that define the functionsg1, g2, g3, g4, . . .
andg1, g2, g3, g2, g4, . . ., respectively, whereg2(.)andg2()are identical. Suppose both al-
gorithms has the same jump distancek = 1, and the privacy property is not set-monotonic. We can construct the following two evaluation paths.
1. a1(t0) :per1(t0)→per2(t0)→ds12(t0)→per3(t0)→per4(t0). . .
a2(t0) : per1(t0) → per2(t0) → ds12(t0) → per3(t0) → per2(t0) → ds12(t0) → p(ds12(t0)) = true
2. a1(t0) : per1(t0) → per2(t0) → per3(t0) → per4(t0) → ds14(t0) → p(ds14(t0)) = true
a2(t0) : per1(t0) → per2(t0) → per3(t0) → per2(t0) → per4(t0) → ds14(t0) → p(ds14(t0)) = f alse
Clearly, the data utility of a1 in the first case is worse than that ofa2, while in the second case it is better.
It is worth noting that although the same generalization function is repetitively eval- uated, its disclosure set will depend on the functions that appear before it in the evaluation path. Take the identical functionsg2 andg2 above as an example, the disclosure set ofg2
is computed by excluding from its permutation set the tables which can be disclosed under
g1; however, the disclosure set ofg2needs to further exclude tables which can be disclosed underg3. Therefore,ds2 ⊆ ds2. Generally,dsi ⊆ dsi whengi(.)is reused asgi(.)in a
later iteration. This leads to the following.
Proposition 3.1 With a set-monotonic privacy property, reusing generalization functions in ak-jump algorithm does not affect the data utility underajump(1).
Proof: Suppose gi(.) is reused as gi(.) in a later iteration of the algorithm. For
any table t, since dsi(t) ⊆ dsi(t), p(dsi(t)) = true implies p(dsi(t)) = true for any
set-monotonic privacy property p(.). Therefore, if p(dsi(t)) = true, the algorithm will
disclose under gi(.); if p(dsi(t)) = f alse then the algorithm will continue to the next
iteration. In both cases, gi(.) cannot exclude the tables from permutation set other than
gi(.)can do, therefore,gi(.)does not affect the data utility.
On the other hand, when generalization functions are reused at the end of the orig- inal sequence of functions, some tables which will lead to disclosing nothing under the original sequence of functions may have a chance to be disclosed under the reused func- tions, which will improve the data utility.
Proposition 3.2 Reusing a generalization function after the last iteration of an existing
k-jump algorithm may improve the data utility whenp(.)is not set-monotonic.
Proof: We construct a case in which reusing a function will improve the data utility. Consider two algorithms a1 and a2 that define the functions g1, g2, g3 and g1, g2, g3, g2,
respectively, whereg2(.)and g2(.)are identical. Suppose both algorithms have the same
jump distancek = 1, and the privacy property is not set-monotonic. We need to construct the following two evaluation paths by whicha1will disclose nothing, whilea2will disclose usingg2.
1. a1(t0) :per1(t0)→per2(t0)→ds12(t0)→p(per3(t0)) = f alse
2. a2(t0) : per1(t0) → per2(t0) → ds12(t0) → per3(t0) → per2(t0) → ds12(t0) → p(ds12(t0)) = true
Table 3.11 shows our construction. The table will lead to disclosing nothing without reusing g2, whereas reusing g2 will lead to a successful disclosure. In this example, the jump distance is1, and the privacy property is that the highest ratio of any sensitive value is no greater than 12.
More specifically, the given table, denoted byt0, cannot be disclosed underg1(.)or
g3(.)sincep(per1) =p(per3) =f alse. Forg2, we havep(per2) =true. The tables inds2
must be in one of the following three disjoint sets.
1. C has the sensitive valueC3. The number of such tables is 21×(41×31) = 24. Denote this set byS1.
2. Cdoes not haveC3, and bothDandE haveC3. There are21×21×21= 8such tables. Denote it byS2.
3. C does not haveC3, and both F andGhave C3. Similarly, there are8 such tables. Denote this set byS3.
We then have ds2 = S1 ∪ S2 ∪S3. The ratio of C being associated with C3 is
24
24+8+8 = 0.6>0.5, sog2(t0)cannot be disclosed, either.
Now, consider the case thatg2is reused asg2. To calculate the disclosure set ofg2,
the tables which can be disclosed underg1, g2, andg3 must be excluded fromds2. After
excluding the tables which can be disclosed underg1, we have that the remaining tables in
ds2 are the same as above, that is, S1 ∪S2 ∪S3. These tables cannot be disclosed under g2 as mentioned above. We further evaluate whether these tables can be disclosed usingg3.
S1can be further divided into three disjoint subsets as follows.
1. One and only one ofDandE hasC3, so doesF andG. This subset has21×21×
2 1
×21= 16tables, and is denoted byS11.
2. BothDandE haveC3. This subset has21×21 = 4tables, and is denoted byS12. 3. BothF andGhaveC3. This subset also has4tables, and is denoted byS13.
All the tables inS12, S13, S2, andS3 cannot be disclosed under g3 since their per- mutation sets underg3 do not satisfy the privacy property (the highest ratios of a sensitive value are respectively 0.6, 1.0, 0.6, and 1.0). On the other hand, the tables in S11 can be disclosed underg3. We can reason as follows. Consider each tabletinS11 underg3. Since
the tables which can be disclosed under g1 must be excluded from ds3(t), the remaining tables inds3(t)must be in one of the following two disjoint sets.
1. BothAandB haveC3. This subset has3!×2! = 12tables, and is denoted byS111. 2. Two ofC, DandE haveC3. This subset has(31×21)×31×21 = 36 tables,
and is denoted byS112.
We must exclude fromds3(t)the tables which can be disclosed usingg2. The tables inS111 cannot be disclosed underg2since their permutation sets underg2do not satisfy the privacy property. Furthermore, the tables in S112 can be further divided into two disjoint subsets based on whetherChasC3. The tables in the case thatChasC3cannot be disclosed using g2 because of the same reason as those in S111, while the tables in the case that C
does not haveC3 cannot be disclosed usingg2because of the similar reason asg2(t0). In a word, all the tables inS111 andS112 cannot be disclosed usingg2, accordingly, these tables cannot be excluded fromds3(t). Thus,ds3(t) = S111 ∪S112. The ratio ofC,D,E,F orG
being associated withC3 inds3(t)is 12 which is the highest ratio, accordingly, the tables in
S11 can be disclosed underg3.
Therefore, the disclosure set under the reused functiong2 must exclude the tables
inS11, consequently,ds2 = S12 ∪S13 ∪S2∪S3. The ratio ofF andGbeing associated
withC3 are0.5, which is the highest ratio. Therefore,g2(t0)can be safely disclosed.
QID g1 g2 g3 g2 A C1 C1 C1 C1 B C2 C2 C2 C2 C C3 C3 C3 C3 D C4 C4 C4 C4 E C5 C5 C5 C5 F C3 C3 C3 C3 G C3 C3 C3 C3