Reusing Generalization Functions - Data Utility Comparison

Chapter 3 PPDP: k Jump Strategy for Privacy Preserving Data Publishing

3.4 Data Utility Comparison

3.4.2 Reusing Generalization Functions

With the naive strategy, whether a generalization function satisﬁes the privacy property is independent of other functions. Therefore, it is meaningless to evaluate the same function more than once. However, we now show that with thek-jump strategy, it is mean- ingful toreusea generalization function along the evaluation path. This will either increase the data utility of the original algorithm, or lead to new algorithms with incomparable data utility to enrich the the existing family of algorithms. That is, reusing generalization functions may beneﬁt the optimization of data utility.

Theorem 3.4 Given the set of generalization functions, there always exist cases in which the data utility of the algorithm with reusing generalization functions is better than that of the algorithm without reusing, and vice versa.

Proof: Consider two algorithmsa₁ anda₂ that deﬁne the functionsg₁, g₂, g₃, g₄, . . .

andg₁, g₂, g₃, g₂, g₄, . . ., respectively, whereg₂(.)andg₂()are identical. Suppose both al-

gorithms has the same jump distancek = 1, and the privacy property is not set-monotonic. We can construct the following two evaluation paths.

1. a₁(t₀) :per₁(t₀)→per₂(t₀)→ds1₂(t₀)→per₃(t₀)→per₄(t₀). . .

a₂(t₀) : per₁(t₀) → per₂(t₀) → ds1₂(t₀) → per₃(t₀) → per₂(t₀) → ds1₂(t₀) → p(ds1₂(t₀)) = true

2. a₁(t₀) : per₁(t₀) → per₂(t₀) → per₃(t₀) → per₄(t₀) → ds1₄(t₀) → p(ds1₄(t₀)) = true

a₂(t₀) : per₁(t₀) → per₂(t₀) → per₃(t₀) → per₂(t₀) → per₄(t₀) → ds1₄(t₀) → p(ds1₄(t₀)) = f alse

Clearly, the data utility of a₁ in the ﬁrst case is worse than that ofa₂, while in the second case it is better.

It is worth noting that although the same generalization function is repetitively eval- uated, its disclosure set will depend on the functions that appear before it in the evaluation path. Take the identical functionsg₂ andg₂ above as an example, the disclosure set ofg₂

is computed by excluding from its permutation set the tables which can be disclosed under

g₁; however, the disclosure set ofg₂needs to further exclude tables which can be disclosed underg₃. Therefore,ds₂ ⊆ ds₂. Generally,dsi ⊆ dsi whengi(.)is reused asgi(.)in a

later iteration. This leads to the following.

Proposition 3.1 With a set-monotonic privacy property, reusing generalization functions in ak-jump algorithm does not affect the data utility underajump(1).

Proof: Suppose gi(.) is reused as gi(.) in a later iteration of the algorithm. For

any table t, since dsi(t) ⊆ dsi(t), p(dsi(t)) = true implies p(dsi(t)) = true for any

set-monotonic privacy property p(.). Therefore, if p(dsi(t)) = true, the algorithm will

disclose under gi(.); if p(dsi(t)) = f alse then the algorithm will continue to the next

iteration. In both cases, gi(.) cannot exclude the tables from permutation set other than

gi(.)can do, therefore,gi(.)does not affect the data utility.

On the other hand, when generalization functions are reused at the end of the original sequence of functions, some tables which will lead to disclosing nothing under the original sequence of functions may have a chance to be disclosed under the reused functions, which will improve the data utility.

Proposition 3.2 Reusing a generalization function after the last iteration of an existing

k-jump algorithm may improve the data utility whenp(.)is not set-monotonic.

Proof: We construct a case in which reusing a function will improve the data utility. Consider two algorithms a₁ and a₂ that deﬁne the functions g₁, g₂, g₃ and g₁, g₂, g₃, g₂,

respectively, whereg₂(.)and g₂(.)are identical. Suppose both algorithms have the same

jump distancek = 1, and the privacy property is not set-monotonic. We need to construct the following two evaluation paths by whicha₁will disclose nothing, whilea₂will disclose usingg₂.

1. a₁(t₀) :per₁(t₀)→per₂(t₀)→ds1₂(t₀)→p(per₃(t₀)) = f alse

2. a₂(t₀) : per₁(t₀) → per₂(t₀) → ds1₂(t₀) → per₃(t₀) → per₂(t₀) → ds1₂(t0) → p(ds1₂(t0)) = true

Table 3.11 shows our construction. The table will lead to disclosing nothing without reusing g₂, whereas reusing g₂ will lead to a successful disclosure. In this example, the jump distance is1, and the privacy property is that the highest ratio of any sensitive value is no greater than 1₂.

More speciﬁcally, the given table, denoted byt₀, cannot be disclosed underg₁(.)or

g₃(.)sincep(per₁) =p(per₃) =f alse. Forg₂, we havep(per₂) =true. The tables inds₂

must be in one of the following three disjoint sets.

1. C has the sensitive valueC₃. The number of such tables is 2₁×(4₁×3₁) = 24. Denote this set byS₁.

2. Cdoes not haveC₃, and bothDandE haveC₃. There are2₁×2₁×2₁= 8such tables. Denote it byS₂.

3. C does not haveC₃, and both F andGhave C₃. Similarly, there are8 such tables. Denote this set byS₃.

We then have ds₂ = S₁ ∪ S₂ ∪S₃. The ratio of C being associated with C₃ is

24+8+8 = 0.6>0.5, sog2(t0)cannot be disclosed, either.

Now, consider the case thatg₂is reused asg₂. To calculate the disclosure set ofg₂,

the tables which can be disclosed underg₁, g₂, andg₃ must be excluded fromds₂. After

excluding the tables which can be disclosed underg₁, we have that the remaining tables in

ds₂ are the same as above, that is, S₁ ∪S₂ ∪S₃. These tables cannot be disclosed under g₂ as mentioned above. We further evaluate whether these tables can be disclosed usingg₃.

S₁can be further divided into three disjoint subsets as follows.

1. One and only one ofDandE hasC₃, so doesF andG. This subset has2₁×2₁×

₂ 1

×2₁= 16tables, and is denoted byS₁₁.

2. BothDandE haveC₃. This subset has2₁×2₁ = 4tables, and is denoted byS₁₂. 3. BothF andGhaveC₃. This subset also has4tables, and is denoted byS₁₃.

All the tables inS₁₂, S₁₃, S₂, andS₃ cannot be disclosed under g₃ since their permutation sets underg₃ do not satisfy the privacy property (the highest ratios of a sensitive value are respectively 0.6, 1.0, 0.6, and 1.0). On the other hand, the tables in S₁₁ can be disclosed underg₃. We can reason as follows. Consider each tabletinS₁₁ underg₃. Since

the tables which can be disclosed under g₁ must be excluded from ds₃(t), the remaining tables inds₃(t)must be in one of the following two disjoint sets.

1. BothAandB haveC₃. This subset has3!×2! = 12tables, and is denoted byS₁₁₁. 2. Two ofC, DandE haveC₃. This subset has(3₁×2₁)×3₁×2₁ = 36 tables,

and is denoted byS₁₁₂.

We must exclude fromds₃(t)the tables which can be disclosed usingg₂. The tables inS₁₁₁ cannot be disclosed underg₂since their permutation sets underg₂do not satisfy the privacy property. Furthermore, the tables in S₁₁₂ can be further divided into two disjoint subsets based on whetherChasC₃. The tables in the case thatChasC₃cannot be disclosed using g₂ because of the same reason as those in S₁₁₁, while the tables in the case that C

does not haveC₃ cannot be disclosed usingg₂because of the similar reason asg₂(t₀). In a word, all the tables inS₁₁₁ andS₁₁₂ cannot be disclosed usingg₂, accordingly, these tables cannot be excluded fromds₃(t). Thus,ds₃(t) = S₁₁₁ ∪S₁₁₂. The ratio ofC,D,E,F orG

being associated withC₃ inds₃(t)is 1₂ which is the highest ratio, accordingly, the tables in

S₁₁ can be disclosed underg₃.

Therefore, the disclosure set under the reused functiong₂ must exclude the tables

inS₁₁, consequently,ds₂ = S₁₂ ∪S₁₃ ∪S₂∪S₃. The ratio ofF andGbeing associated

withC₃ are0.5, which is the highest ratio. Therefore,g₂(t₀)can be safely disclosed.

QID g₁ g₂ g₃ g₂ A C₁ C₁ C₁ C₁ B C₂ C₂ C₂ C₂ C C₃ C₃ C₃ C₃ D C₄ C₄ C₄ C₄ E C₅ C₅ C₅ C₅ F C₃ C₃ C₃ C₃ G C₃ C₃ C₃ C₃

In document Preserving Privacy Against Side-Channel Leaks (Page 58-63)