System Inputs - Problem Formulation - Secure Protocols for Privacy-preserving Data Outsourcing,

4.3 Problem Formulation

4.3.2 System Inputs

In this section we give a formal deﬁnition of the input components, namely,diﬀerentially private data

and user count queries. Without loss of generality, we assume that the input data is anonymized using anε-differential privacy model [Dwo06], although our approach supports other privacy mod- els that produce contingency-like tables based on generalization and suppression. We choose ε- differential privacy because it provides a strong privacy guarantee while being insensitive to any specific record. We first describe how to generateε-differentially private records from a relational

Any_Job Professional Artist [18-65) [18-45) [45-65) Age Job [18-99) [18-40) [40-99) Salary Any_Country USA Canada Country Canada Engineer 30K USA Lawyer 45K USA Lawyer 75K Canada Dancer 60K USA Writer 45K Canada Writer 75K 30 55 45 18 30 45 USA Engineer 30 45K Canada Dancer 65 45K

Country Job Age Salary

3 2 4 5 6 7 1 8 PID Coronary Artery Valve Repair Plastic Coronary Artery Urology Valve Repair Plastic Valve Repair Surgery

Figure 2: A raw data tableD and its taxonomy trees.

data, then we explain how to transform the data using taxonomy trees, and ﬁnally we deﬁne the types of count queries the user can submit.

Diﬀerentially Private Data

In this section we review how a data provider can generateε-differentially private records. We utilize the differentially private anonymization algorithm(DiffGen)[MCFY11] to maximize the data utility for classification analysis. Suppose a data provider owns an integrated patient-specific data table D={AI_{, A}pr_{, A}cls_}_{, where}_AI _{is an}_{explicit identifier} _{attribute such as}_SSN _or_Name_{for explicitly}

identifying individuals that will not be used for generating theε-diﬀerentially private data;Acls_{is a}

class attribute that contains the class value; andApr_{is a set of}_k_{predictor attributes whose values are}

used to predict the class attributeAcls_{. We require the class attribute}_Acls_{to be categorical, whereas}

the predictor attributes inApr _{are required to be either categorical or numerical. Furthermore, we}

assume that for each predictor attributeAi∈Apra taxonomy treeTAi is used in order to specify the

hierarchy among the domain values ofAi. Figure2 shows a raw data tableDwith four attributes,

namely,Country,Job,Age, andSalary and the taxonomy tree for each attribute.

The data provider’s objective is to generate an anonymized version ˆD={Aˆpr_{, N Count}_} _{of the}

data table D, where ˆApr _{is the set of} _k_{generalized predictor attributes, and} _{N Count}_{is the noisy}

count of each record in ˆD. The objective of the data miner is to build a classiﬁer to accurately predict the class attributeAcls_{by submitting count queries on generalized predictor attributes ˆ}_Apr_.

׫Cuti= {Any_Country, Any_Job, [18-65), [18-99]}

׫Cuti= {Any_Country, Professional,Artist, [18-65), [18-99]}

Any_Count. Artist [45-65) [18-99) Any_Count. Artist [18-45) [18-99)

Any_Count. Prof. [45-65) [18-99) Any_Count. Prof. [18-45) [18-99)

Any_Count. Professional [18-65) [18-99) Any_Count. Artist [18-65) [18-99) Any_Count. Any_Job [18-65) [18-99)

Country Job Age Salary

Any_Job {Professional, Artist}

[18-65) {[18-45), [45-65)}

Figure 3: AlgorithmDiffGenfor generatingε-differentially private data with noisy counts. Table 3: Differentially-private data table ˆD

Country _Jobˆ _Ageˆ _Salaryˆ _NCount

Any Country Professional [18-45) [18-99) 4 Any Country Professional [45-65) [18-99) 2 Any Country Artist [18-45) [18-99) 1 Any Country Artist [45-65) [18-99) 5

Example 1 Consider the raw data set in Figure 2. Initially, the algorithm creates one root partition containing all the records that are generalized to Any Country, Any Job, [18-65), [18-99)

. ∪Cut(TA_i_{) includes} _{_{Any Country, Any Job, [18-65), [18-99)}_}_{. Let the ﬁrst specialization be} Any Job→ {Professional,Artist}. TheDiﬀGenalgorithm creates two new partitions under the root, as shown in Figure3, and splits data records between them. ∪Cut(TA_i_{) is updated to}_{{Any Country,} Professional, Artist, [18-65), [18-99)}. Suppose that the next specialization is [18-65)→ {[18-40), [40-65)}, which creates further specialized partitions. Finally, the algorithm outputs the equivalence

groups of each leaf partition with their noisy counts as shown in Table3.

Input Data Transformation

We simplify the representation of theε-diﬀerentially private records ˆD={Aˆpr_{, N Count}}_{by mapping}

the values of each attribute to their integer identiﬁers from the corresponding attribute’s taxonomy tree.

Numerical Attributes. The domain of each numerical attribute ˆAi∈Aˆpr, consists of a set of

ranges that are pair-wise disjoint and can be represented as a continuous and ordered sequence of ranges. We define an order-preserving identification functionIDop_{that assigns an integer identifier}

Any_Job (id=1) Professional (id=2)

Lawyer (id=5)

Engineer (id=4) Dancer (id=6) Singer (id=7) Writer (id=8) Civil (id=10)

Software (id=9) Electrical (id=11)

Solution Cut

Job

Artist (id=3)

Figure 4: Taxonomy treeTJ ob _{for attribute}_Job_.

Table 4: Transformed data table ˆD

Country _Jobˆ _Ageˆ _Salaryˆ _NCount

1 2 1 1 4

1 2 2 1 2

1 3 1 1 1

1 3 2 1 5

to each range r = [rmin_{, r}max_{] such that for any two ranges} _r

j and rl, if rmaxj < rlmin, then

IDop_(r

j) < IDop(rl). For example, if the domain of the generalized attribute Ageˆ is Ω( ˆAge) =

[18,45),[45,65), then IDop_([18,_{45)) = 1 and}_IDop_([45,_{65)) = 2.}

Categorical Attributes. The domain of each categorical attribute ˆAi ∈ Aˆpr consists of the

set of values Cut(TA_i_{). We deﬁne a taxonomy tree identiﬁcation function} _IDt _{such that for any}

two nodes vi, vj : vj = vi, if vi is a parent of vj, then IDt(vi)< IDt(vj). If vi is the root node,

thenIDt_(v

i) = 1. Figure4 illustrates the taxonomy treeTJ obfor attributeJob, where each node is

assigned an identiﬁcation value.

Having deﬁned the mapping functions IDop _and _IDt_{, we now transform the} _{ε-diﬀerentially}

private records ˆD by mapping the values in the domain of each attribute to their identifiers. That is, for each numerical attribute Âi ∈ Aˆpr, we map each range r ∈ Ω( Âi) to its corresponding

identiﬁcation valueIDop_{(r). Similarly, for each categorical attribute ˆ}_A

i∈Aˆpr, we map each value in

v∈Ω( Âi) to its identification value from the taxonomy treeIDt(v). Table4shows the differentially

User Count Queries

The goal of the data miners is to build a classiﬁer based on the noisy count of a query over the generalized attributes ˆApr_{. Therefore, they submit count queries to be processed on the}_{ε-diﬀerentially}

private data ˆD and expect to receive a noisy count as a result to each submitted query. We denote byuser count query any data mining’s count query, and it is formally deﬁned as follows:

Deﬁnition 4 (User Count Query.) A user count query u over ˆDis a conjunction of predicates P1∧...∧ Pm where each predicateP = ( ˆAiOp si) expresses a single criterion such that ˆAi∈Aˆpr,

Op is a comparison operator, and si is an operand. If ˆAi is a categorical attribute, then Op

corresponds to the equality operator “ = ” and si is a value from the taxonomy tree TAi. If ˆAi

is a numerical attribute, then si is a numerical range [smini , smaxi ] such that if smini =smaxi then

Op∈ {>,≥, <,≤,=}; otherwise,Opis the equal operator (=).

In general, a user count queryucan be either exact, specific, orgeneric depending on whether it corresponds to an exact record (equivalence class), or whether it partially intersects with one or more records in theε-differentially private data ˆD. Note that bothspecific and generic queries correspond torange queries in the literature. The following is a formal definition of each type of a user count query.

Deﬁnition 5 (Exact User Count Query.) A user count query uisexact if for each predicate

P = ( ˆAiOp si)∈u,si∈Ω( ˆAi).

Definition 6 (Specific User Count Query.) A user count queryuisspecificif for each predicate P = ( ˆAiOp si)∈u:

1. If ˆAi is categorical, thensi∈Ω( ˆAi).

2. If Âi is numerical, thensi ∈Ω( Âi) or there exists exactly one ranger∈Ω( Âi) wheresi∩r=φ

andsi=r.

Deﬁnition 7 (Generic User Count Query.) A user count queryuisgenericif for each predicate P = ( ˆAiOp si)∈u:

1. If ˆAi is categorical, thensi∈TAi.

2. If ˆAi is numerical, then∃rj, rl∈Ω( ˆAi) such thatsi∩rj =φ,si∩rl=φ, andrj =rl.

Example 2 The following are examples of user count queries over the ε-diﬀerentially private data ˆ

D presented in Table3:

Exact: u1= ( ˆJ ob= “Artist”)∧( ˆAge= [45−65))

Speciﬁc: u2= ( ˆJ ob= “Artist”)∧( ˆAge= [50−57))

Generic: u3= ( ˆJ ob= “Lawyer”)∧( ˆAge= [30−70))

Observe that the queries conform neither to the structure nor to the data in ˆD. That is, attributes ˆ

Country andSalaryˆ are missing, the value “Lawyer” is not in the domain Ω( ˆJ ob), and the range [30,70] spans beyond the values covered by all ranges in Ω( ˆAge). All these issues will be addressed in section4.4.3when the data miner submits her user count query for preprocessing.

In document Secure Protocols for Privacy-preserving Data Outsourcing, Integration, and Auditing (Page 54-59)