Statistical Disclosure Control using Post Randomisation: Variants and Measures for Disclosure Risk

(1)

Statistical Disclosure Control using Post Randomisation: Variants

and Measures for Disclosure Risk

Ardo van den Hout, Elsayed A. H. Elamir

Abstract

This paper discusses the post randomisation method (PRAM) as a method for disclosure control. PRAM protects the privacy of respondent by misclassifying specific variables before data are released to researchers outside the statistical agency. Two variants of the initial idea of PRAM are discussed concerning the information about the misclassification that is given along with the released data. The first variant concerns calibration probabilities and the second variant concerns misclassification proportions. The paper shows that the distinction between the univariate case and multivariate case is important. Additionally, the paper discusses two measures for disclosure risk when PRAM is applied.

(2)

Statistical Disclosure Control Using Post

Randomisation: Variants and Measures for Disclosure

Risk

Ardo van den Hout

Department of Methodology and Statistics Faculty of Social Sciences, Utrecht University Heidelberglaan 1, 3584 CS Utrecht, the Netherlands tel. +31 30 2539237, E-mail: [email protected]

Elsayed A.H. Elamir

Southampton Statistical Sciences Research Institute University of Southampton

Southampton SO17 1BJ, United Kingdom

tel: +44 23 8059 4298, E-mail: [email protected]

Abstract

This paper discusses the post randomisation method (PRAM) as a method

for disclosure control. PRAM protects the privacy of respondent by

misclas-sifying specific variables before data are released to researchers outside the

statistical agency. Two variants of the initial idea of PRAM are discussed

concerning the information about the misclassification that is given along with the released data. The first variant concerns calibration probabilities and the

second variant concerns misclassification proportions. The paper shows that

(3)

impor-tant. Additionally, the paper discusses two measures for disclosure risk when

PRAM is applied.

Keywords: calibration; information loss; misclassification; PRAM.

1 Introduction

The post randomisation method (PRAM) is discussed in Gouweleeuw, Kooiman,

Willenborg, and De Wolf (1998) as a method for statistical disclosure control

(SDC). When survey data are released by statistical agencies, SDC protects the

identity of respondents. SDC tries to prevent that a user of the released data can

link the data of a respondent in the survey to a specific person in the

popula-tion. See Willenborg and De Waal (2001) for an introduction into SDC and SDC

methods other than PRAM.

There is a close link between PRAM and randomised response, a method to ask

sensitive questions in a survey, see Warner (1965) and Rosenberg (1979). Van den

Hout and Van der Heijden (2002) sum up some differences and similarities between

randomised response and PRAM.

When SDC is used, there will always be a loss of information. This is inevitable

since SDC tries to determine the information in the data that can lead to the

disclosure of an identity of a respondent, and eliminates this information before

data are released. It is not difficult to prevent disclosure, but it is difficult to prevent

disclosureand release data that is still useful for statistical analysis. Applying SDC

means searching for a balance between disclosure risk and information loss.

The idea of PRAM is to misclassify some of the categorical variables in the

sur-vey using fixed misclassification probabilities and to release the partly misclassified

data together with those probabilities. Say variableX, with categories {1, ..., J},

(4)

together with conditional probabilitiesIP(X∗ ₌_k_|_X ₌ _j_{) for} _{k, j} _{∈ {}₁_{, ..., J}_}_{. In}

this way PRAM introduces uncertainty in the data: the user of the released data

cannot be sure that the information is original or perturbed due to PRAM and

it becomes harder to establish a correct link between a respondent in the survey

and a specific person in the population. Since the user has the misclassification

probabilities he can adjust his analysis by taking into account the perturbation due

to PRAM.

This paper discusses two ideas to make PRAM more efficient with respect to

the balance between disclosure risk and information loss. First, the paper discusses

the use ofcalibration probabilities

IP(true category isj|categoryiis released). (1)

in the analysis of released data and compares this with using misclassification

probabilities

IP(categoryiis released|true category isj). (2)

The idea of using calibration probabilities is discussed by De Wolf, Gouweleeuw,

Kooiman, and Willenborg (1997), who refer to the discussion of calibration

prob-abilities in misclassification literature, see, e.g., Kuha and Skinner (1997). We will

elaborate the discussion and show that the advantage of calibration probabilities is

limited to the univariate case. Secondly, the paper shows that information loss can

be reduced by providingmisclassification proportions along with the released data.

These proportions inform about the actual change in the survey data due to the

ap-plication of PRAM. (Probabilities (1) and (2) inform about the expected change.)

Additionally, the paper discusses two measures for disclosure risk when PRAM is

applied. The first is a general measure presented in Elamir and Skinner (2003)

(5)

sec-ond measure links up with the SDC practice at Statistic Netherlands. Simulation

results are given to illustrate the theory.

The outline of the paper is as follows. Section 2 provides the framework and

the notation. Section 3 describes frequency estimation for PRAM data. Section

4 discusses the use of calibration probabilities. In Section 5 we introduce the use

of misclassification proportions. Section 6 discusses measures for disclosure risk,

whereas information loss is briefly considered in Section 7. Section 8 presents some

simulations, and Section 9 concludes.

2 Framework and notation

In survey data we distinguish between identifying variables and non-identifying

variables. Identifying variables are variables that can be used to re-identify

indi-viduals represented in the data. These variables are assumed to be categorical,

e.g., Gender, Race, Place of Residence. We assume that the sensitive information

of respondents is contained in the non-identifying variables, see Bethlehem, Keller,

and Pannekoek (1990), and that we want to protect this information by applying

PRAM to (a subset of) the identifying variables.

The notation in this paper is the same as in Skinner and Elliot (2002). Units

are selected from a finite populationU and each selected unit has one record in the

microdata sample s⊂U. Let ndenote the number of units ins. Let the

categor-ical variable formed by cross-classifying (a subset of) the identifying variables be

denoted X with values in {1, ..., J}. LetXi denote the value of X for unit i∈U.

Thepopulation frequencies are denoted

Fj =

X

i∈U

I(Xi =j), j∈ {1, ..., J},

(6)

Thesample frequencies are denoted

fj =

X

i∈s

I(Xi=j), j∈ {1, ..., J}.

In the framework of PRAM, we call the sample that is released by the statistical

agency thereleased microdata sample s∗_{. Note that unit}_i_∈_s∗ _{if and only if} _i_∈_s_.

Let X∗ _{denote the released version of} _X _in _s∗_{. By} _{misclassification} _{of unit} _i _we

meanXi 6=Xi∗. Thereleased sample frequencies are denoted

f_k∗ = X

i∈s∗

I(X_i∗=k), k∈ {1, ..., J}.

LetPX denote the J×J transition matrix that contains the conditional

misclas-sification probabilities pkj = IP(X∗ = k|X = j), for k, j ∈ {1, ..., J}. Note that

the columns ofPX sum up to one. The distribution of X∗ conditional on sis the

J-component finite mixture given by

IP(X_i∗=k|i∈s) =

J

X

j=1

IP(X_i∗=k|Xi=j)IP(Xi =j|i∈s), k∈ {1, ..., J},

where the component distributions are given by PX and the component weights

are given by the conditional distribution ofX. The conditional distribution of X

in samples is multinomial with

IP(Xi =j|i∈s) = 1_nfj, j∈ {1, ..., J}.

3 Frequency estimation for PRAM data

When PRAM is applied and some of the identifying variables are misclassified,

standard statistical models do not apply to the released data since these models do

not take into account the perturbation. This section shows how the misclassification

(7)

We have IE[F∗|f] = PXf, where f = (f1, ..., fJ)t and F∗ = (F1∗, ..., FJ∗)t is

the stochastic vector of the released sample frequencies. An unbiased moment

estimator off is given by

ˆ

f =P_X−1f∗, (3)

see Kooiman, Willenborg, and Gouweleeuw (1997). In practice, assuming thatPX

is non-singular does not impose much restriction on the choice of the

misclassifi-cation probabilities. Matrix P_X−1 exists when the diagonal of P_X dominates, i.e., pii>1/2 fori∈ {1, ..., J}. An additional assumption is that the dimensions of f

and f∗ are the same.

PRAM is applied to each variable independently and a transition matrix is

re-leased per variable. When the user of the rere-leased sample assesses a compounded

variable, he can construct its transition matrix using the transition matrices of

the individual variables. For instance, consider identifying variablesX1, with

cat-egories {1, .., J1} and X2, with categories {1, .., J2}, and the cross-classification

X = (X1, X2), i.e., the Cartesian product ofX1 and X2. Since PRAM is applied

independently, we have

IP

µ

X∗ = (k1, k2)|X= (j1, j2)

¶

= IP(X₁∗=k1|X1 =j1)

× IP(X₂∗=k2|X1 =j2), (4)

for k₁, j₁ ∈ {1, .., J₁} and k₂, j₂ ∈ {1, .., J₂}. In matrix notation, we have P_X = P_X₁⊗P_X₂, where⊗is the Kronecker product. Note that when one of two variables is not perturbed by PRAM, the transition matrix of that variable is the identity

matrix.

The variance of (3) equals

V[fˆ|f] =P_X−1V[F∗|f](P_X−1)t=P_X−1

µ_XJ

j=1

fjVj

¶

(8)

where Vj is the J ×J covariance matrix of two released values given the original

valuej, i.e.,

Vj(k1, k2) =

    

p_k₂_j(1−p_k₂_j) if k₁ =k₂

−pk1jpk2j if k1 6=k2

fork1, k2 ∈ {1, ..., J},

see Kooiman et al. (1997). The variance can be estimated by substituting ˆfj for

f_j in (5), forj∈ {1, .., J}.

The variance given by (5) is the extra variance due to PRAM and does not

take into account the sampling distribution. The formulas for the latter are given

in Chaudhuri and Mukerjee (1988) for multinomial sampling and compared to (5)

in Van den Hout and Van der Heijden (2002), see also Appendix B.

4 Calibration probabilities

Literature concerning misclassification shows that calibration probabilities (1) are

more efficient in the analysis of misclassified data than misclassification

probabil-ities (2), see the review paper by Kuha and Skinner (1997). Often, calibration

probabilities have to be estimated. However, when PRAM is applied, the

statisti-cal agency can compute the statisti-calibration probabilities using the sample frequencies.

The idea of using calibration probabilities for PRAM is mentioned in De Wolf et

al. (1997). The following elaborates this idea and makes a comparison with PRAM

as explained in the previous section.

The J ×J matrix with calibration probabilities of univariate variable X is

denoted by←P−X and has entries ←−pjk defined by

IP(Xi=j|Xi∗ =k, i∈s) =

pkjfj

P_J

j0=1pkj0fj0

, j, k∈ {1, ..., J}, (6)

(9)

column sums up to one. We have

f =←−PXIE[F∗|f], (7)

see Appendix A. An unbiased moment estimator off is therefore given by

˜

f =←P−Xf∗. (8)

In general,←P−X 6=PX−1, see Appendix A. The variance of (8) is given by (5) where

P_X−1 is replaced by ←P−_X andf_j is estimated by ˜f_j, forj∈ {1, ..., J}.

In the remainder of this section we compare estimators (3) and (8). The first

difference is that (3) might yield an estimate where some of the entries are negative,

whereas (8) will never yield negative estimates, see, e.g., De Wolfet al. (1997).

Secondly, estimator (8) is more efficient than (3) in the univariate case. This

is already discussed in Kuha and Skinner (1997). Consider the case whereX has

two categories. Say we want to knowπ=IP(X= 1). Let ˆπ be the estimate using

PX and ˜π the estimate using ←P−X. The efficiency of ˆπ relative to ˜π is given by

eff(ˆπ,π˜) = V[˜p]

V[ˆp] = (p11+p22−1)

2₍←−_p

22− ←−p21)2<1. (9)

So ˜π is always more efficient than ˆπ. An important difference with the general

situation of misclassification is that in the situation of PRAM, matrices PX and

←−

PX are given and do not have to be estimated. Comparison (9) is therefore a

simple form of the comparison in Kuha and Skinner (1997, Section 28.5.1.3.).

The third comparison is between the maximum likelihood properties of (3)

and (8). Assume thatX1, ...Xn are independently multinomially distributed with

parameter vectorπ = (π1, .., πJ)t. In the framework of misclassification, Hochberg

(1977) proves that estimator (8) yields a MLE. When (3) yields an estimate in the

interior of the parameter space, the estimate is also a MLE. See Appendix B for the

(10)

corresponding to (8) is different from the likelihood function corresponding to (3),

since the information used is different. This explains why both can be MLE despite

being different estimators off.

The fourth comparison is with respect to transition matrices of Cartesian

prod-ucts and is less favourable for (8). It has already been noted that PX1 ⊗PX2

is the matrix with misclassification probabilities for the Cartesian product X =

(X1, X2), see (4). Analogously, given ←P−X1 and

←−

PX2 the user can construct

ma-trix ←P−_X₁ ⊗←P−_X₂. However, this matrix does not necessarily contain calibration probabilities for X. Note that

IP³Xi = (j1, j2)|Xi∗ = (k1, k2), i∈s

´

= pk1j1pk2j2IP

³

Xi = (j1, j2)|i∈s

´ P_J₁

v

P_J₂

w pk1vpk2wIP ³

X1 = (v, w)|i∈s

´, (10)

It follows that ←P−X =←P−X1 ⊗

←−

PX2 when X1 and X2 are independent. In general,

this independence is not guaranteed and since the user of the released data does

not have the frequencies ofX, he cannot construct←P−_X.

The fifth and last comparison is with respect to the creation of subgroups.

Consider the situation where a user of the released data creates a subgroup by

using a grouping variable that is not part ofX. When the number of categories in

the subgroup is smaller thanJ, estimate (8) is not well-defined. When the number

of categories is equal to J, estimate (8) is biased due to the fact that (7) does not

hold. Note with respect to (7) that the frequencies that are used to construct←P−X

are the frequencies in the whole sample which will differ from the frequencies in the

subgroup, see also Appendix A. The estimator (3) is still valid for the subgroup.

Since calibration probabilities contain information about the distribution of

the samples, they perform better than misclassification probabilities regarding the

(11)

Table 1: Classification by X∗ _and _X

X

X∗ 1 2 total 1 300 200 500 2 100 400 500 total 400 600 1000

Section 8 presents some simulation results.

5 Misclassification proportions

MatricesPX and ←P−X inform about the expected change due to PRAM. As an

al-ternative we can create transition matrices that inform about the actual change due

an application of PRAM. These matrices contain proportions and will be denoted

P◦

X and

←_P−◦

X. Matrix PX◦ contains misclassification proportions and

←_P−◦

X contains

calibration proportions. This section shows how P◦

X and

←_P−◦

X are computed and

discusses properties of these matrices.

We start with an example. Say that X has categories {1,2}. Assume that

applying PRAM yields the cross-classification in Table 1. From this table it follows

that the proportion of records withX= 1 that haveX∗_{= 1 in the released sample}

is 300/400=3/4 and that the proportion of records withX∗ _{= 1 that have} _X _{= 1}

in the original sample is 300/500=3/5. Analogously we get the other entries of

P_X◦ =

Ã

3/4 1/3 1/4 2/3

!

and ←P−_X◦ =

Ã

3/5 1/5 2/5 4/5

!

.

For the general construction ofP◦

X and

←−

P_X◦, let the cell frequencies in the cross-classificationX∗ _by_X _{be denoted}_c

kj, fork, j∈ {1, .., J}. The entries of theJ×J

transition matrices with the proportions are given by

pe_kj = ckj

fj and

←−_pe

jk =

ckj

f∗

k

(12)

wherek, j ∈ {1, .., J}.

It follows that f∗ = P◦

Xf and f =

←−

P_X◦f∗. This is the reason to consider the matrices with the proportions more closely, since it is a great improvement

compared to (3) and (8). Note that when the user of the released sample hasP◦

X

or←P−_X◦, he can reconstruct Table 1.

Conditional on f, P_X◦ and ←P−_X◦ are stochastic, whereas PX and ←P−X are not.

In expectation P_X◦ equals PX, and PX◦1 ⊗P ◦

X equals PX1 ⊗PX2, see appendix C.

However, since f∗ _{is a value of the stochastic vector}_F∗_{, and} _IP₍_F∗

k = 0)6= 0, the

expectation of ←P−_X◦ does not exists. Nevertheless, an approximation shows that

←−

P_X◦ will be close to←P−X, see appendix C.

There is a set back with respect to the use of proportions for Cartesian products

and this is comparable to the problem mentioned in the previous section. Given

P◦

X1 and P ◦

X2 the user can construct P ◦

X1 ⊗P ◦

X2 for X = (X1, X2). However,

P◦

X1⊗P ◦

X2 does not contain proportions as defined above. Note that the user does

not have the cross-classification ofX and X∗, so he cannot derive the proportions

inP_X◦. The same holds for←P−_X◦. The optimal use of misclassification proportions is thereby limited to the univariate case.

Since misclassification proportions contain information about the actual

per-turbation due to PRAM, we expect them to perform well also in the multivariate

case. Section 8 discusses a multivariate example.

6 Disclosure risk

There are several ways to measure disclosure risk, see, e.g., Skinner and Elliot

(2002), and Domingo-Ferrer and Torra (2001). This section discusses two measures

for disclosure risk with respect to PRAM. Section 6.1 discusses an extension of the

general measure of disclosure risk introduced by Skinner and Elliot (2002). Section

(13)

Statistics Netherlands.

6.1 The measure θ

The following describes how the general measure for disclosure risk introduced in

Skinner and Elliot (2002) can be extended to the situation where PRAM is applied

before data are released by the statistical agency. When a disclosure control method

such as PRAM has been applied, a measure for disclosure risk is needed to quantify

the protection that is offered by the control method. Scenarios that may lead to a

disclosure of the identity of a respondent of are about persons that aim at disclosure

and that may have data that overlap the released data. A common scenario is that

a person has a sample from another source and tries to identify respondents in the

released sample by matching records. Using the extension of the measure in Skinner

and Elliot (2002) we can investigate how applying PRAM reduces the disclosure

risk.

Under simple random sampling Skinner and Elliot (2002) introduced the

mea-sure of disclomea-sure riskθ=IP(correct match|unique match) as

θ= X

j

I(fj = 1)

, X

j

Fj I(fj = 1),

where the summations are over j = 1, ..., J. The measure θ is the proportion of

correct matches among those population units which match a sample unique. The

measure is sample dependent and a distribution-free prediction is given by

b

θ=πn1

,µ

πn1+ 2(1−π)n2

¶

,

whereπ is the sampling fraction,n1=

P

jI(fj = 1) is the number of uniques and

n2 =

P

jI(fj = 2) is the number of twins in the sample, see Skinner and Elliot

(14)

misclassifi-cation occurs. The extension is given by

θmm =

X

i∈s

I(fXi = 1, Xi∗=Xi)

Á X

j

Fj I(fj = 1)

and its distribution-free prediction is given by

b

θmm =π

X

j

I(fj = 1)Pjj

Áµ

πn1+ 2 (1−π)n2

¶

,

where Pjj is the diagonal entry (j, j) of the transition matrix P which describes

the misclassification, see Elamir and Skinner(2003)

Section 8 presents some simulation results with respect to the measureθbefore

applying PRAM, andθmm after applying PRAM.

6.2 Spontaneous recognition

Statistics Netherlands releases data in several ways. One way is the releasing of

detailed survey data under contract, i.e., data are released to bona fide research

institutes that sign an agreement in which they promise not to look for disclosure

explicitly, e.g, by matching the data to other data files. In this situation, SDC

only concerns the protection against what is called spontaneous recognition. This

section introduces a measure for disclosure risk for PRAM data that is specific to

the control for spontaneous recognition.

Controlling for spontaneous recognition means that one should prevent that

certain records attract attention. A record may attract attention when a low

dimensional combination of its values has a low frequency. Also, without

cross-classifying, a record may attract attention when one of its values is recognized as

being very rare in the population. Combinations of values with low frequencies in

the sample are calledunsafe combinations.

Statistics Netherlands uses the rule of thumb that a recognition of a combination

(15)

only combinations of three variables are assessed with respect to disclosure control

for spontaneous recognition.

Note that applying PRAM causes two kinds of modifications in the sample that

make disclosure more difficult. First, it is possible that unsafe combinations in the

sample change into apparently safe ones in the released sample, and, secondly, it is

possible that safe combinations in the sample change into apparently unsafe ones.

Since misclassification probabilities are not that large (in order to keep analysis of

the released sample possible) and the frequency of unsafe combinations is typically

low, the effect of the first modification is negligible in expectation. The second

modification is more likely to protect an unsafe combination j when there are a

lot of combinations k,k 6=j, which are misclassified intoj. This is the reason to

focus, for a given recordiwith the unsafe combination of scoresj, on the calibration

probability

µ=IP(Xi =j|Xi∗ =j, i∈s).

When there are hardly any k, k6= j, misclassified into j, this probability will be

large, and, as a consequence, the record is unsafe. Note that combinations with

frequency equal to zero are never unsafe.

Measureµis a simplification since it ignores possible correlation betweenXand

other variables in the sample. Note that X will be a Cartesian product and that

the statistical agency can compute the calibration probabilityµusing (6) since the

agency has the frequencies of X.

7 Information loss

Since we stressed in the introduction that SDC means searching for a balance

be-tween disclosure risk and information loss, this section indicates ways to investigate

(16)

First, the transition matrix PX gives an idea of the loss of information. The

more this matrix resembles the identity matrix, the less information gets lost. In

general, this requires a definition of a distance between two matrices. However, we

can apply PRAM using matrices that are parameterized by one parameter, denoted

pd. The idea is as follows. Each time PRAM is applied, the diagonal probabilities

in the transition matrices are fixed and equal topdfor all selected variables. In the

columns, the probability mass 1−pd is equally divided over the entries that are

not diagonal entries. In this situation 1−p_d is a measure for the deviation from the identity matrix.

Although transition matrices give an idea of the information loss, it is hard to

have an intuition about how a certain deviation from the identity matrix affects

analysis of the released data. A second way to investigate information loss is the

comparison of extra variances due to PRAM with respect to frequency estimation.

The idea here is that when this extra variance is already substantial, more complex

analyses of the released sample is probably not possible. The variance with respect

to frequency estimation can be estimated using (5).

The next section will assess information loss due to PRAM using these two

approaches.

8 Simulation results

The objective of this section is to illustrate the theory in the foregoing sections and

to investigate disclosure risk and information loss for different choices of

misclas-sification parameters. The population is chosen to consist of units with complete

records in the British Household Survey 1996-1999. We have N = 16710 and we

distinguish 5 identifying variables with respect to the household owner: Sex (S),

Marital Status (M), Economic status (D), Socio-Economic Group (E), and Age

(17)

consider simple random sampling without replacement from the population where

the sample fractionπ is equal to 0.05, 0.10 or 0.15. The three samples are denoted

s1,s2 and s3 and have sample sizes 836, 1671 and 2506 respectively.

The transition matrices used to apply PRAM to the selected variables are

mostly of a simple form and determined by one parameter pd, as described in

Section 7. A more sophisticated construction of the transition matrices can reduce

the disclosure risk further. An example of this fine-tuning will be given.

8.1 Disclosure risk and the measure θ

The following discusses disclosure risk by comparing the measureθbefore PRAM is

applied with the measureθmm after PRAM has been applied, see Section 6.1. The

identifying variables are described byX= (S, M, D, E, A) withJ = 7840 possible

categories.

Since the population is known, we can compute the measures and using the

samples we can compute their predictions. Table 2 presents the simulation results

using simple random sampling without replacement using different sampling

frac-tionsπ and different choices of pd. Given a choice ofπ and pd, drawing the sample

and applying PRAM is 100 times simulated. The means of the computed and

pre-dicted measures are reported in Table 2. Note that θ and θbreflect the risk before applying PRAM andθmm andθbmm reflect the risk after applying PRAM.

It is clear from Table 2 that applying PRAM will reduce the risk, for example,

when p_d= 0.80 and π = 0.10, applying PRAM reduces the risk from θ= 0.166 to θ_mm = 0.055. Whenp_ddecreases, disclosure risk decreases too, as one might expect. Note that disclosure risk increases when sample size increases. In a larger sample, a

record with a unique combination of scores is more likely to be a population unique

(18)

Table 2: Simulation results of disclosure risk measures for X = (S, M, D, E, A)

before and after applying PRAM with p_d.

p_d π θ θb θmm θbmm

0.05 0.084 0.087 0.061 0.065 0.95 0.10 0.151 0.147 0.112 0.110 0.15 0.217 0.215 0.165 0.166 0.05 0.087 0.086 0.047 0.051 0.90 0.10 0.148 0.157 0.083 0.087 0.15 0.213 0.216 0.123 0.127 0.05 0.087 0.090 0.028 0.031 0.80 0.10 0.166 0.151 0.055 0.054 0.15 0.213 0.221 0.065 0.064

8.2 Disclosure risk and spontaneous recognition

This following illustrates the measureµ for disclosure risk for spontaneous

recog-nition that is discussed in Section 6.2.

Spontaneous recognition is defined for combinations up to three identifying

variables, see Section 6.1. So there are 10 groups to consider. We will discuss

only one of them, namely the group defined by X = (M, D, E). The number of

categories ofX is 490. The measure for disclosure risk is given by

µ=IP

µ

(M, D, E) = (m, d, e)

¯ ¯ ¯

¯(M∗, D∗, E∗) = (m, d, e) ¶

,

for those combinations of values (m, d, e) that have frequency 1 in sample s₁, s₂ or s3. Note that when PRAM is not applied, µ= 1. Table 3 shows results with

respect to the maximum ofµwhen PRAM is applied toM,DandE. With respect

to X = (M, D, E) the number of unique combinations in s1, s2 or s3 are 48, 44,

and 53 respectively.

We draw two conclusions from the results. First, the results illustrate that the

(19)

Table 3: Maximum of µ for values of (M, D, E) with frequency 1 when applying PRAM toM, Dand E.

p_d

0.95 0.90 0.85 0.80 0.70 0.60 sample

s1 with n= 836 0.94 0.86 0.76 0.65 0.43 0.24

s2 with n= 1671 0.94 0.85 0.74 0.61 0.36 0.18

s3 with n= 2506 0.93 0.83 0.69 0.55 0.31 0.15

size of the sample is important. In order to protect an unsafe combinationj, it is

necessary that there are a lot of combinations that can change intojdue to PRAM.

Note that this is the other way around compared to the measure θwhere a larger

sample size causes a higher disclosure risk. This difference shows that different

concepts of disclosure induce different methods for disclosure control.

The following introduces a method to fine-tune a transition matrix and shows

that this can help to diminish the disclosure risk. The idea is to adjust one or

more columns in the transition matrix of each variable that is part of an unsafe

combination. ConsiderPX1 where variableX1 hasJ1 categories. The column that

is chosen first corresponds to the category of X1 with the highest frequency in

samples, say columnj. Let furthermore kbe the number that corresponds to the

category of X1 with the lowest frequency ins. The columns of PX1 that are not

columnjare constructed as explained in Section 7: p_don the diagonal and (1−p_d) equally divided over the other entries. Columnj is fine-tuned by

plj =

            

p_d if l=j

(1−pd)/η if l=k ,

(1−pd)/(η(J1−2)) if l6=j, k

(11)

for l ∈ {1, .., J1} and η > 1. The idea here is that when we choose η close to

(20)

Table 4: Maximum ofµfor values of(M, D, E)with frequency 1 in samples3 when

using fine-tuning for all three variables.

p_d

0.95 0.90 0.85 0.80 0.70

no fine-tuning 0.93 0.83 0.69 0.55 0.31 fine-tuning 1 column where η= 1.001 0.92 0.80 0.66 0.51 0.28 fine-tuning 2 columns where η= 1.001 0.91 0.74 0.57 0.42 0.20 fine-tuning 3 columns where η= 1.001 0.90 0.72 0.52 0.34 0.15

change into the category with the lowest frequency. Assuming a link between an

unsafe combination and a low frequency in the original sample, this idea explicitly

supports the concept of PRAM: an unsafe combinationcis after PRAM protected

by creating new combinations c from combinations that have high frequencies in

the original sample.

In the same way other columns in PX1 can be fine-tuned. For example, the

second column chosen is the column that corresponds to the category ofX1 with

the second highest frequency in samples, and the chosen row is now the row that

corresponds to the category ofX₁ with the second lowest frequency in sample s. Table 4 presents results for samples₃when the transition matrices ofM,D, and Eare fine-tuned. The advantage of fine-tuning the transition matrices is dependent

of the data and on the size of the sample. One can see that the idea works, e.g., if

pd= 0.80, fine-tuning can decrease the maximum ofµfrom 0.55 to 0.34.

Even after using fine-tuning, the maximum of µis still quite large. Additional

simulations, not reported, show that the maximum of µ decreases rapidly when

sample size is increased. Conclusion and advice: determine a largest tolerated µ

and check all combinations of three identifying variables and use fine-tuning. The

(21)

Table 5: Frequencies before PRAM and standard errors of estimated frequencies after PRAM of the variableM in samples₃.

p_d

0.95 0.90 0.85

f standard error offˆ

using PM ←P−M PM ←P−M PM ←P−M

99 5.27 4.08 7.87 4.77 10.23 4.87 378 6.33 5.62 9.40 7.35 12.13 8.30 353 6.24 5.52 9.27 7.21 11.97 8.11 471 6.65 5.95 9.85 7.83 12.69 8.90 525 6.83 6.12 10.11 8.08 13.01 9.21 551 6.78 6.08 10.04 8.02 12.93 9.13 169 5.55 4.64 8.28 5.77 10.74 6.20

8.3 Information loss in frequency estimation

To investigate information loss due to PRAM, this section discusses an example

with univariate frequency estimation with respect to the variableM and bivariate

frequency estimation with respect to the variables S and E in sample s₃. We illustrate the difference between usingPX and←P−X by comparing standard errors in

estimating the univariate frequencies of variableM. In the following we assume that

the released sample frequencies ofM to be equal to the expected released sample

frequencies. That is, released sample frequenciesf∗are given byIE(F∗|f) =PMf,

where F∗ and f are defined with respect to M. It follows that in this situation

ˆ

f = f, so that (5) can be used to compare variances. To estimate the standard

errors when using calibration, we use←P−M in (5) instead of PM−1. Table 5 presents

standard errors of estimated frequencies for different choices of p_d. The example shows that←P−_M is more efficient thanP_M, a difference that becomes more striking whenp_dis smaller.

In the bivariate situation calibration probabilities and proportions do not always

(22)

0 100 200 300 400 0 100 200 300 400 (b)

Frequencies in the sample

Estimates using calibration probabilities ₀

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 100 200 300 400

0 100 200 300 400 (a)

Frequencies in the sample

Estimates using misclassification probabilities

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 1: Estimating frequencies ofX= (S, E)after applying PRAM toS andE in samples3 withpd=0.85 in 10 simulations. (a)Using misclassification probabilities.

(b)Using calibration probabilities.

variable X = (S, E) that has 14 categories. The χ2-test of independence between S and E yields 529.55, where df = 6 and the p-value < 0.00. It is this lack of

independence between the variables that causes calibration probabilities to perform

badly. PRAM was applied 10 times to bothSandEwithp_d= 0.85. Figure 1 shows the estimation of the frequencies ofX usingX∗ _and _P

S⊗PE versus using X∗ and

←_P−

S⊗←P−E. From the figure it is clear that the misclassification probabilities perform

better, i.e., the points (f_j,fˆ_j) are closer to the identity line then the points (f_j,f˜_j), j ∈ {1, ...,14}. The variance is less when ←P−_S⊗←P−_E is used, but the figure shows that in that case estimates are biased. Violating the independence assumption

(23)

Table 6: Actual coverage percentage w.r.t. X = (S, E) for sample s3 and 1000

simulated perturbed samples, wherep_d= 0.85.

ACP using ACP using

category PS⊗PE PS◦⊗PE◦ category PS⊗PE PS◦⊗PE◦

(1,1) 94.8 98.2 (2,1) 94.5 97.8 (1,2) 95.3 98.1 (2,2) 95.6 97.7 (1,3) 95.7 97.7 (2,3) 95.0 97.9 (1,4) 95.4 98.0 (2,4) 95.7 98.8 (1,5) 95.5 99.1 (2,5) 93.5 97.3 (1,6) 95.1 97.4 (2,6) 93.8 97.5 (1,7) 96.0 98.3 (2,7) 96.0 98.7

Misclassification proportions are close to misclassification probabilities in the

above example. Compare for instance

PS =

Ã

0.85 0.15 0.15 0.15

!

and P_S◦ =

Ã

0.854 0.154 0.156 0.156

!

.

A simulation study can be used to investigate the performance of

misclassifica-tion probabilities versus misclassificamisclassifica-tion propormisclassifica-tions. The study compares using

PS⊗PE versus usingPS◦⊗PE◦ by looking at the actual coverage percentage (ACP),

which is the percentage of the replicated perturbed samples for which the confidence

interval of the estimated frequency covers the actual frequency in the original

sam-ple. We used samples₃,p_d= 0.85 and 1000 simulated perturbed samples. Table 6 shows that misclassification proportions perform better than then misclassification

probabilities. The mean value of ACP when usingP_S⊗P_E equals 95.14, and the mean value of ACP when using P◦

S ⊗PE◦ equals 98.04. (A Paired t-Test yields a

p-value <0.00.) So, although the transition matrices are quite alike at first sight,

(24)

9 Conclusion

The paper shows that the analysis of PRAM data is more efficient when

misclassi-fication proportions are used instead of misclassimisclassi-fication probabilities. Calibration

probabilities and calibration proportions work fine in the univariate case, but cause

serious bias in the multivariate case. Since in most situations the user of PRAM

data will be interested in multivariate analysis, it seems wise not to release

calibra-tion probabilities or calibracalibra-tion proporcalibra-tions along with the PRAM data. The two

measures for disclosure risk that are used in this paper show that PRAM helps in

protecting the identity of respondents.

Given that releasing misclassification proportions makes PRAM more efficient

with respect to information loss, it is still an open question how this works out

when PRAM is compared to other SDC methods, see Domingo-Ferrer and Torra

(2001). It might be worthwhile to state that PRAM was never meant to replace

existing SDC methods. Working with PRAM data and taking into account the

information about the misclassification in the analysis might be quite a burden

for some researchers. However, when researchers are interested in specific details

in data, details that might disappear when, e.g., global recoding is used, PRAM

can be a solution. Note that PRAM is statistically sound. Data are perturbed,

but information about the perturbation can be used. Although estimates will have

extra variance due to the perturbation, they will be unbiased.

Since the misclassification proportions provide more information about the

orig-inal sample than the misclassification probabilities, one should consider the

ques-tion whether providing these proporques-tions increases the disclosure risk. Since the

privacy protection that is offered by PRAM is at the record level, we do not think

that disclosure risk increases when misclassification proportions are released. With

(25)

but these frequencies are not sensitive information. Note also that when one works

with the measures for disclosure risk discussed in Section 6, the risk does not change

when misclassification proportions are released.

Appendix A

The following shows that f = ←P−XIE[F∗|f]. First note that IE[F∗|f] = PXf

and that entries ←−p_jk of ←P−X are defined as ←−pjk = (pkjfj)(

P_J

j0=1pkj0fj0)−1 for

k, j∈ {1, ..., J}. For eachj∈ {1, ..., J} we have

µ

←−

PXIE[F∗|f]

¶

(j) =

J

X

k=1

←−_p

jk

µ

IE[F∗|f]

¶

(k)

=

J

X

k=1

←−_p

jk

µ _XJ

j0=1

pkj0fj0 ¶

=

J

X

k=1

pkjfj =fj,

since the columns ofPX sum up to one. So f =←P−XPXf and f is an eigenvector

of←P−XPX with eigenvalue 1.

In general, ←P−_X 6= P_X−1. To illustrate this, letR = ←P−_XP_X. The entries of R are r_ij = PJ_k₌₁←−p_ikp_kj, for i, j ∈ {1, ..., J}. Assume that the entries of P_X are all > 0 and that f_j > 0, for j ∈ {1, ..., J}. Then ←−p_jk > 0, for j, k ∈ {1, ..., J}

and rij > 0, for i, j ∈ {1, ..., J}. In this case, R is not the identity matrix and

consequently←P−X 6=P_X−1. A more intuitive explanation is that←P−X changes when

the survey data change, whereas PX can be determined independently from the

data and hence does not necessarily change when the data change. Therefore, it is

always possible to cause←P−X 6=PX−1 by changing the data.

Appendix B

The following derives the maximum likelihood properties of (3) and (8). The

(26)

situation calibration probabilities do not have to be estimated. Also, we show that

the reasoning applies both to (3) and to (8).

Assume that X1, ...Xn are independently multinomially distributed with

pa-rameter vector π = (π1, .., πJ)t, where πj >0 for j ∈ {1, .., J}, and PJj=1πj = 1.

Consider the transformationπ∗ ₌_P_π_{, where} _P _{is a} _J_×_J _{transition matrix, i.e.,}

columns sum up to one andpkj ≥0 for k, j∈ {1, .., J}. Assume that P is

nonsin-gular. Let the distribution ofX∗ be given by IP(X∗ =k) =π∗_k, for k∈ {1, .., J}. It follows thatX∗

1, X2∗, ..., Xn∗ are multinomially distributed with parametersnand

π∗_{. Indeed,}_π∗

k =pk1π1+...+pkJπJ >0 for k∈ {1, .., J}and J

X

k=1

π_k∗ =

µ_XJ

l=1

pl1

¶

π1+...+

µ_XJ

l=1

plJ

¶

πJ = 1.

The likelihood L∗ _for _π∗ _{and observed} _x∗ _{= (}_x∗

1, x∗2, ..., x∗n)t is well known. Let

f∗ = (f∗

1, f2∗, ..., fJ∗)t denote the observed cell frequencies. The MLE is given by

ˆ

π∗ =f∗/nand has covariance matrix Ω = [Diag(π∗₎₋_π∗₍_π∗₎t_]_/n_{, where Diag(}_π∗₎

is the diagonal matrix with the diagonal entries given by the elements ofπ∗_.

Next we can use the invariance property of maximum likelihood. Define the

transformation g(π∗) = P−1π∗. Since g is one-to-one, it follows from L∗(π∗|x∗) and π = g(π∗_{) that the likelihood for} _π _{is given by} _L∗₍_g−1₍_π₎_|_x∗_{) which is}

maximized for πˆ = g(πˆ∗) = P−1_π_ˆ∗_{. Consequently, when} _π_ˆ _∈ ₍₀_,₁₎J_{, it is the}

MLE. Since g has a first order derivative, the covariance matrix of πˆ can be

ob-tained using the delta-method, see, e.g., Agresti (1990, Chapter 12). We have

∂g(π∗₎_/∂_π∗_{= (}_P−1₎t_{and the covariance matrix of} _π_ˆ _{is given by} _n−1_P−1_Ω(_P−1₎t_.

So, with respect to (3) maximum likelihood properties are proven by takingP =

PX and obtaining πˆ =g(πˆ∗) =PX−1πˆ∗. With respect to (8), the misclassification

design is described by π = ←P−Xπ∗, so P = ←P−

−1

X and the MLE is given by π˜ =

(27)

Appendix C

LetP◦

kj denote the stochastic variable of thekj-th entry ofPX◦ andCkj the

stochas-tic variable of thekj-th cell in the cross-classificationX∗ _by_X_{. It follows that}_C

kj

has a binomial distribution with parametersfj andpkj. Consequently,IE[P_kj◦|f] =

IE[Ckj/fj|f] =fjpkj/fj =pkj and in expectation PX◦ equals PX. Since Ck1j1 and

Ck2j2 are independent givenf, it follows that IE[Pk◦1j1P ◦

k2j2|f] =pk1j1pk2j2. So, in

expectation,P◦

X1 ⊗P ◦

X equals PX1 ⊗PX2.

We define ←P−_jk◦ = Ckj/(Fk∗ +ε) where ε is a small positive value. Using the

delta method, see, e.g., Rice (1995, Section 4.6), we obtain

IE[←P−_jk◦|f]≈ IE[Ckj|f]

IE[F∗

k|f]

+ 1 IE[F∗

k|f]2

µ

V[F_k∗|f]IE[Ckj|f] IE[F∗

k|f]

−ρ

q

V[Ckj|f]V[F_k∗|f]

¶

,

where IE[C_kj|f] = fjpkj and ρ is the correlation between Ckj and Fk∗. From this

we see that the difference betweenIE[←P−_jk◦|f] and←−p_jk will be small when V[F∗

k|f]

is small andIE[F∗

k|f] is large.

References

Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.

Bethlehem J.G., Keller, W.J. and Pannekoek, J. (1990). Disclosure Control of Microdata. Journal of the American Statistical Association, 85, 38-45.

Chaudhuri, A., and Mukerjee, R. (1988). Randomized Response, Theory and Techniques. New York: Marcel Dekker.

De Wolf, P.-P., Gouweleeuw, J.M., Kooiman, P., and Willenborg, L.C.R.J (1997). Reflections on PRAM. Research paper no. 9742, Voorburg/ Heerlen: Statistics Netherlands.

Domingo-Ferrer, J., and Torra, V. (2001). A Quantitative Comparison of Disclosure Control Methods for Microdata. In Confidentiality, Disclosure, and Data Access, Theory and Practical Application for Statistical Agencies. Eds. Doyle, P., Lane, J., Zayatz, L., and Theeuwes, J., Amsterdam: North-Holland.

(28)

Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J., and De Wolf, P.-P. (1998). Post Ran-domisation for Statistical Disclosure Control, Theory and Implementation. Journal of Official Statistics, 14, 463-478.

Hochberg, Y. (1977). On the Use of Double Sampling Schemes in Analyzing Categorical Data with Misclassification Errors. Journal of the American Statistical Association, 72, 914-921.

Kooiman, P., Willenborg, L.C.R.J., and Gouweleeuw, J.M. (1997). PRAM, a Method for Dis-closure Limitation of Microdata. Research paper no. 9705, Voorburg/Heerlen: Statistics Nether-lands.

Kuha, J.T., and Skinner, C.J. (1997). Categorical Data Analysis and Misclassification. In Survey Measurement and Process Quality, Eds. Lyberg, L., Biemer, P., Collins, M, De Leeuw, E, Dippo, C, Schwarz, N, and Trewin, D., New York: Wiley.

Rice, J.A. (1995). Mathematical Statistics and Data Analysis. Belmont: Duxbury Press.

Rosenberg, M.J. (1979). Multivariate Analysis by a Randomised Response Technique for Sta-tistical Disclosure Control. Ph.D. dissertation, University of Michigan, Ann Arbor.

Skinner, C.J., and Elliot, M.J. (2002). A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, B, 855-867.

Van den Hout, A., and Van der Heijden, P.G.M. (2002). Randomized Response, Statistical Dis-closure control and Misclassification, a Review. International Statistical Review, 70, 269-288.

Warner, S.L. (1965). Randomized Response: a Survey Technique for Eliminating Answer Bias. Journal of the American Statistical Association, 60, 63-69.