Statistical Disclosure Control using Post Randomisation: Variants
and Measures for Disclosure Risk
Ardo van den Hout, Elsayed A. H. Elamir
Abstract
This paper discusses the post randomisation method (PRAM) as a method for disclosure control. PRAM protects the privacy of respondent by misclassifying specific variables before data are released to researchers outside the statistical agency. Two variants of the initial idea of PRAM are discussed concerning the information about the misclassification that is given along with the released data. The first variant concerns calibration probabilities and the second variant concerns misclassification proportions. The paper shows that the distinction between the univariate case and multivariate case is important. Additionally, the paper discusses two measures for disclosure risk when PRAM is applied.
Statistical Disclosure Control Using Post
Randomisation: Variants and Measures for Disclosure
Risk
Ardo van den Hout
Department of Methodology and Statistics Faculty of Social Sciences, Utrecht University Heidelberglaan 1, 3584 CS Utrecht, the Netherlands tel. +31 30 2539237, E-mail: [email protected]
Elsayed A.H. Elamir
Southampton Statistical Sciences Research Institute University of Southampton
Southampton SO17 1BJ, United Kingdom
tel: +44 23 8059 4298, E-mail: [email protected]
Abstract
This paper discusses the post randomisation method (PRAM) as a method
for disclosure control. PRAM protects the privacy of respondent by
misclas-sifying specific variables before data are released to researchers outside the
statistical agency. Two variants of the initial idea of PRAM are discussed
concerning the information about the misclassification that is given along with the released data. The first variant concerns calibration probabilities and the
second variant concerns misclassification proportions. The paper shows that
impor-tant. Additionally, the paper discusses two measures for disclosure risk when
PRAM is applied.
Keywords: calibration; information loss; misclassification; PRAM.
1
Introduction
The post randomisation method (PRAM) is discussed in Gouweleeuw, Kooiman,
Willenborg, and De Wolf (1998) as a method for statistical disclosure control
(SDC). When survey data are released by statistical agencies, SDC protects the
identity of respondents. SDC tries to prevent that a user of the released data can
link the data of a respondent in the survey to a specific person in the
popula-tion. See Willenborg and De Waal (2001) for an introduction into SDC and SDC
methods other than PRAM.
There is a close link between PRAM and randomised response, a method to ask
sensitive questions in a survey, see Warner (1965) and Rosenberg (1979). Van den
Hout and Van der Heijden (2002) sum up some differences and similarities between
randomised response and PRAM.
When SDC is used, there will always be a loss of information. This is inevitable
since SDC tries to determine the information in the data that can lead to the
disclosure of an identity of a respondent, and eliminates this information before
data are released. It is not difficult to prevent disclosure, but it is difficult to prevent
disclosureand release data that is still useful for statistical analysis. Applying SDC
means searching for a balance between disclosure risk and information loss.
The idea of PRAM is to misclassify some of the categorical variables in the
sur-vey using fixed misclassification probabilities and to release the partly misclassified
data together with those probabilities. Say variableX, with categories {1, ..., J},
together with conditional probabilitiesIP(X∗ =k|X = j) for k, j ∈ {1, ..., J}. In
this way PRAM introduces uncertainty in the data: the user of the released data
cannot be sure that the information is original or perturbed due to PRAM and
it becomes harder to establish a correct link between a respondent in the survey
and a specific person in the population. Since the user has the misclassification
probabilities he can adjust his analysis by taking into account the perturbation due
to PRAM.
This paper discusses two ideas to make PRAM more efficient with respect to
the balance between disclosure risk and information loss. First, the paper discusses
the use ofcalibration probabilities
IP(true category isj|categoryiis released). (1)
in the analysis of released data and compares this with using misclassification
probabilities
IP(categoryiis released|true category isj). (2)
The idea of using calibration probabilities is discussed by De Wolf, Gouweleeuw,
Kooiman, and Willenborg (1997), who refer to the discussion of calibration
prob-abilities in misclassification literature, see, e.g., Kuha and Skinner (1997). We will
elaborate the discussion and show that the advantage of calibration probabilities is
limited to the univariate case. Secondly, the paper shows that information loss can
be reduced by providingmisclassification proportions along with the released data.
These proportions inform about the actual change in the survey data due to the
ap-plication of PRAM. (Probabilities (1) and (2) inform about the expected change.)
Additionally, the paper discusses two measures for disclosure risk when PRAM is
applied. The first is a general measure presented in Elamir and Skinner (2003)
sec-ond measure links up with the SDC practice at Statistic Netherlands. Simulation
results are given to illustrate the theory.
The outline of the paper is as follows. Section 2 provides the framework and
the notation. Section 3 describes frequency estimation for PRAM data. Section
4 discusses the use of calibration probabilities. In Section 5 we introduce the use
of misclassification proportions. Section 6 discusses measures for disclosure risk,
whereas information loss is briefly considered in Section 7. Section 8 presents some
simulations, and Section 9 concludes.
2
Framework and notation
In survey data we distinguish between identifying variables and non-identifying
variables. Identifying variables are variables that can be used to re-identify
indi-viduals represented in the data. These variables are assumed to be categorical,
e.g., Gender, Race, Place of Residence. We assume that the sensitive information
of respondents is contained in the non-identifying variables, see Bethlehem, Keller,
and Pannekoek (1990), and that we want to protect this information by applying
PRAM to (a subset of) the identifying variables.
The notation in this paper is the same as in Skinner and Elliot (2002). Units
are selected from a finite populationU and each selected unit has one record in the
microdata sample s⊂U. Let ndenote the number of units ins. Let the
categor-ical variable formed by cross-classifying (a subset of) the identifying variables be
denoted X with values in {1, ..., J}. LetXi denote the value of X for unit i∈U.
Thepopulation frequencies are denoted
Fj =
X
i∈U
I(Xi =j), j∈ {1, ..., J},
Thesample frequencies are denoted
fj =
X
i∈s
I(Xi=j), j∈ {1, ..., J}.
In the framework of PRAM, we call the sample that is released by the statistical
agency thereleased microdata sample s∗. Note that uniti∈s∗ if and only if i∈s.
Let X∗ denote the released version of X in s∗. By misclassification of unit i we
meanXi 6=Xi∗. Thereleased sample frequencies are denoted
fk∗ = X
i∈s∗
I(Xi∗=k), k∈ {1, ..., J}.
LetPX denote the J×J transition matrix that contains the conditional
misclas-sification probabilities pkj = IP(X∗ = k|X = j), for k, j ∈ {1, ..., J}. Note that
the columns ofPX sum up to one. The distribution of X∗ conditional on sis the
J-component finite mixture given by
IP(Xi∗=k|i∈s) =
J
X
j=1
IP(Xi∗=k|Xi=j)IP(Xi =j|i∈s), k∈ {1, ..., J},
where the component distributions are given by PX and the component weights
are given by the conditional distribution ofX. The conditional distribution of X
in samples is multinomial with
IP(Xi =j|i∈s) = 1nfj, j∈ {1, ..., J}.
3
Frequency estimation for PRAM data
When PRAM is applied and some of the identifying variables are misclassified,
standard statistical models do not apply to the released data since these models do
not take into account the perturbation. This section shows how the misclassification
We have IE[F∗|f] = PXf, where f = (f1, ..., fJ)t and F∗ = (F1∗, ..., FJ∗)t is
the stochastic vector of the released sample frequencies. An unbiased moment
estimator off is given by
ˆ
f =PX−1f∗, (3)
see Kooiman, Willenborg, and Gouweleeuw (1997). In practice, assuming thatPX
is non-singular does not impose much restriction on the choice of the
misclassifi-cation probabilities. Matrix PX−1 exists when the diagonal of PX dominates, i.e., pii>1/2 fori∈ {1, ..., J}. An additional assumption is that the dimensions of f
and f∗ are the same.
PRAM is applied to each variable independently and a transition matrix is
re-leased per variable. When the user of the rere-leased sample assesses a compounded
variable, he can construct its transition matrix using the transition matrices of
the individual variables. For instance, consider identifying variablesX1, with
cat-egories {1, .., J1} and X2, with categories {1, .., J2}, and the cross-classification
X = (X1, X2), i.e., the Cartesian product ofX1 and X2. Since PRAM is applied
independently, we have
IP
µ
X∗ = (k1, k2)|X= (j1, j2)
¶
= IP(X1∗=k1|X1 =j1)
× IP(X2∗=k2|X1 =j2), (4)
for k1, j1 ∈ {1, .., J1} and k2, j2 ∈ {1, .., J2}. In matrix notation, we have PX = PX1⊗PX2, where⊗is the Kronecker product. Note that when one of two variables is not perturbed by PRAM, the transition matrix of that variable is the identity
matrix.
The variance of (3) equals
V[fˆ|f] =PX−1V[F∗|f](PX−1)t=PX−1
µXJ
j=1
fjVj
¶
where Vj is the J ×J covariance matrix of two released values given the original
valuej, i.e.,
Vj(k1, k2) =
pk2j(1−pk2j) if k1 =k2
−pk1jpk2j if k1 6=k2
fork1, k2 ∈ {1, ..., J},
see Kooiman et al. (1997). The variance can be estimated by substituting ˆfj for
fj in (5), forj∈ {1, .., J}.
The variance given by (5) is the extra variance due to PRAM and does not
take into account the sampling distribution. The formulas for the latter are given
in Chaudhuri and Mukerjee (1988) for multinomial sampling and compared to (5)
in Van den Hout and Van der Heijden (2002), see also Appendix B.
4
Calibration probabilities
Literature concerning misclassification shows that calibration probabilities (1) are
more efficient in the analysis of misclassified data than misclassification
probabil-ities (2), see the review paper by Kuha and Skinner (1997). Often, calibration
probabilities have to be estimated. However, when PRAM is applied, the
statisti-cal agency can compute the statisti-calibration probabilities using the sample frequencies.
The idea of using calibration probabilities for PRAM is mentioned in De Wolf et
al. (1997). The following elaborates this idea and makes a comparison with PRAM
as explained in the previous section.
The J ×J matrix with calibration probabilities of univariate variable X is
denoted by←P−X and has entries ←−pjk defined by
IP(Xi=j|Xi∗ =k, i∈s) =
pkjfj
PJ
j0=1pkj0fj0
, j, k∈ {1, ..., J}, (6)
column sums up to one. We have
f =←−PXIE[F∗|f], (7)
see Appendix A. An unbiased moment estimator off is therefore given by
˜
f =←P−Xf∗. (8)
In general,←P−X 6=PX−1, see Appendix A. The variance of (8) is given by (5) where
PX−1 is replaced by ←P−X andfj is estimated by ˜fj, forj∈ {1, ..., J}.
In the remainder of this section we compare estimators (3) and (8). The first
difference is that (3) might yield an estimate where some of the entries are negative,
whereas (8) will never yield negative estimates, see, e.g., De Wolfet al. (1997).
Secondly, estimator (8) is more efficient than (3) in the univariate case. This
is already discussed in Kuha and Skinner (1997). Consider the case whereX has
two categories. Say we want to knowπ=IP(X= 1). Let ˆπ be the estimate using
PX and ˜π the estimate using ←P−X. The efficiency of ˆπ relative to ˜π is given by
eff(ˆπ,π˜) = V[˜p]
V[ˆp] = (p11+p22−1)
2(←−p
22− ←−p21)2<1. (9)
So ˜π is always more efficient than ˆπ. An important difference with the general
situation of misclassification is that in the situation of PRAM, matrices PX and
←−
PX are given and do not have to be estimated. Comparison (9) is therefore a
simple form of the comparison in Kuha and Skinner (1997, Section 28.5.1.3.).
The third comparison is between the maximum likelihood properties of (3)
and (8). Assume thatX1, ...Xn are independently multinomially distributed with
parameter vectorπ = (π1, .., πJ)t. In the framework of misclassification, Hochberg
(1977) proves that estimator (8) yields a MLE. When (3) yields an estimate in the
interior of the parameter space, the estimate is also a MLE. See Appendix B for the
corresponding to (8) is different from the likelihood function corresponding to (3),
since the information used is different. This explains why both can be MLE despite
being different estimators off.
The fourth comparison is with respect to transition matrices of Cartesian
prod-ucts and is less favourable for (8). It has already been noted that PX1 ⊗PX2
is the matrix with misclassification probabilities for the Cartesian product X =
(X1, X2), see (4). Analogously, given ←P−X1 and
←−
PX2 the user can construct
ma-trix ←P−X1 ⊗←P−X2. However, this matrix does not necessarily contain calibration probabilities for X. Note that
IP³Xi = (j1, j2)|Xi∗ = (k1, k2), i∈s
´
= pk1j1pk2j2IP
³
Xi = (j1, j2)|i∈s
´ PJ1
v
PJ2
w pk1vpk2wIP ³
X1 = (v, w)|i∈s
´, (10)
It follows that ←P−X =←P−X1 ⊗
←−
PX2 when X1 and X2 are independent. In general,
this independence is not guaranteed and since the user of the released data does
not have the frequencies ofX, he cannot construct←P−X.
The fifth and last comparison is with respect to the creation of subgroups.
Consider the situation where a user of the released data creates a subgroup by
using a grouping variable that is not part ofX. When the number of categories in
the subgroup is smaller thanJ, estimate (8) is not well-defined. When the number
of categories is equal to J, estimate (8) is biased due to the fact that (7) does not
hold. Note with respect to (7) that the frequencies that are used to construct←P−X
are the frequencies in the whole sample which will differ from the frequencies in the
subgroup, see also Appendix A. The estimator (3) is still valid for the subgroup.
Since calibration probabilities contain information about the distribution of
the samples, they perform better than misclassification probabilities regarding the
Table 1: Classification by X∗ and X
X
X∗ 1 2 total 1 300 200 500 2 100 400 500 total 400 600 1000
Section 8 presents some simulation results.
5
Misclassification proportions
MatricesPX and ←P−X inform about the expected change due to PRAM. As an
al-ternative we can create transition matrices that inform about the actual change due
an application of PRAM. These matrices contain proportions and will be denoted
P◦
X and
←P−◦
X. Matrix PX◦ contains misclassification proportions and
←P−◦
X contains
calibration proportions. This section shows how P◦
X and
←P−◦
X are computed and
discusses properties of these matrices.
We start with an example. Say that X has categories {1,2}. Assume that
applying PRAM yields the cross-classification in Table 1. From this table it follows
that the proportion of records withX= 1 that haveX∗= 1 in the released sample
is 300/400=3/4 and that the proportion of records withX∗ = 1 that have X = 1
in the original sample is 300/500=3/5. Analogously we get the other entries of
PX◦ =
Ã
3/4 1/3 1/4 2/3
!
and ←P−X◦ =
Ã
3/5 1/5 2/5 4/5
!
.
For the general construction ofP◦
X and
←−
PX◦, let the cell frequencies in the cross-classificationX∗ byX be denotedc
kj, fork, j∈ {1, .., J}. The entries of theJ×J
transition matrices with the proportions are given by
pekj = ckj
fj and
←−pe
jk =
ckj
f∗
k
wherek, j ∈ {1, .., J}.
It follows that f∗ = P◦
Xf and f =
←−
PX◦f∗. This is the reason to consider the matrices with the proportions more closely, since it is a great improvement
compared to (3) and (8). Note that when the user of the released sample hasP◦
X
or←P−X◦, he can reconstruct Table 1.
Conditional on f, PX◦ and ←P−X◦ are stochastic, whereas PX and ←P−X are not.
In expectation PX◦ equals PX, and PX◦1 ⊗P ◦
X equals PX1 ⊗PX2, see appendix C.
However, since f∗ is a value of the stochastic vectorF∗, and IP(F∗
k = 0)6= 0, the
expectation of ←P−X◦ does not exists. Nevertheless, an approximation shows that
←−
PX◦ will be close to←P−X, see appendix C.
There is a set back with respect to the use of proportions for Cartesian products
and this is comparable to the problem mentioned in the previous section. Given
P◦
X1 and P ◦
X2 the user can construct P ◦
X1 ⊗P ◦
X2 for X = (X1, X2). However,
P◦
X1⊗P ◦
X2 does not contain proportions as defined above. Note that the user does
not have the cross-classification ofX and X∗, so he cannot derive the proportions
inPX◦. The same holds for←P−X◦. The optimal use of misclassification proportions is thereby limited to the univariate case.
Since misclassification proportions contain information about the actual
per-turbation due to PRAM, we expect them to perform well also in the multivariate
case. Section 8 discusses a multivariate example.
6
Disclosure risk
There are several ways to measure disclosure risk, see, e.g., Skinner and Elliot
(2002), and Domingo-Ferrer and Torra (2001). This section discusses two measures
for disclosure risk with respect to PRAM. Section 6.1 discusses an extension of the
general measure of disclosure risk introduced by Skinner and Elliot (2002). Section
Statistics Netherlands.
6.1 The measure θ
The following describes how the general measure for disclosure risk introduced in
Skinner and Elliot (2002) can be extended to the situation where PRAM is applied
before data are released by the statistical agency. When a disclosure control method
such as PRAM has been applied, a measure for disclosure risk is needed to quantify
the protection that is offered by the control method. Scenarios that may lead to a
disclosure of the identity of a respondent of are about persons that aim at disclosure
and that may have data that overlap the released data. A common scenario is that
a person has a sample from another source and tries to identify respondents in the
released sample by matching records. Using the extension of the measure in Skinner
and Elliot (2002) we can investigate how applying PRAM reduces the disclosure
risk.
Under simple random sampling Skinner and Elliot (2002) introduced the
mea-sure of disclomea-sure riskθ=IP(correct match|unique match) as
θ= X
j
I(fj = 1)
, X
j
Fj I(fj = 1),
where the summations are over j = 1, ..., J. The measure θ is the proportion of
correct matches among those population units which match a sample unique. The
measure is sample dependent and a distribution-free prediction is given by
b
θ=πn1
,µ
πn1+ 2(1−π)n2
¶
,
whereπ is the sampling fraction,n1=
P
jI(fj = 1) is the number of uniques and
n2 =
P
jI(fj = 2) is the number of twins in the sample, see Skinner and Elliot
misclassifi-cation occurs. The extension is given by
θmm =
X
i∈s
I(fXi = 1, Xi∗=Xi)
Á X
j
Fj I(fj = 1)
and its distribution-free prediction is given by
b
θmm =π
X
j
I(fj = 1)Pjj
Áµ
πn1+ 2 (1−π)n2
¶
,
where Pjj is the diagonal entry (j, j) of the transition matrix P which describes
the misclassification, see Elamir and Skinner(2003)
Section 8 presents some simulation results with respect to the measureθbefore
applying PRAM, andθmm after applying PRAM.
6.2 Spontaneous recognition
Statistics Netherlands releases data in several ways. One way is the releasing of
detailed survey data under contract, i.e., data are released to bona fide research
institutes that sign an agreement in which they promise not to look for disclosure
explicitly, e.g, by matching the data to other data files. In this situation, SDC
only concerns the protection against what is called spontaneous recognition. This
section introduces a measure for disclosure risk for PRAM data that is specific to
the control for spontaneous recognition.
Controlling for spontaneous recognition means that one should prevent that
certain records attract attention. A record may attract attention when a low
dimensional combination of its values has a low frequency. Also, without
cross-classifying, a record may attract attention when one of its values is recognized as
being very rare in the population. Combinations of values with low frequencies in
the sample are calledunsafe combinations.
Statistics Netherlands uses the rule of thumb that a recognition of a combination
only combinations of three variables are assessed with respect to disclosure control
for spontaneous recognition.
Note that applying PRAM causes two kinds of modifications in the sample that
make disclosure more difficult. First, it is possible that unsafe combinations in the
sample change into apparently safe ones in the released sample, and, secondly, it is
possible that safe combinations in the sample change into apparently unsafe ones.
Since misclassification probabilities are not that large (in order to keep analysis of
the released sample possible) and the frequency of unsafe combinations is typically
low, the effect of the first modification is negligible in expectation. The second
modification is more likely to protect an unsafe combination j when there are a
lot of combinations k,k 6=j, which are misclassified intoj. This is the reason to
focus, for a given recordiwith the unsafe combination of scoresj, on the calibration
probability
µ=IP(Xi =j|Xi∗ =j, i∈s).
When there are hardly any k, k6= j, misclassified into j, this probability will be
large, and, as a consequence, the record is unsafe. Note that combinations with
frequency equal to zero are never unsafe.
Measureµis a simplification since it ignores possible correlation betweenXand
other variables in the sample. Note that X will be a Cartesian product and that
the statistical agency can compute the calibration probabilityµusing (6) since the
agency has the frequencies of X.
7
Information loss
Since we stressed in the introduction that SDC means searching for a balance
be-tween disclosure risk and information loss, this section indicates ways to investigate
First, the transition matrix PX gives an idea of the loss of information. The
more this matrix resembles the identity matrix, the less information gets lost. In
general, this requires a definition of a distance between two matrices. However, we
can apply PRAM using matrices that are parameterized by one parameter, denoted
pd. The idea is as follows. Each time PRAM is applied, the diagonal probabilities
in the transition matrices are fixed and equal topdfor all selected variables. In the
columns, the probability mass 1−pd is equally divided over the entries that are
not diagonal entries. In this situation 1−pd is a measure for the deviation from the identity matrix.
Although transition matrices give an idea of the information loss, it is hard to
have an intuition about how a certain deviation from the identity matrix affects
analysis of the released data. A second way to investigate information loss is the
comparison of extra variances due to PRAM with respect to frequency estimation.
The idea here is that when this extra variance is already substantial, more complex
analyses of the released sample is probably not possible. The variance with respect
to frequency estimation can be estimated using (5).
The next section will assess information loss due to PRAM using these two
approaches.
8
Simulation results
The objective of this section is to illustrate the theory in the foregoing sections and
to investigate disclosure risk and information loss for different choices of
misclas-sification parameters. The population is chosen to consist of units with complete
records in the British Household Survey 1996-1999. We have N = 16710 and we
distinguish 5 identifying variables with respect to the household owner: Sex (S),
Marital Status (M), Economic status (D), Socio-Economic Group (E), and Age
consider simple random sampling without replacement from the population where
the sample fractionπ is equal to 0.05, 0.10 or 0.15. The three samples are denoted
s1,s2 and s3 and have sample sizes 836, 1671 and 2506 respectively.
The transition matrices used to apply PRAM to the selected variables are
mostly of a simple form and determined by one parameter pd, as described in
Section 7. A more sophisticated construction of the transition matrices can reduce
the disclosure risk further. An example of this fine-tuning will be given.
8.1 Disclosure risk and the measure θ
The following discusses disclosure risk by comparing the measureθbefore PRAM is
applied with the measureθmm after PRAM has been applied, see Section 6.1. The
identifying variables are described byX= (S, M, D, E, A) withJ = 7840 possible
categories.
Since the population is known, we can compute the measures and using the
samples we can compute their predictions. Table 2 presents the simulation results
using simple random sampling without replacement using different sampling
frac-tionsπ and different choices of pd. Given a choice ofπ and pd, drawing the sample
and applying PRAM is 100 times simulated. The means of the computed and
pre-dicted measures are reported in Table 2. Note that θ and θbreflect the risk before applying PRAM andθmm andθbmm reflect the risk after applying PRAM.
It is clear from Table 2 that applying PRAM will reduce the risk, for example,
when pd= 0.80 and π = 0.10, applying PRAM reduces the risk from θ= 0.166 to θmm = 0.055. Whenpddecreases, disclosure risk decreases too, as one might expect. Note that disclosure risk increases when sample size increases. In a larger sample, a
record with a unique combination of scores is more likely to be a population unique
Table 2: Simulation results of disclosure risk measures for X = (S, M, D, E, A)
before and after applying PRAM with pd.
pd π θ θb θmm θbmm
0.05 0.084 0.087 0.061 0.065 0.95 0.10 0.151 0.147 0.112 0.110 0.15 0.217 0.215 0.165 0.166 0.05 0.087 0.086 0.047 0.051 0.90 0.10 0.148 0.157 0.083 0.087 0.15 0.213 0.216 0.123 0.127 0.05 0.087 0.090 0.028 0.031 0.80 0.10 0.166 0.151 0.055 0.054 0.15 0.213 0.221 0.065 0.064
8.2 Disclosure risk and spontaneous recognition
This following illustrates the measureµ for disclosure risk for spontaneous
recog-nition that is discussed in Section 6.2.
Spontaneous recognition is defined for combinations up to three identifying
variables, see Section 6.1. So there are 10 groups to consider. We will discuss
only one of them, namely the group defined by X = (M, D, E). The number of
categories ofX is 490. The measure for disclosure risk is given by
µ=IP
µ
(M, D, E) = (m, d, e)
¯ ¯ ¯
¯(M∗, D∗, E∗) = (m, d, e) ¶
,
for those combinations of values (m, d, e) that have frequency 1 in sample s1, s2 or s3. Note that when PRAM is not applied, µ= 1. Table 3 shows results with
respect to the maximum ofµwhen PRAM is applied toM,DandE. With respect
to X = (M, D, E) the number of unique combinations in s1, s2 or s3 are 48, 44,
and 53 respectively.
We draw two conclusions from the results. First, the results illustrate that the
Table 3: Maximum of µ for values of (M, D, E) with frequency 1 when applying PRAM toM, Dand E.
pd
0.95 0.90 0.85 0.80 0.70 0.60 sample
s1 with n= 836 0.94 0.86 0.76 0.65 0.43 0.24
s2 with n= 1671 0.94 0.85 0.74 0.61 0.36 0.18
s3 with n= 2506 0.93 0.83 0.69 0.55 0.31 0.15
size of the sample is important. In order to protect an unsafe combinationj, it is
necessary that there are a lot of combinations that can change intojdue to PRAM.
Note that this is the other way around compared to the measure θwhere a larger
sample size causes a higher disclosure risk. This difference shows that different
concepts of disclosure induce different methods for disclosure control.
The following introduces a method to fine-tune a transition matrix and shows
that this can help to diminish the disclosure risk. The idea is to adjust one or
more columns in the transition matrix of each variable that is part of an unsafe
combination. ConsiderPX1 where variableX1 hasJ1 categories. The column that
is chosen first corresponds to the category of X1 with the highest frequency in
samples, say columnj. Let furthermore kbe the number that corresponds to the
category of X1 with the lowest frequency ins. The columns of PX1 that are not
columnjare constructed as explained in Section 7: pdon the diagonal and (1−pd) equally divided over the other entries. Columnj is fine-tuned by
plj =
pd if l=j
(1−pd)/η if l=k ,
(1−pd)/(η(J1−2)) if l6=j, k
(11)
for l ∈ {1, .., J1} and η > 1. The idea here is that when we choose η close to
Table 4: Maximum ofµfor values of(M, D, E)with frequency 1 in samples3 when
using fine-tuning for all three variables.
pd
0.95 0.90 0.85 0.80 0.70
no fine-tuning 0.93 0.83 0.69 0.55 0.31 fine-tuning 1 column where η= 1.001 0.92 0.80 0.66 0.51 0.28 fine-tuning 2 columns where η= 1.001 0.91 0.74 0.57 0.42 0.20 fine-tuning 3 columns where η= 1.001 0.90 0.72 0.52 0.34 0.15
change into the category with the lowest frequency. Assuming a link between an
unsafe combination and a low frequency in the original sample, this idea explicitly
supports the concept of PRAM: an unsafe combinationcis after PRAM protected
by creating new combinations c from combinations that have high frequencies in
the original sample.
In the same way other columns in PX1 can be fine-tuned. For example, the
second column chosen is the column that corresponds to the category ofX1 with
the second highest frequency in samples, and the chosen row is now the row that
corresponds to the category ofX1 with the second lowest frequency in sample s. Table 4 presents results for samples3when the transition matrices ofM,D, and Eare fine-tuned. The advantage of fine-tuning the transition matrices is dependent
of the data and on the size of the sample. One can see that the idea works, e.g., if
pd= 0.80, fine-tuning can decrease the maximum ofµfrom 0.55 to 0.34.
Even after using fine-tuning, the maximum of µis still quite large. Additional
simulations, not reported, show that the maximum of µ decreases rapidly when
sample size is increased. Conclusion and advice: determine a largest tolerated µ
and check all combinations of three identifying variables and use fine-tuning. The
Table 5: Frequencies before PRAM and standard errors of estimated frequencies after PRAM of the variableM in samples3.
pd
0.95 0.90 0.85
f standard error offˆ
using PM ←P−M PM ←P−M PM ←P−M
99 5.27 4.08 7.87 4.77 10.23 4.87 378 6.33 5.62 9.40 7.35 12.13 8.30 353 6.24 5.52 9.27 7.21 11.97 8.11 471 6.65 5.95 9.85 7.83 12.69 8.90 525 6.83 6.12 10.11 8.08 13.01 9.21 551 6.78 6.08 10.04 8.02 12.93 9.13 169 5.55 4.64 8.28 5.77 10.74 6.20
8.3 Information loss in frequency estimation
To investigate information loss due to PRAM, this section discusses an example
with univariate frequency estimation with respect to the variableM and bivariate
frequency estimation with respect to the variables S and E in sample s3. We illustrate the difference between usingPX and←P−X by comparing standard errors in
estimating the univariate frequencies of variableM. In the following we assume that
the released sample frequencies ofM to be equal to the expected released sample
frequencies. That is, released sample frequenciesf∗are given byIE(F∗|f) =PMf,
where F∗ and f are defined with respect to M. It follows that in this situation
ˆ
f = f, so that (5) can be used to compare variances. To estimate the standard
errors when using calibration, we use←P−M in (5) instead of PM−1. Table 5 presents
standard errors of estimated frequencies for different choices of pd. The example shows that←P−M is more efficient thanPM, a difference that becomes more striking whenpdis smaller.
In the bivariate situation calibration probabilities and proportions do not always
0 100 200 300 400 0 100 200 300 400 (b)
Frequencies in the sample
Estimates using calibration probabilities 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 100 200 300 400
0 100 200 300 400 (a)
Frequencies in the sample
Estimates using misclassification probabilities
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 1: Estimating frequencies ofX= (S, E)after applying PRAM toS andE in samples3 withpd=0.85 in 10 simulations. (a)Using misclassification probabilities.
(b)Using calibration probabilities.
variable X = (S, E) that has 14 categories. The χ2-test of independence between S and E yields 529.55, where df = 6 and the p-value < 0.00. It is this lack of
independence between the variables that causes calibration probabilities to perform
badly. PRAM was applied 10 times to bothSandEwithpd= 0.85. Figure 1 shows the estimation of the frequencies ofX usingX∗ and P
S⊗PE versus using X∗ and
←P−
S⊗←P−E. From the figure it is clear that the misclassification probabilities perform
better, i.e., the points (fj,fˆj) are closer to the identity line then the points (fj,f˜j), j ∈ {1, ...,14}. The variance is less when ←P−S⊗←P−E is used, but the figure shows that in that case estimates are biased. Violating the independence assumption
Table 6: Actual coverage percentage w.r.t. X = (S, E) for sample s3 and 1000
simulated perturbed samples, wherepd= 0.85.
ACP using ACP using
category PS⊗PE PS◦⊗PE◦ category PS⊗PE PS◦⊗PE◦
(1,1) 94.8 98.2 (2,1) 94.5 97.8 (1,2) 95.3 98.1 (2,2) 95.6 97.7 (1,3) 95.7 97.7 (2,3) 95.0 97.9 (1,4) 95.4 98.0 (2,4) 95.7 98.8 (1,5) 95.5 99.1 (2,5) 93.5 97.3 (1,6) 95.1 97.4 (2,6) 93.8 97.5 (1,7) 96.0 98.3 (2,7) 96.0 98.7
Misclassification proportions are close to misclassification probabilities in the
above example. Compare for instance
PS =
Ã
0.85 0.15 0.15 0.15
!
and PS◦ =
Ã
0.854 0.154 0.156 0.156
!
.
A simulation study can be used to investigate the performance of
misclassifica-tion probabilities versus misclassificamisclassifica-tion propormisclassifica-tions. The study compares using
PS⊗PE versus usingPS◦⊗PE◦ by looking at the actual coverage percentage (ACP),
which is the percentage of the replicated perturbed samples for which the confidence
interval of the estimated frequency covers the actual frequency in the original
sam-ple. We used samples3,pd= 0.85 and 1000 simulated perturbed samples. Table 6 shows that misclassification proportions perform better than then misclassification
probabilities. The mean value of ACP when usingPS⊗PE equals 95.14, and the mean value of ACP when using P◦
S ⊗PE◦ equals 98.04. (A Paired t-Test yields a
p-value <0.00.) So, although the transition matrices are quite alike at first sight,
9
Conclusion
The paper shows that the analysis of PRAM data is more efficient when
misclassi-fication proportions are used instead of misclassimisclassi-fication probabilities. Calibration
probabilities and calibration proportions work fine in the univariate case, but cause
serious bias in the multivariate case. Since in most situations the user of PRAM
data will be interested in multivariate analysis, it seems wise not to release
calibra-tion probabilities or calibracalibra-tion proporcalibra-tions along with the PRAM data. The two
measures for disclosure risk that are used in this paper show that PRAM helps in
protecting the identity of respondents.
Given that releasing misclassification proportions makes PRAM more efficient
with respect to information loss, it is still an open question how this works out
when PRAM is compared to other SDC methods, see Domingo-Ferrer and Torra
(2001). It might be worthwhile to state that PRAM was never meant to replace
existing SDC methods. Working with PRAM data and taking into account the
information about the misclassification in the analysis might be quite a burden
for some researchers. However, when researchers are interested in specific details
in data, details that might disappear when, e.g., global recoding is used, PRAM
can be a solution. Note that PRAM is statistically sound. Data are perturbed,
but information about the perturbation can be used. Although estimates will have
extra variance due to the perturbation, they will be unbiased.
Since the misclassification proportions provide more information about the
orig-inal sample than the misclassification probabilities, one should consider the
ques-tion whether providing these proporques-tions increases the disclosure risk. Since the
privacy protection that is offered by PRAM is at the record level, we do not think
that disclosure risk increases when misclassification proportions are released. With
but these frequencies are not sensitive information. Note also that when one works
with the measures for disclosure risk discussed in Section 6, the risk does not change
when misclassification proportions are released.
Appendix A
The following shows that f = ←P−XIE[F∗|f]. First note that IE[F∗|f] = PXf
and that entries ←−pjk of ←P−X are defined as ←−pjk = (pkjfj)(
PJ
j0=1pkj0fj0)−1 for
k, j∈ {1, ..., J}. For eachj∈ {1, ..., J} we have
µ
←−
PXIE[F∗|f]
¶
(j) =
J
X
k=1
←−p
jk
µ
IE[F∗|f]
¶
(k)
=
J
X
k=1
←−p
jk
µ XJ
j0=1
pkj0fj0 ¶
=
J
X
k=1
pkjfj =fj,
since the columns ofPX sum up to one. So f =←P−XPXf and f is an eigenvector
of←P−XPX with eigenvalue 1.
In general, ←P−X 6= PX−1. To illustrate this, letR = ←P−XPX. The entries of R are rij = PJk=1←−pikpkj, for i, j ∈ {1, ..., J}. Assume that the entries of PX are all > 0 and that fj > 0, for j ∈ {1, ..., J}. Then ←−pjk > 0, for j, k ∈ {1, ..., J}
and rij > 0, for i, j ∈ {1, ..., J}. In this case, R is not the identity matrix and
consequently←P−X 6=PX−1. A more intuitive explanation is that←P−X changes when
the survey data change, whereas PX can be determined independently from the
data and hence does not necessarily change when the data change. Therefore, it is
always possible to cause←P−X 6=PX−1 by changing the data.
Appendix B
The following derives the maximum likelihood properties of (3) and (8). The
situation calibration probabilities do not have to be estimated. Also, we show that
the reasoning applies both to (3) and to (8).
Assume that X1, ...Xn are independently multinomially distributed with
pa-rameter vector π = (π1, .., πJ)t, where πj >0 for j ∈ {1, .., J}, and PJj=1πj = 1.
Consider the transformationπ∗ =Pπ, where P is a J×J transition matrix, i.e.,
columns sum up to one andpkj ≥0 for k, j∈ {1, .., J}. Assume that P is
nonsin-gular. Let the distribution ofX∗ be given by IP(X∗ =k) =π∗k, for k∈ {1, .., J}. It follows thatX∗
1, X2∗, ..., Xn∗ are multinomially distributed with parametersnand
π∗. Indeed,π∗
k =pk1π1+...+pkJπJ >0 for k∈ {1, .., J}and J
X
k=1
πk∗ =
µXJ
l=1
pl1
¶
π1+...+
µXJ
l=1
plJ
¶
πJ = 1.
The likelihood L∗ for π∗ and observed x∗ = (x∗
1, x∗2, ..., x∗n)t is well known. Let
f∗ = (f∗
1, f2∗, ..., fJ∗)t denote the observed cell frequencies. The MLE is given by
ˆ
π∗ =f∗/nand has covariance matrix Ω = [Diag(π∗)−π∗(π∗)t]/n, where Diag(π∗)
is the diagonal matrix with the diagonal entries given by the elements ofπ∗.
Next we can use the invariance property of maximum likelihood. Define the
transformation g(π∗) = P−1π∗. Since g is one-to-one, it follows from L∗(π∗|x∗) and π = g(π∗) that the likelihood for π is given by L∗(g−1(π)|x∗) which is
maximized for πˆ = g(πˆ∗) = P−1πˆ∗. Consequently, when πˆ ∈ (0,1)J, it is the
MLE. Since g has a first order derivative, the covariance matrix of πˆ can be
ob-tained using the delta-method, see, e.g., Agresti (1990, Chapter 12). We have
∂g(π∗)/∂π∗= (P−1)tand the covariance matrix of πˆ is given by n−1P−1Ω(P−1)t.
So, with respect to (3) maximum likelihood properties are proven by takingP =
PX and obtaining πˆ =g(πˆ∗) =PX−1πˆ∗. With respect to (8), the misclassification
design is described by π = ←P−Xπ∗, so P = ←P−
−1
X and the MLE is given by π˜ =
Appendix C
LetP◦
kj denote the stochastic variable of thekj-th entry ofPX◦ andCkj the
stochas-tic variable of thekj-th cell in the cross-classificationX∗ byX. It follows thatC
kj
has a binomial distribution with parametersfj andpkj. Consequently,IE[Pkj◦|f] =
IE[Ckj/fj|f] =fjpkj/fj =pkj and in expectation PX◦ equals PX. Since Ck1j1 and
Ck2j2 are independent givenf, it follows that IE[Pk◦1j1P ◦
k2j2|f] =pk1j1pk2j2. So, in
expectation,P◦
X1 ⊗P ◦
X equals PX1 ⊗PX2.
We define ←P−jk◦ = Ckj/(Fk∗ +ε) where ε is a small positive value. Using the
delta method, see, e.g., Rice (1995, Section 4.6), we obtain
IE[←P−jk◦|f]≈ IE[Ckj|f]
IE[F∗
k|f]
+ 1 IE[F∗
k|f]2
µ
V[Fk∗|f]IE[Ckj|f] IE[F∗
k|f]
−ρ
q
V[Ckj|f]V[Fk∗|f]
¶
,
where IE[Ckj|f] = fjpkj and ρ is the correlation between Ckj and Fk∗. From this
we see that the difference betweenIE[←P−jk◦|f] and←−pjk will be small when V[F∗
k|f]
is small andIE[F∗
k|f] is large.
References
Agresti, A. (1990). Categorical Data Analysis. New York: Wiley.
Bethlehem J.G., Keller, W.J. and Pannekoek, J. (1990). Disclosure Control of Microdata. Journal of the American Statistical Association, 85, 38-45.
Chaudhuri, A., and Mukerjee, R. (1988). Randomized Response, Theory and Techniques. New York: Marcel Dekker.
De Wolf, P.-P., Gouweleeuw, J.M., Kooiman, P., and Willenborg, L.C.R.J (1997). Reflections on PRAM. Research paper no. 9742, Voorburg/ Heerlen: Statistics Netherlands.
Domingo-Ferrer, J., and Torra, V. (2001). A Quantitative Comparison of Disclosure Control Methods for Microdata. In Confidentiality, Disclosure, and Data Access, Theory and Practical Application for Statistical Agencies. Eds. Doyle, P., Lane, J., Zayatz, L., and Theeuwes, J., Amsterdam: North-Holland.
Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J., and De Wolf, P.-P. (1998). Post Ran-domisation for Statistical Disclosure Control, Theory and Implementation. Journal of Official Statistics, 14, 463-478.
Hochberg, Y. (1977). On the Use of Double Sampling Schemes in Analyzing Categorical Data with Misclassification Errors. Journal of the American Statistical Association, 72, 914-921.
Kooiman, P., Willenborg, L.C.R.J., and Gouweleeuw, J.M. (1997). PRAM, a Method for Dis-closure Limitation of Microdata. Research paper no. 9705, Voorburg/Heerlen: Statistics Nether-lands.
Kuha, J.T., and Skinner, C.J. (1997). Categorical Data Analysis and Misclassification. In Survey Measurement and Process Quality, Eds. Lyberg, L., Biemer, P., Collins, M, De Leeuw, E, Dippo, C, Schwarz, N, and Trewin, D., New York: Wiley.
Rice, J.A. (1995). Mathematical Statistics and Data Analysis. Belmont: Duxbury Press.
Rosenberg, M.J. (1979). Multivariate Analysis by a Randomised Response Technique for Sta-tistical Disclosure Control. Ph.D. dissertation, University of Michigan, Ann Arbor.
Skinner, C.J., and Elliot, M.J. (2002). A Measure of Disclosure Risk for Microdata. Journal of the Royal Statistical Society, B, 855-867.
Van den Hout, A., and Van der Heijden, P.G.M. (2002). Randomized Response, Statistical Dis-closure control and Misclassification, a Review. International Statistical Review, 70, 269-288.
Warner, S.L. (1965). Randomized Response: a Survey Technique for Eliminating Answer Bias. Journal of the American Statistical Association, 60, 63-69.